Word Expert Semantics: An Interlingual Knowledge-Based Approach 9783110883480, 9783110133318

207 90 10MB

English Pages 262 [268] Year 1986

Table of contents :
Contents
Preface
Introduction
Part I - Machine Translation and the Problem of Ambiguity
Chapter 1. Natural Language Processing
Chapter 2. Lexical Ambiguity
Chapter 3. A Concise Historical Survey
Chapter 4. MT & Understanding - Contemporary Techniques
Chapter 5. The Importance of the Dictionary
Chapter 6. The Need for Integration
Part II - The Semantic Word Expert System
Chapter 1. General Lay-out of the DLT System
Chapter 2. The Structure of the Lexical Knowledge Bank
Chapter 3. Disambiguation with SWESIL
Chapter 4. The Disambiguation Dialogue
Part III - The Semantic Work Bench - A Development Tool
Chapter 1. Developing SWESIL and the LKB
Chapter 2. The Semantic Work Bench
Chapter 3. Tests and Results
Chapter 4. Melby Test Results
Part IV - Future Developments
Chapter 1. Computerized Lexicography
Chapter 2. Macrocontext and Discourse Analysis
Chapter 3. The Self-Improving System
References
Alphabetical Index of Authors & Names
Alphabetical Index

Recommend Papers

An Outline of English Lexicology: Lexical Structure, Word Semantics, and Word-Formation 9783111403168, 9783484410039

170 47 70MB Read more

Historical Semantics - Historical Word-Formation 9783110850178, 9783110104677

169 77 20MB Read more

The Semantics of Word Formation and Lexicalization 9780748689613

An innovative approach to word formation and lexicalization In the study of word formation, the focus has often been on

115 61 5MB Read more

An Introduction to Semantics 9957401491

379 33 7MB Read more

The Expert Guide to Retail Pricing: An Analytics-Based Approach to Maximise Margins 1032465336, 9781032465333

Going under the hood of retail strategy, this book provides in-depth coverage of how retailers can leverage the latest i

206 100 8MB Read more

Pragmatics and Semantics: An Empiricist Theory 9781501752179

An innovative and probing work, Pragmatics and Semantics will be welcomed by philosophers, linguists, and psycholinguist

103 74 32MB Read more

An Introduction to Semantics 9957601954, 9789957601959

295 64 7MB Read more

Pragmatist Semantics: A Use-Based Approach to Linguistic Representation 9780192874757

José L. Zalabardo defends a pragmatist account of what grounds the meaning of central semantic discourses―ascriptions of

148 42 1MB Read more

Compositional Semantics: An Introduction to the Syntax/Semantics Interface (Oxford Textbooks in Linguistics) 9780199677146, 9780199677153, 019967714X

This book provides an introduction to compositional semantics and to the syntax/semantics interface. It is rooted within

99 69 3MB Read more

Words, Worlds, and Contexts: New Approaches in Word Semantics [Reprint 2015 ed.] 9783110842524, 9783110085044

151 45 14MB Read more

Word Expert Semantics: An Interlingual Knowledge-Based Approach
9783110883480, 9783110133318

Author / Uploaded
B. C. Papegaaij

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Distributed Language Translation The goal of these series is to publish texts w h i c h are related to computational linguistics and machine translation in general, and the DLT (Distributed Language Translation) research project in particular. Series editor Toon Witkam B.S.O./Research P.O. Box 8348, NL-3503 RH Utrecht The Netherlands

B.C. Papegaaij V. Sadler and A.P.M. Witkam (eds.)

WORD EXPERT SEMANTICS an Interlingual Knowledge-Based Approach

1986 FORIS PUBLICATIONS Dordrecht - Holland/Riverton - U.S.A.

Published

by:

Foris Publications Holland P.O. Box 509 3300 A M Dordrecht, The Netherlands Sole distributor

for the U.S.A. and

Canada:

Foris Publications USA Inc. Providence U.S.A. C!P-DATA

In co-operation with BSO, Utrecht, The Netherlands ISBN 90 6765 262 3 (Bound) ISBN 90 6765 261 X (Paper) © 1986 Foris Publications - Dordrecht

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Printed in the Netherlands by ICG Printing, Dordrecht.

Contents

page 1

Introduction Part I

- Machine Translation and the Problem of Ambiguity

Chapter 1

Natural Language Processing - The Computer (R)evolution - Natural Language - Machine Translation - Limitations of the Computer - Natural Language versus the Computer

5 5 5 6 9 10

Chapter 2

Lexical Ambiguity - Paradigms of Meaning - Non-determinism - Knowledge of the World - The Shiftiness and Creativity of Language

19 19 22 25 26

Chapter 3

A -

30 30 32 32 36 38 39 43 46

Chapter 4

MT & Understanding - Contemporary Techniques - 'Decompose and Capture' - Classification by Means of Primitives - The Knowledge Bank - Storing Facts and Figures

Concise Historical Survey Word-by-Word Translation Syntax Only Closed Worlds Selection Restrictions Case Grammar Scripts and Plans Preference Semantics The Word Expert

V

4

48 48 51

-

Sublanguage & Limited Domains Semantic Networks Frames Logic, Inferencing & Mechanical Reasoning

53 54 57 60

Chapter 5

The Importance of the Dictionary - The Lexical Knowledge Bank - Lexical Universe & Lexical Function - Innovations in Lexicography

65 66 68 71

Chapter 6

The Need for Integration - Many Solutions for Many Problems - High Quality MT through Combined Strategies

73 73 7 3

Part II

- The Semantic Word Expert System

75

Chapter 1

General Lay-out of the DLT System - Modular Approach - On-line Parsing - Interleaved Semantic Processing - End-of-Sentence Macrocontext Semantics - Disambiguation Dialogue

77 77 79 81 82 84

Chapter 2

The Structure of the Lexical Knowledge Bank - Principal Structure - S L to IL - IL to TL - The Central Role of IL - The Lexical Taxonomy - Contextual Dependency: SOLL Pairs - Logical & Ontological Information - The Core - Relator Matching - Closed Sets, Cyclic Data and other Information

85 85 86 88 89 91 95 96 99 100 102

Chapter 3

Disambiguation with SWESIL - Syntactic Analysis - Interleaved Semantics - The IST Pair Generator - Paradigmatic Expansion - The Matching Cycle - Ordering the Alternatives - Second-Order Matching

105 105 108 108 112 113 127 127

VI

Chapter 4

The Disambiguation Dialogue - Closed Format Question - Types of Questions & On-screen Presentation - Experiments with Disambiguation Dialogues

129 131 133 136

Part III

- The Semantic Work Bench - A Development Tool

143

Chapter 1

Developing S W E S I L and the LKB - Building a Lexicon - Experimental Lexicographic Approaches - The Melby Test - The Size of the LKB

144 144 147 152 156

Chapter 2

The Semantic Work Bench - The Experimental Environment - Modular Approach - Tools

163 163 164 166

Chapter 3

Tests and Results - Single 1ST Pairs - Sentences with First-Order Matching Only

170 170 174

- Sentences with Second-Order Matching

188

Chapter 4

Melby Test Results

191

Part IV

- Future Developments

196

Chapter 1

Computerized Lexicography

197

-

198 200 201 202

Chapter 2

Diversity of Sources Formalizing Information Storing Entries in the LKB Lexicographic Tools The Ultimate Aim - The General Purpose IL Knowledge Bank Macrocontext and Discourse Analysis - Beyond the Sentence Boundary - Text Models - Incremental Understanding - Validation of Information from Different Sources - Applications - Summaries, Indexing, Rephrasing - Increasing Dialogue Efficiency VII

205 208 208 210 211 213 213

Chapter 3

The Self-Improving System - Changing Language - Variety of Language - Learning from Experience - Improving Performance while Processing Text - The Dialogue as SWESIL's Tutor - Updating and Expanding

References

215 215 216 217 220 221 222

VIII

Preface This book reflects a small part of the current R & D activity in the area of Natural Language Understanding by computers« an area Mhich relies on a combination of computational linguistics, lexicography and translation science, but also on software innovation and artificial intelligence. Amidst a dozen major Machine Translation projects in Europe and Japan, The Netherlands are involved in three long-term projects, and the present book is closely connected with one of then: DLT (Distributed Language Translation), initiated by BSO and financially supported both by that software house and the Netherlands Ministry of Economic Affairs. Though an industrial research project, and in the front line of pre-co»petition in East and «est, we feel that steady progress in this field will greatly depend on rapid communication of views, approaches adopted, intermediate results achieved, etc., among research teams. At the same time, university students of computational linguistics should be given a chance to inform themselves of present industrial research programs, without having to take recourse to the grey (or even black) literature circuit. This has motivated the start of a DLT series in computational linguistics, of which this book is the first volume. The author and editors wish to emphasize that much of the contents is the result of intensive team work, involving more or less all their colleagues at BSD/Research. We are indebted to the BSO management for their critical but stimulating interest, and we thank all other persons who helped with their advice or their patience. Bart Papegaai j Victor Sadler Toon Witkam

Introduction

Distributed Language Translation In 1985, after a thorough feasibility study CWITKAM 19833 sponsored by the Commission of the European Community, BSQ/Research in the Netherlands started the DLT project! a m u l t i million dollar research & development effort aimed at high quality machine translation for the 1990s. The feasibility study outlined the strategies and features deemed necessary to develo-> semi-automatic High Quality Machine Translation (HQMT). Among the more salient features Mere: 1)

Designed for the office of the future; DLT will be integrated into advanced text processing apparatus incorporating s p e l l i n g t syntax and style correction.

2)

As fully automatic HQMT is not thought to be p o s s i b l e Mithin the next feu decades, MT systems will to some extent have to rely on human assistance. In the case of DLT this is realized by the Disambiguation Dialogue. After each sentence the system may ask the user to answer some multiple choice questions, thus interactively aiding in the correct interpretation of the text and clearing up any unresolved ambi gui ti es.

3)

Modern offices will increasingly rely on computer networks for a fast and efficient flow of information. The process of Distributed Language Translation is ideally suited to such a network configuration: the source text will be analyzed once, at the supplier's side, after which an intermediate representation Df the text will be put on the network. Each receiver can choose the language into which the text is to be translated, since the final translation into the target language is accomplished (fully automatically) by the receiving work station. Human assistance in the process is limited to the supplier's side, the end-user sees only the finished product.

4)

The intermediate representation has been chosen to be compact, for efficient transmission through the network, and readily inspectable, which is particularly important during the first stages of development. The choice of an E s p e r a n t o based Intermediate Language (IL) offers the following advantages: - It is powerful and regular, allowing full and u n a m b i g u o u s representation of the source text; - Because of its strong affinity with natural language, it is easily-readable on human inspection; - Its system of

1

unchanging morphemes allows for compact, morpheme based coding? more efficient than the standard character based ASCII code; - By adopting an Esperanto-based IL, DLT can make use of the accumulated experience of a century's use and study of the language. In particular, the existence of many bilingual Esperanto dictionaries is of great help when creating practical bilingual lexicons for the system; 5)

The system will make use of an extensive knowledge bank, written entirely in the IL and stored on modern optical disks (CD-ROM), which will form the heart of the system's i ntel1i gence;

6)

To achieve proper, knowledge based, interpretation of Natural Language, the system will incorporate techniques for problem solving, reasoning and understanding, belonging to the realm of Artificial Intelligence ;

7)

A form of Machine Learning will be aimed at which should enable the system to adjust its performance on the basis of the material it is given to process, continually improving and expanding its lexicon.

The Purpose of this Book The project is now well into its second year and many of the proposed features are being realized in designs and implementation. To facilitate development, a modular approach has been chosen, subdividing the work into more clearly defined subgoals. A central subdivision is between syntax and semantics. Though the DLT system will eventually integrate both kinds of linguistic processing, the separation of the two for the purposes of development stresses the notion that syntax and semantics are two distinct elements of an MT system, each with its own problems and requirements. The separation will be evident throughout this book, which is concerned with the seaantic side of DLT: whenever syntax and the interface between syntax and semantics is discussed, we simply assume that a suitable syntactic representation is available, without bothering about its exact structure or derivation. (An accurate description of the syntax of DLT's IL can be found in SCHUBERT 1986). This book is divided into four parts. Part I is a general introduction to the problems of Machine Translation and the techniques that have been developed to solve them, aimed at those reader not intimately familiar with the details of Computational Linguistics, Natural Language Processing and Artificial Intelligence. It will set the scene for Part II, a description of the (semantic) solutions DLT proposes. Part III is centered round the so-called Semantic Work Bench, the environment in which Lexicographers and Linguists are working to develop their

2

semantic system and the Lexical Knowledge Bank that lies at its center. Part IV concludes the book with a brief survey of developments planned for the near (and not so near) future.

PART I: MACHINE TRANSLATION AND THE PROBLEM OF AMBIGUITY

Chapter 1

Natural Language Processing The Computer

(R)evolution

Since their first appearance half a century ago, as hugei buzzing number crunchers, computers have rapidly developed into compact, fast and powerful all-purpose machines that are of invaluable assistance to human endeavours in almost any field. The ongoing miniaturization, increase in power and decline of prices have made computers available to a large part of the population instead of the small, privileged group of specialists that used to be associated with them. One problem with the spreading use of computers is that an increasing number of people actually operating them are not really trained to do so; the typical computer user is no longer a technically skilled computer buff, familiar with all aspects of the machine, but an office worker interested only in the easiest way to produce the desired results. U n f o r t u n a t e l y , most computers do require quite some knowledge about the way they operate, and at least a more than superficial knowledge of the various commands that can be used - especially the exact rules and syntax of their usage. Computers are very unforgiving when it comes to mistakes and either fail to respond or, even worse, perform some undesired action that can have d i s a s t r o u s effects on data-files or programs.

Natural

Language

The computer industry has come to recognize the need for u s e r friendly front-ends, that shield the user from the inner workings of the machine, allowing only those things to be done that are of direct use, and allowing them to be clone in a way that is easy to learn and easy to perform. Examples of such interfaces are menus, help pages, exhaustive prompts and especially the 'icon' approach made popular by the Liza and Macintosh computers: instead of commands to be typed in, there are small pictures representing the actions that can be performed, and these need only to be pointed at and activated with the click of a button. Nevertheless, although great advances have been made with regard to user-friendliness, the methods used are still c o m p u t e r specific, i.e. they have to be learned anew for each new type of computer and require of the user somi; level of skill and familiarity with the machine's operations. As computers come to be integrated more and more into everyday activities the need grows for a 'natural' way of operating them. What is needed is the same kind of flexibility and ease of communication that exists between people, in other

5

w o r d s : s o m e t h i n g l i k e normal e v e r y d a y l a n g u a g e . However, a l t h o u g h the i d e a of c o m p u t e r s u s i n g n a t u r a l l a n g u a g e i s as o l d as t h e computer i t s e l f , t h e r e seems t o be an e s s e n t i a l i n c o m p a t i b i l i t y between c o m p u t e r s and human l a n g u a g e . They may be very good number c r u n c h e r s and e f f i c i e n t d a t a m a n a g e r s ) c a p a b l e of s t o r i n g enormous amounts of i n f o r m a t i o n » but when i t comes t o n a t u r a l l a n g u a g e c o m p u t e r s are slow« i n e f f i c i e n t and o f t e n i n c a p a b l e of d o i n g a n y t h i n g u s e f u l w i t h t h e i n p u t they a r e g i v e n .

Machine

Translation

P a r a l l e l t o t h e s p r e a d i n g u s e of t h e computer r u n s t h e development of the f i e l d of e l e c t r o n i c c o m m u n i c a t i o n . I n f o r m a t i o n exchange n e t w o r k s have become p o s s i b l e which t r a n s f e r i n f o r m a t i o n from one s i d e of the w o r l d to t h e o t h e r a l m o s t i n s t a n t a n e o u s l y . Such e l e c t r o n i c c o m m u n i c a t i o n i s r a p i d l y c h a n g i n g i n t e r n a t i o n a l b u s i n e s s . A g r o w i n g number of b u s i n e s s o r g a n i z a t i o n s are coming t o r e l y on a f a s t ) e f f i c i e n t and c o n t i n u o u s f l o w of i n f o r m a t i o n b o t h to c o n t r o l t h e i r i n t e r n a l o r g a n i z a t i o n and t o m a i n t a i n t h e i r s h a r e of the m a r k e t . The same need f o r d i r e c t a c c e s s t o i n f o r m a t i o n i s f e l t i n s c i e n t i f i c c e n t e r s and p u b l i c s e r v i c e i n s t i t u t i ons. The more t h e p o s s i b i l i t i e s of ( n e a r l y ) i n s t a n t a n e o u s c o m m u n i c a t i o n a c r o s s the w o r l d i n c r e a s e ) t h e more the 1anguage b a r r i e r becomes a r e a l b o t t l e n e c k i n the f l o w of i n f o r m a t i o n . T r a n s l a t i n g i n f o r m a t i o n from one l a n g u a g e i n t o another t a k e s v a l u a b l e time and i n v o l v e s e m p l o y i n g h i g h l y t r a i n e d and e x p e n s i v e s p e c i a l i s t s . M o r e o v e r ) t h e same p r o c e s s has t o be r e p e a t e d f o r e v e r y l a n g u a g e t h e i n f o r m a t i o n h a s t o appear i n . An i n t e r m e d i a t e s o l u t i o n i s the u s e of a s i n g l e ' i n t e r n a t i o n a l ' language (for h i s t o r i c a l reasons t h i s i s often E n g l i s h ) . Such an i n t e r n a t i o n a l l a n g u a g e does e l i m i n a t e the need f o r m u l t i p l e t r a n s l a t i o n s and t r a n s m i s s i o n s ) but i t h a s s e r i o u s drawbacks as w e l l : For most c o u n t r i e s i t s t i l l means e m p l o y i n g t r a i n e d p e r s o n n e l t o p r o d u c e E n g l i s h t e x t s and t o t r a n s l a t e i n f o r m a t i o n i n E n g l i s h i n t o t h e n a t i v e l a n g u a g e when i s meant f o r a wider a u d i e n c e .

it

For p e o p l e whose n a t i v e l a n g u a g e i s not c l o s e l y r e l a t e d to E n g l i s h « e x t r a e f f o r t and t i m e / c o s t i n v e s t m e n t i s needed t o l e a r n the l a n g u a g e . J a p a n ) f o r example) one of the w o r l d ' s l e a d i n g economic p o w e r s ) has s e r i o u s p r o b l e m s s u p p l y i n g adequate manuals f o r i t s p r o d u c t s ) p u b l i s h i n g s c i e n t i f i c work of h i g h s t y l i s t i c q u a l i t y and p r o v i d i n g i t s own p e o p l e w i t h i n f o r m a t i o n from t h e West.

b

Fig. I.1.1.a. TRADITIONAL (MACHINE) TRANSLATION

One source text has to be translated as many times as there are target languages. After that, each target text has to be sent (or transmitted) separately to its place of destination. In many MT-systems, the receiving side will have to do post-editing before the text is ready for wide-spread use.

Fig. I.l.l.b. DISTRIBUTED LANGUAGE TRANSLATION

t

A source text needs to be translated and Each receiving station offers the user a access the text, which is being produced end-user. The only human assistance is provided by the first place, as integral part of the shable (or transmittable) object.

7

transmitted only once. choice of languages in which to without any assistance of the the person producing the text in creation of the text as a publi-

A better solution would be to eliminate the bottleneck of human translation altogether. Ideally) the same computer that made world-wide electronic communication possible should also be able to take care of this aspect of the process and provide translations as an integral part of the information exchange. Machine Translation has been worked on seriously ever since World War II, but up to now the incompatibility between natural language and computers has prevented any really satisfactory system from being produced. In the 1950s it was felt that the rapid increase in power, speed and capacity of the computer would soon bring Natural Language Processing (NLP) and Machine Translation (MT) within reach, but even though the evolution of the computer has surpassed almost all expectations, those high hopes have yet to be realized. The high expectations, and successive d i s a p p o i n t m e n t s when results failed to be p r o d u c e d , eventually led to the US National Research Council's ALPAC report, whose verdict was that MT could not be achieved in the foreseeable future, and that funds for research could be better spent elsewhere CNRC 1966]. The sentiment of the ALPAC report was worded a few years earlier by Bar-Hillel CBAR-HILLEL 19603. One of the examples he gives to make his point has since become famous: (la)

The pen is in the box.

(lb)

The box is in the pen.

- together with his claim that: "no existing or i«a9inable program will enable an electronic coaputer to determine that the word pen [in (lb) *eansl an enclosure where snail children can play" CBAR-HILLEL, I960, p. 1593. In short, Bar-Hillel's argument was that the amount of detailed knowledge about innumerable aspects of the real world the human language user can call on in order to decide on the intended meaning of an utterance can never be incorporated in a computer program.

Fig. 1.1.2 THE BOX IS IN THE PEN'

III

8

Although initially the ALPAC report caused the suspension of many research efforts aimed at MT, the 1970s brought renewed interest in the subject. This revival was due, partly to the spectacular advances in computer technology, partly to the results obtained by research in the field of Artificial Intelligence (AI) in general, and partly to the renewed interest of linguists in computational linguistics, now that computers had started to become widely available. The announcement of Japan's "Fifth Generation Computers" project - a massive and centrally coordinated effort to produce a new generation of computers and programs specifically aimed at AI and the ability to handle Natural Language - signalled a new approach to the old problem and the renewed belief that this time useful results were within reach. Limitations of the Computer Advances in computational and theoretical linguistics and the achievements of v a r i o u s research projects all over the world have given us a good insight into the problems the earlier attempts stumbled over. The problems summarized below all center on the limitations of the computer compared to human competence when it comes to handling natural language. Rigidity Computers are capable of performing a great variety of tasks, but within each task they are essentially rigid and capable only of doing what they were explicitly programmed for. If situations arise which were not foreseen when the program was written, the machine will either crash or else produce nonsensical results. In no way is the computer able to recognize the incongruity between present input and what it was programmed to handle. Absoluteness While humans can handle a wide range of p r o b a b i l i t i e s in describing or thinking about given situations, all the decisions the computer makes come down to one of two states being active: on or off, yes or no, true or false. Even when higher levels of programming appear to make balanced and finely-nuanced judgements, at the bottom level these decisions come down to the basic yes/no choices. Lack of

creativity

Computers need explicit instructions. It is not enough to indicate the more important points of a task, the way one would instruct a person, and leave the computer to work out the details. Somewhere in the program the details must be explicitly taken care of.

9

Limited Storage

Capacity

Though the capacity, speed and c o s t - e f f e c t i v e n e s s of data storage has increased spectacularly over the last decade^ there is still an absolute limit to the amount of data a computer can effectively store and access. For human beings (though arguably such a limit must exist) no absolute limit has ever been established.

Natural Language versus the Computer In the light of the limitations mentioned above, a number of p r o p e r t i e s of natural language pose serious problems for computer processing. The following list is only meant to give some impression of the range of p r o b l e m s NLP r e s e a r c h e r s are facing and in what directions they hope to find solutions. Complexity Natural language is a complex, m u l t i - l e v e l e d system, governed by a variety of interacting rules. L i n g u i s t s generally divide the language system into a number of subsystems, the most important of which are: phonology, determining the combination of single elements into syllables and wordsl morphology, ruling the combination of word into complete words in sentences;

sound

fragments

syntax, governing the structuring of word groups into phrases and sentences, semantics, providing the interpretation of u t t e r a n c e s in terms of c o n c e p t s that relate those u t t e r a n c e s to the outside world. The precise nature of each subsystem, the scope of their rules and the exact way in which they interact with each other is not at all clear, and at times it would appear that the total system's complexity defies all attempts to capture it in computer programs. Under the influence, in particular, of Chomsky and his theory of Transformational Generative Grammar

(Fr.)

acquérir obtenir gagner recevoir

The solution is the inclusion in the TL column of semantic information coded in IL - in exactly the same format as all the semantic information in the IL component of the LKB - whenever divergency is encountered. The information is chosen to i1lustratei in IL, the difference in contextual dependency between the TL alternatives. When an IL text is being translated into TL t the TL-generation module can ask SMESIL to search the IL text for patterns that fit in with the disambiguation information in the TL column, thus allowing a semantically reasoned choice to be made. This approach is an essential element in avoiding an explosive IL, as discussed above. The kind of nuances found on the IL-TL side are language specific - i.e. relevant only to the TL in question. Incorporating them in the IL would mean adding a burden which would most of the time mean a lot of useless extra work selecting the correct shades of meaning, since they will be

88

lost in the other languages anyway. Instead, they are part of the TL side of the system, where they are immediately useful, and extra work on meaning selection is only undertaken when the TLdictionary indicates the need for it.

Fig.

II.2.4

An IL entry with its French translations. Note the way divergency is marked and detailed in IL-coded form. In this way IL-to-TL ambiguity can be linked back to the LKB to be solved using the semantic information and mechanisms present there. IL » aboi'i abroger

(abolish) Cio-n

leé'o, dekret'o,

regul(ar)oJ

< < p l e n , u « ' i l ' o > > FIRST

AR6UHENT

RELATOR

konsili'o konsili'o ) 'karton'o < 'foli'o> cardboard)

The relators constitute a refined system of thematic roles w i t h i n the dependency pairs, allowing compact coding of detailed world-knowledge. 98

The CORE The CORE is the set of IL words and morphemes that have not been given key-terms which would link them to more general concepts. They are t h e m s e l v e s the most general concepts available and form the top of the hierarchy. As mentioned earlieri -frequency counts of words used in traditional dictionary d e f i n i t i o n s show that there is a small number of words that form the basic stock of defining terms. There is considerable overlap between these words and the kind of p r i m i t i v e s used by many AI r e s e a r c h e r s (WILKS, 1972, CALZQLARI, 1984) which seems to justify the belief that these words stand for the basic concepts in which all words can ultimately be def i ned. There is a problem here, however. It appears that it is almost impossible to establish a single coherent set of such primitive words which covers all r e q u i r e m e n t s of semantic decomposition. When decomposition reaches the level of the most primitive elements, the b o u n d a r i e s between the elements begin to fade, and it becomes increasingly difficult to select a representation that clearly expresses the difference between one concept and another. Interesting enough, Amsler, in his study of structure in d i c t i o n a r i e s CAMSLER, 1980] found that the hierarchies that could be constructed from dictionary definitions did not lead to a single top-node, but rather to a c1uster of words the meaning of which could not be defined 'without using another member of some small set of near synonyeou» words ... » tAHSLER, 19803. It appears that instead of a single word as top-node, many of the subtrees in the lexical taxonomy have a primitive cluster as topnode, the elements of which are more or less equivalent. In traditional dictionaries the use of such clusters often goes unnoticed, partly because most users of a dictionary will not pursue the path of a word up to its most primitive defining concept, but will stop as soon as they have found the desired information, and partly because synonymity and fuzziness are such basic characteristics of natural language communication they are rarely noticed by human language users. They rely on their knowledge of the world to resolve any ambiguities, and will do this so easily that they will not even be noticed. For the purpose of SWESIL's semantic processing, however, the fuzziness and ambiguity caused by the absence of a single topnode in a hierarchy could be dangerous. Decisions about conceptual proximity are based on the existence of a strict direction in the hierarchy. Lack of this direction, as in the case of the clusters found by Amsler, can lead the system into an endless loop; thinking it can find a more primitive term, while

99

in fact it 15 going round in a closed circle of synonymous

words.

In the LKB, therefore, such clusters are avoided by marking the elements that are part of them as CORE words, causing SWESIL to treat them as top nodes (see also ch. II.3). To prevent them from occurring unnoticed - not unlikely considering the size of the LKB - the dictionary is continually tested for the occurrence of loops, both when new words are entered and during processing of actual input. The lexicographer can then remove the loops by entering an additional word-sense, removing faulty ones, adjusting the links of a sense to its top-node etc., or may decide to add an element to the CORE or delete one from it. Apart from the need to avoid closed loops, efficiency requirements may also play a role when deciding to add words to the CORE. Words or morphemes that are used very frequently as defining element for other words can be marked as CORE words, causing SWESIL to stop processing there instead of continuing right up to the actual top of the tree. This SWESIL's way of 'chunking' information: when a certain combination of c h a r a c t e r i s t i c s is used very often, it ceases to be a composite structure and starts to behave as a single, undividable unit. If necessary, when the concept itself is subject Df some reasoning process, and more detailed information is needed, it can be decomposed further. As this kind of inclusion in the CORE is dependent of frequency of occurrence in the LKB, it can largely be done automatical 1 y• SWESI.L continually collects frequency information about the defining t e r m s used, and can signal those elements that cross a certain threshold. It is then up to the lexicographer to decide whether or not the element should be included in the CORE. In a future implementation of SWESIL this process may become completely automatic.

Relator

Matching

The IL provides us with a large number of relators, a set which is even enlarged by the possibility to combine relators into more detailed ones, following a number of composition rules. To allow SWESIL to generalize over relators as well as over lexemes a number of principles have been observed: reciprocity Most relators link two e l e m e n t s to each other in a directional way, i.e. the relation holds for one element to the other but not visa versa. Often, h o w e v e r , when there is a certain relation of A to B, there is an inverse relation of B to A. E.g.: (34a) vetur'i per trajn'o (to be transported ! by means of

100

i train)

is the inverse pair

of:

(34b) trajn'o por vetur'i (Train I for the purpose of i transport)

Fig.

II.2.10

equivalent relators

(near-synonyms):

tra en (through/over/across) (in) tra < — > sur (on) super < — > sur (above/over) < — > eel' e al por (with the aia of) (•for) < — > dank ' al pro (because of) (thanks to) no«' e < — > ekzempl'e (namely) (for example) pri < — > koncern'e al (about) (concerning) inverse

relators:

per (by means of) as (AGENT-verb) a (N0UN-M0DIFYER) ie-el ((place) out of) nom'e (namely)

por (for (the purpose of)) far (verb-AGENT) i a-de ((kind) of) ien-el ((pi ace) into) est'as (is-a)

The SWESIL system has a number of relator tables it can use to map thematic roles onto each other, allowing LKB information to be matched on various surface forms. N.B.: all the equivalents and inversions in the relator tables are subject to various contextual conditions, determined by the syntactic functions of the words in the pair. When A is an action specifically carried out with instrument B, it can be inferred that B is an object with which to carry out action A. To recognize instances of reciprocity! the LKB contains lists of relators that are each other's inverse counterparts. In this way SWESIL, finding in the context A with relation R to B, can also look under B,

101

using inverse relation R'« in an attempt to find B with relation R' to A. inclusion A number of relators are composed of two (or more) morphemes. In this way a general relator is restricted in meaning by a modifier. E.g.: (35a) viv'i (ie-)en akv'o (living!(place-)in 1 water) (35b) trink'i (io-)n akv'o (drink!(thing-)patient¡water) SWESIL can treat such relators first in their most restrictive meaning - i.e. it will use the composite relator in its entirity and search for the specific relation - in addition to which it can drop the modifying element) searching for the more general relation if the restrictive sense produces no match. s y n o n y a i ty

There are a number of r e l a t o r s that are more or less synonyms of each other. These are listed in the LKB to allow the lexicographer more freedom in constructing SQLL-pairs. In addition« it may be the case that a certain relation is preferred« followed closely by one of its (near-)synonyms. In this case a score-number can be added* which tells SWESIL to down-rate a match found with the second-best relator.

Cloted Sets« Cyclic Data and Other

Information

To account for v a r i o u s aspect of the worldi knowledge Qf which constitutes a large part of the LKB« various special structures can be incorporated into the LKB when normal S O L L - p a i r s are not sufficient. Though none of this is implemented at the moment« the design of SWESIL plans for the inclusion of various kinds of additional information« most of which will be accessible through the CORE in the form of specially marked CORE e l e m e n t s which can tell SWESIL where the additional information can be found. Closed

Sets

A substantial part of human knowledge of the world concerns the way it is organized and classified. Most traditional d i c t i o n a r i e s include a special section for this« rather encyclopaedic« kind of knowledge. To name a few groups:

102

Bibliographical references - i.e. names of historical and contemporary persons Geographical references; e.g. seas, rivers» cities, people

famous

countries)

Nuabers Weights and Measures though this knowledge is not strictly speaking linguistic» much of it can be of great help ««hen trying to understand texts. Biographical and geographical names are often used without further explanation» their r e f e r e n c e s taken to be common knowledge. Weights, .measures and numbers are a vocabulary in itself» which have to be related to actual calculations to be u n d e r s t o o d , but also to fuzzy sets such as large-small» m u c h - l i t t l e , h e a v y - l i g h t , etc. It is this encyclopaedic knowledge which can provide the system not only with, for example, the proper translation of geographical names» but also with the concepts behind the names and the semantic preferences and c o n s t r a i n t s connected to them.

Cyclic

Data

Names of time-units, for example, are not just closed sets, but most of them form cycles with a fixed order. In addition the cycles are hierarchically ordered, larger units being built up from smaller ones. A few, obvious, examples: days of the week months of the year seasons of the year We will have to provide SWESIL with a 'sense of t i m e ' , enabling it to place r e f e r e n c e s to actions in the proper sequence, unravel causes and effects that are not explicitly made clear other than by their sequence in time, etc.

Social

and Cultural

Organization

Our society is a complex» tangled and hierarchical structure» in which numerous groups of people fulfill special functions. A part of this structure is culturally defined and, therefore, to a certain extend language specific, but a large part seems to be of a universal nature, viz. it relates to structures that can be found in all societies. A few examples:

103

Governments - i.e. leaders, civil servants, parties, administrations Family

political

relations

Economy - i.e. monetary systems, commerce, division of Social

status - i.e. titles, functions,

labour

jobs

Though complex and very often not very logical, this kind of information is so essential to our own functioning in society that we handle most of it with great ease. Reference to it is so frequent that it can hardly be avoided in any natural language processing system. Abbreviations A very large and continually changing group of identifiers is formed by the abbreviations. Social organizations, companies, political parties, scientific institutions, all frequently use a b b r e v i a t i o n s instead of full names. Another group is formed by the products of industry and science - chemical products, machines, systems, formulas, scientific principles, natural phenomena; abbreviations for them are widely used. It is important for an NLP system to be able to find the correct referent of an abbreviation: partly because of the semantic information connected with it, but also to be able to recognize variation in usage - viz. a full name which is later only referred to by the abbreviation - and reference necessary to form a coherent interpretation of a text. In addition, many abbreviation change in translation, either because the full name is translated before the new abbreviation is formed, or because of historical conventions.

104

Chapter 3

Disambiguation with SWESIL This chapter describes the way SWESIL - the Semantic Word Expert System for the Intermediate Language - is used to aid disambiguation by calculating semantic preferences and ordering alternatives. SWESIL's main task is to apply semantic preference rules and context e x p e c t a t i o n s to the input» and to rank the alternative interpretations of ambiguities according to the Hay they match their context. In contrast to the syntactic parseri SWESIL itself is non-deterministi c, and will not judge alternatives to be "correct" or "incorrect", but will rank then on a scale of semantic a p p r o p r i a t e n e s s instead - a scale which will, among other things, be used at the SL-to-IL stage to reduce the burden of the disambiguation dialogue. Only in clear-cut cases, determined by various pre-set thresholds, can SWESIL discard the most unlikely alternatives.

Syntactic

Analysis

To understand SWESIL properly, it must be seen as an integral part of the total DLT system, where it fulfills an important function in the semantic part of the analysis phase. The first step in the analysis phase (see Chapter II.1) is the analysis of the source text into syntactic structures. The DLT system makes use of dependency trees (for a full description Cf. SCHUBERT '86) that represent the syntactic relations between the words in a sentence. A few examples: (36) Compass drawings produce iaages of high

produce / \ SUBJE OBJEC / \ drawings images HOD

PREA

compas» PARC quality HOD high

105

quality.

Fig. II.3.1.: DEPENDENCY-TREE OF A 49-WORDS IL SENTENCE

ts^l

IL words at the nodes, relations at the branches. Note the superficial resemblance to a semantic network. W i t h relatively little extra work the dependency tree can be transformed into dependency pairs to be processed by SWESIL.

106

> pairs:

XS RS YS

skrib'i io-n (INANIMATE PATIENT) vort'o (word)

XS RS YS

skrib'i io-n 1 iter'o (a letter of the

alphabet)

For each YS-YI match this results in a so-called word match score. As SWESIL is only interested in the extent to which the 1ST sense pair fulfills the e x p e c t a t i o n s of XI over relation RIi and not in the relative strength of the various expectations, SWESIL takes as its final score the maxiauro score of all selected SOLL pairs, thus making the strongest fulfillment of XI's expectations the measure of appropri ateness.

Logical

Matching

The word match, which determines how similar YS and YI are. may be a complete one - when the two are exactly the same word - but more often than not there will be differences. The SOLL pairs have been chosen for their typicality: i.e. the words in the SOLL pairs are examples of the kind of words expected by X; they are in no way exhaustive lists.

116

When two words - YS and YI - are not exactly the sane, they roust be compared on a conceptual level: SWESIL must -find out how far apart they are in the lexical taxonomy in the LKB. The aethod for doing this is called logical matching. As discussed in Chapter 11.2« all word senses in the IL are defined by the addition of a key-term and super-key-tera which relate the word sense to more abstract) higher-level concepts of which the word's meaning is an instance. Because of these key-terms and super-key-terms, the LKB is a taxonomy: a hierarchical structure in which all words are related to others on the basis of their meaning. To find out whether two words are related in meaning - i.e. whether they refer to the same concept at some higher level in the taxonomy - one can follow for each word a unique upward path defined by its key-term and super-key-term, until one reaches a point where the key-terms of both words match. By keeping track of the number of steps taken to reach this point, one can not only establish the kind of relation between the words viz. the more abstract concept they both refer to - but also the strength of that relation. The more steps that have to be taken, the less closely related are the two words.

Fig.

II.3.6

Some related

IL words with their key-terms and

inttnc'i / \ cel'i pi an' i / \ kalkul'i pen 1 i

fluid'aj'o / \ ga»' o • kv' o / \ •er' o aar' o

esplor'i

/ pri'esplor'i / trakt'i / koaput'i

\ stud'i \ ekzaaen'i \ kontrol'i

117

super-key-terms

Key-Term

Substitution

The mechanism SWESIL uses to step through the lexical taxonomy is a recursive key-term substitution; the two IL words (YS and YI) are replaced by their respective key-terms» the key-terms are compared and the resultant score is stored. The two key-terms are then replaced by their key-terms (the original words' super-key-terms) and the process is repeated. The score at any given moment is the maximum scored so far i.e. the strongest similarity determines the measure of conceptual closeness. Of c o u r s e ( a match on a lower level - i.e. closer to the actual YS and YI - should be stronger evidence that the two words are conceptually related than a match on a higher level in the lexical taxonomy. To take this into account) the system has a pre-defined value known as the attenuation factor; a match score on a certain level is reduced by this factor for each step - i.e. each key-term substitution taken to reach that level. If) for example) the attenuation factor is set to 0.8) a full match two steps up from the actual YS and YI would be reduced by 0.B * 0.8 = 0.64.

Calculating

Proximity

The matching cycle in SWESIL has been designed to permit the calculation of a value which expresses the conceptual closeness between words. The LKB> however) e x p r e s s e s concepts in terms of words - or rather: morphemes. There is no level of abstraction within the LKB in which meaning can be viewed as separate from words. Ultimately) therefore) conceptual closeness is expressed as the likeness between two words. The taxonomic structure of the LKB guides the selection of the words that have to be compared with each other) but in the end these are just two symbols) meaningless in themselves) whose similarity must be expressed as a number. What must be calculated at each step of the matching cycle is how much one word resembles) on purely external criteria) the other word. This likeness) expressed as a number) is called the proximity. One method for calculating proximity between words is to treat them as two lists of elements - the letters - and to compare those lists in two respects: how many elements they have in common) and how much the order of the elements differs [YIANILOS, 19833. Some sophistication can be achieved by attaching weights to the letters according to their relative frequency in the language) by assigning extra weight to certain letters when they occur at marked positions - such as front or back of the list - and by varying the influence

118

of l e t t e r s that d o occur in o n e list b u t a r e a b s e n t from t h e other. Such a sophisticated proximity calculation can give a finely b a l a n c e d m e a s u r e of s i m i l a r i t y b e t w e e n w o r d s seen a s l i s t s of l e t t e r s . Fig. II.3.7: THE MATCHING CYCLE

Searching through the lexical taxonomy, SWESIL tries to match the words on many different levels. When needed, wider context matching can be used to find confirmation of scores found by normal matching. 119

This approach would be equally possible for the IL where) after all, words are lists of letters as well. There is a better way, however, when IL words as to be regarded as lists: they are lists of morphemes. Though words in other languages are lists of morphemes too, in most languages these morphemes undergo changes in form when combined to form words. This variation in form makes it difficult to treat them as fixed entities. In the IL no such changes take place. The IL is a truly agglutinative language, using invariable morphemes. Because of this, words can be as easily divided into morphemes as into letters. There are a number of advantages to a division morphemes when it comes to calculating proximity:

into

1)

morphemes are the smallest meaning-carrying elements of the IL. Moreover, the way they combine to form words is clearly regulated - the form of the morphemes never changes, while their function is determined by their position relative to the other morphemes in the word - , making it relatively easy to distinguish the central meaning - the root morpheme - from the function morpheme that modify the central meaning. This arguably places a proximity score based on morphemes much closer to semantic proximity than one based on mere letters. One could compare a proximity score based on letters to a description of the similarities between two machines in terms of the raw materials they are made of, while the proximity score based on morphemes compares the two machines in terms of their main components. The former may often be able to distinguish between machines, but the latter will come much closer to a differentiation based on the intended function of the machines, just as we want to distinguish between intended meanings of words.

2)

Though the number of morphemes in the IL is much larger than the number of letters in the alphabet, the problem of number is drastically reduced by dividing the IL morphemes into classes, the largest class being the root morphemes - the unmodified carriers of meaning. This class is potentially infinite and instable - words may be added or become obsolete as the language develops. The only function, however, these morphemes have in a proximity calculation is that of identifier. If two words have the same root morpheme they are similar, if they have different ones they are not. Beside to the large and open set of root morphemes there is a 1imited and finite set of function morphemes: morphemes that modi fy the meaning of the root morphemes to some extent. These can be further divided into sub-c1 asses, allowing a finely tuned weighting by which different types of function

120

morphemes influence the proximity score on the basis of their position in the word. 3)

The IL morphemes will in the near future be coded, each morpheme receiving a uniquely identifying code based on its class and frequency of occurrence. This will speed up processing as it reduces the number of elements the word structure contains (a 13-letter word like ailtomobi 1 i st' o contains only 3 morphemes), while the frequency information that is built into the codes can be directly used to influence the weights attached to the various morphemes.

Because of the experimental nature of SWESIL, the proximity calculation that forms the basis of the matching cycle has been implemented in three different ways to test the consequences of the different strategies. 1)

No Proximity

Function

The simplest matching that can be done is a simple yes/no type of matching. If two words are not 1iteral1y the same the system returns the score 0, otherwise it returns 1. To rule out some supposedly seaantical1y less relevant function morphemes) the system can handle plural endings and the negation 'ne' by modifying the score somewhat (see below).

(43)

No Proximity

Function

1ST word

SOLL word

hund'o (dog) hund'in'o hund'in'o

hund'in'o hund'o hund'in'o

(female dog)

HATCHES: 0. 000 HATCHES) 0. 000 HATCHES: 1. 000

hund* et'o (small dog) hund'o

hund'tt 1 o

hund'o hund'et'o hund'et'o

HATCHES: 0. 000 HATCHES: 0. 000 HATCHES: 1. 000

hund'ej'o hund'o hund'ej'o

hund'o hund'ej ' o hund'ej'o

HATCHES: 0. 000 HATCHES: 0. 000 HATCHES: 1. 000

(dog-place)

hund'estr' o (dog-keeper) hund'o hund'o

hund'estr'o hund'estr'o

hund'estr'o

HATCHES: 0. 000 HATCHES: 0. 000 HATCHES: 1. 000

hund'estr' in'o hund'o hund'estr' in'o

HATCHES: 0. 000 HATCHES: 0. 000 HATCHES: 1. 000

(female dog-keeper) hund'o

hund'estr'in'o

hund'estr'in'o

121

hund'o'j hund'o

(dogs)

hund'o hund'o'j

HATCHES: 1.000 HATCHES: 0.800*

*>: this lower score is explained as follows: if the S0LL word contains the plural form, that plural form is an extra expectation. Absence of it in the 1ST word reduces the score. If the S0LL word does not contain a plural, the word is unmarked and returns the sane score for singular and plural forms.

2)

Simple Proximity

Function

In this mode the calculation is based on the number and position of the identical morphemes. Apart from handling the same function morphemes as in mode 1), balancing the proximity score can be done by attaching weights to individual morphemes. This allows for the diminishing of the influence of function morphemes relative to root morphemes. The scores produced in this mode lie on a continuous scale from 0 to 1. (44)

Simple Proximity

1ST word

Function SOLL word

hund'o (dog) hund'in'o hund'in'o

hund'ino hund'o hund'in'o

(female dog)

HATCHES: 0.71 HATCHES: 0.71Ï HATCHES: 1.000

hund'et'o hund'o hund'et'o

(small dog)

hund'o hund'et'o hund'et'o

HATCHES: 0.71 HATCHES: 0.71Î HATCHES: 1.000

hund'ej'o hund'o hund'ej'o

(dog-place)

hund'o hund'ej'o hund'ej'o

HATCHES: 0.71 HATCHES: 0.71Î HATCHES: 1.000

hund'estr'o hund'o hund'estr'o

(dog-keeper) hund'o

hund'estr'o

hund'estr'o

hund'estr'in'o hund'o hund'e»tr'in'o hund'o'j hund'o

(female dog-keeper) hund'o

hund'estr'in'o

hund'estr'in'o

(dogs)

hund'o hund'o'j

122

HATCHES: 0.71 HATCHES: 0.71/ HATCHES: 1.000 HATCHES: 0.55 HATCHES: 0.554 HATCHES: 1.000 HATCHES: 1.000 HATCHES: 0 . 8 0 0

3)

Complex

Proximity

Function

In this »ode a relatively large anount of knowledge about the distribution of IL m o r p h e m e s is included. V a r i o u s function morphemes are recognized and their influence included in the form of weights. An important addition is a frequency distribution - attaching more weight to less frequent morphemes - and the influence of length - longer words score higher than shorter words. Calculation of the score is done by first matching the stems of the words (the central content morphemes), which, if the two are identical, gives a score of 1.000, and adding to or subtracting from this score the scores for the various function morphemes surrounding the stem. Because of this, scores produced by the complex proximity function can be greater than 1. (45)

Complex Proximity

1ST word

Function

SOLL word

hund'o (dog) hund'in'o hund'in'o

hund'in'o hund'o hundin'o

hund'et'o hund'o hund'et'o

(small dog)

hund'ej'o hund'o hund'ej'o

(place for dogs) hund'o

hund'ej'o

hund'ej'o

hund'estr'o hund ' o hund'estr'o

hund'o hund'et'o hund'et'o

(dog-keeper) hund'o

hund'estr'o

hund'estr'o

hund'ettr'in'a hund'o hund'estr'in'o hund'o'j hund'o

(female dog)

(female dog-keeper) hund'o

hund'estr'in'o

hund'estr'in'o

(dogs)

hund'o hund'o'j

HATCHES) HATCHES! HATCHES:

0.746 1.200* 1.931

HATCHES: 1 . 2 0 0 HATCHES: 0.700 HATCHES: 2.056 HATCHES: 0.000 HATCHES: 0.000 HATCHES: 2 . 1 0 1 HATCHES: 0.000 HATCHES: 0.000 HATCHES: 2. ISO HATCHES: HATCHES: HATCHES:

0.000 0.000 2.911

HATCHES: HATCHES:

1.200 1.200

*): As with plurals in 3.5) and 3.6), the difference between scores is caused by hund'o being the unmarked case, hund'in'o being marked. A marked S0LL word scores lower with an unmarked 1ST word than visa versa. A marked S0LL word scores higher with an equally marked 1ST word than an unmarked S0LL word with an unmarked 1ST word. Note that plural endings (as in hund'o'j) are no longer treated as semantically marked in the complex proximity functi on.

123

Experimentation with the three kinds of proximity matching soon showed mode 1, no proximity; to be much too rough and unsophisticated: it ignores important similarities because of small deviations. The method can be helpful for some kinds of error finding, but is of no real use in the working SWESIL system. Mode 2, simple proximity; has proven to be more promising, but fails to distinguish several important function morphemes. This can lead to strange results (see (&&&) above) when function morphemes that have semantic consequences are given the same weight as those which have none. Mode 3, complex proximity, gave, as expected, the best results. The addition of extra knowledge about the semantic importance of the morphemes and the weights derived from frequency of occurrence give a much more balanced and intuitively satisfying score. Also, the way longer words (i.e. those with more morphemes) tend to score higher for a full match reflects the intuition that, since longer words carry more semantic information, a full match for a long word is relatively more important than a full match for a short one. (The way the complex proximity has been implemented has also led to an increase in speed, in itself unimportant, but an advantage for researchers who have to produce long lists of figures.)

Intra-word

Grammar

Complex proximity - or "quick proximity" because of its higher speed of execution - gives fairly good results, thanks to its 'knowledge' about IL morphemes. It was felt, however, that the statistical kind of knowledge available now would not be enough to make the kind of sophisticated and balanced judgements SWESIL is expected to make. Therefore the proximity function will eventually be replaced by a complete intra-word grammar which describes in detail the way morphemes combine to form words and what the semantic consequences are of the combinations. This intra-word grammar is now being implemented and will be used by the syntactic module and 1ST pair generator as well.

Ontological

Matching

Logical matching means comparing YS and YI on rising levels of abstraction - i.e. going further and further up in the lexical hierarchy. This (recursive) process results in a score which expresses the degree of (logical) similarity between the words being compared. This word match score is then combined with the relator match score to indicate how well the expectations of XI over relator RI are fulfilled by YI. These expectations are taken

124

from the SOLL pairs .found in the definition of XI in the LKB. Now XI, just like YI, is an IL word sense, complete with key-term and super-key-term. The key-term and super-key-term are t h e m s e l v e s IL word senses as well, complete with their own expectations in the form of SOLL pairs. It may well be that XI, when seen on a higher level of generalization i.e. on the level of its key-term or super-key-term contains expectations that match YI better than those on the actual 1ST pair level. The next step, therefore, is to apply key-term substitution on XI, retrieving the SOLL pairs that belong to this key-term, and restarting the matching cycle. To distinguish this use of the hierarchy to match expectation patterns at different levels, from the use of the same hierarchy for logical matching - matching words - the former is called ontological matching ^UMBE^J

0.745

L^materia^ORM^^^HINFLA^IR^HEEJ

0.532

1 6. STAGE

0.501 0.356

1

I

4. TABLE SPREAD WITH MEAL

J

I 1. SIDE OF A SHIpj

0.199

8. GROUP OF PERSONS ORGANIZED TOR SPECIAL RESPONSIBILITY

0.099

5. DAILY MEAL

P r e f e r e n c e S e m a n t i c s a t t e m p t s to o r d e r p o s s i b l e a l t e r n a t i v e s o n a s c a l e r e p r e s e n t i n g l i k e l i h o o d or s e m a n t i c appropriateness. 126

Ordering the

Alternatives

When the matching cycle comes to an end, SWESIL's natch score table contains three scores for the 1ST sense pair: a forward score) a backward score and a total score» derived from the first two. These scores express in the form of a number the strengths of fulfillment for the expectations of both 1ST senses (XI and YI). Vet by t h e m s e l v e s these values are of little use: there is no absolute judgement that can be made) no line above which an 1ST pair is said to be acceptable and below which it is unacceptable. The match scores derive their significance from comparison with the scores for other 1ST pairs. Therefore) when all 1ST pairs have been processed) SUESIL produces a list sorted in order of scores (Fig. II.3.8). At each word interval) as the syntactic tree structure grows) SMESIL produces additional lists) one for each word pair on each syntactic trail. From time to time) as determined by syntactic processes) an evaluation of these lists will be made) to decide whether some alternatives can safely be abandoned. This evaluation will be done on the basis of t h r e s h o l d s which determine when the distance between a high-scoring alternative and a low-scoring one is large enough for the latter to be abandoned) and will also take into account any reinforcing scores - i.e. whether there are high-scoring alternatives forming a sequence in which one 1ST pair confirms another. Of course) list of alternatives are dropped for those trails that are aborted by the parser for syntactic reasons) and whose sorted alternatives are no longer syntactically valid. In a future implementation the macro context module (see Chapters II.2 & IV.2) will add more powerful decision procedures based on the contents of the text as a whole. The undecided alternatives that are left when the entire sentence has been processed are to be used by in the presentation of the disambiguation dialogue (see Chapter II.2 & II.4). To ask the human user sensible questions and avoid presenting absurd or confusing alternatives) the dialogue module will consult the SWESIL lists and construct its questions around the h i g h e s t scoring alternatives. The others will only be presented if the user rejects those suggested first.

Second-Order

Matching

In the present implementation of SWESIL) ordering of the alternatives is only done for each SL word pair. The X and Y elements of the alternative 1ST pairs are only compared with each other's expectations) without considering the other words in their context.

127

An important addition) now being designed* is the so-called second-order matching process; which should enable SHESIL to take into account not only the e x p e c t a t i o n s of X for Y and of Y for X» but also the expectations of X for Yi or Y for Xi gi yen a third word somewhere else in the context and standing in the proper relation to either X or Y. This principle is best explained with an example. The phrase: (46a)

a coupass

needle

can be translated into the IL in two ways: (46b)

nadl'o de kompas'o

(instrument for finding the north)

(46c)

nadl'o de cirkel'o

(instrument for drawing

circles)

The dictionary definitions of both kompas'o and cirkel'o express an expectation for the word needle - a needle is a part of both. What d i f f e r s is the function of the needle in each instrument. The function of the needle can be found in the second-order SOLL pairs - i.e. pairs that do not have XI on either side of the relator. For konpas'o t second order pairs with nadl'o are: (47a)

nadl'o as montr'i (= AGENT of to s h o w ) aiontr' i io-n nord'o ('the north' = INANIMATE PATIENT of 'to show')

whereas a second-order (47b)

turn'i

ie-tirkati

('to turn

is:

nadl'o

(piace-)about a needle")

If the sentence now (48)

SOLL pair for cirkel'o

continues:

a compass needle pointing

...

then SWESIL, when encountering needle pointing - which translates to aontr'ant'a a nadl'o - can match this (via certain transformation rules) with (&&&), thus gathering additional evidence for the interpretation of compass as kompas'o rather than cirkel'o.

128

Chapter 4

The Disambiguation Dialogue In spite of the development of advanced t e c h n i q u e s for syntactic and semantic processing and significant progress in the field of AI * Fully Automatic High Quality Machine Translation is generally held to be at least 25 years away. In the meantime, either some form of human assistance has to be incorporated) or the MT system has to be very much restricted in either the quality of the output it produces or the range of language it can handle. For systems that aim at High Quality MT, three kinds of human assistance are generally distinguished: 1)

Post-editingi the system produces a complete t r a n s l a t i o n , which is then checked and corrected by a human editor - often a qualified translator.

2)

Pre-editing; before the system is given text to process, the text is prepared by human specialists to remove various kinds of problems - such as ambiguities, complex sentences, unknown words, etc.

3)

Interactive assistance; while the system is processing text, it can report unsolved problems to a human user, who can then solve them on-line, enabling the system to continue processing.

> \ concerning the of \ PARS PAR6 i i course board / i i ATRI ATRI PREA I I I agricultural » ATRI I the

the

of I t PARG I

developaent !

PREA !

of I > PARG economy ì

ATRI I I

agricultural >

ATRI the

177

Table 111.¿> IL translations resulting from the test

1.

2.

3.

4.

agricultural

authorities

board

concerned

agrikultur'a instanc'o'j aùtoritat'ul'o'j rajt'i g'o'j aùtoritat'o'j tabul'o karton'o pension'nutr 'ad 1 o konsi1i'o okup'a jt ' a koncern'ajt'a iapl i k'ajt'a •al'trankvil'ig'ajt'a

5.

course

kurs'o daùr'o vo j ' o tvolu'ad'o ir'direkt'o

b.

decision

decid(ea)o

7.

development

nov'aj"o aper'aj'o •al'san'ii'o ( + je) dis'volv(i4)o evolu(ig)o el'labor(i*)o rivel (i4)o kresk 4ir'i èir'i < trans'don'i > èir'i èir'i aprob'i a instanc'o'j koncern'ajt'a O a aútoritat'ul'o'j O koncern'ajt'a O a aútoritat'o'j de decid 'o O deci d' en ' o < > de deci d 'o O decid' o O de dec i d' em ' oO

konsili'o de konsili'o karton'o de rivel'o voj'o de rivel'o ir'direkt'o de rivel'o voj'o ia'-'el kresk'o voj'o ia'-'el kresk'o kurs'o de nov'ajx'o kurs'o de aper'ajx'o kurs'o da nov'ajx'o kurs'o da aper'ajx'o evolu'ad'o de nov'ajx'o evolu'ad'o de aper'ajx'o kurs'o de sxangx'o kurs'o de evolu'o kurs'o pri «al'san'igx'o kurs'o ia'-'el « a l ' s a n 1 i g x ' o kurs'o < 1 eci on ' ser i ' o> de sial ' san ' igx ' o kurs'o de kresk'o kurs'o da mal'san'igx'o voj'o pri nov'ajx'o voj'o ia'-'el nov'ajx'o voj'o de nov'ajx'o voj'o da nov'ajx'o voj'o pri nov'ajx'o voj'o ia'-'el nov'ajx'o voj'o de nov'ajx'o voj'o da nov'ajx'o voj'o de dis'volv'o voj'o de dis'volv'o kurs'o ia'-'el kresk'o kurs'o de rivel'o

182

"development 2. 060 1. 906 1. 873 1. 448 1. 350 1. 200 0. 960 0. 768 0. 768 0. 768 0. 723 0. 678 0. 678 0.,614 0. 614 0.,492 0. 469 0.,381 0. 378 0.,328 0. 328 0.,324 0. 324 0.,259 0.,259 0.,259 0.,259 0.,203 0., 195

of the

de 'n ekonomi'o kresk'i g ' 0 < > di s'volv ' i g x 'i o\ ) de ekonomi'o de ie k o n o m i ' o < s i t u a c i ' o > sx angx'i gx' 0 < > i¿e ekonomi'o de el 1 1 a b o r ' 0 < > de ekonomi'o de ' n de {e k o n o m i ' o < s i t u a c i ' o > evolu'o O e v o l u ' i g' 0 de e k o n o m i ' o '¡situaci 'o> sx angx 1 o < > de s x p a r ' e m ' o sxangx 'o de ekonomi'o de ekonomi'o ekonomi'o de de ' n ekonomi'o r i vel ' o O ek'est'o < > de ekonomi'o kresk'ig ' 0 < > de ekonomi'o de ekonomi'o evolu'i g ' 0 < > n o v ' a j x '0 de s x p a r ' o aper'ajx ' 0 de sxpar ' o < > n o v ' a jx '0 < > ekonomi'o de ekonomi'o de de ekonomi'o de 'n ekonomi'o

"the a g r i c u l t u r a l 0.131

econoay*

econoay"

a g r i k u l t u r ' a

a

ekonomi'o

is k o n f i r m ' i a i n s t a n c ' o ' j is konfirm'i 1 ' i g ' a j t' a a atitoritat'ul'o'j is $ir'i a atitoritat'ul'o'j is 4ir'i

184

3.

agricultural

board

The input has now switched to a different branch of the tree (Fig. III.3). Consequently the semantic module, too, has to start a separate chain fragment: 0.5B1

agrikultur'a

a

konsili'o

decision of the board The top-ranking pair here ties in nicely with stage 3, so we have an unambiguous choice for this branch of the tree: 1,781 4.

decid'o

de

agricultural

agrikultur'a

a

konsili'o

economy

The process again jumps to a different tree 0.131

agrikultur'a

a

branch:

ekonomi'o

development of the economy The highest-scoring pair fits onto the previous producing the chain fragment: 2.191

evolu'ig'o

course of

de'n

agrikultur'a

a

result,

ekonomi'o

development

Two pairs score equal first. Neither is compatible with the existing chain fragment, so we have to look back at the other results and start two new chains: 2.099 2.779

ir'direkt'o de 4an$'o evolu'ad'o de kresk'o

de'n agrikultur'a a ekonomi'o de agrikultur'a a ekonomi'o

At the same time we have to maintain the previous chain, though with an undefined link for the translation of "the course of", as evolu'iq'o failed to score on this pair: 2.191 5.

*

*

evolu'ig'o

de'n

agrikultur'a

a

ekonomi'o

decision concerning the course

This is the pair which ties together the results of steps 3 and 4. Here, however, no positive scores were obtained. Since the relator in this pair is unambiguous, and step 3 led to only one translation, we can combine that fragment with the top-scoring fragment for step 4:

185

4.560 decid'o de agrikultur'a a konsili'o koncern'e al evolu'ad'o de kresk'o de agrikultur'a a ekonomi'o 6.

endorsed the decision

The top-scoring translation can be tied on to the chain stage 5:

from

5.520 aprob'i n decid'o de agrikultur'a a konsili'o koncern'e al evolu'ad'o de kresk'o de agrikultur'a a ekonomi'o However i we have now reached the top of the tree) and it is tine to link up with the left-hand branch from stages 1 and 2. The problem is that none of the main verbs suggested there is aprob'i. Consequently we have to develop each of the stage 1-2 chains forwards) and at the same time develop the stage 3-6 chain backwards) in order to find the highest total score. To start with the latter possibility) we find that the highest stage 2 score with aprob'i is 0.532

ailtor i t at' ul ' o' j

is

aprob'i

which couples with the equal-first scorers at stage 1 to give: 1.197 1.197

mal'trankvi1'ig'ajt'a a aCitor i t at' ul ' o' j okup'ajt'a a aOtoritat'ul'o'j is aprob'i

is

aprob'i

These can now be linked to the right-hand branch) giving two equal-scoring complete chains for the sentence in which only the first word differs: 6.717 mal'trankvi 1 ' ig ' ajt' a/okup ' a jt' a a alitor i tat' ul ' o' j is aprob'i n decid'o de agrikultur'a a konsili'o koncern'e al evolu'ad'o de kresk'o de agrikultur'a a ekonomi'o It now only remains to develop the stage 1-2 chains forwards. As neither of the main verbs suggested there (konfirm'i) 4ir'i) scored at all at stage 6) each of these fragments can be coupled with the result of stage 5. However) as we are only interested in the top scorers) we can discard the pairs with 4ir'i along the way) leaving a single IL chain for the total sentence) in which only the exact sense of the word instanc'o'j remains in doubt. (This is of no great consequence) since the final output will in any case be a simple IL string) with key-terms deleted.) This interpretation now reads: 6.396 okup * ajt'a a instanc'o'j is konfirm'i n decid'o de agrikultur'a konsili'o koncern'e al evolu'ad'o de kresk'o de agrikultur'a a ekonomi'o

186

a

Comparison or the total scores tor the two complete translations shows that the backward-developed chain has a slight advantage, scoring 6.717 as against 6.396. This, then, is our final selection, which on conversion to standard IL would read: La mal'trankvi1'ig 1 ajt'a'j/okup'ajt'a'j aGtoritat'ul'o'j aprob'is la decid'o'n de la agrikultur'a konsili'o koncern'e al la evolu'ad'o de la kresk'o de la agrikultur'a ekonomi'o. A rough back-translation into English should give an idea of the significance of the d i s c r e p a n c i e s (underlined): The authorities (i.e. persons in authority) concerned (i.e. disturbed/occupied) approved the decision of the agricultural board concerning the development of the growth of the agricultural economy. The highest-scoring "correct" translation (of which there are several possible versions) has a total score of 5.334, as against the 6.717 obtained for the above version. Without the aid of second-order matching or other information fro« the wider context it is difficult to resolve the remaining ambiguity of m a l ' t r a n k v i l ' i g ' a j t ' a / o k u p ' a j t ' a . One possibility would be to play safe by choosing the translation which also appears in the second-best chain: okup'a jt'a. An alternative approach to this whole pair-chaining process would be to wait until the end of the sentence before attempting to link the individual pairs together. This would allow the use of a strategy reminiscent of recent work in speech recognition (the so-called "island-driven" method). It consists in starting with those items which are least in doubt and working out from there. In our example, the pair which gives the least equivocal result (as measured by the ratio between the first and second highest scores) is: "endorsed the decision" — >

aprob'i

n

decid'o

Working out from there would in actual fact lead to the same IL chain selected by the branch-driven method used above. However, there is no reason why this should necessarily be so. The chain selected first by the island-driven technique can only be regarded as a provisional hypothesis. In order to find the highest-scoring total chain, this method still requires chain scores to be computed for all translations which score higher than the h y p o t h e s i s chain on any of the individual pairs. The branch-driven technique seems likely to be more efficient and more consistent, as well as being more in harmony with the

187

s y n t a c t i c s t r a t e g y . It a l s o n o r m a l l y p e r m i t s c h a i n c o m p u t a t i o n b e g i n well b e f o r e t h e e n d of a s e n t e n c e .

to

W h a t e v e r the c h a i n - b u i l d i n g m e t h o d » t h e r e s u l t a n t IL p a i r s c a n now be c o m b i n e d i n t o a d e p e n d e n c y t r e e s t r u c t u r e (Fig. I I I . 4 ) . T h e w o r d s and r e l a t o r s in b o l d t y p e may be r e g a r d e d as correct translations. The remainder (underlined) spotlight cases w h e r e r e c o u r s e to o t h e r p a r t s of t h e DLT s y s t e m ( s y n t a c t i c c o n s t r a i n t s ) t e x t a n a l y s i s or e v e n t h e d i s a m b i g u a t i o n d i a l o g u e ) m i g h t be n e e d e d to a c h i e v e a s o u n d IL t r a n s l a t i o n .

Fig.

III.3.4 aprob'i / is / aOtoritat'ul'o'j / a / okup'ajt'a

\ n \ decid'o I \ de k o n c e r n ' e al ! \ konsili'o evolu'ad'o I \ a de I \ agrikultur'a kresk'o \ de \ ekonomi'o I a > i agri k u l t u r ' a

A s t u d y in d e p t h of t h e m a t c h e s i n v o l v e d in t h i s t e s t s e n t e n c e s h o w e d up a n u m b e r of p o i n t s w h e r e i m p r o v e m e n t s c o u l d be m a d e , p a r t i c u l a r l y in the R e l a t o r D a t a b a s e . In t h e p h r a s e "the c o u r s e of d e v e l o p m e n t of t h e e c o n o m y " , for e x a m p l e , it w a s f o u n d t h a t a d e q u a t e i n f o r m a t i o n w a s p r e s e n t in t h e S e m a n t i c D i c t i o n a r y to p e r m i t S W E S I L to e x t r a c t the c o r r e c t t r a n s l a t i o n , but t h a t c e r t a i n o m i s s i o n s in the Relator D a t a b a s e had p r e v e n t e d this i n f o r m a t i o n from being fully ut i 1 i s e d .

Sentences with Second-Order

Hatching

S e c o n d - o r d e r m a t c h i n g m e a n s u s i n g s e c o n d a r y v a l e n c i e s in the S e m a n t i c D i c t i o n a r y to r e s o l v e or r e i n f o r c e w o r d s e l e c t i o n b a s e d on p r i m a r y p a i r s .

188

For e x a m p l e ) in t h e d i c t i o n a r y e n t r y for c i r k e l ' o ("a pair of c o m p a s s e s " ) we f i n d a p r i m a r y (or f i r s t - o r d e r ) pair l i n k i n g t h i s i n s t r u m e n t to t h e act of d r a w i n g . We a l s o f i n d a s e c o n d a r y (or s e c o n d - o r d e r ) p a i r f u r t h e r s p e c i f y i n g the act of d r a w i n g by l i n k i n g it to an o b j e c t s u c h as a c i r c l e or c u r v e . Now given

a simple

Draw

sentence

a circle with

such a

as

compass.

t h e s y n t a c t i c m o d u l e will p r o d u c e a l o n g t h e l i n e s of F i g . I I I . 5 .

a dependency-tree

analysis

F r o m t h i s t r e e S W E S I L will r e c e i v e t w o i n p u t p a i r s , o n e linking "draw" with "circle", the other linking "draw" with "compass". There is, h o w e v e r , no direct r e l a t i o n b e t w e e n "circle" and " c o m p a s s " . T h u s t h e d i c t i o n a r y i n f o r m a t i o n l i n k i n g t h e s e t w o c o n c e p t s c a n n o t be u t i l i s e d u n t i l s e c o n d - o r d e r m a t c h i n g is implemented.

Fig.

III.5

Dependency

tree

for

"Draw

a circle

with a

compass."

dr aw / OBJ / circle / ATR1

\ PREA \ with \ PARG

/

\ compass \ ATR1

a

\ a

In t h i s s p e c i f i c e x a m p l e , w h e r e t h e a l t e r n a t i v e t r a n s l a t i o n of " c o m p a s s " ( k o m p a s ' o - "a m a g n e t i c c o m p a s s " ) is u n l i k e l y to y i e l d any f i r s t - o r d e r c o n n e c t i o n w i t h t h e act of d r a w i n g , the s e c o n d - o r d e r c o n n e c t i o n w i t h a c i r c l e or c u r v e w o u l d only s e r v e to r e i n f o r c e an a l r e a d y c l e a r c h o i c e . In o t h e r e x a m p l e s , h o w e v e r , t h i s s e c o n d a r y r e i n f o r c e m e n t m i g h t well m a k e all t h e d i f f e r e n c e b e t w e e n s u c c e s s and f a i l u r e to d i s a m b i g u a t e t h e t e x t . At t h e t i m e of w r i t i n g , a f r e s h s e r i e s of t e s t s is in p r e p a r a t i o n a i m e d at d e f i n i n g t h e b e s t p r o c e d u r e to be f o l l o w e d in c h e c k i n g for s e c o n d - o r d e r m a t c h e s and t h e w a y s s u c h a d d i t i o n a l s c o r e s c a n be i n t e g r a t e d i n t o t h e o v e r a l l s c o r e s h e e t .

189

To return to the questions raised at the beginning of this chapter) we can conclude that the tests carried out to date have provided reasonably reassuring answers. In general, SWESIL was able to follow a logical and productive path through the lexical hierarchy. Exceptions were reported and corrected. The number of matches involved was impressive, but the search remained under control and no explosion threatened. The expansion of abbreviations, transformation of complex IL forms, substitution of relators and other rule-driven extensions of the search functioned well, on the whole. Errors were traceable and could be corrected easily. One important question which remains open for further

research:

What is the influence of the d e p t h - o f - s e a r c h threshold, i.e. does unlimited search through the hierarchy provide much more (or more useful) information than more limited searches? This is a complex question, since the answer is obviously also influenced by the attenuation factor (which reduces scores for matches at higher levels). Some systematic experimentation with several variables will be necessary before a working conclusion can be reached.

190

Chapter 4

Melby Test Results At the time of writing, the databases and software to be used for the Melby Test were still in preparation. For this reason, no actual results of the test can be presented here. What can be presented is an example of the type of preliminary syntactic analysis which Melby's students were required to perform in their function of "human syntactic nodule". An in-house test of the feasibility of this procedure was carried out in Utrecht using a local third-year student of English as subject. After an explanatory introduction about the nature of the Melby Test, the subject was given the following (summarized) instructions: 1.

Work through the material sentence by

2.

Identify pronouns with their antecedents. Otherwise each individual sentence as an independent unit.

3.

Decide on a single syntactic analysis of the sentence. You are not required to produce a complete and explicit syntactic analysis. All you need to do at this stage is to think about the syntactic structure and resolve any structural ambiguities in your own mind.

4.

Identify the content words in the input sentence. As a rule of thumb, you may count as content words all verbs (except modals and auxiliaries), all nouns, adjectives and adverbs (except intensifiers) -- thus leaving auxiliary verbs, pronouns, articles and other d e t e r m i n e r s , intensifiers, p r e p o s i t i o n s , conjunctions and perhaps numerals as function words.

5.

Augment the content words with basic i nformati on.

6.

List all the possible syntactically related pairs of content words which can be derived from the original text. It is generally advisable to carry out a preliminary expansion of the original text, filling in such things as implicit conjunctions, pronoun antecedents etc. which can help to make the syntactic structure more transparent.

7.

For each pair, specify the type of relation involved. First, function words such as prepositions, the infinitive particle "to" etc. can be used to help specify

191

sentence. treat

part-of-speech

the type of relation. Second, you can refer to the Relation Table select relations from those suggested there. If you feel the list is incomplete, you are free to suggest other relations of your own invention. Table III.10 shows the relation table used.

Table

III.10

SYNTACTIC RELATION N.B.

TABLE

1. The presence in the text of a variable relator such as a preposition is shown by "...". 2. It may be necessary to reverse the order of the content words in the word pair to match one of the relations in the table. 3. The table is not complete. Other relations can be generated as required. Rei at i on type

Ex ampi e

VERB-"to"-VERB VERB-OBJECT-NOUN VERB-ADNT-NOUN VERB-"..."-NOUN VERB-COM-ADJ VERB-CON-NOUN VERB-ADNT-ADV NOUN-SUBJECT-VERB NOUN-BE-PAST NOUN-HAVE-PAST NOUN-BE-PRES NOUN-HOD-NOUN N0UN-APP0-N0UN N0UN-P0SS-N0UN NOUN-MOD-PAST N0UN-M0D-PRES NOUN-"..."-NOUN NOUN-POST-PAST ADJ-M0D-N0UN ADJ-"..."-NOUN ADV-MOD-VERB ADV-MOD-ADJ ADV-MOD-ADV

require to provide require the subdivision 90 hoae work for hours paint Csoaething] green elect Csoaeone] president drive carefully the aanageaent requires the subdivision is required the aanageaent has required the division is producing docuaentation subdivision the language Esperanto the president's aen aluainiua based electricity generating choice of. words the authorities concerned green walls necessary for success slowly awake slightly green extremely slowly

Abbrevi at ions: ADJ ADNT ADV APPO

Adjective Adjunct Adverb Apposition

192

BE con HAVE HOD PAST POSS POST PRES

Any form of the auxiliary verb "to be" Complement Any form of the auxiliary verb "to have" Modifier Past participle (verb or adjective) Possessive Postmodifier Present participle (verb or noun or adjective)

The test material consisted of a set of 13 independent sentences. Two examples, together with the expected analysis, are given in Table 111.11. The sentences had been specially compiled to include as many syntactic stumbling blocks as possible. They therefore represented a kind of "worst case" for the procedure under test.

Table III. 11 Examples of preliminary dependency-pair 1.

The deficit was covered by the engineers and teachers paying higher contributions.

deficit covered engineers engineers teachers paying higher 2.

analysis

NOUN PAST NOUN NOUN NOUN PRES ADJ

-BE-»by»-SUBJECT-"and"-SUBJECT-OBJECT-HOD-

PAST PRES PRES NOUN PRES NOUN NOUN

covered paying paying teachers paying contributions contributions

The project coordinator aay take exception to students implementing his design without reference to the source.

project coordinator take exception students implementing implementing coordinator reference

NOUN NOUN VERB NOUN NOUN PRES PRES NOUN NOUN

-MODNOUN - S U B J E C T - VERB - O B J E C T - NOUN -"to"PRES - S U B J E C T - PRES - O B J E C T - NOUN - " w i t h o u t "NOUN -POSSNOUN -»to"NOUN

coordi nator take exception implementing implementing design reference design source

The first general observation was that the subject felt constrained by the need to relate words to each other rather than phrases. In analyzing the string "more advanced space hardware", for example, she felt that "advanced" should relate to "space hardware" and not just to "hardware". This is a natural result of

193

her specific linguistic background; the introduction now contains a note on the dependency grammar model chosen -for DLT. From the 13 test sentences the subject produced a total of 100 word pairs, as against 96 previously listed by the experimenter. The first striking difference was in the resolution of syntactic ambiguities. Five of the sentences were given a different syntactic structure from that judged most likely by the experimenter. In one case the difference was unimportant. The other four cases could be attributed to incomplete grasp of the sentences' meaning. It was considered unlikely that native English speakers would have chosen these interpretations. However, in order to judge the subject's success in pair identification, her interpretations of ambiguities were accepted as plausible. Of the 100 pairs thus produced, 10 were judged incorrect. Of these 10 pairs, 5 errors concerned the choice of relator where apparently the actual relation had been correctly identified. These mistakes could easily be eliminated by improving the instructions. Of the remaining 5 errors, 3 concerned the identification of part-of-speech ("failure" identified as a verb, the hyphened form "high-technology" as a noun instead of adjective and "endorsed" as a verb instead of a past participle). However, none of these errors would have been fatal; SWESIL would have compensated for them in the specific pair contexts. The last 2 errors were serious because they concerned mistaken relations. In the sentence "For the government to make a new approach to industry represent real progress."

Mould

the noun "approach" was identified as the subject of the verb "represent" and "for the government" was treated as a prepositional phrase dependent on "represent". It is to be hoped that a native speaker of English would not fall into this kind of confusion. However, to be on the safe side, some examples of difficult constructions like this were added to the instruction manual. Although the subject identified 4 more pairs than the experimenter (mainly by adding pairs for quantifiers etc. which the experimenter had chosen to disregard), 4 other, important relations were omitted. In 2 of these cases the oversight would have been discovered if the subject had drawn a rough network of the relations in the sentence. This would have shown up a discontinuity between different parts of the sentence. Consequently the revised instructions recommend the student to draw a network as a check on completeness.

194

ft third case concerned a sentence

beginning

"At one time there were two space bodies Here the subject failed to identify the relation between "were" and "time") even commenting that "at one time" is unrelated to the remainder of the sentence! This failure seems to be due to a kind of "word-order conditioning", and the instructions now place more emphasis on the need to discount word order in indentifying pairs. The last case of omission concerned the possessive pronoun "his"» which should have been replaced by its antecedent ("coordinator") but was discarded as a "function word" instead. The instructions have now been made more explicit with respect to anaphora. In conclusion it appeared from this experiment that it should be quite feasible for native-speaker linguistics students to generate adequate word pair input to the DLT software on the basis of a set of written instructions. The error rate for native English speakers working on normal running text« given the improvements in the instruction manual) should be quite low. Further tests of this procedure will now be included in the run-up to the Melby Test, in order to make sure that these assumptions are correct.

195

PART IV: FUTURE DEVELOPMENTS

Chapter 1

Computerized Lexicography An important aspect of Language Processing by Computers has been the creation of special dictionaries - digitally stored suited to the needs of the computer. Such dictionaries differ from traditional paper dictionaries in a number of ways: the medium: They are, of course, stored in binary-coded form on tape, disk or in memory banks, instead of being printed on paper. organization: Alphabetical coding, though it may be used, is not usually very useful for a computer system. V a r i o u s other «ays of organizing material - hashing, setting pointers to related entries etc. - are better suited to the computer's needs. accessibility: Traditionally, entries are ordered alphabetically and all r e f e r e n c e s to entries rely on access via the user's knowledge of the alphabet. The computer, on the other hand, allows many types of referencing: forward or reverse alphabetical, referencing by topic, thesaurus-like referencing. The organization of the dictionary is determined in part by how much of the information is accessed at one time. Conventional paper d i c t i o n a r i e s present all the information for any one entry in one continuous stream of text - though often subdivided into several parts to separate the various pieces of information. The computerized dictionary can be organized to access only specific parts of the available information, i.e. only that part which is required at a particular moment.

size: Over the last few years the prices of digital storage media have decreased sharply, making digitized storage cheaper than printing on paper and opening the way for vast amounts of information to be compiled on one medium. Computerized access to the information also ensures that even such amounts of data that on paper would become useless because of their sheer size, are still perfectly manageable. mutability: Conventional dictionary makers face the insurmountable problem of having to portray statically a vocabulary that is continually changing. A paper dictionary takes a great deal

197

of time to compile, correct and print, so much time, in fact, that any such dictionary is outdated as soon as it appears in print. An electronic dictionary, once the basic vocabulary has been compiled, can be continually updated to keep up with the changes in the language. Though any dictionary necessarily runs slightly behind the facts, a properly organized electronic dictionary can incorporate changes in language usage as soon as they have become common enough to be identified as such. As already discussed in Chapters II.5 and III. 1 T p u b l i s h e r s of conventional dictionaries have gradually incorporated computers into their means of production. Gradually, the r e q u i r e m e n t s of NLP projects seem to be overlapping those of dictionary publishers, which may in the future lead to the creation of an additional branch of dictionary publication: the electronic, custom-built dictionary, extracted from a vast central corpus and tailored to the needs of the customer. (The Dutch SPIN projects are in fact partly devoted to creating just such a v o c a b u l a r y , in a form that allows for customized 'spin-offs' to interested parties.) At present, however, we are still a long way from that happy situation, and, like all other NLP projects, the DLT project will have to create and maintain dictionaries of its own. It would be inexcusable, however, not to make use of developments in computer lexicography and not to incorporate existing technology and d a t a b a s e s as much as possible. To not only re-invent the wheel, but to try building a sports car from scratch as well, would be a foolish waste of time and resources and, moreover, doomed to failure. In the planning of the DLT project, therefore, dictionary construction has been allotted an important place, and special attention is being devoted to innovative aspects of lexicography that could prove helpful.

Diversity of

Sources

The DLT Lexical Knowledge Bank is intended to contain a large amount of knowledge on a broad range of subjects, comparable to the kind of knowledge-of-the-wor1d human language users continually - if unconsciously - refer to. As discussed in Chapters II.2 and 111.1» conventional monolingual dictionaries already contain much of this knowledge in a condensed form. These dictionaries will therefore form the primary sources for DLT's LKB. Since we are dealing with a translation system, however, bilingual dictionaries are of particular importance. In the case

198

of DLT, this means bilingual dictionaries with Esperanto as one of the languages. Over the hundred years since Esperanto first appeared, a considerable number of such d i c t i o n a r i e s have been produced. The sane is true of Esperanto textbooks, which provide a rich source of information on metataxis rules which can be stored in the lexicon. Another important source of information is to be found in specialized dictionaries and glossaries) i.e. those restricted to a specific field such as law, economics, p h y s i c s etc. As the OLT system is aimed at the translation of informative and technical texts, such specialized knowledge of vocabulary is necessary to enable the system to distinguish everyday usages of words from their technical senses, as well as to recognize technical terms as such. Here again, a fair amount of pioneer work has been done with regard to Esperanto, although the same kind of intensive development work will be required as is presently being invested in some of the less dominant languages of science and technology such as Arabic and Hebrew. Various kinds of thesauri can also be of value for DLT. SNESIL's LKB relies heavily on a lexical taxonomy coded by means of key-terms and super-key-terms. These key-terms and super-keyterms must be carefully chosen and cross-checked. Thesauri, with their lists of synonyms, near synonyms and hyponyms are a valuable source of ideas and comparative material. In order to be able to translate technical texts, a human translator must have at least a basic understanding of the subject. Incorporating basic technical information in the LKB may improve SHESIL's performance when it comes to crunching technical terms. The last, but certainly not least, of the sources which will contribute to the LKB is the linguistic intuition of DLT's lexicographers, who have, after all, the kind of language expertise we would like SMESIL to have. The broader and more international the experience that can be brought to bear on the LKB, the more reliable it is likely to prove as a standard and multi-directional interface between DLT's source and target languages. At this moment, arrangements are being made (and software developed) for farming out of lexicographic work to external sub-contractors in v a r i o u s countries, subject always to revision and standardization by the in-house specialists. A useful spin-off from this approach is the detailed documentation which it demands, including formal specification of lexicographic procedures, explicit and operational definition of principles and the avoidance of ad hoc solutions not founded in wider language usage.

199

Formalizing

Information

The most i m p o r t a n t job of t h e l e x i c o g r a p h e r is t h e c o n v e r s i o n of the i n f o r m a t i o n that must be i n c o r p o r a t e d into the S e m a n t i c D i c t i o n a r y into a p p r o p r i a t e S O L L p a i r s . S O L L p a i r s c o n s t i t u t e the f o r m a l i s m in w h i c h i n f o r m a t i o n from several s o u r c e s is c a p t u r e d . The formal

r e q u i r e m e n t s of S O L L p a i r s

are:

They are s u b j e c t to several r u l e s for w e l l - f o r m e d n e s s . They c o n f o r m to a ' m i n i - s y n t a x ' , w h i c h g o v e r n s t h e way m o r p h e m e s are s e p a r a t e d ) a l t e r n a t i v e s are a b b r e v i a t e d and s u c h l i k e . The r e l a t o r s used are also s u b j e c t g o v e r n i n g the u s e of h y p h e n s e t c . r e l a t o r s , t h o u g h f i n i t e ) is l a r g e p r e p o s i t i o n s in E n g l i s h ! ) and can of p r e f i x e s . (See C h a p t e r II.2.)

to a 'mini-syntax ' The r a n g e of b a s i c no a t t e m p t has b e e n m a d e to e l i m i n a t e such r e d u n d a n c y by i n h e r i t a n c e m e c h a n i s m s and the like a l t h o u g h ) of c o u r s e ) the p r o c e s s of o n t o l o g i c a l m a t c h i n g w i t h i n S W E S I L is itself a kind of i n h e r i t a n c e m e c h a n i s m ) w i t h the d i f f e r e n c e that S W E S I L h u n t s up the h i e r a r c h y to find the i n f o r m a t i o n ) i n s t e a d of the i n f o r m a t i o n being p a s s e d down. H o w e v e r , the b u r d e n of l e x i c o g r a p h i c work can u n d o u b t e d l y be r e d u c e d by o f f e r i n g the l e x i c o g r a p h e r " i n h e r i t e d " i n f o r m a t i o n , i.e. by s u g g e s t i n g c o n t e x t u a l m o d e l s d e r i v e d from t h o s e of the s e l e c t e d k e y - t e r m (where t h i s is already p r e s e n t in the d i c t i o n a r y ) . T h i s t y p e of f e e d b a c k can, at the same t i m e , p r o v i d e a useful check on the a p p r o p r i a t e n e s s of the c h o s e n k e y -

200

ter«i. The rationale here is that if we decide to enter the word "biology" with "science" as its key-term, many of the contextual expectancies of "science" will be appropriate for "biology" as well - perhaps after a certain amount of editing to make then sore specific. If, on the other hand, they do not seem appropriate, then this may be a good reason to ask ourselves whether "science" is indeed the right key-term to use. Consistency checking is something that has to be done centrally, in order to find out whether one lexicographer's work does not contradict that of others. However, the same checks can be carried out locally as well. When an external sub-contractor is entering material on a PC, for example, consistency checks can be run on the data that has already been entered. The only type of consistency check already implemented concerns the key-term hierarchy: loops are identified and displayed, and warnings are issued whenever an undefined key-term is used. Consistency checks of the ontological type are very much tied up with the future development of inferencing routines. Here again, the most obvious kind of check is the discovery of loops: e.g. in transitive relations such as "part-of", "consisting-of", "group-of" etc.

Storing Entries in the LKB The Semantic Dictionary entries are stored as a database of PROLOG predicates, indexed under entry and under key-term. The indexing is hashed for fast retrieval. Before new entries are added, they have to be converted from the external format in which they have been created, to the internal PROLOG form. This is mostly a matter of expanding abbreviated forms to sets of simple pairs - a task performed by a simple parser that checks the syntax of the entries - and of replacing certain symbols that are meaningful in PROLOG (such as brackets and inverted commas). This conversion is automatic. As soon as new entries have been converted, they are available to the SWES1L system, but not in its fastest form. The PROLOG environment offers the choice of interpreted or compiled execution. In the latter case the predicates are translated beforehand into the machine's inner format, allowing much faster execution. For this reason, when a substantial number of entries have been collected, they are converted in one go, after which the whole Semantic Dictionary is compiled anew. At present the LKB exists on hard disk and tape, but in the future working system, when the contents have a more permanent character, it will be brought out on optical disks (CD-ROM). This offers the advantage of massive storage capacities at high access speeds, while being very incorruptible. The disadvantage is that CD-ROM is (up to now) a Read-Only storage device, so that

201

updating of the LKB is only feasible at longer intervals. Progress in the field of optical storage) however, leads us to believe that true Read/Write optical storage can be achieved within the next ten years.

Lexicographic

Tools

What kind of assistance can a lexicographer working on the creation of an advanced NLP/MT lexical knowledge bank expect to have at his disposal? He will, of course, work on a modern Personal Computer with sufficient memory and permanent storage to cope with fair amounts of text. Text processing and data retrieval will be integrated with the programs for entry generation, checking and storing. A number of useful facilities, as discussed above, will be a standard feature of the working environment. More importantly, he will probably have on-line access to all kinds of lexicographic material stored in central databanks around the world, allowing him to check, compare and use information from a variety of sources without being drowned in a sea of paper and books. All the above-mentioned tools are either already in existence or are about to become available. For the specific task of the 1 exicgrapher, however, a number of tools can be developed which make use of advances in various areas of NLP and which not only help to edit, enter and store information e f f i c i e n t l y , but also provide substantial assistance in the creative process itself. A number of possibilities are briefly discussed below - some of them already developed, others potential extensions to be added to existing programs. it will be equally accessible to all the DLT modules for individual languages and its growth and improvement will be equally relevant for all the language pairs involved. Ultimately« the LKB will have become the heart of an expert system that can be used for much more than translation alone. A few of the p o s s i b i l i t i e s are: information retrieval from databases in many languages« multi-1ingual indexing, and the writing of abstracts and summaries. Another very interesting possibility would be to use the LKB as the heart of a powerful NL user-interface able to commmunicate with the user in the language of his choice.

207

Chapter 2

Macrocontext and Discourse Analysis Berond the Sentence

Boundary

In our discu55ion of HT so far, we have implicitly assumed the sentence to be the central unit to be processed. This assumption has largely been historically determined: the sentence - like the word - is such a 'natural' unit that most grammarians through the ages have taken it to be the largest structure to be described by their grammars. The efforts of TGG - the prime example of modern grammar - were also concentrated solely on describing the grammatical sentence. Syntactically, at first sighti the choice seems to be right. Most rules of Natural Language grammars can adequately account for a large number of phenomena within the boundaries of the single sentence. Phenomena beyond that range appear to be much more difficult to capture in rules. This does not mean that grammarians do not recognize the fact that a text is more than just a random sequence of sentences. It is just that it is commonly felt that syntactic description stops short at the sentence boundary, leaving the description of larger units to other disciplines. Semantically speaking, the sentence boundary is a much less comfortable phenomenon. When one attempts to represent the meaning of a single sentence, one soon discoveres that the p o s s i b i l i t i e s are virtually unlimited and that it is almost always impossible to make an adequate choice on that basis alone. Semantics based on the single sentence can most of the time do no more than describe the multitude of meanings that are possible so long as the context is not taken into account. Yet the speed and ease with which humans process text, without as much as noticing many elements which appear highly ambiguous when seen from the perspective of the isolated sentence, suggests that the disambiguation process is very much guided by the text as a whole. Previously processed information directs subsequent processing, eliminating many alternatives on the basis of what is already known about the text. Very important too is the extra-linguistic knowledge the human reader has about the text: its subject matter, the reason why it was published, the kind of text it is likely to be in the light of its physical context (e.g. in a book, a magazine, a pamphlet). These are matters that influence the process of text understanding in a way that sentence-by-sentence analysis cannot capture. A proficient NLP/MT system will have to include some form of text understanding if it is to analyse with any confidence the texts given to it, without relying too heavily on human assistance. Ideally, the SL analyzer of an MT system must be

208

capable of taking into account the kind of superstructure turns a collection of sentences into a coherent text.

that

In terms of semantics this means that the meaning representations of previously processed sentences must not be forgotteni but must be used by the system in order more successfully to disambiguate later sentences, at the same time building up a coherent meaning representation of the text as a whole. Some (relatively simple) examples of cases in which macrocontext semantics is essential for correct processing of a sentence are: anaphoric

reference

The most obvious example of meaning that can only be understood by using the text as a unit rather than the sentence. There is absolutely no way a personal pronoun, for instance, can be assigned any contents - apart from some basic features like 'male', 'singular', which, in any case border on the syntactic rather than semantic domain - when it occurs as the subject of an isolated sentence. Only when one studies the sentences occurring in its surroundings can an identification be made and the pronoun be represented in terms of its reference to an earlier statement. polysemy Many words have a number of potential meanings that are 'filtered out' because their surrounding words clash with them. The same goes for sentences. The famous "Flying planes can be dangerous" looses much of its ambiguity when it is preceded by the sentence "Balloonists are advised to stay clear of the airport". Often, too, the meaning of single words is more clearly restricted by the text as a whole than by the words in its immediate surroundings. The meaning of the word "alcohol" in the sentence "Alcohol is a dangerous liquid" is different in a text on dangerous drugs and in a chemistry handbook.

The DLT project planning incorporates an intensive effort to create a text analysis module that can analyse texts as coherent structures and use the information gained in this way for the more accurate disambiguation of single words, improving the system's quality of performance by deepening its understanding of the texts it analyzes.

209

Text

Models

To analyze text as a whole, one needs an appropriate model with which to describe the elements that constitute the text and the relations between them. An interesting model of this kind has been proposed by John Swales [SWALES, 19813. His model is related to the - traditional - school of rhetorics, and is largely based on the intentions of the writer as a unifying principle. When analyzing informative texts - the type of text DLT is aimed at - one can safely make the assumption that much of the writer's linguistic behaviour is determined by his wish to be properly understood, to 'get the message across'. Swales' study shows that there are clearly recognizable patterns in a text which correspond to the various stages of 'persuasion' an author may use to 'get the message across'. To name a few: 1. Establishing the Field a) Showing Centrality b) Stating Current Knowledge c) Ascribing Key C h a r a c t e r i s t i c s 2. Summarizing Previous Research a) Strong A u t h o r - O r i e n t a t i o n s b) Weak A u t h o r - O r i e n t a t i o n s c) Subject Orientations 3. Preparing for Present Research a) Indicating a Gap b) Question-Raising c) Extending a Finding 4. Introducing Present Research a) Giving the Purpose b) Describing present research [SWALES,

19813

Interestingly enough, within the type of text Swales studied introductions to scientific articles - there was a strong regularity in this pattern, the stages of which were clearly recognizable not only on semantic grounds but on syntactic grounds as wel1. In particular, the interplay between the range of syntactic structures used and the intentions of the writer offers a promising line of study (in this respect the studies of GallaisHamonno, e.g. CGALAIS-HAMONNO, 1980 S< 19823 are of interest, as they show that typical syntactic patterns exist within a given field (economics) that have a markedly different semantic

210

interpretation inside and outside the field)* as it may well be possible to equip a system with the ability to recognize statistically relevant changes in the syntactic structures used. If that system contains knowledge about the various rethorical stages a text typically goes through, it may combine this knowledge with the detected changes in syntactic patterns to focus more accurately on the writer's intentions» which in turn may provide valuable information to aid disambiguation. As studies by the followers of Schank & Abelson [E.g. DYER, 19833 have shown, texts that contain stories can often be analyzed quite adequately by finding certain stereotypical patterns - Dyer's "adages": one-line summaries of stories by means of proverbs and sayings - and using these as a semantic representation of the text's meaning. Semantic details within the sentences of the text can then be verified against the overall pattern. If we can find an appropriate stock of typical (stereotype) patterns for the kind of informative texts we are dealing with, we may be able to construct a model of the text that contains its overall meaning, while accounting for local syntactic and semantic phenomena.

Incremental Understanding different sources

- Validation of information

from

The process of understanding language is a p r o c e s s that is largely determined by the fact that language is a linear phenomenon, progressing in time. This means that at any given moment in time a language processor processing text will have more information at hand than before that moment, and less information than later on. In many cases this can mean that certain choices cannot be made until some additional information is received. On sentence level, this phenomenon leads, for instance, to the generation of parallel parse trails to account for the possible s t r u c t u r e s that can be constructed on the basis of information received so far. It is possible to do this only when the grammar of the system has a fairly strong sense of what can be expected, limiting the choices to a manageable number. On text level, it is not at all clear whether such r e s t r i c t i o n s on what is to be expected can be made strong enough to limit the p o s s i b i l i t i e s to a manageable number. A strong grammar in this sense would mean the ability to construct a finite number of possible text structures on the basis of the first sentence, gradually limiting the number of possibilities as more sentences are read, sometimes generating additional ones to account for some special type of sentence.

211

For complete texts, such a method of parallel development of possible interpretations is not, however, very practical, if indeed possible. In spite of the kind of structuring discussed above, the interplay between individual sentences in a text seems to be much less rigidly determined than the interplay between the words in the sentence. Whereas a given word in a sentence can often limit the possible words (or word classes) that can follow to a manageable number, it is not very likely that a given sentence will contain e x p e c t a t i o n s strong enough to likewise limit the number of sentences that can possibly follow, in a practical way. Where expectations are as vague as this, it is more likely that the process of text understanding must be done by constructing a very sketchy skeleton, taking as fixed points those pieces of information that appear to be fairly certain. As more text comes in, details that had to be left undecided can be filled in, often causing a complete re-evaluation of the skeleton, filling in places that had previously been left open, perhaps removing some undecided alternatives. In this way, understanding of the meaning of the text gradually advances until, by the end of the text, as complete a representation as possible has been built up. Implementing such incremental understanding in an NLP/MT system would involve providing it with a way of filling in blanks in the knowledge that could not be filled earlier, at the sane time checking what the c o n s e q u e n c e s of this newly acquired knowledge are all along the internal interpretation. It would appear in this approach that the more text that has been taken in (assuming, of course, that the text is a coherent structure) the more certain the system can become about its choices, leading increasingly to postponement of the dialogue until the end of the text: i.e. fewer questions, fewer alternatives, and alternatives that are less far-fetched or even nonsensical. An additional advantage of the DLT system where (incremental) text understanding is concerned is the dialogue, which is carried out on a sentence-by-sentence basis. This means that a number of undecided questions can be firmly resolved through human assistance, long before the system would be able to do so on its own. The system can add these decisions to its points of certainty, reinforcing the skeleton understanding of the text, enabling it to make better and more certain decisions on later problems. In this respect the dialogue becomes an important additional source of information, which is directly available. Another important source of information is the LKB itself. Normal dictionary retrieval is used to solve word sense ambiguities, but the LKB is potentially a source of much wider

212

k n o w l e d g e ( k n o w l e d g e of the w o r l d - see C h a p t e r IV.1). Total a c c e s s i b i l i t y of t h e LK8 can e n o r m o u s l y help u n d e r s t a n d i n g of t h e text by s e a r c h i n g for f a m i l i a r p a t t e r n s or s i g n i f i c a n t c o o c c u r r e n c e s of w o r d s over s t r e t c h e s of text and by c h e c k i n g out logical i n f e r e n c e s a g a i n s t the f a c t s s t o r e d t h e r e .

Applications - Summaries,

Indexing«

Rephrasing

A s y s t e m b a s e d on t e x t u n d e r s t a n d i n g may be u s e d for m o r e t a s k s t h a t t r a n s l a t i o n o n l y . S o m e of the m o r e i n t e r e s t i n g p o s s i b i l i t i e s - in the light of t h e i n t e n d e d o f f i c e e n v i r o n m e n t are: automatic

generation

of summar i e s or

-

abstracts:

C o u p l e d to MT t h i s w o u l d m e a n that d a t a b a s e s c o u l d s t o r e t e x t s and h a v e s u m m a r i e s g e n e r a t e d a u t o m a t i c a l 1 y , on t h e b a s i s of w h i c h p o t e n t i a l c u s t o m e r s c o u l d then« in their own l a n g u a g e , s e l e c t t e x t s they are i n t e r e s t e d in and r e c e i v e them in their own 1 a n g u a g e as w e l 1 . automatic

indexing:

The s y s t e m c o u l d g e n e r a t e a p p r o p r i a t e i n d e x e s of the u s i n g its global u n d e r s t a n d i n g of the t e x t to s i n g l e i m p o r t a n t key w o r d s and p h r a s e s .

text, out

rephrasing: A p o w e r f u l f a c i l i t y , m a d e p o s s i b l e by text u n d e r s t a n d i n g . R e p h r a s i n g will n e c e s s a r i l y t a k e an i m p o r t a n t p l a c e in the d i a l o g u e m o d u l e (see C h a p t e r II.4), but it c o u l d a l s o be u s e d as an a d d i t i o n a l w r i t i n g t o o l . W h e n a w r i t e r has w r i t t e n a t e x t , the s y s t e m m i g h t p r o d u c e a summary f i r s t , to show how it will be i n t e r p r e t e d by the s y s t e m (and, p r o b a b l y , by the r e a d e r s ) . After t h a t , t h o s e p a s s a g e s can be s e l e c t e d , or s u g g e s t e d by the s y s t e m , w h i c h c o u l d be p h r a s e d m o r e c l e a r l y . T h e s y s t e m w o u l d then s u g g e s t a number of a l t e r n a t i v e p h r a s i n g s , from w h i c h the one that best s u i t s the a u t h o r ' s p u r p o s e can be c h o s e n . Note: such r e p h r a s i n g must l e a v e t h e m e a n i n g i n t a c t : t h i s w r i t i n g tool w o u l d be a s t y l i s t i c aid, h e l p i n g the author to w r i t e m o r e c l e a r l y . It c a n n o t s u g g e s t a d d i t i o n a l i n f o r m a t i o n or c h a n g e s of m e a n i n g .

Increasing

Dialogue

Efficiency

T h e size and c o m p l e x i t y of the d i a l o g u e q u e s t i o n s d e p e n d to a l a r g e e x t e n t on the d e p t h of u n d e r s t a n d i n g the s y s t e m is c a p a b l e of. A s long as the s y s t e m ' s u n d e r s t a n d i n g is r e s t r i c t e d to i n d i v i d u a l s e n t e n c e s , most w o r d s will be a m b i g u o u s and often u n d e c i d a b l y so.

213

The more global knowledge that can be taken into account) the more confident the system can become about its choices. Text understanding would, as the text progresses, lead to growing certainty and understanding, leading in turn to a decrease in the number of alternatives to be considered in the dialogue module. For the user this would mean that the number of questions and the number of alternatives offered per question would decrease as more text is typed in. Note, however, that as long as there is any doubt about the correct interpretation, the ultimate decision will always remain with the user. The system may feel confident enough to offer only one interpretation - e.g. "is this correct, yes/no?" - b u t when the answer is no it will have to show the other alternatives, as it may mean that in this case the user actually intended an interpretation that the system had calculated to be less likely.

214

Chapter 3

The Self-Improving System

Changing

Language

AI and MT projects are typically long-term efforts) and the life cycle of the resulting system should be measured in decades rather than in years. Faced with this kind of time scale, even the delineation of an NL grammar becomes a moving target. Expressions that are considered colloquial today may have become generally accepted tomorrow» including their usage in informative texts of a more formal nature. The rate of change in vocabulary is overwhelming: standardization bodies at the national and international level can scarcely cope with the wave of new technical terms produced by certain fields of modern science and industry, like aerospace, informatics, biochemistry etc. But also areas such as public administration, corporate management, banking and finance, contribute new terms and phrases. The continuous demand for a growing and more precise apparatus of technical terminology (viz. 'disk', 'diskette', 'minidiskette', 'microdiskette ' , ...) will probably not be satisfied by direct human creativity alone. Creation of new terms is likely to be partially automated (as is already the case with the invention of trade names). In this respect, it is significant that INFQTERM (the Vienna-based UNISIST institute on methodological aspects of terminology and terminography) has announced its first congress on "Terminology and Knowledge Engineering" (planned for October 19B7). And achievements like "The Wordtree" (Burger, 19843 seem to bridge the gap between traditional lexicography and computer-driven language engineering. But even stronger than the "market pull" will be the "technology push" in language change in the near future. New media demand new writing styles: the first studies on the characteristics of language in videotex systems have already appeared [Lurquin, 1985], and "Writing to be read from a screen" is becoming a regular topic in language courses for business and i ndustr y. Particularly fascinating is the flexibility with which, for instance, Japanese office workers show themselves prepared nowadays to adapt their language to the possibilities and limitations of modern electronic word processing and keyboard equipment. The handling and understanding of natural language by computers opens up such a wealth of applications and benefits that people will be prepared to pay the price for clearer - less ambiguous - writing styles or speaking habits (the latter

215

relating to automatic speech

recognition).

Multilingual communication capacity is one of the benefits: •for the next decades» it will largely rely on the combination of modified» controlled language use by humans on the one hand, and computerized, Al-supported translation on the other. Striking the proper balance at the right time will not be easy, but this i s a challenge shared by the language-isolated Japanese as well as the language-divided Europeans and the industrialized thirdworld countries. With the shift of the world's economic center of gravity from the Atlantic to the Pacific, it is doubtful whether English can maintain its dominant position in the 21st century. Its chances of continuing as a world language could be substantially improved by a step-wise but radical spelling reform, the feasibility of which will much depend on advanced and widespread word processing facilities, i.e. on 'language technology'. In this respect, innovation and proliferation in keyboard design and text input devices offers new prospects: e.g. syllabic keyboards [e.g. the VELQTYPE, invented by Berkelmans and Den Outer in The Netherlands], software-redefinable keyboards [the all-LED keyboard announced by Ericsonl, multilingual keyboards, etc.

Variety of

Language

Experts Ce.g. Slocum, 19843 have pointed out that the coverage of a new field of technical terminology can be of greater impact on an HT system than its extension by a new source language. Indeed, the mere addition of many tens of thousands of lexical entries (including compounds or multi-word terms) is a heavy burden on a multilingual translation system. But besides lexical variation - new terms and special usage of common-language words - phraseology and syntax too are often affected in texts relating to specific disciplines or applications. In linguistics and language teaching, a whole area of research (coined 'LSP', Language for Special Purposes) is dedicated to this phenomenon. Talking about language change and varieties of language, trends are worth noting:

three

- simplification - internationalization - grading The first two are closely related. English for an international audience should be simple and clear, void of the

216

"richness" and idiosyncrasies that serve to show off the author's rather than to meet the reader's command of the language. Exporters of high-tech products requiring piles of technical documentation have increasingly been faced with the need to impose writing rules on their technical authors. In several cases, this has resulted in extensively defined varieties of 'streamlined' English: X E R O X ' S MCE (Multinational Customized English), Caterpillar's ILSAM, and the FOKKER-initiated SE (Simplified English) adopted by European aircraft manufacturers (AECMA) for maintenance manuals. Since the well-known nuclear reactor accident in Harrisburg some years ago, more attention has been given to the problem of optimizing man-computer dialogues in complex control room situations. A principle gaining adherence rapidly now is that of 'graded dialogues'; the operator or user can select between a number of levels of guidance, i.e. of comprehensiveness of the computer-generated texts. In this way, the system will communicate with a novice in a completely different way from what it would do with a skilled user. One can compare this with the different levels offered in chess programs and other computer games. As a consequence of CD-ROM technology, consultation of paper and microfiche manuals will gradually be replaced more and more by querying a computer system and reading from a screen, which means an enormous potential for extending the 'grading' approach. The far end of this evolution will be computer-generated language customized (or 'personalized') to the intelligence, skill, knowledge and beliefs of its momentary reader. A proper mix of pictures (photos, diagrams, cartoons), pictorial symbols, alphanumeric texts, audio signals and voice output, will also be part of this customizing.

Le»rning fro» Experience Processing text

-

Improving Performance while

An expert system on word semantics, not narrowed down to a limited application domain, but directed at general NL understanding of a wide range of informative texts, implies an immense task of knowledge acquisition. Where does one start, when the objective is the description and storage of all ' knowledge-of-the-wor1d', notably in such a way that a computer or robot can effectively use this knowledge for a practical purpose, such as making an abstract or providing a translation? As explained throughout this book, the knowledge-based approach of word expert semantics in DLT centers around the

217

careful use of a l e x i c o g r a p h e r ' s skill to m a k e such a start. But ( t r a d i t i o n a l ) l e x i c o g r a p h y is only the l a u n c h i n g p l a t f o r m of the e n t e r p r i s e : the real t h r u s t will come from a t w o - s t a g e r o c k e t , the s t a g e s of w h i c h are b r i e f l y s u m m a r i z e d in the p a r a g r a p h s below. The first s t a g e c o n s i s t s in the booster of new t o o l s and t e c h n i q u e s b r o u g h t about by c o m p u t e r - a i d e d l e x i c o g r a p h y (see also Chapter IV.1). R a p i d a c c e s s to u p - t o - d a t e w o r d and term b a n k s , c o r p u s - b a s e d i n s t a n t l e x i c o g r a p h y , m u l t i - w i n d o w and m u l t i font p r e s e n t a t i o n of all and only the e s s e n t i a l i n f o r m a t i o n on o n e s c r e e n , all t h e s e t h i n g s can c o n t r i b u t e to a work e n v i r o n m e n t Here t o o , the c o m p u t e r c l o s e to the " l e x i c o g r a p h e r ' s d r e a m " . s y s t e m can p e r s o n a l i z e ' its a n s w e r s and d i s p l a y s to the m e a s u r e d s k i l l s and h a b i t s (including s y s t e m a t i c e r r o r s l ) of the i n d i v i d u a l in front of it. The p r o d u c t i v i t y of f u t u r e l e x i c o g r a p h e r s will t h u s i n c r e a s e e n o r m o u s l y , p r o v i d e d that their t r a i n i n g is b r o u g h t into m o r e m o d e r n o r b i t s in t i m e . The s e c o n d s t a g e of i n n o v a t i v e k n o w l e d g e a c q u i s i t i o n g o e s off as soon as the s y s t e m s t a r t s l e a r n i n g from the h u m a n k n o w l e d g e e x p o s e d in the d i s a m b i g u a t i o n d i a l o g u e (Fig. IV.3.1). Fig.

IV.3.1.

EVOLUTION

rueSAFGRpure TO A I. The k n o w l e d g e b a n k w i l l g r a d u a l l y e x p a n d a n d i m p r o v e its p e r f o r m a n c e b y w a t c h i n g m i l l i o n s of m a n - c o m p u t e r d i a l o g u e s . The d a s h e d line i n d i c a t e s that the l e a r n i n g e f f e c t is n o t a c l o s e d l o o p , a t least n o t i n i t i a l l y . 218

As we have seen (Chapter II.4), the SWESIL system behaves not unlike the student of a foreign language who, when encountering a problematic phrase, will turn to a native speaker of that language for clarification. In the DLT translation process, the immediate reason to start the dialogue is of course the need to analyze the current sentence, so that it can be transmitted to the TL side of the system. It would be a waste of information, however, if that were all we used it for. Whenever the DLT system initiates a dialogue (the 'ultimate' semantics), it has first made an attempt tc solve the problem automatically (by refined SWESIL or advanced macrocontext semantics): in other words, the system always makes its best guess before letting the human operator decide (on certain word meanings etc.). It stands to reason, therefore, to equip the system with the ability to systematically compare its own solutions with those of its human masters, and to improve its own knowledge and performance accordingly, i.e. to learn. The eventual machine learning effect indicated in Fig. IV.3.1 will probably be preceded by several years of DLT operation on a large corpus of suitable informative texts. At selected user sites, p r o v i s i o n s will be made for c o n t i n u o u s automatic archiving of the system's "guesses" (including a trace from SWESIL) and the operator's "last words", together with the corresponding microand macro-context fragments. These semantic performance statistics will periodically (e.g. twice a year) be gathered by a future DLT support center, where they can be further processed and evaluated. Conclusions and measures for system improvement will still involve a considerable amount of intervention by human specialists then, and there will therefore be no question of a closed-loop machine learning mechanism very soon. Deep inside the already difficult and partly impenetrable research area of AI, machine learning is an even more difficult and less explored sub-area. Progress will be slow in the beginning, partly because valid conclusions from probabilistic data can only be drawn over a sufficient large volume of experience. Once it really gets off the ground, the results of a self-learning language understanding system will defy all imagination. As the dependence of such systems on human assistance diminishes, they can gradually be dedicated to the reading of (optically stored) texts 24 hours a day; news reports, articles, encyclopedias, patent files, courseware etc. The systems of the future will show their understanding of NL text by the way they meet certain operational objectives such as rephrasing or translation, for which they will get 'notes' (praises or reprimands) from their human teachers. With proper coaching, these systems will even develop a feeling for humour and jokes, a popular projection in science fiction stories.

219

At the p r e s e n t t i m e , a c o n s i d e r a b l e amount of p r o g r e s s in for i n s t a n c e d i s c o u r s e a n a l y s i s (of w r i t t e n t e x t s , see also C h a p t e r IV.2) still s e p a r a t e s us from the ideal s i t u a t i o n just s k e t c h e d .

The D i a l o g u e

as S W E S I L ' s

Tutor

F o r t u n a t e l y , the c o m b i n a t i o n of a k n o w l e d g e b a s e and an i n t e r a c t i v e d i s a m b i g u a t i o n d i a l o g u e is a c o m b i n a t i o n that lends that l e n d s itself very well to the s t e p w i s e r e a l i z a t i o n of an A I b a s e d s y s t e m (Fig. I V . 3 . 1 ) . R e f e r r i n g back to some t e c h n i c a l and o r g a n i z a t i o n a l a s p e c t s of S W E S I L (as c o v e r e d in C h a p t e r s II and III), we first of all h a v e the f l e x i b l e and p o w e r f u l d e f i n i t i o n s t r u c t u r e s that were d e s i g n e d to g i v e the l e x i c o g r a p h e r s f r e e d o m and power of e x p r e s s i o n . B e c a u s e of the u n i f o r m i t y u n d e r l y i n g the w h o l e s y s t e m , adding new i n f o r m a t i o n to it can be largely a matter of c o p y i n g e x i s t i n g s t r u c t u r e s , a l t e r i n g t h e m s l i g h t l y (inserting d i f f e r e n t key t e r m s , for i n s t a n c e ) and tying them into t h e e x i s t i n g s e m a n t i c n e t w o r k . G i v e n a l i m i t e d number of r u l e s to d i r e c t t h i s , its g r a d u a l i m p l e m e n t a t i o n as a p a r t i a l l y a u t o m a t e d p r o c e s s a p p e a r s to be f e a s i b l e . T h r o u g h the E x p e r t S y s t e m e n v i r o n m e n t , the l e x i c o g r a p h e r c o u l d p r e s u m a b l y i n s t r u c t the s y s t e m e x a c t l y how and when to p e r f o r m t h i s f u n c t i o n , m a k i n g t h e a d d i t i o n of new d e f i n i t i o n s a m a t t e r of t y p i n g in a few w o r d s and a s i n g l e c o m m a n d . If we (again) see S W E S I L as a s t u d e n t of a f o r e i g n l a n g u a g e , with a r e l a t i v e l y m o d e s t a m o u n t of k n o w l e d g e to start w i t h , we w o u l d e x p e c t it to look over the s h o u l d e r of the h u m a n o p e r a t o r a n s w e r i n g the q u e s t i o n s in the d i s a m b i g u a t i o n d i a l o g u e , and to use - d i r e c t l y or i n d i r e c t l y - t h i s a d d i t i o n a l i n f o r m a t i o n to i m p r o v e its p r o f i c i e n c y in that l a n g u a g e . W i t h o u t t h i s , S W E S I L w o u l d keep asking e x a c t l y the same q u e s t i o n , in exactly the same f o r m a t , even in the same c o n t e x t , over and over a g a i n , each time it e n c o u n t e r e d the same p r o b l e m . By a s s e s s i n g i n f o r m a t i o n on p r o b l e m c a s e s and t h e c o r r e s p o n d i n g i n t e r p r e t a t i o n s c h o s e n by the h u m a n u s e r s , S W E S I L might be able to modify t h e v a l e n c y s t r e n g t h s of t h e w o r d d e f i n i t i o n s i n v o l v e d . R e - o c c u r r e n c e of the same p r o b l e m could then lead to a s t r o n g e r p r e f e r e n c e for a p a r t i c u l a r i n t e r p r e t a t i o n . If t h e same p r o b l e m o c c u r s o f t e n e n o u g h (something that must be s t a t i s t i c a l l y d e t e r m i n e d ) , the v a l e n c y s t r e n g t h s may e v e n t u a l l y r e a c h a level at w h i c h the r e l a t i v e distances between mutually conflicting interpretations surpass a c e r t a i n t h r e s h o l d v a l u e , c a u s i n g the less likely o n e to be rejected.

220

Updating and Expanding This book has hardly covered the logistics of distributing LKB updates to future MT system users and thus expanding the general capacity of translation networks such as DLT. As already indicated in one of our previous publications [Witkam, 19833» Digital Optical Recording (more widely known now as CD-ROM) will be the medium for providing each subscriber with his personal 500-MegaByte copy of an interlingual knowledge bank» and supplying him with a new release once a year or so. Besides, the growing availability of write-once, write-twice etc. optical disk units offers prospects for an additional downline loading mechanism for intermediate updates. In the full-grown multilingual translation network of the future, not only a modular assortment of SL and TL m o d u l e s will be available, but also a choice of expansion modules for specialized fields of terminology or LSP is likely to be offered by the electronic publishers. Partitioning of dictionaries and knowledge banks might be tempting to information suppliers, at least for commercial reasons. But as HT system d e v e l o p e r s know too well, such a partitioning can be a threat to the high-quality processing of a wide range of texts, the contents of which often cross these partitions. The expansion of a general-purpose LKB system, in such a way that it remains a coherent and integrated whole, will perhaps demand new cooperation structures in the publisher's world of tomorrow.

221

References

ALPS - 19B5 [AUTOMATED LANGUAGE PROCESSING Avenue Beauregard 3 CH 2035 Corcel1 es (CH)

SYSTEMS]

AL, B. - 1986 Ordinateur et Lexicographie In: A. ZampDlli, Ed. Scritti in Onore di Roberto Busa s.j. Gi ardi ni , Pisa ( I ). AL-KASINI, A.M. - 19B5 Linguistics and Bilingual E.J. Brill, Leiden (NL).

Dictionaries

AMSLER, R.A. - 1980 The Structure of the Merriam-Webster Pocket University of Texas, Austin, Texas (USA). APRESYAN, YU.D., I.A. MEL'CUK k A.K. Z0LK0VSKY

Dictionary

- 1969

Semantics and Lexicography: Towards a New Type of Di cti onar y In: F. Kiefer, Ed. Studies in Syntax and Semantics D. Reidel Publishing Co., Dordrecht (NL).

Unilingual

BAR-HILLEL, Y. - 1960 The Present Status of Automatic Translation of Languages In: F.L. Alt, Ed. Advances in Computers, Vol. 1 Academic Press, New York (USA): pp. 91 - 163 BARR, A. b E.A. FEIGENBAUM - 1981 The Handbook of Artificial Intelligence, Vol. I Pitman Books Ltd., London (GB). BRACHMAN, R.J. - 1983 What IS-A Is and Isn't: An Analysis of Taxonomic Links in Semantic Networks In: Computer, Vol. 16, No. 10 IEEE Society (USA): pp. 30 - 36 BRUCE, B. - 1975 Case Systems for Natural Language In: Artificial Intelligence, Vol. 6 North-Holland, Amsterdam (NL): PP. 327 - 360

222

BURGER« H.6. - 1984 The Hordtree: ft Transitive Ciadistic for Solving Physical & Social Probiens The Wordtree, Merriam, Kansas (USA). CALZOLARI I N. - 1984 Detecting Patterns in a Lexical Database In: Proceedings of Coling '84 Association for Computational Linguistics) Stanford University, California (USA): pp. 170 - 173. CARBONELL, J.6. b M. TOHITA - 1985 New Approaches to Machine Translation Carnegie-Mellon University, Dept. of Computer Pittsburgh (USA).

Science,

CHOMSKY, N. - 1957 Syntactic Structures Mouton, The Hague (NL). COATES, J. - 1983 The Semantics of the Modal Croon Hela, London (GB).

Auxiliaries

CULLIN8F0RD, R.E. I B.A. QNYSHKEVYCH - 1985 Lexicon-Driven Machine Translation In: Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages Colgate University, New York (USA): pp. 75 - 115. DYER, H.6. - 1982 In-Depth Understanding: A Computer Model of Integrated Processing for Narrative Comprehension The MIT Press, Cambridge, M a s s a c h u s e t t s (USA). FANTQM, I.D. - 1985 An Expert System for Semantic Disambiguation of Ill-Formed Text in an Esperanto-Based Intermediate Language for Machine Translation (SWESIL) Polytechnics of the South Bank, Dept. of Electrical & Electronic Engineering, London (GB). FASS, D. fc Y.A. MILKS - 19B3 Preference Semantics, 111-Formedness, and Metaphor In: American Journal of Computational L i n g u i s t i c s , Vol. 9, No. 3 - 4 Association for Computational Linguistics (USA).

223

FILLMORE, C. - 1968 The Case for Case In: E. Bach S< R. Harms, Eds. Universals in Linguistic Theory, Holt, Rinehart, and Winston, New York

(USA): pp. 1 - 88.

HOFSTADTER, D.R. - 1979 Godei 1 Escher, Bach: An Eternal Golden Braid Vintage Books, New York (USA). GALLAIS-HAMMONO, J. - 1980 The Characteristics of English for Economists In: Fachsprache, 2. Jahrgang/Volume Heft 2 Wilhelm Braumüller, Wien (AU): pp. 60 - 71. 6ALLAIS-HANM0N0, J. - 1982 Langage, Langue et Pi scours Economi ques Centre d'Analyse Syntaxique de l'Université de Hetz

(F).

ILSON, R., Ed. - 19B5 Dictionaries, Lexicography and Language Learning Pergaaon Press (in Association with The British Council), Oxford (GB). JACKENDOFF « R. - 1983 Semantics and Cognition The MIT Press, Cambridge, Massachusetts

(USA).

KELLY, E. t, p. STONE - 1973 Computer Recognition of English Word Senses North-Holland Pubi. Co., Amsterdam (NL). KNOMLES, F.E. - 1982 The Pivotal Role of the Various Dictionaries in an MT System In: V. Lawson, Ed. Practical Experience of Machine Translation, Proceedings of a Conference ASLIB, North-Holland, Amsterdam (NL): pp. 149 - 162. KOSTER, L. - forthcoming (1986) Ver vol gonderzoet; naar de Psychologische Gebrui kersaspecten van Disambiguaeringsdialogen, Part 2 Rijksuniversiteit Utrecht, Dept. of Psychology, Utrecht (NL). LOFFLER-LAURIAN, A. - 1983 Pour une Typologie des Erreurs Dans la Traduction In: Multilingua, Vol. 2, No. 2 Mouton, Amsterdam (NL): pp. 65 - 78.

224

Automatique

LURQUIN, 6. - 1985 Quelle Langue pour le Vidéotex? In: Language Monthly, July 1985 Praetorius Ltd. HEL'ÔUK, I.A. - 1985 Lexicography and Verbal Government Societas Linguistica Europea, Mouton, The Hague HEL'ÔUK, I.A. fc A.K. ZHOLKOVSKI

(NL).

CZHOLKOVSKYJ - 1984

Toikovo-Kombinatornyj Slovar' Sovermennogo Russkogo Jazyka Wiener SI anistischer Almanach, Sonderband 14, Vienna (AU). MELBY, A.K. - 19B4 Letter to BSO. HERRIAM-WEBSTER - 1974 The Merriaa-Webster Dictionary 6. & C. Merriam Co., New York (USA). NRC - 1966 [NATIONAL RESEARCH COUNCIL, AUTOMATIC LAN6UAGE PROCESSING ADVISORY CQHHITTEEJ Language and Machines; Computers in Translation and Linguistics, Publication 1416 National Academy of Sciences, National Research Council, Washington, D.C. (USA). NA6A0, H. - 1985 Structural Transformation in the Generation Stage of the MU Japanese to English Machine translation System In: Proceedings of the Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages Colgate University, New York (USA): pp. 200 - 223 0ETTIN6ER, A.6. - 1955 The Design of an Automatic Russian-English Di et i onar y In: W.N. Locke & A.D. Booth, Eds. Machine Translation of Languages Technology Press of MIT and Wiley, New York (USA): PP. 47 - 65

Technical

RASKIN, V. - 1985 Linguistics and Natural Language Processing In: Proceedings of the Conference on Theoretical and Methodological Issues in Hachine Translation of Natural Languages Colgate University, New York (USA): pp. 268 - 282.

225

SCHANK, R.C. k R.P. ABELSON - 1977 Scripts« Plans and Knowledge In: Proceedings of the 4th international Artificial Intelligence IJCA1/4, Tbilisi.

Joint Conference on

SCHAMK, R.C. - 1975 The Primitive ACTs of Conceptual Dependency In: R.C. Schank & B. Nash-Webber, Eds. Theoretical Issues in Natural Language Processing: an Interdisciplinary Workshop in Computational linguistics! Psychology) Linguisti C5> and Artificial Intelligence» SCHUBERT, K. - 1986 Syntactic Tree Structures in DLT BSO/Research, Utrecht (NL). SCHULZE, H. - 1985 On Semantic Primitives and Categories Institut fur Kommunikationsforschung und Phonetik University of Bonn, for BSO/Research, Utrecht (NL). SLOCUH, J. - 1964 Machine Translation, its History, Current Status, and Future Prospects In: Proceedings of Coling '84 Association for Computational Linguistics, Stanford University, California (USA): pp. 546 - 651. SHALL, S. - 1980 Word Expert Parsing: A Theory of Distributed Word-Based Natural Language Understanding University of Maryland, Dept. of Computer Science, Maryland (USA). SHITH, F.J., K. DEVINE b P. CRAIG - 1984 Searching Single-Word and Multi-Word Dictionaries Queen's University of Belfast, Dept. of Computer Science, Belfast (GB). SWALES, J. - 1981 Aspects of Article Introductions The University of Aston, The Language Studies Unit, Birmingham (GB). TURNER, R. - 1984 Logics for Artificial Intelligence Ellis Horwood Ltd., Chichester, West Sussex

226

(GB).

HEAVER, W. - 1949 t19553 Translation In: W.N. Locke & A.D. Booth, Eds. Machine Translation of Languages Technology Press of MIT and Wiley, New York (USA): pp. 15 - 23. MILKS, Y.A. - 1972 Grammar, Meaning and the Machine Analysis of Routledge & Kegan Paul, London (GB).

Language

MILKS, Y.A. - 1973 An Artificial Intelligence Approach to Machine In: R.C. Schank & K.M. Colby, Eds. Computer Models of Thought and Language H.H. Freeman, San Fransisco (USA). MIN06RAD, T. - 1972 Understanding Natural Language Massachusetts Institute of Technology, M a s s a c h u s e t t s