208 41 37MB
English Pages 585 [588] Year 1977
JANUA LINGUARUM STUDIA MEMORIAE N I C O L A I VAN WIJK D E D I C A T A edenda curai C. H . V A N
SCHOONEVELD
Indiana University
Series Maior, 91
PAPERS IN
COMPUTATIONAL LINGUISTICS Edited by
FERENC PAPP and GYÖRGY SZÉPE
1976
MOUTON THE HAGUE . PARIS
Consulting Editors
DAVID G. HAYS WIKTOR ROSENZWEIG BERNARD VAUQUOIS Assistant Editor
TAMÂS TERESTYÉNI
Proceedings of the 3rd International Meeting on Computational Linguistics, held at Debrecen, Hungary
ISBN 90 279 3285 9
Copyright © 1976, Akadémiai Kiadô • Budapest Joint edition with Akadémiai Kiadô • Budapest P R I N T E D I N HUNGARY
CONTENTS
Editorial Foreword Tamas, L. (Hungary) Discours d'ouverture General
9 13
Questions
Culik, K . (Czechoslovakia) A Comparison of Natural and Programming Languages Hays, D. G. (USA) The Field and Scope of Computational Linguistics Rasiowa, H . (Poland) On Algorithmic Logic Srejder, J u . A. (USSR) HeKOTOpbie OCOSCHHOCTH MaTeMaraqecKoro . . . OIIHCaHHfl H 3 M K 0 B H X
OgteKTOB
19 21 27 31
Szanser, A. J . (United Kingdom) Elastic Matching of Coded Strings and its Applications 43 Wilks, Y. (USA) One Small Head - Some Remarks on the Use of "Model" in Linguistics 53 Syntactical Analysis
of Natural and Artificial
Languages
Botos, I. (Hungary) Some Questions of Analyzing Verbal Constructions with Definite Direct Object in Hungarian Dostert, B. H.—Thompson, F. B. (USA) Syntactic Analysis in R E L English Joshi, A. K . (USA) How Much Hierarchical Structure is Necessary for Sentence Description? Moessner, L. (G. F. R.) Analyse der englischen 6wi-Konstruktionen mit mengentheoretischen Begriffen Paduceva, E. V. (USSR) Coordinative Projectivity Peters, S. (USA) An Analysis Procedure for Transformational Grammars Pitha, P. (Czechoslovakia) Remarks on the Description of Possessivity Raskin, V. V. (USSR) The Theory of Small Language Sub-Systems as a Basis of Automatic Analysis of Texts
65 69 97 105 119 125 129 139
6
CONTENTS
Reeker, L. H. (USA) An Extended State View of Parsing Algorithms 141 Smith, R. N. (USA) Interactive Lexicon Construction 161 Semantic Analysis
and
Synthesis
Edmundson, H. P. —Epstein, M. N. (USA) Research on Synonymy and Antonymy: a Model and its Representation HajiSova, E. (Czechoslovakia) Some Remarks on Presuppositions . . Klein, S. - Oakley, J . D. - Suurballe, D. J . - Ziesemer, R. A. (USA) A Program for Generating Reports on the Status and History of Stochastically Modifiable Semantic Models of Arbitrary Universes Russell, S. W. (USA) A Semantic Category System for Conceptual Dependency Analysis of Natural Language Satyanarayana, P. (India) A Semantics-Oriented Syntax Analyzer of Natural English Schank, R. C. (USA) Understanding Natural Language Meaning and Intention Topic I Comment in Formal
175 189
199 221 245 259
Description
Takahasi, H.—Fujimura, O.— Kameda, H. (Japan) A Behavioral Characterization of Topicalization 295 Martem'ianov, Y. (USSR) Vers la grammaire de la P F 309 Sgall, P. (Czechoslovakia) Focus and Topic/Comment in a Formal Description 325 Morphology Kis, E. (Roumania) Morphismes et la pression du système linguistique Klein, S.—Dennison, T. A. (USA) An Interactive Program for Learning the Morphology of Natural Languages Lötz, J . (USA) Relational and Numerical Characteristics of Language Paradigms Tesitelova, M. (Czechoslovakia) On Quantitative Research in the Field of Morphology
337 343 355 363
Phonology Hell, G. (Hungary) Automatic Hyphenation in Hungarian 375 Kok, G. H. A. (The Netherlands) The Automatic Conversion of Written Dutch to a Phonetic Notation 381
CONTENTS
7
Spitzbardt, H. (G.D.R.) Automatic Phoneme Transformation: Sanskrit-Indonesian 389 Tretiakoff, A. (France) Identification de phonèmes dans la langue écrite par un critère d'information maximum 403
Statistical Methods in Language Description Smith, R. N. (USA) A Probabilistic Model for Performance 411 Wachal, R. S. (USA) The Statistics of Word-Frequency Distributions: a Computational Analysis 419
Automatic
Translation
BorSôev, V. B.—Xomjakov, M. V. (USSR) Neighbourhood Grammars and Translation 427 Colmerauer, A.—Dansereau, J.—Harris, B.—Kittredge, R.—van Caneghem, M. (Canada) An English—French MT Prototype 433 Garvin, P. L. (USA) Machine Translation in the Seventies 445 Kulagina, O. S. (USSR) 06 ajiropHTMe cHHTaKcmecKoro a H a n n 3 a b cHCTeMe $paHuy3CKO-pyccKoro MamHHHoro nepeBOfla
461
Ljudskanov, A.—Klimonov, G. (Bulgaria—G.D.R.) Sur l'identification des antécédents des pronoms dans le processus de l'analyse automatique des textes 471 Munnich, A. (Hungary) A Common Language for Man-Machine Interactive Systems 493 Perschke, S. (Italy) SLC-II: One More Software to Resolve Linguistic Problems 505
Other Applications Feliciangeli, H. —Herman, G. T. (Paraguay-USA) Algorithms for Producing Grammars from Sample Derivations: a Common Problem of Formal Language Theory and Developmental Biology Gorodeckij, B. Ju. (USSR) Semantic Inventories in Applied Linguistics . . Voigt, V. (Hungary) Means and Aims of Computer Folklore Research Waaub, J. M. (Belgium) Problems of Computer Simulation of Second Language Learning
523 547 549 555
8
CONTENTS
2uravlev, A. P. (USSR) Automatic Analysis of the Symbolic Aspect of the Subject-Matter of a Poetic Text 563 Yang, S. C. (USA) Interactive Language Learning through a Private Computer Tutor 565 *
*
*
Hays, D. G. (USA) The Past, Present, and Future of Computational Linguistics (Closing Address) 583
EDITORIAL FOREWORD
The International Committee on Computational Linguistics was organized in 1967 at the First International Congress of Computational Linguistics at Grenoble, France. The second Congress was held in Sweden, near Stockholm. Thus the 1971 Congress in Hungary was the third international gathering of computational linguists. As for the place, we decided in favor of Debrecen, an old university city with respectable achievements in the field of computational linguistics. In 1971 the I.C.C.L. was composed of the following persons: Chairman: Bernard Vauquois (Grenoble/Montréal) American Secretary: A. Hood Roberts (Washington, D.C.) European Secretary: Yves Gentilhomme (Besançon) Honorary Member: David G. Hays (Buffalo, N.Y.) Members: Hans Karlgren (Stockholm) Martin Kay (Santa Monica, Cal.) Olga Kulagina (Moscow) Hellmut Schnelle (West Berlin) Gyôrgy Szépe (Budapest) Peter Verburg (Groningen) Hiroshi Wada (Tokyo) This committee selected the following Program Committee, which was responsible for the acceptance of papers submitted to the Organizing Committee: Chairman: Susumo Kuno (Cambridge, Mass.) Members: Manfred Bierwisch (Berlin) I. Ishiwata (Tokyo) Lâszlô Kalmâr (Szeged) Sheldon Klein (Madison, Wisconsin) D. Krallmann (Bonn) Igor Melchuk (Moscow) A. Tomberg (The Hague) A. Veillon (Grenoble) The preparation and organization proper of the Meeting was directed by an Organizational Committee, appointed by the Mathematical and Applied
10
EDITORIAL FOREWORD
Linguistics Committee of the Hungarian Academy of Sciences. The composition of the Organizational Committee was the following: Chairman: Gyôrgy Szépe (Research Institute of Linguistics of the Hungarian Academy of Sciences and Department of General and Applied Linguistics of the Lorând Eôtvôs University, Budapest) Secretary General: Ferenc Papp (Department of Slavic Studies and Russian Language of the Lajos Kossuth University, Debrecen) Members: Jozsef Denes (Institute for the Coordination of Computing Technology of the National Commission for the Technical Development and Budapest Institute of Technology, Budapest) Bâlint Dômôlki (INFELOR System Research Center, Budapest) Gyôrgy Hell (Institute of Languages of the Budapest Institute of Technology, Budapest) Lâszlô Kalmâr (Laboratory of Cybernetics of the Attila J6zsef University and Research Institute of Mathematics of the Hungarian Academy of Sciences, Szeged) Jôzsef Kelemen (Research Institute of Linguistics of the Hungarian Academy of Sciences, Budapest) Ferenc Kiefer (Computing Center of the Hungarian Academy of Sciences, Budapest and Department of General Linguistics of the University of Stockholm, Stockholm) Dénes Varga (Computing Center of the National Planning Commission, Budapest). The secretarial duties of the Organization Committee were carried out by Jozsef Csapô (Department of English of the Lajos Kossuth University, Debrecen) and Béla Lévai (Department of Slavic Studies and Russian Language of the Lajos Kossuth University, Debrecen). This Third International Meeting on Computational linguistics in .Debrecen (abbreviated as CLIDE '71) was officially organized by the Research Institute of Linguistics of the Hungarian Academy of Sciences, Budapest and the Department of Slavic Studies and Russian Language of the L a j os Kossuth University, Debrecen, and co-sponsored by the John von Neumann Society of Computing Technology, Budapest. There were 160 full participants in GLIDE '71 representing the following countries: Austria (2), Belgium (4), Bulgaria (2), Canada (2), Czechoslovakia (7), Denmark (7), France (10), German Democratic Republic (2), German Federal Republic and West Berlin (21), Hungary (51), India (1), I t a l y (3), J a p a n (2), Netherlands (4), Norway (3), Poland (1), Roumania (2), Sweden (8), United Kingdom (2), United States of America (21), and U.S.S.R. (8). During the five-day conference forty papers were read at a plenary meeting and several section meetings devoted to the general problems of
EDITORIAL
FOREWORD
11
mathematical linguistics, computational linguistics, theoretical linguistics, the syntactic analysis of natural and artificial languages, and semantic analysis and synthesis. An interesting experiment to foster a freer exchange of ideas was also ventured during the Meeting: a Round Table Discussion on the Field and Scope of Computational Linguistics. This Round Table was chaired by Bernard Vauquois (Grenoble/Montréal) and moderated by Gyôrgy Szépe (Budapest). Introductory statements were prepared by Bernard Vauquois, David G. Hays (Buffalo, N.Y.)» and W. Rosenzweig (Moscow). Invited discussants were the following: Lâszlo Kalmâr (Szeged), Martin K a y (Santa Monica, Cal.), Hans Karlgren (Stockholm), A. Ljudskanov (Sofia), Ferenc P a p p (Debrecen), Helmut Schnelle (West Berlin), Petr Sgall (Prague), H. Spang-Hanssen (Copenhagen), P. Verburg (Groningen), and Hiroshi Wada (Tokyo). Other participants were (in the order of their intervention): Yves Gentilhomme (Besançon), K . Cullk (Prague), Harry Spitzbardt (Jena), Sheldon Klein (Madison, Wisconsin), Hugo Brandt-Corstius (Amsterdam), Petr Pitha (Prague), Jozsef Kelemen (Budapest), H . Rasiowa (Warsaw), and Antal Munnich (Budapest). The International Committee on Computational Linguistics had two official sessions, both chaired by Bernard Vauquois. One of the sessions was a closed meeting of the members, the other was open to one representative of each count r y present at the congress. We mention here for the record t h a t immediately after the GLIDE '71 meeting The International Federation of Documentation's Committee "Linguistics in Documentation" held two official sessions. This meeting was presided over by D. Locke (Cambridge, Mass.) with 12 members present including the s e c r e t a r y o f t h e F I D / L D Committee, A. Hood Roberts (Washington, D. C.). I t was decided at the closing session to publish the papers presented at the conference, but unfortunately not every lecturer was able to provide a prepared manuscript at t h a t time. On the other hand, other manuscripts were submitted for publication which circumstances prevented from being read at the meeting. So the material of the volume reflects grosso modo the scholarly content of CLIDE '71. The Editorial Board for this volume was composed of the President and the Honorary Member of the International Committee on Computational Linguistics, of Professor Wiktor Rosenzweig, the Doyen of the Moscow School of Mathematical Linguistics, and Ferenc P a p p and Gyôrgy Szépe from the Organizational Committee (the last one being the responsible editor). The editors would like to thank Tamâs Terestyéni who served as Assistant Editor, and Julie Burgoyne who offered English editorial assistance.
DISCOURS D'OUVERTURE LAJOS TAMÂS
Monsieur le Président, Mesdames, Messieurs,
Permettez-moi de me faire l'interprète des salutations chaleureuses que l'Académie des Sciences de Hongrie adresse à notre meeting. J e m'acquitte de cette tâche en deux qualités. D'abord comme représentant de la section des sciences linguistiques et littéraires de l'Académie. J e remplace le professeur Gyula Ortutay, président de la section qui, en raison de ses multiples occupations et à son très grand regret ne peut pas prendre part à notre réunion. Bien qu'il soit folkloriste, il attribue une grande importance aux questions du développement de la linguistique, y compris la linguistique mathématique. Comme directeur de l'Institut de Linguistique de l'Académie, je parle au nom d'un des organes qui se sont chargés d'organiser notre réunion. Notre partenaire, comme vous le savez, est la chaire de langue russe et de linguistique slave fondée en 1970 à l'Université Kossuth Lajos de Debrecen. Quant au troisième organe qui a bien voulu participer aux travaux d'organisation, c'est la Société Neumann Jânos de la Science des ordinateurs, dont la générosité nous a permis d'élargir sensiblement les cadres de notre réunion. J ' a i l'honneur de saluer bien amicalement les membres présents de la Société et particulièrement le professeur Lâszlô Kalmâr et Bâlint Domôlki. Initialement, nous n'avons envisagé qu'une réunion nationale dont le titre aurait été «L'automation dans la linguistique». J e vous ferai remarquer que le nom hongrois de notre conférence est resté le même. Cependant le président de l'International Committee on Computational Linguistics, le professeur Bernard Vauquois, nous a adressé la demande d'organiser cette année en Hongrie la réunion biennale échéante. Des réunions ICCL ont été organisées en 1965 à New York, en 1967 à Grenoble et en 1969 à Stockholm.
14
DISCOURS
D'OUVERTURE
C'est en Slovénie, à Lioubiana, qu'a eu lieu le congrès de l'International Fédération of Information Proceeding. Cependant, en Yougoslavie, on n'a pas pu trouver d'organe qui acceptât d'assumer l'organisation de la réunion de l'ICCL. C'est au cours de l'été de 1970 que l'un des comités de notre Académie, le Comité de Linguistique mathématique et appliquée dont j'ai l'honneur de saluer le président, le professeur Zsigmond Telegdi, a commencé à s'intéresser à cette question. Le président de l'Académie des Sciences de Hongrie a approuvé la proposition du Comité de conférer à notre réunion un caractère international. Voilà la genèse de notre meeting qui, en Hongrie, est le plus important événement international de cette année dans le domaine des disciplines modernes de la linguistique. Qu'il me soit permis de saluer, au nom de l'Académie des Sciences de Hongrie, les membres présents de l'ICCL. En premier lieu le professeur David G. Hays, membre honoraire de l'ICCL, qui n'a pas ménagé ses forces pour mener à bonne fin la cause de l'interdiscipline moderne du Computational Linguistics. De même je salue les deux secrétaires du comité A. Hood Roberts (d'Amérique) et le professeur Yves Gentilhomme (d'Europe), ainsi que les membres du Comité, Hans Karlgren de Suède, Martin Kay des États-Unis, Mademoiselle Olga S. Kulagina qui représente l'Union Soviétique, le professeur Pieter A. Verburg des Pays-Bas et le professeur Hiroshi Wada du Japon. Last but not least, je salue aussi un de mes élèves, Gyôrgy Szépe, représentant de la Hongrie dans le Comité et en même temps coordinateur général de notre meeting. J'ai déjà salué trois membres du comité hongrois d'organisation. J ' y ajoute encore les autres: le professeur Jôzsef Dénes qui actuellement prend part à une autre conférence à Jérévan, le spécialiste distingué Gyôrgy Hell, pionnier hongrois de la traduction automatique, Ferenc Kiefer que nous ne verrons parmi nous qu'à l'occasion des conférences, Jôzsef Kélemen représentant de notre Institut de Linguistique et Dénes Varga, qui a travaillé à la composition de notre programme. Il me reste à saluer, au nom de nous tous, notre hôte, le professeur Ferenc Papp, secrétaire général du comité d'organisation, dont la capacité de travail semble faire concurrence au rendement d'un ordinateur. De même, je remercie chaleureusement les membres du comité qui se sont occupés d'organiser notre programme. Nous regrettons vivement que le président de ce comité, le professeur Susumo Kuno ne puisse être parmi nous à Debrecen. Je voudrais, en particulier, souhaiter la bienvenue au professeur W. N. Locke, président du comité de Linguistics in Documentation, qui fonctionne dans le cadre de la Fédération Internationale de Documentation, A. Hood Roberts, secrétaire du Comité ainsi que de l'ICCL, et aux autres membres du
DISCOURS
15
D'OUVERTURE
comité. Je souhaite que leurs délibérations, après notre meeting, soient très fructueuses. E t enfin, qu'il nous soit permis d'énumérer, par ordre alphabétique français, tous les pays dont les représentants nous ont honorés de leur participation: ALLEMAGNE, République Démocratique ALLEMAGNE, République Fédérale AUTRICHE BELGIQUE BULGARIE CANADA DANEMARK ÉTATS-UNIS D'AMÉRIQUE FRANCE GRANDE BRETAGNE INDE ITALIE JAPON NORVÈGE PAYS-BAS POLOGNE ROUMANIE SUÈDE TCHÉCOSLOVAQUIE UNION SOVIÉTIQUE et enfin la HONGRIE. En somme vingt-et-un pays, ce qui prouve des résultats très appréciables. Une fois de plus j'ai l'honneur de saluer en Hongrie, avec beaucoup d'amitié les représentants de tous les pays. L'Académie des Sciences de Hongrie vous exprime ses meilleurs voeux. Elle est convaincue que les échanges de vues qui se créeront au cours de notre meeting, ainsi que les relations fraternelles entre les savants des divers pays contribueront beaucoup au développement de notre discipline. Institut de Linguistique
de VAcadémie des Sciences de Hongrie,
Budapest
GENERAL QUESTIONS
A COMPARISON OF NATURAL AND PROGRAMMING LANGUAGES (Summary) K. CULIK
An assignment statement " x -f- y = : z " gets its full sense only together with a state of storage S, to which it should be applied, where S is a function b y which a value = object is determined to each address x,y (e.g. if S(x) = 2, S(y) = 3 and if "_)-" is interpreted b y a choice of a computer, t h e assignment 5 = : z should be executed). The state S can be considered as a set of denotations (or exhibitions) a n d therefore it m a y be called notation in general. If we have in mind a n a t u r a l language, then t h e s t a t e s = n o t a t i o n s are not expressed b u t are determined by the relevant circumstances, t h a t is environment. I n fact in n a t u r a l languages only proper names play this purely denotational role of being the names of objects. All nouns, pronouns or noun phrases are, however, also used for the same purpose. I n t h e sentence " J o h n p u t a book on the table" pronounced yesterday in m y room, the objects denoted b y " J o h n " , " a book" and " t h e table" are determined b y t h e additional information including knowledge respective to my room, etc., which is not actually expressed. Without the added circumstances this sentence itself cannot be judged as true or false. These circumstances are expressed separately b y the corresponding notation ( " J o h n " = e x h . . . , " a b o o k " = exh . . . " t h e t a b l e " = = exh • • • >) where the objects are actually exhibited, i.e. seen or touched, etc., thus t h e expressions together with the various notations concern the actual use of phrases of a n a t u r a l language. Therefore two levels of semantics (one interpretation and m a n y notations) are distinguished, which allow the clarification of certain differences between t h e natural and programming languages which concern: 1) t h e different types of homonymy, 2) t h e number of admissible interpretations (with respect to t h e historical development of a n a t u r a l language) and 3) the sorts of meanings (indicative sentences, commands and denotational— functional expressions). T h e system is programmed in Fortran on a Burroughs 5500 time-sharing
20
GENERAL QUESTIONS
computer. A highly efficient dictionary lookup procedure is incorporated to facilitate close associations among various natural languages. Discussions on program logic, operations, syntax, and multi-level learning are argumented with sample illustrations of multi-lingual dictionary and computer-student dialogues. Prague
THE FIELD AND SCOPE OF COMPUTATIONAL LINGUISTICS* DAVID G. HAYS
Why is computational linguistics a separate professional specialty? Would it not be more appropriate to consider it as the intersection of linguistics with computation? In t h a t case, each specialist could attach himself to one field primarily, and claim a secondary interest in the other. Special interest groups exist in the professional societies of linguists and computer specialists. Are they not sufficient ? The irony of this question is best understood by those who remember the struggles of both linguists and computer specialists to make their disciplines independent. I t is possible to regard language as part of culture; hence to make linguistics a branch of anthropology. Or, if language is one of man's cognitive faculties, linguistics is a branch of psychology. To subsume linguistics into the domain of humane studies, think of philology as language history, as philosophy, and as poetics. Looking in the other direction for a moment, we see t h a t computation is readily attached to mathematics (recursive function theory, mathematical logic) or to electrical engineering. Yet societies of linguists thrive, departments of linguistics are solidly established, and the number of full-time professional specialists in computation in the United States is about equal to the number of engineers in all other fields combined. The arguments for giving language and the computer back to the social scientists, mathematicians, and others clearly have no validity in the face of the actual numbers involved, the practical importance of computation, and so on. One may nevertheless argue t h a t computation is used by many sciences, and that no one feels the need for societies of computational physics, computational crystallography, and so on. The list has to be cut short, of course, to avoid stumbling into certain key fields for which specialized societies do
* Based on remarks to open around table discussion at the 1971 International Meeting on Computational Linguistics held at Debrecen, Hungary, September 6, 1971.
22
GENERAL
QUESTIONS
exist: simulation, for example. I n many sciences, the problems for which computation is wanted are statistical problems. The computer programs belong neither to computation nor to the science, but to the statistician. Monte Carlo methods, differential equation solutions, and other procedures have nothing much to do with the internal structure of the sciences t h a t apply them. The situation in computational linguistics is different. Notwithstanding the obscurity of the competence-performance distinction as expounded in the linguistic literature, the possibility and even the value of making such a distinction is clear, and Chomsky is to be congratulated for introducing it. The origin of the idea, I believe, is in mathematical logic. John McCarthy, evidently seeing a different use for the same idea, distinguishes between knowing the solution of a problem (say, knowing t h a t 13 is the square root of 169) and knowing how to obtain the solution of a problem (say, knowing that Newton's approximation yields the square root to any desired degree of precision). The square root function can be exhibited as a curve in a plane, but the algorithm has to be described as a process. One is like competence, the other like performance. Or would it be more correct in the sense of the history and philosophy of science to say that competence-performance is like algorithm-function? That question is irrelevant to my theme, which is about the similarity and not the priority. I t is approximately correct to say t h a t computation is the intellectual field t h a t studies algorithms—which is to say, performance —and that linguistics is the intellectual field t h a t studies competence, at least human competence with respect to information. The two fields are, in this sense, complementary. Two problems arise immediately. The first is t h a t the algorithms of computation have for the most part little to do with language or information; they have to do with the approximation of mathematical functions, and of those functions t h a t are of interest in the infinitesimal calculus more often than any others. The second problem is t h a t computation supplies the algorithms for many fields, not only for linguistics. Concerning the first problem. The integrity of the field of computation is assured by all those theories, the earliest antedating the whole art of computation and others being due to von Neumann and to Markov, t h a t show how all computations can be performed with a minuscule set of primitive operations. The simplest automatic computer is as well equipped for linguistic algorithms as for numerical algorithms—but of course the most complex automatic computer is richly equipped for numerical computation and has still only the primitive operations of linguistics. And so it is with the field of computation; its integrity is primitive and fundamental, but not practical. By adding to the simple ideas of algorithm theory, taking the new elements from say calculus, a field arises—call it numerical analysis. But if the additional elements come
GENERAL QUESTIONS
23
from mathematical logic, or from linguistics, or from somewhere else, then the new field is not numerical analysis and must have a name of its own. The answer to the first problem, therefore, is that if computational linguistics did not exist (because the whole field of computation was engrossed in numerical analysis) it would be necessary to invent it. The foundations of linguistics are not yet fully understood, but they are somewhat different from the foundations of the infinitesimal calculus, combinatorics, and even mathematical logic—the latter being the field in which the study of the foundations of linguistics originated. Because the foundations are different, the higher development of computational linguistics must be different from the higher development of computational calculus, computational combinatorics, or even computational logic. Concerning the second problem. We now wish to say that computational calculus supplies the algorithms for many fields, all of which had used calculus as their fundamental analytic tool before computation was invented. Also, computational combinatorics and computational logic supply the algorithms for many fields. Computational linguistics is destined, I believe, to supply the algorithms for many fields, when it is more fully developed. To understand the issue more clearly, let us contrast two technical fields, computational calculus and computational physics. The first exists, being known as numerical analysis. It combines the fundamental ideas of two branches of mathematics—algorithm theory and the infinitesimal calculus— and develops methods of wide applicability. The second does not exist; it would combine algorithm theory with certain ideas about the universe; but that marriage of incompatibles would not be fruitful. Now to computational linguistics again. If linguistics is understood to be a branch of mathematics, then it is compatible with algorithm theory and the union can bear fruit. If linguistics is understood to be a branch of natural science, like physics or psychology, then linguistics is incompatible with algorithm theory and nothing good can come out. The term historically is used in both senses, and confusion has been the result. Solomon Marcus says that formal linguistics is a pilot science, emphasizing at the same time that the ordinary field of linguistics is not. But that is to say that linguistics as a branch of mathematics will supply methods to many fields of science, whereas linguistics as a descriptive field, a branch of natural history or natural science, does not. I think the case for the independence of linguistics as made by Bloomfield and his contemporaries really has to do with linguistics as a branch of mathematics. I hesitate to call mathematics a science, taking the view that mathematics is out of touch with the natural universe. Likewise I do not believe that computing is a science; it belongs instead on the side of mathematics. Computing would, I suppose, follow the same laws in any universe whatever. In general,
24
GENERAL
QUESTIONS
I believe that information is different from the substance t h a t realizes it, t h a t mathematics and computation are two disciplines dealing with information, and that formal linguistics is a third. Another way to put it would be t h a t computation and formal linguistics are branches of mathematics. However we put it, if linguistics is independent because it deals with the internal laws of language, then linguistics is not a science, as I see it. Let us say t h a t mathematics is about competence in the domain of number and that computation is about algorithms. Then the common interest of mathematics and computation is in algorithms with heuristic or approximative value in the domain of number. We can say t h a t linguistics is about competence in the domain of information; then computational linguistics is about algorithms with heuristic or approximative value in the domain of information. Hence computational linguistics is as different from numerical analysis as the infinitesimal calculus is different from theories of formation and transformation of strings and theories of denotation. The special place of computational linguistics relative to computation is due to the fact t h a t algorithms themselves are in the domain of information. Whereas numerical analysis helps solve problems of nuclear physics, astronautics, and biology, computational linguistics helps solve problems, of among other things, computation. (Much work in computational linguistics has been done by systems programmers and other specialists in computation; note t h a t phrase-structure grammar is also called Backus-Naur form.) Now consider descriptive linguistics, the science. I t surely has use for formal linguistics as a source of model for help in understanding the empirical phenomena it studies. I t also needs computational linguistics, as physics needs numerical analysis, as a tool in practical data gathering and data analysis. Since language occurs on earth only in the human species, only in acoustic form or surrogates thereof, and in great diversity around the world, descriptive linguistics needs close contacts with psychology, with phonetics, and with anthropology—among other relevant sciences of man. But the condition of descriptive linguistics today calls for special attention and even sympathy. After having applied the term 'taxonomic' to earlier work on human language, members of the transformational school must now be distressed to recognize t h a t transformational grammar, too, is 'merely' taxonomic. That is to say, transformational theory is essentially without causal laws. If linguistics is a branch of cognitive psychology, then the causal laws come from psychology, and t h a t is what Chomsky seems to assert when he asks for theories of language t h a t show how its structure can be explained in terms of brain structure. But something else has to be said; the subsumption of linguistics into psychology begs and does not answer the question. The goal of linguistics is not clear. The notion of simplicity that appears
GENERAL QUESTIONS
25
so often merely says t h a t when all other criteria fail to settle t h e issues of linguistics it has t h e same ultimate recourse as all other sciences, to be simple rather t h a n complex without motivation. B u t what are the criteria t h a t have to fail before t h e linguist resorts to the non-criterion of simplicity ? Perhaps other criteria will be discovered in the future. For the present, I can see only two kinds. The first a n d most immediately important are the psycholinguistic criteria, having to do with the n a t u r e of the brain as computer. W h a t kind of algorithms is the brain equipped to r u n ? H u m a n language must be such t h a t algorithms of t h a t kind are able to handle it. The second a n d ultimately most important are the abstract criteria of formal linguistics; what kind of stuff is information ? Whatever it is, t h e h u m a n brain must be t h e kind of machine t h a t can handle it, a n d human language must be t h e kind of stuff t h a t can convey it. The psycholinguistic criteria draw linguistics into t h e circle of t h e sciences of man; t h e formal linguistic criteria draw linguistics away i n t o its classic position of independence. Now, in m y opinion a rich theory of information can be developed only in terms of t h e classes of machines t h a t are involved in processing it. B u t a theory of machines can be a competence theory or a performance theory; the more it is about competence, the more it will look like linguistics, a n d t h e more it is about performance, the more it will look like psychology. A four-way scheme can be arranged, with psychology, computation, formal linguistics, and descriptive linguistics at the poles. Psychology a n d computation are about performance, formal a n d descriptive linguistics are about competence, computation and formal linguistics are abstract, and psychology a n d descriptive linguistics are sciences. B u t two other fields have to find places in this scheme: psycholinguistics joins psychology with linguistics, and seems a t this time a most fruitful field, one in which great progress can be made with benefit to both parent fields. Correspondingly, on t h e abstract side computational linguistics joins computation with formal linguistics a n d also seems a fruitful area, one in which rapid progress can be expected with benefit to both parent fields and with beneficial application to psycholinguistics. The most likely place to arrive at a working idea of how competence and performance —algorithms and information—are related is therefore computational linguistics. The more we achieve on the formal, abstract side, t h e better t h e chance of formulating goals and criteria for linguistics t h a t will help t h e linguist decide whether a grammatical invention merits prolonged study. State University
of New York at Buffalo
ON ALGORITHMIC LOGIC (Summary) H. RASIOWA
The aim of this paper is to give a brief survey of results dealing with a formalization of the concept of program. This research is conducted a t the University of Warsaw, Seminar on Logical Foundations of the Algorithm Theory (directed by professors Z. Pawlak and H. Rasiowa). Other topics of this group are concerned with various approaches to the concept of program, in particular with t h a t based on Pawlak's machine, and moreover deal with problems on automata proving theorems. A formal approach to the concept of program and related questions yielded the notion of FS-expressions (interpreted as programs) and algorithmic logic (A. Salwicki [1], [2], [3]). The notion of FS-expression is equivalent, with respect to realizations, with that of Scott's program [4] (A. Salwicki [2]). Roughly speaking, algorithmic formalized languages differ from the first order predicate languages by another kind of quantifiers, i.e., the iteration quantifiers appear instead of those usually adopted. Moreover, FS-expressions and formulas of the form K (where K is an FS-expression and is a formula) occur in these languages. A. Salwicki constructed a mapping (see [1]), which with every FS-expression K assigns a formula E(K) describing the stop problem for K . These formulas are finite expressions in opposition to Engeler's formulas (Cf. [5]). An infinitistic axiomatization of a fragment of algorithmic logic is due to A. Salwicki, who proved also that the set of all valid formulas in this fragment is not recursively enumerable. An infinitistic axiomatization .of the whole of algorithmic logic is due to G. Mirkowska. Other interesting results concerning equivalence of FS-expressions, prenex form of formulas, compactness, deduction theorems, existence of models, the Skolem-Lowenheim theorems, arithmetics based on algorithmic logic,
28
GENERAL
QUESTIONS
connections between partially recursive functions and computable by means of programs, and others, are obtained by G. Mirkowska and A. Salwicki. Moreover, the degree of indecisiveness of the set of all valid formulas and the propositional calculus with programs were investigated by A. Kreczmar and M. Grabowski, respectively. A generalization of FS-expressions and of algorithmic logic based on invalued predicate calculi (cf. H. Rasiowa [6]) is also examined. University
of
Warsaw
GENERAL
QUESTIONS
29
REFERENCES [1] Salwicki, A. 'Formalized Algorithmic Languages', Bull. Ac. Pol. Sei. 18, 1970, 227232. [2] Salwicki, A. 'On the Equivalence of FS-Expressions and Programs', ibid. 275—278. [3] Salwicki, A. 'On the Predicate Calculi with Iteration Quantifiers', ibid. 279-285. [4] Scott D. 'Some Definitional Suggestions', J. Comp. Syst. Sei. 1, 1968, 187-203. [5] Engeler, E . 'Algorithmic Properties of Structures'. Math. Syst. Theory, 1, 1967, 183196. [6] Rasiowa, H . 'A Theorem on the Existence of Prime Filters in Post Algebras', Bull. Ac. Pol. Sei. 17, 1969, 347-364.
HEKOTOPblE OCOEEHHOCTH MATEMATHMECKOrO OnHCAHHH H3blKOBbIX OB1.EKTOB K). A. IUPEflflEP
§1 Bonpoc 06 0TH01116HHH MaTeMaraiecKoro onncaHHfl H 3 H K O B H X oßteKTOB c jiHHrBHcraqecKOH peajibHOCTbio no eux nop CJIJOKHT npe^MeTOM pa3H006pa3Hbix AHCKyCCHÜ. B HHX CTajIKHBaiOTCfl B033peHHH p a 3 H b I X JIHHrBHCTHqeCKHX I1IKOJI, pa3HbIX MeTOAH^eCKHX yCTaHOBOK. MaCTO B03pa>KeHHH npOTHB MaTeMaTH3ai^HH JIHHrBHCTH^eCKHX HCCJieAOBaHHH CBfl3bIBaiOTCH C B03pa>K6HHHMH npOTHB CTpyKTypHbix METOFLOB B JiHHTBHCTHKe. B TaKoro pofta BbiCTynjieHHHx MoryT coAep-
>KaTbCH BnojiHe JiorHMHbieftOBOAbi,yKa3biBaiomHe Ha cjiaöocTb HJIH orpaHHqeHHOCTb Tex HJIH HHblX KOHKpeTHblX MaTeMaTHqeCKHX CXeM H3bIKa, HO OßbiqHO B HHX HeT oöteKTHBHoM oiteHKH npHHi^HiioB MaTeMaraqecKoro noftxoAa K onHcaHHio H3biKa. n03T0My npeflCTaBJiHeTCH yMecTHbiM npoBecTH aHajiH3 c^epbi npHMeHHMOCTH MaTeMaTHqeCKHX MeTOflOB npH OnHCaHHH «3HK0B0H peajlbHOCTH, He CBH3biBafl ero, no BO3MO>KHOCTH, C TeMH HJIH HHHMH HayqHbiMH HanpaBJieHHHMH B caMOii J1HHFBHCTHK6. TEMOM Harnero o6cy>KAeHHH SYNET MaTeMaraqecKaa jiHHrBHCTHKa H ee OTHOineHHe K H3HKOBOH peajlbHOCTH. Flpn STOM (TyT aBTopcKaa ToqKa 3peHHH He opnrHHajibHa, ee pa3flejiiieT 6ojibLUHHCTBo MaTeMaraqecKHx jiHHrBHCTOB) caMa MaTeMaraqecKaH JiHHrBHCTHKa paccMaTpHBaeTcn KaK BeTBb MATEMATHKH, HO HE K A K qacTb JIHHFBHCTHKH. N 0 3 T 0 M Y , Mbi H HE CQHTAEM BO3MO>KHBIM NPHBM3BIBATB MATEMATHQECKHE cxeMbi K TOKOM-TO ONPEFLEJIEHHOH JIHHFBHCTHqecKoii ToqKe 3peHHH. CflpyroiiCTopoHbi, H3JiaraeMan s^ect. ToqKa 3peHHH BbipaßaTbiBajiacb B npoijecce KOJIJICKTHBHUX 3aHHTHÌi H Haiiijia OTpa>KeHHe B i^ejioM pflfle nyßjiHKaqHH H BbicTynjieHHÌi rpynnbi, paÔoïaiomeM B B H H H T H A H C C C P (MocKBa). B qacTHOCTH, cue^yeT yKa3aTb Ha paßora [ 1 ] — [ 5 ] , rae 3Ta ToqKa 3peHHH Hauijia KOHKperaoe BonjiomeHHe (ßojiee noApoÔHbie CCMJIKH MO>KHO HaKra B yKa3aHHbix paôoTax). BEPOHTHO, HMEET CMHCJI HAQATB c HEKOTOPBIX aHajiornñ H3 oöjiacra H3HKH. B (J)H3HKe (CM. [6]) pa3JinqaioTCH 3aK0Hbi NPNPOABI (ypaBHeHHH ABHWCHHH, COOTHOUieHHH MejKAy OCHOBHblMH (J)H3HqeCKHMH BeJIHqHHaMH, npaBH^a HHBapHaHT-
32
GENERAL QUESTIONS
h o c t h h t . n . ) h «HaqajibHbie ycjiOBHH» —
t. e . xapaKTepHCTHKH
KonKperaoii
4>n3nqecKOii cHTyaiiHH. C o S c t b c h h o aKraqecKH
3aBHCHT y c n e x o t k p h t h h 3 a K 0 H a n p n p o A b i . T a K , 3aK0H n o c T o j i H C T B a y c K o p e H H H n p n nafleHHH Teji Mor SbiTb OTKpbiT TOJibKO noTOMy, mto SKcnepHMeHT npoBOAHJiCH c
flocTaToquo
T i w e j i b i M H T e j i a w H , ajih KOTOpbix He CKa3biBaji0Cb c o n p o r a B J i e H H e
B03Ayxa. h3hkh kho, HTo b p f l A e c j i y q a e B MaTeMaTHKa MO)KeT A a T b TOJibKO K a q e c T B e n u b i e hjih A a w e qncTO 3BpHcraqecKMe c o o ô pa>KeHHH,
a
HCCJieAOBaHue
KOJiHwecTBeHHbix
acneKTOB
HBjieHHH
npHxoAHTCfl
Becra nyTeM S K c n e p H m e H T a hjih MOAejiHpoBaHHH. H H o r A a s t o CBH3ano c rp0M03AK0CTbK) npeACTonmero p a c q e r a — T o r A a mo>kho o ô p a r a T b C i i k n o M o m n 3 B M hjih i< MOAejiHpoBaHHio hbjichhh cxoahhmh Ke mh He b c o c t o h h h h y q e c T b b MaTeMaTHqecKOH c x e M e B c e x Hy>KHbix (jmKTopoB, h T o r ^ a He o c T a e T c n HHqero KpOMe naÔJHOAeHHH hjih S K c n e p H M e H T a . riOAoSnaH CHTyauHH 0 3 i i a q a e T , i t o HaM He y A a j i o c b co3AaTb a A e K B a T i i o i ï TeopHH AaHHoro $H3HMecKoro hbjichhh. 3 t o nojioweHHe onpeAejineT MecTO MaTeMaTHKH b (J»H3HqecKHx h TexHHqecKHX H a y K a x . F l o MHeHHK) a B T o p a , c y m e c T B y r o i n H e c n o p b i o npHMeHHMOCTH M a T e M a r a w e c K H x MeTOAOB B OnHCaHHH H3HK0B0H peajIbHOCTH B 3HaqHTejlbH0H M e p e OÔ-bHCHHKJTCH TeM, h t o b caMOH jiHHrBHCTHKe e m e He n p 0 H 3 0 i u j i 0 AocTaTOqHO q e T K o r o pa3AeJieiiHH oÔJiacTeîi. B c y m H o c r a , jiHHrBHCTHKa He pa3AeJineT CTojib n o c j i e A O B a T e j i b H O H 3 y qeHHfl
HHAHBHAyaJlbHblX
OCOSeHHOCTeH
KOHKpeTHblX
CJIOB,
rpaMMaTHMeCKHX
KOHCTpyKUHÎÎ H T. n . H MeTOAOB OnHCaHHH H3bIK0BbIX KOHCTpyKLJHH, HCCJieAOBahhh
(JtyHAaMeHTajibHbix
H3biK0Bbix 3aK0H0MepH0CTeii.
Ecjih
6bi
aHajiorHMHaH
CHTyauHH HMejia MecTO b $ H 3 H q e c K H x H a y K a x , t o hcbo3mo>kho 6biJio 6bi 3aHHMaTbCH 0CH0B3MH KB3HT0B0H MexaHHKH HJIH OÔmeH TeOpHÊH rpaBHTaU,HH He3aBHCHMO OT p a c q e T a
KOHKpeTHblX
(J)H3HqeCKHX 3(f)(f)eKT0B HJIH HaÔJIIOAeHHH
3a
flBH>KeHHeM n e S e c i i b i x T e j i . O G m a n TeopHH OTHOCHTejibHOCTH n p n o 6 p e j i a n p H 3 H a -
GENERAL
33
QUESTIONS
HHe, npeflCKa3aB H e x o T o p u e HenpaBHJibHOcra B JI,BH>K6HHH M e p K y p n a . H o BE^B 3TO BecbMa n o S o i H o e NOFLTBEP>KFLEHNE ee nji0fl0TB0pH0cra. r o p a 3 A O Gojiee Ba>KHbie NOCJIEACTBHH 3T0H TeOpHH B03HHKJIH B KOCMOrOHHH. ECTECTBEHHAN
cKH0CTb
npocTbix H OÔIIÎHX 33K0H0B. FLOSTOMY Haii6o;iee
KHBaHCb y>Ke Hcn0Jib30BaHH0H a H a j i o THH, MO)KHO CKa3aTb, 1TO yCHJIHH TpaTHTCH He Ha nOHCK 33K0H0B AHHAMHKH, a Ha MeTOftbi p a c q e T a njiaHeTHbix op6nT. T a K , n o p o w A a i o m H e rparoviaTHKH — S T O xopomee
cpeACTBO
onncbiBaTb
cHHTaKcnqecKyio
CTpyKTypy,
HO STO cpeACTBO
Hcnojib3yeTCH AO T o r o , KaK BbincneHO, ^TO ecTb CHHTaKCHqecKan C T p y K T y p a . T e M He MeHee H T a n o e HanpaBjieHHe B MaTeMaraqecKOft JIHHrBHCTHKe BnojiHe o n p a B A3H0 H N0JIE3H0 AJIH H3YQEHHH nsbiKa. HaKOHeu, TaM, rAe HCCJieAyioTCH 0KKa3H0HajibHbie 4>aKTbi H3bnKe coBceM A p y r a a Hayna, (JiaicTHqecKH oGocoSjieHHan OT MareMaraKH. 3 a A a i a CTHTHCTHKH —
He yraAbiBaHHe BHyrpeHHHx 3aK0H0MepH0CTeii jibjichhh,
a r p a M o r a o e onncaHHe BHCIIIHHX npn3HaK0B HBJICHHH.
§
2
M . H . CTeôjiHH-KaAteHCKHÎi [ 7 ] AAOBHTO 3aMeTHji, qTo BOoSme
CTpyKTypHaa
JIHHrBHCTHKa JIK)6HT 3aHHMaTbCfl Ha3bIBaHHeM H3bIK0BbIX HBJieHHH. T o q H e e SblJlO 6bi cKa3aTb, MTO CTPYKTYPHAN H MaTeMaraqecKafl JIHHrBHCTHKa HapHAy c cpeACTBeHHbiM H3biK0BbiM oGteicroM
co3AaeT
H H3yqaeT
HOBbie
Heno-
MeTaoSteKTbi,
KOTOpbie A^HCTBHTejlbHO HBJIHIOTCH pa3BepHyTbIMH Hâ3BaHHHMH (T. e. C T p y K T y p HblMH O n H C a H H H M H ) HCXOAHblX JIHHrBHCTHqeCKHX OÔtÊKTOB. H o npH 3T0M O Ô H a pywHBaioTCH cymecTBeHHbie JinHrBncraKHbie 3 a K 0 H b i ABH>K6HHH n j i a H e T n o OpCHTaM. H o
HMeHHO HblOTOH,
B n e p B b i e 33nHC3BIUHH*
ypaBHCHHH
ABH>KeHHfl,
O T K p b m 3 a K 0 H B c e M H p H o r o THroTeHHH H n o K a 3 a n , qTO 3aK0Hbi K e n n e p a BbiBOflHTCH q n c T o
IlpaBfla,
TaKoK
MTO H a w ™
MaTeiwaraqecKH. T a K
( H a 3 b m a H H H ) HBjieHHÎi n p H p o f t b i — cnocoS
3T0 o q e H b
Ha3biBaHHH
yflaqHbiH c n o c o ô
H3 H e r o
onncaHHH
cymecTBeHHO.
aojdkch
SbiTb
He
TaKCOHOMHqecKHM,
He
n03HTHBHbIM, a HCXOAHTb H3 TJiyÔHHHblX CXeM HBJieHHfl. 3/i,ecb yMecTHO SbiJio 6bi n o A q e p K H y r b
c j i e A y i o m e e oScTOHTejibCTBO.
q e p T a w a T e M a T H q e c i c o r o o r m c a H H H HBJICHHH — STO OTHioftb He e r o (})0pMajIH3M,
a
CTpejVUieHHe
Bbipa3HTb
BHyTpeHHHH
MexaHH3M,
ruaBiian
neAaHraqHbiii BonjiomeHHbiîi
B ASHHOM HBjieHHH. O o p M a j i b H a H c T p o r o c T b M a T e M a r a q e c K o r o H3biKa AOBOJibHO OTHOCHTeJlbHa, H y JlOrHKOB K Heft e c T b CBOH IipeTeH3HH. C T p o r o c T b H TOqHOCTb yTBep>KA6HHH — 3T0 o 5 m H e T p e ô o B a H H H K JHOSOH H a y n e . C n e i j H ^ H K a M a T e M a r a q e c K o r o o n n c a H H H B TOM, q T o OHO n p n j i 0 > K H M 0 He K HHAHBHAyajibHbiM o S t e i c r a M , a K K J i a c c a M o S t e K T O B H o r a c K H B a e T B STHX K J i a c c a x BONJIOMEHHH eAHHOft T e o p H H . C a M a ) K e T e o p H H HMeeT A^JIO He C KOHKpeTHblMH O Ô t e K T a M H H HX CBOHCTBaMH, a c HMeHaMH n p e A H K a T O B . B
npHMeHeHHH K M a T e M a r a q e c K O H jiHHrBHCTHKe 3 T a
T o q n a 3peHHH n o A p o Ô H O c < } ) o p M y j i H p o B 3 H 3 B [ 8 ] . M b i 3 A e c b TOJibKO ASAUM HCKOr o p b i H KOHcneKT STOH T o q K H 3peHHH. B 0CH0Be e e j i e w a T o S m e j i o r n q e c K H e KOHi i e n i j H H T e o p H H H MOACJIH. H a M
ceftqac BawHbi
He CTOJibKO T o q H b i e
(J)opMyjTH-
pOBKH 3THX nOHflTHH, 2 CKOJIbKO HX OÔLUHH CMblCJI. M b i B03bMeM 3 a 0CH0By T y T p a K T O B K y HOHHTHH « M O A e j i H » H « T e o p H H » , K O T o p a n n o c j i e p a ô o T n o j i b C K o r o j i o r n K a A . T a p c K o r o HBjineTCH o S m e n p H H H T o i î B M 3 T e MaTHqecKOH j i o r w K e . O A H a i c o HblMH TeOpHHMH cymecTBeHHa
MH He 6 y A e M o r p a H M q H B a T b c n
H MOAeJIHMH,
K a K 3T0
AeJiaiOT
JlOrHKH,
0 6 m e 3 H a q H M 0 C T b STHX HOHHTHH. C T p o r n e
cïporo
opMajib-
nOCKOJIbKy
AJ1H
Hac
IIOHHTHH H3 M a T e M a r a -
q e c K O H j i o r H K H HBJIHIOTCH, C H a m e S T o q K H 3peHHH, TOJibKO q a c r a b i M H S K c n j i H K a UHflMH COOTBeTCTByiOmHX «pa3MbITbIX» IIOHHTHH. H x COAep>KaHHe M0>KH0 p33T>flCHHTb, He n p H S e r a n K c j i 0 > K H 0 M y M a T e M 3 T H q e c K 0 i w y TeopHH
onncbiBaeT
HenoTopbie
annapaTy.
CBOHCTBa npeAMeTOB,
HO HE c a M H
npeAMera.
B c y m H O c ™ H3biK j n o ô o i î T e o p H H n p n c n o c o S j i e H K o n n c a H H K ) Ha3BaHHÎi CBOMCTB H OTHOLUeHHH He33BHCHM0 OT TOrO, CymeCTByiOT JIH KaKHe-JlHÔO OÔteKTbl C TaKHMH CBOHCTB3MH H OTHOHieHHHMH. MO>KHO nOCTpOHTb T e o p H H ) IJBcTOB, TAe Ô y A y T H33B3HHHI «KpaCHblH», «M3J1HH0BblH», «CHHHH» H T. A- B 3T0H T e o p H H y T B e p > K A a e T C H , Me>K CO6OH, HO 0 6 a He n o x o > K H Ha T p e r a f t . O a h 3 k o
qTO n e p B b i e A B a
c o o r a e c e H H e STHX
CXOAHbl
Ha3BaHHii
c p e a j i b H b i M H o K p a m e H H b i M H oô-beKTaMH He HBUHCTCH n p e A M e T O M c a M o i i
TeopHH.
3 T H Ha3BaHHH c a M a T e o p H H p a c c M 3 T p H B 3 e T K 3 K y c j i o B H b i e KOAbi, A o n y c K a i o m H e
1
T . e. o6o3HaHHBLUHH flBH>K6HHG njiaHeTbi ero ypaBnemieM.
2
KOTOpue MOHtHO H3HTH B [ 8 ] .
35
GENERAL QUESTIONS
p a 3 H o o 6 p a 3 H y i o H H T e p n p e T a i i n i o . H a n p n M e p , « K p a c H b i M » mbi m o k c m y c j i O B H T b c n
CHHTaTb oSteKT, B e c n m n n ôojiee 10 Kr, «ManHHOBbiM» — BecninHH 6ojiee 12 Kr, a «chhhm» — B e c n m n n MeHee 5 K r . Cxoactbo >Ke paccMaTpnBaTb no Becy. 3 T a H H T e p n p e T a i i H H n p t m e q a T e j i b i i a TeM, h t o b Heií « M a j i n H O B b i n »
paccMaTpnBaeTcn
K a K q a c T H b i H c j i y q a n « K p a c H o r o » . P a c n o ; i o > K e H H e LpseTOB n o n o p n A K y b c n e K T p e hjih n o ijBeTOBOMy K p y r y —
s t o B c e TOJibKO T e o p n n , n o K a m h He
conocTaBHM
H a 3 B a H H H i^bêtob c p e a n b H b i M H n p e A M e T a M n . A b o t e m e OAHa « C T p a n n a n » H H T e p n p e T a q n n «TeopnH ijbctob». B o 3 b M e M mhojkcctbo h 3 K p y r o B , c j i a S o
BbiTHHyTbix
SJiJiHncoB h paBHOCTopoHHHx T p e y r o j i b H H K O B . K p y r n 6 y A e M cqnTaTb Ha3bIBaTb) K p a C H b l M H , SJIJIHnCbl —
(nonpocTy
MajIHHOBbIMH, a T p e y r O J I b H H K H —
CHHHMH.
E c T e c T B e H H O K p y r n c w n T a T b noxo>KHMH H a c j i a 6 o B b i T H H y T b i e s j u i n n c b i . T a K H M 0 6 p a 3 0 M , T e o p n i o mo>kho, 0 K a 3 b i B a e T c n ,
HHTepnperapoBaTb
n
Ha «MOAenn»
H3
a S c T p a K T H b i x sjieMeHTOB. HTaK, KOHKperaan
HHTepnpeTaqnn
—
sto to,
eKT0B, He 3aBHcnmee
ot
QUESTIONS
cymecTBOBaHHH s t h x
ncn0jib30BaTb cTporo (JiopMajibHbiH H3biK3 —
oßteKTOB. TeopHH
mojköt
b 3tom cjiyqae OHa na3biBaeTcn
KeT HMeTb pa3Hbie BonjiomeHHH (MOAejm). H a i n a «TeopHH i^bctob» m o k c t BonuomaTbCH b oöbiqHbix ijBeTax (})H3HqecKnx npe^MeTOB h b aßcTpaKTHbix reoMeTpnwecKHx (Jmrypax. JIioßonbiTHo 3Aecb npnnoMHHTb nojieMHKy TeTe c nocjieAOBaTejiHMH HbroTOHa o n p n p o a e ijBeTa. H a nepBbm B3rjiHA, h x TeopHH ijBeTa 6mjih HecoBMecTHMbi. H a caMOM Aejie, o ß e TeopHH 0AHHaK0B0 cocTOHTejibHbi, ho HMeioT pa3Hbie boiuion;eHHH. T e o p H H HbioTona othochtch k M0H0xp0MaTHqecikho paccMaTpHBaTb
Kai< CBoero poAa
cooTBeT-
HHTepnpeTaunio
CMHCJia 3Toro o ß i e K T a - T a K ^ e KaK TeopHH, He HMeroman MOAejieii, öeccMbicjieHHa, Tan h HBjieHHe, He nojiyHHBiuee TeoperaqecKOH HmrepnpeTaqHH, He HMeeT eme HayqHoro cMbicjia. O h o , tcm caMbiM, jie>KHT BHe cncTeMbi HayqHbix itohhthh.
3to
3HaiHT He t o , mto A^HHoe HBJieHHe jiHiueHO BHyTpeHHero cMbicjia, a TOJibKO t o , mto 3T0T OMbicji He pa3raAaH, He cymecreyeT b paMKax iiayquoro 3HaHHH. Y3HaTb 3T0 H 3HaMHT nOHHTb BHyTpeHHHH CMblCJI HBJieHHH, nOHHTb, BOnJIOmeHHeM
—
Mero
OHO HBJIHeTCH.
T a K , HanpHMep,nopo>KAaiomHe rpaMMaTHKH HenocpeACTBeHHbix cocTaBJiHiomHx ecTecTBeHHO
paccMaTpHBaTb
KaK
Teopnio,
BonjiomaeMyio
b
CHHTaKcnqecKOH
CTpyKType Ke TaKan TeopHH ecTb n p e w A e Bcero Harna AoraAKa 06 s t o h peajibHOCTH. rpaMMaTHKa XoMCKoro
yKa3biBaeT
b CTaraKe KOHKpeTHOH KAeHHH, BoiuiomeHHoro b t o t o boK $pa3e H3biKa. B
paöoTax B . B . EopmeBa h M . B . X0MHK0Ba ( [ 1 ] , [ 4 ] ) nocTpoeH H3biK Teo-
pHH, 0CH0B3HHbIH Ha nOHHTHH OKpeCTHOCTH B TeKCTe. 3Ta TeopHH yKa3bIBaeT aKCHOMaTHKy 0KpecTH0CTeii A^HHoro H3biKa, a Ka>KAbiH TeKCT H3biKa HHTepnpeTHp y e T C H KaK MOAejib s t o h TeopHH — MaeTCH
3
KaK
jiHHrBHCTH^ecKHH
oKpecTHocTHon rpaMMaTHKH. TeieHOMeH,
BonjiomaiomHH
Ty
HanpnMep, $13HK y3Koro hchmcjichhh npe^HKaTOB c yHHBepcajibHo
OTHOiueHHeM TOȀecTBa.
OKpecTHocrayio
HHTepnpeTHpyeMbiM
37
GENERAL QUESTIONS
C T p y K T y p y , KOTOpan p a 3 p e u u a e T C H 0KpecTH0CTH0ii rpaMMaTHKOîi. 3AECB H ANNUO n o n u T K a y r a A a T b H o y M e H a n b H y i o p e a j i b H O C T b H3biiKH0CTb e r o HCN0JIB30BAHHH B p e a j i b H o i i 3 H a K 0 B 0 i i CHTYAUHH. 4>aKT COOTBeTCTBHH
TeKCTa
HÔKOTOpOMy
A^HOTaTy
HBJIfleTCfl
CaM
0KKa3H0HajlbHbIM,
HO B03M0>KH0CTb n o j i y ^ H T b 3 H a q e H n e o n p e A e J i f l e T C H r p a M M a r a n e c K O H
npaBHjib-
HOCTbK) T e K C T a . n p H B e f l e H H b i e Bblllie K B 3 3 H p y C C K H e C J 1 0 B 0 0 6 p a 3 0 B a H H H n O C T p O e H b l no npaBHjraM pyccKoii r p a m a r a K H ,
HO HX B03HHKH0BEHHE B TEKCTE
ONPEAEJIH-
JlOCb BHeH3bIK0B0H C H T y a q H e H , H e K 0 T 0 p 0 f t 0KKa3H0Ha;ibH0ii nOTpeÔHOCTbIO. C T O J l b >Ke cjiyqaHHbiM M O w e T 0 K a 3 a T b c a 3aKpenjieHHe STHX CJIOB B p y c c K O M H3biKe,
—
c e i t a a c MOJKHO TOJIBKO ROBOPHTB O TOM, MTO STO 6 b i J i o 6 b i MANOBEPONTHO. H 3 CKA3AHHORO BHAHO, MTO ceMaHTHKa TecHO CBH3AHA co cTaTHcraKoii, c Mepoii YNOTPEÔHTEJIBHOCTH TCKCTOB, OTPA>KAK>meH y p o B e H b HX TOTOBOH ocMbicjieHHOcra. M T O 6 H Bbipa3HTb 3 T y Mbicjib TOMHee, n 0 j i e 3 H 0 pa3AejiHTb nepBHMHyio ceMaHTHKy TeKCTa H e r o BTopHMHyio ceMaHTHKy. FIOA nepBOîi M H SyAeM HMeTb B BHAy CTaHAapTH30BaHHbiH
o6meynoTpe6HTejibHbiH
OMeBHAHbiH
CMMCJI
TeKCTa.
KA>KAoe
CJIOBO HMeeT B H3biKe CMbicjroBoe KJiHiue, 3aKpenjieHHyio B H3biiKe oSjiaAaeT nepBHMHoii ceMaHTHKoii, He 3aBHcnmeii OT KOHTEKCTA HJ1H CHTyai^HH. H o KpOMe nepBHMHOH TeKCT MO>KeT HMeTb BTOpHMHyK) ceMaHTHKy, ONPEAEJINEMYIO
KOHKPETHBIM 3AMBICJIOM r o B o p n m e r o .
I l p n 3TOM TEKCT c
caMoii
npOCTOÎÎ nepBHMHOH CeMaHTHKOH MO>KeT B CHTyaTHBHOM KOHTeKCTe npHOÔpeTaTb BecbMa cjio>KHyio B T o p n r a y i o ceMaHTHKy. H TAK w e TCKCT, He oÔJiaAaiomHÎi n e p BHMHOH CeMaHTHKOH, MO>KeT npH06peCTH TaKOByiO B KOHKpeTHOH 3H3K0B0H CHTyaIIHH. n p H M e p n e p B o r o c j i y q a a (6aHajibHan n e p B H r n a f l ceMaHTHKa H HeTpHBHajibHan BTopHiHan) — jiHTepaTypa
3TO ANAJIORN M e x o B a
aôcypAa.
H XeMHHrysH.
BTopHMHan ceMaHTHKa
BTopoii
cjiyMaii—STO
TeKCTa n p n pa3BHTHH H3biKa c n o -
coÔHa nepeiiTH B MCTKO 3 a K p e n j i e H H y w nepBHMHyEo ceMaHTHKy. BBHAy Toro, CMbiCJIoBWMH roBopHTb, cTporo
MTO NEPBHMHAH
KJiHme
H
HX
CJIEAYN aBTopaM
onpeAejreHHbie
CEMAHTHKA HMEET A^JIO c AOCTATOHHO
KOMÔHHAI^HHMH, [10],
npaBHJia
BnojiHe
pa3yMH0
MCTKHMH
B 3TOM c j i y q a e
o «rpaMMaTHKe CMHCJIOB», NOHHMAN NOA STHM 06pa30BaHHH
KOM0HHauHH
nepBHMHbix
ceM,
N P E O 6 P A 3 0 B A H H H TCKCTOB C COXPAHEHHEM NEPBHMHOII CEMAHTHKH H T. n. PaccMOTpeHHbie o c o ô e H H o c r a rpaMMaTHKH H ceMaHTHKH HMCIOT n p H M o e
COOT-
BeTCTBHe B MaTewaTHqecKOM a n n a p a T e onHcaHHii H3biKa. TpaMMaTHKa onncbiBaeT KJiacc
npaBHJibHbix
TCKCTOB,
KOTopbie
ecTecTBeHHO
TpaKTOBaTb
KaK
(pejiHitHOHHbie CHCTeMbi), BonjiomaromHe rpaMMaTHnecKHe OTHOUICHHH.
MOAeJiH
39
GENERAL QUESTIONS
B TaKOM cnyqae rpaMMaTHKy MO>KHO oiracbiBaTb KaK Teopmo (B jiomqecKOM nOHHMaHHH), MOAeJIHMH KOTOpOH CJiywaT TeKCTbl H3bIKa. COBOKynHOCTb rpaMMaTHqecKH npaBHjibHbix TeKCTOB H3biKa o6pa3yeT aKCH0MaTH3HpyeMbift
KJiacc
MOAejieii flaHHOH TeopHH. B UHrapoBaHHbix paôoTax B. B. BopiyeBa H M. B. X0MHK0BA pa3BHT yAaqHbiîi cnocoô 3anHCbiBaTb Tanyio Teopnio B KHO HafteHTbCH nojiyqHTb aKCHOMaTHKy noTeHiina^bHO ocyiuecTBHMbix (noTeHi;HAJIBHO ocMbicjieHHbix) TCKCTOB, TO BPHA Jin yA^CTCH n o CTpoHTb AOCTaToqHO nojiHyio aKCHOMaTHKy peanbHO
ynoTpeSjineMbix
TCKCTOB.
HajiHMHe TaKoR HcqepnbiBaiomeH aKCHOMaTHKH npoTHBopeqnjio 6bi npHHi;Hny CBoSoAbi qejiOBeKa. npHHI^HIlHaJIbHblH BblXOfl H3 3T0r0 IIOJIO>KeHHH MO>KHO HaHTH CJieAyiOmHM 06pa30M. Hy>KHO B B e c r a noHHTHe KB33HMOACJIH A^HHOH TeopHH, T. e. pejinijHOHHOH CHCTEMBI, NOQ™ Bonjiomaromeii A^Huyio aKCHOMaTHKy. IlOAOÔHbie KaTeropHH
eCTeCTBeHHO B03HHKaK)T B JIHHrBHCTHKe,
HanpHMep,
B CBH3H c noHATHeM npoeKTHBHOCTH: peajibHan $ p a 3 a MO>KeT HecKOJibKO OTKJIOHHTbCH OT npoeKTHBHOCTH. C^eAyiomHii m a r
COCTOHT B TOM, MTOSH BBCCTH N0QTII-AKCH0MATH3NPYEMBIE
KJiaccbi MOAejreH. Pewb HACT o KJiacce pejifli;noHHbix CHCTCM, cpeAH KOTopbix nOHTH BCe (c TOMKH 3peHHH HeKOTOpOH Mepbl) yAOBJieTBOpHIOT HeKOTOpOH aKCHOMaraqecKOH TeopHH. B e c b M a ecTecTBeHHbiM KanceTcn n p e A n o j i o w e H H e , MTO KJiacc ocMbicneHHbix TeKCTOB H3biKa HBJIHCTCH n0qTH-aKCH0MaTH3HpyeMbiM KJiaccoM MOAEJIEII HEKOTOPOII TeopHH, npnweM Bee TCKCTH 3Toro KJiacca c y T b n o q r a MOACJIH TOH >Ke TeopHH [8]. no
coAep>KaHHHK) 3TO npeAnojioweHHe
COCTOHT B TOM, qTo
«ôojibiiiHHCTBo»
TeKCTOB c qeTKO Bbipa>KeHHOH nepBHHHOH ceMaHTHKoii x o p o m o
onHCbiBaeTCH
HeKOTOpOH aKCHOMaTHKOH, a ocTanbHbie nojiyqaioTCH nyTeM HeôojibWHx H a p y uieHHH 3T0H aKCHOMaTHKH. 3 A e c b AonycKaeTCH CBo5oAa HapymeHHH H3biK0Bbix npaBHJi, HO 3Ta CBoGoAa noKynaeTCH onpeAejieHHoii ueHoii. OTCioAa H caMH HapyuieHHH He c r o j i b qacTbi B p e q n . ECJIH w e B n p o u e c c e pa3BHTHH H3biKa HeKOTopoe npaBHJio AocTaTOqHO qacTo HapyiuaeTCH, TO TeM caMbiM OHO caMO TpaHC$OpMHpyeTCH, H BCH 3KCH0MaTHKa H3bIKa MeHHeTCH. n p H 3TOM nOAXOAe B MaTeMaraqecKOM onncaHHH H3biKa cymecTBeHHyio p o j i b HaiHHaioT HrpaTb oijeHKH CJ10>KH0CTH Tex HJIH HHblX H3bIK0BbIX HBJieHHH.4 Pa3yMeeTCH, npH TaKOM nOAXOAe
4
Cp. CBFL3B, yCTaHOBJiCHHyio B (11) MOK^y CJIWKHOCTBIO H KHO pa3JIHHeHHe pejieBaHTHblX H 0KKa3H0HajIbHbIX (JiaKTOB H3HK3. TaK, B paôoTe [ 3 ] ycTaHaBJiHBaioTCH KOJiHqecTBeHHbie 3aK0H0MepH0CTn a j i h KOjiHiecTBa cjiob pa3JiHMHoro p a H r a , coxpaHHBuiHxcH o t H3biKa n p e f l K a hjih 3aHMCTB0BaHHbix H3 A p y r H x H3MK0B, ho He MoryT flejiaTbCH npeAcxa3aHHH o tom, KaKHe hmchho c j i o B a coxpaHHioTCH b H3biKe HJiH 5yn,yT 3anMCTB0BaTbCH H3 A p y r n x . B n p o q e M , (JiaKTbi, 0KKa3H0Ha^bHbie a ^ h j i h h f b h c t h k h , M o r y r 0Ka3aTbcn p e n e BaHTHblMH A-Hfl nCHXOJIOrHH, HCTOpHH HJIH COliHOJIOrHH. BMHHTM,
Mocnea
GENERAL QUESTIONS
41
J1HTEPATYPA [1] EopmeB, B. B., XOMHKOB, M. B. OKpecraocTHbie rpaMMaTHKH H MO«EJM nepeBOfla, HayHHO-mexnm. uivpopMaqun, Cep. 2, Ks 3, 4, 8, 1970. [2] ApanoB, N. B., Illpeiiflep, IO. A., O 3aK0He pacnpeflejieHHH AJIHH npefljio>KeHHH B CBHBHOM T e K C T e ' , maM
Mce, J\f° 3 ,
1970.
[3] ApanoB, n . B., X e p u , M. M., Moflejib rjioiroxpoHOJiorHH, C6. HHifiopMaijuoHHbte eonpocu ceMuomuKU, MameM. Aumeucmum u aem. nepeeoòa. M., B M H H T H , 1972. [4] BopmeB, B. B., X O M J J K O B , M. B. B C6. Trends in Soviet Formai Linguistica, T h e H a g u e , 1971. [5] Illpetiflep, IO. A. maM xce. [6] BnrHep, E . Smwdbi o cuMMempuu, M. 1971. [ 7 ] CTeGjiHH-KaMeHCKHH, M. H. Bonpocu si3btK03HaHM, Ne 5, 1971. [8] Illpeìiflep, K). A. 0 nonamuu marneMammecKan Modenb umica*, M., 3HaHHe, 1971. CepwH «MaTewiaTHKa H KHCepHeTHKa». [ 9 ] Illpeiiflep, IO. A. JfeuK KaK H H C T P y M € H T H O S I E K T H 3 Y I E H H H HayKH, TJpupoda, JV° 5 , 1 9 7 2 . [10] JKOJIKOBCKHB, A. K., MEJ7TIYK, H. A. O CEMAHTMECKOM CHHTE3E, npoóMMU KuóepmmuKii, Bbjn. 19, M., H a y n a , 1968. [11] Illpeiiflep, IO. A. O BO3MO>KHOCTH TeopeTmecKoro BBIB0#a CTaTHCTHiecKHX 3aK0H0MepHOCTeìi TGKCTa, ÌIpoóAeMbi meopuu nepedatu coo6u}eHuìi, N ° 1 , 1 9 6 7 .
ELASTIC MATCHING OF CODED STRINGS AND ITS APPLICATIONS A. J . SZANSER
1. The nature of elastic matching I t is well known that direct matching of strings of coded elements does not provide a quantitative comparison; it is solely a matter of acceptance or rejection, and rejection follows even if the error is insignificant. The reason is that a string is determined not only by its constituent elements but also by their arrangement. A method to cope with this difficulty and so produce a quantitative basis for comparison has been named by the author 'elastic matching'. I t was presented at the preceding conference (1969) in application to natural languages in the written form and, more precisely, to automatic correction of errors in keyed input [3], Briefly, the method consists in spreading the elements of a string (the letters in the given case) so as to preserve their order and yet to allow for unambiguous matching for each element. The spreading is guided by a standard sequence of all the elements used, in a pre-arranged order. In the computer representation, the presence of an element is recorded as a ONE-bit at the corresponding place in the machine word. If a letter in the text word falls before its preceding letter, a new machine word must be started. Thus, using the English alphabet as the guiding sequence, the word CERTITUDE divides into three machine words (from now on to be called 'lines') viz.: CERT - ITU DE, the letters in each 'line' following the order of the alphabet. The words to be compared are 'linearized' first, and then matched line by line, for example: I. Standard version 1 c E II. Letter missing
1
c E
R T
I
TU
DE
T
I
TU
DE
In this and following examples, the ordinary sequence of the English alphabet was taken as the guide, for illustration purposes. This is, however, not the optimum sequence, — cf. below, 2.1.
44
GENERAL
III. Letter extra
C E
IV. Wrong letter
C E
QUESTIONS
R T
I
ST
I
STU T TU
DE DE
T Î
The corresponding 'lines' are matched by the 'non-equivalent' logical operation and the errors stand out as ONE-bits. In the diagram above they are indicated by arrows. If their total number in the word is not greater than a pre-determined threshold, the version is accepted. In our work the maximum allowed was one error per word, defined as follows: one letter extra, missing or changed, or two adjacent letters interchanged. This definition requires some additional conditions, for example in the (IV) case above the two notequivalent bits are accepted as one error, because they do not break the sequence of other letters. On the contrary, in: V.
CE
RT
|JL
U
DE
the two outstanding bits are separated by U, D, E and so they represent two independent errors. If the misspelled word breaks into a number of 'lines' different from the correct version, the procedure is to shift back the remaining lines of the longer version, re-matching the next line with the result of the first disagreeing match, e.g. CE VI. RT — first disagreeing result
TU
DE
R T
TU
DE
T
— final disagreeing result (one error in the word, version accepted) In using this method the misspelled words can be quickly and effectively identified in a list of standard words, and then corrected. The minimum length of a word assumed in this work was 4 letters (the correction of shorter words is discussed below). When the list grows and eventually becomes a complete dictionary, there arise two important problems, namely securing a reasonable total checking time, and selecting the most appropriate word of the many which can possibly be retrieved from such a large set. In the 1968 paper, the author gave some indications how he was going to attack problems. This paper presents the results of the investigations completed, forms new ideas concerning these same applications and discusses the possibility of other applications.
GENERAL QUESTIONS
45
2. The work done and the results achieved 2.1 The optimum sequence The first thing to be done was to establish an optimum sequence of English letters, with which the segmentation into 'lines' would be the most efficient (the smallest number of lines from a chosen text). This was done by both statistical and linguistic methods and the results were given in Ref. [2], P t . I. The guiding sequence, assumed as the basis for subsequent work, was: JCFVWMPLOQUEAINRKGBSTHXDYZ 2 2.2 The matching technique Next, the procedure of elastic matching was developed, including practical details, such as special precaution against 'self-cancelling' matching in some cases. A full report was issued as P a r t I I I of Ref. [2]. 2.3 The long-list matching The complete technique was then tested using a large set of standard words, in fact a section of an English dictionary. The section comprised about 6 thousand words starting with the letter 'S' and representing roughly 10% of the dictionary. The corrective matching for a batch of 20 — 30 misspelled words took less than half a minute of computer time ('central processor time'), 3 including preliminary procedures such as linearizing, whereas the total time, involving the use of peripheral equipment (the dictionary section was stored on magnetic tape, in several 'blocks'), was about 1 — 2 minutes. These operational times can be extrapolated to using a complete dictionary, which was done taking into account certain possible improvements, with the result t h a t the times of operation would rise more slowly than the bulk of the word-list (see below, 3.1). I t has to be stressed that, with the error-threshold assumed, the matching needs to be done only against words containing the identical number of 'lines' or differing by one only. Among other conclusions, the investigation proved t h a t the real need for selective methods to be applied to multiple retrievals is limited to shorter words (4 — 5 letters), whereas long words have usually only one valid alternative. The detailed findings were included in the next memorandum in the series [2], P a r t I l i a . 2
Since this result is unavoidably connected with the corpus used, no universality is claimed for it; it is, however, a good working approximation. 3 The K D F 9 machine has been used throughout.
46
GENERAL QUESTIONS
2.4 The 'General Content' check At the same time an investigation was carried out on one method of selection (any selection must, of course, be based on the context), namely the quasisemantic method of 'General-Content' (GC) check, using the observed fact of word repetition in a coherent sample of natural text. Putting aside the socalled function words (such as prepositions, auxiliary verbs or pronouns), those remaining, which include the syntactic classes of nouns, verbs, adjectives and most adverbs, are, in general, characteristic of the subject matter. If any of them becomes distorted and multiple retrievals are produced, t h a t among the latter which is repeated elsewhere in the same text, is likely to be the right choice. The technique of this check was investigated and the results have been reported on in another memorandum [2], P t . IV. Among other findings, it may be mentioned t h a t the 'content' (that is nonfunction) words are conveniently and sufficiently represented by their stems, i.e. their grammatically immutable parts, and even these can be shortened for the purpose. A workable maximum length, for English, was assumed to be 6 letters. If the stem itself, owing to the peculiarities of grammar, undergoes change (like 'make' — 'made') a suitable provision is made in the dictionary (which otherwise only shows the number of letters to be retained for the GC word list). Recognized prefixes are detached and replaced by a short (two numerical characters) code, in order to retain a sizeable portion of the root. The GC list is complied during an initial dictionary check, and used against the multiple version produced by elastic matching. Special rules apply to proper names, abbreviations, etc. Apart from the series of memoranda quoted, papers dealing with this subject in a more general way were presented at international conferences [3, 4], or published elsewhere [5], 3. The work in progress The investigation in progress proceeds in two main directions: elastic matching of words using a complete dictionary, in particular finding operational times, and an additional selection method, based on syntactic analysis. 3.1 Using a complete dictionary The use of a complete dictionary requires compiling and organizing it in such a way that the number of actual matchings for each word should be brought to an absolute minimum. Compilation consists of (a) rejecting unwanted entries (short words of 3, and less, letters, proper names and the like) and (b) linearizing all accepted words according to the standard sequence (see above, 2.1).
GENERAL
47
QUESTIONS
Organization requires grouping the linearized words into classes containing identical numbers of 'lines' in each word (this has already been done for partdictionary — see above, 2.3) and again, dividing the words in each class into sub-classes, according to the number of their letters. Since, under the errorthreshold assumed, a distorted word needs only to be matched with either the same or the adjoining subdivision, such an organization will substantially reduce the number of matchings. This is illustrated in the table given below, showing numbers of words in the more frequent classes,4 (numbers are given in units of thousands and rounded to the nearest 500). The total dictionary contains about 65,000 words.
numbers of lettera
n u m b e r s of lines
6
6
7
8
2
1.5
1.5
1
0.5
3
1.5
3
3.5
4
0.5
1.5
5 6
12
13
2.5
1.5
0.5
1.5
1.5
1
9
10
il
2.5
1.5
0.5
3.5
4.5
4
2.5
1
1
2
3
4
0.5
1.5
11
0.5
Thus, in the worst case, a word of 8 letters and 4 lines would require matching with 7 sub-classes, as shown by the thicker line, containing in all 26,000 entries. (It may be observed that classes: 3 lines/9 letters and 5 lines/7 letters are excluded, these conditions being contradictory.) The words of 8 letters, however, form the largest classes only dictionary-wise and, in dealing with texts, the numbers of matchings will, on the average, be considerably less. The successive classes of the linearized dictionary are being stored on magnetic tape in blocks not exceeding 25,000 machine words.5 When this is completed, matching will be done against a whole dictionary, bringing blocks from the magnetic tape into the core store as required, and the operational times will be noted in various cases. It is expected that the increase in time, compared with the use of a 10% dictionary, will be only two-three times as great. Further reduction in the number of matchings and, therefore, also the time of operation, may be expected from ordering entries within each sub-division Frequencies here refer to the dictionary; in actual texts, shorter words (both by 'lines' and by letters) prevail. s This number is conditioned by the size of the core store of the computer used and the length of the programme. 4
48
GENERAL
QUESTIONS
according to the standard sequence. Another improvement may be obtained by storing the dictionary on a magnetic disc, instead of magnetic tape. This will, however, affect only the total time (by reducing the access time), not the internal computer time, since the numbers of matchings required will remain the same. These matters will be investigated later on. The results should be known before the time of presentation of this paper. 3.2 Syntactic analysis Syntactic analysis can be expected to provide a selection complementary to that secured by the GC check, because syntactic and semantic aspects of a natural language are to a large extent complementing each other. For example, syntactic rules operate usually by means of function words, while content words are normally carriers of semantic information. The results of the application of syntactic analysis to Russian-English machine translation carried out at this Laboratory [6] allow for reasonable optimism concerning its use in automatic error correction. Among the multiple results of elastic matching there are frequently some differing by their syntactic range and analysis can often indicate the proper ones, which is especially important if they are not suitable for the GC check (e.g. function words). At the present time, work is carried on in syntactic analysis applied to resolving phonetic ambiguities produced in the N.P.L. system of machine-shorthand (Palantype) automatic transcription [7]. The principles involved are very similar to those which will be used in the application to the results of elastic matching. Since this work has not yet been described anywhere, a broad account is given here, pointing out also differences, where they are found. Among the many existing, and working, English syntactic analysis systems, preference was given to the approach used in the method designed by Thorne et al. [8]. Xts main advantage, which is especially valuable in the application to Palantype automatic transcription, is that it regards all words other than function words (which present a not very numerous group, anyhow) as a sort of 'general' class of English words, which can be used as nouns, verbs, adjectives or even adverbs, depending on their syntactic position. For the analysis based on such classification, the only words that must be marked with coded syntactic information in the dictionary are function words and exceptions to the standard-ending rules.6 Otherwise, the syntactic role is defined by the ending, if any. The full syntactic analysis of Thorne is not used, since the exogencies of the N.P.L. system (real-time operation) make it too difficult to 6 Words such as 'bed' or 'wing', which are not the past tense/past participle or gerund/ present participle respectively.
GENERAL QUESTIONS
49
cope with a full sentence or even phrase. The analysis, instead, consists of scanning the current sentence a few words backwards and forwards each time an ambiguity is met, 7 until one of the established clue-words is found. To give an example, let us consider the phrase (taken from an actual recorded text): ONE DOES NOT FIND/FINED THAT . . . The ambiguity FIND/FINED is met, with the following grammatical characteristics, coded in the dictionary or defined by the ending: FIND — general word; FINED — past participle or past tense. Searchback reveals: NOT — neutral word, passed; DOES — auxiliary verb, compatible with the first alternative, but not the second.8 However, the appropriate subroutine has a rule to cover the exception: a past participle/past tense word may sometimes head a noun block (for example: 'someone . . . does forced labour'), so one more check, forward, finds THAT — demonstrative pronoun/relative conjunction, which prevents that possibility. Hence, FIND is selected. I t is clear that an analysis proceeding under such restrictions cannot be complete, and therefore many ambiguities remain unresolved (on the other hand, there is a much smaller chance of wrong resolution). A point should be made that, although the unmarked dictionary words, not ending in a recognizable way, can only be said to be 'general', the ambiguity components (since there are only a few hundreds of those) receive full grammatical characteristics. These contain not only syntactic class specifications ('systemic' characteristics), but also those relative to particular words in question ('exponential' or 'individual' characteristics), which can lead to particular resolution rules. For example, in the ambiguity BUY/BY, the second component can be the subject not only of the rules proper to prepositions in general, but also to individual rules such as: 'preference given, if preceded by a past participle and followed by a noun, or noun block'. As was mentioned before, the whole system of rules will also apply to the selection of elastic matching results, with great advantage, however, that there will be no need for the 'general' word category. When a standard, full dictionary is compiled, there will be no inherent difficulty in marking each word with its complete syntactic information. On the other hand, recognition based on a short search may be very valuable if, ultimately, an approach to real-time operation (which will be discussed in the next section) is contemplated. 7
A s regards the 'real-time' condition, a limited forward-scanning is made possible b y suitable buffering. 8 H A V E (and its other forms) does not share this property. All auxiliary verbs are divided into three groups, differing in their syntactic properties.
50
GENERAL
QUESTIONS
4. The applications 4.1 The keyed input Elastic matching was first proposed in application to the keyed input of natural language text. Indeed, all work done until now has been directed to that end. Before we consider, therefore, other applications, a picture should be presented as to how this would operate in practice. First of all, the errors must be spotted. For the present, this means finding words not included in the dictionary. 9 There exists a complete linearized and organized dictionary as described above. When the input is keyed, each word is checked in the dictionary by means of an addressing routine. This is treeorganized, proceeding letter by letter, and it can be comfortably made to operate in the real-time of keying, as it is done now in automatic machineshorthand transcription [7], If, however, the word is not found, it becomes a 'defective' word and is subject to the elastic matching procedure, which requires additional time. The corrective matching can be visualized in two distinct ways, as follows: (a) The errata-sheet type is the method now used in our investigations; therefore there is no need for further description, except for providing a quick and efficient back-reference to the text. Such reference can be secured by automatio counting of either sentences and words, or else lines of output (in the ordinary Bense, not the elastic matchings segments) and words, transferring the reference to the defective word, and storing the latter, until the matching is done for all defective words after the unit of input (article, speech) is finished. This method is obviously slower than any simultaneous one, but it has the advantage of using the possibilities of selection methods (GO check and syntactio analysis) to the full. If an automatic insertion of correction is required, an intermediate storage of text would be necessary. (b) Simultaneous correction uses10 operator-computer interaction. The operator acts also as part-editor. The keyboard is connected to a visual display device, showing, for example, the current line of text. If the basic check comes across a 'not-in-dictionary' word, an alarm is flashed (and/or the keys are stopped) and, as corrective matching proceeds, the screen shows an array of alternatives, possibly with preferences established by selection (the automatically rejected lines are not shown at all). The operator, using a suitable control, chooses the proper word, or else overrules the automatic correction (this is
9
In the future a possible refinement may include a check of an (otherwise correct) word for its fitting the context, both syntactically and semantically. 10 'Simultaneous' does not mean 'real-time' if this is understood as 'without any interruption or slowing down the keying rate'. The latter is not yet available.
GENERAL QUESTIONS
51
important for proper names 11 or other unusual words). In the case of 'no alternatives', the procedure is similar, with the one difference that the word may contain more than one error, in which case the operator cancels and replaces it. This second method seems more attractive but it loses the advantage of using the selection methods to the full (GC check cannot become operative until a substantial portion of the text has been input and syntactic analysis is incomplete without a forward-look; one can imagine, however, a compromise in which the latter is made possible by buffering for a few words or until the sentence ends). 4.2 Other applications The principle of elastic matching can be applied, theoretically, to any array of elements. I t would require, first of all, a definition as to what constitutes an element in a given case. As examples, we shall consider two cases: spoken language and handwriting. In spoken language, the true elements (phonemes) are themselves difficult to identify, so that little use can be made of elastic matching if an uncertainty occurs at that level. We can use, however, the fact that phonemes are naturally grouped into classes, in which they are phonetically related 12 and these classes can be treated as elements forming, as it were, 'super-phonemes'. Also, the pattern of stress over a word, or a closely connected word-group, is recognizable without difficulty. If, therefore, the elements, in the sense defined above, are superimposed on the recorded stress pattern, the obtained array is amenable to elastic matching. In automatic recognition of handwritten language the search for elements turns in the opposite direction, namely below the letter level. I n many pattern-recognition systems letters are split into 'features' (such as vertical stroke, upward half-loop, etc). The features and their order define the letter and the word, and they can therefore be used as elements for elastic matching. Finally, there is no theoretical obstacle against applying the elastic-matching principle to non-linguistic sets, always provided that they are defined by the selection and order of their elements.
11 Proper names con also be basic-checked and elastic-matching corrected, if their list is provided (e. g. in a known field of discourse) or built up gradually, in analogy to GC check. This also applies, and even more so, to the errata—sheet type method of correcting. 12 This has also been confirmed experimentally [9].
52
GENERAL QUESTIONS
REFERENCES [1] Szanser, A. J . 'Error-Correcting Methods in N a t u r a l L a n g u a g e Processing', Information Processing 68, N o r t h H o l l a n d P u b l . Co., A m s t e r d a m , 1968, Vol. I I , p p . 1412-1416. [2] Szanser, A . J . 'A Serie of I n t e r n a l M e m o r a n d a ( P r i v a t e Communications),'including: P a r t I : Optimum Letter Sequence for Longest Strings in English, 1968; P a r t I I I : Elastic Matching Technique in the Processing of English, 1969; P a r t I l i a : Elastic Matching Using a Dictionary Section, 1970; P a r t I V : General Content Check, 1970. Thia series will be eventually completed a n d published as a n official r e p o r t . [3] Szanser, A . J . ' A u t o m a t i c Error-Correction in N a t u r a l Languages', presented a t t h e I n t e r n a t i o n a l Conference on C o m p u t a t i o n a l Linguistics, September, 1969, SangaSaby, Sweden; published in Information Storage and Retrieval, 6, F e b . 1970, p p . 169—174, a n d also in Statistical Methods in Linguistics, Stockholm, 6, 1970, p p . 52— 59. [4] Szanser, A . J . ' E r r o r Correction in N a t u r a l L a n g u a g e Processing b y C o m p u t e r ' , presented a t t h e V l t h I n t e r n a t i o n a l Congress on Cybernetics, N a m u r , Belgium, Sept e m b e r 1970; t o be published in t h e Congress Proceedings. [5] Szanser, A . J . 'Resolution of Ambiguities b y C o n t e x t u a l W o r d R e p e t i t i o n ' , Review of Applied Linguistics, Louvain, Belgium, 7, 1970, p p . 48—56. [6] McDaniel, J . et al. 'Translation of R u s s i a n Scientific T e x t s into English b y Computer — a F i n a l R e p o r t ' , N a t i o n a l Physical L a b o r a t o r y , Teddington, J u l y 1967, 76 + 20 p p . [7] Price, W . L . 'The Viability of Computer Transcription of Machine S h o r t h a n d ' , presented a t t h e Conference on Man-Computer I n t e r a c t i o n , T e d d i n g t o n , E n g l a n d , 2—4 Sept. 1970; published in t h e Conference Proceedings, p p . 1—6. [8] Thorne, J . P . et al. A Model for the Perception of Syntactic Structure, U n i v e r s i t y of E d i n b u r g h , 1967. [9] Yates, D . M. a n d D y m o t t , E . R . P r i v a t e communication, 1970.
ONE SMALL HEAD'—SOME REMARKS ON THE USE OF "MODEL" IN LINGUISTICS1 YORICK WlLKS
"And still they gazed, and still the wonder grew, That one small head could carry all he knew." Goldsmith's rustics were quite right about the village schoolmaster, of course, well in advance of their time and, apparently, of Goldsmith. B u t perhaps the time has come, especially for linguists, to do less of such gazing, and to pay more attention to their proper business. I am not suggesting t h a t formal linguistics 2 has a single proper task, b u t I am sure, for reasons I shall t r y to make clear, t h a t the present situation, where almost every piece of work in t h a t field is proposed as a new "model of the h u m a n brain or behavior", is an undesirable one. I t is not hard t o see in a sympathetic way how linguistics got into t h a t situation. For a good while there have been serious suspicions, not all voiced f r o m outside t h e subject itself, about t h e other principal explanation of what linguists were up to, namely providing structural descriptions for sentences. For it is not at all easy to be clear about t h e status of conclusions of the form " X x is a correct structural description for x, but X 2 is n o t " . Nor has it been merely lack of the appropriate training t h a t has impeded the understanding of non-linguists, for t h e experts themselves seemed to have no way of deciding the t r u t h of such statements in a manner consistent with normal standards of rational argument. The presentation of linguistic work, therefore, as being ultimately no less t h a n a "brain model", was a natural, and worthier, alternative to a final justification in terms of t h e a t t a c h m e n t to sentences of questionable descriptions. I shall argue here, though, t h a t t h e present widespread use of "model" in linguistics is unfortunate, above all because it indicates a certain resignation about our almost total ignorance of how the brain actually works. Moreover, I think this situation obscures the proper importance of computational lin1
0049. 2
The writing of this paper has been supported by ONR Contract; N000 14-67-A-0112-
When I speak generally of linguistics in this paper, it will be clear that I am referring to recent developments in the subject and not to its traditional comparative and classificatory concerns.
54
GENERAL
QUESTIONS
guistics (CL), which is capable of providing another, more defensible, justification of the aims of formal linguistics at this stage of neurophysiological research. In due course I shall examine some recent remarks about models by Mey [8] who also seeks to defend the independent position of CL. I have chosen his remarks, rather than other easily available and yet more startling remarks about models by non-computational linguists, only because I agree largely with what he argues for. Ten years ago, Chao [3] surveyed the usage of "model" by his fellow participants in a congress, and Suppes [11] has carried out a more rigorous and contemporary study. Both adopted a Websterian, or what might better be called the hundred flowers approach to the diverse uses of the word in research of t h a t time. They would both, I think, have accepted Mey's opening remark: "An important notion in the behavioral sciences is t h a t of a model as a set of hypotheses and empirical assumption leading to certain testable conclusions called predictions (on this cf. Braithwaite . . .)." Now, of course, t h a t is precisely the kind of entity t h a t Braithwaite wrote should be called not a model but a theory, though he did admit t h a t confusion need not necessarily result if "model" is used in this way. Admittedly Braithwaite's is a conservative view of how "model" should be used. He has tried to assimilate its use in empirical science to its use in mathematics, where it is used to mean a second interpretation of a calculus yielding an understood branch of the subject. The fact t h a t a model, in this sense, exists shows that the first interpretation of the calculus (which is the theory in question, the one being "modelled") is a consistent interpretation. Or, in Tarski's words, " a possible realization in which all valid sentences of a theory are satisfied is called a model of the theory". Let us call this standard view of Tarski's MATH. Braithwaite's view has been widely discussed, and criticized on the ground t h a t it puts its emphasis on the calculus, and the theory as an interpretation of the calculus, in a way t h a t is untrue to the actual psychological processes of scientists. Opponents of that view argue (Hesse [7], Achinstein [1]) t h a t the model comes first and t h a t working scientists import features metaphorically or analogically from their chosen model into the theory under construction. I t may subsequently turn out, as Braithwaite says, that model and theory can be shown to be interpretations of a single calculus, but t h a t is all formal tidying up, such opponents would say, after the real work is over. However, this difference of views, which I shall call B R A I T H and SIMPLESCI respectively, is more a difference of emphasis than might appear. Braithwaite, for example, has discussed how, within his scheme of things, one can talk of a "modellist" (SIMPLESCI) moving from model to theory by disinterpreting his model's calculus in order to reinterpret it in the terms of the theory proper. Braithwaite contrasts this with his own "contextualist" view that the
GENERAL QUESTIONS
55
theory is an interpretation of an originally uninterpreted calculus, with the (BRAITH) model entering the picture only subsequently. For my purposes, though, it is important to emphasize Braithwaite's point that both the BRAITH and SIMPLESCI views of models envisage the theoretical terms of the theory gaining their interpretations from the bottom-most, empirical level of the theory upwards. Braithwaite refers to this process as a "semantic ascent". 3 Mey is of course correct when he says that the "model" is used in a sense different from these three in the behavioral sciences, and in linguistics in particular. The interesting question is: why is it used differently, and is there any need to do so, unless something previously unclear is made clear. Let us refer by "MEY" to the view quoted at the beginning; namely that a model is a set of hypotheses, etc., leading to testable conclusions called predictions. That is to say that a model (MEY) is what is otherwise called a theory. Let us now imagine a linguistic (MEY) model. It does not much matter what it is, but presumably it will produce word strings of some sort. Those who believe in the overall importance of syntax will want them to be "grammatically correct" strings in one sense, others will want them to be meaningful strings. And that need not just mean sentence strings, in the conventional typesetter's sense, but could mean utterances of any length, including dialogues that were meaningful and coherent. The distinction between "grammatical" and "meaningful" need not disturb us at this point. After all this clearing of the ground, I want to offer a suggestion as to why the entity under discussion is called a model (MEY) of something, rather than a theory of it. The something in question is, of course, the human language apparatus, or part-brain if that is preferred. In the first line of his paper Mey says of CL, as is often said of linguistics in general, that it has to do with human behavior. But that is really a separate matter from the overall point of view of a theory of language production, since it is agreed by all parties that human language is produced in actual fact by the human brain and associated organs, hence anything that is to be a theory of human language behavior must, ambulando, be a theory of some part of the brain. My point is concerned with the process of interpretation of theoretical terms that I referred to earlier: the terms of a theory are given meaning by reference to lower levels of the theory, that is to say the empirical base. At the highest levels of a scientific theory there may be entities with no direct interpretation in terms of the observational base. These, like "neutrinos" in quantum physics, are sometimes referred to as occult entities, and philosophers go about asking of them "do neutrinos really exist?" Nonetheless such entities usually 3
My paper does not assume B R A I T H . I take it that any view of models requires some such process, not at present possible in the case of the brain.
56
GENERAL QUESTIONS
have a firm place in a theory provided they only occur at the topmost levels: in other words, provided that the process of interpretation can get some reasonable way up from the observational base. But in the case of a linguistic theory, profferred as a MEY model of the brain, the situation is quite different from the one I have just described. At the bottom level, as it were, we can observe what people actually say and write, or, if one prefers, what they ought to say and write; it does not matter which for the moment. But, in the present state of neurophysiological investigations, the matter ends there. There is no possibility of interpreting further, of identifying any item of structure in the brain corresponding to any item or structure, at any "level", of the linguistic theory. And that situation is quite different from that of the empirical "unreachability", as it were, of neutrinos, for the brain items or structures are not so much unreachable as unreached. I t is this point, I think, that Chomsky missed when he compared [6] the role of unreachable occult entities in linguistics, grammars innate in the mind4 in this case, with the positive role of occult entities like gravitation in Newton's theory. Gravitation features as a topmost item of a theory that admits of a paradigmatic semantic ascent from an observational base. But in the case of the brain there is as yet no agreement at all on how the brain stores and processes information of the type under discussion, and there can be no question of a semantic ascent up a linguistic theory of the brain until that is known. I think this point explains the MEY use of "model": linguistics cannot provide theories of the brain, or human language production, so what it does provide is called a "model". The MEY use expresses an implicit resignation. On the other hand, this usage does undoubtedly express an aspirational, SIMPLESCI element as well, in that linguistics could, in principle, offer helpful suggestions to brain investigators as to what to look for, though I know of no evidence that such suggestions are being accepted. Mey is correct then in recognizing, albeit implicitly, that linguistics cannot at present offer full-blooded theories5 of human language behavior with semantic ascent. However, to go from that to endorse a resigned, and diminished, use of "model" seems to me unfortunate, confusing, and moreover inconsistent 4
No one should be misled at this point by the fact that Chomsky speaks of "mind" rather than "brain" in the source referred to. In so far as he is speaking particularly in the traditional "mind-mode", his arguments have, I think, all been effectively dealt with in such writing as Putnam [10]. I therefore take him to be making remarks about the brain. At other times, of course, Chomsky writes as if such grammars are not occult entities but are actually physically present in the brain. 6 There are of course other objections to any apparently deterministic theory of the human brain, language behavior or whatever; objections well known to any reader of Wittgenstein. But those would only arise when such a theory has actually been produced, and we need not concern ourselves with them here.
GENERAL QUESTIONS
57
with his argument for the independence of computational linguistics as a subject. But if linguistics cannot provide structures capable of being interpreted as theories of the brain, and so of human behavior, and if, also, the depleted sense of "model" is less than adequate to cover these would-be theories, is there then any other alternative? It surely cannot be enough for formal linguistics to go into an academic hibernation to await a breakthrough in the description of the brain itself. If linguistics is to offer theories, what are they to be described as theories of? Two obvious alternatives present themselves: firstly, that what linguistics provides are ultimately theories of sets of sentences. Secondly, that what it provides are theories of a particular class of algorithms. Even if "sentences" is taken in a wide sense, so as to include whole discourses of any reasonable length, there seem strong and traditional objections to the first of these suggestions. For the proposal may sound like no more than a resurrection of the form of logical empiricism in the philosophy of science most closely associated with Neurath [9]. For Neurath, a theory was no more or less than a production system for the basic sentences, or Protokollsátze, of a science. Beyond that the theory was wholly dispensable, and there was no place in his views for models of any sort, or "semantic ascents" up the levels of a theory. There are well-known objections to such a view of theories in general. From the standpoint of the argument expressed here, the view is unacceptable because a linguistic theory that was merely a theory of a set of sentences, with no added qualification, would, in the Neurathian view of theories, also be a theory of anything producing such a set of sentences, and among these things would be human beings and their brains. That is the view of linguistic theory, of course, which makes the easy and almost imperceptible shifts of many generative grammarians, between talking of theories of sentences and theories of human brains and behavior, most plausible and acceptable. However, it is a general view of theories which, if they thought about it, most of them would wholeheartedly reject: Chomsky himself, for example, has argued many times against any such empiricist view of theories. Chomsky himself makes these transitions frequently, though he is by no means a consistent user of "model" to mean "theory" in this context in the way I argued against earlier. He frequently writes of theories, though in a number of different ways: "There is a certain irreducible vagueness in describing a formalized grammar as a theory of the linguistic intuition of the native speaker" (Chomsky [5], p. 533). Chomsky is arguing for such theories here, of course, and this is a formulation of his position apparently different from any of the views of the role of
I 58
GENERAL QUESTIONS
theories mentioned so far: then sentence view, or the grammar-in-the-brain-ormind views.6 However, if we ignore the limitation to grammar in any narrow sense, this statement reduces to something very like the linguistic-theoriesare-of-sentences view under discussion, at least if the intuitions in question are restricted to intuitions as to what are and are not sentences. I would myself suggest a version of the second of the above views, namely that we view linguistic theories as the production of particular sets of sentences by programs or algorithms. There is an implicit restriction included there, naturally enough, to non-trivial methods, that would exclude the printing out of any prestored list of sentences. That formulation may sound like no more than an analytic definition of the phrase "computational linguistics", and indeed, as Professor Wisdom has so often pointed out, philosophical proposals are usually no more than the announcement of a platitude. But in the current state of the use of "model" and "theory" in linguistics, any single way of speaking of theories would be an advantage if it replaced the current Babel in a generally acceptable manner. Most importantly, and here I think Mey might agree with me, the proposed view of theories in linguistics would make CL the foundational part of formal linguistics, and not the poor relation it is treated as at present. Yet, if, as Chomsky has always argued, linguistics is to be more than the mere classification and comparison it once used to be, then I do not see how generative grammarians can resist some such view, of what linguistio theories are theories of, as the one proposed here. On this view, the items of a linguistic theory could, without too much difficulty, be identified, with subparts of the algorithm, in a way that cannot be done for the brain. More importantly, this view could be related in a coherent fashion to current notions of theory and model, and in that sense would have an obvious advantage over the loose talk of "psychological modelling" with which contemporary linguistics is so beset. For example, it would be possible for such a theory of CL to have a model (BRAITH and MATH), in the sense of an area of logic or mathematics with suitably related properties. These models almost certainly exist for a number of CL theories: those using phrase-structure algorithms for example. Again, there is no reason why what people say about their language structure, and what facts psychological experiments can elicit about the associations between speech items, should not serve as suggestive models (SIMPLESCI) for proper theories of CL. Linguists, who are wondering
6
I have argued elsewhere and in detail [14] against the way in which Chomsly makes these transitions, and also that these intuitions that justify any particular set of sentences cannot, whatever they are, be syntactic ones in any serious sense. B u t that disagreement need not affect the point under discussion.
GENERAL
QUESTIONS
59
if they read t h a t last sentence the right way round, need not read it again, they did. 7 I t may be objected at this point t h a t such a view is too particular. Given the flourishing state of automata theory proper and the theory of algorithms, whether viewed as a part of mathematics, logic, or mathematical linguistics, it is as absurd to suggest this view of CL as to seek to propagate "the chemistry of the apple" as an independent subject. However, there need be no conflict here, and on the view under discussion it would be quite reasonable to conceive of CL within either mathematical linguistics or automata theory as their implemented aspect, one which might be expected ipso facto to be less mathematically interesting than the general theory of algorithms or the theory of abstract machines. There can be no objection in principle, though, to a CL theory being a theory of algorithms, on the grounds that the algorithm might have been described in some other way. At least, not if the objector is a linguist who does want "psychological modelling" and theories of the brain, for he would hardly take it as an objection to some future theory of language production in the brain t h a t the area of the brain in question might just as easily have been used to process, say, visual data. There are two ancilliary arguments which, I think, justify the introduction of the notion of an algorithm into the definition of a theory of formal linguistics. Firstly, it is a fact of academic observation, as I mentioned earlier, t h a t the descriptions which linguists provide for utterances are disputable, what one might call undecidably so. The production, or non-production, of strings (or analysis, or non-analysis, of course) by algorithm provides a non-disputable justification for whatever linguistic classification and description had been initially imposed and programmed. To put an old and well-labored point briefly: classification, in linguistics at least, requires some purpose, or something one wants to do, and CL can provide it. I t is not usually necessary to operate a logical system very far in order to see whether or not it produces the set of strings t h a t are in question, the theorems, for t h a t can usually be seen by inspection. But the rules of linguistics are generally so much more numerous and complicated t h a t inspection is not usually sufficient. Furthermore, inspection in such cases is prey to all the wellknown weaknesses of investigators for looking for what supports their case and ignoring what does not. If the strings are produced by algorithm, possibly out of a machine, it is more difficult to select unconsciously in t h a t way. Let us now take a warning look at the distinction among models (MEY) t h a t Mey actually proposes. The B R A I T H theory, or MEY model, Mey proposes to call a descriptive model (MEYD). He cautions us that "it need not be ' Those who, like Hesse [7], adopt an "interact ion" view of the role of (SIMPLESCI) models would say that this possibility was to be expected.
60
GENERAL QUESTIONS
(and should not be) considered a faithful reproduction of reality, in the sense t h a t to each part of the model there corresponds, by some kind of isomorphic mapping, a particular chunk of 'real' life. I n other words, this descriptive kind of model does not attempt to imitate the behavior of the descriptum" (p. 2). The last sentence might leave one asking: well if it does not do t h a t , what does it do t h a t deserves interest and attention? There is also an ambiguity about notion of "mapping" here. I t might seem t h a t by "mapping" Mey refers to the interpretation of model (MEYD) items at different levels by brain items. But he goes straight on to discuss the non-equivalence of behavior which suggests t h a t assumption is wrong, and t h a t he means only t h a t the model (MEYD) need not even give output like human behavior. He goes on, "The other kind of model I propose to call the simulative one . . . a conscious effort to picture, point by point, the activities t h a t we want to describe " (ibid, pp. 2—3). The elucidation of the distinction between MEYD a n d what I shall call MEYS, is wholly in terms of t h a t philosophical monster, Chomsky's competence-performance distinction. For example, MEYD models are said to be like Chomsky's competence models, yet he writes of MEYD's: "the model t h a t is a grammar does not attempt to explain linguistic activity on the part of the speaker or hearer by appealing to direct similarities between t h a t activity and the rules of the grammar. Rather, the activity of the speaker (his performance) is explained by pointing to the fact t h a t the rules give exactly the same result (if they are correct t h a t is) as does the performance of the speaker-hearer" (p. 4). But there is serious trouble here. If these two entities, the human and the MEYD, give the same result then, as I have pointed out at some length, the one does not explain the other in the sense that an interpreted theory explains what it is a theory of. Mey writes in the last quoted passage as if there are other similarities between grammar and human, other than output identity. B u t what can they be? Moreover, dubious as Chomsky's distinction is, I am not sure t h a t Mey has got it right, for which he can be forgiven, of course. For it seems odd to identify MEYD with the Chomskyan comparison between the "outputs" of grammars and humans, since that is surely within what Chomsky would call "performance". Again, the last quoted passage makes clear t h a t MEYS's describe by definition, just as do MEYD's. According to the definition given the distinguishing feature of MEYS's is that they picture "point by point" the human language activity. But, as I have argued at some length above in connection with the general notions of model and theory, t h a t is just what they cannot conceivably do, at least not at the moment while there is no hint available as to what the "points" to be pictured are. In the case of human beings as of machines, output is output, so what distinction can Mey offer between MEYD's and MEYS's, since ultimately all they both have to "model" is human output?
GENERAL QUESTIONS
61
MEY models will not do, partly because one can do better with the notion of a theory of CL, and partly because the distinction between MEYD and MEYS is tied to the inscrutable competence-performance distinction. Chomsky means different things by t h a t at different times: if he is attacked on one version he shifts to another [see 10 and 14], As Paul Garvin put the matter some while ago "with the new linguistics came a new style of argument". To take just Chomsky's version of competence t h a t Mey begins with (p. 2) as "a speaker's knowledge of a language", which Chomsky takes essentially to include a grammar, yet as should be widely known by now, the majority of the world's competent speakers probably do not even know their language has a grammar. The main point t h a t Chomsky has tried to express by means of t h a t distinction is, I think, that the behavior of all actual things differs significantly from theoretical predictions and idealizations. Actual billiard balls collect dust as they roll and slow down, so their experimental performance never quite lives up to their theoretical mathematical competence. People as speakers are not different in this respect from other things in the natural world: they insert "urns" into otherwise perfect sentences. But this is true in every field, and deserves no special terminological recognition in the case of human speech or writing. I n his paper Mey is arguing for worthwhile things, and in particular t h a t linguistics, by which he means CL, should concern itself more with meaning and less with grammar, and that it should concentrate on the acceptance and interpretation of utterances rather than their acceptance-or-rejection offhand. He argues that the latter will require new kinds of theories in CL. I agree and have tried in [13] to suggest what their general form might be. His mistake, I think, is to t r y and make these valuable points with the aid of an illthought out distinction between two kinds of "models". I say ill-thought out advisedly and for two reasons: first, as I argued in detail, "model" is best kept for other and more conventional uses, and CL would benefit more from a suggested extension of the term "theory". Secondly, because Mey thinks t h a t whatever it is he has to say, it must have something to do with what Chomsky meant when he at various times tried to distinguish competence and performance, and in particular t h a t CL can find an acceptable theoretical niche by being the long-awaited "theory of performance". If my general argument in this paper is correct there can only be "theories of performance", and for that CL is a foundation stone and in no need of a niche in linguistics. Stanford
University
62
GENERAL QUESTIONS
REFERENCES [1] Achistein, P . 'Theoretical Models', Brit. Jl. Phil, of Sci., 1965. [2] Braithwaite, R . B. 'Models in t h e Empirical Sciences', in (eds.) Nagel, Suppes and Tarski, Proc. 1960 International Congress on Logic, Methodology and Philosophy of Science, Stanford, 1962. [3] Chao, Y. R . 'Models in Linguistics and Models in General', as [2]. [4] Chomsky, N. 'Three Models for the Description of Language', IRE Transactions in Information Theory, Vol. II—2. (1956) [5] Chomsky, N. 'Explanatory Models in Linguistics', as [2], [6] Chomsky, N. Language and Mind, New York, 1969. [7] Hesse, M. 'The Explanatory Function of Metaphor', in Proc. 1964 International Congress on Logic, Methodology and Philosophy of Science, Amsterdam, 1965. [8] Mey, J . 'Toward a Theory of Computational Linguistics', presented a t the Annual Meeting of t h e Association for Computational Linguistics, Columbus, Ohio, 1970. [9] Neurath, O. 'Protocol Sentences', in (ed.) A. J . Ayer, Logical Positivism, Glencoe, Illinois, 1959. [10] P u t n a m , H . 'The Innateness Hypothesis and E x p l a n a t o r y Models in Linguistics, Synthese 17, 1967. [11] Suppes, P . 'A Comparison of t h e Meaning and Uses of Models in Mathematics a n d Empirical Sciences', Synthese 12, 1960. [12] Wilks, Y. 'On-line Semantic Analysis of English Texts', Mechanical Translation and Computational Linguistics, 1968. [13] Wilks, Y. 'Decidability and Natural Language', Mind, 1971. [14] Wilks, Y. Grammar, Meaning and the Machine Analysis of Language, London, 1971.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
SOME QUESTIONS OF ANALYZING VERBAL CONSTRUCTIONS WITH DEFINITE DIRECT OBJECT IN HUNGARIAN I. BOTOS
1. The definite or indefinite character of an accusative object is expressed in Hungarian not only within the object itself (the kind of article, absence of article, special pronouns), but also within the form of the verb upon which it depends. There are two different types of personal verbal endings: one of them is used if the verb has no direct object or if it has an indefinite direct object (olvasok 'I am reading' — subjective form), the other, if the verb has a definite object (olvasom a konyvet 'I am reading the book' — objective form). The following examples will help to explain the degree of redundancy in the twofold grammatical expression of one and the same category: (1) (a) Konyvet olvasoi. 1 .. , , , . } I am reading a book.' ™ 7 i •• (b) Olvaso/c egy konyvet. J (c) Olvasom a konyvet. 'I am reading the book.' (2) (a) Azt mondok, amit csak akarok. 'I say whatever I want to.' (b) Azt mondom, amit tudok. 'I say what I know.' (3) (a) V&rok, hogy utol^rjen. 'I am waiting so that he can catch me up.' (b) V&rom, hogy utoterjen. 'I am waiting for him to catch me up.' Explanations of the examples: (1) konyvet 'book' (accusative) egy (indefinite article) a (definite article) (2) azt 'that' (demonstrative pronoun, accusative) mondok 'I say' (subjective form) mondom 'I say' (objective form) amit 'what' (relative pronoun, accusative) amit csak 'whatever' (accusative)
66
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
akarok ' I want (to)' (subjective form) tudok ' I know' (subjective form) (3) varok ' I am waiting' (transitive verb, subjective form) varom 'I am waiting' (objective form) hogy 'that', 'so t h a t ' (subordinate conjunction, used both in final and object clauses) utolerjen 'he/she/it shall catch (me) up' (optative) I n example (1) the expression of the indefinite/definite character of the object is fully redundant: in relation (a) : (c) kj-m and 0/a; in relation kj-m and egyja. (b) : (c) In example (2) this category is formally expressed only by the verbal endings (-¿/-m), though contained, as well, in the semantics of each subordinate clause ('whatever I want to' — something indefinite; 'what I know' — something definite). In example (3) the only difference between (a) and (b) is the verbal form in the main clause, contrasting in this case two different types of subordinate clauses: a final clause (a) and an object clause (b). 2. From the point of view of automatic translation, the only case of the above examples in which the special verbal form is relevant is that of example (3). If this type of contraposition (where the difference between final clause and object clause is formally expressed by the subjective/objective form of the verb in the main clause) is neglected, there may be a low percentage of cases in which the translation into a number of target languages would be slightly inaccurate although nearly adequate. This would be the case, for instance, if the target language is English or German (in the last case: 'Ich warte, damit er mich erreiche' (a); 'Ich warte, dass er mich erreicht' (b)). The translation of sentences (3a) and (3b) will, however, be the same if the target language is, e.g. Russian: ' J a zdu, ctoby on menja dognal'. If, on the other hand, the case of example (3) is taken into consideration and special symbols are introduced for the objective verbal forms (i.e. forms referring to definite objects), the whole system becomes more complicated. Not only must the number of symbols for personal verbal endings be doubled, but the number of symbols and rules for possible object phrases must be trebled in order to discriminate three groups: 1) phrases which require subjective conjugation, 2) phrases which require objective conjugation and 3) phrases which allow both types of conjugation. Finally, it would be necessary to introduce further symbols for elements which can be parts of such possible object phrases. 3. Thus, the question arises of how to deal with the two types of personal verbal forms in the elaboration of a system of automatic analysis of Hungarian texts.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
67
One possible approach to the question is the elaboration of symbols and rules in full correspondence with the concerned linguistic facts (version A); another possibility is a system that does not take into consideration the difference between the subjective and objective conjugations (version B); and a third, the combination of both, i.e. discrimination between subjective and objective verbal endings in the system of initial symbols and in part of the contracted symbols and rules, but neglect-of the difference in another part of the contracted symbols and rules, where there is a case of redundancy (version C). So, for instance, in version A the adjective pronoun valamelyik 'one (of them)' must be signed by a symbol different from that of the adjective pronoun valamely 'some', for, when forming part of an accusative noun phrase, the first requires an objective, the second a subjective verbal form. This is one of the cases where grammatical definiteness does not coincide with a factual definiteness, but is merely required by the suffix -ik in valamelyik (cf. amelyet olvasok: amelyiket olvasom). Another case of more grammatical definiteness is that of a genitive construction (cf. Valamilyen irò valamilyen konyvét olvasom. 'I read some book of some author'). Some initial symbols for morphological endings in versions A and C will represent homonymous endings of subjective and objective conjugation (such as in làttam 'I have seen', 'I saw': làttam egy kónyvet: làttam a kònyvet). In conversation, hence in literary prose as well, there occur sentences as: Olvasok. 'I am reading.' Olvasom. 'I am reading it.' There are even cases in which occurrence of the definite article does not help in determining the definiteness or indefiniteness of the object. Namely, the preceding attribute of the object may begin with a definite article, and in this case there can be no second definite article belonging to the phrase as a whole. (Unlike Greek: 'OQCÒ ròv èv rqj xrjnco negmaróvra véov.) An example: a kertben sétàló fiut (kert 'garden'; -ben 'in'; sétàló 'walking'; fiilt 'boy' [accusative]). In this construction the article a 'the' can belong to the noun fiut (in this case the phrase means: 'the boy walking in a garden), to the noun kertben (in this case: 'a boy walking in the garden'), and, finally, to both nouns (with the phrase meaning: 'the boy walking in the garden'). Consequently, the subjective or objective verbal form becomes relevant in specifying the definiteness of a sentence containing such an object: (a) Ha kinézek az ablakon, a kertben sétàló fiut làtoA. (b) Ha kinézek az ablakon, a kertben sétàló fiut làtom. These sentences are translated as follows: (a) 'If I look out of the window, I see a boy walking in the garden.' (b) 'If I look out of the window, I see the boy walking in the garden.'
68
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
Phrases like a kertben setdlo fiut do not allow a subjective verbal form before them, only after them. One cannot say: Ldtolc a kertben setdlo fiut, for if an indefinite direct object follows the verb, the indefinite article is obligatory. One does not say Olvasok konyvet, but Olvasok egy konyvet. Similarly: Latok egy, a kertben setdlo fiut. If the aim of the recognition of grammar is to create a model of the grammatical structure of sentences in Hungarian, there can be no doubt that version A is to be used. If the aim is translation of any texts whatsoever, version C can be used. If the aim is translation of non-literary texts, further experiments are necessary to decide whether version B is acceptable. Computing Center of the Hungarian Academy of Sciences, Budapest
SYNTACTIC ANALYSIS IN REL ENGLISH1 BOZENA HENISZ DOSTERT and F R E D E R I K BURTIS THOMPSON
I. INTRODUCTION 1. The Core Idea of BEL. The assertion is often made in modern linguistics that natural language is a species-specific, uniquely human phenomenon. Linguists using the computer distinguish between "natural" languages and artificial languages, the former being largely an unconscious human invention which we inherit either innately or through learning (obviously both) and use in communicating with other humans. The latter, artificial languages, are the product of conscious invention for purposes other than human communication, in particular as a means of controlling a non-natural phenomenon, the computer. Our purpose in developing REL English is to carry over the use of natural language to this new area of man's communication with his environment—to bring together man's two most powerful tools for information analysis, the age-old tool of natural language and the new one, the digital computer. However, the languages we specify for the computer must perforce be artificial. With the advanced state of linguistic knowledge, we can approximate more and more closely the natural languages of communication. Further, we can provide an essential characteristic of language—extensibility, primarily through the medium of definitions. When a user himself extends his language in a natural way and in his own idiolect, what starts as an artificial approximation becomes natural through use. The language used by the researcher can be kept in pace with his developing conceptual structure, as he himself makes his understanding explicit through definition and statement. To provide a user with such a means of communicating with the computer is the central objective of the REL (Rapidly Extensible Language) system. 2. The REL System. Only a brief characterization of the REL software system can be given here. It is discussed more fully in other publications [1, 2, 3], Although many changes have recently been made, partly as a result of changes 1
The research reported here was supported in part by the National Health, grant GM15537.
Institutes of
70
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
of the computers available to us and partly as the result of many improvements aimed at optimizing system operation, the publications referred to still provide a useful overall view of the REL system. At any rate, many of you are familiar with the REL system, its aims and its special characteristics. One point needs clarification, however. It has been our frequent experience that REL is referred to as a "language". This is probably due to the emphasis we have placed on English as an REL language. At the cost of being repetitious, then, let us state once again that REL is not a language, but a total software system allowing a variety of user's languages which are tightly coupled with user's data bases. This stems from our philosophy, as discussed in earlier publications, that a user's language and data base become highly integrated especially as he proceeds, in the process of data analysis and building of conceptual structures, to define new concepts and build upon his data base. He may start with REL English as a base, adding his initial lexicon and data; but it is the interplay between data analysis, language extension and conceptual structuring that builds his language/data base package into a powerful idiosyncratic tool of his research. English is, at the present time, the most prominent language within the REL system. It is, on the other hand, not the only language in the system. The REL Animated Film Language was developed and used interactively on the IBM 2250 display for computer graphics and for producing computer movies. One of these, MATRIX, was recently made by the computer artist of international repute, Mr. John Whitney. The REL system is currently operating on the IBM 370/155 in batch mode. We are preparing for a return to a fully interactive, multiprogrammed system (which REL was until August 1970) to be operational in the summer of 1972 on an IBM 370/135. 3. REL English. This paper is devoted to aspects of current REL syntactic analysis and most recent extensions of REL syntax. REL English as a manmachine communication language is a somewhat restricted model of natural English. Earlier discussions are found in [2, 3]. The scope of constructions currently handled is illustrated in section IV of this paper which also includes examples from the actual experience of users working with their data. REL English includes, naturally, a semantic component. As will be seen in section II, we have incorporated many of the ideas of Fillmore's case grammar [4]; this as well as Chafe's semantic analysis [5] have suggested means of handling verbs, prepositions and dependent nouns which we have recently implemented in our semantic analysis of English. These semantic aspects, however, will await separate publication.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
71
II. SYNTACTIC ANALYSIS: LINGUISTIC ASPECTS 1. Characterization of REL English Grammar. R E L English grammar consists of over 300 rewrite rules, the application of which is governed by syntactic features which accompany them. The scope of an earlier version of the grammar is discussed in [2], Although the grammar has been revised to a large extent, that reference still provides a useful background to the variety of English constructions handled in R E L English. 2. Features. The feature mechanism is extremely important and powerful in R E L grammar. The average number of features that are checked in the application of a rule is 6. This, however, varies widely, since many simple rules, such as those t h a t combine auxiliary verbs with the main verb, have few feature checks, while the application of major rules, e.g. noun phase collection and relative clause rules, call for extensive checks of features. The feature mechanism is discussed in some detail in [3]. The discussion in this paper is illustrative and concentrates on the use of features in constructions which have been recently added or revised. Briefly, the role of features is to: (a) subcategorize parts of speech, e.g. plural and possessive nouns are distinguished by features, (b) prevent ungrammatical strings, e.g. *some the boy is ruled out since the quantifier "some" cannot be combined with a noun bearing the feature "definite" which results from the presence of the definite article. (c) determine the preferred order of syntactic groupings, and prevent unnecessary parsings. Thus (i) below is the preferred grouping and (ii) —(iv) are ruled out by several features: (i) (ii) (iii) (iv)
friends of (Caltech's (female students)! (friends of Caltech)'s (female students) (friends of (Caltech's female)) students ((friends of Caltech)'s female) students
The importance of preventing spurious ambiguous parsings is evident from the point of view of computational efficiency. We have conducted experiments to determine the time-saving role of features, the results of which are reported in section IV. Suffice it to say here that the parsing time for a typical sentence of, say, about 12 words is now less than a tenth of a second. The role of features is illustrated here by the rules which handle the collection of the subject N P (noun phrase) and the V P (verb phrase). Rules will be discussed discursively in order to avoid cumbersome notation. (The feature
72
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
checks on the noun phrases are omitted from consideration in all of this discussion, except when the animate feature is considered.) (1) VP' NP VP Feature check: VP must be (—Passive), (—Subject) and (— Agentive) Feature set: assign (-(-Subject) and (-(-Agentive) to VP' together with features of VP. Example: John (ate an apple) (2) VP' NP VP Feature check: VP must be (-(-Passive), (—Subject) and (—Objective) Feature set: assign (-(-Subject) and (-(-Objective) to VP' together with features of VP Example: Books (were given to Mary) (3) VP' — NP VP Feature check: VO must be (-(-Passive), (—Subject) and (-(-Objective) Feature set: assign (-(-Subject) and (-(-Dative) to VP' together with features of VP. Example: Mary (was given books) It will be noticed that these features refer both to surface structure and to deep structure. The (-(-Subject) feature primarily concerns surface structure and control of syntactic grouping, e.g. permitting the first but not the second of: (i) John (ate the apple) (ii) (John ate) (the apple) On the other hand, the deep case features of (-(-Agentive), (-(-Objective) and (-(-Dative) are assigned by rules (1), (2) and (3) respectively, marking the roles the NP plays in deep structure. These rules also illustrate that in REL English we adhere to the centrality of the verb. The verb is the fulcrum of the sentence and accumulates its noun phrase modifiers, its adverbial modifiers and auxiliaries. We even go so far as to indicate a clause as a VP with the subject feature (-(-Subject). And indeed this is convenient, as in the rule: (4) VP' — will VP Feature check: VP must be (-(-Subject) and (—Question Transformation) Feature set: assign (-(-Question Transformation) to VP' together with features of VP. Example: Will (John eat an apple)
SYNTACTICAL ANALYSIS OF NATURAL AOT> ARTIFICIAL LANGUAGES
73
3. Transformations. R E L English is a transformational grammar, reflecting the transformational character of natural English. However, the transformational "component" is not treated as a separate component within R E L English grammar. (For t h a t matter, no component is treated as a separate level of analysis, though discussion of this basic philosophy of design must await another paper.) We do not have a special formal language for describing the transformational operations t h a t take place. Such formalisms are useful and illuminating in writing a general theory of language. But in systems that put natural language to use, we have found t h a t a separate apparatus for transformations in a computational form is more of a hindrance than might be the advantages of the readability which the inclusion of such an apparatus would yield to the grammar. This matter is taken up from the computational side in section I I I . The purpose of a grammar in a computational system handling natural language is primarily functional efficiency. Certainly descriptive adequacy is being tested. As far as explanatory adequacy, experience shows that work on natural language with the constraints of artificial systems always leaves unresolved much that is really interesting about natural language. That which has been contained in a computer program or recursive function throws little light on t h a t which is of real importance to the students of language. But the question of the proper relationship between computational systems for handling natural language and explanatory adequacy is too controversial and basic a problem to be expanded upon here. As for the actual transformations t h a t have been identified in the literature and which bear specific names in transformational theory, a great many are handled in very straightforward terms in R E L grammar. Some of these will be evident in examples in this section and particularly in examples in section IV. Here we consider just two—the passive and the question transformations. The way R E L English grammar effects the structural changes expected of the passive transformation can be seen in rules (2) and (3). Rule (2), which handles such structures as "books were given", marks "books" as being both surface Subject and deep structure Objective. Rule (3), which handles "Mary was given books", marks "Mary" as surface Subject and deep structure Dative. A passive verb phrase results from the expected passive rule: (5) VP" — VP' V P Feature check: V P ' must be (+Copula) and (—Subject), V P must be (4-Past Participle) Feature set: assign (+Passive) to VP" as well as non-copula features of VP'. Example: was defeated.
74
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
Another rule handles: "given by Mary": (6) VP' -+ VP by N P Feature check: VP must be (—Subject) and (—Agentive) and either (-(-Passive) or (-[*Past Participle) Feature set: assign (-|-Passive) and (+Agentive) to VP' together with features of VP. Example: given by Mary It is interesting to note that this rule does not require the VP to be passive, while the resulting VP' is marked as passive. The reason for this is to allow a subsequent question transformation. Thus, in the sentence: John ((was given books) by Mary) the verb phrase "was given books" is marked as passive by an application of rule (5) prior to application of rule (6). However, in the sentence: Was (John ((given books) by Mary)) rule (5) does not apply prior to rule (6). However, the "by" phrase with the agentive case marks the verb phrase as passive. This fact insures that "John" will be recognized as subject by rule (3) rather than rule (1), and thus be assigned (-(-Dative) in deep structure. Rule (4) exemplifies the handling of question transformations. One result of setting the question transformation feature can be seen in the following rule, which handles a type of WH-questions: (7) VP' N P VP Feature check: VP must be (-(-Question Transformation), (-(-Subject) and (—Objective) Feature set: assign (+Objective) to VP' together with features of VP. Example: (what door) (will John open) Rule (7) allows an N P in Objective case to precede the verb phrase if the verb phrase has undergone the question transformation. The same is true for the passive, e.g. (what book) (was John given) 4. Deep Case Structure. B y far the most interesting development within REL English grammar in the recent months is the inclusion of case grammar. We follow many of the ideas of Fillmore [4, 6] and Chafe [5], but use them selectively and modify them in application for the sake of handling efficiency. Our version of case grammar assumes the centrality of the verb and views the noun phrases around the verb as being in certain case relations with it. Although we do not assign to the verb phrase such features as "state", "process", "action", etc., we analyze verbs as propositions which express relations
SYNTACTICAL ANALYSIS OP NATURAL AND ARTIFICIAL LANGUAGES
75
between nouns. This is also noted in Fillmore [6], For instance, " b u y " means t h a t the relation of ownership commenced between the entities named by the Agentive and Objective NPs; "own" means t h a t the relation of ownership continues to hold; "arrive" means t h a t the relation of location commenced between the entities named by the Agentive and Locative NPs; "depart" means t h a t the relation of location terminated. This analysis adds, in our view, another dimension to the understanding of verb semantics. From the point of view of syntactic analysis, it is similar to Fillmore's solution for "buy/ seH", "send/receive" [7]. We conceive of the deep structure of the sentence as first consisting of the central verb with auxiliaries and modals, such as adverbs of tense. Then around the central verb are accumulated noun phrases, each of which is related to the verb by a specific case. Thus, for example, the following sentences (i)—(iv) result in the same deep structure: (i) (ii) (iii) (iv)
John gave Mary flowers. John gave flowers to Mary. Mary was given flowers by John. Flowers were given to Mary by John.
This deep structure corresponds to the following diagram John I agentive
I Mary ("give ' i
•
\ dative /
obje ctive flowers Figure 1
The cases currently used in R E L English grammar are: Agentive (A) Dative (D), Objective (0), Instrumental (I), Locative (L), and Genitive (G). These are discussed below with actual examples and the rules governing their use. Most recently, it also appears necessary to add a Prepositional case, especially for verbs of such immense semantic and syntactic complexity as "show" as it appears in such sentences as: The test John took showed negative. John showed improvement. John showed positive on the test. John showed Mary the picture. These problems, however, are not too well understood as yet. A given verb is defined in terms of a frame, following Fillmore [4], which specifies which case-related nouns are obligatory "satellites" and which are optional. For example, the frame for the verb " b u y " is (A, 0 , (D, G, I, L)). The cases in the inside parentheses are optional. This frame accounts for the following constructions:
76 (i) (ii) (iii) (iv) (v) (vi)
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
John (A) bought flowers (0). Flowers (0) were bought by John (A). John bought Mary (D) flowers. John bought flowers with stolen money (I). John bought flowers from Peter (G). John bought flowers in Boston (L).
The frame for "own" is considerably more restricted, probably only (A, 0, (L)), e.g. (i) John (A) owns a house (0). (ii) John owns a house in Pasadena (L). The frame for "give" is: (A, 0 , (D, I(?), L)) or (A, 0, D, (I(?), L)), depending on whether or not the sentence: (i) John gave books. is considered elliptical, or what comes to the same thing, whether Dative is an obligatory case for "give". Other examples of these frames are: (ii) John gave books to Mary. (iii) John gave Mary flowers at the airport. (iv) John gave Mary flowers with his left hand (?) The question marks in the frames and in (iv) indicate our doubt whether "gave" can take an instrumental. The frame for "open" is (0, (A, I, D)), e.g. (i) (ii) (iii) (iv)
The door was opened. John opened the door. The door was opened by a key. John opened the door for Mary.
The following sentence, discussed by Fillmore [4], presents greater difficulty: (v) The door opened. It appears that "open" may have two frames, the second being (O, (I, D)), with corresponding semantic interpretations. 5. From Surface to Deep Case Structure. W e now t u r n to t h e question how
the deep case structure of a sentence is identified. Certain rules have already been given. We will first examine some further rules, and then show the range of constructions these rules can handle. The notational conventions are as in the earlier rules. How the parsing proceeds and the deep case structure is achieved computationally is discussed in section III.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
77
8) VP' VP NP Feature check: VP must be (—Subject), (—Agentive), (—Objective), (—Dative), (—Locative), (—Instrumental) Feature set: assign ( +Objective) to VP' together with features of VP. Example: give flowers, was given flowers. (9) VP' VP NP NP' Feature check: VP must be (—Passive), (—Subject), ( — Objective), (—Dative), (—Locative), (—Instrumental) Feature set: assign (-(-Objective), (-(-Dative) to VP' together with features of VP. Example: give Mary flowers. These two rules are the rules that put the direct and indirect objects on the verb phrase. They reflect the preposition dropping transformations when the objective and dative case nouns have been transformed to the positions in surface structure immediately following the verb phrase. The various features that are checked prevent noun phrases with non-deleted prepositions from intervening, thus ruling out: (i) *give to Mary flowers (ii) *give at the airport flowers How NP is assigned Dative and NP' Objective in deep case structure when rule (9) applies will be clarified in section III. (10) VP' — VP to NP Feature check: VP must be (—Subject), (—Dative) Feature set: assign (-(-Dative) to VP' together with features of VP. Example: (give flowers) to Mary, flowers ((were given) to Mary) (Note that "to" is not the only dative preposition. See discussion of prepositions below.) Let us first illustrate the above rules (1) through (10), which deal with the assignment of what traditionally are called subject and direct and indirect object in active and passive sentences. In (i)-(xv), the numbers in parentheses refer to the relevant rules, the letters in parentheses show the case assignments in deep structure. (i) (ii) (iii) (iv) (v) (vi)
John John John John John John
(A) gave. (1) (A) gave books (0). (8) (1) (A) gave to Mary (D). (10) (1) (A) gave books (O) to Mary (D). (8) (10)' (1) (A) gave Mary (D) books (0). (9) (1) (A) gave Mary (0). (8) (1)
78
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
(vii) (viii) (ix) (x) (xi) (xii) (xiii) (xiv) (xv)
Books (0) were given. (5) (2) Books (0) were given by John (A). (5) (6) (2) Books (O) were given to Mary (D). (5) (10) (2) Mary (D) was given books (0). (5) (8) (3) Books (D) were given Mary (0). (5) (8) (3) Books (0) were given by John (A) to Mary (D). (5) (6) (10) (2) Books (0) were given to Mary (D) b y John (A). (5) (10) (6) (2) Mary (D) was given books (0) by John (A). (5) (8) (6) (3) Books (D) were given Mary (0) by John (A). (5) (8) (6) (3)
Note that in (xi) and (xv), the assignment of cases does not follow the interpretation a fluent speaker would give these sentences. One solution would be to make use of the feature of animateness. But there are problems. We can not restrict Dative to Animate nouns, as does Fillmore [4], because of such sentences as: (i) John gave books to the library. (ii) John gave money to the campaign f u n d . While "the library" could be considered a locative, "the campaign f u n d " could hardly be. Another solution might be to restrict the deletion of the dative preposition to animate nouns. We include the feature Animate in our own research version of R E L English. However, there is in general practice a disadvantage to such features. R E L English uses no pre-set lexicon. A user adds his own lexicon. If a user, who has his own substantive interests, were required to know how to assign such features as Animate to new items he wished t o introduce, he would be handicapped in using one of the system's strongest assets, its definitional flexibility. However, Animate is a useful feature for syntactic analysis, as illustrated by the rules below: (V) V P ' — N P V P Feature check: V P must be (—Passive), ( — Subject), (—Agentive), and N P must be ( + A n i m a t e ) Feature set: assign (-(-Subject) and (-(-Agentive) to V P ' together with features of VP. (11) V P ' -»- N P V P Feature check: V P must be (—Passive), (—Subject), (—Agentive), (—Instrumental), (-(-Objective), and N P must be (—Animate) Feature set: assign (-{-Subject) and (-(-Instrumental) to V P ' together with features of VP.
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
79
These two rules allow the handling of: (i) The key opened the door. (ii) The janitor opened the door. (iii) *The key opened the door by the janitor. correctly making "the key" Instrumental in (i) and "the janitor" Agentive in (ii). 6. Prepositions. Prepositions are notoriously difficult to handle. As Fillmore [6] mentions, some are dependent on the verb. This may account for the different dative prepositions of "buy" and "give". But one wonders whether (i) John bought Mary flowers. (ii) John bought flowers for Mary. are indeed synonymous in the same way as (iii) John gave Mary flowers. (iv) John gave flowers to Mary. Mary is not quite so definitely a recipient in (ii), as she is in the other sentences. Consider: (v) John bought flowers for Mary, but he gave them to Alice. If (ii) is ambiguous, then (v) contains a "dative of intention", while the other sentences contain "true" datives. This kind of analysis, illustrated in the paragraph above, would take us too far into the field of "presuppositions" and "knowledge of the world", which, though of importance for language theory, are usually resolvable by context in systems using natural language or communication. Some prepositions are case-connected independently of the verb. As Fillmore [4] points out, the presence of prepositions can always be assumed in deep structure to mark case relationships. The agentive preposition seems to be "by". The objective case seems to be governed by "of", as is evidenced by the derived nominal in the following phrase: the giving of the book to Mary. Fillmore [4] also states that the instrumental preposition "is 'with' just in case the proposition contains an agent phrase, otherwise it is 'by' ". This is reflected in our rules, which employ the Animate feature: (6') VP' — VP by NP Feature check: VP must be (—Subject), (—Agentive) and (—Instrumental), NP must be (-{-Animate) Feature set: assign (+Passive) and (+Agentive) to VP' together with features of VP.
80
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
(12) VP' -»-VP by NP Feature check: VP must be (—Subject) and ( —Agentive), (—Instrumental), VP must be (—Animate) Feature set: assign (+Passive) and ( + Instrumental) to VP' together with features of VP. (13) VP' -»-VP with NP Feature check: VP must be (—Subject) and (—Instrumental), N P must •be (—Animate) Feature set: assign (+Instrumental) to VP' together with features of VP. These rules result in the correct case assignments in: (i) The door (0) was opened by the janitor (A). (6') (ii) The door (0) was opened by the key (I). (12) (iii) The janitor (A) opened the door (0) with the key (I). (13) The prepositions "with" and "by" are indeed the most obvious instrumental prepositions, but the picture is more complicated than that. Consider, for example: (iv) John bought flowers with stolen money. (v) John bought flowers for ten dollars. both of which contain likely candidates for instrumental. Prepositions which carry semantic information are even more interesting. Examples are prepositions connected with the locative, as in: (i) John arrived in New York. (ii) John arrived from Boston. Sentence (ii) implicitly includes the meaning (iii) John departed from Boston. Sentences (i) and (iii) can be paraphrased: (iv) Location of John commenced being New York. (v) Location of John terminated being Boston. Thus the deep structure of propositions with the locative may involve another distinct verb, related to the central verb of the proposition through such aspects as their temporal character (e.g., initiation and termination of state or
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
81
action).2 And indeed this deep structure reflects the way the data must be recorded in the data base. This analysis is further supported by sentences which contain two nouns in the locative with different prepositions: (vi) John arrived in New York from Boston. whose deep structure is:
If (vi) had to contain only one verb in its deep structure, we would be forced to abandon Fillmore's theory [4] that there can be only one instance of a given case in a proposition. But more importantly, the correct semantic interpretation could not be assigned. Our analysis is somewhat complicated by such sentences as: (vii) John departed from Boston for Washington, but then decided to stop in New York. Perhaps there exists a case which might be called locative of direction (e.g. "John headed for home"). We saw earlier that there probably exists a dative of intent (e.g. "John bought flowers for Mary, but gave them to Alice."). Interestingly enough, the preposition involved is "for" in both cases. Further support for our analysis is offered by such verbs as "send" and "receive". Sentences including these verbs often have a third part in their deep structure, namely the location of the agent, e.g. (viii) John sent the box from Boston to New York. (ix) John received the box from Boston in New York. Because of the semantic content provided by the preposition, our rules for handling locatives take a different form. (14) LO (15)
f in \ NP at from (16) to (17) (18) VP' - VP LO At least for some verbs, such as "arrive" and "buy", an antonymous verb seems to be involved. Clarify Chafe's analysis of antonyms [6], 1
82
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
Feature check: VP must be ( — Subject) Feature set: assign (+Locative) to VP' together with features of VP. Example: John ((bought books) (in Boston)) (19) VP' — LO VP Feature check: VP must be ( +Subject) Feature set: assign (-(-Locative) to VP' together with features of VP. Example: (At the airport) (John bought Mary flowers) Extending our current work on prepositions is one of the most interesting aspects of our current research. 7. Pronouns. The incorporation of the notions of Fillmore's case grammar has been a major step forward for REL English beyond what had been discussed in previous publications. Another step of similar importance is the inclusion of pronouns. We had observed that lack of them was a real hindrance to users, especially in embedded clauses, which have naturally been available in REL English for some time. Use of embedded clauses often required the repetition of an identical noun, e.g. (i) When John attended Harvard, did John and John's sister live in Cambridge ? in place of: (ii) When John attended Harvard, did he and his sister live in Cambridge? Certain desirable constructions were simply not available, e.g. (iii) What is the average income of people who own their home ? From the purely syntactic point of view, simple pronouns are straightforward. They are introduced by lexical rules, e.g.: (20) NP — he Feature set: assign (-(-Pronoun) to NP (21) NP — his Feature set: assign (-(-Prounoun) and (-|-Possessive) to NP In building more complex phrases, pronouns are handled syntactically like any other noun phrase. Semantic processing of pronouns, which will be discussed in section III, is much more interesting. One important aspect must be mentioned here. In REL English, nouns refer to individuals, classes or relations, e.g., respectively, "Alice", "girl", and "parent". These subcategorizations are marked by features, even though they are semantic in character and, in fact, directly reflect the way data is organized in the data base. However, these subcategorizations do have syntactic significance, especially in regard to pronouns.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
83
As the following examples illustrate, a class noun preceded by the definite article "the" is equivalent to a pronoun. (i) John and Betty raced and she won. (ii) John and Betty raced and the girl won. This gives rise to the rule: (22) NP' the N P Feature check: N P must be (-)-class) Feature set: assign (+pronoun) to N P ' Semantically, the range of the resulting pronoun is the class named by the NP, that is, the referent for the resulting pronoun must be a member of this class, e.g., in the example above "Betty" is the referent for "the girl" only if Betty is a member of the class of girls. Postal [8] argued convincingly that pronouns are definite articles. His treatment is in some ways different from ours, since he includes such quantifiers as "some" and "all" among pronouns. As is discussed in section III, pronouns for us are a special case of "variables". Thus, "the" followed by a class noun forms a variable. Quantifiers followed by a class noun similarly form "variables" but of a somewhat different kind than pronouns. Thus such phrases as "all girls", "the girl" and "she" are closely related, all acting syntactically like phrases starting with the definite article. It is interesting to note that Postal's syntactically-based solution and our semantically-based solution point to a similar conclusion. III. SYNTACTIC ANALYSIS: COMPUTATIONAL ASPECTS 1. The Parser and the P-Marker. The now classic notion of parsing is that the parsing procedure develops one or more P-markers in the form of an inverted "tree". At the top is the part of speech "sentence" and at the bottom are elements of the terminal vocabulary or lexicon. The intervening nodes identify the constituent phrases of the sentence. Transformational rules of grammar transform one such P-marker into another. The result of parsing is one or more such P-markers, each giving the deep structure of the initial sentence. The parsing algorithm must keep a certain amount of information about the developing structure of the sentence as required for the further analysis of the sentence. It might appear that the developing P-marker, or P-markers, are precisely the information that is necessary. In an abstract sense, this is true. However, in practice, the efficient parsers which are in use today organize their working information in different ways. The two methods currently considered best suited for handling natural language are the augmented transition network parser of William Woods [9] and the modified forms of the Martin
84
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
K a y parser [10] as used by both Charles Kellogg [11] and in our own REL work. In the latter case, the working information is maintained in a directed graph structure. This "parsing graph" is closely related to the P-marker but is distinct from it. The following remarks are confined to the R E L parsing procedures. There are two main problems in parsing context free and above languages. The first is ambiguity, not only on the sentence level, but ambiguity of partial segments of the sentence, which may never propagate to parsings of the entire sentence. The second problem is the inability to predict which segment of the sentence should be the next one to be parsed. This necessitates the exploration of all alternative parsings. One way to handle these problems is to keep the parsing information in the form of P-markers (i.e., trees) and, when encountering two alternative ways to proceed, to make copies of these trees and proceed separately with every one. Since much of the further parsing of every one of these would be the same, the resulting redundancy would make such a method highly inefficient. However, there is an efficient alternative in which all the information is contained in a single graph in such a way that no redundant steps need to be taken. I t is just such a parsing graph that underlies our parser. Figure 2 gives an example of the type of parsing graph we use. In the classical P-marker, the constituent phrases of the sentence are represented by the nodes; the structural relationships among them are shown by the directed arcs (see Figure 2a). In the parsing graph, however, the constituent phrases are attached as labels to the arcs. As is illustrated in Figure 2b, an arc that spans given segments of the initial sentence has as a label the phrase resulting from these constituent segments. The nodes of the parsing graph have little linguistic significance and their justification can only be found in the nature and efficiency of the parsing algorithm. Let us examine more closely these labels; they constitute the information that is maintained about each of the constituent phrases of the sentence. This information includes: (i) The part of speech or syntactic category of the word or phrase; (ii) The syntactic features that mark the phrase; (iii) A routine, or procedure, that is associated with the grammar rule which generated the given phrase; (iv) In the case of a lexical item, a pointer to the associated data elements in the data base; (iv') In the case of a phrase, a list of its immediate constituents. The parsing algorithm follows all possible paths in the parsing graph, matching sequences of parts of speech with the rewrite rules of the grammar. Whenever a match is found, a new arc is built into the parsing graph connect-
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
85
ing the two nodes on either end of the segments matched. A new phrase is constructed, which contains the above four items of information, and assigned as a label to this new arc. Consider this phrase information on the arcs and in particular item (iv') in the above list. As mentioned above, each phrase contains a list of its im-
Figure 2a
mediate constituent phrases. Each of these in turn contains a list of its immediate constituents. Thus, this resulting list structure is in fact the P-marker for the segment of the sentence constituting that phrase. The developing Pmarkers therefore "hang-off", as it were, from the arcs of the parsing graph. To restate briefly, when a phrase is constructed during sentence analysis, it includes a list structure which constitutes the P-marker for the phrase. But more than that is accomplished. In the case of a typical rewrite rule, the constructed phrase does indeed contain just such a list of its immediate con-
86
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
stituents, which link to their constituents in turn. However, if a transformational rule is called for, this P-marker can be transformed according to the structural change called for by the given transformational rule. Thus, the structure developed in the course of parsing is the deep structure P-marker. In other words, the label for each arc of the parsing graph is in fact a representation of the deep structure of t h a t part of the sentence which is bridged by the given arc. When the sentence is finally parsed, any arc of that parsing graph which spans from the beginning to the end of the sentence has as a label a deep structure representation of t h a t sentence. I n section II, the rules t h a t add nouns to the central verb were discussed. Here we illustrate how the deep case structure of the sentence is actually developed. Consider as an example the following sentence: (A) John gave flowers to Mary.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
87
I n the deep structure, the three nouns are attached to the verb in a way that identifies their case relationships, as diagramed in Figure 1 (page 75). This deep structure can also be put in list form as: 3 (B) (gave, (Agentive, John), (Objective, flowers), (Dative, Mary)) The order in the list of the last three case/noun elements is immaterial. (B) is thus a representation of the deep structure of sentence (A).4 Consider now the parsing of this sentence, applying rules (5), (10) and (2) of section II, as well as lexical rules, and building the parsing graph as described above. VP (gave, (Agentive, John),
NP (John)
• John
£ A
(Objective,
flowers),
(Dative, Mary))
VP (gave, (Objective, flowers), (Datjve, Mary)) T P (gave, (Objective, flowers))
/ / - VP (gave)
gave
r
—
N P (flowers) - v
flowers
r
N P (Mary)
Mary
Figure 3
In the parsing graph of Figure 3, we have shown how the deep structure of a phrase spanned by a given arc is attached by the parsing algorithm to that arc as a label at the time the arc is added to the graph. By the time the parsing is complete, the deep case structure (B) of the sentence (A) is attached to that arc which spans the entire sentence. It is this deep structure that forms the basis for semantic processing. A word needs to be said concerning the computational procedures involved. In the first place, the feature checks are an integral part of the parsing algorithm. The features are part of the phrase information and are stored as a bit mask. Checking features is thus only a matter of matching "on"-masks and "off"-masks that are carried in the rule of grammar to the feature masks of each of the constituent phrases. That part of the algorithm which accomplishes this is particularly fast and tight. What instrument, then, is avaliable for accomplishing the more involved checks of structural conditions and carrying out the structural change that creates the deep structure P-markers within each phrase? It will be recalled that in listing the items of information to be found in each phrase, one — item (iii) in the list — was a routine, or procedure, that is associated with the rule 3
Using L I S P notation. S u c h a r e p r e s e n t a t i o n c o u l d b e t y p i c a l of t h e k i n d of m a p p i n g s of d e e p s t r u c t u r e o n t o logical s t r u c t u r e r e c e n t l y c o n s i d e r e d b y C h o m s k y [12]. 4
88
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
of grammar from whose application the phrase arose. It is this routine that accomplishes such tasks as those involved in application of a typical transformational rule. These routines have access to all aspects of sentence analysis — parsing graph, P-markers, etc. — as well as the capability to store and use working information of their own. Most rules do not in fact involve such routines. Since they are invoked in only a few cases in the analysis of a sentence, they do not seriously affect parsing times. 2. Variables, Quantifiers and Pronouns. Typically, a simply sentence states a single proposition about a person or object, e.g. : (i) John lived in Cambridge. In more complex sentences, e.g.: (ii) When John attended Harvard, he and his sister lived in Cambridge. several propositions are stated or implied, possibly concerning the same subject. Language has mechanisms for indicating that these various propositions are about the same subject. One such mechanism is pronouns; another, as pointed out in section II, is anaphoric constructions using the definite article: (iii) John and Betty raced and the girl won. Such mechanisms for insuring proper cross-referencing in a complex statement have been formalized in mathematics and particularly in mathematical logic as "variables". Thus in the expression: (iv) (x + 3) * x the variable "x" is always construed as taking on the same value in each of its instances. Following this usage, we will speak of pronouns and anaphoric phrases as special kinds of variables. We shall see that there are other kinds of variables in natural language as well. A pronoun is a "dangling" pronoun unless or until a proper referent for it is found in a prior part of the sentence. When such a referent is found, the meaning of the pronoun is established. Again following the phraseology of logic, the pronoun is "bound" to its referent; and, until its referent is found, the pronoun will be called "free". Thus the pronoun "he" is free in the phrase: (v) whom he saw last Friday while it is bound to "John" in the phrase: (vi) John wanted to meet the girl whom he saw last Friday Free and bound variables in natural language are not limited to pronouns and anaphoric phrases. Quantified phrases, such as "all boys", "some boys", "what boys", and "how many boys" are also variables, as the following anal-
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
89
ysis indicates. Consider the phrase "what boys". As in the case of the phrase "the boy" used anaphorically, the ultimate denotation of "what boys" depends on the remainder of the sentence in which it is embedded. "What boys" designates quite different boys in the two sentences: (vii) What boys attend Harvard? (viii) What boys do not attend Harvard? Such phrases function as free variables until they are bound at sentence or clause boundaries. To see the difference in binding of such variables as "what courses" and "all courses", consider the following two sentences: (ix) Did the school commend the students who received A's in all courses? (x) The school commended the students who received A's in what courses? In sentence (ix), one is seeking first to ascertain those students who received all A's. Once this is established, the answer to the question depends only on whether these particular stu dents were commended. Thus the binding of the free variable "all courses" can be made at the relative clause boundary. That is to say, once the referent is found of the phrase: "students who received A's in all courses" (say, John, Tom and Mary) further consideration of courses is no longer germane to the sentence. This is not the case in sentence (x), where: "students who received A's in what courses" still has the force of a variable and cannot be bound until the entire sentence is considered. "All. " and "some " form variables which are bound at clause boundaries; "what " and "how many " form variables which are bound at sentence boundaries. Pronouns and anaphoric phrases form variables which are bound when noun phrases preceding them in the sentence are found which name members of their range, e.g., in sentence (iii) "the girl" is bound at the occurrence of "Betty", since Betty is a girl. The problem of handling the semantic aspects of variables, whether in formal languages of logic and computing or in natural language, is twofold: (a) they must remain an active part of sentence analysis as long as they are free variables, and (b) their meaning must be established at the appropriate point in the analysis, namely at the point where they should be bound. For example, in the analysis of the sentences: (iii) John and Betty raced and she won. (iv) John and Betty raced and the girl won.
90
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
simple recognition of "she" or "the girl" as noun phrases is not sufficient. A computational mechanism must also exist which establishes the proper referent " B e t t y " and binds the "variable" phrase to it. Such a computational method of handling variables is a very important part of the R E L language processor, one t h a t depends on certain features of the parsing algorithm, as will now be explained. I n section I I I . 1, we stated four items of information that are included in each phrase. There is also a fifth one. Each phrase contains a list of all free variables t h a t occur in that phrase and, for each such variable, pointers to each of its occurrences. This list also includes information concerning the range and types of each variable. When a rule of grammar is applied which forms a phrase where a type of variable is to be bound, e.g., a rule t h a t forms a relative clause in the cases of "all " or "some " variables, the routine or procedure associated with the rule can bind the variable in the appropriate way, making whatever checks are required. Carrying the list of free variables along makes this procedure for binding efficient. Consider, for example, the case of pronouns. Suppose a rule applies whose list of constituent phrases in left to right order is (P lf P 2 , . . . , P n ). Suppose also the pronoun "she" is free somewhere in the phrase Pj, and is so indicated in the free variable list associated with When the rule applies, the associated routine notes the existence of the pronoun and recursively checks the Pmarkers of each of the phrases P 1 , P 2 , . . . , P i — 1 to see if a noun phrase naming an element of the range of the pronoun, in this case a member of the class female, can be found. If found, the occurrence of the variable is changed to refer to this new referent and the pronoun omitted from the free variable list of the resulting phrase. If not found, the variable propagates upward until a phrase is reached that spans a part of the sentence including an appropriate referent. To illustrate, the parsing graph for the following sentence is shown in Figure 4. (v) Will Betty's brother give his wife flowers? The constituent phrases are numbered in the order in which they are added t o the parsing graph by the parsing algorithm. Only constituents of the correct P-marker are shown. Phrase NP S is marked as a free variable. A pointer to N P 3 is carried in the free variable list of NP 4 and VP„. On application of the three rules t h a t create the phrases NP 4 , VP 6 and VP U , a check is made to see whether this variable can be resolved. I n the case of NP 4 , since no constituent is to the left of NP 3 , this check immediately fails. I n the case of VP 6 , the single constituent to be checked is VP 5 , which, being a verb phrase, is not eligible to be a referent. I n the case of VP 11 , of the four constituent noun phrases, NP 7 , NP 8 , NP 9 and NP 10 only N P 8 and NP 10 are candidates, since N P 9 is possessive and NP 7 is a rela-
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
91
tion phrase. Thus V P n is marked, for the time being, as ambiguous, yielding the potential referents NP S and NP 10 respectively for resolution of the variable "his". It could well be that neither NP 8 nor NP 10 would satisfy necessary requirements. Thus a third alternative would also be followed, namely carrying the free variable to VP U , VP12 and finally to SS33. Since there is no possibility of binding in any of these, it remains a "dangling" pronoun and thus this third alternative is dropped as ungrammatical. During the subsequent semantic analysis, the two remaining cases will be examined again. Betty will be found not to be in the range of "his", and thus the analysis that included NP 8 as a candidate will be abandoned. Betty's brother will have been found to be Tom, and the resulting correct analysis will find that Tom did indeed give flowers to his wife.
IV. THE SCOPE OF CURRENT REL ENGLISH The problems of expanding the variety of syntactic structures and improving the expressiveness of an artificial approximation to natural language, such as REL English, are in themselves quite interesting. While experimenting with the language and observing actual users, one finds a class of structures which cannot yet be handled. In working with them, difficult linguistic problems are encountered, often problems discussed at length in linguistic literature. Applying the insights put forward by other linguists and one's own ingenuity, ways are found to incorporate this new class of structures. Then new rules can be added, making a variety of new expressions available. Most recently, the greatest weaknesses of REL English were (a) the inability to handle indirect objects and locative and instrumental modifiers of the verb, and (b) the lack of pronouns. To handle the former, we have incorporated the deep case grammar notions of Fillmore. To handle the latter, we have made
92
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
basic changes in the language processor, introducing new methods for handling variables. These new solutions have been discussed in sections II and III. The scope of classes of syntactic structures now incorporated in REL English is illustrated below. (a) Noun phrase constructions, as in: John gave flowers to the daughter of Tom's sister. John gave flowers to his Boston friend. (b) Auxiliary verbs Will Mary be given flowers? Flowers had been given to Mary before now. (c) Adverbial phrases of time John gave Mary flowers in June, 1970. John gave flowers before last Monday. Three days ago, John gave Mary flowers. (d) Relative clauses John gave flowers which he bought from Tom. John gave flowers whose colors were bright.
John thanked the man by whom he was paid. John arrived in the city where he had met Mary. (e) Subordinate clauses What John earned he gave to Mary. When John received the flowers, he gave them to Mary.
(f) Negation John did not give flowers to Mary. John doesn't earn more than $7,000. (g) Quantifiers Which boys gave flowers to Mary? John gave flowers to some girl. How many of the boys gave flowers to more than three girls ? (h) Conjunctions Did John give flowers to Mary or Betty ? John gave flowers to Mary and candy to Betty. Mary's mother and father gave flowers to Tom's brother and Betty's sister. John gave flowers and Tom gave candy, and a number of others: Was there a boy who gave candy ? John earned less than Tom. (ellipsis) John has a sister who goes to Yale. These various structures can of course be combined in innumerable ways. For example:
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
93
Before John arrived in Boston and Mary left Los Angeles, Tom had departed from the city where he had lived since his married sister's baby was born. At the present time, the areas on which we are working are prepositions and nominalizations of the verb. We expect our work with deep case grammar to be useful in attacking these new problems. We are also working on methods for handling the semantics of more complex verbs, such as "show", including methods for introducing them by the user as part of this ongoing conversation with the computer. If a computational system for handling natural language is to be operationally useful, an adequate syntactic capability is necessary, but not sufficient. Computational efficiency is also necessary. In this paper we focus on the efficiency of the syntactic analysis. This efficiency is achieved on two fronts. First, the structure of the grammar in controlling excessive syntactic ambiguity is of prime importance. Our use of features has played a major role in this regard. Consider the following two sentences: (i) Do Cambridge girls who attend Yale love Harvard boys ? (ii) Has John attended the school of Cambridge's mayor? Using the REL grammar, we obtained the following statistics: Sentence ( i )
No. of rules applied before feature checks No. of rules actually applied No. of rules used in single complete parsing
508 38 17
Sentence
(ii)
317 27 15
By-passing the feature checks altogether, but otherwise using the same grammar, thus essentially using a featureless grammar, we obtained the following statistics: Sentence ( i )
No. of rules applied No. of ambiguous parsings
6,316 668
Sentence
(ii)
8,176 2,701
Second, computational efficiency depends on the efficiency of the parsing algorithm and of its implementation. For this reason we have carried through our programming at the assembly language level with careful attention to tight, efficient code. As a result, the computation time for the syntactic analysis of each of the above two sentences is less than one tenth of a second. (This is on an IBM 360/75 computer.) In the development of R E L English, we have been deeply involved with the descriptive adequacy of this artificial language as an approximation of natural English. However, we have also had a second, more pragmatic goal, namely to provide a system for the social scientist and others working with large bodies of highly interrelated data that they would find natural and con-
94
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
venient. We hypothesized t h a t there is a threshold in terms of flexibility and versatility of the syntax provided, beyond which the user feels t h a t he does indeed have natural English at his disposal, even though in a somewhat limited form. Our experience with the interactive R E L system t h a t was operational in the spring and summer of 1970 led us to believe t h a t we were beyond, though just barely beyond, this threshold. The definitional capability certainly adds a great deal, as was amply demonstrated, by allowing the user to add his own specialized phrases and idioms. The most noticeable deficiency in the summer 1970 English was pronouns. Indirect objects, certain locative constructions and durative tenses were also missed, but to a lesser extent. On the positive side, quantifiers, the use of "have" verbs, and especially the rather complete and natural handling of tense were features of the summer 1970 English t h a t were extensively used and seemed to contribute most to the natural English-like feel. All in all, we are now convinced t h a t an approximation to English can be achieved t h a t is well beyond the threshold between a formal-feeling language and a natural-feeling language. J u s t how far we can go in achieving the feel on the part of the user t h a t he has natural English available to him has now reached the level of an exciting challenge. California
Institute
of
Technology
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
95
REFERENCES [1] Thompson, F . B., Lockemann, P . C., Dostert, B. H . , and Deverill, R . S. ' R E L : A Rapidly Extensible Language System', Proc. 24th National ACM Conference, August, 1969. [2] Dostert, B. H., and Thompson, F . B. 'A Rapidly Extensible Language System: R E L English', 1969 International Conference on Computational Linguistics, Stockholm, 1969. [3] Dostert, B. H., and Thompson, F . B. 'How Features Resolve Syntactic Ambiguity', Proc. of Symposium on Info. Storage and Retrieval, J . Minker and S. Rosenfield (eds.), University of Maryland, April, 1971. [4] Fillmore, C. J . 'The Case for Case', Universals in Linguistic Theory, E . Bach and R . H a r m s (eds.), Holt, R i n e h a r t and Winston, New York, 1968. [5] Chafe, W . L. Meaning and the Structure of Language, University of Chicago Press, 1970. [6] Fillmore, C. J . 'A Proposal Concerning English Prepositions', 17th Annual Round Table Meeting on Linguistics and Language Studies, F . P . Dinneen (ed.), Georgetown University Press, 1966. [7] Fillmore, C. J . 'Subjects, Speakers and Roles', Ohio State University Working Papers in Linguistics, No. 4, Columbus, 1970. [8] Postal, P . M. 'On So-Called Pronouns in English', 17th Annual Round Table Meeting on Linguistics and Language Studies, F . P . Dinneen (ed.), Georgetown University Press, 1966. [9] Woods, W . A., 'Transition Network Grammars for N a t u r a l Language Analysis', Comm. of the ACM, vol. 13, Oct., 1970, pp. 591-606. [10] K a y , M. 'Experiments with a Powerful Parser', Second International Conference on Computational Linguistics, Grenoble, August, 1967. [11] Kellogg, C., Burger, J . , Diller, T., and Fogt, K . 'The Converse N a t u r a l Language D a t a Management System: Current Status and Plans', Proc. of Symposium of Info. Storage and Retrieval, J . Minker and S. Rosenfield (eds.), University of Maryland, April, 1971. [12] Chomsky, N., Lecture a t the Summer Linguistic Institute, State University of New York, Buffalo, July, 1971.
HOW MUCH HIERARCHICAL STRUCTURE IS NECESSARY FOR SENTENCE DESCRIPTION? A R A V I N D K. JOSHI
1. Introduction: In this paper, we will present some results which have bearing on the amount of hierarchical structure necessary for sentence description. In its full generality, it is difficult to find a precise answer to this question, mainly because it is not clear how this question can be formulated. However, for some nontrivial cases we can formulate this question and obtain some interesting answers. Context-free languages (cfl's), although not adequate by themselves, play an important role in the theory of grammars. We will give a formulation of our question for this class and then try to answer it. First we will describe a tree generating system consisting of a set of basic trees and rules for their composition. We will then formulate our question in terms of this characterization (Section 2). Our results can be roughly described as follows. For a large class of cfl's almost no hierarchical structure is necessary (Theorems 2.1 and 2.2), and for the full class of cfl's essentially the same result holds if a certain very restricted type of discontiguity is allowed (Theorem 3.1).* Our presentation here will be rather informal. Detailed definitions, proofs, etc., will be reported later. 2. A tree generating system: Let V be a finite alphabet and 2 C V. We call 2 the terminal alphabet and V - £ the nonterminal alphabet. Let r ( V ) be the set of trees over V (i.e., nodes of trees in r(V) are labelled by symbols in V ) subject to the condition that the interior nodes of a tree (i.e., nodes which are not terminal) must be labelled by symbols in V — 2. Terminal nodes may be labelled by a terminal or a nonterminal symbol. The nodes of a tree will be addressed as follows. Root node has address 0; immediate descendants of the root (from left to right) have addresses 1, 2, . . .; immediate descendants of 1 are 1.1, 1.2, . . .; immediate descendants of 2 are 2.1, 2.2, . . . , etc. a(p) denotes the label of the node at address p in a tree a. A tree a £ T(V) * The results in this paper can be regarded as belonging to the general area of syntactic complexity. See [1] for some other results in this area.
98
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
is called a center tree if a(0) = S where S 2 is a distinguished symbol and y(a) £ 2*, where y(a) is the yield of a (i.e., the terminal string of a). A tree P £ T(V) is called an adjunct tree if ¡3(0) = X £ V — 2 and y(P) £2* X 2 * , i.e., the terminal nodes of [3 are all labelled by symbols of 2 except one terminal node which has the same label as the label of the root of p. Definition 2.1: A tree adjunct grammar (tag) is a pair = ( envisioned aj but aj £ A = Vietnamese units have assisted regularly, but not as extensively as is now envisioned. (CS) Dieselbe Verknüpfung gilt auch, wenn wir die Mengen A, OK, F K und D anders interpretieren: A sei die Menge der nominalen Gruppen, die mindestens einen Satelliten enthalten, OK sei die Menge ihrer Nuklei, F K die Menge ihrer Satelliten. Zwei beliebige Elemente aus A lassen sich dann folgendermaßen darstellen: a, = {fkj!, fk i2 , . . . fk i n , o k j j = {fkji- f V • • • f k j m . o k j }
a
n ]> 1 m ;> 1
Die Definitionsmenge D ist eine Teilmenge des cartesischen Produkts A x A ( = nominale Gruppe x nominale Gruppe). Die Verknüpfung (2) erfaßt dann nominale Gruppen folgender Struktur: a4 £ A = its similar fleet a,) £ A = its smaller fleet aj but a } £ A = its similar, but smaller fleet13 (CS) aj £ A = the sad evocation a i £ A = the proud evocation aj but aj £ A = the sad but proud evocation 13 (OBS)
13
Identische Satelliten in aj und aj (its, the) erscheinen im Bild nur ein Mal.
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
113
Wie bei Abbildung (1) gilt auch hier eine semantische Implikation. Sie lautet: fkj = > -ifkj Konstruktionstyp
3
In ^¿-Konstruktionen dieses Typs erweitert but die Nuklei von Elementen morphologischer Klassen. Sei MK die Menge der Elemente, die zur gleichen morphologischen Klasse gehören. Diese Menge läßt sich in zwei disjunkte Teilmengen zerlegen, T1 und T2. T1 enthält Elemente, deren Nuklei nicht durch but erweitert sind. T2 enthält Elemente, deren Nuklei durch but erweitert sind. Dieser Sachverhalt läßt sich in der Mengenschreibweise folgendermaßen darstellen: MK = {mkj, mkjj, . . . mk n } mkj = {sat u , sat i2 , . . . sat lm , nukj} ( = Element der Teilmenge T2) 14 mkj = {sat'jj, sat' j2 , . . . sa't jp , nukj} ( = Element der Teilmenge Tl) 1 5
n m
2 1
p
0
Durch die Abbildungsvorschrift Tl
T2
(3)
werden die Elemente der Teilmenge T l auf die Elemente der Teilmenge T2 abgebildet. Wie beim Konstruktionstyp 2 sind mehrere Interpretationsmöglichkeiten der in der Abbildung vorkommenden Mengen gegeben. a) Wir interpretieren MK als Menge der nominalen Gruppen. Dann enthält T l nominale Gruppen ohne, T2 nominale Gruppen mit but. Die durch but erweiterten nominalen Gruppen können dabei beliebige Satzteile vertreten. Es folgen einige Beispiele: mkj £ T l = afi episode >• mk, g T2 = but an episode Love is but an episode. (Moon, S. 178)16
14
sat 1( sat 2 , . . . . sat m sind Elemente der Menge SAT, die die Satelliten der Elemente der Teilmenge T2 bezeichnet. nuk 1; nuk 2 , . . . nuk n sind Elemente der Menge N U K , die die Nuklei der Elemente der Menge MK bezeichnet. 16 satj, satj, . . . satp sind Elemente der Menge SAT'. SAT' bezeichnet die Menge der Satelliten der Elemente der Teilmenge T l . D a in dieser Menge but nicht enthalten ist, ergibt sie sich als Differenzmenge S A T \ { b u t } . 18 Abkürzung für William Somerset Maugham: The Moon and Sixpence, Penguin Book No. 468.
114
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
mkj £T1 = an insignificant part >mkj £ T2 = but an insignificant part And yet they [ = his relations to women] were but an insignificant part of his life. (Moon, S. 178) mkj £ T1 = a place of passage
»-
mkj £ T2 = but a place of passage They [ = the populous streets] remain but a place of passage. (Moon, S. 207) mkj £ T1 = one power
>
mkj £ T2 = but one power Christ Jesus was aware of but one power, the power of God. (CS) mkj T1 = = but one room mkj ££ T2 one room > The Rue Boutery is a narrow street of one-storeyed houses, each house consisting of but one room. (Moon, S. 196) mkj £ T1 = a little while > mkj 6 T2 = but a little while These recollections . . . took but a little while to pass through my head. (Cakes, S. 109) b) Wir interpretieren MK als Menge der präpositionalen Gruppen. Dann werden durch Abbildung (3) Konstruktionen der folgenden Art erfaßt: mkj £ T1 = by island hopping > mkj € T2 = but only by island hopping The ocean has been flown by helicopters before; but only by island hopping. (CS) mkj £ T1 = with moderation »mkj g T2 = but with moderation The best judges praised him, but with moderation. (Cakes, S. 108) mkj g T1 = for a little while * mkj £ T2 = but for a little while And perfection . . . holds our attention but for a little while. (Cakes, S. 106) c) MK kann auch als Menge von Adverbien interpretiert werden. Dann erfaßt unsere Abbildung folgende Beispiele:
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
mkj £T1 = once
115
>-
mkj £ T2 = but once I met Thomas Hardy but once. (Cakes, S. 7) mkj £T1 = mildly
*
mkj £ T2 = but mildly She but mildly interested me. (Calces, S. 46) mkj ££T1 littlelittle > mk, T2 == but Though his books sold but little. (Cakes, S. 102) d) Wenn MK als Menge der verbalen Gruppen interpretiert wird, gilt die Zusatzregel, daß entweder die verbale Gruppe selbst oder die ihr vorausgehende nominale Gruppe verneint sein muß. Im ersten Fall wird der Nukleus der verbalen Gruppe gewöhnlich außerdem durch ein modales Hilfsverb erweitert. Folgende Beispiele gehören hierher: mkj £ T1 = revived > mkj £ T2 = but revived Not a tree, not a bush, scarce a wildflower in their path, but revived in Rosamund some recollection. (JES III, § 9.7) mkj £T1 = could not feel >mkj £ T2 = could not but feel Even I . . . could not but feel that there, . . . , was real power. (Moon, S. 171) mkj £ T1 = cannot be
»-
mkj £ T2 = cannot but be I t cannot but be to the greater glory of English literature. (Cakes, S. 135) mkj £T1 = could not think > mkj £ T2 = could not but think I could not but think it mean and paltry. (Cakes, S. 99) Konstruktionstyp
4
Während beim Konstruktionstyp 3 Definitionsmenge und Bildmenge disjunkte Teilmengen derselben morphologischen Klasse sind, wird beim letzten noch zu beschreibenden Konstruktionstyp eine morphologische Klasse auf eine andere morphologische Klasse abgebildet. Diese Abbildung wird nicht
116
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
durch but alleine bewirkt, sondern durch die Morphemfolge but for. Die Definitionsmenge wird gebildet durch die Menge der nominalen Gruppen (MK1), die Bildmenge durch die Menge der präpositionalen Gruppen (MK2). Die Abbildung lautet MK1
MK2
(4)
Die Elemente der Bildmenge manifestieren gewöhnlich Satzerweiterungen. Zur Veranschaulichung mögen folgende Beispiele dienen: mklj £ MK1 = an unfortunate attack of measles mk2, £ MK2 = but for an unfortunate attack of measles He was president of the Union and but for an unfortunate attack of measles might very well have got his rowing blue. {Cakes, S. 13) mklj £ MK1 = the hazard of a journey to Tahiti >mk2[ £ MK2 = but for the hazard of a journey to Tahiti I've said already that but for the hazard of a journey to Tahiti I should doubtless never have written this book. (Moon, S. 184) Universität zu Freiburg
ANHANG Definitionen der verwendeten mengentheoretisehen Begriffe: 1 7 Menge: U n t e r einer Menge verstehen wir jede Zusammenfassung von bestimmten wohlunterschiedenen Objekten unserer Anschauung oder unseres Denkends, (nach Georg August Cantor) Element: Objekte, die zur einer Menge gehören, sind Elemente dieser Menge. Durch Aufzählung ihrer Elemente definierte Mengen werden zwischen geschweiften Klammern notiert, z. B. A = {a^ a i , a 3 , . . . a n } Die Beziehung "gehört zur Menge" wird mit " notiert, z. B. »1 € A Teilmenge: Sind A, B Mengen, so heißt A Teilmenge von B, wenn aus x € A stets folgt x € B, wenn also jedes Element von A auch zu B gehört. Die Teilmengenbeziehung wird mit " c " notiert. Differenzmenge: Seien A, B Mengen. Die Differenzmenge A \ B besteht aus allen Elementen, die zu A, aber nicht zu B gehören. Komplementmenge: Seien A, B Mengen. I s t B C A, soheißt die Differenzmenge A \ B Komplementmenge von B in A. Sie wird mit "CA" notiert.
17
Vgl. Studienbegleitbriefe
1-4 zum Funkkolle
g
Mathematik.
SYNTACTICAL ANALYSIS OF NATURAL A N D ARTIFICIAL LANGUAGES
117
Implikation: A impliziert B bedeutet: W e n n A den W e r t " w a h r " hat, h a t auch B den Wert "wahr". Eine Implikation wird mit Doppelpfeil notiert, z. B. A B cartesisches Produkt: Seien A, B Menge. Die Menge aller geordneten P a a r e (x,y) mit x 6 A und y X € B heißt cartesisches P r o d u k t von A und B. Ein cartesisches P r o d u k t wird mit X notiert, z. B. A X B = {(x,y) | x € A A y
(B}
Abbildung: Eine Zuordnung zwischen den Elementen einer nichtleeren Menge A und denjenigen einer Menge B heißt Abbildung, wenn 1. jedem x € A ein Element von B zugeordnet ist, 2. jedem x £ A nur ein einziges Element von B zugeordnet ist. Abbildung werden m i t Pfeilen notiert, z. B. A ->- B. Definitionsmenge: Die Menge von Elementen, denen bei einer Abbildung Elemente einer anderen Menge zugeordnet werden, heißt Definitionsmenge. I n der Abbildung A -*• B ist A die Definitionsmenge. Bildmenge: Die Menge von Elementen, die bei einer Abbildung den Elementen der Definitionsmenge zugeordnet werden, heißt Bildmenge. W e n n die Zahlen 1, 2, 3 auf die Zahlen —1, —2, —3 abgebildet werden sollen, ist die Menge M = { — I, —2, —3} Bildmenge. Bildelement: Jedes Element der Bildmenge heißt Bild oder Bildelement. Zielmenge: Zielmenge einer Abbildung ist eine beliebige Menge, die die Bildmenge als Teilmenge enthält. W e n n A = {1, 2, 3} in B = {—1, —2, —3} abgebildet werden soll, ist u.a. die Menge der natürlichen ganzen Zahlen Zielmenge. injektiv: Eine Abbildung ist injektiv, wenn verschiedenen Elementen der Definitionsmenge verschiedene Elemente der Zielmenge zugeordnet werden. surjektiv: Eine Abbildung ist surjektiv, wenn jedem Element der Zielmenge mindestens ein Element der Definitionsmenge entspricht. Bildmenge und Zielmenge sind in diesem Fall gleich. bijektiv: Eine Abbildung heißt bijektiv, wenn sie injektiv und surjektiv ist. Verknüpfung: Sei V ^ 0 eine Menge. Jede Abbildung V X V -»-V heißt Verknüpfung auf V. disjunkt: Seien A, B Mengen. Sie sind disjunkt, wenn folgende Implikation gilt: x € A =>• x (J B.
118
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
LITERATURVERZEICHNIS Cooper, William S. Set Theory and Syntactic Description (Janna linguarum, Series minor XXXIV), The Hague, 1964. Gleitmann, Lila R. 'Coordination Conjunctions in English', Language, X L I (1966), 260— 293. Pilch, Herbert, 'Drei Diskussionsbeiträge zu Zwirners Arbeiten', Theorie und Empirie in der Sprachforschung, ed. H . Pilch und H. Richter, München—Paris—New York, 1970, 9-22. Pilch, Herbert, Altenglische Grammatik, München, 1970. Stockwell, Robert P. 'The Transformational Model of Generative or Predictive Grammar', Natural Language and the Computer, ed. P. Garvin, New York, 1963, 23—46. Studienbegleitbriefe 1—4 zum, Funkkolleg Mathematik, ed. Deutsches Institut für Fernstudien an der Universität Tübingen, Weinheim—Berlin—Basel, 1970. Wells, Rulon S. 'Immediate Constituents', Language, X X I I I (1947), 81-117.
COORDINATIVE PROJECTIVITY E. V. PADUÓEVA
1. Coordinative projectivity is a certain restriction on word order in sentences with coordinate members. The term is coined by analogy with usual p r o j e c t i v i t y which is a well-known general regularity of word order: if there is a syntactic structure of a sentence represented by a dependency tree [1], then projectivity holds for this structure if a group of words is a coherent stretch of a sentence, not interrupted by words which do not belong to the group [2], Coordinative projectivity is a much less general property of word order than projectivity because it only involves sentences with coordinate members, but it is universal in the sense that it is not limited to any particular language but occurs, more or less regularly, in any language. The examples below will be taken from Russian. The supposition about the universality of coordinative projectivity rests upon the fact that a model of speech production can be postulated which seems sufficiently general and well-motivated and which generates only the sentences which are coordinatively projective analogous to Yngve 1962 [3]. 2. To explain coordinative projectivity, several auxiliary notions are needed. First is the notion of a coordinative group. In a sentence with coordination the following parts can be delimited: a) a conjunction; b) coordinate members — words directly connected by a conjunction; c) groups of coordinate members, each group including all and only the words that depend, directly or indirectly, on one of the coordinate members. Thus, in the sentence (1) (1) IleTfl (CXBaTHJI H SblCTpO o6e3opy>KHji) npOTHBHHKa
'Peter-caught-and-quickly-disarmed-enemy' coordinate members are cxBaTHji 'caught' and o6e3opy>KHJi 'disarmed'; the group of the first coordinate member consists of one word; the group of the second coordinate member is SbiCTpo o6e3opy>KHji 'quickly disarmed'; the coordinative group is C X B a T H J I H SbiCTpo 06e30py>KHJi 'caught and quickly disarmed'.
120
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
I n addition, we need the notion of s y n t a c t i c c o n n e c t i o n which differs from the notion of dependency in t h a t it is more vague, and less subject to formalization, e.g., the diagram of this relation need not be a tree. The use of this term will be made clear from the examples below. Using the notion of syntactic connection we can make some refinements in our definition of a coordinative group. Namely, it can be said t h a t words which are syntactically connected not with one of the coordinative members but with both of them are not included in the coordinative group — e.g., neTfl (Peter) and npoTHBiiHKa (the enemy) in the sentence ( 1 ) . Thus, c o o r d i n a t i v e p r o j e c t i v i t y can be defined as a coordinative group t h a t forms a coherent stretch of a sentence, uninterrupted by words which are not members of the coordinative group (in particular, those words which do not enter the coordinative group because of being syntactically connected with both of the coordinate members). For example, in ( 1 ) IleTH and np OTHBHHKCI are situated one to the left, the other to the right of the coordinative group, so t h a t the group itself forms a continuous stretch. Coordinative projectivity is independent of projectivity: the sentence can be projective but not coordinatively projective. Yet it is difficult to exemplify this thesis because sentences with coordination have no natural dependency trees, cf. [4], 3. In Russian there are numerous sentence types which can be considered as violations of coordinative projectivity. Yet it can be shown that at least some of them are only apparent counterexamples: (2)
Bee
3HAJIH e r o
H HHKTO H e
JIK>6HJI
( E v e r y b o d y - k n e w - h i m -an d-nobody-liked) ; ( 3 ) CoxpaHHJI
OH IIHCbMO
HJIH
BblKHHyjI?
(Kept-he-letter-or-threw-away) ; (4) Bbi H A « coSoii CMEETECB HJIH N A / ; co6ece/i,HHKOM? (You-at-yourself-laugh-or-at-interlocuter). Each of these sentences can be given two syntactic interpretations. Interpretation 1. The underlined words are syntactically connected with both coordinate members and therefore do not belong to the coordinate group. But they occupy a place inside the group, so t h a t coordinative projectivity is violated. Interpretation 2. The words in question are syntactically connected only with the first of the coordinate members. Then the second member has some of its syntactical valences unrealized. B u t this is naturally explained by the fact t h a t the sentences are elliptic in the sense t h a t the second of the coordinate members has a zero substitute morpheme; the zero substitute is
SYNTACTICAL ANALYSIS OF NATURAL AND ARTIFICIAL LANGUAGES
121
syntactically connected with the second coordinate member and has the underlined word as its antecedent (cf. R. Jakobson [5]: "Ellipsis is a zero anaphoric (deictic) sign"): » -—" (2)
Bee
3HajiH
ero
h
hhkto
ne
jiioShji
0
1 I n this case the underlined words belong to the coordinative group and projectivity is not violated. I t can be shown t h a t interpretation 2 is essentially preferable to interpretation 1, in t h a t it provides a better insight into the structures analyzed. Namely, there are at least two properties of these structures which would not have been explained (and most probably not even noticed at all) if structures (2) —(4) had not been treated as elliptical. The first question which deserves attention is why do the words which violate coordinative projectivity always occur within the first of the coordinate members, t h a t is, to the left of the conjunction and not to the right; indeed, when they are placed to the right, the sentences become ungrammatical: (21) *Bce3najiH h
h h k t o ero He jiioShji.
If sentence (2) is interpreted as elliptical, then explanation of this asymm e t r y readily presents itself: for substitution morphemes with minimal meaning (e.g., third person pronouns and zero substitution morphemes) it is natural for the antecedent to occur on the left. Within the frame of interpretation 1 this peculiarity of word order seems difficult to explain. Note. English sentence Everybody knew him and nobody liked him (which might seem to contradict this statement) is not structurally equivalent to the Russian sentence (21), because it is coordinatively projective and can be interpreted as containing coordinate members and no ellipsis. In general, in English, as far as we can judge, coordinative projectivity is quite trivially fulfilled. The second property of the structures analyzed is that the underlined words in all the three examples bear no phrasal stress. Two of them are pronouns (not by chance for it is natural for words occupying this position in a coordinative group to be pronouns), and it is natural for pronouns to be unstressed. B u t all the other words are also unstressed. Now this fact can be explained easily in terms of ellipsis. There is an interesting general regularity t h a t the antecedent of a zero substitution morpheme is almost always a word bearing no phrasal stress (several related ideas can be found in [6]). Actually, the place of the phrasal stress is used to resolve the ambiguity of the following sentence:
122
S Y N T A C T I C A L A N A L Y S I S OF N A T U R A L A N D A R T I F I C I A L
(5) H
LANGUAGES
B3HJI 30HTHK HHKOJiaH H i n J i n n y .
(I-took-umbrella-of-Nicolas-and-hat) Sentence (5) has the following two readings: a.
H B3HJI 30HTHK HHKOJiaH H IIIJIHIiy
0
I / 'I took the umbrella of Nicolas and his hat'; b . H B3HJI 30HTHK HHKOJiaH H IIIJIHIiy
'I took the umbrella of Nicolas and a hat*. Sentence 5a is understood as containing a zero substitution morpheme with the antecedent H H K O J I A H which has no stress. For sentence 5b, where H H K O JIAH is stressed, this interpretation is impossible. 4. To demonstrate that coordinative projectivity is a useful notion we certainly need not prove that all sentences are coordinatively projective. There are sentences which undoubtedly violate coordinative projectivity: C e r o f l H H MBI MHTaeiw; aHajiH3 wee
H nepeBOfl SyfleT 3aBTpa.
'Today we are reading; analysis and translation are postponed till tomorrow'. The particle wee, which is syntactically connected with the whole coordinate group, is placed inside the group, and there can be no ellipsis here whatsoever: this is the general law of discontinuity of wee placement, cf. (7): ( 7 ) r i e r a 3flecb, H B a H >Ke H n K O J i a e B n q 3aAep>KHBaeTCH
'Peter is here, while Ivan Nicolaevich will come later'. 5. Some of the facts included here under the common heading of coordinative projectivity were known before (see [7], [8], [9]). Still it should be pointed out that only when taken in their totality can they receive their proper explanation. A model of speech production which will give this explanation would be as follows: I t can be easily proved that if sentences with coordinative members were the result of the so-called conjunction reduction, then their coordinative projectivity would be the automatic consequence of this fact. The rules of conjunction reduction can be stated as follows (see [10]; the competing solution stated in [11] seems to be wrong): Rule I. (A subord B) coord (A' subord C) =>• A subord (B coord C); Rule II. (B subord A) coord (C subord A') (A coord C) subord B. In this notation A, A', B and C are constituents which can be connected with each other either by subordination or coordination; if constituents A, and A' are, in some sense, identical, and B and C are, in some sense, structurally similar, then one of the constituents A, A' can be eliminated, with the
SYNTACTICAL ANALYSIS OP NATURAL AND ARTIFICIAL LANGUAGES
123
subsequent structural reorganization. (It must be stressed that the word order is indeed essential here; if the order of constituents does not correspond to any of the rules, conjunction reduction results in an ungrammatical sentence): (8)
flnpeKTopoM
6biJi TaBpHJia; r a B p m i a
paSoTaji A o S p o c o B e c T n o
=^-*,n,npeK-
TopoM 6MJI TaBpHJia H paSoTaji ;i,o6pocoBecTHo.
In this respect conjunction reduction as it is presented here differs from many of the other paraphrastic transformations, which do not demand for their application that the word order be fixed. Perhaps it would have been more adequate to formulate the restriction not in terms of word order but in term of functional sentence perspective: most probably, the word order restrictions follow from the demand that coordinate members must play identical roles in the functional perspective of a sentence. 6. As any other syntactic regularity, coordinative projectivity can be used in syntactic analysis. On the other hand, coordinative projectivity can serve as a convincing example of how transformations may serve as a heuristic method for discovering purely syntactic structural regularities. VINITI,
Moscow
124
SYNTACTICAL ANALYSIS OP NATURAL A N D ARTIFICIAL
LANGUAGES
REFERENCES
[1]Hays, D. G. 'Dependency Theory: a Formalism and Some Observations', Language 40, N.4, 1964, 511—525. [2] Lecerf, Y. 'Une représentation algébrique de la structure des phrase dans diverses langues naturelles', Comptes rendues de l'Académie de sciences, 1969, N2., 232—234. [3] Yngve, V. 'The Depth Hypothesis', The Structure of Language and its Mathematical Aspects, Providence, 1961. [4] ria^yneBa, 'O nopnflKe cjiob b npefljio>KeHHiix c cothhchhcm: coHHHHTejibHaH npoeKTHBHOCTb', Haymo-mexHuneCKan UH0opMaifUH, cepra 2, 1971, N. 3. [5] Jakobson, R. 'Signe zéro', in R. Jakobson, Selected Writings, II. 1971, 211 — 219. [6] UlaxMaTOB, CunmaKCUc pyccicoeo nsbnca. M-l. 1941. [7] JHpeii3HH, '3aBHCHMbie cjioBa h rpynna cjiob npn oflHopoflHOM cymectBHTejibHOM c tomkh 3peHHH aBTOMaTHiecKoro cmrraKCHHecKoro aHaJiH3a', Haywo-mexHrnecican UHtßopMaqm, 1966, N. 7, 4 2 - 4 5 . [8]IUpeHflep, 'Cbohctbo npoeKTHBHoera jrabiKa', HayiHO-mexHimecican UHfßopMaifUH, 1964, N. 8, 3 8 - 4 1 . [9]ropKHHa, 'O crpyKType npe«Jio»{V)
^ * 0
d c
^ *- xSz X ^xAx x y xAy —>- yAx xyz xAy & yAz —>- xSz xyz xAy & ySz —> xAz y : x xAy 1 y; x x" = y & xSy xyz
Global Axioms 10. 11. 12. 13.
Closure: Intersection: Union: Disjoint:
xy xy xy x*x2
xS + y s(x).s(y) = s(x) •«-»• xSy s(x) -f s(y) = s(x) —
§ VI
8) s> * O
Hi ^ (k ¿5 S ^
£ &
F4 ¿i P
O fc
B o o ft CO
ft, P o
1 «-3 O P4 S H i « H § H
B fM
+
73 §
H 8 H H o < 7+
£ M H s
o H o cS CL a 60 e XI
a
a ,a
2 3 8
S E M A N T I C
S Y N T H E S I S
c
4) C 0
2 a.
•s
H O
X
CO
* 3
9 pq
A N D
a o ••§
¿3
fc O w B
M 0 C O h 0)
1O C
1 1 'rrt .1 JS
a o &
-w
II
aC3 3
a
•W
t
I § .9 Vm r $ ~ f £ -fa H
o a
r ^ *o ^ I I
JS s 1 b 8.1
& s
o &
¡5 o fc
•
Pi w H •H o EH K S H
Ph & O fc
Hi
55 H I •d c eS H H 0 &3 &03
X O III H N s £ 0) J3 W ft +N
TJ I > & H
3
o a. o o.
240
SEMANTIC ANALYSIS A N D
SYNTHESIS
through an explicit relation (preposition). Although we will not go into details here, we have written a program which judges such phrases with regard to the intended conceptual relation expressed by a syntactic preposition which potentially has multiple senses. 6. Feature Categorization
of ACTs
A method has been outlined for t h e classification and dictionary definition of ACTs and ACT-based verbs, which is similar t o t h a t for the nominals. The method postulates a number of "basic ACTs". All other verbs are defined systematically in terms of these ACTs, "basic parameters", higher-level primitives such as CAUSE, a n d verbs which have been previously defined. The idea of defining verbs in terms of categorized ACTs is similar to t h a t underlying Schank's verb-ACT dictionary [4], b u t our method defines t h e nature of a given ACT or verb more systematically and makes more specific some of the previous thought on ACTs. The resulting advantage is t h a t 1) t h e word-definer can be guided in his descriptions by choices offered b y an editing program, as for P P s , and 2) relationships between verbs can be identified. The latter result is particularly important with respect to two considerations. B y stating how two verbs are related, we provide valuable information for inference making, which is of use to any kind of dialogue program which " u n d e r s t a n d s " through the parser. Also, by noting common components of verbs or ACTs, we provide some basis for understanding metaphorical senses of a verb, as will be illustrated below. (For a more extensive discussion of considerations related to these points, see Schank [5].) 6.1. Outline of the Method As space does not permit us to present the system in its entirety or to fully explain our terms here, we will merely suggest something of its nature, relying on isolated examples. There are five tentative groups of ACTs (they reflect conceptual ACT criteria given in Weber [9], b u t present a different emphasis and arrangement of concepts): 1) 2) 3) 4)
neutral (objective) experiences: K N O W , TOUCH, P E R C E I V E value (subjective) experience: B E L I E V E , L I K E , ACCEPT expression: E X P R E S S ! (information), E X P R E S S s (e.g. /Sounds) individual action or state, not involving other objects: ACTSTATE, i.e. be ± PA, walk, sleep, etc. 5) transfer t y p e actions: T R A N S
Because of their frequency and recognizability, we can immediately establish
SEMANTIC ANALYSIS A N D
241
SYNTHESIS
as "basic" the following verbs based on TRANS: MOVE, TRANSFER, TAKE D (locative, i.e. Directive), PUT, GIVE, OFFER, TAKE R (abstract, i.e. as Recipient), LOSE. Basic parameters which can be applied to a basic ACT to define a verb are: TS TF K R I A
(start) (finish) (continuous) (repeated) (intentional, conscious) (attempted)
In the conceptual diagram in which the verb is represented, these parameters will appear above the two-way (predicative) link. Negation might also be considered a parameter, but is an area for discussion in itself. In addition to the foregoing, we recognize the higher level "parameters" CAUSE (by which a verb may be defined in terms of a predefined verb and immediate causal link) and INSTR (by which a verb may be defined in terms of a specified instrumental or "means" conceptualization). Each ACT has a specification for K and I already built into its definition. The value ' —' is considered more basic than ' + ' ; if a parameter of a basic ACT has a ' + ' specification, then there is no known concept which has the definition of the ACT, but has a ' value for the given parameter. An example of the definition of verbs based on the ACT 'know' is: ACT know
+1 contemplate
A study
Ts learn
Tf forget
R —
CAUSE inform
INSTR —
For instance, here we have interpreted 'learn' as 'starting to have information', or 'starting to know'. 'Know' has no corresponding R- or INSTR-verb, since it is + K , i.e. a kind of state; states are not thought of as being repeatable and cannot be "done with an instrumental aid". Dependency information concerning the ACT in the dictionary would look like the following for the example of 'touch': PP (+PHYS —ENVMT)
touch «— PP (+PHYS)
This means a possible "actor" must have the feature + P H Y S but not the feature + ENVMT. The "object" must be a physical PP. In addition to parameter specifications, ACTs and verbs also have a LEVEL-specification. LEVELs now recognized are PHYSICAL, MENTAL and ABSTRACT (meaning "concerning possession", "social"). (See Tesler [8]
242
SEMANTIC ANALYSIS AND
SYNTHESIS
for more discussion relating to levels.) Aside from the value of having ACTinformation which corresponds to PP-features, for purposes of knowing what P P s and ACTs can be "mixed", this method of description can be used in determining metaphorical usages of a verb or ACT. By "metaphorical" we are referring to analogy: One verb sense is used metaphorically with respect to another if they are both of the same category, i.e. based on the same ACT, but operate on different levels. The physical level is usually considered the "non-metaphorical" one. The admissible features for dependent or associated P P s will correspond to the difference in level. If we start with only non-metaphorical definitions of verbs in our dictionary, we might recognize and represent metaphorical uses of the work in the following manner: We understand 'John killed the cat' as (ignoring irrelevant notation) John
DO "j| J L (e-g- John
shot cat)
T cat
not be alive
We know this because 'alive' as a PA or 'be alive' ('live') as an ACTSTATE can be dependent on all P P s which are + A N I M . If we now consider 'The House killed the bill', we note t h a t bills cannot be alive according to our definitions. However, if we ignore the contextual conditions which specify an + A N I M object and adopt only the parameters and structure (i.e. CAUSE to NOT BE) specified for 'kill', the resulting representation makes sense. The metaphorical 'kill' can thus be defined in the dictionary as based on the non-metaphorical 'kill', with the differenoss t h a t the associated P P s will have other feature specifications and t h a t the associated INSTRconceptualization will be different. We can make similar inferences about both senses of 'kill' because we have preserved that which is common to both senses. For instance, if a "realworld" model containing psychological implications [5] tells us t h a t if A kills B, then A does not like B, we can apply this inference to both usages of 'kill'. As regards inferences, consider also 'The blotter drank the ink'. If 'drink' is based ultimately on TAKE D , we know t h a t the ink was removed from somewhere and is now in the blotter. I n the light of such examples, the method outlined appears to be able to provide a link between the conceptual level and some kind of "logical" level. The question naturally arises as to whether we can use the parameter specifications for verbs in the same way as we use feature specifications for PPs, namely for information on conceptual selectional restrictions. The answer is affirmative, but such selectional restrictions have to do with higher levels than that of the conceptualization. In other words, parameter descrip-
SEMANTIC ANALYSIS AND SYNTHESIS
243
tions play an important role when dependencies among ACTs and conceptualizations themselves are considered. For instance, duration imposes a + K or + R parameter on the conceptualization which it qualifies. We can say 'He beat the dog for five minutes' ('beat' is defined as 'hit' with an + R specification), but not 'He hit the dog for five minutes' ('hit' = —R, —K), unless we mean 'hit repeatedly', in which case we must add the + R to the conceptual representation. In order to define formally all the selectional restrictions which exist at higher conceptual levels, we need a "conceptual grammar", which seeks to describe extra-linguistically all that people might want to communicate to one another, and all the sensible semantic relations between components of such communication. Such a "grammar" would be invaluable in informing a dialogue program of what has been said and what would be interesting to find out. 7. Conclusion The semantic category system we have outlined represents an attempt in the direction of profitably systematizing conceptual dependency rules and semantic descriptions of the objects involved. Dealing at the conceptual rather than at any syntactic or "deep-structure" level, it relies on semantics and its role in determining dependencies and is thus language-independent. The system together with computer experimentation with it should lead to a better understanding of the definition of the "conceptualization" and lays a basis for a more rigorous treatment of conceptual relations at higher levels. In addition to lending itself to the solution of problems concerning consistency of semantic descriptions of PPs and ACTs, the system (with its emphasis on components of meaning) is suitable for carrying on further analysis as to how we grasp the meaning of language. This is a step towards achieving a "valid" sort of computer understanding of language. Stanford University
244
SEMANTIC ANALYSIS A N D
SYNTHESIS
REFERENCES [1] Arnheim, Rudolf, Visual Thinking. [2] Russell, S. W . 'Categories of Conceptual Nominala', Stanford A. I . Memo, Computer Science Department, Stanford University (forthcoming). [3] Schank, R . 'A Conceptual Dependency Representation for a Computer-Oriented Semantics', Stanford A. I. Memo 83, Computer Science Department, Stanford University, March 1969. [4] Schank, R . Tesler, L., and Weber, S., 'Spinoza I I : Conceptual Case-Based N a t u r a l Language Analysis', Standford A. I. Memo 109, Computer Science Department, Stanford University, J a n u a r y 1970. [5] Schank, R . 'Intention, Memory, and Computer Understanding', Stanford A. I . Memo 140, Computer Science Department, Stanford University, J a n u a r y 1971. [6] Smith, David C. 'MLISP', Stanford A. I. Memo, Computer Science D e p a r t m e n t , Stanford University, March 1971. [7] Su, 'A Semantic Theory Based upon Interactive Meaning', Computer Science Technical R e p o r t 68, University of Wisconsin. [8] Tesler, L. 'New Approaches to Conceptual Dependency Analysis' in (4). [9] Weber, S. 'Conceptual ACT Categories' in (4).
A SEMANTICS-ORIENTED SYNTAX ANALYZER OF NATURAL ENGLISH PULAVARTHI SATYANARAYANA*
During the past two decades, many groups of researchers have attempted to analyze natural language sentences by computer. 1 I n the beginning the emphasis was mostly on syntactic aspects of the analysis. This approach allowed some of the analyzers to give meaningless analyses to some types of sentences. But, the trend is now changing and some recent analysis programs have also taken semantics into consideration. 2 At the Coordinated Science Laboratory at the University of Illinois, we are analyzing natural English as a part of the development of an automated natural language question-answering system, called the R2 system. 3 I n our research efforts, it has been found t h a t in addition to using semantics for checking semantic well-formedness of phrases recognized as syntactically wellformed, we also need to orient the analyzer in such a way t h a t the output explicitly reflects all the semantic content of a sentence. This involves, among other things, giving different analysis to restrictive and nonrestrictive relative clauses, mentioning explicitly whether a particular prepositional phrase in a sentence is a time adverb, a place adverb, or something else. Keeping this in mind, our analysis has thus far concentrated on recognizing restrictive andnon* The author is deeply indebted to Professor Robert T. Chien and Fred A. Stahl for their constant encouragement, helpful discussions and valuable criticisms of this manuscript. He also thanks Kenneth O. Biss and Karumuri V. Subbarao for their encouragement and participation in discussions and making constructive comments that have improved the paper. Thanks also are due to Barbara Champagne, who typed this manuscript. 1 For example, Bobrow and Fraser (1969), Coles (1968), Kuno and Oettinger (1963), Norton (1968), Petrick (1965), Schank and Tesler (1969), Schwartz et al. (1970), Thorne et al. (1968), Winograd (1971), and Woods (1970). For some excellent reviews of work in this field, see Bobrow (1963), Simmons (1966), Kuno (1967), Bobrow, Fraser and Quillian (1967), Salton (1968), and Montgomery (1969). 2 For example, Woods (1968), Schank and Tesler (1969), Schwartz (1970), and Winograd (1971). 3 See Biss et al. (1971) for details about this system.
246
SEMANTIC ANALYSIS AND SYNTHESIS
restrictive relative clauses, recognizing the various meanings associated with conjunctions, finding out automatically which prepositional phrases mean what, and depending upon these results, giving a suitable representation for the entire semantic content of a sentence. This paper will briefly describe our analysis and the results obtained so far, in regard to the topics mentioned above. A brief comparison with some other systems that use semantics will also be made. THE ANALYSIS Since a natural language analysis involves the use of a dictionary, we will begin with the description of how the dictionary entries appear in our analysis. This will be followed by a description of the syntactic analysis where semantics is used for disambiguation, and the analysis of adverbs, relative clauses, and conjunctions. The Structure of the Dictionary Our analysis program relies heavily on the semantic as well as syntactic properties of each word in a sentence. Accordingly, in the system each word is given information about not only its syntactic category but also the semantic information carried by it. For example, the word plane, which can function in three different ways syntactically, is given the information in Figure 1. WORD Plane
ATTRIBUTE (Category) N V Adj
VALUE (Semantic-syntactic information) (Concrete — count (2 (A flying vehicle) (A level surface))) (Transitive — human subject (Carpentry (Smoothing a surface))) (— (2 (Mathematics (Lying on a level surface)) (Any (Even))))
Figure 1. Property List of plane (in the dictionary)
Each category in which a word can occur syntactically is called an attribute or property of that word. The additional semantic and syntactic information is given as the value associated with that attribute for that word. In Figure 1 plane is marked as a concrete-count noun. That is, it has both these features. As a noun it has two meanings, viz., "a flying vehicle" and "a level surface". In addition, plane can also function as a verb or as an adjective. As a verb it requires an object and a human subject. As an adjective it again has two
SEMANTIC ANALYSIS AND SYNTHESIS
247
meanings — one used in mathematics and another in any context. Similarly t h e words pitcher and angry have the properties noun and adjective respectively as shown in Figure 2. WORD
ATTRIBUTE
Pitcher
N
Angry
Adj
VALUE (Human (Baseball (Player)) Inanimate — count (Container of water)) (Animate subject (A mood))
Figure 2. Property List of pitcher,
angry
Even though in the above figure the word pitcher is only marked human, it is in fact marked for human, animate, concrete, and count internally in the system. The values animate and count are predictable from the value human. All features of a word are explicitly carried on its property list to cut down on the deduction needed when checking the semantic well-formedness of a phrase. This explicit marking of features does not involve space problems because only one bit is needed for each feature. I n fact, more information can be carried in one computer word b y the efficient use of single bits than by using the characters H, U, M, A and N to refer to the single feature HUMAN. At present features are marked manually. This could be done to a large extent by the computer using rules for predicting the features. I n the case of a word like pitcher where it has two meanings, each with a different group of syntactico-semantic features, we list each group of these features separately followed by the corresponding purely semantic information. In this case, as a human noun, pitcher is a kind of player in the context of baseball. As an inanimate concrete object, it means a water container. The mention of the context, as in the case of pitcher, or as in the adjective sense of plane, is to help decide the topic being discussed during the analysis of sentences. For example, by analyzing the sentence "The angry pitcher ran off the field", we not only get the information t h a t pitcher refers to a kind of player, but also that the discussion is probably on baseball. Though this kind of information may not be of immediate use, we hope to use it in our future work on question-answer and natural language discourse with computers. Another point of departure from the existing systems is in not using a separate entity called dictionary explicitly. We use the programming language L I S P in our analysis. We make use of the LISP atoms and the property lists associated with them, thus leaving the burden of searching the dictionary to the interpreter. I t remains to be seen whether this approach is more efficient than having a separate unit called dictionary and using a different search strategy which is independent of the LISP interpreter.
248 Syntactic
SEMANTIC ANALYSIS A N D
SYNTHESIS
Analysis
I n general, the analysis consists of checking the category of the first word in a sentence and depending upon that, looking for various phrases following it. These phrases in t u r n check the category of the first word and look for the smaller phrases t h a t can follow it to form a constituent. Often there is a choice of phrases t h a t can occur a f t e r a given kind of phrase, usually each giving different structures to the sentence. Depending upon the sequences of phrases built, and the kinds of words occurring in the phrase (i.e. their context sensitivity) more phrases of particular kinds are searched for. As an example, a f t e r finding a noun phrase a n d a transitive verb, the analyzer automatically looks for a direct object and m a k e s sure t h a t it is in the accusative form if it is a pronoun. There m a y be more words left in t h e sentence in which case t h e analyzer looks for the presence of adverbs. We are using a modified transformational grammar for our syntactic analysis. I t does not have explicit phrase-structure a n d transformational rules separately. Instead, t h e transformations are incorporated into the analysis. For example, during the analysis, if a be followed b y a verb in t h e participial form is encountered, t h e n our analysis treats it as a passive. Accordingly, t h e surface subject and agent are treated as object and subject respectively in t h e final analysis. The analysis proceeds as follows (see Figures 3 and 4): The first word is processed. If it is a conjunction, then the sentence is checked for t h e appropriate number of clauses. If it is not, t h e n it is checked to see if it is a preposition followed by a noun phrase making a prepositional phrase. If not, a check is made for its being one of: what, which, who, whom, whose (called Type 1 question words); or one of why, how, where, when (called Type 2 question words). The reason for this distinction is t h a t Type 1 question words signal a missing noun phrase (subject, direct object or indirect object, etc.) in the sentence, whereas Type 2 question words, being adverbs, do not. After finding a question word at the beginning of a sentence, the analyzer looks for a verb to conclude t h a t it is a content question. If the second word in the input sentence is not a verb, then it should be the beginning of a noun phrase making the beginning of the sentence a relative clause. For example, " W h o wrote this on the board?", " W h y did J o h n withdraw his money from the b a n k ? " , are both content questions, whereas " W h y J o h n withdrew his money is a m y s t e r y " a n d " W h o m you saw now is the President of the U.S." are only sentences with relatives. If the first word in the sentence is not a question word, then a check is made to see if it is an auxiliary verb. If so, then the sentence is a multiple choice t y p e question. For example, "Do you see a blue object in the s k y ? " , "Can I study law or engineering or physics?" Notice t h a t the latter sentence is ambiguous and this can be recognized b y the presence of or.
SEMANTIC ANALYSIS A N D
SYNTHESIS
249
If the first word is not any of the above, but rather a verb, then the sentence is recognized as an imperative and is checked for the appropriate number of objects. If it is not a verb either, it could be an adverb like obviously, or the beginning of a noun phrase, i.e., article, etc. In any case the sentence is declarative and the analysis continues with a search for the auxiliaries and the verb, etc. If the first word was a preposition, then it could be part of an adverb or an indirect object, etc., if the sentence is a question. This will be decided when the second word is encountered. If the second word is a question word, then it is obviously a question. Otherwise it should be part of a noun phrase. Thus, the first word, along with the second word if necessary, tells pretty much about the type of the sentence. Similarly, when it is decided that a sentence is a multiple choice type question, i.e., an auxiliary verb is found first, then the analyzer looks for a subject noun phrase followed by more optional auxiliaries and a main verb and the proper number of objects and adverbs if any. Sometimes we need to back track in the analysis. For example, after finding a question word, say who, and an auxiliary verb, if we come across a word which is both a noun and a verb, and we happen to treat it as a noun and thus assume that it is a question, but cannot find a verb, then we back track and see if treating the third word as a verb will lead to analyzing the input string as a well-formed sentence. The checks for noun phrases, adjectival phrases, prepositional phrases are made by separate subroutines or programs, thus permitting recursion. Since the search for auxiliaries, etc., is common to all types of sentences, the program branches to the appropriate place in each case. Semantic Check and Disambiguation Every time a phrase is constructed by the analysis, a semantic check is performed to see if it "conflicts" with any of the phrases formed so far, be it in the same sentence or in the sentences before, thus reducing the number of spurious intermediate structures as well as spurious ambiguities at the end. The semantic check is exemplified below. There are two main sources of ambiguity in sentences — lexicon and transformations. This excludes the many ambiguities caused by pronouns. Lexical ambiguity is due to the existence of words having more than one meaning, but functioning the same way syntactically; and also due to the existence of words belonging to more than one syntactic category, as for example, "light" and "time", respectively. Transformational ambiguity is due to two sets of transformations yielding the same structure. The classical example of this is "They are flying planes". The first type of lexical ambiguity is eliminated by the use of
250
SEMANTIC ANALYSIS AND SYNTHESIS
Figure 3, Flow diagram for syntactic analysis
SEMANTIC ANALYSIS A N D
SYNTHESIS
251
semantic check. For example, though "pitcher" is ambiguous, "the angry pitcher" is not. So when the adjective "angry" and the noun "pitcher" are combined to make a noun phrase, because of the restrictions mentioned in the lexical item "angry", "pitcher" is treated as a player not a water container. Transformational ambiguity is treated by giving more than one analysis if certain types of constructions occur. A final semantic check is made to see if that ambiguity is also semantically valid. For example, when a perfect marker have-en or a progressive be-ing are encountered in a sentence, it is checked to see if the have and be could be the main verbs themselves. For example, "They have stolen jewelry" is ambiguous (have functioning as the auxiliary and the main verb in the two interpretations), whereas "They have rotten tomatoes" is not (it can only be interpreted as possession of certain kind of tomatoes). This is recognized by the property of the verb rot, which cannot take an object. All this is accomplished by associating syntactic and semantic features with each word as described in the section on the structure of the dictionary.
252
SEMANTIC ANALYSIS A N D
Analysis
SYNTHESIS
of Adverbs
There are several types of adverbs in English. Some of these are single lexical items. For example, yesterday, quickly, daily, etc. B u t there are m a n y adverbs t h a t are formed b y placing a preposition in front of noun phrases. These have all been grouped under prepositional phrases by Chomsky (1965). B u t grouping all of t h e m together results in a loss of information. I t is not a problem to call both the subject and the object of a sentence noun phrases, because the relative position of the phrases with respect to t h e verb unambiguously distinguishes them. B u t there is no such "clue" for finding the function of a prepositional phrase in Chomsky's notation. Being oriented towards semantics, we group these different prepositional phrases into time adverbs, place adverbs, etc. These are recognized in our analysis by looking at the combination of the preposition a n d the noun phrase. For example, " a t " followed b y a " t i m e " noun like " 5 o'clock" is recognized as a time adverb, though the same " a t " followed b y a noun like "school" marked " a place" in t h e dictionary is recognized as a place adverb. Similarly, " b y my house" is recognized as a place adverb because "house" is a "place" noun, " b y M o n d a y " as a time adverb, a n d " b y J o h n " as either an agentive phrase in passive sentences or as a place adverb meaning "next t o J o h n " . Analysis
of Relative
Clauses
I t is an accepted fact t h a t there are two types of relative clauses in English — the restrictive and the appositive (nonrestrictive). Even transformational grammarians have been unable to capture all the generalizations among these two types of relative clauses. Neither of the two branches of transformational grammarians — the lexicalists and the transformationalists — have s t a t e d precisely where a clause is to be treated as a restrictive relative clause a n d where it is to be treated as an appositive. Furthermore, t h e y either propose t h a t both types are derived f r o m embedded sentences in the underlying representation or both from conjunctions in the underlying representation. 4 Instead of discussing the merits and demerits of these syntactically oriented approaches we will present the results we obtained in our a t t e m p t s to distinguish these two types of sentences. Any relative clause whose he ad noun is preceded by the quantifiers any, all, every or no is f o u n d to be a restrictive relative. Accordingly, the sentence " E v e r y antelope which has three legs limps" is understood not as " E v e r y antelope has t h r e e f e e t " b u t as "If an antelope has three legs, it limps". Since t h e restrictive rel ative and t h e conditional statement just made are identical 4
See for example Stockwell et at. (1968) and Thompson (1970).
SEMANTIC ANALYSIS A N D
SYNTHESIS
253
in meaning, for any practical application like ours they need to be represented alike. So, in our semantics-oriented analysis the output will be the same as for the conditional statement. Another case where restrictive relatives are found is when the head noun phrase has the article " t h e " followed by a numeral. An example is " I took the three chairs which you showed me yesterday". For the present, the system only gives the analysis as a sentence with embedding. However, we believe t h a t eventually the system should be capable of replacing the whole noun phrase "the three chairs which you showed me yesterday" with a pointer to the particular set of chairs if the system already knows the set, or else ask back " I did not show you three chairs !" In contrast to this, if there is only a numeral but no article, then the clause is appositive and so "Thirty tigers, which were hungry, were brought into the circus" will be analyzed as "Thirty tigers were brought into the circus. They were hungry." Two more clear cases for appositives have been found — one with the demonstratives "this" and "these" and the other with the proper names. These are analyzed as conjunctions. Certain expressions have been found to be consistently ambiguous. For example, the sentence "Those people, who talk more, work less" gives an appositive meaning. However, with the commas removed, it is restrictive. So, if we insist on putting in commas, these can be analyzed correctly, too. Analysis
of
Conjunctions
There are a number of semantic questions concerning the use of conjunctions. While attempting analysis at discourse level it has been noticed that in actual discourse people use groups of sentences — not single sentences — to express a single unit of thought. They try to make the individual sentences coherent and make sure t h a t there is a "flow" of thought from one sentence to the next. I n doing so they use conjunctions like and, or, however, but, even i f , etc. When interpreting English discourse much information is derived from these "little" words. An attempt is being made to analyze these conjunctions with an automated question-answer system in mind. For a long time linguists a n d logicians have assumed t h a t English " a n d " and " o r " are identical to the semantic or logical " a n d " and "or", respectively. The reason for this assumption is the existence of many cases like " J o h n and Mary went to the store", and " H a r r y will meet J a n e or Linda in the supermarket". This assumption was further supported by the English conjunctions along with the word " n o t " satisfying the De Morgan's law. For example, "There will not be a meeting on Saturday or Sunday" means "There will not be a meeting on Saturday and there will not be a meeting on S u n d a y " .
254
SEMANTIC ANALYSIS A N D
SYNTHESIS
However, our research efforts have shown that this is not always the case. There are cases where English "or" functions as semantic "and". In the Illinois Rules of the Road manual there is a statement "Sirens, bells, or whistles are prohibited in unauthorized cars". This statement can only mean "Sirens are prohibited, bells are prohibited and whistles are prohibited in unauthorized cars". In a similar vein the statement "It may be useful to reorganize the file structure or to utilize a drum" means "It may be useful to reorganize the file structure and also it may be useful to utilize a drum". One relevant feature that has been found for determining when "or" is semantically "and" is the existence or nonexistence of the modal "may" in the sentence. Though "John will call Tony, you, or Harry" means "John will call Tony, or John will call you, or John will call Harry", the same sentence with "may", i.e. "John may call Tony, you, or Harry", means "John may call Tony, and John may call you and John may call Harry". The nature of phrases being conjoined also seems to influence the decision. For example, if they share certain properties, the "or" functions as a set theoretic union and thus gives the sense of "and". The use of these conjunctions needs to be investigated more. Other
Conjunctions
Linguists have, for a long time, known that the conjunction "but" presupposes the existence of an inherent contradiction in the passage. For example, in the following sentences the "but" is the result of the apparent contradiction between "their flying" and "my not wanting to fly". Mary, John and I went to the airport on Saturday. I did not want to fly planes. But they were flying planes. So, I flew, too.
This helps in disambiguating the third sentence. It appears that the contradiction could be with the matrix sentence or with the embedded sentence as in the present case. COMPARISON WITH OTHER ANALYZERS A few brief comments need to be made as comparison of our analysis with those of others. Simmons' PROTOSYNTHEX III uses Semantic Event Forms (SEF) in order to disambiguate sentences but this is done at the end of syntactic analysis. Further, the semantic check involves considerable deduction because the redundant features are only found by going through the SUP chains each time they are needed. The network analysis described first by Thorne et al. (1968) and later by Bobrow and Fraser (1969) and lately by
SEMANTIC ANALYSIS A N D
SYNTHESIS
255
Woods (1970) is very similar to our analysis procedure, syntactically; b u t they do not use any semantics. Also, it came t o our attention recently t h a t Winograd (1971) uses the same kind of analysis as ours. One major advance of our analysis is in attempting to analyze relative clauses and conjunctions from a semantic point of view and t h a t is what is needed for practical application. University
of
Illinois
256
SEMANTIC ANALYSIS A N D
SYNTHESIS
REFERENCES Bisa, K. O., R . T. Dhien, and F . F. A. Stahl (1971), 'R2-A N a t u r a l Language Question Answering System', AFIPS Conference Proceedings, 1971, SJCG, Mont vale, N. J . A F I P S Press, p p . 303-308. Bobrow, Daniel G. (1963), 'Syntactic Analysis of English by Computer — A Survey', in AFIPS Conference Proceedings, 1963, FJCC, Baltimore, Md.: Spartan Books, Inc., pp. 365-387. Bobrow, D. G., J . B. Fraser, a n d M. R . Quillian (1967), 'Automated Language Processing', in Carlos A. Cuadra (ed.), Annual Review of Information Science and Technology, V.2., New York: Interscience Publishers, pp. 161—186. Bobrow, Daniel G. and Bruce Fraser (1969), 'An Augmented State Transition Network Analysis Procedure', Proc. of the International Joint Conference on Artificial Intelligence, pp. 657—67. Chomsky, N o a m (1965), Aspects of the Theory of Syntax, Cambridge, Massachusetts: M I T Press. Coles, L. Stephen (1968), 'An On-Line Question-Answering System with N a t u r a l Language and Pictorial I n p u t ' , Proc. of 1968 ACM National Conference, pp. 157—159. Kuno, Susumu (1967), 'Computer Analysis of N a t u r a l Languages', in J . T. Schwartz (ed.), Proc. of Symposia in Applied Mathematics, V. X I X , Providence, R . I.: American Mathematical Society, pp. 52—110. Kuno, Susumu and A n t h o n y G. Oettinger (1963), 'Syntactic Structure and Ambiguity of English', AFIPS Conference Proceedings, 1963, FJCC, Baltimore, Md.: Spartan Books, Inc., p p . 397-418. Montgomery, Christine A. (1969), 'Automated Language Processing', in Carlos A. Cuadra and Ann W . Luke (eds.), ASIS Annual Review of Information Science and Technology, V. 4, Chicago: Encyclopedia Britannica, Inc., pp. 145—174. Norton, Lewis M. (1968), 'The S A F A R I Text-Processing System: I B M 360 Programs', M I T R E Corporation R e p o r t MTP-103, Bedford, Mass: M I T R E Corporation. Petrick, S. R . (1965), 'A Recognition Procedure for Transformational Grammars' P h . D. Dissertation, MIT. Salton, Gerard (1968), 'Automated Language Processing', in Carlos A. Cuadra (ed.), ASIS Annual Review of Information Science and Technology, V. 3, Chicago: Encyclopedia Britannioa, Inc., pp. 169—200. Schank, Roger C. and Lawrence G. Tesler (1969), 'A Conceptual Parser for N a t u r a l Language', Proc. of the International Joint Conference on Artificial Intelligence, pp. 59—78. Schwartz, R o b e r t M., J o h n F . Burger, and R o b e r t F. Simmons (1970), 'A Deductive Question-Answer for N a t u r a l Language I n f e r e n c e ' , ) C A C M , V. 13—3, pp. 167—183. Simmons, R o b e r t F . (1966) 'AutomatedLanguage Processing', in Carlos A. Cuadra (ed.), ASIS Annual Review of Information Science and Technology, V. 1, New York, London, Sidney: Interscience Publishers, pp. 137—169. Stockwell, R o b e r t P., P a u l Schachter, and B a r b a r a H . Partee (1968), Integration of Transformational Theories on English Syntax, Los Angeles, California: University of California. Thompson, Sandra A. (1970), 'Relative Change Structures and Constraints on Types of Complex Sentences', in Working Papers in Linguistics, No. 6, Columbus, Ohio: Comp u t e r and Information Science Research Center. Thorne, J., P . Bartley, and H . Dewar (1968), 'The Syntactic Analysis of English b y Machine', Donald Michie (ed.) Machine Intelligence, 3, New York: American Elsevier Publishing Co., Inc., p p . 281-310.
SEMANTIC ANALYSIS AND SYNTHESIS
257
Winograd, Terry (1971), 'Procedures as a Representation for Data in a Computer Program for Understanding Natural Language', Report MAC TR-84, Project MAC, Cambridge, Mass: MIT. Woods, W. A. (1968), 'Procedural Semantics for a Question-Answering Machine', AFIP8 Conference Proceedings, 1968, FJCC, Baltimore, Md.: Spartan Books, Inc., pp. 457—471. Woods, W. A. (1970), 'Transition Network Grammars for Natural Language Analysis', CAGM, V. 13-10 pp. 5 9 1 - 6 0 6 .
UNDERSTANDING NATURAL LANGUAGE MEANING AND INTENTION ROGER C. SCHANK
The conceptual analysis procedure described at the 1969 International Conference on Computational Linguistics has been expanded so that it may now be considered to be a more accurate simulation of a human engaged in the same process. The reliance on grammar rules has been eliminated and a powerful conceptual mechanism that functions interlingually has been created t h a t drives the analysis process searching for combinations of concepts t h a t are consonant with the systems model of the world. This has allowed for an analyzer t h a t knows when it is in need of information t h a t has yet to have been provided. Use of associative information with respect to concepts is made to fill in missing pieces that are ordinarily inferred in human understanding but often not explicitly stated. This paper presents the conceptual base structure that is used by this system. A mechanism for using these structures to infer the intention of an utterance in a natural language conversation is also described. 1. The Theoretical
Framework
1.1. Conceptual-Based
Theory
The theory presented here has as its initial premise that the basis of natural language is conceptual. That is, I claim t h a t there exists a conceptual base t h a t is interlingual, onto which linguistic structures in a given language map during the understanding process and out of which such structures are created during generation. Thus we are primarily concerned with the representation of the conceptual base that underlies all natural languages. The simple fact t h a t it is possible for humans to understand any given natural language if they are immersed in it for a sufficient amount of time and to be able to translate from that language to whatever other natural language they are well acquainted with, would indicate t h a t such a conceptual base has psychological reality. People fluent in
260
SEMANTIC ANALYSIS AND
SYNTHESIS
many languages can pass freely from one to another, sometimes without even being overtly aware of what language they are speaking at a given instant. What they are doing is invoking a package of mapping rules for a given language from the conceptual base. The conceptual base has in it the content of the thought t h a t is being expressed. This conceptual content is then mapped into linguistic units via realization rules. We will not discuss realization rules in any detail in this paper. The primary purpose here will be to explain what such a conceptual base looks like and how it functions during the understanding process. There is evidence that such an interlingual conceptual base exists in people's heads. Both Lenneberg (1967) and F u r t h (1966) found that thinking is not impaired by a lack of language. Inhelder and Piaget (1958) note t h a t logical thinking does not find its base in the verbal symbol. Anderson (1971) notes t h a t subjects tend to remember the conceptual content of an utterance rather t h a n a visual image or a more linguistic representation. Bower (1970) notes the need for hypothesizing such a conceptual base in order to handle similarities in visual and verbal processing. What I am suggesting then is that such a conceptual base exists; t h a t its elements are concepts and not words; that the natural language system is stratified with the actual language output, being merely an indicator of what conceptual content lies beneath it; and t h a t the conceptual apparatus t h a t we tend to call thinking functions in terms of this conceptual base, with concepts and the relations between these concepts as the operands. I shall refer in this paper to the conceptual content underlying an utterance, that is, what would be represented in the conceptual base, as the meaning of an utterance. The theory t h a t is proposed here is doing its job if two linguistic structures, whether in the same or different languages, have the same conceptual representation if they are translations or paraphrases of each other. Early attempts at computer programs t h a t used natural language were primarily concerned with syntactic analysis (e.g. Kuno and Oettinger (1963)). Whereas no one would claim today t h a t syntactic analysis of a sentence is sufficient for programs which use natural language, it may not even be necessary. This is not to say t h a t syntax is not useful: it most certainly is. But its function is as a pointer to semantic information rather than as a first step semantic analysis as had been traditionally assumed. I t is necessary t o recognize t h a t an important part of the understandito process is in the realm of prediction. Kuno and Oettinger realized this in the predictive analysis program for syntactic analysis. However, humans engaged in the understanding process make predictions about a great deal more than the syntactic structure of a sentence, and any adequate understanding theory must predict much of what is received as input in order to know how to handle it.
SEMANTIC ANALYSIS AND
1.2. Conceptual
SYNTHESIS
261
Dependency
The conceptual base is responsible for formally representing the concepts underlying an utterance without respect to the language in which t h a t utterance was encoded. A given word in a language may or may not have one or more concepts underlying it. We seek to extract the concepts t h a t the words denote and relate them in some manner to those concepts denoted by other words in a given utterance. We are dealing here with two distinct levels of analysis t h a t are part of a stratified system (cf. Lamb (1966)). On the sentential level, the utterances of a given language are encoded within a syntactic structure of t h a t language. The basic construction of the sentential level is the sentence. The next highest level in the system that we are presenting is the conceptual level. We call the basic construction of this level the conceptualization. A conceptualization consists of concepts and certain formal relations t h a t exist between these concepts. We can consider t h a t both of these levels exist at the same point in time and t h a t for any unit on one level, some corresponding realizate exists on the other level. This realizate may be null or extremely complex (see Lamb (1964) for discussion of this general idea). The important point is t h a t underlying every sentence in a language there exists at least one conceptualization. Conceptualizations may relate to other conceptualizations lby nesting or other specified relationships, so it is possible for a sentence in a language to be the realization of many conceptualizations at one time. This is ike saying that one sentence can express many complete ideas and the relation of those ideas. The basic unit of the conceptualization is the concept. There are three elemental kinds of concepts. A concept can either be a nominal, an action or a modifier. Nominals are considered to be those things t h a t can be thought of by themselves without the need for relating them to some other concept. That is, a word t h a t is a realization of a nominal concept tends to produce a picture of that real world item in the mind of the hearer. We thus refer to nominal concepts as P P ' s (for picture producer). A P P , then, is the concept of a general thing, for example, a man, a duck, a book or a pen; or of a specific thing, for example, John, New York or the Grand Canyon. An action is that which a nominal can be said to be doing. There are certain basic actions (henceforth referred to as ACT's) t h a t are the core of most verbs in a language, but this will be explained in section 4. A modifier is a concept t h a t makes no sense without the nominal or action to which it relates. I t is a descriptor of the nominal or action to which it relates and serves to specify an attribute of t h a t nominal or action. We refer to modifiers of nominals as PA's (picture aiders) and modifiers of actions as AA's (action aiders).
262
SEMANTIC ANALYSIS A N D
SYNTHESIS
I t should be emphasized here that what we have said so far about concepts refers to their conceptual properties and not their sentential ones. While it is possible to have a sentence without a verb, or without a subject, for example, or to have an adjective without a noun, conceptually the corresponding things cannot exist. Each of these conceptual categories (PP, ACT, PA and AA) can relate in specified ways to each other. These relations are called dependencies. They are the conceptual analogue of syntactic dependencies used by Hays (1964), Klein (1965) and others. A dependency relation between two conceptual items indicates t h a t the dependent item predicts the governing item. A governor need not have a dependent, but a dependent must have a governor. The rule of thumb in establishing dependency relations between two concepts is whether one item can be understood without the other. A governor can be understood by itself. However, in order for a conceptualization to exist, even a governor must be dependent on some other concept in that conceptualization. We represent the conceptual base by a linked network of concepts and dependencies between concepts t h a t is called a conceptual dependency network (which we abbreviate "C-diagram"). Let us look at what such a network is like. Consider sentence (1): (1) John hit his little dog. 'John' is the name of an object, so it represents a concept which can be understood by itself, and it is thus a P P . 'Hit' represents a concept of action. Each of these concepts is necessary to the conceptualization in the sense that if either were not present, there would be no conceptualization. Thus, we say t h a t a two-way dependency exists between them. That is, they each act as governors which can be understood by themselves but cannot be understood as a conceptualization without the other. We denote the two-way dependency by«=>. The words 'his' and 'little' both represent dependent concepts in that in order to understand them it is necessary to hold them in waiting until what they modify appears. That is, they cannot be understood alone. 'Dog' is the name of a concept which is a P P and therefore a governor. The P P 'dog' is conceptually related to the ACT 'hit' as object. That is, it is dependent on 'hit' in that it cannot be understood with respect to the conceptualization except in terms of 'hit'. We denote objective dependency We thus have the following network so far: John hit dog We can now add the dependents t h a t were waiting for 'dog' as governor. 'Little' represents a PA t h a t is dependent on 'dog'. We call this attributive dependency and denote it as t . The concept given by 'his' would appear to be dependent on 'dog' as well, and it is, but it is not a simple concept. 'His' is
SEMANTIC ANALYSIS AND
SYNTHESIS
263
really another syntactic representation of the P P 'John', t h a t is being used in the syntactic form t h a t indicates possession. W h a t we have is one P P acting as dependent modifier to another P P . We denote prepositional dependency b y ft between the two P P ' s involved, and a label indicating the type of prepositional dependency. (Here POSS indicates t h a t t h e governor possesses the dependent.) The final network is then: J o h n •• hit dog t f t POSS little J o h n A conceptual dependency network m a y be treated as a whole unit by reference to the two-way dependency link. Thus the time of t h e events of this conceptualization may have been 'yesterday'. This would be indicated b y use of an attributive dependency between the t i m e - P P (PP T ) 'yesterday' a n d the as follows: J o h n • hit • P A E x : J o h n tall Sent: J o h n is tall
An attributive conceptualization exists when an attribute is being predicated about a given P P .
3. P P < = > P P E x : J o h n doctor Sent: J o h n is a doctor
This rule is similar to Rule 2 and is a set inclusion t y p e predication conceptualization.
4. P P t PA E x : man t tall Sent: The tall man
This is a dependency between a concept and an attribute of t h a t concept t h a t has already been predicated.
5. P P ft pp
Two conceptual objects in the world can be related to each other in various fashions. The three principal ones are containment, location, and possession and these are marked on the ft arrows.
E x : man dog ft LOC ft POSS New York J o h n Sent: The man in New York J o h n ' s dog
270
SEMANTIC ANALYSIS A N D
SYNTHESIS
6. ACT PP This is objective dependency. The P P is reE x : hit boy lated as object to t h e ACT which governs it. Sent: (He) hit t h e boy The six rules given thus far are enough to express the conceptual representation of the conceptualization underlying sentence (1). We will present other conceptual rules as they are needed for the examples. 3.2. Underlying
ACT's
Until this point, it may have seemed t h a t what we are passing off as conceptual ACT's are really no more t h a n verbs in a different guise. This is only partially true. Actually, ACT's are whatever is verbal in nature on t h e sentential level, rewritten into a primitive form. 'Love, for example, is an ACT, even if it is possible to nominalize it in English. I n other words, it is t h e responsibility of the conceptual level to explicate underlying relationships t h a t speakers know to exist. Although 'love' might be a noun in a given sentence, all speakers know t h a t somewhere a subject and an object (for t h e right sense of 'love') must exist. I t is the purpose of the conceptual rules to mark an ACT such as 'love' as requiring conceptual rules (1) and (6). Thus, when 'love' is encountered in a sentence, it is discovered to be the realízate of the ACT 'love' and immediately the question of the P P ' s t h a t are its actor a n d object is raised. The conceptual processor can then search through the sentence to find the candidates for these positions. I t knows where to look for t h e m by the sentential rules and what it is looking for by the syntax and semantics of the conceptual level. Thus, it is the predictive ability of the formulation of t h e conceptual rules t h a t makes them powerful tools. I t is interesting at this point to look at ACT's t h a t do not have direct English realizates in order t o more clearly pinpoint the problem of conceptual representation. Consider sentence (5): (5) The m a n took a book. Since 'man' is the actor here and 'book' is the object of the action 'took', it might seem to be appropriate here to conceptually analyze this sentence as: man
p
take
book
(We write a 'p' over the «=> to denote t h a t the event being referred to occurred in the 'past'.) However, in attempting to uncover the actual conceptualization underlying a sentence, we must recognize t h a t a sentence is often more t h a n its component parts. I n fact, a dialogue is usually based on the information t h a t is l e f t out of a sentence b u t is predicted b y the conceptual rules. For example, in
SEMANTIC ANALYSIS AND
SYNTHESIS
271
this sentence, we know t h a t there was a time and location of this conceptualization and furthermore that the book was taken from 'someone' or 'someplace' and is, as far as we know, now in the possession of the actor. We thus posit a two-pronged recipient case, dependent on the ACT through the object. The recipient case is used to denote the transition in possession of the object from the originator to the recipient. Thus we have the following network: to
• man
man «=> take J L book Jii from Now, suppose we were given sentence (6): (6) I gave the man a book. I t is interesting to compare the underlying representations of these two sentences. 'Give' is like 'take' in t h a t it also requires a recipient and an original possessor of the object. Then we have: to give
• man
.book. from
Note t h a t these two conceptualizations look very much alike. They differ in t h e identity of actor and recipient in (5) and actor and originator in (6), and in the verb. But, is there any actual reason that the verbs should be different? I t would appear t h a t conceptually the same underlying action has occurred. What is actually different between (5) and (6) (assuming that in (5) the originator was also 'I') is that the initiator of the action, the actor, is different in each instance. The action t h a t was performed, namely transition of possession of an object, is the same for both. We thus conceptually realize both 'give' and 'take' by the ACT 'trans'. Thus the conceptualizations underlying (5) and (6) are: to (5) man •• trans
from to (6) I • trans
• man
book J L hk someone • man
book from
'Give' is then defined as 'trans' where actor and originator are identical, while 'take' is 'trans' where actor and recipient are identical.
272
SEMANTIC ANALYSIS AND SYNTHESIS
W h a t is most important here is t h a t not only 'give' and 'take' are realized as 'trans' plus other requirements, b u t a great m a n y other verbs are conceptual realizations of 'trans' plus something else as well. For example, 'steal', 'sell', 'own', 'bring', 'catch' and 'want', all have senses whose complex realizates include as their ACT the ACT 'trans'. I t is this conceptual rewriting of sentences into conceptualizations with common elements t h a t allows for recognition of similarity or paraphrase between utterances. 3.3. Conceptual
Cases
We have seen in the last section a new conceptual rule which we label (7):
7. ACT a
> P P This conceptual rule states t h a t certain ACT's require a two-part recipient in a dependency similar to t h a t of - « P P objective dependency. The similarity lies in the fact t h a t this t y p e of dependency is demanded b y certain members of the category ACT. If it is present at all, it is because it was required. There is no option and no ACT's can have a recipient dependency without having required it. Thus, this dependency can in a sense be considered t o be a p a r t of t h e ACT itself.
We call those dependents t h a t are required by the ACT, conceptual cases. There are 4 conceptual cases in conceptual dependency, namely, O B J E C T I V E , R E C I P I E N T , D I R E C T I V E and INSTRUMENTAL. We use conceptual case as the basic predictive mechanism available to the conceptual processor. T h a t is, if I say ' I am going', a very reasonable inquiry would be 'Where?'. Dialogues are o f t en partly concerned with the filling in of the case slots in a conceptualization. People do not usually state all the parts of a given thought t h a t t h e y are t r y i n g to communicate. Usually this is because the speaker was trying to be brief and leave out assumed or unessential information or simply information t h a t he did not want to communicate. A conceptual case often m a y not b e realized in a given sentence. The sentence will nevertheless appear well-formed syntactically. But, a C-diagram t h a t contains only the sententially realized information m a y not be well-formed conceptually. T h a t is, a conceptualization is not complete until all the conceptual cases required by the ACT have been explicated. The conceptual processor makes use of the unfilled case slots to search for a given t y p e of information in a sentence or larger unit of discourse t h a t will fit the needed slot. Linguists who are interested in syntactic well-formedness have began to look into the possibility of using cases in their representations recently (e.g. Fillmore (1968)). However, the cases used by t h e linguists are syntactic cases
SEMANTIC ANALYSIS A N D
SYNTHESIS
273
and should not be confused with conceptual cases. Syntactic cases have to do with the well-formedness of a sentence given a particular verb. Conceptual cases have to do with the well-formedness of a conceptualization given a particular ACT. The two are not always the same. Consider, for example, the problem of instrument in sentence (7): (7) John grew the plants with fertilizer. Here, 'fertilizer' is the instrument of the verb 'grow'. This is a syntactic case relation and is treated as such by certain linguists, including Fillmore. But, conceptually the job is not so simple. Why? Simply because 'grow' is not something t h a t 'John' can do to something else. The conceptualization John grow is perfectly all right, but it means t h a t 'John grew'. If we mean 'the plants grew, then the representation of t h a t is: plants - grow But where is 'John' ? I n the conceptualization underlying this sentence, 'John' was doing something t h a t caused these plants to grow. What was he 'doing' ? We don't know exactly, but we do know that he did it with the fertilizer.We might be tempted to posit a conceptual instrument here and have a conceptualization with no particular ACT in it (represented by a dummy ACT — 'do'). Thus we could have (allowing a new conceptual rule (ACT / PP)): John «=>• do J— fertilizer We then could relate this conceptualization to the previous conceptualization. They relate to each other causally since John's action caused the plants to grow. We denote causality by ^ between the two-way dependency links. This indicates t h a t it was not the actor or the action be itself t h a t caused the new conceptualization, but rather it was the combination of the two that caused a new actor-action combination. The causal arrow is a dependency, and consequently the direction of the arrow is from dependent to governor or in this instance from causal to causer. That is, the caused conceptualization could not have occurred without the causer occurring so it is dependent on it. Thus we have (placing an 'i' over the "jjf to denote intentional causation): John plants
do ^ L fertilizer Ill
1
grow
Now, while this is roughly a characterization of what is going on here, it is not correct. I n fact, 'fertilizer' was not the instrument of the action that took place. I t was the object. Consider what probably happened. John took his
274
SEMANTIC ANALYSIS A N D
SYNTHESIS
fertilizer bag over to the plants and added the fertilizer to the ground where the plants were. This enabled the plants to grow. This is conceptually another instance of 'trans'. What we have is this: • plants ground John •• trans J L fertilizer Ill
plants
I bag
i
grow
Thus, what appeared to be an instrument syntactically and then conceptually turned out to be an object of an action after all. This, as it turns out, is what always happens to a conceptual instrument. That is not to say t h a t there is no conceptual instrumental case, but simply that a single P P cannot be a conceptual instrument. Consider sentence (8): (8) Fred hit the boy with a stick. This sentence means t h a t the boy was hit by a stick. We assume t h a t the stick did not do the action of its own accord, so it is reasonable to have the actor be 'Fred'. But we have just stated P P ' s cannot be conceptual instruments. Thus p
Fred «=>• hit «1. boy
stick
is no good. W h a t actually happened here is that Fred threw the stick or swung the stick or performed some other action with the stick as object. In other words, what we call 'Fred hitting the boy' is really another action entirely which we interpret as Fred hitting the boy. We thus allow a new conceptual rule (8). 8. ACT
ft
This rule states that an entire conceptualization can be the instrument of a given ACT. Since this is a case dependency, the new conceptualization can be considered to be a part of the original ACT and thus the original conceptualization in some sense subsumes the instrumental conceptualization. Thus the underlying conceptualization for sentence (8) is: Fred Fred • hit ^
boy ^L
f do stick | D
ill
•
Fred
boy
SEMANTIC ANALYSIS A N D
SYNTHESIS
275
The 'do' here indicates that we really do not know what the action was exactly. However, the requirements on what could possibly fit the 'do' slot are rather stringent. That is, it must be an action that takes OBJECTIVE and DIRECTIVE cases (DIRECTIVE caséis denotedby
and will be discussed
below) and furthermore the particular semantic categories of 'boy' and 'stick' must be allowed. It turns out that 'throw' and 'swing' are about the only ACT's that will fit those requirements. The instrumental conceptualization is written vertically simply because we have no Z-coordinate to write in. The instrumental conceptualization should be considered to be like the other cases, a main line dependent of the ACT. Syntactic instruments in English are nearly always realized conceptually as the object in the instrumental conceptualization. Usually, these syntactic instruments are so stated because of bredity and because of the possible endless nesting of instruments. All conceptualizations that take RECIPIENT or DIRECTIVE case and most that take OBJECTIVE case, take INSTRUMENTAL case as well. Thus, every time a conceptualization is used instrumentally, it is likely that there is an instrument of that instrumental conceptualization. For example, in the instrumental conceptualization for (8) the instrument there could be something like 'Fred grabbed the stick'. In turn, that conceptualization's instrument could be something like 'Fred moved his hand to the stick'. In other words, there is a great deal of information underlying the information actually stated in a given sentence. Which of this information must be retrieved is of concern to the individual in the dialogue situation. Our concern here is to make explicit the relationships that exist between concepts that have been referenced and to be able to discover those (we deem it necessary to know. to
This rule indicates that two PP's can be dependent on the ACT as DIRECTIVE case.
*
9. ACT «E. from
PP » NY
Ex: go «E. * LA Sent: (He) went to N.Y. from L.A. Rules 6, 7, 8 and 9 constitute the conceptual cases of conceptual dependency theory. Thus there are only four cases, of which there can be as few as none or as many as three for a given ACT. Each ACT category (explained in section
276
SEMANTIC ANALYSIS A N D SYNTHESIS
3.4) requires a certain set number of cases. Thus, any given ACT requires a given number of cases and, no m a t t e r what the English realization, will have exactly t h a t number in t h e C-diagram. 3.4. ACT
Categories
Central to t h e problem of analyzing n a t u r a l language conceptually is the problem of predicting what conceptual information must be found in order to complete a given conceptualization. These predictions are based principally upon the conceptual rules and the ACT categories. The category of an ACT is an indication of what cases an ACT must take. Thus, when a word is realized as an ACT, the ACT's category can be looked up, and t h e case requirements of t h a t ACT made known. The semantic requirements of each case are then discovered for the individual ACT. This allows for powerful predictions to be.made and changes the basic analysis process f r o m bottom-up to top-down and bottom-up. There are t h e following ACT-types: PACT — Physical ACT — PACT's require an objective case a n d an instrumental case. They are representative of the traditional a c t o r action-object t y p e constructions. Example: hit, eat, touch. EACT — Emotional ACT — EACT's require an objective case only, a n d are abstract in nature. Example: love. TACT — Transfer ACT — TACT's require objective, recipient, a n d instrumental cases and express alienable possession. Example: trans. CACT — Communication ACT — CACT's also require objective, recipient a n d instrumental cases. The object, however, is never concrete a n d in fact is never a P P b u t always a conceptualization. Example: communicate, say, read. I n order to do this, we need a new conceptual rule: 10. CACT
|
This rule states t h a t a particular type of ACT, a CACT, takes a conceptualization in the objective slot rather t h a n a P P
An example of this t y p e of ACT can be seen in the conceptual construction underlying sentence (9): (9) J o h n told Mary t h a t he loves her. p
J o h n - communicate
John
•Mary
^ love Mary
i John
277
EMANTIC ANALYSIS AND SYNTHESIS
This conceptual construction says that 'John communicated to Mary that he loves her'. Actually, John could have done that in any of a number of ways not involving speech. We say that 'tell' means conceptually 'communicate by saying', i.e., you can communicate ideas by saying words. Thus there is an instrument missing here whose object must be the actual words. Since the object consists of words and not concepts as is usual, we write the items in quotes to indicate that they actually are not part of the C-diagram:
John • communicate
John | i love t° Mary
> Mary ^L ® John
John | say
t'I love you'
tR
I John Mary Here again, this might not be exactly what was said in (9) since the instrument could have been left out of the sentence. The sentence could have ended with 'by kissing her' and then that would belong in the instrument conceptualization. We use the C-diagram shown above until given reason to do otherwise. That is, we make assumptions about the information that was not explicitly stated. This is something that humans do, and therefore something that a theory of understanding and an interactive computer program must do. We now return to the ACT categories. DACT — Direction ACT — DACT's take directive case, objective case, and instrumental case and express motion of objects that are inanimate. Example: move. RACT — Reflexive ACT — RACT's take directive and instrumental case. The object of the action is the same as the actor in these instances. IACT — Intransitive ACT — IACT's are actions that are performed by an actor in isolation. That is, IACT's have to do with the state of existence of the actor and nothing else. Thus, IACT's take no case at all. Example: sleep, be. SACT - State ACT — SACT's are a special kind of ACT in that they serve to introduce another conceptualization that they are not a part of. Example: want, believe. 11. SACT This rule indicates that certain conceptualizations can be det pendent on an ACT in a manner other than that of a case de• look-at1 .¿Llphysobj conceptualization Thus the following conceptual structure is found: i I -- seex Grand Canyon JL t seex
ACT-cat CACT
actor animal
look-at. t Grand Canyon The next word in the sentence is 'flying'. Since 'flying' is realized by an ACT only, it is the candidate for the verb part of a new triple to be sent up to the conceptual processor. Since a prepositional phrase beginning with 'to' follows, it is a vio verb. However, the problem exists of what the subject of the triple for 'flying' is. This problem is not divorced from the problem of where to attach the new conceptualization that will be formed from this triple. Previously, when we entered the conceptual processor with a triple, there was a place waiting for it in the old conceptualization or else there was no old conceptualization, in which case there was no problem. But here we are within a sentence and we already have all the main line elements of a conceptualization filled in. These two problems have the same basis for solution. There is no place in the right side of the conceptualization to place a new conceptualization. But, there is also no immediate candidate for the position of subject of the new conceptualization. The rule in situations such as this is always the same. If there is no PP available as subject, the last PP which was placed in the conceptualization from the sentence is the prime candidate. This takes care of the problem of the entering triple for the conceptual processor, but still does not explicate where the new conceptualization is to be placed. The actual problem is where to place the 'fly' construction since Grand Canyon has already been placed in the old conceptualization. The conceptual rules are checked to find a place for connecting 'fly' to something already in the conceptualization. Again, 'Grand Canyon' is the prime candidate for attacking 'fly' since it was the last concept placed in the conceptualization. The conceptual rule PP J ACT is found and now it is clear where such a connection would go. So the verb-ACT dictionary is entered with the triple (Grand Canyon, fly-vio, nil). The entry for fly is:
SEMANTIC ANALYSIS AND
fly vio
X
bird plane insect
vt
X C m s . C ^ R E L 1
(III)
pour imposer à ce symbole une articulation particulière: C m 8 . Cm - X REL' =• (C^+î X«REL'», C m + 1 /XREL'/) C m s ,
(IV)
le sème x concrétisant le sens du CG, avec la catégorie entre guillemets signifiant la relation sémantique du même nom introduite entre le CG et le CD respectifs, et marquant du même coup le rôle éventuel du sème x; la même catégorie sémantisée s'ajoute au CD, mais, cette fois, entre parenthèses obliques, rappelant l'environnement sémantique immédiat du constituant droit. 2.1. Superposition
de catégories
La règle (III), réitérée, applique au symbole sémantique initial une superposition de catégories, p. ex. Cms . C m - X REL P .C m - y REL q . . . Dans ce cas-là, toute catégorie postérieure visera le constituant droit de l'articulation précédente: C m s . Cm - X REL P . Cm - y REL q . . . . =• =• (C"^ r x «REL p », C m+1 s / X REL P / . C m+1 - y REL q . . . .) C m s
(V)
Ex. Pour «programmer» l'articulation (8), il suffit d'imaginer, appliquée au symbole initial C m s , la superposition des catégories Cm-M16ADAT .BACT . sc ACT .sNOM, qui fait appel à la règle (V), réitérée (pour les superpositions d'une même destination, l'adresse du constituant visé est indiquée dans la première catégorie), donnant: CmS.Cm-M16ADAT -bACT , sc ACT ,sNOM (C™+^M16A«DAT», C m+1 s / M16A DAT/ .Cm+ 1-BACT , sc ACT ,sNOM)Cms (C1ÏÏ+V, (C"+V(C m+2 wl «JUNC», C m + 2 / w JUNC/ . y l REL 4 . . . .)C m + 1 X«RELP»
(XVII)
C'est ainsi qu'on peut expliquer les déterminatifs définis dans les jonctions déjà citées: le matin d u 1 6 a v r i l ( = CmM1«JUNC»), le docteur B e r Bernard n a r d R i e u x , son cabinet = le cabinet d u d o c t e u r R i e u x . P a r contre, un rat reste indéterminé puisqu'il ne possède pas, bien que jonctif, d'éléments déterminés dans sa jonction. E n comparant l'intercalation avec la superposition, on constate que celle-ci ne fait que programmer une série ordonnée d'énoncés conjoints; ainsi, pour la superposition (13), ce sont les énoncés (26): — — — — — — — —
(a) (b) (c) (d) (e) (f) (g) (h)
L'auteur énonce l'affirmation (b) le matin du 16 avril est la date de l'événement (c) le docteur Bernard Rieux est l'actant du changement (d) être hors du cabinet est l'état dû à l'action (e) 1 fois est la structure de processus (f) perfection est l'aspect de l'accomplissement (g) faire est le nom de l'état (h) il existe un état.
L'intercalation, elle, avec la jonction anticipant la superposition primaire, sert à refuser par avance le caractère d'énoncé à toutes les propositions intercalées, donc à les déclarer désignations pures: avant d'apparaître dans sa qualité d'énoncé, le sens (24) 'le docteur . . . sortit de son cabinet', précédé de la jonction sur la Q D A T , devient la désignation 'le moment où le docteur . . . sortit de son cabinet' (v. (25)).10 10 La distinction entre les catégories «substantielles» (telles ACT ou ASP) et «formelles» (JUNC) semble manquer à l'ensemble d'«étiquettes» (v. dans [5]), où celle d'«épithète» ou de «spécification» entre au même titre qu'«agent» ou «objet».
TOPIC/COMMENT IN FORMAL DESCRIPTION
319
Il est t e m p s d'insister sur l'équivalence des catégorisations — superposition ou intercalation — programmantes, et des articulations respectives — progression ou ingression — programmées; comme les secondes ne font que réaliser les premières suivant les règles invariables (IV) et (XIV), on peut, en cas de besoin, opérer sur les programmes au lieu de «patauger» dans leurs réalisations. 2.3. Particularisation
de catégories
Les sèmes rangés dans les catégories sémantiques sont susceptibles de particularisation, i.e. intercalation spéciale effectuée par la règle (XVII): aREL
p
=> a R E L p (' a JUNC . , a R E L p a r t , a NOM)
(XVII)
On distingue, dans le sème intégral a, un sème particulier 'a, dérivé de a par une relation de particularisation (, f t REL p a r t ): 'a est un cas particulier de a. L e sème particulier diffère de la valeur intégrale d'une façon régulière, compte tenu de la catégorie où il est rangé: le sème 'a de la catégorie de DAT se rétrécit généralement jusqu'à «un moment non-final de a», — sauf les cas, bien entendu, où la différence est marquée spécialement; pour les catégories d'ACT ou NOM, le sème 'a constitue une manifestation particulière — y compris la manifestation-zéro — de a, rapportée à un moment ou un lieu occasionnels, etc. E x . On peut supposer que les sèmes comportés par la catégorisation (27) se doivent aux particularisations préalables (28). C m s .Cm-, M16A DAT ., CDBRXH .. .LOC., X I B R A C T . XHCDBRDBROBJ (• • • • •»XHCDBR^OC •> d b r A C T ) ., r m A C T . n ,STRUCT , P ASP ., F B NOM -
(27)
Mi6ADAT(, M J M JÜNC., M 1 6 A REL^ t . M 1 6 A NOM) (,M16A = 'un moment du matin du 16 avril'); • »CDBRXH .. R E L p a r t . C D B R X H .. NOM) ( ' C D B R X H . . 7 ' = CDBR = 'le cabinet du docteur Bernard Rieux, l'un des lieux successifs de son itinéraire'); CDBRXH .. •LOC(.c.JBBXH...J*JNCi
,, D B B REL p a r t . D B R NOM) ('DBR = 'une manifestation du docteur Bernard Rieux, rapportée à une d a t e et u n lieu fixés');
-
D B B ACT(, D B R JUNC
—
xHCDBR...kOC(, XHCDBR J U N C vxHCDBRRELpart . X H C D B E NOM) ( , X H C D B R . . . = X H C D B R = 'un lieu hors du cabinet du docteur . . . l'un des lieux successifs de son itinéraire');
320
TOPIC/COMMENT IN FORMAL DESCRIPTION
. , R M R E I / a r t - r m NOM) (,RM = 'une manifestation du r a t mort, ici manifestation-zéro, i.e. absence');
-
rmACTUmJUNC
-
fbNOM(,fbJUNC
,, P B REL p a r t , p b NOM) ('FB = 'une manifestation partielle de l'action complexe faire et buter, ici l'action faire seule'). (28)
Compte tenu des valeurs assignées, par (28), aux sèmes particularisés du (27), on reconnaît dans (27) la superposition (24). Catégorisation-particularisation. Les catégories aux sèmes particularisés demandent, pour se réaliser en articulation, qu'on en construise une superposition cohérente qui dégage de la superposition initiale, aux sèmes intégraux respectifs: C m - a REL p (, a JUNC . . . .) . b REL% b JTJNC . . . .) . . . . =» => C™*" 1 -, a REL p (, a JUNC . . . .) ., b REL ( 1 (, b JUNC . . . .) .C m - a REL p (, a JUNC . . . .) . b R E L a ( , b J U N C . . . .) . . . .
(XVIII)
i.e. la superposition de catégories aux sèmes intégraux, visant u n constituant C m , peut être anticipée par une superposition analogue de catégories aux sèmes respectifs particularisés, qui visera le constituant gauche de l'ordre suivant (CT+î). L a superposition comportant une particularisation anticipante impose au constituant visé une articulation tout à fait spéciale (XIX) que nous appellerons «scission»: C m s , a R E L p ( , a J U N C . . . .)
C m - a REL p ( , a JUNC . . . .) .
. . . = (0™+^ .C n ï + r -, a REL p (, a JUNC . . . . ) . . . . , C m + 1 s . . C r a + 1 - + a R E L p ( + a J U N C . + a R E L c o m p l ., a I NOM). . . .)C m s
(XIX)
i.e. le constituant C m s est divisé en deux constituants immédiats dont le gauche se voit appliquer toute la superposition particularisée, anticipante, tandis que le droit se charge d'une superposition complémentaire, c'est-à-dire comportant des sèmes réduits, obtenus à partir des sèmes intégraux, déduction faite, pour chaque catégorie de la superposition, du sème particularisé correspondant: + a R E L p = a R E L p - , a R E L p ; p. ex., + a D A T prend, par rapport à , a DAT, la valeur de 'période de a postérieur à ,a'; le sème de + a LOC désigne 'tous les lieux prévus par a LOC à commencer par , a LOC', etc.; le symbole de détermination (!) est accordé au sème 'a intercalé dans le constituant droit — rien que pour avoir figuré dans le constituant gauche respectif (selon la règle (XVII), ce qui fait considérer comme déterminé le sème jonctif correspondant, donc, + a ) .
TOPIC/COMMENT I N
FORMAL
DESCRIPTION
321
Ex. 1. Supposons qu'une superposition soit construite de catégories aux sèmes intégraux, énumérées dans (28), et qu'elle soit précédée d'une catégorisation analogue, aux sèmes particularisés respectifs (soulignés dans (28)): 1
>MI8A®AT . C D B R L O C . , D B E A C T -XHCDBRDBROBJ(. • • • XHCDBR^OC .
. ' D B R A C T ) , 1 F S T R U C T .pASP , F N O M .C M - M 1 6 A DAT .DBRACT .RMACT.
1FSTRUCT
.0DBRXH0DBR
.. LOC .
.pASP . F B N O M
(29)
L'articulation imposée au symbole initial C m s par la superposition (29) serait scission, qui aboutirait, évidemment, à la catégorisation (27) (équivalente à (24)), dans son constituant gauche, et donnerait, pour le constituant droit, la superposition complémentaire (30): CMTTS •CM+Ï-+M16ADAT(+M16AJUNC
.+M16AREL°°**'
.(M16A1NOM).
( = 'période postérieure au moment 'M16A ! connu pour avoir servi de DAT dans le constituant gauche respectif (v. (27)) •+CDBRXHCDBR . . I J O C (
J U N C •+CDBRXHCDBR . .
+ C D B R X H C D B R
RELCOMPL
•XHCDBRI^OM).
( + C D B R X H C D B R . . . = XHCDBR : . . = 'tous les lieux, sauf CDBR, à commencer par le «dehors du cabinet», connu par (27)') .
+ D B R
ACT(
(+DBR .
+ R M
=
ACT(
+ D B R
JUNC
,
+ D B R
R E L
C 0
^
.,DBRINOM).
'l e docteur Bernard Rieux après être sorti du cabinet') + R M
JUNC
.
+ R M
R E L - ^
.ZÉRONOM).
( + R M = RM = 'un rat mort, inconnu, qui vient de se manifester') , + F B NOM( + F B JUNC . + P B R E L c o m % N O M ) . ( + F B = B = 'l'action de buter')
(30)
Ex. 2. La superposition complémentaire (30) issue de la scission supposée peut à son tour subir une particularisation ultérieure, pour dégager une superposition anticipante nouvelle (31): C ^ - , + M 1 6 A D A T ( . • •)• ( = 'un moment de la période +M16A, v. (30)') •>XODBRMpkOC(,XCI)BRMpJUNC • iXCDBRMP^ELpart -xCDBRMpNOM). ( = 'le milieu du palier, lieu particulier hors du cabinet')
322
TOPIC/COMMENT IN FORMAL DESCRIPTION
(— 'le docteur Bernard Rieux au milieu du palier, après être sorti du cabinet') .,+BMACT(. ..). ( = 'un rat mort, au milieu du palier') . 1F STRUCT .pASP . .,BNOM(. . .) ( = 'l'action de buter'),
(31)
qui aboutit, après l'articulation régulière (selon les règles (iy)-(V)), à l'énoncé (32): 'A un moment du matin du 16 avril, de la période postérieure à l'instant où le docteur Bernard Rieux sortit de son cabinet, le docteur . . . buta contre un rat mort, au milieu du palier'. (32) Particularisation épuisée. Il se peut que les sémès-particularisations égalent, par leur valeurs, les sèmes intégraux respectifs, de sorte que, lors de la scission, les compléments se trouvent réduits à zéro, et la scission ne fait que révéler l'épuisement absolu du sens à articuler, dans le constituant droit. Ex. 1. Deux articulations-scissions ont suffi pour épuiser le sens programmé par la catégorisation (28): C m s -C% 16A DAT . . . . _ _ CVC m +i-, M16A DAT
Cm-M16ADAT . . . .
(C m+1 s .Cm+1-^Mi6ADAT . . . . ,C m + 1 s ,Cm+ 1 - +M16A DAT . . . ,)Cm C m+ j^.C m + 2 -, +M16A DAT
C m+ 1 - +M16A DAT . . . .
(C m+2 S .Cm+2-,+M1éADAT . . . . ,Cm+2s .C m+2 - + , +M16A DAT . . ,)C m+1 où
= 'à un moment du matin du 16 avril, le docteur . . . sortit de son cabinet' = 'à un moment de la période postérieure . . ., le docteur . . . buta contre un rat mort, au milieu du palier' = zéro (33)
Ex. 2. Le procédé d'articulation-scission suffit pour épuiser, en le structurant, un sens de n'importe quelle complexité intercalée, eût-il, à la «dernière synthèse», le volume d'un alinéa. (34): «Le matin du 16 avril, le docteur Bernard Rieux sortit de son cabinet et buta contre un rat mort, au milieu du palier (1). Sur le moment, il écarta la bête sans y prendre garde et descendit l'escalier (2). Mais, arrivé dans la rue, la pensée lui vint que ce rat n'était pas à sa place et il retourna sur ses pas pour avertir le concierge (3). Devant la réaction du vieux M. Michel, il
TOPIC/COMMENT IN FORMAL DESCRIPTION
323
sentit mieux ce que sa découverte avait d'insolite (4). La présence de ce rat mort lui avait paru seulement bizarre tandis que, pour le concierge, elle constituait un scandale (5). La position de ce dernier était catégorique: il n'y avait pas de rats dans la maison (6). Le docteur eut beau l'assurer qu'il y en avait un sur le palier du premier étage, et probablement mort, la conviction de M. Michel restait entière (7). Il n'y avait pas de rats dans la maison, il fallait donc qu'on eût apporté celui-là du dehors (8). Bref, il s'agissait d'une farce (9). Le soir même, Bernard Rieux, debout dans le couloir de l'immeuble, cherchait ses clefs avant de monter chez lui (10).» On aurait, pour une vue d'ensemble, facile à détailler, l'arborescence (35)1 C m -i s .(1-10) .(1-9) Cms.(10) C^+ïg .(1) C m + 1 s .(2-9) CT+'b .(2-3) C m+2 S .(4-9) < r + V ( ^ 5 ) C m+3 S .(6-9) C"^V(6) C m+4 S .(7-9) C-^+VCM)) C m+8 S .(zéro) Conclusion
Les trois types de catégorisation présidant à l'articulation sémantique remplissent une triple fonction — celle de régulariser, d'enrichir et de répartir le sens à articuler. Ainsi se profile un espace à trois dimensions qui suffit apparemment pour comprendre la perspective sémantique canonique. Pour embrasser la perspective «fonctionelle», une dimension spéciale s'impose, qui fait l'objet d'un autre travail.11 Institut
Thorez
Moscou
11 Les idées principales du présent article, avec plus de détails et d'exemples, sont exposées dans [6], [12], [13] et [14].
324
TOPIC/COMMENT IN FORMAL DESCRIPTION
RÉFÉRENCE [1] J e s p e r s e n , O. The Philosophy of Grammar, 1924. [2] Cros, R . S., Gardin, J.—C., L é v y , F . L . L'automatisation des recherches documentaires, un modèle général: le SYNTOL, Paris, 1964. [3] Chomsky, N . Aspects of the Theory of Syntax, T h e M I T Press, 1965. [4] Weinreich, U . ' E x p l o r a t i o n s in Semantic Theory', in T . A. Sebeok, (ed.), Current, Trends in Linguistics, Vol. I l l , 1967. [5]Veillon, G., Veyrunes, J . , Vauquois, B . ' U n m é t a l a n g a g e de g r a m m a i r e s t r a n s formationnelles', IIe Conférence internationale sur le traitement automatique des langages, Grenoble, 1967. [6] M a r t a m i a n o v , Y., Mouchanov, Y . ' U n e n o t a t i o n s é m a n t i q u e et ses implications ; linguistiques', ibid. [7] Fillmore, C. J . 'The Case for Case', Universals in Linguistic Theory, Austin, Texas, 1968. [8] Gladkij, A . A. 'O cnocoCax o n n c a H H H CHHTaKCHiecKoiî CTpyKTypu npefljio>KeHHji', Computational Linguistics, V I I , B u d a p e s t , 1969. [9] Dorofeev, G. V., M a r t e m ' i a n o v , Y . 'JlonreecKHH BMBOB H BMHBJICHHE CBflaeiî Me>Kfly npeflJio>KeHH«MH B TeKCTe', Mauimmiâ nepeeod u npuKAadnan AUHeeiicmwca, 12, MocKBa, 1969. [10] Halliday, M. A . K . 'The Place of «Functional Sentence Perspective» in t h e System of Linguistic Description', Symposium in Mariânské Lâzné, 1970. [11] FrantiSek Daneâ, ' F S P a n d t h e Organisation of t h e T e x t ' , ibid. [12] M a r t e m ' i a n o v , Y . 'K onHCaHHio TeKCTa: H3HK BajieHTHO-IOHKTHBHO-3MO(J)a3Hwx OTHOrneHHH', MamuHHUû nepeeod u npUK/iadnan Aumeucmwca, 13-14, MocKBa, 1970—1971. [13] M a r t e m ' i a n o v , Y . 'AKTyantHoe HJICHGHHGI N03HUH0HHBIII H JICKCHHCCKHH cnocoShi Bupa>KCHHJI', Te3ucu KOHfpepemfUU no aemoMamunecKOù oSpaGomne meKcma, KHIUHHCB, 1971. [14] L e o n t j e v a , N . , M a r t e m ' i a n o v , Y., Rozencveig, V. 'O BbiHBjreiiHH H npe^CTaBJiCHHH CMIJCJIOBOII CTpyKTypw TCKCTOB SKOHOMHICCKHXFLOKYIVIEHTOB',CeMaummecKue npoôneMU aemoMamu3aifau UH0opMaquoHHOZO noucKa, KneB, 1971.
FOCUS AND TOPIC/COMMENT IN A FORMAL DESCRIPTION PETR SGALL
The aim of the present paper is to show that Chomsky's (1968) treatment of "presupposition" and "focus" should be relieved of some inconsistent formulations (§ 1), and that some difficulties connected with this approach make it worthwhile to reconsider whether a dependency-based description would not be preferable (§ 2), especially if it contains a hierarchy or ordering of the elements of the semantic (and/or deep) structure of the sentence, which determines the range of permissible focus. 1. Problems of the topic/comment articulation (under the names of topicalition, theme, functional sentence perspective, communicative dynamism, or more recently of presupposition and focus) belong to the most important and decisive issues in the present dispute among transformationalists on the status of the semantic component. This has been shown once more in Chomsky (1968), where it is argued that such pairs of sentences as (1), (2) could be (1) I t isn't JOHN" who writes poetry. (2) John doesn't write POETRY. analyzed into presupposition and focus defining the latter as "the predicate of the dominant proposition" in deep structure (p. 31), while to account also for such pairs as (3), (4), it is necessary to assume that "the focus is determined . . . as the phrase containing the intonation center" in the surface (or even phonetic) representation of the sentence (ibid.). (3) Does John write poetry in his STUDY? (4) I t isn't in his STUDY that John writes poetry. Before considering the question whether the "surface structure" alternative avoids all the difficulties connected with the "deep structure" one, we want to point out an inconsistency in Chomsky's formulations. The use of the definite article in the passages quoted above (. . . the predicate . . . , . . . the phrase . . .) suggests the idea that according to the author there is a single such phrase in every sentence, i.e., that the focus is determined uniquely.
326
TOPIC/COMMENT I N FORMAL DESCRIPTION
Of course, this is not the case, and in further discussion as well as in his closing summary (p. 43, where he characterizes focus as " a phrase containing the intonation center", italics mine), Chomsky considers only the "range of permissible focus" as determined uniquely by the grammatical structure of the sentence. His examples (one of which is reproduced here as (5) and (6), for ease of reference) show t h a t the range of permissible focus corresponds to a certain hierarchy of phrases, with phrases embedded gradually one into another, so t h a t an idea of a linear ordering suggests itself. (5) I t wasn't an ex-convict with a red S H I R T that he was warned to look out for. (6) (i) an ex-convict with a r e d SHIRT (ii) with a red S H I R T (iii) a red SHIRT (iv) S H I R T This idea is corroborated by the fact t h a t in the cases for which Chomsky provides a more or less explicit analysis, mostly the permissible choices of focus yield a phrase t h a t figures as rightmost in the deep structure. This would be even more conspicuous, perhaps, if one were to work not only with examples of the second or even third (contrastive) layer of the topic/comment articulation of the sentence, as Chomsky were, to but add examples of the first layer, too. (Actually, this is the case for sentence (2); in such examples as (7), (8), unmarked from the point of view of topicalization, the predicate noun is the rightmost of the main constituents; and the intonation center is at the end of the sentence even in (9), (10) as well as other sentences.) (7) (8) (9) (1.0)
The autumn was very WARM. During these years he became an old MAN. John came from S P R I N G F I E L D . John went from Springfield to CHICAGO.
I t is well known t h a t the transformationalists' interest in studying the phenomena of topic/comment articulation was highly influenced by the writings and lectures of M. A. K . Halliday; and Halliday (1967) himself quotes Czechoslovak scholars as one of the sources of his approach. I t could be advantageous, then, to review what Mathesius, Danes, Firbas and others have to say about the nature of these phenomena, especially when comparing English, as a language in which the surface word order is highly fixed by grammar, with Slavic languages having a rather "free" word order. We have tried to introduce some of their results into a generative framework elsewhere (Sgall, 1967; 1972). Here we would like to state only t h a t the idea of a hierarchy more or less similar to a linear ordering has been formulated and discussed in detail for years in several writings by Firbas.
TOPIC/COMMENT I N FORMAL DESCRIPTION
327
Firbas claims that with the phenomena involved it is not sufficient to work with a mere dichotomy (be it in terms of "given" — "new", or "what is talked about" — "what is said about it"), but it is necessary to work with a whole scale of degrees of "communicative dynamism (CD)". Thus in our example (10) the subject carries the lowest degree of CD (it is the "theme proper", the topic), and the adverbial of direction the highest (being the comment or rheme proper). The verb carries a higher degree of CD than the subject, but lower than both the complements. The position of individual items in the hierarchy of CD is in many types of phrases basically determined by their semantic structure: for instance, the actor usually carries a lower degree of CD than the verb, while the goal carries a higher degree than the verb; the "newcomer" on the scene is higher on the scale of CD than the verb referring to his appearance on the scene, while the scene itself or "local setting" belongs usually to the items with a low degree of CD. This basic distribution of CD may be, however, superimposed by other factors, such as the contextual dependence of an item that, according to its position in the semantic structure of the sentence, would carry a relatively high degree of CD; this item is topicalized, and the sentence gets a shape belonging to what we call the second layer. 2. Transformationalists who are used to working with phrase structure grammars, i.e. with the "immediate constituent" approach, assume, of course, that the units of the topic/comment articulation (topicalization, etc.) are phrases. Chomsky also treats the concept of focus as a phrase, and finding examples as (11), he regards them only as corroborating the view that surface phrases, rather than deep phrases are involved. (11) John is neither EASY to please, nor EAGER to please, nor CERTAIN to please, nor INCLINED to please, nor HAPPY to please. Ignoring the question of the necessity of repeating the words to please five times in this sentence (which could perhaps shed more light on its semantic or deep structure), we might add other examples showing that even if one has recourse to surface (or phonetic) structure, one must speak of units other than phrases. We do not consider here such examples as (12), quoted by Chomsky, which belong to the third (contrastive) layer (to the "second instance" of Bolinger, 1952), since in this layer stress can be placed on any item, even on that which does not bear a word accent, or does not have a syllabic shape, thus requiring quite general rules operating also with phonetic units. (12) John is more concerned with AFfirmation than with CONfirmation. But let us restrict ourselves to sentences in which the placement of stress is not influenced by some contrast. Some of our examples, such as (13), might
328
TOPIC/COMMENT IN FORMAL DESCRIPTION
be subject to discussion in so far as their semantic structure could, after the approach of Lakoff and Peters (1966), have a form such that even our approach characterized below would not produce a proper solution. We do not think it necessary to derive the with-construction from conjunction in these cases, but it would be out of the scope of this presentation to discuss this issue here. On the other hand, such examples as (14) clearly demonstrate that at least for some cases our treatment is more adequate than the usual phrase structure analysis. (13) You shouldn't argue with Bill about MONEY. (14) John went to SICILY for a week. If the possible "natural responses" can serve as a criterion for the determination of what can be chosen as the focus, and if (15) and (16) are possible natural responses to (13) and (14), respectively, then it is necessary to state that with Bill about money in (13) and to Sicily for a week in (14) can be chosen as the focus, even though neither of them is a phrase of any level (semantic, deep, shallow, surface). (15) I t ' l l be better to argue with the director about a free TICKET. (16) Oh no, he went only to his PARENTS for the weekend. Difficulties of a similar sort would be much more apparent in languages with free word order if the phrase structure syntax were used for their description. If, for instance, the Czech sentence Tu Jcnihu zanesl otec domu (ithe 6ooi-Accusative-iooi-/aiAer-Nominative-^ome.) may have as its natural response Ne, zanesl ji Jan do knihovny (No-toolc-il-Accusative-John-Nominative-io the library.), then we do not reach the desired result by analyzing sentences into phrases. Perhaps it is not by accident that phrase structure (immediate constituent) syntax analysis has almost never been attempted in connection with such languages, but rather they have always been analyzed by means of dependency syntax. Within systems using phrase structure, even if more levels are used for the description of syntax, the difficulties characterized in the preceding paragraphs can scarcely be avoided. As for transformational descriptions, we have seen that Chomsky's "surface structure alternative" does provide a solution for all the difficulties connected with the "deep structure" one. Moreover, to abandon the assumption that transformations preserve meaning leads, as is well known, to far-reaching difficulties connected with the questions of "reversibility" of the description (existence of a recognition routine, etc.), with questions of the description as a part of a model of a mechanism internalized and used by a speaker/hearer (see Sgall, 1965; 1971), and even to a serious weakening of the methodological advantages of the generative description, distinguishing a deep structure that determines the semantic inter-
TOPIC/COMMENT IN FORMAL DESCRIPTION
329
pretation and a surface structure t h a t underlies the phonetic manifestation of messages (as discussed especially by Chomsky, 1966); the heuristic significance of the hypothesis has been very aptly demonstrated by Partee (1970). I t seems, however, at least in the questions concerning us here that the advantages connected with the standard theory (and necessary for its generative semantics variant) could be preserved if the following two conditions are met: (a) the relations in deep structure (or in the semantic representations) should be defined in terms of dependency trees rather than in terms of phrase markers (for various forms and uses of dependency trees, see esp. Hays, 1964; Robinson, 1970a, b; Sgall etal., 1969; Sgall and HajiSova, 1970). Such a segment of text as easy to PLEASE, which can be taken as focus, then corresponds to a deep structure syntagma governed by the verb (see Fig. 1); (b) the deep (or semantic) structure must provide for Chomsky's concept of the range of permissible focus and for the scale (or hierarchy) which underlies this concept; this can be achieved on the basis of the ordering known as the scale of communicative dynamism in the terminology of Firbas, a possibility which will be discussed in § 3. We see that, if in all other respects the phrase structure, or, more precisely, the class of context-free grammars, and the class of dependency grammars are equivalent, with respect to the phenomena of topic/comment articulation the dependency approach has certain advantages. 3. On the basis of the hierarchy of communicative dynamism (CD, see § 1) the notion of focus can be characterized as a connected part of the deep (semantic) structure of the sentence that includes the item carrying the highest degree of CD. This formulation assumes that a substring corresponding to a highest part of the scale of CD can be chosen as focus, the rest of the scale belonging, in some sense, to the presuppositions of the sentence (see Hajicova's paper in this volume for an analysis of various uses of the term presupposition). If this proves to be true, then it does not matter t h a t the segment chosen as focus is not always a surface, or a deep, phrase; for the sentence (14), for instance, the hierarchy could be determined as in (17) where the subscripts stand for the degrees of CD, with 0 denoting the lowest: (17) John 0 went 1 for a week 2 to Sicily 3 According to the chosen description, all the sentences (18) to (20) belong to the presupposition-sharing responses to (14); either the part containing only degree 3, or also t h a t bearing degree 2, as in (19), or the whole part consisting of the syntagmas carrying the degrees 1, 2, 3, as in (20), can be chosen as focus.
330
TOPIC/COMMENT IN FORMAL DESCRIPTION
(18) No, he went for a week to PARIS. (19) No, he went for two days to CHICAGO. (20) No, he stayed at HOME. We have hitherto referred to Chomsky's procedure of "natural response" as to a procedure of determining the range of permissible focus. The wellknown "question-test" used now for a long time to determine the topic and comment of a sentence (see Hatcher, 1956; DaneS, 1969) is another very useful procedure of the same sort. The item t h a t must appear in all questions to which the given sentence can be considered an answer (with the exception of quite general questions, as "What happened?") has the lowest degree of CD; that which cannot occur in any of such questions has the highest degree of CD, and those items t h a t must appear in any question in which an item A appears carry a lower degree of CD than A. Thus possible questions having (14) as their answers are, among others: Where is John? Where did John go? Where did John go for a week? The examples discussed above have one property in common, namely that their surface word order coincides with their scale of CD. Of course, this is not always so. But it is worth examining whether the scale of CD can be taken as a "deep word order"; this means t h a t the left-to-right ordering of the elements of semantic representations of sentences (and/or of their deep structures) would be interpreted as the scale of CD (for a more detailed account see Sgall, 1972). I n the transformational or transductive part, there would be some rules showing the order of elements to be changed in cases where this is either conditioned grammatically in the given language or connected with a stylistically marked character (and with a special intonation shift), or both. We must admit, of course, t h a t the phenomena involved in this discussion are not yet well-described from the empirical point of view. Only after phenomena of intonation, free word order, emphasis, "natural" questions, responses and contexts have been studied in much more detail at the level of observational adequacy, will it be possible to evaluate various proposed approaches with more certainty, or to find new, more adequate ones. Thus, we do not claim to have solved all unclear questions connected with focus and topicalization. There are, for instance, open questions connected with the internal structure of individual constituents, especially that of noun phrases. One has to reckon with the fact t h a t the hierarchy of CD, as well as the syntax of the sentence, is not linear but has an articulated internal structure; some of the properties of this structure have been investigated b y Svoboda (1968). 4. Let us add one remark concerning Chomsky's notion of focus and its relationship to the phenomena of topic/comment articulation. What is actually uniquely determined by the structure of the sentence is not focus itself, b u t
TOPIC/COMMENT IN FORMAL DESCRIPTION
331
the range of permissible focus. It remains open to further discussion whether such a sentence as (5) should be regarded as ambiguous, with different meanings corresponding to the different possibilities of choice of focus indicated in (6), or whether the choice of focus belongs only to individual utterance tokens, not to the semantics of the sentence as such. If the latter solution proves more adequate, this would mean that "what is spoken about" (Chomsky's presupposition) and "what is said about it" (focus) must be distinguished for individual utterance tokens as their presupposition and focus, respectively; but the structure of the sentence, its CD-hierarchy, restricts the speaker's choice of focus to certain limits, given by the range of permissible focus. In this sense, Chomsky's notion can serve to answer a question that has remained unclear in structural linguistics, namely, whether the topic/comment articulation is an articulation of a sentence as a unit of the language system or of an utterance token in a given text. is John 0
easy P please 0 Aa
John 0
is John 0
eager p please 0 John A
Ao
Figure 1 NOTE: The nodes of the dependency trees are labelled by complex symbols having also elements interpreted as indications of their syntactic roles, denoted here as subscripts: 0 — Objective, A — Agentive, P — Predicative (of course, many unclear points are not considered here). The left-to-right ordering of the nodes corresponds to the scale of CD, see § 1.
5. We attempted to show that there is a way out of the dispute between the adherents of interpretive semantics, who, as Chomsky admits, cannot maintain the hypothesis that transformations preserve meaning, and those of generative semantics, who work with such devices as the global constraints,
332
TOPIC/COMMENT I N FORMAL DESCRIPTION
which are not specific enough, and thus make the given type of description rather uninteresting from the theoretical point of view. Our approach (see Sgall, 1972) makes it possible to avoid these difficulties and to handle not only such sentences as quoted above, but also the much discussed examples Many men read few books, Everybody in this room knows at least two languages, Many arrows didn't hit the target, I made every log into a canoe, and their variants differing with respect to phenomena of topic/comment articulation, or, in other terms, of the ordering of quantifiers. Charles University, Prague Laboratory of Algebraic Linguistics
TOPIC/COMMENT IN FORMAL DESCRIPTION
333
REFERENCES Bolinger, D. L. (1952) 'Linear Modification', Publication of the Modern Language Association of America, L X V I I , 1117—1144. Chomsky, N. (1966) Cartesian Linguistics, New York—London. Chomsky, N. (1968) 'Deep Structure, Surface Structure, and Semantic Interpretation', mimeo. Danes, F . (1969) 'Zur linguistischen Analysen der Text-Struktur', Folia Linguistica, IV, 1 - 2 , 72-78. Firbas, J . (1970a) 'On t h e Concept of Communicative Dynamism in the Theory of Functional Sentence Perspective', mimeo for the Seminar on Complex Systems (Cambridge, Mass. J u n e 1970); t o be printed in Sbornik praci ftlosoficke fakulty brninske university (Publications of the Philosophical Fac. of Brno Univ.), Brno. Firbas, J . (1970b) 'Some Aspects of the Czechoslovak Approach to Problems of Functional Sentence Perspective', Functional Sentence Perspective, Papers prepared for t h e Symposium held a t Marienbad on October 12th—14th, 1970 (mimeo, I n s t i t u t e of the Czech Language, Prague). Halliday, M.A.K. (1967) 'Notes on Transitivity and Theme in English', P a r t 2, Journal of Linguistics, 3, 199—244. Hatcher, A. G. (1956) 'Syntax and t h e Sentence', Word, 2, 234—250. Hays, D. G. (1964) 'Dependency Theory: A Formalism and Some Observations', Language, 40, 511-525. Lakoff, G., Peters, S. (1966) 'Phrasal Conjunction and Symmetric Predicates', MLATNSF, 17, p a r t VI. Partee, B. H . (1970) 'On the Requirement t h a t Transformations Preserve Meaning', mimeo. Robinson, J . J . (1970a) 'Case, Category, and Configuration', Journal of Linguistics, 6, 57-80. Robinson, J . J . (1970b) 'Dependency Structures and Transformational Rules', Language, 46, No. 1, P a r t 1, 259-285. Sgall, P . (1965) 'Generation, Production, and Translation', submitted a t t h e I n t . Conf. of AMTCL, New York; reprinted in Prague Bulletin of Mathematical Linguistics, 8, 1968, 3-13. Sgall, P . (1967) 'Functional Sentence Perspective in a Generative Description', Prague Studies in Mathematical Linguistics, 2, Academia, Prague, 203—225. Sgall, P . (1971) 'Status of Semantics in Generative Description', submitted a t the International Congress for Logic, Phylosophy, and Methodology of Sciences, Bucharest; in press in Teorie a metoda, 3, No. 3. Sgall, P . (1972) 'Topic, Focus, and t h e Ordering of Elements of Semantic Representations', Philologica Pragensia, 15, No. 1. Sgall, P. et al. (1969) A Functional Approach to Syntax in a Generative Description of Language', New York. Sgall, P., Hajicovd, E . (1970) 'A " F u n c t i o n a l " Generative Description', Prague Bulletin of Mathematical Linguistics, 14, 3—38. Svoboda, A. (1968) 'The Hierarchy of Communicative Units and Fields as Illustrated by English Attributive Constructions', Brno Studies in English, V I I , Brno University, 49-101.
MORPHOLOGY
MORPHISMES ET LA PRESSION DU SYSTÈME LINGUISTIQUE EMESE KIS
Cet exposé se propose d'étudier les relations particulières des emprunts linguistiques de deux langues non-apparentées du point de vue généticostructural: le roumain (langue romane, du type flexionnel) et le hongrois (langue finno-ougrienne du type agglutinant). Ces relations ou applications Bont caractérisées dans le roumain par l'ensemble des unités de l'étymon: nous le notons par X (toute unité x appartient au domaine de définition X c'est-à-dire x £ X) ; soit l'ensemble des unités de l'emprunt linguistique c'està-dire l'ensemble des valeurs Y, y ( Y et chaque emprunt correspondant à un étymon dans le hongrois, et Boit une loi de correspondance univoque f qui agit de manière qu'à chaque x £ X correspond un y £ Y; ainsi f(x) = y. Donc la relation est le triplet (X, Y, f) qu'on peut encore noter par f : X ->- Y ou bien XJ^. Y. Cette application est compatible avec deux relations définies par la logique mathématique (ce qui semble contredire l'opinion de E. Coseriu [3: 243] sur le caractère ni logique, ni illogique du langage): la relation de dépendance que nous notons et la relation de constellation *. Ces relations sont utilisées dans l'acception donnée par L. Hjelmslev [6] ou bien par H. S. Sorensen [25]. Par la compatibilité de notre application f avec les relations ->•, * nous entendrons: fK x2) = f(xx) f(x2) f( X l * x2) = f( Xl ) * f(x2) Dans la partie gauche de l'égalité, les relations sont définies entre les x £ X, dans l'ensemble des unités de l'étymon, dans la partie droite de l'égalité les relations sont définies dans l'ensemble des unités de l'emprunt linguistique: f(x) = y, y £ Y. Au cours du procès de l'emprunt x peut être représenté par ses unités constitutives. Au niveau de la phonétique nous notons ses unités constitutives [21] par a, au niveau de la morphologie par b, au niveau de la syntaxe par c. 1. L'adaptation au système phonétique est compatible avec la relation* de dépendance.
338
MORPHOLOGY
Au niveau des phonèmes soit a 1 une voyelle et a 2 une consonne. Selon les constatations de Kurylowicz [14] entre la voyelle et la consonne on peut définir une relation de dépendance, la voyelle étant l'élément régent. La relation de dépendance est gardée également dans l'ensemble des valeurs: f( a i ->- a2) = f(%) — f(a 2 ) P a r exemple roum. hotar < hongr. hatâr les voyelles se transforment en voyelles, les consonnes en consonnes au cours du procès de l'emprunt, autrement dit la dépendance voyelle-consonne est un invariant d'adaptation, ou un invariant de transplante. 2. La pression du système morphologique est compatible avec la relation de dépendance. Au niveau de la morphologie, conformément à la loi de l'harmonie vocalique de la langue hongroise, les morphèmes thématiques imposent la nature des morphèmes désinentiels [8]. Dans la langue roumaine, sous l'action du synharmonisme, la situation se présente en raison inverse: les morphèmes désinentiels impliquent certains morphèmes thématiques [23]. A condition que nous considérions le thème comme un centre et que bx soit le morphème du thème de l'étymon et b 2 le morphème désinentiel, nous obtenons un anti-isomorphisme [22] en sens centripète. Par anti-isomorphisme nous entendons une correspondance biunivoque qui rend invariable l'existence de la relation non commutative, mais commute toutefois les termes de la relation: f(\ - b2) = f(b2) - , f f o ) Nous disons, «anti-isomorphisme en sens centripète» quand nous avons: b2) = f(bx) - , f(b 2 ). Par exemple: rom. mestesug -< mestersug < mestersiugu •< mestersigu < hongr. dial. mestarsîg, de même rom. mirisug, betesug, chelciug, vicîesug hongr. dial. nyeresîg, belegsig, kô(l)csig, Tiitlensig [24, 26]. Au niveau de la syntaxe, la forme de l'étymon pourvue d'un suffixe possessif personnel hongrois apparaît, aussi souvent dans les contextes oraux que dans les cas où elle est pourvue d'un suffixe comme celui de l'instrumental, de l'accusatif, du pluriel. Toutes ces variantes fonctionnelles sont gardées dans la langue qui les emprunte par l'entremise du contexte syntaxique. Rapportése au contexte, ces variantes sont en rapport de constellation. La pression du système syntaxique est compatible avec le rapport de constellation. En notant donc la coexistence possible par *, et par ^ un des étymons,
MORPHOLOGY
339
par c2 la fonction syntaxique, nous avons: f( C l * c2) = f( Cl ) * f(c2). P a r exemple les nominatifs ou les accusatifs roumains explicables par l'existence d'une forme pourvue d'un suffixe possessif personnel ou d'un suffixe instrumental de l'étymon roum. harjâ, hasnâ, labâ, seamâ, talpâ, roum. dial. levesâ, labosâ, code, roum. ancien helge, fuglâ, o(a)cà < hongr. harca, haszna, lâba, szâma, talpa, levese, lâbosa, kocsija, hôlgye, foglya, oka etc. [10] ou hongr. harcca(l), lâbba(l), talppa(l), levesse(l), lâbossa(l), hôlggye(l), okka(l) [9]. 3. L'adaptation des emprunt sau systèmeph onétique de la langue hongroise est compatible avec la relation de dépendance. Notons par X ' l'ensemble des étymons roumains, par Y' l'ensemble des emprunts d'origine roumaine dans le hongrois, et par f ' la loi qui établit une correspondance biunivoque entre x' £ X ' et y £ Y', de façon que f'(x') = y'. Au niveau des phonèmes nous remplaçons par a', dans la morphologie par b', dans la syntaxe par c', les unités correspondantes x'. Nous avons l'isomorphisme f'(a^ ->- a^) = f(a^) ->- f(ag). Par exemple roum. ficior > hongr. ficsûr [1] les voyelles restent des voyelles, les consonnes restent toujours consonnes, donc la dépendance constitue une invariante opérationnelle au niveau phonétique, rapportée à l'adaptation au système. Dans la morphologie, selon le synharmonisme roumain par lequel le morphème désinentiel gouverne le morphème thématique et selon l'harmonie vocalique par laquelle les morphèmes thématiques impliquent en hongrois les morphèmes désinentiels, nous avons toujours un anti-isomorphisme en sens centrifuge f'(b; b^ = f'(b;) f'^). Considérons le thème bx comme un centre, nous disons anti-isomorphisme en sens centrifuge quand f'(bi -» b^) = f'(bi) f'O^). P a r ex. roum. bucalaie, cetinâ, cîrti$â^>hongr. dial. bukelâja, csetenye, kertice. 4. La formule f'(ci * c^) = f'(cj) * f'(c^) est aussi l'expression du conditionnement par le contexte syntaxique. Nous observons que la monotypie structurale des applications f et f ' est l'expression de leur caractère isomorphe. Les situations d'adaptation non-isomorphe [9, 11, 13] se présentent quand la pression du système phonétique tend à prévaloir sur la pression du système morphologique. Par exemple en roumain l'apparition de la diphtongue roum. -ea- est conditionnée par l'existence d'une voyelle palatale ou centrale dans la syllabe suivante. Les mots d'origine hongroise comme roum. beteag «infirme», sireag «ligne, ordre, armée» < hongr. beteg, hongr. dial. betàg, hongr. sereg, hongr. dial. seràg. C'est ainsi que la voyelle à f u t commutée par la diphtongue ea interprétée comme une de ses variantes. Au lieu de la correspondance hongr. voyelle >• roum. voyelle, on a hongr. voyelle > roum. consonne +
340
MORPHOLOGY
voyelle. De même, en sens inverse, la diphtongue ea est sous différenciée et considérée comme une variante de a cf. par ex. roum. buïeandra ]> hongr. bulândra. Conformément aux analogies constatées dans la pénétration des éléments étrangers dans la langue hongroise et dans la langue roumaine, nous avons tiré les conclusions qui suivent: (1) Pour une analyse typologique parallèle de deux langues non apparentése, du point de vue génético-structurel il est important de mettre en évidence leur identité structurelle. (2) L'étude des emprunts permet la révélation des phénomènes d'isomorphisme interlinguistique, parmi lesquels un rôle particulier revient aux formes d'invariance relationnelle. (3) L'invariance de certaines relations de dépendance et de constellation est en concordance avec la pression du système de la langue qui assimile l'emprunt. (4) Si l'adaptation au système phonétique et au système syntaxique présente des phénomènes d'isomorphisme droit sur le plan intralinguistique aussi bien que sur le plan interlinguistique, dans l'incadration morphologique l'anti-isomorphisme apparait aussi. (5) Les schèmes isomorphes présentent dans le procès de l'emprunt une importance similaire à l'économie des schèmes dans la traduction humaine. Université Babe§-Bolyai, Cluj
MORPHOLOGY
341
RÉFÉRENCES [1]Blédy, G. Influença limbii române asupra limbii maghiare, Sibiu, 1942. [2] Byck, J., Al. Graur, 'L'influence du pluriel sur le singulier', BL, I, p. 21 et suiv. [3] Coseriu, E. Teoría del langtiaje y lingüistica general, Madrid, 1969. [4] Draganu, N. 'Etimologii, elemente unguresti', DR, VI, 301—302. [5] Frumkina, R. M., Zolotariov, V. M. 'Cu privire la modelul probabilistic de propozitie', Probleme de lingüistica matematicä. Traduceri din literaturä soviética de specialitate, 1960, 28-29. [6] Hjelmslev, L. Prolegomena to a Theory of Language, Baltimore, 1953. [7] Iordan, I. Limba romänä contemporanä, Bucureçti, 1965, 276. [8] Horváth, E. 'A román nyelv magyar jôvevényszavai alaktani beilleszkedésének néhány kérdése', Magyar Nyelv, LXV, 1970, 63—59. [9] Kis, E. 'Aspeóte din încadrarea morfologicä a substantivelor de origine maghiarä în limba romftnä', StUBB, 1962, 53-66. [10] Kis, E. Cu privire la terminaría -ä a substantivelor româneçti de origine maghiarä, 1958, 145-153. [11] Kis, E. 'Oglindirea evolu^iei consoanelor tematice maghiare -ly çi -ny în tema substantivelor româneçti de origine maghiarä', OL, 1962, 145 — 153 [12] Kis, E. 'O problemä de izomorfism in limba romänä', GL, 1965, 187-193. [13] Kis, E. 'Sufixul -äu în cuvintele de origine maghiarä in limba romänä', CL, 1960, 74 — 84. [14] Kuryîowicz, J . 'La notion d'isomorphisme', Travaux du Cercle Linguistique de Copenhague, V, 1948, p. 48 et suiv. [15] Kuryîowicz, J . 'Allophones et allomorphes', Omagiu lui lorgu Iordan, 1958, 495— 500. [16] Marcus, S. 'Aspecte aie modelärii matematice în lingvisticä', Studii §i cercetäri de lingvistica, XIV, 1963, Nr. 4, 487-502. [17] Marcus, S. Lingvistica matematicä. Modele matematica in lingvisticä, Boucarest, 1963. ed. II. Boucarest, 1966. [18] Makaev, E. A. 'K Bonpocy 06 H30M0p$H3Me', Bonpocu n3bmo3HaHun, X, 1961, Nr. 5, p. 51 et suiv. [19] Makaev, E. A. TIohhthc hubushkh chctcmh h HepapXHH H3HK0Bwx eflHHim', Bonpocbi ü3biK03HaHUH, XI, 1962, Nr. 5, 47-52. [20] Melnikov, G. P. 'Limbajul masinii çi planul continutului', Probleme de lingvistioä matematicä, 1960, 35—40. [21] Petrovici, E. 'Evolutia foneticii, substituiré de sunete sau adaptare morfologicä î (In legäturä cu tratamentul lui o final în elementele slave în limba romänä)', CL, VI, 25-30. [22] Pic, G. Algebra superioarä, Bucuresti, 1966. [23] Puscariu, S. 'Le morphème et l'économie de la langue', Études linguistiques roumaines, Cluj — Bucureçti, 1937, p. 260 et suiv. [24] Sala, M. 'în legäturä cu originea sufixului románese -çug', Omagiu lui lorgu Iordan 1958, 763-764. [25] Sorensen, H. S. 'On the Logic of Classes and Relations in Linguistics', Travaux du Ie Congrès de sémiotique, Warszava, 25—28 Août 1968. [26] Tamás, L. Etimologisch-historisches Wörterbuch der ungarischen Elemente im Rumänischen — Unter Berücksichtigung der Mundartwörter, Budapest, 1966.
AN INTERACTIVE PROGRAM FOR LEARNING THE MORPHOLOGY OF NATURAL LANGUAGES* S H E L D O N K L E I N and T E R R Y A. D E N N I S ON
Introduction The morphology learning program is a subcomponent of AUTOLING, a program that learns transformational grammars of artificial and natural languages through interaction with a human informant [3, 4], The AUTOLING system as a whole was operational on a Burroughs B5500 computer located at the University of Wisconsin and was written in extended ALGOL for that machine. This computer is no longer available to the researchers. The AUTOLING work has been transferred to a Burroughs B6700 computer located at the University of California, San Diego, where the phrase structure learning component and portions of the morphology learning component are operational and rewritten in extended ALGOL for that machine. The transformation learning component at this date is not yet fully operational on the new machine. Serious precursors of this work include that of Alicia Towster [4] and Paul Garvin [1], Towster's work was connected with the AUTOLING research group, but the state of development of the program did not involve a component that was integratable with the overall AUTOLING system. Garvin and his assistants developed an elaborate but unimplemented program design involving specific tests for semantic and morphological structure types. The methodology anticipated the usage of a knowledge of linguistic universals. Our particular approach requires the integration of the morphology learning process with the grammar discovery methods of the total AUTOLING learning system. Primary emphasis is placed on the assumption that full automation of discovery methods will require a complete functional analysis of the semantics of the meta-language used in the learning process (in this case English). Because such an analysis is not available, glosses for forms are rewritten in a semantic notation adequate for solution to the particular problems presented. Ultimately, a program might be created that would reinter* Research sponsored by National Science Foundation Grant GS-2595.
344
MORPHOLOGY
pret English in a proper semantic form automatically. Yet more would be required, for the ever present problems of meaning unit mappings between languages (several units in one language mapping into one unit of the other and vice versa, or worse) suggest a future methodology involving ultimate rewriting of glosses perhaps as universal semantic features, and a program logic capable of determining the distinctive ones for a particular language. In the following sections we present a description of what is currently implemented or readily implementable within the framework of available computational and theoretical resources. All discovery methods used are heuristic rather than algorithmic, which means that they are part of what has been called a linguist's "bag of tricks" — methods that may work, but that do not guarantee resolution of problems. The AUTOLING program creates grammars with unordered context-free phrase structure rules coupled with ordered transformations to handle context sensitive phenomena. In the original version (without a morphology learning component) the informant was assumed to be bilingual and was required to input sentences with spaces between morphemes. The new version does not require such preanalysis, and will yield grammars in which a transformational model is used to the level of morpheme strings. From that point on a structuralist description will be derived as well as a transformational one. To some relationalist philosophical grammarians this may seem a strange mixture. As a logical-positivist interested in creating a system that works, the first author reserves the right to be eclectic; we note that each model is particularly suited to automation of certain discovery procedures. The discovery heuristics of the phrase structure learning program are described in [3], We note that they involve extensive use of distributional criteria (especially frame tests). The transformation learning component makes use of informant corrections to faulty productions of the program in testing mode. Generality of specific transformations is obtained by heuristics analogous to the ones that are used in the phrase structure learning component.
The Analytic
Philosophy
At every stage of the analysis, the program attempts to formulate a grammar to account for the observed data base. Accordingly, the learning heuristics must assume that the grammar is never complete, and constantly subject to revision. This problem has been solved for the phrase structure and transformation learning components through mechanisms for creating, destroying
MORPHOLOGY
345
and substituting classes of morphemes and higher level units in already existing rules. The same capabilities applied to t h e morphology learning component create a separate set of d a t a maintenance problems. We note t h a t every input from an i n f o r m a n t is stored and used for reference at various stages of analysis. Accordingly in an intermediate state of analysis, a new morphological breakdown of a unit previously treated as monomorphemic requires u p d a t i n g not only in the hierarchies of possible rule chains t h a t may reference it, b u t in all stored inputs t h a t may contain it.
A Sample
Problem
The quickest explication of the methodology can be provided by an example analysis. We note t h a t the following is a hand simulation, as t h e currently working portion of the system merely isolates morphs, and does not provide complex updating of existing rules. I n t h e example m a n y features of t h e system, including checks of informant glosses a n d this consistency are not indicated; these will be discussed in a later section. We note t h a t the system operates in two modes, syntactic and morphological. I n t h e morphological mode, almost no phrase structure learning heuristics a n d none of the transformation learning ones are used. At t h e point of return to t h e s y n t a x mode, all inputs t h a t were entered during the morphological mode are reentered as inputs to t h e system in their analyzed, morphologically portioned fcim. A major portion of heuristic strategy can govern the time the system stays in each mode, a n d the circumstances t h a t demand a switch. For t h e following example we will adopt the rule t h a t any time an input contains morphological material not recognized by the system, it will automatically switch to morphological mode, a n d return to the syntactic mode only when tentative analysis of t h a t material has been made. We note t h a t this is not likely to be the optimum analytic strategy. The basic analytic method consists of double matches of forms and glosses. The matching of glosses involves simple set intersection rather t h a n the ordering of elements t h a t is observed when comparing forms in the language under analysis. An exception to this is possible if brackets are placed around bundles of elements in t h e gloss. I n this case partial orderings are possible. None of t h e current heuristics handle problems with discontinuous morphemes. A major heuristic involves t h e choice of what to compare with what. Forcing the system to wait for maximally similar forms before undertaking matchings would simplify the computations. The problem as formulated b y N I D A [2] is as follows:
346 Problem
MORPHOLOGY - (data from the Elisabethville dialect of Congo Swahili, a language o f the Belgian Congo)
Instructions: a. L i s t all morphemes. b. Give the meaning of each. 1. ninasema ' I speak' 2. wunasema 'you (sg.) speak' 3. anasema 'he speaks' 4. ninaona ' I see' 6. ninamupika ' I hit him' 6. tunasema 'we speak' 7. munasema 'you (pi.) speak' 8. wanasema 'they speak' 9. ninapika ' I h i t ' 10. ninanupika ' I hit you (pi.)' 11. ninakupika ' I hit you (sg.)' 12. ninawapika ' I h i t t h e m ' 13. ananipika 'he hits m e ' 14. ananupika 'he hits you (pi.)'
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
nilipika ' I have h i t ' nilimupika ' I have hit him' nitakanupika ' I will h i t you (pi.)' nitakapikiwa ' I will be h i t ' wutakapikiwa 'you (sg.) will be h i t ' ninapikiwa ' I a m h i t ' nilipikiwa ' I have been h i t ' nilipikaka ' I hit (remote time)' wunapikizwa 'you (sg.) cause being h i t ' wunanipikizwa 'you (sg.) cause m e t o be h i t ' wutakanipikiwza 'you (sg.) will cause me to be h i t ' sitanupika ' I do not h i t you (pi).' hatanupika 'he does not h i t you (pi.)' hatutanupika 'we do not h i t you (pi.)' hawaratupika 'they do not h i t us'
Supplementary information: 1. The future -taka- and the negative -fa- are not related. 2. The final -a m a y be treated as a morpheme. I t s meaning is not indicated in this series. 3. The passive morpheme m a y be described as having two forms, iw- and -w-. I t s form depends on what precedes it (see Principles 2 and 3).
The first input is: (1) ninasema (1st person sg.) (speak present) The system automatically jumps to morphological mode and coins the phrase structure rule: — Si: = ninasema A note on the form of the rules: the left hand side of each rule is given the label 'S' plus a number. A prefix of '*' indicates that the construction occurred as a free form input during the syntactic mode, a prefix indicates a morpheme or morpheme class. The absence of a prefix indicates that the construction is an analytically derived higher level intermediate node. The next input is: (2) wunasema (2nd person sg.) (speak present) The system now tries a left alignment searching for a right match of the forms in the language under analysis (only if there is a common intersection in their glosses). If both alignments yield a match, the longest is taken. If no matches are found, the system continues to the next input.
MORPHOLOGY
347
In this case both alignments are identical: wunasema ninasema The following are entered in the dictionary: wu (2nd person sg.) ni (1st person sg.) nasema (speak present) and the phrase structure rules are rewritten: SI S2 -S3 -S4 S5
= S3 S5 = S4 S5 = wu = ni — nasema
At this point the system returns to the syntactic mode, and the forms are automatically reentered as 'ni nasema' and 'wu nasema'. Among other things that happen is the application of some combinatory phrase structure rules that operate on rules SI and S2 which have now been given asterisked, free form status. The resultant grammar is: *S1 -S3 -S4 -S5 S6 S6
= = = = = =
S6 S5 wu ni nasema S3 S4
The next input is: (3) anasema (3rd person sg.) (speak present) Again the system switches to morphological mode. At this point, the dictionary is used to parse the input, i.e. to identify previously determined morphemes on the basis of longest embedded matches, and set inclusion of the dictionary gloss in the total input gloss, 'nasema' is found, and 'a (3rd person sg.)' is added to the dictionary. The following rules are subsequently added to the phrase structure grammar: S7: = S8 S5 -S8: = a An attempt is made to find this new 'a' in other dictionary entries, but it fails because of a lack of semantic intersection.
348
MORPHOLOGY
After a return to syntax mode, rule S7 is deleted and S6: = S8 is added. The next input is: (4) ninaona (1st person sg.) (see present) The dictionary check yields 'ni (1st person sg.)' and 'naona (see present)' is added to it. At this point, the newly entered item is matched against other dictionary entries, yielding the common element: na
present
and the newly segmented disjunctive items: sema ona
speak see
The final result after reentry into syntax mode includes a rewriting of rule S5 and a combination of 'ona' and 'sema' into one class: *S1: -S3: -S4: S5: -86: -S6: -S6: -S8: -S9:
= = = = = = = =
S6 S5 wu ni S9 S10 S3 S4 S8 a na
-S10: = sema -S10: - ona
The elements of the next input: (5) ninamupike (1st person sg.) (hit present) (3rd person sg.) are all accounted for by the dictionary except for 'mu' which is entered with the gloss '(3rd person sg.)'. At this point we might have written the gloss for 'mu' in the original input as '(3rd person sg. object)'. Having done the problem in advance, we know that 'mu' and 'a' should eventually be combined as allomorphs, making the specification of 'object' in the semantics superfluous. This suggests that the system must have heuristics capable of determining non-distinctive semantic features if there is over-specification.
MORPHOLOGY
349
The anticipated tactic at this point is to avoid attempts at immediate resolution, and to treat the two morphemes as independent entities. The resultant grammar offers further processing in both morphological and syntactic modes and yields the addition of two rules: *S11: = S6 S9 S12 S10 -S12: = mu However, the syntax mode heuristics continue to analyze the rules and seek to combine the partially similar Si and Sll, resulting in the deletion of S l l , and the rewriting of Si as *S1: = S6 S13 and the addition of two new rules: S13: = S5 S 1 3 : = S9 S12 S10 The input form: (6) tunasema (1st person pi.) (speak present) adds segment 'tu' to the dictionary, and the rule: -S14: = tu Input 7: munasema (2nd person pi.) (speak present) at first yields mu (2nd person pi.) as a dictionary entry. Comparison with the already existing entry 'wu (2nd person sg.)' yields a reanalysis, resulting in the new dictionary entries: u w m
(2nd person) sg pi
It should be possible for the reader to anticipate the future course of the analysis. The program will split some pronoun morphemes into two components and leave others as single units, 'li past', 'taka future' and 'pika see' will be cut, and the system will decide that 'pik see' is also a morph when determining 'iwa passive' and 'wa passive', 'ka remote time' will be cut and 'si', 'ha', 'hatu' and 'hawa' will all attain pronoun morph status at the time 'ta negative' is cut. The reader, of course, might wish to handle the problem somewhat differently, but the program cannot take Nida's hint about the final 'a' because no meaning is indicated, and the negative is unsatisfactorily solved partly for this reason and partly because of the program's inability to segment into discontinuous morphemes.
350
MORPHOLOGY
Hierarchies of Heuristics The hand simulation we have been doing does not indicate all the testing done by the system. Especially, it does not indicate the yet to be programmed heuristics that will monitor and govern the basic segmentation program described above. As in the fully functioning syntax learning component, a basic design principle is the use of higher level analytic components to analyze and perhaps reject tentative results of lower level, brute force heuristics. Some of the blocking criteria can be very specific and even stylistic. In the proceeding problem one might wish to block the segmentation of the pronoun system into person and number for special reasons. The pronoun systems of many languages lend themselves to very complicated segmentation of little or no generality; many linguists prefer to avoid such cutting and prefer to treat as single units entities that might otherwise be segmented. Another key heuristic involves the avoidance of comparison of forms for segmentation purposes except under conditions likely to produce optimum results: such heuristics might require maximal similarities in shape and glosses within a minimum size sample. Perhaps the most powerful heuristics used have not yet been indicated; they involve the testing of the grammar at each stage of rule modification through the generation of test productions whose generative history includes the newly created or modified rules. Such testing permits phrase structure rule modification, or may lead to the learning of a transformation if the program should require the informant to supply a correction. Testing in the system under construction includes a translation of the test production that is also offered to the informant for acceptance or rejection. Accordingly, the informant has five possible responses: acceptance of test production and translation, rejection of both with refusal of correction, rejection of both with correction of both, correction just for gloss, and correction of just the form, with acceptance of the gloss. The kind of corrections provided by the informant provides the data for determining allomorphic status and the coining of morphophonemic rules. Of course it is possible to develop the program in such a way that the morphology is handled implicitly in the form of transformations. However, accidents in the input sequence predictably can lead to situations where only the analytic techniques of structuralist taxonomic morphological analysis will be able to recover the pertinent data. Once such techniques have been applied, it is possible to reformulate the information in a transformational model. The heuristics for discovering morphophonemic relations can involve the use of phonological distinctive features, set intersection and resultant generalization.
MORPHOLOGY
351
T h e use of an articulatory phonetic chart, in the form of an array in several dimensions, also lends itself to use in a powerful heuristic for the extension of generality of morphophonemic rules. Given an established morphophonemic rule involving a single phonemic unit, similar rules t h a t involve members of t h e same row or column in t h e articulatory a r r a y may be tested. A similar heuristic could be obtained from set intersection of distinctive features. Also, t h e heuristic in array form might yield hypotheses of greater phonological plausibility at an earlier stage. For taxonomic unification of morphs into morphemes, the system provides outstanding information about distribution. Two morphs with identical glosses b u t variant shapes initially will have different phrase structure rule numbers assigned to t h e m . The existing program maintains an inverse index of all rules containing a given class descriptor. Accordingly, retrieval of all constructions involving a particular morph is simple. Environments of candidates for merger into a single morpheme are readily compared. We have a reluctance t o abandon a n y heuristics t h a t m a y be peculiar to a specific grammatical model just for t h e sake of theoretical purity. Actually, it seems worthwhile to permit the system to formulate morphological treatment in both transformational and taxonomic models, and to provide mechanisms for convertibility just to preserve the heuristic techniques available to each formulation. A Note on Semantics and
Meta-languages
The construction of a high quality morphological analysis program will undoubtedly prove to be more difficult t h a n the t a s k of automating all other aspects of grammar discovery. Given even complete control over the semantics of the language in which the glosses are formulated, given even a well-developed theory of universal semantic features, t h e problem of learning the mappings of meaning units from one language t o another in the general case is quite difficult. (Let us define the general case as consisting of a language situation wherein an utterance of m morphemes containing n semantic units is translated by a gloss containing p morphemes representing q semantic units — and where m, n, p, q may take on any independent integer values.) Of the work actually done in this area, the program designs of Garvin are t h e most developed, although unimplemented and untested [1]. A fruitful approach, in conjunction with other methods, would well include a t t e m p t s to perform simultaneous analyses of both a language and the semantic structure of related glosses. The Whorfian hypothesis, t h e notion t h a t the structure of a language determines the speaker's perception of the universe, is at least partially antithetical to a notion of semantic universals. The implication of the Whorfian hypothesis
352
MORPHOLOGY
for discovery methodology is an inversion of the original formulation: a knowledge of t h e extralinguistic universe of a language speaker is a prerequisite to a knowledge of the semantic structure of his language. I n absence of proof or disproof of t h e total accuracy of either view, an empirical researcher must be prepared to analyze linguistic situations involving a little of both. Indeed, even if the universalist position is the correct one, t h e techniques of an approach assuming non-universality cannot yield false results, b u t rather corroborative data. There is one area in linguistic analysis where t h e first author will vigorously defend the validity of t h e Whorfian hypothesis — t h e meta-languages associated with grammatical descriptions. Few linguists (if any) acknowledge t h a t at least two meta-languages are associated with every linguistic description. Many might acknowledge the language of gloss representation as one, b u t few would concede t h a t t h e theoretical model used to formulate the description is really another. I n each of the two, t h e structure determines t h e linguist's perception of the realities of the language it is used to describe. One m a y bring yet a third meta-language to t h e scene with a tale of t h e first author's experience t h a t the choice of programming language in computational linguistic work can alter radically t h e structure of t h e solutions t o particular problems. Often t h e choice of a particular programming language can make theoretical problems t h a t appear difficult in their original formulation seem trivial in their programmed t r e a t m e n t and vice versa. A t this point t h e reader can guess the implied methodology: with regard to theories, linguists should be exploitive masters rather t h a n servants. University of Wisconsin Computer Sciences Department
MOBPHOLOGY
353
REFERENCES [1] Garvin, Paul L. Computer-Based Research on Linguistic Universals, Bunker-Ramo Corporation Quarterly Progress Reports, under contract NSF 576, Series G096—8U3, Canoga Park, California, 1967-1970. [2] Nida, E. A. Morphology, the Descriptive Analysis of Words, 2nd edition, University of Michigan Press, Ann Arbor, 1949. [3] Klein, S. & Kuppin M. A. 'An Interactive Heuristic Program for Learning Transformational Grammars', University of Wisconsin Computer Science Department Technical Report # 9 7 , August 1970. Also in press, Journal of Computer Studies in the Humanities & Verbal Behaviour. [4] Klein, S., Fabens, W., Herriot, R. G., Katke, W. J., Kuppin, M. A., & Towster, A. E. "The AUTOLING System", University of Wisconsin Computer Sciences Department Technical Report # 4 3 , September 1968.
RELATIONAL AND NUMERICAL CHARACTERISTICS OF LANGUAGE PARADIGMS JOHN LÖTZ
In this brief paper a few thoughts will be presented on the relational and numerical characteristics of language paradigms. The relational structure among the elements in a given language paradigm, the vectorial ordering (orientation) of the relational structure), and the numerical characteristics of the structure will be discussed. In order to tie the presentation closely to natural languages and not to artificially constructed — and therefore linguistically misleading — cases, concrete examples from natural languages will be used: the Swedish noun paradigm, the set of short vowels in Turkish and the nominal bases in Hungarian. 1. By language paradigm we mean a set of intensively well-defined language data. A paradigm can include: a) sounds, e.g. the stops in English; b) inflectional items, e.g. the declensional and conjugational paradigms in the Classical grammatical tradition; c) syntactic constructions, e.g. the sentence types referring to interrogation and negation; and d) lexical items, e.g. verbs referring to locomotion or, for that matter, simply a conventional dictionary as a set of alphabetically arranged, well-ordered set of words. 2. A set of cohesive language paradigms can be classified in the following way. We call a normal paradigm a set which has the most frequent general membership. What is "normal" is determined by its productivity and frequency and is set up on the basis of common sense in the framework of all paradigms. Setting up normal paradigms implies that the membership of paradigms can also include those with a larger or smaller number of elements in their paradigmatic set than found in the normal paradigm. Larger number paradigms occur in the following two cases. (The examples are chosen from inflection.) (a) Abundance. One element in the normal paradigm corresponds to at least two elements in an abundative paradigm, with semantic distinction. For example, in English brothers and brethren correspond to a "normal" single plural. (b) Variation. One element in the normal paradigm corresponds to at least two elements in a variative paradigm, but without semantic distinction. For
356
MORPHOLOGY
example, Swedish flikkar, flilcJcur and ftikkor are regional-social variants which can be glossed by English 'girls'. Smallers numbers of elements in the paradigmatic set occur in the following two cases. (Examples are again taken from morphology.) (c) Syncretism. A single form corresponds to at least two elements in the normal paradigm without loss of semantic coverage. For example, the Latin plural form regibus corresponds to rege and regi in the singular. (d) Defectiveness. There is no form t h a t corresponds to an element in the normal paradigm, i.e., the paradigmatic set has lacunae. For example, Latin vis 'strength' has no genitive and dative, nor does it have a plural. I n addition we might distinguish paradigms which use one stem form from those with more than one stem form, the case of suppletion. For example, Latin amo, amavi, amatum correspond to fero, tuli, latum. To sum up: n + y forms
n forms
n—x forms
3. The maximal number of distinctions of a set
II = {a^ a 2 , . . . , a n } is expressed by the formula D =
n(n-l) 2
where D is the number of distinctions. This view is implicit in de Saussure's famous dictum, "Dans la langue il n'y a que des différences" and also in the Bloomfieldian or Hjelmslevian approach to phonology where the phonemic units were established by using the method of structural differentiation and any further systematization according to phonetic features was rejected as transgression into alien substance matters.
357
MORPHOLOGY
1 element
Topologically:
etc. 4. It seems clear that language operates more economically than this model suggests. More economy can be achieved in two ways: a) certain binary distinctions are grouped together, or b) relations of more than two elements are introduced (ternary . . . ii-nary relations). 4.1. The simplest example of the binary type is achieved by setting up a branching tree diagram with one element set aside at each node. This was the favorite model of de Groot, the Dutch linguist. He depicted the Latin cases in the following way where each node was characterized semantically: all cases vocative accusative
all other cases all other cases
genitive
dative
ablative
The mathematical formula for this type is: D = n- 1 4.2. The strongest binary model is the maximally recurring binary distinctions. Roman Jakobson is the proponent of this approach which uses minimal, i.e. binary, relations and maximizes the utilization of each distinction, thus increasing efficiency.
358
MORPHOLOGY
Such maximal utilization of binary distinctions is illustrated by the Swedish nominal paradigm or the Turkish short vowel system: Swedish 1. 2. 3. 4. 5. 6. 7. 8.
flicka flickas flickan flickans flickor flickors flickorna flickornas
Turkish
'girl' 'of girl' 'the girl' 'of the girl' 'girls' 'of girls' 'the girls' 'of the girls'
1. 2. 3. 4. 5. 6. 7. 8.
a e 0 0 i ii ï(i) u
Mathematically this can be expressed by the formula D = 2n. I t can also be depicted graphically by a topological cube graph where the three distinctions are: Swedish
Turkish
a. Case b. Definiteness c. Number
a. Vertical tongue action b. Horizontal tongue action c. Labial action
If the number of elements in the paradigmatic set is n, the information content is not correctly expressed by the customary formula I = log2n, but rather, since the number of distinctive relations must be an integer, a Diophantic solution has to be sought: I = -[-log 2 n], where [ ] refers to the integer contained in the fraction. The usual formula quoted in information theory, I = log2n, would not do because if n is not
359
MORPHOLOGY
an exact power of 2, the number of distinctive relations would be 1 less than required. For example, 9 elements require at least 4, and not the 3 distinctive relations which would result from the positive formula. Of course, a new notation could take care of this difficulty. 4.3. Relations with more than two terms would also reduce the number of maximal distinctions. For instance, ternary relations can be postulated in the case of Hungarian back vowels: o, o, u; or in the case of the English stops: p, t, k. The extreme case of distinctions is a single relation with »-terms. This can be exemplified by the numerals or by the dictionary example mentioned above. To sum up, the numerical properties of a paradigm with n members moves diophantically, i.e. using only integers, in the following triangle:
D2
D,= -[-log2n
n(n-l)
]
Dn = l
The corresponding graphic representation is a topological graph. E.g. : English he O
Hungarian Oshe
Ô it
6-0
Oôk
'he, she' 'they'
ôtO
Oôket
'him, her' 'them'
or the cube examples above. A more complex example would be the 14 nominal base forms in Hungarian which require 4 distinctive semantic relations (dimensions). 5. Up to this point the relational structures have been viewed in terms of differences only. There is also, however, a certain orientation in these graph structures since they cannot be rotated freely in space. This orientation can be of two kinds: A) metric and B) topological. A) Metric. If we take the Turkish high vowels, i and u represent an extreme in the second formant position and i and u represent an intermediate position, expressed in cycles per second. In such a case, since the structure is determined by measurements, but not conversely, one can question the
360
MORPHOLOGY
validity of postulating only abstract, absolute differences in phonology. If this criticism is correct, most of modern structural phonology represents an oversimplification and will have to be revised. B) Topological. There is no doubt t h a t the semantic structures represent absolute differences. Therefore, in these cases the topological representation is an adequate one. Here the concept basic general vs. specific gives a vectorial orientation to the graph, corresponding to the unmarked vs. marked opposition. E.g., in the Swedish noun case: Genitive Singular Condeterminate Nominative
Determinate
Plural I would like to caution, however, against the zealous enthusiasm in believing that the concept of markedness is universally valid and applicable in all binary distinctions. In the above I have tried to give a few examples of number and set theory for linguistic paradigms. Because of lack of time I will mention only a few further applications: (1) The Use of Pascal's Triangle for Distinctions. The Swedish three-dimensional semantic cube graph corresponds morphologically to the following inflectional morpheme scheme: 0 morpheme forms
1 flicka 0
1 morpheme 3
2 morphemes 3
3 morphemes 1
flicka-s flicka-n-s flicka-n flick-or-s flick-or flick-or-na
flick-or-na-s
This is the coefficient distribution given by the solution of the following binomial equation: (x + y) 3 = lx3 + 3x2y + 3xy2 +
ly3
x and y can be interpreted as unmarked and marked, where y = x, i.e. the complementary class to x.
MORPHOLOGY
361
(2) Degeneration of the Numbers by Syncretism. The common Swedish noun type hus 'house' has only 6 forms, because singular and plural syncretically collapse in the paradigm itself. Hence, the above structure has to be redrawn and its numerical characteristics recalculated. (3) Statistics. The best known case for using numbers is, of course, the use of numbers in statistics as utilized in information theory. Center for Applied Linguistics, Washington, D. C. and Research Institute of Linguistics of the Hungarian Academy of Sciences, Budapest
ON QUANTITATIVE RESEARCH IN THE FIELD OF MORPHOLOGY MARIE T f i S l T E L O V A
When applying statistical methods of investigation within the framework of grammar, special conditions have to be met in the field of morphology. Above all, it is required that phenomena (categories) of system are clearly distinguished from those of context ('parole'). The system of morphology, especially in that of inflected languages, represents for the most part an intricately combined and regulated whole, codified with lists of admissible forms and with respect to separate categories (e.g., case, number, and gender for nouns; person, number, tense, mood, and gender for verbs). Thus, not only a limited number of morphological categories, but also a limited number of forms (regarding the norm, or, as the case may be, codification) appears in the system of morphology. It is known that the set of morphological forms falls into various subsets of forms, the so-called paradigms. This grouping, however, often displays a certain "arbitrariness" in putting characteristic features together, differing according to special purposes, e.g. school practice: although this problem is beyond the scope of this paper, it may be worth mentioning that up to now paradigmatic classification has been and continues to be based upon qualitative characteristics; it is my opinion that quantitative characteristics should be employed as well, when sufficient data and necessary evidence have been accumulated in different languages. (E.g. the number of the socalled paradigms with respect to the number of nouns following them, such as in the Czech paradigm kufe.) The number of categories of nominal parts of speech, even with inflected languages, is relatively limited. For example, in Czech, where nouns, adjectives, pronouns, and numerals are classed together (the latter two only in case they have special forms, such as pronominal declension with pronouns, e.g. the personal pronouns jd, ty, etc., or with numerals, e.g. jeden, etc.), we find three categories: gender, number, case. In each of these categories further subcategories are distinguished: with gender in Czech, they are masculine, feminine, and neuter; with masculine a further division into masculine animate and masculine inanimate takes place. With adjectives, which, in their semantics, are primarily attached to nouns, the same categories as with nouns are found. With pronouns categories devoid of gender appear
364
MORPHOLOGY
the personal pronouns such as jd, ty, se, etc. On the other hand, with some numerals, i.e. with cardinal numerals from pet to devetadevadesdt, the category of gender is negligible. The category of nominal gender is also secondarily reflected in transgressives and participles, cf. pise, piMc, plsice; zdk psal, zdkyne psala, dite psalo. The category of number manifests itself, on the whole, in an analogous way within the framework of Czech nominal parts of speech (singular, plural, and, only exceptionally, dual exclusively with nominal parts of speech). Of course, the category of number plays a very important part with the verb as well, appearing not only with some indefinite forms, transgressives and participles, but also with definite forms (cf. pUi, piSeme). In contradistinction to the categories of gender and number, the category of case is specific to nominal parts of speech. I n most languages it is modified according to number. The number of cases varies from one language to another, being relatively high (7) in Czech. Full use is made of the category of case in present-day Czech especially with nouns and adjectives, in singular and plural; we find, however, a limited use of case with pronouns and numerals, especially as regards plural; cf. e.g. the special situation with personal pronouns, the declension of numerals from pet upwards, etc. The enumeration of the basic categories of the nominal parts of speech (and, partially, of verbs), has been presented in order to show their system and indicate its representation within different parts of speech categories. I t is evident, of course, t h a t within the confines of the system of morphology (in the present case that of nominal parts of speech) all the categories have their place, and their position is determined by a qualitative aspect. All these categories are to be taken into consideration if we intend to examine their quantitative aspect, which, naturally, presents itself in context, discourse, in communication. I t is from this aspect t h a t we can quantify different morphological categories, show their hierarchy and the proportional relations among them, as well as their employment in style. From the technical point of view it means that when designing a code for machine processing of data for purposes of morphological quantitative analysis, we have to take into account all the morphological categories; this is why the code for the morphological analysis of inflected languages is considerably complex, especially as far as the verb is concerned. With definite forms very complicated combinations of time, mood and voice are to be respected, e.g. psal jesm, psal, byl psdn, psal bych, byl by psdn, etc. With indefinite forms, the same holds true for nominal gender, cf. psal, psala, psalo, etc. The proper examination of morphology from the statistical point of view concentrates upon context, where categories play different roles according to their special character, and where they acquire their meaning, or where their meaning is given more precision or modified. This I have already demonstrated
365
MOBPHOLOGY
elsewhere, on the problem of morphological homonymy in Czech [1], Morphological homonymy is a complex phenomenon conditioned by an accordance of the categories of nominal parts of speech and those of verb, which find their expression in nominal and verbal forms usually in such a harmonious way, t h a t in most cases the homonymy itself escapes recognition or even passes unnoticed merely owing to its self-evidence. A prerequisite for the quantification of morphological categories is created by their various combinations t h a t are formally identifiable in inflected languages e.g. b y virtue of the combinations of endings, which differ according to gender, number, and often also according to case (cf. mladi), mladd, mlade, mladi, mlade, mladou, etc.); in context it is a combination with other forms of other parts of speech, cf. takovou mladou zenu, takoveho mladeho muze, nasla si takoveho mladeho muze, zfekla se takoveho mladeho muze, etc. Another prerequisite for the quantification of morphological categories, especially of forms (in which various combinations of individual morphological categories find their expression), is given by the fact t h a t there is a disproportion between the number of forms in the system (often the number of potential forms) and the number of forms found in context. For example, in Czech, if we supposed one special form for each case of nouns following the 12 regular paradigms (pan, hrad, muz, stroj; zena, nilse, plSen, kost; mesto, more, kufe, stavenl), 168 forms would be required; in fact, in the system of noun forms in Czech (all possible alternative forms included), 192 forms are found altogether; in context, however, only 142 forms appear, viz. masculines 75
in the system 59
feminines
59
"
"
"
neuters
58
"
"
"
in context =
79%
48
=
81%
35
=
46%
The differences in numbers of forms in the context naturally depend upon the numbers of forms in the system. As far as the forms of Czech adjectives are concerned, we should theoretically assume t h a t for the paradigms mlady and jarni 2 times 48 different forms occur, i.e. 96 forms altogether. In the system, however, we find only 11 forms with the paradigm mlady and only 6 forms with the paradigm jarni. With the exception of nom. pi. of the paradigm mlady — mladi — all forms are homonymous, i.e. of identical form, though different as to their function. I n consideration of these facts, to the paradigm mladi) jarni
has "
47 forms, in the context being found 37, i.e. 79% 48 " " " " " " 40, " 81%
366
MORPHOLOGY
The morphology of the Czech verb being exceedingly complicated, it will not be dealt with in this paper. Although the number of morphological categories in the system is relatively constant and their occurrence is subject to laws of a qualitative nature, their occurrence within the context depends on quantitative laws as well, and t h a t is why it is important to pay attention to the frequency of individual morphological categories and to their common relations. I n revealing these relations among the morphological categories we contribute particularly to the understanding of their semantics. This step is very important in the investigation of semantics from the quantitative point of view, although it constitutes only a component part of the complex examination of the problems in question. For example, in context the homonymous forms of nouns in Czech are to a considerable degree or fully determined by words (or forms of words) t h a t immediately precede or follow, i.e. by the so-called environment of the word. The neighbouring words are largely stabilized as regards their classification with the parts of speech, their morphological or syntactical characteristics, according to the syntactical function performed by the homonymous form of the noun in the sentence. Therefore it is possible to distinguish the environments of the subject, of the object, of the adverbial modification, and of the attribute, especially of the so-called incongruent attribute. Thus the homonymous form of the nominative sing, or pi. representing the subject in the sentence is, as a rule, preceded by an attributive expression (adjectival, pronominal, or numeral), which usually is homonymous in itself, e.g. stare lidove zvyky (pomalu zanikly), ten ohen (jiz uhasl), etc. The verb in any form precedes in approximately half of the cases, in the other half it follows. The decisive factor here is the theme - rheme articulation, i.e. semantic factors. I n spite of the limited number of morphological categories in the system and notwithstanding their relatively high rate of recurrence in context (i.e. their high frequency), differences are found in relative frequencies of individual categories in general, a fact which demonstrates a certain hierarchy [2] among them, given quantitatively, especially with respect to the realization of the discourse (written or spoken) and to the functional style. We shall illustrate this statement by adducing the example of the category of case in presentday Czech. If we examine the frequency of cases [3] in present-day Czech with regard to their realization in written discourses in general (on the basis of the Frequency Dictionary of Czech [4], henceforth referred to as FDC) and in special groups, viz. group A (fiction), E (popular science literature), G (scientific literature), D (drama) and H (the so-called spoken discourses, i.e. delivered and subsequently published) on the one hand, and in spolcen scientific dis-
367
MORPHOLOGY
courses (on t h e basis of approximately 100,000 words from 12 spoken scientific discourses) [5] on the other, we state the following relations among individual cases (cf. Table 1): Table 1 The r a n t of cases of nouns ordered according to decreasing frequency Material
Singular
Plural
X
N
4
5
2
1
6
3
7
4
5
7
5
4
1
2
6
3
7
5
4
7
4
5
2
1
6
3
-
4
5
4
5
2
1
6
3
-
4
5
2
7
4
5
2
1
6
3
7
4
5
6
2
7
4
5
3
2
6
1
7
4
5
6
3
7
4
5
3
1
6
2
7
4
5
N
G
FDC
1
2
6
3
7
Group A
1
3
6
2
Group E
2
1
6
3
Group G
2
1
6
3
SSD
1
3
6
Group D
1
3
Group H
2
1
D
A
V
-
L
G
D
A
V
L
I
If we compare the frequency of cases in the singular in written discourses in Czech (according to FDC) with their frequency in spoken scientific discourses (henceforth SSD), we find agreement above all with the most frequent case, i.e. nom. sing., f u r t h e r with the cases of medium frequency (locative sing., instrumental, sing.) a n d eventually with these of t h e lowest frequency (dat. sing, and vocative sing.). The main differences (that is to say, an inverted ratio) in rank according t o descending frequency can be found with gen. sing, and acc. sing.: in t h e written discourses in Czech the second most frequent case is gen. sing., the third acc. sing., in SSD t h e second most frequent case is acc. sing., t h e third place being occupied b y gen. sing. I t is obviously a characteristic feature of spoken discourses in general, as is shown b y a comparison with t h e frequency of cases in the group D in FDC, which, in fact, represents spoken discourses of fiction style (see Table 1). Here, the first three most frequent cases are: nom. sing., acc. sing., and gen. sing. On the other hand, in t h e written scientific discourses (in FDC represented by the groups of texts E a n d G, b u t also by the group H) the most frequent case is gen. sing., f u r t h e r nom. sing, and acc. sing. I n general, t h e nom. sing., gen. sing., and acc. sing, are cases t h a t respond most sensitively t o b o t h t h e kind of discourse a n d the functional style, as well as t o the individual style of the author, a fact to be illustrated belows This is obviously connected with expressing subjects a n d direct as well a. indirect objects; here, of course, a relation to expressing negation with verbs
368
MORPHOLOGY
is involved. As we have already mentioned on another occasion, t h e point in question here is obviously t h e theme - rheme articulation of t h e utterance. As far as t h e frequency of cases in the plural is concerned, their frequency in SSD agrees with t h a t in FDC, t h a t is, in fact, in written Czech: the most frequent case here is gen. pi., then nom. pi., and in third place acc. pi. The cases with medium frequency are represented by locative pi. and instrumental pi., which corresponds to t h e situation in the singular, the least frequent being d a t . pi. and vocative pi., which conforms to t h e singular as well. On t h e other hand, in group D (drama) the most frequent case is acc. pi., t h e second most frequent one gen. pi., and in t h e third place nom. pi. I t is interesting t h a t in group H the third most frequent case is nom. pi.; the most frequent case is gen. pi., t h e second place being t a k e n b y acc. pi. As in the singular, t h e three most important cases in t h e plural as to their frequency, and sensitivity to various factors, are: nom. pi., gen. pi., and acc. pi., b u t in a different order: gen. pi., nom. pi., and acc. pi. Again, t h e y are cases t h a t play a significant role in the theme - rheme articulation. As regards t h e comparison of spoken discourses of different kinds, especially those of common communication, the ratio of these three cases in t h e plural would deserve special attention. The differences here, however, may be caused by a relatively small number of d a t a in the plural. The ratio of absolute to relative frequency of casesin singular and plural, t h a t is to say, t h e frequency of cases with respect to t h e category of number, proves to be an interesting problem as well. I n written discourses in Czech (according to FDC), about 75% of all forms of nouns (exactly 75.81%) are singular, and approximately 25% (24.19%) plural. I n spoken scientific discourses (SSD) the number of singular forms is smaller, roughly 70% (70.38%), t h e number of plural forms being, naturally, larger, about 30% (29.62%). The ratio of t h e two basic morphological categories, case a n d number, cannot be accounted for in terms of differences in realization of discourses only, i.e. with reference to their written or spoken form. The comparison with t h e frequencies of cases and numbers in individual functional styles covered by FDC shows that, e.g., in dramas the singular forms even represent 83% of all forms of nouns (83.27%), the plural forms comprising only 17% (16.73%). I n the framework of fictional style in general t h e same ratio of singular and plural forms as with SSD is found only in the language of poetry (about 70% of forms in t h e singular and 30% in t h e plural); in my opinion this congruency can be regarded as purely accidental. I t is well worth noting, however, t h a t in the so-called scientific discourses (G) there is a tendency towards a decrease in numbers of singular forms (approximately 73%), and a corresponding increase in plural forms. This is also confirmed, of course, by group H (the so-called spoken discourses), where the singular forms represent roughly 72% of all forms (71.96%) and the plural
369
MORPHOLOGY
forms approximately 28% (28.04%). The smaller number of forms in the singular and the larger number of plural forms with nouns obviously represent the scientific character of the discourse rather than its spoken form, and, conversely, the larger number of singular forms indicates a greater looseness with respect to current spoken language. Both these cases, of course, depend Table 2 Frequency of number with nouns in SSD Text No.
Singular
Nn
F
Plural
%
F
%
1.
1,672
1,325
79.25
347
20.75
2.
1,564
1,042
66.62
522
33.38
3.
1,324
853
64.43
471
35.57
4.
2,125
1,554
73.13
571
26.87
5.
1,149
825
71.80
324
28.20
6.
1,787
1,342
75.10
445
24.90
7.
1,772
1,160
65.46
612
34.54
8.
1,689
1,158
68.56
531
31.44 31.88
9.
1,578
1,075
68.12
503
10.
2,574
1,630
63.33
944
36.67
11.
1,406
1,080
76.81
326
23.19
12.
2,039
1,510
74.06
529
25.94
20,679
14,554
70.38
6125
29.62
1,723
1,213
70.40
510
29.60
136,810
63,068
X
369.82
251.13
25,779 160.56
upon the subject of the discourse in question; in highly specialized papers delivered with concentration on the given subject and premeditated to a great extent as to details, or even prepared in advance, a much smaller number of singular forms, and, correspondingly, a larger number of plural forms of nouns, is found. There is a considerable span in individual texts (see Table 2): with forms in singular 63-79%, with plural forms 37-21%. These are, of course, special problems we will not go into here. Using the categories of case and number as examples, we have attempted to demonstrate that the examination of morphology from the quantitative point of view not only requires statistical data on individual categories, but also comparison of them, and an analysis of their combinations and inter-
370
MORPHOLOGY
relationship. This is of crucial importance for the interpretation of statistical data on morphological phenomena, and, at the same time, it offers a clue for the study of semantics. To this purpose it is necessary — especially in the inflected languages — to examine and to confront not only morphological categories within the confines of individual parts of speech, but also to study the combinations of morphological categories of different parts of speech, e.g., in Czech, of nouns and adjectives, of verbs and pronouns, of both these groups, etc. The material indispensable to such an analysis must necessarily be subject to machine processing. At present we find ourselves only at the very beginning of this work; up to now, we have succeeded in getting some data on morphological categories in the framework of individual parts of speech. In the future, a more profound confrontation and examination of the combinations of different morphological categories of different parts of speech will be necessary. I regard the successful solution of this task as one of the basic objectives of machine linguistics in the field of morphological and semantic analysis. Institute of Czech Language of the Czechoslovak Academy
of
Sciences
MORPHOLOGY
371
REFERENCES [lJTögitelovä, M. O morjologicki homonymii v ieitini, Praha, 1966. [2] Tösitelovä, M. 'Zur Quantiiikation der grammatischen Kategorien', Linguistics (in press). [3] Tösitelovä, M. 'K frekvenci pädu v souSasne spisovne öeStinö', SaS, 30,1969, 269-275. [4] Jelinek, J., Beöka, J., TSSitelovä, M. Frekvence slov, slovnich druhu a tvaru v SesJcem jazyce, Praha, 1961. [5] The material was collected and is being analyzed in the Department of Mathematical Linguistics of the Institute of Czech Language of the Czechoslovak Academy of Sciences in Prague.
PHONOLOGY
AUTOMATIC HYPHENATION IN HUNGARIAN GYÖRGY HELL
The need for mechanical hyphenation stems from an economical motivation for automatic setting in printing. The realization of the task depends on whether the rules of hyphenation can be executed by a computer. The difficulties of the work are well known. Principally the difficulties lie in the fact that rules for hyphenation are founded not only in the phonological structure of words but also on their position in the grammatical structure of a given language. For example, we distinguish two sets of rules in Hungarian hyphenation: (1) rules of natural syllabication (rules of pronunciation), (2) rules based on the morphological structure of words (etymological principle). The rules for natural syllabication in Hungarian can be expressed in a formalized form the following way: [/C/V . . .] - [/C/V . . .] This means that in Hungarian syllables the consonants either follow the vowel, or one consonant precedes it. According to a count of a corpus of 38,326 syllables eleven types have been found in the following order of decreasing frequency: 1) CVC 39.1% 2) CY 38.3% 3) VC 10.1% 4) CVCC 5.9% 5) V 5.3% 6) 7) 8) 9) 10) 11)
VCC CCV CCVC CCVCC CVCCC VCCC
-
98.7% 0.7% 0.1% 0.1% 0.01% 0.01% 0.01%
376
PHONOLOGY
Most Hungarian syllables have the construction V,
CY,
CVC,
VC,
CVCC.
I t can be demonstrated t h a t a "rule of thumb program", according to which words are hyphenated before a consonant preceding a vowel, would give a correct result of about 83%. Such a program correctly splits all the words which have syllables of the types CVC, CY, CVCC. If we take into consideration the fact t h a t the syllable types VC occur mainly at the head of words — about 30% of Hungarian words begin with a vowel — an even higher percent of correctness can be achieved by a very simple program. The question is, what percentage of correctness is considered satisfactory? The main difficulty in elaborating a completely correct hyphenation is the fact t h a t in Hungarian many words are composita. For a perfect solution let us suppose we have a computer program with the entire dictionary. As Hungarian composita give an open set, we must provide for an analysis of words to determine the units of composition. This task is complicated not only because of the nearly unlimited possibilities of composition but also by the characteristic structure of Hungarian. As an agglutinative language Hungarian has a large number of suffixes and suffix combinations identical to word morphemes, so t h a t an analysis would evaluate simple words as composita. (E.g.: halado, okok, Lazar, vigasztal, karora, eleg, megint, etc.) I t is clear, such cases require an extended and reliable sentence analysis, t h a t has not yet been developed. A 100% solution can be nevertheless achieved if we state our aim in another way: as the computer is unable to solve the problem of hyphenation in the same way as the human mind, we must have a program according to which the computer itself determines the places where formally a correct hyphenation is possible. A starting point for such an aim can be given by looking for places in Hungarian texts where hyphenation is absolutely always correct. Trivially such places are: (a) in hyphenated word forms (hyphenated composita) (b) at places of occurrences of two or more vowels. There are not many such places in a text and to obtain a satisfactorily large number of places for hyphenation other measures must be taken. (c) Correct hyphenation is possible in consonant clusters, if the clusters are either word endings or word heads. Consonant clusters occurring as noninflected word endings are not too numerous; their number is 120. At word heads considerably fewer such clusters (23) can be found. If such clusters are found, hyphenation cannot be undertaken, otherwise one of the consonants joins the next vowel.
377
PHONOLOGY
(d) A much higher frequency in exact hyphenation can be achieved if we find a solution for the hyphenation of all two-syllable words. While the distribution by length of noninflected Hungarian words gives a Gaussian curve, the Bame distribution in running text gives a quite different picture. A count of a corpus of 20,000 words gives the following results: words with one character 2 characters 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
— — _ _ — _ — — —
1734 2519 2404 1917 2654 2004 1585 1578 1020 894 615 362 265 140 73 50 15 11 19 7
— — _ — — —
2 1
-
22
-
-
1
-
23
-
2
-
8.7% 12.6% 12.1% 9.6% 13.3% 10.0% 7.9% 7.9% 5.1% 4.4% 3.1% 1.8% 1.3% 0.6% 0.3% 0.2% 0.06% 0.05% 0.08% 0.003% -
Most of the words are those with five characters and words up to 7 characters give 74.2% of the whole text. Even words with 5 characters — which in all probability are no more than two-syllable forms — give more than 50% (56.3%) of the text. So it seems quite reasonable to expect to find a solution for the hyphenation of two-syllable words. Another, smaller count of the syllables confirms this presupposition (see Table 1). Two-syllable words can be inflected monosyllabics or two-syllabic composita. The hyphenation of the two-syllabic inflected words takes place either just before the second vowel or before the preceding consonant. Difficulties may arise only in composita (including prefixed words), so a reliable analysis is necessary.
378
PHONOLOGY
An analysis of the two-syllabics shows that not all monosyllabic words are necessary in an analysis for convenient splitting. Words with a vowel at the head (52 items), words beginning or ending with two consonants (47 and 286 respectively) and so words with three consonants are sufficient for the program. If we do not take the words but only the consonant clusters — as mentioned earlier — these numbers can be reduced to 22 and 120. Table 1 SYLLABLE
TYPES
V MONOSYLLABIC WORDS POLYSYLLABIC WORDS WORD HEAD IN WORD FORMS 2nd SYLL. 3« SYLL. 4"> SYLL. 5"> SYLL. ALL THE SYLLABLES IN THE WORDS WORD FINAL SYLLABLES: 2»" SYLL. 3d SYLL. 4 th SYLL. 5 th SYLL. 6"> SYLL. ALL WORD FINAL SYLLABLES
cv
IN HUNGARIAN
WORDS
VC
ovo
CVCC
vco
vcoo
covo
cov
18.7
5.7
32.6
26.4
13.6
2.3
0.2
—
11.8
37.2
14.5
32.7
2.5
0.6
—
0.3
0.1
4.2 1.4 4.4 9.0
38.8 62.9 55.5 36.3
3.6
51.2 32.7 33.3
0.3 1.9
0.2 0.4
—
—
—
—
—
0.4
—
—
—
—
—
—
—
—
3.3
47.9
2.6
44.3
1.1
0.3
—
—
0.1
0.6 3.2
27.4 28.9 28.7 23.5 18.0 27.8
1.2 1.2 5.6 2.9
57.5 55.9 51.8 58.8 72.0 56.2
12.1 10.6 12.5 11.6 9.0 11.7
—
—
2.9
_
1.4
—
6.6 5.4
—
2.1
—
0.6
-
-
-
—
—
-
—
—
-
—
1.2 —
—
-
—
—
—
-
0.5
—
—
—
Some difficulties arise in words of type . . . VCV . . . where the splitting can take place just after the vowel or just before the second vowel if the word is a compositum. To avoid any mistakes in such cases we had to store not only the words with a vowel head (for the identification of the second constituent) but also the words ending with one consonant. Their number is fairly large, about 400. To diminish this number in our program not all such words were listed, but only those which possibly can give a compositum with any of the types VC . . . . The monosyllabic words stored in the computer can also be used for the hyphenation of polysyllabic items. If a word ending corresponds to no monosyllabic word, a correct hyphenation can always be undertaken before the vowel or before the preceding consonant. This fact and the high relative frequency of two-syllable words give a., good ratio of correct hyphenation.
PHONOLOGY
379
But as hyphenation does not occur everywhere in words where it is possible, the printed lines must have a minimal length so that the final word might be correctly split at the end if necessary. Our program allows relatively short lines of about six to eight syllables. Institute of Languages at the Technical University
of
Budapest
THE AUTOMATIC CONVERSION OF WRITTEN DUTCH TO A PHONETIC NOTATION G. H. A. KOK
Man is able to convert written text to spoken language and vice versa. Can a machine learn this too? At present we are not interested in physical and physiological questions, nor in individual differences in speech. That is why we restrict ourselves to the conversion from and to a phonetic notation. Definition of the phonetic notation: (As a meta-language we use the Backus Normal Form [1]) {short vowel) {long vowel) {diphthong)
{voiced consonant) {voiceless consonant) {liquid) {consonant)
{syllable) {word)
a\o\u\%\e aa | oo | uu \ ee | ie | oe | eu ou\ ei \ ui ai | oi | aai | ooi | oei \ eeu \ ieu {short vowel) | {long vowel) | {diphthong) | {glide) v\z\g\b\d\q f\s\c\p\t\k l\m\n\r {voiced consonant) ) {voiceless consonant) | {liquid) \ j\w\h \ n {combination of consonants pronounceable before a vowel) | {empty) {combination of consonants pronounceable after a vowel) | {empty) {Cl> {vowel) {C2) {syllable) | {word) {separator) {syllable)
Interpretation: We give an interpretation of this notation with the help of the International Phonetic Alphabet [2],
PHONOLOGY
382 a 0 u e i
= = = = =
V
=
z
=
9 b
=
=
a o 9 or ce e i
V z Y b
d
d
q
g
aa 00 uu ee ie ce eu f s c
a 0
=
=
y e i u 0 j s
= =
-
=
=
X
p
=
p
t
=
t
=
k
k
ou ei ui
I m n r
--
=
= =
mi ei Ay
1 m n r
ai oi aai ooi oei eeu ieu y
w h n
= = = =
= =
=
ai oi ai oi ui eu iw
=
y w h
=
n
= =
If certain nuances are ignored, we can consider this an unambiguous notation for the Dutch phonemes. Every Dutch word can be expressed with it, but of course the grammar of (word) generates a lot more words than actually occur. This system was chosen because it looked so much like written Dutch; if we replace n by ng (not n g) and c by ch (not c h), we have a notation for spoken Dutch which uses a (combination of) symbol(s) for every phoneme often used in written Dutch. We can divide the ambiguity of written Dutch into two components: (1) Cases in which more than one notation is used for a single phoneme, e.g. ou can be written as ou or au, ei which can be written as ei or ij. (2) Cases in which one notation can stand for different phonemes in different words, e.g. ch can stand for c or sj, qu for k and kw. The difficulty of converting written Dutch to our notation is defined by component 2, the opposite conversion by component 1. In some important cases of component 1 we need a grammatical analysis of the sentence (t written as t, d or dt in verb declensions), or the meaning of the word (ei, ou). For this reason automatical conversion of spoken to written Dutch does not seem practical. Some important differences between written Dutch and our notation are: (a) In written Dutch a long vowel in an open syllable is noted with one symbol. (Baken, baakun, beacon); (Dutch, phonetic, English). (b) A short vowel in an open syllable, from a phonetical point of view, is noted by doubling a consonant, (bakken, bakun, to bake). (c) Assimilated consonants are not written as such, (maatbeker, maadbeekur, measuring-glass).
383
PHONOLOGY
(d) The very frequent letter e can stand for the phonemes e, ee and u, while u can be written in Dutch as e, i or ij. Differences of type a. cause ambiguities of case 2., b. of case 1., c. causes both kinds. We wrote a program in ALGOL 60 for the conversion of written Dutch to our notation. The basis of the program is hyphenation; we use Brandt Corstius' Sylsplit [3] for it. After hyphenation, prefixes are split off. Placement of stress is obtained by the following rule: in general, stress is on the first syllable; some prefixes (be, ge, ver, te and ont) put the stress on the next syllable*; in loan words some suffixes either take stress or put it on the preceding syllable. In this paper only two parts of the program are described: one part handles assimilation and the doubling of a consonant (c. and b.), and the other decides whether an e or i is an u or not. Assimilation: Assimilation is performed only inside a word, not between words. The assimilation between the j-th and (j + l)-th syllable is as follows: if the j-th syllable is the last one, then, of course, the change is only in the last C2. By . . . letter 1 — letter 2 . . . - > - . . . letter 3 — letter 4 ... , we mean that, if the last letter of syllable j is letter 1 and the first of syllable j -f- 1 is letter 2, then they are replaced by letter 3 and letter 4. Algorithm: (1) I f syllable j is not the last one then /
(2) All voiced consonants of C2 of syllable j are set voiceless. I f this syllable is the last one, or if its last letter is not subject to assimilation (it is a liquid or a j, w or h) then go to 5. (3) Progressive assimilation: f
f
s c p t
k
—
V
s
f
z
c
s
g .
p t
c
k
* Not every occurrence o f one o f the syllables be, ge, ver and te, is an instance o f a prefix which p u t s the stress on the n e x t syllable; sometimes this syllable is simply part of the root o f the word. E . g . I n vergezellen (to accompany) both ver and ge put the stress on the n e x t syllable: verge'zellen, but in vergeven (to forgive), ge is part o f the root geven thus ver'geven.
384
PHONOLOGY
(4) Regressive assimilation:
(5) If the j-th syllable is not the last one, then . . . letter 1 — letter 1 . . . - > . . . — letter 1 Examples: scale plating schubpantser schub-pant-ser (1) schub-pant-ser (2) schup-pant-ser (3) schup-pant-ser (4) schup-pant-ser (5) schu-pant-ser
greed hebzucht heb-zucht heb-zucht hep-zucht hep-sucht hep-sucht hep-sucht
(we) bathed baadden baad-den baad-den baat-den baat-den baad-den baa-den
to lie liggen lig-gen li-gen li-gen li-gen li-gen li-gen
The letters e and i. This section covers three different phenomena: 1. the letter e which is used as u (noted here as 0), 2. the letter i which is used as u (noted as *), 3. the letter i followed by ng (noted as t). These three symbols are treated together because they play a comparable part in the declension. E.g.: stell0n
(to declare), stelling (thesis), stellig (positive).
Some Boolean procedures were devised which determine what phoneme is actually meant by e or i. They are explained in a manner similar to that used for assimilation. CI and C2 are separated from the vowel by a comma, (n) means that CI or C2 consists of n symbols. L is used for 1 m n r (1) bool proc i is an i; n—g, 0,
385
PHONOLOGY
Examples: be-we-gtng be-we-gtn-g0n
(movement) (movements)
(2) bool proc i is an i; an unstressed syllable and (1) (0)
g -• • k , i,
(0)
St
-
k, e g, 0 g. !,
Examples: (blessed) (supplest) (hawk) (hawks)
za-lig le-nigst ha-vik ha-vi-ken
ver-e-ni-gtng ver-e-ni-g0n ern-stig
(union) (to unite) (grave)
Procedures 3, 4 and 5 are local to 6. (3) bool proc e is an ee; , e, ( 0 ) - ( l ) , but not , e, ( 0 ) - L , Examples: true in: sal-pe-t0r (salpetre), aard-be-vtng (earthquake), paar-de-de-k0n (horse-blanket). false in: sta-me-l0n-d0 (stammering), sta-me-ling (a stammer), zeur-de-rig (tedious). (4) bool proo e is an e; . e, ( l ) - ( l ) , but not , e, L
d s
386
PHONOLOGY
Examples: true in: false in:
J vei-lig-stel-ltng (assurance), bloed-stel-p0nd (styptic), bo-ven-st0 (upper), stel-pen-d0 (staunching), wes-ter-s0 (western)
(5) bool proc e is an 0 in the last syllable; a final syllable; the syllable is the last one; and
e,
L Ld It mt rt nds ndst rst rdst
or k, e, nt
Examples: true in:
sta-p0lt (-he- piles), ge-sta-p0ld (piled). uit-0rst (utmost), rek0nt (-he- calculates).
false in:
di-ri-gent (conductor), con-su-ment (consumer).
(6) bool proc e is an 0; unstressed on this syllable and e is an 0 in the last syllable (0)
e but not an e or ee
—w, , , —r, —s,
e, e, e, e, e,
1 n—t s—s, s s
L Ls s
387
PHONOLOGY
Starting at the end of the word every e or i is (if necessary) replaced by the correct symbol. An e which is not an 0 becomes an ee in an open syllable. Examples: vervelende (boring) ver-'ve-len-de (ver replaces the stress to the second syllable), ver-'ve-len-d0, ver-'ve-l0n-d0, ver-'vee-l0n-d0, v0r-'vee-l0n-d0. stervelingen (mortals) 'ster-ve-lin-gen, 'ster-ve-lin-g0n, 'ster-ve-lin-g0n, 'ster-ve-ltn-g0n, 'ster-v0lin-g0n. Results: Though it is difficult to say whether a phonetic representation of some word is correct or not, we found 97 percent of the word-tokens in a sample from a Dutch newspaper [4] to be converted correctly. At the moment the program is being adapted to the speech-synthesizer of the Institute for Perception Research at Eindhoven (IPO). Valuable improvements of our program are expected as a result of this project. Some examples of the complete program: Dutch chubpantser hebzucht wodka zakdoek baadden baden alleseter begevende revolutionaire vergevingen afwezig Center for Mathematics Amsterdam
phonetic scu pant sur kept suet wot Jcaa zaq doek baa dun baa dun a lu zee tur bu gee vun du ree vo luu tsie oo ne ru vur gee vin un af wee zuc
English scale plating greed vodka handkerchief bathed (to) bath omnivor going revolutionary forgivenesses absent
388
PHONOLOGY
REFERENCES [1] Backus, J . W . 'The Syntax and Semantics of t h e Proposed International Algebraic Language of t h e Zürich ACM—GAMM Conference', IGIP, Paris, J u n e , 1959. [2] Principles of the International Phonetic Association, London, 1949. [3] B r a n d t Cortius, H . Exercises in Computational Linguistics, Mathematisch Centrum, Amsterdam, 1970. [4] Van Berckel, J . A. Th. M., B r a n d t Corstius, H., Mokken, R . J . , Van Wijngaarden, A. Formal Properties of Newspaper Dutch, Mathematisch Centrum, Amsterdam, 1965.
AUTOMATIC PHONEME TRANSFORMATION: SANSKRIT-INDONESIAN HARRY SPITZBARDT
1. General Remarks The continuous intake of socially, culturally, scientifically, and economically important loanwords as well as newly coined internationalisms into the system of a modern national language is of paramount interest for problems of language planning in general and the modernization of national languages in particular. As has been shown by recent developments in African and Asian countries, the choice and subsequent modernization process of national languages, especially in these areas of the world, reveal a great complex of various theoretical and practical aspects. Once the selection of the national or official language has been made out of several existing languages or dialects of each country, there is the problem of standardization of the chosen official communication system. This problem of standardization — and we may add communicative optimization — of a given natural language is closely connected with a steady improvement and enrichment of its vocabulary, and the creation of specified systems of terminology in the field of science, technology, economics, education, etc. As everybody knows, the cultural and linguistic history of European countries has been greatly influenced by Classical Greek and Latin. The borrowing from these ancient models of lexical and word-formation elements continues to serve the entire intellectual world with ever new and well-defined, internationally accepted and standardized scientific and technical terms. A similar role is played by Sanskrit in the national language of the Indonesian Republic, the Bahasa Indonesia, henceforth simply called "Indonesian" for the sake of brevity. In many respects the growth and structure of Modern Indonesian may be compared, from its developmental point of view, with the history of English. The Roman, Scandinavian, and French invasions into England in their cultural and linguistic effects bear a certain resemblance to the Indian, Islamic, Portuguese, Dutch, and Japanese invasions of the Indonesian archipelago. The consequence for the present Indonesians may be seen in a diverse bulk of loanwords, including loan formations, mainly from Sanskrit, Arabic, Dutch, and English, not only in the local dialects, but also
390
PHONOLOGY
in that basic interinsular vernacular that has developed from the original Malay language and t h a t officially has come to be accepted as the unitary lingua franca, under the name of "Bahasa Indonesia". I n his contribution to the Conference on "The Modernization of Languages in Asia", held in Kuala Lumpur in September 1967, the Indonesian novelist and philologist Sutan Takdir Alisjahbana made it clear t h a t there is a passionate rivalry in the lexical field between Sanskrit, Arabic, Graeco-Latin and local languages and dialects in the standardization and modernization of the Indonesian language. In Sukarno's time there was even a noticeable tendency to screen off intruding Anglo-Americanisms by resorting to ancient borrowings or newly patterned forms from Sanskrit, using "pramugari" for "stewardess" and "prasedjarah" (a Sanskrit-Arabic hybrid) for "prehistory", "wartawan" for "journalist", "wisatawan" for "tourist", "swasraja" for "self-service", "dwidahasa" for "bilingual". Of late "konamatru" has been introduced in mathematics and engineering as a substitute for "goniometry". Moreover, it should be noted from the sociolinguistic point of view t h a t the general trend to apply highfaluting Sanskritisms in Modern Indonesian, much in the sense of so-called "hard words" in American English usage, has at the same time a significant bearing on style in its social frame. I n many cases the preference of a Sanskrit loan-word or neologism to a simple-sounding vernacular word is meant to show t h a t the speaker is an intellectual, educated person, saying "pria" and "wanita" instead of Malay "laki-laki" and "perempuan" for masculine or feminine or "boy" and "girl" respectively. The purifiers must have felt themselves rather at a loss when faced with such innovations of AngloAmerican origin as, for instance, "tape-recorder", "musical", "computer", "laser", "astronaut", which have been retained in their English form, albeit with varied pronunciation, up to this day. Nothing can be done in cases like " b e a t " and "pop" music, and a lot of other words have been "Indonesianized", like "television" into "televisi". With all this, however, there still is a growing tendency to use Sanskritisms in the national language of Indonesia, whose very designation, "Bahasa", meaning "language", has been derived from Sanskrit "bhasá", of the same meaning. The fact that we have two forms in Indonesian — bisyllabic "basa" and trisyllabic "bahasa" — posed a certain difficulty for our experiments in automatic language data processing, where a number of hitherto disregarded or unknown irregularities could be discovered and analyzed. A tentative semantic grouping of Sanskrit words in the Bahasa Indonesia displays the following fields of application: — Religion and Philosophy — Scholarship, Science, Numbers — Abstract Words
PHONOLOGY
— — — — — — — — — —
391
Man and Parts of the Body Family Relations Official Appointments and Titles Literary Terms and Notions Natural Phenomena and Geographical Expressions Animals and Plants Metals, Minerals, and Other Materials Notions of Time Buildings and Institutions Trade and Business.
In addition to the main stock of lexical items from both the nominal and verbal complex, there is a group of function-words (prepositions, adverbs, pronouns, and conjunctions) and prefixal as well as suffixal elements to be used in word formation, all of them derived from Sanskrit. There is a considerable number of incontestable Sanskritisms unrecognized and consequently not registered as such in Indonesian dictionaries, such as the "Kamus Moderen Bahasa Indonesia", Djakarta 1954, by Sutan Mohammad Zain, or the Indonesian-German Dictionary, by Otto Karow and Irene Hilgers-Hesse, published in Wiesbaden 1962. The two German orientalists, in fact, must have been unable to recognize a Sanskritism when they saw one, otherwise they would have designated such evident items as "bahasa" = "language", "bakti" = "devotion", "bisa" = "poison", "muka" = "face", "prasangka" = "prejudice", "sardjana" = "scholar", " t j i t a " = "idea" (from the Skt. Past Participle "citta" = "thought"), and a great many others as Sanskritisms. Phonematic correlations between the two languages under consideration have been drafted intuitively and empirically by J . Gonda in his book Sanskrit in Indonesia, Nagpur 1952, in H. Kahler's Grammatik der Bahasa Indonesia, Wiesbaden 1956, and by G. Kahlo with his chapter on the sound changes of Sanskrit loan words in Malay contained in his booklet Indonesische Forschungen — Sprachbetrachtungen, Leipzig 1941. Their inventories of phonematic transformation rules between Sanskrit and Indonesian are far from being exhaustive and carry a certain amount of mistakes with them. In the case of the Indonesian homonym "bisa", for example, we have to take into account two etymological traces, one coming directly from the native tongue and meaning as much as "can, may, possible", the other leading straight back to Sanskrit "visa", which is "poison". That is why we decided to use a quadruple configuration consisting of a Sanskrit word, its Indonesian counterpart and their respective German meanings, each serving as one entry in our input format for the purpose of data processing. The German equivalents may serve as semantic markers in further investigations. A regular howler
392
PHONOLOGY
is Mohammed Zain's wilful explanation of Indonesian "balai" = "house, building" from Sanskrit "valaya" = "bracelet, circle, enclosure". We always have one reliable method at our disposal, by which to prove that a given item is of Sanskrit origin or not. This is the Polynesian matching test, because Sanskrit did not spread into the Pacific beyond the Philippines. Thus, if we find for our Indonesian word "balai", the form "bale" in Javanese, and "fale" in Samoan, as well as "whale" in Maori, all expressing the same meaning, we may be absolutely certain that we are not dealing with a Sanskritism. The same author's identification of the Indonesian word "meditase" = "meditation" as of Sanskrit origin should not be considered as a howler, but as a mere joke. After statistically checking different dictionaries as to their content in Sanskritisms we arrived at the following results: — Karow and Hilgers-Hesse booked 545 Sanskritisms among 19,070 dictionary entries, which is 2.9%, as compared to 8.7% Arabisms. — Sutan Mohammed Zain records 565 Sanskritisms from a total of 13,182 entries, which is 4.3%, as compared to 8.4% Arabisms. Already from these numerical data we can infer a general agreement with regard to Arabisms and a considerable uncertainty when it comes to Sanskritisms. Our own file of Sanskrit loanwords in the Indonesian language, collected by hand in the good old philological tradition from novels, scientific and technical publications, magazines and newspapers, private letters, together with those specimens found — if designated — in the dictionaries, amounts to 760 items altogether, i.e. both borrowings and new formations with the help of Sanskrit elements. It is to be expected that the actual number of the Sanskrit share in Indonesian word usage is much higher, because our collection increases from day to day under our own eyes. As from our 760 Sanskrit words about 100 are still more or less doubtful in view of the shortcomings mentioned above, the need for an exact and complete solution of the problem becomes obvious. In order to arrive at an approximately accurate number, an automatic matching of an Indonesian word list with a Sanskrit dictionary, applying routine techniques of computer systems, seems to be inevitable. A given Indonesian word should, in accordance with a system of phonological transformation rules, be converted into its initially quite fictitious Sanskrit counterpart, a process which after a suggestion made by Hans Karlgren might duly be called a "translation into quasi-Sanskrit". The fictitious Sanskrit word, produced in this way, will then be looked up in the Sanskrit dictionary for a match, if successful it will be stripped of its fictitious character and be turned into a genuine Sanskritism, if there is no match, the next Indonesian word has to be converted.
PHONOLOGY
393
As may easily be guessed, the first step in the entire lexicological research scheme has to be an automatic phoneme transformation Sanskrit-Indonesian, yielding the fundamental inventory of phonemic correlation rules between the two languages. Another preliminary procedure to facilitate the dictionary comparison will be an automatic root analysis of current Indonesian text materials, because only root morphemes are to be finally matched so that the Sanskritisms among them can be selected by machine routines. Moreover, an automatic root analysis is indispensable for the preparation of dictionaries, thesauri, and word-frequency lists, which may then be subjected to various procedures of natural language data processing, and be further exploited in automatic documentation systems, as for instance in keyword-in-context (KWIC) indexing, or in mechanized information retrieval with indexing, content analysis, abstracting, and — to some degree at least — for purposes of machine translation of telegraphic abstracts, again in the field of information and documentation. Furthermore, it will become feasible, after all, to state the relative frequencies of Sanskrit words in actual usage of the Bahasa Indonesia covering all provinces of life, i.e. from newspapers, wireless broadcasting and television, from publications of any kind, political speeches, university lectures, bazaar slang, and so forth. To explain this by way of an example, let us take as input for our automatic text analysis on a medium-size computer just one sentence from the speech held by Sukarno on the occasion of the national ceremonies for the 19th anniversary of the Day of Proclamation on 17th August 1964. Incidentally, Sukarno's name is derived from a Sanskrit Bahuvrihi-Compound "su-karna" meaning "being provided with a nice ear", just as Suharto's goes straight back to "su-artha", which meant as much as "showing nice endeavour, good property". Sukarno said in his speech: "Karena itulah maka pada permulaan pidato ini saja bitjara tentang pengalaman dimasa jang lampau, dan djurusan untuk masa jang akan datang." In English: "For this reason at the beginning of my speech I will talk about the experience drawn from the past and our tasks for the future." The output in the form of a string of root words after the procedure of automatic root analysis looks like this: KARENA ITU MAKA PADA MULA PIDATO INI SAJA BITJARA TENTANG ALAM MASA JANG LAMPAU DAN DJURUS UNTUK MASA JANG AKAN DATANG. Only in this form may any word usually be looked up in an Indonesian dictionary. It is this intermediary output which in its turn may be used for different processes of theoretical and practical analysis. If, however, we should want to get nothing else but merely an indication of the content in Sanskritisms, an automatic scanning of this small piece of text would print out the following words of Sanskrit origin or pattern: KARENA, MULA, SAJA, BITJARA,
394
PHONOLOGY
MASA f r o m " k a r a n a " = "reason", " m u l a " = "root, basis", " s a h a y a " = "fellow", "vicara" = "consideration", " m a s a " = " m o n t h " . T h a t is to say t h a t 23% of all words in our selected sentence are Sanskritisms.
2. Phonematic
Transformation
through Sanskrit-Indonesian
Word
Matching
Our collected d a t a file comprises a list of 662 quadruples, each quadruple consisting of: one Sanskrit word one Indonesian word
— one German equivalent of the Sanskrit word — one German equivalent of the Indonesian word.
An illustration of this input format is given in Figure 1.
Word - Quadruple
VAACAA BATJA
WORT LESEN
Figure 1
F r o m this given list of quadruples we want to obtain: (1) A complete set of transformation rules f r o m Sanskrit phonemes to Indonesian phonemes (2) Alphabetically arranged word lists for each t y p e of transformations (3) A series of tables, recording both the absolute and relative frequencies of the occurrence of the particular transformation types. Our basic idea is a man-machine interplay, where man feeds the machine an empirically, or otherwise gained, initial set of transformation rules and the machine, as it were, is recursively "learning" more and more rules by which the whole batch of Sanskrit-Indonesian word-couples is filtered. There will always remain a residual amount of words not corresponding to the given rules from the inventory. Taking this fact into consideration, we formulated an algorithmic strategy, t h e main operations of which are: (A) Successively increasing the inventory of phonematic transformation rules (B) At the same time, successively reducing the list of residual word quadruples. This process m a y be illustrated b y a block diagram in the following way:
395
PHONOLOGY
s
' n
/ \
/
•
N
Discontinuance if RESIDUAL WORD LIST sufficiently small
\
N
„
•
?
Figure 2. Block diagram
Methodologically, the algorithm for the computer program was based on principles of set theory. As a primary source of rules, by which to set the whole process into action, we used a limited collection of phonemic transformations adopted from the descriptions we found in traditional grammars, verified and enlarged by a Sanskrit-Indonesian word tabulation carried out on a BULL Gamma 10 tabulator. An illustration of this preliminary procedure see in Figure 3.
396
PHONOLOGY
Sanskrit
Input Format
German Meaning
Indonesian
German Meaning
agama visa dâsa bhùmi
AA/G/A/M/A V/I/.S/A D/A/Z/A BH/UU/M/I
RELIGION GIFT ZEHN ERDE
A/G/A/M/A B/I/S/A D/A/S/A B/U/MI
RELIGION GIFT ZEHN ERDE
F i g u r e 3. P h o n e m e t r a n s f o r m a t i o n b y
tabulation
Transformation Rules .S" BH Z
O D
B
O G
0 M
etc.
II
The algorithm of the computer program for an automatic phoneme transformation Sanskrit into Indonesian is made up of the following main steps: (0) An initial set of transformational rules P ( 0 ) is given. (| P ( 0 ) | > 1). (1) Filtering t h e set of word-quadruples by means of the set P, yielding as a result: — different lists of corresponding word-quadruples Lj(P (0) ) — a list of residual word-quadruples R(P ( 0 ) ). (2) Construction of a new, supplementary set of rules P ( 1 ) b y philological reasoning, i.e. by manual scanning of R(P ( 0 ) ). (3) Filtering of the residual word list R(P ( 0 ) ) by means of the unified set of rules P (0) U P (1) , yielding as a result: — a new list of corresponding word-quadruples Lj(P (1) ) — a diminished list of residual word-quadruples R(P (1) ). (4) Recursions of step 3, until the list of residual word-quadruples is assumed to be sufficiently small. (5) Printout of — a complete catalogue of phonematic transformation rules — alphabetically arranged word-lists to each t y p e of phoneme transformation, including the sufficiently reduced remainder list of irregularities — frequency lists for the occurrence of particular transformations. The algorithm turned out to be slightly susceptible cases where the SanskritIndonesian word-couples showed different lengths. This happens when the
397
PHONOLOGY
Indonesian syllabic glide vowel (9) (spelled as an "e") is inserted between consonant + (r), e.g. Sanskrit I S T R I I = "wife" Indonesian I S T E R I = "wife", or when an aspirated consonant in Sanskrit either looses its aspiration in Indonesian or is transformed b y insertion of a phoneme (a) into a full syllable, such as in Indonesian BAHAGIA corresponding to Sanskrit BHAAGYA = " f o r t u n e " , whereas in the case of Sanskrit BHAA.SAA = "language" we have two corresponding forms in Indonesian, viz. BASA or BAHASA. Likewise, the Sanskrit suffix YA becomes either J A or simply I in Indonesian. Apart from these eases, a great number of philologically interesting irregularities due to reduction, contraction, and syllable formation have been revealed through our automatic transformation, and, furthermore, have been formulated as regularized correspondences between the two languages. I n this respect, hitherto unnoticed cases of nasalization in Indonesian words, especially before dental consonants, are of particular interest to orientalists. 3. Phonomorphological
Typology
by Statistical
Contrastive
Analysis
As has repeatedly been pointed out by m a n y investigators using different methods (cf., for example, Gabriel Altmann (Bratislava) with his statistical analysis of Indonesian morpheme structures and his countryman Jiri K r a m s k y (Prague) who was concerned with a quantitative approach to phonological typology of languages), a classification of n a t u r a l languages appears feasible from various aspects of phonology and morphology. Both consonantal clustering, as with Altmann, and vowel frequency, as with K r a m s k y , m a y be taken as typological criteria. I n our phonomorphological typology by statistical contrastive analysis of t h e most frequent morpheme types as well as the extent of consonant clustering between Sanskrit on the one hand and Indonesian on the other, we obtained significant characteristics for the two different languages. Using a modified form of P . Menzerath's terminology, a table of all morpheme-types occurring in both our Sanskrit and Indonesian word collections was arranged according to syllable groups (G) and phoneme classes (C). "Classes" are defined by the number of phonemes (from 1 to 12 for Indonesian) present in a given morpheme, a n d have been entered in the columns of our table. "Groups" are defined by the number of syllables (from 1 to 5 for Indonesian) and have been entered in the rows of the table. "Syllables" are considered merely as "word-portions", consisting of only one vowel with from zero to n optional consonants before and/or behind it. Diphthongs AU and A I are counted as one vowel each. The o u t p u t of our statistical analysis has shown t h a t the maximum indices for C and G are greater in Sanskrit t h a n in Indonesian, f r o m which fact may be inferred a general t r e n d of reduction in Indonesian.
398
PHONOLOGY
A "morpheme type" (T) is defined by the intersection between phonemeclass C( and syllable-group Gs, i.e. after the formula: T = C,
n G*
In his article, "The Structure of Indonesian Morphemes", published in: Asian and African Studies, Vol. I l l , 1967, p. 23 ff., G. Altmann described 143 different morpheme types of Modern Indonesian in this manner. As may be seen from his general table of morpheme types, the five most frequent forms of Modern Indonesian, described by the intersection C4 f| Gj, are: I. CG(5; 2): CYCVC I I . CG(6; 2): CVCCVC I I I . CG(4; 2): CVCV
Example: bagus = "fine" Example: bentuk = "form" Example: batu = "stone"
IV. CG(7; 3): CVCVCVC Example: halaman = "yard" V. CG(4; 2): VCVC Example: arak = "rice whisky"
(35%) (17%) (7%) (6%) (4%)
From another table, extracted from the first one, we may read the distribution of the 143 types within the syllable groups and phoneme classes, where each of the up to 12 classes has its maximal and minimal number of syllables and, similarly, each group has its maximal and minimal number of phonemes or, as Altmann puts it, "each class spreads over one or more groups and each group spreads over one and more classes". I f the range W is defined as the difference of the maximal and the minimal value of phoneme or syllable number, then the mean range W may be used as a significant phonomorphological criterium of linguistic typology. Thus, computing the group ranges W g by the differences of class maxima and minima after the formula: W g = C M — Cm
where M and m are maximal and minimal values respectively
we obtain the mean group range after the formula: yy _ J^ y , ^ g n
8
where n = 5, i.e. the amount of syllable groups in Indonesian.
The resulting index of 4.60 for the mean group range W g in Indonesian, as obtained by Altmann, informs us that on the average every syllable group spreads over 4.60 phoneme classes. On the whole, G. Altmann's findings for the structure of Indonesian morphemes could be proved by our own computations, which, in addition, are used for a phonomorphological contrastive analysis between Indonesian and Sanskrit. Due to the considerably smaller amount of words subjected to the statistical analysis, the results gained in
PHONOLOGY
399
our own computations showed slight deviations in the rank ordering of morpheme types, the combination CVCV — of third rank in Altmann's list — ranking topmost in our findings. Likewise, the value for the mean group range (W g ) is somewhat lower in our own computations, viz. 2.83, whereas for Sanskrit it turned out to be 3.17. The results so far obtained from our investigations, especially the values for W g as the mean group range, are not only to be considered as an informative measure for syllable and morpheme distribution, but above all for the accurate measurement of consonant clusters in both languages. Our own computations served as a verification of the general rule: "The greater the consonant clusters admitted in a given language, the greater the value for W g ". This sounds quite logical, because, if on both sides of a vowel (syllable nucleus) only one consonant may stand, rendering the morpheme type CVC, then the maximum phoneme number is 3; let its minimum be 1, then the group range W g is 3 — 1 = 2. If the phonomorphological system of a given language admits two consonants on both sides, rendering the structure CCVCC, then the phonemic class maximum is 5 and accordingly W g = 4; and so forth. As Sanskrit has come to be known as a language relatively rich in consonants and Indonesian rather poor, as is true of the entire Malayo-Polynesian family, the difference between W g (Sanskrit) and W g (Indonesian) may at the same time be regarded as an expression for the degree of adaptability of foreign structures into the phonomorphological system of the Bahasa Indonesia. As has been proved by the phonemic transformation rules in the first part of our investigation and by the quantitative contrastive analysis in the second part, words of Sanskrit origin or pattern are easily and skillfully assimilated to phonomorphological standards of the Indonesian language and not without a certain natural elegance and gracefulness incorporated into the analyticagglutinative system of its grammar. I n order to overcome the difficulties of Sanskrit spelling, we used a machine code developed by Bart van Nooten at the American Institute of Indian Studies from the Deccan College of Poona, India. The full code may be seen from Figure 4. The input format for the automatic language data processing was quadrupled words on punched tape, the output consisted of printed lists and numerical tables. A small-size computer of the type C 8206 (made in the G.D.R.), equipped with only a drum store of 4096 cells at 33 bit, served our present purposes quite well.
400
PHONOLOGY
Conventional Transcription
Machine Transcription
i i
A AA I II
u Ü r ?
U UU .R .RR
a
a
i
e ai o au m (anusvära) k kh g gh n or n c ch j jh n
.L E AI O AU .M K KH G GH .G C CH J JH .J
Conventional Transcription t ¿h d dh n t th d dh n P ph b bh m
y r 1 V é s 8 h (visarga) h
Machine Transcription .T .TH .D .DH .N T TH D DH N P PH B BH M Y R
h V
z .s s .H H
Figure 4. Machine code for Sanskrit alphabet, designed by B. van Nooten, American Institute of Indian Studies, Deccan College, Poona, India
The only disadvantage with smaller types of computers is that the programs usually cannot be written in an adequate problem oriented programming language, such as FORTRAN, COBOL, or SNOBOL, but in a rather unwieldy autocode. In our case, however, this seems to be a matter of software rather than hardware impediments. Friedrich-Schiller University, Jena
PHONOLOGY
401
REFERENCES Altmann, G. 'The Structure of Indonesian Morphemes', Asian and African Studies, Vol. I l l , 1967, p. 23 ff. Beskrovnyj, V. M. 'O pojiH caHCKpnra B pa3BHTHH HOBOHHfloapHiiCKHX jiHTepaTypHhix m u KOB', CoepeMeHHue Aumepamypmie H3UKU cmpan Ä3UU (Collected Papers), Moscow, 1965, p. 62. ff. Gonda, J . Sanskrit in Indonesia, Nagpur, 1952. Kahler, H . Grammatik der Bahasa Indonesia, Wiesbaden, 1956. Kahlo, G. Indonesische Forschungen — Sprachbetrachtungen. Abschnitt 3. Die Lautveränderungen bei Sanskritlehnwörtern im Malayischen, Leipzig, 1941. Karow, O. and Hilfers-Hesse, I . Indonesisch-Deutsches Wörterbuch, Wiesbaden, 1962. Soebadio, N j . H . 'Penggunaan bahasa Sanskerta dalam p e m b e n t u k a n istilah b a r u ' [The Use of Sanskrit in t h e Formation of Modern Terminology], Madjalah Ilmu-Ilmu Sastra Indonesia, No. 1, 1963, p. 47 ff. Spitzbardt, H . 'Zur Entwicklung der Sprachstatistik in der Sowjetunion', Wissenschaftliche Zeitschrift der Friedrich-Schiller Universität Jena, Gesellschafts- aund Sprachwissenschaftliche Reihe, H e f t 4, 1967, p. 487 ff. Spitzbardt, H . 'Sanskrit Loan Words in t h e Bahasa Indonesia', Beiträge zur Linguistik und Informationsverarbeitung, H e f t 19, 1970, p. 62 ff. Staal, J . F . 'Sanskrit and Sanskritization', The Journal of Asian Studies, Vol. X X I I , No. 3, 1963, p. 261 ff. Wirjosuparto, S. 'Sanskrit in Modern Indonesia', United Asia, International Magazine of Afro-Asian Affairs, No. 4, Bombay, 1966, p. 165 ff. Zain, St. M. Kamus Moderen Bahasa Indonesia [Modern Indonesian Dictionary], Djakarta, 1954.
IDENTIFICATION DE PHONÈMES DANS LA LANGUE ÉCRITE PAR UN CRITÈRE D'INFORMATION MAXIMUM - APPLICATION AU TURC, AU RUSSE, À L'UZBEK, AU KIRGIZ ET AU TADJIK A. TRETIAKOFF
1. O n c h e r c h e à c l a s s e r les c a r a c t è r e s d ' u n e l a n g u e é c r i t e e n 2 c a t é g o r i e s . O n v a m o n t r e r q u e le c l a s s e m e n t le p l u s s i g n i f i c a t i f e s t c e l u i q u i r e n d m a x i m u m u n e q u a n t i t é d ' i n f o r m a t i o n s a s s o c i é e à ce c l a s s e m e n t d e l a f a ç o n q u e n o u s a l l o n s i n d i q u e r d a n s ce q u i s u i t . 1.1. P o u r cela prenons p a r exemple un t e x t e écrit de turc moderne dans l e q u e l a p p a r a i s s e n t 28 c a r a c t è r e s d i f f é r e n t s . C l a s s o n s a r b i t r a i r e m e n t les 1 4 p r e m i e r s c a r a c t è r e s ( p a r o r d r e a l p h a b é t i q u e ) d a n s u n e c a t é g o r i e e t les 1 4 d e r n i e r s d a n s l ' a u t r e . C o n s i d é r o n s les g r o u p e s d e 2 c a r a c t è r e s c o n s é c u t i f s , ils p e u v e n t ê t r e d e 4 t y p e s d i f f é r e n t s : (1) c a r a c t è r e 1 è r e c a t é g o r i e s u i v i c a r a c t è r e 1 è r e c a t é g o r i e ( p a r e x e m p l e A A , A L , L C , etc
)
(2) c a r a c t è r e 1 è r e c a t é g o r i e s u i v i c a r a c t è r e 2 è m e c a t é g o r i e ( p a r e x e m p l e A P , K U , N Z , etc
)
(3) c a r a c t è r e 2 è m e c a t é g o r i e s u i v i c a r a c t è r e 1 è r e c a t é g o r i e (4) c a r a c t è r e 2 è m e c a t é g o r i e s u i v i c a r a c t è r e 2 è m e c a t é g o r i e . S i o n d é s i g n e p a r N u , N 1 2 , N 2 1 e t N 2 2 les f r é q u e n c e s r e s p e c t i v e s d e ces d i f f é r e n t s t y p e s e t p a r p u , p 1 2 , p 2 1 , p 2 2 les p r o b a b i l i t é s c o r r e s p o n d a n t e s =
e t c . . . . a v e c N = N u + N 1 2 + N 2 1 + N 2 2 ). N O n p e u t a s s o c i e r à ce c l a s s e m e n t l a q u a n t i t é d ' i n f o r m a t i o n : I = p n Log2
Pu Pi'P-i
+ Pl 2 L ° g 2
+ P21
L
°g2
+ P22 L ° g 2 '
P12 Pi-P-2 P21 Pa ' P • 1 P22 Pa ' P • 2
(pu
=
404
PHONOLOGY
où p v = p u + p 12 (probabilité de trouver en début d'un groupe un caractère de la 1ère catégorie) et ainsi de suite. Si les catégories se suivent au hasard, alors p u = p 1 .p. 1 etc. . . . ce qui donne: 1 = 0. Par exemple dans le texte turc choisi, on a trouvé: Nu N12 N 21 N22
= 3321 = 2043 = 1842 = 976
= 0.406 p12 = 0.250 p 21 = 0.225 p22 = 0.119
Pll
Si les catégories choisies se suivaient au hasard, on aurait en: Nu N12 N 21 N22
= = = =
3385 1979 1778 1040
pu p 12 pa p22
= = = =
0.414 0.242 0.217 0.127
Comme cela n'est pas le cas, il existe une corrélation entre les catégories ainsi définies et la connaissance de leur loi de succession apporte la quantité d'information. I = 0.0017 Cette quantité d'information est très faible car les catégories choisies se suivent presque au hasard. 1.2. Il existe 227 façons de classer en 2 catégories les 28 caractères du turc. A chacun de ces classements correspond une quantité d'information que l'on peut calculer de la façon qui vient d'être indiquée. L'un de ces classements fournit pour I la plus grande valeur possible. Cette quantité maximum d'information vaut 0.560 pour le texte turc choisi, et le classement correspondant partage les caractères en consonnes et voyelles.
Les fréquences des groupes de différents types sont maintenant: Nvv N vc N cv N cc
= 17 = 3407 = 3812 = 946
Alors que si voyelles et consonnes se suivaient au hasard, on aurait N vv = 1602 Nvc =
1822
N cv = 2227 N„„ = 2531
PHONOLOGY
405
On voit que le groupe voyelle-voyelle est peu fréquent, 17 occurences seulement alors qu'il devrait apparaître 1602 fois au hasard. 1.3. Ces résultats mettent en évidence la loi qualitativement bien connue d'alternance consonne-voyelle dont nous fournissons ici une mesure. Si cette loi était parfaite, c'est-à-dire si une voyelle était toujours suivie d'une consonne et réciproquement, la quantité d'information associée vaudrait 1 unité («bit»). 2. Nous avons appliqué la même méthode aux langues suivantes: Russe, Uzbek, Kirgiz et Tadjik en transcrivant les caractères cyrilliques H par JA, Ë par J O et K) par J U . Nous montrerons que cette transcription est justifiée, sauf pour le Russe. Comme dans le cas du turc, le classement en 2 catégories associé à l'information maximum partage encore les caractères en consonnes et voyelles. Les quantités d'information obtenues valent respectivement: Russe Uzbek
0.419 0.506
Kirgiz Tadjik
0.453 0.458
Ceci montre que l'alternance consonne-voyelle est bien marquée dans toutes ces langues. 3. Pour les langues non slaves de l'URSS qui utilisent depuis 1940 l'alphabet cyrillique les caractères H, Ë, K) ont été utilisés pour noter des sons qui, en turc de Turquie, sont notés à l'aide de 2 caractères YA, YO, YU. E n turc, en effet le «yod» est considéré comme une consonne autonome et non comme une mouillure de voyelle. 3.1. Afin de comparer ces deux notations, nous avons répété le calcul de la quantité d'information associée au classement optimum en 2 catégories en conservant pour le russe, l'uzbek, le kirgiz et le tadjik, tous les caractères cyrilliques et en transcrivant pour le turc les groupes YA, YO, YU par les caractères cyrilliques. 3.2. Les résultats obtenus sont présentés dans le tableau 1. On constate que la quantité d'information associée diminue dans ce cas pour toutes les langues à l'exception du russe. On peut en conclure que si l'on utilise comme critère la qualité de l'alternance consonne-voyelle, le russe est parmi les langues étudiées la seule où l'emploi d'un caractère unique pour noter les sons H, Ë, K) se trouve justifié. 4. Un problème analogue se présente pour des phonèmes propres à certaines langues turques d'Asie, tel que NG, qui est noté parfois par un seul signe parfois par 2. 4.1. Nous avons pour le Kirgiz et Uzbek effectué le classement optimum en 2 catégories. Une première fois en notant NG à l'aide de 2 caractères distincts, une seconde fois en le notant à l'aide d'un caractère spécial.
406
PHONOLOGY Tableau 1
JA JO J U Notés par 1 caractère
Notés par 2 caractères
0.443
0.453
KIRGIZ Voyelles
a bioyeueyys
Voyelles
a3uoy
abioyeuos 0.487
0.506
UZBEK
asuoyyëio 0.548
TADJIK Voyelles
0.427
aeuuoyy3
TURC
aeuoyySj 0.534
0.560 Voyelles
aioueiôii
aioueiôii
RUSSE
0.427
0.419 Voyelles
aeuoybI39ë
a e u o y bl 3
On retrouve dans tous les cas le partage entre consonnes et voyelles. L e caractère spécial pour N G se trouve classé dans les consonnes et son emploi améliore l'alternance consonne-voyelle comme le montre le tableau 2. Tableau 2 N G deux caractères
NG un caractère
Kirgiz
0.442
0.452
Uzbek
0.506
0.546
On voit que selon ce critière l'emploi d'un caractère spécial pour représenter ce phonème en Kirgiz est justifié, et serait souhaitable en Uzbek. 5. Nous espérons avoir montré sur ces différents exemples qu'un critère d'information maximum associé au classement des caractères en 2 catégories conduit à une information d'autant meilleure que la correspondance phonèmecaractère est mieux réalisée. Ceci suggère un procédé d'identification automatique des phonèmes sur un texte écrit.
PHONOLOGY
407
ANNEXE A . l . Quantité d'information associée à un classement en catégories A. 1.1. Soit NJJ le nombre de groupes constitués par le caractère i suivi du caractère j et N =
ij
le nombre total de groupes de deux caractères analysés. On définira les probabilités suivantes: Pu = N,j/N
(probabilité du groupe i, j)
Pj.
(probabilité d'avoir i en 1ère position) j
p.j = ^ p y i
(probabilité d'avoir j en 2ème position)
Pi bì i = Pij/Pi-
(probabilité d'avoir j après i)
A. 1.2. Supposons les caractères classés en catégories. On définira les probabilités suivantes: Pab =
2
PU (probabilité d'avoir un caractère quelconque de la catégo-
i cat. A j cat. B
rie A suivi d'un caractère quelconque de la catégorie B) pA. =
^
pj. (probabilité d'avoir en 1ère position un caractère quel-
i cat. A
conque de la catégorie A) p. B =
^
p.j (probabilité d'avoir en 2ème position u n caractère quel-
j cat. B
conque de la catégorie B) Pab/Pa(probabilité d'avoir un caractère quelconque de la catégorie B après un caractère quelconque de la catégorie A) A. 1.3. L a quantité moyenne d'information par caractère placé en 2ème position est, si l'on ne possède aucun renseignement sur la loi de succession des caractères, c'est-à-dire si on connaît seulement les p.j P b si a
=
^ - J p - j W P - I ) i
Si l'on connaît la loi de succession des caractères, c'est-à-dire les probabilités Py, la quantité moyenne d'information par caractère placé en 2ème position devient: r = -
^PuMPjsii)
408
PHONOLOGY
La perte moyenne d'information donc:
par caractère placé en 2ème position est
i - r - ^ P ^ M ^ ij IPi-P-J A. 1.4. Si l'on connaît seulement pour chaque caractère i, la catégorie A à laquelle il appartient, ainsi que la loi de succession des catégories, c'est-àdire les p AB , on peut calculer une valeur approchée de Pj 8 i i En effet si on sait que i appartient à la catégorie A et j à la catégorie B, on a: _ ^ RL _ PAB Pjsi 1 — PbsìA X ^ — „ „ P j P-B
PA-P-B
où encore en posant A = (cat i) B = (cat j) Pi sii =
P(cat i) (cat j)
~
P(cati)-P-(catj)
Pj
La quantité moyenne d'information par caractère placé en 2ème position devient: I' = — ^ P y M P j s i i ) = - ^ P ì j L 2 M ^ 2 «
y
l Pa- P-b
La perte moyenne d'information par caractère est donc: I - I ' ^ P i ^ p ^ - 1 Ij " l Pa-P-b] où encore en faisant la somme des p^ sur les catégories semblables: AB
\ PA-P-B,
Nous dirons que la quantité d'information J donnée par cette expression est associée à ce classement en catégories. A.1.5. Si les caractères sont classés en deux catégories 1 et 2, et si
Alors et
Pii = p22 = 0 P12 = P21 = 0.5 Pi- = P i = P2* = P*2 = 0-5 J = 1
On voit donc que pour une alternance parfaite des catégories 1 et 2, J atteint sa valeur maximum, égale à 1.
STATISTICAL METHODS IN LANGUAGE DESCRIPTION
A PROBABILISTIC MODEL FOR PERFORMANCE RAOUL N. SMITH
An interesting place for probabilistic models has been suggested by recent studies in various areas of linguistic performance. These activities include the resurgence of statistical stylistics (for example, Bailey and Dolezel 1968), the attempts at correlating various social and linguistic phenomena by sociolinguists (e.g. Labov 1964) and the continued interest in probabilistic models of learning by psychologists (for example, Suppes 1969). An increase of activity in some of these areas can be attributed to the paucity of interesting models to account for the variety of performance data t h a t has not yet been incorporated in any model of language use. A partial solution to the model problem is emerging from recent research by formal language specialists interested in communications problems (see, in particular, various recent articles in Information and Control). I t is the purpose of this paper to apply the concepts of one such formal model (proposed in Salomaa 1969) to a particular grammar of a natural language and suggest possible uses and interpretations of this expanded model for accounting for facts of performance. Chomsky (1965: 4) has defined performance as "the actual use of language in concrete situations". Further he states that a theory of performance must account for "how the speaker or hearer might proceed, in some practical or efficient way, to construct a derivation" (Chomsky 1965: 4). The factors involved in or accounted for by any performance model should include "memory limitations, intonational and stylistic factors, 'iconic' elements of discourse (for example, a tendency to place logical subject and object early rather than late . . . ), and so on" (Chomsky 1965: 11). If we also add to these factors such phenomena as hesitation pauses, complexity of sentence construction (in terms of embedding and conjoining), choice of certain linguistic items within specific extra-linguistic contexts, and others, we see that all of these and many more are what the socio-linguist, psycho-linguist, style analyst, and communication engineer are interested in, that is, performance variables. (Some of these factors could also be accounted for by richer competence models, for example, incorporating constraints from genre, which, to me, belong to a theory of competence.)
412
STATISTICAL METHODS IN LANGUAGE DESCRIPTION
As a basic ingredient a performance model should obviously incorporate characteristics of the native speaker's competence. As Chomsky says (1965: 9) "No doubt, a reasonable model of language use will incorporate, as a basic component, the generative grammar t h a t expresses the speaker-hearer's knowledge of the language." And later "it seems that the study of performance models incorporating generative grammars might be fruitful study; furthermore, it is difficult to imagine any other basis on which a theory of performance might develop". (Chomsky 1956: 15) In addition, the choice of underlying competence model will help to determine the richness of the performance model. The existence of a competence model interacting in some way with or as part of the performance model is probably accepted by most researchers. W h a t form the performance model should take, however, especially with respect to the role of probabilities, appears to be an open question. Since no interesting or generally accepted models for performance have been proposed (except for the recent suggestions in Fromkin 1971), I would like to submit the adoption of a probabilistic model, at least as an interim solution, to the problem of accounting for some of the performance factors mentioned above. A recent model proposed by Salomaa (1969) has appeared in the formal grammar literature that offers particularly appealing characteristics for solving the performance problem. This model is appealing because it has built-in the machinery for accounting for some of the particular factors mentioned earlier, in particular, " t h e tendency to place logical subject and object early rather than late", sentence complexity, and so forth, and in addition for accounting for what Chomsky refers to as acceptable, 1 as relating "not to a particular rule, but rather to the way in which the rules interrelate in a derivation" (1965: 12). I n his paper Salomaa describes a class of grammars whose rewrite rules are restricted in some form. The grammar which he describes and formalizes in most detail is a probabilistic grammar. I t is the purpose of the present study to apply the concepts of the formal model developed by Salomaa to a non-probabilistic competence model proposed for a natural language (Rosenbaum 1967), but with probabilistic constraints conforming in format to Salomaa. I n this way the native speaker's competence is still accounted for (i.e., all and only the sentences of the grammar are generated) but certain propensities for particular structures, etc., t h a t is individual performance characteristics can also be accounted for. The resultant probabilistic grammar is presented as a performance model rather than a competence model and therefore interpretable in psychological, stylistic, social or other performance terms. Let G = {VN, V T , S, F} be the context free phrase structure grammar de1
I n C h o m s k y acceptability relates t o performance, while g r a m m a t i c a l i t y relates t o competence.
413
STATISTICAL METHODS IN LANGUAGE DESCRIPTION
scribed in Rosenbaum (1967) where V N = {VP, NP, S}, V T = {YB, N, T, # } , S £ V N is the initial symbol and F ia the set of core CF rewrite or production rules fj. In addition let G P = {G, 8, 0} where G is as above, S is a stochastic vector representing the initial probability distribution of the rewrite rules, and 0 is a stochastic vector associated with each rewrite rule f whose i-th component indicates the probability that the i-th rewrite rule is applied after f. This defines a probabilistic grammar G P . To concretize these statements let us look at the actual rules posited for English by Rosenbaum: # T NP VP
S VP
V B (NP)
NP
NP S N (S)
#
fNP) S
Inflating these rules in the usual way and putting them into Salomaa's format with probabilities calculated for an equiprobable expansion for each choice, we obtain fx: S - # T NP VP # VB f 2 : VP f.= VP — V B N P U- VP — V B NP NP f« = VP — V B NP S f 6 : VP ->- V B S f? •NP —>- NP S f 8 : NP — N S f 9 : NP — N
( 0, (0, (0, (0, (1/4, (1/4, (1/9, (1/6, ( 0,
1/8, 0, 0, 0, 0, 0, 1/9, 1/6, 1/5,
1/8, 0, 0, 0, 0, 0, 1/9, 1/6, 1/5,
1/8, 0, 0, 0, 0, 0, 1/9, 1/6, 1/5,
1/8, 0, 0, 0, 0, 0, 1/9, 1/6, 1/5,
1/8, 0, 0, 0, 0, 0, 1/9, 1/6, 1/5,
1/8, 1/3, 1/3, 1/3, 1/4, 1/4, 1/9, 0, 0,
1/8) 1/3, 1/3, 1/3, 1/4, 1/4, 1/9, 0, 0,
1/3,) 1/3,) 1/3) 1/4) 1/4) 1/9) 0) 0)
where the parenthesized expression in each rule f is the vector 0 = (pj, p?, . . . . . . ,p") where pj represents the probability of applying rule fj after rule n
f,. S p? = 1 and the initial probability distribution S = (1, 0, 0, 0, 0, 0, 0, 0, 0) i=l
For example, the following trees might have been generated by applying rules fx, f 2 , and f 9 in the order described by the labels on the left.
fx
414
STATISTICAL METHODS I N LANGUAGE
where p j = 1, P ? = y . P® =
DESCRIPTION
y
This tree is equivalent, except for order of application of the rules, to
fi
where p j = 1, P? = v , Pb = T" 5 Both of these trees represent the sentence of the language L(Gp) P = = ^TNVB We can represent their derivational histories by a control word, which is a record of the application of the rewrite rules in the form of the left-to-right succession of the labels of the rewrite rules applied in the derivation. There are two different control words (contrary to the usual interpretation of application in ordering of rules in a competence model, see Bach 1964) for these two trees: fi and Let D:
Sn
fj(l)
^ f jj(2)
f„KD
be a derivation according to G P where Pi represents a line in the derivation and f j ( i + 1 ) is the label applied t o the production going from Pi to P i + 1 . Then for these two terminal strings Di : P„ = P 1 = P2 = P3=
S # T N P V P # # T N P VB # #TNVB #
and D2 : P 0 = S = # T N P VP # P2= # T N V P # P3= # T N V B #
STATISTICAL METHODS IN LANGUAGE
DESCRIPTION
415
we can define two different control words for
Di : fi(i) fÎ/9\ fc 2(2) ^9(8)
and for
D,2 :• ±fl(l) i m f, 2(2) f, 2(3)
These two derivations differ only in line P 2 which records the different order of expansion of N P or VP. But although the order of application in a nonprobabilistic grammar is irrelevant, it would be relevant for a performance grammar, for example, left-to-right stringing of words in time. If more weight should be given to generating an N P before a VP, then more weight should be given to the derivation D 2 than to D 1 . We can characterize this notion by defining a function ^(D)— a measure of the probability of deriving a sentence as the weights of application differ. (This definition of ^(D) differs from Salomaa who uses length of derivation as a measure since his function presupposes a grammar in a normal form where all rules are expansion rules into more than one symbol.) We can achieve this by considering
® oo
>> s S t ND Oi x «8C o < ft O S
C d
KAyHapo,n,HHH CHMno3«yM «3HaKH h cHcreMa H3bIKa.» E — distance entre un P donnée et son Aa dans un mécanisme de référence donné ou dans une UT; Emax — la distance maximale entre un P et son Aa. La distance est mesurée en phrases et en segments [voir [14] — indépendants, partiels et segments partiels développants (voir ci-dessous)] et elle est désignée par une suite de deux chiffres et le signe «-» qui signifie «à gauche»; le premier de ces deux chifffres désigne la distance en phrases et le deuxième — en segments. Par exemple, dans l'UT suivante: «H3Bec™o, qTo MjieHopa3AejibHafl penb (Aa I, II, IV) co b c c m h CBOHCTBeHHbiMH eu
(P I) OCOÔeHHOCTHMH, KaK CTpyKTypHbIMH, T3K H (J>yHKliHOHajIbHbMH — 3T0 4pe3BbmaHH0 cjio>KHoe HBjieHHe //. Ona (P II) B03HHKJia, pa3yMeeTca, He cpa3y,// KaK ne cpa3y B03hhk h caM neAoeeK (Aa III) c ezo (P III) n0AJiHHH0 qejiOBeqecKHM MbiuiJTeHHeM. Il Ona (P IV)flBHjiacb pe3yAbmamoM (Ap V) AJiHTejrbHoro pa3eumun (ApV) TpyAOBOH AeHTejibHOcra nepBoôbiTuoro ueAoeeica (AaV), ezo (P V) MbimjieHHH, Bce ycjio>KHHK)mnxcH coi;HajibHbix CBH3en, M03ra h nepH(J)epHqecKoro pe^eBoro annapaTa. //» La valeur de E (pour P IV) est (-2, -2). S — segment; c'est un énoncé prédicatif, c'est-à-dire une proposition simple une construction avec un participe ou un gérondif (les «npH^acTHbie» et les «AeenpimacTHbie oSopora») (voir [14]; voir aussi [15]). Ainsi dans l'UT «Tenepb y w e H3 y 3 H a T b (C 1), // k
HenocpeACTBeHHoro b o c i i p h h t h h KaKOMy npedMemy
(Ap I)
HJIH
uepozAutfa
(Aa)
HeB03M0>KH0
KAaccy (Ap II) ripe^MeTOB
OH
(P)
OTHOCHTCH ( C 2).»
nous avons deux segments (ils sont désignés dans les exemples russes par C 1, C 2 etc. et séparés par //). SP — segment partiel (dans la terminologie slave MC); segment qui représente un noyau incomplet d'une proposition simple ou d'une construction avec un participe (ou un gérondif). Par exemple: «I4ejibH00(J)0pMJieHH0CTb cAoea (Aa) (C 1 / 4 C 1), Il BbWBJiHiomaflCH b ezo ( P I )
BnyrpeHHeM crpoeHHH (C 2), // oTrpaHuqHBaeT ezo (P II) o t cjiOBOcoqeTaHHii (C 1/MC2),// Kcvropbie MoryT SbiTb onpe/i,eJieHbi KaK o6pa3oeaHHH «pa3AejibH0o^opMJieHHbie» (C 3). Il»
482
AUTOMATIC
TRANSLATION
SPD — segment partiel développant (dans le terminologie slave MCP); c'est un groupe de mots qui développe un segment indépendent avec un noyau complet. Par exemple: «B
OTJiH^He o t
H3biKU (Aa) (C 2/MCP), Il Komopbiu (P) C03AaBajiCfl j i i o a l m h Heo-
C03H3H0 ( C l ) , / / nHCbMeHHOCTb HBJlfleTCfl pe3yj]bTaT0M C03HaTejlbH0r0 TBOpMeCTBa JHOAeÎi H Ha BCeM IIpOTfl>K6HHH BnojiHe npeAnaMepenHO
(C
CBOerO HCTOpH^eCKOrO
pa3BHTHH
H3MeHHJiaCb
2).»
Type de structure: sous le terme structure nous entendons conventionnelle ment la représentation graphique des rapports entre les Aa et les P, et entre les Ap et les P; dans les UT avec un MRS sans Ap, ce rapport est évidemment direct; par contre, dans les UT avec plusieurs MR (ou bien avec des Ap) s'établissent des types de relations plus compliquées que nous allons nommer de la manière suivante: structure successive à plusieurs niveaux (type 2); structure intersectée (type 3) et structure enclavée (type 4). Ces types de structures, qui se combinent d'une manière très compliquée dans les MR et UT réels, peuvent être représentées graphiquement comme suit: type 1
Aa
P
type 2
Aa
1 PI
I P II
P III
type 3
Aa I I
Aa I
P II
PI
type 4
Aa I I
Aa I
PI
PII
C. Types de classification
Après avoir réalisé l'analyse du corps de texte choisi d'après les paramètres acceptés et sur la base des notions introduites, on a effectué les classifications suivantes: premièrement toutes les unités de travail ont été classées d'après le nombre (1, 2, 3 etc.) de MRS (c'est-à-dire les UT qui contiennent 1 P, 2 P, etc.). Après cela, dans les groupes ainsi constitués on a effectué les classifications suivantes: d'après le type et les combinaisons des P; d'après la position
483
AUTOMATIC TRANSLATION
des Aa par rapport aux Ap et leurs E ; d'après la position des Aa par rapport à leurs P et leurs E ; d'après Q max ; d'après E m a x ; d'après les A qui se répètent et les A coïncidants, etc. Ceci fait, les résultats obtenus pour les textes de chacune des trois langues ont été comparés en partant des critères suivants: la quantité de UT dans 100 pages de texte; le quantité moyenne de MR dans 100 UT; la quantité maximale des P pour un Aa, la distribution des UT par les valeurs de Q, etc. *
L'analyse effectuée nous a donné une grande quantité de données concernant le statut et le fonctionnement des P et des MR dans les textes scientifiques russes, bulgares et allemands. Dans la partie suivante de notre exposé, nous allons noter quelques-unes de ces données et surtout celles dont dépend sous un aspect ou un autre l'organisation de l'analyse automatique et la fixation des critères de choix des Aa. I I I . Quelques résultats de l'analyse 1. La quantité moyenne des UT dans 100 pages de texte appartenant à chacune des trois langues que nous avons choisies est à peu près la suivante (tableau 1): Tableau
1. Quantité moyenne
des UT dans 100 p. de texte
R
B
240
230
A 570
2. Dans 240 UT (100 p. de texte russe) il y a en moyenne 350 MR; dans 230 UT (100 p. de texte bulgare) — 340 MR et dans 570 UT (100 p. de texte allemand) — en moyenne 1288 MR. Par conséquent, pour 100 UT, on obtient la quantité moyenne de MR suivante (tableau 2): Tableau
2. Quantité moyenne R
B
140
137
des MR
dans 100
UT
A 226
3. D'après la quantité des P et, par conséquent, des MRS qu'elles contiennent, les UT se répartissent dans les trois langues de la manière suivante (tableau 3):
484
AUTOMATIC TRANSLATION Tableau
3. Distribution des UT d'après de P dans chacune d'elles
Quantité de P
1 2 3 4 5 6 7 8 12
P P P P P P P P P
la
quantité
E
B
A
175 37 18 8 1
177 36 15 1 1
263 157 72 41 20 4 5 6 1
1
22 P
1
4. Selon la quantité maximale des P qui se rapportent à un Aa nous avons la distribution suivante (tableau 4): Tableau
4. Quantité
maximale
de P pour
R
B
A
7
6
16
un
Aa
5. D'après Q m a x (c'est-à-dire d'après la quantité maximale des Aa et des Ap) dans une U T nous avons la distribution suivante (tableau 5): Tableau
5. Distribution
R 15
B 9
selon Qmax A 20
6. E m a x (c'est-à-dire la distance maximale en phrases et en segments entre un P donné et son Aa) prend approximativement les valeurs suivantes (tableau 6): Tableau R — 2.( —7)
6. Distribution B —2.( —5)
d'après
Emax A —10.( —27)
7. Distribution approximative des U T d'après les positions respectives des Aa et des P (ils se trouvent dans le même segment; ils se trouvent dans des segments voisins; ils se trouvent dans des segments non-voisins); cette distribution, donnée sur la base de 200 U T russes peut être représentée de la manière suivante (tableau 7):
AUTOMATIC TRANSLATION
Tableau
7. Distribution
Q
Quantité totale des UT
des UT d'après la position
Aa et P sont dans le même segment
Aa et P sont dans des segments voisins
485
respective des Aa et des P
Aa et P sont dans des segments non-voisins B
—2
—3
max
—4
1
51
9
38
4
0
0
- 2
2
65
23
39
1
2
0
- 3
3
35
9
21
3
2
0
- 3
4
20
8
10
2
0
0
- 2
5
14
5
7
1
0
1
- 4
6
5
3
0
1
1
0
- 3
7
4
2
1
0
1
0
- 3
8
3
2
1
0
0
0
- 1
9
3
1
2
2
0
0
- 1
10
1
1
13
1
1
8. Dans les UT, qui sont prises dans 100 p. de texte russe, on trouve approximativement 52 A qui se répètent; 136 A coïncidants; 26 Aa qui se distinguent formellement et 41 Aa qui ne se distinguent pas formellement. 9. Au point de vue quantitatif, les types de structure sont répartis dans les textes russes approximativement de la manière suivante: type 1 — 21; type 2 — 51; type 3 — 33; type 4 — 1 0 . 10. Le matériel linguistique a confirmé grosso modo notre hypothèse initiale à savoir que dans les textes scientifiques russes, bulgares et allemands les Aa et les Ap peuvent se trouver seulement à gauche du P correspondent. Mais on a établi de même une exception pour le pronom c b o h qui peut avoir aussi un ^josi-cédent, c'est-à-dire un A qui se trouve à droite du P. Par ex. «fleHCTBHTeJIbHO, B npaKTHWeCKOM npHMÊHeHHH K ceoeu
(P) «aHLjy3CK0H c t h -
JlHCTHKe» «CBOÔOflHafl TaSjIHIja HfleHTH({)HIJHpyK)mHX CJIOB H HX CHH0HHM0B» III. EaAAU (Post) BBOftHT iecKne eflHHHijbi.»
IV. Quelques
B
BaWHeÎÎUIHX
CHHOHHMHHeCKHe p H A H CJIOB (J)pa3e0J10rH-
conclusions
Les résultats obtenus au cours de notre analyse linguistique, jettent une certaine lumière sur le statut des P et le fonctionnement des MR dans les textes russes, bulgares et allemands d'un caractère scientifique (plus exactement linguistique) et permettent de faire quelques constatations, de tirer quelques conclusions générales et de formuler certaines propositions sur l'organisation de l'analyse et le caractère des critères du choix dans le modèle d'identification des antécédents actuels des pronoms, lors de l'analyse automatique des textes. En voici quelques-unes:
486
AUTOMATIC TRANSLATION
1. Les données des tableaux 1 et 2 prouvent d'une manière évidente que les mécanismes de référence (MR) représentent un phénomène très fréquent dans toutes les langues envisagées (de même que, probablement, dans toutes les langues développées) — approximativement 240-250 pour 100 p . de texte scientifique russe et bulgare et 560-590 pour 100 p. de texte scientifique allemand. Cette constatation affirme d'un côté l'importance capitale de la résolution du problème de l'identification des antécédents dans le processus de l'analyse automatique des textes, et d'un autre — elle montre qu'il serait vain de tenter d'obtenir cette résolution par la voie de certains palliatifs et des élargissements fragmentaires des schémas d'analyse déjà existants; cette résolution peut être obtenue uniquement par la voie de l'élaboration d'un modèle compact (voir ci-dessous). 2. Les données quantitatives, citées dans ces deux tableaux, ainsi que les données obtenues pour tous les autres paramètres nous font voir que, tandis que le statut et le fonctionnement des P et des MR dans les textes russes et bulgares sont presque identiques au sens quantitatif comme au sens qualitatif, les textes allemands sont très différents: d'un côté la quantité des U T et des MR est plus de deux fois supérieure et de l'autre — ces U T et MR sont bien plus compliquées au point de vue syntaxique et sémantique. Ce fait nous mène, entre autres, à la conclusion qu'il faut élaborer premièrement un modèle d'identification des antécédents actuels des pronoms pour les textes scientifiques russes, et après certaines modifications pas très compliquées l'adapter aux textes scientifiques bulgares. Ceci nous donnera les conditions nécessaires et nous permettra de voir d'une manière plus n e t t e les voies de résolution du problème pour les textes allemands. 3. Les valeurs que nous avons établies pour E m a x entre les P et les Aa en phrases (voir tableau 6) montrent que nous avons toutes les raisons de supposer que, dans la plupart des cas, la valeur de E m a x des textes scientifiques ne dépassera pas deux-trois phrases à gauche 8 (comme il suit de la tableau 6 pour les textes allemands la valeur de E m a x en phrases atteint -10; mais l'UT, citée dans l'exemple à la page 15-17, peut être subdivisée en trois U T selon les Aq correspondants, et par conséquent -10 sera réduit à -3). La constatation de ce que dans les textes du type envisagé la valeur de E m a x n'est pas supérieure à - 2 ou -3, a une grande importance pour l'organisation des premières étapes de l'analyse automatique. Au fond, comme on le sait, cette analyse peut être organisée en partant de deux principes dia8 L'analyse montre que dans le cas des valeurs positives de E, c'est-à-dire quand la distance est mesurée à droite du P jusqu'à son postcédent, ce postcédent se trouve toujours dans la même phrase que le P correspondant, et par conséquent dans ces cas la valeur de E en phrases est toujours égale à zéro.
AUTOMATIC TRANSLATION
487
métralement opposés: excentrique et concentrique. Mais dans les deux cas il est très important d'être en état d'établir préalablement les limites minimales et suffisantes du fragment de travail du texte. Ainsi, en se basant sur les résultats de notre analyse, nous avons l'intention de fixer la limite gauche préliminaire du fragment de travail qui sera établi par l'ordinateur à - 2 phrases. 4. Examinons de même les données suivantes: (a) comme il découle de la table 3, un nombre important d'UT contient seulement un P et, par conséquent, un MRS. (b) dans les cas où les UT comptent plus d'un MRS, il est possible en première approximation d'accepter que leur nombre moyen pour une U T varie de 2 à 4; (c) dans chaque MRS le nombre moyen des Ap est approximativement de 2 à 3. Ces données nous permettent de formuler le point de vue suivant sur l'organisation des premiers pas de l'analyse automatique, après que l'ordinateur a établi le fragment de travail préliminaire — au moins deux phrases à gauche du premier P (en comptant la phrase dans laquelle se trouve le P); compte tenu de ce que d'une part la majorité des U T ne contiennent qu'un MRS, et de l'autre — que le nombre moyen des MR dans les UT compliquées est de 2 à 4, ce qui conditionne des entrelacements fort compliqués (sans perdre de vue qu'en plus de cela chaque MR peut contenir en moyenne 2-4 Ap) il nous semble utile de prévoir une procédure préalable de décomposition des UT compliquées (c'est-à-dire contenant plus d'un MR) en MRS et de les traiter séparément un à un (de même qu'une procédure de subdivision des MRC en MRS). 5. La décomposition des U T et MRC en MR simples, cités ci-dessus, pose le problème de l'établissement préalable des limites de ces MRS dans le cadre de l'UT préliminaire, établie par l'ordinateur (le fragment de texte préliminaire). E n se basant sur les données de la table 7, qui montrent que par rapport à leur position mutuelle les Aa et les P se trouvent dans la plupart des cas dans des segments voisins, il nous semble raisonnable de fixer la limite gauche du MR simple donné à - 1 segment. Dans le cas ou l'application des critères du choix dans ce segment ne permettrait pas d'obtenir des résultats positifs, il faudra élargir sa limite à gauche jusqu'à la limite de l'UT préliminaire. 6. Au point de vue linguistique, il est intéressant de noter le fait suivant: on voit, à la table 4, que dans aucun des textes analysés, le nombre des P qui se rapportent à un Aa ne dépasse 7 (en ce qui concerne les textes allemands il faut tenir compte de la possibilité de décomposer les UT d'après les Aq). On peut supposer que ce fait est le résultat du même mécanisme profond psycho-physiologique et des possibilités de la mémoire humaine, qui
488
AUTOMATIC
TRANSLATION
conditionnent la propriété des langues naturelles, constatée par l'hypothèse de Yngve. Peut-être devrait-on donner une explication du même type à ce que la valeur de E m a x en segments ne dépasse pas en principe -7. Il faudrait de même penser à une explication linguistique du fait que (voir la tableau 7), dans la plupart des cas, les Aa et les P se trouvent dans des segments voisins (bien sûr, ce fait est partiellement conditionné par la grande influence qu'exerce la très haute fréquence de l'apparition dans les textes des trois langues du pronom relatif «Komopbiu» qui est presque toujours situé dans le segment voisin, à droite du segment qui contient son antécédent). *
En plus des conclusions et des constatations mentionnées ci-dessus, l'analyse que nous avons effectuée, de même que quelques-uns de nos travaux préalables, nous permettent de formuler quelques raisonnements sur le problème le plus important qui se pose pour l'élaboration d'un modèle d'identification des Aa — le problème des principes généraux sur lesquels doit être fondé ce modèle et des critères du choix. Pour l'idéal, le modèle d'identification des Aa devrait être construit de la manière suivante: au cours d'une analyse sémantique complète (c'est-à-dire d'une analyse qui pourrait tenir compte des références extralinguistiques), en «interprétant» le texte de gauche à droite, l'ordinateur devrait établir et fixer dans une forme explicite le sens de chaque phrase; en rencontrant un P dans une phrase donnée et en tenant compte de ce que d'une part son Aa doit en principe se trouver à gauche (en observant les exceptions mentionnées) et de l'autre — le fait que dans l'hypothèse envisagée le sens est déjà établi, le rattachement du P donné à son Aa ne devrait pas poser de problèmes: il est clair que dans une telle situation le problème de l'exclusion des Ap n'existerait même pas. Mais de nos jours nous sommes bien loin de la possibilité d'une telle résolution. C'est pourquoi, dans l'état actuel des choses, il nous semble rationnel (et uniquement possible) d'élaborer un modèle spécialisé, dont le but serait de choisir l'Aa et d'exclure les Ap en partant de critères définis, identifiés dans le texte. Ces critères devraient être morphologiques, syntaxiques et, en premier lieu, sémantiques. Le travail que nous avons effectué jusqu'à présent nous permet de dire que les critères morphologiques et en partie syntaxiques donneront la possibilité de résoudre approximativement 20-30 % des ambigiiités et, par conséquent, on devrait les utiliser seulement comme des filtres préalables. La majorité des ambiguïtés devrait être résolue en partant de critères sémantiques, appliqués au cours de la réalisation d'un système d'analyse sémantique, après l'accumulation d'informations suffisantes sur la structure syntaxique
AUTOMATIC
489
TRANSLATION
des unités de travail correspondantes (de même qu'en utilisant une vaste information donnée dans le dictionnaire). E n ce qui concerne les filtres préalables, on peut noter les possibilités suivantes: A. Premièrement, on pourrait utiliser certains traits morphologiques: (1) Le genre grammatical, pour réaliser «la distinction formelle» au cours de l'identification des K R . (2) E n traitant les MR avec le pronom
ceou
on peut utiliser l'information
sur le nombre du verbe (par ex. HBAH B3HJI CBOH K H H M ) . . (3) On peut recourir à une information morpho-syntaxique assez élémentaire au cours du traitement des MR avec le P Komopbiû : en principe son Aa doit être situé dans le premier segment à gauche; la possibilité de l'apparition des Ap entre Komopbiu et son Aa est minime — approximativement 7 sur 100. B. En se basant sur l'information générale sur le structure syntaxique des UT, on pourrait utiliser en tant que filtres préalables quelques données supplémentaires : (1) Dans les constructions verbales avec les conjunctions doubles
kcik
. . .
maK u l'Aa du pronom qui se trouve après la deuxième conjonction doit se trouver après la première (par exemple: y^eiibiH uejiOM, maK
HCCJIE^OBAJI
kûk
H3blK
B
u OTflejibîibie ezo CTpyKTypbi).
(2) De même dans les constructions avec des verbes qui se répètent et des conjonctions H, HJIH l'Aa du pronom qui se trouve après le deuxième verbe doit être en principe situé après le premier (par ex. B xofle anajiH3a SbiJio ycTanoBJieno, MTO Mop(J)eMa MOKËT UMemb STO ceoucmeo, HJIH HE uMemb ezo). (3) Dans les groupes nominaux avec des pronoms possessifs (Pb) l'Aa et le Pb ne peuvent pas être situés dans le même groupe nominal. Prenons pour exemple le fragment de l'UT suivante: « . . . B pnfte c;iyqaeB 3aKpeiuieHHe CHHTaKCHMecKOH (JjyHKijHH oflHoro HAena npedAOJKenuH
( A p 1 ) MO>K6T n p H B e c r a K c y m e c T B e H H O M y u3MeHeHuw
(Ap 2)
(Aa) ezo
(P) 3HaHenHH.» A la fin de cette UT nous avons le groupe nommai «H3M6HCHHK> ero 3HAQEHHH» avec un Pb, dans lequel le mot «H3MeHeHHio» est le noyau, le mot «3HaqeHHH» est un attribut de ce noyau, et le Pb «ero» — un attribut de l'attribut. Tous ces 3 mots sont reliés par une structure syntaxique «immédiate» H3MeHeHHio !
ero I
3HaqeHH« f I
490
AUTOMATIC
TRANSLATION
à l'intérieur de laquelle un Aa ne peut pas remplacer le P correspondantGrâce à ce filtrage, Q = 3 pour le fragment donné se réduit à Q = 2. (4) L'établissement du type exact du segment (indépendant, partiel ou partiel-développant) donne la possibilité d'exclure préalablement certains Ap. Ainsi, la division de l'UT suivante en segments exclut une certaine quantité de Ap pour le P ceou: (Aa),// ynoTpeÔMJiHflCb b p o j i h pa3JiHHHbix H J i e n o B npedAOMcemin // onpeflejifliomeiicfl npuHadMMCH0Cbw (Ap 2) ero k t o h h j i h h h o î î uacmu
.
For all other symbols x, y and z in G, 8(x, y, z) = {y} . Since S satisfies the unique successor condition, there is no conflict in the definition of S and so L is a well-defined deterministic Lindenmayer model. Also, if S satisfies the propagating condition, then L is propagating. I t is not too difficult to prove in detail that, for 1 < i < m, S; is an 1-regular subprocess associated with L, but we shall just demonstrate this by an example. Let S = {