211 32 9MB
English Pages 312 [316] Year 1997
Speech Production and Language
W DE G
Speech Research 13
Editors
Vincent J. van Heuven Louis C.W. Pols
Mouton de Gruyter Berlin · New York
Speech Production and Language In Honor of Osamu Fujimura
Edited by
Shigeru Kiritani Hajime Hirose Hiroya Fujisaki
Mouton de Gruyter Berlin · New York 1997
Mouton de Gruyter (formerly Mouton, The Hague) is a Division of Walter de Gruyter & Co., Berlin.
® Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence and durability.
Library of Congress
Cataloging-in-Publication-Data
Speech production and language : in honor of Osamu Fujimura / edited by Shigeru Kiritani, Hajime Hirose, Hiroya Fujisaki. p. cm. - (Speech research ; 13) Includes bibliographical references and index. ISBN 3-11-015277-0 (alk. paper) 1. Speech. 2. Language and languages. I. Fujimura, Osamu, 1927-. II. Kiritani, Shigeru, 1940-. III. Hirose, Hajime, 1933—. IV. Fujisaki, H. (Hiroya) V. Series. P95.S646 1997 302.2'242-dc21 96-53596' CIP
Die Deutsche Bibliothek —
Cataloging-in-Publication-Data
Speech production and language : in honor of Osamu Fujimura / ed. by Shigeru Kiritani ... - Berlin ; New York : Mouton de Gruyter, 1997 (Speech research ; 13) ISBN 3-11-015277-0 NE: Kiritani, Shigeru [Hrsg.]; Fujimura, Osamu: Festschrift; G T
© Copyright 1997 by Walter de Gruyter & Co., D-10785 Berlin All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without permission in writing from the publisher. Printing: Arthur Collignon GmbH, Berlin. Binding: Lüderitz & Bauer, Berlin. Printed in Germany.
Preface This book is a collection of research papers written in honor of Osamu Fujimura by his friends and pupils. It recognizes Osamu Fujimura's own outstanding contributions and leadership in the field of speech science. Among contemporary speech scientists, Osamu Fujimura is especially noted for his interest and competence in a wide variety of subjects ranging from physics, physiology and phonetics to linguistics and artificial intelligence. Through a fusion of these diverse disciplines he has shown us new ways of looking into human speech and language which relate the physical and physiological processes in phonetics to abstract, higher-level linguistic structure. Osamu Fujimura was born in Tokyo in 1927. After graduating from the Faculty of Science of the University of Tokyo in 1952, he became a Research Assistant at the Kobayashi Institute of Physical Research, Tokyo. His first work in speech science was "Speech Analysis and Synthesis Using ADP-Crystal Optomechanical Resonators", and for this work he received The Asahi Award for Promotion of Science. In 1958 he became an Assistant Professor at the University of Electrocommunications, Tokyo and in 1962 he received the Doctor of Science degree from the University of Tokyo. During the 1950's and 1960's, he was a guest researcher at two meccas of modern speech science, the Research Laboratory of Electronics of the Massachusetts Institute of Technology (1958-1961) and the Speech Transmission Laboratory of the Royal Institute of Technology, Stockholm (1963-1965). During these years, he carried out many outstanding studies focused mainly on the physical process of speech production. Through these studies, he acquired a solid international reputation. From the early days of his career, Osamu showed deep insights into the necessity of interdisciplinary research on human speech and language, based on the integration of physical, physiological and linguistic studies. To realize this goal, he devoted himself to the establishment of the Research Institute of Logopedics and Phoniatrics at the Faculty of Medicine of the University of Tokyo. The Institute was founded in 1965, and was given great encouragement by the international speech-research community, especially by the researchers at the Royal Institute of Technology, the Massachusetts Institute of Technology, Haskins Laboratories and Bell Telephone Laboratories.
vi Preface Osamu Fujimura served as Director of the Research Institute of Logopedics and Phoniatrics from 1969 to 1973. During that time, he succeeded in extending the bases of the physiological study of speech. The x-ray microbeam method, electropalatography and a flexible laryngeal fiberscope, which are all currently used world-wide as basic research tools in speech science, were the fruits of Osamu's work and insight. At this time, he also established a good cooperative link between the Institute and Haskins Laboratories. This link later formed the basis of the development of physiological research both in Japan and in the United States. In 1973, Osamu Fujimura moved to Bell Telephone Laboratories, where he headed the Department of Linguistics and Speech Analysis Research (later, the Department of Linguistics and Artificial Intelligence Research). There, he expanded his sphere of research to devote himself to bridging the gap between experimental phonetic studies and linguistic and artificialintelligence studies of speech. He also was the driving force behind the development of the large-scale, second-generation x-ray microbeam system at the University of Wisconsin. From the time of its installation until today, this system has played an important role as the most powerful and effective source of articulatory data in the field of basic speech research. In 1990, Osamu Fujimura moved to The Ohio State University as a Professor in the Department of Speech and Hearing Science. He is also affiliated with the Center for Cognitive Science and the Biomedical Engineering Center of The Ohio State University. He has continued active research up to the present, stimulating us with his strikingly fresh ideas across the whole field of basic speech science. In addition to his research activities, Osamu Fujimura has also served as a prominent member of many academic societies. He is a fellow of the Acoustical Society of America and the New York Academy of Sciences. He is also a member of the Editorial Board of several important international journals, including Phonetica, Zeitschrift für Phonetik and Journal of Phonetics. Reflecting Osamu Fujimura's long-standing interests, the chapters in this book provide wide perspectives on the various aspects of speech production (physical, physiological, syntactic and information theoretic) and their relationship to the structure of speech and language. Section 1 contains an introduction to the background of Osamu's research career from the viewpoint of a physicist. Section 2 consists of five physiological studies of
Preface
vii
laryngeal functions related to the production of the voice source, and Section 3 consists of four chapters on the acoustic study of the voice source. These two sections provide a comprehensive survey of the current state of the art in these areas. Section 4 consists of five chapters on segmental features and their temporal organization. These chapters range from acoustic phonetic studies to neurobiological studies. Finally, Section 5 contains three chapters which treat the higher-level structure of the speech signal - phonetic, syntactic and information-theoretic. We believe that these 19 chapters represent the current status of the interdisciplinary research field to which Osamu has devoted himself. The editors would like to express their gratitude to the contributors for their articles. We also thank Professor Seiji Niimi, Drs. Satoshi Imaizumi, Kenji Itoh and Sotaro Sekimoto (Research Institute of Logopedics and Phoniatrics, University of Tokyo) for their help in editing this volume. Special thanks are due to Professor Morihiro Sugishita, current director of the Institute, for his kind support in publishing this volume. Finally, we wish to thank Mrs. Michiyo Tasaki for her patience and skill in editing the contributed manuscripts.
Contents 1. Background Speech: a physicist remembers Manfred R. Schroeder
1
2. Laryngeal functions in speech Male-female differences in anterior commissure angle Minoru Hirano, Kiminori Sato and Keiichiro Yukizane
11
Correlations among intrinsic laryngeal muscles during speech gestures Christy L. Ludlow, Susan E. Sedory Holzer and Mihoho Fujita
19
Regulation of fundamental frequency with a physiologically-based model of the larynx Ingo R. Titze
33
High-speed digital image analysis of temporal changes in vocal fold vibration in tremor Shigeru Kiritani and Seiji Niimi
53
Phonetic control of the glottal opening Masayuki Sawashima
69
3. Voice source characteristics in speech Frequency domain analysis of glottal flow: The LF-model revisited Gunnar Fant
77
Consequences of intonation for the voice source Janet Pierrehumbert
Ill
Fundamental frequency rule for English discourse Noriko Umeda
131
χ
Contents
Physiological and acoustical correlates of voicing distinction in esophageal speech Hajime Hirose
157
4. Articulatory organization The postalveolar fricatives of Polish Morris Halle and Kenneth N. Stevens
177
A note on the durations of American English consonants Thomas H. Crystal and Arthur S. House
195
Articulatory coordination and its neurobiological aspects Shinji Maeda and Kiyoshi Honda
215
Token-to-token variation of tongue-body vowel targets: The effect of context Joseph S. Perkell and Marc H. Cohen
229
The phonetic realization of the haiku form in Estonian poetry, compared to Japanese Ilse Lehiste
241
Synthesis and coding of speech using physiological models M. Mohan Sondhi
251
5. Verbal behavior: sound structure, information structure Comparison of speech sounds: distance vs. cost metrics John J. Ohala
261
A note on Japanese passives James D. McCawley
271
Sentence production and information Hiroya Fujisaki
279
Index
297
1. Background
Speech: a physicist remembers Manfred R. Sehroeder
1. Dedication Osamu Fujimura, a dear friend and colleague, with whom I shared two fruitful decades at Bell Laboratories, received his basic education as a physicist a background he often "betrayed" in his many tangible contributions to fundamental speech research, such as his ingenious microbeam X-ray scanner (Kiritani—Itoh—Fujimura 1975). I myself was trained as a physicist at Göttingen, whence I returned as a University Professor after a long and satisfying career in communications research at Bell. Now Osamu, too, has returned to Academia and I take this opportunity to wish him well in his new sphere of influence at Ohio State University. May we continue to see many more beautiful ideas spawned by Osamu's felicitous fusion of diverse talents, ranging from physics and signal theory to linguistics and artificial intelligence. In this hope I dedicate these rememberances to Osamu's future welfare.
2. A physicist turns to speech My interest in speech was awakened by a talk given by Professor Werner Meyer-Eppler at the Göttingen Physics Colloquium in 1952. Although my physicist friends pretended "not to understand a word", I was completely captivated by Meyer-Eppler's excursion into the wonderful world of signals and symbols, by Shannon's Communication Theory, Hamming's ErrorCorrecting Codes, Rice's noise analysis and Dudley's Vocoder. Two years later, in March 1954, William Shockley (the coinventor of the transistor) and James Fisk (then Director of Research at Bell Laboratories) visited Göttingen (and other academic centers in Europe) to look for "bright young physicists" for Bell Labs. I had just completed my doctoral dissertation (using microwaves to investigate the statistical distribution of eigenfrequencies in irregularly shaped cavities, which turned out to be the Wigner distribution, now recognized as archetypical for nonintegrable ("chaotic") systems) and I was about to become "independently famous" for having introduced statistically interfering waves to the study of sound transmission in
2
Manfred R. Schroeder
closed spaces (Schroeder 1987). (This statistical wave theory not only demolished preexisting erroneous "explanations", but abolished the need, once and for all, to read steady-state frequency responses of concert halls - tealeaf style - for acoustical quality criteria. These responses, I was able to show, were nothing but (complex) Gaussian noises in the frequency domain.) Small wonder then that I found myself in the employ of Ma Bell only a few months after the visit by Shockley and Fisk.
3. Early years at Bell Laboratories There ensued a difficult transition period (that Osamu must have endured, too, when he first came to the United States) - a transition from one language and culture to another, from the life of a carefree student to that of a "regular employee" and, soon, from the quintessential bachelor to the married man and father. At Bell I was encouraged to continue my work in room acoustics and random waves, but, remembering Meyer-Eppler's talk, I opted for research in speech, which struck me as much more germane to Bell's business. (Little did I anticipate the recent renewed interest in random standing waves, both acoustic and electromagnetic, for conference telephony, cordless handsets, mobil communication and laser speckles.) I immediately focussed on the two then outstanding problems in speech analysis: pitch "tracking" and the measurement of formant frequencies. To pin down the formants, I suggested a spectral moment method that became my first U.S. patent (and the subject of my first oral paper at the Acoustical Society of America). Concerning the pitch problem, I was soon reminded of Dante's motto for hell: "Lasciate Ogni Speranza, Voi Qu'Entrate". To make things worse, John Pierce (who had become my "executive director" in 1955) asked me to design a "high-fidelity" vocoder. John's idea was to use Dudley's vocoder principle, not to save bandwidth, but to improve the quality of telephone speech by compressing a 10-kHz wide speech signal into 3 kHz for transmission over ordinary telephone lines. But how to produce high-quality vocoder speech, given the fact that vocoder speech had always been "distinguished" by a lamentable "electronic accent"? And how to solve the pitch problem with the added constraint that the synthetic pitch of the compressed portion of the speech spectrum must match the pitch of the uncoded "baseband" exactly! I concluded that it just couldn't be done, not without invoking the devil's good services, anyhow. But then, John Pierce was my boss and perhaps my
Speech: a physicist remembers
3
propensity to please him propelled me to invent the voice-excited vocoder (or VEV, soon so nicknamed by Edward E. David, Jr.). In the VEV, as the name implies, the excitation energy for the coded portion of the spectrum is obtained by nonlinear distortion and spectrum flattening, a principle to which Ben Logan, Mohan Sondhi and, last not least, Osamu made important contributions. The VEV, it turned out, was a smashing hit. It was in fact the first vocoder that sounded not like a vocoder but, miraculously, like a human being. The VEV, incidentally, was the last speech processor I built with a soldering iron (and winding my own coils for the bandpass filters). After that, I became addicted to digital simulation, transfused to Bell by Max Mathews and brought to full bloom in collaboration with Ed David, Hank McDonald and, later, Vic Vyssotsky and John Kelly, who fathered, with Carol Lochbaum, the Block Diagram (BLODI) Compiler.
4. Speech to please the human ear In the course of my work on speech, I became soon aware that I had to know more about hearing - a realization that stood me in good stead later when I introduced the masking properties of the human ear into speech coding. The resulting noise-spectrum shaping really allowed predictive coding to take off, with near perfect quality at bit rates as low as 1/4 bit per sample for the prediction residual. But that harvest was not reaped until 20 years later. One of the first problems that caught my ear after joining the "speech gang" at Bell Laboratories1 was the buzzy quality of vocoder speech. I thought the peaky excitation function, a zero-phase periodic pulse train, was to blame. My first experiments in hearing, on monaural phase sensitivity, convinced me that not only was the human ear not phase deaf but it could in fact discriminate a great variety of different phase spectra for signals with a fixed power spectrum. Indeed, I succeeded in generating simple melodies solely by changing phase angles. (The contraption I had built for this purpose became appropriately known as the Phase Organ.) Many of the different timbres I obtained by phase changes sounded much like vowel sounds, and I was convinced that intelligible speech could be generated without changing the amplitude spectrum - a dream that was realized almost 30 years later with Hans Werner Strube (Schroeder—Strube 1986). As I had hoped, speech synthesizers driven by excitation signals less peaky than zero-phase pulses sounded more natural and less buzzy than "traditional" vocoder speech. Obviously, low peak-factors were good for
4
Manfred R. Schroeder
speech. But what phase combination gave the lowest peak-factor? This is a problem of combinatorial complexity, one of the most active research areas today. With 32 harmonic frequency components and allowing just 8 different phase angles per component, there are 8 30 ~ 1027 possibilities (allowing for a factor 4 for time and sign-reversal symmetry) - a bit much, even for the fastest supercomputers.2 Thus, VicVyssotsky, with whom I shared an office, and I turned to Monte Carlo computation. The best of 4096 randomly chosen phases (0° and 180° only) reduced the peak-factor more than two-fold.3 Later I derived a formula for the phase angles ("Schroeder phases"), based on the theory of large-index FM, that reduced the peak-factor of a 31component signal 2.6-fold, and still later I discovered that "Galoissequences" from number theory gave the largest reductions so far realized (Schroeder 1986).
5. Predictive coding Still, even with low peak-factor excitation, vocoders sounded less than human. Becoming increasingly irritated by the low quality of synthetic speech, the thought struck me, in the mid-1960s, that speech coding was somehow based on too rigid a recipe, relying on oversimplified models of speech production. Could one not analyze and synthesize speech in a more forgiving manner, deliberately leaving room for error? How about exploiting the quasiperiodicity and formant structure of voiced speech signals without discarding any resulting deviation? These thoughts soon led to linear prediction, undertaken jointly with Bishnu Atal, whom I had invited to join us from Bangalore in 1961 (Atal—Schroeder 1967). (Itakura and Saito, in Japan, arrived at basically the same method starting from a different philosophy (Itakura—Saito 1969).) Somewhat later, it became clear - and I found this very reassuring - that linear prediction was equivalent to a much more general principle: that of maximizing entropy, which has also been used in geophysical data processing (hunting for oil!) and image compression and restoration (in radio astronomy, for example) (Schroeder 1982). Using just 8 predictor coefficients to describe all-pole fashion - the spectral envelope (and 3 more coefficients to specify the spectral fine structure), we succeeded in generating a prediction residual that gave us excellent speech quality at as little as 1 bit per sample (8 kilobits per second) (Atal— Schroeder 1970). Interestingly, the measured signal-to-noise ratio wasn't all that good, giving me a first inkling that the ear was quite tolerant to certain types of error.
Speech: a physicist remembers
5
6. The (un)masking of quantizing noise Old-style predicitive coders minimized the mean-square prediction error, a purely mathematical distortion measure. The resulting prediction residual has a flat ("white") spectrum. As a consequence, the spectral level of the quantizing error (for low bit rates) exceeds the speech spectrum in the "valleys" between the formants and thus becomes audible. Can the quantizing noise not be "swept" under the speech spectrum? Auditory masking suggests that such a spectrally shaped noise might be much less audible - or even inaudible. This thought was confirmed in masking experiments undertaken with Joe Hall in which tones masked noise rather than the other way around (Schroeder—Atal—Hall 1979). Of course, a non-white noise does not have minimum power, but it can have minimum audibility. This introduction of properties of human hearing into the design of speech synthesizers allowed Bishnu and me to design Code-Book Excited Linear Prediction (CELP) at bit rates as low as 1/4 bit per sample of the prediction residual (Schroeder—Atal 1985). Specifically, "vectors" of 40 residuals (5 milliseconds) are encoded by 210 = 1024 different codewords (each specified by a 10-bit "address", i.e. 1 bit for every 4 samples). Remembering Shannon's random coding proofs, the codebooks were essentially randomly generated without loss of efficiency. Of course, searching a 1024-word codebook for a best match every 5 milliseconds is not exactly easy to do in real time. Thus, we used tree and trellis coding initially to speed up the coding process. Later Neil Sloane and I introduced a coding scheme with sufficient algebraic structure and amenable to fast computation, based on the Fast Hadamard Transform (Schroeder— Sloane 1987) Now, 20 years after the invention of linear prediction and 10 year after the introduction of masking properties of the ear, CELP vocoders are beginning to dominate speech synthesis. There is hardly a talking toy or speaking computer that does not use linear prediction. It is an idea whose time has come. And I am doubtly happy because for three decades I had devoted much of my free energy as a physicist to better speech coders, with few applications to see the light of day. The huge information capacities of transistorized transatlantic cables, communication satellites and optical fibres made bandwidth and bit rates so plentiful that there was little interest in speech compression. The picture began to change, however, in the 1970s with the introduction of talking dolls and learning games (Speak & Spell and others) that required sizeable speech repertoires while eschewing moving parts (magnetic tape). Speech synthesis by linear prediction, it turned out, was the ideal answer,
6
Manfred R. Schroeder
featuring both very high data compression and extremely simple synthesis that could be acommodated on cheap chips. (In these mass-market applications, predictive analysis, which is not quite as simple as all-pole synthesis, is a one-time operation, performed at the factory and "engraved" in readonly memories.) And speech compression is finally coming into its own, too. Optical fibres may have a sheer unlimited capacity, but you can't run fibres to moving cars. Thus, mobile telephones, to become a universal service, must use compressed speech. And with linear prediction the bit rates are so low that digital encryption right at every handset becomes feasible.
7. Articulatory research While I emphasized physics and signal theory in my speech work, Osamu branched out toward linguistics and artificial intelligence - the names of the two research departments at Bell Laboratories that he headed and to which he attracted such outstanding workers as Mark Liberman, Mitch Marcus, Joan Miller, Ken Church and Janet Pierrehumbert. But Osamu and I never drifted far apart. Our offices were always within convenient walking distance (not the rule at Bell, given its "mile-long" corridors) and we shared a common interest in articulatory analysis. Around 1965, I formulated a simple perturbation theory connecting area functions and formant frequencies (Schroeder 1967). Impedance measurements at the lips (or sufficiently constrained models) removed the (articulatory) ambiguity encountered in most "inverse" problems (Borg 1946). My modest beginnings led to much good work by Osamu, Atal, Max Mathews, John Tukey, Mohan Sondhi, Strube and others (Fujimura— Shibata—Kiritani—Simada—Satta 1968), (Atal—Chang—Mathews— Tukey 1978), (Sondhi 1977). My student Wolfgang Möller worked with Osamu's X-ray data to derive "principal components" of articulator motions (Moller 1978) Apart from better speech synthesis from text (for the blind and for spoken information services), this work also promises reading aids for the deaf ("visible speech", displaying articulatory motions).
8. Speech recognition In addition to our articulatory work at Göttingen, based on new physical measurement methods (Schroeder—Strube 1979), we embarked on a pro-
Speech: a physicist remembers
7
gram on speech and speaker recognition, resulting not only in more theses, but also very respectable recognition rates (Strube—Helling—Krause 1985). In fact, Osamu saw fit to ask me to edit a book (Schroeder 1985) on speech and speaker recogition that had the first parallel processing and associative models expounded by Jeffrey Elman, James McClelland and Stephen Marcus. (The cover of the book was embellished by one of my favorite computer graphics, rooted in a solution of the eikonal equation of geometrical optics. Not unintentionally, the picture looked vaguely like a contour speech spectrogram - which Gunnar Fant in fact thought it was.) I dedicated the volume both to John Pierce and his pseudonym as a science fiction author, J. J. Coupling. (The publisher was anxious to know who this J. J. Coupling was, but I never told them, or they might have "decoupled" Coupling from the dedication.)
9. Conclusion Speech has come a long way since Count von Kempelen's talking robot with which he amused the European courts 200 years ago. The quality of coded speech is now near perfect, even at very low bit rates. Useful applications in mobile and private communication and aids for the deaf and the blind abound. Spoken information services proliferate and even (limited) speech recognition shows some promise, thanks to neural networks and parallel processing. Osamu, I am happy to conclude, is working in a vigorous field at a vital time. Notes 1.
Harold Barney, Ralph Miller, Ed David, John Kelly, Bruce Bogart and Erich Weibel; soon to be joined by Max Mathews, Hank MacDonald, Ben Logan, Jim Flanagan, Mohan Sondhi, Bishnu Atal and Mike Noll.
2.
1027 is a very large number. For example, it is 10 times larger than 1 ft 10 , which is also very large.
3.
It would be interesting to see whether supercomputers (with or without "simulated annealing") would find an ultrametric structure in the space of relative minima of the peak factor. (In an ultrametric space all triangles are equilateral or isosceles with a short base. Distances in phy-
8
Manfred R. Schroeder logenetic trees are the best-known example of an ultrametric space.)
References Atal, B. S.—M.R.Schroeder 1967 "Predictive coding of speech signals", Proc. 1967 IEEE Conf. on Communication and Processing, 360-361. 1970 "Adaptive predictive coding of speech signals", The Bell Syst. Techn. J. 49: 1973-1986. Atal, B. S.—J. J. Chang—M. V. Mathews—J. W. Tukey 1978 "Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique", J. Acoust. Soc. Am. 63: 1535-1555. Borg, G. 1946 "Eine Umkehrung der Sturm-Liouvilleschen Eigenwertaufgabe. Bestimmung der Differentialgleichung durch die Eigenwerte", Acta math. 78: 1-96. Fujimura, O.—S. Shibata—S. Kiritani—Z. Simada—C. Satta 1968 "A Study of dynamic palatography", Rept. 6th Int. Congr. Acoust., Tokyo Vol. II: B21-24. Itakura, F.—S. Saito 1969 "Speech analysis-synthesis system based on the partial autocorrelation coefficient", Rept. Acoust. Soc. Japan Meeting. Kiritani, S.—K. Itoh—O. Fujimura 1975 "Tongue pellet tracking by a computer-controlled x-Ray microbeam system", J. Acoust. Soc. Am. 57: 1516-1520. Möller, J. W. 1978 "Regressive Schätzung Artikulatorischer Parameter aus dem Sprachsignal", Dissertation, Göttingen. Schroeder, M. R. 1967 "Determination of the geometry of the human vocal tract by acoustic measurements", J. Acoust. Soc. Am. 41: 1002-1010. 1982 "Linear prediction, extremal entropy and prior information in speech signal analysis and synthesis", Speech Communication 1 : 9-20. 1986 "Number theory in science and communication", (2nd enlarged edition.), Berlin: Springer. 1987 "Statistical parameters of the frequency response curves of large rooms", J. Audio Eng. Soc. 35: 299-306. 1987 "Normal frequency and excitation statistics in rooms: Model
Speech: a physicist remembers
9
experiments with electric waves", J. Audio Eng. Soc. 35: 307316. Schroeder, M. R. (ed.) 1985 Speech and speaker recognition. Basel: Karger. Schroeder, M. R—B. S. Atal 1985 "Stochastic coding of speech signals at very low bit rates: The importance of speech perception", Speech Communication 4: 155-162. Schroeder, M. R.—B. S. Atal—J. L. Hall 1979 "Optimizing digital speech coders by exploiting masking properties of the human ear", J. Acoust. Soc. Am. 66: 1647-1652. Schroeder, M. R.—N. J. A. Sloane 1987 "New permutation codes using hadamard unscrambling", IEEE Trans. Inf. Theory 33: 144-146. Schroeder, M. R.—H. W. Strube 1979 "Acoustic measurements of articulator motions", Phonetica 36: 302-313. 1986 "Flat-spectrum speech", J. Acoust. Soc. Am. 79: 1580-1583. Sondhi, M. M. 1977 "Estimation of vocal-tract areas: The need for acoustical measurements", in: Carre - Descout - Wajskop (eds.), Articulatory modeling and phonetics. Brussels: Institut de Phonétique, Université libre de Bruxelles, 77-88. Strube, H. W. D. Helling—A. Krase—M. R. Schroeder 1985 "Word and speaker recognition based on entire words without framewise analysis", in: M. R. Schroeder (ed.), Speech and Speaker Recognition, Basel: Karger 80-114.
2. Laryngeal functions in speech
Male-female difference in anterior commissure angle
Minoru Hirano, Kiminori Sato and Keiichiro Yukizane
Hirano—Kiyokawa—Kurita (1988) reported that the anterior commissure angle (AnAC), i. e. the angle between the bilateral vocal folds at anterior comissure, was greater for females than for males in excised human laryngés. This paper discusses causes of the male-female difference in AnAC.
1. Review of our previous work Some AnAC-related portions of our previous work (Hirano—Kiyokawa— Kurita 1988) will be reviewed in this section. Ten male and ten female excised laryngés were investigated. The average age was 58 years in males and 66 years in females. Figure 1 shows the dimensions and angle of interest related to the glottic shape and Table 1 shows the mean value for each measure. Figure 2 depicts the glottic shape for both sexes drawn on the basis of the mean values reported by Hirano—Kiyokawa—Kurita (1988). AnAC was 16° for males and 25° for females. This was caused by the fact that the length of the membranous vocal fold (LMF) was greater for males than for females while the glottic width at the vocal process (GWP) was almost the same for both sexes. All the other ventrodorsal dimensions including the length of the anterior glottis (LAG), length of the po$terior glottis (LPG) and length of the entire glottis (LEG) were also greater for males than for females. The greater ventrodorsal dimensions for males were naturally attributed to the fact that the laryngeal framework was bigger in males than in females. A question was raised: why was the GWP almost the same for both sexes? In order to find an answer to this question, we investigate the laryngeal framework.
12
Minoru Hirano—Kiminori Sato—Keiichiro
Yukizane
Ventricular
Figure 1. Glottal shape, dimensions and angle of interest (Hirano Kiyokawa - Kurita 1988). (a) anatomic view: AC: anterior comissure; VP: tip of vocal proccss. (b) geometrie view: AC: anterior comissure; VP: tip of vocal proccss; AnAC: angle of bilateral vocal folds at AC; GWP: glottic width at vocal proccss level; LEG: length of entire glottis; LAG: length of anterior glottis; LPG: length of posterior glottis; LMF: length of membranous vocal fold. Table 1. Average values for AnAC, LMF, GWP, LAG, LPG and LEG (Hirano—Kiyokawa—Kurita 1988).
AnAC in degrees
Male 16
Female 25
Ratio (M/F")
LMF in mm
15.4
9.8
1.57
GWP in mm
4.3
4.2
1.02
LAG in mm
15.1
9.5
1.59
LPG in mm
9.5
6.8
1.40
LEG in mm
24.5
16.3
1.50
Male-female difference in anterior commissure angle
MALE AC
13
FEMALE in
VP
Figure 2. Glottic shape drawn on the basis of the mean value for measures reported by Hirano—Kiyokawa—Kurita(1988). AC: anterior comissure; VP: tip of vocal process.
2. Investigation of laryngeal framework 2.1 Method Twenty laryngeal frameworks obtained from autopsy cases were subjected to this study. They were different from the laryngés employed in the previous study. Of the 20, 10 were male and 10 were female. All subjects were in their 50's. The cartilaginous framework was isolated by removing the soft tissue. Measurements were conducted with the framework of each larynx. 2.2 Location of cricoarytenoid joint We first thought that the same GWP for both sexes might be accounted for by the location of the cricoarytenoid joint. Therefore, we made some measurements related to the cricoarytenoid joint (Figure 3). Measurements were conducted on photographs of the cartilaginous framework taken from above. The results are shown in Table 2. The angle (AnJP) between bilateral lines that connected the midpoint of
14
Minoru Hirano—Kiminori Sato—Keiichiro
Yukizane
the cricoid arch (MCA) and the posterior end of the cricoid articular facet of the crico-arytenoid joint (JP) did not differ significantly between two sexes. The distance between the bilateral JP (WJP) was greater in males than in females. The almost same GWP value for both sexes, therefore, could not be
Figure 3. Dimensions and angles related to cricoarytenoid joint and midpoint of cricoid arch (MCA) and anterior comissure (AC), (a) MCA: midpoint of cricoid arch; JP: posterior end of cricoid articular facet; AnJP: angle between bilateral lines that connected JP and MCA; LJP: ventrodorsal dimension from MCA; WJP: JP distance, (b) AC: anterior comissure; JP: posterior end of cricoid articular facet; AnACJP: angle between bilateral lines that connected JP and AC; LACJP: ventrodorsal dimension from AC; WJP: JP distance.
Table 2. Average values for AnJP, LJP, WJP, AnACJP and LACJP (10 males and 10 females).
AnJP in degree
Male 41
Female 38
Ratio ÌM/FÌ
LJP in mm
22.1
18.5
1.19
WJP in mm
16.5
11.9
1.39
AnACJP in degree
32
38
LACJP in mm
26.2
18.7
1.40
Male-female difference in anterior commissure angle
15
accounted for by the location of the joint on the cricoid cartilage. The ventrodorsal dimension (LJP) was greater for males than for females. The angle (AnACJP) between bilateral lines that connected the location of the anterior commissure (AC) and JP was slightly greater for females than for males. The ventrodorsal dimension of the cricothyroid complex (LACJP) was greater for males than for females. The LACJP/WJP ratio, however, was 1.59 for males and 1.57 for females, not being very different between two sexes. The spacial relationship between the thyroid and cricoid carti-
(a)
(b)
Figure 4. Angles of cricoarytenoid joint, (a) lateral view of cricoid cartilage; AnJH: angle of the longitudinal axis of cricoid articular facet against horizontal plane (b) superior view of cricoid cartilage; AnJM: angle of the longitudinal axis of cricoid articular facct against midsaggital plane.
Table 3. Average values for AnJM and AnJH (10 males and 10 females).
AnJM in degree
male 34
female 33
AnJH in degree
40
42
16
Minoru Hirano—Kiminori
Sato—Keiichiro
Yukizane
lages, therefore, could not account for the same GWP value for both sexes either. In our previous work, we presumed that the distance between the bilateral cricoarytenoid joints relative to the ventrodorsal dimension of the glottis was greater in females than in males (Hirano—Kiyokawa—Kurital988). The present results, however, do not support this presumption. 2.3 Angle of cricoarytenoid joint Next possible factor that might explain the same GWP for both sexes was thought to be the inclination of the cricoarytenoid joint. Two angles related to the joint were investigated (Figure 4): the angle of the longitudinal axis of the cricoid articular facet against the midsaggital Plane (AnJM) and that against the horizontal Plane (AnJH). Measurements were conducted for both sides of each larynx on photographs. The results are shown in Table 3. There was no difference in AnJM and in AnJH between two sexes.
3. Comments The results of the study described in the previous section indicate that the male-female difference in AnAC can not be attributed to differences in the location of the cricoarytenoid joint relative to the ventrodorsal dimension of the cricothyroid complex. This suggests that the tip of the vocal process of
MALE AC
FEMALE AC
Figure 5. Schematic presentation demonstrating spacial relationship among vocal process (VP), cricoarytenoid joint (JP) and other related structures.
Male-female difference in anterior commissure angle
17
the arytenoid cartilage is located more medially relative to the cricoarytenoid joint in males than in females (Figure 5). The reason for this is not clear on the basis of the present study. One of the possible reasons is the fact that the angle between the bilateral thyroid alae is smaller for males than for females in adults (Kahane 1978; Malinowski 1967). The thyroid ala may constrain the location of the structures around the glottis including the vocal process. Hirano—Kiyokawa—Kurita (1988) reported that the adductory effect of the cricothyroid muscle on the vocal fold located in the neutral position was greater in females than in males. This can be accounted for by the fact that the vocal process in the neutral position is placed more laterally relative to the cricoarytenoid joint in females than in males, having a greater potentiality to move medially in females.
References Hirano, M.—K. Kiyokawa—S. Kurita 1988 "Laryngeal muscles and glottic shaping", in: O. Fujimura (ed.), Vocal physiology: Voice production, mechanisms and functions. 49-64, New York: Raven Press. Kahane, J. 1978 "A morphological study of the human prepubertal and pubertal larynx", Am. J. Anatomy 151: 11-20. Malinowski, A. 1967 "The shape, dimensions and process of calcification of the cartilaginous framework of the larynx in relation to age and sex in the Polish population", Folia Morphologica Warszawa 26: 118128.
Correlations among intrinsic laryngeal muscles during speech gestures Christy L. Ludlow, Susan E. Sedory Holzer, and Mihoko Fujita
1. Introduction Electromyographic data may be examined to determine the major actions of the intrinsic laryngeal muscles on vocal fold motion. For example, the thyroarytenoid muscle, or vocalis, is thought to have both adductory and shortening actions on the vocal fold (Faaborg—Andersen, 1957). On the other hand, the thyroarytenoid may have different actions dependent upon the simultaneous actions of the other laryngeal muscles (Gay—Strome—Hirose— Sawashima 1972). The shortening of the vocalis may be modified by simultaneous contraction of the cricothyroid resulting in vocal fold lengthening (Hirano—Ohala—Vennard 1969; Atkinson 1978). During speech, three aspects of vocal fold movement must be controlled: 1) adductory movements for speech onset; 2) rapid alternations between adductory and abductory movements for phonation onsets and offsets during speech; and 3) fine adjustment in adductory, abductory, lengthening and shortening forces to produce vocal fold vibration for phonation. For regular phonation to occur with minimal levels of frequency perturbation (16 to 50 microseconds) (Horii 1979; Ludlow—Bassich—Connor—Coulter—Lee 1987) both vocal folds should vibrate symmetrically. If both vocal folds have the same mass and tissue characteristics then the biomechanical actions of the muscles on the two folds should be the same, to result in similar vibratory patterns. However, given the right-left asymmetry in body and facial structure, it is likely that the right and left vocal folds would not have the exact same mass, layer and shape characteristics in an individual (Hirano— Kurita—Yukizane—Hibi 1989). If such is the case then the muscles acting on each vocal fold must have different actions to produce the same vibratory pattern in each of the two folds. Two possible mechanisms may be used to achieve symmetric vibratory patterns between the right and left sides during phonation in an individual. One, if the two vocal folds are exactly the same in mass, layer composition and shape on the two sides, then the right and left pairs of each of the laryngeal muscles must have the exact same activation patterns on the two sides. Thus, activation of the motoneuron pools for the same muscle on the right
20
Christy L. Ludlow et al.
and left sides must be inter-related. On the other hand, if the two vocal folds in an individual are not highly similar in their mass, then the biomechanical forces of the muscles on the two sides must be independent to achieve the same resultant vibratory pattern for each side in response to the same subglottal pressure. In this event, the motoneuron pools of the two sides for the same muscle must be independent with close coordination between different muscles on each side of the larynx to achieve similar vibratory patterns in vocal folds with different mass, layer composition and shape. Our purpose was to examine the degree of correlation among muscle activation patterns within individuals to determine whether higher levels of correlation occur between two different muscles on the same side of the larynx than between the same muscles on the two different sides of the larynx. Further we wanted to determine whether the degree of correlation among muscles is greater during prolonged vocal fold vibration than when the vocal folds are adducting for speech onset or when they are rapidly abducting and adducting for phonation onset and offset during speech. This investigation examined the correlations between two laryngeal muscles, the thyroarytenoid and cricothyroid, during four different movements: vocal fold adduction for speech onset, vocal fold vibration during an extended vowel, production of a sentence with no voice onsets and offsets and sentence production with voice offsets.
2. Methods Five normal volunteers (two males) naive to the purpose of the study participated after informed consent. They ranged in age from 20 to 43 years. Following a subcutaneous injection of 2% Xylocaine to reduce discomfort, bipolar concentric 27 gauge 30 mm electromyographic needles were inserted percutaneously into the right thyroarytenoid (RTA), the left thyroarytenoid (LTA), the right cricothyroid (RCT), and the left cricothyroid (LCT). Recordings were obtained in all four muscles simultaneously. Verifying gestures for the RTA and LTA were phonation and effort closure. The location of maximum activation was obtained by moving the electrode. None of the thyroarytenoid recordings had prominent phonatory onset and offset bursts indicative of lateral cricoarytenoid placement (Hirano— Ohala 1969). For verification of the LCT and RCT, a pitch glide and phonation at different pitch levels confirmed placements without significant activation on head turning or elevation indicative of strap muscle interference. The needle electrode was moved to the location of maximal response. A sixth subject participated in a comparison of two types of electrodes,
Correlations among intrinsic laryngeal muscles
21
bipolar concentric and bipolar needle electrodes, to determine whether electrode cross talk could contribute to higher correlations between the TA and CT on the same side of the larynx than between the same muscle on opposite sides of the larynx (Dedo 1970). The bipolar needle electrode has a much smaller recording field than the bipolar concentric (Dedo 1970; Loeb—Gans 1986; DISA 1985). In this subject, bipolar concentric electrodes were placed in the right TA and CT and the left TA. Bipolar needle electrodes were inserted parallel to and as close as possible to the bipolar concentric electrodes in the same muscles without causing interference. Verifying gestures were used to assure accurate placement for both sets of electrodes. After electrode placement verification, the lights were dimmed and the room quieted. After 5 minutes or when the subject was relaxed, one minute of quiet respiration was recorded. The EMG signals were bandpass filtered between 100 and 5000 Hz, amplified, and recorded on FM tape along with the speech signal. Subjects phonated the vowel /i/ three times and repeated the sentences, "We mow our lawn all year" (all voiced) and "A dog dug a new bone" (up to 3 voice offsets), three times each. Two volt peak to peak sawtooth calibration signals were recorded for each subject. Analytic Procedures The EMG signals were digitized at 5000 Hz with anti-aliasing filtering at 2000 Hz. Speech was digitized at 10 kHz with filtering at 5 kHz. Linear interpolation based on calibration signal measures converted the EMG signals to microvolts. Following full wave rectification for each EMG channel, the minimum noise level in μ Volts was measured during quiet respiration between motor unit firings and subtracted from all signals for that channel to correct for impedance. A 20 ms sliding window was used to smooth each EMG signal prior to automatic measurement. For each phonation and speech task, the onset of muscle activation was identified as the point when the EMG signal exceeded 150% above the mean of the maximum activation points during respiration in the same muscle for that subject. The time of activation offset was similarly identified where the EMG signal became less than 150% above the mean respiratory minimum (Figure 1). The automatically detected onset and offset points were reviewed for accuracy by visual inspection. Speech onset and offset points were identified from the rectified and smoothed acoustic signal. Automatic processing was used to calculate a correlation coefficient, using a 50 point sliding window, with the following formula (Ferguson 1966). r=
22
Christy L. Ludlow et al.
Figure 1. Plots of signals recorded from a subject during extended phonation of the vowel III. The acoustic signal and the electromyographic recordings from three muscles have been rectified, downsampled to 2500 Hz and the maximum point selected for this 5 second display from fixed adjacent windows of 50 points. Time series plots of r values were derived from the 20 ms sliding window between 1) the right TA and the left TA, 2) the right CT and the left CT, 3) the right TA and right CT, and 4) the left TA and left CT. For each task, the mean r values were computed for 3 time periods: 1) the time between the
Correlations among intrinsic laryngeal muscles
23
onset of muscle activation and speech onset when the vocal folds would be adducting, 2) the time between phonation onset and muscle activation offset, and 3) the entire period from muscle activation onset to muscle activation offset (Figure 2). For each of the time periods during which mean r values were computed, the maximum r value was automatically detected. ACOUSTIC I speech offset
S sec C R O S S - C O R R E L A T I O N BETWEEN RIGHT & LEFT THYROARYTENOID
1 seo
2 sec
3 sec
4 sec
C R O S S - C O R R E L A T I O N BETWEEN RIGHT THYROARYTENOID & CRICOTHYROID
S sec
Figure 2. Point by point correlations plotted for the right and left thyroarytenoid and the right thyroarytenoid and cricothyroid computed from the signals plotted in Figure 1. The maximum point contained in adjacent windows of 150 points are plotted for each signal. The three time periods of vocal fold movement are labeled as: (1) muscle activation onset to phonatory onset, (2) phonatory onset to muscle activation offset, and (3) muscle activation onset to offset.
24
Christy L. Ludlow et al.
3. Results The analyses addressed the following questions: (1) if the mean correlation coefficients for the right TA and CT and for the left TA and CT (within side correlations) were greater than for the right and left TA and for the right and left CT (within muscle correlations). (2) if the mean within side and within muscle correlation coefficients were higher during speech than during vocal fold adduction prior to speech. (3) if the maximum within side and within muscle correlation coefficients were greater during speech than during vocal fold adduction for speech.
c η 4>
s
Figure 3. The group mean and one standard deviation above the mean for the mean Pearson correlation coefficients obtained from the time of muscle activation onset to muscle activation offset between the right and left thyroarytenoids (TA), the right and left cricothyroids (CT), the right TA and CT and the left TA and CT.
Correlations among intrinsic laryngeal muscles
25
Using the mean r computations from muscle activation onset to offset, a repeated MANO VA analysis was computed to compare differences between the two muscle correlation types (within side correlations with within muscle correlations) during speech in each of the different tasks (phonation, the voiced sentence and the sentence containing devoicing). There were significant differences between muscle correlation types ( F=9.972, p00 η —'
(bit/descriptive unit),
where subscript k indicates the type of descriptive unit, viz. k = 1,2,3,4 correspond to letter, word, rule, and sentence, respectively, and summation is taken over all combinations of (itj, ...u¿ η ). If we calculate these four kinds of H ^ ' s for the same linguistic expressions and convert them to the same unit, say, information per letter, then they should give identical values. The estimated value of Hi k decreases monotonously with the string length n, and the true value of Hi k is given by its lower limit. Since we assume one-to-one correspondence between events and sentences, we can estimate Hs by first estimating HLk and then multiplying it by the average number of descriptive units per sentence. The channel capacity C/t of a language is defined as the upper bound of Hi k with regard to the probability of occurrence P¿ of event Ei, and is dependent on the choice of descriptive units. 3.2. Estimation of information contained in linguistic expressions In this section we describe the estimation of information using n-grams of descriptive units which appear in linguistic expressions, assuming that
Sentence production and information 283 events generated by a source with known characteristics are successively transformed into sentences via a simple context-free grammar. Although it is known that context-free rules arc insufficient for the description of the syntax of natural languages and transformational rules are necessary, here we confine our discussion to languages that can be generated by context-free grammars, since the underlying deep structure of a sentence is determined by context-free rules even in transformational generative grammar theory, and a transformational rule can be approximated by a set of context-free rules for practical purposes. The estimation of information is illustrated by two examples with different source characteristics: (a) the case of a zero-memory source which generates each event independently, and (b) the case of a simple Markov source where every two successive events generated have some correlation.
(a) Zero-memory source Table 1 shows a context-free grammar used in this example. Here the letters consists of the Japanese Jkana-letters and the 'period' which indicates the separation between sentences. The 'period' is regarded as an independent word when word is adopted as the descriptive unit. From the grammar shown in Table 1, we can derive an infinite number of sentences, such as ^ . t U i c^il/V^&fi&fCo (Kore-wa kireina hana-da.) 'This is a pretty flower.', ftfi ¿ Τ t # t i V ">ftf i & 0 (Kore-wa totemo kireina hana-da.) 'This is a very pretty flower.', ¿ " C Ì i t t l V ^ i f i ^ f C o (Kore-wa totemo totemo kireina hana-da.) 'This is a very very pretty flower.', etc. In actual situations, sentences containing many repetitions of t X t> (totemo) are rarely seen. We assume that there are no real events corresponding to sentences Table 1. A context-free grammar adopted as an example in the estimation of the amount of information generated by a zero-memory source.
G = (S,VN,VT,P> VN = {S, NP, VP, ADJ} VT = { - f t l i (Wi), - í r t m (W2), fcixtt (Wj), t X t> (W4), è ftV Vfe (W5), t V 5 ^ (W6), (w7), ttfc« (w8), o (w 9 )} Ρ = { S—>NP VP (Pi), N-> £ ftfi (P2), NP—» - t f t l i (P3), N—> fcftli (P4), VP—• ADJ (*t£f¿. (P5), ADJ—>ADJ ¿ T t (P6), ADJ-+ (P7), ADJ-» (P8), ADJ—» Î i i S ê & (P9)}
284 Hiroya Fujisaki containing repetitions of t X h , so that there remain a finite number (18) of events. These events are further assumed to occur with equal probabilities. In this case, the amount of information produced by the source is given by f f s = - X > log Pi (bit/events), i where P{ denotes the probability of occurrence of the event Εχ. Figure 2 shows an example of representations of a sentence in terms of the three kinds of descriptive units, i.e., letter, word, and rule. The representation as a string of re-writing rules obeys the leftmost derivation of the sentence. In the following calculation all the obligatory rules are treated as one rule. Because of strong correlations among descriptive units that constitute a sentence, better approximations to the statistical characteristics of sentences can be obtained by adopting longer strings of descriptive units. EVENT SENTENCE LETTERS WORDS
RULES
Ei ... r t l f i i r T t t i X V ^ i f i ^ f C c (Kore-wa totemo kireina hana-da.) . . . L5L12L10L8L7L11L4L12L2L9L10L9L14L16... W,
W4
Pi
P2
W5
P5
Wg
P6
W9
P7
Figure 2. A sentence represented as letter-, word-, and rule-strings. Figure 3 shows the relation between the estimated value of the information Hi,k contained in expressions and the length of strings η of the three kinds of descriptive units used in the estimation under the condition that each event is successively generated from the source and transformed into each sentence according to the grammar shown in Table 1. The ordinate represents the estimated value of Hik in terms of bit per one Japanese kana-letter. The horizontal dashed line in the figure indicates the amount of information produced from the source (Hs) in terms of bit per letter. Results shown in the figure indicate that the estimated value of Hj^ tends to a certain value with the increase in the size of η-grams regardless of the choice of descriptive unit. This asymptotic value represents the true value of Hi k and corresponds to the source information Hg. Furthermore, the higher the level of descriptive unit, we can obtain a good estimation with smaller values of n. The difference is very large between letter and word but relatively small between word and rule. In this example, the average number of descriptive units necessary to express an event is 12.5, 4.5, and 3.5 for letter, word, and rule, respectively.
Sentence production and information
285
Figure 3. Asymptotic estimation of the amount of information H^ contained in linguistic expressions generated by a zero-memory source.
In the above discussion we assumed that the occurrence of each event from the information source was equiprobable, hence the amount of information produced from a source (Hs) has its maximum value (HSma). We will next examine cases when the source information H s does not reach its maximum value. Let η be the ratio of the source information Hs to its attainable maximum H s T h i s quantity will be referred to as the relative amount of the source information and will be used as an index of the statistical characteristics of the source. Any change in the relative amount of the source information η causes a change in the value of linguistic information Hi k , and thus affects its estimation. Here we investigate the relation between the relative amount of the source information η and the relative error in estimating Hik. As in the previous example, we used the grammar of Table 1 and the information source that produces 18 events. Using the relative amount of the source information η as a parameter, the relation between the size of n-grams n, and the relative estimation error of Hi k was calculated. The results are shown in Figure 4 through Figure 6, indicating that for the same η the estimation error is smaller for larger values of η regardless of descriptive units, and that in all of these cases the estimation error is approximately inversely proportional to the size of n-grams.
286 Hiroya Fujisaki
Figure 4. Relative error in the estimation of information Hi{ using letter n-grams.
SIZE OF WORD η -GRAM
Figure 5. Relative error in the estimation of information Η¿2 using word n-grams.
Sentence production and information
287
Figure 6. Relative error in the estimation of information Hi3 using rule n-grams. (b) Simple Markov source Under ordinary circumstances successive events and hence successive sentences are generally correlated with each other. In other words, the information source cannot be regarded as a zero-memory source, but should be considered to possess a more complex structure. As a first approximation, we consider the case where a simple Markov source generates each event, i.e., where an event is correlated only with the immediately preceding event. The amount of source information Hg in this case is given by Hs = - Σ
(bit/event),
ij
where P¡ stands for the stationary probability of an event Eit and stands for the conditional probability of occurrence of E¡ given that Ei has occurred. Even though the information source has memory and successive events have some correlation, the same method can be applied to estimate the amount of information contained in linguistic expressions by using the rate of occurrence of η-grams of descriptive units. In addition to the three kinds of descriptive units adopted in the previous examples, we adopt here still another higher-level descriptive unit, viz. sentence. In the present example, a simple Markov source is assumed to generate six events, corresponding to the six sentences shown in Table 2. Consideration of the consistency of all ordered pair of sentences
288 Hiroya Fujisaki of Table 2 leads to constraints on possible combinations of sentences listed in Table 3. Assuming that successive events are generated stochastically under the constraints of Table 3 and then expressed as sentences, the amount of information in these linguistic expressions is estimated by using each of the four kinds of descriptive units and are shown in Figure 7 as functions of the size of η-grams. The ordinate indicates the value of information after conversion into bit per letter for the sake of mutual comparison, and the horizontal dashed line indicates the source information HsThe figure shows that, regardless of the descriptive unit used in the estimation, the estimated value of information tends to the same asymptote as the size of η-gram is increased. This asymptotic value is equal to the true value of information in linguistic expressions H i k , as well as to the source information H s. The figure also indicates that the higher the level of descriptive unit, the smaller the size of the ra-gram necessary to achieve a certain estimation accuracy. In this example, the average number of descriptive units necessary to express an event is 12.1,4.1, 4.0, and 1.0 for letters, words, rules, and sentences, respectively.
Table 2. Sentences used in the estimation of the amount of information generated by a simple Markov source. Si:
rftltjlLVvfÈfc*.
'This is a pretty flower.' 'This is a yellow flower.' 'It is pretty.' 'It is red.' Ί like it.'
(Kore-wa utsukushii hana-da.)
S2:
(Kore-wa kiiroi hana-da.) (iSore-wa utsukusii.)
(Sore-wa akai.) S4: ^ i t f i ^ , ( Watashi-wa sore-o konomu.) S5: fttt-tixfcjiftf. s6: mtZft&btetiizhifii-t. (Watashi-wa sore-o anata-ni agemasu.) Ί will give it to you.'
Table 3. Allowed(o) and forbidden (χ) transitions between sentences. SUBSEQUENT S4 S3 Ss χ Ο Ο
Si
Si X
S2 X
X
X
Ο
Ο
ρ S3
X
O χ
χ
X
χ
Ο
ο
X
X
χ
χ
χ
ο
O
O
χ
χ
χ
χ
O
O
χ
χ
χ
χ
S's "s
6
S6 Ο
Sentence production and information 289
Figure 7. Asymptotic estimation of the amount of information Hit contained in linguistic expressions generated by a simple Markov source. 3.3 Estimation of source information by extrapolation In the previous examples we assumed that the stochastic characteristics of the information source are already known. In actual situations, however, the nature of the information source is usually unknown. Moreover, successive events generated from the source are mutually dependent and form a complex structure. In these cases source information, too, must be defined as the limit of the entropy obtained by using the probability that each η-gram of events occurs. However, as long as there is no ambiguity in the correspondence between events and linguistic expressions, we can estimate the source information through estimation of the amount of information contained in linguistic expressions. According to information theory, the estimation error in the amount of information produced by an ergodic Markov source decreases with the size of the η-gram and is inversely proportional to η when η is sufficiently large (Abramson 1963). In estimating the information Hi k contained in linguistic expressions, the estimated value is also expected to vary with n, approximately according to the formula A+B/n. As an example, the relationship between the estimated value of Hi k and the inverse of the size of η-grams of descriptive unit used in the estimation is shown for both of two previous examples in Figures 8 and 9.
290 Hiroya Fujisaki
INVERSE OF THE SIZE OF η -GRAM, Un
Figure 8. Approximation of the behavior of the estimated H¡,t generated by a zero-memory source with the formula A + Β / η .
INVERSE OF THE SIZE OF η -GRAM, Un
Figure 9. Approximation of the behavior of the estimated Hit generated by a simple Markov source with the formula A + B / n .
Sentence production and information 291 These figures show that the above-mentioned formula fits the actual estimated values quite well, not only in the case of a zero-memory source but also in the case of a simple Markov source. They also indicate that an accurate estimation of Hi,k can be obtained by extrapolating the results of rough estimations obtained by using relatively small values of n, since the intersecting point of the vertical axis and the straight line connecting the estimated values of Hi k 's for various n's indicates the asymptotic value of the estimate of HLk. The results in these figures show that the estimated value obtained by extrapolation is almost identical irrespective of the choice of descriptive unit. These figures also show that using a higher-level unit will result in a good fit to the approximate formula with a smaller value of η as compared with using lower-level ones. Consequently, one can obtain an accurate estimation efficiently through extrapolation by using a higher-level descriptive unit.
4. Information contained in English text While the foregoing analysis may serve to illustrate the basic relationships between the source information and its estimates obtained by various methods, the numerical results may not be realistic since both the source of information and the grammar are artificial and highly restricted. In order to obtain an estimate for the amount of information contained in an ordinary text written in a natural language, the full text of a scientific article was analyzed. The material was the English text of the paper 'Transmission of meaning by language" by H. Fujisaki, K. Hirose and Y. Katagiri, and contained a little less than 4000 words. The amount of information was estimated using both letter and word as the descriptive unit. In adopting letters as the descriptive units, the upper and lower case letters were regarded as separate symbols. Various punctuation marks, symbols as well as spaces were also treated as letters. The number of space symbols between adjacent words and between adjacent sentences was always assumed to be one. In adopting words as the descriptive units, on the other hand, spaces and punctuation marks were neglected except for the 'period* at the end of each sentence, which was counted as one word. Figure 10 shows the plot of the estimated amount of information H i (in empty circles) against the inverse of the size of letter or word η-grams, and the amount of source information obtained by extrapolation, all expressed in units of bit per letter. The straight line for the letter η-grams indicates the linear regression for η > 5, while that for the word η-grams indicates the line connecting two points for η = 1 and 2. The two straight lines are extrapolated (in broken lines) to converge at 0.5 (bit/letter) as the estimated value of Η$·
292 Hiroya Fujisaki
INVERSE OF THE SIZE OF n-GRAM 1/w
Figure 10. Estimation of the amount of information contained in English text.
The figure also indicates, in dotted lines, the upper and lower bounds calculated by Shannon on the basis of prediction experiments (Shannon 1951: 50-64). It is interesting to note that our results for the letter η-grams fall between the two predicted bounds for π > 5, and tend to a value that is only slightly larger than Shannon's lower bound at η = 100. It is also interesting to note that approximately the same estimate is obtained by using only the values of τι = 1 and 2 for the word η-grams, although we need much more experimental data to confirm the reliability of estimation. Figure 11 shows the effect of text size on the results of estimation, and indicates the monotonie increase in H i with the increase in text size. The results, however, are seen to level off for text sizes beyond 2000 words. The variability of estimates due to sample location was also investigated by dividing the original text into eight parts approximately equal in size, and estimating H i for each part. Figure 12 shows the mean and the range of variation of the estimates obtained from these eight parts. The vertical line segment indicates the range of variation of the eight estimates, showing the stability of estimation and hence the uniformity of the text under study. While the above-mentioned results were obtained from an actual English text, the results are expected to vary to some extent depending on the content as well as on the style of the text. In order to obtain a crude estimate for
Sentence production and information
S E E OF η-GRAM 1
g.