Stress and Non-Stress Accent [Reprint 2012 ed.] 9783110874020, 9783110137293


221 69 12MB

English Pages 252 [256] Year 1992

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
PREFACE
CHAPTER 1. DEFINING ACCENT
1.1 Differentiating accent from tone
1.2 Relating accent to intonation
CHAPTER 2. ACCENT SYSTEMS AND TONE SYSTEMS
2.1. Historical overview
2.2. Characteristic differences between accent and tone
CHAPTER 3. ACCENT AND INTONATION
3.1. Stress levels
3.2. Tonetic stress marks
3.3. Early experimental evidence
3.4. Pitch-accent theory
3.5. Recent experimental evidence
3.6. Metrical theory
3.7. The international contour in metrical theory
3.8. The metrical tree
3.9. The metrical grid
3.10. Hierarchical structures in other languages
CHAPTER 4. FUNDAMENTAL FREQUENCY AND PITCH
4.1. Signals with stationary fundamental frequency and pitch
4.2. Signals with changing fundamental frequency and pitch
4.3. Segmental effects on fundamental frequency and pitch
CHAPTER 5. INTENSITY, DURATION, AND LOUDNESS
5.1. The loudness function
5.2. Intrinsic duration, intensity, and loudness
CHAPTER 6. ACOUSTIC CORRELATES OF ACCENT IN ENGLISH AND JAPANESE
6.1. Methods
6.2. Results
6.3. Comparison with earlier studies
6.4. Hierarchies of acoustic correlates of accent
CHAPTER 7. PERCEPTUAL CUES TO ACCENT IN ENGLISH AND JAPANESE
7.1. Methods
7.2. Results
7.3. Discussion
7.4. Conclusion
REFERENCES
APPENDICES
Appendix A
Appendix B
Appendix C
Recommend Papers

Stress and Non-Stress Accent [Reprint 2012 ed.]
 9783110874020, 9783110137293

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Stress and Non-Stress Accent

Netherlands Phonetic Archives The Netherlands Phonetic Archives (NPA) are modestly priced series of monographs or edited volumes of papers, reporting recent advances in the field of phonetics and experimental phonology. The archives address an audience of phoneticians, phonologists and psycholinguists. Editors Marcel P.R. Van den Broecke University of Utrecht

Vincent J. van Heuven University oJLeyden

Other books in this series: I

Nico Willems English Intonation from a Dutch Point of View

IIA

A. Cohen and M.P.R. Van den Broecke (eds. ) Abstracts of the Tenth International Congress of Phonetic Sciences

IIB

M.P.R. Van den Broecke and A Cohen ( eds. ) Proceedings of the Tenth International Congress of Phonetic Sciences

III

J.R. de Pijper Modelling British English Intonation

IV

Lou Boves The Phonetic Basis of Perceptual Ratings of RunningSpeech

V

Renee van Bezooyen Characteristics and Recognizability of Vocal Expressions of Emotion

VI

Robert Channon & Linda Shockey In Honor of Use Lehiste

Mary E. Beckman

Stress and Non-Stress Accent

¥

1986 FORIS PUBLICATIONS Dordrecht - Holland/Riverton - U.S A

Published by: Foris Publications Holland P.O. Box 509 3300 AM Dordrecht, The Netherlands Sole distributor for the U.S.A and Canada: Foris Publications U.S A P.O. Box C-50 Riverton N.J. 08077 U.SA CIP-DATA Beckman, Mary E. Stress and Non-Stress Accent / Mary E. Beckman. - Dordrecht [etc.] : Foris. - (Netherlands Phonetic Archives ; 7) ISBN 90-6765-243-1 bound ISBN 90-6765-244-X paper SISO 805.2 UDC 801.16 Subject heading : phonetics.

Frontcover illustration taken from F.M. Helmont [An unbreviated representation of a true natural Hebrew alphabeth, which simultaneously shows how those born deaf can be thought not only to understand others who speak but even to produce speech themselves], Pieter Rotterdam, Amsterdam 1697.

ISBN 90 6765 243 1 (Bound) ISBN 90 6765 244 X (Paper) ® 1986 Bell Telephone Laboratories, Inc. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Printed in the Netherlands by ICG Printing, Dordrecht.

Table of Contents

PREFACE

ix

CHAPTER 1. DEFINING ACCENT

1

1.1 Differentiating accent from tone 1.2 Relating accent to intonation

1 5

CHAPTER 2. ACCENT SYSTEMS AND TONE SYSTEMS

11

2.1. Historical overview 2.1.1. Early classifications based on phonetic criteria 2.1.2. Bloomfield's primary versus secondary phonemes 2.1.3. Later structuralist treatments of accent 2.1.4. Trubetskoy's correlation of accent 2.1.5. Later functionalist treatments of accent — a critique of Trubetskoy 2.1.6. The organizational function in metrical theory 2.2. Characteristic differences between accent and tone 2.2.1. Speakers' attitudes 2.2.2. Historical development 2.2.3. Distinctive load 2.2.4. Alternations and restrictions

11 11 14 18 19 22 26 27 27 31 36 42

CHAPTER 3. ACCENT AND INTONATION

45

3.1. 3.2. 3.3. 3.4. 3.5. 3.6.

46 51 52 54 60 62

Stress levels Tonetic stress marks Early experimental evidence Pitch-accent theory Recent experimental evidence Metrical theory

vi

Table of Contents

3.7. The international contour in metrical theory 3.8. The metrical tree 3.9. The metrical grid 3.10. Hierarchical structures in other languages

66 68 85 96

CHAPTER 4. FUNDAMENTAL FREQUENCY AND PITCH

107

4.1. Signals with stationary fundamental frequency and pitch 4.2. Signals with changing fundamental frequency and pitch 4.2.1. The pitch of changing signals 4.2.2. Frequency discrimination of changing signals 4.2.3. Effect of rate of F Q movement 4.2.4. Effect of direction of F Q movement 4.2.5. Effect of concomitant amplitude change 4.3. Segmental effects on fundamental frequency and pitch 4.3.1. Consonantal effects on fundamental frequency 4.3.2. Vocalic effects on fundamental frequency and pitch

107 110 111 113 113 116 117 126 126

CHAPTER 5. INTENSITY, DURATION, AND LOUDNESS

133

5.1. The loudness function 5.1.1. Critical bands and loudness summation 5.1.2. Critical duration and temporal summation 5.2. Intrinsic duration, intensity, and loudness

133 134 136 141

CHAPTER 6. ACOUSTIC CORRELATES OF ACCENT IN ENGLISH AND JAPANESE

145

6.1. Methods 6.1.1. The corpora 6.1.2. The subjects and utterance tokens 6.1.3. The measurements 6.1.4. Statistics 6.2. Results 6.2.1. Fundamental frequency patterns 6.2.2. Peak and average amplitude patterns 6.2.3. Duration and total amplitude patterns 6.3. Comparison with earlier studies 6.4. Hierarchies of acoustic correlates of accent

145 145 147 148 151 153 153 157 160 165 173

128

vii CHAPTER 7. PERCEPTUAL CUES TO ACCENT IN ENGLISH AND JAPANESE

179

7.1. Methods 7.1.1. Stimuli 7.1.2. Subjects 7.1.3. Testing procedure 7.1.4. Statistics 7.2. Results 7.2.1. Overall group means and individual subject means 7.2.2. Group means by word type and by speaker 7.3. Discussion 7.3.1. Comparison to earlier studies 7.3.2. The total amplitude data and temporal summation 7.3.3. Bilingualism and accent 7.4. Conclusion

179 179 181 181 182 183 183 191 195 195 196 197 198

REFERENCES

201

APPENDICES Appendix A Appendix Β Appendix C

213 213 225 229

Preface

This book presents some data in support of the hypothesis that lexical accent in languages such as Dutch and English (henceforth 'stress accent') differs phonetically from accent in other languages such as Japanese ('non-stress accent') in that it uses to a greater extent material other than pitch. This hypothesis, which will be called the 'stress-accent hypothesis', makes two presuppositions. The first presupposition is that there is such a thing as accent that can be identified and separated from other phonological phenomena in a language. Since the term 'accent' has been used in so many ways to mean so many different things, this presupposition amounts to an assumption of a theory of accent. As a matter of necessary groundwork, therefore, Chapters 1 through 3 of this book will present a definition of accent and review some of the arguments for it. The second presupposition is that phonological categories are not necessarily phonetically uniform across languages or even within a language. A phonological property in one language may be phonetically different from 'the same property' in another language. Moreover, the difference need not be an absolute difference in the phonetic characteristics cueing the property, but can be merely a difference in the relative weight of these characteristics. (Thus the stress-accent hypothesis does not claim that stress accent differs from non-stress accent in not utilizing pitch as a cue, but rather that it differs in the extent to which it uses other characteristics in addition to pitch.) This presupposition seems obvious to the phonetician, who is familiar with the many different phonetic factors that can cue, for example, the single phonological property 'voiced'. I labor the point, however, since many earlier categorizations of accent have been inadequate precisely because the categorists overlooked the phonetic complexity of even the apparently simplest phonological elements. Presuming, then, that there is some definable phonological phenomenon 'accent', and that the phenomenon in one language can differ phonetically from the same phenomenon in another language while still using many of the same phonetic cues, the stress-accent hypothesis proposes that stress accents differ from non-stress accents

χ

Preface

in the degree to which they use phonetic attributes other t h a n pitch patterns. In proposing these phonetic differences, the hypothesis does not also claim t h a t they are the only differentiating factor or even the major differentiating factor. Indeed, the systems t h a t will be cited in this book as typical stress accents often seem to differ from many of the non-stress-accent systems in other ways as well, ways t h a t might be more interesting to the phonologist. I have chosen, however, to state the hypothesis entirely in terms of the phonetic differences, for two reasons. The first reason is t h a t the stress-accent hypothesis, if tenable, may explain why the notion of a type of phonological prominence utilizing loudness or force instead of pitch has persisted so long in the face of evidence t h a t stress accent is not cued primarily by differences in acoustic energy level. Contrastive phonetic studies of like phonological phenomena in different languages often reveal important phonetic detail t h a t is overlooked when only the phonological p a t t e r n s are compared. In Chapters 6 and 7, d a t a from a contrastive study of accent in English and Japanese will show t h a t English uses duration and other 'secondary' physical attributes as correlates of accent far more t h a n does Japanese, although both accentual systems seem to rely heavily on fundamental frequency. Interpreted in terms of certain experiments in psychoacoustics, this result may explain why English accentual contrasts have been described so often as loudness contrasts, whereas the Japanese accentual system has always been called a 'pitch accent'. The second reason for the focus on phonetic differences is that phonological p a t t e r n s often cannot be explained adequately without reference to the phonetic material t h a t they utilize. As Ohala has repeatedly pointed out, 'the inherent physical constitution of sounds, i.e., how they are made and how they sound, [has] as much or more importance t h a n system-internal relations, in determining the behavior of speech sounds' (Ohala, 1979, p. 49). For accentual systems, this means t h a t the particular phonetic media through which accentual prominence is achieved will have significance in everything from how lexical accent might interact with intonational phrasing to how it will figure in sound change. In other words, once the phonetic attributes t h a t characterize stress accents are known, the differences in phonological patterning may fall out naturally from them. In contrasting stress accent to other types of accent, therefore, the stress-accent hypothesis concentrates on the phonetic differences between them. In the preliminary separation of accent from other phonological categories, however, it will be necessary to focus instead on the phonological patterns typical of accent, because, physically, accent is similar to and usually realized concurrently with certain

xi other phonological phenomena. The suprasegmental physical attributes of fundamental frequency and duration that are used in the formation of accentual contrasts are used also in tonal contrasts and in phonemic length contrasts, as well as in the more 'ideophonic' aspects of certain intonational structures. If accent is to be separated from such other phonological uses of the same phonetic material, the delimiting criteria must refer to attributes other than the physical characteristics of the sound patterns. They must refer to those aspects of the category's distribution and occurrence that give clues to its phonological function. The first half of this book, therefore, departs from the strictly phonetic emphasis of the stress-accent hypothesis to present a definition of accent and to relate accent to other phonological phenomena that use the same phonetic material. In this part, there will also be a discussion of other, earlier definitions of accent, comparing the classificatory systems implicit in those definitions with the one proposed here. The second half of the book then presents some evidence for the stress-accent hypothesis and the phonetic typology implicit in it. This evidence is only a first small step toward proving the hypothesis, because a claim for two such broad categories as stress-accent languages versus non-stress-accent languages cannot be tested conclusively without carefully controlled investigations of dozens of languages. A first approximation to such a test, however, can be made by comparing data from one or two representative languages from each group. The comparative data will be from two experiments comparing production and perception patterns in English and Japanese. The production experiment compares measurements of various acoustic parameters in English and Japanese minimal pairs elicited in comparable linguistic environments, and the perception experiment compares accent judgements by native speakers of synthetic stimuli made from the utterances used to obtain the acoustic measurements. Data from other experiments and other languages will be referred to where available, but it must be emphasized that only the two English-Japanese experiments will have been controlled for such matters as having identical measurement criteria in the production tests, or the same methods of varying the acoustic patterns in the synthetic stimuli for the perception tests. The different emphases in these two parts of the book necessitate a difference in the types of arguments presented for the statements made. The first three chapters rely heavily on anecdotal or qualitative evidence, whereas the last two chapters look only at experimental or quantitative evidence. This difference in style of argument necessitates also a difference in the terminology used for the

xii

Preface

various phonetic attributes discussed. In the first three chapters, the conventional phonological usage is followed; no attempt is made to consistently differentiate between terms such as pitch and those such as fundamental frequency. (This usage is unavoidable where earlier phonological treatments are discussed and compared.) In the later chapters, on the other hand, usage is stricter; 'pitch' and 'loudness' are reserved for the pschoacoustic attributes (or for the phonetic interpretation of the psychoacoustic attributes) and are not used interchangeably with the terms for the physical attributes, 'fundamental frequency' and 'amplitude'. (This usage is followed especially stringently in Chapters 4 and 5, which review the relationships among the relevant psychoacoustic and physical attributes.) The various chapters in this book differ also in the extent to which they replicate material presented earlier in my doctoral dissertation, Toward Phonetic Criteria for a Typology of Lexical Accent, which was written while I was a graduate student in the Department of Linguistics at Cornell University. The account of accent and tone systems in Chapters 1 and 2 and the review of the psychoacoustics literature in Chapters 4 and 5 are only slightly modified or updated versions of sections of the dissertation. The report of the experiment in Chapter 6 has been reorganized to rid it of earlier redundancies, but presents no new data. The account of accent and its relationship to intonation in Chapters 1 and 3, on the other hand, has been completely rewritten to accommadate changes in my views on the topic, changes that result from work done since the completion of my disertation. The experiment reported in Chapter 7 also is work done more recently, at the welcome urging of Marcel P.R. van den Broecke and Vincent J. van Heuven. Since parts of this book replicate my dissertation, this preface gives me the opportunity to thank once again the many people who contributed in one way or another to the original dissertation. I am especially grateful to those at Cornell who aided or encouraged me while there, a long list of people that begins with my dissertation adviser, Frans van Coetsem, whose enthusiastic support first prompted my choice of thesis topic. The work in this book done since the dissertaion owes initially to Osamu Fujimura, who arranged for me to join his department at AT&T Bell Laboratories as a post-doctoral fellow upon my leaving Cornell, and to John Ohala, who persuaded me to try to publish the work. I thank also the many other people who have helped and encouraged me at Bell Laboratories, especially Janet B. Pierrehumbert. Her discussions on intonation and accent in general and her collaboration with me on intonation and accent in Japanese

xiii have been a major impetus in the development of my understanding of how these two prosodie categories relate, an understanding which I hope will benefit further by future amicable arguments on the points with which she disagrees. This preface also gives me the chance to again thank John S. Cikoski, whose contributions to this book went far beyond the mere tolerance of domestic neglect normally expected of authors' families.

CHAPTER 1

Defining accent

Hypothesis: Stress accent differs phonetically from non-stress accent in that it uses to a greater extent material other than pitch.

The stress-accent hypothesis proposed above refers to two phonological categories — accent and stress. Since these two words have been used by linguists to mean so many different things, a preliminary definition of the terms is always necessary when they are to be used. In the statement of the hypothesis, 'accent' means a system of syntagmatic contrasts used to construct prosodie patterns which divide an utterance into a succession of shorter phrases and to specify relationships among these patterns which organize them into larger phrasal groupings. And 'stress' means a phonologically delimitable type of accent in which the pitch shape of the accentual pattern cannot be specified in the lexicon but rather is chosen for a specific utterance from an inventory of shapes provided by the intonation system. This definition of accent and stress is intended to provide specific functional criteria for relating accent to certain other prosodie categories that linguists have often mentioned together with accent, but have not always related to accent in ways consistent with the stress-accent hypothesis. The two such prosodie categories which are most crucial to the correct interpretation of the stress-accent hypothesis are phonemic tone and intonation. This chapter outlines how the definition of accent presented here relates it to these two categories. 1.1 D i f f e r e n t i a t i n g a c c e n t f r o m t o n e The specification that accent involves syntagmatic contrasts is taken from Garde (1968), and is meant expressly to separate accent from paradigmatic prosodie contrasts, such as the opposition between long and short vowels in Japanese or the opposition between high-level syllables and high-rising syllables in Mandarin Chinese. Contrastive

2

Defining

accent

vowel length and tone seem to function primarily to distinguish one word from another that could have occurred in the same place. Their salient function is, in Trubetskoy's terminology, the distinctive one. Thus the length of a vowel in Japanese is just one more of the distinctive features t h a t together oppose it to all other phonemes in the language, making it possible to distinguish, for example, the surnames Oogawa and Ogawa. Similarly in Mandarin, the tone pattern of a syllable is merely part of a large cluster of features distinguishing it from all other phonotactically possible syllables. Accent, by contrast, seems to function less as a distinctive feature than as an organizational feature. In any given utterance, more prominent portions alternate and contrast syntagmatically with less prominent portions, creating a series of accentual phrases t h a t are delimited by or centered around the prominent portions. This organizational function is reminiscent of Trubetskoy's delimitative and culminative functions, but it is unlike them in t h a t it operates on several domains. Trubetskoy defined the two functions in reference to the word, with delimitative features being those that mark the boundaries between words and culminative features being those that signal the number of words without reference to their boundaries. The organizational function of accent, by contrast, often creates accentual phrases t h a t are larger or smaller than anything t h a t could be called a 'word'. In standard Japanese, for example, there is a welldefined level of accentual phrase that in citation form might be identified with the word or noun-phrase, but in actual speech more often corresponds to some larger piece, such as an adjective together with the following modified noun, or a noun in accusative or locative case followed by the governing verb. In English, similarly, there is a type of prosodie unit defined by alternations among reduced and unreduced vowels, and in actual words, there can be two or more such units. On the other hand, in many languages, it is possible to define a phonological unit 'word' as the smallest piece that can stand alone as a separate phrase defined by a single accentual prominence at some level. Moreover, in some languages, words can contrast paradigmatically by the placement of this culminative accentual prominence when they stand as complete accentual phrases. In such languages, accentual patterns can fill the distinctive function as well as the organizational function. Because of this possibility, it is not always easy to separate accent from tone. A more appropriate view is perhaps to set up a continuum between 'pure' accent and 'pure' tone, locating phonological phenomena in various languages along the continuum by the relative salience of the two different functions.

Differentiating

accent from tone

3

The idea t h a t accent is different from paradigmatic oppositions such as tone is by no means original with the above-stated definition of accent as an organizational feature. Trubetskoy's delimitative and culminative functions are early statements of such an idea, although Trubetskoy classified culminative features as a special subtype of distinctive feature rather than including them with delimitative features as would the otherwise very similar treatments of Arisaka (1941), Martinet (1965), and Garde (1968). Like the organizational definition of accent, all of these treatments differentiate accent from tone by explict reference to some phonological function vaguely similar to the organizational function described above. The idea t h a t accent is different from tone is seen also in the work of some generative linguists. McCawley (1970; 1978), for example, has made use of what might be called a culminative 'principle' without referring explicitly to the phonological function. In McCawley's taxonomy, accent is distinguished from tone by the type of phonological rules t h a t characteristically operate on it and by the type of specification that it requires in the lexicon. Whereas the phonological rules t h a t operate on tones produce the familiar assimilations and dissimilations seen in segmental distinctive features, rules operating on accent 'apply in such a way as to yield outputs in which each phrase has at most one accent' (McCawley, 1978, p. 119). Whereas the dictionary entry of a lexical unit for a tone system must specify tone features for each each separate syllable, t h a t for an accent system need specify only an accentual feature at a single location in the word. The use of accent in autosegmental phonology to mean a formal place marker for some basic tone shape (Goldsmith, 1976; 1982) is a more recent restatement of this same idea. These generative linguists define accent in terms of the symptoms of the culminative function rather than in terms of the function per se, b u t their definitions do yield nearly the same taxonomy that the explicitly functional treatments do, and in this respect they differ from many other earlier classificatory schemes. One earlier American usage, for example, was to analyze stress accent as a system of paradigmatically opposed 'stress levels', and to use accent as a general cover term for any set of prosodie properties that can perform the distinctive function (e.g., Trager, 1941; Hockett, 1958). In these classificatory schemes, accent is defined functionally, but only the distinctive function is recognized. These early structuralist treatments ignore the extreme dependence upon intonational context that characterizes the distinctive use of accent patterns. Another very common usage that has an even longer history is to ignore function altogether and concentrate instead on the phonetic

4

Defining

accent

material supposedly involved. This usage can be exemplified by Passy's (1891; 1906) contrast between languages with 'l'accent de force' and languages with 'l'accent musical' or by Jones's (1950) contrast between 'stress languages' and 'tone languages'. In this usage, stress is defined as the linguistic use of articulatory or acoustic energy, and prosodie systems using stress are distinguished phonetically from tone systems, which use pitch. Both of these earlier usages imply a classification of prosodie systems that is fundamentally different from the categorization implicit in the organizational definition of accent. The classificatory scheme that recognizes only the distinctive function excludes from accent those prosodie systems that distribute accentual prominences delimitatively, and includes those systems that would be classed separately as tone if the organizational function were recognized. The classificatory scheme that recognizes only phonetic criteria, on the other hand, puts in tone many of the non-stress accents that use pitch levels or pitch contours culminatively as part of the lexical level of the organizing pattern. Both earlier usages assume a definition of stress and a classification of stress accent relative to tone that is incompatible with the stress-accent hypothesis. In the first usage, the stress-accent hypothesis is a trivial statement, since stress accent is obviously an accent system that uses stress levels rather than pitch to contrast words. In the second usage, the stress-accent hypothesis is a self-contradictory statement, since a system that uses pitch in any way would be not accent at all but tone. Neither of these usages gives a classification of stress relative to tone that captures the essential similarity among all accent systems and their apparent difference from systems of phonemic tone. A complete understanding of the difference between accent systems and tone systems will probably not be possible until the relationship between each of these and the total prosodie system of the language (including intonation) has been thoroughly described. Even without such a thorough description, however, there is already a large body of data more readily at hand to support the classification of accent as functionally different from primarily distinctive categories such as tone. This evidence comes from areas as diverse as tonogenesis and the phonology of synchronically productive derivational patterns. Moreover, in many of the earlier taxonomies of prosodie systems, there are often hints toward the classification implicit in the definition offered above. Indeed, the emerging consensus among linguists over the last eighty years seems to be that accent is different from tone and that the difference lies in something like its organizational characteristics. The separation of accent from tone is

Differentiating

accent from tone

5

thus neither original to this book nor likely to be very controversial. The development of the organizational definition of accent systems as separate from tone systems and the more readily available evidence for t h a t separation will be discussed further in Chapter 2. 1.2 R e l a t i n g a c c e n t t o i n t o n a t i o n By contrast to the separation of accent from tone, a second aspect of the organizational definition of accent is more controversial and rather more difficult to justify — namely, the relationship t h a t it implies between accent and intonation. On the one hand, by avoiding any reference to phonetic material beyond the vague specification t h a t accentual contrasts be prosodie, the definition is meant to preclude any assumption of purely phonetic criteria for separating accentual patterns from intonational patterns in the suprasegmental makeup of an utterance. On the other hand, by stating t h a t these patterns organize the utterance at some basic level of phrasal structure, the definition is meant to preclude also the assumption of a perfect identity between the accentual pattern and some of the more paradigmatic, iconic aspects of the intonational pattern. The organizational definition of accent rejects phonetic criteria for differentiating accent from intonation because the experimental literature on intonation and accent has shown the two systems to be inextricably linked together in the prosodie patterns of utterances. It is impossible to give an adequate description of the production and perception of accent patterns in English without describing at the same time the phonetic and phonological structures of intonation. More recently, experimental investigations of accent and intonation in several other languages have shown that English is not unique in this regard. The notion that accent and intonation are phonetically independent, however, has a long history and has persisted in some version in every linguistic school. Several sections of Chapter 3 will describe the various incarnations of this notion and review the evidence against it. The organizational definition of accent is not the first theory of accent to reject the notion that accent patterns can be distinguished from intonation patterns on the basis of their phonetic composition. Bolinger's pitch-accent theory also does, and for the same reasons (Bolinger, 1958; 1978). However, Bolinger's account of the relationship between accent and intonation differs from the functional account in several crucial ways. Bolinger equates accent in English directly with certain prominence-lending pitch obtrusions and does not separate the obtrusions into such syntagmatic features as accent placement and more paradigmatic features such as choice of accent shape. Moreover, Bolinger identifies the function of prominence-

6

Defining

accent

lending pitch obtrusions completely with the more ideophonic aspects of intonational structure. In pitch-accent theory there is no such thing as a neutral intonation pattern; all uses of accentual prominence have some sort of meaning having to do with an intonational focus on the accented constituent and including an attitudinal component conveyed by the pitch-accent type. The organizational definition of accent, however, precludes such an analysis of the relationship between accent and intonation in English. It assumes a theory of intonation in which there can be such a thing as a 'neutral' intonation pattern specifying little more than the phrasal organization of the utterance to which is applied. In such an intonational pattern, the accent system predicts many of the prosodie elements at the different levels of the phrasal hierarchy. Accent also provides the link between the intonational system and the lexicon by specifying any aspects of the intonational pattern t h a t are peculiar to individual words and by stating the derivational or inflectional regularities t h a t govern the patterns of larger lexical structures. The organizational definition of accent assumes such a theory of intonation over Bolinger's pitch-accent theory in part because Bolinger's equation between accent and prominence-lending pitch shapes is incompatible with the extent to which accent patterns in English are correlated with and can be cued by other characteristics of the prosodie pattern such as duration and vowel quality. More important, Bolinger's pitch-accent theory of intonation is rejected for its inadequacy as a general theory of accent and intonation. Despite his claims for its universality (Bolinger, 1978), pitch-accent theory does not provide an accurate description of intonation in languages which do not share the rich inventory of pitch-accent shapes t h a t characterizes English and instead specify the shape of the pitch accent from within the lexicon, leaving to intonation only the choice of deleting or not deleting a pitch accent to accord with the larger prosodie organization of an utterance. Consider, for example, the intonation system of standard (Tokyo) Japanese as described by Pierrehumbert and Beckman (Beckman and Pierrehumbert, 1985; Pierrehumbert and Beckman, in preparation), and compare it with the intonation system of English, as described in Pierrehumbert's earlier work (Pierrehumbert, 1980, 1981; Liberman and Pierrehumbert, 1984; Anderson et al., 1984). An intonation contour in standard Japanese can be described as a sparsely-specified sequence of high and low tones which are grouped into several types of tone 'morphemes'. As in English, these tone morphemes include boundary tone shapes, which do not belong to specific syllables or morae in the utterance, but rather are aligned with the edges of prosodie phrases. The F contours in Figures 1.1a and 1.1b illustrate

Relating accent to

intonation

7

175 150

-

125

100 sa

ge

ra

HL

re'

L%

ru

a.

Figure 1.1. F u n d a m e n t a l frequency contours for two renditions of the phrase migigawa ga tagerare 'ru. Version in (a) is a s t a t e m e n t with a L % boundary tone, version in (b) is a question ending in a H % boundary tone.

two very common types of boundary shapes t h a t can occur at the ends of major intonational phrases in Japanese. They are very similar to the contrasting English boundary tones shown in Figures 1.2a and 1.2b. In addition to the boundary shapes, the tone morphemes of a Japanese intonation contour also include shapes t h a t belong to specific syllables or morae in the utterance. The high tone followed by a low tone in the utterances in Figure 1.1, for example, is associated to the penultimate syllable. This high-low shape must occur at t h a t particular place in this intonation contour. Similarly in the English utterance in 1.2a, the high tone on the first syllable must go on t h a t syllable in this word. Such associated tone shapes are the 'pitch accents' of the utterance. If any of the utterances in these two sets of intonation curves were longer and included more than a single accentual phrase, more such pitch accents would be required. Then

8

Defining

accent

α.

400 350χ

300-

O 250ϋ.

200 • 150

b.

400 p 350 χ

300

O

250

iL

200 150 C. Figure l.t. Fundamental frequency contours for three renditions of the word Anna. Version in (a) is a simple neutral declarative intonation (H* L L%). Version in (b) is a contour conveying surprise or incredulity (L+H* L H%). Version in (c) is a typical interrogative intonation (L* H H%).

the scaling of the accents within the pitch range of the utterance could be compared, and it would soon become clear that the high tones are not all at the same high value within the pitch range, nor

Relating accent to intonation

9

are the low tones all at the same low value. The relative placement of its component tones within the pitch range is an important part of the prominence of a pitch accent relative to other pitch accents. The relationships of greater or lesser prominence among the accents in turn contribute to the larger organization of the intonation contour, as governed by certain language-specific rules. In Japanese, for example, a more prominent accentual phrase cannot occur to the right of a less prominent accentual phrase within the same intermediatelevel phrase grouping. In an English intonation contour, no accentual tone-shape can occur after the most prominent accent within such a phrase grouping. The existence of pitch accents and the principles governing the relationships among the accents are important similarities between these two accent languages. There is also one major difference between the two languages that is evident in the contrast between the English intonation contours in Figures 1.2b and 1.2c. These two utterances have the same accentual structure and the same H% boundary tone. However, whereas the accent in Figure 1.2b consists of a pitch rise on the accented syllable, the accent in Figure 1.2c is an associated low tone. The choice of this particular pitch shape gives the utterance a different meaning from the incredulous rhetorical question in Figure 1.2b. This availability of several different possible shapes for the accent is similar to the availability of different possible shapes for boundary configurations in the two languages. But the choice of an alternate shape for the accent is not a possibility in Japanese. The accent on the penultimate syllable in the Japanese intonation contours in Figures 1.1a and 1.1b must consist of a high tone followed by a low. The only choice that exists for the speaker is whether to change the organization of the utterance by subordinating this high-low accent completely or partially to other accents within the same utterance. In describing standard Japanese, the phonologist can specify both the place and the pitch shape of the accent within the lexicon or he can specify only the place of accent in the lexicon and list the single possible shape in the intonational inventory of tone morphemes. There are no clear language-internal grounds for choosing between these two modes of description, although the existence of other nonstress-accent languages in which both the place and the shape of the accent must be specified (e.g., Swedish) might influence a choice for the former. In describing English, on the other hand, the phonologist can choose only the latter mode of description, because the pitch shape of the accent is by no means the property of the word with the accent, but rather, like the shape of the boundary configuration, is a property of the specific intonation contour.

10

Defining

accent

This characteristic of the relationship between the lexical specification of accent placement and the intonational specification of accent shape in English is the defining characteristic that differentiates stress accent from non-stress accent. When a nonstress-accent language has several possible pitch shapes for accents, the shape is a phonological feature of the individual lexical item. Of the two different accentual pitch shapes of standard Swedish, for example, some words have one and some words have the other. It is an arbitrary feature of the word itself, just as the placement of the primary stress is for an English word. In a stress-accent language, by contrast, the choice of pitch shape for an accent is like the choice of the boundary configuration. It is part of the paradigmatically contrasting inventory of tone morphemes available in building the intonational meaning of the utterance and is not a phonological feature of the word. The organizational definition of accent recognizes that not all intonation systems are like that of English, and instead takes the organizational capacity of accents and prominence relationships among them to be the universal defining characteristic of an accent system. Moreover, since accentual prominence is not defined a priori to be a matter of pitch obtrusion, the definition allows for the possibility of phonetic differences among accent languages. Indeed, this possibility is the motivation for the stress-accent hypothesis. Since stress-accent systems can associate the same accent within an accent pattern with several different pitch shapes for different intonation contours, might they not then compensate for this phonetic uncertainty by using other phonetic cues more? Might there not be, for example, an accompanying durational pattern to ensure that the tonal pattern of an utterance is correctly interpreted for its particular accentual organization? These issues will be discussed in more detail in Chapter 3.

CHAPTER 2

Accent systems and tone systems

As noted in Chapter 1, the definition of accent assumed there implies that accent systems are distinct from primarily paradigmatic systems such as tone. This distinction is based upon apparent differences in the phonological functions of the prosodie patterns of words in languages that can be classified as having accent or as having tone. Not all linguists, however, have recognized these functional differences. Indeed, some linguists have classified the prosodie systems included in these categories in ways having almost nothing to do with phonological function. This chapter will review the treatment of accent versus tone in other earlier categorizations of prosodie phenomena, and then discuss the more easily discernible characteristics that differentiate the two, showing how these characteristics are symptomatic of their differing functions. 2 . 1 Historical overview 2 . 1 . 1 E a r l y classifications based on phonetic c r i t e r i a

Early descriptive linguists based their categorizations of prosodie phenomena almost entirely upon phonetic criteria. Accent was categorized as separate from tone in these early taxonomies because of its supposedly different physical properties. The physical properties attributed to accent are stated most explicitly in Sweet's definitions of 'stress' and 'force': Physically [force] is synonymous with the effort by which breath is expelled from the lungs.... Acoustically it produces the effect known as 'loudness' which is dependent on the size of the vibration-waves which produce the sensation of sound. (Sweet, 1906, p. 47) T h e comparative force with which the syllables that make up a longer group are uttered is called 'stress', (ibid, p. 49)

Sweet's understanding of the physical constitution of 'stress' was typical of linguists of his time, and served as the basis for the various taxonomies that separated prosodie phenomena involving loudness or force from those involving pitch. Passy, for example, opposed an accent musical, which utilized pitch, to an accent de force, which was

12

Accent systems and tone

systems

directly equivalent to Sweet's 'stress': L a force (Allemand lautheit, anglais loudnet». Nous n'avons pas de bonne expression équivalente.) provient de la r a p i d i t é avec laquelle l'air est chassé des poumons. (Passy, 1891, p. 41-42) Q u a n t à la force relative des diverses parties d'un groupe, il est facile de distinguer des syllabes forte», moyennes et faible».... On dit souvent que la syllabe forte est accentuée ou porte l'accent de force; que les autres sont des syllabes inaccentuée» ou atone». (Passy, 1006, p. 27)

Although Passy thus separated accent from tone entirely in terms of their supposed physical makeup, he did touch on some functional differences among languages with l'accent de force. He noted, for example, t h a t by contrast to French, the opposition between accented and unaccented syllables in some other languages is 'very marked' and can differentiate meanings: En français, la différence est si peu sensible, que des observateurs é t r a n g e r s ont pu croire que toutes nos syllabes é t a i e n t également fortes. Dans les langues germaniques, s u r t o u t en allemand, l'opposition est au contraire t r è s marquée; de m ê m e en italien, en espagnol et en portugais. Elle peut alors servir à changer c o m p l è t e m e n t le sens, par example d ' u n mot composé: anglais 'drawback « i n c o n v é n i e n t » ; to 'draw 'hack « r e c u l e r » . (Passy, 1891, p. 63)

These differences in the way languages use stress, however, were clearly secondary to the primary opposition between languages with l'accent de force and languages with l'accent musical. Examples of the latter were languages like Swedish, Lithuanian, Chinese and Vietnamese — languages in which 'deux mots, identiques pour tout le reste, sont néanmoins parfaitement différenciés par leur intonation.' (Passy, 1891, p. 70-71). The asymmetry of this opposition should be noted. L'accent de force, on the one hand, is a completely phonetic category, including uses of force ranging from that of distinguishing words ('Elle peut alors servir à changer complètement le sens.') to t h a t of highlighting particular words in a sentence (Passy, 1906, p. 32-35). L'accent musical, on the other hand, is phonetically delimited from l'accent de force, b u t it is also functionally delimited from other uses of pitch. It does not include languages in which 'les intonations ... sont employées uniquement pour indiquer le sens général d'une phrase' (Passy, 1891, p. 70). Passy's categorization of l'accent de force versus l'accent musical translates exactly into Jones's (1950) distinction between 'stress languages' and 'tone languages'. Like Passy's l'accent de force, 'stress languages' is a phonetically delimited category covering all possible phonological functions:

Historical

13

overview

Force of utterance, abstracted from the other attributes of speech sounds is termed etrei». (Jones, 1950, p. 134) Languages in which meaning depends in any degree upon types of stress or upon the location of strong stresses in sequences of syllables are termed 'stress languages', (ibid, p. 136).

By saying 'depends in any degree', Jones meant to include in this category everything from languages which have minimal pairs contrasting the placement of stress in words to languages which use variations in the placement of accentual prominence only to signal special intonational meanings such as contrastive focus or emphasis. Opposed to this undifferentiated phonetic category was a category that corresponds exactly to Passy's l'accent musical·. Languages in which voice-pitches are used for the purpose distinguishing words are called 'tone languages'. (Jones, 1950, p. 152)

of

This primarily phonetic categorization of stress languages and tone languages coincides most closely with popular usage of the English terms 'stress' and 'tone', and has been assumed in some introductory linguistics texts even very recently. Even among linguists who subscribed to this popular and persistent usage, however, there have also always been doubts about the validity of the proposed phonetic delimitation of the category 'stress'. From the very beginnings of descriptive linguistics, some phoneticians have suspected t h a t the linguistic phenomenon 'stress' does not correspond exactly to the physical phenomenon identified as 'stress'. Even Sweet, for example, acknowledged: The discrimination of degrees of stress is no easy matter in any case, because of counter-associations of quantity, intonation, and vowel-quality, which make us apt to fancy that long, high-toned, or clear-voweled syllables have stronger stress than they really have. (Sweet, 1006, p. 51)

Similarly, Jones distinguished carefully between stress, 'which is a subjective activity on the part of the speaker,' and prominence, 'an effect perceived by the hearer' (Jones, 1950, p. 137). The latter, he emphasized, could be due to many phonetic attributes other than physical force of utterance. He was receptive to early experimental research on stress perception, such as t h a t of Scott (1939), and acknowledged especially the importance of pitch patterns as cues to primary stress: The linking of stress with special intonations in English is so frequent and so necessary that the opinion has been expressed by some that stress is immaterial, or at least negligible, and that all necessary prominence is given to syllables by intonation only.... (T]here is much to be said in favour of this view. (Jones, 1950, p. 147)

14

Accent systems and tone

systems

He was most confident about this dependence between primary stress and pitch in English, but observed that it seemed to be a characteristic of stress languages in general: It would seem that in other stress languages, such as Italian, Spanish, Russian and Greek, differences of stress are likewise almost always accompanied by special intonations. (Jones, 1950, p. 147)

Jones was also skeptical of any simple equation between physical force of utterance and linguistic stress when it came to secondary stresses. He ascribed these to 'tamber [timbre], sound-groupings, length or voice-pitch, or combinations of these, with or without the accompaniment of [physical] stress' (Jones, 1950, p. 148). Moreover, at the same time that Jones acknowledged phonetic factors other than physical 'stress' in the production and perception of linguistic stress, he was also noticing the very special status that stress had in the sound systems of most languages, a status that differentiated it from tone as well as from other phonological features: [Ijt is the location of strong stresses and not their type which has significance in stress languages. In this respect stress differs from all other features of speech. In a stress language every word of more than one syllable must have a strong stress somewhere, and one word may be distinguished from another by the position of the strong stress. Being however a matter of location and not of kind, ordinary stresses cannot be grouped in any way into families corresponding to phonemes, chronemes and tonemes. (Jones, 1950, p. 152)

Jones's approach to the classification of speech-sounds, however, always started from a phonetician's perception of the physical properties involved, rather than from a grammarian's conception of an underlying linguistic structure. Despite the admitted problems, therefore, he kept the phonetic criteria paramount in his delimitation of stress from tone. Hence, like Passy, he classified Norwegian, Swedish, and Lithuanian as tone languages along with Cantonese and Thai, and opposed them as a group to 'stress languages' such as English. Alongside this longstanding phonetic taxonomy, however, there were developing many taxonomies of prosodie phenomena in which functional criteria were given relatively more weight. Bloomfield's (1933) distinction between primary and secondary phonemes is an early example. 2 . 1 . 2 B l o o m f i e l d ' s p r i m a r y versus secondary phonemes

In his discussion of suprasegmental features in Chapter 7 of Language, Bloomfield points out that many languages use suprasegmental features such as pitch and duration as primary phonemes, 'quite on

Historical

overview

15

par with' segmental phonemes. In these languages, he implies, the linguist's distinction between 'basic speech sounds' (segmentals) and 'modifying features' (suprasegmentale) is merely an expositional convenience. However, Bloomfield then immediately contrasts another use of suprasegmental features: On t h e other hand, most languages ... use some of the modifying features as secondary phonemes — phonemes which are not part of t h e simplest linguistic forms, but merely mark combinations or particular uses of such forms. (Bloomfield, 1933, p. 109)

Bloomfield's distinction between primary and secondary phonemes allowed him to make a more formal statement of the special status of stress that Jones describes. It also allowed him to categorize with stress certain prosodie phenomena that share the special status of stress, but that recognizedly involve pitch patterns. For example, whereas Jones's phonetically delimited category 'tone languages' lumps Swedish together with Cantonese, Bloomfield can contrast them by function. Swedish, for him, is an example of 'languages ... using secondary phonemes of pitch as we use those of stress,' as contrasted to Cantonese, in which 'features of pitch are used as primary phonemes' (Bloomfield, 1933, p. 116). Bloomfield's grouping of the Swedish accentual system together with English stress and in oposition to Cantonese tone is identical to the grouping that would result from applying the definition of accent proposed above. A closer examination of Bloomfield's definition of secondary phonemes as opposed to primary phonemes, however, makes it clear that secondary phonemes are not exactly equivalent to accents as defined and contrasted to tone by the later functionalists. The most important difference is that Bloomfield's category 'secondary phonemes' does not differentiate between the more purely organizational aspects of accent that function even in accent patterns as abstracted from particular intonational contexts and other uses of prominence that are achieved by the actual realization of accent patterns within a given intonation pattern. That Bloomfield recognized some distinction between these different aspects of accent is clear from his discussion of the various uses associated with stress and pitch as secondary phonemes in English: The stress phonemes step in only when two or more elements of speech are joined into one form: a simple word, like John, contains no distinctive feature of stress; t o hear a distinctive feature of stress we must take a phrase or a compound word or, at least, a word containing two or more parts, such as contest. The pitch phonemes, on the other hand, occur in every utterance, appearing even when a single word is uttered, as in John! John? John. On the other hand, the pitch phonemes in English are not in

16

Accent systems and tone systems principle attached t o any particular words or phrases, but vary, with differences of meaning, in otherwise identical forms. (Bloomfield, 1933, p. 116)

In this passage, Bloomfield differentiates between the two functions that he assigns to secondary phonemes in the definition quoted above. Some secondary phonemes exist to 'mark combinations of simple forms' and others exist to 'mark particular uses'. This distinction seems to be in accordance with the organizatinal definition of acc«nt, which would make the latter a function of paradigmatic intonational variations, and the former a part of syntagmatic accentual contrasts. However, this appearance of agreement is misleading. It is clear from Bloomfield's elaboration on the stress phonemes that 'marking combinations' includes more than the purely organizational aspects of accent. For example, in Bloomfield's discussion of the English stress system, no distinction is made between the culminative prominence that characterizes the general prosodie organization of lexical forms (e.g., the prominence of the root syllable of 'parking', where the lack of final punctuation and the small case ρ imply the abstract dictionary specification) and the special emphasis that can be produced by aligning a specific intonation pattern in a particular way (e.g., the prominence of the nuclear pitch accent on my in 'This is my parking place.', where the punctuation and italics imply a H* L- L% intonation with an exaggerated peak for the H*). Both the abstract accentual prominence and the particular intonational prominence are somehow the same function of 'marking combinations'. Moreover, Bloomfield's discussion of secondary phonemes makes some apparently arbitrary distinctions among accent systems. For example, only Russian is singled out from among stress languages as having contrastive placement of stress as a primary phoneme. Bloomfield says: [S]ome languages of this type contain simple linguistic forms (such as unanalyzable words) of more than one syllable, which may be differentiated accordingly, by the place of the stress; t h u s Russian |'gorot) 'city' and jmo'rosj ' f r o s t ' are both simple words, containing no prefix or suffix; here, accordingly, the place of stress has t h e value of a primary phoneme. (Bloomfield, 1933, p. I l l )

He implies by this differential treatment of Russian that English, by contrast, never distinguishes 'simple linguistic forms' by stress. He implies that because the prefix tn- can be identified in insert, the contrast in accent patterns between the verb insert and noun insert is functionally unlike the contrast in the Russian pair. Indeed, Bloomfield lists this English minimal pair along with other examples of English stress patterns such as the sentences:

Historical

overview

17

I'm going out. [ajm ,goif] 'awt] This is my parking place, ['dis ιζ "maj 'parkii] ,plejs] Apparently he considered the assignment of stress in the lexical forms [in'srt] and ['insrt] to be no different from the assignment of the phrase accent to the ['awt] of 'I'm going out.' or even from the assignment of the special emphatic prominence to the my in this sentence. Thus unlike the place of stress in gorod and moroe, the place of stress in insert (verb) and insert (noun) are, for Bloomfield, secondary phonemes that 'merely mark combinations', just like the stress patterns in the larger phrases. Bloomfield's discussion of stress patterns in English and other stress languages makes it seem as if he had intuitively grasped the fact that oppositions of stress in English noun-verb pairs are not functionally an exact equivalent to oppositions among segmental phonemes. But recognizing only the distinctive function, and not understanding the hierarchical nature of accentual organization and its relationship to intonation, he could only state the difference in terms of the domain of the distinctive function in the two types of oppositions. Therefore, he sets up primary versus secondary phonemes. That is, instead of positing a different function altogether for stress, he speaks of 'stress phonemes' and 'distinctive features of stress', assuaging his qualms about their equivalence to more obviously distinctive features by limiting the domain of their function in English to 'a phrase or a compound word or, at least, a word containing two or more parts, such as contest' (Bloomfield, 1933, p. 116). Notice, too, that he carefully states the domain in terms of morphological rather than phonological complexity, because stating it in terms of phonological complexity ('words of two or more syllables, such as contest') would have made stress a primary phoneme in English as well as in Russian. It is interesting that when Bloomfield turns his discussion to languages that 'differ from English in using secondary phonemes of pitch as we use those of stress', he abandons this criterion that the domain of the distinctive function for secondary phonemes be morphologically complex (Bloomfield, 1933, p. 166ff). He cites examples of such secondary pitch phonemes from Norwegian and Japanese, but in these examples, two of the Norwegian words cited have no inflectional endings, and all of the Japanese forms are monomorphemic nouns. In citing the tone patterns of the lexical accent in these forms as functional equivalents of English stress, Bloomfield must have been considering not the domain of the contrasting patterns so much as the fact that the tonal oppositions are syntagmatic and culminative, since otherwise these should be

18

Accent système and tone

systems

classified together with the monomorphemic Russian forms. Bloomfield must have recognized t h a t the 'normal' and 'higher' pitch levels in the Japanese forms are 'relative' to each other within the word rather t h a n being paradigmatically opposed, and t h a t the distinctive placement of the higher pitch level can occur only once per word. He must have informally recognized the existence of something like the culminative function, since he states the principle directly in his description of stress systems in general: In stress-using languages like these [English, Italian, etc.], the stress characterizes combinations of linguistic forms; the typical case is the use of one high stress in each word in the phrase, with certain unstressed or lowstressed words as exceptions. (Bloomfield, 1033, p. I l l )

Despite these hints at other possible phonological functions, however, Bloomfield formally recognized only the distinctive function, and, therefore, could formally state the difference between accent and tone only in terms of the domain of the distinctive function. As a result, his categories are muddled and imprecise. Place of stress is a primary phoneme in Russian b u t not in English. The Swedish and Japanese accentual systems are described under secondary phonemes, b u t the Lithuanian and Serbian accents are called primary phonemes. 2.1.3 L a t e r s t r u c t u r a l i s t t r e a t m e n t s of accent It is perhaps because of the cloudiness of Bloomfield's category 'secondary phonemes' t h a t later American linguists ignored altogether the functional distinctions t h a t Bloomfield hinted at. In Trager (1941), for example, all phonological structure is reduced to the distinctive function. Where Bloomfield distinguished the use of prosodie properties as secondary phonemes from their use as primary phonemes, Trager sets up only the single category 'accents', covering everything from tone in Mandarin and contrastive length in Estonian to the stress systems of English and Spanish. This taxonomy forces Trager to abandon Bloomfield's insight t h a t it is the place of stress t h a t has the value of a primary phoneme. Instead, he must state the oppositions of stress in terms of paradigmatically opposed 'stress registers'. Trager realizes t h a t this equation between the functions of tone and of stress makes it 'necessary to look for the possibility of soft stress occurring by itself, as it should be able to do theoretically if it is parallel to low tone,' b u t he can offer only a few feeble examples such as the perennially reduced vowel in the adverb just (Trager, 1941, p. 137). Hockett (1955; 1958) similarly treats the distinctive function as paramount. He too analyzes stress as a system of paradigmatically contrasting 'stress levels', and defines the category 'accentual system'

Historical

19

overview

as any system in which minimal pairs can contrast only in prosodie pattern. Hockett's distinction between systems with and without zero (Hockett, 1955, p. 65-72) in many cases distinguishes between languages that Garde would classify as having and not having accent, b u t Hockett recognizes no functional difference between the two types. T h a t the opposed terms in a system with zero can be described as positive and negative (the presence or absence of a mark) is to Hockett merely a notationally convenient fact about their pattern of distribution, and not an indication of a more fundamental difference. Hence his discussion of why linear systems without zero always involve contrasts of pitch rather than of stress is stated in terms of the different physical property supposedly involved in 'stress': P i t c h levels are always relative matters.... Yet there is such a thing as 'absolute pitch' in t h e musical sense ... and a syllable of a tone language pronounced in isolation can usually be identified as t o its structurally relevant pitch level.... The loudness scale [by contrast] is apparently much more purely relative.... The relative loudness of a syllable, or of any other sound, can be measured and judged only relative t o the noise-level of the background in which it is presented. (Hockett, 1055, p. 08-60)

W h a t Hockett seems to be saying here is that linear systems without zero do not involve stress because 'stress' (i.e., loudness) is harder to use distinctively. There is no suggestion in this passage t h a t stress is 'relative' not because its physical quantity can only be measured in contrast to its background, b u t because its phonological quality as an accentual property is manifest in its function of contrasting to its background. In other words, for Hockett, phonological contrast is always relative to other phonological entities t h a t could have occurred in the same place, never to other entities actually occurring in adjacent places. Thus, like Bloomfield and Trager, Hockett recognizes formally only paradigmatic contrasts and the distinctive function. For an explicit statement of any other phonological function, it is necessary to turn to the European linguists, beginning with Trubetskoy and his Gründzuge der Phonologie.1 2.1.4

Trubetskoy's correlation of accent

Whereas American linguists after Bloomfield recognized only the single phonological function of distinguishing different meanings, Trubetskoy recognized two others as well. In addition to signaling different lexical units (the distinctive function), phonic features could 1. Reference will be to Baltaxe's (I960) English translation, Principle»

of

Phonology.

20

Accent systems and tone

systems

be used in the phonological system of a language to signal the number of lexical units (the culminative function) or the boundaries between lexical units (the delimitative function). The notion of the culminative function allowed Trubetskoy to identify a culminative use of prosodie properties and to describe a 'correlation of accent' that in several important respects is identical to the category 'accent' adopted in Chapter 1. The defining difference between culminative and nonculminative prosodie oppositions is that in the former, one of the properties opposed can occur only once in a word: [In culminative oppositions], the prosodemes are distributed in such a way t h a t each word has only a single prosodeme which by virtue of its differential property stands out among all others. T h e remaining prosodemes of the same word show the opposite differential property. (Trubetskoy, 1969, p. 182)

This culminative distribution differs fundamentally from the nonculminative use of prosodie features. In a nonculminative opposition, each minimal prosody-bearing unit is independently assigned one of the opposed prosodie features, and limitations on the distribution of these features are restricted to the same sort of neutralizations as can befall segmental oppositions. To use one of Trubetskoy's own examples, a polysyllabic word in Czech can have all short vowels, mostly long vowels, or a mixture of long and short vowels in any sequence, the only restriction being that long vowels cannot occur in initial position. In a culminative prosodie opposition, by contrast, the assignment of a prosodie feature is utterly dependent on the assignment of prosodie features in the rest of the word. One prosodeme is singled out by bearing one of the differential properties, and then the rest of the word is automatically restricted to bearing the opposite differential property. In a polysyllabic word in Russian, for example, one syllable is singled out by being stressed and the other syllables are then necessarily unstressed. Similarly in Lithuanian, one mora in a word is singled out by being high-pitched and the other morae are then necessarily low. Two aspects of this singling out are especially important. The first is that it involves a ranking of the culminative prosodeme against the other prosodemes in the word; the culminative prosodeme 'stands out from all other prosodemes and is not second in prominence to any other prosodemes of the same word' (Trubetskoy, 1969, p. 189). The actual physical means by which the culminative prosodeme is made to stand out can be a mixture of several phonetic factors. What the factors are matters less than that they make the culminative prosodeme protrude above its surroundings.

Historical

overview

21

W h a t is phonologically i m p o r t a n t here is only a general p r o m i n e n c e of t h e c u l m i n a t i v e p r o s o d e m e , t h a t is, t h e f a c t t h a t t h i s p r o s o d e m e s t a n d s o u t a m o n g all o t h e r s . ( T r u b e t s k o y , 1969, p. 183)

The second important aspect of the way that culminative prominence singles out a culminative prosodeme is that the differential properties it uses are then involved primarily in a syntagmatic contrast, and only indirectly in a paradigmatic one. The culminative prosodeme is opposed first to the nonculminative prosodemes in the same word. Then, if the prosodie opposition figures also in a paradigmatic contrast, it is not because one word has one differential property and the contrasting word an opposing one, but because the position of the prominent prosodeme differs (Trubetskoy, 1969, p. 189).2 It is these two characteristics that differentiate accent from nonculminative oppositions such as tone. Tone is a system of paradigmatic oppositions just like any segmental phonemic opposition. Variations of pitch constitute tone if a particular pitch pattern on a prosodeme in one word contrasts with a different pitch in the analogous prosodeme in an otherwise identical word. Accent, on the other hand, can figure in paradigmatic oppositions, but this is not its primary function. Rather, its primary function is to set up syntagmatic contrasts among the prosodemes of an utterance, and to thereby organize the utterance around the location of the units that are marked by the prominences. Tone is a primarily paradigmatic opposition, and, therefore, it is phonologically important that one tone be distinct from the other tones that might have occurred in the same place in the utterance. The necessity of distinguishing the tones implies no ranking of the phonetic features effecting the distinction. Accent, by contrast, is primarily syntagmatic, and therefore, it is phonologically more important that the accented portions of the utterance be distinguished from the adjoining unaccented portions by being relatively more prominent, necessarily involving a ranking of the phonetic features involved. It should be noted, however, that of these two defining characteristics of accent, only the notion of the ranking that is involved in culminative prominence is stated explicitly in Principles of Phonology. Although Trubetskoy implied the syntagmatic contrast between the accented and unaccented prosodemes in his description 2. cf. B l o o m f i e l d ' s i n s i g h t t h a t it is t h e place of t h e p r o m i n e n c e t h a t ' h a s t h e value of a p r i m a r y p h o n e m e ' (Bloomfield, 1933, p. 111).

22

Accent eyeteme and tone eytteme

of the correlation of accent, he did not recognize the primacy of this contrast over the distinctive opposition. Although he introduces the culminative, delimitative, and distinctive functions in his 'Preliminary Remarks' as if they were three independent uses of phonic properties, it is clear from the major divisions of the book that he considered the culminative function to be a mere addendum to the distinctive function. Culminative prosodie oppositions (free accent) are discussed together with nonculminative prosodie oppositions under 'The Theory of Distinctiveness', and both these uses of prosodie properties are thus opposed to fixed accent, which is discussed separately under 'The Theory of Delimitative Elements'. 2.1.5 Later functionalist treatments of accent — a critique of Trubetskoy In the case of segmental properties, Trubetskoy's assessment of the distinctive function as the primary phonological function is without doubt justified. The validity of this judgement is seen immediately when the delimitative and culminative uses of segmental properties are examined. Consider, for example, the contrast [g] versus [(]] in Japanese. For many speakers of the standard dialect, the stop occurs only wordinitially and the nasal occurs only word-medially, so that the contrast between the two velar consonants has only the delimitative function. Nevertheless, [g] and [q] do take part in many other contrasts that are distinctive. For example, [g] contrasts with [d], as in goo (a unit of measure) versus doo 'how', and [f]] contrasts with [n], as in kagi 'key' versus kani 'crab'. Thus the two phones never function merely as delimitative units. The presence of [g] at a particular place in an utterance can never signal only that this is the beginning of a word. It must necessarily signal also that the the word is one that begins with / g / rather than with one of any number of other possible phonemes. Similarly, in a language with mostly monosyllabic morphemes, the opposition between phones that can occur as syllabe nuclei and those that cannot fulfills something like the culminative function. The syllabe nuclei contrast syntagmatically with the surrounding consonants to signal the number of morphemes in any given utterance. At the same time, however, the syllabic phones are also always involved in a paradigmatic distinctive opposition; a particular vowel occurring in a particular place in an utterance signals that here is a morpheme, but it also necessarily (and primarily) signals that the morpheme is one containing that particular vowel and not one of any number of other possible syllable nuclei.

Hietorical overview

23

Because the distinctive function is so obviously primary in the case of segmental contrasts, linguists have almost automatically assumed its primacy in the case of prosodie contrasts. In Martinet's words: Troubetzkoy a bien marqué la n i cessiti d'un examen fonctionnel des faits accentuels. H l'a t e n t i dans le cadre de sa distinction entre trois fonctions: la fonction distinctive, la fonction d i marcati ve et la fonction culminative. Mais l ' i t u d e de ces deux dernières a souffert du fait que, la fonction distinctive paraissant en g i n i r a l de beaucoup la plus importante, c'est par elle qu'on commençait, et que les faits prosodiques, où s'entremêlent les trois fonctions, voyaient, dès l'abord, leur r&le distinctif éventuel si bien mis en valeur que celui-ci semblait, à tort, partout et toujours le plus décisif: i rechercher en anglais, en russe ou en espagnol, les paires de mots du type (to) increate — (art) increate, múka — mukà, cortei — cortét, on oubliait de se demander si la fonction rielle de l'accent n ' i t a i t pas ailleurs que dans la distinction de quelques douzaines de pairs de mots ou de formes qui, le plus souvent, ne sauraient guère figurer dans la même contexte. (Martinet, 1965, p. 149)

As Martinet suggests in this passage, however, prosodie properties are not necessarily like segmental properties; for some prosodie oppositions, the primacy of the distinctive function must be questioned. The most obvious of such cases are the accentual systems in which accent never contrasts two otherwise identical words. In languages such as Czech, where every word can have at most one prominent syllable and the position of that syllable is fixed relative to the initial or final boundary of the word, the complex of prosodie properties that make up the prominence has only a delimitative function. It can contrast only syntagmatically with the lack of prominence in the surrounding syllables. It cannot also be opposed paradigmatically to some other complex of prosodie properties that could have appeared in the same place to signal a different word. The delimitative use of accentual contrasts is thus qualititatively different from the delimitative use of segmental contrasts, such as the [g] : [η] opposition in Japanese cited above. While the difference between the two types of oppositions is most obvious when they are used delimitatively, accentual contrasts also differ qualitatively from segmental contrasts when they are used distinctively. Consider, for example, the two minimal pairs (a) house : (to) house and (an) increase : (to) increase. In the first pair, the distinction between the noun and the verb hinges directly on the paradigmatic contrast between the voiceless / s / and voiced / z / as the final consonant. In the pair (an) increase : (to) increase, on the other hand, the distinction does not hinge on any contrast between a single set of opposed properties at a single place in the two words. Instead, the contrast is between the total pattern of prominent first

24

Accent

système and tone

systems

syllable followed by unaccented second syllable in the noun as opposed to the total pattern of unaccented first syllable followed by prominent second syllable in the verb. Whereas the segmental contrast figures directly in the paradigmatic contrast between the two words, the accentual contrast figures directly only in the syntagmatic contrast that makes up the accentual pattern. It figures merely indirectly, via the opposition of the differing accentual patterns, in the paradigmatic opposition between the two different forms. In short, even when accentual contrasts are used distinctively, there is reason to question the primacy of the distinctive function. Once it is recognized that the distinctive opposition is secondary to the syntagmatic culminative contrast, the basic similarity between delimitative and purely culminative uses of prominence becomes apparent. Whether the place of the highest prominence is predictable given the sequence of phonemes making up a word (fixed accent) or must be specified along with the sequence of phonemes (free accent), accent's salient function is the same. Accent syntagmatically contrasts higher prominence with adjoining stretches of lower prominence, and by associating at most one significant high prominence to each 'word' in an utterance, it serves to organize the utterance into the phonological units that are defined by the accentual patterns of the words and larger accentual phrases. Thus whether a language uses accent delimiatitively or distinctively, it also uses it the same time culminatively. In Martinet's view, it is this common culminative function that is basic, and not the fact that in free accent the place of the significant high prominence can distinguish word pairs. He summarizes as follows: Ce qu'il est important de relever ici c'est que, dans tous les cas, une fois choisie l ' u n i t i accentuable, ... elle ne c o n n a î t jamais qu'un seul accent, et cette constance dans l'unicité indique clairement le caractère fonctionnel particulier de l'accent. Il ne peut guère faire de doute que l'accent sert essentiellement à individualiser les unités sémantiques dans la c h a î n e parlée et, par raccroc seulement, i les opposer dans le s y s t è m e par la place qu'il occupe dans l ' u n i t é accentuelle. (Martinet, 1065, p. 151-152)

Once accent's culminative organizational function is recognized, the minimal grammatical unit associated with accent also becomes apparent. An accent pattern is something defined over at least a word. Conversely, the word is phonologically defined by accent; it is the smallest unit that can stand alone with its own accent pattern. In this way accent is again fundamentally different from primarily distinctive units such as segmental phonemes or tones; the minimal grammatical unit defined by those units is the morpheme rather than the word.

Historical

overview

25

On the other hand, the combination of morphemes into words is sometimes marked by distributional restrictions on the distinctive oppositional units that make up the morphemes. For example, the combination of morphemes sitting in English is marked as a single lexical unit in part because it usually contains an alveolar flap, which normally occurs as a variant of ft/ only medially. Similarly, the combination of morphemes Bunde in German is identifiable as a single word because it contains a voiced stop, which cannot occur wordfinally. In many languages, distinctive features are thus used also as organizational features to mark off words, either by the restriction of particular phonetic variants to certain positions in the word, or by the neutralization of a particular opposition in certain environments defined in relation to the word's boundaries. However, this organizational use of distinctive oppositional units is sporadic; it can mark off only the proportion of the words in the language that contain the particular oppositional unit. The restriction on the occurrence of the flap in English, for example, would be of no help in identifying sitting as a single lexical unit if it did not contain the phoneme / t / . Similarly, the neutralization of the opposition between / t / and / d / word-finally in German would be of no help in identifying Bunde as a single lexical unit if it did not contain the phoneme / d / . The flap in sitting, moreover, has another function, that of opposing sitting to forms such as sinning. Similarly, the / d / in Bunde opposes it to bunte. In both cases, the association between the phoneme and the word is secondary to the association between it and the morpheme. By contrast to the sporadic and secondary association between phonemes and the word, the link between accent patterns and the word is neither sporadic nor secondary. Sitting and sinning, Bunde and bunte can all be reliably identified as single lexical units because they each contain only one stressed syllable. Indeed, one could almost say that it is the stress pattern itself that defines these sequences as single words. One could say without being too far wrong that it is only because English and German both have accentual systems that it is possible to identify in both of them a formal grammatical unit intermediate between the morpheme and the sentence. Moreover, they share this property with all other languages with accentual systems. This point is emphasized by Garde: Il a p p a r a i t donc que dans les langues sans accent il existe seulement deux types d ' u n i t é s signifiantes susceptibles d'une definition rigoureuse du point de vue de la linguistique g é n é r a l e et sans r é f é r e n c e particulière à la s t r u c t u r e de chaque langue: le m o r p h è m e et la phrase. Toutes les u n i t é s intermédiaires (différents types de syntagmes, mots) ne peuvent ê t r e définies que selon des critères propres à chaque type de langue. Mais dans les langues à accent il existe un troisième type d ' u n i t é , l ' u n i t é accentuelle,

26

Accent

système

and tone

systems

qui admet une définition valable pour toutes les langues à accent, parce qu'elle possède une marque formelle: l'éxistence entre ses limites d ' u n contraste entre les u n i t é s accentuables, syllabes ou mores, qu'elle contient. (Garde, 1968, p. 18)

Accent, then, differs fundamentally from tone both in function and in the size of the grammatical unit with which this function is associated. Tonal oppositions are like any other distinctive oppositions. Their function is to distinguish meanings, and the grammatical unit with which this function is associated is the minimal unit for which a meaning can be distinguished — the morpheme. Accentual contrasts, on the other hand, have as their primary function the production of culminative patterns that phonologically conjoin closely cohering combinations of morphemes and set up larger sense groups in an utterance. The smallest grammatical unit with which this function is associated is therefore a unit usually larger than the morpheme, the minimal unit for which an accentual pattern can be specified — the word. 2.1.6 T h e organizational function in metrical theory

While the word is typically the smallest grammatical unit for which the function of accentual contrasts can reliably be defined, however, it does not often show more than the most basic syntagmatic contrast. One does not usually see the true complexity of the organizational structure that is indicated by the accent pattern in the citation forms that are the usual utterances examined by linguists. In formulating their definition of accent and contrasting the function of accent to that of tone, the European functionalists concentrated almost exclusively on the prosodie patterns of words, and did not explore the place of accent in the larger prosodie structures of longer, more complicated utterances. The likely reason for this neglect is that they had no formalism for describing prominence relationships among accents. In discussing the prosodie patterns of individual accentual phrases, the functionalists had the useful notion of an absolute syntagamatic contrast between the accented syllable or mora and surrounding unaccented syllables or morae within the accentual phrase. But they had no mechanism for representing the relative contrasts of prominence among different accentual phrases. In the last decade, metrical theory has provided such a formal mechanism. And while the proponents of metrical theory have rarely been so explicit about phonological function as were the European functionalists, it is clear from the occasional reference to function that accent is considered to be primarily an organizational feature. The earliest metrical treatment of stress accent in English, for example, described the aims of the new theory as follows:

Historical

overview

27

It will be argued that certain features of prosodie systems like that of English, in particular the phenomenon of 'stress subordination', are not to be referred primarily to the properties of individual segments (or syllables), but rather reflect a hierarchical rhythmic structuring that organizes the syllables, words, and syntactic phrases of a sentence. (Liberman and Prince, 1977, p. 240)

This earliest formulation of metrical theory identified the organizational structure of an utterance with the phonetic manifestations of rhythm in its duration pattern, and hence did not generalize the organizational function to types of accent systems other than stress accent (where the connection between pitch and prominence has not always been recognized). There is, however, nothing inherent in the representational mechanisms provided by the theory that precludes a more general interpretation. If the metrical grid is interpreted as a formal means of representing abstract prominence relationships rather than as a structure specific to rhythm or timing, it might be possible to use the grid to represent accentual patterns of words and phrases in languages that are superficially very unlike English. For example, Kubozono (in preparation) gives a description of accentuation and phrasing in Japanese noun compounds using a grid representation of the prominence relationships involved. In Chapter 3, the application of metrical theory to the organizational definition of accent will be discussed further in the context of relating accent to intonation. There it will be suggested that the fundamental similarities among accent systems are best illuminated by an examination of the role of accent within the more complicated phrasal organization of larger utterances. The functional difference between tone systems and accent systems, on the other hand, can perhaps be seen more clearly by examining matters at a less complicated level. The rest of this chapter will describe several characteristic differences between tone systems and accent systems that give some evidence for the different functions posited above, and, hence, for the separate categories set up on the basis of the different functions. 2.2 Characteristic differences between accent and tone 2.2.1 Speakers' attitudes

One way in which tone and accent differ is in the attitudes that they characteristically evoke in native speakers. Attitudes are difficult to document quantitatively, but anecdotal evidence invariably points to different views of the two types of prosodie contrasts. To the native speaker describing tone, the linguist's distinction between segmental phonemes and suprasegmental tonemes is not at

28

Accent systems and tone système

all obvious. R.B. Jones reports (p.c.) that when he asked one educated Burmese informant how she categorized the tonal contrasts in a minimal pair, she answered t h a t she perceived the words as having different vowels. Similarly, Chao and Yang report of Mandarin speakers: The common |Western] attitude of treating the tone as an epiphenomenon on top of the solid sounds — consonants and vowels — is to the Chinese mind quite unintelligible or at least highly sophisticated. (Chao and Yang, 1947, p. xv)

To the native speaker describing accent, however, this attitude is the intuitively correct one. The linguist's classification of accent patterns as belonging to suprasegmental 'epiphenomena' is highly intelligible even to the most unsophisticated freshman taking an introductory course in linguistics. Indeed, it is possible for native speakers to regard as 'epiphenomena on top of the solid sounds' even those accentual features that include segmental oppositions. In English, for example, there are pairs of words contrasting two vowel phonemes t h a t are commonly said to have minimally contrasting stress patterns — e.g., contract versus contract, where the nuclei of the initial syllables are [an] versus [δ]. The native speaker's attitude toward such pairs is that the perceived stress contrast is something extra added to the segmental specification of the words. He views the unstressed syllable in the verb as having the same vowel as the stressed syllable of the noun, except t h a t the verb's vowel is 'weakened' or 'reduced'. This attitude is in sharp contrast to t h a t of the Burmese speaker cited above. The native speaker of the tone language has trouble hearing syllables differing only in tone as having the same vowel, b u t the native speaker of the accent language can hear syllables differing in accent as having the same vowel even when they do not, because the vowel-quality differences are part of the accentual system of contrasts in prominence. Thus whereas the speaker of a tone language finds it difficult to separate an utterance's tonal specification from its segmental specification, the speaker of an accent language separates them quite easily. Whereas the tones are viewed as part of a word's phonemic makeup, the accent pattern seems to be something fastened onto the word after the segmental content is specified. Accent is, in this sense, more suprasegmental than is tone. The complement of this difference is t h a t the even more suprasegmental phenomenon of intonation is easily separable from tone, but not from accent. Since the tonal specification is such a permanent integral part of a word's phonemic makeup, the native speaker finds it obvious t h a t tone is distinct from intonation, even

Characteriatic

difference«

29

though the two prosodie phenomena are phonetically intertwined in the tempo and pitch contour of an utterance. Thus Chao (1956), for example, describes the interactions between tone and intonation in his native (standard Mandarin) dialect as a simple algebraic summation to produce an utterance's pitch contour. Chang similarly, has no trouble distinguishing the contributions of tone and intonation in utterances in his native Chengtu: Tones apply to individual syllables whereas intonation covers the whole sentence. Unlike tones, furthermore, a change of intonation does not affect the lexical value of words. It only adds shades of meaning to the sentence spoken and brings out the attitude of the speaker and the emotional state he is in. (Chang, 1058, p. 70)

In other words, the tones, like segmental phonemes, are integral to the morphemes; they 'affect the lexical value of words'. Hence they are easy to separate from intonation, which 'only adds' those parts of the meaning t h a t are determined by the utterance's context rather than by its morphemic content. Accent, on the other hand, shares in the context-dependent, added-on quality of the rest of intonation. An accent pattern can be abstracted away from any specific utterance within which the pattern might fill its organizational capacity. But it is extremely difficult for the native speaker (and for the linguist) to perform this abstraction completely. The confusion between various degrees of abstraction is especially evident in our use of terms. In English, for example, we say that the first syllable is stressed in the general phonological representation of the word delegate, b u t we also say t h a t the first word is stressed in the specific intonation pattern of YOU did it. Nor is the confusion limited to languages traditionally classified as stress languages. In Norwegian, for example, it is common to include a few contrasts between neutral and emphatic intonation patterns when listing minimal pairs for the pitch accent contrast. Thus Grundt describes the emphatic prominence on βά in Hun er SA enili 'She is SO nice' as giving accent 2 to the verb er, as opposed to a more usual accent 1 in the neutral Hun er ed enill 'She is so nice' (Grundt, 1977, p. 185). This description implies that Grundt sees no difference between the level of abstraction necessary to describe the specific intonation pattern that displaces the sentence accent shape from anill onto ed in 'She is SO nice.' and that necessary to describe the pitch shape of the lexical accent pattern in the phonological form binnen 'the bean'. Grundt's confusion between the intonational variation and the lexical accent contrast is typical. In fact, Kloster Jensen's discussion of intonational versus lexical patterns in Tonemicity seems to be the only treatment that differentiates the two

30

Accent systems and tone système

(Kloster Jensen, 1961, p. 22). One final example of this sort of confusion between the abstract lexical accent and specific intonational prominences was provided by one of the native Japanese speakers who read the corpus of sentences for the English-Japanese production experiment that will be described in Chapter 6. Among the test words in the corpus was ki'ku 'initial phrase' (i.e., first line of a haiku), a rather rare Sino-Japanese compound. During the recording session the informant unhesitatingly pronounced this word each time he encountered it, now with one, now with another of the three possible accent patterns. When asked after the recording session about the word's accent, the informant replied that he had never heard the word said, but that given its meaning, the first syllable should be 'stronger' and therefore should be accented. In other words, to this native speaker, the lexical accent in this word was something that could be determined by the intonational logic of making prominent that which should be emphasized. It is difficult to imagine how a native speaker could apply such logic when trying to decipher the tones of an unfamiliar written word. The different attitudes towards tone and accent are both illustrated in the following passage from Chao's Grammar of Spoken Chinese: In working with written texts, it is common practice to treat as 'the same sentence' whatever is written alike. But since most systems of writing omit indication of significant prosodie elements, one should always remember t h a t the same sequence of words may not represent the same sentence. For example: (1) Ta tzoou (,) bu hao. 'That he goes is not good, — he had better not go.': (2) Ta tzoou.bu hao. 'He cannot walk well.' Here the full tone on bu (buk) and the optional pause between Ta tzoou and bu hao in sentence (1) make it a different sentence from sentence (2), in which tzoou.bu-kao is one word, with no possibility of any break within. (Chao, 1968, p. 58)

By 'significant prosodie elements' in this passage, Chao does not mean the tones. Insofar as different morphemes are written with different graphs, the Chinese logographic writing system cannot 'omit indication' of the tones any more than it can 'omit indication' of the vowels and consonants. What Chao means is rather the contrast between the accented (full tone) 6« in the first sentence and the unaccented (neutral tone) .bu in the second. To native speakers of Mandarin, such as Chao, the tones are not part of 'prosodie elements', whereas the accent pattern is. Thus in his example, the tones are part of the phonemic specification that makes the two utterances 'the same sequence of words', whereas the accent pattern is something added on later with the more subtle junctural cues that differentiate them as sentences.

Characteristic

differences

31

In other words, the full tone bu of sentence (1) and the neutral tone .bu of the second sentence are really the same word for the Mandarin speaker, just as the stressed bird of It's a black bird and the unstressed bird of It's a blackbird are 'the same word' for the English speaker. By contrast, the third tone jaang (pin-yin zhâng) 'to rise (of prices)' and the fourth tone janq (pin-yin zhbng) 'to swell' would no more be the same word than are poach and poke. (To the etymologist they might be, b u t hardly to the average speaker.) This difference in attitudes on the part of native speakers toward tone and toward accent can easily be explained in terms of their different functions. Tone is difficult to view as distinct from the segmental specification of a word because its salient function is identical to that of the segmental phonemes; both specify the morphemic content of an utterance. Accent, on the other hand, is nearly impossible to abstract away from the intonation specified for an utterance because its salient function is fulfilled in part by intonation; it is there to organize the morphemes in an utterance into larger prosodie structures by making some parts of the utterance more prominent than others. 2.2.2 Historical development A second difference between tone and accent is in the way t h a t they figure in the history of a language's sound system. The two differ especially in the ways in which contrasts can develop where historically there were none. Whenever the history of a particular tonal contrast can be reconstructed to something other than tone, it is traced to an original segmental contrast. In a few cases, the existence of intermediate dialects with both the tonal and the original segmental contrast as concurrent cues provides incontrovertible confirming evidence for the reconstruction. For example, the rising and falling tones in various dialects of Panjabi go back to preceding and following breathy-voice consonants in the proto-language (Gill and Gleason, 1969; Haudricourt, 1972). This reconstruction is confirmed by the existence of intermediate Hindi dialects t h a t have both the segmental contrasts and, redundantly, the tonal contrasts (Gumperz, 1958). Even when intermediate dialects are not available, however, there is often experimental phonetic evidence that confirms the posited reconstruction. When the comparative evidence suggests t h a t the proto-language had a contrast between two consonant types where some daughter languages have a tonal contrast on an adjacent vowel, it is not unusual to find that the particular consonantal types reconstructed cause similar, non-distinctive variations in the fundamental frequencies of adjacent vowels in living non-tone

32

Accent systems and tone

systems

languages. The overwhelming number of cases in which the comparative and experimental evidence coincides in just this way makes the following generalization possible: Tonal contrasts as a rule develop when the pitch perturbations universally accompanying various consonants are exaggerated and then reinterpreted as the primary cues in a particular phonological opposition (see Ohala, 1974, 1978; Hombert, 1978; Hombert et al., 1979). This generalization, however, is not true of accentual contrasts. I know of no instance in which an accentual contrast has developed historically from the phonological reanalysis of a segmental opposition. This is not to say that segmental oppositions play no role in changes to accentual systems. Languages have gone from a simple fixed accent to free accent and back again because of segmental sound changes. Mohawk and Oneida, for example, developed free accent when epenthetic vowels were introduced to break up certain consonant clusters in Proto-Northern-Iroquoian, a language that is reconstructed as having a fixed penultimate accent (Chafe, 1977). Similarly, when Latin lost the distinctions in vowel length that partially conditioned the placement of its demarcative accent, Common Romance developed contrastive accent placement. However, conditioning a change in a phonological subsystem is a very different matter from being the source of that subsystem, and all of these examples of accentual development presume the existence of accent contrasts already in the proto-languages. There are as yet no theories of the genesis of accent so well substantiated as is the now standard theory of tonogenesis, but comparative and typological evidence suggest that accent patterns do not develop from segmental contrasts. Rather, they seem to develop from other suprasegmental patterns, in one of two ways. The two possible sources are the reanalysis of tone, as proposed by Clements and Goldsmith (1984), and the lexicalization of intonation, as proposed by Hyman (1977a). Each of these proposed origins seems plausible, but for different types of accent systems. Hyman's theory relies on typological evidence, and seems most appropriate for delimitative stress systems. Hyman gathered data on lexical accent placement in a large number of languages, and showed that three places were especially favored in languages that use stress delimitatively — namely, the initial syllable, the final syllable, and the penultimate syllable. Hyman sees a connection between the favoring of these three positions and the nearly universal occurrence of certain intonation patterns at utterance or clause boundaries. For example, the falling configuration that comes from the L % boundary

Characteristic

differences

33

tone of the neutral declarative intonations in many languages might be reanalyzed as an accentual prominence on the final syllable, or on the penultimate syllable if the fall were interpreted as a high-low sequence over the last two syllables. The accompanying durational effect of phrase-final lengthening might also contribute to this reanalysis. Hyman's theory gains much plausibility from the fact that many linguists have described French as having delimitative lexical stress on the final syllable of the word (to the immense confusion of all the native speakers t h a t I have questioned on this subject). The explanation for this fact seems to be that the usual intonation pattern for medial phrase boundaries in French consists of a H% boundary tone accompanied by a very marked final lengthening. Linguists who are native speakers of stress accent languages would be especially prone to interpreting this phrase-final configuration as a property of the word rather than of the larger phrase containing the word, b u t it may mirror actual processes by which the speakers of a language without word accent lexicalize larger phrasal intonation patterns. Hyman, in fact, suggests that such a lexicalization of phrase boundary intonation is part of a common diachronic process of generalizing from utterance or clause boundaries to word boundaries. He cites, for example, the common phonotactic constraint against voiced segments word-finally as the generalization of a phonetically motivated utterance-final devoicing. He concludes that delimitative stress accents develop as the result of just such a generalization of common intonational patterns: [We| can hypothesize t h a t a stress-accent comes into being when an intonational feature becomes associated with a grammatical unit smaller t h a n a clause (where a pause is frequently t o be expected). In other words, intonation becomes grammaticalizci as word-stress when the suprasegmental features of pitch, duration, and intensity t h a t would have characterized a word in isolation are encoded with the word, and t h u s came t o function in words not in isolation. (Hyman, 1977a, p. 44)

Hyman's typological evidence for his theory of accent development is far less convincing than the comparative and experimental evidence for the theory of tone development cited above. However, as more and better descriptive and experimental d a t a are gathered on phraseboundary intonations and accent systems in various languages, we should be in a better position to judge these speculations on the origins of delimitative stress accent. The other theory of accent development now current is supported by comparative evidence in at least two language groups containing languages with non-delimitative non-stress accents. It was first proposed for Bantu. Clements and Goldsmith (1984) explain the rich

34

Accent systems and tone systems

variety of tone and accent systems in the Bantu languages as the result of a process of leveling or reanalysis of the Proto-Bantu tone system. They suggest that the various Bantu languages that have accent systems corresponding to tone systems in the other Bantu languages might have acquired these systems by rephonologizing tone sequences into culminative accent patterns. They list two characteristics of Bantu tone that make it particularly susceptible to such a reanalysis. The first characteristic is what they call the 'mobility' of the tones. Underlying morphophonemic high and low tones can be realized on different syllables in different surface forms, and the rules governing the realization of these highs and lows must refer to units generally much larger than the morpheme. The second characteristic that they list is the presence of underlyingly atonal affix morphemes which acquire surface tones from neighboring root morphemes. A third characteristic that Clements and Goldstein do not list, but that seems equally important, is that Bantu tone systens have only two contrasting tones, high versus low. Given these three characteristics, Clements and Goldstein offer a hypothetical account of how a reanalysis as accent might occur. Suppose that a language has predominately bisyllabic roots. There are only four contrasting patterns that these roots might have — namely, HH, HL, LH and LL. Then, the language need only lose the HH stems (perhaps through a sound change turning all *HH stems into HL stems) in order for the roots to be subject to reanalysis as having first-syllable accent (HL), second-syllable accent (LH), or no accent (LL). Since the language has affixes whose tones are determined to a large extent by the tones of the roots, the language now has culminative accent placement organizing the utterance into words or larger sense-groups. The comparative evidence in Bantu supports this suggested leveling process as an explanation for the diversity of systems along the continuum between 'pure' tone and 'pure' accent in the modern Bantu languages. A similar process seems to be responsible for the variety of accentual systems in the dialects of modern Japanese. Proto-Japanese had a very complicated tone-like system in which both the place of a pitch rise and the place of a following pitch fall had to be specified for noun roots (Hayata, 1973). This system has been simplified to various degrees in the modern dialects. In the Kansai dialects, for example, the place of the pitch fall is still completely free, but the placement of the rise has simplified to a distinction between words that have a high tone on the intial syllable and those that do not. In standard (Tokyo) Japanese, the system is even more of a 'pure' accent. The specification of the lexical pattern is simpler, requiring only the single

Characteristic

differences

35

accent mark for the pitch fall, with the phrasal patterns inserting an initial rise by rule. Moreover, the culminative domain is usually larger, with some lexical accents subordinated to an extent commonly associated only with stress accent languages. Where the Osaka speaker would give each word in a object-verb sequence its own independent accent pattern, the Tokyo speaker generally makes the whole sequence a single accentual phrase or greatly reduces the tonal prominence of the second phrase unless this degree of subordination is incompatible with the requirements of a special intonational focus (Yamada et al., 1982; Beckman and Pierrehumbert, 1985). Finally, in dialects spoken north of Tokyo, the proto-Japanese tone-like system has leveled even further to a system of no accent contrasts. In these dialects both the rise and the fall are completely predictable from the word boundaries; the lexical accent pattern is merely delimitative. This variety of accent systems in the modern Japanese dialects shows t h a t they have leveled to various degrees the proto-Japanese tone system. It suggests that they have undergone to various extents some sort of rephonologization similar to the one t h a t Clements and Goldsmith have posited for Bantu. More comparisons of tonal and accentual systems in other language families may yield further comparative evidence for the hypothesis that pitch-accent might develop in a language as a reanalysis of tone. Some stress-accent systems may then be a further development from non-stress accent via the association with the culminative syllable of other prosodie features in addition to the pitch prominence, and the complete disassociation of the pitch-shape specification from the lexical accent specification. The Modern Greek stress system seems to be an example of such a development, since the grammarians' descriptions suggest strongly t h a t Classical Greek had a pitch accent system like that of most varieties of modern Swedish (cf. Buck, 1933, p. 162). The dialects of Swedish that do not contrast two types of tone shape for the lexical accent pattern might similarly be interpreted as one step further in the development from a tone system toward a stressaccent system. Neither Hyman's grammaticalization of intonation nor Clements and Goldsmith's rephonologization of tone is anywhere near as wellsubstantiated as the Hombert et al. theory of tonogenesis, but both are tenable hypotheses about how accent might come into being. If substantiated, they would fit in very well with the defining differences in function posited for tone and accent here. Tonal oppositions function like any segmental opposition because they develop historically from segmental oppositions. Accentual contrasts, on the other hand, are rather more removed from segmental oppositions, developing from culminative reanalyses of other suprasegmental

36

Accent systems and tone systems

patterns. 2.2.3 Distinctive load A third way in which accentual contrasts and tonal contrasts differ is the relative amount of work they do as distinctive features. This amount of work has traditionally been called 'functional load'. However, because the term 'functional load' implies that distinguishing utterances is the only possible phonological function, the term 'distinctive load' will be substituted for it in the following discussion. Distinctive load has been measured in various ways, most commonly by counts of the number of minimal pairs that exemplify each contrast (e.g., Hockett, 1955, p. 215ff). Martinet has suggested that the relatively small number of minimal pairs exemplifying many lexcial accent contrasts can be taken as evidence that the function of accent is not primarily to distinguish words (Martinet, 1965, p. 149). A possible argument against Martinet's suggestion is that few phones figure in only one opposition, and that, therefore, the distinctive load borne by a phonic property (as measured by the number of minimal pairs in which an opposition involving that property figures) cannot be a good indication of the property's primary function. This objection seems valid when the opposition under discussion contrasts segmental phonemes. For example, the opposition between the voiced and voiceless interdental fricatives in English figures in only a very few minimal pairs — mouth.mouthe, thigh:thy, wreath:wreathe, ether:either, and loath.loathe. However, both of these phonemes occur in opposition with every other consonantal phoneme in English, so that it would be difficult to argue that they are not primarily oppositional units. A prosodie property, on the other hand, usually figures in only a very few oppositions. Each tone in Mandarin, for example, is opposed to only three other tones. Vowel length in Japanese figures only in the single opposition between long vowels and short vowels. Lexical stress patterns in English can be said to figure only in the three-way opposition among primary accent (full stress), secondary accent (an 'unstressed' full vowel), and tertiary accent (a reduced vowel). It seems legitimate, therefore, to use the number of minimal pairs in which a prosodie opposition figures as one indication of the relative salience of the distinctive function for that opposition. Computed in this way, tone always bears a considerably higher distinctive load than does accent, and thus is more obviously phonemic. In Mandarin, for example, there are minimal quadruplets of morphemes contrasting only in tone for an overwhelming majority of

Characteristic

differences

37

phonotactically possible monosyllables. This fact can be ascertained easily by a glance through the index of Chao and Yang (1947). Similarly in Yoruba, 47% of the possible CV syllables occur in all three tones to form minimally contrasting triplets of monosyllabic verbs (Stahlke, 1974, p. 143). In English, by contrast, relatively few minimal pairs are distinguished only by the stress pattern. Moreover, most of these are noun-verb pairs such as increaec.incréaee or pérmit:perm(t, which are unlikely to occur in contexts where one could be misunderstood for the other. It is doubtful, therefore, t h a t accent patterns in English carry enough of a distinctive load to be termed primarily oppositional units. English is not unique in this regard. In bokmil Norwegian, for example, Kloster Jensen (1958, cited in Kloster Jensen, 1961, p. 35) counted altogether only 2400 pairs of words minimally contrasting accent 1 with accent 2. If 2400 minimal pairs seems rather high, it should be remembered that a complete paradigm requires only the single contrast between accent 1 and accent 2, whereas a full paradigm of tonal contrasts requires three forms in Yoruba and four forms in Mandarin. Thus the initial odds are biased more against getting the full tonal contrast. Moreover, unlike these tonal contrasts, seven-eighths of the Norwegian accentual contrasts involve words from different grammatical classes — e.g., the noun versus verb-infinitive pairs such as hoppet 'the jump' versus hoppe 'to j u m p ' (Haugen, 1967, p. 190). In Swedish the total number of minimal pairs contrasting accent 1 and accent 2 is even smaller. Elert (1972) was able to find only about 350 pairs, and many of these pairs involved an obsolete imperative or second plural indicative, as in tdgen 'march (imperative)' versus tdgcn 'the trains' (Elert, 1964, p. 27). Likewise in standard Japanese, there are three possible accent patterns in disyllabic forms, but I was able to find fewer than fifty minimal triplets in the Japan Broadcasting Corporation's (1979) dictionary of Standard Japanese pronunciations. And in longer forms, it is difficult to find even minimal pairs, much less minimal groups showing a complete paradigm of possible accent patterns. A second indication of accent's smaller distinctive load is the extent to which the accentual pattern of a word is predictable. One way to measure this degree of predictability is to determine the percentage of the vocabulary in which the accentual pattern can be generated by rule given the segmental content.

38

Accent systems and tone systems

In a language with strictly delimitative accent, this percentage is very high, essentially 100%. In such systems, the only source of unpredictability is that it is not always clear what is or is not a 'word'. In Latin, for example, morphemes such as que and ne must be labeled as enclitics. Given this specification, however, the accentual pattern of any word can then be stated as a single rule — accent falls on the penultimate syllable if this syllable is long, and otherwise on the antepenultimate. In an inflectionally fixed accent system such as this, the accent pattern need not be described as part of the abstract phonological specification of the word; it is just another level in the phrasal hierarchy of the utterance whose prosodie markings can be predicted from the grammatical structure along with those of the larger phrasal constituents. Although only a strictly delimitative system provides a productive grammatical predictability of this sort, many non-delimitative systems have a kind of 'derivational predictability' that looks very similar. One kind of derivational predictability is seen in what the functionalists call 'restricted free' accent. In such a system, the exact position of the accent must be specified for some words, but the approximate position — an 'accentable zone' — can be predicted in terms of segmental content for all words in the lexicon. In Provençal, for example, accent must fall within the last two syllables of a word. Similarly in Latvian, accent must fall within the first syllable of the word, and if the first syllable is short, no additional information is necessary. Only if the first syllable is long is it necessary to specify further whether it is the first mora or the second mora that is accented. (See Trubetskoy, 1969, p. 192-93, and Garde, 1968, p. 13739, for these and other examples.) Another kind of derivational predictability is seen in many languages with 'unrestricted' free accent. In these langauges, there are large subsets of the lexicon in which the lexical accent pattern can be described by delimitative rules. In Japanese, for example, there is an accent pattern for recent (i.e., non-Sino-Japanese) loan words that is remarkably similar to the Latin stress rule. The accent occurs in the penultimate syllable if that or the final syllable is long and otherwise on the antepenultimate syllable. (If the word is only two syllables long, the accent is on the first syllable.) This pattern can be called the default pattern for recent loans for a number of reasons. First, it occurs in an overwhelming majority of the words in this category. For example, of the twenty non-Sino-Japanese loans on the first two pages of the Japan Broadcasting Corporation's (1979) pronunciation dictionary, only one, aamen 'amen,' does not have this accent pattern, and it is the oldest loanword in the group, dating from the Portuguese missionaries of the 17th century. Second, this

Characteristic

39

differences

pattern is being generalized to some words originally borrowed with a different accent pattern. For example, the pronunciations pi'kunikku 'picnic,' ma'sukotto 'mascot,' and a'amondo 'almond' (with accent on the first syllable as in the source words) are being replaced in the speech of younger Tokyo speakers by pikuni'kkti, masuko'tto and aamo'ndo (with the default accent pattern). Finally, when longer loan words are truncated (as they almost inevitably are), the accent often shifts in the shorter form to conform to the default pattern. Thus depaatomento-svto'a 'department store' has been shortened to depa'ato,

terebi'zyon

'television' t o te'rebi,

sutora'iki

'strike' to

su'to,

and gvrote'suku 'grotesque' to gv'ro. (Note that in the last two forms, the accent has shifted to a syllable that did not even exist in the loan source.) Other languages with 'unrestricted' free accent have similar default delimitative patterns which are even more widespread. In Spanish, for example, the accent can be predicted in a large proportion of the total vocabulary by a simple rule: stress the penultimate syllable if the last syllable ends in a vowel, / n / , or / s / ; otherwise stress the last syllable. As with the Japanese pattern discussed above, this pattern can be called a default accent because the words that conform to it vastly outnumber the exceptions. Much research in the generative tradition in America has been concerned with discovering a default pattern for English (e.g., Chomsky and Halle, 1968, p. 69ff; Halle and Keyser, 1971; Liberman and Prince, 1977, p. 264ff; Selkirk, 1980; Hayes, 1981, 1982). Another type of grammatical predictability is seen in languages with root accent. In such languages, there are classes of more and less accentable morphemes corresponding to roots and affixes or to content and function morphemes. The accentual pattern of a word is then predictable from its morphological structure. Proto-Germanic must have had such a system; its traces are seen in modern English accentual patterns such as beget versus beggar and akin versus aching and in modern German pairs such as Gebet 'prayer' versus gebet 'give (2nd plural archaic)'. The Mandarin system is also essentially a root accent system, as can be seen from such contrasts as lian-tz 'hanging screen' versus lian tzyy 'lotus seed'. The unaccented (neutral tone) -tz in the first word in this pair is a noun suffix, whereas the accented (full tone) tzyy in the second is the morpheme 'seed'. Just as the modern English system is complicated by a competing derivationally delimitative system introduced with Romance loans, however, Mandarin is complicated by 'loans' from a more literary style of Chinese that does not have the level of accentual organziation defined by the neutral tone (Chao, 1968, p. 38-39).

40

Accent systems and tone systems

Root-accent systems are especially interesting when they show an intermediate level of accentability between unaccented suffixes and fully accented roots. This intermediate level is best illustrated in compound words in the Germanic languages. For example, English blackbird, warehouse, and strikeout have each only one primary accent, as demanded by the culminative principle. This primary accent falls on the first element in each word. But the second element, rather than having a reduced vowel (as an inflectional ending would), keeps its original long unreduced vowel. That is, it has a secondary accent, showing its underlying status as a root. Similarly in the German compounds Bürgermeister 'mayor' and aussergewöhnlich 'extraordinary,' the -mei- and -wöhn- are not as prominent as the first syllables in each word, but they are more prominent than the surrounding affixes, and must be said to have secondary accent. Martinet (1965, p. 151) and Garde (1968, p. 53-56) insist that these secondary accents are 'real' accents and not the sort of delimitatively determined 'echo' accent seen in, for example, Southern Paiute. The situation in English is complicated by the existence also of derivational 'echo' accents in, for example, chandelier, ravioli, Tennessee, etc. In words such as these, the secondary stress seems to be determined by the syllabic complexity of the word. The primary accent goes on a syllable relatively late in the word and is balanced by an earlier secondary accent if there are several syllables preceding it. In words like redó, on the other hand, the secondary accent is determined instead by the morphological structure of the word. It is there because redó is two full morphemes, as opposed to reduce, which is synchronically a single unanalyzable morpheme. The secondary accents thus provide an intermediate organizational type between two unconjoined words (with two primary accents) and a single monomorphemic word (with only the single culminative primary accent). The distribution of accent in Swedish and Norwegian also seems to be derived from the Germanic root accent system. Whether accent 2 is analyzed as stress followed by a 'nucleus-extending juncture' (Jasanoff, 1966), or as stress followed by 'tone' (Haugen, 1967), or as a retraction onto the stressed syllable of a following 'secondary stress' (Grundt, 1977), it is generally thought of as giving a small degree of prominence to a post-tonic syllable that is predictable from the morphological structure of the word. For example, Norwegian binnen 'the bean' has accent 2 because the -e- is morphologically a gender suffix forming the disyllabic noun base ò/nne, whereas binnen 'the prayer' has accent 1 because its -e- is only an epenthetic vowel breaking up the consonant cluster when -n 'the' is attached to the

Characteristic

differences

41

monosyllabic base ò/nn 'prayer'. Analogous differences in morphological structure account for nearly all of the minimal pairs contrasting accent 1 and accent 2. The only exceptions are loan words (which always have accent 1) and a few pairs such as aksel 'shoulder' versus aksel 'axle' (where the accent 1 in 'shoulder' reflects the fact t h a t the -e- is historically an epenthetic vowel — i.e., ON axl 'shoulder' versus ON oxull 'axle'). Except for these few exceptions, the accent pattern of a native Norwegian word is grammatically predictable from its synchronic morphological organization (Haugen, 1967). In contrast to this use of accentual prominence in root-accent systems, when tone patterns reflect the morphological structure of words, they generally do so not because the tone is predictable from the organizational relationship among the component morphemes, b u t rather because the tone itself is a morpheme. In Igbo, for example, the morphological structure of possessive phrases is reflected in the tonal pattern because the associative morpheme is a high tone, as in Hyman's (1974) analysis. Hyman gives the following example: àgbà ('jaw') + * (associative) + è QW è ('monkey') = àgbà èijwè ('monkey's jaw') Other examples of grammatical tone morphemes are the uses of tone to signal person, number, tense or aspect in various African and Amerindian languages (Pike, 1948, p. 22-23). This use of tone is very different from the 'grammatical' use of accent. When the Norwegian speaker says 6/nnen 'the prayer' versus binnen 'the bean,' for example, he can redundantly predict the differing accents from the fact t h a t bflnn 'prayer' is one syllable whereas 6/nne 'bean' is two. From the hearer's viewpoint, then, the accent might signal first t h a t the two have different morphological organization (6/nn-en 'the prayer' versus bfinne-n 'the bean') and only indirectly, through the consequent identification of the one- or two-syllable noun, t h a t different morphemes are involved. By contrast, when the Bini speaker says « me Ί show' rather than ΐ ma Ί showed,' the tonal pattern is no more predictable than the different vowel phonemes in English swam versus swim. Instead, it is a direct consequence of his wish to signal past tense. From the hearer's viewpoint, conversely, the different tone, like the different vowel in English swam, is directly involved in the identification of the tense, and inasmuch as it occurs in other past tense forms ίβ the tense morpheme. Thus the grammatical uses of tone, far from showing the predictability of tone, actually show the opposite; they are another indication of tone's greater distinctive load. The differing distinctive loads of lexical accent patterns and of tone, like the different attitudes of speakers towards them, are a

42

Accent

systems

and tone

systems

direct consequence of their different functions. Tonal oppositions have the distinctive load appropriate to any primarily phonemic opposition precisely because their primary function is to differentiate morphemes. Accentual oppositions in words have smaller distinctive loads because their primary function is instead to build, either delimitatively or through a hierarchy of culminative accentability, the lexical level of the organizational structure of an utterance. 2.2.4 Alternations and restrictions. A final indication of the different functions of accent and tone is the different sorts of phonological alternations and phonotactic restrictions to which they are subject. Tone alternations are no different from the alternations and neutralizations to which segmental phonemes are subject. Every third-tone morpheme in Mandarin, for example, has an alternate form in the second tone when it occurs before another syllable in the third tone. (That is, the opposition between the second and third tones is neutralized in this environment.) Tone sandhi of this sort is exactly analogous to such segmental patterns as the alternation between / t / and / d / in the inflectional forms of many German nouns that is due to the neutralization of the opposition between voiced and voiceless obstruents word-finally. Like segmental alternations, tone sandhi can sometimes be limited to certain classes of morphemes. In Mixteco, for example, low and mid tones are raised after some words with high tones but not after others (Pike, 1948, p. 26, 77-81). As Pike points out, however, this sort of 'arbitrary tone sandhi' is no different from limited morphophonemic segmental processes, such as the alternation between / f / and / v / in some English nouns (leaf, leaves) but not in others (chief, chiefs). Whether it is a general or a limited alternation, tone sandhi is nearly always induced by the immediate phonetic environment, and it can usually be understood as the synchronic results of a conditioned tone change (just as segmental alternations are the synchronic residue of conditioned segmental change). For example, Hyman (1978, p. 262) interprets the Mandarin third-tone sandhi cited above as the result of a dissimilatory sound change induced by the need to minimize ups and downs in rapid speech. (The change from '3rd-tone 3rd-tone' to '2nd-tone 3rd-tone' replaces a 'dip-rise dip-rise' with a 'rise dip-rise'.) The same phonetic constraints seem to be operating in the secondtone sandhi that in rapid speech substitutes a high-level (1st) tone for a rising (2nd) tone before another rising tone. (See Chao, 1968, p. 2629, for a description of these and other tone sandhis in Mandarin.) In any case, it is the tone of the adjacent syllable that triggers both

Characteristic

differences

43

these tone sandhis, and this is the typical situation in tonal alternations. Accentual alternations, by contrast, are rarely triggered by just the nearby phonetic environment. For example, the alternation on the second syllable of German gewöhnlich between primary stress (in isolation) and secondary stress (in the compound avssergewóhnlich) is triggered by the primary stress on au- three syllables away. The alternation in Japanese sya'kai ('society') between first-syllable accent in isolation and no accent in the forms syakai-syu 'gi ('Socialism') or syakai-v'ndoo ('social movememt') is triggered by the placement of the accent in the following syu'gi or vndoo that marks the compound forms. 3 Unlike tonal alternations, therefore, accentual alternations are difficult to attribute to phonetically conditioned sound change. Instead they are best described as the suppression of an accent in order to preserve the culminative principle of one primary accent per lexical unit. The different kinds of phonotactic restrictions that characterize tone and accent are also indicative of functional differences. Accentual systems are characterized by phonotactic restrictions in monosyllables. In many languages, accentual oppositions cannot occur at all in monosyllabic words; forms must be at least two syllables long for there to be a minimally contrasting accent pattern. This is the case, for example, in English, Swedish, Serbocroatian, German, Mandarin, and Burmese. In other languages, such as Slovenian and Latvian, it is possible for monosyllabic words to contrast in accentual pattern, but only if they are two morae long; short syllables do not contrast. In Japanese, on the other hand, short monosyllabic words can contrast accent versus no accent (as in hi 'sun' versus hi' 'fire'), but this contrast is neutralized in isolation where there is no following syllable to realize the low tone of the accent. These characteristic restrictions against accentual oppositions in monosyllables are not usually shared by tonal oppositions, and they can be understood in terms of a single underlying functional motivation. They indicate clearly that the culminative syntagmatic contrast between prominence and adjacent lack of prominence is more important than any paradigmatic opposition in which accent might

3. See Kubozono (in preparation) for a description of this sort of productive compound formation in Japanese. See Hirayama (1960) and McCawley (1968; 1977) for examples of accentual alternations in other, more limited p a t t e r n s of compound formation and in inflectional forms.

44

Accent systems and tone systems

figure. The phonotactic restrictions that tonal systems are subject to, by contrast, have no single unifying raison d'etre. Instead, they are an assortment of historical accidents just like segmental phonotactic restrictions. In standard Burmese, for example, nasalized vowels do not occur in the fourth (stopped) tone (Cornyn, 1944, p. 7-9). But there is no profound functional motivation for this restriction. Rather, the hazards of sound change dictated that Old Burmese vowels before final nasals would become nasalized vowels and Old Burmese vowels before final stops would become vowels in the stopped tone, and since vowels in Old Burmese could not be simultaneously before a final nasal and before a final voiceless stop, nasalized vowels in Modern Burmese cannot occur in the stopped tone. 4 Thus the tonal restriction in Burmese is exactly like the segmental restriction in English against syllable-initial [q]. Since modern English [q] developed from an older [ng], and since [ng] was not a permissible initial cluster, [q] does not occur initially in English. Similarly in Dogri, the restriction that rising or falling tone can occur only once per word is also a mere historical accident. The rising tone developed on accented vowels that preceded breathy-voiced consonants and the falling tone developed from accented vowels following breathy-voiced consonants. Since accent, being culminative, could occur only once per word, the same restriction applies to the tones (Ghai, 1980, p. 52-55). This is no different from the restriction in Russian that mid vowels can occur only once per word. In this aspect again, tones are like segments and unlike accents. Taken together, the four differences just discussed — namely, the different kinds of restrictions and alternations characteristic of accent, the smaller distinctive load it bears, its probably different historical origin, and the different attitude it invokes in native speakers — all are convincing evidence that accent is fundamentally different from tone. Moreover, they suggest that the fundamental difference lies in tone's functional role in differentiating words as opposed to accent's role in organizing utterances.

4. See Burling (1967) for Atsi and Maru cognates confirming the Old Burmese final consonants indicated in the writing system.

CHAPTER 3

Accent and intonation

As demonstrated in the previous chapter, it is possible to define accent systems relative to tone systems on the basis of a rather superficial survey of characteristics common to exemplars of the two types. The contrast between the syntagmatic functions of accent and the paradigmatic functions of tone is evident in symptomatic differences which are immediately apparent and which can be discovered using the familiar techniques of linguistic description such as the comparative method and lexical inventory. The relationship between accent and intonation, by contrast, is much more difficult to define. This difficulty is due in part to the size of the description necessary. It is not enough to discover how certain features are distributed in words in the lexicon and how they function in deriving compound forms. Instead, one needs an accurate understanding of the total prosodie system of a language as used in many types and sizes of utterance. The difficulty is also due to the inherent difficulty of studying intonation. A basic description of paradigmatic tone structures can be obtained using transcribed citation forms and only the minimal control of having enough different forms to be able to look for complementary distribution and obvious sandhi patterns. Even the simplest accurate description of intonation structures, on the other hand, requires the analysis of large amounts of quantitative data from experiments designed with extensive controls for segmental effects, phrasing effects, and so on. The qualitative, anecdotal sort of evidence that can be used to contrast accent systems to tone systems seems inevitably to lead to some simplistic assumption about the phonetic composition of accentual patterns which then becomes a precarious basis for an account of the relationship between accent and intonation. The most common form that such an assumption takes is the notion that the production and perception of stress accent does not involve pitch and, therefore, that it is a system completely independent of and easily distinguishable from intonation on the basis of its different phonetic composition. This understanding of the relationship between accent and intonation is explicit in most

46

Accent and

intonation

descriptions of accent and intonation before the 1960's and is implicit in some more recent accounts, including those versions of metrical theory that interpret the metrical grid as a structure for the timing of segmental or prosodie events that are separate from any associated intonational events. The less common form that the assumption takes is the opposite notion that stress accent involves only pitch and, therefore, that accents have no phonological existence outside of the intonational melody. Bolinger (1958) was the first to develop this view into comprehensive theory of intonation and accent in any particular language, but similar accounts have been adopted by, for example, 't Hart and Collier (1975, 1979). As will be shown below, neither of these assumptions about the phonetics of stress accent is entirely correct. Moreover both lead to unsatisfactory classifications of accent systems. The first method of defining accent relative to intonation forces a classification in which any system that does not noticeably involve force or rhythm is not termed an accentual system, even though studies of accent and intonation in several such languages have suggested that they all use accentual prominence to organize an utterance in ways very similar to its use in the better-studied stress-accent languages. The second method, on the other hand, must assume that all accent systems are stress-accent systems; it cannot explain the different way in which the pitch shape of accents are specified in non-stress-accent languages. The following sections will briefly trace the history of these earlier treatments of accent relative to intonation, reviewing at appropriate points the relevant experimental literature that shows the simplism of the phonetic assumptions underlying them. 3 . 1 Stress levels Most early descriptive linguists assumed that stress is a simple physical attribute, produced by varying the degree of articulatory force, and perceived as varying degrees of loudness due to the resulting differences in acoustic energy. Bloomfield, for example, gives the following description: Stress — that is, intensity or loudness — consists in greater amplitude of sound-waves, and is produced by means of more energetic movements, such as pumping more breath, bringing the vocal chords closer together for voicing, and using the muscles more vigorously for oral articulation. (Bloomfield, 1933, p. 110-111)

(Compare Sweet's definition of stress and force quoted above in section 2.1.1.) Because of this assumption about the physical nature of stress, early descriptions of prosodie phenomena in stress-accent languages

Stress

levels

47

could separate accent from intonation using the definitions 'accent equals loudness' and 'intonation equals pitch'. One of the best exemplars of this approach is the description of English stress and pitch phonemes in An Outline of English Structure (Trager and Smith, 1951), much of which is translated directly into generative feature analysis in The Sound Pattern of English (Chomsky and Halle, 1968). In An Outline of English Structure, stress as follows:

Trager and Smith first define

English utterances containing more t h a n one vowel exhibit marked differences in loudness, concentrated on t h e vowels. These different loudnesses are found t o be consistent in their RELATIVE strengths, and their location is seen to be constant within systematic possibilities of variation. The presumption is t h a t they are indications or results of phonemic entities. (Trager and Smith, 1951, p. 35)

On the basis of this presumption t h a t they are dealing with phonemic (i.e., distinctive) entities, the two authors then set about searching for pairs of words contrasting various degrees of loudness. Pronouncing to themselves the citation forms yes, go, under, going, above and allow, they decide t h a t the prosodie p a t t e r n s of the disyllabic utterances are evidence t h a t 'loud stress, [ '] and soft stress, [w], are two different entities,' and they accordingly set up for the first a 'phoneme' of loudness called 'primary stress', enclosing it appropriately in slashes, ' / ' / ' (Trager and Smith, 1951, p. 36). Continuing in this vein, they decide t h a t the last two syllables of animal are both 'soft-stressed', b u t t h a t the last is louder, demonstrating two 'allophones' of a second 'phoneme of loudness, in this case a WEAK stress / " / ' (p. 36). Then, they decide t h a t the final syllable of refugee is louder t h a n t h a t of effigy, and they accordingly set up a third phoneme, 'tertiary stress /*/>' f ° r former (p. 37). Finally, after a short digression in which they set up a 'phoneme' of 'plus-juncture / + / ' , they consider the stress patterns of compound words. Deciding t h a t the first syllable of operator in elevator-operator is softer than the primary stress on the first syllable of elevator, b u t louder t h a n the tertiary stress on the third syllable of operator in isolation, they set up a fourth phoneme, 'secondary stress / " / ' . They do note t h a t there can only be as many occurrences of / " / as there are instances of / + / in an utterance, b u t they do not consider the implications of this culminative principle; they offer as everyday phonemic contrasts such pairs as the compound word White House versus the intonational fragment (he lives in a) white house (not a brown one) (p. 39). Once they have set up these 'stress phonemes' of loudness, Trager and Smith then turn to what, for them, is an entirely independent

48

Accent and

intonation

phonetic attribute — namely, pitch. As with their discussion of the 'degrees of loudness', the description of pitch begins with the assumption that they are dealing with discrete levels constituting 'phonemes': T h e examination of the understanding of the fact t h a t discussed. Also to be noted is language is heard around a continuum.

pitch phenomena of English involves the it is relative, not absolute, pitch that is being the prelinguistic finding that pitch as used in limited number of points rather than as a

Extensive testing of spoken English material has convinced us of the correctness of the independent conclusions of Pike and Wells that there are four pitch phonemes in English. Our presentation of the supporting data will be in terms of this conclusion. (Trager and Smith, 1951, p. 41)

In their presentation of this 'supporting data', Trager and Smith soon introduce a third kind of 'phoneme' — the 'terminal junctures' — in order to be able to deal with phrase-final intonational configurations without multiplying their four pitch levels. This separation of pitch phenomena into entities of pitch height (the 'pitch phonemes') and entities of direction of pitch change (the 'terminal junctures') is quite ingenious. With it, Trager and Smith separate various aspects of what they consider to be a single phonetic attribute. There is even a hint of a functional distinction between the two; the terminal junctures are described as 'different manners of transition' from one major part of an utterance to another, whereas the pitch phonemes are defined as occurring only within the 'major part' before a juncture. By contrast to this distinction between the two types of phonemes involving pitch, the separation between stress phonemes and pitch phonemes is quite simply between two different phonetic attributes. Distinctive levels of loudness (the stress phonemes) are opposed to distinctive levels of pitch. There is no indication that Trager and Smith were at all aware of the early experiments showing that stress patterns might be realized physically as fundamental frequency patterns or be cued by 'intonation' (e.g., Muyskens, 1931; Scott, 1939). Thus they claim that in their example sentence How do they study?, the level / 3 / pitch peak can be moved from study to How '[wjithout changing the stresses' (Trager and Smith, 1951, p. 42). They claim that, in the phonological analysis of an utterance, the pitch phonemes and junctures can be stripped off, leaving intact all the stresses (ibid, p. 50). They absolve themselves from testing this claim by calling the result of stripping the pitch an 'unpronounceable abstraction', taking no lesson from Jones, who used whispering as a non-instrumental means of stripping the fundamental frequency of an utterance to test his hypotheses about the perception of stress (e.g.,

Stress ¡evele

49

Jones, 1950, p. 138). This same distinction between stress and pitch is maintained also in their discussion of larger prosodie patterns. In the later section of the monograph, Morphemiee, Trager and Smith describe two types of compound prosodie structure: 'intonational patterns' and 'superfixes' (stress patterns). Like the smaller primitives of 'pitch phenomena' and 'phonemes of loudness' set up earlier under Phonology, these two complex types are distinquished primarily in terms of their supposed phonetic makeup. Intonation patterns are defined as 'suprasegmental morphemes ... consisting of pitches and a terminal juncture', whereas superfixes are '[sjuprasegmental morphemes consisting of patterns of stress [i.e., loudness]' (Trager and Smith, 1951, p. 56). This identification between assumed phonetic material and phonological structure results in a characteristic inability to distinguish degrees of abstractness in their description of prosodie patterns. For example, since stress is, for Trager and Smith, a simple phonetic attribute, what they identify as a particular stress pattern is the same superfix whatever its function. They say: Many of the superfixes, then, are not limited t o one function; ' + * and \/ + form constructions substitutable for nouns, but are also the result when (stress] shift is applied t o such a phrase as Tell John, giving v / t é l + j à n / , based on a normal ^ t é l + j á n / . (ibid, p. 73)

In other words, they hear the prosodie patterns in blackboard and TELL John to be formally identical sequences, with primary stress on the first syllable and secondary stress on the second. Therefore, the two patterns are the same suprasegmental morpheme, despite the fact that they represent very different types of descriptive analysis. In the one case the stress pattern is a statement of the abstract prominence relationship signaling the organization of a compound noun, whereas in the other it is an intonation-specific prominence relationship signaling special contrastive emphasis. In the first case, the stress pattern describes a prominence relationship that would be true for the compound word in more than one utterance context with any of a large number of intonation patterns associated to the utterance in any of several ways. In the second case, by contrast, the stress pattern describes a prominence relationship that is the specific result of saying the two words as a complete utterance with a particular falling intonation pattern (H* L - L% in Pierrehumbert's notation) associated in such a way as to put the single high-prominence nuclear pitch-accent shape on Tell. Because they assume stress to be a homogeneous concrete phonetic quality, however, Trager and Smith can confound these two types of description.

50

Accent and

intonation

The assumptions underlying this account of stress patterns are adopted intact in The Sound Pattern of English (Chomsky and Halle, 1968). Although the generative linguists rephrase the account in terms of 'features' rather than 'phonemes', Chomsky and Halle, like their structuralist predecessors, assume that there is a phonetic attribute 'stress', and that this phonetic attribute is systematized as discrete levels along a scale from strongest to weakest. Also, from their disclaimer that 'we have omitted pitch from consideration' (Chomsky and Halle, 1968, p. ix), it is obvious that they, too, assume that the levels of prominence that they describe involve a physical attribute other than fundamental frequency. As in the earlier account, this assumption leads to a confusion between the general description of abstract accentual patterns for words in the lexicon and the particular description of specific organizational structures for phrases in actual sentences. Although the notion of the transformational cycle in stress assignment (ibid, pp. 15-24) eventually inspired the development of more useful mechanisms for representing the organizational structures of relative prominence among prosodie phrases in metrical theory, it was at this stage merely a more elaborate representation of the earlier linguists' notion that 'the relation of the phrase superfixes to the word superfixes is ... a matter of syntax' (Träger and Smith, 1951, p. 59). In Chomsky and Halle, as in Trager and Smith, no formal distinction is made between the abstract prominence pattern of the compound word Blackboard and the specific prominence relationship in any given intonation pattern for the sentence John left. Both are represented as sequences of discrete stress levels determined by syntactic structure, and are assigned by rule in the course of the derivation of the utterance (Chomsky and Halle, 1968, p. 17). Like their predecessors, Chomsky and Halle ignore the experimental literature on the acoustic and perceptual properties of stress. Although they nowhere state what physical property stress might be, it is for them a simple phonetic attribute, comparable to aspiration (ibid, pp. viii, 26), and distinct from segmental features and from pitch. Although they cite Lieberman's (1965) study of the perceptual salience of Trager-Smith style pitch and stress levels, they interpret his results not as an indication of the inadequacy of the homogeneous levels analysis, but as a vindication of the transformational cycle (p. 26). The fact that Lieberman's two linguist subjects represented the pitch and stress levels differently when segmental information was suppressed does not suggest to Chomsky and Halle that linguists are artificially trained to transcribe prominence relationships among different vowel qualities as if they were phonetically identical to prominence relationships among syllable durations, or that the

Streee levels

51

linguists might be trained to transcribe abstract lexical prominence contrasts even when they are leveled in certain intonational contexts where the stressed syllable is not associated with a pitch accent. Rather they interpret this result as evidence that all native speakers have a strong compulsion to hear an utterance's stress contours in accordance with internalized stress assignment rules. The extent to which a stress pattern is actually physically present in the utterance is 'of little significance', because the competent hearer will construct the abstract phonetic representation on the basis of the internalized rules (Chomsky and Halle, 1968, p. 25). By invoking the 'competence' of their 'idealized speaker-hearer' (ibid, p. 3), Chomsky and Halle absolve themselves of any necessity to demonstrate their stress levels in actual speech as produced and perceived by actual speakers. Without making any controlled tests, they say: We do not doubt that the stress contours ... constitute some sort of perceptual reality for those who know the language.... (A] speaker who utilizes the principle of the transformational cycle and the Compound and Nuclear Stress Rules should 'hear' the stress contour of the utterance that he perceives and understands, whether or not it is physically present in any detail, (ibid, p. 25)

Thus despite the intervening decades of research showing otherwise, Chomsky and Halle, like Trager and Smith, assert that stress is some kind of phonetic feature opposed to pitch, t h a t it comes in discrete levels along a single phonetic dimension, and t h a t it can be studied adequately using no analytical technique other than the linguist's listening to himself saying various test utterances. 1 3.2 T o n e t i c s t r e s s m a r k s The American structuralist and generativist dichotomy between stress patterns t h a t use contrasting loudness levels and intonational patterns t h a t use contrasting pitch levels displays the traditional misapprehension of stress in one of its purest forms. British linguists, by contrast, were somewhat more aware of the problems in simply equating their intuitions about linguistic prominence with a single phonetic feature such as loudness or force of utterance. Jones, for example, cites such early experimental work as Scott (1939), and emphasizes t h a t factors other than 'stress', such as 'intonation' (i.e., pitch) and vowel quality, were also very important in the perception of prominence (Jones, 1950, p. 134-45).

1. See Ohala (1070, p. 104-117) for a more thorough criticism.

52

Accent and

intonation

Kingdon, also, was aware of the phonetic complexity of accentual prominence and its heavy dependence on the intonation pattern. This awareness is obvious in his invention of 'tonetic stress marks' (Kingdon, 1939), and is stated again in his longer description of English prosody, where he lists duration and intensity among the factors entering into the composition of kinetic tones (along with direction, height and range of the pitch change) (Kingdon, 1958a, pp. xxv-xxvi). Moreover, Kingdon often seems to be distinguishing the more iconic paradigmatic functions of the prosodie pattern from its organizational syntagmatic functions in his use of the terms 'intonation' and 'stress'. For example, he describes the kinetic tones as 'moving and meaningful' (Kingdon, 1958a, p. xxv), thus suggesting that it is their function of 'expressing feelings' (Kingdon, 1958b, p. 2) as much as their phonetic content that makes them part of intonation. Conversely, he says of the static tones: |F]rom the stress point of view, they are prominences, but ... from the intonation point of view they make no contribution to the meaning or feeling of an utterance. (Kingdon, 1958b, p. 2) (T)hough naturally containing a tonal element, [they] are functionally stresses. (1958a, p. xxv)

At the same time, however, Kingdon is not willing to make an explicitly functional account of the relationship between accentual prominence and intonation. He still clings to the old equation between stress and force of utterance, and makes the traditional contrast between it and intonation in terms of phonetic material: Street is the relative degree of force used by a speaker in the various syllables he is uttering. It gives a certain basic prominence to the syllables, and hence to the words, on which it is used. Intonation is the variation given to the pitch of the voice in speaking. (Kingdon, 1958b, p. 1)

Kingdon's recognition of the phonetic complexity of prominence is thus just an incompatible tangential elaboration of what is basically the same old dubious phonetic categorization. 3.3 Early experimental evidence The British phonologists' suspicions that stress could not be so simply separated from intonation began to get more sophisticated experimental confirmation in the 1950's, aided by several technological advances. With the commercialization of the sound spectrograph and the development of practical speech synthesis equipment (e.g., the Pattern Playback at Haskins), there was a sudden surge in instrumental investigations of the perceptual cues and acoustic correlates of stress. The best known of these are the

Early experimental

evidence

53

series of experiments performed by Fry (1955; 1958; 1965). Fry synthesized tokens of words from such minimal pairs as subject (noun) versus subject (verb), and then varied the vowel intensities, durations, and fundamental frequency patterns between end-point values t h a t he determined from measurements of the original natural models for these utterances. He presented these synthesized tokens to a large group of subjects for identification as having accent on the first syllable or on the second. The percentage of subjects t h a t heard accent on the first syllable ('% noun responses') could then be compared from token to token as a measure of their success in simulating accent on one or the other syllable and hence as a measure of the varied parameter's effectiveness in cueing stress. In the first experiment in the series (Fry, 1955), Fry tested intensity and duration. When responses for all the tokens with a particular intensity pattern were pooled over various duration patterns and compared, it was found that the entire range of intensity variation accounted for a relatively small shift in response values. The various duration patterns, by contrast, produced a larger shift in response values, ranging from 12% noun responses for the tokens with the most verb-like pattern to 92% noun responses for those with the most noun-like pattern. Fry concluded from this result that intensity is a rather ineffective cue to stress when compared with vowel duration. In the second experiment (Fry, 1958), Fry then compared duration with fundamental frequency. Combining the different duration patterns for the test words with a wide variety of stylized fundamental frequency patterns, Fry again presented the synthesized tokens to a large group of subjects for identification as 'first-syllable accented' or 'second-syllable accented.' Comparing the responses to the various tokens, Fry found that many of the fundamental frequency patterns biased the responses toward noun or verb. For example, one pattern had (on the first syllable) a linear fall from a high initial F Q value and (on the second syllable) a steady low F 0 value. This pattern never evoked less than 57% noun responses whatever the duration pattern of the token. Conversely, the opposite pattern (which had a low level first vowel followed by a linear fall from a high value on the second vowel) never evoked more than 38% noun responses. Most of the patterns that biased the responses in this way sounded like natural English intonational patterns, so that the syllable favored by the bias could be heard as carrying the nuclear pitch accent of an intonation pattern for a one-word sentence. Fry interpreted these results as 'suggesting] that sentence intonation is an over-riding factor in determining the perception of stress and that in this sense the fundamental frequency cue may outweigh the duration cue' (Fry, 1958, p. 151).

54

Accent and

intonation

P u t t i n g the two experiments together, then, intensity was shown to be a very a weak cue indeed. It ranked below duration, which in turn ranked below fundamental frequency inflections. Other similar experiments performed around the same time tended to confirm Fry's results. Moreover, measurements of peak intensities in natural speech often showed t h a t syllables heard as stressed did not have higher peak intensity levels than did those heard as unstressed (see Bolinger, 1958, p. 24ff). Experiments such as these made the traditional dichotomy between simple phonetic categories intensity/stress and pitch/intonation untenable even in the phonetically more informed version of Jones and Kingdon. Some new reformulation of the relationship between stress and intonation was clearly necessary to reconcile linguistic description to these experimental findings. 3.4 P i t c h - a c c e n t t h e o r y The first linguist to respond to these experiments with a comprehensive account of prosodie phenomena in English was Bolinger (1958). Bolinger's solution was to demote intensity and duration to the role of quality-enhancing helpers, equating stress directly with fundamental frequency inflections. Arguing from the results of Fry's experiments and from many similar (albeit more informal) experiments of his own, Bolinger concluded t h a t since stress is perceived predominately as a result of fundamental frequency obtrusions, and since intensity is such an ineffectual cue, the fundamental frequency obtrusions cannot be explained away as mere accompaniments to stress. Instead, the obtrusions must themselves be stress. To avoid confusion with the older categorizations, however, he replaced the term 'stress' with the term 'pitch accent', which he defined as 'prominence due to the configuration of pitches' (Bolinger, 1958, p. 36). Since the adoption of the term 'pitch accent' might make the theory seem more different from the older categorizations than it really is, it is useful to compare Bolinger's pitch accents to the pitch primaries in the older accounts. Bolinger's pitch accents are in some ways reminiscent of Kingdon's kinetic tones. They are variously shaped pitch obtrusions t h a t occur on intonationally prominent syllables to give the intonational meaning of an utterance. Unlike Kingdon's tones, however, which are merely associated with stresses, Bolinger's pitch accents are stresses. Where Kingdon attributed the prominence on the syllable taking the tone to the greater intensity of the associated stress, Bolinger attributed the prominence (the stress) directly to the tonal shape. The radicalness of Bolinger's account is even more obvious when his pitch accents are compared to the pitch primaries in the account by his contemporary Americans. In Trager and Smith's analysis, the

Pitch-accent

theory

55

pitch levels are phonemes t h a t make up meaningful intonational morphemes only in combination. In Bolinger's analysis, by contrast, the accents themselves are morphemes (Bolinger, 1958, p. 51). They are meaningful pitch obtrusions of a particular shape, gestalt configurations rather than analyzable sequences of levels. When an accent occurs at a particular place in an utterance, it gives prominence to the accented syllable and at the same time contrasts with other accents t h a t could have occurred in the same place. Note t h a t both of these functions are all-or-none paradigmatic oppositions. Either a particular syllable is or is not assigned an accent by the intonation pattern, and either that accent is one configurational morpheme or it is another. Although the name 'pitch-accent theory' emphasizes Bolinger's unhappiness with the traditional phonetic definition of stress as intensity, it is equally important t h a t the theory does away with any attempt to divide intonation into such paradigmatic functions of particular intonation patterns and the syntagmatic organizational function t h a t can be abstracted away from particular intonation contours. The earlier accounts may have erred in trying to identify the latter function of the prosodie pattern of an utterance with a different phonetic quality, and they often mixed various levels of descriptive abstraction in their examples of it, b u t their account of stress as a category separate from intonation has some t r u t h if understood as an attempt to separate the syntagmatic prominence relationships among parts of the utterance from such things as the choice of pitch shapes t h a t can paradigmatically contrast different utterances having the same organizational structure. Bolinger, on the other hand, denies t h a t this distinction is possible. He reduces accent to its paradigmatic aspects, and denies t h a t it can exist as a phonological property abstracted away from particular occurences. An accent is always peculiar to the specific utterance in which it occurs. It occurs 'for some purpose related to focus on a particular constituent or on a whole sentence' (Bolinger, 1978, p. 474). This last aspect of Bolinger's disagreement with earlier accounts is stated most clearly in Bolinger's article on intonation universale (Bolinger, 1978). Since this article is also the one in which Bolinger claims the greatest generality for the theory that he originally developed just for English, it is worth quoting in some detail: Earlier discussion^) of 'stress' have emphasized its demarcative function and overlooked its place in the melodic line. This was mainly a result of restricting the discussion to citation forms, which instead of being recognized as one-word sentences and hence bearing a sentence accent, were regarded as morphologically basic.... Given this assumption, it was then necessary, in a description of longer stretches, to provide for the 'suppression' of all 'stresses' except particular ones — a suppression of

56

Accent and

intonation

something that was never there in the first place as component of the word as such, but only as a manifestation of a sentence with a sentence accent. The most that can be said is that in the majority of languages when a given word is accented a given syllable ... is marked to receive the accent. The occurrence of the accent is an intonational matter. (Bolinger, 1078, p. 480)

This definition of accent as a property of specific intonation patterns is the source of another major disagreement between Bolinger and his contemporary Americans. Where Trager and Smith and the later generative linguists attempted to predict accent patterns of sentences from syntactic organization, Bolinger rejects the idea that accent can sometimes be predicted. There is no such thing as a neutral use of accent to signal only the organizational relationships among constituents without placing any particular paradigmatic focus on the accented constituents, and therefore no sentence has an accent pattern that can be predicted from its syntactic organization. In Bolinger's words, 'Accent is predictable (if you're a mind-reader)' (Bolinger, 1972). Bolinger extends this insistence on the dependence of accent on specific utterance contexts even to the smallest phrase level. The fixed place of the accent in a word is redundant to its segmental organization and can be overridden whenever pragmatic considerations governing the intonation pattern require that a different syllable be accented. 'If the speaker knows the identity of a word, he can accent any syllable in it for contrastive or attitudinal purposes' (Bolinger, 1978, p. 384). It is possible to criticize Bolinger on this point, but it is important to understand his theory within the context of the earlier treatments against which Bolinger was reacting. Thus, concerning the existence of neutral intonation patterns, Bolinger's account must be credited for understanding that prosodie structure is independent of syntactic structure. The attempts by Trager and Smith and by Chomsky and Halle to prescribe direct mappings between syntactic constituency relationships and prosodie prominence relationships were such an oversimplification as to be wrong. In this context, Bolinger was correct in emphasizing the role of pragmatics in determining the accentual organization of an utterance. On the other hand, the prosodie structure of an utterance can differ from the syntactic structure for other than pragmatic reasons. The existence of such phenomena as the rhythm rule in English has long been noted (e.g., Kingdon, 1958a, p. 165; Bolinger, 1965; Liberman, 1975, p. 157ff), and similar phonological considerations can play a part in determining the prosodie organization of an utterance in nonstress-accent languages as well. For example, a long sequence of nouns in an entirely left-branching syntactic structure in Japanese is often broken up into several groups of prosodie phrases, as shown

Pitek-aeeent

57

theory

below in an example from Haruo Kubozono (p.c.) : (3.1)

syntactic structure:

Ta'roo-no

ito'ko-no

o'kusan-no

ne'esan

Taroo's

cousin'β

wife '«

sister

o'kusan-no

ne'esan

prosodie structure:

Ta'roo-no

ito'ko-no

Moreover, languages always seem to set some sort of phonological limits on the extent to which pragmatics can override the neutral accentual organization. In Mandarin, for example, certain lexically unaccented syllables cannot be given contrastive or emphatic prominence no matter how persuasive the intonational logic for doing so. Thus it is ungrammatical to contrastively stress the neutral tone .de or .6« in kann.de-jiann 'can see,' kann.bv-jiann 'can't see' (Chao, 1968, p. 150-151). Similarly in English, it is impossible to place stress on some inherently low-stressed syllables. One cannot contrastively shift the stress to the second syllable in thirty, forty, etc., even though these are often confused with thirteen, forteen, etc. One must say Ί said THIRty, not thirTEEN,' even though logically it should be *'I said thirTY, ...' D. Robert Ladd (p.c.) says that there are similar constraints against stress shifts for emphasis in Roumanian. This last aspect of the predictability of accent also shows that Bolinger erred in the extent to which he refused to abstract the accent patterns of words away from actual occurrences. It is clearly not always true that the speaker can accent any syllable in a word for contrastive or attitudinal purposes. In reaction to the erroneous

58

Aeeent and

intonation

claim t h a t phrasal accent patterns are entirely a syntactic matter, Bolinger insisted t h a t accent is entirely an intonational matter, an insistence t h a t inevitably led to errors in treating the grammar of lexical patterns. Grammatical functions that can be deduced from accents specified in the lexicon are a problem in pitch-accent theory. Thus Bolinger must ignore completely the culminative function. He grudgingly acknowledges the demarcative and distinctive functions, b u t claims t h a t they are significant only in certain special utterances: The demarcative function is important only in the early stages of language learning.... (Ajdults exaggerate the prosodie features of words ... as a way of helping children learn them. A fixed stress on a word is a handy peg on which to hang the syllables.... But it is not essential, and it becomes redundant after the vocabulary is learned. The distinctive function is simply one more way ... of increasing the lexicon. Like the demarcative function, it is abated whenever the word is not in accent position. (Bolinger, 1978, p. 483-484)

Because of this trivialization of the grammatical role of lexical accent patterns, the relationship between the specific intonational organization of an utterance and the general prosodie organization of forms in the lexicon becomes for Bolinger a baffling fact. He cannot understand why it is t h a t : Of all the various kinds of stress systems, the most advantageous of all for intonation, is the rarest: that in which ... no particular syllable of a word is marked to carry the accent. (Bolinger, 1978, p. 483)

The treatment of accent and the lexicon reveals several other inadequacies of the theory as well. One of these is the account of accent-conditioned segmental properties. The phonological reality of lexical accent patterns separate from specific intonational contours is evident in the way t h a t accent interacts with other phonological patterns. Lexical accent has served as the phonological environment in more than one sound change. Verner's Law is the usual example, b u t there are others, such as the development of tone from breathy voice in Dogri only in accented syllables (Ghai, 1980), or the development of voiced obstruents in Burmese only in unaccented ones (Okell, 1969, p. 15-18). And there are similar conditions in synchronic phonology. In English there is a phonotactic constraint against lax vowels occurring in open stressed syllables, a restriction t h a t interacts with several other sound patterns involving syllable structure to form many stress-conditioned segmental alternations. For example, a medial / I / after a stressed central vowel must close the syllable, and therefore will necessarily be a dark velarized liquid rather than a light alveolar lateral. Because the accent pattern of a word abstracted from its intonational context has no good place in Bolinger's theory,

Piteh-aeeent

theory

59

however, Bolinger must downplay such evidence of the link between the accent pattern of a word and its segmental organization. Thus in discussing demarcative accent he says: [Ajdults do not require more information than is supplied by the segmentals (including vowel quality, of course, which is often a historic residue of accent) in order to distinguish the words in a stream of speech. In fact, more words probably lack accentual prominence than have it, or have it in such a vestigial form that it serves no purpose. (Bolinger, 1078, p. 483-484)

Another misunderstanding arises in the treatment of the relationship between the lexically specified accent pattern and the larger intonation pattern in non-stress-accent languages, where lexical accent has not only a prescribed place in the word but also a prescribed pitch shape. For example, in cataloguing the occurrence of intonation patterns with low main accent followed by some sort of rise, Bolinger says that 'The C accent is common in Western languages' but that 'there is some evidence that C accents are excluded* in Japanese (Bolinger, 1978, p. 493-494). The reference that he cites here makes it clear that he realizes that the 'C accent' is impossible because of the inherent high tone of the accented syllable in Japanese; he clearly recognizes that some accent languages do not have the intonational choice of pitch shape for accent that is characteristic of stress-accent languages. But this difference among accent languages never becomes explicitly stated in his theory. Indeed, it cannot be stated in his theory because of what pitch accents are. They are meaningful gestalt configurations specified by the intonation. They include not only the pieces that are associated with the accent, but also any following phrase-boundary shapes. Thus, the Ό accent' in English is what Pierrehumbert (1980) would analyze as a complex intonation pattern composed of a low tone on the nuclear accent followed by a phrase-final configuration with high phrase-accent and an upstepped high boundary tone (L* Η H%). In Bolinger's catalogue of Ό accents' in various languages, it is surely the following rising phrase-final configuration and not the low tone at the accent that contributes the shared attitudinal effect of 'antiassertive, downtoning ... inconclusiveness'. While the Japanese system precludes a low tone for accents, it does include in its repertoire of boundary configurations a LH boundary sequence with just the attitudinal meaning that Bolinger attributes to the 'C accent' as an unanalyzable whole. In doing away with the older distinction between stress and intonation, however, Bolinger erased an important distinction between pitch accents and the other parts of intonation. In rejecting the structuralists' analysis Bolinger also rejected the insight that intonation in English can be divided into 'terminal

60

Accent and

intonation

junctures' and the pieces of the intonation contour that do not belong to phrase-final configurations. In defining stress as a complex pitch shape specific to a given intonation pattern, Bolinger precluded a proper understanding of what aspects of the relationship to intonation are shared by all accent languages and what aspects are peculiar to stress-accent languages. One final criticism that can be made of Bolinger's pitch-accent theory is that he underestimates the role of acoustic correlates other than pitch in the makeup of stress accent. From recognizing the importance in prominence of pitch obtrusion, Bolinger goes the further step of defining prominence as pitch obtrusion. While he recognizes that other factors do play a part in the perception of prominence, he relegates all these to secondary assisting roles. He speaks of 'the intimate relationship between syllabic length and accent', but claims that this relationship originates in the intonation pattern: [The relationship] has its beginning in the need to assist pitch accent in the absence of flanking unaccented syllables.... A monosyllable that readily falls in accent position cannot execute a clear pitch turn unless it is stretched.... From repeated use in accent position, this stretching feature is built into the syllable and is encountered whether or not the syllable is actually accented. (Bolinger, 1965, p. 45)

He acknowledges that this durational effect is 'especially important,' but insists that duration is only a 'co-variable' with pitch accent rather than an actual part of the accent: The experiments have made it clear that in the duration-pitch complex it is pitch that primarily signals accent. I therefore assume that duration is ancillary. Figuratively speaking, it is there IN ORDER TO make room for the accent. (Bolinger, 1958, p. 45)

However, while Bolinger's (and others') experiments have shown that F q inflection can be an overriding factor in the perception of stress, in natural speech it almost never occurs unaccompanied by other cues. Algorithms for locating stressed syllables consistently do better if they incorporate a combination of acoustic cues rather than just a single cue such as F Q shape or value (Lea, 1977). Moreover, a more recent experiment has suggested that other acoustic correlates of stress can be more important than the intonational accent shape in some circumstances. Since this experiment has never been published, it will be described in detail in the following section. 3.5 R e c e n t experimental evidence Nakatani and Aston (1978) took advantage of two recent technical advances to avoid some of the problems of early experiments in stress

Recent

experimental

evidence

61

perception. First, they used the technique of réitérant speech (cf. Liberman and Streeter, 1978) to get their test stimuli. This technique allowed them to test words in various positions in longer utterances without sacrificing the control of using minimal stress pairs. Thus they obtained more generally valid results than they could have if they had used words in isolation or in a fixed quotative frame, without resorting to the informal uncontrolled comparisons that constituted so many of Bolinger's experiments. Second, they used the technique of LPC analysis and resynthesis (Atal and Schroeder, 1967; Atal and Hanauer, 1971) to vary the test parameters in their stimuli. Therefore, they could take their test parameter values directly from speech, instead of merely modeling them on measurements from natural utterances as Fry did in his synthesized test stimuli tokens. They obtained their test stimuli by having six speakers read pairs of sentences such as The lawyer badgered the client, versus The lawyer convinced the client., substituting sequences of ma syllables with the appropriate stress patterns for the test words in the pairs (e.g., MAma for badgered versus maMA for convinced). They then made various hybrid renditions of the test words by shuffling the F Q , the duration, the amplitude, and the spectral patterns of the minimally contrasting mama pairs. For example, one hybrid might have the F Q pattern taken from the réitérant rendition of badgered combined with the other three patterns taken from the réitérant rendition of convinced. Nakatani and Aston presented the sentences with the hybrid and the original mama renditions to twelve native listeners who gave each stimulus a 'stress rating' on a six point scale ranging from 'MAma with a high degree of confidence' to 'maMA with a high degree of confidence'. They then computed a 'change in stress rating' value for each parameter by subtracting the mean stress rating for all the tokens having the MAma pattern for that parameter from the mean for all tokens having the maMA pattern for it. The change-in-stressrating value was thus an indicator of the extent to which the particular parameter influenced the perception of the réitérant word as having stress on the first or on the second syllable. The results of the experiment confirmed earlier experiments in regard to the efficacy of the amplitude level as a cue to stress. Amplitude consistently had a very low change-of-stress-rating value, lower than that of any of the other three parameters. The results also showed, however, that the relative effectiveness of the F Q , duration, and spectral pattern varied depending on the position of the word in the sentence. Sentence-finally, where the test words had nuclear stress, the F 0 pattern for the test words far outweighed the other parameters, as would be expected from earlier experiments such as those of Fry (1958). In prenuclear positions, however, duration and

62

Accent and

intonation

spectral pattern (i.e., vowel quality) vied with F Q , sometimes ranking somewhat below and sometimes a little higher. And in postnuclear position, duration outranked F Q as highly as F Q outranked it in nuclear position. Given these results, it is difficult to agree with Bolinger's claim that duration is necessarily ancillary to pitch obtrusion in English. 3.0 Metrical theory The previous section described recent research that showed that stress in English is cued by a complex of phonetic correlates including duration and spectral quality as well as fundamental frequency. This research poses a problem for Bolinger's account of stress and its relationship to intonation. While pitch-accent theory corrects the earlier error of equating stress directly with intensity level, it commits the same error of equating stress with a single phonetic feature. Perhaps this error was inevitable given the lack of an adequate means for representing the hierarchical nature of abstract prominence relationships. The traditional equation between stress and intensity follows almost directly from the representation of stress as a specific phonological feature of particular segments. Similarly, Bolinger's equation between stress and pitch obtrusion follows from the representation of accentual prominence as a property of particular patterns in specific intonation contours. The functionalist definition of accent as a syntagmatic contrast among the syllables or morae of a word captured the organizational nature of prominence as a property of lexical patterns, but the convenient represention of the relative prominences that organize larger prosodie structures remained a problem that showed little promise of solution until the development of metrical theory. Metrical theory was first proposed as an account of English prosody in Liberman (1975) and Liberman and Prince (1977). This earliest formulation of metrical theory provided three hierarchical devices to represent various aspects of the prosodie pattern of an utterance. The 'tune' was a series of tones organized into a tree to represent the pitch shape and prominence pattern of the intonational contour of an utterance. The 'metrical tree' was the surface syntactic structure of the utterance with the nodes labeled to represent the stress pattern of prominence relationships among constitents. And the 'metrical grid' was a multi-layered sequence of beats to represent the underlying rhythmic structure of the timing of the prosodie events in the metrical tree and the associated tune. In addition, the theory provided detailed accounts of the alignment between the intonational contour and the stress pattern as represented in the metrical tree (Liberman, 1975), and of the

Metrical

63

theory

alignment between the stress patterns in the tree and the timing patterns in the grid (Liberman and Prince, 1977). A given tune was aligned to the text by matching the node labels in its tree structure to those in the metrical tree, and a grid was aligned to the tree by associating the successive columns in the grid to the successive terminal constituents of the trees subject to the constraint t h a t a grid could not be associated with a particular text unless at every node in the text, the head of the stronger constituent would correspond to a higher column in the grid. As an illustration of these devices and alignment procedures, consider the phrase Joey Davis. The stress pattern in the phrase might be represented as the following tree structure: (3.2) w

s

ΛΛ

s w sw Joey Davis where β means 'strong' and w means 'weak'. In an actual utterance of this phrase, the organizational structure of the tree could be associated with the following timing grid of beats: (3-3)

J J JJ J J Joey Davis In idealized examples, the representation avoids the reference to any particular time values by replacing the musical notation with layers of numerals or tick marks (usually represented with x's): (3.4)

3 2

χ

2

1 1 1 1 JoeyDavis

X

or

X

x x x x JoeyDavis

In an actual utterance of this phrase there would also be an associated intonation pattern. The intonation of calls, for example, would be transcribed as a sequence of tones LHM organized into a rightbranching tree. This tune would be fitted to the text of Joey Davie as follows:

64

Accent and

intonation

(3.5)

+ Joey Davis

W S W L H M

Joey Davis L

HM

where the circled nodes in the text tree are the ones that are matched by the nodes in the tune's tree. Two characteristics of this account are especially noteworthy. First, stress is represented as a prominence relationship among constituents. Second, the stress pattern of the utterance is separate from any intonation pattern that might be aligned to it. The relational definition of stress distinguishes metrical theory from all of the earlier accounts described above. Where earlier definitions made stress a paradigmatic phonetic feature specified locally for a syllable or a paradigmatically chosen type of pitch obtrusion localized in the intonation contour around a syllable, metrical theory defines stress as a syntagmatic feature of comparison that cannot inhere to a single locality in the utterance. This definition allows the notion of accent as an organizational feature to be extended beyond the accent pattern of the word. Prominence can be more than a feature of a syllable that contrasts it to neighboring syllables so as to organize them into words. It can be a feature of a larger pattern that contrasts words within phrases and smaller phrases within larger phrases, creating an ever larger organizational structure up to the level of the entire utterance. The separation between stress and intonation also makes metrical theory obviously different from Bolinger's pitch accent theory. Moreover, the way in which the separation is accomplished makes it different from other earlier American accounts as well. The two pieces of the prosodie pattern are not completely independent of each other as in the theory of stress levels and pitch levels, but rather are linked via an alignment procedure that is heavily constrained by a notion of intonational prominence, represented in the tree structure of the tune. Although this particular representation of the organization of the tune proved to be problematic, it does capture Bolinger's insight that some parts of the intonation pattern are directly related to prominence patterns, and it does so without forfeiting the notion that stress patterns might be related to grammatical organization.

Metrical

theory

65

Later metrical descriptions have proposed various modifications to this earliest metrical account of the prosodie structure of an English utterance. These modifications resolve several problems in the earliest formulation of the theory, b u t they also raise several questions that are of especial importance if the devices of metrical theory are to be used to explain the relationship between intonation and accent. Pierrehumbert (1980) addresses the question of whether the tonal trees of the the early theory accurately represent the structure of the intonational contour in English. Her proposed answers to this question resolve certain inconsistencies between observed F Q patterns and the earlier tree representation of the tune, but they also raise other questions involving the correct alignment between the tune and the text. In particular, there is the problem of how to align the boundary tones to the edges of constituents in the text when the alignment is not accomplished through the metrical tree. Selkirk (1980, 1981) addresses the question of how to correctly represent parts of stress patterns t h a t incorporate paradigmatic distinctions. Her proposed answer to this question involves the reinterpretation of the tree as an organization of phonological constituents rather than as phonologically labeled syntactic organization. This answer resolves the problem of how to represent the accentual distinction between light and heavy syllables in the metrical tree, and also suggests answers to certain other problems stemming from the early identification of the metrical tree with syntactic constituency. At the same time, however, her solution to these problems raises several other questions concerning the nature of the tree and its relationship to the grid. In particular, the introduction of prosodie labels for some levels within the tree suggests that a more explicit account of the actual accentual features defining these levels is in order. Also, the possibility that some of these features may be exactly those features of rhythmic alternation represented earlier with the grid raises the question of whether the grid might be redundant to the tree. This latter issue is discussed more recently in Prince (1983) and in Selkirk (1984). One critical question in this discussion is whether phenomena such as stress shift can be adequately represented by the metrical tree alone or need the more explicit representation of rhythm afforded by the grid. Other affiliated issues are the correct phonetic interpretation of the grid and the description of rhythm and accentual timing in general. Another critical question is whether the grid can represent phenomena that were earlier assumed to be a matter of constituent structure. An affiliated issue is the representation of delimitative prosodie features such as boundary tones and derivationally fixed accents.

66

Accent and intonation

The following four sections discuss these issues and attempt to relate them to the more general questions of what the relationship between accent and intonation is and whether the representational devices used to account for phrase-accent patterns and intonational contours in English can be applied to homologous structures in other languages. 3.7 T h e intonational contour in metrical theory The first issue raised above in the list of questions that need to be addressed in evaluating the metrical account of English prosodie structure is the correct representation of the intonational contour of an utterance. In the original formulation of metrical theory Liberman (1975) proposed that the intonational contour was chosen from a lexicon of tunes, each with its own attitudinal meaning and prominence structure. The four tones that made up these tunes are reminiscent of the four pitch phonemes of Trager and Smith (1951), except that in combining together to make up a tune they are organized into a labeled tree structure which is intended to describe the accentual factors which constrain the association between the intonation contour and text of an utterance. Pierrehumbert (1980) discusses several problems with this representation of the tune, and proposes that the tune is not chosen from a finite lexicon of possible contours, but is instead generated by an intonational grammar from three types of tone 'morphemes' — pitch accents, phrase accents, and boundary tones. The calling intonation in (3.5), for example, would be analyzed in this account as a L* pitch accent followed by a H*-fL pitch accent and then by a downstepped H phrase accent and an upstepped L% boundary tone. No tree or other higher structure is provided in this account, but it preserves the notion of intonational prominence by the distinction between tonal pieces that are associated with prominent syllables (the starred tones of the pitch accents) and pieces that are not (unstarred tones of two-tone pitch accents, phrase accents, and boundary tones). The linkage between tune and text is constrained by rules stating that no pitch accent can be associated to the phrase after the metrically strongest beat (the nuclear stress), and that if a beat in the metrical grid of the text is associated to a pitch accent, any beat of equal or greater strength must also be associated with a pitch accent. Thus in the Joey Davie utterance represented above, the strongest beat is on the first syllable of Davie, so the starred tone of the final pitch accent in the tune must be aligned to it. The next strongest syllable is the first syllable in Joey, and so the first pitch accent must be associated there, as shown below:

The intonational (3.6)

contour in metrical

χ X X + XX XX Joey Davis

67

theory

L* H*+L H L%

Joey Davis L*

H*+L H L%

This account of the tune as a grammatical sequence of smaller tone morphemes seems in general to provide a more accurate description of the possible intonation patterns than did the earlier lexical approach to the intonational contour. The decomposition of this particular contour into two pitch accents, phrase accent, and final boundary tone, for example, neatly distinguishes between the optional prenuclear accent and the obligatory phrase-final configuration. The optionality of the initial L* is evident when the same intonational meaning of calling out to someone is applied to a monosyllabic utterance such as Joe. The grammar also allows for any number of pitch accents to precede the nucleus. This possibility must be allowed because in a longer phrase such as Joey Allen Davie, the calling intonation contour might have two prenuclear L* pitch accents, one on Joey and the other on Allen. And in less stylized intonation contours there are even more choices. For example, if the surprise/redundancy contour described in Sag and Liberman (1975) were applied to a long phrase such as That rascally Joey Davie, there could be a single L* accent on rascally, two L* accents on rascally and Joey, or a H* on rascally and a L* on Joey, with a somewhat different meaning for each different combination of prenuclear tones. A complete discussion of the merits of this compositional approach to intonation contours in English is beyond the scope of this chapter and is not directly relevant to the issue at hand. Instead of presenting further arguments for Pierrehumbert's intonational grammar, therefore, I will simply assume its correctness. On the other hand, the question of how to associate the various elements of the intonation contour to the constituents of the text is directly relevant to the other questions listed above. In the following sections, various questions concerning the interpretation of metrical trees and grids and their relation to the perceived stress pattern will be discussed at some length. One issue t h a t will occur and reoccur in this discussion is which device allows the better procedure for relating the intonation contour to the accentual pattern. Accordingly, the description of Pierrehumbert's arguments for aligning the tune to the text by means of the grid rather than by means of the tree will be deferred to the next section, which presents various interpretations of the metrical tree.

68

Accent and

intonation

3.8 T h e metrical tree In the earliest formulation of metrical theory, the metrical tree is understood to be the surface syntactic structure of the text, with the syntactic constituents that are bracketed together at each node of the tree labeled by phonological rule to be strong or weak. Certain purely phonological operations can rearrange this structure. The rhythm rule of English, for example, is represented as a reversal of the normal w/e labeling for a pair of nodes, with the reversal conditioned by a particular configuration of the columns in the metrical grid representing those nodes in the tree (Liberman and Prince, 1977, p. 316ff.). But the original labeling that is reversed by this operation is a labeling of syntactic constituents. The neutral organization indicated by the neutral stress pattern is the syntactic organization. A second important characteristic of the metrical tree in this earliest formulation is that all of the branchings at the nodes of the tree are binary. The stricture against other than binary branching was originally justified on the grounds that it ensures that the labels β and w be understood as a property of the relationship between the branches of a node rather than as properties of the labeled branches themselves. An isolated [s] or an isolated [w] makes no sense if prominence is a strictly relational property; nonbranching nodes are not allowed for the same reason that the patterns [ss] and [ww] are not permitted. The fact that the stricture also prohibits patterns such as [sww] with no further internal bracketing is less obviously justified on this grounds, and the only argument advanced for it is Liberman's statement that it extends the essentially binary character of the opposition between strong and weak (Liberman, 1975, p. 33). In the earliest version of metrical theory, the s/w tree of the text fulfilled two separate functions. It was used to represent the underlying stress pattern of the utterance and to constrain the possible rhythmic structures that could be associated with a given stress pattern, and it was used to align the appropriate syllables with the notes in the intonation contour. This latter use presupposed a matching metrical structure for the tune; like the constituents of the text, the notes in a tune were organized into a strictly binarybranching tree whose nodes are labeled [sw] or [ws]. As Pierrehumbert has pointed out, however, the latter use presents several serious difficulties. One very important problem is that the accompanying tree representation of the intonation contour does not distinguish between notes that must be aligned with prominent syllables and those that simply follow or precede prominent syllables at some given distance in time. The alignment by matching the two trees, therefore, makes certain consistent errors in predicting how the

69

The metrical tree

F 0 pattern lines up with the text. For example, the tree representation of intonation patterns often analyzed the nuclear pitch accent and the following phrase-final configuration as a pair of bracketed tones labeled [sw]. When such a pattern is aligned to the phrase bulldozer drivers' union, the tree-matching method predicts t h a t the second tone will fall on the first syllable of union, as shown here: (3.7)

s sw sw s w bulldozer drivers'union

Λ

s Τ1

w Τ2

bulldozerdrivers' union

Whenever the location of the second tone can be identified clearly in the F q curve of an actual utterance of this phrase, however, it is inevitably placed much earlier, at or near the end of bulldozer (Pierrehumbert, 1980, p. 44). Instead of being governed by the prominence relationship between the heads of the nuclear constituent and some other following constituent in the text, the second tone seems to be positioned relative to the text by a principle of fixed time after the nuclear tone. A second class of problems arises in aligning intonation contours t h a t have only one or two pitch accents to texts that are longer than one or two phrases. Pierrehumbert (1980) gives the following example. Suppose the surprise/redundancy tune were to be applied to the sentence The weather was unusually dry. This intonation contour is represented in Liberman (1975, pp. 60-70) by a rightbranching tree with the middle H tone (the nuclear accent) as the designated terminal element (DTE) that is dominated only by strong nodes:

70

Accent and

intonation

(3.8)

W

L

S

H

W

L

The text of the sentence, on the other hand, would have the following top-level structure if it is built simply by labeling the syntactic constituent structure: (3.9)

w w s The weather was unusually dry. If the intonation contour is aligned to the text by matching the two trees, the two final tones bracketed together under the branching β must go together on dry, and the first tone (the prenuclear L) must go on weather. To generate the possible pronunciation that puts the first accent on unusually, the tree structure for the text must be: (3.10)

w s s The weather was unusually dry. Pierrehumbert describes this tree structure as something that 'would arise by some kind of extended cliticization' (1980, p. 45), but it is clearly problematic in an account that equates the textual tree structure with the syntactic surface structure of the sentence. Because of difficulties such as these, Pierrehumbert (1980) proposes that the intonation contour does not have constituent structure above

The metrical tree

71

the lowest level (where the starred and unstarred tones of two-tone pitch accents are bracketed together). She further proposes that control of the alignment between the intonation contour and the text is accomplished not by matching tree structures, but by means of the metrical grid. This latter proposal is a rather important one because Prince (1983) has used it as an argument for doing away with the metrical tree altogether in favor of representing the stress pattern of an utterance directly with the grid. However, while it is true that aligning the pitch accents to strong beats in the grid would solve the problem of how to get a pre-nuclear pitch accent on weather in (3.10), it is also the case that this problem with the metrical tree is an artifact of the assumed isomorphism between the tree structure of the text and syntactic constituency. Matching the weaker first pitch accent in the surprise/redundancy contour to weather rather than to unusually is not problematic if the organization in (3.10) is not assumed to be a drastic restructuring of the syntactic organization in (3.9), but rather is taken to be a separate prosodie organization. The identification of the textual tree structure with syntactic structure in the earliest formulation of metrical theory and the accompanying insistence on having only binary-branching nodes caused problems also for the other use to which the tree was put, the alignment between the abstract accentual organization of the text and the actual perceived rhythmic structure that was to be represented by the metrical grid. One immediate problem was how to map the rather rich constituent structure represented in the strictly binary bracketing of the tree onto the often much coarser organization of the perceived pattern of rhythmically alternating stronger and weaker elements. This is a problem because the most obvious mapping rule states that each level of bracketing in the tree corresponds to a separate level in the grid where the DTE of any branch labeled w does not have a beat in the grid and its sister branch labeled β does. This mapping would preserve the full richness of the tree's organization that is dictated by the stricture against other than binary-branching nodes. For example, to have only binary branches, the phrase John 'β three red shirt« must have three layers of bracketing:

72

Accent and

intonation

(3.11)

[w

s 3

w si 2 [ws] 1 W W W s John's three red shirts

By the mapping rule stated above, therefore, this phrase must be aligned to a grid that has beats for at least three levels above the level that gives a beat to each syllable: (3.12) χ χ X X X Χ X X X John's three red shirts The results of this mapping are in perfect mimicry of earlier generative accounts having as many stress levels. And they are subject to the same criticism that the earlier accounts are; the rule predicts far more levels of stress than are actually perceived. Because of this discrepancy between the metrical organization of the tree and the perceived stress pattern, Liberman and Prince instead propose a less structure-preservative mapping via the 'Relative Prominence Projection Rule' (Liberman and Prince, 1977, p. 316). Instead of insisting that each layer of bracketing be represented directly in the grid, the RPPR states only that for each bracketed pair of branches, the designated terminal element dominated by the strong branch must correspond to a beat at a higher level in the grid than that containing the highest beat for the DTE of the weak branch. By this formulation of the mapping rule, the only prominence structure required for a grid to be aligned with the tree in (3-11) is that the beat on ehirts be stronger than the beat on each other word, since ehirts is the DTE of the strong branch in every bracketed pair. The minimal grid that can be aligned with the phrase, therefore, has only a single layer of phrase stress above the syllable beats:

73

The metrical tree (3.13) χ Χ X X X John's three red shirts

The question that now comes to mind is how to explain the discrepancy between the many layers of phrasal organization represented in the metrical tree and the single layer of organizational structure represented in the minimal grid for this phrase. In the earliest formulation of metrical theory, this discrepancy is not an anomaly, because the metrical tree is assumed to be the syntactic organization of the phrase, which need not correspond exactly to the prosodie organization. Indeed, Liberman and Prince (1977) considered the possibility of discrepancy between grid structure and tree structure to be an actual advantage of the theory. They point out, for example, that the two different metrical tree structures corresponding to the two different parsings of American history teacher would have identical grid alignments by the RPPR, as shown in the diagram below (which leaves out the word internal tree structure so that the phrasal constituency differences will be more obvious): (3.14)

χ X X X X X XX X X X X X American history teacher

χ X X X X X XX X X X X X American history teacher

Liberman and Prince call this result a 'credit' to their theory, since the two parsings are usually observed not to be differently stressed (Liberman and Prince, 1977, p. 123-124). What is left out of this account, however, is how to explain the discrepancies between tree structure and grid at the level of the word. The word American in (3.14), for example, is given the internal tree structure shown in (3.15):

74

Accent and

intonation

(3.15)



W

S WW

American

a structure which would indicate the bracketing structure shown in (3.16): (3.16)

[A [[[mer] i] can]

Within the word, however, there is no syntactic justification for bracketing comparable to the syntactic tree structure that gives the metrical bracketing of John's three red shirte in (3.11) above. The organizational structure indicated by the tree structure in (3.15) must be a phonological organization rather than a syntactic organization. There is therefore no explanation for the discrepancy between the metrical tree structure assigned to this word and the grid that is aligned to it. There is no justification for the complex organizational structure in (3.15) other than a desire to make the representational devices within the word look like the representational devices above the word. Since the phrasal structure always brackets only two adjacent constituents, early metrical theory always brackets only two adjacent constituents word internally. Thus, the syllables of American get the three layers of binary branching nodes shown in (3.15) even though the prosodie structure necessary to achieve the correct grid alignment justifies no organizational structure more complicated than what might be represented by the quaternary branching structure shown in (3.17): (3.17) W

S WW

American Another related problem in the alignment between the tree and the grid in the earliest metrical accounts is that some of the factors contributing to the perceived stress pattern at the lowest levels of the hierarchy could not be represented in the same uniform way that the prominence relationships at higher levels were. While in general,

The metrical tree

75

accent can be contrasted to paradigmatic systems such as lexical tone by its predominately syntagmatic characteristics, it usually also has certain paradigmatic aspects. One such aspect in English is t h a t at the very lowest level, the perceived stress pattern involves a contrast between two different classes of vowels. Thus the full versus reduced second vowels in gymnast and modest can cause these two words to be perceived as having somewhat different stress patterns, even though both have primary stress on the first syllable. In Liberman and Prince (1977), this distinction is represented by introducing the segmental features [± stress] in addition to relational features β and w. The patterns for gymnast versus modest, for example, would be: (3.18) s w gymnast + +

s w modest + -

Liberman and Prince justify this representation on the grounds t h a t the two words have the same patterns of relative prominence in the sense that both make the first syllable stronger than the second. The s/w labeling of their metrical tree structures must be identical. The distinction between strong full vowels and weak reduced vowels, therefore, is 'submetricaF. It cannot be represented in the perceived stress pattern of the metrical grid aligned to the tree. It must be represented in a different way from distinctions in stress pattern at higher levels of the phrasal hierarchy, because of the insistence on having only binary-branching nodes. Later metrical treatments, on the other hand, avoid the introduction of a separate feature by allowing a nonbranching node (labeled Σ for 'stress foot') to dominate strong syllables, and a nonbranching node (labeled ω for 'prosodie word') to dominate a single stress foot, as first proposed in Selkirk (1980). Under this new metrical treatment, the contrast between gymnast and modest would be represented by the contrasting tree structures shown in (3.19):

76

Accent and

(3.19)

ω

intonation

ω

/\

I

σ σ gymnast

Λ σ σ modest

Σ

Σ

Σ

These different tree structures would presumably then be aligned to different grid structures that represented the differing degrees of relative prominence implied by the presence or absence of the separate nonbranching node labeled Σ by giving the first syllable in modest an extra level of beat strength above the beat on the second syllable: (3.20)

χ X X X X gymnast

χ χ X X modest

The introduction of nonbranching nodes within the word was the first step toward a new interpretation of the metrical tree. The constituents labeled by Σ or ω were explicitly phonological units and not syntactic units. The notion that the metrical tree represents phonological structure rather than syntactic structure was soon extended to higher levels by the introduction of phrasal constituents such as 'intonational phrase' and 'phonological phrase' (Selkirk, 1981). These higher phrasal constituents were sometimes defined by paradigmatic distinctions comparable to the distinction between full and reduced vowels at the level of the stress foot. The intonational phrase, for example, was defined as the unit associated with an intonational melody, possessing a single nuclear tone (Selkirk, 1981, p. 385). Like the definition of the stress foot, this definition calls for the possibility of nonbranching nodes, since an utterance might contain only one intonational phrase, or an intonational melody might cover only a single phonological phrase. In doing away with the earlier equation between syntactic constituency and the metrical tree structure, however, these later metrical accounts did not examine the implications of this reanalysis for certain other aspects of the original formulation of metrical theory. Selkirk's reinterpretation of the metrical tree relaxes the original stricture against other than binary branchings by allowing nonbranching nodes where necessary to represent paradigmatic

77

The metrical tree

aspects of certain prosodie categories, b u t it retains the converse side of the rule t h a t prohibits a higher node from branching into more than two lower sister nodes. In the original formulation of metrical theory, this aspect of the rule had much less justification than the prohibition against singular nodes [s] or [w]. With the new understanding of the tree it has even less justification. If the stress pattern of an utterance of John's three red ehtrte is as indicated by the flattest grid in (3.13), what possible phonological information does the tree structure in (3.11) give t h a t justifies it over the tree structure in (3.21)? (3.21) W W W s John's three red shirts

Indeed, the tree structure in (3.11) is not only less justified than the quaternary branching in (3.21), it is in one sense less correct, because it misleadingly suggests too much information. If the metrical tree is to be a phonological organization rather than a syntactic organization it should represent only such constituent structure as is actually indicated by the prosodie categories. This line of thought suggests a further criticism of existing treatments of the metrical tree. The Relative Prominence Projection Rule states t h a t the D T E of the e-labeled node in any bracketed pair be stronger than t h a t of the w node, b u t it says nothing about how much stronger, and therefore creates no ordering among the elements in the terminal string that are immediately dominated by w. This lack of ordering usually means that many different metrical grids can be aligned with a given metrical tree. The tree structure shown in (3.11), for example, could be aligned with any of the following grids: (3.22)

a. χ Χ X X X John's three red shirts

78

Accent and

intonation

b. χ X X Χ X X X John's three red shirts

c. χ X Χ X X X John's three red shirts X

The grid in (3.22a) is the minimal grid pattern needed, and the grid in (3.22c) is cited in Liberman and Prince (1977, p. 327) as an example of a common kind of alternating pattern that breaks up long sequences of otherwise equal pretonic stresses. If the tree structure in (3.11) is understood to be the syntactic structure of the phrase, then the possibility of aligning it with any of the perceived stress patterns in (3.22) might be a natural consequence of the intuitively obvious claim that the prosodie structure of an utterance is influenced by such things as speaking rate and situational pragmatics as well as by syntactic organization. On the other hand, if the metrical tree is understood to be the prosodie structure of the phrase, the possibility of aligning it to any of several different grids amounts to the strong claim that the stress structures represented in the metrical grid are not part of prosodie organization. Such a claim is intuitively false. Even the earliest formulators of metrical theory, who equated the metrical tree with syntactic structure, felt that the different grids represent different phonological constitutent structures, as can be seen in their suggestion that it might be necessary to build an independent metrical tree on the metrical grid to reflect the different phonological 'foot' organization of the different grid patterns in (3.22a) and (3.22c) (Liberman and Prince, 1977, p. 327). A possible remedy to these problems is to allow each node in the tree to have one, or two, or more branches leading into it as necessary to represent all and only the organizational structures that are well defined by the prosodie pattern, and to recognize the metrical foot suggested by Liberman and Prince as one of the well-defined organizational structures of the tree. The different organizations indicated by the three different grid patterns in (3.22), for example, might be represented by the trees in (3.23). (Here the strong node among sisters that are bracketed together is labeled with the

The metrical

79

tree

relational feature β, but the weak sisters among bracketed nodes and singular nodes that have no sisters are labeled neutrally with a place marker x): (3.23)

a.

intonational phrase χ metrical foot χ X X X s John's three red shirts

prosodie word

b. χ X

I

intonational phrase s

Λ\

X X X s John's three red shirts

metrical foot prosodie word

c. intonational phrase metrical foot χ s χ s John's three red shirts

prosodie word

A question that such a use of metrical trees immediately raises is what it means for an organizational structure to be well defined. For some of the structural pieces of the English prosodie system, we can offer confident answers to this question. For example, the organization into intonational phrases in English is well defined by the nuclear stresses — that is, by the pitch accents in the utterance's intonation pattern that are immediately followed by the tonal configuration consisting of a phrase accent plus final boundary tone, as identified by Pierrehumbert (1980). We can probably be similarly

80

Accent and

intonation

confident about the organization into stress feet within the word, which seems to be well defined by the segmentally strong syllables, as identified by Selkirk (1980) and others. This level is recognized in several otherwise incompatible accounts. It is, for example, the level that Vanderslice and Ladefoged (1972) identify by their feature [± heavy]. There is fairly good evidence also for the level of organization that Selkirk has called the 'prosodie word'. This is the level that Vanderslice and Ladefoged (1972) have identified with their feature [± stress]. From the experiments of Fry and others (described above in sections 3.3 and 3.5), I will hazard to claim that structure here is well defined by some phonetic feature of prominence involving vowel duration. (More will be said about this in the next section and at intervals in Chapters 6 through 8.) For much of the space in the hierarchy between the stress foot and the intonation contour, however, we have only some vague intuitions about relative prosodie strength. The level of organization identified as the metrical foot in the trees in (3.22), for example, is conjectured to be well defined on the basis of the types of metrical grids that linguists such as Liberman and Prince (1977) have set up for these and similar utterances. I suspect that the definition of this prosodie category corresponds to something like the placement of prenuclear pitch accents in the intonation contour. I suspect also that the similar alternating stress patterns that have been described as products of a rhythm rule are instances of this category and are achieved primarily by the placement of pitch accents at relatively regular intervals on pretonic words. If this surmise is proved true, it would explain why the rhythm rule is said not to apply to postnuclear stress clashes, as in sports contest. The grammar of English intonation patterns fills the space after the nuclear pitch accent with the phrase accent and final boundary tone. The perceived greatest stress in words in this space cannot be shifted by the judicious placement of a pitch accent. Selkirk (1984), on the other hand, claims that stress shift and other alternating stress patterns are defined by some sort of prominence structure that is independent of the intonational organization. She claims, for example, that a normal neutral pronunciation of the sentence It's organized on the model of a gallon of worms, might have no pitch accent before the last word, and yet will reliably bear a rhythmic prominence on either organized or modeled (Selkirk, 1984, p. 49-50). She recognizes an intermediate prosodie level defined by the prominences of pretonic pitch accents, but she also makes a very strong claim that 'in either pre- or postnuclear position, a non-pitchaccent-bearing syllable will be locally and reliably prominent if it is the last main-stressed syllable on a cyclic domain containing another

81

The metrical tree

stress word (but no pitch accents)' (Selkirk, 1984, p. 154-155). There is some experimental evidence against this claim for so much prosodie organization defined by prominence features not involving pitch accents, but that evidence is more appropriately discussed in the next section on the metrical grid. Precise definitions of prosodie structure at these intermediate organizational levels in English will require extensive experimental investigation into the phonetics and phonology of the perceived prominence patterns that prompt such strong claims. Since we are not yet at a stage where we can give a complete definition of the prosodie organization of English, it would be premature to claim that the account of metrical tree structure proposed above is representationally adequate. I am encouraged, however, to believe that it will prove to be so because of several advantageous characteristics that are not always shared by the earlier standard account in which more than two branches were not allowed at any node. The first characteristic is that in terms of representing the rhythmically adjacent prominences of stress clashes, the revised metrical tree should be at least as powerful as the metrical grid. It should be able to represent the patterns that give rise to the rhythm rule in a way that is equivalent to that of the grid, because the nodes of the revised account can be in one-to-one correspondence to the relevant beats in the grid. Indeed, they must be in one-to-one correspondence if the relevant levels in the tree are defined by the same prominence structures that give rise to the perception of the corresponding layers of beats in the grid structure. This fact is more obvious if we place each node label above the DTE of the node rather than centering it above the entire constituent. Compare, for example, the tree structure in (3.24a) to its corresponding metrical grid representation in (3.24b), with the stress clash in the grid marked in the usual way with asterisks: (3.24)

a. χ

χ

intonational phrase

s

metrical foot

X S X thirteen men

prosodie word

82

Accent and b.

intonation χ *x *x XX X thirteen men

Furthermore, there are a number of cases in which the tree representation would capture certain facts about stress clash in a more natural way than does the grid. These cases are the pairs of phrases (such as invalid sample versus insipid coffee) that would ordinarily be assigned identical grid representations by the Relative Prominence Projection Rule but which behave differently with respect to the applicability of the rhythm rule. Both invalid sample and insipid coffee would have the grid in (3.25a) as their flattest interpretation, but only invalid sample can undergo the rhythm rule to get the grid in (3.25b). (3.25)

a.

χ χ χ X XX X X invalid sample insipid coffee

b.

χ X X X XX X X invalid sample

Liberman and Prince suggest that such differences are related to the inherent metrical strength of lexical monosyllables and productive prefixes, and they propose an addendum to the Relative Prominence Projection Rule stating that such units always be given at least two levels of strength in the grid (1977, p. 322). By this ex post facto modification to the RPPR, the two phrases would differ in that (3.25a) would no longer be a valid grid for the phrase with the strong prefix. The flattest grid that could be aligned to an utterance of invalid sample that had not undergone the rhythm rule would instead be the one shown in (3.26): (3.26) X XX X XX invalid

χ X X X X sample

In a labeled tree representation, on the other hand, the difference between invalid and insipid would automatically be represented without the introduction of a new constraint on the alignment

The metrical

83

tree

between textual constituents and rhythmic structure. If the prosodie characteristics of monosyllables and strong prefixes are well defined, these units would already be incorporated as a separate explicit organizational level in the tree, as shown in (3.27a): (3.27)

a.

χ

/I /II IKK

χ s

χ

x s x s χ invalid sample

b.

χ

/I II A Κ X

X

x s x s x insipid coffee

intonational phrase metrical foot prosodie word syllable

Another advantage of the revised metrical tree representation is that its explicitly labeled organizational levels are far more conducive to precisely stated, testable hypotheses about the exact phonological and phonetic form of the patterns of perceived prominences at the different levels of the hierachy. For example, the tree representation of the stress clash in thirteen men in (3.24a) above separates the two syllables of thirteen at the level of the prosodie word, and does not organize them into a single unit below the metrical foot. I have chosen to represent thirteen in this way because the word does not seem to have a fixed lexical stress pattern as does, say, raccoon. Although both words are marked alike as having primary stress on the second syllable by most dictionaries, there are many situations in which native speakers claim an initial stress for thirteen, or perceive equal stress on the two syllables. In counting, for example, one often hears ...thirteen, fourteen, fifteen.... And in the sentence he's thirteen, Kenyon and Knott's (1944) pronunciation dictionary marks the word as having two primary stresses. (The syllabic organizations of the citation forms also suggest that the stress patterns are not really the same; the first syllable of raccoon must be unstressed because the medial [k] unambiguously goes with the second syllable, whereas the [t] in thirteen can be ambisyllábic or geminated.) The tree structure in (3.24a) is meant to capture such facts about the word thirteen, and at the same time to offer an explicit hypothesis about the phonetic patterns that are responsible for the perceived stress patterns, including the perceived stress shift in thirteen men. The hypothesis is that the vowel duration differences and related syntagmatic phonetic contrasts that define the organization of the two syllables of raccoon into one prosodie word do not differentiate

84

Accent and

intonation

the first and second syllables of thirteen. Any organization of these two syllables into a single prosodie unit must be accomplished instead by the same mechanisms that might be used to make good night a single phrase. If the conjecture above about the definition of the metrical foot is correct, then the tree structure in (3.24a) amounts to a hypothesis that a second syllable stress on thirteen in thirteen men could only be accomplished by putting a pretonic pitch accent there, and that the perception of stress shift in thirteen men is due to a tendency to place pretonic pitch accents at regular intervals so that their tonal structure can be fully realized. Some evidence in favor of this interpretation is the fact that the unshifted stress pattern thirteen men is possible if the tonal shape for the prenuclear pitch accent on thirteen is one that can be realized easily in the context of the tonal shape chosen for the nuclear pitch accent. The surprise/redundancy contour L* H* L- L%, for example, can be said with the L* pitch accent aligned to either the first or the second syllable of thirteen, but the hat pattern contour H* H* L- L% is difficult to say fluently unless the prenuclear H* is placed on the first syllable. In any case, this representation offers an explicit hypothesis about the perception of shifted rhythm that can be tested experimentally. The claim for the necessity of a pitch accent can be tested by synthesizing various versions of such phrases with different intonational structure. The prior claim that thirteen and raccoon (or invalid and insipid) have different prosodie structure at the level of the prosodie word might be tested by comparing duration patterns and other phonetic characteristics in réitérant renditions of the contrasting words. The requirement that the levels of the metrical representation be defined by specific accentual features reveals the kinds of issues that need to be addressed experimentally before an adequate characterization of the rhythm rule can be achieved. A related benefit is that the revised tree representation apparently resolves the problem of how to align the intonation pattern to the text. Instead of being a separate phonological structure whose prominences must somehow be made to match the textual prominences, the various pieces of the intonation contour are incorporated directly into the prosodie pattern represented by the tree. The pretonic pitch accents go where they do because they are the prominences that formally define one level of prosodie organization for the utterance. The phrase accents go where they do because they define the preceding nuclear accents as the prominences organizing the utterance at another higher level. And the boundary tones go where they do because the information about constituent edges necessary for their proper alignment is automatically

The metrical tree

85

represented in the constituent structure of the tree. Another especially encouraging characteristic of the revised metrical tree is that it seems capable of representing analogous prosodie organization in other languages. Pierrehmbert and Beckman have recently used such a metrical tree representation in a description of accent and intonation structures in standard Japanese (Beckman and Pierrehumbert, 1984; Pierrehumbert and Beckman, in preparation). In this description, the initial LH tonal pattern and the culminative placement of the accentual HL define a relatively low level of metrical organization t h a t seems comparable to the prosodie word in English. The tonal subordination phenomenon known as 'downstep' or 'catathesis' (Poser, 1983, 1984) defines a next higher level of organization comparable to the metrical foot. And the presence of a final L% or LH% boundary tone accompanied by phonetic final lowering or raising defines an even higher level that is comparable to a level t h a t might be called 'utterance' having similar phonetic manifestations in English. This last characteristic of the revised metrical tree merits further discussion. Before it can be discussed properly, however, another issue needs to be addressed. It was suggested above t h a t when the stricture against nonbinary branches is relaxed sufficiently, the metrical tree is capable of representing stress clash as well as does the metrical grid. This fact makes the use of the grid as a separate phonological device in English seem of dubious value. The need for the grid to represent patterns such as stress clash has been cited as an important difference between accent systems like that of English and accent systems like that of Japanese (Prince, 1983). If the rhythmic requirements of stress patterns can be represented just as well using the metrical tree, this difference disappears. The use of the grid as a separate phonological device will be examined more fully in the next section. The issue of the applicability of the revised tree represention to other languages will be examined further in section 3.10. 3.9 T h e m e t r i c a l grid In section 3.8, a use of the metrical tree was proposed t h a t would incorporate the prosodie structures of intonation directly into the representation of the hierarchy of accentual prominences that organizes an English utterance. This use of the tree is foreshadowed in earlier proposals to understand the tree as a structure built from phonological constituents rather than as a phonological labeling of syntactic constituents, and is in t h a t sense a natural extension of already existing developments within metrical theory. In another sense, however, this proposal is fundamentally different from earlier

86

Accent and

intonation

metrical descriptions of English prosody. Because it incorporates the higher-level prominences defined by the intonation contour and the lower-level prominences defined by accentual features other than tone into the same representational device, this use of the tree would erase the distinction between stresses and intonational prominences that has been assumed as a basic tenet of the theory. (This consequence of the incorporation of intonation directly into the representation of the perceived stress pattern is somewhat reminiscent of Bolinger's views. Bolinger's nonhierarchical account, however, would make pitch accents part of the stress pattern at the expense of all other accentual features.) The complete distinction between intonation and stress in earlier metrical accounts is accomplished in large part by the hypothesized existence of the metrical grid. The grid is a hierarchical prominence structure for the timing of events in the utterance as distinct from the strong/weak relationships of the metrical tree and from any prominence relationships defined over the tones in the intonation contour. In Liberman (1975), this metrical grid structure is defined as the abstract organization of the occurrence of prosodie events in time, and is contrasted to the textual and tonal trees, which represent the structural organization of the events themselves. In Liberman and Prince (1977), the grid structure and the constituent relationships of the metrical tree are said to be responsible together for the perceived stress pattern of an utterance. Liberman and Prince further posit that this perceived stress pattern is realized phonetically in a way that is perfectly separable from the phonetic implementation of the associated intonation contour. They say: English is a stress language, not a tone or pitch-accent language; English stress patterns, within and among words, have phonetic reality as rhythmic patterns entirely independent of their role in orchestrating the placement of intonation contours. (Liberman and Prince, 1977, p. 250)

In the very earliest metrical account, the independence of stress from intonation was assured not only by the assumption of an 'independent phonetic reality as rhythmic patterns', but also by the way in which stress patterns were related to the alignment between text and tune. Thus, in Liberman (1975), the intonation contour was aligned to the text by means of the s/w relationships built on the syntactic constituent tree, making the tune one step further removed from any contribution of the grid to the perceived stress pattern. Any possible interaction between prominences in the tune and the perceived stresses from the grid had to be mediated through their common alignment to the metrical tree. More thorough research into the phonetics and phonology of the intonation contour soon led to the abandonment of this method of aligning the tune to the text in favor

The metrical grid

87

of an alignment procedure that matched tonal prominences directly to the perceived strong beats in the grid (see Pierrehumbert, 1980, 1981, and the summary of her criticisms of the earlier tree-matching techniques in Section 3.7 above). This new alignment procedure or something like it is assumed in all later metrical treatments of the grid and the relationship between stress and intonation in English. For example, Prince (1983) cites the criticisms of Pierrehumbert (1980) in arguing that only the flattest interpretation of the metrical tree is compatible with the correct alignment between tune and textual prominences. He further outlines ways in which the strictly binary bracketings of the metrical tree are more complex than necessary for expressing rules that generate such patterns as default compound-word stress and phrasal nuclear stress. In short, he argues that the metrical tree is irrelevant for the linguistic representation of stress patterns and proposes to do away with the tree entirely in favor of having only the grid. Selkirk (1984) similarly does away with the metrical tree in favor of the grid for representing stress patterns, although she gives a somewhat different interpretation to the alignment of the intonation contour via the grid. In her account, the intonation pattern of an utterance is prior to the rhythmic pattern and the pitch accents in the contour induce some of the rhythmic beats that are built up in the derivation of the grid. Both of these later accounts also assume the validity of the original assertion that there is a completely separate phonetic reality for the stress pattern as represented in the metrical grid. Prince, for example, contrasts the types of tonal prominences that belong to lower levels of the accentual hierarchy in non-stress-accent systems to the hierarchy of accentual prominences represented in the grid. He calls them 'notably distinct' modes of prominence, and ascribes the stress patterns to 'rhythmic effects' that include 'vowel reduction, vowel and consonant lengthening, and lenitions and fortitions of various kinds' but noi F 0 patterns (Prince, 1983, p. 89). Selkirk, similarly, assumes that the rhythmic structures represented at various levels in the metrical grid are independent of the F patterns of the pitch accents in the intonation contour, although the Pitch Accent Prominence Rule will cause a rhythmic prominence (a beat) to be added onto the grid for all syllables associated to a pitch accent in the intonation contour (Selkirk, 1984, p. 152). This assumption that stress as embodied in the grid has a completely separate phonetic reality is critical to the understanding of the relationship between the intonation pattern and the accent pattern in English, because an alignment procedure between the two is necessary only if stresses are demonstrably different from pitch accents at all levels. It is appropriate, therefore, to examine what

88

Accent and

intonation

experimental evidence there is for such a separate phonetic reality. In their original statement of the distinction between stress and pitch accent, Liberman and Prince cite a set of experiments on segment durations as evidence that stress patterns have a separate phonetic reality as rhythmic patterns. They describe this evidence as follows: One promising line of inquiry relies on the Tact t h a t it is possible t o mimic an arbitrary English utterance while substituting reiteration of a single syllable (e.g., me) for each syllable of the original. Such 'réitérant speech 1 shows stable durational patterns, which depend on the stress pattern and constituent structure of the utterance (cf. Liberman and Streeter, 1676), j u s t as durational patterns in natural speech do. It has been shown (by Nakatani and Schaffer, 1976) that listeners are able to extract stress and constituent-structure information from réitérant speech, and t h a t (under the condition of the cited experiment) duration is the dominant cue in both cases. (Liberman and Prince, 1977, p. 250)

The implication of citing this particular type of experiment seems to be as follows. If stress patterns are a metrical structure that is at all levels independent of an utterance's intonation pattern, this structure should be evident in the utterance's durational pattern, and the durational pattern should outweigh the F Q pattern as a cue to stress. The particular experiment cited in this passage, however, does not show duration being 'the dominant cue' to stress. The 1976 Nakatani and Schaffer paper that they refer to was later expanded into an article (Nakatani and Schaffer, 1978), but nowhere in the article is there any mention of a comparison of cues to stress separate from constituent structure. Instead, what the experiment compares is the efficacy of various cues to word boundaries that cannot be disambiguated by the stress pattern as it would be represented in the metrical grid. The experimenters used the hybrid speech-synthesis technique described above in section 3.5 to discover which of the four parameters F Q , duration, amplitude, and spectrum dominate as cues in the parsing of réitérant renditions of three-syllable phrases such as noisy dog and bold design, phrases in which the MAmaMA sequence could be understood either as MAma MA or as MA maMA. The experiment showed clearly that the durational pattern was the only cue that disambiguated such phrases effectively, but this result says nothing about duration as a cue to the stress patterns as they would be represented in the grid, since the minimal grids that would be aligned to the contrasting phrases are the same. (In the revised tree representation outlined in the previous section, on the other hand, the phrases might differ in having the first MA in the bold design type dominated by an unbranching node at the level of the prosodie word.) On the other hand, Nakatani and Aston did later use the hybrid speech-synthesis technique to compare cues to contrasting stress patterns that did not also differ in constituent structure, and that

The metrical grid

89

experiment does show duration to be an effective cue to stress in many contexts (Nakatani and Aston, 1978). But it was not always the dominant cue. In sentence-final nuclear-accent position, for example, the change-in-stress rating for the F Q pattern was far higher than the rating for the durational pattern. The only context in which the durational pattern outweighed the F Q pattern to the same extent was in postnuclear position after an emphatically stressed word, as in réitérant renditions of the words jacket and balloon in The DUSTY jacket was in the corner, versus The DUSTY balloon was in the corner. This particular example in Nakatani and Aston's experiment is especially interesting, because the authors cite it as evidence for the high degree of redundancy that speech provides in cueing stress. They say that 'pitch was nullified as a stress cue for words following an emphatic adjective, but duration and vowel quality enabled listeners to hear the stress pattern of the words nevertheless' (Nakatani and Aston, 1978, p. 25). That is, the contrast between MAma and maMA was apparently not neutralized in this postnuclear position where pitch accents are not allowed by the grammar of English intonation contours. (This result is some evidence counter to Bolinger's position that the presence of a pitch accent is the only relevant factor in the perception of stress.) This result is important because it contradicts two earlier experiments that used natural minimal pairs to show that stress contrasts are neutralized in this position (Scott, 1939; Huss, 1978). The minimal pairs in these other experiments were pairs such as import versus import and decrease versus decrease, pairs in which the stress patterns are not much reflected in the vowel qualities and therefore must be differentiated by something like duration or loudness pattern if they are not aligned with intonational prominences. In Huss's experiment, the stimuli were obtained by having several speakers read dialogues ending in a sequence that put a contrastive emphasis in the answer on the word preceding the test word (e.g. The GERMANS import sinks, or The GERMANS' import sinks). Subjects listening to these sentences out of context could not reliably identify the test words as the verb or the noun. Measurements of the contrasting test words, however, showed them to have reliably different duration patterns, a fact that suggests that the difference between Huss's results and those obtained by Nakatani and Aston reflects some difference in experimental design rather than a real contradiction concerning the perceptual dominance of the duration pattern as a cue to lexical stress patterns in the absence of phrasal pitch accents.

90

Accent and

intonation

These experiments suggest t h a t duration is indeed the dominant cue to lexical stress patterns in some intonational positions in English. However, this proof does not constitute sufficient evidence t h a t the metrical grid has an entirely separate phonetic reality, since duration could conceivably signal stress at a single well-defined level of the prosodie structure via the simple equation 'long equals stressed' (or alternatively, through the equation 'long equals loud equals stressed' — see chapter 5). As Liberman and Prince acknowledge, the metrical grid as a phonetic structure completely separate from the intonation contour implies a more complicated relationship. The stress patterns of English utterances must 'have phonetic reality as rhythmic patterns' and the patterns must be both 'within and among words'. Their effect on the durational structure of the utterance must not simply be to produce sequences of longs and shorts corresponding to the succession of stressed and unstressed syllables within the word at a level of the hierarchy below the level where pitch accents exist. Their effect must rather be to produce a complex intersection of layers of alternations between long and short durations. In another early discussion of the metrical grid, Liberman argues t h a t there is evidence for such hierarchic rhythmic structure in the durational patterns seen in a certain class of sentences — namely, those in which the predicate includes a verb-particle construction t h a t can be stressed on either component (Liberman, 1975, p. 188-193). The example t h a t he gives is the sentence John etruck out my friend, in which either the struck or the out could be more stressed, at the speaker's discretion. According to Liberman's account, the rhythmic structure of John differs in the two cases. When the following word is stressed, John must have two beats at the lowest level of the hierarchy in order to maintain the optimal grid structure for the utterance, as shown here: (3.28)

a.

2 2 2 1 1 1 1 1 John struck out my friend, w s

b.

2 2 2 1 1 1 1 1 1 friend. John struck out my s w

91

The metrical grid

Liberman shows duration measurements and F 0 curves for a pair of sample renditions of the two sentences, and observes t h a t the only noticeable difference between the two is in the length of John. In version (b) John is substantially longer, in accordance with the differences in the posited metrical grids. In fact, the actual F Q contours shown are not incompatible with a claim that the (a) version has a H* pitch accent on out, but this example is nevertheless very interesting because it suggests a class of tests t h a t could differentiate between evidence for the rhythmic durational patterns of the grid and evidence for the claim that a binary stress feature involving durational contrasts defines a single relatively low level in the prosodie hierarchy. Unfortunately, this line of experimentation has so far yielded mixed results. These results, moreover, are usually ambiguated further by the possibility of phrase-boundary effects separate from any prominence features. Ladd, for example, has attempted to reinterpret older data from measurements of lighthouse keeper versus LIGHT housekeeper as evidence for the rhythmic patterns t h a t the metrical grid is to represent (Ladd, 1980, p. 41-48). But this reinterpretation is not convincing because the original interpretation of 'disjuncture' has gained solid support from more recent experiments showing the durational effects of constituent structure on segment lengths near boundaries (e.g., Lehiste et al., 1976). Similarly, when data from réitérant speech experiments are reexamined, they more often than not support the simpler binary feature hypothesis. Consider, for example, the durations of the syllables in the réitérant sequences in Nakatani and Schaffer (1978). The stressed syllables are longer than the unstressed syllables in corresponding positions in the mamama sequences, but this seems to be the only major effect of the stress structure. The duration of the first ma in the réitérant renditions of near future types is not substantially longer than the first ma in the bold design types, although this adjustment would be predicted from assigning them different metrical grids in accordance with the different grids assigned to the two renditions of John struck out... in (3.28): (3.29)

2 2 1 1 1 1 near future

versus

2 2 1 1 1 bold design

Again, in a later experiment using many more adjective-noun phrases of differing lengths (Nakatani et al., 1981), it was found t h a t a syllable's duration was strongly influenced by its stress level and by

92

Accent and

intonation

its position relative to word and phrase boundaries, but there was little evidence of adjustments for rhythmic patterns. One of the figures in the report of this experiment (ibid, p. 101) shows some evidence that a few speakers lengthened the durations of stressed monosyllables when another stressed syllable followed (a result that is inconsistent with the mean durations seen in Nakatani and Schaffer, 1978). But otherwise there was no evident effect of a rhythmic stress structure. The durations of the metrical feet increased approximately linearly with foot size, and were not compressed at larger foot sizes to preserve a regular rhythmic succession of stressed beats. This last set of results is especially important because réitérant speech should give the ideal medium for detecting minute adjustments for rhythmic structure, without interference from segmental influences on duration. Experiments on the effects of foot size on duration using non-reiterant speech sometimes agree with this result and sometimes do not (see Lehiste, 1977, and Fowler, 1977, p. 30-35, for reviews), but it is even harder to design such experiments using nonreiterant speech that are not contaminated by boundary effects. Indeed, the difficulty of experimentally verifying rhythmic interstress isochrony was anticipated by Liberman (1975, p. 279). Among the possible obstacles he cites is that the implementation of grid patterns in speech might be characterized by 'graceful neglect', whereby the speaker does not aim for perfectly isochronous beats, but only for such rhythmic structure as would ensure that the abstract pattern is perceived. Another line of inquiry that relates to this last point are the experiments of Donovan and Darwin (Darwin and Donovan, 1979; Donovan and Darwin, 1979). These experiments have often been interpreted as evidence that stresses are perceived as occurring at regularly spaced intervals despite the lack of any good documentation of a physical regularity. The experimental paradigm is to have subjects tap to the timing of the stressed syllables in sentences. In the original results, the subjects' averaged tap interval sizes were found to be relatively more isochronous than the foot durations in the models. This response pattern has often been cited as evidence for a kind of 'perceptual isochrony', suggesting there might be a more regular relationship between the posited metrical structure and foot size if psychological rather than physical duration were measured. However, later experiments, such as those by Bell and Fowler (1984) and Scott et al. (1985) showed that only some subjects responded to interstress intervals in this way, and these subjects responded similarly to other sorts of intervals, suggesting that 'perceptual isochrony' is an artifact of the tapping task.

The metrical grid

93

Another sort of evidence against the hypothesis that the stress patterns of speech are a rhythmic structure characterized by 'graceful neglect' is that even when subjects are asked to be consciously rhythmic, they apparently do not produce durational patterns that support the phonetic reality of grid structure. In 1984, Mark Liberman and Gösta Bruce reported (p.c.) a series of informal preliminary tests using réitérant speech produced normally and 'as rhythmically as possible'. They found that the stressed syllables were much longer relative to the unstressed syllables in the 'rhythmic' variants, but the interfoot intervals were no more isochronous than in the 'normal' readings. Thus it seems that even the most generous interpretation of the current pool of experimental data on the durational effects of stress patterns does not provide evidence for the completely separate reality of stresses as rhythmic patterns, which is the usual phonetic interpretation of the metrical grid. If there is so little experimental evidence for the phonetic reality of such a timing structure, however, how can we explain the perceived hierarchy of rhythms that the grid was meant to embody? A possible explanation is that the perceived rhythms are not strictly temporal peridocities, but rather are a more formal matter. Suppose that the units in the layers of beats are not phonetic units defined uniformly at each level by the intersection of the interval durations with those at the immediately lower level, but rather are phonological constituents defined by the prosodie features peculiar to that level. Such prosodie features need not always be characterized to the same extent by the salience of the accentual function, and at some levels in some languages there is hardly any component of prominence in the prosodie features defining the level. In English, however, there seems to be at each layer in the hierarchy a tendency toward some sort of syntagmatic alternation that manifests the organization at that level. At a relatively low level within the word, the very weak syllables with reduced vowels alternate with the stronger syllables that define the organization into stress feet. At a higher level, prosodie words that are aligned with pitch accents alternate with toneless prosodie words to define the organization into metrical feet. The extra-metrical factors constraining the alternation differ at the different levels, but the underlying organizational logic of the alternation is the same. This explanation amounts to a proposal that the metrical grid be interpreted as a metaphor for the particular way that prominent features alternate with nonprominent features at various levels of the prosodie heirarchy in English, rather than as an extrinsic device that controls the timing of the alternations. The question of the validity of the grid then becomes not whether the proposed phonetic reality exists, but whether the grid is an adequate metaphor for the already

94

Accent and

intonation

attested phonetics. The first test in answering this question might be to see how the metrical grid compares as a representational device to the revised metrical tree proposed in the last section. For example, both Prince (1983) and Selkirk (1984) have claimed that the metrical grid is better suited to representing the patterns that give rise to the application of the rhythm rule in English. I have suggested above that this advantage is an artifact of the stricture against more than two branches at any node in the tree. At the same time, I argued that the labeled layers in the revised metrical tree representation were more conducive to explicit hypotheses about the phonetic or phonological sources of the perceived stress shifts. To make the comparison fair, however, the layers of the grid must be assumed to correspond to the well-defined layers of the tree, and so we will give them the same labels. (Some earlier treatments already do assume some corrrespondence. Prince (1983), for example, gives labels to strata in the beat levels corresponding to the tree levels that Selkirk (1980) identified as the stress foot and the prosodie word.) The greater explicitness of the tree would of course be negated if the levels of the grid were made to be in one-to-one correspondence of the revised tree. In terms of representing the rhythms of metrical foot structure, therefore, the two representational devices are equivalent. The grid has also been claimed to be necessary because of the problems that arose from aligning the intonation contour to the text by matching tree structures. These problems, however, are more an indictment of the original representation given to the intonation contour than they are a real justification for the grid. In representing the intonation contour as a metrical tree built on the tones, Liberman (1975) did distinguish to some extent between tones that are aligned directly to prominent syllables and those that are not. The boundary tones are labeled Β rather than « or w, and are understood to go at the edges of the utterance rather than at the head of some metrically prominent constituent within the utterance. The lack of any comparable mechanism for labeling the pitch accent and the unstarred tones of bitonal pitch accents was clearly a drawback that is corrected in Pierrehumbert's description. On the other hand, the tree representation does suggest a fact about intonation that in Pierrehumbert's description can only be stated less formally — namely, that not all pitch accents are equal, the last pitch accent always has the greatest metrical strength. When the intonation contour is aligned to the text just by associating the pitch accents to the more prominent beats in the grid, the extra prominence of the nuclear accent is a curious fact that is unconnected to any other property of the intonation contour. A revised tree

The metrical grid

95

representation along the lines suggested in the previous section, however, would immediately suggest a hypothesis about the prominence of the nuclear pitch accent. Since the last pitch accent in an intonation contour in English is always followed immediately by the phrase accent, perhaps the following tonal configuration is an accentual feature t h a t defines the next higher level of constituent in the prosodie hierarchy. In other words, perhaps the definition of the intonational phrase in English is not a prominence-neutral feature like the H boundary tone of the intermediate phrase level in French. Perhaps the phrase accent is actually what its name implies, an 'accent' t h a t gives the preceding pitch accent a prominence above that of any earlier prenuclear pitch accents. A revised tree representation of the prosodie structure of English could incorporate this into the organizational structure by marking the nuclear accent to be an β constituent at the level of the metrical foot and defining this β feature to be something like [+following phrase accent]. The fact t h a t there can be no further division into metrical feet past the nuclear accent would be an automatic consequence of the fact t h a t the phrase accent spreads from the linked nuclear pitch accent to the final boundary tone. Of course, such a definition of the nuclear pitch accent as the strongest metrical foot in an intonational phrase would be possible also in a metrical grid representation if the grid is understood to be a metaphor for the accentual organization of the utterance rather than as a timing device. In terms of this aspect of the alignment between the intonation contour and the rest of the prosodie structure, therefore, the grid is at no disadvantage to the tree. (Nor is it at any advantage, counter to the earlier claims for its necessity.) There is, however, one aspect of the alignment between the intonation contour and the text that is a problem for a grid representation b u t not for a tree representation. That aspect is the alignment of boundary tones. In a tree representation, the placement of the boundary tones with respect to the text is no problem, because the edges of the constituents where the boundary tones go are represented directly in the constituent tree structure. To obtain correct placement of boundary tones in a grid representation, by contrast, either there must be 'silent beats' between constituents, of the sort that Selkirk has suggested to explain final lengthening effects (1984), or the intonational phrase must be bracketed. Either of these modifications to the grid representation amounts to the reintroduction of something like the boundary segments of earlier nonhierarchical accounts. It may be difficult to argue against boundary segments at the level of the intonational phrase in English, but it is appropriate to remember some of the problems that argued

96

Accent and

intonation

against them at other levels. The syllable boundary as a segment, for example, prevented any good representation of ambisyllabicity. Moreover, problems such as the representation of ambisyllabicity do have their analogues at higher prosodie levels in other languages. For example, the implementation of the L tone of the accentual phrase boundary in standard Japanese is tractable only if the tone is considered to be ambiphrasal. On the other hand, the grid has an advantage over the tree in that it does not require constituent edges to be defined at every level. In example (3.23c) above, for example, I have tentatively grouped the prosodie words into metrical feet with the final element always as the head. But there is no good evidence that I know of that would favor this grouping over a bracketing that makes red part of the same metrical foot as John's three. One final advantage of the revised tree representation over the grid is that it can represent organizational structure that is not at all defined by prominence features. For example, if the intonational phrase in English were not characterized by the greater prominence accorded to the nuclear pitch accent (by virture of the following phrase accent), it would be well defined only by the presence of the boundary tone at the end; an intonational phrase would be a constituent without a head. Without a head, it would not be possible to represent it as a layer in the grid, unless in addition to introducing bracketing or silent beats, we made the further modification of allowing some layers in the grid be empty. Since English has a particular dearth of headless prosodie constituents, this disadvantage of the grid representation is not fully appreciated unless the question of grid or tree is extended to other languages, as it is in the next section. 3.10 Hierarchical structures in other languages The three previous sections have outlined some of the issues involved in applying the constructs of metrical theory to the representation of the relationship between accent and intonation in English. It was suggested that a suitably revised metrical tree structure could provide a reasonable representation that incorporated the insights of earlier nonhierarchical treatments while avoiding their errors, especially the error of equating accentual prominence at all levels to a single phonetic dimension. Further indication of the merit of the representation was the demonstration that it seems to be capable of explaining the set of distributional facts that has been considered to be the special province of the metrical grid in earlier metrical treatments. One final test of the suitability of this approach to the relationship between accent and intonation in English is to see what

Hierarchical structures in other languages

97

applicability it might have to the prosodie systems of other languages. Can this revised metrical tree structure be used to represent similar prosodie functions in other languages? A first test might be to apply the tree to the accentual system of Stockholm Swedish. This seems at first glance to be a simple exercise, because most of the descriptive accentual features provided by Bruce (1977) can be translated directly into the defining features of the various levels of the prosodie hierarchy, as illustrated by the tree below, which follows exactly the feature matrix for the same phrase given by Bruce (1977, p. 13): (3.30)

sentence accent (phrase tone)

χ S

X

I Λ

word accent (accent 1 or 2)

stress (long syllable) X S X stoppa dansskorna 'STOP the dancing-shoes.' One thing that is not immediately clear in this translation is whether accent 2 should be represented as a separate layer in the hierarchy intermediate between the word accent and the sentence accent levels. Having the pitch shape of the word accent define a separate organizational layer would seem to be in accord with the feature [±accent II] that Bruce sets up in parallel to [±word stress], and it would seem to make sense of the productive rule that gives accent 2 to the element with word accent in all compounds other than specific lexicalized items such as days of the week (which are marked as special also by lacking any stresses other than on the syllable with word accent). On the other hand, it would also mean that the sentence accent could not be placed distinctively on either word in a phrase containing both an accent 1 and an accent 2 word, as the tree structure in (3.31) shows:

98

Accent

and

intonation

(3.31)

sentence accent word accent

accent II

χ χ stress langa nummer 'long numbers' But a distinction in sentence accent placement is clearly possible, as the following example from Bruce's test sets show: (3.32) LANGA nummer

langa NUMMER

+

— +



+

— +



stress

+

— +



+

— +



w o r d accent

+

— —



+

— —



accent Π

+

— —





— +



sentence accent

'LONG n u m b e r s '

'long NUMBERS'

The possibility of this contrast shows that accent 2 is not really an organizational feature like primary word stress in English. Instead it is more like accent in- standard Japanese, which gives a kind of prominence to a word that carries it, but which cannot be said to create a separate layer of organization because of the possibility of having an accentual phrase without an accent. Accent 2 is thus just one of two possible types at the word accent level. The distribution of accent type in compounds must be accounted for by a separate level-specific rule stating that word accents of type 1 can label only nodes that are nonbranching at this level. In the facts reviewed thus far, there has been nothing to argue for the metrical tree over the grid as necessary to the adequate representation of Stockholm Swedish accentual features in a hierarchical structure. The distribution of accent type at the word accent level, for example, must be accounted for with a feature extra to the strictly organizational structure in either a tree or a grid representation. There is, however, an argument for the tree at the next higher organizational level. The sentence accent, like the nuclear accent in English, is given to the word that is the focal center of the sentence, and it affects the realization of word accent on all words to the right, eliminating the rise for the immediately following accent

Hierarchical structures in other languages

99

and reducing considerably the pitch range after t h a t . W h a t is problematic about the sentence accent is t h a t its tonal realization is always associated to a specific syllable in the text, but not always to the syllable with word accent. In a compound word, the sentence accent is realized on the stressed syllable in the second element rather than on the earlier stressed syllable with word accent in the first element of the compound, as in the following examples from Bruce (1977, p. 12, 14): (3.33)

lamadjur

klarastegen

+ - + +

+- ++ +

+ 'llamas'

+ 'the steps of Klara'

stress word accent sentence accent

Note t h a t this feature analysis contradicts earlier accounts which interpreted the realization of sentence accent in compounds as putting a type 1 word accent on the second element. The validity of Bruce's analysis is evident, however, in his experiments showing t h a t there are differences between compound words such as these and segmentally matched two-word phrases containing an accent 2 type word followed by an accent 1 type word bearing the sentence accent (Bruce, 1977, p. 50-57). To reiterate, then, compound words in citation intonation place the word accent on a strong syllable in the first element, but the sentence accent on a strong syllable in the second element. This fact about the location of the sentence accent in compound words is extremely problematic in a grid representation. It could only be represented if the grid were allowed to have gaps in one level underneath a beat at a higher level. Such an asynchronous arrangement of beats could be interpreted only if we imagine the sentence accent and word accent layers to be in a parallel rather than a hierarchical arrangement to each other. Such an interpretation would be incompatible with the basic assumption t h a t accentual features are hierarchical. It would mean t h a t the hierarchical arrangement can no longer be used to explain the fact t h a t a sentence can have several word accents but only one sentence accent. It would make it difficult to explain the fact t h a t the sentence accent gives a focus to the entire word, not just to the morpheme containing the accented syllable. It would also be incompatible with the fact that the sentence accent does not give extra prominence to the syllable associated to the accent, as evident in the earlier more traditional accounts t h a t assigned the same tertiary stress level to a long syllable not having a word accent regardless of whether it carried the sentence

100

Accent and

intonation

accent. In a tree representation, by contrast, the distribution of sentence accent in compound words is less problematic. The hierarchical arrangement of constituents in the tree does not necessitate any claims about the details of the alignment of the tonal elements contributed by the different levels within the tree. Since the tree structure preserves the constituent boundaries as well as the constituent heads, the notion of constituency is not necessarily lost by having one strong syllable carry the word accent and another carry the sentence accent. The presence of the sentence accent within the unit marked out by the word accent still makes that unit the head of the larger unit at the next higher organizational level. The distribution of the sentence accent within compounds must be specified by a special extra-structural alignment rule t h a t refers to two lower levels of the hierarchy rather than the more usual single layer. (It must pick out the word-final strong syllable at the level below the word accent rather than just picking out a particular word accent.) But this rule, while curious, is not fundamentally incompatible with the basic assumption of a hierarchical arrangement. Thus the revised tree seems far more suitable a device than the grid for translating the accentual features of Stockholm Swedish into a hierarchical metrical representation. A second test of the revised metrical tree is to see whether it can be used to provide a correct hierarchical description of the accentual and intonational structures of Tokyo Japanese. I suggested above t h a t the metrical tree is eminently suitable for representing the organization of an utterance into accentual phrases, which are grouped into intermediate phrases, which are grouped in turn into larger utterance structures. On the other hand, in terms of representing the specifically accentual features of this organization, there seems to be nothing to recommend the tree over the grid. For example, the accentual phrase is defined by the occurrence of an initial rise t h a t results from the juxtaposition of a L boundary tone and an immediately following phrasal H tone. In a grid representation, the phrasal H could be represented as a beat at the lowest organizational level of the grid where every syllable is grouped into some accentual phrase. The intermediate phrase, similarly, is defined by a successive accentual subordination of each component accentual phrase to the immediately preceding accentual phrase. When the preceding accentual phrase is accented this accentual subordination is realized as a catethesis of the tones in the subordinated phrases. In a grid representation, it could represented by adding beat structure at the next higher levels above the accentual phrase. There is, however, at least one problem with a grid

Hierarchical

structures

in other

languages

101

representation of the phrase levels in Japanese, having to do with the correct treatment of the boundary tones. In English, the alignment of boundary tones is a problem only at the very highest levels of the organization because there are no such elements below the intonational phrase. In Japanese, by contrast, boundary tones are part of the organizational structure already at the level of the accentual phrase. The phrasal H of the accentual phrase is always preceded by a L boundary tone which defines the end of the preceding accentual phrase and the start of a new one. On the other hand, since the phrasal H is usually placed on the second sonorant mora of the accentual phrase, the boundary L would seem superficially to be far less of a problem than the boundary tones in English. The alignment of the L would seem to be completely predictable from the alignment of the phrasal H; the L could be placed by rule on the mora preceding the H. But two regularly occurring types of phrases show that this simple solution is not correct. In initial long syllables and in initial accented morae, the L is not aligned to the first mora of the phrase and is instead left to be a true boundary tone at the constituent edge. Moreover, in certain other cases where the boundary tone is aligned to the first mora, its intonational behavior can affect the alignment of the phrasal H. See, for example, the accentual phrase in Figure 3.1b, where the phrasal H has been delayed to the point of being swallowed up in the later accent H as a consequence of the boundary L being lengthened and lowered to emphasize the juncture and put focus on this phrase. The grid cannot account for such behavior without introducing some representation of constituent edges. In the previous section, it was suggested that the grid could overcome the problems of aligning boundary tones in English if bracketing or silent beats were introduced to separated adjacent intonational phrases. Perhaps the same solution can be applied to the similar problem of aligning lower-level boundary tones in standard Japanese. If brackets or silent beats were introduced as textual boundary segments between the accentual phrases, then the L boundary tone could be associated to the first mora after the bracket or silent beat interval when the first syllable of the phrase is short and unaccented, and otherwise it could be aligned at the left edge of the accentual phrase right after the lowest-level bracket or last silent beat. There is one further fact about the L boundary tone, however, that cannot be convenienently represented when constituent edges are represented by boundary segments such as brackets or silent beats. The method of alignment outlined above assumes that the L boundary tone belongs exclusively to the accentual phrase following

102

Accent and

intonation

Figurt S.l.

Two versions of utterance uma'i ame-wa arima 'sen with contrastive emphasis on ame. The focus on ame has caused an intermediate phrase break to its left. In version (b), the disjuncture here is made even more prominent by lowering and lengthening the L boundary tone.

the boundary, but that assumption happens not to be correct. At medial phrase boundaries, the tonal value of the boundary L is determined in part by its association or lack of association to the initial syllable in the following phrase, but it is also determined by the occurrence or lack of occurrence of any catethesis-inducing accents in the previous phrase, as illustrated in Figure 3.2. The upper part of this figure plots the F Q value of the L boundary tone as a function of that of the phrasal H in the preceding phrase, and fits regression lines to the two sets of data points representing a contrast between accent and no accent in the preceding phrase. The lower part of the figure shows mean residuals from those regression lines grouped according to whether the L is a strong tone associated to the first mora in the following phrase or a weak unassociated tone. The wide separation of the regression lines and the differences between the mean residuals from the lines show that both of these factors influence the F„ value

Hierarchical

103

structure« in other languages

0 og o q»' o_ „ ojalá o o8 H αJ U - ' o' „ o' ° o ° ° "o°\ 0

βπ o o O X

X X χ X * i---" χ "i.-*' X X

i

*x ** X*w

J_

J_ 140

160

1Θ0

200

FO aiaxlnum et preceding phrase H

streng L

weak L

ne downstep

streng L

weak L

downstep

Figure S.t. Upper figure shows boundary L as function of preceding phrase peak, with regression lines fitted to two groups of data. Points plotted with o's are for utterances with unaccented preceding phrase, x's for accented. Lower figure shows mean residuals for these two groups divided further according to whether the initial syllable in the following phrase has induced an associated (strong) L or an unlinked (weak) L.

104

Accent and

intonation

of the boundary tone. Rather than belonging only to the following phrase, the boundary tone acquires some features of its tonal realization from both the following and the preceding phrase. (See Pierrehumbert and Beckman, in preparation, for further discussion of this issue.) This sort of ambiguous constituency is difficult or impossible to represent in a grid, whereas it is far less of a problem in a tree, which does not treat the edges of constituents by separating the constituents with some kind of boundary segment. Further tests of the suitability of the revised metrical tree will be possible when we have more complete descriptions of accentual and phrasal structures in other languages. Some indication of its applicability to standard Mandarin Chinese is that the lowest levels of a hierarchical representation are already known to be well defined by the existence of atonal syllables and tone sandhi. The atonal syllables are a precise phonological analogue of reduced syllables in English and define the comparable level in the hierarchy. They are also to some extent a phonetic analogue of the English reduced syllables, since in addition to not having an assigned tone, they share with the English syllables the same shorter durations and lenited vowel qualities. Tone sandhi then defines the next higher level in the prosodie hierarchy. This again might be a level of the hierarchy that is defined by an accentual feature, since a tonally modified syllable could be viewed as a weak sister of any syllable that retains its full original tone. The third-tone and second-tone sandhi rules would then be just more obvious grammaticalized subrules in a general rule that reduces the full tonal scaling of syllables that are strong at this level. The revised metrical tree might also be applicable to l'accent in French. It is not clear how much hierarchical structure is needed below and above the phrase that is defined by the H boundary tone in French, but there may be smaller units that are well defined by rules of liason and there must be a larger unit for the possible sentencefinal intonations. If we exclude the use of l'accent d'insistence as a garden-variety organizational feature that is common in the speech of public speakers and younger speakers, the intermediate-level phrase is defined exclusively by a boundary tone and not by any accentual feature. In other words, this level is a well-defined organizational level that has no head. The grid cannot represent this sort of organization unless it allows some levels of the hierarchy to be empty of beats. (The organization at these levels would then be defined exclusively by the presence of silent beats or bracketing at lower levels.) In a tree representation, by contrast, the nonaccentual levels of organization would simply be those levels that cannot have elabeled nodes.

Hierarchical structures in other languages

105

Far more work on these and other languages is necessary before we can fairly judge whether the revised metrical tree is adequate as a general device for representing prosodie organization. It does seem capable of representing the organization of prosodie categories in certain stress and non-stress languages. And it does seem to provide a representation of the relationship between abstract stress patterns and intonation in English t h a t is more in accord with the experimental data on stress perception than was either the traditional structuralist account or Bolinger's pitch accent theory. If the application of the metrical tree to English and other languages should prove to be as problem-free as the above discussions would make it seem, then we have another way to state the dichotomy between stress accent and non-stress accent defined in Chapter 1. Non-stressaccent languages are those in which tonal features enter into the definition of some levels of the organizational hierarchy that are marked in the lexicon. Or putting it another way, stress-accent languages are those in which the lower lexical levels of the hierarchy are well-defined only in terms of such non-intonational features as syllable weight. If the stress-accent hypothesis stated in Chapter 1 is true, then the greater role of phonetic material other than pitch patterns in cueing stress would be a reflection of the higher levels at which nontonal accentual features enter into the definition of the prosodie organization of an utterance. In later chapters, some experimental evidence for the stress-accent hypothesis will be presented. This evidence will come from a comparison of phonetic patterns associated with contrasting accentual patterns in English and Japanese utterances. In this and the previous two chapters, I have reviewed some of the structural or functional linguistic considerations that must be taken into account before the contributions of different phonetic patterns to accent patterns can be compared in such different languages. The next two chapters will review certain other considerations of a nonlinguistic nature that must also be taken into account. They cover the perception of loudness and pitch and how these psychological measures relate to the physical measures frequency, intensity, and duration.

CHAPTER 4

Fundamental frequency and pitch

Among the various phonetic attributes that can be associated with accentual prominence, pitch patterns are in some ways the easiest and in other ways the most difficult to interpret. Pitch is easier to evaluate than loudness, for example, in the sense that the psychophysical mapping from frequency to pitch seems to be somewhat less distorted by other physical attributes than is the mapping from intensity to loudness. Thus it probably is relatively less erroneous to equate hertz or semitones with pitch than it is to equate decibels with loudness. On the other hand, pitch is more difficult to interpret than, for example, duration in the sense that a speech sound only rarely has a fundamental frequency value in the way that it has, for example, a duration. Fundamental frequency can and usually does change over time, making it difficult to evaluate the pitch of a sound in relation to its prominence. It is obvious that, other things being equal, a longer syllable nucleus should be more prominent than a shorter one, but which of two syllable nuclei should be more prominent if one has fundamental frequency rising from X Hz to Y Hz and the other has fundamental frequency falling over the same range? Since questions of this kind clearly affect the interpretation of fundamental frequency measurements for accent, this chapter will present a brief overview of the relevant experimental literature on the pitch of complex tones in general and of speech sounds in particular.

4.1 Signals with stationary fundamental frequency and pitch The pitch of a pure tone is a nonlinear, monotonically increasing function of its frequency; the higher the frequency, the higher the pitch, with identical increments of frequency producing smaller and smaller increments of pitch at higher and higher frequency ranges. (See Stevens and Volkman, 1940, for a pitch scale approximating this function.) The pitch of a complex tone is complicated by the presence of components at more than one frequency. When the tone is anharmonic (i.e., when its components are not all integral multiples of

108

Fundamental

frequency

and pitch

some fundamental frequency within the auditory range), it may or may not yield a single unambiguous pitch sensation. If the anharmonic tone has a pitch, then it can be assigned a pitch value, defined as the frequency of a pure tone with matching pitch. Whether it has a pitch and what its pitch value might be depends on several factors such as the number of components (Patterson, 1973), their spacing relative to each other and to the auditory frequency range (Patterson and Wightman, 1976), and how far they are from being harmonic. Even if a complex tone is harmonic, however, it may not have a pitch if its lowest components are of too high a frequency relative to their spacing (Ritsma, 1962; 1963). Pitchless harmonic tones and the attested pitches of anharmonic tones have greatly interested psychoacousticians, because they are the critical case against which the explanatory adequacy of any model of pitch perception is judged. They are not, however, directly relevant to speech, because they are very unlike the vowels and other voiced sounds t h a t carry pitch in speech. A steady-state vowel (or a musical note) is a classic example of a harmonic tone with all or most of its lower harmonics present. Such a richly harmonic tone, when presented to an auditor, will always yield an unambiguous pitch, with a value corresponding fairly closely to its fundamental frequency. 1 In fact, it is this usual correspondence between fundamental frequency and pitch for music and speech that is the context within which the pitches of anharmonic tones, etc., are anomalies to be explained. Speech sounds are considered such good examples of the usual correspondence between fundamental frequency and pitch t h a t one recent theory of pitch perception even proposes to explain anomalous pitches in terms of a learning grid that responds to the large amounts of speech in the auditory environment (Terhardt, 1974; Terhardt et al., 1982a). According to this theory, the correspondences among the 'spectral pitches' of the fundamental and other low components of speech sounds are heard so often t h a t they establish learned patterns of 'virtual pitches' from which the pitches of anharmonic tones, of harmonic tones with 'missing' fundamentals, etc., can be derived.

1. T h i s correspondence between pitch and P Q should not be interpreted as meaning t h a t the component at the fundamental frequency is responsible for the pitch, but only t h a t the pitch values of different harmonic tones with equal fundamental frequencies will be equal. In the F Q range relevant for speech, it is the third through fifth harmonics which are actually dominant in the perception of pitch (Ritsma, 1976).

Signala with stationary

fundamental

frequency

and

pitch

109

Indeed, the correspondence between fundamental frequency and pitch for speech sounds is in some ways better than that between frequency and pitch for pure tones. The pitch of a pure tone can be drastically affected by its intensity (Stevens, 1935; Snow, 1936) and by its duration (Doughty and Garner, 1948). For complex tones, however, the effect of intensity has been shown to be very much smaller than the effect for pure tones (Terhardt, 1975), a finding confirmed for the special case of synthesized vowels by Chuang and Wang (1978). The effects of duration on pitch, moreover, are limited to stimuli that are much shorter than the durations of most vowels, suggesting that these two complicating factors can both be ignored in assessing the pitch of a speech sound with stationary fundamental frequency. On the other hand, while the effect of duration on the perceived pitch of a complex tone can probably be ignored, its effect on pitch discrimination is more critical. Liang and Chistovich (1960), for example, found that duration has a drastic effect on difference limens for pure tones even at durations that are typical of short and medium length vowels. For both subjects in their experiment, the mean difference limens were larger for shorter stimuli up to about 150 or 200 ms. Henning (1970) found a similar effect for his two subjects. While these results have not been tested for complex tones, they suggest that shorter vowels might require larger fundamental frequency differences to be perceived as different in pitch. If this proves true, then the small difference limens that Flanagan and Saslow (1958) found for their 500 ms long synthetic vowel stimuli cannot be applied in analyzing the typically short vowels of, for example, Japanese. This probable effect of duration on frequency discrimination should be borne in mind whenever the significance of small differences in fundamental frequency is being evaluated. One final factor that may influence the pitch of a steady harmonic tone is its spectral shape (i.e., the relative amplitudes of the various components). Results of several experiments have shown that spectral shape does affect pitch, but these experiments do not agree as to the nature of the effect. Lichte and Gray (1955) found, for example, that energy at higher harmonics tends to shift pitch upward. They asked subjects to adjust the frequency of a pure tone until its pitch matched that of a harmonic tone used as the test stimulus. The mean frequency for the adjusted pure tone was 255.8 Hz when matching the pitch of a complex tone containing the five lowest harmonics of a 250 Hz fundamental, and it was even higher (263.43 Hz) when matching a

110

Fundamental

frequency

and pitch

tone containing harmonics one through twenty-seven. A later experiment by Terhardt (1971), on the other hand, shows opposite results. In Terhardt's experiment, the complex tones were matched by pure tones with frequencies lower than the test tone's fundamental, indicating t h a t the presence of energy at the higher harmonics lowered pitch rather than raising it. 2 The finding t h a t spectral shape affects the pitch of complex tones is potentially very important for evaluating the F Q contours of speech, since it suggests t h a t different speech sounds could have different 'intrinsic' pitches due to their differing characteristic spectral shapes. In fact, two studies have shown that different vowels have different intrinsic pitches (see Section 4.3.2 below). However, it is difficult to relate these intrinsic pitches to the contradictory results of Lichte and Gray's and Terhardt's experiments. More studies are necessary on these general psychophysical effects before the intrinsic pitch of vowels can be explained as due to them or judged to be unrelated to them. 4.2 S i g n a l s w i t h c h a n g i n g f u n d a m e n t a l f r e q u e n c y a n d p i t c h The previous section reviewed the literature on pitches of harmonic tones (including synthesized vowels) when the stimuli to be matched or differentiated have unchanging fundamental frequency. The fundamental frequencies of speech sounds, however, are rarely stationary over even relatively short periods of time. It is important, therefore, to know how changing frequency is perceived. Can a richly harmonic signal with changing F Q have a single unambiguous pitch, and if so, what is t h a t pitch? Does changing F Q affect the frequency discrimination function, and if so, what is t h a t effect? Finally, since the frequency movement itself might have some linguistic significance, it is important to know how frequency change is related to pitch change. How much of a frequency excursion is necessary before a signal will be perceived as changing in pitch, and what other factors besides the amount of frequency change affect the perception of changing pitch?

2. The contradictory results of the t w o experiments may hare been due in part to the differing intensities of the pure tones used as comparison stimuli, since Lichte and Gray had subjects first adjust the intensity of the pure tone until it was equal in loudness t o the test stimulus, whereas Terhardt kept the intensity of the comparison stimuli at fixed levels 10 or 20 dB higher than the overall intensity of the test stimuli.

Signals with changing fundamental

frequency

and pitch

111

4.2.1 T h e pitch of changing signals One study bearing on the first of these questions is the series of experiments reported in Nabelek et al. (1970). In these experiments, subjects adjusted the frequency of a stationary comparison tone to match the frequency of test tones with various different linear frequency changes. Two types of signals were tested — those in which the frequency transition occurred over the entire duration of the test stimulus, and those in which a shorter transition covering only 25% of the stimulus duration was preceded and/or followed by a steady-state portion (as shown in Figure 4.1). The two types were tested for various stimulus durations and various sizes of frequency excursion. The results of the tests showed that if the transition duration was short enough or the frequency excursion small enough, the pitch matches clustered tightly around a single frequency, indicating t h a t the subjects perceived a single unambiguous pitch. For those stimuli in which the transition did not cover the entire duration of the signal, the value of this cluster of pitch matches depended on where in the signal the transition occurred. The center of the cluster moved through the frequency values covered by the transition in accordance with the relative durations of the preceding and following steady-state portions, centering around the initial frequency value for the two shapes (g), and around the final frequency value for shapes (a). As the frequency excursion increased, variability showed up as a tendency toward two pitch matches, one at the initial frequency and the other at the final frequency of the transition. For test stimuli with no steady-state portion, there was a simpler rule. The pitch corresponded to the frequency value at a point just past the center of the transition. As the stimulus duration or frequency excursion increased in these stimuli, there was no tendency for two pitch matches. However, the clustering of pitch match values became less tight and the center of the cluster shifted more toward the end value of the transition. Although the Nabelek et al. experiment used simple sinusoidal signals rather than complex tones, the results seem to be applicable to speech. They agree with the results of two more recent experiments that test the pitch of vowels with linearly changing F Q (Rossi, 1971a; 1978a). In these experiments, Rossi presented subjects with pairs of natural vowels, in which the first had a steady F Q and the second had rising or falling F Q . (He controlled for such factors as timbre and duration by using only [a]'s and making the test and comparison [a]'s equal in duration.) The subjects' task was to identify the pitch of the vowel with changing F as lower than, equal to, or higher than that of

112



Fundamental

l

and pitch

~

• J ~ c.

frequency

\ Λ

d. e. f.



\ "A

shorter transition

transition covering entire stimulus

Figure 4-1- Shapes of test stimuli used in Nabelek et al. (1970). (Adapted from their Figures 3, 4, 8, and 9.)

the vowel with stationary F . From these judgments, Rossi extrapolated the value of the fundamental frequency at which the stationary and changing vowels would be equal in pitch. This value corresponded to the frequency at a point roughly two-thirds of the way through the F Q excursion. This result is in close agreement with that of the Nabelek et al. experiment for pure-tone stimuli with comparable durations and frequency excursions. It will be applied in determining the appropriate point for measuring the F Q of each syllable in the test word tokens in the English-Japanese production experiment (see Section 6.1.3.1 below).

Signala

with changing

fundamental

frequency

and

pitch

113

4 . 2 . 2 F r e q u e n c y d i s c r i m i n a t i o n of c h a n g i n g s i g n a l s

Since signals with changing fundamental frequency can produce unambiguous pitch sensations, the next question to ask is how well these pitches can be differentiated. Klatt (1973) did an an experiment bearing on this question. He presented his three subjects with pairs of synthesized [ε] or [ya], and asked them to judge which of the stimuli was higher (AX method). In some pairs, the stimuli to be compared had stationary fundamental frequenices and, in others, they had steadily falling fundamental frequencies. The results of the experiment showed that the two types of stimuli had different discrimination functions. For the stimuli with level F , the average difference limens (DL) for the three subjects were 0.3 Hz for the [ε] pairs and 0.5 Hz for the [ya] pairs EM comparable to the small DL's that Flanagan and Saslow (1958) found for their synthetic vowel stimuli. For the stimulus pairs with changing fundamental frequencies, however, the mean DL's were much larger (2 Hz for the [ε] pairs and 2.5 Hz for the [ya] pairs), indicating that frequency discrimination is much coarser for changing than for stationary fundamental frequency. These results imply that very fine pitch resolution is in some way a function of time — i.e., that it is possible only when the frequency is relatively invariant over some length of time. They are thus in agreement with Liang and Chistovich's results showing that fine pitch discrimination is dependent on the duration of the signal (see Section 4.1). 4 . 2 . 3 E f f e c t of r a t e of F Q m o v e m e n t

Further evidence that relatively stable frequency is necessary for very fine resolution of differences can be seen in the various studies of DL's for rate of change. In experiments using pure-tone glides, Pollack (1968) and Nabelek and Hirsh (1969) found that just noticeable differences (JND) were larger for steeper slopes. The more rapid the change, the coarser the resolution of differences. Of these two experiments, Pollack's is the less interesting for pitch contours in speech, because his stimuli were all centered around 707 Hz (a frequency that is considerably higher than the range of the fundamental for most speakers). Nabelek and Hirsh, on the other hand, deliberately chose one subset of their stimuli from a frequency region that would be relevant to the perception of F contours in speech. The stimuli in this group all either began at 250 and fell to 222, 167, or 125 Hz, or began at one of these three lower freqencies and rose to 250 Hz, so that they had freqency drops or rises of 28, 83, or 125 Hz. The slopes of these drops and rises were varied by

114

Fundamental

frequency

and pitch

changing the durations of the transition portions over which the frequency rose or fell. For each of the three values of rise or fall, there were reference stimuli with transition durations of 300, 100, 30 and 10 ms. These reference stimuli were paired with test stimuli having the same or a longer duration (i.e., the same or a less steep frequency slope), and were presented to subjects for same/different judgments (AX method). Figure 4.2 shows the JND's for each of these sets as a function of the slope of the reference stimulus. Whether one compares the four reference durations for a particular frequency change or the three frequency changes for a particular reference duration, the JND is larger for a reference stimulus with a steeper slope. If applicable to the complex signals of speech, these JND's would suggest, for example, t h a t it would be easier to distinguish two gently-sloping falls t h a n to distinguish two steep falls. These results seem to be confirmed for complex tones by K l a t t ' s (1973) experiments using synthetic vowel stimuli. K l a t t had his three subjects match a test stimulus with one of two preceding comparison stimuli (ABX method). All three stimuli in each group had the same average fundamental frequency, b u t one of the comparison stimuli had different beginning and ending frequencies and, therefore, a different F slope from the other comparison stimulus and the test stimulus. In some of the groups, the F inflections were steep falls, whereas in others, they were relatively flat and either rises or falls. K l a t t found t h a t difference limens for the steeply falling stimuli were much larger t h a n for the stimuli with frequency changing only slightly (32 Hz/sec versus 12 Hz/sec), a result which seems in keeping with Nabelek and Hirsh's findings for pure-tone glides. However, the results of the pure tone experiment and those of the synthetic vowel experiment may not really be comparable. While both ostensibly test the perception of rate of frequency change, either or both could have been measuring something completely different. As Pollack (1968, p. 538) has pointed out, it is impossible to design a perfectly controlled experiment for rate of change. If the frequency range covered by the stimuli pairs is kept constant (as in Nabelek and Hirsh's experiment) then varying the slope necessarily changes the duration of the frequency transition, so t h a t the subjects could actually be responding to the differences in length. Conversely, if the transition duration is kept constant (as in K l a t t ' s experiment) then varying the slope necessarily changes the endpoint frequencies of the transitions, so t h a t the subjects could actually be responding to the differences in beginning or ending pitch.

Signale with changing fundamental

frequency

and pitch

115

3.15"

0.75ir"* V w 0 it

0.35H

ηo e o c I.

0.15-I

O.OOH 0.00

—I 0.05

1 0.15

1 0.35

1 0.75

1 1.55

6.35

12.75

Rate of change of comparison (Hz/ms)

Figure 4.t. JND's for rate of change. Data for pure tone stimuli from Nabelek and Hirsh (1969) (points connected by dashed lines) and for synthesized vowels from Klatt (1973) (points connected by solid line). The three symbol types and four dashed line types for Nabelek and Hirsh's data indicate, respectively, the three frequency differences and four transition durations of the twelve comparison stimuli used.

It is difficult to evaluate which of the two methods is more applicable to speech analysis, because it is not clear how rate of change corresponds to linguistically meaningful categories such as tone or accentual prominence. For example, suppose that differences in the extent of pitch change are linguistically relevant, but that, within certain limits, the amount of time used in effecting a pitch change is irrelevant. Then, the results obtained by Nabelek and Hirsh's method are applicable to speech (and comparable to those obtained by Klatt) only if it is shown t h a t the perception of differences is accomplished in part by judging the rate of frequency

116

Fundamental frequency and pitch

change. One experiment that suggests that this would be the case is Black (1970). The stimuli for Black's experiment were natural vowels excised from two sets of recorded utterances — one a speech recited by an actor and the other a corpus of sentences read in various emotional contexts by the subjects for another experiment. Four stimuli were chosen from each set of recordings for each of eight possible intersections of three pairs of dichotomous categories: 1.

fast (steep) changes versus slow (flat) changes,

2.

small changes versus large changes, and

3.

rises versus falls.

The twenty subjects participating in the experiment performed four tasks that were designed to reveal the relative magnitudes of the pitch inflections associated with the stimuli's F Q inflections. These tasks were: 1.

to assign a number with value proportional to the extent of the pitch rise or fall of each stimulus,

2.

to draw a line with length proportional to the pitch inflection of a stimulus,

3.

to score the extent of pitch change of a stimulus on a tenpoint scale, and

4.

to identify which stimulus in a pair had the larger rise or fall.

Results of the four different tasks were remarkably consistent. Stimuli with slow rates of change were assigned higher values or longer lines in the first three tasks than were stimuli covering the same F range at a faster rate of change. They also were identified as covering the larger pitch range in the fourth task. In other words, rate of change apparently does affect the perception of amount of frequency change. Moreover, since the division of the stimuli among 'slow' and 'fast' rates were determined by the length of the stimuli, this result suggests that Nabelek and Hirsh's method of testing JND's for rate of change is applicable to the evaluation of F fl changes in speech. 4.2.4 Effect of direction of F 0 m o v e m e n t

A second important result of Black's experiment is that the direction of change also affected the perception of the amount of change. Rising inflections were generally judged as covering a greater pitch range than falling inflections covering the same fundamental

Signale with changing fundamental

frequency

and

pitch

117

frequency range. A similar effect is seen in 't Hart's (1974) study of just-noticeable differences in extent of fundamental frequency change. The stimuli in this set of experiments consisted of natural tokens of various foursyllable Dutch words in which the F 0 had been altered so as to produce a linear rise or fall over the accented third syllable. Subjects were presented with pairs of these stimuli and asked to judge which member of the pair had a larger pitch excursion. The median JND for the rising pairs was very much smaller than that for the falling pairs (1.5 semitones versus 3.0 semitones). Thus rising F Q seems to be more salient than falling F Q both in the extent of pitch excursion associated with a given F Q excursion and in the discriminability of different excursion sizes. It would seem from these results that rises should also be more detectable than falls. However, there is another experiment that indicates that they are not. In a series of tests using both natural and synthesized vowels under various experimental conditions, Rossi (1971a; 1978a) determined how much of an F excursion was necessary for a vowel to be perceived as changing ratner than static in pitch. The obtained 'glissando threshold' was nearly the same (about 2.5 semitones) for all experimental conditions, including rising versus falling fundamental frequency. This discrepancy between Rossi's results and those of 't Hart and of Black could be interpreted as evidence that the perception of a vowel's F q as moving or as static is somehow independent of the perception of the extent of such a movement. Other differences may have been responsible for the differing results, however. For example, there may have been systematic differences in the amplitude patterns among the stimuli in the various experiments (see discussion of the effects of amplitude in the next section). Alternatively, the effects of direction of change seen in Black's and 't Hart's studies may have been mediated by the accentual patterns of the native languages of the subjects (as suggested by 't Hart, 1974, p. 61), since Black's and 't Hart's subects were English and Dutch speakers, respectively, whereas Rossi used French speakers. Further experiments are needed to resolve the discrepancy. 4.2.5 Effect of concomitant amplitude change One final factor that has been shown to affect the perception of changing pitch is changing amplitude. Two studies that are particularly relevant to speech are Cohen (1982) and Rossi (1978b). Of these, Cohen's experiment is somewhat more interesting, because it tests a specific prediction about the effects of amplitude change that follows from his model of the neural coding of pitch.

118

Fundamental

frequency

and pitch

Cohen's model of pitch coding is based on recent studies (e.g., Sachs and Young, 1980) showing t h a t a simple place model of neural coding cannot account for the representation of spectral detail in complex signals such as speech sounds. These studies show that most auditory nerve fibers fire at rates at or near saturation level when responding to speech signals at normal loudness levels. Thus variations in amplitude at different frequencies could not possibly be encoded in the neural signal simply in terms of the average firing rate of fibers at different regions along the basilar membrane. On the other hand, many spectral details can be recovered from the temporal relationships among successive pulses in the neural signal, which reflect the timing of amplitude peaks in the original waveform. This account is compatible with any of the current theories of pitch, as shown in Figure 4.3. Figure 4.3 is a block diagram of the essential elements of a model of the auditory system. In this model, the pitch of complex tones is not directly coded in the periphery, b u t rather is computed in the central nervous system (CNS) from the spectral details of the acoustic signal as represented in the neural signal from the auditory periphery. In order to accurately predict the attested pitches of anharmonic tones, this computation requires t h a t the auditory signal to the CNS include not just the components of the original acoustic signal, b u t also the correct combination tones. In most current theories of pitch, the introduction of these combination tones is an unexplained extratheoretical process, as shown in the dotted box (a) in the figure. Cohen's contribution to pitch theory is to incorporate into the neural coding module the two processes shown in dotted box (b) in the figure. These two processes are amplitude-dependent effects t h a t were discovered after the development of the current standard model of neural coding. Rate adaptation is the name for variations in the amplitudes of pulses in the neural signal, and latency variation is the name for adjustments in their timing relative to the peaks in the input waveform t h a t they represent. As a result of rate adaptation, the amplitude of each succesive peak is inversely correlated to the amplitudes of previous pulses. As a result of latency variation, the timing delay between the acoustic peak and the neural pulse is inversely correlated to the amplitude of that acoustic peak. Cohen has shown t h a t the combination tones necessary for the correct computation of the pitches of anharmonic tones will be automatically represented in the neural signal if rate adaptation and latency variation are incorporated into the neural coding module. Cohen's model is thus especially interesting for general theories of pitch because it predicts these necessary combination tones from attested patterns in the neural encoding process instead of merely adding them

Signale with changing fundamental

frequency

and pitch

119

acoustic signal

i OUTER AND MIDDLE EAR

(sound pressure transducer)

;; BASILAR MEMBRANE

(narrow band-pass filter)

I AUDITORY NERVOUS SYSTEM

(standard model of neural coding)

a.

k

ι

1

ι

ι

¡ combination-tone ¡ "¡ generator j

ι rate adaptation and latency variation

1

neural signal

ί CENTRAL NERVOUS SYSTEM

(processing of neural code) includes pitch computation — e.g., harmonic sieve (Goldstein), learned patterns (Terhardt), or autocorrelation (Wightman), ...

Figure 4-S. Block diagram of the essential elements of models of the auditory system. Figure adapted and expanded from Figures 4.1 and 4.2 in Cohen (1982).

ad hoc. A second consequence of Cohen's model that is even more interesting for pitch in speech, however, is the prediction that it makes about the effect of amplitude change on pitch. Figure 4.4 illustrates this prediction for the effect of increasing amplitude.

120

Fundamental

frequency

and

pitch

+ o

UJ

Ω 3 CL

Σ