259 117 8MB
English Pages 162 [164] Year 1983
Modelling British English Intonation
Netherlands Phonetic Archives The Netherlands Phonetic Archives (NPA) are modestly priced series of monographs or edited volumes of papers, reporting recent advances in the field of phonetics and experimental phonology. The archives address an audience of phoneticians, phonologists and psycholinguists. Editors
Marcel P.R. van den Broecke University of Utrecht
Vincent J. van Heuven University ofLeyden
Jan Roelof de Pijper
Modelling British English Intonation
1983 FORIS PUBLICATIONS Dordrecht - Holland/Cinnaminson - U.SA
Published by: Foris Publications Holland P.O. Box 509 3300 AM Dordrecht, The Netherlands Sole distributor for the U.SA. and Canada: Foris Publications U.SA P.O. Box C-50 Cinnaminson N.J. 08077 U.SA
CIP Pijper, Jan Roelof de Modelling British English Intonation: an analysis by resynthesis of Britsh English Intonation / Jan Roelof de Pijper. - Dordrecht [etc.]: Foris Publications. - (Netherlands Phonetic Archives; III) - With references. ISBN 90-6765-004-8 hdb. ISBN 90-6765-003-X ppb. SISO enge 837.2 UDC 802.0-4 Subject Heading: English; Linguistics; Phonetics; Intonation.
Frontcover illustration taken from: F.M. Helmont [An unbreviated representation of a true natural Hebrew alphabeth, which simultaneously shows how those born deaf can be thought not only to understand others who speak but even to produce speech themselves], Pieter Rotterdam, Amsterdam 1697.
ISBN 90 6765 004 8 (Bound) ISBN 90 6765 003 X (Paper) © 1983 Foris Publications - Dordrecht. No part of this publication maybe reproduced ortransmitted inany form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Printed in the Netherlands by ICG Printing, Dordrecht.
Table of contents
ACKNOWLEDGEMENTS C U R R I C U L U M VITAE
XI XIII
C H A P T E R 1. B A C K G R O U N D , AIMS A N D R E S E A R C H TECHNIQUES
1
1.1. Introductory 1.2. Dutch intonation 1.3. Features of the IPO-approach 1.3.1. Discrete pitch movements 1.3.2. The acoustic signal as starting point 1.3.3. Analysis-through-resynthesis 1.3.4. Verifiability 1.4. Intonation research and speech synthesis 1.4.1. Introductory 1.4.2. Mattingly 1.4.3. Witten 1.5. Aims and motivation of this study 1.6. Fields of application 1.7. Research techniques 1.7.1. Introductory 1.7.2. Measuring fundamental frequency 1.7.3. Analyzing and resynthesizing speech 1.7.4. The use of the logarithmic scale in F 0 recordings 1.8. Some definitions 1.9. Brief outline of the study
1 2 4 4 4 5 5 5 5 6 7 10 10 11 11 11 12 13 13 14
C H A P T E R 2. CLOSE-COPY STYLIZATIONS
17
2.1. Finding the perceptually relevant elements in F 0 -curves 2.1.1. Interpreting acoustic visual recordings of F 0 -curves 2.1.2. Stylization and perceptual evaluation 2.2. Close-copy stylizations 2.2.1. Introductory 2.2.2. Usefulness of close-copy stylizations
17 17 17 20 20 20
VIII 2.2.3. Criteria f o r close-copy stylizations 2.3. Testing the perceptual equality of close-copy stylizations 2.3.1. I n t r o d u c t o r y 2.3.2. Set-up of the experiment 2.3.2.1. Test material 2.3.2.2. Presentation of the test 2.3.3. Statistical analysis of the results 2.3.3.1. Differences between g r o u p s of subjects 2.3.3.2. Differences between tests 1 a n d 2 2.3.3.3. Overall test results 2.3.3.4. Category A-items 2.3.4. Discussion a n d conclusions 2.4. Testing the perceptual acceptability of close-copy stylizations 2.4.1. I n t r o d u c t o r y 2.4.2. A m e t h o d of measuring acceptability of pitch c o n t o u r s 2.4.2.1. I n t r o d u c t o r y 2.4.2.2. T h e m e t h o d of successive intervals 2.4.3. Set-up of the experiment 2.4.3.1. Test material 2.4.3.2. Presentation of the test 2.4.4. Statistical analysis of the results 2.4.5. Discussion a n d conclusions 2.5. G e n e r a l conclusions a n d o b s e r v a t i o n s 2.5.1. Stylizing F 0 -curves 2.5.2. Close-copy stylizations 2.5.3. P e r f o r m a n c e of the subjects 2.6. S u m m a r y
22 23 23 23 23 25 26 26 27 28 29 30 31 31 32 32 33 35 35 35 35 37 38 38 39 40 40
C H A P T E R 3. A M E L O D I C M O D E L O F B R I T I S H E N G L I S H INTONATION
43
3.1. I n t r o d u c t o r y 3.2. Description of the melodic model 3.2.1. G e n e r a l description 3.2.2. Pitch m o v e m e n t s 3.2.2.1. Melodic characterization 3.2.2.2. Position with respect to the syllable 3.2.2.3. Survey a n d coding of pitch m o v e m e n t types 3.2.3. Declination 3.2.4. Final f r e q u e n c y of the pitch c o n t o u r 3.3. C o m p u t e r i m p l e m e n t a t i o n of the melodic model 3.3.1. I n t r o d u c t o r y 3.3.2. H o w the p r o g r a m works 3.3.3. An e x a m p l e
43 44 44 44 45 46 48 49 50 51 51 51 52
IX 3.4. Testing the perceptual adequacy of the melodic model 3.4.1. Introductory 3.4.2. Set-up of the experiment 3.4.2.1. Speech material 3.4.2.2. PES-stylizations 3.4.2.3. FS-stylizations 3.4.2.4. DUTCH-stylizations 3.4.2.5. WITTEN-stylizations 3.4.2.6. Composition of test tape 3.4.2.7. Presentation of the test 3.4.3. Statistical analysis of the results 3.4.4. Discussion and conclusions 3.5. Summary
C H A P T E R 4. G E N E R A T I N G BRITISH CONTOURS AUTOMATICALLY
ENGLISH
55 55 55 55 55 58 59 59 60 60 61 64 64
PITCH 67
4.1. Introductory 4.2. Characterizations of Halliday's tones 4.2.1. Introductory 4.2.2. Analysis 4.2.3. Tone 1 4.2.4. Tone 2 4.2.5. Tone 3 4.2.6. Tone 4 4.2.7. Tone 5 4.2.8. Tones 13 and 53 4.3. Automatic generation of Halliday's tones 4.4. Testing the perceptual adequacy of automatically generated pitch contours 4.4.1. Introductory 4.4.2. Set-up of the experiment 4.4.2.1. Speech material 4.4.2.2. SATO-contours 4.4.2.3. DITO-contours 4.4.2.4. Other stylizations 4.4.2.5. Composition of test tape 4.4.2.6. Presentation of the test 4.4.3. Statistical analysis of the results 4.4.4. Discussion and conclusions 4.6. Summary
67 69 69 69 70 72 73 75 76 77 77
C H A P T E R 5. IN C O N C L U S I O N
87
5.1. Introduction
87
78 78 78 78 79 79 80 80 81 81 82 85
X 5.2. General evaluation 5.2.1. T h e I P O - a p p r o a c h 5.2.2. T h e melodic model
87 87 88
5.3. Dutch and B E intonation compared 5.3.1. Introduction 5.3.2. General picture 5.3.3. Declination 5.3.4. Levels
89 89 89 89 90
5.3.5. Pitch movements 5.3.6. Conclusion 5.4. Possible wider implications 5.4.1. A suggestion 5.4.2. Evidence 5.4.2.1. Declination 5.4.2.2. Physiological evidence 5.4.2.3. Evidence from other languages 5.5. Possible relevance to the theoretical linguist 5.6. Suggestions for further research 5.6.1. Introductory 5.6.2. Spontaneous speech 5.6.3. Longer stretches o f speech 5.6.4. Pitch patterns 5.6.5. S o m e final observations
91 92 92 93 93 94 94 95 96 97 97 97 98 100 104
REFERENCES
107
APPENDICES Appendix 2.1 Appendix 2.2 Appendix 2.3 Appendix 2.4 Appendix 2.5 Appendix 2.6 Appendix 3.1 Appendix 3.2 Appendix 3.3 Appendix 3.4 Appendix 3.5 Appendix 4.1 Appendix 4.2 Appendix 4.3
109 111 117 119 120 122 123 124 132 134 136 137 138 146 147
SUMMARY
149
SAMENVATTING
151
Acknowledgements
This research was supported by the Foundation for Linguistic Research, which is funded by the Netherlands Organization for the advancement of Pure Research (Z.W.O.). I want to avail myself of this opportunity to thank everybody who, in whatever way and capacity, has contributed to the completion of this work. Some of those I would like to mention by name. First of all, my teacher and supervisor, Prof. Dr. A. Cohen. He has proved a constant support and has always been ready to discuss the text with me. In particular his patience and forbearance, and his constant show of optimism in those times when things did not seem to be progressing at all have gained him my admiration. Also the contribution that Hans 't Hart has made to this work cannot be overemphasized. He has been my roommate and immediate supervisor for three years and his great expertise, his enthusiasm and his eagerness to discuss anything at any time has been a constant source of inspiration to me. Prof. Dr. W.H. Vieregge, Dr. M.P.R. van den Broecke and Dr. V.J.J.P. van Heuven, for their critical and valuable comments on the text. Mr. A.J.G. O'Connor, Dr. C. Darwin and Mr. H.H. de Pijper, for giving me every assistance to make the experiments in England and the pilot experiments in Holland possible. Drs. I. de Raaff, for her moral support and keeping me at work when I did not. Dr. N.J. Willems, for rooming and pubbing with me in Brighton, and for not letting my work go to waste. The Institute for Perception Research, for keeping every facility available to me after the official completion of the project.
XII Ing. L.L.M. Vogten, for analyzing and resynthesizing speech, and being always and instantly ready to give assistance. All colleagues and supporting staff of the Institute for Perception Research, whose cheerfulness, knowledge and willingness to help have made working there a pleasure. Mr and Mrs Lennie, for putting me up in Brighton. All those who have participated as subjects in the experiments, without whom this work could not have been written. Last but not least, my parents, for supporting me in more ways than one.
Curriculum vitae
The author was born in Geldrop on August 1, 1952. After finishing grammar school, he started studying English language and literature at Utrecht University in 1970, specializing in experimental phonetics. The bachelor's and master's degrees were obtained in 1973 and 1975, respectively, with minors in medieval history and didactics. From 1975 till 1976 he taught English at a secondary school in Amersfoort, after which he taught phonetics at Utrecht University for half a year. From 1977 till 1980 he carried out the present research at the Institute for Perception Research (IPO) in Eindhoven. From 1980 onwards he has been a teacher of English at secondary schools in Dongen.
CHAPTER 1
Background, aims and research techniques
1.1. INTRODUCTORY
A great deal of research has been done, and is still being done, on intonation, and perhaps more so for English than for any other language. It is therefore necessary to explain why yet another book should be written on the subject and what new contribution to the field it will try to make. Before even this, however, the notion 'intonation' as it will be used throughout this book must be defined. In this study, the term 'intonation' is used only to refer to variations in pitch, to the exclusion of such other prosodic speech features as amplitude, duration and the like. By far the greater part of all the work done on English intonation has been inspired by educational necessity. English is widely used all over the world in such important areas as aviation, business, commerce, technology and science, and therefore a great many people whose native language is not English find they have to learn English in the performance of their jobs. Thus, a lot of work has been done on teaching English as a foreign language and on teaching English intonation as part of it. A description of intonation that is used to teach people has as an advantage that it can afford not to be completely explicit. People bring to the situation their linguistic competence: they know what a language is and how it may work. For this reason, for instance, the listen-and-repeat system can be effective. This is one of the reasons why so far not much experimental work on English intonation has been done. If an impressionistic approach, where the researcher relies mainly on his own ears to discover a system in the intonation of a language, yields good enough results for this purpose then it is perfectly legitimate. Another reason why systematic experimental work on intonation has been neglected may have to do with the practical technical problems involved. Until recently, it was hardly possible and very time-consuming even to obtain a reliable measurement of the course of the fundamental frequency of an utterance. This has changed over the past few decades and drastically so over the past few years, with computers improving and becoming more easily accessible.
2 In the Netherlands, and more particularly at the Institute for Perception Research (IPO), speech researchers have been quick to catch on to the improving of technical facilities and their possible applicability to intonation research. As early as 1965, they started working with a vocoder, because this machine made it possible for them to provide an analyzed and then resynthesized speech signal with a controlled, artificial pitch contour, so that it now became possible to actually hear an utterance with an artificially generated pitch contour. They found that they could stylize original fundamental frequency curves by means of a relatively small number of straight lines without the resulting pitch contours becoming unacceptable, or even essentially different, as long as these pitch movements were of the right size and slope and in the right position. The first major publication of their findings was Cohen a n d ' t Hart (1967). Encouraged, they continued their research along the same lines, with the result that now a fairly complete description is available of all the melodic properties of Dutch intonation. This approach to the study of intonation, which has meanwhile acquired the epithet 'Dutch School' (Delgutte 1976), was novel in a number of respects. In the first place, it was no longer purely impressionistic: a stylized pitch contour was not just a set of lines drawn on paper, but something which could actually be given an acoustic reality. Secondly, as a logical consequence, any claims made about the melodic properties of Dutch intonation became verifiable by having subjects listen to specimens of artificially generated pitch contours. Thirdly, it now became possible to judge models of intonation on perceptual grounds. It is on this 'IPO-approach' to the study of intonation that the present research is based. Obviously, the usefulness of the method developed and applied at the IPO would be greatly enhanced if it could be shown to be applicable also to other languages than Dutch. British English (henceforward BE) seemed a logical choice as target language, since it is the foreign language most widely used and taught in the Netherlands.
1.2. DUTCH INTONATION
As explained, the research reported on here is based directly on the work done on Dutch intonation at the IPO since 1965, notably by Cohen, Collier a n d ' t Hart. Their results have been amply published in the literature, for which the most important references are Cohen a n d ' t Hart (1967), 't Hart and Cohen (1973) a n d ' t Hart and Collier (1975). It might also be of interest to the reader to refer to the recently published course in Dutch intonation: Collier a n d ' t Hart (1981). Moreover, Willems, in his 'English intonation from a Dutch point of view' gives a good summary of the work done on Dutch intonation (Willems 1982, pp. 36-42). It was nevertheless felt necessary, in view of the background, to include a
3 short account here, for the convenience of those readers not familiar with the 'Dutch School'. This is done in the next section. No attempt will be made to give a historical survey of the IPO-work, nor will an effort be made to be exhaustive. Rather, I will try to set out the gross results of the research as concisely as is compatible with clarity. The subsequent sections will highlight several important features which are essential to the approach. In the IPO-approach, the pitch curve of an utterance is replaced by an artificial contour, using resynthesized speech. The artificial contour is built up of a number of discrete pitch movements, which take the shape of straight lines whose slope and duration can be varied at will. This artificial contour is manipulated in such a way that perceptual equivalence is achieved with the original pitch curve, using as few pitch movements as possible. These movements are then said to possess perceptual relevance. By comparing many such stylized pitch contours and looking for consistent similarities and differences, one tries to reach a certain level of standardization. In this way it was found that it is possible to produce standardized stylizations of the great majority of Dutch intonation contours using only ten perceptually relevant pitch movements, which move from and to a low and a high declination line. The low declination line can be seen as a slightly downward tilted line on which the contour is superimposed. Figure 1.1. gives an example of such a standardized Dutch intonation contour. A grammar has been developed which describes how these various pitch movements can be combined to form entire contours. Moreover, it has been found what intonational cues can be used in Dutch to give pitch accents and what intonational cues may be used to indicate syntactic phrasing of utte-
50
0.0
—
0.3
0.6
0.8
t
(S)
Figure 1.1. A typical Dutch standardized intonation contour.
1.2
1.5
4 ranees. Using this inventory and the g r a m m a r , it is possible to provide any D u t c h utterance with a correct pitch contour.
1.3. FEATURES O F THE IPO-APPROACH
Basically, the I P O - a p p r o a c h may be said to aim at giving an acoustically completely explicit description of the melodic properties of the intonational system of a language. In doing so, the original speech signal is t a k e n as a starting point, with as few linguistic preconceptions as possible. The description is considered a d e q u a t e only if it is satisfactory f r o m a perceptual point of view and if this can be established by means of experimentation. T h e description is in terms of discrete pitch movements a n d the principal research tool is the so-called analysis-through-resynthesis technique. The following sections will deal with these various features in some more detail; they are f u r t h e r developed a n d elaborated on in the following chapters. 1.3.1. Discrete pitch
movements
F o r D u t c h intonation it has been f o u n d possible to generate perceptually adequate artificial pitch contours that consist of a succession of discrete pitch movements that can be represented by straight lines in a visual recording if F 0 is plotted as a logarithmic function of time. The term 'discrete' is used here in its most literal sense: the artificial c o n t o u r s generated are actually built up of separate lines, each of which can be defined independently of the others. The transitions between pitch movements are a b r u p t and not s m o o t h e d in any way. This does not a p p e a r to have any negative effects f r o m a perceptual point of view. It has f u r t h e r m o r e proved possible to prepare an inventory of only 10 such s t a n d a r d movements, sufficient to generate acceptable versions of the great majority of patterns permissible in D u t c h . This obviously makes for an efficient a n d easy m e t h o d to define pitch contours acoustically. There is, however, also evidence f r o m physiology that it is exactly these pitch movements that the speaker of a language intentionally produces to build u p the intonation pattern he has chosen (see section 5.4.2.2.). Consequently, in this study, too, we shall try a n d give a description of BE intonation in terms of discrete pitch movements. 1.3.2. The acoustic signal as starting
point
Also f u n d a m e n t a l to the I P O - a p p r o a c h is that the researcher starts f r o m the b o t t o m of the speech chain, i.e. the acoustic signal, with as few linguistic preconceptions as possible. After gathering his acoustic d a t a , he tries to generalize f r o m them a n d find out what are the c o m m o n elements a p p a r e n t l y
5 possessing perceptual relevance. In this systematic way he may eventually obtain a complete description of the melodic properties of the intonation of a language. Compared with this, the theoretical linguist may be said to do exactly the opposite: he starts from his linguistic knowledge of a language and tries to deduce from that what functions might be fulfilled by intonation and how (see also section 5.5.).
1.3.3. Analysis-through-resynthesis The principal technique which makes it possible to apply the approach to the study of pitch phenomena mentioned in the previous section is the so-called analysis-through-resynthesis technique, which works as follows. The original speech signal can be instrumentally analyzed into its separate components (spectral composition, amplitude, fundamental frequency and voiced/unvoiced indication). The researcher can then change those aspects that he has singled out for attention in some systematic way of his own choosing. The original speech signal, but with the changes, can then be reconstructed (resynthesis) and the result can be made audible and compared with the original. Thus, the researcher can find out what is the perceptual impact of certain changes in the acoustic signal. This bridge between the acoustic signal and its perceptual impact is absolutely essential for a perceptually relevant description of intonation.
1.3.4. Verif lability The analysis-through-resynthesis technique paves the way for another essential feature of the present approach: the requirement of verifiability: no claim should be made about melodic properties of a language unless it can be and has been verified by means of experimentation. Moreover, the experiments should always aim at perceptual verification, i.e. phonetically naive subjects should express their satisfaction that utterances with properly manipulated artificial pitch contours are acceptable to them. After all, this can be the only criterion for the validity of such a claim.
1.4. INTONATION RESEARCH A N D SPEECH SYNTHESIS
1.4.1. Introductory It is quite surprising, when one consults the literature, to find that it is very difficult to find any reports of studies where speech (re)synthesis is used as a research tool to gain insight into the melodic structure of a language. Usually, it is the other way round: people want to set up a speech synthesis
6 system a n d so find themselves in need of rules for pitch control. These rules often m a k e a crude impression, suggesting a lack of sufficient basic research and one feels that they have been a d d e d to the segmental synthesis rules more or less as an a f t e r t h o u g h t . Consequently, the results of the synthesis, as far as pitch is concerned, are generally disappointing. Moreover, it is very rare indeed to find instances where p r o p e r perceptual experiments have been carried out to assess the adequacy of artificial rule-generated pitch contours. As exceptions to this I would like to mention M a e d a (1976) and Martin (1978) (cf. section 2.4.2.1.). Fortunately, there are clear indications at the m o m e n t that the situation is changing for the better: m o r e and more research into intonation using speech (re)synthesis is being d o n e all over the world, and perhaps perceptual evaluation of automatically generated pitch c o n t o u r s will also b e c o m e a m a t t e r of course before t o o long. A f t e r all, nearly all a u t h o r s reporting on pitch rules in speech synthesis admit quite freely that, naturally, acceptability of speech o u t p u t is the most i m p o r t a n t criterion for the success of a melodic model. The present study hopes to contribute to this development. It does not fall within the scope of this report to provide a survey of all the work d o n e which concerns melodic models of intonation sufficiently explicit to be applicable to speech synthesis systems, but I d o want to give some relevant references: American English: British English: Danish: Dutch: French: German: Japanese: Swedish:
M a e d a (1976); Pierrehumbert (1981); Mattingly (1966); Witten (1978); Thorsen (1980); C o h e n a n d ' t H a r t (1967); 't Hart a n d C o h e n (1973); 't H a r t and Collier (1975); Vaissiere (1971); G r u n d s t r o m (1972); Martin (1978); Isacenko a n d Schädlich (1964); Fujisaki and Nagashima (1967); Fujisaki a n d Hirose (1982); Bruce (1977).
Since the present study deals with BE intonation, the next two sections will give a s u m m a r y of Mattingly (1966) and Witten (1978), the only instances in the literature of explicit rules for the generation of artificial BE pitch contours. 1.4.2.
Mattingly
Mattingly (1966) is the first instance of BE pitch rules explicit e n o u g h to be applied in a speech synthesizer. Mattingly bases himself on two B E intonation courses: A r m s t r o n g a n d W a r d (1931) and O ' C o n n o r a n d A r n o l d ( 1973), both of which f a v o u r the ' t o n e ' a p p r o a c h . Sentences are split up into sensegroups (in accordance with J o n e s , 1962) and each of these sense-groups is
7 assigned a tone. A r m s t r o n g a n d W a r d recognize two tunes, a n d O ' C o n n o r a n d A r n o l d sixteen; Mattingly's melodic model provides for three tunes: falling, rising a n d falling-rising. T o provide an utterance with a pitch c o n t o u r , one has to indicate in the segmental input string where the sense-group boundaries are, which are the p r o m i n e n t words, which syllables of the prominent words should be stressed, a n d which tunes are to be applied. The a p p r o p r i a t e c o n t o u r is then a u t o m a t i cally generated by means of a set of general p r o g r a m p a r a m e t e r s a n d an external set of prosodic tables, b o t h organized so as to allow easy m a n i p u l a tion if needed. Mattingly describes the quality of the resulting synthetic speech as 'not unsatisfactory' a n d the rules which control variations in f u n d a m e n t a l frequency as 'fairly successful', but does not report any perceptual acceptability tests. Although he refers to the rules as 'a first a t t e m p t ' a n d to the results as 'preliminary', there has not yet been a follow-up to the article. 1.4.3. Wit ten After Mattingly (1966) the only instance of explicit rules for the control of pitch in a speech synthesis system is Witten (1978). As was the case with Mattingly, Witten's interest is not so m u c h the study of intonation in itself, as setting u p a complete speech synthesis-by-rule system. In the course of this, he naturally f o u n d himself in need of pitch control rules, besides features f o r the synthesis of segmental features a n d timing rules. A n o t h e r resemblance between Mattingly a n d Witten is that they both based their rules on existing BE intonation courses, but while Mattingly concentrated o n A r m s t r o n g a n d W a r d (1931) and O ' C o n n o r and A r n o l d (1973), Witten used Halliday (1970) as a starting point for his pitch rules. One of the merits of the article in which Witten presents his rules for pitch control is that he states these rules so explicitly a n d clearly that I was able, without too m u c h difficulty, to apply his rules on the speech synthesizer at my disposal. Since pitch c o n t o u r s generated in accordance with Witten's pitch rules will be used in the experiments to be described in chapters 3 a n d 4, these rules will be given here in some detail; for a complete description, however, the reader should refer to Witten (1978). Halliday uses the 'tone g r o u p ' as the unit of intonation. In each tone g r o u p there is one part that is especially p r o m i n e n t , called the 'tonic'. The first 'salient' (lexically stressed) syllable of the tonic is called the 'tonic syllable'. The tonic may or may not be preceded by a 'pretonic segment'. T o n e g r o u p s can either be 'simple', containing one tonic, or ' c o m p o u n d ' , with a d o u b l e tonic. Each tone g r o u p can be p r o n o u n c e d with one of seven tones: 1 , 2 , 3 , 4 , 5, 13, o r 53, the first five of which are simple a n d the last two c o m p o u n d . Every tone g r o u p consists of a n u m b e r of 'feet', the foot being the unit on which 'the r h y t h m of spoken English is based'. Figure 1.2. gives two examples of sentences transcribed according to Halliday's notational system.
8
/ / I Arthur and / Jane / left for / Italy this / morning / / . / / 5 3 he's / never / taken / Jane on / any of his / visits / though / / . Figure 1.2. T w o (unconnected) sentences transcribed according to Halliday's n o t a t i o n a ! system. A double slash indicates a tone g r o u p b o u n d a r y ; a single slash indicates a f o o t b o u n d a r y ; the n u m b e r denotes the tone used; tonic syllables are underlined.
Witten gives rules for the automatic generation of the five primary non-compound tone-groups distinguished by Halliday. Before a pitch contour can be applied the experimenter has t o specify which tone-group is to be chosen, after which three positions in the tonegroup must be determined: the beginning of the tone-group, the beginning of the tonic syllable and the end of the tone-group. Besides this, the tone-group must be divided into feet and the boundaries of these must be known, too. After the tone-group has been segmented in this way, the pitch contour is applied in two stages: first an overall contour is calculated and then a 'foot-pattern' is superimposed to give the stressed (salient) syllables of each foot added prominence. For this purpose, six quantities must be specified, which may differ per tone-group. Figure 1.3, which is an example of tone 4, illustrates the system. The numbers 1 to 6 refer to the six quantities specified for each tone-group and their meaning is given here: 1. 2. 3. 4. 5. 6.
indicates the value of F 0 at the beginning of the tone-group - in this case 280 Hz; indicates the value of F 0 at the end of the pretonic - in this case 180 Hz; indicates the value of F 0 at the beginning of the tonic syllable - in this case 200 Hz; indicates the value of F 0 at the end of the tone-group - in this case 245 Hz. indicates how many Hz below the overall contour F 0 is at 1/3 of each foot, in the pretonic ('non-linearity parameter'); indicates how many Hz below the overall contour F 0 is at 1/2 of the first foot of the tonic ('non-linearity parameter').
The first four quantities define the shape of the overall contour. As can be seen, there is an abrupt pitch change at the tonic. Quantities 5 and 6 define the exact shape of the foot pattern. The various points found in this way are connected by straightforward linear interpolation. Table 1.1. gives the values of the 6 quantities that Witten associates with each of the five primary tones. The above is in fact an oversimplified picture of Witten's pitch rules: I have given only such information as is necessary to understand how he has
9
TIME (S) Figure 1.3. A n example of tone 4, according to Witten's rules for pitch control. The thin line represents the overall c o n t o u r and the solid line the completed contour, after superposition of the foot pattern. The meaning of the numbers 1 to 6 is explained in the text.
synthesized Halliday's five-tone system. In reality, his system is quite flexible: he also allows for an 'initialization' and a 'continuation' possibility, the fraction along the foot in the non-linearity parameters (numbers 5 and 6 above) can be varied, and all the quantities given in table 1.1. can be given any value wanted. Thus, a very large range of different contours can be generated and experimented with. Witten concludes his article with this statement: 'We are currently experimenting with the system, and testing various hypotheses about the effects of the parameter settings. Results of these tests, together with a proper evalua-
Table 1.1. The values of the six quantities that Witten associates with each of Halliday's five primary tones, in Hz. Halliday's tone g r o u p
1 2 3 4 5
1
2
3
175 280 175 280 175
175 280 175 180 235
175 90 105 200 215
Quantities 4 75 190 150 245 170
5
6
-40 -40 -40 -40 -40
-40 0 -10 -45 +45
10 tion of the speech generated, will be presented in a future paper' (p.260). To date, unfortunately, this future paper has not been published.
1.5. AIMS AND MOTIVATION O F THIS STUDY
Generally speaking, the main aim of this research is to study BE intonation in accordance with the principles that have been used in the study of Dutch intonation. The motivation for this is twofold. Firstly, to see if these principles can indeed be used on a language other than Dutch, since this would show the IPO-approach as a tool in intonation research to have a more general applicability. Secondly, because so little experimental work has been done on BE intonation, and none at all where serious experiments have been performed to evaluate the perceptual adequacy of proposed melodic models of intonation. We will therefore try, using the IPO-approach outlined in previous sections, to develop certain melodic precepts for the automatic generation of artificial BE pitch contours. These artificial pitch contours should satisfy three requirements. In the first place, they must be perceptually acceptable to native speakers of BE, and demonstrably so. In the second place, they must be so explicit acoustically that they can be implemented on a speech synthesis system. In the third place, it must be possible to use essentially identical pitch contours with different utterances without a decrease in perceptual acceptability: otherwise the pitch contours would have no generalizing value. To do this, it is necessary to 1) make an inventory of BE standardized perceptually relevant pitch movements, 2) find rules that dictate how these pitch movements may be combined to form complete pitch contours, and 3) make an inventory of BE pitch patterns, i.e. combinations of pitch movements that appear to recur frequently. Another aim is to compare the Dutch and English intonational systems to see if any systematic differences and similarities can be found. From the start, since only three years were available for this research, it was not expected that these aims could be realized in any complete sense, but it was hoped that a fair start towards them could be made, providing a sound basis for follow-up research.
1.6. FIELDS O F APPLICATION
Several possible fields of application suggest themselves if the aims outlined above are met. First, there is the possibility of incorporating the melodic precepts found in speech synthesis-by-rule systems. The use of such systems is clearly on the
11 increase, e.g. as a speech research tool, in automatic answering service systems, in reading machines for the blind and even in electronic games. Generally, the weak link in such systems, and a factor contributing greatly to unnaturalness of the synthetic speech, is the lack of adequate rules for pitch control. The results of this study could contribute to providing such rules. Second, the melodic precepts could be used as a basis for a BE intonation course, much as has been done for Dutch intonation. The advantage that this course would have over the many BE intonation courses already in existence would be that 'in contrast with most existing intonation courses ( ) the user instructions in this course will be based on experimental evidence and as such stand a better chance of representing an explicit and internally consistent survey of allowable melodic structures of the language concerned. Moreover, the suggested notational system of straight-line contours is straightforward and thereby easy to comprehend.' (Willems 1982, p.58). Moreover, since a description of Dutch intonation in exactly the same terms already exists, the basis is laid for a grammar aimed directly at Dutch learners of BE, by concentrating on the oppositions between the two. A lot of work has in fact already been done on such a 'contrastive' intonation course (see Willems 1982, chapter 7). Other applications, also in the domain of education, are the training of laryngectomees with an electrolarynx specially prepared for the easy generation of pitch contours (Van Geel 1983), and the suitability of the melodic precepts proposed here for training students of intonation by means of visual feedback (De Bot 1982).
1.7. RESEARCH TECHNIQUES
1.7.1.
Introductory
The approach to the study of intonation as used here requires a great deal of technical facilities: it is necessary to make acoustical analyses and visual recordings of the speech signal, to manipulate the course of the fundamental frequency of utterances, to resynthesize speech with stylized pitch contours and make them audible. This section is dedicated to giving details on how this is achieved. Strictly speaking, the term 'research techniques' includes experimental techniques and techniques of statistical analysis of experimental results. These, however, will be dealt with in the sections where the experiments proper are described. The present section will also be used to give some definitions. 1.7.2. Measuring fundamental
frequency
In the experiments described in this report it was essential to make audible
12 resynthesized speech with the original f u n d a m e n t a l frequency curve left intact. It was therefore necessary to have an accurate m e t h o d of determining the course of the f u n d a m e n t a l frequency of utterances. Measuring the rapidly varying f u n d a m e n t a l frequency of the quasi-periodic signal that is speech, is a problem of long standing. The various m e t h o d s in existence to d o it automatically invariably make mistakes - o r did until recently- and are thus not sufficiently reliable. Alternatively, the m e t h o d of doing it 'by h a n d ' , by indicating the periods that one can see in the oscillographic recording of the speech signal, is reliable, but extremely time-consuming. In this research, a mixed m e t h o d has been used: f u n d a m e n t a l frequency is measured automatically by means of an auto-sign correlation m e t h o d ; the adequacy of this measurement is checked b o t h visually on recordings, and by ear, by c o m p a r i n g it with the original after resynthesis; if necessary, the measurement is corrected 'by h a n d ' . The reader who wants to k n o w more a b o u t auto-sign correlation measurement of f u n d a m e n t a l frequency is referred to Rabiner (1977). In this way, one can be certain of correct measurements a n d the m e t h o d is not prohibitively time-consuming. 1.7.3. Analyzing
and resynthesizing
speech
Most of the work on D u t c h intonation at the I P O has been done by m e a n s of the so-called ' i n t o n a t o r ' , basically a vocoder-type machine used for speech analysis and resynthesis, coupled to a function generator which controls the repetition rate of the periodic source of the vocoder in resynthesis. By the time this project started, more sophisticated means were available. Use has been made of the technique of Linear Predictive Coding ( f r o m now on LPC). L P C analysis/resynthesis systems have become quite well-known and p o p u l a r in the past few years and I will therefore not go into their workings here. The interested reader is referred to Atal a n d H a n a u e r ( 1 9 7 1 ) . The system as it has been used in this research accepts digitized speech at the input and analyzes it in terms of the amplitude envelope and the first five f o r m a n t s with their bandwidths. Independent of the LPC-analysis, a voiced/unvoiced decision is made and f u n d a m e n t a l frequency is measured by means of an auto-sign correlation m e t h o d which may be corrected 'by h a n d ' if called for. This analyzed speech signal can be resynthesized and made audible. In the meantime, any or all of the p a r a m e t e r s mentioned can be manipulated at will. A more complete description of the system is given by 't H a r t , N o o t e b o o m , Vogten a n d Willems (1982). The LPC-vocoder has considerable advantages over the i n t o n a t o r : the quality of the resynthesized speech is higher, it is more versatile, it is more reliable a n d it is possible to resynthesize an utterance with the original f u n d a m e n t a l frequency curve left intact. All these advantages have been gratefully exploited; in fact, with the intonator it would have been much more difficult to d o the experiments to be described later.
13 1.7.4. The use of the logarithmic scale in F0 recordings In this report a number of examples of recordings of fundamental frequency curves and stylized pitch contours are included (see e.g. figure 2.1.). In these recordings, time is plotted along the horizontal and frequency along the vertical axis. It is important to note that frequency is always plotted along a logarithmic, rather than a linear scale. Thus, changes in frequency that appear as straight lines in these graphs would appear as curved lines if frequency were plotted linearly. This is especially relevant to remember if one should want to compare these recordings with similar recordings in other publications by different authors who plot frequency along a linear scale. The reason why a logarithmic scale is used has to do with the fact that intonation is looked at from a basically perceptual point of view; not what happens in a purely acoustical sense is important, but what it does to perception. Thus, when a woman's voice rises from 200 Hz to 400 Hz within a certain time interval, and a man's voice rises from 100 Hz to 200 Hz in the same time, these two pitch changes are perceived as essentially the same. This is more adequately represented by using a logarithmic scale: the two pitch changes have the same slope and the same size.
1.8. SOME DEFINITIONS
Throughout this report a number of terms and notions recur frequently. It is important for the reader to have a clear idea of what, in this context, is meant by them. Therefore, definitions of the most important terms are given here. Pitch and fundamental frequency. Although, strictly speaking, 'pitch' is a perceptual term, here it will be used also to refer to the acoustic signal. In these contexts it is identical to 'fundamental frequency' (or the inverse of the vocal cord periodicity). The notion 'fundamental frequency' may be abbreviated to F 0 . Contour and Curve. The word 'contour' is used exclusively to refer to stylized pitch curves; similarly, pitch 'curves' are always understood as not stylized. Perceptual equivalence. It has never been possible to give the term 'perceptual equivalence' a really good definition, 't Hart and Cohen (1973) say that a stylization that is perceptually equivalent to the natural course of the fundamental frequency is established by having a resynthesizer 'perform the smallest number of such movements as are needed to achieve, between input and output signal, a resemblance which is satisfactory according to a judgment obtained in analytic listening'. This resemblance is, of course, to be understood as melodic.
14 Perceptual relevance. Perceptually relevant pitch movements are those pitch movements that cannot be deleted from a pitch contour without a clearly audible change in the melody of the pitch contour as a whole. Also, it is assumed that it is these and only these perceptually relevant pitch movements that a speaker produces under voluntary control as part of an intonation contour. All other pitch phenomena that one may observe, with the exception of declination (see below), are regarded as micro-intonation. Declination. This is the tendency of pitch to float down over the course of an utterance. In this study it is assumed that declination is not produced under voluntary control by the speaker and it is therefore not regarded as a pitch movement. It does, however, possess perceptual relevance, since it cannot be deleted from a pitch contour with perceptual impunity. Micro-intonation. By micro-intonation is meant those fluctuations in fundamental frequency which are not intended by the speaker as part of the intonation pattern and which have no relevance for the perception of intonation. Such fluctuations are caused for instance by involuntary changes in the transglottal pressure drop or in the laryngeal musculature due to the segmental environment. For more information on micro-intonation see e.g. Lehiste (1970). Semitones. Since changes in fundamental frequency are thought of as taking place in the logarithmic frequency domain, size and slope of pitch movements are expressed in terms of the logarithmic notions of semitones (ST), and semitones per second (ST/s), respectively. Where necessary, the following formula is used to go from semitones to Hertz and vice versa:
size= where
12 log 2
log (f2/fl),
size = size of a frequency change in semitones, f2 = end frequency in Hertz of a frequency change, fl = start frequency in Hertz of a frequency change.
Positive values for size indicate rises, and negative ones falls.
1.9. BRIEF OUTLINE O F THE STUDY
Chapter 2 describes a method to determine which elements in the F 0 curve of an utterance are relevant for the perception of the pitch contour. This is important, because any model of intonation need take only these perceptual-
15 ly relevant elements into account and can ignore the rest. To this purpose, a certain type of stylization is introduced, the so-called close-copy stylization. This is a stylization consisting of as few pitch movements as possible, while still being perceptually indistinguishable from the original F 0 curve f r o m which it is derived. Also, a method is proposed to objectively test the perceptual acceptability of utterances with artificial pitch contours. Naive subjects are asked to score the attribute 'acceptability of pitch contour' on a 5-point scale, and then the so-called method of successive intervals is used to process the results in a statistically justifiable manner. As a result, each utterance used in the experiment can be given a 'scale value', which represents the perceptual adequacy of the pitch contour. In chapter 3, a model is proposed that claims to give an account of the melodic system native speakers of BE make use of. The model comprises eight melodically distinct pitch movements, which may take up any of three positions with respect to the syllable with which they are associated, and rules governing the ways in which these pitch movements may be merged to form complete pitch contours that are fully specified acoustically. An experiment is described which tests whether the model can be used to generate perceptually acceptable pitch contours. The two main conclusions to be drawn from this experiment are that pitch contours generated in terms of the model are not distinguishable from original fundamental frequency curves, as far as perceptual acceptability is concerned, and that the subjects have shown themselves remarkably capable in assessing the abstract attribute 'acceptability of pitch contour'. In chapter 4, explicit rules are given for the automatic generation of each of Halliday's seven primary tones, with and without pretonic. These rules are in accordance with the melodic model outlined in the previous chapter. An experiment is described which tests whether pitch contours generated in accordance with these rules are perceptually adequate. Although the results are not conclusive, they suggest that such pitch contours have a high degree of perceptual acceptability. However, the need for further research and experimentation is clearly indicated. Chapter 5, finally, evaluates to what extent the aims of the study have been realized. The Dutch and BE intonational systems are compared and suggestions are given for further research.
CHAPTER 2
Close-copy stylizations
2.1. FINDING THE PERCEPTUALLY RELEVANT ELEMENTS IN F 0 -CURVES
2.1.1. Interpreting acoustic visual recordings of F0-curves As has been explained in section 1.7.2, it is possible, using a combination of automatic and manual analysis, to obtain an accurate visual recording of the course of the fundamental frequency of an utterance. It therefore seemed logical to use such visual recordings as a starting point for stylization. However, the problem of interpreting acoustic visual recordings of F 0 curves is well-known to all who deal with them. What one finds oneself confronted with is a rapidly varying and capricious looking curve, in which any possibly present system and regularity is not immediately apparent. The reason for this is that the observed F 0 -curve is the result of a number of contributing factors, some of which are intentional, while others are not. First of all, there are those changes in pitch that the speaker produces under voluntary control as part of the intonation contour, but in addition to that there are changes in pitch that are due to such things as imperfections in the articulatory mechanism and influences of the segmental environment. Which of the details observed in an F 0 -curve are pertinent for the perception of the intonation contour and which details can be smoothed out because they are not, is a question that cannot be decided on the basis of the acoustic visual recording alone. 2.1.2. Stylization
and perceptual
evaluation
In this research a method was developed to determine which elements in an F 0 -curve can be considered perceptually relevant which involves stylization of the original F 0 -curves and subsequent perceptual evaluation. As an example, consider figure 2.1.a, which shows the fundamental frequency curve of the utterance 'where did you hide that thing', as spoken by a male native speaker of BE. Given this curve, one might decide to make a stylization by connecting points 1 through 6 as indicated in figure 2. l.a. by means of straight lines. This results in the stylization as shown in figure 2. l.b. The dots represent the original curve and the straight lines the stylization. During periods of silence and during the voiceless segments of the utterance
18
19 the periodic source of the LPC-vocoder is switched off, which is reflected in the pictures as an interruption in the recording, the dots over the figure indicating the voiceless segments. Using the LPC-vocoder, an utterance can be made audible with the stylized pitch contour and compared directly with the utterance with the original F 0 curve. This original, too, first undergoes the analysis/resynthesis process for purposes of comparability; after all, the quality of the resynthesized speech differs clearly from unprocessed speech. In this instance, it was found that there were hardly any audible differences at all between the two intonational versions. In fact, I found that it takes some very analytic listening to determine whether the two are indeed different. In such cases, where a stylized pitch contour is perceptually distinguishable from the corresponding original fundamental frequency curve only through very attentive and analytic listening, I will refer to the stylized pitch contour as being ''perceptually equal' to the original. It can be seen that a great deal of detail has been eliminated from the curve without audible consequences. These details are therefore regarded as perceptually irrelevant, i.e. not relevant to the perception of the intonation contour. In the case of the utterance 'where did you hide that thing' I found that I could make the stylization even more economical than the one shown in figure 2. l.b. The stylization in figure 2.I.e. is more economical in the sense that it employs fewer straight lines. Perceptual comparison, however, showed that it was still hardly distinguishable from the original. Put differently, the stylization in figure 2. l.c, though more economical than the one in figure 2.l.b, is still perceptually equal to the original in the sense given above. In such cases, the conclusion must be that the stylization in figure 2. l.b. contains elements that are not relevant for the perception of the intonation contour. Apparently, the small rise at the beginning of the word 'hide', the third straight line from the left, is not a perceptually relevant pitch movement in this particular instance. It is quite possible that in a somewhat different segmental environment an identical acoustic event would have had perceptual relevance and could not have been smoothed out with impunity. As it is, it is decided after resynthesizing and perceptual evaluation, that the rise here must be interpreted as a micro-intonational phenomenon. One might propose an even more economical stylization, such as shown in figure 2. l.d, which involves only three, rather than four distinct pitch movements, but this turns out to be immediately audible: the steep rise at the beginning of the contour has perceptual relevance and cannot be left out.
4 Figure 2.1. Fundamental frequency curve (a), a perceptually equal stylization (b), the close-copy stylization (c) and a perceptually different stylization (d) of the utterance 'where did you hide that thing'.
20 2 . 2 . CLOSE-COPY STYLIZATIONS
2.2.1. Introductory Such stylizations, where virtually no perceptual differences appear to exist between stylized pitch contour and original fundamental frequency curve and which are as economical as possible in terms of the number of pitch movements needed, will henceforward be called ''close-copy stylizations', or simply 'close copies'. The rest of this chapter will be dedicated mainly to this notion of close-copy stylizations, their use, their generation, and tests involving close-copy stylizations. This is because they are quite useful as an instrument in intonation research. 2.2.2. Usefulness of close-copy
stylizations
The close-copy stylization is useful not so much as an end in itself, but as an intermediate stage between original, capricious fundamental frequency curves and completely standardized pitch contours. Figure 2.2.a. gives the fundamental frequency curve of the utterance 'yes, it was in Sweden that I think the most embarrassing thing that ever happened to me occurred' as spoken by a male native BE speaker. Such a curve shows a great amount of detail. Which of these details are pertinent for the perception of the intonation contour, and which details can be smoothed out because they are not, is something that cannot be decided on the basis of the acoustic recording alone. The close-copy stylization of this curve, however, given in figure 2.2.b, shows up the perceptually relevant elements; it is more transparent and more easily interpretable than the original curve. Since the two are perceptually equal, the straight lines making up the contour are regarded as the perceptually relevant pitch movements, i.e. the movements that the speaker has produced under voluntary control as part of the intonation contour. Thus, close-copy stylizations allow the researcher of intonation to gain insight into the structure of the melodic system of a language. Using the accumulated evidence of a large number of close-copy stylizations, it is possible to find reasons to discriminate between various standard pitch movements, and also how they may be combined to form entire contours. Another important advantage of close-copy stylizations is the simplicity with which they can be acoustically defined. The contour consists of a relatively small number of interconnected straight lines. Each of these lines is fully defined in terms of only five parameters: start frequency, end frequency, slope, size and duration. Table 2.1, for instance, is an exhaustive description of the stylized pitch contour shown in figure 2. l.c, of the utterance 'where did you hide that thing'. Even this small list is redundant, because only three parameters are really needed; the others are then fixed and can be calculated by means of the formula given in section 1.8. Thus, the first rise in the contour goes from 106
21
5.5
Figure 2.2. Original fundamental frequency curve (a) and close-copy stylization (b) of the utterane 'yes, it was in Sweden that I think the most embarrassing thing that ever happened to me occurred.'
Table 2.1. Start frequency, end frequency, duration, slope and size of the four pitch movements making up the pitch contour shown in figure 2. I.e. Pitch movement
Start frequency
End frequency
Slope (ST/s)
Size (ST)
Duration (ms)
1. 2. 3. 4.
106 178 196 84
178 196 84 76
59.8 3.9 -66.7 -2.3
9.0 1.7 -14.7 -1.7
150 430 220 760
Hz to 178 Hz in 150 ms. From this it can be calculated that it has a slope of 59.8 ST/s and a size of 9.0 ST. The advantage of having such an efficient way of listing the perceptually relevant information present in pitch contours needs no further elaboration.
22 2.2.3. Criteria for close-copy
stylizations
So far, the notion of close-copy stylization has been only loosely defined. It seems desirable that the relation between close-copy stylizations and the original fundamental frequency curve from which they are derived be more formally established. That is, it would be nice to have some objective criteria that make it possible to derive close-copy stylizations from original fundamental frequency curves in a reliable and reproduceable way. This, however, is not at all an easy matter. When one looks at recordings of pitch contours and their corresponding close-copy stylizations (cf. appendix 2.1.), one may be inclined to think of them as a sort of best straight line fit through the dots making op the original curve. If this were so, it would probably be quite possible to design an excellent set of objective criteria that would establish a 100% perfect one-to-one relationship between original and corresponding close-copy. However, such criteria cannot be limited to the acoustic recording of the fundamental frequency curve alone. As has been mentioned above, similar acoustic events in a fundamental frequency curve may be different as far as the perception of intonation is concerned. Frequency fluctuations that are important in one instance may be of no consequence to the perception of the intonation contour in another. This may be the result of micro-intonational phenomena, but linguistic criteria, such as prominence, can also be of considerable importance to a listener's interpretation of the intonation contour he bears. Other comparable influences are the syntactic structure of the utterance and the semantic contents. Since not enough is known about these influences at the moment, I found myself frustrated in attempts to establish a set of objective criteria for the generation of correct close-copy stylizations. More knowledge of the workings of the entire intonational system of BE is needed before this can be done. The reader who is interested in problems encountered in automatic stylization of pitch contours in terms of straight lines is referred to Boves and Rietveld (1978). Thus, instead of criteria guiding the derivation of close-copy stylizations from original fundamental frequency curves, I have had to satisfy myself with three criteria that are operational in character and concern not the derivation of the stylizations, but rather the resulting stylizations themselves. In summary, the first criterion states that the close-copy stylization must consist of a sequence of straight lines in the logarithmic frequency domain. The second criterion states that the stylization must be perceptually equal to the original. The third criterion states that it must not be possible to satisfy the second criterion with a stylization consisting of fewer lines, in other words, that one must use as few pitch movements as possible. As it is, the generation of a close-copy stylization is a constant interaction between changing stylized pitch contours, listening, and comparing with the
23 a p p r o p r i a t e original f u n d a m e n t a l frequency curve. This goes on until one is satisfied that the stylization is b o t h perceptually equal to the original and as economic as possible. It should be clear that in consequence of this p r o c e d u r e there is not just one close-copy stylization for a particular F 0 curve. In principle an infinite n u m b e r are possible, as long as the differences remain under the threshold.
2 . 3 . TESTING THE PERCEPTUAL EQUALITY O F CLOSE-COPY STYLIZATIONS
2.3.1.
Introductory
Up till now, the alleged perceptual equality of close-copy stylizations a n d the f u n d a m e n t a l frequency curves f r o m which they have been derived has not been f u r t h e r s u p p o r t e d . The applicability of close-copy stylizations as a reliable tool in intonation research, however, depends completely on the two being perceptually equal. It is therefore necessary to find experimental evidence for the alleged perceptual equality of close-copy stylizations a n d originals. In fact, if such evidence were provided, several purposes would be served at once. Firstly, it would be clearly indicated that the methods that have been used successfully in the study of D u t c h intonation are also usable in the study of BE intonation. Secondly, it would be proved that it is possible to describe the pitch of BE utterances in terms of a limited n u m b e r of discrete, perceptually relevant pitch movements. These are two of the questions that the present research set out to answer. Thirdly, of course, there would be experimental evidence for the claim that close-copy stylizations can be regarded as perceptually equal to the original f u n d a m e n t a l frequency curves f r o m which they were derived. Accordingly, an experiment was designed to find this experimental evidence. This test will be dealt with in the following sections. 2.3.2. Set-up of the 2.3.2.1.
Test
experiment
material
By the time this experiment was done, close-copy stylizations had been m a d e of a b o u t 50 utterances. The majority of these, a b o u t 35, are utterances that were ' b o r r o w e d ' f r o m the tapes containing d e m o n s t r a t i o n s a n d exercises that come with the BE intonation courses by Halliday and by O ' C o n n o r a n d Arnold (see references). These were chosen f o r the following reasons. The utterances on the tapes have a d u r a t i o n of a b o u t two to three seconds, which seemed a convenient length. The recording quality is good on b o t h tapes, which is i m p o r t a n t in connection with the analysis/resynthesis process the utterances are subjected to. The utterances are articulated carefully a n d special attention is paid to the intonation c o n t o u r . Since utterances were
24 taken from all tones and tone groups, a large variety of pitch contours is covered. The other 15 utterances were spoken by four different native speakers of BE, who did not pay any special attention to intonation. All the utterances were spoken by male speakers, since the quality of the resynthesized speech spoken by women turned out to be so poor as to be hardly acceptable. For the purposes of this experiment, 20 utterances were chosen from the available 50, on a random basis. Of each of these utterances four or five different versions were available. These were: the original version and the close-copy stylization, and two or three other stylizations that differed f r o m the original only with respect to the pitch contour and did so in a variety of ways and degrees, but so much that, according to the experimenter, the differences should usually be easily audible to any native speaker of BE. These other versions are from now on called 'alternative' versions. More about these alternative versions is said later in this section. Appendix 2.1. gives graphical representations of all intonational versions of five of the utterances used in the experiment. Two test tapes were then constructed. Each tape contained 40 test items and 40 dummy items, 80 items in all. Each item consisted of a pair of utterances which were either exactly identical, or different from each other with respect to pitch. The test items were the same on both tapes; some of the dummy items were different on the two tapes. Figure 2.3. shows the composition of the tapes. The forty test items contained twenty pairs of exactly identical utterances, i.e. whose members were both originals or both close-copy stylizations of the same utterances. These will henceforward be referred to as items belonging to category B. The other twenty pairs consisted of an original and a close-copy stylization of the same utterance. These are called items belonging to category A. The forty d u m m y items likewise contained twenty items consisting of pairs whose members differed. Here, one member was either an original or a close-copy stylization and the other was an 'alternative' contour. These pairs were such that every native Englishman should easily be able to hear the •80 items 40 test items 20 different
CATEGORY A
40 dummy items 20 equal
CATEGORY B
20 different
20 equal or different
CATEGORY C
CATEGORY D
Figure 2.3. Composition of the test tapes used in the experiment. For explanation, see the text.
25 difference between the two members. This is category C. The last category, category D , is the only one in which the test tapes differed. On both test tapes, only 'alternative' versions (as defined above) were used for the category D-items. On test tape 2, the pairs making up the D-items had identical members. On test tape 1, however, the D-items consisted of pairs with different members. Here, the degree of difference between the two versions varied. Sometimes the differences were quite big a n d sometimes they were quite small. This was done f o r exploratory purposes a n d has n o direct bearing on the actual p u r p o s e of the experiment. All the same, the experimenter felt safe in assuming that most of the differences also in this category would be audible to the subjects. In fact, the alternative c o n t o u r s served as an exploratory preparation for the melodic model introduced in the next chapter a n d , a m o n g other things, the various p a r a m e t e r settings in the model were experimented with. Sometimes, the only difference between two alternative versions of an utterance is a steeper slope of the pitch movements, or a slightly higher final frequency of the pitch contour. In other cases, however, a steep fall in pitch is shifted to a different syllable, giving rise to a clearly audible shift in stress, or a completely different type of pitch c o n t o u r is implemented. Several examples of differences between alternative contours can be f o u n d in appendix 2.1. A n o t h e r reason for making the difference between the two test tapes was the following. If the assumption of perceptual equality between original a n d close-copy stylization holds, then subjects are supposed to hear no differences in categories A and B, in b o t h tests; to hear clear differences in category C, also in b o t h tests; a n d in category D no differences in test 2 a n d in test 1 differences in most cases. Thus, in test 1 the subjects will be c o n f r o n t e d with physically equal stimuli in only one out of f o u r cases, whereas in test 2 this happens in half the cases. On the other h a n d , if the assumption of perceptual equality holds, the subjects will actually hear equal stimuli in half the cases in test 1, and in three out of f o u r cases in test 2. It may be of some interest to see if these different environments in which the test stimuli, i.e. the items in categories A and B, are presented will influence the test results. 2.3.2.2. Presentation of the test Two groups of subjects participated in the experiment, a n d they were all native speakers of BE, both female and male. One g r o u p of subjects were 38 students of the University of Sussex, f r o m various disciplines. Half of these did test 1 and half did test 2. The second g r o u p of subjects were 26 employees of the M.E.L. E q u i p m e n t C o m p a n y Ltd. in Crawley, ranging f r o m secretaries to managers. Here, too, half did test 1 and half test 2. It was a point of interest to see if the two groups would p e r f o r m in the same way, in spite of their different backgrounds. Listening conditions for the two groups of subjects differed only in that the student g r o u p listened to the test tapes in semi-soundproof cubicles; this was
26 unfortunately impossible in Crawley, where such cubicles were not available. All subjects listened to one of the test tapes through high-quality headphones and were asked simply to note down for each pair whether they thought the members of the pair were identical or not. They heard each of the eighty pairs only once. On the scoring forms, the verbal contents of each of the items was given. Since most of the subjects were not at all or hardly familiar with the concept of intonation, they all listened to a ten-minute introduction which was on tape as well as on paper. The introduction was meant to get the subjects accustomed to the quality of the resynthesized speech, to the test situation, to the notion of intonation and to the sort of stimuli they could expect in the test proper. The full text of this introduction is given in appendix 2.2. The introduction was then followed, after any possible questions had been asked, by the test proper. This took another twenty minutes. The student group of subjects were paid for their services. The M.E.L. group received no financial reward, since they did the test during working hours. 2.3.3. Statistical analysis of the results 2.3.3.1. Differences between groups of subjects Since the object of the test was to see whether subjects could correctly determine if the two members of a pair of utterances were identical or not, the results have been analyzed in terms of'correct' and 'incorrect' scores. Thus, for the category A-items, where the two members of a pair are different, an 'equal' response was incorrect, while an 'equal' response on category B-items was correct, since the two members of a pair were actually identical. Table 2.2. shows which responses were correct and incorrect, for each test category. Test category D is split up over tests 1 and 2, since in test 1 the pairs of utterances consisted of different members, whereas in test 2 the members were equal. Table 2.3. shows that in group 1 (the students) subjects made an average of 22.2 (28%) incorrect scores, and in group 2 (the M.E.L. group) an average of 23.4 (29%) incorrect scores. The difference between these means is not significant (t-test for 2 means: t = 1.1, df = 62, p = .29; or Mann- Whitney U-test: z = 1.04, p = .30). Table 2.2. Definition of what are correct and incorrect scores for each test category, as a function of the response. Response
Equal Different
Category A
B
C
D test 1
D test 2
incorr corr
corr incorr
incorr corr
incorr corr
corr incorr
27 Table 2.3. Incorrect scores per subject for the two groups of subjects.
Number of subjects Incorrect scores Average incorrect Standard deviation
Group 1
Group 2
Total
38 844 22.2 4.2
26 608 23.4 4.5
64 1452 22.7 4.3
Thus, o n the average, subjects in one g r o u p did not make more incorrect scores t h a n in the other g r o u p . In other words, I feel free to assume that in this particular test there is no reason to suspect any difference in p e r f o r m a n c e between the two groups of subjects, and therefore the two groups will be treated as one in the discussion to follow. 2.3.3.2. Differences between tests 1 and 2 As mentioned before, the only difference between tests 1 and 2 is in the items in category D . As far as the actual test items, those in categories A and B, are concerned, the two tests are identical. Table 2.4. shows an average of 24.3 (30%) incorrect scores per subject in test 1 a n d 21.1 (26%) in test 2. The difference between these two means is significant (t-test for 2 means: t = 3.1, df = 62, p = .003; o r M a n n Whitney U-test: z = 3.2, p = .0007). Thus, there seems to be little d o u b t t h a t , on the average, subjects gave more incorrect responses in test 1 than they did in test 2. Table 2.5. gives the average n u m b e r of incorrect scores per subject for tests Table 2.4. Incorrect scores per subject for tests 1 and 2.
Number of subjects Incorrect scores Average incorrect Standard deviation
Test 1
Test 2
Total
32 777 24.3 5.0
32 675 21.2 2.3
64 1452 22.7 4.3
Table 2.5. Incorrect scores per subject in test categories A and B, test 1 and test 2.
Number of subjects Incorrect scores Average incorrect Standard deviation
Test 1
Test 2
Total
32 572 17.9 2.4
32 597 18.7 1.9
64 1169 18.3 2.2
28 1 a n d 2, but only for the actual test items, categories A a n d B. N o w , the average for test 1 is 17.9 a n d for test 2 18.7. This difference is not significant (t-test for 2 means: t = 1.5, d f = 6 2 , p = . 15; or M a n n Whitney U-test: z = 1.3, P = -21). The explanation for the significant result mentioned above is f o u n d in category D. In table 2.6. we can see that when we look at the total n u m b e r of incorrect scores made in each category, the only conspicuous difference between tests 1 a n d 2 is in category D: 173 v. 35 incorrect responses. This is because in test 1 the items in category D have members with different pitch c o n t o u r s and the degree of difference has been varied f r o m hardly audible to clearly audible. Therefore, I certainly expected more incorrect scores to be m a d e here t h a n in category D in test 2, where the members of each pair do not differ a n d are therefore essentially the same as those in category B. The conclusion, then, is that as far as the actual test items are concerned, i.e. those in categories A and B, there is no reason to assume that on the average subjects gave more incorrect responses in one test t h a n they did in the other. In discussions a b o u t the items in categories A a n d B, the two tests will therefore be treated as one. 2.3.3.3. Overall test results Having established that we can ignore differences between groups a n d - f o r categories A and B- between tests 1 and 2, it is time to t a k e a look at the achievements of the subjects in general. Appendices 2.3. and 2.4. provide some useful surveys of overall test results. Appendix 2.3. gives the n u m b e r of incorrect scores per subject, split up over the f o u r test categories. Appendix 2.4. does the same, but for the n u m b e r of incorrect scores per item. F r o m these surveys, table 2.7. is derived. F r o m this table we can see that generally the subjects have performed according to expectation. Categories B a n d C give the best indication of the subjects' p e r f o r m a n c e . In category B there were no differences between the members of a pair, a n d subjects m a d e incorrect scores in only 4.5% of the cases. In category C there were always differences a n d these, too, the subjects picked out correctly, making incorrect scores in 5.9% of the cases. Category D , as has been explained, is different in the two test versions. In test 1 the character of this
Table 2.6. Number of incorrect responses per category, test 1 and test 2.
Test 1 Test 2
A
B
543 568
29 29
Category C 32 43
D 173 35
29 Table 2.7. Number of incorrect responses per category, with percentages.
Number Percentage
A
Category B
C
D test 1
D test 2
1111 86.8
58 4.5
75 5.9
173 27.0
35 5.5
g r o u p of items was e x p l o r a t o r y a n d t h e r e f o r e subjects' reactions were u n p r e dictable to a certain extent. In fact, in test 1 subjects m a d e 27% incorrect scores a n d in test 2, where again there were no differences between the m e m b e r s of a pair, there were 5.5% incorrect scores, m u c h like in category B. All this indicates that where matters were as clear as in categories B a n d C a n d category D of test 2, subjects p e r f o r m e d quite well. Category A is a different m a t t e r altogether. Here, we have the o p p o s i t i o n close-copy versus original. Since it was h o p e d that subjects would not be able to hear the differences, there should have been 100% incorrect scores, ideally speaking. In effect, incorrect scores were m a d e in 86.8% of the cases. This is still quite a lot when c o m p a r e d with the o t h e r categories, but not really e n o u g h . This becomes clear when we look at table 2.8. This table gives the observed n u m b e r of responses which deviated f r o m expectation, i.e. correct scores in category A a n d incorrect scores in categories B a n d C a n d category D in test 2. T h e scores in category D in test 1 are ignored here. F r o m the total n u m b e r of observed deviant scores, expected n u m b e r s f o r each of the categories are given by equal p r o p o r t i o n s , t a k i n g into a c c o u n t that the n u m b e r of scores in category D is based on half the n u m b e r of subjects as c o m p a r e d to the o t h e r categories. T h e discrepancies between observed a n d expected frequencies in this table are significant (Chi-square: X 2 = 78.4, df = 3, p.05), while all other pairwise, differences between means are significant (Scheffe, p SCALE VALUES
Figure 3.7. Psychological continuum derived from the experimental results. For further explanation, see the text.
64 3.4.4. Discussion and conclusions A first conclusion to be drawn is that, as in the close-copy acceptability test, the subjects who participated in the experiment have shown a remarkable capability in assessing the abstract attribute 'acceptability of pitch contour', as demonstrated in section 3.4.3. This experiment proves even more conclusively than the close-copy acceptability test described in the previous chapter that experiments of this type are perfectly feasible and yield reliable and consistent results. In other words, it is quite possible to test the adequacy of an explicit model of the melodic system of a language. Secondly, the basic question that this experiment set out to answer, viz. whether the model outlined in this chapter can be used successfully in generating standardized pitch contours that are perceptually acceptable to native speakers of BE, can be answered with an emphatic and confident 'yes'. This is most clearly seen in figure 3.7: originals, PES-and FS-stylizations are equally acceptable to native speakers of BE, from a statistical point of view, while they are significantly more acceptable than the other two types of stylization: they form a homogeneous group, clearly set apart from the rest. One might wonder why subjects do not score the originals closer to the top of the scale: figure 3.7. shows that on the average the originals get a '4' in terms of the original 5-point scale. This may be because subjects really found some originals better than others (scale values range from 1.644 to 3.252). Alternative explanations could be that the use of synthetic speech introduces a bias, or that the scaling task itself is a biasing factor: maybe subjects feel obliged to introduce variety in their scoring behaviour, or to score on the conservative side. However this may be, the subjects' scoring behaviour is consistent and the outcome of the experiment is not affected. The mean scale value of the originals is the reference point and stylizations that do not differ significantly from this can be said to be as acceptable to native speakers of BE as original utterances. For a final word on the model, see section 5.6.5.
3 . 5 . SUMMARY
In this chapter, a model is proposed that claims to give an account of the melodic system of which native speakers of BE make use. The model does not include any knowledge concerning what intonation patterns exist in BE, or of their functions. The model comprises eight melodically distinct pitch movements, which may take up any of three positions with respect to the syllable with which they are associated, and rules governing the ways in which these pitch movements may be merged to form complete pitch contours that are fully specified acoustically.
65 An experiment is described which tests whether the model can be used to generate perceptually acceptable pitch contours. The two main conclusions to be drawn from this experiment are that pitch contours generated in terms of the model are not distinguishable from original fundamental frequency curves, as far as perceptual acceptability is concerned, and that the subjects have shown themselves remarkably capable in assessing the abstract attribute 'acceptability of pitch contour'.
CHAPTER 4
Generating British English pitch contours automatically
4 . 1 . INTRODUCTORY
In the previous chapter a melodic model of BE intonation was presented and it was demonstrated that the model can be applied successfully to generate pitch contours that are perceptually acceptable to native speakers of BE. It was emphasized in the introduction to that chapter that the model is a purely melodic one, lacking all 'knowledge' concerning intonation patterns that exist in English, and their possible functions. The input specifications to the model must therefore specify which syllables are to be provided with a pitch movement and which type of pitch movement this must be. It is assumed here that the native speaker has at his disposal an inventory of abstract intonation patterns, just as he has an inventory of abstract pitch movements that may be used. Each time he produces an utterance, he makes choices from this inventory; how these choices are made and on what grounds, is a matter that will not concern us here. It can be seen that the generalizing power of the model would be greatly enhanced if we could incorporate in it knowledge of possible intonation patterns. Rather than specifying a type of pitch movement for each syllable in an utterance that is to receive a pitch movement, a system with virtually no generalizing power at all, it would then suffice to indicate an intonation pattern, together with such information as is necessary to allow the intonation pattern to be realized as an actual pitch contour, correctly synchronized with the utterance. The notion of intonation pattern is, of course, not a new one. Indeed, it becomes difficult to understand how intonation can be used in human communication efficiently if we do not assume such an organization, or one similar to it. Thus, we encounter the concept in most attempts at a systematic description of the intonational system of a language. In the traditional descriptions of BE intonation we usually find the terms 'tones' or 'tunes' to refer to these higher-order organizations of intonational elements. From the point of view of the experimental phonetician it is again the acoustic signal that must be the starting point in a search for intonation patterns used by native speakers of a language. Only those intonation patterns that can be distinguished on the basis of the acoustic signal can be recognized as different patterns used by the speaker. The ideal method would
68 therefore be to collect a large corpus of representative speech material and analyze it using the perceptual-acoustic methods outlined in the previous chapters. Such a procedure, however, is extremely time-consuming and, in the case of the present research project, it was prohibitively so. As a result, it was decided once more to look to Halliday's course in English intonation for help, thereby increasing our indebtment to him still further. Halliday (1970) recognizes seven different tones and this notion 'tone' corresponds roughly to the notion 'intonation pattern' employed here. The spoken examples of the various tones on the tape that comes with the intonation course, then, take the place of the corpus of speech material mentioned above. In fact, they make up a corpus themselves, but one that is already selected and pre-arranged. The corpus is also a limited one and, moreover, one that may be biased to some extent. By this I mean that the utterances are examples of the tones distinguished by Halliday and they are pronounced in accordance with his instructions. Consequently, how representative the corpus is of all the intonational possibilities of BE depends on how exhaustive and justified Halliday's seven-tone system is. O ' C o n n o r and Arnold in their 'Intonation of Colloquial English' (1973), for instance, distinguish ten different tone groups and a comparison between the two categorizations reveals many similarities, but also fairly fundamental differences. Thus, many intonation patterns find a place in both systems - for instance O ' C o n n o r and Arnold's tone group 1, the low drop, corresponds roughly to Halliday's tone 1 -, but sometimes a pattern in one system finds no parallel in the other - O ' C o n n o r and Arnold's tone group 10, the 'terrace', does not exist in Halliday's system. Quite often, the emphasis on certain patterns is different in the two systems; Halliday's tone 1 with rising pretonic is considered so important by O ' C o n n o r and Arnold that it is a separate tone group in their system: the long jump, tone group 6. Halliday himself, in fact, explicitly states that his seven-tone system is not necessarily the one-and-only correct categorization: 'There is nothing surprising in the fact that different books on English intonation give different numbers of tones. This does not mean that one or the other is wrong; it simply means that the authors are using different methods for explaining and teaching the language'. (Halliday 1970, p.8). It must be understood, then, that the fact that use is made here of Halliday's seven-tone system and their spoken realizations on tape does not mean that we accept his categorization as necessarily valid and correct. Quite likely, the present approach will ultimately yield a different inventory of intonation patterns. The main reason for this expectation is that in the present approach, different patterns can be distinguished only on melodic grounds; Halliday's tones are distinguished only partly for melodic reasons, while
69 other criteria are applicability in the teaching situation and also the syntactic and semantic functions of the tones.
4 . 2 . CHARACTERIZATIONS O F HALLIDAY'S TONES
4.2.1.
Introductory
An attempt is made to give an explicit melodic characterization of each of Halliday's seven primary tones. These characterizations are based on acoustic and perceptual analysis of a fairly large number of examples of the tones on the tape that comes with Halliday's intonation course. The perceptual adequacy of these characterizations is experimentally verified. For a better understanding of the following the reader not familiar with Halliday's system is advised to reread the short description that was given in section 1.4.3. 4.2.2.
Analysis
Source material were the examples contained in study units 11,12,13,15,16, 17 and 18, i.e. those units dealing with the primary tones, but excluding tone sequences. These study units are subdivided into a number of sections containing examples of increasing complexity. From each of these sections, some four examples were chosen for analysis, by male as well as female speakers. First, they were analyzed in terms of LPC-parameters and the fundamental frequency curves were measured. Then, perceptually equivalent standardized approximations were made (PES-approximations, cf. section 3.4.2.2.). As before, these PES-approximations are pitch contours generated by the melodic model outlined in the previous chapter, but with some of the standard parameters allowed to be varied so as to permit a closer approximation to the original. In the previous chapter, only the standard slope and the final frequency were allowed to vary; for the present purpose, one more parameter was given this freedom, viz. the position of a pitch movement with respect to the syllable. This was done because in some instances this position in the originals (or rather the close-copy stylizations) seemed to differ consistently from the standard positions prescribed by the model and might be a perceptually important feature of the character of the tone. In addition, it was found that for some tones it was desirable to use pitch movements covering only one quarter of the range. Thus, a pitch movement coded 2110 is a steep fall covering only a quarter of the full range. In standard parameters, its slope would be -75 ST/s, its duration 40 ms and its range 3 ST. In this way, 26 examples were analyzed of tone 1,22 of tone 2,20 of tone 3, 18 of tone 13,14 of tone 4,12 of tone 5, and 12 of tone 53. Approximately half of these were spoken by a male speaker and half by a female speaker.
70 On the basis of these PES-approximations, a general characterization was then derived for each tone, generalizing as much as possible the common elements. It is interesting to note that no systematic differences were found between the contours produced by the men and those by the women, except, of course, that women speak at a higher average pitch than men. Structure of the contours and slope, range and duration of the pitch movements did not appear to differ consistently. This observation is of course true only if frequency is plotted logarithmically (cf. section 1.7.4.), which is an argument in favour of the use of a logarithmic rather than a linear frequency scale when dealing with pitch in speech. In the following sections, the characterization of each of the tones thus found will be given and commented upon. In the figures, the vertical lines represent, from left to right, the beginning of the utterance, the vowel onset of the first salient syllable of the pretonic, the vowel onset of the tonic syllable, and the end of voicing of the utterance. Each subdivision on the horizontal time scale corresponds to 40 ms. 4.2.3. Tone 1 As can be seen in the figure, the common element in all four variants of tone 1 is a full steep fall 2142 on the tonic syllable. This fall starts 50 ms after the vowel onset of the tonic syllable, rather than 30 ms which is the value prescribed by the model; although 30 ms sounded quite acceptable, 50 ms was found to be slightly pleasanter in most cases. When there is no pretonic, this fall is immediately preceded by a half rise 1120, positioned so that there is a 30 ms interval between the end of the rise and the beginning of the fall; the contour starts at mid-level. This is variant (a). It was found that when the distance between the beginning of the utterance and the tonic syllable was shorter than 250 ms, this rise-fall combination had a tendency to sound too abrupt. In that case, variant (b) is applied; the fall is now preceded by a gradual half rise 1220. When there is also a pretonic, there is an early half rise 1121 on the first salient syllable of the pretonic; in accordance with the model, this rise is completed at the vowel onset of the syllable. The contour then remains high until the fall on the tonic. This is variant (c). When the distance between the pretonic and the tonic (measured from vowel onset to vowel onset) exceeds 750 ms, the contour rapidly loses acceptability, because of the relatively long stretch of high declination. In that case, variant (d) is more acceptable: after the pretonic syllable is completed, a quarter fall 2113 is implemented. To avoid the necessity of specifying an extra reference point in the pretonic syllable, viz. the end of syllable voicing, this position is fixed at 250 ms after the vowel onset. The tonic fall is then preceded by a quarter rise 1111, which is completed at the tonic vowel onset.
71 PRE!"ONIC
TON IC / \
/
a
i i i i i
i i i i i i i i i
PRE1"ONIC
b i i i i i
/
C i i i i i
i i i i r i i i i i i i i i i
\
TON IC
1 1 1 1 1 1 1 11
PRE1r 0NIC
/
TON IC
i i i i i i i i i
PRE1rONIC
i i i i r i i i i i i i i i i
i i i i r i i i i i i i i i i
TON IC
W
d Figure 4.1. General characterization of Tone 1
For all four variants, the standard slope is 60 ST/s, giving a standard range of 9.6 ST. This is slightly less than the values specified in the model (75 S T / s and 12 ST, respectively). The final frequency is 70 Hz, rather than 65 Hz. Naturally, this final frequency is based on the average performance of the male speaker only; the standard slope of 60 S T / s is based on the average performance of both the male and female speakers. This observation goes for the other tones as well. Sometimes, an utterance may start immediately with the tonic or pretonic syllable; in those instances, the initial rise may only partly be there or it may not be there at all. For instance, the utterance 'everybody knows about it',
72 where the first syllable is the tonic, will have only the fall 2142 on 'everybody'. 4.2.4. Tone 2 The common characteristic of the tonic movement of tone 2 is that the contour rises gradually over half the range, starting at the tonic, and then ends with a late steep half rise 1123 on the last syllable. In the model, the gradual rise starts at the vowel onset of the tonic syllable, so that the preceding half fall 2121 (when there is a pretonic) starts 80 ms before this vowel onset. However, this meant that the fall often began audibly late in the preceding syllable, lending that syllable undesirable pitch
PRE"!"ONIC
/
C i i i i i
TON I C
i i i i i i i i i
^^^^\ \ \
Figure 4.2. General characterisation of Tone 2
i i i i i i i i i i
73 prominence. To avoid this, the gradual rise begins 40 ms after the tonic vowel onset, so that the preceding fall begins 40 ms before the vowel onset. The final half rise 1123 is a perceptually important feature of this pattern. According to the model, this rise must end 30 ms before the end of the voiced part of the syllable; here, the end of syllable voicing coincides with the end of the rise. This, too, has a practical reason: if the tonic is the last word in the utterance, there is often not enough room. It can be seen that when the distance between tonic vowel onset and the end of final syllable voicing is 200 ms, there is exactly 160 ms available for both the gradual half rise and the final steep rise together. In this case, the two movements blend into a steep fall rise 1143, as shown in figure 4.2.c. If the available space is still less, only part of this rise is realized; this may happen for instance in the utterance 'are you sure it's the right address When there is a pretonic, the contour starts at mid-level and has an early half rise 1121 on the first salient syllable of the pretonic. Immediately following this rise there is a gradual fall over half the range, 2220, continuing till 40 ms before the tonic vowel onset. At this point, a steep half fall 2121 is implemented, which ends 40 ms after the tonic vowel onset. The final frequency is 70 Hz. The standard slope for tone 2 is set at 90 ST/s, yielding a standard range of 14.4 ST. This range is considerably greater than that of tone 1 (9.6 ST). 4.2.5. Tone 3 Tone 3 is obviously very similar to tone 2. Here, too, the common characteristic of the tonic movement is a gradual half rise 1120, followed by a late steep half rise 1123 on the last syllable. As with tone 2, the gradual rise 1220 does not start exactly at the tonic vowel onset, but somewhat later, in order to avoid the occurrence of pitch prominence on the syllable preceding the tonic when there is a pretonic. Since the fall used, 2141, is a full fall, extending over the entire range and lasting 160 ms, the gradual rise in this case starts 80 ms after the tonic vowel onset. For the same reason as with tone 2, the end of the final rise coincides with the end of syllable voicing of the last syllable, rather than 30 ms earlier. When the distance between the tonic vowel onset and the end of final syllable voicing is 240 ms, the gradual and steep half rises blend into one full steep rise 1143, as shown in figure 4.3.e; if the available time is even less, only the initial part of this rise is realized. It was found that when the distance between the end of syllable voicing of the final syllable and the beginning of the gradual rise increases, there is a tendency to decrease the range covered by the gradual rise, in favour of an increase of the range covered by the final steep rise. For that reason, when this distance exceeds 500 ms, the gradual half rise 1120 is replaced by a gradual quarter rise 1210, and the final half rise 1123 by a three-quarter rise 1133 (versions (b) and (d)).
74
PRE1"ONIC
TON IC
a i i i i i i i i i r i i i i™T—r-T T 1 i r T ^ i^i i PRE!r0NIC
TON IC
b T 1 1 1 1
i
i
i
PRE1r0NIC
/
i
n—i—i—r~
i
i
i
i
i
i
i
i
i
i
i i
TON IC
\
c 1 t!1 1 iiiiiiii i PRE1"ONIC
/
TON IC
\
d i i i i i i i i i i i i i i i >T PRE!"ONIC
/
ni
iiiiiiii i
TON IC
\
e —[-1 i i i i i i i i i i i i Figure 4.3. General characterization of Tone 3
11111111 1
75 When there is a pretonic, the contour starts at mid-level and has an early rise 1121 on the first salient syllable of the pretonic. After that, it follows the high declination line until 80 ms before the tonic vowel onset, where a full fall 2141 starts. This fall ends 80 ms after the tonic vowel onset. Without pretonic, the contour begins low, and follows the low declination line till 80 ms after the tonic vowel onset. The final frequency for tone 3 is 70 Hz. The standard slope is 70 ST/s, with a standard range of 11.2 ST. 4.2.6. Tone 4 The characteristic movements in tone 4 involve a rising movement reaching to the top level, immediately followed by a steep fall covering the full range, and finally a full steep late rise 1143 on the last syllable. We find here a difference between the versions without pretonic and with pretonic. Without pretonic, the contour begins low and follows the low declination line until it reaches the vowel onset of the tonic syllable. There, a
76 full rise 1142 begins. On the other hand, when there is a pretonic, the contour is on the middle level when it reaches the tonic vowel onset and there a half rise 1122 begins. In both cases, the rise begins at the vowel onset. As a result, the fall comes relatively late in the syllable: when there is a pretonic, the fall starts 80 ms after the tonic vowel onset and when there is no pretonic, it even starts 160 ms after the vowel onset. It seems typical of tone 4 that the whole rise takes place in the tonic syllable rather than partly before it, which would be the case if the fall were to start 30 ms after the tonic vowel onset; it adds a sweeping sensation to the contour that seems quite characteristic. When there is a pretonic, the contour starts at mid-level and has a half-rise 1122 on the first salient syllable of the pretonic. Following this, there is a gradual half fall 2220 which extends to the vowel onset of the tonic syllable. Notice that the pretonic rise begins at the vowel onset of the first salient syllable, whereas in tones 1, 2 and 3 the pretonic rise is completed at that point. This is done for the same reason as in the tonic, viz. to obtain the sweeping sensation characteristic of the contour. When the distance between the tonic vowel onset and the end of final syllable voicing is less than 400 ms (with pretonic) or 480 ms (without pretonic), which is nearly always the case when the tonic consists of only one or two syllables, there is not enough room for all pitch movements. In that case, the slope of the pitch movements is increased, so that the same range is covered in a shorter time. When this measure still does not save enough space, the range covered by the tonic pitch movements is decreased as much as is necessary. Figure 4.4.c. gives an example of this. The final frequency for tone 4 is 70 Hz. The standard slope is 75 ST/s, with a standard range of 12 ST. 4.2.7. Tone 5 The characteristic movements in the tonic of tone 5 strongly resemble those of tone 4: there is a rising movement which begins at the tonic vowel onset and which reaches to the top level. This rise is immediately followed by a full fall 2143. The main difference is that the contour now remains low; there is no final rise. There is, however, still another difference. It seems typical of tone 5 that the fall should not begin till after completion of the tonic syllable. Therefore, a gradual rise is applied which extends over 200 ms, rather than the standard steep rise. Again, there is a difference between the versions with and without pretonic. Without pretonic, the contour begins low and follows the declination line until it reaches the vowel onset of the tonic syllable. There, a full rise 1240 with a duration of 200 ms begins. The pretonic of tone 5 differs essentially from the other tones, involving, as it does, a gradual rise. After a start on the low level, there is a small rise 1111 which is finished at the vowel onset of the first salient syllable of the pretonic. There, a gradual half rise 1220 begins, which ends 80 ms before the tonic
77 TON I C
PRE1" 0 N I C
a
i
i i i i
i i i
i
r—i—i—r—r-
i
i i
1 i
• i
TON I C
PRE!"0NIC
b / i—i—rri
T'"I
i
i i i
i i
i
i i i
i i i i l
i i i
i i i
Figure 4.5. General characterization of Tone 5 vowel o n s e t . It is f o l l o w e d by a h a l f steep fall 2 1 2 1 , which is c o m p l e t e d at the t o n i c vowel onset. T h e final f r e q u e n c y f o r t o n e 5 is 7 0 Hz. T h e s t a n d a r d slope is 9 0 S T / s , with a s t a n d a r d range o f 14.4 S T .
4.2.8. Tones 13 and 53 T o n e s 13 and 53 a r e simply sequences o f the two c o m p o s i n g t o n e s , and their specifications can be f o u n d under t h o s e t o n e s . In b o t h c o m p o u n d t o n e s , t o n e 3 is without p r e t o n i c , by definition. F o r b o t h c o m p o u n d t o n e s , the final f r e q u e n c y is 7 0 Hz. F o r t o n e 13, the s t a n d a r d slope is 7 0 S T / s , giving a s t a n d a r d range o f 11.2 S T . F o r tone 53, these values are 9 0 S T / s and 14.4 S T , respectively.
4 . 3 . AUTOMATIC GENERATION OF HALLIDAY'S TONES S i n c e the melodic model described in the previous c h a p t e r had already been i m p l e m e n t e d on the c o m p u t e r (cf. section 3 . 3 . ) , it was an easy m a t t e r t o a d a p t the p r o g r a m to a u t o m a t i c g e n e r a t i o n o f H a l l i d a y ' s t o n e s , in a c c o r d a n c e with the rules given a b o v e . T o provide an utterance with a pitch c o n t o u r , it is n o w necessary only to provide the c o m p u t e r with the following s p e c i f i c a t i o n s : what t o n e is to be used; the b l o c k n u m b e r t h a t c o r r e s p o n d s to the vowel onset o f the t o n i c ; if there is a p r e t o n i c , the b l o c k n u m b e r that c o r r e s p o n d s to the vowel onset o f the first salient syllable o f the p r e t o n i c ; if the tone involves a final rise, the b l o c k n u m b e r that c o r r e s p o n d s to the end o f voicing o f the final syllable.
78 After these f o u r specifications have been given, the p r o g r a m provides the utterance with the corresponding c o n t o u r fully automatically, after which it can be made audible at once. It should be noted t h a t , even t h o u g h in the characterization of the tones declination has not been mentioned a n d has been omitted f r o m the schematized pictures, declination is most certainly present in the c o n t o u r s that are generated. Generating the same contours, but without allowing f o r declination, gives rise to unacceptable contours. In principle, the p r o g r a m generates the pitch c o n t o u r s in exactly the same way as described in section 3. The only difference is that now it need not be given i n f o r m a t i o n a b o u t which types of pitch movements are to be implemented, a n d in what position with respect to the associated syllable; this i n f o r m a t i o n is already available within the program itself.
4 . 4 . TESTING THE PERCEPTUAL ADEQUACY O F AUTOMATICALLY GENERATED PITCH CONTOURS
4.4.1.
Introductory
Having m a d e rules for the a u t o m a t i c generation of pitch contours, a n o t h e r experiment is in order, to test whether the pitch c o n t o u r s thus p r o d u c e d are perceptually adequate. The similarities with the aim of the previous experiment stand out. There, the aim was to test the perceptual acceptability of the PES- and FS-stylizations; here, the aim is to test the perceptual acceptability of automatically generated pitch contours. Accordingly, the same experimental set-up and the same methods of statistical analysis to process the results are used. Since this set-up h a d worked well in the previous experiment, there seemed n o reason why it should not d o so now. Moreover, using an identical experimental set-up in b o t h experiments would make a c o m p a r i son of the results feasible. Nevertheless, a few m i n o r changes with respect to the previous experimental set-up were made, notably in the n u m b e r of test items. 4.4.2. Set-up of the 4.4.2.1. Speech
experiment
material
Since the rules for the pitch c o n t o u r s to be tested are based on Halliday's tone system, again a choice was made f r o m Halliday's tape to serve as a basis for the speech material to be used in the experiment. As in the previous experiment, two examples of each of the seven tones were chosen, all spoken by a male speaker, with a d u r a t i o n between one and three seconds, a n d containing b o t h a tonic and a pretonic segment. In addition, it was m a d e a requirement that the material be fresh, i.e. that it had not been used in one of the earlier experiments and that it should not
79 contain any of the 124 utterances analyzed to come to the general characterization of each tone (see section 4.2.2.). Therefore, the material was taken partly from chapter 1 (introduction) of Halliday's course, and partly from chapter 3 (secondary tones), rather than chapter 2 (primary tones), from which nearly all material used so far came. The fourteen selected utterances are listed in appendix 4.1. Of each of these fourteen utterances, five different versions were made, differing only in intonation. These will be dicussed in the following sections. Seven examples, one for each tone, with all five versions, are given in appendix 4.1. 4.4.2.2. SATO-contours SATO-contours are contours which are generated automatically in accordance with the rules and specifications given above in section 4.2. They are called S A T O (same tone), because the choice of which of the seven available tones to use was determined by the tone that was used on Halliday's tape in pronouncing the utterance. Thus, if on Halliday's tape an utterance is pronounced with tone 5, it was now given a contour in accordance with the specifications given above for tone 5. In most cases, the resulting contours were clearly different from the original, but the expectation was that they would not be less acceptable. The differences were naturally greatest when the original was a secondary tone. For instance, the utterance 'in another twenty years or thereabouts' is pronounced on tape with a tone 3 with a low pretonic; the SATO-contour has a high pretonic (see section 4.2.5.). 4.4.2.3. DITO-contours If the automatically generated patterns are to have any generalizing value, it must be possible to use essentially identical pitch contours with different utterances without a decrease in perceptual acceptability (cf. section 1.5.). Within tones, of course, this already happens: two different utterances are provided with the same basic pattern. I wanted to see if this was also possible when the utterance had originally been pronounced with a different type of pattern. These contours are called DITO-contours (different tone) and they are also automatically generated in accordance with the rules and specifications given in section 4.2. Now, however, a different tone was applied than the one used on Halliday's tape. Table 4.1. shows which tone was used in the DITO-contour for each tone used on Halliday's tape. Thus, if on Halliday's tape an utterance is pronounced with tone 4, it was now given a contour in accordance with the specifications given above for tone 1. Table 4.1. The tones used in the DITO-contours, for each tone used on Halliday's tape. Tone used on tape Tone used in DITO-contour
1 4
2 3
3 2
4 1
5 1
13 4
53 2
80 Theoretically, the resulting utterances, t h o u g h obviously different f r o m the original, should still be acceptable. Yet now a lower acceptability rating was anticipated, because of what might be called 'conflicting cues'. F o r instance, a tone 4 sounds livelier (more dynamic) t h a n a tone 1; in the original, this liveliness is reflected not only in the pitch curve, but also in other speech parameters, such as a m p l i t u d e variations a n d d u r a t i o n . In the D I T O - c o n t o u r , only the pitch is changed a n d all other p a r a m e t e r s are left the same. Thus, the liveliness suggested by amplitude variations and d u r a t i o n a l m a k e - u p may be in conflict with the 'soberness' of the pitch c o n t o u r , with a resulting decrease in acceptability. Sometimes, the consequences can be even more serious. F o r example, if an original tone 1-utterance is given a tone 4-contour, the final rise (present in tone 4, but absent in tone 1) may be all but inaudible, because of very low final amplitude in the original. Inversely, if an original tone 4-utterance is given a tone 1-contour, the end may s o u n d quite u n n a t u r a l , because there is clearly audible increase in amplitude towards the end, without the expected concomitant change in pitch. A n o t h e r disturbing factor that can be imagined is that the D I T O - c o n t o u r sounds u n n a t u r a l with regard to the semantic contents of the utterance. I tried to counteract this t o a certain extent by the choice of the tone t o be used in the D I T O - c o n t o u r s . Thus, if on Halliday's t a p e an utterance is p r o n o u n ced with tone 5 , 1 decided to use tone 1 f o r the D I T O - c o n t o u r , because I felt that this would keep the risk of a clash between semantic contents a n d t o n e used as low as possible. Admittedly, t h o u g h , this was based mainly on intuition. In spite of all these considerations, it was decided not to start m a n i p u l a t i n g other speech parameters, such as a m p l i t u d e and d u r a t i o n , for fear this might bias the results a n d make their interpretation more difficult and less reliable. 4.4.2.4. Other stylizations The remaining three intonational versions were the same as in the previous experiment, viz. the O R I G I N A L version, the D U T C H - c o n t o u r a n d the W I T T E N - c o n t o u r . This was done because they give a good indication of the scoring behaviour of the subjects and also to make a good c o m p a r i s o n with the results of the previous experiment possible. F o r descriptions of the D U T C H - a n d W I T T E N - c o n t o u r s , see sections 3.4.2.4, a n d 3.4.2.5, respectively. 4.4.2.5. Composition of test tape As in the previous experiment, there are fourteen test utterances a n d five intonational versions of each, totalling 70 different test items. In the previous experiment, each item was presented twice, which, a f t e r addition of ten d u m m y items, gave a test tape containing 150 stimuli. F o r t h e purposes of the present experiment it was decided to reduce this n u m b e r , because several subjects h a d complained that the test was rather long, which might have negative effects on their powers of concentration (although this was not b o r n e out in the results).
81 Therefore, a tape was constructed on which half the test items occurred twice a n d the other half only once. Thus, there are 105 stimuli, rather t h a n 140, while it would still be possible to test the subjects' consistency of scoring behaviour. The test stimuli were preceded a n d followed by five d u m m y stimuli, so that the whole test tape consisted of 115 stimuli. Each of these stimuli was presented twice, with a 500 ms interval in between, after which subjects h a d three seconds to note down their reaction. Each stimulus was preceded by a short 500 Hz warning tone (cf. figure 3.6.). 4.4.2.6. Presentation of the test The test was presented to 29 students of L o n d o n University, all native speakers of BE, both male and female. They listened to the test tape over high quality earphones. The test p r o p e r was preceded by a ten-minute introduction, identical to the one used in the previous experiment. Including the introduction, the test lasted a b o u t 25 minutes. After the introduction, the tape was stopped a n d subjects could ask any questions they wanted. Their task was then to score the acceptability of each of the 115 items on a 5-point scale, 1 corresponding to least and 5 to most acceptable. The subjects were paid for their services. 4.4.3. Statistical
analysis of the results
The results were processed statistically in the same way as in the previous experiment. Scale values were calculated for all 105 test items. They are given in appendix 4.2, together with the psychological c o n t i n u u m to which they refer. Of the 35 items of which there were two presentations a paired t-test showed that there was no significant difference between first a n d second presentations (t = -.845, df = 34, p = .40). Pearson's correlation coefficient shows that the scale values of first and second presentations are highly correlated (r = .98). On the basis of these observations, the definitive scale values of the 35 items with two presentations were calculated as the arithmetic mean of the two available scale values. The resulting values are given in table 4.2. Of each of the fourteen test utterances, the scale values are given f o r each intonational version, f r o m left to right. The item numbers are the same as those used in appendix 4.1, where the test utterances are given. T o determine if the differences between the five conditions (intonational versions) are statistically significant, an analysis of variance (single factor design with repeated measurements on the same elements) was carried out. The A N O V A s u m m a r y table and a table with all pairwise differences between means can be f o u n d in appendix 4.3. The differences between the mean scale values of the five conditions are f o u n d to be highly significant ( F ( 4 5 2 ) = 35.78, p.05), while all other pairwise differences between means are significant (Scheffe, p Z3
N
200
0.0
0.4
0.B
1.2
1.6
2.0
Figure 5.1. A typical English standardized pitch contour (a) and a typical Dutch one (b).
5.3.4. Levels In Dutch, the various pitch movements move from and to a higher and a lower declination level and pitch movements covering only half the range are relatively rare. In BE, not only is the standard range between lower and upper declination level twice as large, viz. 12 ST rather than 6 ST, but also d o half pitch movements seem to be as frequent as full ones: all tones described in chapter 4 use at least as many half pitch movements as full ones. In BE, a description in terms of three, rather than two levels seems thus appropriate: a clearly language-specific distinction.
91 5.3.5. Pitch
movements
In view of the larger standard range between upper and lower declination levels in BE, and the greater importance of half pitch movements, it is not surprising that the total number of standard pitch movements is larger in BE than it is in Dutch: the present melodic model for BE recognizes 16 (or rather more, if one counts pitch movements covering a quarter of the range), the one for Dutch 10 distinct standardized perceptually relevant pitch movements. Moreover, BE pitch movements are both steeper and longer than their Dutch counterparts, as shown in table 5.1, which gives standard parameter values for steep pitch movements covering the full range, for Dutch and BE.
Table 5.1. Standard parameters of Dutch and BE standardized perceptually relevant pitch movements covering the full range compared.
Dutch English
Slope
Duration
Range
60 ST/s 75 ST/s
100 ms 160 ms
6 ST 12 ST
As for the positions that pitch movements may take up with respect to the syllable with which they are associated, here the differences between Dutch and BE are much smaller. In both languages steep, full pitch movements, both rises and falls, may take up any of three positions with respect to the syllable: early, middle or late. The exact positions may differ by some tens of milliseconds one way or the other, but it must be emphasized that here, too, there is a certain tolerance in the standard values and this goes for both languages. Thus, shifting the positions of the standard pitch movements within certain limits will usually not affect either the identity or the acceptability of the pitch contour. Although experiments to back this up have not been carried out, I suspect that a set of standard positions with respect to the syllable could be devised that would yield satisfactory results for both BE and Dutch. As far as half, steep pitch movements are concerned, the Dutch melodic inventory recognizes only one rise and one fall, which occur relatively infrequently, so here a comparison between Dutch and BE is hardly possible. In Dutch, the half movement is always completed at the vowel onset of the associated syllable; in BE, the half movements may take up the full set of possible positions early, middle and late. In both languages, though, the half movements have the same slope as the full ones and half the duration. Finally, both languages recognize gradual pitch movements, extending over more syllables. In BE, gradual pitch movements may be either half or
92 full; D u t c h only knows full ones. In both languages, the gradual m o v e m e n t s start at the vowel onset of the first associated syllable. 5.3.6.
Conclusion
In conclusion, there are b o t h obvious similarities a n d obvious differences between the Dutch and BE melodic schemes. We will try to s u m m a r i z e b o t h categories. Similarities: -
-
Differences: -
standardized pitch c o n t o u r s in terms of a limited n u m b e r of discrete perceptually relevant pitch m o v e m e n t s represented by straight lines; pitch movements move u p and down between a higher a n d a lower declination line; the s t a n d a r d declination slope is the same in b o t h cases; pitch movements take up similar positions with respect to the syllable with which they are associated; pitch c o n t o u r s can be described in terms of the same set of parameters. three levels in BE versus two in D u t c h ; pitch movements differ with respect to slope, d u r a t i o n a n d range; more pitch movements in BE t h a n in D u t c h ; total pitch range covered much larger in BE t h a n in D u t c h .
It is i m p o r t a n t to realize that what we have been c o m p a r i n g are one set of standard p a r a m e t e r s for D u t c h and one for BE. Experience has shown that there is always a certain a m o u n t of tolerance in these parameters a n d that usually they can be given somewhat different values without affecting the identity or acceptability of the resulting pitch contours. If we look for instance at the results of the last experiment, they show that providing BE utterances with pitch c o n t o u r s that c o n f o r m to D u t c h melodic precepts may sometimes lead to surprisingly acceptable results, in spite of the large differences in s t a n d a r d values between D u t c h a n d BE. However, more research would be needed to give an indication as to the limits of this tolerance for each p a r a m e t e r (see also section 5.6.5.). As stated, it is not yet possible to c o m p a r e standardized pitch patterns, since this research has not yielded sufficient i n f o r m a t i o n a b o u t possible BE pitch patterns.
5 . 4 . POSSIBLE W I D E R I M P L I C A T I O N S
If we are to look for possible wider implications in the results of this study, we
93 must obviously concentrate on the similarities found between Dutch and B E , rather than the differences. At the same time, it is clear that linguistic similarities found between any two languages, both of them Germanic languages at that, cannot lead one to statements about universality. Therefore, what is said in the remainder of this section should be understood as possibilities and suggestions, and not as claims. 5.4.1. A
suggestion
Generally speaking, it seems quite likely to me that it should be possible to draw up melodic models for the intonation o f most if not all languages in the same terms as has now been done for both Dutch and B E . These models would have the following elements in c o m m o n : -
a set of standardized perceptually relevant discrete pitch movements, represented by straight lines; - a lower and an upper declination line between which these pitch movements go up and down. Differences between languages might then concentrate on elements such as the following: - the number of different pitch movements; - their slope, duration and range; - perhaps their possible positions with respect to the syllable; - the absence or presence o f half, or even quarter pitch movements; - the rules dictating their combinability into pitch patterns. If this were indeed true, it is an attractive notion that it would be possible to draw up such models for a number of the more important languages in the world and then compare them. Since the descriptions would be along the same lines and in exactly the same terms, such a comparison would instantly show up the differences between the languages and it would become very much easier to research, describe and teach the intonation o f any individual language. 5.4.2.
Evidence
The notion that it should be possible to draw up melodic models in these terms for more languages is based on more than the similarities found between Dutch and B E in this study. There is already some -informalevidence from other languages, and there is physiological evidence that the notions o f declination and discrete pitch movements may be more than handy devices yielding perceptually acceptable results, and may in fact be a reflection of what actually goes on in the articulatory domain, as explained in the following sections.
94 5.4.2.1. Declination It seems hardly necessary nowadays to argue for the language-universality of declination (or downdrift, baseline, running-down pattern and whatever else the phenomenon is called). It has now been observed in many widely different languages, especially by those working with speech (re)synthesis. The discussion now seems to centre not so much on whether declination is an actual phenomenon, but more on its origin, possible function and shape. There are those who regard declination basically as a purely physiological manifestation, the result of gradually decreasing subglottal pressure, and those who claim that declination is under more active control of the speaker and may serve a variety of linguistic functions. Thorsen (1980), for instance, claims that in Danish the slope of the declination line is related to the syntactic sentence category. The results of this study do not support anything like Thorsen's findings for BE: one simple formula has been used to calculate the declination slope of all utterances used in the experiments, with very satisfactory results, and in this formula declination slope varies only with duration of the utterance. In addition, this formula is identical to the one that is used with equally favourable results to calculate declination slopes for Dutch pitch contours. And finally, the formula was inspired by the one that Maeda (1976) uses to calculate the slope of his 'baseline' in American English (see also section 5.4.2.3.). For an extensive survey and discussion of the current positions taken up on the subject of declination, the reader is referred to Cohen, Collier a n d ' t Hart (1982). 5.4.2.2. Physiological evidence The idea that a description of pitch patterns in terms of discrete, relatively fast pitch movements superimposed on a declination line may be a reflection of articulatory reality, is supported by physiological research. In this context I would like to draw attention in particular to Collier (1975). He pronounced one Dutch sentence with 15 different Dutch pitch contours, repeating each contour 20 to 30 times, and measured subglottal air pressure and the electromyographical activity of the following laryngeal muscles, all of which were known to participate in the control of fundamental frequency: the cricothyroid, the sternothyoid, the sternothyroid and the thyrohyoid. He found that the rises and falls in pitch patterns are primarily controlled by contraction and relaxation of one muscle only, the cricothyroid, and he concludes that 'the major differences between any two F 0 contours appear to be related systematically to differences in the activity of this single muscle.' (p.254). He adds that, when this muscle is passive, F 0 is controlled by the steadily decreasing subglottal air pressure, resulting in a steadily decreasing pitch. Evidently, this analysis gives strong physiological backing to an acoustic description of pitch patterns in terms of discrete pitch movements superimposed on a declination line.
95 5.4.2.3. Evidence from other languages Evidence that American English (AmE) can be described in the same terms as has been done for Dutch and BE is provided by Maeda (1976). In his description of AmE he introduces three basic attributes: BL (for 'baseline'), R (for 'rise') and L (for 'lowering'). BL corresponds to what in this study is called declination and R and L to rises and falls, respectively. Figure 5.2. illustrates his system: on a baseline (a) is superimposed a
(a)
(c)
Figure 5.2. An illustration of Maeda's system to describe schematized AmE pitch patterns; (a) is the baseline, (b) is a sequence of pitch movements, (c) is the result when (a) and (b) are added.
sequence of pitch movements (b) to produce an artificial pitch pattern (c). Obviously, both this procedure and the resulting pitch pattern bear strong resemblances to what we find in both Dutch and BE. Maeda calls the type of pattern as shown in (c) 'hat-pattern', in imitation of the name Cohen a n d ' t Hart (1967) gave to the pattern they found to be most frequent in Dutch, and claims that, also in AmE, it is the 'most basic and common pattern'. A perceptual experiment provided evidence for the viability of his approach and he concludes that 'it is, perhaps, safe to state that the speech quality is not significantly reduced by the rule-generated F 0 contours' (cf. section 2.4.2.1.). A further point of interest in this context is that the formula that has been used in this study to calculate the slope of the declination line (see section 3.2.3.) was inspired by the formula that Maeda used for the slope of his baseline. The main difference is that Maeda not only uses a fixed final frequency, but also a fixed start frequency (although the values may differ
96 from speaker to speaker); with the formula used in the present study the final frequency is fixed, but the start frequency gets higher as utterance duration increases. In both cases, however, the slope of the declination line gets smaller as utterances get longer. Much more informal and tenuous 'evidence' comes from G e r m a n and Italian. During the annual meeting of the 'Deutsche Gesellschaft für Sprachwissenschaft' (Berlin, March 1980), 't Hart presented a paper about the synthesis of stylized pitch contours. For this purpose he prepared some demonstration material using German speech. Not only did he find that, applying the methods used in the study of Dutch and BE intonation, this was no problem at all, but also were the audience impressed with the high quality of the resulting -German- synthetic speech. For instance, it turned out to be quite easy to make close-copy stylizations and, just as was found for BE, the audience could not distinguish these from the corresponding original. In Italian, Bertinetto and Vivalda have been providing Italian synthetic speech utterances with Dutch hat-patterns in complete accordance with the rules for Dutch intonation, just to find out how this would sound. In a personal communication they let't Hart know that they were surprised at the relatively high resulting acceptability.
5 . 5 . POSSIBLE R E L E V A N C E T O T H E T H E O R E T I C A L L I N G U I S T
There is a natural limitation as to what can be achieved with the approach to the study of intonation that has been employed in this study. As explained in the introductory chapter, the acoustic signal is taken as the starting point and, with as few linguistic preconceptions as possible, one starts working up through the speech chain. In principle, the task is mainly one of sifting and cataloguing observed pitch phenomena; the ultimate aim to give an exhaustive account of all the melodic properties of the intonation of a language. Thus, only one aspect of intonation is dealt with, viz. its melodic structure. In terms of the speech chain, however, even if this were completed to perfection, one would still be no higher up than the phonetic or phonological component. This seems to me exactly where the experimental phonetician and the theoretical linguist should, but do not always, meet and support each other. It is just as difficult for the theoretical linguist to bridge the gap between his abstract phonological concepts and their concrete acoustic realization as it is for the phonetician to penetrate into the more abstract levels above the phonetic component: the specializations have drifted too far apart. Thus, once an inventory of possible pitch patterns in a language is available, the phonetician depends on the linguist to tell him which patterns should be applied to which sentences and exactly how the pattern should be lined up with the sentence. He will have to take into account things like pitch accents, semantic contents, syntactic structure, attitudinal meaning and perhaps even
97 phonological structure. The phonetician will then be able to 'translate' this abstract representation into an actual acoustic realization, perceptually acceptable to the native speaker. It is my conviction that one of the reasons why comparatively little headway seems to have been made so far t o w a r d s a satisfactory linguistic theory of intonation is that there have never really been any objective, reliable a n d complete phonetic d a t a on this elusive suprasegmental phenomenon. The melodic properties of a language have always been wery difficult to get access to and researchers have had to rely on their personal impressionistic observations. D a t a collected in this fashion not only inevitably lack objectivity, but also specificity. It is to be hoped that through this type of experimental research the linguist will be provided with a reliable phonetic f o u n d a t i o n on which a sound theory of intonation can be based.
5.6. SUGGESTIONS FOR FURTHER RESEARCH
5.6.1.
Introductory
It is, perhaps, superfluous to point out that the research of which this is a report is by no means a finished piece of work; it is a beginning. It is therefore pleasant for the a u t h o r to know that, as this is being written, a n o t h e r research project is already well under way which is a direct successor to the project reported on here. Nevertheless, I feel I would fall short in my duties if I did not include here a list of what I experience as the main shortcomings in this study, coupled to suggestions for f u r t h e r research. On the one hand, the experiments reported here have been limited in a n u m b e r of ways, particularly the last one, a n d thus the need for follow-up experiments is indicated. On the other h a n d , exploratory research has suggested a n u m b e r of interesting possibilities which, through lack of time, have not been worked out a n d systematically investigated and have, therefore, never reached the stage of experimentation. Some of the more promising of these will be mentioned in this final section. One i m p o r t a n t restriction in this study is in the speech material used in the experiments, which has been limited in at least three ways. In the first place, only read-out, rather t h a n s p o n t a n e o u s speech has been used. Secondly, the material has been d r a w n almost exclusively f r o m Halliday's intonation course. Thirdly, nearly all utterances used in the experiments are shorter than three seconds. It is clear that, for a complete and valid description of BE intonation, it will be necessary to work with s p o n t a n e o u s speech a n d longer utterances. 5.6.2. Spontaneous
speech
Theoretically, there is a risk that the research methods that have worked well
98 for shorter and carefully read-out samples of speech could fail to produce adequate results when applied to longer stretches of speech, taken f r o m spontaneous conversation. However, preliminary try-outs have shown that any fear that this might be so is groundless. It appears perfectly feasible, for instance, to make close-copy stylizations of F 0 curves of spontaneous utterances; in fact, the utterance 'Yes, it was in Sweden that I think the most embarrassing thing that ever happened to me occurred' in section 2.2.2. demonstrating how the technique of close-copy stylizations may be used to show u p perceptually relevant elements in a pitch curve, was taken f r o m spontaneous speech. Various other try-outs with samples of spontaneous speech, mostly recorded from radio broadcasts have, moreover, shown that the melodic model presented in chapter 3 can be used to approximate pitch curves of samples of spontaneous speech just as well as samples of read-out speech. PES- and FS-stylizations (see chapter 3) are easily made and sound quite acceptable with spontaneous speech. 5.6.3. Longer stretches of speech As far as working with longer stretches of speech is concerned, in principle this seems to introduce only one extra factor to be taken into account, which will be discussed below. For the rest, there seems no reason why this should prove any more difficult than working with short utterances, apart f r o m purely practical matters that have to do with the larger amount of data to be handled. Thus, for instance, more computer memory is needed and the task will be more time-consuming. The one extra factor mentioned above is concerned with declination and, more particularly, the declination slope. This slope becomes smaller as the duration of the utterance increases, since otherwise in longer utterances the pitch at the beginning would become too high, or the pitch at the end too low, or both. In consequence, however, when utterances get really long, the declination slope becomes so small that it is hardly noticeable and the result is loss of acceptability. To give an example, for an utterance with a duration of ten seconds, the standard declination slope is only -0.85 S T / s . The solution seems to be to divide longer stretches of speech u p into parts of relatively short duration, each with their own declination line, whose slope is calculated independently of the others. That this approach can work will be demonstrated by means of part of a 'spontaneous monologue' f r o m Halliday's intonation course (study unit 35). Of the pitch curve of this relatively long stretch of speech a PES-stylization was made. The text of this passage and the stylization are given in figure 5.3. Also indicated there are the • Figure 5.3. The PES-stylization of a longer stretch of speech. The slashes in the text indicate the breaks that separate the parts which were treated independently as far as declination is concerned.
by the time the great central was built, the trains could
^A manage the gradients much more easily / and the great central
A. line usually went across the valleys instead of round them
y\
j
/
like the earlier railways. / so the distances were shorter
\
and you got better views. / the whole of the way between
A
A/
/A
A
nottingham and london there was nowhere where another railway
\
Y v
line crossed overhead. / it was a fast line and very pleasant
to travel on. / much the most interesting route to the north. /
•Av but for some reason it was neglected and all the trains have
y ~ \ been withdrawn. :hdrawn / now if they'd decided to keep the track in
A/ good condition and run the trains at the speeds at which they ^
A
J
could have been run, the line would have been very popular, /
so
X thought of a way of making use of it.
100 breaks in the passage separating the parts which were treated independently as far as declination is concerned. The resulting stylized pitch c o n t o u r was surprisingly acceptable, certainly for a first attempt. In this case, the positions of the breaks were determined ad hoc, by trial a n d error; the problem will be to find rules to d o this. A break in the wrong position was f o u n d to sound very a b r u p t a n d unacceptable because the effect of starting a new declination line is a sort of resetting of F 0 to a higher level. Figure 5.4. illustrates this. The first part of the standardized c o n t o u r shown in (a) ends on the lower level; the second part, after the break, starts on the middle level, so that there is a virtual half-rise in between. However, since at the break also a 'resetting' of the declination line takes place, the actual rise is larger a n d may easily be f r o m 65 to 130 Hz. This would be a full octave a n d , in terms of s t a n d a r d parameters, a full rise rather than a half one. Nevertheless, if the break is positioned appropriately, the listener will still interpret the rise as a half-rise, as intended. The effect of resetting the declination line is shown in (b), which gives the same c o n t o u r as (a), but n o w with each part of the c o n t o u r superimposed on its own declination line. 5.6.4. Pitch
patterns
A n o t h e r i m p o r t a n t aspect that this study has really only touched on is finding out what often recurring pitch patterns can or should be distinguished in BE, in other words, in what combinations the various pitch movements tend to occur. The seven pitch patterns described in chapter 4 are only a first step towards establishing such an inventory. It was already stated there that, although these patterns offer a reasonable variety, to what extent they are representative of all possible BE pitch patterns depends on h o w exhaustive and justified Halliday's seven-tone system is. In addition, only Halliday's seven primary tones with neutral pretonic have been used, whereas Halliday actually distinguishes a variety of secondary tones a n d pretonics.
(b)
Figure 5.4. A typical BE standardized pitch contour, showing the effect of a 'reset' of declination; (a) is the schematized pattern with declination omitted, (b) shows the pattern with declination included.
101 To achieve this aim, the work done on Dutch intonation can be taken as a lead, since for Dutch intonation an inventory of possible pitch patterns has already been worked out. In the Dutch system, standard pitch movement can be combined in certain prescribed ways to form three types of so-called 'intonational blocks', and a set of cyclic rules, or a grammar, dictates how these blocks may be combined to make up complete pitch contours. This grammar was devised on the basis of a large corpus of spontaneous speech material and in the end could account for 94% of all utterances in the corpus (Collier 1972; 't Hart and Collier 1975). On the basis of my experiences so far, I am confident that a similar approach should prove quite workable for BE. In conclusion, I should like to give an example of how this might work out for the pattern that so far seems to be the most frequent one in BE. Basically, it is the pattern presented in chapter 4 as tone 1 (section 4.2.3.), and what I regard as its most 'neutral' form is shown here in figure 5.5.a. In words: the contour
(a)
(b)
Figure 5.5. The basic shapes of the most frequent pitch patterns in BE (a) and in Dutch (b).
starts at mid-level; there is a half-rise 1120 on the first accented syllable and a full fall 2140 on the last accented syllable. This 'neutral' form naturally only applies when there are two accented syllables. The pattern is similar to what in Dutch intonation has been termed the 'hat pattern' and which is shown in figure 5.5.b. There is the obvious similarity in shape, but there is also a similarity in function: in either language, this seems to be the pattern that is most frequently used and which can be said to be most nearly neutral in character. It is, in other words, the pattern that a speaker would choose if he simply wanted to convey a message, without any attitudinal or emotional overtones. Now the point is that, just as in Dutch there are a number of variations of the hat pattern, in BE there are a number of frequently occurring patterns which can be seen as variations on the pattern shown in figure 5.5.a. Two of these variations are shown in figure 5.6.
(a)
(b)
Figure 5.6. Two variations on the basic BE pitch pattern shown in figure 5.5.a.
102 Using the code for pitch movements explained in section 3.2.2.3. and conveniently omitting the last digit of the code, which denotes position of the pitch movement with respect to the syllable, these three BE patterns could be characterized as 112-214, 112-212-112-214, and 112-222-112-214, for figures 5.5.a, 5.6.a. and 5.6.b, respectively. The pitch movements thus indicated are understood as being connected by declination. In a generalizing form, these three characterizations can be summarized in the following rule (rule 1):
0
122
222
112
rule 1
214
where 0 stands for a mid-level start, everything between round brackets is optional and a choice must be made among anything included in braces. Since the rule does not specify position of pitch movements, but only type and order (an added convention should take care of position), the same rule covers the case of only one pitch accent, as shown in figure 5.7.
Figure 5.7. The shape of the basic pattern shown in figure 5.5.a. when there is only one pitch accent.
Suppose that for three pitch accents the four patterns shown in figure 5.8. are found. In that case, a very simple addition to the rule would make it valid also
A. (a)
(b)
(c)
(d)
Figure 5.8. Four variations of the basic pattern shown in figure S. S.a., valid when there are three pitch accents.
103
for these patterns, at the same time recognizing their generally similar build-up: rule 2
214
In rule 2, the arrow indicates that a sequence of pitch movements may be repeated if necessary or desirable. Going one step further, it appears that with longer utterances in BE people tend to repeat this same pattern. Since the pattern ends low, but starts on the middle level, a half-rise must be inserted between any two repetitions. Sometimes this half-rise is audible as a 'continuation rise', and sometimes it is contrived so unobtrusively as to be virtually unrecognizable as a rise, in which case we might speak of a 'reset' to the middle level. Figure 5.9. gives two examples of this phenomenon. RESET
y ALANS IN CAMBRIDGE. STUDYING BOTANY RESET
THE OTHERS WERE SATISFIED. BUT I DIDNT LIKE IT Figure 5.9. Two examples of pitch patterns where variations of the basic pattern shown in figure 5.5.a. follow each other.
Rule 3 takes this possibility to repeat the basic pattern into account and is now capable of generating a large number of different contours, all of which can be regarded as variations of one basic pattern. This rule, of course, is meant as an illustration; the contours it generates certainly occur quite frequently in BE, but a more definite and complete version of the rule would undoubtedly be more complicated. 112
214
112
rule 3
104 To see if this type of pattern can be used on longer stretches of speech, the same 'spontaneous monologue' from Halliday's intonation course that was used earlier in this section was used. It was now provided with the pitch contour shown in figure 5.10. Generally speaking, this version, too, sounded reasonably acceptable. 5.6.5. Some final observations I should like to conclude this chapter with a few observations meant to put the proposed model and characterizations of Halliday's tones into their right perspective. As observed in the introduction to chapter 3, the melodic model presented there is inspired partly by the existing model for Dutch intonation and partly it is based on empirical evidence obtained f r o m an analysis of a large number of close-copy stylizations. It should be clear, however, that this is a melodic model of BE intonation and not the model. It is based on a more or less rough averaging of a limited amount of observed phenomena and found to yield perceptually acceptable results. This is not to say that a different model could not yield equally acceptable results. Specifically, no systematic research has been done into the tolerances in the standard parameters of the model. F o r instance, the standard slope of steep pitch movements is 75 ST/s in the present model, a value based on averaging observed phenomena, but it is quite possible that any standard slope between, say, 60 S T / s and 90 S T / s would hardly have affected the result. Very much the same can be said of all other standard parameters. F o r instance, the steep rise in the middle position starts 30 ms before the vowel onset of the syllable, whereas the steep fall in the middle position starts 30 ms after the vowel onset. Perhaps having them both start exactly on the vowel onset would do no great harm. Research into this has simply not come within the scope of this study. Summarizing, the model is the reflection of the average performance of a number of native speakers of BE and does not pretend to be the one-and-only correct melodic model of BE intonation. Very much the same observations go for the melodic characterizations of Halliday's tones in chapter 4. They should not be seen as the seven contours making up BE intonation, but much less ambitiously as a set of contours that can be generated automatically and made to yield perceptually acceptable results. As pointed out in the introduction to chapter 4, there has been no room in the present research for an exhaustive cataloguing of contours based on an analysis of a representative corpus of speech material. As with the model proper, the standard values used in the characteriza• Figure 5.10. A longer stretch of speech, provided with a standardized pitch contour consisting of variations of the basic pattern shown in figure S.S.a. The slashes in the text indicate the breaks that separate the parts which were treated independently as far as declination is concerned.
by the time the great central was built, the trains could
A.
rA A. \y
manage the gradients much more easily / and the great central
A
line usually went across the valleys instead of round them
A.
/
like the earlier railways. / so the distances were shorter
line crossed overhead. / it was a fast line and very pleasant
'V
to travel on. / much the most interesting route to the north. /
\ but for some reason it was neglected and all the trains have
been withdrawn.. / now if they'd decided to keep the track in
A/
good condition and run the trains at the speeds at which they
could have been run, the line would have been very popular, /
A.
so I thought of a way of making use of it.
106 tions are based on the average performance of a limited number of speakers in a limited number of utterances and there is bound to be a range of values each parameter could assume without affecting acceptability. Further research might, for instance, show that on purely melodic grounds there is no reason to distinguish between tones 2 and 3: given enough tolerance in the parameters, the two would simply blend into one. In view of all this I repeat that the present research is only a beginning and that a lot of research will have to be done to bring it to a successful completion.
References
A r m s t r o n g , L.E. and I.C. W a r d (1931) A handbook of English intonation, C a m b r i d g e Bot, C.L.J, de (1982) Visuele feedback van intonatie, doctoral dissertation, University of Nijmegen. Boves, L. and T. Rietveld (1978) Stylization of pitch contours, Proceedings of the Institute of Phonetics of the Catholic University in Nijmegen, 34-40. Bruce, G . (1977) Swedish word accents in sentence perspective, Lund: Gleerup. Cohen, A., Collier.R. and J. 't H a r t (1982) Declination: construct or intrinsic feature of speech pitch?, Phonetica 39, 254-273. C o h e n , A. and J. 't H a r t (1967) On the a n a t o m y of intonation, Lingua 19, 177-192. Collier, R. (1972) From pitch to intonation. Doctoral dissertation, University of Leuven. Collier, R. (1975) Perceptual and linguistic tolerance in intonation. International Review of Applied Linguistics in Language Teaching X I I I / 4 , 293-308. Collier, R. and J. 't Hart (1981) Cursus Nederlandse intonatie, Acco, Leuven. Delgutte, B. (1976) Fundamental frequency contours of French: a perceptual study, Msc Thesis, M.I.T. Edwards, A.L. (1957) Techniques of attitude scale construction, Appleton-Century C r o f t s , New York. Fujisaki, H. and S. Nagashima (1967) A model for the synthesis of pitch c o n t o u r s of connected speech, Annual Report, Eng. Res. Inst. Faculty of Engineering University of Tokyo 28, 53-60. Fujisaki, H. and Hirose, K. (1982) Modelling the dynamic characteristics of voice f u n d a m e n t a l frequency with application to analysis and synthesis of intonation, Proceedings of the XUIth International Congress of Linguists, 57-70. Geel, R.C. van (1983) Pitch inflection in electrolaryngeal speech, doctoral dissertation. University of Utrecht. G r u n d s t r o m , A . W . (1972) Des formes acoustiques de l'intonation interrogative en français. Acta Universitatis Carolinae - philologica I, Phonetica Pragensia III, 97-104. Halliday, M.A.K. (1970) A course in spoken English: intonation, Oxford University Press, London. Hart J. 't, and A. Cohen (1973) Intonation by rule, a perceptual quest, Journal of Phonetics 1, 309-327. Hart J. 't, and R. Collier (1975) Integrating different levels of intonation analysis, Journal of Phonetics 3, 235-255. Hart J. 't, N o o t e b o o m , S.G., Vögten, L.L.M. and L.F. Willems (1982) Manipulations with speech sounds, Philips Technical Review 40, 134-145. Isaienko, A.V. and H . J . Schädlich, (1964) Untersuchungen über die Deutsche Satzintonation, Akademie-Verlag, Berlin. Jones, D. (1962) An outline of English phonetics, Cambridge. Lehiste, I. (1970) Suprasegmentals, the M.I.T. Press, Cambridge, Mass. Lesmo, L., Mezzalama M. and P. Torrazzo (1978) Int. J. Man-Machine Studies 10, 569-591. Maeda, S. (1976) A characterisation of American English intonation, Ph.D. thesis, M.I.T. Press, Cambridge, Mass.
108 Martin (1978) Perception de séquences de contours prosodiques des phrases synthétisées. Actes des 9èmes JEP, Galf, Lannion, 23-29. Mattingly, I. (1966) Synthesis by rule of prosodie features, Language and Speech 9, 1-13. O ' C o n n o r , J . D . a n d G . F . Arnold (1973) Intonation of colloquial English, L o n g m a n s , L o n d o n , second edition. Pierrehumbert, J . (1981) Synthesizing intonation, J. Acoust. Soc. Am. 70(4), 985-995. Rabiner, L.R. (1977) On the use of auto-correlation analysis for pitch detection, l.E.E.E. Trans A.S.S.P. 25, 24-33. Thorsen, N. (1980) A study of the perception of sentence intonation, evidence f r o m Danish, J. Acoust. Soc. Am., 67(3), 1014-1030. Torgerson, W.S. (1958) Theory and models of scaling, Wiley, New York. Vaissière, J. (1971) Contribution à la synthèse par règles du Français, doctoral dissertation, Université de Grenoble. Willems, N.J. (1982) English intonation from a Dutch point of view, doctoral dissertation, University of Utrecht, Foris Publications, D o r d r e c h t . Winer B.J. (1962) Statistical principles in experimental design, McGraw-Hill, New York. Witten, I.H. (1978) A flexible scheme for assigning timing and pitch to synthetic speech, Language and Speech 20, 240-260.
Appendices
Ill APPENDIX 2 . 1
The twenty utterances used in the experiments testing the perceptual equality and the perceptual acceptability of close-copy stylizations (chapter 2) are the following: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
is there any more news of the French elections I was afraid I should be late when's the best time to catch him do you suppose you forgot to put a stamp on that letter it would be an excellent idea they both look alike to me do you know why it doesn't work when did they say they sent it so that's why Arthur was looking so gloomy is that a new house try the yellow one on try not to worry too much where did you find my spectacles why did he do it keep the coffee hot for me let's all go and wish him a happy birthday you knew I didn't want you to do it will there be rain tomorrow would you like to borrow mine where did you hide that thing
The next pages give graphical representations of all intonational versions (original, close-copy stylization and 'alternative' versions) of five of the utterances. For each utterance, the first picture gives the original fundamental frequency curve (dotted) and the corresponding close-copy stylization (solid). The subsequent pictures give the fundamental frequency curve and one of the alternative versions. Each of the utterances has been roughly segmented in time, the vertical bars corresponding to the positions of the vowel onsets of some of the more prominent syllables.
112 > I was af-
raid I should be 1- ate
V 0.0
0.3
0.6 t
> 3
0.9 (S)
1.2
1.
1.2
1.5
1.2
l.S
BOO
400
I was af-
raid I should be 1- ate
300 N O u_
200
100
so 0.0
0.3
0.6 t
> 3
0.9 (Si
500 400
I was af-
raid I should be 1- ate
300 N
200
— 100
BO 0.0
0.3
0.6 t
0.8 (S)
113
t
(S)
500 400
you forg ot to put a st
amp on that letter
300 200
r 1
Y
—
100
50 t
(S)
eoo 400
you forg ot to put a st
amp on that
letter
300 200
A
s .
1
•i-
100
50 0.0
0.4
O.B
1.2 t
(S)
1.8
2.0
114 > 3
BOO
400
it would be an
excellent id
ea
300 N O ll.
200
100 "" * ~ —
BO >
BOO
400
it would be an
excellent id
it would be an
excellent id
it would be an
excellent id
ea
300 N
200
100
BO >
BOO
400 300 N O u.
200
100
BO >
BOO
400
ea
300 N
200
o
Lu
100
80 0.0
—v •
\
•
0.3
0.B
0.9
t (S)
1.2
1.5
115
5
BOO 400
eep the c offee h
ot for me
eep the c offee h
ot for me
300 N
200
100
SO >
BOO 400 300
N O u.
800
y
—
-N
100
BO B00 400
k
eep the c offee h
ot for me
300 "N X O u.
200
y\
100
BO 600 400
k
eep the c offee h
ot for me
300 200 O u.
y
—
> W/
100
BO
0.0
0.3
O.B
0.8
t (S)
1,2
l.S
116
0.0
O.B
1.0
t
1.5
ÍS)
2.0
2.5
117 APPENDIX 2.2
This appendix gives the text of the introduction that the testees got before they started on the actual experiments. This introduction was also on tape, so that the testees read and listened to it at the same time. 'The test you are about to take uses synthetic speech, i.e. speech generated by a machine. The quality of this speech does not equal that of natural speech, but is still quite understandable. Listen to the utterance Try not to worry too much, first with normal speech, and then synthetic. E X A M P L E : Try not to worry too much (2x) As a second example, listen to the utterance You knew I didn't want you to do it. E X A M P L E : You knew I didn't want you to do it (2x) In this test you will hear pairs of English utterances. In some cases the members of each pair will be exactly identical, but in other cases there will be differences. Whenever the members of a pair d o differ, they differ only with respect to intonation. By intonation we mean the variations of pitch in speech, or the speech melody. The following examples will give you a feeling for what intonation is. All examples use synthetic speech. In the next example, you will hear the utterance You forgot to put a stamp on that letter, first with normal intonation and then with no pitch variations at all, so that it is monotonous. E X A M P L E : You forgot to put a stamp on that letter (2x) In the next example, you will hear the utterance Where did you hide that thing, first with a gradually rising intonation and then with a gradually falling intonation: E X A M P L E : Where did you hide that thing (2x) You will now hear two examples of pairs of utterances that differ from each other only in their intonation. In the first example, the differences are big and can easily be heard. In the second example the differences are small and you may not hear them at all. Each example is given twice. E X A M P L E : It would be an excellent idea (2x) E X A M P L E : Keep the coffee hot for me (2x) You are now ready to begin with the actual test. You will hear 80 pairs of
118 utterances. All you have to do is indicate for each pair whether you think the members of the pair are identical or not. Remember that if there are differences, they are only in intonation, and that sometimes differences may be very small. You have about seven seconds time after each pair to cross the column of your choice. One second before each pair, you will hear a short high tone to warn you. Moreover, as you can see, you have the text of the utterances in front of you. This should make it easier for you to concentrate on the intonation. If you have any questions at all, you can now ask them. Good luck and thank you for your cooperation.'
119 APPENDIX 2 . 3
This appendix gives a survey of the number of incorrect responses for each testee in the experiment testing the perceptual equality of close-copy stylizations (chapter 2). They are split up over the four test categories. There are two tables, one for each test version.
Test 1 Testees
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. Total
Test 2 Cat Cat Cat Cat A B C D 20 19 19 18 13 14 17 14 15 12 19 17 11 19 19 16 18 19 18 18 8 16 14 20 19 17 20 17 20 20 18 19 543
0 0 0 0 2
0
0 0
4 3 9 5 3 5 9 8 4 5 5
29
32
173
777
1 1 0 0 1
1 0 0
1 3 0 0 2 8 3 0 0 2 0 0
1 0 0
1
1 0
1 0 1 0 0
1 1 0 0 2 2 0 0
1 0 0 0 0 3
1 1 0 6 3
1
11 9 5 5 5 6 5 4 4 2 4 9 4 2 8 6 4 6 10 3
1
Testees
Cat
Cat B
Cat C
Cat D
Total
18 20 19 17 18 18 15 20 19 11 18 16 15 19 19 19 13 20 19 16 13 17 20 19 20 20 16 20 19 17 19 19
1 0 0 1 0 0 1 0 0 2 1 0 4 2 0 3 6 0 0 0 2 1 0 0 0 1 0 1 2 2 0 0
1 0 1 0 1 0 0 4 2 0 0 0 0 0 2 1 0 1 2 3 0 0 4 1 3 1 1 1 1 0 2 11
0 1 0 1 0 0 1 0 0 5 1 2 2 0 0 3 4 0 0 3 5 0 0 0 0 1 2 0 2 0 0 2
20 21 20 19 19 18 17 24 21 18 20 18 21 21 21 26 23 21 21 22 20 18 24 20 23 23 19 22 24 19 21 32
568
29
43
35
675
A 34 32 24 24 24 22 22 21 20 14 24 28 16 21 29 25 25 25 29 23 17 23 17 32 27 21 25 33 31 25 24 24
0 2
3 4 0
Total
33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. Total
120 APPENDIX 2 . 4
This appendix gives a survey of the number of incorrect responses for each test item in the experiment testing the perceptual equality of close-copy stylizations (chapter 2). They are split up over the two test versions. There are four tables, one for each test category.
Category A
Category B
Item
Test 1
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
24 32 19 25 29 24 29 29 29 30 28 32 25 26 27 31 20 31 29 24
25 30 21 31 30 22 31 29 28 31 29 31 29 31 30 28 24 31 28 29
49 62 40 56 59 46 60 58 57 61 57 63 54 57 57 59 44 62 57 53
543
568
1111
Total
Test 2
Total
Item 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Total
Test 1
Test 2
Total
5 1 1 3 3 1 0 1 0 1 0 4 1 1 0 2 1 0 2 2
2 0 2 1 2 2 2 2 3 0 2 1 2 2 0 0 1 0 1 4
7 1 3 4 5 3 2 3 3 1 2 5 3 3 0 2 2 0 3 6
29
29
58
121 Category C
Category D
Item
Test 1
Test 2
Total
Item
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
2 2 10 0 0 0 1 1 0 0 0 5 0 0 5 0 1 0 4 1
3 1 13 1 2 2 0 1 1 1 3 4 0 0 2 0 3 0 5 1
5 3 23 1 2 2 1 2 1 1 3 9 0 0 7 0 4 0 9 2
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Total
32
43
75
Total
Test 1
Test 2
Total
1 2 2 1 6 0 13 4 27 0 0 3 0 0 29 23 9 31 19 3
4 1 1 2 1 2 2 2 1 4 0 0 1 1 1 1 5 1 2 3
5 3 3 3 7 2 15 6 28 4 0 3 1 0 30 24 14 32 21 6
173
35
208
122 APPENDIX 2.5
Scale values of each of the intonational versions of each of the twenty utterances used in the experiment testing the perceptual acceptability of close-copy stylizations (chapter 2), with the psychological continuum to which they refer.
Items
Originals
Close-copies
'Alternative* versions
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
2.804 2.187 .971 2.853 2.804 2.927 1.361 1.878 2.893 2.434 2.032 2.407 2.804 2.853 2.110 1.602 2.341 1.878 2.804 2.010
2.275 1.878 1.304 2.650 2.341 2.927 2.143 1.189 3.001 2.215 2.143 1.878 2.539 2.495 2.248 1.189 2.248 2.225 2.341 1.740
1.994 1.740 .579 1.740 2.341 2.110 2.032 .971 2.804 1.878 2.407 1.740 .971 2.804 1.681 1.189 .617 .808 .617 1.419
Mean SD
2.298 .567
2.148 .505
2.063 1.189 .998 1.419 2.457 2.853 1.419 1.094 .884 1.740 2.701 2.032 1.419 1.878 1.878 .383 .596 .348 .383 1.878
1.878 1.456 .426 2.495 .596 1.878
1.740 1.465 1.740 1.878 .644
1.534 .703
—
1 0.426
1 0.000 1
1 1.678
1 1.189 2
3
1 2.604 4
1 3.296 B
» SCALE VALUES
123 APPENDIX 2.6
Summary table of the analysis of variance as applied to the results of the experiment testing the perceptual acceptability of close-copy stylizations (chapter 2).
SOURCE
SS
Between utterances
12.428
Utterances intonation residual
11.254
Totals
23.682
DF
MS
2 38
3.205 .127
F
19 40 6.411 4.843
25.148
59
Pairwise differences between the mean scale values found for the three conditions in the experiment testing the perceptual acceptability of close-copy stylizations (chapter 2); the differences indicated in italics are significant at the .01 level; the other is not significant at the .05 level.
Condition Mean Originals Close-copies Alternative
2.298 2.148 1.534
Originals
Close-copies
Alternative
2.298
2.148
1.534
.150
.764 .614
124 APPENDIX 3.1
The 14 test utterances used in the experiment testing the perceptual adequacy of the melodic model outlined in chapter 3 are the following:
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
It would be an excellent idea They're just starting a regular helicopter service Anybody at home Can you lend me half a crown I'll just take these letters to the post Arthur likes to have it while he's there You can go if you've finished Valerie can't stand dogs I wouldn't mind going there in summer It's better than I expected No wonder they don't grow By all means say so No-one will mind Mind you I never intended to sell it
(tone 1) (tone 1) (tone 2) (tone 2) (tone 3) (tone 3) (tone 13) (tone 13) (tone 4) (tone 4) (tone 5) (tone 5) (tone 53) (tone 53)
The next pages give graphical representations of all intonational versions of seven of the utterances, one for each tone. For each utterance, the first picture gives the original fundamental frequency curve (dotted) and the PES-stylization (solid). The second picture gives the PES-stylization (dotted) and the FS-stylization (solid); the third picture gives the original fundamental frequency curve (dotted) and the DUTCH-stylization (solid); the fourth picture gives the original fundamental frequency curve (dotted) and the WITTEN-stylization (solid) Each of the utterances has been roughly segmented in time, the vertical bars corresponding to the positions of the vowel onsets of some of the more prominent syllables.
> 3
600 400
it would be an
excellent id
ea
it would be an
excellent id
ea
300 N
200
O
ti.
100
BO >
BOO
400 300 N
200
O
ti.
100
—
:
^
BO
>
eoo 400
it would be an
excellent id
ea
it would be an
excellent id
ea
300 N
200
o
iL
100
50
>
500 400 300
N O u.
200
100
50 < 0.0
0.3
0.B t
0.8 (S)
1.2
1.5
126
> BOO 400
can you 1 end me half a cr
own
300 N
200
100
>
KO BOO 400
can you 1 end me half a cr
300 N O u.
=3
200
100
B00 400
can you 1 end me half a cr
own
300 N
200 " *** *••«.
/ *"***• *
100
00 È
BOO can you 1 end me half a cr
own
300
5
200
......
/f
100
BO 0.0
0.3
O.B
t
0.8
is)
1.2
1.5
127 > 500 400
Arthur likes to h ave it while he is there
300 N
800
\
100
/
60 ^
900 400
Arthur likes to h ave it while he is there
300
5
200
100
90 3
900 400
Arthur likes to h ave it while he is there
300
N
200 /—
o
u.
\
\
100
• 90 ^
900 400
Arthur likes to h ave it while he is there
300
N O u.
200
L
100
X
90 0.0
0.4
0.8
1.2
t (S)
1.6
2.0
128
> BOO 400
you can' g
o if you've
inished
f
300 N
200
\
100
"**.
\
V.. *
i "
1.
BO >
BOO 400
you can g
o if you've
inished
f
300 N
200
— ,
O 100
so •>
V
/
t
BOO] 400
you can g
o if you've
f
inished
o if you've
f
inished
300 N T
_
O li.
200
.. 100
s
•
BO ^
BOO 400
you can g
300 N O li.
800
100 V
BO
0.0
0.3
O.B t
0.9 (S)
1.2
1.5
129
BOO 400
I k ouldnt mind going there In s ummer
300 200 %
100
50 0.0
0.4
O.B
1.2
t (S)
l.B
2.0
130
500 400 300
no
M
onder they dont gr
OM
aoo 100 •—
00 000 400
no w
onder they dont gp
OM
300 X o li.
200
100
BO 0.0
0.3
0.8
0.9
t (S)
1.2
1.5
131
no one will m
ind
300
N
200 ...
O Ii.
100
50 >
500 400
no one will m
ind
no one will m
ind
300 N
200
o
Ii.
100
50 ^
500 400 300
N
200
o
u.
100
50
0.0
0.3
O.B t
0.8 (S)
1.2
1.5
132 APPENDIX 3 . 2
Scale values of both presentations of each of the 70 test items used in the experiment testing the perceptual adequacy of the melodic model outlined in chapter 3, for the 'Brighton' group of testees and the 'Crawley' group of testees separately, with the psychological continua to which they refer. 'Brighton' group of testees
Test items 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Mean SD
1 0.000
ORIG
PES
FS
DUTCH
WITTEN
2.476 2.454 2.234 2.499 2.065 2.364 3.026 3.262 1.120 2.065 2.413 2.791 3.053 3.207 2.811 2.712 3.096 3.014 2.964 3.151 3.327 3.312 2.567 2.885 2.418 2.319 3.064 2.827
3.026 2.652 1.884 2.202 1.964 2.369 3.000 3.262 1.071 1.347 2.185 2.176 3.262 2.529 2.838 2.932 2.565 2.659 3.127 3.026 3.213 3.168 1.980 2.230 2.167 2.319 2.483 2.378
3.262 3.353 2.167 2.192 2.261 2.143 2.863 3.074 1.862 2.277 2.095 2.362 2.895 2.948 2.975 3.113 2.865 2.903 3.262 2.885 3.297 3.280 2.696 2.159 2.659 2.405 2.696 2.659
2.052 2.097 1.726 1.748 1.765 1.469 1.611 2.239 1.862 2.065 1.938 1.748 1.938 1.515 3.327 2.682 1.869 2.483 2.224 3.074 .718 .678 .976 .634 .957 1.142 2.052 2.319
2.119 1.513 .926 .676 .423 .393 .323 .344 .366 .434 .819 .955 1.720 1.781 .634 .515 .358 .393 .351 .393 .434 .375 .569 .402 .458 .550 .393 .366
2.697 .467
2.501 .552
2.701 .427
1.818 .644
.678 .496
1 0.B34
1
1 1.557
2
1 2.31S
3
Ï 3.262
4
*
1 3.747
5
SCALE VALUES
133 'Crawley'
group of
Test items 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Mean SD
I 0.000
testees
ORIG
PES
FS
DUTCH
WITTEN
2.345 2.215 2.287 2.577 2.287 1.997 2.990 3.068 1.230 2.287 1.885 2.287 2.723 2.868 2.519 2.937 2.694 2.619 2.577 2.121 2.868 2.868 2.650 2.937 2.345 1.997 2.723 2.215
2.650 1.997 2.215 1.997 2.868 2.171 2.990 2.990 1.161 1,299 1.661 1.717 2.868 2.577 2.868 2.868 2.432 2.384 2.577' 2.432 2.937 2.694 1.997 2.577 1.624 1.773 2.121 1.885
2.990 2.937 2.432 2.432 2.937 2.723 2.868 2.990 2.287 2,215 1.997 1.810 2.990 2.142 2.937 3.068 2.432 2.868 2.519 2.432 2.868 2.868 2.519 2.345 2.937 2.215 2.519 2.370
1.997 1.997 1.577 1.437 1.837 2.142 2.345 2.246 1.437 1.624 1.717 1.437 1.299 1.299 2.990 2.577 2.287 2.650 2.287 2.937 .815 .976 .963 1.126 1.230 1.022 2.142 2.432
1.810 1.022 .608 .473 .426 .355 .355 .327 .387 .426 .532 .726 1.082 1.230 .963 .746 .355 .355 .327 .426 .426 .355 .774 .355 .532 .473 .387 .355
2.496 .360
2.298 .508
2.595 .308
1.815 .616
.593 .329
1 0.608 1
1 1.437 2
1 1.997 3
1 2.868 4
1 3.420 S
SCALE VALUES
134 APPENDIX 3.3.
Definitive scale values of each of the 70 test items used in the experiment testing the perceptual adequacy of the melodic model outlined in chapter 3, for the 'Brighton' group of testees and the 'Crawley' group of testees separately, with the psychological continuum to which they refer. 'Brighton' group of
testees
Items
ORIG
PES
FS
DUTCH
WITTEN
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
2.465 2.367 2.215 3.144 1.593 2.602 3.130 2.762 3.055 3.058 3.320 2.726 2.369 2.946
2.839 2.043 2.167 3.131 1.209 2.181 2.896 2.885 2.612 3.077 3.191 2.105 2.243 2.431
3.308 2.180 2.202 2.969 2.070 2.229 2.992 3.044 2.884 3.074 3.289 2.428 2.532 2.678
2.075 1.737 1.617 1.925 1.964 1.843 1.727 3.005 2.176 2.649 .698 .805 1.050 2.186
1.816 .801 .408 .334 .400 .887 1.751 .575 .376 .372 .405 .486 .504 .380
Mean SD
2.697 .467
2.501 .552
2.701 .427
1.818 .644
.678 .496
I
1
0.000
1
0.634 1
1
1.557 s
1
2.319 a
1
3.262 4
3.747 s
SCALE VALUES
135 'Crawley'
group of
testees.
Items
ORIG
PES
FS
DUTCH
WITTEN
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
2.280 2.432 2.142 3.029 1.759 2.086 2.796 2.728 2.657 2.349
2.324 2.106 2.520 2.990 1.230 1.689 2.723 2.868 2.408 2.505
11.
2.868
2.816
12. 13. 14.
2.794 2.171 2.469
2.287 1.699 2.003
2.964 2.432 2.830 2.929 2.251 1.904 2.566 3.003 2.650 2.476 2.868 2.432 2.576 2.445
1.997 1.507 1.990 2.296 1.531 1.577 1.299 2.784 2.469 2.612 .896 1.045 1.126 2.287
1.416 .541 . 391 . 341 .407 .629 1.156 .855 .355 .377 . 391 .565 .503 .371
Mean SD
2.496 .360
2.298 .508
2.595 .308
1.815 .616
.593 .329
I
0.000
1
1
0.608
1
2
1.437
1
3
1.997
1
4
2.8BB
1
5
3.420
SCALE VALUES
136 APPENDIX 3 . 4 Scale values o f b o t h p r e s e n t a t i o n s o f e a c h o f the 7 0 test items used the e x p e r i m e n t t e s t i n g t h e p e r c e p t u a l a d e q u a c y o f t h e m e l o d i c m o d e l o u t l i n e d in c h a p t e r 3, f o r b o t h g r o u p s o f t e s t e e s t o g e t h e r , w i t h t h e p s y c h o l o g i c a l c o n t i n u u m to which they refer.
Test items 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Mean SD
I 0.000
1
ORIG
PES
FS
DUTCH
WITTEN
2.484 2.431 2.252 2.514 2.136 2.321 3.090 3.258 1.145 2.143 2.322 2.706 3.013 3.163 2.777 2.777 3.037 2.957 2.910 2.898 3.258 3.246 2.659 2.936 2.437 2.284 3.021 2.744
2.974 2.575 1.961 2.197 2.007 2.361 3.079 3.233 1.085 1.349 2.083 2.120 3.205 2.547 2.844 2.936 2.591 2.636 3.021 2.928 3.205 3.094 1.989 2.284 2.079 2.217 2.437 2.284
3.233 3.291 2.324 2.284 2.332 2.364 2.855 3.156 1.997 2.284 2.099 2.284 2.995 2.783 2.984 3.219 2.800 2.932 3.108 2.824 3.233 3.219 2.689 2.217 2.711 2.390 2.689 2.629
2.059 2.093 1.714 1.714 1.874 1.645 1.743 2.284 1.773 1.997 1.915 1.710 1.848 1.481 3.280 2.665 1.931 2.523 2.247 3.085 .746 .777 .976 .800 1.008 1.123 2.083 2.381
2.073 1.387 .836 .604 .419 .381 .326 .337 .367 .428 .738 .900 1.546 1.657 .786 .541 .354 .381 .343 .395 .428 .367 .587 .388 .467 .527 .388 .361
2.676 .438
2.476 .541
2.712 .384
1.839 .625
.654 .454
1 0.623
2
1 1.596
3
1 2.2B4
4
1 3.205
5
1 3.695
> SCALE VALUES
137 APPENDIX 3.5
Summary table of the analysis of variance as applied to the results of the experiment testing the perceptual adequacy of the melodic model outlined in chapter 3. SOURCE
SS
Between utterances
DF
7.021
Within utterances intonation residual
50.981
Totals
58.002
MS
F
13 56 32.022 8.959
4 52
10.505 .172
60.974
69
Pairwise differences between the mean scale values found for the five conditions in the experiment testing the perceptual adequacy of the melodic model outlined in chapter 3; the differences indicated in italics are significant at the .01 level; the others are not significant at the .05 level.
Condition
ORIG PES FS DUTCH WITTEN
ORIG
PES
FS
DUTCH
Mean
2.676
2.476
2.712
1.839
.654
2.676 2.476 2.712 1.839 .654
_
.200
-.036 -.236
.837 .637 .873
2.022 1.822 2.058 2.185
—
— —
WITTEN
-
138 APPENDIX 4 . 1
The 14 test utterances used in the experiment testing the perceptual adequacy of automatically generated pitch contours (chapter 4) are the following: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Arthur and Jane left for Italy this morning it's the only place I've ever lived in do they take the car when they go abroad have you brought your taperecorder in another twenty years or thereabouts it isn't worth arguing with them I didn't mean to come home as late as this not unless he's willing to apologize I didn't know they'd ever been to Italy so that's why you said you were too busy I thought I saw a kingfisher down by the stream just now you can still see the old lighthouse across the harbour he's never taken Jane on any of his visits though he didn't come home last night
(tone 1) (tone 1) (tone 2) (tone 2) (tone 3) (tone 3) (tone 4) (tone 4) (tone 5) (tone 5) (tone 13) (tone 13) (tone 53) (tone 53)
The next pages give graphical representations of all intonational versions of seven of the utterances, one for each tone. For each utterance, the first picture gives the original fundamental frequency curve (dotted) and the SATO-stylization (solid). The second picture gives the original fundamental frequency curve (dotted) and the DITO-stylization (solid); the third picture gives the original fundamental frequency curve (dotted) and the DUTCH-stylization (solid); the fourth picture gives the original fundamental frequency curve (dotted) and the WITTEN-stylization (solid). Each of the utterances has been roughly segmented in time, the vertical bars corresponding to the positions of the vowel onsets of some of the more prominent syllables.
139 > 500 Arthur and Jane left for
400
Italy this morning
300 N
200 >
100
f
^ "
.J 'S
\
50 > ID
500 400
Arthur and Jane left for
Italy this morning
300
H
200
o
u.
/
100
60 g
Arthur and Jane left for
Italy this morning
500 400
N
300 200
100 X sa > 3
500 400
Arthur and Jane left for
Italy this morning
300 N
200
o
li.
100
'V,
X
140 > 3
600 400
do t h e y t a k e t h e c
a r when they go a b r o a d
300
\ ,
200
/
'
/
:
100 BO BOO 400
do t h e y t a k e t h e c
a r when t h e y go a b r o a d
300
200
-»
•
S.
%
-
VV
[j
V
100
V
_
i
BO BOO 400
do t h e y t a k e t h e c
a r when t h e y go a b r o a d
300
200
*
' r
A
100
SO BOO 400
do t h e y t a k e t h e c
a r when t h e y go a b r o a d
300
200
•
N
vs
100
SO 0.0
0.5
1.0
1.5
t
(S)
2.0
2.5
in an
other twenty y
ears or thereabouts
other twenty y
ears or thereabouts
J
in an
J v \
— - V ,
"
w /
BOO 400
In an
other twenty y
ears or thereabouts
300 200
100 V,
B0
0.0
0.5
1.0
t
1.5
(S)
2.0
2.5
500 400
I d
Idnt m ean to come home as 1
ate as this
300
A
200
/
J/
\
100
%
—
•
-
—
—
50 500 400
: d
idnt m ean to come home as 1
ate as this
300
200
J
100
\ ;* s •
"
—
—-
.. " -
50 500 400
ean to come home as 1
ate as this
300
200
100 •v • SO 500 400 I d
idnt m ean to come home as 1
ate as this
300 —
200
'
'
/100
\
;
. •• -
-
i —
-
50 0 O
0.5
1.0
1.5
t (S)
2.0
2.5
500 400
I
idnt know theyd ever b
een to Italy
300
/\
200
100
-J
50
BOO
400
I < ldnt know theyd ever b
een to Italy
300
200
/ \
_ „«"•
100
'
"j
\ •.
*
"
1
\
—
BO
BOO
400
ldnt know theyd ever b
een to Italy
I < idnt know theyd ever b
een to Italy
300
200
100
•0
600
400 300 200
yf \
3s
/
100 s
.
50
0.0
0.4
O.B
t
1.2
(S)
1.6
2.0
144
> BOO
3
400
I th ought I saw a k ingflsher down by the str
ean lust now
300
N
200
J
O
li.
"
N
v'
\ _
100
V\ • *
-
j] J
BO > 3
BOO 400
I th ought I saw a k ingflsher down by the str
ean Just now
300 N X
AVv\
200
100
3
BO BOO 400
I th ought I sew a k Ingflsher down by the str
(HZ)
300
o u.
*
earn J u s t now
\
200
100
BO > 3
BOO 400
ought I saw a k tngfieher down by the str
ean lust now
300 N X
200
•S.
N
X 100
BO
0.0
0.6
1.2
t
1.8
is)
2.4
3.0
145 > 500 400 hes n ever taken J
ane on any of his v lsits though
300
N
200
100
•
i
/
v ^
• s
\
/ /
'
••
J
\
v
\V
-r \
SO
> Z3
BOO 400
ever taken J
ane on any of his v lsits though
300 N X
200
J
o
Li.
100
V
SO >) Z
soo 400 hes n ever taken J
ane on any of his v lsits though
300 N
200
p
/ *
o
u_
100
- — r r
^
Jf
— \
> n
o
u.
146 APPENDIX 4 . 2 S c a l e v a l u e s o f all p r e s e n t a t i o n s o f e a c h o f t h e 7 0 test i t e m s u s e d in t h e experiment testing the perceptual a d e q u a c y of automatically generated pitch c o n t o u r s ( c h a p t e r 4), w i t h t h e p s y c h o l o g i c a l c o n t i n u u m t o w h i c h t h e y refer.
Test items 1. 2. 3.
ORIG
SATO
DITO
DUTCH
WITTEN
2.675 2.520 2.703
2.573
2.122 1.697 2.071
2.401
1.085 1.313 1.606
3.216 3.464 2.860
4.
.982 1.024 1.616
2.478 2.678 .876
5. 6. 7.
1.218 1.187 2.969
3.173 3.385 3.042
8.
2.233 2.046 2.920
3.504 3.520 3.563
9. 10. 11.
3.464 3.441 2.172
2.747 2.675 2.625
12. 13. 14.
3.550 3.536 2.703
Mean SD
2.852 .682
> 0.000
2.707 2.637 2.414
1
I 0.554
2.244 2.605 2.573
2.133 2.468 .771
1.336 .993 1.959
2.478 2.099 2.393
2.172 2.115 2.468
2.252 2.747 3.323
2.860 3.114 1.521
2.172 2.252 2.520
2.071 2.172 1.906
1.187 1.060 .984
2.304 .674
2
1.028 1.018 2.155
1.959 2.101 .908
2.046 2.187
1 1.5B7
2.393 2.187 1.398
1.356 1.616
1.953 .741
3
1.930 .551
1 2.350
4
.535 .422 .321 1.028 1.313 .446 .638 .655 .365 .446 .401 .472 1.697 1.606 1.450 .287 .277 .382
.778 .515
1 3.388
5
1 3.B5B
SCALE VALUES
147 APPENDIX 4.3
Summary table of the analysis-of-variance as applied to the results of the experiment testing the perceptual adequacy of automatically generated pitch contours (chapter 4).
SOURCE
SS
Between u t t e r a n c e s
14.709
Within u t t e r a n c e s intonation residual
44.121
Totals
58.830
DF
MS
F
13 56 4
32.362 11.759
52
8.090 .226
35.776
69
All pairwise differences between the mean scale values found for the five conditions in the experiment testing the perceptual adequacy of automatically generated pitch contours (chapter 4); the differences indicated in italics are significant at the .01 level; the others are not significant at the .05 level.
Condition Mean OR1G SATO DITO DUTCH WITTEN
2.852 2.304 1.953 1.930 .778
ORIG
SATO
DITO
DUTCH
2.852
2.304
1.953
1.930
.778
.548
.899
.922 .374 .023
2.074 1.526 1.175 1.152
.351
WITTEN
Summary
By far the greater part of existing descriptions of British English intonation are impressionistic in character, since they are mainly meant to be used in the context of teaching English as a foreign language. The present research sets out to give a melodic description of British English intonation which is so explicit that it can be used for the control of pitch in a speech synthesis system. Intonation is studied in accordance with the principles that have been followed with success in the study of Dutch intonation at the Institute for Perception Research. The description is based on a mixture of acoustic and perceptual analysis of speech samples and, using speech resynthesis as a research tool, any claims made are experimentally verified. Chapter 2 addresses itself to the problem of determining which elements in the fundamental frequency curve of an utterance are relevant for the perception of intonation and which elements can be ignored because they are not. To this purpose, the so-called close-copy stylization is introduced. This is a relatively rigorous and simple stylization of an original fundamental frequency curve which is perceptually virtually indistinguishable from it, thus reflecting more closely the pattern that the speaker meant to produce. The perceptual equality of original fundamental frequency curve and close-copy stylization is experimentally verified. Also in this chapter, a method is proposed to measure the perceptual acceptability of pitch contours. This involves letting subjects score acceptability of pitch contours on a five-point scale and then processing the resulting raw scores to obtain a so-called 'scale value' for each utterance which represents its acceptability. In chapter 3, a melodic model of British English intonation is proposed. The model contains an inventory of eight melodically distinct pitch movements that can each take up any of three positions with respect to the syllable, and rules that specify how these pitch movements may be combined to form complete pitch contours. The model is completely explicit acoustically and can be used for pitch control in speech synthesis. A perceptual experiment confirms that the model can be used with success to provide utterances with artificial pitch contours that are no less acceptable than original fundamental frequency curves.
150 Chapter 4 gives rules for the automatic generation of seven different types of pitch pattern, based on the seven-tone system proposed by Halliday (1970). Another experiment shows that pitch contours generated in accordance with these rules have a high degree of perceptual acceptability. Chapter 5, finally, evaluates how successful the approach to the study of intonation used in the present research has been. A comparison is made between the Dutch and English intonational systems, and suggestions are given for possible future research.