Abstracts of the Tenth International Congress of Phonetic Sciences: Utrecht, 1–6 August, 1983 [Reprint 2019 ed.] 9783111694603, 9783111306797


198 81 33MB

English Pages 821 [824] Year 1983

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Table of Contents
Preface
Previous Congresses
Congress Committee
Plenary Sessions
Symposia
Section Abstracts. Part 1
Section Abstracts. Part 2
Index of Authors
Workshop on Sandhi
Recommend Papers

Abstracts of the Tenth International Congress of Phonetic Sciences: Utrecht, 1–6 August, 1983 [Reprint 2019 ed.]
 9783111694603, 9783111306797

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

^ TENTH INT. CONGRESS OF PHONETIC SCIENCES 1-6 AUGUST 1983 v

UTRECHT THE NETHERLANDS

Editors: A. COHEN and M.P.R. VAN DEN BROECKE Institute of Phonetics, University of Utrecht

Utrecht, 1-6 August 1983

Abstracts of the Tenth International Congress of Phonetic Sciences

¥

1983 FORIS PUBLICATIONS Dordrecht - Holland/Cinnaminson - U.S.A.

Published by: Foris Publications Holland P.O. Box 509 3300 AM Dordrecht, The Netherlands Sole distributor for the USA. and Canada: Foris Publications U.SA P.O. Box C-50 Cinnaminson N.J. 08077 U.SA

ISBN 90 70176 89 0 © 1983 by the authors. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Printed in the Netherlands by ICG Printing, Dordrecht.

Table of Contents Preface

vii

Previous Congresses

ix

Congress Committee

x

Plenary Sessions

1

Symposia

63

Section Abstracts

329

Index of Authors

793

Workshop on Sandhi

805

Preface

The festive occasion of this 10th International Congress of Phonetic Sciences is seriously overshadowed by the sudden and untimely death of Dennis B. Fry, the President of the Permanent Council, who died in London on 21st March of this year. His presence will be sorely missed by all who have known him. The 10th Congress will take place in Utrecht, the Netherlands from 1-6 August 1983 at thejaarbeurs Congres Gebouw, which is located adjoining the Central Railway Station. In analogy with the previous 9th Congress the various scientific activities will be divided over a number of categories, i.e. 5 plenary sessions each addressed by two invited speakers, 6 Symposia, a great number of section papers and poster sessions. There will also be an exhibition of scientific instruments as well as an educational one organized by the University of Utrecht Museum. We are fortunate in being able to announce that, with one exception, all originally invited lecturers/speakers and symposium chairmen have notified us of their willingness to participate. Due to the shift of emphasis over the last few years in our field a rather large part of the invited contributions will deal with technological advances of the speech sciences. This volume contains 4 page abstracts of all invited speakers, as well as one page abstracts of all the section papers/posters that we could accommodate in the program. The Congress Proceedings will contain the complete texts of all invited speakers as well as a selection of 4-page abstracts of section papers and will appear in book form before the end of this year. We are grateful to the following institutions who have given us support in one form or another: The Royal Dutch Academy of Arts and Sciences under whose auspices this congress will be held. The Dutch Ministry of Education for providing us with financial support. The Netherlands Organization for the Advancement of Pure Research (ZWO), which thanks to the Foundation of Linguistics, was able to contribute to the financing of the Proceedings.

viii

The City Council/Municipality of Utrecht for their financial support. The Executive Council of Utrecht University for their willingness to make the organization of this congress possible in a number of ways. The Faculty of Arts for helping us to engage a secretary for all administrative work. Last but not least all those persons who have shown tremendous willingness to take part in various capacities, sometimes more than one, in an advising form. The Editors

Previous Congresses First:

Amsterdam

1932

Second:

London

1935

Third:

Ghent

1938

Fourth:

Helsinki

1961

Fifth:

Münster

1964

Sixth:

Prague

1967

Seventh:

Montreal

1971

Eighth:

Leeds

1975

Ninth:

Copenhagen

1979

Congress Committee Chairman: A. Cohen, University of Utrecht Executive Secretary.M.P.R. van den Broecke, University of Utrecht Congress Secretary: A.M. van der Linden-van Niekerk Members: F.J. Koopmans-van Beinum, University of Amsterdam S.G. Nooteboom, Leyden University, Institute of Perception Research (IPO), Eindhoven LC.W. Pols, University of Amsterdam

Plenary sessions

Opening Address: E. Fischer-J0rgensen Keynote Address: G. Fant

1 . Speech and hearing R. PLOMP: Perception of speech as a modulated signal

5

MANFRED R. SCHROEDER: Speech and hearing: some important interactions

9

2. Relation between speech production and speech perception L.A. CHISTOVICH: Relation between speech production and speech perception

15

H. FUJISAKI: Relation between speech production and speech perception

21

3. Can the models of evolutionary biology be applied to phonetic problems? BJORN LINDBLOM: Can the models of evolutionary biology be applied to phonetic problems?

27

PETER LADEFOGED: The limits of biological explanations in phonetics

31

4. Psycholinguistic contributions to phonetics W.D. MARSLEN-WILSON: Perceiving speech and perceiving words WILLEM J.M. LEVELT: Spontaneous self-repairs in speech: structures and processes

39

43

5. Speech technology in the next decades J.L. FLANAGAN: Speech technology in the coming decades

49

J.N. HOLMES: Speech technology in the next decades

59

5 PERCEPTION OF SPEECH AS A MODULATED SIGNAL R. Plomp Institute for Perception TNO, Soesterberg, and Department of Medicine, Free University, Amsterdam, The Netherlands Acoustically, speech can be considered to be a wide-band signal modulated continuously in time in three different respects: (1) the vibration frequency of the vocal cords is modulated, determining the pitch variations of the voice, (2) the temporal intensity of this signal is modulated by narrowing and widening the vocal tract by means of the tongue and the lips, and (3) the tongue and the lips in combination with the cavities of the vocal tract vary the sound spectrum of the speech signal, which may be considered as a modulation along the frequency scale. For each of these three types of modulation, it is of interest to study the modulations present in the speech sound radiated from the mouth, the extent to which these modulations are preserved on their way from the speaker to the listener, and the ability of the ear to perceive them. The modulations can be studied by means of frequency analysis, resulting in frequency-response characteristics for the modulation frequencies. In this abstract I will restrict myself to the temporal and spectral modulations, that is, the horizontal and vertical axes of the speech spectrogram, leaving frequency variations of the fundamental out of consideration. Modulation in time The varying intensity of speech, given by the speech-wave envelope, can be analysed by means of a set of one-third octave bandfilters. Experiments (Steeneken & Houtgast, 1983) have shown that the envelope spectrum is rather independent from the speaker and his way of speaking. The envelope spectrum covers a range from about 0.5 Hz up to about 15 Hz, with a peak at 3 to 4 Hz. The latter value seems to be related to the number of words and syllables per second. On its way from the speaker to the listener, the speech wave is usually affected by reverberation. The higher the modulation frequency and the longer the reverberation time, the more the modulations are smoothed. For instance, at a large distance from the speaker in a room with moderate reverberation (T = 0.5 sec), the intensity modulations at the listener's position are reduced by a factor of two for a modulation frequency of about 8 Hz (Houtgast et al., 1980) . This implies that the fast modulations of speech are not preserved, resulting in a decrease of speech intelligibility. In the same way as the transfer from the speaker to the listener, the ear of the listener can be described by a temporal modulation transfer function. Recent experiments (Festen & Plomp, 1981) have shown that the modulation cut-off frequency of the ear is, on the average, about 25 Hz. This value is considerably higher than both the upper limit of modulation frequencies present in the speech signal and the cut-off frequency of a typical room.

6

Speech and hearing

Modulation along the frequency scale In the same way as for the fluctuations in time, the variations oc curring in the sound spectrum as a function of frequency at a particular moment of the speech signal can be analysed by means of a set of one-third octave filters. Since these data are not availabl at the moment, we have to follow another way. We may expect that the spectral modulation transfer function has to be sufficiently high to resolve the two spectral peaks corresponding to the most important formant frequencies Fi and F2 . From spectral data on Dutch vowels (Klein et al., 1970) we may decide upon 1 period per octave as a rough estimate of the upper limit of modulation frequencies present in speech signals. The effect of reverberation on the transfer of the signal from the speaker to the listener (too often neglected in speech research) is, for constant sounds, comparable with introducing a standard deviation of 5.57 dB in the sound-pressure level of any harmonic (Schroeder, 1954; Plomp & Steeneken, 1973). If we may assume that this value also holds for the actual speech signal, it can be calculated that the variance introduced by reverberation is, roughly, about as large as the spectral differences when the same vowel is pronounced by different speakers. The spectral modulation transfer function of the human ear can be determined in a way similar to that for the temporal modulation transfer function. Festen and Plomp (1981) found that the higher cut-off frequency of the ear is about 5 periods per octave. This value is much larger than the 1 period per octave we estimated for the speech spectrum. Conclusions The data presented above indicate that, both for temporal and spec tral modulations, the human ear is sensitive to modulation frequen cies exceeding those present in the speech signal. In indoor situa tions reverberation is the most important limiting factor in the perception of the speech-envelope variations in time. It can be ar gued that the high frequency-resolving power of the ear is not related to the perception of a single speech signal, but to the discrimination of speech interfered with by other sounds. References Festen, J.M. & Plomp, R. (1981). Relations between auditory functions in normal hearing. Journal of the Acoustical Society of America, 70, 356-369. Houtgast, T., Steeneken, H.J.M. & Plomp, R. (1980). Predicting speech intelligibility in rooms from the Modulation Transfer Function. I. General room acoustics. Acústica, 46, 60-72. Klein, W., Plomp, R. & Pols, L.C.W. (1970). Vowel spectra, vowel spaces, and vowel identification. Journal of the Acoustical Society of America, 48, 999-1009. Plomp, R. & Steeneken, H.J.M. (1973) . Place dependence of timbre i reverberant sound fields. Acuática, 28, 50-59. Schroeder, M. (1954). Die statistischen Parameter der Frequenzkurven von grossen Räumen. Acústica, 4, 594-600.

Plomp: Perception

of speech as a modulated

signal

7

Steeneken, H.J.M. & Houtgast, T. (1983). The temporal envelope spectrum of speech and its significance in room acoustics. To be published in Proceedings of the 11th International Congress of Acoustics, Paris.

SPEECH A N D H E A R I N G : SOME IMPORTANT Manfred R.

INTERACTIONS

Schroeder,

Drittes Physikalisches Institut, Universität Goettingen, Bell Laboratories, Murray Hill, N . J . USA

FRG

Speech Quality and Monaural Phase In the 1950s, when I first became interested in speech synthesis, I was almost immediately intrigued by the problems of subjective quality of synthetic speech. Vocoders had a reedy, electronic "accent" and I thought that the excitation waveform, consisting of sharp pulses, was perhaps to blame. To investigate this question more deeply, I built a generator for 31 coherent harmonics of variable fundamental frequency. The phase of each harmonic could be chosen to be either 0 orr - a total of 230 - 1,073,741,824 different waveforms, each of which appeared to have its own intrinsic timbre - their identical power spectra notwithstanding. (I wish Seebeck, Ohm and Helmholtz had had a chance to listen to these stimuli!) For all phase angles set equal to 0, one obtains a periodic cosine-pulse. When this waveform is used as an excitation signal for a speech synthesizer, the result is the reedy quality already mentioned. By contrast, if one randomizes the phase angles, one gets a less peaky waveform and a mellower sound (Schroeder, 1959). A better-than-random choice for the phase angles (one that gives an even less peaky waveform) is given by the formula „ - ™7AT, where n is the harmonic number and N the total number of harmonics in the flat spectrum stimulus. More general formulae, for arbitrary spectra and phase angles restricted to 0 or x, are given by Schroeder (1970). The great variety of timbres producible by phase manipulations alone suggested to me that it might be possible to produce intelligible voiced speech from stimuli having flat (or otherwise fixed)

10

Speech and hearing

line spectra. Although this appears to contradict the most basic tenets of auditory perception of speech, "formant-less" speech has now been demonstrated in collaboration with H. W. Strube of Goettingen. To achieve this flat-spectrum (voiced) speech, we generate a synthetic speech signal which is symmetric in time and has maximum peak factor N

s(t) — %S„ n-1

cos

nal,

where a is the fundamental frequency and the Sn are the amplitudes of individual harmonics. The Sn are determined, for example, by linear predictive analysis of a natural speech signal. The Sn reflect its formant structure. To the speech signal s(t) is then added another signal, a(t), of like fundamental frequency, with harmonic amplitudes

An

and phases

n

given by 2

1

-H 0. G •H 10 E 0

w HI 0 c Q) +> c 01 in •C +> o +J a c a) •H Mo a T3 a C 5 (0 0 xi l-i to S-t •H < 13 UI a) ¡5 cn C •0rl e •P 0 (0 n D m +> •H T) W d) •>H -d C U (0 a) •0 in — , . c0 r ai •H +> •cH (0 r—t T3 (1) C 0 O +> 0 •C H •

> N

s i nI


3 1 s. *

1

n t. .«V.

1\ i 5 •a k.

• or n

A

s

«

u * -V *t

i a A > w 4* Cn •H Pn

Symposium 2

Units in speech synthesis

107 UNITS FOR SPEECH SYNTHESIS Jonathan Allen Research Laboratory of Electronics, Department of Electrical Engineering and Computer Sciences, Massachusetts Institute of Technology, Cambridge, USA The goal of synthetic speech algorithms is to provide a means to produce a large (infinite) set of speech waveforms.

Since this number is so large, the

units cannot he the messages themselves so we must look for ways to compose the message from a smaller number of units.

In this summary the desiderata

for these units are discussed. 1.

Clearly there must be a relatively small number of units at each level.

There is a well known tendency for longer and more complex units to be larger in number.

Thus the word lexicon for any natural language is large, but the

number of phonemes is small. 2.

The choice of units must allow for the freedom of control to implement all

significant aspects of the waveform.

This leaves the question of what

is

significant, but it is clear that perceptual results should be the guide to such a choice.

Historically these aspects have included spectral sections,

including spectral prominences (formants) plus the prosodic structure of the speech. 3.

In order to derive the necessary control for the desired units, it must be

possible

to analyze speech in terms of these units.

algorithmically (automatically), so much the better.

If this can be

are deep abstractions and (so far) require human analysis. analysis

would

include

the

discovery

of

done

Nevertheless, many units

syllable

An example of such structure

and

the

determination of stress from acoustic correlates. 4.

Conversely, it must be possible to compose an utterance by interpretive

processes on the selected units for synthesis.

This consideration points out

the dual aspects of internal versus external structure.

There are divergent

108

Units in speech synthesis

opinions as to where to emphasize the overall structure in terms of synthesis units.

Thus,

relatively transitions hand, there

example,

internal

to be provided

diphone

structure,

for

little

based

since

phonemically

structure

but

interpretively

systems

have

the transition

a

based leave at

synthesis much

systems

structure

in

the boundaries.

relatively

in formation

larger

is

On the

amount

internal

specify terms

to

of

the

of

other

internal unit,

is intentionally a minimal amount of external boundary structure

and that

must be specified interpretively. This contrast points out the distinction between compiled and interpretive use of

units.

When

phonemes

are

used,

there

is

relatively

structure and more interpretive composition at boundaries. more

little

compiled

Diphones represent

compiled internal structure which can be represented lexically.

differences

reflect

the

variety

and

the

function used to join units together.

complexity

of

the

These

compositional

They also emphasize the choice as to

whether the knowledge (of transitions, say) is to be represented (in terms of static lexical forms) or procedurally.

structurally

That is, is the richness

of detail (phonetically) to be captured as intrinsics of the units themselves or in terms of the connective framework? The

nature

of

the

compositional

problem of binding time.

function

is

also

highly

connected

to

the

Clearly in the case of transitions, this information

is bound early (and not later modified) for diphones and hence the information can be considered to be compiled into structural lexical form at the time the overall system is built. the

late binding

On the other hand, the use of phonemes contemplates

of the transitional

information.

This

provides

for

more

flexibility but requires that explicit procedures for transition specification be developed.

For consonant vowel and vowel consonant transitions much work

has been done, but in clusters and in vowel—vowel transitions, a great deal of further research must be completed.

So far, we have thought of transitions in

terms of the spectral part of the assumed also

the problem of transitions

at

source-filter model, but

the source

level.

there

is

Onset and offset

of

voicing must be specified (the compositional function is now far too simple: just voiced/unvoiced/silence) as well as mixed excitation. 5.

The units selected should be insightful and relevant to the accumulated

research literature.

They should mark distinctions

that are

linguistically

Allen: Units for speech synthesis relevant.

109

Here an important notion is similarity versus contrast.

Within the

class of utterances that correspond to a unit there should be a cohesion and similarity of internal structure, hut between these units there should be a contrast in features. contrasts

Thus phonology was invented to provide a notation for

in meaning.

It may

be

of

course

that

there

are

new

units

and

representational frameworks that are not related to existing understanding and yet are important for high quality synthesis.

This is most likely to happen

in areas where the need for structure is felt, but sufficient research has not yet been completed, as in the study of articulation. 6.

There are several

levels

of units

that

are

important.

scope, and address different aspects of structure.

These

varying

It is probably the case

that all aspects of linguistic structure are reflected somehow in the acoustic waveform.

The

sentence, and

feature.

reasons

levels

of

structure

currently

recognized

clause, phrase, word, morpheme, metrical These have all been

found useful

of distribution and contrast, but

cohesion

and

place

normally

carry

focus

properties

on

the

within

units

are:

in linguistic

they each exhibit

themselves.

themselves

as

discourse,

foot, syllable,

That

to how

analysis

some is,

they

phoneme, for

intrinsic

they

combine.

don't The

sequence of these units suggest a natural hierarchy and nesting in terms of size but this is not necessary, as with the asynchronous behavior of features described in autosegmental phonology.

Features such as nasality, for example,

span several segments, and the control of this feature should probably not be thought of as being dominated solely at the phoneme level. 7.

The progression through the unit levels described above is gradual, thus

maintaining

fine

gradations

and

transformations between these levels.

providing

for

relatively

direct

Thus a representation of phonemes plus

stress marks plus boundaries can be converted into a structure of allophones plus

boundaries

framework structure. vocal

which

including

in

both

turn

can

the

prosodic

converted frame

to

and

a the

generalized

target

segmental

target

These are then interpreted into a parameter string relevant to the

tract model to be utilized, which then interprets these parameters to

form the final acoustic waveform. the

be

By providing such a large number of units,

jump between any two levels of representation can be made small

such that a reasonable understanding can be developed.

enough

If there were not such

a proliferation of units, then many meaningful generalities would be missed,

110

Units in speech synthesis

but more importantly the means to convert from a linguistic representation to the waveform would have to be expressed in a hopelessly complex way.

Each of

these smaller transformations can be thought of as a procedural interpreter of the proceeding framework level or levels. must

be

chosen

simplifying

in

a

way

to

At each of these levels the units

capture

all

important

the rule structure and increasing

generalities,

thus

the overall modularity of the

system. 8.

The notion of target is central in this set of levels and transformations

since

it

bridges

properties

of

between

the vocal

the

abstract

tract model.

linguistic

The

target

labels

may be

and

physical

thought

of as

a

spectral contour, parameter value, fundamental frequency gesture or other more complex

structure.

realizations

that

The

represents

sufficient

to

needed

complete

to

target

give

rise

to

the

represents

attributes the

of the

desired

realization

an

abstraction

from

speech

are

percept.

of

an

that

individual felt

By implication,

utterance

are

felt

to

be

elements

to

be

less

important to the intended percept.

The trend is towards more complex target

structures,

understanding

integration

reflecting of

smaller

increased targets

into

larger

of

target

phonetic

cues,

frameworks.

Thus,

and the

targets, like other units, have scope and intrinsic cohesion, and participate in

composition

functions

which

are

largely

as

yet

undiscovered.

Simple

addition of smaller targets into larger ones as in the modification of phrase level

fundamental

frequency

contours

by

segmental

fundamental

frequency

gestures, are inadequate to describe the phonetic facts. 9-

Many of the units specify attributes that are necessary (usually at some

abstract level) in order to realize the desired utterance. invariants

overall

possible

realizations

of

the

Thus they specify

utterance.

Nevertheless,

there is a great deal of observed change as in allophonic variation.

What is

true variability (not subject to law) and what is superficial variation due to context dependent composition functions is a major problem, but research

reveals

that

much

of

this

seemingly

deterministic, at least for a given talker.

freely

varying

contemporary detail

is

The trend is toward a richer set

of interpretive processes operating on the linguistic abstractions to produce the rich set of phonetic detail observed at many levels.

The management

of

redundant cues that integrate together to form complex percepts is a procedure requiring

vast

new

amounts

of

data

analysis

and

theoretical

foundation.

111

Allen: Units for speech synthesis

Undoubtedly research in this area will lead to the formulation of a variety of highly

complex

target

robust

targets

leading

frameworks. to

an

Today's

awkward

lack

synthesis of

rigidity that current synthetic speech demonstrates. find

some

process

of

invariant. seems

invariance

in

integrating Prom

remote,

terms cues

today's awaiting

of

to

fundamental

form

perspective, a

vastly

percepts the

utilizes

naturalness

itself

possibility

increased

only

few the

While there is hope to

articulatory may

a

reflecting

of

amount

processes, be

a

such a of

the

procedural discovery

research

and

understanding. These observations are intended

to characterize

the selection of units of speech synthesis.

the general

process

guiding

The role of these units must be

clearly stated together with a complete view of the total synthesis procedure that

exploits

them.

There's

a

tendency

to

be

guided

by

technological

constraints which may permit greater use of memory (compiled strategies) real

time

processing

(interpretive

strategies).

What

is

more

or

important,

however, is the provision of a correct theory fully parameterized to reflect the naturally occurring facts.

The technology will easily rise to carry these

formulations into practical real systems.

REMARKS ON SPEECH SYNTHESIS Osamu Fujimura Bell Laboratories, Murray Hill, N.J., USA

^

,

Segncatal Uato Speech synthesis by rule, since the pioneering work by Holmes, Mattingly and Shearme [1964], has always been based on the principle of concatenating some segments, with prosodic modulations. Most systems use phonemic/phonic segments. The basic idea is to assume a single target value for each of the control variables per segment, and connect these sample values smoothly by rule (see [Liberman, et al., 1959] for an early discussion of this concept). Obviously, the rules that generate control time functions must do more than simply (e.g. linearly) connect the specified sample points, and the phone inventory must carry information about transitional characteristics of the given (type of) segments as well as target values. Also, more than a single target is often needed per phonemesize segment. In the efforts to improve the quality of the output speech, it was found necessary for the rules to look several segments ahead in the string before they could decide which parametric values to use. For example, the duration (as well as quality) of the vowel [a] in [paJnt] crucially depends on the tenseness of the final consonant. In addition to this long-range context sensitivity, there are other variations of segmental properties that escape explanation by any simple notion of (hard) coarticulation [Fujimura & Lovins, 1978]. Adopting phonic diads (i.e. diphones) as the concatenative units [Dixon, 1968] solves part of this problem. The number of entries in the inventory has to be relatively large, but the transitional elements can be specified effectively by entering only endpoint values for the LPC pseudo-area

114

Units in speech synthesis

parameters [Olive, 1977]. This assumes that all transitions are approximated by straight lines, the endpoints being coincident in time among different parameters (see below for further discussion). Furthermore, the relatively stationary patterns, for example in the middle of a vowel portion (or consonantal constriction), are effectively represented by straight lines connecting the endpoints of the transitions into and out of the vowel (or consonant). While this approach is actually very effective, given a very limited storage for the inventory, it has inherent problems. For example, a given sequence of phonemes, say /st/, manifests different phonetic characteristics depending on the roles of the segments as syllable constituents. Thus the stop is usually aspirated in 'mistame', but not much in 'mistake' (see [Davidsen-Nielsen, 1974]). Even if this problem is solved by expanding the inventory to include transitions into and out of a syllable boundary for each phoneme/phone, we still have the same problem as the phonemic approach with respect to the long range phenomena. Interestingly, most of the long range phenomena are contained within the syllable. There are many ad hoc (i.e. non-coarticulatory) tautosyllabic interactions (see Fujimura & Lovins, ibid.) but almost none (assuming that the ambisyllabicity problem [Kahn, 1976] is treated correctly) across syllable boundaries, apart from what we can describe as prosodic effects on segmental quality (see below). For this reason, we advocate the use of syllables, or more practically demisyllables with phonetic affixes, for concatenative synthesis [Fujimura, Marchi, and Lovins, 1977]. Each syllable is divided into an initial demisyllable, a final demisyllable, and optional (in wordfinal position) phonetic affixes. The crucial information about transitions is totally preserved. The final demisyllable contains the quasi-stationary part of the vowel, and the strong effects of the final consonantal features are well represented throughout the demisyllable. On the other hand, there is little interaction between initial and final demisyllables, with respect to both linguistic distributional patterns and phonetic quality (apart from coarticulation). Therefore, the quality of demisyllabic synthesis is about the same as that of syllabic synthesis, and the inventory size is considerably smaller (around 800 items in our current version for English, as opposed to more than 10,000 for syllables).

The intelligibility of monosyllabic words presented in isolation

is virtually limited by the LPC process itself (see Lovins. et al. 1979).

Fujimura: Remarks on speech synthesis

115

A parametric schematization, as in Olive's approach, should be effective for demisyllabic patterns also. Is Concatenation of Concrete Units Workable? As an abstract representation of speech, a string of words seems basically an appropriate picture, albeit obviously not complete. One can argue about the validity of phonemic segments [Chomsky & Halle, 1968], but there must be some purely phonological unit that gives us a useful picture of sound streams. Just as syntax represents a sentence by a tree structure of phrasal units, phonology may specify for a syntactic unit, a word, a phrase, or a sentence, a structural organization of phonetic units.

This phonological structure, however, must relate segmental features to

suprasegmental elements. This can be done via a separate structure (metrical grid) [Liberman and Prince, 1977], or by a more direct linking (autosegmental theory), see [Goldsmith 1977]. There has been remarkable activity in this area of theoretical linguistics, some based on experimental and quantitative work, in an effort to establish an integral theory of phonology and phonetics [Pierrehumbert, 1980] [Liberman & Pierrehumbert, to appear]. The basic problem that has to be solved is how to describe speech in those aspects where the concatenative (linear) models fail. The more accurately and systematically we observe speech phenomena, given now available experimental tools, the more immediate our concerns become. We now understand, at least in part, why speech synthesis schemes invariably sound machine-like rather than human, and why automatic recognition schemes fail to identify even isolated words that are nearly perfectly identifiable for human native speakers. All the existing schemes, including diadic and demisyllabic synthesis, template matching or feature-based recognition systems, use some concatenative segmental units in concrete forms, or even, in some cases, such as dynamic programming, assume preservation of finely divided subsegments (sample frames), as nearly invariant units. According to this principle, all variables characterizing speech signals (except pitch) have to move concurrently, when prosody dictates temporal modulation and other suprasegmental modifications. Furthermore, apart from simple smoothing (hard coarticulation) alterations are implemented only through a symbolic selection between alternative forms, such as allophones. This picture is incorrect, in our opinion.

116

Units in speech synthesis

A Multidimensional Model In our recent work on articulator? movements as observed by the x-ray microbeam, we found that articulators moved somewhat independently from each other [Fujimura, 1981]. Thus the notion of "simultaneous bundles" [Jakobson, Fant and Halle, 1963] does not apply literally to concrete physical phenomena. At the same time, we observed that there were relatively stable patches of movement patterns in the time domain of the multi-dimensional articulatory movements for speech utterances. These relatively invariant local movement patterns (which we called "iceberg patterns") represent crucial transitions with respect to the demisyllabic identity, presumably being responsible for perceptual cues for consonant (place) identification. The patterns of articulatory movements between icebergs, including quasi-stationary gestures for vowels, as well as the relative timings of icebergs, seem highly variable, depending on various factors such as stress, emphasis, global and local tempo. One important implication of this multidimensionality is that the articulatory interaction between syllable nuclei may be, to a large extent, independent from intervening consonantal gestures. The tongue body gestures for different vowel qualities seem to undergo intricate assimilationdissimilation processes. In some languages, such as Portuguese, effects of such vocalic interaction are so strong that they have been described with symbolic descriptions, phonologically (metaphony) and phonetically (vowel gradation) [Maia, 1980]. In other words, there are different phonetic processes that affect different aspects of sound units selectively. Vowel (or consonant) harmony is an extreme case, and hard coarticulation is the other extreme, but both affect a particular articulatory dimension representing a particular phonological feature, rather than the segmental unit as a whole. What precisely governs such interactions, unfortunately, is largely unknown in most cases. In the domain of symbolic representation, linguists in the newly emerging field called nonlinear phonology are addressing themselves to closely related theoretical problems (see, in addition to Liberman and Prince, Goldsmith, ibid.J [Williams 1971] [Halle and Vergrand 1980] [Selkirk 1980] [Kiparsky l«82l.

Fujimura: Remarks on speech synthesis

117

The extent of variability of, for example, vowel quality, either in terms of formant frequencies or articulator positions, is remarkably large: for the same phoneme in exactly the same segmental environment (i.e. word sequence), different emphases, phrasing, or other utterance characteristics can easily affect the vowel quality by a third of the entire range of variation (for all vowels) of the given variable (say F2). Iceberg patterns, however, are affected relatively little as long as they belong to the same (phonological) deinisyllable (probably with some qualifications that are not clear at this time.) Whether the iceberg model is correct or not, the point is that speech is highly variable in a specific way. Until we understand precisely in what way speech is variable and what influences different aspects of speech signals, we obviously cannot achieve a satisfactory synthesis (or recognition) system. We are only beginning to gain this understanding. An adequate articulatory model, which we still have to work out, may be a particularly powerful tool of research from this point of view. Ultimately, there could well be a way to handle acoustic parameters, such as formant frequencies, directly by appropriate rules, in interpreting phonological, symbolic representations of speech for synthesis. But in the mean time, we should exploit whatever we have as the research tool, in order to understand the key issues involved. Likewise, a machine also ought to be able to approach the human performance in speech recognition based on acoustic signals only. As the history of speech research clearly shows, analysis (observation/interpretation) and synthesis must go hand in hand. Also, theory and experimentation, including well guided designs of concrete, potentially practical systems, are bound to help each other.

118

Units in speech synthesis

REFERENCES Chomsky, A., and M. Halle (1968). The Sound Pattern of English. New York: Harper & Row, Publ. Davidsen-Nielsen, N. (1974). Syllabification in English Words with Medial sp, st, sk. J. Phonetics 2: 15-45. Dixon, N.R. (1968). Terminal Analog Synthesis of Continuous Speech Using the Diphone Method of Segment Assembly. IEEE Trans Audio Electro AV-16(1): 40-50. Fujimara, 0. (1981). Temporal Organization of Articulatory Movements as a Multidimensional Phrasal Structure. Phonetica 39: 66-83. Fujimara, 0. and Lovins, J.B. (1978). Syllables as Concatenative Phonetic Units. In A. Bell and J.B. Cooper (eds.), Syllables & Segments. Amsterdam, Holland: North-Holland Publ. Co. Fujimara, O., MACCHI, M.J. and Lovins, J.B. (1977). Demisyllables and Affixes for Speech Synthesis. In Cont. Pap. 1:513. Proceedings 9th Int. Congr. on Acoust., Madrid, Spain. Goldsmith, J. (1976). An Overview of Autosegmental Phonology. Ling. Anal. 2: 23-68. Halle, M., and Vergrand, J.R. (1980). Three-Dimensional Phonology. J. Ling. Res. 1: 83-105. Holmes, J.N., Mattingly, I.G. and Shearme, J.N. (1964). Speech Synthesis by Rule. Lang. Speech 7 (3): 127-143. Jakobson, R., Fant, C.G.M. and Halle, M. (1963). Preliminaries to Speech Analysis: The Distinctive Features and their Correlates. Cambridge, MA: MIT Press. Kahn, D. (1976). Syllable-based Generalizations in English Phonology. (Doctoral dissertation, MIT 1976) Available from Indiana Univ. Ling. Club, Bloomington, IN. Kiparsky, P. (1982). From Cyclic Phonology to Lexical Phonology. In H. vander Hülst and N. Smith (eds.), The Structure of Phonological Representations. Dordrecht, Holland: Foris Publications. Liberman, A.M., Ingemann, F., Lisker, L., Delattre, P., and Cooper, F.S. (1959). Minimal Rules for Synthesizing Speech. JASA 31 (11): 1496-1499. Liberman, M.Y., and Pierrehumbert, J. Intonational Invariance Under Changes in Pitch, (to appear) Liberman, M.Y., and Prince, A. (1977). On Stress and Linguistic Rhythm. Ling. Inq. 8: 82-90. Lovins, J.B., Macchi, M.J. and Fujimara, 0. (1979). A Demisyllable Inventory for Speech Synthesis. In Speech Communication Papers, presented at the 97th Mtg. of ASA, Cambridge, MA, 1979: pp. 519-522. Maia, E.A. DaMotta (1981). Phonological & Lexical Processes in a Generative Grammar of Portuguese. (Doctoral dissertation, Brown Univ. 1980). Olive, J.P. (1977). Rule Synthesis of Speech from Dyadic Units. Conference Record of IEEE Int. Acoust. Speech Signal Process, Hartford, CT, 1977: pp. 568-570. Pierrehumbert, J. (1981). The Phonology and Phonetics of English Intonation. (Doctoral dissertation, MIT 1980). Selkirk, E. (1980). The Role of Prosodic Categories in English Word Stress. Ling. Inq. 11: 563-605.

Fujimura:

Remarks

on speech synthesis

119

Williams, E.S. (1976). Underlying Tone in Margi and Igbo. Ling. Inq. 7: 463-484.

121 UNITS IN SPEECH SYNTHESIS J.N. Holmes Joint Speech Research Unit, Cheltenham, UK

In this discussion I am speech

assuming

that

the

purpose

of

synthesis is for machine voice output, rather than for

phonetic

research

or

for

the

receiving

end

of

an

analysis/synthesis telephony link.

There is currently a wide variety of units used in speech generation

for

machine

voice

output.

Systems using stored

human speech waveforms are only suitable for

very

applications,

techniques using

but

the

advent

of

vocoder

digitally stored control signals opened stored

word methods.

a

new

restricted

dimension

to

It became possible to modify the timing

and fundamental frequency of stored vocoder control signals to get

the

desired prosodic form of sentences.

of very compact low-cost devices and

the

fairly

modest

storage

for

LPC-vocoder

stored-vocoder-word

synthesis,

requirement for the control

signals for a moderate sized vocabulary, has the

The development

approach

Although vocoder distortion is still

currently

given

great economic importance. noticeable,

the

speech

quality is generally adequate for many practical applications.

PAGE 2

122

Units in speech synthesis Stored coded words or larger units of

obviously

human

speech

are

not suitable for arbitrarily large vocbularies, but

the co-articulation problems are

too

severe

for

it

to

acceptable to use units corresponding to single phonemes.

be The

alternative is to exploit the co-articulatory effects produced by

a

human

speaker as far as they affect adjacent phonemes.

Boundaries of the stored units can then be made in the of

the

middle

acoustic segments associated with the phonemes.

With

this method it should be possible to make up all messages from a

number of units of the order of the square of the number of

significantly different allophones in the language. variants

of

terms 'dyad' Makhoul,

this

approach

(Olive,

Klatt

and

have

1977),

Different

been investigated, and the

'diphone'

(Schwarz,

Klovstad,

Zue, 1979) and 'demi-syllable' (Browman,

1980) have been used.

With these smaller units, human speech

analysis

can

be

used only to provide the spectrum envelope information for the appropriate sounds and their transitions: fundamental

frequency*

information

the

must

durations be

completely by rule to achieve the desired prosodic Use

and

calculated structure.

of spectral envelope patterns with rule-generated prosody

is undoubtedly successful to a several

factors

that

will

first limit

order, the

but

there

performance.

are These

include the inherent distortion of the vocoder representation, the

fact

that

co-articulation

cannot easily be varied with

segment duration, and the problems of selection of a set of units from human utterances.

suitable

Holmes: Units in speech synthesis A completely different approach to speech synthesis is to use

a

system

of

rules for the entire process of converting

from a linguistic specif! cation to the acoustic signal*

Human

speech is then not used at all except to assist in determining the rules. the

Rules need to be applied at

higher

systems

several

levels,

but

levels are essentially the same as are needed for

using

very

small

units

of

coded

human

speech,

discussed above.

At the lowest level the conversion

of

a

detailed

rules

have

to

deal

When

success is

inherently

straightforward,

humans

learn

to

auditory,

and

speak, and

completely

hence

significant

It

features.

(Holmes,

1973)

it

seems

that

a

to

most

use

aiming

has

speech

criterion of

adequate, merely

demonstrated

the

the

acoustic-domain model for synthesis, auditory

the

phonetic specification (including

pitch and timing information for each phone) into waveform.

with

an

to

copy

already

been

fairly

simple

parallel-formant model of speech production can produce a very close

approximation

practical

to

purposes.

human

speech,

adequate

for

all

Although minimal rules for synthesizing

speech with this model would probably be more complicated than for

an

rules by

articulatory study

of

model, it is much easier to derive such human

speech,

because

the

information

provided by s p e c t r o g r a p h s analysis is directly related to the way the input to a parallel-formant synthesizer is I

specified.

therefore very strongly believe that the future of phonetic

synthesis by rule

is,

for

wedded to formant synthesis.

good

practical

reasons,

firmly

124

Units in speech synthesis The specification of input for the phonetic level is most

conveniently

supplied in terms of allophones.

synthesis system the only penalty

for

In a practical

freedom

in

regarding

phones in special phonetic combinations as distinct allophones is the increase in the with

corresponding

number

storage

of

allophone

requirements.

system of Holmes, Mattingly and Shearme Pronunciation

specifications, The

(1964)

table-driven for

Received

(RP) had an extremely simple rule structure and

very few special allophones for any of the R P phonemes.

Based

on experience of that system, I think it is very unlikely that the number of entries in the rule synthesis tables would to

be

very

large

allophones of RP.

to

have

deal with all significantly different

The inclusion of a separate

stage

in

the

synthesis process to select the correct allophone according to phonetic environment would be a negligible extra computational load.

I

thus

believe

that

the units for synthesis at the

phonetic level should be allophones, of which there would have

to

be

more

than

100 - 150 in most languages.

input to the allophone selection

stage

the

Input

not

At the

would

be

phonemes.

The

best

performance

synthesis-by-rule mistaken for systems

system

human

produced

so

far

speech.

by

would

However,

never in

in

possible

approximate

to

devise

natural speech.

speaking

formant

normally

contrast

to

be any

using a concatenated selection from a small number of

coded human segments, it should

produce

any

Moreover,

different with

rules

the

voice same

to it

principle

should

qualities accent,

also

eventually very be

be

closely to possible

to

(man, woman, child etc.) by

modifications

to

the

Holmes: Units in speech synthesis synthesizer

to

the

synthesizer control signals, but without having to modify

the

phonetic

parameters

rules.

I

for

all

by

transformations

therefore expect that in due course this

approach will completely methods

and/or

displace

applications

the

stored

human

speech

except those with a few fixed

messages.

There is so much difference between the prosodic patterns of

different

languages

(and

even between different accents

with the same language), that the structure of prosodic needs

to

be

to

a

large

extent

language-specific.

successful systems have been developed for both pitch

of

some

languages;

rules Very

duration

and

for example there are systems for

English for which the input consists of phonemes, stress marks and

punctuation.

The

stress

marking

for

such

a

system

corresponds fairly well to the type of stress marking used phoneticians

in phonemic transcription, and punctuation marks

control the

pauses

intonation

contour

and

selection

for

as

closely

of

certain

principle it should be possible to approximate

by

as

particular

stressed refine

desired

types

syllables.

this

approach

of In to

to the performance of a

skilled human in reproducing the prosody of an utterance

from

an appropriately marked phonetic transcription.

The next level synthesis

depends

up on

in

the

heirarchy

the application.

of

programs

for

In a reading machine

for the blind it is obviously necessary to be able

to

accept

conventionally

spelled text, and this brings all the problems

of

from

converting

particularly

spelling

difficult

for

to

pronunciation

English,

and

a

(this

is

pronouncing

126

Units in speech synthesis

dictionary for many of the words is an essential part effective

system).

of

any

Lexical stress assignment within words is

sometimes dependent on the syntactic function of the word in a sentence (such as for the English word 'contract').

The level

of stress used in any word in a sentence will also the

semantics,

dependent

words to make a variations

this

type

of

meaning.

require

The

more

modest

from

general

systems text

subtle

a reading machine to have

linguistic knowledge approaching that of a human more

with

on the desire to emphasise certain

distinction

of

vary

reader,

but

that can give very useful speech output are

now

clearly

practicable

(Allen,

Hunnicutt, Carlson and GranstrSm, 1979).

For various

announcing types

machines

in

information

services

the obvious approach is for an operator of the

system to specify the message components in textual form. reduce

application

there

is

no

where

possible,

difficulty

pronunciation anomalies by ad hoc addition dictionary

of

exceptions,

by

phonetic

in

to

deliberate

facilitate pronunciation by rule, or by

a

be

in

correcting pronouncing

embedding

to

occasional

symbols within a conventional orthographic sequence.

over-ridden

by

achieve special effects. under

but

mis-spelling

Assignment by rule of both lexical and semantic also

To

the need for operator training, it is advantageous for

conventional orthography to be used this

of

operator

control

the

stresses

can

operator whenever necessary to

Parts

of

messages

so

constructed

may

be

subsequently

re-arranged

entirely automatically, but such

changes

will

not

normally

involve phonemic or stress changes within the message parts.

Holmes: Units in speech synthesis At a slightly higher level concept'

approach

the

'speech

In

conventional process,

and

from

of Young and Fallside (1979) could be used

to go straight from the concept to the input of program.

synthesis

the

prosodic

systems of this type there is no point in using orthography knowledge

at

any

stage

in

the

synthesis

of the concept obviously defines the

stress assignments needed

to

represent

particular

semantic

interpretations of the words.

I expect these large-vocabulary

three

synthesis

high-level

input

techniques

for

will continue to have their place

for appropriate applications, but ultimately they will

nearly

all use formant synthesis by rule for the low-level stage.

128

Units in speech synthesis

References

Allen,

J.,

(1979).

Hunnlcutt,

MITalk-79:

S-,

Carlson,

R. and

Granstrom,

The 1979 M I T text-Co-speech system.

Speech Communication Papers Presented at the 97th the

Acoustical

D. H.,

Eds.),

Society

of

Acoustical

America

B. Tn:

Meeting

of

(Wolf, J. J. and Klatt,

Society

of

America,

New

York,

pp. 507-510.

Browman, C. P. (1980). Lingua,

a

language

International

Rules for demisyllable synthesis using Interpreter.

Conference

on

Proceedings

Acoustics,

Speech

of

the IEEE

and

Signal

Processing, Denver, Co., 561-564.

Holmes, J. N. (1973).

The influence of

glottal

the naturalness of speech from a parallel formant IEEE

Transactions

on

Audio

and

waveform

on

synthesizer.

Electroacoustics

AU-21,

298-305.

Holmes, J. N.,

Mattingly,

Speech synthesis by rule.

I. G. and

Rule synthesis

units.

of

Acoustics, 568-570.

Speech

and

J. N. (1964).

Language and Speech 7, 127-143.

Olive, J. P. (1977). Proceedings

Shearme,

of

speech

from

dyadic

the IEEE International Conference on Signal

Processing,

Hartford,

Ct., WW

Holmes: Units in speech synthesis Schwarz, R., Klovstad, J., Makhoul, J., Klatt, D. and Zue, (1979).

Diphone synthesis for phonetic vocoding.

V.

Proceedings

of the IEEE International Conference on Acoustics, Speech

and

Signal Processing, Washington, D.C., 891-894.

Young, S. J. and Fallside, concept: a

F. (1979). Speech

synthesis

from

method for speech output from information systems.

Journal of the Acoustical Society of America 66, 685-695.

131 UNITS IN TEXT-TO-SPEECH SYSTEMS Rolf Carlson and Björn Granström Department of Speech Communication and Music Acoustics KTH, Stockholm, Sweden Current text-to-speech systems make use of units of different kinds depending on the goals involved. If the purpose is adequate output only, the internal structure and the choice of units is, to some extent, arbitrary. The choice of units in a research system, with the ambition to model the speech production process adequately, might be quite different. The paper deals with this problem and makes some comparisons to research in speech recognition. INTRODUCTION Speech synthesis systems have existed for a long time. About ten years ago the first text-to-speech systems were designed. The speech quality has been improved to a high degree, but the progress has been gradual. Adjustments of parameter values or formulations of linguistic rules have been the base for this work. It is important to note that the progress has been made within the already established framework, and broader views have been exceptions. In the whole text-to-speech process many units are conceivably of interest , ranging from concepts to motor commands given to individual articulators. The title of this session, "Units in Speech Synthesis," suggests that we could select some specific units on which a synthesis system could be based. Rather than taking on the task of deciding on the "best" unit or set of units, we want to briefly discuss some considerations in designing a text-to-speech system. The choice of units will be shown to be highly dependent on the goal of the synthesis project.

A TYPICAL TEXT-TO-SPEECH SYSTEM Let us, as a starting point, describe a typical text-tospeech system. The system is meant to accept any kind of text

132

Units in speech synthesis

input. The first task is to generate a phonetic transcription of the text. This can be done by a combination of rules and lexical lookups. The lexicon can either be built on a word-byword basis or can use some kind of morph representation. Optionally, a grammatical analysis is performed and syntactic information included. At the next stage, the phonetic representation is transformed to control parameters that are adjusted depending on context, prosodic information and type of synthesizer. The acoustic output is generated with the help of an LPC, formant or articulatory model. Let us, already at this point, conclude that a system of this kind is a sequential system to a very high degree. The text is succesively transformed into a speech wave.

Systems of this kind have been designed at several research institutes, Allen et al.(1979), Carlson and Granstrom (1976), Coker et al. (1973), Cooper et al. (1962), Pujimura (1976), Klatt (1982), Mattingly (1968) and Olive (1977). Some are now produced for different applications. The output is often quite far from natural human speech, but nevertheless understandable due to the eminent capability of man to cope with a great variety of dialects under poor signal-to-noise conditions. UNITS FOR WHAT ? Synthesis systems may be divided sometimes different goals:

into two groups with

I. Systems that can be commercialized and used in applications . II. Research systems that model our present knowledge of speech production. There is not a distinct borderline between these groups. Presently there is no guarantee that a system of type II will produce better speech than those of type I. Consider, for example, the present level of articulatory models. Much more has to be done before the output can be compared to the best

Carlson and Granstrom: Units in text-to-speech systems

133

terminal analog systems. This will probably change in the future. Another example is the use of syntactic information. It is obvious that information of this kind can be used to gain higher speech quality. An erroneous syntactic analysis is, however, often worse than no analysis at all. An LPC-based system of type I forms a rigid system with limited flexibility and often produces an uneven speech quality. Rules for coarticulation and reduction are much harder to implement in this framework than in an articulatory model. On the other hand, rules of this kind are presently only partially known. This is compensated for by using a variety of allophones or diphones that approximates the variations found in human speech. The designer of systems of type I often takes a very pragmatic view on the selection of units. The availability of information, the ease of gathering speech data, computational efficiency, hardware cost, etc., are important considerations. Any kind of feature, diphone or parameter could be a helpful unit. The problem is rather to choose the optimal description at a certain point in the system. In a system of type II, the choice of units is obviously important since it reflects the researcher's view on speech production. For example, a system exclusively based on features and their acoustical correlates forms a strong hypothesis and generates important research activity, but does not necessarily produce the highest speech quality. MACHINES AND MODELS Computers have had an enormous impact on the present society, including speech research. A computer program is normally a single sequential process. This has, to a high degree, guided the development of speech synthesis systems. No system makes use of extensive parallell processing, physiological or auditory feedback. No system compensates for noise level or the quality of the loudspeaker. If such processes were included, the choice of units would be different. The dynamic properties of speech production would have been emphasized. In contrast, speech recognition research has, to some degree, chosen an alternative way by exploring analysis-bysynthesis techniques and a combination of top down - bottom up processes.

134

Units in speech synthesis UNITS AND DIMENSIONS

Despite the great number of units that could be chosen in speech synthesis research, the approach is still "two-dimensional" to a very high degree: ordered rules, linguistic tree structures, formant versus time trajectories, sequential phonemes, allophones or diphones Articulatory models are sometimes important exceptions. The multidimensionality that clearly exists in speech production has to be explored. Conflicting cues have been studied in speech perception research, and must also be considered in speech production. FUTURE Present synthesis systems produce output that is fairly intelligible. Considering the difficulty of the task of generating speech from an unknown string of letters, the quality of the speech is astonishingly high in some systems. This partial success could have a negative effect for the near future. It does not immediately force synthesis research to explore new approaches and to investigate more complete models of the speech production process. We may learn from research on speech recognition, which has had a different evolution compared to speech synthesis. At a very early stage it was realized that recognition based on pattern matching techniques was inadequate to solve the general problem. Thus, in parallel, complex systems, designed for recognition of continuous speech, started to be developed. It was clear that all aspects of speech communication had to be included. We now face a scientific dilemma. It is possible to make more shortcuts in speech synthesis compared to speech recognition. Does this mean that the effort to generate high quality synthetic speech will be reduced? The number of papers in scientific journals and at conferences dealing with speech synthesis has been drastically reduced. A reorientation of speech synthesis efforts appears to be needed to make it again an interesting and active field of research.

Carlson and Granstrom: Units in text-to-speech systems

135

REFERENCES Allen, J., Hunnicutt, S., Carlson, R., and Granstrom, B. (1979). MITalk-79. The MIT text-to-speech system, in J.J. Wolf and D. H. Klatt (Eds) Speech Communication papers presented at the 97th Meeting of the Acoustical Society of America, published by the Acoustical Society of America. Carlson, R. and Granstrom, B. (1976). A text-to-speech system based entirely on rules. Conf. Rec. 1976 IEEE International Conf on Acoustics, Speech, and Signal Processing. Coker, C.H., Umeda N. and Browman C.P. (1973). Automatic Synthesis from Ordinary English Text, IEEE Audio and Electroacoustics, AU-21,293-397 Cooper, F., Liberman, A., Lisker, L. and Gaitenby, J. (1962). Speech Synthesis by Rules. Paper F2. Speech Communication Seminar. Stockholm. Fujimura, 0., (1976). Syllables as Concatenated Demisyllables and Affixes. 91st Meeting of the Acoustical Society of America. Klatt, D.H. (1982) The KLATTalk Text-to-speech Conversion System. IEEE Int. Conf. on ASSP, IEEE Catalog No. 82CH1746-7, 1589-1592 Mattingly, I. (1968) Synthesis by Rule of American English. Supplement to Status Report on Speech Research at Haskins Laboratories, New Haven, Conn. USA Olive, J.P. (1977) Rule Synthesis from Diadic Units. IEEE Int. Conf. on ASSP, IEEE Catalog No. 77CH1197-3 ASSP, 568-570

137 LINGUISTIC UNITS FOR FO SYNTHESIS Janet Pierrehumbert Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974 1.

Introduction

In this note, I will summarize some results on the structure of the English intonation system and relate them to rules for TO synthesis.

So far, speech

synthesis programs have not used the full repertoire of English intonation. However, study of the complete system can help guide the choice of units and the formulation of implementation rules. Thus, the first part of the paper will be devoted to theoretical description and the latter part to practical issues.

Comparisons to other proposals in the literature can be found in

Pierrehumbert (1981). 2.

A Phenomenology

2.1 Melody, Stress, and Phrasing In English,

the words which

comprise

an utterance

constrain

but

do

not

determine its intonation. Three dimensions of choice are melody, stress, and phrasing. The same text, with

the same

stress pattern, can be produced

with many

different melodies. For example, Figure 1 shows three of the many FO patterns that could be produced on the monosyllable "Anne". 1A is a declarative, 1B is a question, and 1C could convey incredulity.

Conversely, the same melody can

occur on many different texts. These basic observations have led both British and American linguists to treat melodies as autonomous constructs which are coordinated with the text. The coordination of melody and text is determined by the stress pattern and the

phrasing.

Some

features

of

the melody

are

associated

with

stressed

syllables, while others are associated with the intonation phrase boundary. The FO configurations which are (optionally) assigned to stressed syllables are "pitch accents". The English pitch accents include the peak in 1A, the valley in 1B, and the scoop, or valley plus peak, in 1C. In a polysyllabic word, these configurations attach themselves to the main word

stress. In

138

Units in speech synthesis

ANNE.

ANNE?

ANNE !

Figure 1

300 N X

S 150 S I

100

THE CARDAMOM BREAD WAS PALATABLE

Figure 2

Pierrehumbert: Linguistic units for FO synthesis multiword

utterances,

attachment.

the

In Figure

phrasal

139

stress

2, for example,

subordination

focus on the

single scoop accent to the front of the sentence.

controls

their

subject has moved

the

The relation of stress to

melody also shows up in a different form. The more stressed syllables there are in a phrase, the more opportunities for a pitch accent. Thus, longer phrases can have more complex melodies than shorter ones. Figures 1 and 2 also exemplify some of the options in the treatment of the end of the phrase. The rise at the end of contour 1C remains at the end when the scoop accent is moved leftward, as in 2. The terminal low value in 1A would behave

similarly.

Multi-word

utterances

can

be

divided

into

several

intonational phrases, each with a phrase-final tonal configuration at its end. The boundary between one phrase and the next can be marked with a pause or lengthening

of

the

final

syllable.

It

is

important

to

note

that

the

intonation phrase is a unit of phonological rather than syntactic description. Intonation phrases are not always syntactic constituents.

The same sentence,

with the same syntax, can often be produced with more than one

intonational

phrasing. 2.2 What controls the variants observed? The

previous

section described

options in the English intonational

system.

What governs choices among these options in natural speech? A

tradition

exemplified

by

Halliday

(1967)

defines

intonation

phrases

as

informational units, or "sense groups". Selkirk (1978) attempts to predict the set of permissible phrasings as a function of the syntax.

What controls the

choice among alternative phrasings is not well understood. The

phrasal

presupposed

stress

pattern

is

in the discourse.

a

function of what

is

focused

and

what

is

The FO contour in Figure 2, for example, is

only appropriate if "palatable" is in some sense already known. Carlson (1982) and

Selkirk

(1983)

take interesting

approaches

to

this problem and

provide

additional references. Efforts

to

describe

the

usage

of

particular

melodies

suggests

that

they

function much like pragmatic particles in other languages, such as "doch" in German.

They

convey

information

about

the

attitude

of the speaker and

the

relation of the utterance to others in the discourse. Sag and Liberman (1975)

140

Units in speech synthesis

describe

instances

in

which

melody

disambiguates

between

the

literal

and

rhetorical readings of questions. Ladd (1978) suggests that a common vocative melody has a broader meaning as a marker of conventionalized expressions.

The

rather sparse experimental work on the perception of melody tends to support this viewpoint; see especially Nash and Mulac (1980) and Bonnet (1980). 2.3 Micromelody The speech segments introduce contour.

perturbations of

the

prosodically

derived

FO

All other things equal, the FO during a high vowel is higher than

during a low vowel. Voiceless obstruents raise the FO on a following vowel; voiced obstruents depress it.

A large literature documents the existence of

these effects, but does not provide a complete picture of their quantitative behaviour in different prosodic environments.

(See Lea (1973) for a review.)

Making even an approximation to them is likely to add to the naturalness of FO synthesis. 3-

A Theory

3-1 Phonological Units A theory of intonation developed in Pierrehumbert (1980) decomposes the melody into a sequence of elements. The minimal units in the system are low (L) and high (H) tones. tones,

with

accents

are

A pitch accent can consist of a single tone or a pair of two

one marked posited.

to

fall

on

The melodic

the

stress;

altogether

elements associated

seven

different

with the end of

phrase are the boundary tone, wh'ich is located right at the phrase

the

boundary

and may be either L or H, and the phrase accent, which controls the FO between the

last

pitch

accent

and

the boundary

tone, and may also be

H. The boundary tone was proposed in Liberman (1975).

either L or

The idea of the phrase

accent is adapted from Bruce's (1977) work on Swedish.

As far as is known,

different pitch accents, phrase accents, and boundary tones combine freely, so that

the phrasal melodies can be

treated

as

the

output

of a

finite

state

grammar. 3.2 Realization There

are two aspects of the implementation

process to consider. Tones are

realized as crucial points in the FO contour; transitions between one tone and

Pierrehumbert: Linguistic units for FO synthesis the next fill in the FO contour.

141

The transitions can generate contours

on

syllables with only one L or H tone, and they supply FO values for unstressed or unaccented syllables which do not carry a tone. Experiments mapping

reported in Liberman and Pierrehumbert

from

computations

tones

to

crucial

points

in

the

(1983) shed light on the contour.

It

appears

that

can be made left to right in a "window" of two pitch accents.

Superficially global trends arise from iterative application

of these

local

rules. The

transitions

intonation nonmonotonic

between

synthesis

crucial system

transitions

to

points

have

described

achieve

not been as well

in

Pierrehumbert

a sparse

representation.

pitch accents, a transition which dips as a function of the

studied.

(1981)

The

posits

Between two separation

H in

frequency and time between"the two target values is computed. Our impression from comparing rule systems

is that the ear is much less sensitive

to

the

shape of the transitions than to the configuration of the crucial points in the FO contour. 3-3 Disagreements The picture we have presented differs in important ways

from others in

the

literature. In the speech communication literature, FO sometimes figures as a transducer of stress-the more a syllable is stressed, the higher its FO or the greater its FO change. The existence

of many different

pitch accents

means

that this approach cannot be maintained in the general case. We also deny that intonation is built up in layers, by superposing local movements on a global component. We differ from theorists who decompose the melody into FO changes rather

than

into

crucial

points.

Pierrehumbert

(1980)

and

Liberman

and

Pierrehumbert (1983) offer arguments for this position. 4.

Synthesis Rules

Section 3 suggests that FO contours can be represented as a sequence of tonal units assigned contours

needed

relatively for

sparsely

speech

appear to be problematic.

to

synthesis

the text. Deriving from

such

a

the continuous

representation

does

FO not

The major problems arise at the level of assigning

the representation to a text; only part of the information which governs real speakers' choices is available in a speech synthesis program.

Punctuation is

142

Units in speech synthesis

an imperfect

representation

of phrasing.

A

representation

of new

and

old

information sufficient to predict the phrasal stress contour is not available. Nor do current systems model the pragmatic and affective information on which choice of melody and overall pitch range appear to be based. Two

responses

to

this

state

of

affairs

suggest

themselves.

The

information can be specified in the input, using an appropriate of melody and

phrasing.

this approach.

Pierrehumbert

(1981)

missing

transcription

describes a program based

on

In a text-to-speech system, however, it is necessary to make

the best of conventional English text input. What principles can be used to generate an intonation transcription from the information available?

Phrasing

breaks are posited at punctuation marks. A parse may be helpful for

further

subdivision of the sentence, but only if it is accurate and feeds a good set of phrasing rules. always

been

To date, the problem of selection among accent types has

side-stepped

by the use of a reduced

inventory.

For

example,

rules for text-to-speech given in Pierrehumbert (1981) use H accents, and The MITalk rules, which are based on observations in O'Shaughnessy (1979), appear to use a mixture

of two

accent

for rhythmic alternation. of

speech,

types.

Rules

for accent

placement

rely

on

Pierrehumbert's rules are based on the tendency

tendencies in the language.

O'Shaughnessy's depend on the fact that some parts

such as nouns,

receive

accents more often than others,

such

as

verbs. In

our

experience,

the

most

conspicuous

deficiencies

of

such

systems

are

inappropriate phrasing and wrong placement of accents. Assigning accents on a purely syntactic or rhythmic basis often gives undue importance to words which are presupposed in the discourse, or insufficient importance to words should be emphasized.

are especially abrasive. likely to suggest rules.

The

monotonous (1976) accents

even

Additional study of phrasing and accent placement is

rules of thumb for improving this aspect of FO

restricted melodic

for

and

texts

of any

O'Shaughnessy in

which

Errors involving wrong placement of accent in compounds

reading

inventory

length.

(1976)

shows

used

in

Examination that

FO

synthesis

of FO

speakers

use

contours a

synthesis

programs

is

in

Maeda

variety

pitch

neutral declarative materials. Work on

unobtrusive

variation of accent type is likely to pay off in a more lively result.

Pierrehumbert: Linguistic units for FO synthesis

143

References Bonnet, G. (1980), "A Study of Intonation in the Soccer Results,"

Journal of

Phonetics, 8, 21-38. Bruce, G. (1977), "Swedish Word Accents in Sentence Perspective," l'Institut de Linguistique de Lund.

Travaux de

Malmo: CWK Gleerup.

Carlson, L. (1982), "Dialogue Games: An Approach to Discourse Analysis," Ph.D dissertation, MIT (forthcoming from Reidel). Halliday,

M.A.K.

(1967),

"Intonation

and

Grammar

in

British

English,"

The

Hague: Mouton. Ladd, D.R. (1978), "Stylized Intonation," Language, 54, 517-540. Liberman, M. (1979), "The Intonational System of English," Liberman, Invariance

M. and

J. Pierrehumbert

under

Changes

in

(forthcoming

Pitch

Range

in

and

New York: Garland.

1983)-

Length,"

"Intonational

In

M. Aronoff

and

R. Oehrle, eds, Language Sound Structure. Cambridge: MIT Press. Maeda,

S. (1976),

"A Characterization

of American English Intonation,"

Ph.D

disseration, MIT. Nash,

R. and

Waugh

and

A. Mulac

C. H.

van

(1980),

"The

Schooneveld,

Intonation eds,

The

of Verifiability,"

Melody

of

Language.

In

L.R.

Baltimore:

University Park Press. O'Shaughnessy,

D. (1976),

"Modelling

Fundamental

Frequency,

and

Relationship to Syntax, Semantics, and Phonetics,"

Ph.D dissertation, MIT.

O'Shaughnessy,

in

D. (1979),

"Linguistic

Features

Fundamental

its

Frequency

Patterns," Journal of Phonetics, 7, 119-146. Pierrehumbert, J. (1980), "The Phonology and Phonetics of English Intonation," Ph.D dissertation, MIT, (forthcoming from MIT Press, Cambridge) Pierrehumbert,

J. (1981),

"Synthesizing

Acoustical Society of America, 70, 985-995-

Intonation,"

Journal

of

the

144

Units in speech synthesis

Sag, I. and M. Liberman Speech

Acts,"

Papers

(1975), "The Intonational Disambiguation of Indirect from

the

Eleventh

Regional

Meeting

of

the

Chicago

Linguistic Society, 487-497. Selkirk,

E. (1978),

Structure,"

Paper

"On

Prosodic

presented

to

Structure

the

Sloan

and

its

Foundation

Relation

to

Conference

Syntactic on

Mental

Representation in Phonology. (MS Dept. of Linguistics, U. Mass, Amherst.) Selkirk,

E. (forthcoming

in

between Sound and Structure,"

1983),

"Phonology

and

Cambridge: MIT Press.

Syntax:

The

Relation

Symposium 3

Models of the larynx

147 LARYNX M O D E L S AS COMPONENTS IN M O D E L S OF SPEECH DUCTION Celia

PRO-

Scully

University of Leeds, U.K. The most relevant dictionary d e f i n i t i o n of a model is the simplified representation of a process or system. To be manageable, a complicated system, such as that which generates speech, needs to be reducible to a number o f quasi-independent components. It i s helpful to i d e n t i f y the larynx as one such component. I t s complexity and importance in speech are indicated by the number o f scholars investigating larynx a c t i v i t y and by the wide v a r i e t y of techniques employed. Since the symposium on the larynx at the 8th International Congress o f Phonetic Sciences (Fant and Scully, 1977), several conferences and books have been partly or e n t i r e l y devoted to the larynx in speech and singing (Fink, 1975; Carre, Descout & Wajskop, 1977; Fink & Demarest, 1978; Boe, Descout & Gu^rin, 1979; Lass, 1979, 1981; Lawrence & Weinberg, 1980; Stevens & Hlrano, 1981). Modelling i s one path towards a greater understanding of the larynx, but i t needs to be considered as several systems: 1. neural control mechanisms; 2. anatomical structures, tissue properties and muscle mechanics; 3. articulation; 4. aerodynamics; 5. acoustic sources; 6. acoustic f i l t e r s . These are t e n t a t i v e , s i m p l i f i e d and, to some extent, a r b i t r a r y d i v i s i o n s . The processes l i s t e d are not truly independent. Insofar as each system may be considered as a stage of speech production, conditions in a l a t e r stage are determined from the variables defined in more than one of the e a r l i e r stages. The larynx does not operate in i s o l a t i o n . To d i f f e r e n t degrees, in each of the systems l i s t e d above, links between the larynx and both subglottal and supraglottal regions need to be considered. Neural systems w i l l not be discussed here (but see Muller, Abbs & Kennedy, pp.209-227 in Stevens & Hirano, 1981). Anatomical and physiological links Anatomically, i t may seem reasonable to treat the larynx structures, from cricoid c a r t i l a g e to hyoid bone, as a component in a t o t a l speech-producing system which i s reducible in the cybernetic sense. But through extrinsic' muscles the larynx is linked to structures which shape the vocal t r a c t ; because i t forms the top of the trachea, the larynx is a f f e c t e d by f o r c e s associated with changes o f lung volume. To what extent could a simplified representation of the forces associated with the extrinsic laryngeal muscles explain, for example, the hyoid bone movements observed in real speech? To what extent do larynx settings and vocal tract shapes covary? Fink and Demarest ( 1978) have provided a foundation f o r this kind of larynx modelling.

148

Models of the larynx

The a r t i c u l a t o r y system Here, a r t i c u l a t i o n i s taken to mean a l l actions needed f o r speech production, not only those which shape the supraglottal vocal t r a c t .

At l e a s t

three kinds o f a r t i c u l a t o r y action may t e n t a t i v e l y be ascribed to the l a r y n x : 1.

the slowly-changing ( d . c . ) component o f g l o t t a l area, or abduction and

2.

v e r t i c a l movements o f the larynx;

3.

changes in the e f f e c t i v e s t i f f n e s s and mass o f the vocal f o l d s , as a

adduction o f the vocal

folds;

component o f fundamental frequency c o n t r o l . There are unanswered questions here.

To what extent does an action o f one

o f these three a r t i c u l a t o r s a f f e c t the other two?

Should there be a fourth

a r t i c u l a t o r y action f o r the control o f phonation type? are o f the essence in an a r t i c u l a t o r y model.

Timing and coordination

Much has been learned in recent

years about changes o f g l o t t a l width in r e a l speech and patterns o f associated muscle a c t i v i t y ( f o r example, Hirose & Sawashima, Chapter 11 and Sawashima & Hirose, Chapter 22 in Stevens 4 Hirano, 1981).

Coordination o f a c t i o n s , both

within the larynx and also between laryngeal and other a r t i c u l a t o r s , may be modelled.

What i s not yet well established i s the v a r i a b i l i t y - both in timing

and in a r t i c u l a t o r y distance - found in r e a l speech. a given speaker and/or a given speech rate?

Is v a r i a b i l i t y f i x e d

for

Are kinematic d e s c r i p t i o n s

s u f f i c i e n t f o r an a r t i c u l a t o r y larynx model, or i s there a pressing

requirement

to model the dynamics? The aerodynamic system The low frequency ( s l o w l y changing, d . c . ) components o f volume f l o w r a t e o f a i r through the g l o t t i s and pressure drop across the g l o t t i s are v a r i a b l e s in the aerodynamic system of speech production.

This i s i r r e d u c i b l e :

aerodynamic

conditions a t the g l o t t i s depend on the changing configurations o f the whole respiratory t r a c t .

S i g n i f i c a n t subglottal

f a c t o r s include the r a t e o f lung

volume decrement and subglottal airways r e s i s t a n c e .

Significant

supraglottal

f a c t o r s include the c r o s s - s e c t i o n area o f any severe c o n s t r i c t i o n s o f the vocal t r a c t and changes in c a v i t y volumes:

both a c t i v e , due to a r t i c u l a t o r y a c t i o n s ,

and passive, due to wall compliance.

More data are needed to develop improved

representations o f the g l o t t a l o r i f i c e as a flow-dependent r e s i s t a n c e . Laryngeal processes f o r acoustic

sources

Quantification o f the myoelastic-aerodynamic theory o f v o i c i n g i s an important goal f o r modelling.

Larynx muscle f o r c e s , the v i s c o e l a s t i c

p r o p e r t i e s o f the body and cover o f the vocal f o l d s and t h e i r geometry combine with l o c a l aerodynamic f o r c e s to make the vocal f o l d s v i b r a t e . waveform f o r the a . c . component o f volume v e l o c i t y o f g l o t t a l said to d e f i n e a ' s h o r t c i r c u i t current s o u r c e ' :

The r e s u l t a n t a i r f l o w can be

the v o i c e source as i t would

appear in the absence o f the vocal t r a c t acoustic tube.

Each vocal f o l d may be

represented as two or more mass-spring systems, as a continuous bounded medium or as a beam;

a functional approach may be employed instead.

model may depend on the a p p l i c a t i o n .

The choice o f

Among the questions as yet unanswered by

modelling i s the exact nature o f the aerodynamic contribution to fundamental frequency.

The processes generating turbulence noise are not well

understood,

Scully: Larynx models as components of speech production

149

e s p e c i a l l y f o r the i n t e r m i t t e n t a i r f l o w and moving boundaries a s s o c i a t e d with v i b r a t i o n of t h e vocal f o l d s . Because of the u n i t y of the whole aerodynamic system, g l o t t a l and s u p r a g l o t t a l source processes a r e i n t e r r e l a t e d . A complex a u d i t o r y framework has been proposed f o r laryngeal and other components of voice q u a l i t y (Laver, 1980). Relationships between wave d e s c r i p t i o n s f o r the a c o u s t i c sources and perceptual a t t r i b u t e s need to be f u r t h e r e x p l o r e d . I t may be hoped t h a t modelling w i l l take us towards a c o n s i s t e n t d e s c r i p t i v e framework f o r v i b r a t i o n modes of t h e vocal f o l d s and the co-occurrence of voice and t u r b u l e n c e noise sources. An agreed s e t of phonetic f e a t u r e s f o r the larynx may f o l l o w , e v e n t u a l l y . C l a r i f i c a t i o n of nomenclature could be one short-term g o a l , but even t h i s may have to await increased understanding of the concepts of r e g i s t e r and phonation t y p e . Acoustic coupling between source and f i l t e r The s u b g l o t t a l f i l t e r might be i n c l u d e d . A voice source as defined above i s assumed to be l i n e a r l y s e p a r a b l e from the a c o u s t i c f i l t e r s , but source and f i l t e r s i n t e r a c t , e s p e c i a l l y in the frequency region of the f i r s t f o r m a n t s . One approach i s to d e r i v e t r u e g l o t t a l flow from lung a i r p r e s s u r e and waveforms of g l o t t a l a r e a . To what e x t e n t do g l o t t a l l o s s e s and the e f f e c t s of FQ - Fj_ proximity vary with speaker type? Concluding remarks The larynx p r e s e n t s us with a l a r g e number of q u e s t i o n s about the c o n t r o l and operation in speech and singing of i t s t i n y and r a t h e r i n a c c e s s i b l e s t r u c t u r e s . Some c h a l l e n g e s and p o s s i b i l i t i e s f o r modelling may be o f f e r e d . S i m p l i f i c a t i o n i s advantageous, but only i f i t does not s e r i o u s l y d i s t o r t the system and change the whole p i c t u r e . Dominant f a c t o r s may emerge, but d e c i s i o n s about the need to include in a model s p e c i f i c v a r i a b l e s must be judgements (or guesses) based on fragmentary d a t a . Modelling and data g a t h e r i n g can complement each other and give a ' l e a p - f r o g ' progress in understanding p o r t i o n s of t h e complex whole. Individual f a c t o r s may be i d e n t i f i e d and manipulated as they cannot be in r e a l speech. For example, a c o u s t i c coupling and anatomical l i n k s between larynx and vocal t r a c t can be d i s s o c i a t e d . Since modelling of a c o u s t i c coupling alone g i v e s v a r i a t i o n s of fundamental frequency with vowel type in the o p p o s i t e sense to t h a t of r e a l speech, p h y s i o l o g i c a l l i n k s should give o v e r r i d i n g e f f e c t s (Gu^rin, Degyrse and Boe, pp.263-277 in Carre e t a l . , 1977). Modelling should be able to make p r e d i c t i o n s beyond what can be measured in r e a l speech. Normal a d u l t speech i s so s k i l f u l , r e l i a b l e and, probably, s t y l i s e d , t h a t the t a s k s performed in speech a r e not obvious. A model can get t h i n g s wrong (only too e a s i l y ! ) by using i n a p p r o p r i a t e s e t t i n g s , a c t i o n s and temporal o r g a n i s a t i o n ; and thus help t o explain the p a r t i c u l a r p a t t e r n s found in r e a l speech. Regions of s t a b i l i t y f o r the larynx may be i d e n t i f i e d . Models ought to conform to physical p r i n c i p l e s and c o n s t r a i n t s , w h i l s t being capable of simulating d i f f e r e n t speaker types and d i f f e r e n t o p t i o n s ; a l s o d i f f e r e n t degrees of c o n t r o l , as in singing versus speech. I n c r e a s i n g l y r e a l i s t i c

150

Models of the larynx

r e p r e s e n t a t i o n s w i l l have o b v i o u s a p p l i c a t i o n s i n speech s y n t h e s i s ( b o t h t e r m i n a l - a n a l o g and l i n e - a n a l o g ) and in d i a g n o s i s o f p a t h o l o g i c a l s t a t e s o f t h e v o c a l f o l d s . One o b j e c t i o n t o m o d e l l i n g a s an a n a l y s i s - b y - s y n t h e s i s a p p r o a c h i s t h a t , w i t h many i l l - q u a n t i f i e d p a r a m e t e r s t o m a n i p u l a t e , t o o many u n i n f o r m a t i v e matches t o r e a l speech can be a c h i e v e d . Practical experience s u g g e s t s t h a t t h i s need n o t be a m a j o r worry f o r t h e p r e s e n t l a r y n x m o d e l l i n g symposium; where c u r r e n t models a r e n o t c o m p l e t e l y s u c c e s s f u l s i m u l a t i o n s , t h e ways in which t h e y f a i l a r e i l l u m i n a t i n g .

References Boë, L . J . , D e s c o u t , R. & G u é r i n , B., e d s . ( 1 9 7 9 ) . Larynx e t P a r o l e . P r o c e e d i n g s o f a GALF Seminar, G r e n o b l e , F e b r u a r y 8 - 9 , 1979. Grenoble: I n s t i t u t de Phonétique de G r e n o b l e . L a s s , N . J . , e d . (1979, 1981). Volumes 2 & 5 . Speech and Language Advances i n Basic Research and P r a c t i c e . New York: Academic P r e s s . C a r r é , R., D e s c o u t , R. & Wajskop, M., e d s . ( 1 9 7 7 ) . A r t i c u l a t o r y Modelling and P h o n e t i c s . P r o c e e d i n g s o f a Symposiun, G r e n o b l e , J u l y 10-12, 1977. B r u s s e l s : GALF Groupe de l a Communication P a r l e e . F a n t , G. & S c u l l y , C., e d s . ( 1 9 7 7 ) . The Larynx and Language. P r o c e e d i n g s o f a D i s c u s s i o n Seminar a t t h e 8 t h I n t e r n a t i o n a l Congress o f P h o n e t i c S c i e n c e s , Leeds, August 17-23, 1975, P h o n e t i c a 31 C O . F i n k , B.R. ( 1 9 7 5 ) . The Human Larynx, A F u n c t i o n a l S t u d y . New York: Raven Press. F i n k , B.R. & Demarest, R . J . ( 1 9 7 8 ) . Harvard U n i v e r s i t y P r e s s .

Laryngeal Biomechanics.

L a v e r , J . ( 1 9 8 0 ) . The Phonetic D e s c r i p t i o n of Voice Q u a l i t y . Cambridge U n i v e r s i t y P r e s s .

Cambridge, MA:

Cambridge:

Lawrence, V.L. & Weinberg, B., e d s . ( 1 9 8 0 ) . T r a n s c r i p t s o f t h e Eighth Symposium Care of t h e P r o f e s s i o n a l V o i c e , P a r t s I , I I , & I I I . , June 11-15, 1979, The J u i l l i a r d S c h o o l , NYC. New York: The Voice F o u n d a t i o n . S t e v e n s , K.N. & H i r a n o , M., e d s . ( 1 9 8 1 ) . Vocal Fold P h y s i o l o g y . P r o c e e d i n g s o f a C o n f e r e n c e , Kurune, J a n u a r y 15-19, 1980. Tokyo: U n i v e r s i t y o f Tokyo P r e s s .

151 THE VOICE SOURCE - ACOUSTIC MODELLING Gunnar Fant Department of Speech Communication & Music Acoustics, Royal Institute of Technology, Stockholm, Sweden Abstract Recent improvements in the source-filter concept of voice production take into account interaction between the time variable nonlinear glottal impedance and the pressure-flow state of the entire sub- and supraglottal systems.

Alternative defini-

tions of source and filter function in a physical production model and in syntesis schemes are reviewed.

Requirements for

time and frequency domain parameterization of the voice source are discussed with reference to speech dynamics. Introduction The acoustics of speech production is based on the concept of a source and a

filter function - in a more general sense a

raw material and a sound shaping process.

In current models the

source of voiced sounds is represented by a quasiperiodic succession of pulses of air emitted through the glottis, as the vocal cords open and close, and the filter function is assumed to be linear and short-time invariant.

Not much work has been done on

the description of the voice source with reference to speaker specifics and to contextual factors.

Speech synthesis has gained

a fair quality on the basis of conventional idealizations, such as a -12 dB/oct average spectrum slope and uniform shape.

Source

and filter have been assumed to be linearly separable. This primitive view has served well as the foundation of acoustics phonetics, but it is time to revise and improve it.

In

the last few years it has become apparent that we need a firmer theoretical basis of voice production for descriptive work.

Al-

so, there is demand for a more natural quality in speech syn-

152

Models of the larynx

thesis and a way out of the male voice dominance. A more profound view of the production process is now emerging. It is clear that the filter and source functions are

mutually depen-

dent, there is acoustical and mechanical interaction (Ishizaka and Flanagan, 1972).

The concept and the definition of a source

will differ in a true, maximally human-like model and in a terminal analog synthesizer.

Also, there exist various combinations

of a source and a filter function that will produce one and the same or approximately the same output. A part of the theoretical foundation lies in the development of suitable models for parametrical representation of the voice source together with measurement technqiues, such as time domain and frequency domain inverse filtering and spectrum matching. These tools are still in a developing stage. Source -filter decomposition of voiced sounds The major theoretical complication in human voice production models is that in the glottal open state, the sub- and supraglottal parts of the vocal tract are acoustically coupled through the time variable and nonlinear glottis impedance, whereas when glottis is closed the sub- and supraglottal systems execute approximately free and separate oscillations. Resonance frequencies and especially bandwidths may differ in the two states.

The glottal

open period can be regarded as a phase of acoustical

energy

charging, Ananthapadmanabha and Fant (1982), followed by a discharge at glottal closure.

During the glottal closed conditions,

the formants prevail with a constant bandwidth, i.e., a constant rate of decay followed by a relative faster decay of amplitude during the next glottal open interval.

This truncation effect

(Fant, 1979, 1981; Fant and Ananthapadmanabha, 1982) cially apparent in maximally open

is espe-

back vowels

of high F^. The intraglottal variations in the system function

Fant: The voice source - acoustic modelling

153

are only approximately definable by time varying frequencies and b a n d w i d t h s of vocal resonances. This brief introduction to the production mechanism shows that terminal analog synthesizers with independent source and filter function linearly combined have inherent

limitations.

The physically m o s t complete speech production model that has been

developed

is that of

Flanagan et al (1975).

Ishizaka

and

Flanagan

(1972),

W i t h the t w o - m a s s m o d e l of the vocal

cords incorporated in a distributed parameter system, their model does not have a specific source in the linear network sense. is a self-oscillating and self-adjusting

system,

It

the main power

deriving from the expiratory force as represented by the lung pressure. In the work of Ananthapadmanabha and Fant (1982), the acoustical modeling of voice production starts by assuming a specific glottal area function A g ( t ) within a fundamental period and a specific lung pressure, other

P^.

parts of the system

The flow and pressure states are then calculated

by

in

techniques

similar to those of Ishizaka and Flanagan (1972) leading to numerical d e t e r m i n a t i o n s of the glottal v o l u m e velocity Ug(t), the output volume velocity at the lips U Q (t), and the sound pressure, p

a(t),

at a distance of a centimeters from the speaker's mouth.

By defining the filter function as the relation of P a (t) to Ug(t) w e define Ug(t) as the source or w e m a y go to

underlying the

glottal area function, Ag(t) as a conceptual reference which may attain the d i m e n s i o n a l i t y of a flow source by calculating the current in a submodel w i t h the lung pressure Pj^ inserted as a constant voltage across the time-variable glottal impedance represented

by the

"kinetic" term

only,

inductance and frictional resistance.

thus

ignoring

glottal

These are of secondary

importance only and may be taken into account later, Ananthapadmanabha and Fant (1982).

From the relation

Models of the larynx

154

, S1 •S £ Ü cn •H M4(D Vl

!! i

M-l M-l

h

iI s

172

Models of the larynx

excitations, i.e., at discontinuities at a glottal opening which are highly susceptable to glottal damping.

Excitations during

glottal closure reported by Holmes (1976) would require additional discrete elements of excitation functions in the glottal source.

The theoretical modelling undertaken here to convert

glottal area function into glottal flow could

be extended to

take into account the displacement of air associated with the lateral and longitudinal movements of the vocal cords (Flanagan and Ishizaka, 1978). Conclusions, source dynamics The theory of voice production, as well as parametric models and data collection techniques, has advanced significantly during the last years but we are still in a developing stage gradually appproaching the central objects of research,

a quantitative

rule-oriented description of voice individualities, and speech dynamics. A voiced sound may be decomposed into a source function and a filter function, the definition of which varies with the particular production or synthesis model.

We do not yet have suffi-

cient experience of how well the complete interactive model may be approximated in various synthesis schemes.

Terminal analog

synthesizers may be modified in various ways, e.g., to adopt the smoothed glottal flow as source and to simulate truncation effects which should ensure a sufficient amount of naturalness for synthesis of low Fg voices.

It is possible that the situation is

not that simple at high F Q , where the interaction effects appear to be more demanding. Up till now inverse filtering has mostly served as a tool for qualitative, indirect studies of the.phonatory process at the level of the vocal cords and of temporal patterns of formant

Fant: The voice source - acoustic modelling excitation.

173

It is time that we develop more

oriented analysis techniques.

quantitatively

This requires specifications of

simultaneous values of both source and filter functions and of significant interaction effects, e.g., truncation phenomena as a constituent in estimates of effective bandwidths. Source spectrum shapes are not adequately described by a single slope value only.

An important feature derivable from the

particular combination of the F^ , FO, and K parameters (Fant, 1979) is the amplitude level of the F

source

maximum relative

the source spectrum level at some higher frequency, e.g., at 1000 Hz.

In connected speech,

the F g level stays rather invariant

whilst formant levels in addition to F-pattern-induced variations tend to fall off as a result of a progressive abduction.

Similar

effects occur as a reaction from supraglottal narrowing when extended beyond that of close vowels. relative to that of F

q

The amplitude level of F1

and also the absolute level of F „ are

4 , 5 ) .

This f i n d i n g was synthesized in a

model of auditory analysis that provides a t o n o t o p i c a l l y organized short-term frequency amplitude spectrum which preserves a l l

the information required by

the psychophysics of auditory frequency measurements and i s constrained by auditory-nerve physiology (_6»18).

This model of auditory a n a l y s i s , coupled

to an optimum central processor of fundamental frequency, r e p l i c a t e s the predictions of the e a r l i e r non-physiological optimum processor theory of pitch of complex tones.

The new analysis model i s more general because the

short-term spectrum s i g n a l l i n g the central processor i s defined f o r a r b i t r a r y acoustic signals including speech.

Indeed the potential relevance of

this

auditory analysis model f o r speech signals has been d i r e c t l y demonstrated in physiological

experiments

Cl9»13).

Several aspects of t h i s research on the perception of fundamental pitch are r e l e v a n t f o r speech research.

Most d i r e c t l y , because fundamental

pitch

Goldstein: An outline of recent research progress

247

is a significant conveyer of linguistic information, the basic understanding of Its auditory communication limits should be useful in the normal situation as well as in the cases of deteriorated speech signal and damaged auditory system. Secondly, the short-term speech spectrum described by the new auditory analysis model, or its variants, is likely to be more relevant for speech communication than conventional spectogram representations.

Finally, the general research

strategy of relating both experiment and theory of psychophysics and physiology and the use of ideal communication models for representing central processing should be productive as well for the more complex problems of speech communcation where the tasks of the central processor are less understood (1,17).

248

Auditory analysis and speech perception

1. Delgutte, B. (1982). Some correlates of phonetic distinctions at the level of the auditory nerve, pp. 131-149 in Carlson and Grandstrom (eds.). The Re-presentation of Speech in the Peripheral Auditory System. Elsevier Biomedical Press. 2. Gerson, A. and Goldstein, J.L, (1978). Evidence for a General Template in Central Optimal Processing for Pitch of Complex Tones. J. Acoust. Soc. Am. (tf, 498-510. 3. Goldstein, J.L. (1973). An Optimum Processor Theory for the Central Formation of the Pitch of Complex Tones. J. Acoust. Soc. Am. 54, 1496-1516. 4. Goldstein, J.L. (1978). Mechanisms of Signal Analysis and Pattern Perception in Periodicity Pitch. Audiology 17, 421-445, 5. Goldstein J.L. (1980). On the Signal Processing Potential of High Threshold Auditory Nerve.Fibers, pp. 293-299 in van den Brink and Bilsen (eds.). Psychological, Physiological and Behavioural Studies in Hearing, Delft Univ. Press. 6. Goldstein, J.L. (1981). Signal Processing Mechanisms in Human Auditory Processing of Complex Sounds. Final Report U,S.-Israel Binational Fund, Grant No. 1286/77, 4/1977-3/1980, Tel-Aviv University, School of Engineering. 7. Goldstein J.L. and Srulovicz, P. (1977). Auditory-Nerve Spike Intervals as an Adequate Basis for Aural Spectrum Analysis, pp. 337-345 in Evans and Wilson (eds.) Psychophysics and Physiology of Hearing. Academic Press. 8. Goldstein J.L., Gerson, A., Srulovicz P. and Furst, M. (1978). Verification of the Optimal Probabilistic Basis of Aural Processing in Pitch of Complex Tones. J. Acoust. Soc. Am. 486-497. 9. Houtsma, A.J.M. and Goldstein, J.L. (.1971). Perception of Musical Intervals: Evidence for the Central Origin of Musical Pitch. MIT Res. Lab. Elec. Technical Rpt. 484. 10. Houtsma A.J.M. and Goldstein J.L. (.1972). The central origin of the pitch of complex tones: evidence from musical interval recognition, J. Acoust. Soc. Amer, 51^, 520-529. 11. Plomp, R. and Smoorenburg, G,F. Eds. (1970). Frequency Analysis and Periodicity Detection in Hearing. Sijthoff, Leiden. 12. Rabiner, L.R. Cheng, M.J. Rosenberg, A.E. and Mc Gonegal, C.A. C1976). A comparative performance study of several pitch detection algorithms. IEEE Trans. ASSP 24, 399-418 13. Sachs, M.B. and Young E.D. (1980), Effects of Nonlinearities on Speech Encoding in the Auditory Nerve. J. Axoust. Soc. Am. 68, 858-875. 14. Schouten, J.F. (1970). The residue revisited

in Plomp and Smoorenburg, op. cit.

15. Siebert, W.M. (.1968). Stimulus Transformations in the Peripheral Auditory System. pp. 104-133 in Kohlers and Eden (eds.) Recognizing Patterns. M.I.T. Press, Cambridge. 16. Siebert, W.M. (1970). Frequency Discrimination in the Auditory System: Place or Periodicity Mechanisms. Proc. IEEE 58, 723-730. 17. Soli, D. and Arabie, P. (1979). Auditory versus phonetic accounts of observed confusions between consonant phonemes. J. Acoust. Soc. Am. 66^, 46-59.

Goldstein: An outline of recent research progress

249

18. Sruluovicz, P. and Goldstein J.L. (1983), A Central Spectrum Model: A Synthesis of Auditory-Nerve Timing and Place Cues in Monaural Communication of Frequency Spectrum J. Acoust. Soc. Am. Scheduled March. 19. Young, E.D. and Sachs, M.B. (1979). Representation of Steady-State Vowels in the Temporal Aspects of the Discharge Patterns of Populations of Auditory-Nerve Fibers. J. Acoust. Soc. Am. 66, 1381-1403.

Symposium 5

Phonetic explanation in phonology

253 THE DIRECTION OF SOUND J o h n J.

CHANGE

Ohala

P h o n o l o g y L a b o r a t o r y , D e p a r t m e n t of L i n g u i s t i c s , v e r s i t y of C a l i f o r n i a , B e r k e l e y , U S A

Uni-

Introduction The striking success of the comparative method for the reconstruction of the linguistic past rests in part on linguists' intuitions as to the favored direction of sound change. Given dialectal variants such as [tjuzdi] and [tjuzdi] "Tuesday", linguists know that the second is more likely to have developed from a form similar to the first rather than viceversa. This intuition is based primarily on experience, i.e., having previously encountered several cases of the sort [tj] » [tj] (where the directionality is established on independent grounds) and few or none of the reverse. If there were assumptions about the physical workings of speech production and speech perception which informed these intuitions, they were, with few exceptions, naive and empirically unsupported, e.g., the still unproven teleological notion of 'ease of articulation.' The history of progress in science and technology, e.g., in medicine, iron smelting, bridge construction, demonstrates, however, that although intuitions can take a field a long way, more can be achieved if these are complemented by empirically-founded models (theories) of the system the field concerns itself with, i.e., if induction is united with deduction. In this paper I briefly review some of the universal phonetic factors that determine the direction of sound change. Articulatory Constraints One of the clearest and most well-known examples of an articulatory constraint determining the direction of sound change is the aerodynamic factors which lead to the devoicing of stops, especially those with a long closure interval, e.g., geminates, see Table I.

T a b l e I. Devoicing of Geminate Stops in Moré 1953; transcription simplified).

(from

Alexandre

Morphophonemic Form

Phonemic Form

French

Gloss

pabbo

papo

"frapper"

bad + do

bato

"corbeilles"

lug + gu

luku

"enclos"

254

Phonetioc explanation in phonology

As noted by Passy (1890:161) voicing is threatened by the increasing oral pressure (and thus the decreasing pressure drop across the glottis) caused by the accumulation of the glottal airflow in the oral cavity. Recent mathematical models provide us with a much better understanding of this process (Rothenberg 1968, Ohala 1976, Ohala and Riordan 1979, Westbury 1979). Therefore, unless the closure duration of the stop is shortened (which may happen in word-medial, intervocalic position) or the oral cavity is actively enlarged during the closure (i.e., becomes implosive) or the oral cavity is vented somehow (say, via the nasal passage), then there is a strong physical motivation for the direction of a change in voicing to be from [+voice] to [-voice] (Ohala 1983). Acoustic Factors It has also long been recognized that certain distinct articulations may give rise to speech sounds which are substantially similar acoustically and auditorily, such that listeners may inadvertently substitute a different articulation for the original one (Sweet 1874:15-16). Sound changes such as those in Table II are very common cross-linguistically and can be shown to arise, plausibly, due to the acoustic similarity of the sounds or sound sequences involved in the change. Table II. Sound Changes Precipitated by Acoustic Similarity Different Articulations. Palatalized Labials Roman Italian

>

Apicals [pjeno]

Genoese

Italian

[bjaoko] Pre-Classical

Greek khalep-jo

>

[tjena]

(1f u l l

[d^aqku]

"white"

Classical Greek

guam-jo Labialized Velars

of

p r o v o k e ii

khaIept o

»

baino

"I

c o m e II

Labials

Proto-Indo-European

ekwos

Classical

Proto-Bantu

-kumu

W. Teke

Greek

hippos

"horse"

pfumu

"chief"

Ohala: The direction of sound change

255

The problem, though, is that if these speech sounds are just similar, then the substitutions should occur in either direction. In fact, the substitutions are strongly asymmetrical, i.e., in the direction shown in Table II. The change of labials to velars, for example, though attested (Lemle 1971) are much rarer and often occur only in highly restricted environments, e.g., Venezuelan Spanish [ e k l i k s e ] < [ e k l i p s e ] "eclipse", [ k o n s e k s i o n ] < [ k o n s e p s i o n ] "conception".where the presence of the following apical seems to be essential (Maríscela Amador, personal communication). The same asymmetry is evident in the confusion matrices resulting from laboratory-based speech perception studies such as that by Winitz, Scheib, and Reeds (1972); see Table III. Table III. Asymmetry of Identification Errors in the P e r c e p t i o n Study of W i n i t z e t al. (1972).

Speech

[p] >

[t]/

[i]

(34%) but [t] >

[p]/

[i]

(6%)

[k] >

[p] /

[u]

(27%) b u t [p] >

[k]/

[u]

(16%)

Attributing these listeners identify the will not work since in /p/ (Wang and Crawford

asymmetries to "response bias" (when in doubt, sound as that which is most frequent in the language) English, at least, /k/ occurs more frequently than 1960).

To try to understand the causes of this asymmetry in misperception it may be useful to examine similar asymmetries in the perception of stimuli in different sensory channels. (See also Ohala 1982.) When subjects' task is the identification of briefly glimpsed capital letters of the Roman alphabet (A, B, C, etc.), the following asymmetries in misidentification occur (where ' >' means misidentification in the given direction more often than the reverse): E > F, Q > 0, R > P, B > P, P > F, J > I, W > V (Gilmore, Hersh, Caramazza, and Griffin 1979). Again, "response bias" is of no help to account for the favored direction of these errors since •E1 is far more common than 1F1 (in printed English). In each of these pairs the letter on the left is graphically equivalent to that on the right plus some extra feature. As Garner (1978) has pointed out, it follows that if this extra feature is not detected, for example, in the case of the first pair, the "foot" of the 'E1, then it will be misperceived as the letter that is similar except for the lack of this feature, 1F' in the cited example. Inducing ("hallucinating") the presence of this extra feature when it is not present in the stimulus is less likely. (This does not mean that inducing absent features or segments is uncommon; in fact, if the listener has reason to suspect that he has "missed" something because of background noise, this may be the most common source of error.) To understand the asymmetries in the errors of speech perception and thus the favored direction of sound change due to this cause we should look for the features which, for example, /kw/ has but /p/ lacks. In the case of /kw/ » /p/, it seems likely that it is the relatively intense stop burst with a compact spectrum that velars have but which is missing in labials. Research on these details is likely to benefit not only diachronic phonology but also such practical areas as automatic speech recognition.

256

Phonetioc explanation in phonology

Auditory factors There is considerable evidence that speech perception is an active process such that "top-down" information is applied to resolve ambiguities and to factor out distortions in the speech signal (Warren 1970). Ohala, Riordan, Kawasaki, and Caisse (forthcoming) and Kawasaki (1978, forthcoming) demonstrated that listeners alter their identification of speech sounds or phonetic features as a function of the surrounding phonetic context-apparently by factoring out of the stimulus the distortions that the surrounding sounds would be likely to create. For example, Kawasaki found that listeners judged the same phonetically nasalized vowel to be less nasal when it was flanked by full nasal consonants, vis-a-vis the case where the nasal consonants were attenuated or deleted. Ohala et al. found that listeners would identify a more front vowel on the /i/-to-/u/ continuum as an /u/ when it was flanked by apical consonants, vis-a-vis when it was flanked by labial consonants. (That apical consonants have a fronting effect on /u/ is well known--Lindblom 1963, Stevens and House 1963--; it is this distortion, presumably, that the subjects were factoring out from the stimuli surrounded by apicals in Ohala et al.'s study.) What this means is that if the articulatory constraints of the vocal tract cause a shift of the sound A to B, then the listener's "reconstructive" ability allows him to "undo" this change and convert perceived B into recognized A. Most of the time this process seems to succeed in enabling the listener to recognize what the speaker intended to say in spite of the fact that his speech is encrusted--like a boat with barnacles--with the unintended distortions caused by articulatory constraints. There is evidence, however, that in some cases these reconstructive processes are applied inappropriately to factor out parts of the intended signal. This is a kind of "hypercorrection" at the phonetic level. As argued by Ohala (1981), this is the basic nature of dissimilation. According to this hypothesis, dissimilatory changes of the sort Latin /k u iqk u e/ > /kiqk w e/ > Italian /tjiokwe/ arose due to listeners (or a listener) thinking that the labialization on the first velar was a distortion caused by spillover of labialization from the second velar and therefore factoring it out when they pronounced this word. There is support for this hypothesis. It predicts that the only phonetic features which could be dissimilated at a distance, i.e., undergo the change in (1), would be those which could spread (1)

[CCfeature] — >

[-0-feature] /

X [^feature]

like a prosody across intervening segments. Thus features such as labialization, retroflexion, palatalization, voice quality (including that induced by aspiration), and place of articulation of consonants would be prime candidates for dissimilation and features that did not have this property, e.g., [+obstruent] or [+affricate], would not. This prediction seems to agree with the available data (see Ohala 1981). Furthermore, although it is often the case that sound changes due to assimilatory processes, as in (2), involve the (apparently simultaneous) loss of the conditioning environment (italicized in (2)), this could not happen in the case of

257

Ohala: The direction of sound change (2)

an > a foti_ » f«St

dissimilation. The conditioning environment must be present if the listener is to be misled in thinking it is the source of the phonetic feature which is factored out, i.e., dissimilated. Thus, sound changes such as those in (3), hypothetical (3)

b h and h

>

kuic]kwe >

ban kiqe

versions of Grassmann's Law and the Latin labialization dissimilations, should not occur, i.e., the dissimilating segment or feature should not be lost simultaneously only in the cases where it had caused dissimilation. Again, this prediction seems to be borne out by the available data. Many linguists have been reluctant to include dissimilation in their list of natural or expected process by which sounds may change, if this means giving it the same status as assimilation (Sweet 1874:13,Bloomfield 1933:390, Schane 1972:216). This is understandable since it appears unprincipled to claim that changes of the sort AB BB (assimilatory) are expected if changes in the reverse direction, BB AB (dissimilatory), are also expected. No doubt to the prescientific mind the fact that wood falls down in air (when elevated and released) but rises if submerged in water presents a serious obstacle to the development of a coherent generalization regarding the expected or natural motions of objects in the world. A scientific understanding of these phenomena removes these obstacles. Similarly, the account given here delineates the different circumstances under which assimilation and dissimilation may occur so that there is no contradiction between them. Dissimilation is perpetrated exclusively by listeners (through a kind of perceptual hypercorrection); assimilation is largely attributable to the speaker. References Alexandre, R. P. (1953). La langue more. français d'Afrique noire, NoI 34. Bloomfield, L.

(1933).

Language.

Dakar:

New York:

Mémoires de Institut

Holt, Rinehart, & Winston.

Garner, W. R. (1978). Aspects of a stimulus: features, dimensions, and configurations. In Cognition and categorization (E. Rosch & B. B. Lloyd, Eds.), pp. 99-133. RTllsdaTël Lawrence Erlbaum Associates. Gilmore, G. C., Hersh, H., Cararnazza, A., & Griffin, J. (1979). Multidimensional letter similarity derived from recognition errors. Perception & Psychophysics, 25, 425-431. Kawasaki, H. (1978). The perceived nasality of vowels with gradual attenuation of adjacent nasal consonants. J. Acous. Soc. Am., 64, S19. [Forthcoming in Experimental phonology (J. I5hala, Ed.).J Lemle, M. (1971). Internal classification of the Tupi-Guarani linguistic family. In Tupi studies I (D. Beridor-Samuel, Ed.), pp. 107-129. Norman: Summer Institute of Linguistics.

258

Phonetioc explanation in phonology

Lindblom, B. (1963). Spectrographic study of vowel reduction. Soc. Am., 35, 1773-1781. Ohala, J. J. (1976). A model of speech aerodynamics. logy Laboratory, 1, 93-107.

J. Acous.

Report of the Phono-

Ohala, J. J. (1981). The listener as a source of sound change. In Papers from the Parasession on Language and Behavior, (C. S. Masek, R. 7 C HencTrTclc, & M. F. Miller, Eds.), pp. 178-203. Chicago: Chicago Linguistic Society. Ohala, J. J. (1982). The phonological end justifies any means. Preprints of the plenary sessions, 13th Int. Congr. of Linguists, Tokyo, 195Zj pp,nU9-208. Tokyo: ICL Editoria 1 Committee. Ohala, J. J. (1983). The origin of sound patterns in vocal tract constraints. In The production of speech, (P. F. MacNeilage, Ed.), pp. 189-216. New York"! Springer-Terlag. Ohala, J. J. & Riordan, C. J. (1979). Passive vocal tract enlargement during voiced stops. In Speech communication papers (J. J. Wolf & D. H. Klatt, Eds.), pp. 89-9Y. New York: Acoustical Society of America. Ohala, J. J., Riordan, C. J., Kawasaki, H., & Caisse, M. (Forthcoming). The influence of consonant environment upon the perception of vowel quality. Passy, P. (1890). Etudes sur les changements phonetiques. Librairie Firmin-Didot.

Paris:

Rothenberg, M. (1968). The breath-stream dynamics of simple-releasedplosive production. "TFibliotheca Phonetica No.~F.l Basel: S. Karger. Schane, S. (1972). Natural rules in phonology. In Linguistic change and generative theory (R. P. Stockwell & R. K. S. Macaulay, Eds.), pp. 199229. Bloomington: Indiana University Press. Stevens, K. N. & House, A. S. (1963). Perturbation of vowel articulations by consonantal context: an acoustical study. J. Speech & Hearing Res., 6, 111-128. Sweet, H.

(1874).

History of English sounds.

London:

Trubner.

Wang, W. S.-Y. & Crawford, J. (1960). Frequency studies of English consonants. Language & Speech, 3, 131-139. Warren, R. (1970). Perceptual restoration of missing speech sounds. Science, 167, 392-393. Westbury, J. R. (1979). Aspects of the temporal control of voicing in consonant clusters in English. Unpub. Doc. Diss., University of"Texas at Austin. Winitz, H., Scheib, M. E., & Reeds, J. A. (1972). Identification of stops and vowels for the burst portion of /p,t,k/ isolated from conversational speech. J. Acous. Soc. Am., 51, 1309-1317.

259 VOWEL FEATURES AND THEIR EXPLANATORY POWER IN PHONOLOGY Eli Fischer-J^rgensen University of Kopenhagen, Denmark

Phonetics cannot explain the phonological pattern of a given concrete language and its development within a given period of time, but it can attempt to explain some universal constraints and tendencies in phonological patterns and phonological change. For this purpose one needs detailed quantitative models of speech production and speech perception, but it is also necessary to have a general frame of reference for the description of the individual languages in the form of a system of phonetic dimensions, according to which the speech sounds of a language can be grouped into classes with common features. These dimensions (which, according to a now generally accepted, but not quite unambiguous, terminology are also called features) must on the one hand have correlation to speech production and speech perception, on the other hand be adequate for the description of phonological patterns and rules. In the present paper I will consider only vowel features, and only some basic features (excluding nasality, r-colouring, etc., but including tenseness). Since the days of Bell and Sweet it has been the tradition to describe vowel systems by means of the basic features high-low, front-back, and rounded-unrounded and (for some languages) tense-lax. This system, which was defined in articulatory terms, has been severely criticized for not covering the articulatory facts, e.g. by Russell (1928) and, more recently, by Ladefoged (e.g. 1976 and 1980), Wood (1982) and Nearey (1978). It has been criticized that according to X-ray photos, the highest point of the tongue is> e.g., often lower

260

Phonetioc explanation in phonology

for [i] than for [e], for [o] than for [a], for [u] than for [i ], and that, on the whole, this point is rather variable. However, almost all these objections are only valid as regards the revisions introduced by Daniel Jones in the classical system for the purpose of his cardinal vowel chart which was meant as a practical tool in phonetic field work and not as a theoretical vowel system. English phoneticians have simply identified the cardinal vowel chart with the system of classical phonetics. None of the founders of classical phonetics (e.g. Bell, Sweet, Jespersen, Sievers, Vietor) ever used the term "the highest point of the tongue". It was introduced in Jones' Outline (1918). This point is indeed often variable (although I agree with Catford (1981) and Lindau (1977) in finding the criticism somewhat exaggerated) . It is also very rarely used in later phonetic works by continental phoneticians except when they quote Jones. Moreover, it was Jones who, for practical reasons, placed [a] as representing the lowest degree of height in the series [u o o a], and who discarded tenseness as a separate dimension and thus placed [i] between [i] and [e] in his chart. If tenseness is considered a separate dimension and height is taken to mean the relative distance between the whole surface of the tongue and the palate within each of the series of rounded or unrounded, tense or lax, front or back vowels, most of the inconsistencies between these traditional labels and the articulatorv facts disappear. It is true that tenseness has been defined in several different ways, and not very convincingly, within classical phonetics, and it has probably been applied to too many languages. But it is a useful feature for the description of some languages, e.g. German, Dutch, various Indian languages, and for the high vowels of English. It seems to be correlated with the acoustic property of a relatively more peripheral vs. a more central placement in the vowel space (Lindau 1977) and, as far as articulation is concerned, with a higher vs. lower muscular

261

Fischer-J0rgensen: Vowel features and their explanatory power tension, which has a number of observable

consequences:

a flattening of the tongue accompanied by a narrower pharyngeal cavity, less pronounced lip articulation, and a relatively small mandibular distance ment with Wood 1982).

(I am here in agree-

"Advanced tongue root" captures only

one of these consequences, and does not work for [o] vs. [o]. However, the classical system has rightly been

criticized

for not taking account of the pharynx and thus for constituting an inadequate starting point for the calculation of the acoustic output.

Various more recent descriptions of vowel articulation

have seen it as their main purpose to establish a more direct connection with the acoustic consequences.

In one version,

place and degree of constriction play a central role.

This is

no doubt a better starting point for computing the area

function

of the vocal tract, but I do not think that it is a useful basis for setting up a new feature system.

The constriction

parameter has only been used as one factor of the tense-lax feature, and it will probably be difficult to find any phonological rule that is common to the class of the most stricted vowels [i u o a].

con-

Wood invokes sound typology, but

four-vowel systems containing these four vowels, which he considers basic, are extremely rare, whereas [i u a e ] and [i o a e ] are more common. (1969) and Wood

But both Lindblom and Sundberg

(1982) have proposed feature systems where

the traditional front-back dimension has been replaced by ±palatal, ±velar, and ±pharyngeal, which in the LindblomSundberg system define three places of articulation, and in Wood's system four places, since he defines [o] as pharyngovelar and [a] as low pharyngeal.

Both use jaw opening to

define height differences, Lindblom-Sundberg operating with ±close and ±open, Wood with only ±open. Wood uses his feature ±pharyngeal to describe the retracted and lowered allophones of the Greenlandic vowels before uvular

(pharyngeal) consonants.

It might also be used to

describe the allophone of Danish /a/ after [«].

But apart

from these cases I think the front-back dimension is more useful.

262

Phonetioc explanation in phonology

Place features are needed in the description of e.g. the Germanic i-Umlaut and of vowel harmony of the type found in Finnish and Turkish, but these facts are expressed in a much simpler way by the features front and back than by combinations of three different features for back vowels. And, incidentally, X-ray photos show the place of the narrowest constriction to be just as variable as the highest point of the tongue. Front vowels may have their narrowest constriction in the alveolar region, and the narrowest constriction of an [o] may be found at the velum, at the uvula, or in the pharynx. This does not make much difference for the general area function, but it does make a difference for the feature system. Moreover, Wood's ±open feature is not sufficient for describing and explaining the many cases where one needs three or four degrees of height, e.g. the Danish vowel system, the English great vowel shift, the general tendency for relatively high vowels to be shorter than relatively low vowels (which has phonological consequences in various languages), the role of vowels of different height in palatalization processes, etc. Ladefoged does not consider place and degree of constriction to be a good basis for a feature system, nor does he find his own physiological front and back raising parameters adequate for this purpose (1980), whereas he recognizes that the traditional features have proved their value in phonological description. He solves the problem in a different way, i.e. by defining the (multivalued) features front-back and high-low in auditory-acoustic terms, high-low corresponding to frequency of F^ and front-back to F 2~ f i* Rounding is retained as a feature defined in articulatory terms. This interpretation of the traditional features raises some problems. In the first place the auditory "front-back" dimension (as well as its acoustic correlate F 2~ F i' corresponds to a combination of the articulatory dimensions of rounding and front-back tongue position. Ladefoged has demonstrated himself that even phoneticians have difficulty in distinguishing

Fischer-J0rgensen: Vowel features and their explanatory power

263

between front rounded and back unrounded vowels, and it is difficult to elicit a dimension of rounding in multidimensional scaling experiments of vowel perception. Terbeek (1977) did not succeed until using five dimensions. The F 2 ~F 1 dimension corresponds to the auditory bright-dark dimension which was used in vowel descriptions in the pre-Bell period. As this dimension includes the effect of rounding it seems problematic to set up an independent rounding dimension at the same time. (A binary opposition, like 1 Jakobson s grave-acute does not raise quite the same problems.) Moreover, an auditory definition of the front-back dimension implies that processes like i-Umlaut and vowel harmony of the Finnish and Turkish type should be described and explained in auditory terms. But this is not adequate. In these processes rounding and the front-back dimension are kept strictly apart. In i-Umlaut the vowels retain their rounding feature ([a] does not become [as]), and the same is the case in the Finnish vowel harmony. In the Turkish vowel harmony both rounding and front-back dimensions are involved but according to separate rules. This seems to indicate an articulatory process. It also seems more plausible to explain such processes of assimilation in motor terms as an anticipation of an articulatory movement. As far as the i-Umlaut is concerned, perception may play a role at the last stage, where the i of the ending may have become so weak that the listener does not hear it and therefore perceives the front feature of the stem vowel as an independent feature (cf. Ohala 1981), but in its start it must be a mainly articulatory process. It therefore seems preferable to define the front-back dimension in articulatory terms, and there is a sufficiently clear basis in X-ray photos for a definition in terms of a forward, respectively backward movement of the tongue body. But at the same time it must be recognized that from an auditory point of view rounding and front-back combine in one dimension, and I think this aspect is prevalent in the patterning of vowel systems where [u] and [i] are extremely common because they are maximally different in a two-dimensional auditory space, whereas [y] and [ »« ] are rare (cf. Lindblom 1982).

264

Phonetioc explanation in phonology

As for the height dimension, it is probably true that it has a somewhat simpler connection with its physical than with its physiological correlates, even when the distance of the whole upper surface of the tongue is considered, but I think a two-sided correlation can be retained. It would also be very difficult to find an auditory explanation of some of the rules involving this feature, e.g. the differences in duration and fundamental frequency, the influence on palatalization, and the tendency towards unvoicing of high vowels, whereas plausible, if not definitive, physiological explanations have been advanced for these facts.

References: Catford, I.e. (1981). Observations on the recent history of vowel classification. Towards the History of Phonetics, ed. Asher and Henderson. Edinburgh: University Press. Jones, D. (1918).

Outline of English Phonetics. London:

Heffers and Sons. Ladefoged, P. (1976). The phonetic specification of the languages of the world. UCLA Working Papers in Phonetics, 31, 3-21. Ladefoged, P. (1980). What are linguistic sounds made of? Language, 55, 485-502 (also UCLA Working Papers, 45, 1-24). Lindau, M. (1977). Vowel features.

UCLA Working Papers, 38, 49-81.

Lindblom, B. and Sundberg, J. (1969). A quantitative model of vowel production and the distinctive features of Swedish vowels. Speech Transmission Laboratory, Quarterly Progress and Status Report, Stockholm, 14-32. Lindblom, B.E. (1982).

Phonetic universals in vowel systems

(to be published in Experimental Phonology, Academic Press) Nearey, T.M. (1978). Phonetic feature systems of vowels. Linguistic Club.

Indiana

Ohala, J. (1981). The listener as a source of sound change. Papers from the Parasession on Language and Behaviour, ed. Miller et al., Chicago. 26 pp.

Fischer-J0rgensen: Vowel features and their explanatory power

265

Ohala, J. (1982). The origin of sound patterns in vocal constraints (to be published in The Production of Speech, ed. MacNeilage), 32 pp. Russell, G.O. (1928). The Vowel.

Columbus: Ohio State Uni-

versity Press. Terbeek, D. (1977). A cross-language multidimensional scaling study of vowel perception, UCLA Working Papers, 37, Wood, S. (1982). X-ray and model studies on vowel articulation. Working Papers, Lund, 23.A, 1-191.

267 VOWEL SHIFTS AND ARTICULATORY-ACOUSTIC RELATIONS Louis Goldstein, Haskins Laboratories and Yale University, USA

R i o n o l o g i s t s have o f t e n supposed t h a t t h e p h o n e t i c v a r i a b i l i t y o f speech i s somehow r e l a t e d to sound c h a n g e . B l o o m f i e l d (1933, p p . 365) f o r example, d e p i c t s p h o n e t i c change a s " a g r a d u a l f a v o r i n g of some n o n - d i s t i n c t i v e [phonetic] variants," which a r e a l s o s u b j e c t t o i m i t a t i o n and a n a l o g y . Hbckett (1958) makes t h e c o n n e c t i o n q u i t e e x p l i c i t l y ; he views sound change a s a mechanism of p h o n o l o g i c a l c h a n g e , whereby s m a l l asymmetries i n t h e d i s t r i b u t i o n o f p r o n u n c i a t i o n e r r o r s a b o u t some a c o u s t i c target result in pushing t h e phoneme's t a r g e t , i m p e r c e p t i b l y o v e r t i m e , i n t h e e r r o r - p r o n e direction. In t h i s view t h e n , s p e e c h v a r i a b i l i t y r e p r e s e n t s t h e s e e d s o u t o f which a p a r t i c u l a r sound change may grow. The s p r o u t i n g and d e v e l o p n e n t a r e , o f c o u r s e , dependent on many o t h e r l i n g u i s t i c and s o c i a l f a c t o r s . Sound change d o e s n o t a p p e a r t o be random, a s p h o n o l o g i s t s and p h o n e t i c i a n s have long n o t e d , i n t h a t t h e r e seem to be c e r t a i n p a t t e r n s o f change t h a t r e c u r i n a number o f u n r e l a t e d l a n g u a g e s . In t h i s p a p e r , I s p e c u l a t e a b o u t how p a t t e r n s o f v a r i a b i l i t y c o n s i s t e n t with c e r t a i n t y p e s o f sound change might emerge from t h e r e s o n a n c e p r o p e r t i e s o f t h e human v o c a l t r a c t , g i v e n e s s e n t i a l l y random a r t i c u l a t o r y v a r i a b i l i t y . If t h e s e p r i n c i p l e s a r e g e n e r a l i z e d i n an o p t i m i s t i c way, we might l o o k a t them as d e f i n i n g p o s s i b l e sound c h a n g e s . In t h i s p a p e r , however, d i s c u s s i o n w i l l be r e s t r i c t e d to one type o f sound change—vowel s h i f t s . Vowel S h i f t s When vowels change t h e i r q u a l i t i e s , i t i s p o s s i b l e t o a s k i f any g e n e r a l p a t t e r n s o f change emerge, o r i f vowels s i m p l y s h i f t t o some random ( b u t roughly adjacent) q u a l i t y . Labov, Yaeger and S t e i n e r (1972) examined a number o f ongoing c h a i n s h i f t s , changes i n which J :he q u a l i t i e s o f a number o f vowels a r e interdependent a c r o s s the v a r i o u s s t a g e s of the change. T o g e t h e r with d a t a from completed sound c h a n g e s , t h e i n v e s t i g a t i o n s l e d to t h r e e p r i n c i p l e s o f c h a i n - s h i f t i n g : ( 1 ) f r o n t o r back " t e n s e " ( p e r i p h e r a l ) vowels tend to r i s e i n c h a i n s h i f t s , ( 2 ) f r o n t o r back " l a x " ( l e s s p e r i p h e r a l ) vowels tend to f a l l , ( 3 ) back vowels tend to be f r o n t e d . Thus, ( r e o r g a n i z e d slightly), the main movement i n vowel q u a l i t y i s i n t h e dimension o f vowel h e i g h t , (where "vowel h e i g h t " i s b e i n g used h e r e as a g e n e r a l term f o r t h e dimension a l o n g which vowels l i k e [ i ] , [ i ] , [ e ] , [£/] l i e , r a t h e r t h a n some u n i t a r y a c o u s t i c o r a r t i c u l a t o r y p a r a m e t e r ) . For f r o n t v o w e l s , h e i g h t i s t h e o n l y d i r e c t i o n of movement; back vowels can a l s o be f r o n t e d . While t h e r e a r e a number o f m e t h o d o l o g i c a l problems i n t h i s i n v e s t i g a t i o n ( f o r example, t h e use o f an a c o u s t i c F1 X F2 c h a r t a s a r e p r e s e n t a t i o n of vowel q u a l i t y allowed no way t o i s o l a t e l i p - r o u n d i n g ) , t h e g e n e r a l p i c t u r e seems r o b u s t , and w i l l be t a k e n a s a working h y p o t h e s i s a b o u t t h e n a t u r e of vowel s h i f t s . We would l i k e t o u n d e r s t a n d why s p e e c h v a r i a b i l i t y d e v e l o p s a l o n g the p a r t i c u l a r d i m e n s i o n s t h a t a r e i n v o l v e d i n vowel s h i f t s . One might a t t e m p t t o f i n d a r e a s o n i n c o n s t r a i n t s t h a t t h e s p e e c h p r o d u c t i o n mechanism employs t o

268

Phonetioc explanation in phonology

coordinate the a c t i v i t i e s pf the various muscles. Random p e r t u r b a t i o n o f a vowel w i t h i n such a system of c o n s t r a i n t s might r e s u l t i n non-random displacement from t h e t a r g e t ( s e e d i s c u s s i o n below o f P e r k e l l and N e l s o n , 1982). However, i t i s p o s s i b l e t h a t even i f a r t i c u l a t o r y v a r i a b i l i t y were c o m p l e t e l y random, t h e a c o u s t i c c o n s e q u e n c e s o f s u c h v a r i a b i l i t y would be d i r e c t e d along c e r t a i n d i m e n s i o n s , g i v e n t h e t h e n o n - l i n e a r r e l a t i o n s h i p s t h a t o b t a i n between v o c a l t r a c t c o n f i g u r a t i o n s and t h e i r a c o u s t i c c o n s e q u e n c e s . It i s t o t h i s l a t t e r a p p r o a c h we t u r n . Bon-linearities in articulatory-acoustic

relations

The a c o u s t i c s e n s i t i v i t y o f t h e v o c a l t r a c t t o a r t i c u l a t o r y perturbation can be examined by o b s e r v i n g , w i t h a v o c a l t r a c t a n a l o g , how t h e t u b e r e s o n a n c e s v a r y a s a f u n c t i o n o f c h a n g e s i n s h a p e . F o r e x a m p l e , t h e nomograms o f Fant ( i 9 6 0 ) show t h e f o r m a n t f r e q u e n c i e s o f a h o r n - s h a p e d a n a l o g of t h e v o c a l t r a c t a s a f u n c t i o n o f c o n s t r i c t i o n l o c a t i o n , c o n s t r i c t i o n s i z e , and l i p area. In such f i g u r e s , i t i s p o s s i b l e t o see t h a t there are certain constriction l o c a t i o n s that are a c o u s t i c a l l y s t a b l e , i n the sense that small v a r i a t i o n s i n c o n s t r i c t i o n l o c a t i o n r e s u l t i n l i t t l e o r no c h a n g e i n r e s o n a n c e s , whereas a t o t h e r l o c a t i o n s , s m a l l c h a n g e s r e s u l t i n r a t h e r large resonance s h i f t s . Stevens (1972) f i r s t c a l l e d attention to these s t a b l e regions in his quantal theory, i n which he proposed t h a t t h e contrasts employed i n human l a n g u a g e make use o f t h e s e r e g i o n s . A vowel l i k e [ i ] has an a c o u s t i c a l l y s t a b l e p a l a t a l c o n s t r i c t i o n , and t h e r e f o r e s m a l l c h a n g e s i n c o n s t r i c t i o n l o c a t i o n w i l l have l i t t l e o r no e f f e c t on i t s r e s o n a n c e s . However, a change i n t h e s i z e ( i e . narrowness) of the constriction will produce c h a n g e s i n t h e f o r m a n t s . A small, random p e r t u r b a t i o n o f tongue p o s i t i o n f o r [ i ] might a f f e c t e i t h e r t h e constriction l o c a t i o n or the c o n s t r i c t i o n s i z e . Ho wever, o n l y t h e p e r t u r b a t i o n s t h a t , i n f a c t , modify t h e c o n s t r i c t i o n s i z e w i l l have any s u b s t a n t i a l a c o u s t i c effect. Thus, random a r t i c u l a t o r y e r r o r would be r e f l e c t e d i n t h e a c o u s t i c medium i n an almost i d e n t i c a l f a s h i o n t o v a r i a t i o n i n c o n s t r i c t i o n s i z e o n l y . Since t h e dimension o f vowel h e i g h t ( f o r a f r o n t v o w e l , l i k e [ i ] ) c o r r e s p o n d s t o d i f f e r e n c e s i n t h e s i z e o f a p a l a t a l c o n s t r i c t i o n , t h e v a r i a b i l i t y produced b y random p e r t u r b a t i o n w i l l l i e a l o n g t h e vowel h e i g h t d i m e n s i o n . This i s , o f c o u r s e , i n t h e d i m e n s i o n a l o n g which vowel s h i f t s o c c u r . Simulation of a r t i c u l a t o r y

variability

In o r d e r t o examine t h e a c o u s t i c e f f e c t s o f random articulatory p e r t u r b a t i o n more s y s t e m a t i c a l l y , t o n g u e shape v a r i a b i l i t y was s i m u l a t e d u s i n g t h e Haskins l a b o r a t o r i e s articulatory synthesizer (Rubin, Baer and Mermelstein, 1981). This program a l l o w s m o d i f i c a t i o n o f a m i d - s a g i t t a l r e p r e s e n t a t i o n o f t h e v o c a l t r a c t by means o f t h e s i x a r t i c u l a t o r y parameters shown i n F i g u r e 1. The s h a p e o f t h e u p p e r p a r t o f t h e t o n g u e body i s m o d e l l e d as a segment o f a c i r c l e o f f i x e d r a d i U 3 . D i f f e r e n t tongue shapes are produced b y moving t h e c e n t e r o f t h i s c i r c l e ( p a r a m e t e r C) . Random a r t i c u l a t o r y v a r i a b i l i t y was s i m u l a t e d by c h o o s i n g a s e t o f t a r g e t vowel s h a p e s , and f o r e a c h o n e , g e n e r a t i n g 100 new v o c a l t r a c t s h a p e s , e a c h o f whose tongue body c e n t e r l i e s on a c i r c l e o f a r a d i u s 2 mm a b o u t t h e c e n t e r o f the t a r g e t tongue s h a p e . Thus, t h e s e t o f s h a p e s r e p r e s e n t s a c o n s t a n t e r r o r o f 2 mm i n any d i r e c t i o n from t h e h y p o t h e t i c a l tongue body c e n t e r t a r g e t .

Goldstein: Vowel shifts and articulatory-acoustic

relations

269

L -- L I P S H -- HYOID

K E Y V O C A L TRACT F i g u r e 1. Vocal synthesizer.

tract

control

PARAMETERS

parameters

for

Haakins

articulatory

270

Phonetioc explanation in phonology

The vowels chosen f o r study wore three f r o n t vowels ( [ i ] i [ e], [2^-]), three back vowels ( [ u ] , [ o ] , [®-]) and a mid c e n t r a l vowel [• [ n ] ) ,

e, e ] , e . g .

[ ' I ad i ] " l a d d i e " , [ ' lo.se]

(Dieth 1932, 72 f f ) . V o i c i n g can be maintained

more e a s i l y i f

the tongue i s pushed in a high f r o n t

position.

I t i s t o be e x p e c t e d t h a t the /a/ i s a l s o d i f f e r e n t i n the two environments.

In t h i s connection we can a l s o mention the

S l a v i c assonances, i n which v o i c i n g i s the only consonantal f e a t u r e , e . g .

relevant

doba - droga - woda - koza - sowa

v s . kopa - sroka - r o t a - rosa - s o f a in P o l i s h o r a l and written poetry

(Jakobson, Fant, H a l l e

1951,

42).

In the Romance and S l a v o n i c languages, in Dutch and Phineland German, and i n S c o t t i s h E n g l i s h the v o i c i n g has a high rank i n the h i e r a r c h y of p h o n e t i c i n d i c e s g u i s h i n g the two obstruent c l a s s e s ,

feature distin-

o v e r and above the

lenis/

f o r t i s c o n t r a s t . P e r i o d i c i t y i s a c t i v e l y c o n t r o l l e d i n these languages or d i a l e c t s , whereas o t h e r t y p e s o f E n g l i s h and German commonly r e g u l a t e v o c a l f o l d v i b r a t i o n p a s s i v e l y . g r e a t e r a r t i c u l a t o r y e f f o r t i n the f o r t i s a r t i c u l a t i o n

The

does

not only a f f e c t the v o c a l t r a c t movement, but a l s o the t e n s i o n i n the l a r y n x

( H a l l e and Stevens 1971) r e s u l t i n g i n a q u i c k e r

decay of v o i c i n g ; tinue i f

i n the l e n i s o b s t r u e n t , p e r i o d i c i t y may con-

the pressure drop across the g l o t t i s

e . g . because of a s h o r t s t r i c t u r e duration i n position.

is

favourable,

intervocalic

In the languages with a c t i v e v o i c i n g c o n t r o l , on

the o t h e r hand, a r t i c u l a t o r y maneuvers d e l i b e r a t e l y

maintain

a pressure d i f f e r e n t i a l , which may even s e t i n t o o soon, l e a d i n g t o r e g r e s s i v e a s s i m i l a t i o n o f v o i c e i n a l l the guages mentioned above, but unknown i n e . g .

lan-

the o t h e r Ger-

manic languages. In c e r t a i n languages, the f o r t i s / l e n i s

f e a t u r e i s sup-

Kohler: Phonetic timing as explanation in phonology

279

plemented by an aspiration contrast, either in combination with the voicing feature (as in the languages of India) or on its own (as in Standard German and most varieties of English). Aspiration is often associated with the fortis element, but in Danish the strongly aspirated [ph, th, kh] (as against the slightly aspirated or unaspirated, but commonly voiceless [[}, j, §] are characterised by weaker articulation and shorter closure duration (Fischer-J• Vm ("habe" , "haben") , f o r i n s t a n c e , but p r e s e r v e f o r t i s stops as o c c l u s i o n s , simply reducing the degree o f a s p i r a t i o n and making p a s s i v e v o i c i n g p o s s i b l e in c e r t a i n c o n t e x t s , e . g .

"hat e r "

(Kohler 1979c). In h i s t o r i c a l sound change, these

[d]

develop-

ments are well-known, f o r i n s t a n c e from L a t i n t o t h e western Romance languages, e . g . Spanish

( " v i d a " *• " v i t a " ,

"porfía"

" p e r f i d i a " ) . The r e l a t i o n of f o r t i s t o l e n i s i s

maintained

i n the d i f f e r e n t t i m i n g s o f a r t i c u l a t o r y movements, even if

the absolute v a l u e s are lowered. This r e l a t i o n s h i p between

m o r e - f o r t i s stops and l e n i s approximants a l s o a p p l i e s the a l l o p h o n i c s of modern Spanish: u t t e r a n c e - i n i t i a l

to or

p o s t - p a u s a l p o s i t i o n as w e l l as g r e a t e r emphasis, which both r e q u i r e more a r t i c u l a t o r y e f f o r t , demand v o i c e d s t o p s ,

as

a g a i n s t i n t e r v o c a l i c approximants, and t h i s i s even true of the o r i g i n a l semi-vowels /w/ and / j / , e . g . [bw],

"hierro"

[d^]

('rehilamiento',

"huevo"

[gw],

c f . Barbón Rodriguez

1978). A f o r t i s / l e n i s o p p o s i t i o n w i t h the same t y p e s a r t i c u l a t o r y reduction a l s o manifests i t s e l f g r a d a t i o n o f Finnish

of

in the consonant

( c f . Karlsson 1981, 36 f f ) ,

as the

re-

s u l t o f a tendency t o make corresponding open and c l o s e d s y I l e í b l e s the same l e n g t h : v o i c e l e s s stops

(e.g.

long v o i c e l e s s

"kukka" - "kukan");

stops short

stops a f t e r homorganic n a s a l s - long n a s a l s -

"Helsingin");

[It,

otherwise [ t ] - [ d ] (e.g.

r t ] - [11, (e.g.

"tupa" - " t u v a s s a " ) ,

rr]

(e.g.

voiceless

(e.g.

"Helsinki"

"kulta" -

"katu" - " k a d u l l a " ) , [ k ] - 0 (e.g.

short

"kulla");

[p] -

"jalka" -

The timing o f a r t i c u l a t o r movement and o f the

[v]

"jalan"). concomitant

l a r y n g e a l a c t i v i t y may be r e o r g a n i s e d in such a way t h a t u t t e r a n c e - f i n a l l e n i s and f o r t i s consonants c o a l e s c e

in

Kohler: Phonetic timing as explanation in phonology

281

their glottal features whereas the duration contrast in the preceding vowel remains and may even be accentuated. English is developing in this direction. In Low German this change is complete, "ick riet" [rit] and "ick ried" [ri:t] (from OSax "writan" and "ridan") are now differentiated by short and long vowel, respectively (cf. Kohler 1982a), after the final /e/ apocope led to a levelling in the consonants themselves. The explanation usually given for this phenomenon - compensatory vowel lengthening in connection with the elimination of the following /e/ (e.g. Bremer 1929) is wrong because it does not only misrepresent the genesis of the quantity opposition, but it cannot even account for the differentiation of the two examples given. The distinction in vowel duration is tied to an original fortis/lenis contrast in the following consonant and to the structures 'vowel + fortis consonant' vs. 'vowel + morpheme boundary + fortis consonant' (as in "Brut" [ut] - "bru-t" [u:t]), the latter case preserving final vowel length. This vowel quantity feature is modified by the timing at the utterance level, but the details of this interaction still require thorough investigation. References Bannert, R. (1976). Mittelbairische Phonologie auf akustischer und perzeptorischer Grundlage. Travaux de l'lnstitut de Linguistique de Lund X. Lund: Gleerup/Fink. Barbón Rodríguez, J. A. (1978). El rehilamiento: descripción. Phonetica, 35, 185 - 215. Bremer, 0. (1929). Der Schleifton im Nordniedersachsischen. Niederdeutsches Jahrbuch, 53, 1 - 32. Chen, M. (1970). Vowel length variation as a function of the voicing of the consonant environment. Phonetica, 22, 129 - 159. Dieth, E. (19 32). A Grammar of the Buchan Dialect. Cambridge: Heffer.

282

Phonetioc explanation in phonology

van Dommelen, W.

(1983). Parameter i n t e r a c t i o n i n the p e r -

c e p t i o n o f French p l o s i v e s . Eiert,

C.-C.

Phonetica,

40.

(1964). P h o n o l o g i c a l S t u d i e s o f Quantity i n

Swedish. Stockholm: A l m q v i s t S W i k s e l l . Fischer-J^rgensen,

E.

(1954). A c o u s t i c a n a l y s i s o f s t o p con-

sonants. M i s c e l l a n e a P h o n e t i c a , I I , F i s c h e r - J ^ r g e n s e n , E.

42 - 59.

(1980). Temporal r e l a t i o n s i n Danish

t a u t o s y l l a b i c CV sequences w i t h s t o p consonants.

Annual

Report of the I n s t i t u t e of P h o n e t i c s . U n i v e r s i t y

of

Copenhagen 14, 207 - 261. F i t c h , H. L.

(1981). D i s t i n g u i s h i n g temporal i n f o r m a t i o n

speaking r a t e from temporal i n f o r m a t i o n f o r s t o p consonant v o i c i n g . search 65,

1 -

for

intervocalic

Haskins L a b o r a t o r i e s Speech Re-

32.

Fujimura, 0. & M i l l e r , J . E.

(1979). Mandible h e i g h t and

s y l l a b l e - f i n a l tenseness. Phonetica, H a l l e , M. & S t e v e n s , K.

36, 2 6 3 -

272.

(1971). A note on l a r y n g e a l

features.

MIT Research Laboratory o f E l e c t r o n i c s , Q u a r t e r l y Report 101, 198 -

Progress

213.

Jakobson, R . , Fant, C. G. M. & H a l l e , M. (1951).

Preliminaries

t o Speech A n a l y s i s . Cambridge, Mass.: The MIT P r e s s . K a r l s s o n , F.

(1981). Finsk grammatik. Suomalaisen

Kirjalli-

suuden Seura. Kim, C.-W. 107 -

(1970). A t h e o r y of

aspiration. Phonetica,

21,

116.

K o h l e r , K. J.

(1979a). Dimensions in the p e r c e p t i o n o f

and l e n i s p l o s i v e s . K o h l e r , K. J.

Phonetica,

36, 332 -

(1979b). Parameters i n the production and the

p e r c e p t i o n of p l o s i v e s in German and French.

Arbeitsbe-

r i c h t e des I n s t i t u t s f ü r Phonetik der U n i v e r s i t ä t (AIPUK)

fortis

343.

12, 261 - 280.

Kiel

Kohler: Phonetic timing as explanation in phonology

283

Kohler, K. J. (1979c). Kommunikative Aspekte satzphonetischer Prozesse im Deutschen. Phonologische Probleme des Deutschen (Vater H. ed.), Studien zur deutschen Grammatik 10, 13 39. Tübingen: G. Narr. Kohler, K. J. (1982a). überlänge im Niederdeutschen? Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 19, 65 - 87. Kohler, K. J., van Dommelen, W. & Timmermann, G. (1981). Die Merkmalpaare stimmhaft/stimmlos und fortis/lenis in der Konsonantenproduktion und -perzeption .des heutigen Standardfranzösisch. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 14. Kohler, K. J., van Dommelen, W., Timmermann, G. & Barry, W. J. (1981). Die signalphonetische Ausprägung des Merkmalpaares fortis/lenis in französischen Plosiven. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 16, 43 - 94. Kohler, K., Krützmann, U., Reetz, H. & Timmermann, G. (1982). Sprachliche Determinanten der signalphonetischen Dauer. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 17, 1 - 48. Öhman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America, 39, 151 - 168. Port, R. F. (1981). Linguistic timing factors in combination. Journal of the Acoustical Society of America, 69, 262 - 274. Port, R. F./ Al-Ani, S. & Maeda, S. (1980). Temporal compensation and universal phonetics. Phonetica, 37, 2 35 - 252. Slis, J. H. & Cohen, A. (1969). On the complex regulating the voiced-voiceless distinction I. Language and Speech, 12, 80 - 102.

Symposium 6

Human and automatic speech recognition

287 THE PROBLEMS OF VARIABILITY SPEECH PERCEPTION D e n n i s H.

IN S P E E C H R E C O G N I T I O N A N D IN M O D E L S

Klatt

M a s s a c h u s e t t s I n s t i t u t e of T e c h n o l o g y , C a m b r i d g e , EXTENDED

U.S.A.

ABSTRACT

Human listeners know, implicitly, a great deal about a c o u s t i c p r o p e r t i e s d e f i n e a n a c c e p t a b l e p r o n u n c i a t i o n of given word.

variability, within-speaker variability,

any

and

a c r o s s - s p e a k e r s v a r i a b i l i t y that is to be e x p e c t e d and the p r o c e s s of

identification.

Current

algorithms that recognize speech employ rather techniques

what

Part of t h i s k n o w l e d g e c o n c e r n s the k i n d s of

environmental

during

discounted

computer primitive

for d e a l i n g w i t h t h i s v a r i a b i l i t y , and t h u s o f t e n

it d i f f i c u l t to d i s t i n g u i s h b e t w e e n m e m b e r s of a r e l a t i v e l y

find small

s e t of a c o u s t i c a l l y d i s t i n c t w o r d s if s p o k e n b y m a n y t a l k e r s . will try to i d e n t i f y e x a c t l y w h a t is wrong w i t h "pattern-recognition"

t e c h n i q u e s for g e t t i n g a r o u n d

their s p e e c h r e c o g n i t i o n p e r f o r m a n c e to c o n s t r a i n t s

We

current variability,

and s u g g e s t w a y s in w h i c h m a c h i n e s m i g h t s i g n i f i c a n t l y

perception

OF

in the f u t u r e by

imposed by the h u m a n s p e e c h p r o d u c t i o n

improve attending and

apparatus.

We will also c o n s i d e r

the a d d i t i o n a l v a r i a b i l i t y

a r i s e s w h e n the task is c o n t i n u o u s s p e e c h r e c o g n i t i o n . "second-generation"

LAFS m o d e l of b o t t o m - u p lexical

that A

access

is

o f f e r e d as a m e a n s for identifying w o r d s in c o n n e c t e d s p e e c h , as a c a n d i d a t e p e r c e p t u a l m o d e l .

and

The r e f i n e m e n t s c o n c e r n a m o r e

e f f i c i e n t g e n e r a l i z a b l e w a y of h a n d l i n g

cross-word-boundary

288

Human and automatic speech recognition

p h o n o l o g y and c o a r t i c u l a t i o n , a n d a m o d e l of l e a r n i n g p o w e r f u l e n o u g h to e x p l a i n h o w l i s t e n e r s

that m a y b e

optimize

a c o u s t i c - p h o n e t i c d e c i s i o n s and d i s c o v e r p h o n o l o g i c a l

rules.

BACKGROUND It is n e a r l y 7 y e a r s s i n c e the end of the A R P A understanding

project, which

I reviewed

speech

in a 1 9 7 7 p a p e r

1 9 7 7 ) , and it is a b o u t 4 y e a r s s i n c e I p u b l i s h e d a

(Klatt,

theoretical

paper o n m o d e l i n g of s p e e c h p e r c e p t i o n a n d l e x i c a l a c c e s s 1979a)

(Klatt,

that w a s b a s e d in l a r g e m e a s u r e o n i d e a s from the A R P A

project.

S i n c e t h a t time t h e r e h a s b e e n m u c h a c t i v i t y

on

i s o l a t e d w o r d r e c o g n i t i o n , s o m e l i m i t e d a c t i v i t y in c o n n e c t e d s p e e c h r e c o g n i t i o n , p a r t i c u l a r l y a t IBM, and some of p e r c e p t u a l m o d e l s

(Elman and M c C l e l l a n d ,

proliferation

19xx).

This paper

w i l l be an a t t e m p t to look c r i t i c a l l y at a c t i v i t y in e a c h of t h e s e a r e a s , b u t p a r t i c u l a r l y to focus o n w h a t a p p e a r s to be a bottleneck limiting A typical

progress

in i s o l a t e d w o r d

recognition.

isolated word recognition system might

c h a r a c t e r i z e a n i n p u t s p e e c h w a v e f o r m as a s e q u e n c e of c o m p u t e d e v e r y t e n to t w e n t y m s e c .

spectra

E a c h v o c a b u l a r y item m i g h t

t h e n be r e p r e s e n t e d b y o n e or m o r e s e q u e n c e s of s p e c t r a from training d a t a .

R e c o g n i t i o n c o n s i s t s of finding

match between input and vocabulary

derived

the

best

templates.

R e c o g n i t i o n of a small set of w o r d s w o u l d n o t be

difficult

w e r e it n o t for the r e m a r k a b l e v a r i a b i l i t y s e e n in the p r o n u n c i a t i o n of a n y g i v e n w o r d .

In the s y s t e m s we are

about, within-speaker variability

in p r o n u n c i a t i o n a n d

talking speaking

Klatt: The problems of variability in speech recognition rate are h a n d l e d b y a clustering

(1) including m o r e t h a n one w o r d t e m p l a t e

algorithm decides that a single template

a d e q u a t e l y d e s c r i b e the t r a i n i n g d a t a using d y n a m i c p r o g r a m i n g temporal

289

cannot

(Rabiner et a l . , 1 9 7 9 ) ,

to t r y e s s e n t i a l l y all

C h i b a , 1971; X t a k u r a , 1 9 7 5 ) , and

word templates (3) using

(Sakoe

the linear

r e s i d u a l s p e c t r a l d i s t a n c e m e t r i c to q u a n t i f y p h o n e t i c b e t w e e n p a i r s of s p e c t r a

(Itakura,

(2)

reasonable

a l i g n m e n t s of the u n k n o w n s p e c t r a l s e q u e n c e w i t h

spectral sequences characterizing

if

the

and

prediction similarity

1975).

E a c h of t h e s e t h r e e t e c h n i q u e s r e p r e s e n t s a n

important

s c i e n t i f i c a d v a n c e m e n t over s c h e m e s used p r e v i o u s l y .

However,

the t h e m e to b e d e v e l o p e d

techniques

are n o t a d e q u a t e .

in t h i s p a p e r

One m u s t u n d e r s t a n d

is t h a t t h e s e

the p r o c e s s e s b y w h i c h

v a r i a b i l i t y a r i s e s and by w h i c h we as l i s t e n e r s h a v e l e a r n e d ignore i r r e l e v a n t a c o u s t i c v a r i a t i o n w h e n r e c o g n i z i n g spoken by many

to

words

talkers.

T R E A T M E N T OF V A R I A B I L I T Y V a r i a b i l i t y in the a c o u s t i c m a n i f e s t a t i o n s of a g i v e n utterance

is s u b s t a n t i a l

and a r i s e s f r o m m a n y s o u r c e s .

These

include: [1] r e c o r d i n g c o n d i t i o n s (background n o i s e , room reverberation, microphone/telephone characteristics) [2] w i t h i n - s p e a k e r v a r i a b i l i t y ( b r e a t h y / c r e a k y v o i c e q u a l i t y , c h a n g e s in v o i c e f u n d a m e n t a l f r e q u e n c y , s p e a k i n g r a t e - r e l a t e d u n d e r s h o o t in a r t i c u l a t o r y t a r g e t s , s l i g h t s t a t i s t i c a l v a r i a b i l i t y in a r t i c u l a t i o n t h a t c a n lead to big a c o u s t i c c h a n g e s , v a r i a b l e a m o u n t of f e a t u r e p r o p a g a t i o n , s u c h as n a s a l i t y or r o u n d i n g , to a d j a c e n t sounds),

290

Human and automatic speech recognition [3] c r o s s - s p e a k e r v a r i a b i l i t y ( d i f f e r e n c e s in d i a l e c t , v o c a l - t r a c t l e n g t h and n e u t r a l s h a p e , d e t a i l e d a r t i c u l a t o r y habits) [4] w o r d e n v i r o n m e n t in c o n t i n u o u s s p e e c h (cross-word-boundary coarticulation, phonological p h o n e t i c r e c o d i n g of w o r d s in s e n t e n c e s )

and

The c u m u l a t i v e e f f e c t s of this v a r i a b i l i t y are so g r e a t

that

c u r r e n t s y s t e m s d e s i g n e d to r e c o g n i z e o n l y the i s o l a t e d

digits

z e r o - t o - n i n e h a v e c o n s i d e r a b l e d i f f i c u l t y d o i n g so in a speaker-independent manner. is p e r h a p s the m o s t

A poor u n d e r s t a n d i n g of

important stumbling block inhibiting

d e v e l o p m e n t of r e a l l y p o w e r f u l There

variability the

isolated word recognition

devices.

is a c r y i n g n e e d for a s y s t e m a t i c a c o u s t i c s t u d y of

variability. CONTINUOUS SPEECH

RECOGNITION

T h e r e is n o t a s m u c h d i f f e r e n c e b e t w e e n the p r o b l e m s the d e s i g n e r of a d v a n c e d

i s o l a t e d word r e c o g n i t i o n s y s t e m s

t h o s e facing the d e s i g n e r of a c o n t i n u o u s s p e e c h d e v i c e as was o n c e s u p p o s e d .

recognition

in c o n t i n u o u s s p e e c h

in s t r e s s , d u r a t i o n , a n d

c o n t e x t s p e c i f i e d by a d j a c e n t w o r d s . considered

and

H o w e v e r , a w o r d is s u b j e c t to a

g r e a t e r n u m b e r of p e r m i t t e d v a r i a t i o n s in i s o l a t i o n due to v a r i a t i o n s

facing

than

phonetic

One q u e s t i o n to b e

in this s e c t i o n is h o w to c h a r a c t e r i z e

by r u l e a n d use the r u l e - b a s e d k n o w l e d g e

in a

these

processes

recognition

algorithm.

R e p r e s e n t a t i o n of A c o u s t i c - P h o n e t i c LAFS. digits.

Consider

and P h o n o l o g i c a l

the p r o b l e m of r e c o g n i z i n g

Knowledge

connected

In o r d e r to r e c o g n i z e the d i g i t "8" w h e n p r e c e d e d

or

Klatt: The problems of variability in speech recognition

291

f o l l o w e d by a n y o t h e r d i g i t or s i l e n c e , o n e m u s t s p e c i f y c o n s i d e r a b l e d e t a i l the a c o u s t i c m o d i f i c a t i o n s at w o r d t h a t are l i k e l y to t a k e p l a c e

in e a c h c a s e .

in

boundaries

If n o t , t h e n o n e

m u s t r e l y o n the r o b u s t n e s s of the a c o u s t i c c e n t e r of the w o r d and t r e a t the c o a r t i c u l a t i o n o c c u r r i n g m o s t l y a t o n s e t and

offset

as one further s o u r c e of n o i s e .

and

This is o f t e n d o n e

(Sakoe

C h i b a , 1971; M y e r s and R a b i n e r , 1 9 8 1 ) , b u t it is a c o m p r o m i s e t h a t , it is c l e a r , we as l i s t e n e r s do n o t m a k e .

The

alternative

is to d e s c r i b e t e n d i f f e r e n t e x p e c t e d o n s e t s and t e n d i f f e r e n t e x p e c t e d o f f s e t s for e a c h d i g i t , and f u r t h e r m o r e , to r e q u i r e t h e s e a l t e r n a t i v e s o n l y b e used if in fact the a p p r o p r i a t e is s e e n at each end of the

digit

"8".

C o n s t r u c t i o n of a n e t w o r k of a l t e r n a t i v e s p e c t r a l for e a c h p o s s i b l e d i g i t - d i g i t t r a n s i t i o n (though tedious)

that

s o l u t i o n , as d e s c r i b e d

(lexical a c c e s s from spectra)

is a

sequences

straight-forward

in m y paper o n

(Klatt, 1979a).(1)

I am

LAFS currently

doing just t h a t in o r d e r to e x p l o r e the p r o p e r t i e s of LAFS a n d to provide a testbed

for e v a l u a t i o n of a l t e r n a t i v e d i s t a n c e

metrics.

P e r h a p s p r e l i m i n a r y r e s u l t s will be a v a i l a b l e b y the t i m e of

the

conference. Word-Boundary Effects.

Coarticulation

c r o s s - w o r d - b o u n d a r y p h o n o l o g y are t r e a t e d creating

and

in the LAFS s y s t e m

a large n u m b e r of s t a t e s and p a t h s b e t w e e n w o r d

and all w o r d b e g i n n i n g s .

Can such a brute-force method

(1) A n i m p r o v e d " v e r s i o n of L A F S is d e s c r i b e d later section.

by

endings be

in this

292

Human and automatic speech recognition

applied successfully

in v e r y l a r g e - v o c a b u l a r y c o n n e c t e d

r e c o g n i t i o n , a n d c a n it s e r v e as a m o d e l of

speech

perceptual

strategies?

All a c o u s t i c d e t a i l s c a n n o t be l e a r n e d

for

word because

it w o u l d r e q u i r e too m u c h l a b e l e d t r a i n i n g

every data

(even for I B M ) , and t h a t g e n e r a l i z a t i o n s , m o s t p r o b a b l y a t the level of the d i p h o n e , are the o n l y w a y to c a p t u r e r u l e s at w o r d b o u n d a r i e s

coarticulation

in a m e a n i n g f u l a n d u s e f u l

way.

C h i l d r e n are e x p o s e d to a n e n o r m o u s a m o u n t of s p e e c h d a t a , the s a m e a r g u m e n t m u s t a p p l y — perceptual

e l s e we w o u l d e x p e c t to

a b e r r a t i o n s w h e r e e . g . the p a l a t a l i z a t i o n

you" fame w a s k n o w n for s o m e w o r d p a i r s , y e t o t h e r s , p a l a t a l i z e d , s l o w e d d o w n p e r c e p t i o n or c a u s e d So h o w do w e r e s o l v e the p a r a d o x

find

rule of

misperceptions.

i m p l i e d b y the n e e d to and

a l a b y r i n t h of " h y p o t h e s i z e ,

and p o s t - t e s t - f o r - v a l i d - c o n t e x t "

"did

when

apply cross-word-boundary coarticulation rules rapidly a u t o m a t i c a l l y w i t h o u t invoking

yet

h e u r i s t i c s ? (1)

t h a t w e m u s t d e v i s e a w a y to g e t the a d v a n t a g e of

The a n s w e r

test, is

precompiled

n e t w o r k s of s p e c t r a l e x p e c t a t i o n s for e a c h w o r d , b u t w i t h i n

the

c o n s t r a i n t t h a t c o a r t i c u l a t o r y e f f e c t s b e t w e e n w o r d s m u s t be d e s c r i b e d by s e g m e n t a l l y b a s e d r u l e s sub-network)

(or a s i n g l e

word-boundary

rather t h a n c o m p l e t e e l a b o r a t i o n of the

at e a c h w o r d b o u n d a r y in the

alternatives

network.

(1) N e w e l l (1979) a r g u e s t h a t c o g n i t i v e s t r a t e g i e s s u c h a s a n a l y s i s b y s y n t h e s i s or h y p o t h e s i z e and t e s t are too time c o n s u m i n g to be r e a l i s t i c m o d e l s of h u m a n p e r c e p t i o n , and that o n e m u s t find w a y s of d e v i s i n g h i g h l y p a r a l l e l a u t o m a t i c p r o c e s s e s to m o d e l h u m a n t h o u g h t . A s i m i l a r v i e w is e x p r e s s e d in the w o r k of E l m a n a n d M c C l e l l a n d (19xx).

Klatt: The problems of variability in speech recognition The N e w LAFS.

293

C o n s i d e r the s i m p l e s t form that

s t r u c t u r e c o u l d take b y a s s u m i n g

for the m o m e n t the

this simplifying

assumption that coarticulation across word boundaries

is

r e s t r i c t e d to the d i p h o n e c o n s i s t i n g of the last h a l f of the p h o n e m e at the e n d of the w o r d and the f i r s t h a l f of the p h o n e m e of all w o r d s .

initial

T h e n , i n s t e a d of c r e a t i n g a l a r g e s e t of

p a t h s a n d s t a t e s for e a c h w o r d in the l e x i c o n , so as to

connect

the end of the w o r d to all w o r d b e g i n n i n g s , it s u f f i c e s to to the a p p r o p r i a t e p l a c e in a s i n g l e w o r d - b o u n d a r y carrying

forward a backpointer

recognized

network,

to the w o r d that w o u l d be

if t h i s c o n t i n u e s to be the b e s t n e t w o r k p a t h .

word-boundary network specifies spectral traversed

The

sequences that must be

in o r d e r to g e t to the b e g i n n i n g of w o r d s w i t h

possible beginning

jump

each

phoneme.

It is p o s s i b l e to c o n c e i v e of m o r e g e n e r a l v a r i a n t s of a p p r o a c h t h a t a l l o w c o a r t i c u l a t i o n and p h o n o l o g i c a l g r e a t e r p o r t i o n s of w o r d s , and that m i g h t

recoding

incorporate

into

words, such as -to", "and", "a", This p r o p o s a l

function

LAFS d e s i g n

A s a p r a c t i c a l m a t t e r , it m a k e s

p o s s i b l e the c o n s t r u c t i o n of l a r g e - v o c a b u l a r y n e t w o r k s the p r o h i b i t i v e c o s t of full d u p l i c a t i o n of n e t w o r k s at the b e g i n n i n g

and

"the".

is a n e x t e n s i o n of the o r i g i n a l

t h a t I d e s c r i b e d in 1 9 7 9 .

over

this

s p e c i a l p a r t of the n e t w o r k regular s u f f i x e s s u c h a s p l e u r a l p a s t , and even i n c o r p o r a t e the s h o r t h i g h l y m o d i f i a b l e

this

without

cross-word-boundary

and end of e a c h w o r d .

m o d e l , the n e w L A F S is an i m p r o v e m e n t b e c a u s e

As a

it m e a n s

perceptual that

294

Human and automatic speech recognition

w o r d - b o u n d a r y r u l e s are a u t o m a t i c g e n e r a l i z e d lexical

items.

to all

If t a k e n as a p e r c e p t u a l m o d e l , the

appropriate word-boundary

s u b - n e t w o r k s e e m s to imply t h a t o n l y the b e s t - s c o r i n g w o r d e a c h p o s s i b l e p h o n e t i c ending c a n be "seen" by the m o d u l e

for or

d e m o n that s e a r c h e s for the b e s t s c o r e , w h i c h w o u l d b e a s t r o n g constraint on bottom-up lexical Unsupervised IBM s y s t e m

Learning

(Jelinek, 1976)

search.

of A c o u s t i c - P h o n e t i c

Decisions.

is o n e of s e v e r a l a t t e m p t s

The

to

a u t o m a t i c a l l y o p t i m i z e a d e c i s i o n s t r u c t u r e o n the b a s i s of experience

(see also

Lowerre and Reddy, 1980).

Generalization

t a k e s p l a c e at the level of the p h o n e m e , and thus m a y a n a t t r a c t i v e m o d e l of h o w c h i l d r e n f i r s t a t t e m p t that d e p a r t from w h o l e - w o r d a c o u s t i c p a t t e r n s . IBM s y s t e m is l o o k i n g

for a c o u s t i c i n v a r i a n c e

generalizations

In a s e n s e , the in the

r e p r e s e n t a t i o n s t h a t it s e e s , a n d thus the s y s t e m

spectral

is p a r s i m o n i o u s

w i t h s e v e r a l c u r r e n t a c c o u n t s of c h i l d r e n ' s l a n g u a g e (Stevens,

d e c i s i o n s c o n s t i t u t e s a strong invariance

is r e q u i r e d

in m a k i n g

fine

phonetic

r e f u t a t i o n of the idea

that

is s u f f i c i e n t for s p e e c h u n d e r s t a n d i n g .

q u e s t i o n we pose is t h i s — model

acquisition

19xx).

The w e a k n e s s of the IBM s y s t e m

phonemic

constitute

how great a modification

to the

in o r d e r to d i s c o v e r m o r e a p p r o p r i a t e

The IBM

acoustic

generalizations? The a n s w e r

is s u r p r i s i n g l y s i m p l e : w h e n a s e q u e n c e

s p e c t r a is m a p p e d o n t o a p a r t i c u l a r

phoneme

in a n input

of utterance

Klatt: The problems of variability in speech recognition

295

(that is c o r r e c t l y r e c o g n i z e d ) , do n o t u p d a t e p r o b a b i l i t i e s all

i n s t a n t i a t i o n s of a p h o n e m e , as IBM d o e s n o w , b u t

update only probabilities at those network locations the same p h o n e t i c e n v i r o n m e n t as is o b s e r v e d

rather possessing

in the i n p u t .

d o n e o n a d i p h o n e b a s i s , t h e n n e t w o r k s t a t e s near the

If

beginning

of a p h o n e m e d e f i n i t i o n are t u n e d o n l y to inputs i n v o l v i n g p h o n e m e p r e c e d e d by the a p p r o p r i a t e p h o n e m e , a n d

that

correspondingly,

n e t w o r k s t a t e s near the end of the p h o n e m e d e f i n i t i o n are o n l y to i n p u t d a t a h a v i n g the a p p r o p r i a t e

at

following

tuned

phoneme.

W h i l e this is c o n c e p t u a l l y s i m p l e to i m p l e m e n t in a c o m p u t e r , it i m p l i e s a m u c h larger

set of p r o b a b i l i t i e s to b e

e s t i m a t e d and s t o r e d t h a n in the s t a n d a r d

IBM s y s t e m .

In the

s t a n d a r d s y s t e m , t h e r e m i g h t be 40 p h o n e m e s , 10 n e t w o r k t r a n s i t i o n s per p h o n e m e , and 200 t e m p l a t e p r o b a b i l i t i e s to be estimated

for e a c h t r a n s i t i o n , or a b o u t 1 0 0 , 0 0 0 p r o b a b i l i t i e s

b e e s t i m a t e d and s t o r e d .

If training

d i p h o n e s , and the t e m p l a t e

to

is d o n e o n the b a s i s of

i n v e n t o r y is i n c r e a s e d

in o r d e r

p e r m i t finer p h o n e t i c d i s t i n c t i o n s , the n u m b e r s m i g h t be

to

about

1000 d i p h o n e s t i m e s , s a y , 6 n e t w o r k t r a n s i t i o n s per d i p h o n e

times

1000 t e m p l a t e p r o b a b i l i t i e s , or 6 m i l l i o n p r o b a b i l i t i e s to be e s t i m a t e d and s t o r e d . impractical

increase

A 60-fold

i n c r e a s e w o u l d imply a n

in r e q u i r e d training d a t a as w e l l as a

computer memory greater

t h a n is e a s i l y r e f e r e n c e d

in m o s t

computers. An alternative

is to r e t u r n to the f r a m e w o r k of H a r p y

LAFS in w h i c h p r o b a b i l i t y is r e p l a c e d by a d i s t a n c e m e t r i c

and that

296

Human and automatic speech recognition

c o m p u t e s the l i k e l i h o o d t h a t an i n p u t s p e c t r u m is the same as a spectral

template

representing

(or small set of a l t e r n a t i v e

each network state.

d e s c r i b e o n e m e a n s of using

L o w e r r e and R e d d y

unsupervised learning

t e m p l a t e s to c o n v e r g e t o w a r d d a t a s e e n d u r i n g by a v e r a g i n g

input s p e c t r a w i t h t e m p l a t e

It is d o u b t f u l best because spectral

templates) (1980)

to c a u s e

the

recognition,

simply

spectra.

that d i r e c t s p e c t r a l a v e r a g i n g will p e a k s b e c o m e less p e a k e d d u r i n g

work

averaging.

Some o t h e r m e t h o d of a v e r a g i n g will h a v e to b e f o u n d , b u t

the

c o n c e p t of a u t o m a t i c t e m p l a t e t u n i n g , w h i l e n o t n e w , s e e m s to be sufficiently powerful

to s e r v e a s a n a t t r a c t i v e m e c h a n i s m

for

b o t h m a c h i n e and m a n . Unsupervised

Learning of N e w N e t w o r k

Configurations.

A v e r a g i n g of input w i t h t e m p l a t e s c a n c a u s e n e t w o r k s t r u c t u r e s c o n v e r g e t o w a r d o p t i m a l p e r f o r m a n c e , b u t h o w d o e s one c r e a t e n e t w o r k s t r u c t u r e s to c h a r a c t e r i z e n e w l y d i s c o v e r e d r u l e s or u n f o r e s e e n a c o u s t i c - p h o n e t i c

new

phonological

possibilities?

It is

recognized

n e c e s s a r y to b e a b l e to d e t e c t w h e n a c o r r e c t l y acoustic

to

input d o e s n o t m a t c h the c o r r e c t p a t h t h r o u g h

the

n e t w o r k v e r y w e l l , and e s t a b l i s h t h a t t h i s a c o u s t i c d a t a c o u l d be g e n e r a t e d by a h u m a n vocal t r a c t o b e y i n g p h o n e t i c s and p h o n o l o g y .

r u l e s of

English

D e t e c t i n g a poor m a t c h m a y n o t b e

d i f f i c u l t , b u t to be a b l e to d e t e r m i n e w h e t h e r

the d e v i a t i o n s

w o r t h y of i n c l u s i o n a s n e w n e t w o r k p a t h s of local or i m p o r t r e q u i r e s e x p e r t k n o w l e d g e of the r u l e s of p r o d u c t i o n a n d their a c o u s t i c c o n s e q u e n c e s .

too are

global

speech

It is m y b e l i e f

that

Klatt: The problems of variability in speech recognition

297

the r o l e of a n a l y s i s b y s y n t h e s i s and the m o t o r t h e o r y of

speech

p r o d u c t i o n a r i s e s e x a c t l y h e r e , to s e r v e a s a c o n s t r a i n t o n the c o n s t r u c t i o n of a l t e r n a t i v e n e t w o r k p a t h s d u r i n g learning.

C o n s t r u c t i o n of a L A F S - l i k e c o m p u t e r

possessing

these s k i l l s m u s t a w a i t p r o g r e s s

unsupervised simulation

in u n d e r s t a n d i n g

detailed relations between speech production, perception,

the

and

phonology. W e h a v e d e s c r i b e d p o s s i b l e m e c h a n i s m s for tuning t h r o u g h e x p e r i e n c e a n d for a u g m e n t i n g

their connectivity

b u t h o w d o e s the w h o l e p r o c e s s g e t s t a r t e d general

networks pattern,

in the c h i l d ?

Do

principles exist that, when confronted with speech data

like that to w h i c h a c h i l d is e x p o s e d , c r e a t e n e t w o r k s of sort?

The a l t e r n a t i v e , innate p r o c e s s e s a n d s t r u c t u r e s

s p e e c h p e r c e p t i o n , g o e s w e l l b e y o n d the k i n d s of s i m p l e

this

for innate

f e a t u r e d e t e c t o r s t h a t h a v e b e e n p r o p o s e d from t i m e to t i m e the s p e e c h p e r c e p t i o n l i t e r a t u r e .

While

it m a y t u r n o u t

in

that

m u c h of the s t r u c t u r e of the p e r c e p t i o n p r o c e s s m u s t be

assumed

to be s p e c i f i e d g e n e t i c a l l y ,

search

it is b e s t to c o n t i n u e the

for d a t a - d r i v e n m e c h a n i s m s rather t h a n b o w d o w n to the g o d of i n n a t e n e s s too

soon.

CONCLUSIONS This is a v e r y e x c i t i n g psychologists

interested

time for e n g i n e e r s , l i n g u i s t s ,

in s p e e c h r e c o g n i t i o n a n d

and

speech

p e r c e p t i o n , for we a r e c l e a r l y at the t h r e s h o l d of a b r e a k t h r o u g h in b o t h u n d e r s t a n d i n g a n d m a c h i n e p e r f o r m a n c e .

It h a s b e e n

298

Human and automatic speech recognition

argued here that this breakthrough will be expedited by careful study of variability in speech, development of better phonetically-motivated

distance metrics, and the description of

acoustic-phonetic details within the framework of a recognition algorithm that is both simpls and powerful, such as LAFS.

ACKNOWLEDGEMENT This work was supported

in part by grants from the National

Science Foundation and the Department of Defense.

299 PROPOSAL FOR AN ISOLATED-WORD RECOGNITION SYSTEM BASED ON PHONETIC KNOWLEDGE AND STRUCTURAL CONSTRAINTS Victor W. Zue Massachusetts Institute of Technology, Cambridge, U.S.A. During the past decade, significant advances have the

field

of

isolated

word

recognition

transitions from research results to taken

place.

(IWR).

practical

been

In many

made

in

instances,

implementations

have

Today, speech recognition systems that can recognize a

small set of isolated words, say 50, for a given speaker with an error rate of less than 5% appear to be current

systems

derive

their

utilize power

techniques.

The

relatively

common.

Most

little or no speech-specific

from

success

general-purpose of

attributed

to the introduction

(Makhoul,

1975),

distance

these of

the

knowledge, but

pattern

recognition

systems can at least in part be

novel

metrics

of

parametric

(Itakura,

powerful time alignment procedure of dynamic

representations

1975),

and

programming

the very

(Sakoe

and

Chiba, 1971). While we have clearly made significant advances in dealing with a small

portion

of

doubt regarding tasks

the

speech

recognition problem, there is serious

the extendibility of the pattern matching approach

involving

multiple

speakers,

large

continuous speech.

One of the limitations of

approach

both

is

that

linearly with the size of

computation the

and

vocabulary.

the

vocabularies template

storage grow When

the

to

and/or matching

(essentially) size

of

the

vocabulary is very large, e.g., over 10,000 words, the computation and storage

requirements

associated

with

current

IWR

systems

prohibitively expensive.

Even if the computational cost were

issue,

of

the

performance

become not

an

these IWR systems for a large vocabulary

300

Human and automatic speech recognition

would surely deteriorate size

(Keilin et al., 1981).

Furthermore,

as

the

of the vocabulary grows, it becomes imperative that such systems

be able to operate in a speaker-independent mode,

since

training

of

the system for each user will take too long. This

paper proposes a new approach to large-vocabulary,

word recognition which combines detailed with

constraints

knowledge

on the sound patterns imposed by the language.

proposed system draws on the results demonstrating

acoustic-phonetic

isolated

the

richness

of

signal and the other demonstrating

of

two

sets

of

The

studies;

one

phonetic information in the acoustic the power of structural

constraints

imposed by the language. Spectrogram Reading

Reliance

on

general

pattern

techniques has been partly motivated by the unsatisfactory of

early

phonetically-based

speech

recognition

matching performance

systems.

The

difficulty of automatic acoustic-phonetic analysis has also led to the speculation that phonetic information must be derived, in large from

semantic,

syntactic

the acoustic signal. phonetically-based

part,

and discourse constraints rather than from

For the most part, the poor performance of these

systems can be attributed

to

the

fact

that

our

knowledge of the context-dependency of the acoustic characteristics of speech

sounds was very limited at the time.

slowly changing.

We now have a far better understanding of contextual

influences on phonetic segments. demonstrated al.

1980).

However, this picture is

This improved understanding

in a series of spectrogram reading experiments It

was

found

that

a trained subject can

has been (Cole

et

phonetically

Zue: Proposal for an isolated-word recognition system

301

transcribe unknown sentences from speech spectrograms with an accuracy of approximately 85%.

This performance is better

recognizers

in

reported

order statistics.

It

the

was

literature,

also

than

the

phonetic

both in accuracy and rank

demonstrated

that

the

process

of

spectrogram reading makes use of explicit acoustic phonetic rules, and that

this skill can be learned by others.

the acoustic signal is rich permit

substantially

in

phonetic

better

These results suggest that information,

performance

which

should

in

automatic

phonetic

improved

knowledge

base,

recognition. However, even with a substantially completely

bottom-up

phonetic

a

analysis still has serious drawbacks.

It is often difficult to make fine phonetic distinctions

(for example,

distinguishing

the word pair "Sue/shoe") reliably across a wide

range

of

Furthermore, the application of context-dependent

rules

speakers.

often requires the specification of the that

can

be

prone

to error.

correct

context,

retroflex

consonant /r/.)

be

a

desirable

aim

identifying

Problems such as these suggest that a

detailed phonetic transcription of an unknown itself

process

(For example, the identification of a

retroflexed /t/ in the word "tree" depends upon correctly the

a

for

the

utterance

may

not

by

early application of phonetic

knowledge. Constraints on Sound Patterns

Detailed segmental

of the speech signal constitutes but one of phonetic

information.

the

representation

sources

of

encoded

The sound patterns of a given language are not

only limited by the inventory of basic sound units, but

also

by

the

302

Human and automatic speech recognition

allowable

combinations

phonotactic

of

constraints

communication,

since

these

is

presumably

be

Thus,

recognized

Knowledge about such

very

useful

are

otherwise

not

without

having

to

specify

the

of the phonemes /s/, /p/, and /n/.

is the only word

in the Merriam Pocket

available

[1] [VOWEL]

[NASAL]

detailed

Dictionary

20,000 words) that satisfies the following [CONS]

speech

or

acoustic

In fact, "splint" (containing

[STOP].

found that knowledge of even broad specification of the sound American

English

about

description:

In a study of the properties of large lexicons, Shipman and Zue

of

are

as an extreme example, a word such as "splint" can

characteristics

[CONS]

in

it provides native speakers with the ability to

fill in phonetic details that distorted.

sound units.

words, both at the segmental and

(1982)

patterns

suprasegmental

levels, imposes strong constraints on their phonetic identities.

For

example, if each word in the lexicon is represented only in terms of 6 broad manner categories then

the

(such as vowel, stop, strong

fricative, etc.),

average number of words in a 20,000-word lexicon that share

the same sound pattern is about 2.

In fact, such crude

classification

will enable about 1/3 of the lexical items to be uniquely There

is

indirect

evidence

that

the

determined.

broad

phonetic

characteristics of speech sounds and their structural constraints utilized

to aid human speech perception.

has shown that people can be speech,

in

which

while detailed

manner

place

cues

taught

to

For example, Blesser perceive

are

(1969)

spectrally-rotated

cues and suprasegmental cues are preserved are

severely

distorted.

The

data

on

303

Zue: Proposal for an isolated-word recognition system misperception

of fluent speech reported by Bond and G a m e s

the results of experiments on listening by

Cole and Jakimik

(1980) and

for mispronunciation

reported

(1980) also suggest that the perceptual mechanism

utilizes information about the broad

phonetic

categories

of

speech

sounds and the constraints on how they can be combined. Proposed System

Based

on the results of the two studies cited

above, we propose a new approach to recognition.

This

approach

is

phonetically-based distinctly

isolated-word

different from previous

attempts in that detailed phonetic analysis of the acoustic signal not

performed.

Rather, the speech signal is segmented and classified

into several broad manner categories. classifier

serves

several

The

purposes.

broad

First,

would be reduced. should

Finally, we

feature large

be

speculate

mechanisms to is

in

phonetic

phonetic analyses,

less that

sensitive the

to interspeaker

sequential

constraints

variations. and

their

even at the broad phonetic level, may provide powerful reduce

the

search

space

substantially.

This

last

particularly important when the size of the vocabulary is

(of the order of several thousand words or more) . Once the acoustic

signal

has

been

reduced

to

a

lattice)

of

resulting

representation will be used for lexical access.

is

(manner)

Second, by avoiding fine phonetic distinctions, the

also

distributions,

phonetic errors

labeling, which are most often caused by detailed

system

is

to

(or

phonetic segments that have been broadly classified, the

reduce

knowledge

string

about

the the

number

of

structural

The

intent

possible word candidates by utilizing constraints,

both

segmental

and

304

Human and automatic speech recognition

suprasegmental,

of

the

words.

The result, as indicated previously,

should be a relatively small set of word candidates. will then be

selected

through

judicious

The correct word

applications

of

detailed

phonetic knowledge. In

summary, this paper presents a new approach to the problem of

recognizing speakers.

isolated The

words

system

from

been

vocabularies

significantly

reduced

Once the potential through

the

would

multiple

follow.

Such

word

candidates

utilization

structural constraints, then a detailed examination differences

and

initially classifies the acoustic signal into

several broad manner categories. have

large

of

the

of

the

acoustic

a procedure will enable us to deal

with the large vocabulary recognition problem in an efficient

manner.

What is even more important is the fact that such an approach bypasses the

often

tedious

and

error-prone

process

of deriving a complete

phonetic transcription from the acoustic signal. detailed

acoustic

phonetic

knowledge

can

In

this

approach,

be applied in a top-down

verification mode, where the exact phonetic context can be

specified.

REFERENCES Bond, Z.S. and G a m e s S. (1980) "Misperceptions of Fluent Speech," Chapter 5' in Perception and Production of Fluent Speech, ed. R.A. Cole, 1154-132 (Lawrence Erlbaum Asso., Hillsdale, New Jersey). Cole, R.A. and Jakimik, J. (1980) "A Model of Speech Perception," Chapter 6 in Perception and Production of Fluent Speech, ed. R.A. Cole, 133-163 (Lawrence Erlbaum Asso., Hillsdale, New Jersey). Cole, R.A., Rudnicky, A.I., Zue, V.W., and Reddy, D.R. (1980) "Speech as Patterns on Paper," Chapter 1 in Perception and Production of Fluent Speech, ed. R.A. Cole, 3-50 (Lawrence Erlbaum Asso., Hillsdale, New Jersey).

305

Zue: Proposal for an isolated-word recognition system

Itakura, F. (1975) "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-23, 67-72. Keilin, W.J., Rabiner, L.R., Rosenberg, A.E., and Wilpon, J.G. (1981) "Speaker Trained Isolated Word Recognition on a Large Vocabulary," J. Acoust. Soc. Am., Vol. 70, S60. Makhoul, J.I. (1975) "Linear Prediction: IEEE, Vol. 63, 561-580.

A

Tutorial

Review,"

Proc.

Shipman, D.W. and Zue, V.W. (1982) "Properties of Large Lexicons: Implications for Advanced Isolated Word Recognition Systems," Conference Record, IEEE 1982 International Conference on Acoustics, Speech and Signal Processing, 546-549. Sakoe, H. and Chiba, S. (1971) "A Dynamic Programming Optimization for Spoken Word Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-26, 43-49.

307 TIME IN THE PROCESS O F SPEECH RECOGNITION Stephen M. Marcus Institute for Perception Research Eindhoven, The Netherlands.

1

(IPO)

introduction

The process of speech recognition involves finding an optimum match between an unknown iput and a sequence of stored representations of known words in a listener's vocabulary. In normal conversational speech a number of problems arise. Firstly, word onsets and offsets are not clearly marked, if at all, in the acoustic signal. Secondly, the durations of words may show large variations, which may result in non-linear changes in segment duration within a word. Finally, the spectral or even phonetic realisations of a word may exhibit considerable variation from production to production, even given the same speaker. Most approaches consider speech as a sequence of events ordered in time, and implicitly assume that such a linear sequential representation is used both to represent the unknown input and to code the stored lexicon. It is not of great importance here whether such a representation is in terms of phoneme-like segments, short-term spectral descriptions, or longer diphone or syllabic elements. The same problems, of onset detecton, time normalisation, and variant production, arise in all cases. Adequate solutions have been developed, respectively involving testing all possible starting points for all possible words in the lexicon, "time warping" algorithms to find an optimal temporal match between the input and the representation of each possible word, and representations built of nodes with alternative branching paths for known variations in production. Given the rapid increase in speed and decrease in cost of computer hardware, it seems certain that, in the not-too-distant future, such approaches will become economically feasible, even for relatively large vocabularies.

states in input

sequence

states in s t o r e d r e p r e s e n t a t i o n

Figure 1.

"Time warping" to match an input with its ing stored representation.

correspond-

308

Human and automatic speech recognition

However, the most efficient speech recognition device constructed to date, which will probably remain so for a considerable time to come, is characterised not by serial processing hardware with cycle times measured in nanoseconds, but by electro-chemical "wetware" with transmission latencies measured in milliseconds, and owes its speed to a large capacity for independent parallel processing. The device I am referring to is of course the human brain. Seen from this viewpoint, speech recognition becomes essentially a problem of memory access, and we need to ask how speech might be stored in our memories.

2

context-sensitive coding

One model of human memory is the "slot theory", in which memory is seen as organised into a set of discrete "slots" in which differing items are stored. Items are recalled by accessing the contents of each slot in turn, and forgetting results when the contents of a slot becomes overwritten or lost. An alternative theory, one of whose major proponents is Wickelgren (1969), is that memory has an associative structure, based on a context-sensitive code in which the sequence of items is not stored, but only the association between adjacent items. Thus in the slot theory, the sequence "m", "e", "n", "i" would be stored as: (1s "m"; 2: "e"; 3: "n"; 4: "i"). In an associative model the same sequence would be represented by: ("m"-"e"; "e"-"n";

"n"-"i")

where " - " should be read as "followed by". The slot theory is analogous to the use of a linear sequential representation in coding speech. It presents the same problems of onset detection, normalisation, and variations in production. An associative or context-sensitive code offers a far greater flexibility in these matters. Since the temporal sequence is no longer explicitly represented, the associations (which will here be limited to associations between neighbouring pairs of states, and termed state-pairs) may be compared directly as they occur in the unknown input with those anywhere in any stored representation. Such a matching process is ideally suited to a parallel processing system working with a content addressable memory (as we know the human brain is: for example, remember a moment in your early childhood when you were outside in the sunshine, or the first time you heard of speech research). Wickelgren (1969) has pointed out that such a system based on pair information will tend to make errors, somewhat like stuttering or

309

Marcus: Time in the process of speech recognition some slips of the tongue, when used for speech would like to suggest that conversely, it may flexibility we need in perception, in dealing with in the speech signal. Figure 2 illustrates context-sensitive code based on state-pairs in same sequence as in Figure 1.

production. I offer just the the variability the use of a recognizing the

s t a t e - p a i r s in stored representations

Figure 2.

Associative coding used to match number of stored representations.

an

input

with

a

Note that it is neither necessary to locate the moment of word onset in the unknown input, nor to have special procedures to deal with local changes in speech rate. Each state-pair in the input finds its match in certain stored representations, regardless of its temporal location. Nor do all state-pairs need to match, and alternative productions may be catered for by including their corresponding state-pairs in the stored representation. All that is required for correct recognition is that the number of state-pairs corresponding between the input and the correct representation is greater than for all other stored representations.

3

a computer simulation

Such a simple coding may introduce ambiguity in the recognition of some sequences. It remains an empirical question whether this ambiguity will result in problems in recognition, or will turn out to result in the flexibility needed to deal with the idiosyncracies of the speech signal. Since the effectiveness of such a context-sensitive code with real speech is difficult to estimate by purely theoretical considerations, a simple computer speech recogniton system was implemented (Marcus, 1981). Speech was

310

Human and automatic speech recognition

analysed at 10msec intervals, the first three formants >eing extracted automatically by the IPO linear-prediction vocoder system (Vogten & Willems, 1977). No assumptions were made about the nature of possible intermediate states, such as phonemes or syllables, between such a spectral representation and each word in the vocabulary. The system performed a direct mapping between the spectral representation of the input and that for each word. Such an approach has been demonstrated to be highly successful in conventional approaches to automatic speech recognition by Bakis (1974) and Lowerre ( 1976), and has also been advocated by Klatt (1979) . In this case however, both the input and the stored lexicon were represented using a context-sensitive coding of adjacent spectral states. These "state-pairs" were used in a recogniton system for a small vocabulary - the English digits "one" to "nine". Figure 3 shows the performance of nine independent recognition units, one for

Figure 3.

The response of the computer simulation using a context-sensitive code to "unknown" tokens of the digits "one", "two" and "three". Each figure shows the response of all recognition units for all nine digits to the unknown input. The horizontal axis is stimulus time, the vertical axis is a logarithmic transform of recognition unit activity.

Marcus: Time in the process of speech recognition

311

each digit, in response to tokens of the words "one", "two" and "three". For each recognition unit, a logarithmic transform of cumulative probability is plotted against stimulus time. The system displays a number of interesting properties. The most superficial, though not the least impressive, is that the upper trace in each case is the "correct" recognition unit, corresponding in type with the stimulus word. Secondly, within 100 to 150 ms from stimulus onset, only a small number of recognition units remain as possible candidates, the rest having become so improbable that they have been deactivated from the system. Each recognition unit is operating independently, this characteristic is not critically dependent on the number of recognition units in the system. This time period is of the same order of magnitude as that suggested by Marslen-Wilson for restricting activity to a "word initial cohort" in the Cohort Model (Marslen-Wilson & Welsh, 1978). Thirdly, it is generally possible to make a recognition decision well before the end of the word, as soon as the activation of a particular recognition unit has risen sufficiently above that of all others. Though no such recognition decision was built into the computer simulation, it can be seen in Fig. 3 that such a decision can be made within 300 ms of stimulus onset with these stimuli. This performance is also typical for tokens of the other stimuli. This simulation also allows us to compare the effectiveness of state-pair information to that contributed simply by the presence of single states. The perfomance shown for "one" in Figure 3a should be contrasted with Figure 4, where only single-state information, rather than state-pairs, are used. The increase in performance using a context-sensitive code is clear and quite dramatic.

Figure 4.

The response of the computer simulation to the same "one" as in Figure 3, here using only single-state information. The vertical axis is to the same scale.

312 4

Human and automatic speech recognition word boundaries

Though a system as outlined above provides an elegant solution to the problem of word onset detection - by making such detection unnecessary - it contains no component for producing an actual recognition decision. This would presumably base its decision on the relative activity of all the word candidates, waiting until one had risen sufficiently highly in relation to all others. Contextual information could also be incorporated into such a decision algorithm. Lacking such a component, it was presumed that word onsets could be difficult or impossible to detect. The high level of activity of the candidate corresponding to the previous word was expected to mask the rise in activity resulting from the onset of the following word. However, Figure 5 shows the activation of the model in response to the sequence "one-nine-eight-two". The only modification to the system described above is that the activity of recognition units cannot fall below zero - that is, evidence is not collected against improbable candidates, only for or against those which have a significant chance of being present (a recognition unit cannot be "deader than dead"). Under the horizontal axis, the number corresponding to the highest recognition unit is displayed, above, the actual stimulus sequence is also shown.

Figure 5.

The response of the computer simulation to the sequence "one-nine-eight-two". No recognition component has been incorporated to deactivate each unit after recognition in preparation for the onset of the next. The number under the horizontal axis indicate the digit corresponding to the most active recognition unit at each moment in time.

Marcus: Time in the process of speech recognition 5

313

conclusion

One c o n c l u s i o n which may be drawn even f r o m t h i s l i m i t e d s i m u l a t i on i s t h a t t h e s o l u t i o n t o many c u r r e n t p r o b l e m s i n speech r e c o g n i t i o n may be s i m p l e r t h a n we u s u a l l y s u p p o s e . In p a r t i c u l a r t h e r e p r e s e n t a t i o n of t i m e and p r o b l e m s a s s o c i a t e d w i t h word o n s e t s and o f f s e t s may n o t r e q u i r e t h e e x t r e m e l y complex a p p r o a c h es c u r r e n t l y being used. I t r e m a i n s t o be s e e n w h e t h e r t h i s same a p p r o a c h w i l l be v a l i d f o r a much l a r g e r v o c a b u l a r y . There i s good r e a s o n t o s u p p o s e t h i s w i l l be t h e c a s e ; f i r s t l y t h e e x t r e m e r a p i d i t y ( i n t e r m s of s t i m u l u s t i m e ) w i t h which t h i s s i m u l a t i o n d i s c r i m i n a t e s one word f r o m a l l o t h e r s i n d i c a t e s t h e p r e s e n c e of much more power t h a n n e e d e d t o d i s c r i m i n a t e n i n e w o r d s . Secondly, trials w i t h words n o t in the c u r r e n t vocabulary show good r e j e c t i o n of t h e s e w o r d s , u n l e s s , of c o u r s e , t h e y s h a r e much i n common w i t h a known w o r d . Then, j u s t as w i t h h y p e r - e f f i c i e n t human s p e e c h r e c o g n i t i o n , such m i s p r o n u n c a t i o n s w i l l be i g n o r e d , and t h e c l o s e s t c a n d i d a t e s e l e c t e d .

references Bakis, R. (1974) Continuous-speech word spotting via centisecond acoustic s t a t e s . I EM speech processing group, report RC 4788. K l a t t , D.H. (1979) Speech perception: a model of acoustic-phonetic analysis and l e x i c a l access. Journal of Phonetics, 279-312. Lowerre, B.T. (1976) The HARPY speech recognition system. Unpublished PhD t h e s i s , Carnegie-Mellon University. Marcus, S.M. (1981) ERIS - c o n t e x t - s e n s i t i v e coding in speech perception. Journal of Phonetics, 9 , 197-220. Marslen-Wilson, W. & Welsh, A. (1978) Processing i n t e r a c t i o n s and l e x i c a l access during word recognition in continuous speech. Cognitive Psychology, 10, 29-63. Vogten, L.L.M. & Willems, L.F. (1977) The formator: a speech analysis-synthesis system based on formant extraction from l i n e a r prediction c o e f f i c i e n t s . IPO Annual Progress Report, 12, 47-62. Eindhoven, The Netherlands. Wickelgren, W.A. (1969) Context-sensitive coding, associative memory, and s e r i a l order in (speech) behaviour. Psych. Rev., 76, 1-15.

315 ON THE ROLE OF PHONETIC STRUCTURE IN AUTOMATIC SPEECH RECOGNITION Mark Liberman Bell Laboratories, Murray Hill, N.J. USA Introduction Three principal assumptions underlie this presentation: (1) sub-lexical information in speech (e.g. that available in fluently spoken nonsense words) is ordinarily of very high quality; better access to this information (which is traditionally studied under the name of phonetics and phonology) is the main requirement for better automatic speech recognition (ASR); (2) in order to improve ASR it is now especially important to devise improved algorithms for describing the phonetic structure of speech signals; (3) the phonological structure of speech must ultimately play a part in ASR, but is of limited relevence without robust phonetic descriptions of the the sort just mentioned. As always, the best evidence for such assumptions will prove to be the value of their consequences. I will sketch a sample recipe for exploring the consequences of these assumptions, suggesting what some appropriate phonetic descriptions might be like, and how they might be used. Recognizing speech on a phonetic basis means using the phonetic structure of the human speech communication system in mapping from sound streams to "toplinguistic messages. It does not imply any necessary choice between down" and "bottom-up" processing directions, and in particular it does not imply the construction of a standard "phonetic transcription" as the first stage of the recognition process. Phonetic recognition requires the design of phonetically robust signal descriptions: that is, acoustically definable predicates that have reliable connections to lexically relevent categories, or, to put it in plain language, things that reliably relate sounds to words. To avoid confusion, I will call these phonetic predicates "descriptors." Such descriptors need not impose any well-defined segmentation and labelling ("phonetic transcription"); rather, they are any appropriate set of descriptions of points or regions in the sound stream — what we might think of (very loosely) as a commentary on a spectrogram. Given a set of such descriptors, the recognition process can proceed by hypothesis-and-test ("top-down"), or by describe-and-infer ("bottom-up") methods, or by a mixture of such methods, as we please. For any particular application, we also need a control structure that makes effective use of the information contained in the signal descriptions. In the end, perhaps, a single algorithm can be made to work for all cases, but this goal is surely far in the future, and phonetically-based recognition can meanwhile produce useful results in less general forms. In what follows, I will specify a general philosophy for the selection of descriptors, a particular set of descriptors with which I have sufficient experience to be confident of their usefulness, and a particular class of recognition control structures that can make effective use of the proposed descriptor set. My hope is to find a framework within which research on phonetically based speech recognition can proceed productively, with useful applications available from the beginning.

316

Human and automatic speech recognition

Design Criteria for a Descriptor Set. Evidently, the descriptor set should work. The general strategy is to use phonetic knowledge to look in the right places for the crucial information; for instance, in distinguishing among the phonetically confusable set of letter names "E,B,D,V,P,T," the crucial information is in the burst spectrum, the first AO msec, or so of F2 and F3 transitions, and the voice onset time. Unless this information is extracted and made available in some relatively pure form, it will be swamped by phonetically irrelevent acoustic variation from other sources. Appropriate signal transforms are better than complex logic. For instance, in looking for candidate bursts, finding peaks in the (smooth) time derivative of the log energy is preferable to a complex procedure that "crawls around" in the rms amplitude. Such an "edge detector" is relatively immune to level changes and to background noise, and its output is characterized by a small set of numbers (e.g. the value of the derivative at the peak, the duration of the region of positive slope, etc.) that can easily be used in training the system. Because such signal transforms are mathematically simple, their behavior is also more predictable and less likely to suffer from "bugs" due to unexpected interactions among the branches of a complex decision tree. In general, the descriptor definitions should be as simple as possible, so as to minimize surprises and maximize trainability. However, as a last resort, temporary logical "patches" that seem to work are not to be disdained. All the descriptors used should be insensitive to pre-emphasis, band-limiting, broadband noise, and other phonetically irrelevent kinds of distortion. For instance, using spectral balance as a measure of voicing status fails badly if F1 is sometimes filtered out, as it often is in high vowels in telephone speech. The most interesting descriptors are lexically independent ones, since they are the most general, and will best repay the time spend to develop them. The fact that speech is phonologically encoded guarantees that an adequate set of general descriptors will exist. However, lexically specific descriptors are perfectly legitimate for applications that justify their development. For instance, the rising F2 transition in "one" will function quite reliably as a "one-spotter" in connected digit recognition. Some previously made points about the proposed "descriptors" may now be clearer. These are not phonological or phonetic features in the usual sense, but phonetically relevent acoustic descriptions. The more closely they correspond to phonologically defined categories, the better they will work, of course, but they will still be useful if the correlation is imperfect. These descriptors carry with them sets of parameter values that can be used by higher levels of the recognizer. For instance, in constructing a spotter for the rising F2 transition in "one," we find all monotonically rising "ridges" between 500 and 2000 Hz. in an appropriately smoothed time-frequency-amplitude surface; the attached numbers include (say) the lowest frequency, the highest frequency, the duration, and the rate of spectral balance change over the corresponding time period. In the space defined by the joint distribution of

Liberman: On the role of phonetic structure

317

these parameters, "one" will be fairly well separated from anything else that can arise in connected digit sequences. There are a variety of different ways to use this separation: we can send all candidates up to a higher level parsing algorithm; we can prune hopeless cases first; we can combine the rising-F2 feature in a bottom-up way with descriptors relating to the expected nasal implosion; and so forth. The descriptors under discussion can be divided roughly into a "first tier," made up of fairly direct characterizations of local properties of the signal (like "peak in the first derivative of mid frequency log energy"), and a "second tier" of descriptions that are more abstract and closer to linguistic categories (like "apparent voiceless stop burst"). In all cases, the point is to find ways to look at signal parameters through phonetic spectacles. A Candidate Descriptor Set In the conference presentation, a candidate set of phonetic descriptors will be given, w i t h a sketch of algorithms for their computation. The "first tier" of descriptors will be emphasized, and includes (1) time dimension "edges," (2) voicing and FO determination, (3) formant-like frequency-time trajectories of local spectral features, and (A) local spectral balance changes. In the "second tier," predicates like "pre-vocalic voiceless burst" are defined, largely in terms of first-tier descriptors. A Class of Control Structures. The general idea is to match descriptors predicted on the basis of a linguistic transcription against candidate descriptors found in the waveform. Three cases are contemplated: (1) phonetic alignment, in which the correct transcription is known or hypothesized, and then matched against a description of the speech; (2) sequence spotting, which is like (1) above except that the endpoint of the match are free to float in the input descriptor stream; (3) phonetic parsing, in which the transcriptional pattern is generalized into the form of a grammar, and the matching algorithm is generalized to become a form of probabalistic (or otherwise "fuzzy") parsing. Type (1) algorithms can be used to construct annotated data bases, and can do isolated-word recognition on a hypothesize-and-test basis. Type (2) algorithms can be used as the first pass in a connected word recognizer, or as part of a phonological recognizer that looks for syllable-sized patterns. Type (3) algorithms will be an effective architecture for a connected word recognizer if the local information (provided by the descriptors under discussion) is good, since then a beam search or shortfall-scheduled search will work effectively. The conference presentation is divided into three parts: the nature of the lexically-derived phonetic patterns to be matched against speech- derived information; the definition of a match and its overall evaluation; and some of the algorithms that can be used to find nominally optimal matches. One general point: the hardest problem, the one that most stands in the way of progress, is the definition of effective local phonetic descriptors. Control structures for integrating local information are an interesting topic in their

318

Human and automatic speech recognition

own right, but for now, I think, the goal should be to find simple, adequate algorithms that don't get in the way of research on phonetic description. The Role of Phonological Structure It is a useful as well as interesting fact that the acoustic properties of the class of noises corresponding to a given word can be predicted from its phonological representation. As a result, a system of phonetic description can hope eventually to achieve closure, so that the introduction of a new vocabulary item does not require the inference of any information beyond its phonological "spelling." This is by no means just a matter of storing word templates at a slightly lower bit rate, which would be a matter of little consequence. The important thing is to "modularize" the mutual influence of adjacent words at their boundaries, the effects of phrasal structure, the vagaries of dialect, the consequences of rate, emphasis and affect, and so forth, so that these need not be learned anew for every word and every combination of other relevent circumstances.

319 A COMPUTATIONAL MODEL OF INTERACTION BETWEEN AUDITORY, SYLLABIC AND LEXICAL KNOVJLEDGE FOR COMPUTER PERCEPTION OF SPEECH Renato de Mori Department of Computer Science, Concordia University, Montreal, Quebec, Canada ABSTRACT A system organization for extracting acoustic cues and generating syllabic and lexical hypotheses in continuous speech is outlined. A model for machine perception involving a message passing through knowledge instantiations is proposed. Some components of this model can be implemented by a pipeline scheme, some others by an Actor's system and some others with special devices for performing parallel operations ©n a semantic network. 1 INTRODUCTION Complex speaker-independent systems capable of understanding continuous speech must take into account many diverse types of features extracted by processes performing different perceptual tasks. Some of these processes have to work directly on the speech patterns for extracting acoustic cues. Some other processes operate on representations obtained by intermediate processing steps and generate feature hypotheses. Knowledge-based systems using rules for hypothesis generation can be enriched as new experience is gained, allowing simulation of human behaviour in learning new word pronunciations, the characteristics of new speakers or new languages. Algorithms for extracting a unique non-ambiguous description of the speech data in terms of acoustic cues have been proposed (De Mori [1]). These algorithms are based on a knowledge representation in which structuraland procedural knowledge are fully integrated. Knowledge representation shows how the complex task of describing the acoustic cues of the speech message can be decomposed into sub-tasks that can be executed in parallel. Task decomposition and parallel execution of cooperating processes have the advantage of speeding up the operation of a speech understanding system which should work close to real-time. Furthermore, conceiving the system as a collection of interacting modules offers the more important advantage of designing, testing and updating separately and independently a large portion of the knowledge of each module. The performances of each module can be improved by adding specific knowledge which can be acquired in any particular area of speech and language researches. 2 A COMPUTATIONAL MODEL FOR THE INTERACTION BETWEEN AUDITORY, SYLLABIC AND LEXICAL KNOWLEDGE The knowledge of a Speech Understanding System (SUS) can be subdivided into levels. The choice of these levels is not unique.

320

Human and automatic speech recognition

Following past experience and in agreement with speech perception results (De Mori [1]), three levels are considered through which the speech signal is transformed into lexical hypotheses. These levels are called the auditory, the syllabic and the lexical one. Fig. 1 shows a three-dimensional network of computational activities. Activities Ai are performed at the auditory level, activities Si at the syllabic level, activities Lij at the lexical level. Each activity processes a portion of hypotheses in a time interval Tli corresponding to the duration of a syllable (0 t C ^ r V ^ ^ (C3...) where C^ denotes the velar and palato-dental stops (k, £, t, and d) and L, liquids (Japanese has only r) . Historically, there are three distinct stages. Though the manner of borrowing into Japanese was once uniquely determined, drastic processes like Forward and Backward Vowel Assimilation of epenthentic vowels have gradually given way to Forward Consonant Assimilation, which represents a weaker process, but a more corrplicated rule. Thus, native phonology and loan phonology seem to differ in the power of analogy to promote rule generalization. The examples cited suggested that, for native phonology, once some members of a class submit to a change, analogy works to include those with less phonetic motivation, while loan phonology has the option of restructuring borrowed classes in ways not always predictable by analogy.

656 THE TYPOLOGICAL CHARACTER OF ACOUSTIC STRUCTURE OF EMOTIONAL SPEECH IN ENGLISH, RUSSIAN AND UKRAINIAN E.A. Nushikyan English Department, the University of Odessa, U S S R This paper deals with the general problem of typological investigations of the acoustic parameters of emotional speech. Six English, six Russian and six Ukrainian native speakers participated as subjects. To preserve comparability of the experimental material, s i t u ations were chosen from English fiction and then translation into R u s s i a n and Ukrainian, their overall number being two hundred statements, questions and exclamations expressing the most frequently observed positive and n e g a tive emotions. The material was recorded twice: in isolation (as non-emotional utterances) and in situations (as samples of emotional speech). T h e n oscilligrams were obtained, fundamental frequency, intensity and duration of speech signal were recorded. The detailed contrastive analysis of the acoustic structure of emotional and non-emotional speech have revealed that the frequency range, the frequency interval of the terminal tone, the frequency interval of the semantic centre, the peak intensity and the nuclear syllabic intensity are always greater in the emotional utterance in all languages under investigation. In this way the similarity of emotion expression is manifested. But it is the movement of the fundamental frequency in the emotional utterance that is peculiar for each language. The nuclear syllable duration and the duration of emotional utterances on the w h o l e exceed the duration of neutral ones. O n the other hand it turns out to be the m o s t variable acoustic parameter. For evaluating the average data standard methods of mathematical statistics were applied (t ratio, Student's t). Formant frequencies were measured from broad-band spectograms m a d e o n Kay Sonograph (6061 B). The shift of the intensity of formants frequencies of the stressed vowels into higher regions is noticed in the emotional speech. A constant increase of the total energy is observed at the expense of the first and second formants. W e also found frequency ranges enlargment, as well as in the greater importance of the third and fourth formants frequency range. The results of this contrastive study permit us to suppose that acoustic structure of emotional speech in English, Russian and Ukrainian displays more universal than particular properties in the manifestation of emotion.

Section

15

657 STYLISTIC VARIATION IN R.P. Susan Ramsaran Department of Phonetics and Linguistics, University College, London, U.K. A corpus of twenty hours of spontaneous English conversation was gathered in order to study the phonetic and phonological correlates of style in R.P. Six central subjects were recorded in both casual and formal social situations as well as in monologue. Each informant was recorded in different dyadic relationships, an attempt being made to hold all social variables constant apart from the subjects' relation to their interlocutors. Since the identity of the interlocutor is not an adequate criterion for determining speech style, a supplementary grammatical analysis of the data was also made. In formal situations where the topic of conversation is also formal, there is a tendency for speech to contain more stressed syllables, audibly released plosives and concentrations of nuclear tones than occur in the most casual stretches of speech. Assimilations (which do not increase with pace) and elisions are more frequent in the casual contexts. Linking /r/ and weak forms are not indicators of style. Although some of these features are distributionally marked, none is unique to a speech variety. It is concluded that in R.P. there is no shift from a distinct 'casual' style to a distinct 'formal' style. Instead gradual variation is to be seen as inherent in a unitary system.

658 CHARACTERISTIC FEATURES IN USAGE OF "LIAISON" IN THEATRICAL SPEECH. V.A.SAKHAROV. Department of Phonetics at the Institute of Foreign Languages, Tbilisi, Georgian SSR, USSR. According to the data of our research, all the three categories of "liaison" are used in theatrical speech-liaison obligatoire, facultative and interdite, meanwhile facultative "liaison" permanently exchanges places with that of obligatory and the latter with facultative "liaison" and conditions of their changing places depend on: a) language and style of a literary work; b) author and period when a literary work was written; c) direction of a theatre; d) individuality of an actor. Usage of linkage in contemporary language is largely determined by a number of factors among which the following are essential: a) functional style; b) semantic-syntactic connection; c) grammatical factor; d) phonetic conditions; e) historical factor. Our work enables us to prove that "liaison" is most frequently used in theatrical speech. According to our research we may say that it is theatrical speech which combines both classical norms of literary language and contemporary tendencies of colloquial language comparison of which enables us to trace back the evolution undergone by "liaison" at the present stage of development of the French language.

Section

15

659 STRATEGIES DE REMPLACEMENT DU R APICAL PAR LE R POSTERIEUR L. Santerre Département de Linguistique, Université de Montréal, Québec, Canada Les R postérieurs tendent de plus en plus à remplacer les R antérieurs dans le langage soigné des francophones à Montréal. De nombreux locuteurs font tantôt l'un tantôt l'autre variphone, mais aussi très couramment les deux à la fois, en finale de mot; ex. Pierre [pj£Rr] . C'est le premier R qui est plus intense et perçu, donc transmis aux enfants; les radiofilms montrent que le second est réalisé par une occlusion de l'apex sur les alvéoles, mais difficilement audible. Autres stratégies: la réduction du /r/ à une seule occlusion, en position intervocalique; ex. la reine [ l a r e n ] ; après consonne, il y a épenthèse; ex. drame [deram] , crème [ ksrem] , une robe [ynarob] r ardoise rouge [ ardwazaru^ ] Devant consonne apicale, le [r] est réduit à sa phase implosive; ex. retourne [ r e t u r n ] , regarde [ regard ] , université [ yni ve r /i ste ] ; devant constrictive non-apicale et sonore, production d'épenthèse; ex. large [ Iars^] , pour vous [pu^avu] ; devant les occlusives et les sourdes, le [ r ] peut être réduit à sa phase implosive; ex. remarque [ rsmar/i k] , marche [mar/* J ] , parfumeuse [parfym0z] • On tend à réduire au minimum le nombre de battements antérieurs. Remarque: les épenthèses ne sont pas perçues comme telles, mais comme un deuxième battement. Présentation de tracés articulatoires aux rayons-x synchronisés avec l'analyse spectrale. = seulement la phase implosive du battement

660 SOCIO—STYLISTIC VARIABILITY OF SEGMENTAL UNITS V.S. Sokolova, N.I. Portnova Foreign Languages Institute, Moscow, USSR

1.The problems of studying socio-stylistic variability of the units of segmental level are very acute in modern socio phonetics. 2.The study of socio- and stylistically informative value of the variants of segmental units serves to reveal the inner mechanism of sound modifications and the peculiarities of combinations and the distribution of concrete realizations in such varieties of speech as functional stylistic variety and the speech of social groups. 3.The results of the present experimental phonetic investigation carried out on the basis of the French language make it possible to state the following: 3.1. The inner mechanism of the variability of vowels lies in the changes of the acoustic paramétrés: frequency index of the second and zero formants in the spectrum of vowels, energy satiation of the spectrum and the length of vowels. 3.2. Realization of the so-called semi-vowels depends upon extralinguistic factors. 3.3«Timbre characteristics of vowels, and especially of the functioning of semi-vowels correlates with such extralinguistic factors as the age status of an informant, which affects interpersonal sound variations and changeable communicative situations, which determine individual sound changeability in stylistic varieties of oral texts.

Section

15

661 DISTANCES BETWEEN LANGUAGES WITHIN LANGUAGE FAMILIES FROM THE POINT OF VIEW OF PHONOSTATISTICS Yuri A. Tambovtsev Department of Foreign Languages at the Novosibirsk University, Novosibirsk, USSR There may be different classifications and measurements of the distances between languages in language families. It is proposed in this report to measure the distances by the values of frequencies of certain phoneme groups in speech. These phonostatistical methods are believed to give the necessary precision and objectivity. The following methods of phonostatistics are used in this study: 1) The value of the consonant coefficient; 2) The value of the ratios and differences of frequencies of certain groups of consonants in speech and in the inventory; 3) The value of ratios and differences of frequencies in different languages derived by a unified four features method. In discussing the closeness of some languages to others within the language family, it is important to take into consideration both typological and universal characteristics among the languages subjected to phonostatistical investigations. Special coefficients which take into account the values of the genuine language universals are introduced. Some powerful methods of mathematical statistics are used in order to avoid mistakes in conclusions caused by a number of common features due to pure chance. The group of experimental linguistics of Novosibirsk University has studied by phonostatistical methods the languages of Finno-Ugric, Turkish, Tungus-Manchurian, Paleo-Asiatic and some isolated languages of Siberia and the Far East. Some phonostatistical studies are based on acrolets taking a language as a whole; others deal with dialects. This study involved dialects to measure the distances between the dialectal forms and the literary language on the one hand, and it considered the distances between languages as a whole on the other. The main objective of the study is to demonstrate the relations of languages not only on a qualitative but also on a quantitative scale. It is believed that it is quite essential to determine the correct degrees of relationship between the dialects in a language and between languages in language families. The method is believed to be basically new and is being applied to some Siberian, FinnoUgric, Turkish and Tungus-Manchurian languages for the first time. These phonostatistical methods lead to outcomes similar to the results achieved by classical comparative linguistic methods. The degrees of closeness in the relationship between some Finno-Ugric, Slavonic and Turkish languages may encourage scholars to use such analyses for the whole Finno-Ugric, Slavonic, Turkish, TungusManchurian and other language families and to help to include some isolated languages into language families since this particular approach seems to be accurate, precise and promising.

662 LE PROBLEME DE L'EXPLICATION DES INNOVATIONS PHONIQUES ET L'AFFAIRE DES LABIOVELAIRES EN GREC ANCIEN A. Uguzzoni I s t i t u t o di G l o t t o l o g i a , Université di Bologna, I t a l i a L ' a p p l i c a t i o n du modèle de Henning Andersen s ' e s t révélée particulièrement i n téressante pour l ' é t u d e de l ' o r i g i n e des changements phoniques qui sont à la base des correspondances diachroniques et qui conduisent à la formation des d i vergences et des convergences d i a l e c t a l e s . Une t e l l e optique consent en e f f e t de formuler des hypothèses s a t i s f a i s a n t e s sur l a nature, les motivations, les modalités des innovations é v o l u t i v e s , et apporte un contribut c o n s t r u c t i f pour la r e d é f i n i t i o n du problème controversé de l ' e x p l i c a t i o n en l i n g u i s t i q u e h i s t o rique. A l ' i n t é r i e u r de ce cadre théorique et méthodologique s ' i n s c r i t la r e cherche dont nous présentons i c i quelques e x t r a i t s . L'étude de l ' é l i m i n a t i o n des l a b i o v ë l a i r e s en grec ancien révèle des anomalies qui peuvent être reéxaminées et dépassées en tenant compte aussi bien des aspects u n i v e r s e l s de la substance phonétique que des asoects spécifiques de la phonologie des dialectes grecs.Un cas très controversé est représenté par les développements des l a b i o v ë l a i r e s s u i v i e s de e^ On propose de considérer la correspondance diachronique l a b i o v ë l a i r e s : dentales comme le r é s u l t a t de deux changements de caractère essentiellement d i f f é r e n t : un changement contextuel, à cause duquel les l a b i o v ë l a i r e s sont p a l a t a l i s é e s devant i , e, et un acontextuel, pour lequel les produits de la p a l a t a l i s a t i o n sont suïïstitués par les dentales. Le premier est une innovation dëductive, qui c o n s i s t e en l ' i n t r o d u c t i o n d'une nouvelle règle de prononciation, tandis que le second est une innovation abductive, qui consiste en la rëinterprêtation des e n t i t é s phonétiques issues du processus précédent. La d i s t i n c t i o n entre ces stades é v o l u t i f s met en évidence la contribution e x p l i cative s o i t des facteurs a r t i c u l a t o i r e s , s o i t des facteurs acoustiques et audit i f s . L'exégèse proposée dans cette communication se fonde sur l'hypothèse de deux conditions préalables. D'une part on postule que la divergence entre les r é s u l t a t s éoliens et les r é s u l t a t s non-éoliens dépend d'un degré d i f f é r e n t de p a l a t a l i s a t i o n qui touche, respectivement, les l a b i o v ë l a i r e s s u i v i e s de et les l a b i o v ë l a i r e s s u i v i e s de i_, e^. D'autre part on relève que ce qui permet la confusion entre les l a b i o v ë l a i r e s p a l a t a l i s é e s et les dentales est probablement une ressemblance a c o u s t i c o - a u d i t i v e r e l i é e avec les t r a n s i t i o n s des seconds f o r mants. L'examen de la correspondance diachronique l a b i o v é l a i r e s : l a b i a l e s confirme le rôle de l'apprenant-auditeur dans les procès é v o l u t i f s et l'importance de l ' é t u de des ambiguités acoustiques qui peuvent constituer les prémisses pour les i n novations abductives. Toutefois i l n ' e s t oas exclu que à la projection phonétique d é f i n i t i v e de la réinterprétation des l a b i o v é l a i r e s comme l a b i a l e s on s o i t a r r i vé à travers des degrés a r t i c u l a t o i r e s intermédiaires, semblables à ceux qui ont été envisagés par les analyses t r a d i t i o n n e l l e s de l ' é v o l u t i o n des l a b i o v ë l a i r e s .

Section

15

663 THE RELATIVE IMPORTANCE OF VOCAL SPEECH PARAMETERS FOR THE DISCRIMINATION AMONG EMOTIONS R. van Bezooijen and L. Boves Institute of Phonetics, University of Nijmegen, The Netherlands In the experimental literature on vocal expressions of emotion two quite independent mainstreams may be distinguished, namely research which focusses on a parametric description of vocal expressions of emotion and research which examines the recognizability of vocal expressions of emotion. In the present contribution an effort is made to link the two approaches in order to gain insight into the relative importance of various vocal speech parameters for the discrimination among emotions by human subjects. Hundred and sixty emotional utterances (8 speakers x 2 standard phrases x 10 emotions) were rated by six slightly trained judges on 13 vocal speech parameters. The speakers were native speakers of Dutch of between 20 and 26 years of age. The phrases were "two months pregnant" (/tve: mai'nda zvaqar/) and "such a big American car" (/zo:n yro:ta amerika:nsa o:to:/). The emotions were disgust, surprise, shame, interest, joy, pitch level, pitch range, loudness/effort, tempo, precision of articulation, laryngeal tension, laryngeal laxness, lip rounding, lip spreading, creak, harshness, tremulousness, and whisper. The means of the ratings on 12 of these parameters - lip rounding was discarded because an analysis of variance failed to reveal a significant effect of the factor emotion - were subjected to a multiple discriminant analysis. Aim of this analysis was to attain an optimal separation of the 10 emotional categories by constructing a limited set of dimensions which are linear combinations of the original 12 discriminating variables. In a 3-function solution 62.5% of the utterances were correctly classified. The same 160 emotional utterances which were auditorily described, were offered to 48 Dutch adults with the request to indicate for each which of the 10 emotional categories had been expressed. From the responses 67% proved to be correct. The confusion data resulting from this recognition experiment were first symmetrized and then subjected to a multidimensional scaling analysis which aimed at providing insight into the nature of the dimensions underlying the classificatory behavior of the subjects. A comparison of the dimensions emerging from the two analyses suggests that the dimension of level of activation - and the vocal speech parameters related to level of activation, such as loudness/effort, laryngeal tension, laryngeal laxness, and pitch range - plays a central role in the discrimination among emotional expressions, not only in a statistical sense, but also in connection with the classificatory behavior of human subjects. An evaluative dimension was not clearly present.

664 A CROSS-DIALECT STUDY OF VOWEL PERCEPTION IN STANDARD INDONESIAN1 Ellen van Zanten & Vincent J. van Heuven Dept. of Linguistics/Phonetics laboratory. University of Leyden, the Netherlands When the Republik Indonesia was founded as an independent state in 1947, Indonesian was declared the national language, after having served as a lingua franca for the entire archipelago for centuries. In our research we purport to establish both acoustic and perceptual aspects of the Indonesian vcwel system, as well as the (potential) effects of the speakers' local vernacular or substrate dialect on his performance in the standard language. In the present experiment we sought to map the internal representation of the 6 Indonesian monophthongs (/i, e, a, o, u, a/) for 3 groups of subjects: 4 Toba Bataks, 5 Javanese, 4 Sundanese. The Javanese vowel system is identical to that of the standard language, the Batak dialect lacks the central vcwel, whereas Sundanese has an extra central (high) vcwel /y/. 188 vcwel sounds were synthesized on a Fonema OVE Illb speech synthesizer, sampling the acoustic F1/F2 plane in both dimensions with 9% frequency steps. Subjects listened to the material twice in counterbalanced randcm orders. They were instructed to label each stimulus as one of the 6 Indonesian monophthongs (forced choice), as well as to rate each token along a 3-point acceptability scale. The results contain important differences in preferred locations of the response vowels, which are argued to reflect properties of the substrate systems of the three groups of subjects. For Batak listeners, /s/ is the least favoured reponse category, and its distribution is highly irregular and restricted; for Sundanese listeners, hcwever, it is the most favoured vcwel, occupying a relatively large area in the F1/F2 plane, to the detriment of /u/, whose distribution is conspicuously more restricted here than with the Javanese and Batak listeners. Javanese listeners performed their task better than the other groups, suggesting that the Javanese substrate interferes least with Indonesian. We conclude that the labelling method is a highly sensitive tool in the study of vcwel systems, that merits wider application in field studies. "'"This research was supported in part by a grant from the Netherlands Organisation for the Advancement of Pure Research (ZWO) under project # 17-21-20 (Stichting Taalwetenschap).

Section

15

665 THE PHONEMIC SYSTEM AND CHANGE IN PRONUNCIATION NORMS L.A. Verbitskaja University of Leningrad, Leningrad, USSR Questions concerning the connection between a phonemic system and pronunciation norms are considered from the point of view of the interrelation between the phonological system and the pronunciation norms. A phonemic system includes not only the inventory of phonemes, but also the distribution of phonemes, their alternations, functions, and phonemic combinations. The problem of pronunciation norms arises because a system has not just one, but two or more possible designations for one and the same linguistic realization. Variation within the system is limited by the norms. Pronunciation norms can change either within the limitations of a system or as a result of changes internal to the system. The influence of orthography can hardly be considered a major factor that affects changes in the pronunciation norms.

666 THE STRUCTURAL UNITS OF EMOTIONALLY EXPRESSIVE SPEECH E.N. Vinarskaya Foreign Language Institute, Moscow, USSR Any speech utterance is formed of structural units of one of three types: sign units of language, sign extralinguistic units and nonlanguage inborn biological units. Language units are of the uppermost hierarchy and dominate the ones of the lower hierarchical levels, thus making their structure less defined. It is in the simplified speech structure (utterances by infants, or everyday vernacular), that the structural peculiarities of the lower sign and non-sign units stand out more explicitly. Extralinguistic sign units of speech could be also described as specific units of emotional expressiveness. Active formation of these sign units takes place in the pre-language period of early childhood. Similar to the peripheral speech organs that build themselves upon the earlier formed digestive and respiratory systems, language units are developed on the basis of already existing emotionally expressive signs. Emotionally expressive signs participate in the processes of speech in their two functions: one that provides the transmission of emotional meaning and whose evolution has been long completed, and the other that has been acquired much later and is functionally linguistic.

Section 15

667 TENDENCIES IN CONTEMPORARY FRENCH PRONUNCIATION. I. Zhgenti. The head of the department of Phonetics at the Institute of Foreign Languages, Tbilisi, Georgian SSR, USSR. Sounds in contemporary French were studied on the basis of phonological distribution of phonemes and with the help of research in Paris in 1972 after which a complex experimental research aimed at fixing pronunciation tendencies and articulative acoustic changes by means of spectrum and X-ray analysis, oscillographing and synthesis of speech sounds, was carried out. As the result of our research, the following tendencies can be outlined in French pronunciation: a) tendency towards front articulation; b) tendency towards delabialization and reducing the number of nasal vowels ; c) tendency towards open articulation; d) stabilization of uvular "r" or "r grasseyé" in standard French pronunciation, which also helps the general tendency towards front pronunciation. As for the pronunciation of variphones of "r" in French, German, Swedish, Dutch, Danish, etc., we may conclude, judging from our observation, that the uvular."r" similar to French "r grasseyé" is pronounced in those vocal-type languages which include 12 vowels in their phonological system, out of which 8 vowels are of front articulation, i.e. they have a tendency towards front pronunciation. These objectives revealed by us can be considered universal.

Section 16

Phonetics and Phonology

Section 16

671 DIFFERENCES IN DISTRIBUTION BETWEEN ARABIC /l/, /r/ AND ENGLISH /l/, /r/ Mohammad Anani Department of English at the University of Jordan, Amman, Jordan Phonetic differences between lateral and trill articulations in 'emphatic' and 'non-emphatic' contexts raises special problems for Arabic speakers of English. The contextual distribution of Arabic varieties of /l/ and /r/ is examined and differences in distribution between Arabic /l/, /r/ and English /l/, /r/ are stated. Some instances of pronunciation errors due to such differences are mentioned.

672 THE PHONETIC CHARACTERIZATION OF LENITION Laurie Bauer Victoria University of Wellington, New Zealand If lenition is a phonological process that does not have a phonetic definition, it very soon becomes clear that there are instances where it is not possible to state non-circularly whether lenition or fortition is involved. Most attempts to give a phonetic characterization of lenition seem to suggest that: 1) the voicing of voiceless stops; and 2) the spirantization of voiced stops are lenitions; but that 3) the centralization of vowels to schwa is not a lenition, although it is usually described as such in the literature. This raises the question of whether lenition is a unitary phenomenon as it affects both consonants and vowels, and whether a single characterization is possible. These are the questions that will be posed in this paper, and a number of possible phonetic characterizations of lenition will be considered in an attempt to provide an answer.

Section 16 673 VOWEL REDUCTION IN DUTCH G.E.BOOÍj Department of General Linguistics, Free University, Amsterdam, the Netherlands In Dutch, a vowel in a non-wordfinal, unstressed syllable can be reduced to a a schwa. If the syllable is the word-initial one, the vowel to be reduced must be in syllable-final position. This process of reduction manifests itself in three ways: (1) as a diachronic process, with concomitant restructuring of the underlying form, as in: repetítie, conferéntie, reclame, televísie, where the underlined e always stands for a schwa; (2) as an obligatory synchronic rule, as shown by the following pairs of related words: proféet-profetéer, juwéel-juwelíer, géne-genánt; (3) as an optional style-dependent synchronic rule, e.g. in banáan, polítie, minúut, relátie, economíe, where the underlined vowel can be pronounced as a schwa. In Koopmans-van Beinum(1980) it is shown that vowel reduction (=vowel contrast reduction) is a general phonetic tendency of Dutch, occurring in all speech styles, both in lexically stressed ana lexically unstressed syllables. This tendency, presumably a manifestation of the principle of minimal effort, does not suffice, however, to explain the phenomena observed above, although it makes them understandable . The vowel reduction in (1) - (3) is a grammaticalization of a phonetic tendency, manifesting itself in threee different ways in the grammar of Dutch. This grammaticalization can also be inferred from the fact that vowel reduction is lexically governed (vowels in words of high frequency reduce more easily). Consequently, the acoustic realization of a Dutch vowel is determined by at least the following factors: (i) the systematic phonetic representation of the vowel; this may be a schwa, due to the diachronic or synchronic rule of vowel reduction; (ii) the general tendency of contrast reduction in both stressed and unstressed syllables. This implies that in many cases phonological intuitions with respect to vowel reduction cannot be checked by means of phonetic measurements. If the phonological rule claims that reduction is possible, but we do not find it phonetically, this may be due to the optionality of the rule. Conversely, if the rule says that reduction is impossible (e.g. in lexically stressed syllables), and yet we do find reduction, this can be ascribed to the general phonetic tendency. I conclude, therefore, that -even in the case of relatively low level rulesit is not possible to provide direct evidence for the reality and correctness of phonological rules by means of phonetic experimentation.

Reference: F.J.Koopmans-van Beinum, Vowel contrast reduction. An acoustic and perceptual study of Dutch vowels in various speech conditions. Amsterdam: Academische Pers, 1980.

674 ON THE PHONETICS AND PHONOLOGY OF CONSONANT CLUSTERS Andrew Butcher Department of Linguistic Science, University of Reading, UK. Syllable-final consonant clusters of the type NASAL + FRICATIVE and NASAL + STOP + FRICATIVE have attracted the interest of both phoneticians and phonologists over the years. On the one hand, study of variation in the temporal organization of such complex sequences (across languages, across speakers and w i t h i n speakers) provides insight into various aspects of the speech production mechanism. On the other hand, the often apparently optional presence in such clusters of phases perceived as oral stops has provoked some discussion as to whether phonemes are being inserted, deleted or inhibited. This paper presents an evaluation of some acoustic, electroglottographic, pneum o t a c h o g r a p h ^ and electropalotographic data recorded simultaneously, using British English speakers. The clusters concerned w i t h (phonologically) both voiced and voiceless (STOP+) FRICATIVE cognates - were pronounced under three different conditions: isolated words, sentences and connected texts. The results indicate that the occurrence of an oral stop during fully voiced sequences - and therefore the maintenance of the phonolgical opposition b e t w e e n /-ndz/ and /-nz/ clusters - is extremely rare. The occurrence of 'stopped' versus 'stopless' transitions between nasals and voiceless fricatives, on the other hand is highly variable, although for most speakers it bears no relation to the phonological opposition predicted by the orthography. It depends on the relative timing of velic closure, oral release and glottal abduction. Simultaneous execution of all three gestures is m o s t infrequent and most ^stopless' pronunciations in fact include a period of simultaneous oral and nasal air flow as the velic closure lags behind. Transitions in which a stop is perceived sometimes include a true stop where oral release occurs after the other two gestures. The most common strategy of all, however, is the delay of both velic closure and oral release until well after glottal opening, producing a voiceless nasal phase, w h i c h is nonetheless perceived as a stop. On the basis of this and further auditory-based data, the following factors seem to play a role in determining the occurrence of a perceived stop! firstly speakers may be divided, apparently on a regional/social basis into habitual 'stoppers', 'non-stoppers' and ' distinguishers t. secondly increased tempo leads to less stopping and less consistent distinction in general. M u c h less importantly, there is also less stopping in homorganic sequences and more stopping in cases where a semantic opposition depends on the stop versus stopless distinction. Obviously the range of variation to be found in such sequences is rather wide, and presents difficulties for any m o n o s y s t e m i c phonemic type of analysis. It is suggested that a polysystemic approach m i g h t be somewhat more satisfactory, or even one w h i c h treated such clusters as single phonemes and differences in their realization at the level of diaphonic and allophonic variation.

Section 16 675 VOWEL CONTRAST REDUCTION IN TERMS OF ACOUSTIC SYSTEM CONTRAST IN VARIOUS LANGUAGES 1 2 Tjeerd de Graaf and Fiorina J. Koopmans-van Beinum ^Institute of Phonetic Sciences, Groningen University, The Netherlands.

2

Institute of Phonetic Sciences, University of Amsterdam, The Netherlands.

In a previous study on vowel contrast reduction (Koopmans-van Beinum, 1980) a measure was introduced to indicate the degree of acoustic contrast of vowel systems in various speech conditions. For the Dutch vowel system the values of this measure ASC (Acoustic System Contrast = the total variance of a vowel system) decrease in a similar way for male as well as for female speakers when passing from vowels pronounced in isolation via isolated words, stressed and unstressed vowels in a text read aloud, stressed vowels in a retold story and in free conversation to, finally, unstressed vowels in a retold story and in free conversation. Thus the ASC measure, its value being to a large extent dependent on speaker and on speech condition, provides a quantitative description of the process of vowel contrast reduction. Can this measure ASC also be used for the comparison of vowel contrast reduction in other languages ? This is of particular interest when these languages have vowel systems differing in the number of vowels involved, or when the individual vowels assume deviant positions in the acoustic vowel space. We might hypothesize that systems involving fewer vowels would have a larger degree of vowel contrast reduction than richer vowel systems. However, our first results in the comparison of Dutch (12 vowels) , Italian (7 vowels), and Japanese (5 vowels) display a similar pattern of vowel contrast reduction independent of the number of vowels or of their position in the acoustic space, as can be seen in the illustration. Another point that will be discussed is the question whether the ASC measure can also be applied to vowel systems with nasalized vowels and to systems with distinctive pairs of long and short vowels. Koopmans-van Beinum,F.J.(1980). 'Vowel Contrast Reduction'. Diss. Amsterdam. Individual ASC values in various speech conditions for Dutch (2 male: Dl and D2 and 2 female: D3 and D4 ) , Italian (2 male: II and 12), and Japanese (3 male: J1, J2, and J3) speakers.

Dl

02

03

04

[1 12 J1

J2 03

676 VOWEL QUALITY IN DIFFERENT LANGUAGES AS PRONOUNCED BY THE SAME SPEAKER Sandra Ferrari Disner UCLA Phonetics Laboratory, Los Angeles, USA Most cross- linguistic comparisons of vowel quality are impaired by the diversity of speakers in the language groups. If significant differences are found between the formant frequencies of, say, the vowel transcribed as [a] in one language and the vowel transcribed as [a] in another, one cannot be certain that such differences are entirely linguistic in nature. While they may be due to shifts along linguistic parameters such as tongue height or pharyngeal constriction, there is always a possibility that the observed acoustic differences between pairs of similar vowels are due to consistent anatomical differences between the samples, such as a difference in mean vocal tract length or lip dimensions. This difficulty can be avoided by studying a group that speaks both of the languages to be compared. Over a period of time 27 subjects were recorded who spoke two or more of the following languages: English, Dutch, German, Norwegian, Swedish, and Danish. The list of monosyllabic test words pronounced by each speaker was submitted to a panel of native-language judges for evaluation; only those speakers who were judged to have native proficiency in at least two languages were selected for this study. The sample of speakers who passed this evaluation enables many of the possible pairs of these languages to be compared. Since each speaker employs the same vocal apparatus to produce the vowels of two or more languages, the observed differences between pairs of similar vowels cannot be attributed to anatomical differences between the language groups. Rather, they are a close approximation of all and only the linguistic differences which hold between the vowels of these six closely-related languages. Formant-frequency charts of eight different pairs of languages reveal that many of the differences frequently noted in conventional language comparisons, such as the relatively low F1 of Danish vowels and the relative centralization of English vowels, are supported by these bilingual results. In addition, a number of more specific differences, affecting a single vowel rather than the entire vowel system, are found across all speakers. Many of these differences are consistent but quite small, and as such have not received attention in the literature to date.

Section

16 677

CONTRASTIVE STUDY OF THE MAIN FEATURES OF CHINESE AND ENGLISH SOUND SYSTEMS Gui Cankun Guangzhou Institute of Foreign Languages, Guangzhou, China 1 . Differences in segmental phonemes a. Different ways of classification b. Differences in distinctive features c. Differences in distribution d. Trouble spots for Chinese students learning English 2. Tone language vs intonation language Chinese as a tone language and English as an intonation language Each has its own characteristics Trouble spots for Chinese students 3. Differences in rhythm Different arrangements of rhythm patterns Chinese is syllable-timed while English is stress-timed Examples in poetry and conversation Trouble spots for Chinese students 4. Differences in juncture Chinese speech flow — staccato, no smooth linking of syllables or words English speech flow — legato, very smooth linking of words with the exception of breaks and pauses Trouble spot for Chinese students

678 DURATION AND COARTICULATION IN PERCEPTION: A CROSSLANGUAGE STUDY Charles Hoequist I n s t i t u t f ü r Phonetik der U n i v e r s i t ä t K i e l , West Germany A p e r c e p t i o n experiment was run t o t e s t p r e d i c t i o n s concerning two aspects of speech p r o d u c t i o n : rhythm and c o a r t i c u l a t i o n . The rhythm hypothesis under t e s t concerns the p o s t u l a t i o n o f rhythmic c a t e g o r i e s , such as s t r e s s - t i m i n g , s y l l a b l e - t i m i n g , and mora-timing, i n t o which languages can be grouped. Some experiment a l e v i d e n c e i n d i c a t e s that these i m p r e s s i o n i s t i c c a t e g o r i e s do correspond t o d i f f e r e n c e s i n the way and degree s y l l a b i c d u r a t i o n i s u t i l i z e d by speakers of languages claimed t o b e l o n g t o d i f f e r ent rhythm c a t e g o r i e s . The c o a r t i c u l a t i o n hypothesis i s based on e v i d e n c e i n d i c a t i n g t h a t vowel production in speech usually c o a r t i c u l a t e s across consonants, so t h a t a consonant does not n e c e s s a r i l y s e r v e as the boundary of a vowel. The ' e d g e s ' of vowels (and t h e r e f o r e perhaps of s y l l a b l e s ) seem t o depend more on a d j a c e n t vowels than on i n t e r vening consonants. I f the p e r c e p t i o n o f duration uses vowel bounda r i e s as markers, then i t may be t h a t p e r c e i v e d s y l l a b l e d u r a t i o n (along w i t h whatever i s s i g n a l e d by s y l l a b l e d u r a t i o n i n language) can be changed by a l t e r i n g degrees of c o a r t i c u l a t i o n of vowels i n a d j a c e n t s y l l a b l e s , w i t h o u t changing the temporal r e l a t i o n s among the consonants. In o t h e r words, t h e r e might be a p e r c e p t u a l t r a d e o f f between c o a r t i c u l a t i o n and d u r a t i o n . The r e s u l t s of the experiment i n d i c a t e l i t t l e i f any t r a d e o f f between c o a r t i c u l a t i o n and duration p e r c e p t i o n under the c o n d i t i o n s t e s t e d . However, duration p e r c e p t i o n alone does p a r t i a l l y c o r r e l a t e with d u r a t i o n a l c h a r a c t e r i s t i c s of the s u b j e c t s ' n a t i v e languages.

Section 16 679 ESPACE VOCALIQUE ET STRUCTURATION PERCE7TUELLE: APPLICATION AU SYSTÈME VOCALIQUE DU SWAHILI J.M. Hombert et G. Puech Centre de Recherche en Linguistique et Sémiologie UER Sciences du Langage Université Lyon , France Le modèle proposé par Lindblom et Sundberg (1972) prévoit que les voyelles se distribuent dans un espace acoustique de manière à maximiser les distances qui les séparent. En phonologie structurale on considère par ailleurs que la pérennité d'un système repose sur le maintien des distinctions même si la distance phonétique entre certaines voyelles, mesurées par exemple à partir des valeurs des trois premiers formants, est minimale. Notre contribution au débat consiste en premier lieu à comparer 3 systèmes vocaliques à 5 voyelles: celui du swahili, étudié avec 6 locuteurs (3 honries et 3 ferrites) et ceux du japonais et de l'espagnol pour lesquels il existe des données publiées. Cette comparaison montre que si les hypothèses de Lindblom rendent compte de façon satisfaisante du cas du swahili, elles s'adaptent mal aux deux autres cas mentionnés. Nous pensons que la méthode la plus significative n'est pas en fait l'analyse des distances acoustiques mais celle des distances perceptuelles. Nous présenterons donc aux 6 locuteurs dont nous avons déjà analysé le système 53 voyelles synthétiques réparties sur l'ensemble de l'espace vocalique et présentées chacune cinq fois dans un ordre aléatoire en leur demandant de les comparer avec celles de leur propre système. Les 6 sujets seront donc amenés à Êire un découpage phonologique de leur espace vocalique. Cette méthode permet de recueillir des données comparables d'un locuteur à l'autre sans se heurter au traditionnel problème de la normalisation. Les résultats obtenus montreront dans quelle mesure la structuration perceptuelle opérée par des locuteurs dont le système maximalise les distances acoustiques entre voyelles est comparable à celle que supposent nécessairement les systèmes où cette même distance est minimale.

680 DISTINCTIVE FEATURES IN A PHONOLOGY OF PERCEPTION Tore Janson Department of Linguistics, University of Stockholm,

Sweden

In modern phonology, it has often been assumed that phonological features and rules possess "psychological reality". This may imply, among other things, that distinctions corresponding to a feature classification are actually made in perception and/or production. Here are discussed some consequences of the assumption that such distinctions are made in perception. In that case, there must be a feature classification in the memorized lexical items, to be matched in sone way with a corresponding classification of the incoming signal. The classification in the lexicon must be fully specified for all distinctions made in careful pronunciation. That is, all relevant features have to be assigned one discrete value (+ or - , in most phonological notations). The classification of an incoming signal consisting of a carefully pronounced word in isolation could receive an identical classification. In ordinary speech, however, many segments are pronounced in a way that does not allow complete phonological classification of the ensuing acoustic signal. The most important causes are reduction and various forms of assimilation. A n example is centralized pronunciation of vowels. A phonetic form [1^/] in English may represent either live or love. Under such c i r c u m stances, the relevant f e a t u r e s + f o r the incoming vowel are indeterminate, and can be denoted - (e.g. - h i g h , - l o w ) . A n indeterminate feature for the incoming signal is different from a feature that has not been determined (because of interfering noise, for example), which could be denoted b y a q u e s t i o n + m a r k (?). While the presence of a ? does not allow any further conclusions, a - in the incoming matrix is a feature value associated w i t h a particular configuration of phonetic parameters. It supplies not only the information that the lexical matrix should have + or - in the relevant slot, but also that the feature is one that jan be subjected to indeterminacy in this particular context. For example, - h i g h in English is incompatible with +tense, since tense vowels do not reduce to shwa. Thus, in matrices for incoming signals, binary features may take on at least four values: +, - , - , ?. In the oral presentation, several cases of assignment of - will be demonstrated, w i t h some discussion of the theoretical implications.

Section

16

681 ON THE PERCEPTION OF DISTINCTIVE FEATURES OF PHONEMES. Z.N. Japaridze The Institute of Linguistics of the Academy of Sciences of the Georgian SSR, Tbilisi, USSR. 1. In acts of speech, at the stage of perception, distinctive features have no form, therefore the description of this form (one of the aims of Jakobson's theory of distinctive features) is in principle impossible. This form (the sound corresponding to each feature) is not necessary in language functioning and is not part of normal language perception. 2. In language perception distinctive features are differentiated only on a subsensory level. On the sensory level a phoneme cannot be broken down into either consecutive or simultaneous components. 3. In some cases, however, the observation of speech sounds makes it possible to speak about the perception of distinctive features. This is the case for those kinds of features which are added to feature complexes and change their characteristics only slightly. In these cases speakers can perceive the sound of the feature, such as, nasality in vowels, aspiration in consonants, although we are unable to perceive the sounds of features which cause considerable changes in the formant structure or the dynamics of the sound, e.g. the sound of the features "diffuse", "interrupted", etc.

682 ON THE LEVEL OF SYLLABIC FUNCTIONING. L.Kalnin. Institute of Slavic and Balkanology. Moscow, USSR. The syllabic composition of speech possesses suprasegmental and segmental aspects (that is respectively creation of syllables, producing unbroken syllabic chains, and division of words into syllables with the syllable as the main segment). Suprasegmental characteristics of the language realize themselves in the syllabic chain, not in the syllable itself. The syllable as a segment has a different functional status in the languages with unified and not unified types of syllables. Unified syllables may be identified with phonemes. The function of the not unified syllable may be determined by correlation with: a) its evaluation by the language-user; b) the sound syntagmatics of the word; c) the morphemic composition of the word. Dialects give the most favourable possibilities for research into the phonetics of the syllabic division. That is because phonetic notions of the language-users are not influenced by morphemic and orthographic associations. Investigation of the syllable and syllabic division in the Russian dialects brought us to the conclusion that the syllable functions on the verge of the levels of speech and language. The speech level index: a) the process of syllabic division is intuitive, it is influenced by factors that are not within the scope of the speaker's consciousness; b) lack and functional meaning of the syllable -among the phonetic phenomena there is no such function which is possessed exclusively by the syllable. The language level index: a) the speaker has his own point of view on the correctness/incorrectness of the syllabic division; the phonetic structure of the syllable is one of the peculiarities of the phonetic system of certain languages; b) the syllabic division destroys any audibly perceptible connection between sounds within the word (assimilation, dissimilation); it does not only change the phonetics of the word, but also destroys such phonematic characteristics as neutralization according to distinctive features; c) if the morphemic borders do not coincide with the phonetic border of the word, then the syllabic division destroys the morpheme itself. In the languages with the not unified type of syllable the latter's functioning on the verge of speech and language levels has the meaning of the syllable's ontologic characteristic.

Section 16 683 ON UNIVERSAL PHONETIC CONSTRAINTS Patricia A. Keating Linguistics Department, UCLA, Los Angeles CA, USA, The model of grammar currently assumed w i t h i n generative phonology, e.g. Chomsky and Halle 1968, divides phonetics into two components. First, the largely language-specific phonetic detail rules convert binary phonological feature specifications into quantitative phonetic values, called the "phonetic transcription". The phonetic transcription is the end product of the grammar. Next, a largely universal phonetic component, w h i c h is technically not part of the grammar, translates the segmental transcription into continuous physical parameters. It is therefore a question of some theoretical interest to determine which aspects of speech sound production are universal, i.e. independent of any given language, and non-grammatical. In this paper I w i l l consider three cases of phonetic patterns that have been, or could be, considered automatic consequences of universal phonetic constraints. Recent investigations indicate that none of these phonetic patterns is completely automatic, though they may be quite natural results of other choices made by languages. The first case is intrinsic vowel duration, i.e. that lower vowels are generally longer. Lindblom (1967) proposed that this pattern is an automatic consequence of jaw biomechanics, and that no temporal control need be posited. However, recent BMG data demonstrate that such temporal control is exercised by a speaker. Thus the phonetic pattern, even if universal, must be represented directly at some point in the production of vowels, cannot be The second case is a n extrinsic vowel duration pattern: vowels are shorter before voiceless consonants. Chen (1970) proposed that some part of this pattern is universal, and that its origin is in universal speech physiology. I w i l l argue that the pattern is not found universally, since at least Czech and Polish are exceptions. In addition, the fact that in other languages (e.g. English, Russian, German) the pattern is made opaque by the later application of other phonological rules suggests that the synchronic source of the variation cannot be automatic physical constraints, and that the pattern is itself represented as a rule in the grammar. The third case is the timing of voicing onset and offset in stop consonants, as affected b y the place of articulation of the stop, and the phonetic context in which it is found. Articulatory modeling shows that some such patterns can be derived from certain articulatory settings, without explicit temporal control. However, those articulatory settings themselves are not automatic,and languages m a y differ along such physical dimensions. Thus none of these candidates for universal phonetic patterns can be motivated as automatic consequences of the articulatory apparatus. It could be that most phonetic patterns are like this — even at a low phonetic level, they follow from language-specific, rule-governed behavior. This result w o u l d suggest that the most productive approach to the role of phonetics in the grammar w i l l treat phonetic patterns like other, "phonological", patterns w i t h formal representations, w i t h no patterns being attributable only to mechanical constraints.

684 GRADATIONAL PHONOLOGY OF THE LANGUAGE. E.F. Kirov Department of Philology of the Kazan University, Kazan, USSR The phonological concept is a development from Baudouin de Courtenay's ideas: the phoneme is understood as the image of the sounds (Lautvorstellung) in the individual and social consciousness . The concept of the phoneme has increasingly narrowed and this lead phonologists to make attempts to find at least two juxta positions in this concept: phoneme - archiphoneme (N.Trubetzkoy); phoneme Hyperphoneme (Sidorov); phoneme - general type (Smirnitskij); stark phoneme - weak phoneme (Abanesov). Thus a paradigm of phonological units is formed. This paradigm should correspond closely to the paradigm of phonological positions: the phoneme is perceived in the strong position, whereas a more generalized unit, termed here as megaphoneme is perceived in the weak position. The investigation of reduced sounds in the Russian language prompts the idea, that there is still another position - the superweak position with a corresponding phonological unit, termed here quasiphoneme. In Russian there are to be found quasiphonemes only within vowels. The quasiphoneme has the status of a phonological unit since it has a phonological function: it makes op the words, the syllables and the strong position for the preceding consonant phoneme. The phoneme has the largest range of distinctive features (DF), the megaphoneme is characterized by a narrowed range of DF, whereas no DF are to be found within the quasiphoneme. Unit alternation corresponds to position alternation. Phonemes, megaphonemes and quasiphonemes alternate corresponding to the alternation of phonological positions. However, this is not always the case. Among the Russian words with a frequency of 10 or more, constituting 92,4 % of all the words used in standard and everyday speech, we found only 28 % of words whose unstressed sounds can be verified by stress. Hence, 72 % of Russian highfrequency words contain vowel quasiphonemes and megaphonemes which do not alternate with phonemes, i.e. their phonemic composition proper is impossible to establish.

Section 16 685 THE THEORY OF PHONOLOGICAL FEATURES AND DEFINITION OF PHONEME G.S. Klychkow N.K. Krupskaya Moscow Regional Teacher Treaning Institute, Moscow, USSR A reverse postulate of the phonological features regarding their inner structure may lead to fundamental reformulation of basic phonological concepts. If a feature can be regarded as a complex structure in turn consisting of features (Tbilisi phonological school of the Soviet linguistics), a combination or rather amalgamation of features can generate a new feature. Feature synthesis presupposes some theoretical possibilities. A group of low level features may be treated as a low level "cover feature" (e.g. coronal in generative phonology). Synthesis of low level features may lead to a complex feature of the same phonological level - if the input features show the relation of interdependence - e.g. if all phonologically back vowels are rounded, and all rounded vowels are back the features can be treated as one complex feature back/rounded. Synthesis of two contradictory features produces a feature of higher level (front and back are peripheral). Synthesis of higher level features may produce a lower level feature (non-consonant, non-vocalic means sonorant; non-consonant, non-vocalic, interrupted means nasal). The most interesting case appears when a feature determines direction of the markedness of an opposition at the next node of classificational tree, generating a pair of lower level features. The feature turbulent (strident) presupposes the pair interruptednoninterrupted, the feature nonturbulent determines the opposition continuant-noncontinuant. The affricates are marked among turbulent by the feature interrupted, the fricatives are marked among nonturbulent by the feature continuant. Theoretically synthesis/analysis of features is connected with absorption/ emanation of information (functional load). The functional load freed from a segmental phonological feature is used in phonotactics or prosodic structures. Thus group phonemes and prosodemes may be treated as transforms of phonemes and vice versa. There is no gap between phonetic and phonological units. Levels of abstraction in phonology form a continuum.

686 ON THE USES OF COMPLEMENTARY

DISTRIBUTION

Anatoly Liberman Department of German at the University of Minnesota, Minneapolis, U.S.A. The idea of complementary distribution (CD) played a great role in the d e v e l o p ment of the phonological schools that recognize the phoneme as a n autonomous linguistic unit. Since the allophone is the manifestation of the phoneme in a given context, all the allophones of one phoneme stand in CD. The question important from a methodological point of view is whether it is enough for two sounds to stand in CD in order to qualify as allophones of the same phoneme. This question often arises in historical phonology, in the study of combinatory changes. For example, ji in West Germanic was not allowed before i (West Germanic breaking), and it stood in CD w i t h i. Most scholars are ready to treat e and i as allophones of one phoneme. Later, Old High German a became e just before i (umlaut), and this new e did not coalesce with e. Again, there are no serious objections to viewing e^ and a as belonging to the same phoneme. But as a combined result of West Germanic breaking and Old High German umlaut, «1 and e^ also found themselves in CD: ji never before e only before i. Uniting these two sounds as allophones of an e-like phoneme seems a very dubious operation, even if we disregard the fact that i and a will also turn out to be allophones of this artificial unit. Synchronic phonology, too, works w i t h rules beginning w i t h the statement: "Two sounds belong to the same phoneme if . . ." CD is a usual condition following the IF. Trubetzkoy realized that CD was not a sufficient condition for assigning two sounds to one phoneme (his example w a s li and i}) and suggested that such sounds should also possess a unique set of distinctive features. This is a correct but impracticable rule, because at the earliest stage of decoding, w h e n the phonemes have not yet b e e n obtained, their features remain unknown. L.R. Zinder showed that for two sounds to belong to the same phoneme they must stand in CD AND alternate w i t h i n the same morpheme. Indeed, in a language in which vowels are nasalized before m and n, all the nasalized vowels are in CD with all the non-nasalized ones. Yet, we unite a w i t h a, not o or e. In Russian, all vowels are fronted before palatalized consonants, but, as in the previous case, we assign them to phonemes pairwise, though all the fronted vowels stand in CD w i t h all the more retracted ones. Trubetzkoy's example is unique only because it deals w i t h two phonemes rather than two allophones. Zinder's rule is an improvement o n Trubetzkoy's theory, but it is equally inapplicable, for it also presupposes that the phonologist can work w i t h phones before the phonemes have b e e n isolated and fully described. Only in historical phonology, w h e n we have the w r i t t e n image of the w o r d segmented into letters and each unit implicitly analyzed b y the scribe, so that we make conclusions about sounds from letters, can we resort to this rule. For instance, ii and e^ never alternate w i t h i n the same morpheme and consequently are not allophones of one phoneme. But the phonological analysis of living speech starts w i t h morphological segmentation, w h i c h yields phonemes as bundles of distinctive features. Allophones, together with the phonetic correlates of distinctive features, are the last units to be obtained. Once the entire analysis has b e e n carried out, the concept of CD can add something to our knowledge of the language, but as a tool of assembling the phoneme it is worthless, just as the idea itself of assembling phonemes from their widely scattered allophones is worthless. The problem of CD is not isolated. Together w i t h many others, such as neutralization, biphonemicity, and the role of morphology in phonological analysis, it can be solved only by a method that starts with a flow of speech and ends up with a set of discrete phonemes.

Section 16 687 IS THERE A FRICATIVES ?

VALID

DISTINCTION

BETWEEN

VOICELESS

LATERAL

APPROXIMANTS

AND

Ian M a d d i e s o n and K a r e n Emmorey Phonetics Laboratory, Department of Linguistics, UCLA, Los Angeles, California, USA. This paper examines the validity of a distinction between types of voiceless laterals. They have been divided into two major manner of articulation classes. Some languages (such as Zulu, W e l s h , Navaho and B u r a ) have been described as having voiceless lateral fricatives, and others (such as Burmese, Iai, Sherpa and Irish) have been described as having voiceless lateral approximants. Sounds in this latter class are sometimes referred to as voiceless aspirated laterals. N o language has been reported as having a contrast between these two segment types. This could be the case because there is no basis for making a distinction between them. The majority of phonetic typologies do not m e n t i o n the difference. Compare voiceless nasals; no language is known to contrast two different types of voiceless nasals, and phoneticians have never suggested any basis for making such a linguistic distinction. In the same way, because of their phonetic similarity and lack of contrast, there is some doubt about the reality of the phonetic distinction between the two classes of voiceless laterals. The different descriptions applied to voiceless lateral segments in various languages could be simply a result of different terminological traditions. If this is true, then the claim that languages do not contrast voiceless lateral fricatives and approximants would be vacuous. However, w e find that there are both phonological and phonetic grounds for affirming that voiceless lateral fricatives and voiceless lateral approximants are distinct types of sounds. The phonological case is made on the basis of the phonotactlcs and allophony of laterals in languages w i t h voiceless lateral phonemes. In general, the following distinctions between the two voiceless types hold. The approximants occur only in prevocalic position, fricatives may occur in final position and in clusters. Fricatives may have affricate allophones, approximants do not. Fricatives may occur in languages w i t h no voiced lateral phonemes, approximants may not. The phonetic case is made on the basis of measurements on voiceless lateral sounds from several languages from each of the two groups mentioned above. The following properties were measured: (i) the relative amplitude of the noisy portion of the acoustic signal, (ii) the spectral characteristics of the noise, and, (ill) the duration of the portion of the lateral w h i c h is assimilated in voicing to a following voiced vowel. It was found that voiceless fricative laterals are noisier and have less voicing assimilation than voiceless approximant laterals. Several of the phonological and phonetic attributes which distinguish these two types of laterals can be related to w h a t is presumed to be the m a i n articulatory difference between them, namely, a more constricted passage between the articulators for the fricative. Hence indirect evidence of this articulatory difference is obtained. Because of the differences revealed by this study, the claim that languages do not contrast voiceless lateral fricatives and approximants has been shown to have meaningful content. The more important lesson for linguistics is that a small, often overlooked, difference in phonetic substance is associated w i t h major differences in phonological patterning.

688 ON THE RELATIONSHIP BETWEEN PHONETIC AND FUNCTIONAL CHARACTERISTICS OF DISTINCTIVE FEATURES I.G. Melikishvili The Institute of Linguistics of the Academy of Sciences of the Georgian SSR, Tbilisi, USSR 1. The comparison of functional characteristics of distinctive features (their frequency characteristics and different capacities to form simultaneous and linear units) with the results of the experiments on perception reveals the relationship between the phonetic and functional data. The more the sonority and distinguishability of the sound, the greater the intensity of its utilization in speech. 2. The functional characteristics of distinctive features reflect the complexity of their internal structure. These characteristics can be correlated with different phonetic components of distinctive features. The study of the features of laryngeal articulation - voice, aspiration and glottalization - reveals functional equivalents for various phonetic components: of voice, aspiration and glottalization proper and for different degrees of the tenseness and duration.

Section 16 689 A PROCEDURE IN FIELD PHONETICS E. Niit Institute of Language and Literature, Academy of Sciences of the Estonian SSR, Tallinn, USSR In field linguistics one often faces a problem like the following: there is a great bulk of screening material awaiting computer analysis, but the process is rendered unduly complicated by recording disturbances on the one hand and unavoidable errors in segmentation and parameter discrimination on the other. Here the possibility of basing fieldwork on a set of test sentences may sometimes come in handy. I tape-recorded 12 sentences from 173 informants living along the whole coast of Estonia and on the islands. Clipped speech was fed into the computer (storing the zerocrossings of the signals) and its F^ contours were found out by means of a special algorithm. My problem consisting in an analysis of the distances between turning-points in the contours it was the turning-points that had to be determined first. As the points could not be discerned in the "naive" way, i.e. searching the contours for their significant rises and falls only, I also made use of the fact that the sentences were known beforehand. The computing procedure combining the "naive" with the "sophisticated" consisted of the following steps: (i) All the rises and falls with sufficient length were extracted from the contour. (ii) The sequence of rises and falls (and "nothings") so obtained was juxtaposed with a sequence of theoretically expected contours of the corresponding stressed and unstressed vowels (the long and overlong marked separately). (iii) As the number of rises and fall in the initial F0-contour considerably exceeded the theoretical expectancy the following stage of the procedure consisted in smoothing the contours by joining shorter rises (falls) - where their occurrence was densest - with the preceding and following ones until either an "expected" contour was obtained or one should have continued the joining over rests more than 50 ms long. As a result the number of falls and rises in the original contour which exceeded the theoretical expectancy by 3.7 times dropped to an excess of 1.4 times only. Evidently the procedure can be counted on in fieldwork planning.

690 ON THE CORRELATION OF PHONETIC AND PHONEMIC DISTINCTIONS A. Steponavifiius Department of English Philology at the University of Vilnius, Vilnius, USSR The correlation and interdependence of phonetics and phonology may be defined in terms of the dichotomies between language and speech, paradigmatics and syntagmatics. Phonetics lies both in the domain of speech and language (in that it is the level of both indiscrete speech sounds and discretfe "sound types" of language), and has both syntagmatic and paradigmatic aspects. Phonology lies in the domain of language, but not speech, and has both paradigmatic and syntagmatic aspects. Phonemes are intrinsically related to sounds (their phonetic realizations) in that distinctive features (DFs) have anthropophonic correlates. Yet phonemes are purely structural-functional entities, which are defined as members of phonematic oppositions and are endowed with constitutive and distinctive functions, whereas sounds are above all material entities, "substance". Furthermore, DFs need not necessarily be directly related to phonetic features (cf. negatively expressed DFs, distinctive features which cover several phonetic properties, or the functional difference of DFs, based upon the same phonetic properties). The above generalizations may be illustrated by data from Baltic, Slavonic and Germanic languages.

Section 16 691 PHONETIC CORRELATES OF' REGISTER IN VA Jan-Olof Svantesson Department of Phonetics, University of Lund, Sweden Va belongs to the Palaungic branch of the Austroasiatic (MonKhmer) languages. It is spoken by some 500 000 people in China and Burma. As many other Austroasiatic languages, Va has lost the original opposition between voiced and voiceless initial stops, and has in its place developed a phenomenon usually termed "register". The registers in Va are referred to as tense and lax. They correspond to original voiceless and voiced initial consonants respectively . The phonetic correlates of register differ from language to language, but the following factors seem to be involved: (1) Phonation type (properties of the voice source) (2) Pharyngeal effects (widening or constriction of the pharynx) (3) Fundamental frequency (pitch) (4) Vowel quality For this investigation minimal tense/lax pairs were recorded for each vowel from two speakers of the Par^uk dialect of Va, spoken in Western Yunnan province, China. The following results have been obtained: There is a small but systematic difference between the vowel qualities of the two registers, so that the lax register vowels are somewhat centralized in relation to the tense register vowels. There is no systematic difference in fundamental frequency between the two registers. There is a phonation type difference which is reflected in a somewhat greater difference between the level of the first formant area and the fundamental frequency area of vowel spectra for the tense register vowels compared to the lax register.(This measure has been proposed by Peter Ladefoged and Gunnar Fant.) There also seems to be a greater slope in the spectra above the F1 area for lax than for tense vowels, which indicates that the source spectrum also has a greater slope in the lax register than in the tense.

692 A DISTINCTIVE FEATURE BASED SYSTEM FOR THE EVALUATION OF SEGMENTAL TRANSCRIPTION IN DUTCH W.H. Vieregge, A.C.M. Rietveld and C.I.E. Jansen Institute of Phonetics; E.N.T.-clinic, University of Nijmegen, the Netherlands However extensive the literature on transcription systems may be, it remains astonishing to see that data on inter- and intrasubject reliability are almost completely lacking. One of the major problems in the assessment of reliability is that it requires a system with which differences between transcription symbols can be assigned numbers corresponding to the distances between the transcription symbols, or rather corresponding to the distances between the segments that the transcription symbols stand for. Preferably, these distances should be defined articulatorily rather than auditorily, since the training in the use of transcription symbols is largely articulatorily based as well. For the construction of a system in which the distances between the Dutch vowels are numerically expressed, enough experimental data may be found in the literature (f.e. Nooteboom, 1971/72; Rietveld, 1979). The available data with respect to the Dutch consonants we find less satisfactory. Spa (1970) describes the Dutch consonants by means of 16 distinctive features. One of our main objections against Spa's systems is that the front-back dimension - a dimension which is crucial for the classification and the adequate use of transcription symbols is only implicitly represented by the features cor, ant, high, low, and back. Moreover, the validity of Spa's system was not experimentally tested. We therefore decided to develop a new consonant system for Dutch with a heavier emphasis on articulation. The validity of this system was assessed by means of an experiment in which subjects were asked to make dissimilarity judgments on consonant pairs. Twenty-five first year speech therapy students were offered 18 Dutch consonants pairwise in medial word position; they were asked to rate each pair on articulatory dissimilarity on a 10-point scale. The stimulus material consisted of 153 word pairs. In the instructions it was emphasized that during the rating process the whole articulatory apparatus should be taken into consideration. Multidimensional scaling was carried out on the dissimilarity judgments of the subjects, yielding five dimensions that can be interpreted in terms of phonetic categories. The configuration which resulted from the multidimensional scaling suggested us to revise our consonant feature system. Distances calculated between the consonants using the revised system correlated rather highly with the dissimilarities obtained in our experiment ( r = .75). An additional evaluation of our system was performed by correlating Spa's DFsystem, our DF-system and the similarities obtained in Van den Broecke's (1976) experiment and our experiment. The correlations showed that Spa's DF-system is a better predictor of the auditorily based similarity judgments gathered by Van den Broecke. Our own DF-system, however, is more successful in predicting the dissimilarity judgments which are based on articulation.

Section 16

693 ASSIMILATION VS. DISSIMILATION: V O W E L QUALITY AND VERBAL REDUPLICATION IN OBOLO K. Williamson and N. Faraclas Department of Linguistics and African Languages, University o f Port Harcourt, Port Harcourt, Nigeria In Obolo, a Lower Cross (Niger-Congo) language of S.E. Nigeria, verbal reduplication involves the lowering o f the quality o f stem vowels. Changes in vowel quality due t o verbal reduplication in other West African languages, such as Igbo (Williamson 1972), however, usually are in an upward rather t h a n a downward direction. The exceptional behavior o f Obolo in this respect m a y b e considered to be the result either of dissimilatory or of assimilatory p r o cesses. The dissimilatory explanation is b a s e d on Ohala's (1980) perceptual model and the hierarchies of vowel assimilation in Obolo established b y Faraclas (1982). The assimilatory explanation relies primarily u p o n recent w o r k indicating that Proto-Benue-Kwa had a 10-vowel system (Williamson 1982) and some apparent areal characteristics of 'Bantu Borderland' languages. Both explanations will be discussed in light of the questions that each raises regarding the production, perception, and historical development of vowels in West African languages as w e l l as in languages universally.

Section 17 Phonology

Section

17 697

ON DIFFERENT METHODS OF CONSTRUCTING A PHONOLOGICAL MODEL OF LANGUAGE. L.V. Bondarko. University of Leningrad, Leningrad USSR. Two different approaches to the formation of a phonological system are considered: the classical linguistic approach, based only on the analysis of existing oppositions differentiating meaning, and the approach in which the speech habits of the speakers of a language are taken into consideration. The system of phonemes, their distinctive features, their relation to other layers in the organization of a language are seen differently from these two perspectives. For example, the classical model of Russian vocalism becomes "deformed" if we do not take into account speech habits, and above all, speech perception. Those problems of phonological interpretation are considered, that native speakers of Russian perceive in one and the same way, e.g. the phonological status of the feature front vs. back, the relation between Qi/J and [5-3. For the speakers of a language, the position of a vowel in the phonemic system is determined not only (and not exclusively) by possible oppositions, but also by the extent to which it alternates with other phonemes, by the frequency of occurrance, and by realization of its potential to be an exponent of linguistic meaning. The speech habits of the speakers of a language can be considered as evidence of linguistic competence, which can be seen as a result of the influence of the language system on the linguistic competence.

698 NOTIONS OF STOPPING CLOSURE AND SOUND-PRODUCING STRICTURE FOR CLASSIFYING CONSONANTS AND ANALYSING AFFRICATES L.N. Cherkasov Department of English Philology at the Teachers' Training College, Yaroslavle, USSR The widely used term "complete closure" is not entirely devoid of ambiguity. So it seems proper to distinguish between a directing closure which directs the flow of air along one of the two existing ways (oral or nasal) and a stopping closure which, though "mute", affects the following sound. Only the latter is essential for descriptive phonetics. It helps to divide consonants into stops articulated with the stopping closure and continuants which have only directing closures. The stopping closure is not regarded as a kind of stricture, the latter being defined here as a soundproducing obstruction. Any stopping closure is pre-strictural. The off-glide in stops and the retention-stage in continuants are strictural. So the stricture can be dynamic (in stops) and static (in continuants). In English the dynamic stricture is either abrupt or gliding, the static stricture being narrow or broad. Accordingly, the English consonants can be divided into abrupt and gliding stops, and narrow and broad continuants. Acoustically, they are plosives, affricates, fricatives, and sonorants. The stopping closure and the stricture do not only serve as basis of classification, but also help to better understand the monophonemic nature of affricates, for they show that there is no essential difference in the basic structure of a plosive and an affricate, because both of them consist of two elements, one of which is a stopping closure and the other is a stricture, the difference between the two types of consonants being strictural, not structural. It does not mean that any combination of stopping closure and gliding stricture is an affricate, as it is evidently believed those phoneticians who consider /ts, dz, t9, dS, tr, dr/ to be affricates. According to our observation, an abrupt stop becomes gliding before a homorganic continuant. For explanation of this phenomenon we postulate the assimilation affecting the muscular tension in the articulator. The fact is that with plosives all muscles of the fore part of the tongue are strong and release the closure abruptly and instantaneously. With affricates this part of the tongue is weak. When an abrupt stop is followed by a homorganic continuant redistribution of muscular tension takes place so that the tip of the tongue is weakened and the gliding stricture substitutes for the abrupt one. What occurs in those clusters and even in /t+j, d+j/ is affrication of /t, d/ and reduction of the continuants. These two factors make /ts, dz, etc./ sound like affricates. Since the latter are regarded as monophonemic entities we suggest for the aforenamed clusters the term "gliding consonants"for it adequately shows the articulatory peculiarities of those sounds and bears no phonological connotations.

699 LES PROCESSUS PHONÉTIQUES DANS LE MOLDAVE PARLE Gozhin, G.M. Kichinev, Academie des Sciences de Moldavie Dans le présent énoncé on systématise pour la première fois la structure sonore du style parlé du moldave en comparaison avec le moldave littéraire codifié. 482 sujets on été examinés. Le matériel est analysé selon la méthode de E.A. Zemskaïa (Le russe p.M., 1973). Dans structure du moldave parlé des changements quantitatives n'ont pas lien. Le système des voyelles du moldave parlé comporte 7 phonèmes voyelles comme et le moldave littéraire codifié qui se caractérisent par les mêmes traits différentiels. Pourtant on y trouve toute une série de processus phonétiques spécifiques. La réduction pleine englobe les voyelles fermées qui se soumettent surtout à ce phénomène (HyK.yu>°T - H^kU>Ôi>\ "petit hoyer". La réduction partielle comprend les fermées postaccentuées ,y (MÂnyruJie -Jif)JlyP"jiu ) (les rives). La fermeture englobe les voyelles ouvertes et semi-ouvertes accentuées et inaccentuées (kAc3 - KÂCti , MîcÉM - fiiciJlU) . L'ouverture embrasse les voyelles fermées M,*/ , y ( Bej>tir3 -iepéru ) . La postérisation englobe les voyelles H,& (rnru?é3 -ryrypé3 ) • La délabilisation englobe les voyelles y,0 (cvHK.ywôr-c.m4Kaat6P, RHÀiroAt)-t)H/iiT34,fi ) . Le hiatus complet comprend un des constituants de l'hiatus (joojoçûe -30MOXÛË , COYUÂA -COf'/Cr) . Le hiatus partiel renferme les voyelles (reofiie-T"OrtÎu ) . Les groupes des voyelles prëaccentuées se diphtonguent dans les diphtongues ascendantes (14) mais les groupes poi.taccentués se diphtonguent^ dans les diphtongues descendantes (14) (MoHAHAJl-MMAUùjiji, JlékifiMJie -Jtekyuûe) . Le système des consonnes du moldave parlé (22) se caractérise par un mode ferme d'articulation. On observe certains processus^ d ' af friquâtion et de dépalatalisation (ytive -i(ûttn, jécrse -Ajécrre ) , et d'autres. Des syllabes consonantiques apparaissent ( C9 ce -BQA3 -

C iÂAll) .

Les processus phonétiques en fonction éloiguent considérablement le système du moldave parlé du moldave littéraire codifié.

700 STABLE ELEMENTS IN THE PHONETIC EVOLUTION OF LANGUAGE. Ts. A. Gventsadze. The Tbilisi State University, Tbilisi, USSR. There are two permanently interacting forces in language: one directed at language change, the other - acting towards the preservation and stability of its elements. The historical study of languages is mainly aimed at the analysis of the first force that finds its explanation in terms of the function, substance or structure-orientated approaches. However, it does not seem contradictory to assert that the same approaches can be taken into consideration in the analysis of the stable elements and features of the phonetic system. Elements of the phonetic system may have various degrees of stability which can be defined according to a number of criteria. The first criterion may be called phonological. This was clearly formulated by Martinet who asserted that elements of the phonetic system preserved in a language if they are functionally loaded, i.e. if there is a necessity for a phonological opposition. The second criterion is phonetic. It is based on the phenomenon of diaphonic variation. Sound elements are preserved when the degree of diaphonic variation is minimal. The third criterion may be called phonotactic and has been least investigated. It is based on the phenomenon of allophonic variation, the characteristic arrangement of phonological items in sequence and their possible co - occurence. An example of this is provided by the word - initial, biphonemic complexes [br ] , T fr ] , [ tr], [gr ] in French, Spanish and Italian. Their phonetic substance has changed while the phonotactic pattern remains intact. This all testifies to the fact that in analysing the degree of stability of a phonological element in the system one should take into consideration all the three criteria.

701 A STRUCTURAL-FUNCTIONAL ANALYSIS OF ONOMATOPOEIAS IN JAPANESE AND ENGLISH Kakehi, Hisao Kobe University, Japan (1) Originally, onomatopoeias are to imitate the sounds of the outer world. Since linguistically deviced, they should be influenced by the phonological systems of respective languages. In the present section, our chief concern is limited to the reduplicated forms of onomatopoeias. In Japanese, the same set of vowels is repeated in the first and the A

A

second parts of the onomatopoeic expression (eg. gon-gon, pata-pata, etc.), while in English, different set of vowels is usually employed (e£. d_ing-dong, pitter-patter, etc.). This is mainly because Japanese has pitch accent, and English has stress accent. The above-mentioned phenomenon is found not only in onomatopoeic words but also in the phonemic structures of the words of these languages. In English, no two vowels with equal stress and quality can occur in a word, but this is not applied to Japanese.(eg. aka 'red1, ki.mi. 'you', kokoro 'mind' etc.). Since the sound of the outer world is a physical continuum, Japanese seems to be more suitable to express sounds as they are, in that it permits the repetition of the same set of vowels. English, on the other hand, describes the natural sound in the more indirect (ie, more lexicalized*) level. This is proved by the fact that, in English a part of the reduplicated form of an onomatopoeia can operate as a free form (eg. ding, dong, patter, etc.); in Japanese,,however, this is not the case. For example, gon of gon-gon, or pata of pata-pata remains still in the onomatopoeic level. (2) Japanese onomatopoeic expressions are usually realised as adverbials, while those in English, except nonce formations, are most frequently realised as nouns and verbs. From the syntactic point of view, adverbials functioning as the optional modifiers of the verb, can enjoy much wider positional freedom in a clause, compared with nouns and verbs which function as obligatory components of a clause like subjects, objects and predicate verbs (eg. Kodomotachi ga wai-wai, kya-kya itte-iru. 'Children are screaming and shrieking.')• Chiefly for this reason, the Japanese language can more freely create onomatopoeias such as run-run (an expression indicating that things are going on quite smoothly), and shinzori, which describes the snow piled up high and still on leaves of an evergreen which is about to fall off them in the glittering morning sunshine. What is stated in the two sections above may explain some of the reasons why the Japanese language abounds in onomatopoeic expressions. * With regard to "degrees of lexicalization", see Kakehi, H. (forthcoming) "Onomatopoeic expressions in Japanese and English", in Proceedings of the 13th International Congress of Linguists.

702 DAS SCHICKSAL DER NICHTKORRELIERENDEN PHONEME IN DER REALISATION R. Kaspranskij Fremdsprachenhochschule in Gorki, UdSSR Die Realisation der Phoneme ist der systembedingten und normativen Regulation unterworfen. Die systembedingte Regulation hängt von den oppositiven und korrelativen Verbindungen des Phonems ab; je umfangreicher diese Verbindungen sind, um so größeren Systemzwang erleidet das Phonem, und umgekehrt. Die Liquida stehen im Konsonantensystem abgesondert da: sie treten in keine Serien- und Reihenkorrelationen ein (nur in wenigen Sprachen korrelieren sie mit anderen Konsonanten nach dem modalen DM "scharf - nichtscharf"). Diese Tatsache, daß die Liquida im konsonantischen Oppositionssystem mit anderen Konsonanten nicht korrelieren und daß sie im Oppositionssystem allein durch negative DM charakterisiert werden ("nichtexplosiv", "nichtfrikativ", "nichtnasal"), führt dazu, daß ihr Erstreckungsgebiet von Seiten anderer Konsonanten nicht "bedroht" wird und deshalb nicht streng fixiert und lokalisiert ist. Durch diese "Freiheit" in ihrer Stellung im konsonantischen System ist es zu erklären, daß die Liquida in der Realisation oft ihre Erstreckungsgrenzen überschreiten und über den semikonsonantischen Zustand bis zum semivokalischen und vokalischen von ihren Realisationsnormen ausweichen können, manchmal bis zum völligen Ausfall des Konsonanten, d.h. bis zur Nullrealisation. Dafür sprechen Beispiele aus den germanischen, slawischen, romanischen u.a. Sprachen, solche wie [ 1']> [j ] im Französischen, [l ] > [u] im Niederländischen, [1]> [w] im Polnischen und Ukrainischen, [l] > [i] im Wienerischen u.a. Einige Sprachforscher sehen die Bedingungen und Ursachen dieser "Freiheiten" in der Realisation darin, daß diese Phoneme funktionalarm oder gar funktionslos sind (M. Lehnert, G. Meinhold, H. Weinrich u.a.). Dagegen sprechen aber Tatsachen aus slawischen Sprachen, wo z.B. der Liquidlaut [ 1] funktional sehr wichtig ist (besonders im Paradigma des Verbs). Die Realisation der Liquida wird demzufolge nicht durch den Systemzwang, sondern hauptsächlich durch sozial-traditionelle Realisationsnormen reglamentiert.

703 ONOMATOPOEIA K. C. Summer

IN T A B A S C O

CHONTAL

Keller Institute of Linguistics,

Mexico

Most o n o m a t o p o e t i c w o r d s in C h o n t a l , a M a y a n language s p o k e n in s o u t h e r n M e x i c o , not only fit the s y s t e m a t i c p h o n e t i c p a t t e r n s of the l a n g u a g e , but also h a v e regular g r a m m a t i c a l f o r m a t i o n s and functions. T h e r e is no g e n e r a l term for sound or n o i s e , but m a n y s p e c i f i c terms. M a n y of the s p e c i f i c terms are formed by an onom a t o p o e t i c root f o l l o w e d by the suffix -lawe " s o u n d of"' to form a n o u n , or the root plus the suffix -law to form w o r d s w i t h s t a t i v e verb f u n c t i o n or adverbial function. E x a m p l e s : top4lawe " s o u n d of b u r s t i n g , as of c o c o a b e a n s w h e n t o a s t i n g " , tsahlswe " s o u n d of frying", ramlawe " w h i r r i n g or w h i z z i n g s o u n d " , hak'lawe " h i c c o u g h " , hamlawe " r o a r or zoom of s o m e t h i n g going r a p i d l y by, as a b u s " , wet^'lawe s o u n d of w a t e r b e i n g thrown o u t " , wek'lawe " s o u n d of w a t e r m o v i n g in a v e s s e l " . The a d v e r b i a l f u n c t i o n is shown by ramlaw u nume 'it goes w h i z z i n g by". The verbal f u n c t i o n is s h o w n in wek'law ha? tama t'ub "the w a t e r s o u n d s from b e i n g s h a k e n in the gourd". W h e r e a s the -l'awe/r1£w forms r e p r e s e n t n o n - s p e c i f i e d or u n m a r k e d a c t i o n s o u n d s , i t e r a t i v e or r e p e t i t i v e a c t i o n sounds are repres e n t e d by c o m p l e t e r e d u p l i c a t i o n of the b a s i c CVC o n o m a t o p o e t i c root p l u s the suffix -ne for n o u n s or the suffix -na for adverb or s t a t i v e verb f u n c t i o n . E x a m p l e s : wohwohne, wohwohna " b a r k i n g of d o g s " , tumtumne, tumtumna "beating of d r u m s " , ts 3 i?ts 3 i?ne, ts'iTts^iina " s o u n d m a d e by rats or b a t s " , loklokne " g o b b l e of turkey' 1 , t'oht'ohne " s o u n d of p e c k i n g or of k n o c k i n g on w o o d " , ?eh?ehne "grunt of p i g " , ramramna "hum or b u z z i n g of bees or m o s q u i t o e s " . C o m p l e t e r e d u p l i c a t i o n for iterative a c t i o n is a p r o d u c t i v e p a t t e r n , and occurs also in w o r d s w h i c h are not o n o m a t o p o e t i c , as inliklikne, liklikna " s h i v e r i n g " , p'ikp'ikna " b l i n k i n g the e y e s " , welwelna " f l a p p i n g in the w i n d " . C o n t i n u o u s a c t i o n s o u n d w o r d s are f o r m e d by r e d u p l i c a t i o n of the stem v o w e l plus -k f o l l o w e d by the suffix -ne for n o u n s or the suffix -na for adverb or s t a t i v e verb f u n c t i o n w o r d s . E x a m p l e s : sahakne " s o u n d of w i n d in the t r e e s " , hanakne "purr of cat or roar of t i g e r " , kat^'akne yeh " g n a s h i n g of teeth". O n o m a t o p o e t i c roots s o m e t i m e s o c c u r as the s t e m for t r a n s i t i v e or i n t r a n s i t i v e verbs as in wohan "bark at" (compare w i t h wohwohna " s o u n d of dog b a r k i n g " ) , tsahe? "fry it" (compare w i t h tsahtsahne " s o u n d of f r y i n g " ) , t*ohe? " - p e c K a t . i t " (compare w i t h foht^ohne " s o u n d of p e c k i n g or k n o c k i n g on w o o d , as a w o o d p e c k e r " ) , u top'e "it b u r s t s " (compare w i t h top'T^we " s o u n d of b u r s t i n g " ) , wet^'Bn 'throw out w a t e r " (compare w i t h wetS'lawe " s o u n d of t h r o w i n g out water"). O n o m a t o p o e t i c roots may also e n t e r into c o m p o u n d s , as in ramhule? "throw it w i t h w h i z z i n g s o u n d " (hule? " t h r o w " ) , rahhats'e? "spank or hit w i t h s p a n k i n g n o i s e " (hats'e? "hit").

704 COMPARISON OF RUSSIAN SPEECH SEGMENT-RHYTHMIC ORGANIZATION OF GERMAN SPEAKERS AND TATARS. Kulsharipova R.E. (Kazan University). Language contact investigation must be regarded important in view of language policy, teaching practics and language functioning in society. The m a i n disadvantage in the analysis of language contact specifics is the neglect of speech segment-rhythmic typology. Our research is built o n comparison of speech sound content of G e r m a n students and Tatars studying at the Russian philology department of Kazan University. Russian speech of Germans and Tatars is characterized by two variants of syntagmatic division: each syntagma is correlated w i t h aspiration group and vice versa, each rhythmic structure represents an independent syntagma; incorrect intonation of syntagma melodic schemes was noticed here, viz. complete-incomplete communicative types of neutralization; m o s t cases of rhythmic interference occurs in speech-polylogue, fewer in speech dialogue. Accoustic analysis showed the activation of représentants Î.O.IO ), that testifies to other zones of vocal allophone dispersion than in Russian literary language. Phonostylistic information of each sound type gives the opportunity to distinguish the following positions in unprepared speech, meaningful for neutralization of vowels by non-Russians: stressed syllable in logical and illogical centres of syntagma, 1 pretonic and 1 posttonic o p e n in logical and illogical centres of syntagma; 2 pretonic and posttonic closed syllables. Distinguishable peculiarities of non-Russian segment-rhythmic speech organization testify to the mixed type of prosody in conditions of different language-system speaker contacts. We consider this important for the creation of general typological complex models of speech prosody.

705 TOWARDS TYPOLOGY OF PHONOLOGICAL SYSTEMS OF LANGUAGES OF CENTRAL AND SOUTH-EAST EUROPE M.I. Lekomtseva Institute of Slavic and Balkan Studies, Academy of Sciences, Moscow, USSR The theoretical aspect of our report concerns the interrelation between a phonological opposition and phenomena within a certain area. The facts under consideration are the palatal occlusives widespread in the Latvian, Czech, Slovak, Hungarian, Albanian, Macedonian, Serbo-Croatian and Slovenian language, as well as in dialects of Polish, Romanian and Bulgarian. These languages are surrounded by both types of languages, i.e. those with the palatalization correlation and those lacking both palatalized and palatal consonants. To the East, Lithuanian, Russian, Byelorussian, Ukrainan and Bulgarian possess the palatalization correlation - somewhere small islands are included, where k'- t' are interchangeable. To the West and to the North we find the Germanic languages without either type of correlation. In the old Slavic areas of Polabian and Sorabian as well as further to the West in the Celtic group of languages, one again encounters the palatalization correlation; it is remarkable that one finds the palatalization in those languages where one finds the interchange of k' and t" . From the extralinguistic point of view it is important to emphasize that the entire zone of languages with palatal occlusives coincides with the genetic pool of gene B at a statistical level of 10 - 15 %. Apart from the genealogical and typological types of linguistic interconnection the term "linguistic pool" must be introduced in order to indicate the similarity between languages caused by interchange of genetic information. The linguistic pool of palatal consonants is the borderline zone between the satam and centum groups of IndoEuropean languages. It may be suggested that the Proto-IndoEuropean possessed palatal occlusives, at least for the satem languages. In the historical shifts of the palatal consonants a vacant cell (according to A. Martinet) was repeatedly filled from different sources. The satdm languages where the retroflex order had developed on the basis of a different substratum, were prevented from developing the palatals. The languages of Central and SouthEast Europe, developing according to an archaic phonological pattern, correlated with the genetic model of articulation processes, created time and again new layers of palatal consonants.

706 A CERTAIN REGULARITY OF AFFRICATIZATICN IN KARTVELIAN LANGUAGES Lomtadze Amiran, E. Tbilisi, USSR I. According to an accepted point of view, one of the ways of affricatization in Karbelian Languages is organic fusion of obstruent and spirant consonants: t + s > c , t + s > £ , d + z > ? ... Literary Georgian: at + ^armreti > cameti, at + Svidmeti > cvidmeti, bed + savi > bet + savi > becavi ... Kevsurian: erturt-s > ertur-c ... Gurian-Adjarian: gverd-ze > gver-?e, gverd-Si > gvert-si > gverci ... Cvanian: padasa-s > padsa-s > patsà-s > paca-s ... II. According to soma kartvelologists, spirants when immediately following résonants / m, n, JL, r / are also affricatized ÇI. Megrelidze, Sh. Gaprindashvili, Kobalava, G. Klimov ... ) : kal-s > kal-c, vard-s > var-s > var-c. /;ensia > /¡encia, xelsaxoci xelcaxoci, P. arsva > /;arcva ... Sh. Gaprindashvili is inclined to explain this phenomenon namely, that the affricatization of spirants in this position is caused by a special kind of off-glide in the résonants. III. Contrary to this last consideration our observations suggest that the spirants turn into affricates when the spirants s or z, s or z come directly after an obstruent consonant. A noise-obstruent consonant / voiced, aspirate, glottalized / provokes affricatization in the same manner as résonants. It is immaterial whether they are pure obstruent, affricate or the so-called mixed ones. The important thing is that their articulation is to be characterized by occlusion. The partial fusion of the occlusion component with the spirant produces an affricate. The phenomenon is of a labile nature and characterizes all the three Kartvelian Languages: Georgian, Svanian and Colchian (Megrelian-Chanian). a). Colchian (Megrelian dialect) : Nominative Case - dxoul-ejD-i, Dative Case éxoul-ep-s >// âxoul-ep-c; dud-i - dud-sa>// dud-c5; toron?-i - toronf-s >// toronf-c >// toron-c; koi-i - koc-s > koc-c > ko-c ... £xom-i - ¿xom-s // cxcm-c ... cil-i - cil-s > cil-c ... çkun-s ¿kun-c, lur-s > lur-c ... kurs-i>// kurc-i, erskem-i^// erckem-i, rskin-i>// rckin-i Chanian dialect : memsxveri > memcxveri ... mzora > m|ora, mzvabu > m^rabu ... b). In literary georgian, affricatization of the kind of spirants mentioned above occurs rarely, but in the dialects especially in the Gurian dialect, it is comparatively frequent: Gurian: m£ad-s > miad-c. datv-s > *dat-s > dat-c, kun?-s > kunj-c // kun-c, kvic-s > kvic-c (//kvi-c), lur?-s > lur/-c, biç-s > biç-c (>bic) ... ucyvet-s > ucjvet-c > ucyve-c ... elisabedi > elsabedi > elcabedi ... Gurian, Imeretian, Kartlian: Sansiani ? sanciani, mam^arse > mam/>ace ... Ad jar ian: bjravil-s > bjravil-c, ertsaxe > ersaxe > ercaxe ... Literary Georgian: sabzeli > sabfeli, anderzi / Persian - andarz / anderji ... mzaxali> mjaxali Gurian: elanze > elan^e ... Gurian, Ad jar ian: xarsvà > xarcva ... Mesxian: cver-si > cver-ci .. Imeretian: midfem-Si midjem-ci, c^al-li c burfuazia ... c). The above regularity of the spirants' affricatization is also observed in the Svanian Language: xat-si > xâtyci, xeçsix > xeçcix ...

707 THE PHONETIC STRUCTURE OF THE WORD IN LITERARY UZBEK A. Makhmudoff Tashkent, Uzbek, USSR The author examines the phonetic structure in literary Uzbek as four nutually connected substructures: a) the substructure of phonemes; b) the substructure of phonemes 1 combinations; c) syllabic substructure; d) accentual-rhythmical substructure. The study of the phonetic system of the Uzbek draw to the conclusion that the vowel harmony is the basic sign of the phonetic word in Turkisch languages, and in literary Uzbek the soundcompaunding role belongs to the accent; in Turkisch languages the phonetic Kernel of the word is disposed at the beginning part, while in Uzbek the phonetic structure of the root and affixes depends on the accent. Thus, in literary Uzbek the phonetic structure of the word has its distinguishing features, not like in the other Turkisch languages. In the Uzbek the combinations of the consonants may be situated not only in the middle of the word but in the beginning of it too. This is an innovation in the modern Uzbek. The Uzbek syllables and the principles of syllabation are examined in this work.

708 REGULARITIES IN THE FORMATION OF CONSONANT CLUSTERS M. Rannut Institute of Language and Literature, Academy of Sciences of the Estonian SSR., Tallinn, USSR. This paper provides the results of statistical processing of Estonian consonant clusters presenting a complete list of the clusters with their frequencies of occurrence as well as their structual regularities. Estonian consonant clusters can be divided into the following three groups: 1) genuine clusters; 2) clusters of later origin (developed through vowel contraction); 3) clusters occurring in foreign words only. Some of the phonetic constraints characteristic of genuine consonant clusters are abandoned in clusters of later origin. Structual rules determining the consonant clusters in modern usage can be characterized by means of prediction formulas that have been presented together with their corresponding probabilities. To handle certain phonemena that change the cluster structure (syllable boundary, assimilation) we offer additional formulas to be applied in those cases when syllable boundary passes through the consonant cluster or when one of the components of the cluster is not pronounced.

709 SCHWA Annie

AND

SYLLABLES

IN

PARISIAN

FRENCH

Rialland

C.N.R.S.,

E.R.A.

433,

Paris,

France

T h e p r o b l e m o f s c h w a is t h e m o s t w r i t t e n a b o u t i n t h e f r e n c h p h o n o l o g y w h e r e a s t h e p r o b l e m o f t h e p h o n e t i c s y l l a b a t i o n is n o t well known. U s i n g p h o n e t i c d a t a as s t a r t i n g p o i n t , w e s h a l l s h o w t h a t t w o s y l l a b i c s t a t u s m u s t be p o s i t e d for the s c h w a s . Some of t h e m are p h o n o l o g i c a l nuclei while the others are e p e n t h e t i c vowels following a c l o s e d s y l l a b l e . We s h a l l d e s i g n a t e t h e f i r s t s c h w a s as n u c l e i s c h w a s a n d t h é s e c o n d o n e s as n o n n u c l e i s c h w a s . The n u c l e u s s c h w a o c c u r s i n s i d e of the l e x e m e s , the n o n n u c l e u s schwa occurs elsewhere. S e v e r a l f a c t s s h o w t h a t t h e n u c l e u s s c h w a is a p h o n o l o g i c a l n u cleus : 1 - t h e p r e c e d i n g s y l l a b l e h a s the a l l o p h o n e s of the o p e n s y l l a b l e £S>] a n d f e ] . 2 - t h e n u c l e u s s c h w a h a s the p r o p e r t y of b e i n g r e p r e s e n t e d i n c e r t a i n c o n t e x t s ( m a i n l y a t t h e b e g i n n i n g o f w o r d s ) b y a n u c l e u s w h e n the£gj] is e l i d e d . F o r e x a m p l e , the c o n s o n a n t t of t ( e ) r e n v e r s e r d o e s n o t b e c o m e t h e o n s e t of a s y l l a b l e [ t r a ] , the r e a l i z a t i o n of t a n d r b e i n g d i f f e r e n t f r o m the r e a l i z a t i o n of t h e c l u s t e r tr a n d it d o e s n o t become the c o d a of a p r e c e d i n g s y l l a b l e , the c o a r t i c u l a t i o n of t a n d r in t(e) r e n v e r s e r b e i n g s t r o n g e r t h a n the o n e of t a n d r w h e n t h e y b e l o n g to t w o d i f f e r e n t s y l l a b l e s . F o r t ( e ) r e n v e r s e r , w e s h a l l p o s i t t h e f o l l o w i n g s y l l a b a t i o n at the p h o n e t i c level: r ea~ a£ led zed

K t

r

a

A

. v t r

s

e J

T h e n u c l e u s b e i n g f r e e b e c a u s e of t h e e l i s i o n of the £ » } is f i l by the f r i c a t i v e ( o r v i b r a n t ) r. The p r o c e s s u s can be s c h e m a t i as f o l l o w i n g ; was sufficient to cue accentedness. Correct recognition was 63% and 71% (for the initial stop in /ti/ and /tu/); and 66% and 69% (for the vowels of / t i / and /tu/). Finally, Exp. 5 showed that listeners were able to detect accentedness when just the first 30 ms of /tu/ was presented, even though these "/t/-bursts" were not identifiable as speech sounds. These results are discussed in terms of the role of phonetic category prototypes in speech perception.

736 PATTERNS OF ENGLISH WORD STRESS BY NATIVE AND NON-NATIVE SPEAKERS Joann Fokes,

Z. S. Bond, Marcy Steinberg

School of Hearing and Speech Sciences, Ohio University, Athens, Ohio U.S.A. The purpose of this study was to examine the acoustical characteristics of English stressed and unstressed syllables in the speech of non-native speakers, in comparison to the same syllables produced by native speakers. Non-native speakers were college students of five different language backgrounds, learning English as a second language, and enrolled in an English pronunciation class. Native speakers were college students from the midwestern United States. Students were tape recordered while reading five types of words: 1) prefixed words with second syllable stress, e.g. "confess"; 2) the same words with an "-ion" suffix, e.g. "confession"; 3) and 4) words which change stress patterns upon suffixation, e.g. "combine"/"combination"; and 5) words of similar phonetic form but stressed on the initial syllable, e.g. "conquer". Selected suffixed words given in citation form were also compared with a sentence form, spoken in answer to a question, "Was the confession accepted?"/"The confession was accepted." Measurements were made of fundamental frequency, relative amplitude, and duration for the stressed and unstressed syllables. Mean differences of these three acoustical correlates of stress were compared between the native and nonnative speakers for all classes of test syllables. The correlates of stress as used by non-native and native speakers are discussed.

Section 19 III KORREKTIVER AUSSPRACHEUNTERRICHT AUF AUDITIVER BASIS Hans Grassegger Institut für Sprachwissenschaft der Universität Graz, Österreich Im Ausspracheunterricht kommt dem auditiven Aspekt eine grundlegende - weil im Lernprozeß zeitlich vorgeordnete - Rolle zu. Lautliche Interferenzerscheinungen im L2-Erwerb können nämlich auch auf charakteristischen Hörfehlern beruhen, die eine ihrer Ursachen in der auditiven Ähnlichkeit von AS- und ZS-Phon haben. Die vorliegende Studie untersucht die auditive Ähnlichkeit des (im Deutschen nicht vorkommenden) stimmlosen dentalen

Lateralfrikativs ([4 ]) mit einer Reihe von

möglichen Substitutionen im Urteil von deutschen Hörern. Aus den Ähnlichkeitsurteilen über alle Lautpaare, die den Lateralfrikativ als ersten oder als zweiten Stimulus enthalten, lassen sich Konsequenzen für die Erstellung eines korrektiven Programms zu bestimmten Problemlauten der ZS ableiten. Danach sollten ZS-Problemlaute mit ihren auditiv motivierten AS-Substitutionen kontrastiert werden und damit die hördiskriminatorische Voraussetzung für die korrekte Produktion des ZS-Lautes geschaffen werden.

738 PHONIC TRANSFER: THE STRUCTURAL BASES OF INTERLINGUAL ASSESSMENTS Allan R. James Engels Seminarium, Universiteit van Amsterdam, Netherlands The paper discusses certain inadequacies in the phonetic and phonological explication of foreign language pronunciation data which derive from a restricted view of the structural determinants of TL pronunciation behaviour. Above all, existing frameworks of description seem unable to account for the inherent variability in TL pronunciation. This, it is claimed, is a product of a 'compartmentalized' view of the structure of sound systems as well as of a simplified view of the dynamics of the TL acquisition and production processes themselves. Central to the latter however are assessment procedures of the foreign language learner in which TL forms are judged for their compatibility with forms of his NL. These assessments in turn lay the basis for the phonic transfer of elements of the NIL into the TL. The potentiation for transfer of NL forms is a product of the degree of relatedness as perceived between TL and NL sound elements within phonological, phonetic and atriculatory levels of sound structure. Actuation of transfer is a product of properties of the suprasegmental context within which the fl?L form is located. These structural properties of context account for the positional constraints on TL production and determine the form of the pronunciation variant used. An analysis of the typical occurrence of non-target segments in the English pronunciation of Dutch speakers compared to that of German speakers is offered by way of illustration of of the points made.

Section

19

739 DIE SYLLABISCH-AKZENTOLOGISCHEN MODELLE DER RUSSISCHEN SUBSTANTIVE E. Jasovä Pädagogische Hochschule, Banskä Bystrica, CSSR Zu den Grundproblemen des Unterrichts in der russischen Sprache als Fremdsprache in der slowakischen bzw. tschechischen Schule gehört nach underer Auffassung auch die Betonung, die durch einen Komplex von phonetischen und morphologischen Eigenschaften gekennzeichnet ist. Das typische Merkmal der russischen Sprache ist eine sog. freie Betonung im Unterschied zur slowakischen oder tschechischen Sprache, in denen die Betonung an die erste Silbe des Wortes gebunden ist. Der Hauptgegenstand unserer Untersuchung sind die syllabischakzentologischen Beziehungen der russischen Substantiv^ auf Grund des Frequenzwörterbuches der russischen Sprache. Diese werden an einer begrenzten Zahl von Substantiven /unabgeleiteten, abgeleiteten; einfachen, zusammengesetzten; "einheimischen" und "fremden"/ unter zwei Aspekten untersucht. Sie sind in einem Verzeichnis nach Silbenzahl und sinkender Frequenz erfasst. Die Untersuchungsaspekte: 1. Der syntagmatische Aspekt. Wir stellen die Distribution der russischen freien Betonung von Substantiven nach der Grundform des Wörterbuches /Nom.Sing./ vom Standpunkt der grammatischen Kategorie des Geschlechts her fest. 2. Der paradigmatische Askpekt. Wir erforschen: a/ die syllabische Variabilität von Wortformen der Paradigma und b/ die Bewegung der Betonung in den Wortformen der Substantive. In unserer Untersuchung gehen wir von der akzentologischen Konzeption von V. Strakovä2 aus, die uns vom Standpunkt der pädagogischen Praxis im Unterricht der russichen Sprache als Fremdsprache besonders geeignet erscheint. 1/ Castotnyj slovar'russkogo jazyka, Pod red. L.N. Zasorinoj, Moskva 1977 f 2/ Strakovä, V.:Rusky prlzvuk v prehledech a komentärich, Praha 1978.

740 INTERLANGUAGE PRONUNCIATION: WHAT FACTORS DETERMINE VARIABILITY? J. Normann J0rgensen Dept. of Danish, Royal Danish School of Educational Studies at Copenhagen The material for this investigation is the Danish language as spoken by young Turkish immigrants attending the public schools of Metropolitan Copenhagen. Several social and educational factors have been covered as well as certain linguistic characteristics have been investigated. Regarding the young Turks' pronunciation of Danish, it was found that "contrastive" deviations from native Danish were frequent. Naturally, so were non-deviant features. When Turkish-Danish pronunciation is described as an interlanguage, a variation must be accounted for which can not simply be described in the usual terms for deviation: interference, generalization Sc. Rather, the variation seems to depend on the variation within natively spoken Danish as well. An example: Danish short non-high back tongue vowels ([o+,or,oT-j- Turkish has [o]). For Danish [o+] we often hear in Turkish-spoken-Danish. Adjacent to [ 8 ] , however, [o] is predominant. The native Danish [o+] and [or] are represented in the Turkish-spoken-Danish by [o]. That native Danish [oT] is not represented by [o+] is probably due to the fact that the dominant sociolect in immigrant-dense parts of Copenhagen often has [OT] for standard Danish [o+]. On the other hand, the young Turks sometimes do have [o+], i.e. non-deviant standard Danish pronunciation. This is particularly frequent among young Turkish women. Such variation is similar to sex-related variation among native Danish speakers. It is tempting to describe an interlanguage as a series of stages - or a continuum - between a source language and a target language. Of course, it has long been realized that not every deviation from the target language can be related to the source language and that the interlanguage therefore is more complicated than that. It seems, however, that the target (as such) is to be understood in no simpler way. Any description of at least this particular (Turkisk-spoken-Danish) interlanguage must take the complicated reality of Danish variation into account. Consequently, phonological considerations will conceal some of the interlanguage features. The interlanguage variation is no less systematic for that reason: it is partly systematic in the way described by Dickerson, partly systematic like intrinsic language variation as described by e.g. Labov, but first of all: as a whole more complex than either, in the way social and linguistic factors interrelate. References:M. S0gaard Ldrsen: Skolegang, etnicitet, klasse og familiebaggrund. Danmarks pjedagogiske Bibliotek, Copenhagen 1982. M. S0gaard Larsen & J. Normann J0rgensen: Kommunikationsstrategier som graensedragningsfaenomen. KURSIV 3, Copenhagen (Dansklsererforeningen) , 1982, p. 19-35. J. Normann J0rgensen: Det flade a vil sejre. SAML 7, Copenhagen University Dept. of Applied Ling., 1980, p. 67124. J. Normann J0rgensen: Kontrastiv udtalebeskrivelse: Gabrielsen & Gimbel (eds) : Dansk som fremmedsprog, Laererforeningens Materialeudvalg, Copenhagen 1982, p. 297-336. Lonna J. Dickerson: The Learner's Interlanguage as a System of Variable Rules. TESOL Quarterly, Vol.9, No.4, December 1975.

Section

19 741

TIMING OF ENGLISH VOWELS SPOKEN WITH AN ARABIC ACCENT Fares Mitleb Department of English, Yarmouk University, Irbid, Jordan This study attempts, first, to determine to what extent the temporal properties of Arabic-accented English vowels resemble the first or the second language. And second, to examine the extent to which abstract-level differences between Arabic and English affect Arabs' production of the phonetic manifestation of the phonological rule of flapping in American English and that of the novel English syllable type CVC. Results show that Arabs failed to exhibit a vowel duration difference for voicing and produce an exaggerated length contrast of tense vs. lax vowels that more closely resembles Arabic long and short vowels. Thus the temporal properties of Arabic-accented English vowels are only slightly altered from the corresponding Arabic values. On the other hand, the Arabs thoroughly acquired the American segmental phonological rule that changes intervocalic /t-d/ into apical flap L O and produced correctly the novel English CVC syllable type instead of their native CVCC. These results support the hypothesis that phonetic implementation rules are more difficult for an adult language learner to change than rules which can be stated at the level of abstract segmental units. (Research supported by Yarmouk University)

742 INDIVIDUAL STUDENT INPUT IN THE LEARNING OF F 1 PRONUNCIATION H. Niedzielski Department of European Languages at the University of Hawaii, Honolulu, Hawaii, USA. Acoustic phonetics based on contrastive recordings of the source and target languages can be learned individually by students with a minimum intervention of the teacher.

The help of the latter

is needed more for articulatory phonetics.

However, here again

with contrastive descriptions of both languages, most of the work can be performed by students themselves on a n individual basis. As I often say to m y students in French phonetics, I can show you how to improve your pronunciation but I cannot improve it for you.

That is your part.

Consequently, I provide them with var-

ious oral and visual contrastive materials and assign them various tasks to perform individually.

Among others, they keep a

diary of their own efforts in the language laboratories, at home or anywhere.

They report their problems, solutions, failures a n d

successes; a n d we discuss all these in class, in m y office or in the cafeteria. This paper will present some of m y students' comments about the materials, techniques, and approaches which they find so attractive and efficient.

Sample materials will be exhibited.

Section

19 743

VOICE QUALITY AND ITS IMPLICATIONS Paroo Nihalani Department of English Language and Literature at the National University of Singapore, Singapore The widespread use of Daniel Jones' English Pronouncing Dictionary in the Commonwealth countries seems to imply that British Received Pronunciation (BRP) is the model of English prescribed for the learners of English in these countries. The speaker feels that this form of pronunciation represents an unrealistic objective and one that is perhaps undesirable. BRP is the 'normative' model that limits itself to the consideration of communicative intentions attributed to the speaker only. The speaker argues in favour of a pragmatic model which is a twoeay interactional model within the framework of Speech Act theory which considers the hearer as an active participant. Only the observation of the hearer's answer can tell whether the speaker has succeeded in performing his/her speech act. The importance of para-phonological features such as 'pleasant' voice quality for communicative purposes will be discussed. It is suggested that perhaps a course in Spoken English based on 'diction' and 'dramatics' rather than on the exact phonetic quality of sounds may prove to be more effective. Phonetic correlates of what is called 'pleasant' voice quality have also been discussed.

744 UNTERSUCHUNGEN ZUR FRIKATIV-KORRELATION IM DEUTSCHEN E. Stock Wissenschaftsbereich Sprechwissenschaft der Sektion Germanistik und Kunstwissenschaften, Martin-Luther-Universität HalleWittenberg, Deutsche Demokratische Republik In vielen Beschreibungen des nördlichen Aussprachestandards im Deutschen wird die konsonantische Korrelation bei Explosiven und Frikativen nach wie vor als eine Korrelation von stimmhaft und stimmlos dargestellt. Ziel der vorliegenden Arbeit ist es nachzuweisen, daß es angemessener und nützlicher ist, nicht von einer Stimmbeteiligungskorrelation, sondern von einer Spannungskorrelation zu sprechen. Dieses Problem ist vor allem für die phonetischen Übungen im Unterricht Deutsch als Fremdsprache bedeutungsvoll. Hat ein Spracherlerner mit seiner Muttersprache automatisiert, daß die Phoneme /b/-/p/ in der Realisierung vorwiegend mittels des distinktiven Merkmals Stimme unterschieden werden, so wird er das gleiche Merkmal ohne Bedenken auch im Deutschen anwenden, wenn die phonologisch-phonetische Beschreibung ihn in dieser Richtung bestärkt. Daraus resultieren Fehler in der Phonemrealisation und falsche Assimilationen. Die Aussprache bleibt auffällig fremd; die Besonderheit des deutschen Konsonantismus wird verfehlt. Meinhold/Stock haben bereits 1963 auf der Grundlage experimentell-phonetischer Untersuchungen gezeigt, daß die Verschlußphase der Medien /b, d, g/ nur nach stimmhaften Allophonen stimmhaft, nach Pausen und stimmlosen Allophonen dagegen stimmlos ist, ohne daß diese Phoneme dadurch als Tenues realisiert werden. Das Auftreten der Stimmhaftigkeit ist positionsabhängig; das dominierende und frequentere distinktive Merkmal ist fortis-lenis. Eine 1982 vorgenommene experimentell-phonetische Untersuchung zeigt ähnliche Verhältnisse auch für die Realisierung der Frikative /v, z, j/. Die Ergebnisse werden mit Tabellen und Oszillogrammen veranschaulicht. Für die phonologische Beschreibung wird daraus die Berechtigung abgeleitet, von einer konsonantischen Spannungskorrelation im Deutschen zu sprechen. An Unterrichtsanalysen wird abschließend die Zweckmäßigkeit dieser Beschreibung verdeutlicht.

Section 19 745 METRIQUE POUR L'EVALUATION DES ERREURS PROSODIQUES P. Touati Institut de Linguistique et de Phonétique, Lund, Suéde Le but de cette ccnntunication est de présenter une méthode qui permet de mesurer les erreurs prosodiques effectuées par des locuteurs suédois parlant français. La difficulté première lorsque l'on juge les erreurs de prononciation d'un locuteur est de séparer ce qui est dû aux erreurs segmentales de ce qui est dû aux erreurs prosodiques.Grâce au système d'analyse-synthèse à codage prédictif nous avons remédié à cette difficulté en introduisant dans une phrase française la prosodie suédoise et en maintenant les segments français.Cette manipulation est le point de départ de notre méthode. Expérimentation Le rythme et l'intonation suédois ont été introduits dans la phrase originale française en effectuant des modifications systématiques sur les paramètres de durée et de fréquence fondamentale de cette phrase.Quatre phrases-stimuli ont été ainsi obtenues:1a première étant une simple copie synthétique de l'originale,la deuxième a le rythme suédois,la troisïène a l'intonation suédoise et la auatriène a le rythire et l'intonation suédois. Ces stimuli ont été présentés à l'écoute de cinq auditeurs chargés d'évaluer le degré d'accent étranger de ces stimuli.Les auditeurs ont ensuite comparé ces stimuli à la phrase originale française produite par trois locuteurs suédois.Cette comparaison nous a permis d'évaluer le degré d'accent étranger de nos locuteurs et surtout de déterminer de manière plus systématique quel était le paramètre prosodique responsable de cet accent. Résultats et conclusion Pour la majorité des auditeurs le stimuli ayant le rythme et l'intonation suédois a été considéré came ayant le plus d'accent étranger,puis vient celui avec l'intonation et enfin celui avec le rythme. Quant aux locuteurs le paramètre jugé responsable de leur accent varie.Les causes de cette variation seront discutées. Ces premiers résultats semblent cependant indiquer la validité de cette méthode ccmme métrique pour l'évaluation des erreurs prosodiques. Références Touati, p. (1980) Etude comparative des variations de la fréquence fondamentale en suédois et en français.Working Papers,Department of Linguistics,Lund Univer sity 19: 60-64 Garding, E.,Botinis, A. & Touati, P. (1982) A comparative study of Swedish,Greek and French Intonation.Wbrking Papers,Department of Linguistics,Lund University 22: 137-152 Gârding, E. (1981) Gontrastive Prosody: A model and its application.Spécial Lecture to ÂILA Congress 1981.In Studia Linguistica, Vol. 35, No 1-2

746 RESEARCH IN THE FOREIGN-LANGUAGE PHONETICS CLASS Joel C. Walz

Department of Romance Languages at the University of Georgia, Athens, Georgia, USA The foreign-language phonetics class in the United States is usually a combination of theoretical aspects based on articulation and corrective procedures designed to improve students' pronunciation.

Students participating in the

course are most often in their third or fourth year of undergraduate work and have had little linguistic training.

Theory certainly can help students become

cognizant of their own weaknesses, but it does not always integrate well into class activities. The author proposes that teachers require all students to complete a research project, which will combine phonetic theory with practical applications. Since few foreign-language teachers have elaborate equipment or the expertise to use It, topics will involve only standard tape recording.

Six problems can

be studied by students at an elementary level, yet their findings can prove pedagogically quite useful. 1)

Phonetics students can test a beginning language student and design a

corrective program for him. 2)

They can test a native speaker of the language they are studying and

describe variations from the orthoepic system. 3)

They can design and administer a test of register.

4)

In cosmopolitan areas, a test of regional variation is possible.

5)

They can administer a self-test and develop hypotheses for their errors

(interference, intralingual confusion). 6)

A test of native speaker reaction to pronunciation errors could have

immediate applications. The research project in the foreign-language phonetics class is an effective way of uniting the theoretical and practical aspects of the course.

Section TEACHING UNSTRESSED VOWELS IN GERMAN: STRESS UPON VOWEL DIFFERENTIATION

19 747

THE EFFECT OF DIMINISHED

R. Weiss Department of Foreign Languages and Literatures at Western Washington University, Bellingham, WA. 98225, U.S. This paper addresses itself to the phenomenon of reduction of stress and its resultant effect upon the length and quality of German vowels. Prior research has indicated that in German a maximum of 15 vowel oppositions are operative in fully stressed syllable position. The total number of oppositions are minimized due to a complex but systematic relationship of length and quality. (See Weiss, "The German Vowel: Phonetic and Phonemic Considerations," Hamburger Phonetische Beitrage, 25 (1978) , 461-475.) In unstressed syllable position the system potentially increases in vowel diversity to include as many as 29 vowels (Duden). Although this maximal diversity is reflected primarily in borrowed words of a non-German origin, the increase in vowel opposition is due largely to the fact that length and quality may function more independently in unstressed syllable position. Although it appears that vowel differentiation seems to be more diverse in unstressed syllable position, it will be demonstrated that in reality quite the opposite is true: in actual practice vowel differentiation dramatically decreases in unstressed syllable position. An attempt will be made to correlate diminishing degrees of stress with loss of certain phonetic features, such as liprounding, length, extremes in quality, etc. A priority system of loss of features sequentially both in regard to perception and normal production will be proposed. Additionally, other factors which play a role in unstressed syllable position such as phonemic and morphophonemic considerations will be taken into account. It will be demonstrated that there exists a direct and positive correlation between vowel diversity and vowel stress. A set of postulates operative in unstressed syllable position will be given respective to different levels of stress. An attempt will be made to present in hierarchical fashion the different vowel systems operative at different levels of stress from fully stressed to totally unstressed. In addition the implications of the above findings for foreign language teaching will be discussed. Pedagogical guidelines for a practical treatment of unstressed vowels will be given. A relatively simple and practical seven vowel system will be proposed which not only more accurately reflects the principles actually operative in unstressed vowel production, but more closely reflects the actual articulation most commonly associated with unstressed vowels.

748 THE VISUALISATION OF PITCH CONTOURS: SOME ASPECTS OP ITS EFFECTIVENESS IN TEACHING FOREIGN INTONATION *) B. Weltens & K. de Bot Institute of Phonetics / Institute of Applied Linguistics, University of Nijmegen, The Netherlands. One of the contributions from phonetic research to applied linguistics is the development of technical aids for the teaching of pronunciation. Such aids have been developed for both segmental and suprasegmental aspects of pronunciation, the latter having - until recently - received comparatively little attention (cf. e.g. Abberton & Fourcin, 1975; Hengstenberg, 1980; James, 1976, 1977; Léon s Martin, 1972). Since 1976 work has been carried out towards developing a microcomputercontrolled set of equipment for displaying pitch contours of sentences. The aim was to produce a practical set-up in which target sentences recorded on tape and ad-hoc imitations by learners/subjects could be displayed simultaneously on the upper and lower halves of a t.v. screen, and to test this set-up with different target languages and different groups of learners under different conditions. Over the past seven years a number of experiments have been carried out with several different target languages, experimental designs and set-ups of the equipment. Intonation contours of three languages have been used in consecutive experiments: Swedish (with Dutch subjects who had no previous knowledge of the target language), English (with advanced Dutch learners: lst-year undergraduate students) and Dutch (with Turkish immigrants of different levels of proficiency in the target language). In this series of experiments we have investigated the influence of the following variables on the ability to imitate foreign intonation patterns: - feedback mode : auditive vs. audio-visual, - feedback delay : the time lag between the moment of producing the utterance and the moment of plotting its pitch contour on the screen, - proficiency level of the learner in the target language: measured by means of a written editing test (Mullen, 1979), - age of the learner : 9 - 12 year old children vs. adults. The outcome of these investigations will be presented in very general terms: We will discuss the effectiveness of visual feedback compared with auditive feedback, the effect of the feedback mode on the practice behaviour of the subjects during the experimental session, and the influence of potentially interfering variables (feedback delay, proficiency level and age of the learner) on the effectiveness of visualising pitch contours in intonation learning. We will also briefly describe the latest set-up of the equipment, which proved to be highly workable for the individual learner and could form a major contribution to intonation teaching in many areas of language teaching and speech therapy. *) This research was partly sponsored by the Research Pool of the University of Nijmegen, and by the Netherlands Organisation for the Advancement of Pure Research (ZWO).

Section 19 749 CROSS-LINGUISTIC RHYTHMIC

INFLUENCE

IN SECOND L A N G U A G E A C Q U I S I T I O N :

THE

DIMENSION

B. J. W e n k Department of M o d e r n Languages; Strasbourg, France.

Université

de S t r a s b o u r g

II;

A s temporal patterning constitutes a crucial feature of language p e r f o r m a n c e — p o s s i b l y m o r e f u n d a m e n t a l to c o m p r e h e n s i o n a n d e x p r e s s i o n t h a n segmental f e a t u r e s in a n d of t h e m s e l v e s — t h e a c q u i sition of target rhythmic organisation for learners whose first l a n g u a g e is d i s p a r a t e i n t h i s r e g a r d to t h e t a r g e t l a n g u a g e is clearly an area worthy of careful analysis. Perhaps because of the naivety of p o p u l a r rhythmic typologies a n d the e x c l u s i o n of r h y t h m f r o m the p u r v i e w o f m u c h p h o n o l o g i c a l t h e o r y , t h e p r o b l e m h a s n o t so f a r r e c e i v e d t h e a t t e n t i o n it m e r i t s . H o w e v e r , t h e d e s c r i p t i v e f r a m e w o r k a p p l i e d to F r e n c h a n d E n g l i s h r h y t h m s i n W e n k and W i o l a n d (1982), revealing a hitherto u n s u s p e c t e d set of i n t e r relationships between temporal patterning and a range of phonetic f e a t u r e s , p e r m i t s the d i s c o v e r y of a g e n e r a l i s a b l e o r d e r of a c q u i s i t i o n for w h i c h experimental c o n f i r m a t i o n is p r e s e n t e d . The d a t a a r e a l s o a n a l y s e d w i t h r e s p e c t to v a r i a t i o n d u e t o s p e e c h s t y l e and phonetic context.

Reference Wenk,

B . J . a n d F. W i o l a n d ( 1 9 8 2 ) : " I s F r e n c h r e a l l y t i m e d ? " , J o u r n a l o f P h o n e t i c s , 10, 1 9 3 - 2 1 6 .

syllable-

750 E N G L I S H I N T O N A T I O N FROM A D U T C H P O I N T O F V I E W N.J.

Willems

I n s t i t u t e for P e r c e p t i o n R e s e a r c h , E i n d h o v e n , the N e t h e r l a n d s W h e n n a t i v e s p e a k e r s of D u t c h speak (British) E n g l i s h , their p r o nunciation will generally differ f r o m that of E n g l i s h native s p e a k e r s . T h e s e d i f f e r e n c e s g i v e r i s e to the p e r c e p t i o n of a n o n n a t i v e 'accent'. T h i s p a p e r r e p o r t s o n a n e x p e r i m e n t a l - p h o n e t i c inv e s t i g a t i o n w h i c h a t t e m p t s to c h a r a c t e r i z e and d e s c r i b e the i n t o n a t i o n a l , or r a t h e r m e l o d i c , a s p e c t s of this n o n - n a t i v e n e s s . T h i s w i l l b e u s e d to d e s i g n an e x p e r i m e n t a l l y b a s e d i n t o n a t i o n c o u r s e for D u t c h l e a r n e r s of E n g l i s h . C o n t r a r y to t r a d i t i o n a l courses in E n g l i s h i n t o n a t i o n , w h i c h are m a i n l y b a s e d o n 'drills', the p l a n n e d c o u r s e c o u l d m a k e s t u d e n t s aware of i n t o n a t i o n a l s t r u c t u r e s of the t a r g e t l a n g u a g e by p r o v i d i n g t h e m w i t h an e x p l i c i t b u t s i m p l e d e s c r i p t i o n . O u r a p p r o a c h is l a r g e l y b a s e d o n the r e s e a r c h m e t h o d s of the 'Dutch s c h o o l ' of i n t o n a t i o n , w h i c h d e s c r i b e s the d e t a i l e d perceptually fundamental f r e q u e n c y c u r v e s in t e r m s of d i s c r e t e e q u i v a l e n t p i t c h m o v e m e n t s using a s t r a i g h t line approximation (stylization). A n e x t e n s i v e c o m p a r i s o n w a s m a d e b e t w e e n a b o u t 600 f u n d a m e n t a l f r e q u e n c y c u r v e s of E n g l i s h u t t e r a n c e s p r o d u c e d by a d o z e n n a t i v e D u t c h a n d E n g l i s h s p e a k e r s . T h i s a n a l y s i s y i e l d e d six fairly s y s t e m a t i c m e l o d i c d e v i a t i o n s p r o d u c e d by D u t c h n a t i v e s p e a k e r s . M a j o r d e v i a t i o n s w e r e s u b s t a n t i a l l y s m a l l e r e x c u r s i o n s , r i s i n g instead of f a l l i n g p i t c h m o v e m e n t s and a too low ( r e ) s t a r t i n g level. In a f i r s t p e r c e p t i o n t e s t , in w h i c h s y n t h e t i c s p e e c h w a s u s e d , E n g l i s h n a t i v e s p e a k e r s w e r e a s k e d to a s s e s s the a c c e p t a b i l i t y of s y s t e m a t i c v a r i a t i o n s in m a g n i t u d e of the e x c u r s i o n and p o s i t i o n of the p i t c h m o v e m e n t in the s y l l a b l e w i t h r e s p e c t to v o w e l o n s e t . The o u t c o m e of this e x p e r i m e n t s h o w e d that E n g l i s h p i t c h c o n t o u r s c a n be a d e q u a t e l y d e s c r i b e d w i t h an a v e r a g e e x c u r s i o n of 12 s e m i t o n e s . In o r d e r to e s t a b l i s h the p e r c e p t u a l r e l e v a n c e of the d e v i a tions found, a s e c o n d p e r c e p t i o n e x p e r i m e n t was p e r f o r m e d in w h i c h all o r i g i n a l f u n d a m e n t a l f r e q u e n c y c u r v e s w e r e r e p l a c e d by s y s t e m a t i c a l l y v a r i e d a r t i f i c i a l c o n t o u r s by m e a n s of L P C - r e s y n t h e s i s . T h e s e c o n t o u r s w e r e s u p e r i m p o s e d o n the u t t e r a n c e s p r o d u c e d by the n a t i v e s p e a k e r s of E n g l i s h . T h i s t e c h n i q u e was used to p r e v e n t the n a t i v e E n g l i s h j u d g e s f r o m being i n f l u e n c e d by d e v i a t i o n s o t h e r t h a n t h o s e in p i t c h . A c c o r d i n g to 55 E n g l i s h n a t i v e s p e a k e r s , w h o a p p e a r e d to b e v e r y c o n s i s t e n t in their j u d g m e n t s , the deviations w e r e to a g r e a t e r or lesser e x t e n t u n a c c e p t a b l e . O n the b a s i s of the r e s u l t s of this e x p e r i m e n t the u s e f u l n e s s of s o m e s t y l i z e d m e l o d i c p r o n u n c i a t i o n p r e c e p t s was tested. R e s u l t s s h o w e d h i g h a c c e p t a b i l i t y s c o r e s , s u g g e s t i n g the p o t e n t i a l e f f e c t i v e n e s s of the p r e c e p t s . O u r r e s u l t s s u g g e s t it s h o u l d be p o s s i b l e to set up a m e l o d i c i n t o n a t i o n course for D u t c h s t u d e n t s based o n e x p e r i m e n t a l e v i d e n c e . M o r e o v e r the s u c c e s s of the s t y l i z a t i o n m e t h o d for E n g l i s h s u g g e s t s that there is g r e a t p r o m i s e in d e v e l o p i n g a n o t a t i o n a l s y s t e m of s t r a i g h t - l i n e c o n t o u r s .

Section 20

Speech Pathology and Aids for the Handicapped

Section 20

753 INTONATION PATTERNS IN NORMAL, APHASIC, AND AUTISTIC CHILDREN Christiane A.M. Baltaxe, Eric Zee, James Q. Simmons Department of Psychiatry at the University of California, Los Angeles, US Prosody is constituted of the acoustic parameters of fundamental frequency, intensity, duration, and their covariation. Prosody presents an important dimension in language development. Prosodic variables may affect aspects of language comprehension, retention, and production in the acquisition process. Prosodic patterns are mastered and stabilize prior to segmental and syntactic patterns of language and several investigators have proposed that they form 'frames' or 'matrices' for subsequently developing segmental and syntactic units. However these prosodic frames have not b e e n sufficiently characterized and the phonetic and linguistic details of their development are presently unclear. Prosody as a potential and powerful variable in delayed or deviant language development also awaits investigation. The present study examines the frequency characteristics of intonation patterns of three matched groups of young children (MLU 1.5-4.0), normal subjects, aphasic subjects (language delayed),and autistic subjects (language delayed/deviant). Only the autistic group also had perceptible prosodic abnormalities. The parameter of frequency was chosen for study since.developmentally, it appears to stabilize first. The present investigation focuses on simple declara tive subject-verb-object u t t e r a n c e s ^ r o d u c e d spontaneously under controlled conditions. Frequency measurements were obtained using a pitch meter and Oscillomink tracings, and measurements w e r e subjected to appropriate statistical analysis. Results show that all three groups can be differentiated on the basis of intonation contours of declarative utterances, based on visual inspection of the pitch contours as well as on comparisons based on statistical analyses of the measurements taken. Only the normal group showed a frequency pattern w h i c h was comparable to normal adult speech. Descriptively, the pattern can best be characterized by initial rise and terminal fall of the fundamental frequency contour. Although the aphasic and autistic groups generally showed initial rise, terminal fall was absent in most s u b j e c t s ^ h o showed level or rising pitch finally. Most characteristic of the autistic group was a saw-tooth pattern of pitch modulation. In addition, a flat pitch pattern w a s also seen. This was also the dominant pattern of the aphasic group. Based on frequency measurements and statistical analyses, significant between-group differences were seen in overall frequency range, frequency differences within each syllables, and frequency shift between syllables. Differences in frequency modulation between content and function words were significant for all three groups. This appears to indicate that normal as well as the language deficient groups differentiated among the two semantic categories and generally adhered to the related stress pattern. Some significant differences in frequency modulation were also seen within the content w o r d category, depending on whether subject,verb, or object positions w e r e involved. The occurrence of primary sentence stress on final stressable syllable, i.e. object position, was not supported by greater pitch modulation in that position. Theoretical implications of these findings and their clinical significance are discussed.

754 AN INVESTIGATION OF SOME FEATURES OF ALARYNGEAL R.

SPEECH

Beresford

Sub-department

of S p e e c h , U n i v e r s i t y of N e w c a s t l e u p o n T y n e ,

England.

T h i s is p a r t of a larger i n v e s t i g a t i o n b e i n g m a d e w i t h c l i n i c a l c o l l e a g u e s a n d is c o n c e r n e d w i t h the f o l l o w i n g q u e s t i o n s : (1) Is the o e s o p h a g u s p a s s i v e d u r i n g o e s o p h a g e a l s p e e c h ? (2) W h a t use is m a d e of the c a p a c i t y of the o e s o p h a g u s d u r i n g alaryngeal speech? (3) Is f l o w - r a t e m o r e i m p o r t a n t than the o e s o p h a g e a l v o l u m e u s e d ? (4) Is d u r a t i o n of p h o n a t i o n d e p e n d e n t u p o n the air r e s e r v o i r in the o e s o p h a g u s ? (5) Is o e s o p h a g e a l p r e s s u r e low in 'good' o e s o p h a g e a l s p e e c h ? (6) W h a t are the v a r i a b l e s of e x p u l s i o n ?

Section

20

755 THE INTELLIGIBILITY Gerrit

O F S E N T E N C E S S P O K E N BY

LARYNGECTOMEES

Bloothooft

F a c u l t y of M e d i c i n e , F r e e U n i v e r s i t y , A m s t e r d a m , T h e N e t h e r l a n d s P r e s e n t a d d r e s s : I n s t i t u t e of P h o n e t i c s , U t r e c h t U n i v e r s i t y , U t r e c h t , The N e t h e r l a n d s L a r y n g e c t o m e e s are severly h a n d i c a p p e d in their speech c o m m u n i c a t i o n b e c a u s e of t h e r e l a t i v e l y l o w q u a l i t y of t h e i r s e c o n d v o i c e . W e i n v e s t i g a t e d t h e i r h a n d i c a p i n t e r m s of r e d u c e d s p e e c h i n t e l l i g i b i l i t y . F o r 18 l a r y n g e c t o m e e s , 9 of w h i c h developed esophageal speech and 9 speech from a neoglottis o b t a i n e d by a surgical r e c o n s t r u c t i o n a f t e r S t a f f i e r i , the i n t e l l i g i b i l i t y of s h o r t , p h o n e t i c a l l y b a l a n c e d , e v e r y d a y sentences was determined. Intelligibility was measured in terms of t h e s p e e c h - r e c e p t i o n t h r e s h o l d ( S R T ) : t h e s o u n d l e v e l f o r w h i c h 50 % of t h e s e n t e n c e s w a s r e p r o d u c e d c o r r e c t l y b y l i s t e n e r s of n o r m a l h e a r i n g . S R T w a s d e t e r m i n e d b o t h i n q u i e t a n d i n i n t e r f e r i n g n o i s e of 60 d B ( A ) w i t h a s p e c t r u m e q u a l to t h e l o n g - t e r m a v e r a g e s p e c t r u m of n o r m a l s p e e c h . W i t h t h e S R T values for normally spoken sentences as a reference (Plomp a n d M i m p e n , 1979), we d e t e r m i n e d the s p e e c h - i n t e l l i g i b i l i t y loss (SIL) f o r a l l 18 a l a r y n g e a l s p e a k e r s . T h e S I L in n o i s e w a s n o t s i g n i f i c a n t l y d i f f e r e n t f r o m the S I L i n q u i e t , i n d i c a t i n g t h a t the i n t e l l i g i b i l i t y loss for a l a r y n g e a l speech was largely due to d i s t o r t i o n . T h e S I L v a r i e d i n t e r i n d i v i d u a l l y b e t w e e n 2 d B a n d 20 d B w i t h a n a v e r a g e of 10 d B . N o s i g n i f i c a n t d i f f e r e n c e s in SIL values between esophageal and Staffieri neoglottis speakers w e r e f o u n d . A model w i l l be p r e s e n t e d in w h i c h the l i m i t i n g c o n d i t i o n s in s p e e c h c o m m u n i c a t i o n f o r l a r y n g e c t o m e e s c a n b e d e m o n s t r a t e d a s a f u n c t i o n of t h e a m b i e n t n o i s e l e v e l . In t h i s m o d e l n o t o n l y t h e S I L b u t a l s o the l o w e r a v e r a g e v o c a l i n t e n s i t y of a l a r y n g e a l s p e e c h (on a v e r a g e 12 d B i n c o m p a r i s o n to n o r m a l s p e e c h u n d e r t h e s a m e c o n d i t i o n s ) is i n c l u d e d . It will be shown that many l a r y n g e c t o m e e s are already severely h a n d i c a p p e d in t h e i r s p e e c h c o m m u n i c a t i o n w i t h a n a m b i e n t n o i s e l e v e l as l o w as t y p i c a l l y p r e s e n t i n l i v i n g - r o o m s ( ¿ 0 d B ( A ) ) . The c o o p e r a t i o n of p a t i e n t s a n d the s t a f f of t h e L o g o p e d i c a n d P h o n i a t r i c D e p a r t m e n t of t h e F r e e U n i v e r s i t y H o s p i t a l is k i n d l y acknowledged. Reference P l o m p , R. a n d M i m p e n , A . M . ( 1 9 7 9 ) . S p e e c h - r e c e p t i o n t r e s h o l d f o r s e n t e n c e s as a f u n c t i o n of a g e a n d n o i s e l e v e l . J . A c o u s t . S o c . A m . 66 (1333-134-2).

756 SPECIAL PROBLEMS IN DIAGNOSIS: PROSODY AND RIGHT HEMISPHERE DAMAGE Pamela Bourgeois , M.A., Amy Veroff, Ph.D. and Bill LaFranchi, M.A. Casa Colina Hospital for Rehabilitative Medicine, Pomona, California, USA Control of the syntactic, semantic, and phonological aspects of language has long been associated with the left or "dominant" hemisphere. In contrast, recent research has correlated dysfunctions in the prosodic components of language with right hemisphere lesions (Ross, 1981). Although advances have been made in the scientific study of the minor hsnisphere's role in speech and language function, especially with regard to dysprosody, this information has not been integrated into the rehabilitation of patients with disorders arising from minor hemisphere pathology. Dysprosody has been operationally defined as "the inability to impart affective tone, introduce subtle shades of meaning, and vary emphasis of spoken language" (Weintraub, Mesulim, and Kramer, 1981). Specifically, this disturbance reflects the individual's inability to express stress and melodic contour in communication. The absence or reduction of prosodic components in verbal language produces a demonstrable conmunication deficit in which the pragpiatic or socio-linguistic aspects of cotnnunication are affected. Our research of dysprosody offers techniques which speech/language pathologists and psychologists can use to differentiate disturbances of the affective components of speech as isolated symptoms separate from real disturbances of affect such as depression. A complete neuropsychological evaluation is also performed which provides valuable information of the patient's overall cognitive status. This information is currently being integrated into the rehabilitation of patients with disorders arising from minor hemisphere pathology. The purpose of this presentation is to offer a clinical approach to disorders of prosody. It will discuss 1) the effects of minor hemisphere lesions upon speech and language, 2) the need for communicative rehabilitation with patients presenting prosodic impairments, and 3) a systematic diagnostic and treatment protocol.

Section

20

757 SOME SPEECH CHARATERISTICS E L E C T R O M Y O G R A P H I C STUDY OF M.

Gent i1,

Centre

+

J.

Pellat

Hospitaller

+

& A.

OF P A R K I N S O N I A N D Y S A R T H R I A : FOUR LABIAL MUSCLES AND ACOUSTIC Vila

Universitaire

DATA

++

de

Grenoble,

Grenoble,

France

Early studies provided acoustic information about parkinsonian speech diseases (Alajouanine T.,Sabouraud 0 § Scherrer J., Contrib u t i o n a l ' e t u d e o s c i 1 1 o g r a p h i q u e d e s t r o u b l e s d e la p a r o l e , L a r y n x et P h o n a t i o n , p p 1 4 5 - 1 5 8 , Greray F . , T h e s e d e M e d e c i n e , P a r i s , 1 9 5 7 ) . M o r e r e c e n t i n v e s t i g a t i o n s u s e d e l e c t r o m y o g r a p h i c t e c h n i q u e s to d e t e r m i n e n e u r o - B U S c u l a r d y s f u n c t i o n s in t h e o r o f a c i a l s y s t e m o f p a r k i n s o n i a n p a t i e n t s (Hunker C.J., ABBS J.H.5 BARLOW J.M., The relat i o n s h i p b e t w e e n p a r k i n s o n i a n r i g i d i t y a n d h y p o k i n e s i a in t h e o r o f a c i a l s y s t e m , N e u r o l o g y 3 2 , 1 9 8 2 ) . H o w e v e r in m o s t e x p e r i m e n t s o n ly f e w m u s c l e s w e r e m o n i t o r e d . It s e e m s n e c e s s a r y , g i v e n t h e p h e n o m e n o n o f m o t o r e q u i v a l e n c e , to m o n i t o r a lot o f m u s c l e s to o b t a i n a valid p e r s p e c t i v e of the control p r o c e s s (Gentil M., Gracco V . L . , § Abbs J . H . , lie ICA, P a r i s , 1 9 8 3 ) . The p u r p o s e of the p r e s e n t stud y w a s to a n a l y z e t h e a c t i v i t y o f 4 l a b i a l m u s c l e s o f p a r k i n s o n i a n s u b j e c t s , p e r f o r m i n g l o w e r l i p m o v e m e n t s a n d to s e e if o u r o b s e r v a t i o n s w e r e c o n s i s t e n t w i t h f i n d i n g s c o n c e r n i n g l i m b m u s c l e s , in spite of d i f f e r e n t p h y s i o p a t h o 1ogica 1 m e c h a n i s m s . F u r t h e r m o r e oscill o g r a p h i c r e c o r d i n g s of d a t a s p e c i f i e d the c h a r a c t e r i s t i c s of subj e c t s ' v o i c e s : low and u n i f o r m i n t e n s i t y , p o o r t i m b r e , i r r e g u l a r speech rate and a b n o r m a l i t i e s of pitch. The P a r k i n s o n d y s a r t h r i c s for this study were 2 f e m a l e and I m a l e s u b j e c t s , r e s p e c t i v e l y a g e d 40 - 74 - 70 y e a r s . A l l w e r e j u d g e d to m a n i f e s t r i g i d i t y a n d h y p o k i n e s i a in t h e l i m b s a n d r e d u c e d f a c i a l m i m i c . No labial t r e m o r s w e r e n o t i c e d . T h e s e p a t i e n t s w e r e u n d e r L D o p a m e d i c a m e n t . T w o n o r m a l f e m a l e s u b j e c t s w e r e i n v e s t i g a t e d in p a r a l l e l as c o n t r o l s . O u r o b s e r v a t i o n s c o n c e r n e d l o w e r l i p b e c a u s e o f I) i t s c o n t r i b u t i o n to s p e e c h p r o d u c t i o n s , 2) i t s u s u a l r i g i d i t y in c o m p a r i s o n w i t h u p p e r l i p in p a r k i n s o n i s m . E M G a c t i v i t y w a s recorded with needle electrodes from orbicularis oris inferior(001) d e p r e s s o r labii i n f e r i o r (DLI), m e n t a l i s (MTL) a n d b u c c i n a t o r ( B U C ) . Acoustical signals were simultaneously recorded. The subjects r e p e a t e d 3 t i m e s 3 s e n t e n c e s (cf s p e e d r a t e m e a s u r e m e n t ) a n d a l i s t o f 27 w o r d s o f CV or C V C V t y p e (V = a , i , u ) . T h e s e o n e s w e r e s e l e c ted b e c a u s e of the p a r t i c u l a r a c t i v i t y of r e c o r d e d m u s c l e s d u r i n g t h e i r p r o d u c t i o n . C o a r t i c u l a t o r y e f f e c t s w e r e e s p e c i a l l y s t u d i e d in monosyllabic words. The r e s u l t s of our a n a l y s e s i n d i c a t e d well d e f i n e d EMG a b n o r m a l i t i e s f o r a l l p a t i e n t s . W e o b s e r v e d I) I m p a i r m e n t o f t h e f u n c t i o n a l o r g a n i z a t i o n of the a n t a g o n i s t i c m u s c l e s (lack of r e c i p r o c a l i n h i b i tion) 2) t h e e x i s t e n c e o f a r e s t i n g a c t i v i t y b e t w e e n p r o d u c t i o n s 3) the p r e s e n c e of a s u s t a i n e d h y p e r t o n i c b a c k g r o u n d a c t i v i t y during the productions 4) d i f f e r e n c e s c o n c e r n i n g c o a r t i c u l a t o r y e f fects between normal and p a r k i n s o n i a n subjects. These results will be d i s c u s s e d in t e r m s o f t h e e v e n t u a l a n a l o g y t h e y h a v e w i t h g e n e r a l s y m p t o m s o b s e r v e d in l i m b s . + Service

de

Neurologie

Dejerine

++

Service

d1E1ectromyographie

758 PHONO-ARTICULATORY STEREOTYPES IN DEAF CHILDREN L. Handzel Independent Laboratory of Phoniatrics, Medical Academy, Wroclaw, Poland Phono-articulatory phenomena were analysed in deaf children at the age between 7-14 years using a visible speech sonograph, apparatus for registration of tone-pitch and an oscillograph to register acoustic energy while a short sentence was uttered. The studies made it possible to distinguish a number of phonoarticulatory stereotypes and their variants. Observations of phono-articulatory events in the deaf can be regarded as a contribution to diagnostic tools, as regards the time of onset and degree of the hearing impairment, which means the possibility of developing appropriate rehabilitation methods.

Section

QUANTITATIVE STUDY OF ARTICULATION DISORDERS USING INSTRUMENTAL PHONETIC TECHNIQUES W.J. Hardcastle Department of Linguistic Science, University of Reading, Reading, U.K. In this study* a detailed analysis of the speech of five articulatory dyspraxic, and five normal children ranging in age from 8 to 14, was carried out using the instrumental techniques of electropalatography and pneumotachography. Electropalatography provided data on the dynamics of tongue contacts with the palate and pneumotachography was used to measure air-flow characteristics of obstruent sounds. The temporal relationships between lingual patterns and air-flow and acoustic characteristics were determined by synchronizing the recordings from each instrument. Speech measures included: (1)

place of lingual contact during obstruent sounds;

(2)

timing of approach and release phases of obstruents;

(3)

voicing of obstruents (including V.O.T.);

(4)

articulatory variability;

(5)

timing of component elements in consonant clusters.

The speech of the disordered group was found to differ from the normals in all five areas. Specific abnormal articulatory patterns found included (a) simultaneous contact at alveolar and velar regions of the palate during alveolar consonants, (b) abnormal temporal transitions between elements in clusters such as [st], [skj, (c) abnormal tongue grooving for fricatives, (d) lack of normal V.O.T. distinctions. The advantages of this quantitative approach over traditional assessment techniques relying solely on impressionistic auditory judgments are discussed.

* This work is supported by a research grant from the British Medical Research Council.

20

759

760 THE IMPLICATIONS OF STUTTERING FOR SPEECH PRODUCTION J.M. Harrington Department of Linguistics, University of Cambridge, Cambridge, England A recent pilot study has suggested that stuttering o n the post-palatal of [ci:n]

(keen)

manifests itself as a repetition of the mid-velar

[c]

[ks].

Stuttering on the affricate [tr]of [tj°u:z] (choose) manifests itself as a repetition [tJsnta]

of the entire affricate [t_r] whereas stuttering on the onset of (tranter) manifests itself as either a repetition of the entire

[t.x] cluster or simply the alveolar stop plus schwa vowel [to]. The ! data is then interpreted in terms of a speech production model and auditory, proprioceptive and internal feedback channels. It is argued that stuttering involves a breakdown at the level of internal feedback w h i c h results in the inability to integrate the onset and rhyme of a syllabic cluster.

Section

20

761 ACOUSTICAL MEASUREMENT OF VOICE QUALITY IN DYSPHONIA AFTER TRAUMATIC MIDBRAIN DAMAGE E. Hartmann, D. v. Cramon Neuropsychological Department, Max-Planck-Institute for Psychiatry, Munich, FRG In a former study the characteristics of dysphonia after traumatic mutism and the patterns in the recovery of normal laryngeal functions had been established. Traumatic dysphonia is initially characterized by breathy voice quality; with decreasing breathiness voice quality becomes in particular tense and mildly rough. This had been described in terms of auditory judgement. In order to provide quantitative and objective measures for these qualitative description, several acoustical parameters were developed. About twenty male and female patients, suffering from traumatic midbrain syndrome, were examined. The isolated cardinal vowels, repeated by the patients in a phonatory test, were recorded, digitized and automatically analyzed by a Foanalysis and a FFT spectral analysis routine on a PDP 11/40 computer. Subsequently the following parameters were calculated: 'mean fundamental frequency', 'fundamental period perturbation', 'spectral energy above 5 KHz' and 'variance of spectral energy'. Additionally, with the aid of a segmentation routine, the 'time lag of exhaled air', preceding the vocal onset, was measured. These parameters showed significant differences between patients with different pathological voice qualities and a control group of speakers without voice disorders. Besides this classification the different stages in the process of recovery of normal phonation could be described quantitatively. The results encourage acoustical analysis as a tool in clinical examination and treatment of voice disorders.

762 A CONTRIBUTION TO THE PHONOLOGIC PATHOLOGY OF SPEECH STRUCTURE IN CHILDREN WITH IMPAIRED HEARING A. Jarosz Independent Laboratory of Phoniatrics, Medical Academy, Wroclaw, Poland Examinations were made in 12 pupils /4 girls, 8 boys/ from the 5th form of the hearing deficient children in Wroclaw. They were 12 to 13 years old, except 2 aged 14 and 15 years. Phonoarticulatory, audiometric and neurologic aspects were taken into consideration. The utterances consisted in naming objects, events etc., as well as telling picture stories from everyday life and experience. Vowel phoneme patterns were analysed. Pathologic features of the phonemes are presented and interpreted in the light of linguistic rules, environmental factors and others.

Section 20 763 VOCAL PROFILES OF ADULT DOWN'S SYNDROME SPEAKERS J. Laver, J. Mackenzie, S. Wirz and S. Hiller Phonetics Laboratory, Department of Linguistics, University of Edinburgh, Scotland. A descriptive perceptual technique has been developed, for use in speech pathology clinics, for characterizing a patient's voice in terms of his long-term characteristics of supralaryngeal and laryngeal vocal quality, of prosodic features of pitch and loudness, and of temporal organization features of continuity and rate. The product of the technique, expressed in 40 scalar parameters, is the speaker's 'Vocal Profile'. As part of a large-scale project investigating the vocal profiles of eight different speech disorders (1), 26 male and female adult Down's Syndrome speakers, and a sex-matched normal control group were tape-recorded. Three trained judges each independently constructed a vocal profile for each subject, and a representative consensus profile was established for the subject under strict criteria of agreement. Comparison of the vocal profiles of subjects in the two groups shows that the profiles of the Down's Syndrome groups differ significantly from those of the control groups on a majority of the parameters, and that the detailed differences are plausibly related to organic differences of vocal anatomy between the two groups. (1)

This research was supported by a project grant from the Medical Research Council.

764 PSYCHOTISCHE SPRACHE UND KOMMUNIKATION V. Rainov Sofia, Bulgaria Auf Grund unserer vorangegangenen Untersuchungen über die Spezifik der psychotischen Sprache und speziell, der Besonderheiten der Lautersetzung in der Sprachproduktion (Zaimov, Rainov, 1971,1972), wird der Versuch unternommen, die Rolle und den Ausprägungsgrad der Kommunikationsfähigkeit bei Geisteskranken zu ermitteln. Zur Untersuchung wurden Patienten mit Schizophrenie, Zyklophrenie und mit amentiver Symptomatik herangezogen. Es wurden:. 1) die linguostatistischen Daten des Sprachzerfalls und 2) der Grad der Störung der Kommunikationsfähigkeit analysiert. In Ubereinstimmung mit Watzlawick, Beavin und Jackson (1967) vertreten wir die Einstellung, dass "es unmöglich ist nicht zu kommunizieren" und dass die globale aktive bzw. passive Strategie, die die Sprache des Geisteskranken charakerisiert, bereits eine Art der Kommunikation darstellt.

Section

20

765 A NEW TYPE OF ACOUSTIC FEATURES CHARACTERIZING LARYNGEAL PATHOLOGY AT THE ACOUSTIC LEVEL Jean Schoentgen, Paul Jospa Institut de Phonétique, Université Libre de Bruxelles, Bruxelles, Belgium Features characterizing vocal jitter and vocal shimmer have become by now fairly standard with investigators interested in the quantification of voice quality or in the detection of laryngeal pathology at the acoustic level. In the meantime some authors have reported several cases of laryngeal pathologies (e.g. light laryngitis, hypotonia of the vocal folds) which are not characterized by higher perturbation, but by an abnormally low excitation of the vocal tract at the instant of glottal closure. We have proposed a family of algorithms which allow for the estimation of the evolution of the local damping constant inside one glottal period. These algorithms have been described elsewhere. In this paper we present tne results of an evaluation of the discrimination performance, between normal and dysphonic speakers, of a set of acoustic features making explicit use of the estimation of the local damping measure provided by this type of algorithms. The performance of the features in discriminating between normal and dysphonic subjects was evaluated by a clustering analysis. Open and closed tests were performed. We also computed the more classic jitter and shimmer features in order to allow for comparisons and to evaluate the combined detection power of jitter/shimmer and damping features. We extracted our features from stable portions of the sustained vowel/a/, uttered by 35 dysphonic speakers and 39 normal speakers respectively. Several authors have reported that in the presence of laryngeal pathology continuous speech is, at the acoustic level, much more likely to exhibit abnormalities than sustained' vowels. A careful study of the littérature convinced us that this is only true with reference to the narrow band equipment (e.g. pitch analyzers) used by these investigators. For large band equipment (sampling frequencies up to 10 khz) this question is still unsettled. These matters will also be discussed in our paper.

766 A TACTUAL "HEARING AID" FOR THE DEAF Karl-Erik Spens The Swedish Institute for the Handicapped, and the Department of Speech Communication and Music Acoustics, Royal Institute of Technology, Stockholm, Sweden In order to make a wearable aid for the deaf which transforms acoustic information into tactile stimuli there are several problems to be solved before the aid actually gives a positive net benefit to the user. Some of the basic problems to be solved are, power consumption, size and weight, feedback risks, dynamic range and intensity resolution. For high information rate aids with several channels power consumption and size become even more important and difficult to solve. A one-channel aid is developed and will be demonstrated. It has the following features: 1. battery life 50 hrs. 2. size and weight, similar to a body worn hearing aid. 3. no feedback problems. 4. a signal processing which gives a good intensity resolution and a large dynamic.input range in order to fit the intensity characteristics of the skin. The aid conveys information about the speech rhythm which is a support for lipreaders. A test with unknown sentences and 14 normal hearing subjects indicates an average increase shown in the figure. The aid is now used on an every day basis by five postlingually deaf subjects. All subjects report that the aid facilitates lipreading and that the ability to monitor the acoustic environment adds to their confidence. A condensed collection of subjective evaluation data from the long term use of the aid and quantitative evaluation with lipreading tests will be presented at the conference. 100 L IPREADING o

Id CC a O

WITH

"

WITHOUT.»—« TACTILE

AID.

o

Lü O Cd

Ld Q.

0

EACH • 1

POINT

=

280

ST IM,