198 81 33MB
English Pages 821 [824] Year 1983
^ TENTH INT. CONGRESS OF PHONETIC SCIENCES 1-6 AUGUST 1983 v
UTRECHT THE NETHERLANDS
Editors: A. COHEN and M.P.R. VAN DEN BROECKE Institute of Phonetics, University of Utrecht
Utrecht, 1-6 August 1983
Abstracts of the Tenth International Congress of Phonetic Sciences
¥
1983 FORIS PUBLICATIONS Dordrecht - Holland/Cinnaminson - U.S.A.
Published by: Foris Publications Holland P.O. Box 509 3300 AM Dordrecht, The Netherlands Sole distributor for the USA. and Canada: Foris Publications U.SA P.O. Box C-50 Cinnaminson N.J. 08077 U.SA
ISBN 90 70176 89 0 © 1983 by the authors. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner. Printed in the Netherlands by ICG Printing, Dordrecht.
Table of Contents Preface
vii
Previous Congresses
ix
Congress Committee
x
Plenary Sessions
1
Symposia
63
Section Abstracts
329
Index of Authors
793
Workshop on Sandhi
805
Preface
The festive occasion of this 10th International Congress of Phonetic Sciences is seriously overshadowed by the sudden and untimely death of Dennis B. Fry, the President of the Permanent Council, who died in London on 21st March of this year. His presence will be sorely missed by all who have known him. The 10th Congress will take place in Utrecht, the Netherlands from 1-6 August 1983 at thejaarbeurs Congres Gebouw, which is located adjoining the Central Railway Station. In analogy with the previous 9th Congress the various scientific activities will be divided over a number of categories, i.e. 5 plenary sessions each addressed by two invited speakers, 6 Symposia, a great number of section papers and poster sessions. There will also be an exhibition of scientific instruments as well as an educational one organized by the University of Utrecht Museum. We are fortunate in being able to announce that, with one exception, all originally invited lecturers/speakers and symposium chairmen have notified us of their willingness to participate. Due to the shift of emphasis over the last few years in our field a rather large part of the invited contributions will deal with technological advances of the speech sciences. This volume contains 4 page abstracts of all invited speakers, as well as one page abstracts of all the section papers/posters that we could accommodate in the program. The Congress Proceedings will contain the complete texts of all invited speakers as well as a selection of 4-page abstracts of section papers and will appear in book form before the end of this year. We are grateful to the following institutions who have given us support in one form or another: The Royal Dutch Academy of Arts and Sciences under whose auspices this congress will be held. The Dutch Ministry of Education for providing us with financial support. The Netherlands Organization for the Advancement of Pure Research (ZWO), which thanks to the Foundation of Linguistics, was able to contribute to the financing of the Proceedings.
viii
The City Council/Municipality of Utrecht for their financial support. The Executive Council of Utrecht University for their willingness to make the organization of this congress possible in a number of ways. The Faculty of Arts for helping us to engage a secretary for all administrative work. Last but not least all those persons who have shown tremendous willingness to take part in various capacities, sometimes more than one, in an advising form. The Editors
Previous Congresses First:
Amsterdam
1932
Second:
London
1935
Third:
Ghent
1938
Fourth:
Helsinki
1961
Fifth:
Münster
1964
Sixth:
Prague
1967
Seventh:
Montreal
1971
Eighth:
Leeds
1975
Ninth:
Copenhagen
1979
Congress Committee Chairman: A. Cohen, University of Utrecht Executive Secretary.M.P.R. van den Broecke, University of Utrecht Congress Secretary: A.M. van der Linden-van Niekerk Members: F.J. Koopmans-van Beinum, University of Amsterdam S.G. Nooteboom, Leyden University, Institute of Perception Research (IPO), Eindhoven LC.W. Pols, University of Amsterdam
Plenary sessions
Opening Address: E. Fischer-J0rgensen Keynote Address: G. Fant
1 . Speech and hearing R. PLOMP: Perception of speech as a modulated signal
5
MANFRED R. SCHROEDER: Speech and hearing: some important interactions
9
2. Relation between speech production and speech perception L.A. CHISTOVICH: Relation between speech production and speech perception
15
H. FUJISAKI: Relation between speech production and speech perception
21
3. Can the models of evolutionary biology be applied to phonetic problems? BJORN LINDBLOM: Can the models of evolutionary biology be applied to phonetic problems?
27
PETER LADEFOGED: The limits of biological explanations in phonetics
31
4. Psycholinguistic contributions to phonetics W.D. MARSLEN-WILSON: Perceiving speech and perceiving words WILLEM J.M. LEVELT: Spontaneous self-repairs in speech: structures and processes
39
43
5. Speech technology in the next decades J.L. FLANAGAN: Speech technology in the coming decades
49
J.N. HOLMES: Speech technology in the next decades
59
5 PERCEPTION OF SPEECH AS A MODULATED SIGNAL R. Plomp Institute for Perception TNO, Soesterberg, and Department of Medicine, Free University, Amsterdam, The Netherlands Acoustically, speech can be considered to be a wide-band signal modulated continuously in time in three different respects: (1) the vibration frequency of the vocal cords is modulated, determining the pitch variations of the voice, (2) the temporal intensity of this signal is modulated by narrowing and widening the vocal tract by means of the tongue and the lips, and (3) the tongue and the lips in combination with the cavities of the vocal tract vary the sound spectrum of the speech signal, which may be considered as a modulation along the frequency scale. For each of these three types of modulation, it is of interest to study the modulations present in the speech sound radiated from the mouth, the extent to which these modulations are preserved on their way from the speaker to the listener, and the ability of the ear to perceive them. The modulations can be studied by means of frequency analysis, resulting in frequency-response characteristics for the modulation frequencies. In this abstract I will restrict myself to the temporal and spectral modulations, that is, the horizontal and vertical axes of the speech spectrogram, leaving frequency variations of the fundamental out of consideration. Modulation in time The varying intensity of speech, given by the speech-wave envelope, can be analysed by means of a set of one-third octave bandfilters. Experiments (Steeneken & Houtgast, 1983) have shown that the envelope spectrum is rather independent from the speaker and his way of speaking. The envelope spectrum covers a range from about 0.5 Hz up to about 15 Hz, with a peak at 3 to 4 Hz. The latter value seems to be related to the number of words and syllables per second. On its way from the speaker to the listener, the speech wave is usually affected by reverberation. The higher the modulation frequency and the longer the reverberation time, the more the modulations are smoothed. For instance, at a large distance from the speaker in a room with moderate reverberation (T = 0.5 sec), the intensity modulations at the listener's position are reduced by a factor of two for a modulation frequency of about 8 Hz (Houtgast et al., 1980) . This implies that the fast modulations of speech are not preserved, resulting in a decrease of speech intelligibility. In the same way as the transfer from the speaker to the listener, the ear of the listener can be described by a temporal modulation transfer function. Recent experiments (Festen & Plomp, 1981) have shown that the modulation cut-off frequency of the ear is, on the average, about 25 Hz. This value is considerably higher than both the upper limit of modulation frequencies present in the speech signal and the cut-off frequency of a typical room.
6
Speech and hearing
Modulation along the frequency scale In the same way as for the fluctuations in time, the variations oc curring in the sound spectrum as a function of frequency at a particular moment of the speech signal can be analysed by means of a set of one-third octave filters. Since these data are not availabl at the moment, we have to follow another way. We may expect that the spectral modulation transfer function has to be sufficiently high to resolve the two spectral peaks corresponding to the most important formant frequencies Fi and F2 . From spectral data on Dutch vowels (Klein et al., 1970) we may decide upon 1 period per octave as a rough estimate of the upper limit of modulation frequencies present in speech signals. The effect of reverberation on the transfer of the signal from the speaker to the listener (too often neglected in speech research) is, for constant sounds, comparable with introducing a standard deviation of 5.57 dB in the sound-pressure level of any harmonic (Schroeder, 1954; Plomp & Steeneken, 1973). If we may assume that this value also holds for the actual speech signal, it can be calculated that the variance introduced by reverberation is, roughly, about as large as the spectral differences when the same vowel is pronounced by different speakers. The spectral modulation transfer function of the human ear can be determined in a way similar to that for the temporal modulation transfer function. Festen and Plomp (1981) found that the higher cut-off frequency of the ear is about 5 periods per octave. This value is much larger than the 1 period per octave we estimated for the speech spectrum. Conclusions The data presented above indicate that, both for temporal and spec tral modulations, the human ear is sensitive to modulation frequen cies exceeding those present in the speech signal. In indoor situa tions reverberation is the most important limiting factor in the perception of the speech-envelope variations in time. It can be ar gued that the high frequency-resolving power of the ear is not related to the perception of a single speech signal, but to the discrimination of speech interfered with by other sounds. References Festen, J.M. & Plomp, R. (1981). Relations between auditory functions in normal hearing. Journal of the Acoustical Society of America, 70, 356-369. Houtgast, T., Steeneken, H.J.M. & Plomp, R. (1980). Predicting speech intelligibility in rooms from the Modulation Transfer Function. I. General room acoustics. Acústica, 46, 60-72. Klein, W., Plomp, R. & Pols, L.C.W. (1970). Vowel spectra, vowel spaces, and vowel identification. Journal of the Acoustical Society of America, 48, 999-1009. Plomp, R. & Steeneken, H.J.M. (1973) . Place dependence of timbre i reverberant sound fields. Acuática, 28, 50-59. Schroeder, M. (1954). Die statistischen Parameter der Frequenzkurven von grossen Räumen. Acústica, 4, 594-600.
Plomp: Perception
of speech as a modulated
signal
7
Steeneken, H.J.M. & Houtgast, T. (1983). The temporal envelope spectrum of speech and its significance in room acoustics. To be published in Proceedings of the 11th International Congress of Acoustics, Paris.
SPEECH A N D H E A R I N G : SOME IMPORTANT Manfred R.
INTERACTIONS
Schroeder,
Drittes Physikalisches Institut, Universität Goettingen, Bell Laboratories, Murray Hill, N . J . USA
FRG
Speech Quality and Monaural Phase In the 1950s, when I first became interested in speech synthesis, I was almost immediately intrigued by the problems of subjective quality of synthetic speech. Vocoders had a reedy, electronic "accent" and I thought that the excitation waveform, consisting of sharp pulses, was perhaps to blame. To investigate this question more deeply, I built a generator for 31 coherent harmonics of variable fundamental frequency. The phase of each harmonic could be chosen to be either 0 orr - a total of 230 - 1,073,741,824 different waveforms, each of which appeared to have its own intrinsic timbre - their identical power spectra notwithstanding. (I wish Seebeck, Ohm and Helmholtz had had a chance to listen to these stimuli!) For all phase angles set equal to 0, one obtains a periodic cosine-pulse. When this waveform is used as an excitation signal for a speech synthesizer, the result is the reedy quality already mentioned. By contrast, if one randomizes the phase angles, one gets a less peaky waveform and a mellower sound (Schroeder, 1959). A better-than-random choice for the phase angles (one that gives an even less peaky waveform) is given by the formula „ - ™7AT, where n is the harmonic number and N the total number of harmonics in the flat spectrum stimulus. More general formulae, for arbitrary spectra and phase angles restricted to 0 or x, are given by Schroeder (1970). The great variety of timbres producible by phase manipulations alone suggested to me that it might be possible to produce intelligible voiced speech from stimuli having flat (or otherwise fixed)
10
Speech and hearing
line spectra. Although this appears to contradict the most basic tenets of auditory perception of speech, "formant-less" speech has now been demonstrated in collaboration with H. W. Strube of Goettingen. To achieve this flat-spectrum (voiced) speech, we generate a synthetic speech signal which is symmetric in time and has maximum peak factor N
s(t) — %S„ n-1
cos
nal,
where a is the fundamental frequency and the Sn are the amplitudes of individual harmonics. The Sn are determined, for example, by linear predictive analysis of a natural speech signal. The Sn reflect its formant structure. To the speech signal s(t) is then added another signal, a(t), of like fundamental frequency, with harmonic amplitudes
An
and phases
n
given by 2
1
-H 0. G •H 10 E 0
w HI 0 c Q) +> c 01 in •C +> o +J a c a) •H Mo a T3 a C 5 (0 0 xi l-i to S-t •H < 13 UI a) ¡5 cn C •0rl e •P 0 (0 n D m +> •H T) W d) •>H -d C U (0 a) •0 in — , . c0 r ai •H +> •cH (0 r—t T3 (1) C 0 O +> 0 •C H •
> N
s i nI
3 1 s. *
1
n t. .«V.
1\ i 5 •a k.
• or n
A
s
«
u * -V *t
i a A > w 4* Cn •H Pn
Symposium 2
Units in speech synthesis
107 UNITS FOR SPEECH SYNTHESIS Jonathan Allen Research Laboratory of Electronics, Department of Electrical Engineering and Computer Sciences, Massachusetts Institute of Technology, Cambridge, USA The goal of synthetic speech algorithms is to provide a means to produce a large (infinite) set of speech waveforms.
Since this number is so large, the
units cannot he the messages themselves so we must look for ways to compose the message from a smaller number of units.
In this summary the desiderata
for these units are discussed. 1.
Clearly there must be a relatively small number of units at each level.
There is a well known tendency for longer and more complex units to be larger in number.
Thus the word lexicon for any natural language is large, but the
number of phonemes is small. 2.
The choice of units must allow for the freedom of control to implement all
significant aspects of the waveform.
This leaves the question of what
is
significant, but it is clear that perceptual results should be the guide to such a choice.
Historically these aspects have included spectral sections,
including spectral prominences (formants) plus the prosodic structure of the speech. 3.
In order to derive the necessary control for the desired units, it must be
possible
to analyze speech in terms of these units.
algorithmically (automatically), so much the better.
If this can be
are deep abstractions and (so far) require human analysis. analysis
would
include
the
discovery
of
done
Nevertheless, many units
syllable
An example of such structure
and
the
determination of stress from acoustic correlates. 4.
Conversely, it must be possible to compose an utterance by interpretive
processes on the selected units for synthesis.
This consideration points out
the dual aspects of internal versus external structure.
There are divergent
108
Units in speech synthesis
opinions as to where to emphasize the overall structure in terms of synthesis units.
Thus,
relatively transitions hand, there
example,
internal
to be provided
diphone
structure,
for
little
based
since
phonemically
structure
but
interpretively
systems
have
the transition
a
based leave at
synthesis much
systems
structure
in
the boundaries.
relatively
in formation
larger
is
On the
amount
internal
specify terms
to
of
the
of
other
internal unit,
is intentionally a minimal amount of external boundary structure
and that
must be specified interpretively. This contrast points out the distinction between compiled and interpretive use of
units.
When
phonemes
are
used,
there
is
relatively
structure and more interpretive composition at boundaries. more
little
compiled
Diphones represent
compiled internal structure which can be represented lexically.
differences
reflect
the
variety
and
the
function used to join units together.
complexity
of
the
These
compositional
They also emphasize the choice as to
whether the knowledge (of transitions, say) is to be represented (in terms of static lexical forms) or procedurally.
structurally
That is, is the richness
of detail (phonetically) to be captured as intrinsics of the units themselves or in terms of the connective framework? The
nature
of
the
compositional
problem of binding time.
function
is
also
highly
connected
to
the
Clearly in the case of transitions, this information
is bound early (and not later modified) for diphones and hence the information can be considered to be compiled into structural lexical form at the time the overall system is built. the
late binding
On the other hand, the use of phonemes contemplates
of the transitional
information.
This
provides
for
more
flexibility but requires that explicit procedures for transition specification be developed.
For consonant vowel and vowel consonant transitions much work
has been done, but in clusters and in vowel—vowel transitions, a great deal of further research must be completed.
So far, we have thought of transitions in
terms of the spectral part of the assumed also
the problem of transitions
at
source-filter model, but
the source
level.
there
is
Onset and offset
of
voicing must be specified (the compositional function is now far too simple: just voiced/unvoiced/silence) as well as mixed excitation. 5.
The units selected should be insightful and relevant to the accumulated
research literature.
They should mark distinctions
that are
linguistically
Allen: Units for speech synthesis relevant.
109
Here an important notion is similarity versus contrast.
Within the
class of utterances that correspond to a unit there should be a cohesion and similarity of internal structure, hut between these units there should be a contrast in features. contrasts
Thus phonology was invented to provide a notation for
in meaning.
It may
be
of
course
that
there
are
new
units
and
representational frameworks that are not related to existing understanding and yet are important for high quality synthesis.
This is most likely to happen
in areas where the need for structure is felt, but sufficient research has not yet been completed, as in the study of articulation. 6.
There are several
levels
of units
that
are
important.
scope, and address different aspects of structure.
These
varying
It is probably the case
that all aspects of linguistic structure are reflected somehow in the acoustic waveform.
The
sentence, and
feature.
reasons
levels
of
structure
currently
recognized
clause, phrase, word, morpheme, metrical These have all been
found useful
of distribution and contrast, but
cohesion
and
place
normally
carry
focus
properties
on
the
within
units
are:
in linguistic
they each exhibit
themselves.
themselves
as
discourse,
foot, syllable,
That
to how
analysis
some is,
they
phoneme, for
intrinsic
they
combine.
don't The
sequence of these units suggest a natural hierarchy and nesting in terms of size but this is not necessary, as with the asynchronous behavior of features described in autosegmental phonology.
Features such as nasality, for example,
span several segments, and the control of this feature should probably not be thought of as being dominated solely at the phoneme level. 7.
The progression through the unit levels described above is gradual, thus
maintaining
fine
gradations
and
transformations between these levels.
providing
for
relatively
direct
Thus a representation of phonemes plus
stress marks plus boundaries can be converted into a structure of allophones plus
boundaries
framework structure. vocal
which
including
in
both
turn
can
the
prosodic
converted frame
to
and
a the
generalized
target
segmental
target
These are then interpreted into a parameter string relevant to the
tract model to be utilized, which then interprets these parameters to
form the final acoustic waveform. the
be
By providing such a large number of units,
jump between any two levels of representation can be made small
such that a reasonable understanding can be developed.
enough
If there were not such
a proliferation of units, then many meaningful generalities would be missed,
110
Units in speech synthesis
but more importantly the means to convert from a linguistic representation to the waveform would have to be expressed in a hopelessly complex way.
Each of
these smaller transformations can be thought of as a procedural interpreter of the proceeding framework level or levels. must
be
chosen
simplifying
in
a
way
to
At each of these levels the units
capture
all
important
the rule structure and increasing
generalities,
thus
the overall modularity of the
system. 8.
The notion of target is central in this set of levels and transformations
since
it
bridges
properties
of
between
the vocal
the
abstract
tract model.
linguistic
The
target
labels
may be
and
physical
thought
of as
a
spectral contour, parameter value, fundamental frequency gesture or other more complex
structure.
realizations
that
The
represents
sufficient
to
needed
complete
to
target
give
rise
to
the
represents
attributes the
of the
desired
realization
an
abstraction
from
speech
are
percept.
of
an
that
individual felt
By implication,
utterance
are
felt
to
be
elements
to
be
less
important to the intended percept.
The trend is towards more complex target
structures,
understanding
integration
reflecting of
smaller
increased targets
into
larger
of
target
phonetic
cues,
frameworks.
Thus,
and the
targets, like other units, have scope and intrinsic cohesion, and participate in
composition
functions
which
are
largely
as
yet
undiscovered.
Simple
addition of smaller targets into larger ones as in the modification of phrase level
fundamental
frequency
contours
by
segmental
fundamental
frequency
gestures, are inadequate to describe the phonetic facts. 9-
Many of the units specify attributes that are necessary (usually at some
abstract level) in order to realize the desired utterance. invariants
overall
possible
realizations
of
the
Thus they specify
utterance.
Nevertheless,
there is a great deal of observed change as in allophonic variation.
What is
true variability (not subject to law) and what is superficial variation due to context dependent composition functions is a major problem, but research
reveals
that
much
of
this
seemingly
deterministic, at least for a given talker.
freely
varying
contemporary detail
is
The trend is toward a richer set
of interpretive processes operating on the linguistic abstractions to produce the rich set of phonetic detail observed at many levels.
The management
of
redundant cues that integrate together to form complex percepts is a procedure requiring
vast
new
amounts
of
data
analysis
and
theoretical
foundation.
111
Allen: Units for speech synthesis
Undoubtedly research in this area will lead to the formulation of a variety of highly
complex
target
robust
targets
leading
frameworks. to
an
Today's
awkward
lack
synthesis of
rigidity that current synthetic speech demonstrates. find
some
process
of
invariant. seems
invariance
in
integrating Prom
remote,
terms cues
today's awaiting
of
to
fundamental
form
perspective, a
vastly
percepts the
utilizes
naturalness
itself
possibility
increased
only
few the
While there is hope to
articulatory may
a
reflecting
of
amount
processes, be
a
such a of
the
procedural discovery
research
and
understanding. These observations are intended
to characterize
the selection of units of speech synthesis.
the general
process
guiding
The role of these units must be
clearly stated together with a complete view of the total synthesis procedure that
exploits
them.
There's
a
tendency
to
be
guided
by
technological
constraints which may permit greater use of memory (compiled strategies) real
time
processing
(interpretive
strategies).
What
is
more
or
important,
however, is the provision of a correct theory fully parameterized to reflect the naturally occurring facts.
The technology will easily rise to carry these
formulations into practical real systems.
REMARKS ON SPEECH SYNTHESIS Osamu Fujimura Bell Laboratories, Murray Hill, N.J., USA
^
,
Segncatal Uato Speech synthesis by rule, since the pioneering work by Holmes, Mattingly and Shearme [1964], has always been based on the principle of concatenating some segments, with prosodic modulations. Most systems use phonemic/phonic segments. The basic idea is to assume a single target value for each of the control variables per segment, and connect these sample values smoothly by rule (see [Liberman, et al., 1959] for an early discussion of this concept). Obviously, the rules that generate control time functions must do more than simply (e.g. linearly) connect the specified sample points, and the phone inventory must carry information about transitional characteristics of the given (type of) segments as well as target values. Also, more than a single target is often needed per phonemesize segment. In the efforts to improve the quality of the output speech, it was found necessary for the rules to look several segments ahead in the string before they could decide which parametric values to use. For example, the duration (as well as quality) of the vowel [a] in [paJnt] crucially depends on the tenseness of the final consonant. In addition to this long-range context sensitivity, there are other variations of segmental properties that escape explanation by any simple notion of (hard) coarticulation [Fujimura & Lovins, 1978]. Adopting phonic diads (i.e. diphones) as the concatenative units [Dixon, 1968] solves part of this problem. The number of entries in the inventory has to be relatively large, but the transitional elements can be specified effectively by entering only endpoint values for the LPC pseudo-area
114
Units in speech synthesis
parameters [Olive, 1977]. This assumes that all transitions are approximated by straight lines, the endpoints being coincident in time among different parameters (see below for further discussion). Furthermore, the relatively stationary patterns, for example in the middle of a vowel portion (or consonantal constriction), are effectively represented by straight lines connecting the endpoints of the transitions into and out of the vowel (or consonant). While this approach is actually very effective, given a very limited storage for the inventory, it has inherent problems. For example, a given sequence of phonemes, say /st/, manifests different phonetic characteristics depending on the roles of the segments as syllable constituents. Thus the stop is usually aspirated in 'mistame', but not much in 'mistake' (see [Davidsen-Nielsen, 1974]). Even if this problem is solved by expanding the inventory to include transitions into and out of a syllable boundary for each phoneme/phone, we still have the same problem as the phonemic approach with respect to the long range phenomena. Interestingly, most of the long range phenomena are contained within the syllable. There are many ad hoc (i.e. non-coarticulatory) tautosyllabic interactions (see Fujimura & Lovins, ibid.) but almost none (assuming that the ambisyllabicity problem [Kahn, 1976] is treated correctly) across syllable boundaries, apart from what we can describe as prosodic effects on segmental quality (see below). For this reason, we advocate the use of syllables, or more practically demisyllables with phonetic affixes, for concatenative synthesis [Fujimura, Marchi, and Lovins, 1977]. Each syllable is divided into an initial demisyllable, a final demisyllable, and optional (in wordfinal position) phonetic affixes. The crucial information about transitions is totally preserved. The final demisyllable contains the quasi-stationary part of the vowel, and the strong effects of the final consonantal features are well represented throughout the demisyllable. On the other hand, there is little interaction between initial and final demisyllables, with respect to both linguistic distributional patterns and phonetic quality (apart from coarticulation). Therefore, the quality of demisyllabic synthesis is about the same as that of syllabic synthesis, and the inventory size is considerably smaller (around 800 items in our current version for English, as opposed to more than 10,000 for syllables).
The intelligibility of monosyllabic words presented in isolation
is virtually limited by the LPC process itself (see Lovins. et al. 1979).
Fujimura: Remarks on speech synthesis
115
A parametric schematization, as in Olive's approach, should be effective for demisyllabic patterns also. Is Concatenation of Concrete Units Workable? As an abstract representation of speech, a string of words seems basically an appropriate picture, albeit obviously not complete. One can argue about the validity of phonemic segments [Chomsky & Halle, 1968], but there must be some purely phonological unit that gives us a useful picture of sound streams. Just as syntax represents a sentence by a tree structure of phrasal units, phonology may specify for a syntactic unit, a word, a phrase, or a sentence, a structural organization of phonetic units.
This phonological structure, however, must relate segmental features to
suprasegmental elements. This can be done via a separate structure (metrical grid) [Liberman and Prince, 1977], or by a more direct linking (autosegmental theory), see [Goldsmith 1977]. There has been remarkable activity in this area of theoretical linguistics, some based on experimental and quantitative work, in an effort to establish an integral theory of phonology and phonetics [Pierrehumbert, 1980] [Liberman & Pierrehumbert, to appear]. The basic problem that has to be solved is how to describe speech in those aspects where the concatenative (linear) models fail. The more accurately and systematically we observe speech phenomena, given now available experimental tools, the more immediate our concerns become. We now understand, at least in part, why speech synthesis schemes invariably sound machine-like rather than human, and why automatic recognition schemes fail to identify even isolated words that are nearly perfectly identifiable for human native speakers. All the existing schemes, including diadic and demisyllabic synthesis, template matching or feature-based recognition systems, use some concatenative segmental units in concrete forms, or even, in some cases, such as dynamic programming, assume preservation of finely divided subsegments (sample frames), as nearly invariant units. According to this principle, all variables characterizing speech signals (except pitch) have to move concurrently, when prosody dictates temporal modulation and other suprasegmental modifications. Furthermore, apart from simple smoothing (hard coarticulation) alterations are implemented only through a symbolic selection between alternative forms, such as allophones. This picture is incorrect, in our opinion.
116
Units in speech synthesis
A Multidimensional Model In our recent work on articulator? movements as observed by the x-ray microbeam, we found that articulators moved somewhat independently from each other [Fujimura, 1981]. Thus the notion of "simultaneous bundles" [Jakobson, Fant and Halle, 1963] does not apply literally to concrete physical phenomena. At the same time, we observed that there were relatively stable patches of movement patterns in the time domain of the multi-dimensional articulatory movements for speech utterances. These relatively invariant local movement patterns (which we called "iceberg patterns") represent crucial transitions with respect to the demisyllabic identity, presumably being responsible for perceptual cues for consonant (place) identification. The patterns of articulatory movements between icebergs, including quasi-stationary gestures for vowels, as well as the relative timings of icebergs, seem highly variable, depending on various factors such as stress, emphasis, global and local tempo. One important implication of this multidimensionality is that the articulatory interaction between syllable nuclei may be, to a large extent, independent from intervening consonantal gestures. The tongue body gestures for different vowel qualities seem to undergo intricate assimilationdissimilation processes. In some languages, such as Portuguese, effects of such vocalic interaction are so strong that they have been described with symbolic descriptions, phonologically (metaphony) and phonetically (vowel gradation) [Maia, 1980]. In other words, there are different phonetic processes that affect different aspects of sound units selectively. Vowel (or consonant) harmony is an extreme case, and hard coarticulation is the other extreme, but both affect a particular articulatory dimension representing a particular phonological feature, rather than the segmental unit as a whole. What precisely governs such interactions, unfortunately, is largely unknown in most cases. In the domain of symbolic representation, linguists in the newly emerging field called nonlinear phonology are addressing themselves to closely related theoretical problems (see, in addition to Liberman and Prince, Goldsmith, ibid.J [Williams 1971] [Halle and Vergrand 1980] [Selkirk 1980] [Kiparsky l«82l.
Fujimura: Remarks on speech synthesis
117
The extent of variability of, for example, vowel quality, either in terms of formant frequencies or articulator positions, is remarkably large: for the same phoneme in exactly the same segmental environment (i.e. word sequence), different emphases, phrasing, or other utterance characteristics can easily affect the vowel quality by a third of the entire range of variation (for all vowels) of the given variable (say F2). Iceberg patterns, however, are affected relatively little as long as they belong to the same (phonological) deinisyllable (probably with some qualifications that are not clear at this time.) Whether the iceberg model is correct or not, the point is that speech is highly variable in a specific way. Until we understand precisely in what way speech is variable and what influences different aspects of speech signals, we obviously cannot achieve a satisfactory synthesis (or recognition) system. We are only beginning to gain this understanding. An adequate articulatory model, which we still have to work out, may be a particularly powerful tool of research from this point of view. Ultimately, there could well be a way to handle acoustic parameters, such as formant frequencies, directly by appropriate rules, in interpreting phonological, symbolic representations of speech for synthesis. But in the mean time, we should exploit whatever we have as the research tool, in order to understand the key issues involved. Likewise, a machine also ought to be able to approach the human performance in speech recognition based on acoustic signals only. As the history of speech research clearly shows, analysis (observation/interpretation) and synthesis must go hand in hand. Also, theory and experimentation, including well guided designs of concrete, potentially practical systems, are bound to help each other.
118
Units in speech synthesis
REFERENCES Chomsky, A., and M. Halle (1968). The Sound Pattern of English. New York: Harper & Row, Publ. Davidsen-Nielsen, N. (1974). Syllabification in English Words with Medial sp, st, sk. J. Phonetics 2: 15-45. Dixon, N.R. (1968). Terminal Analog Synthesis of Continuous Speech Using the Diphone Method of Segment Assembly. IEEE Trans Audio Electro AV-16(1): 40-50. Fujimara, 0. (1981). Temporal Organization of Articulatory Movements as a Multidimensional Phrasal Structure. Phonetica 39: 66-83. Fujimara, 0. and Lovins, J.B. (1978). Syllables as Concatenative Phonetic Units. In A. Bell and J.B. Cooper (eds.), Syllables & Segments. Amsterdam, Holland: North-Holland Publ. Co. Fujimara, O., MACCHI, M.J. and Lovins, J.B. (1977). Demisyllables and Affixes for Speech Synthesis. In Cont. Pap. 1:513. Proceedings 9th Int. Congr. on Acoust., Madrid, Spain. Goldsmith, J. (1976). An Overview of Autosegmental Phonology. Ling. Anal. 2: 23-68. Halle, M., and Vergrand, J.R. (1980). Three-Dimensional Phonology. J. Ling. Res. 1: 83-105. Holmes, J.N., Mattingly, I.G. and Shearme, J.N. (1964). Speech Synthesis by Rule. Lang. Speech 7 (3): 127-143. Jakobson, R., Fant, C.G.M. and Halle, M. (1963). Preliminaries to Speech Analysis: The Distinctive Features and their Correlates. Cambridge, MA: MIT Press. Kahn, D. (1976). Syllable-based Generalizations in English Phonology. (Doctoral dissertation, MIT 1976) Available from Indiana Univ. Ling. Club, Bloomington, IN. Kiparsky, P. (1982). From Cyclic Phonology to Lexical Phonology. In H. vander Hülst and N. Smith (eds.), The Structure of Phonological Representations. Dordrecht, Holland: Foris Publications. Liberman, A.M., Ingemann, F., Lisker, L., Delattre, P., and Cooper, F.S. (1959). Minimal Rules for Synthesizing Speech. JASA 31 (11): 1496-1499. Liberman, M.Y., and Pierrehumbert, J. Intonational Invariance Under Changes in Pitch, (to appear) Liberman, M.Y., and Prince, A. (1977). On Stress and Linguistic Rhythm. Ling. Inq. 8: 82-90. Lovins, J.B., Macchi, M.J. and Fujimara, 0. (1979). A Demisyllable Inventory for Speech Synthesis. In Speech Communication Papers, presented at the 97th Mtg. of ASA, Cambridge, MA, 1979: pp. 519-522. Maia, E.A. DaMotta (1981). Phonological & Lexical Processes in a Generative Grammar of Portuguese. (Doctoral dissertation, Brown Univ. 1980). Olive, J.P. (1977). Rule Synthesis of Speech from Dyadic Units. Conference Record of IEEE Int. Acoust. Speech Signal Process, Hartford, CT, 1977: pp. 568-570. Pierrehumbert, J. (1981). The Phonology and Phonetics of English Intonation. (Doctoral dissertation, MIT 1980). Selkirk, E. (1980). The Role of Prosodic Categories in English Word Stress. Ling. Inq. 11: 563-605.
Fujimura:
Remarks
on speech synthesis
119
Williams, E.S. (1976). Underlying Tone in Margi and Igbo. Ling. Inq. 7: 463-484.
121 UNITS IN SPEECH SYNTHESIS J.N. Holmes Joint Speech Research Unit, Cheltenham, UK
In this discussion I am speech
assuming
that
the
purpose
of
synthesis is for machine voice output, rather than for
phonetic
research
or
for
the
receiving
end
of
an
analysis/synthesis telephony link.
There is currently a wide variety of units used in speech generation
for
machine
voice
output.
Systems using stored
human speech waveforms are only suitable for
very
applications,
techniques using
but
the
advent
of
vocoder
digitally stored control signals opened stored
word methods.
a
new
restricted
dimension
to
It became possible to modify the timing
and fundamental frequency of stored vocoder control signals to get
the
desired prosodic form of sentences.
of very compact low-cost devices and
the
fairly
modest
storage
for
LPC-vocoder
stored-vocoder-word
synthesis,
requirement for the control
signals for a moderate sized vocabulary, has the
The development
approach
Although vocoder distortion is still
currently
given
great economic importance. noticeable,
the
speech
quality is generally adequate for many practical applications.
PAGE 2
122
Units in speech synthesis Stored coded words or larger units of
obviously
human
speech
are
not suitable for arbitrarily large vocbularies, but
the co-articulation problems are
too
severe
for
it
to
acceptable to use units corresponding to single phonemes.
be The
alternative is to exploit the co-articulatory effects produced by
a
human
speaker as far as they affect adjacent phonemes.
Boundaries of the stored units can then be made in the of
the
middle
acoustic segments associated with the phonemes.
With
this method it should be possible to make up all messages from a
number of units of the order of the square of the number of
significantly different allophones in the language. variants
of
terms 'dyad' Makhoul,
this
approach
(Olive,
Klatt
and
have
1977),
Different
been investigated, and the
'diphone'
(Schwarz,
Klovstad,
Zue, 1979) and 'demi-syllable' (Browman,
1980) have been used.
With these smaller units, human speech
analysis
can
be
used only to provide the spectrum envelope information for the appropriate sounds and their transitions: fundamental
frequency*
information
the
must
durations be
completely by rule to achieve the desired prosodic Use
and
calculated structure.
of spectral envelope patterns with rule-generated prosody
is undoubtedly successful to a several
factors
that
will
first limit
order, the
but
there
performance.
are These
include the inherent distortion of the vocoder representation, the
fact
that
co-articulation
cannot easily be varied with
segment duration, and the problems of selection of a set of units from human utterances.
suitable
Holmes: Units in speech synthesis A completely different approach to speech synthesis is to use
a
system
of
rules for the entire process of converting
from a linguistic specif! cation to the acoustic signal*
Human
speech is then not used at all except to assist in determining the rules. the
Rules need to be applied at
higher
systems
several
levels,
but
levels are essentially the same as are needed for
using
very
small
units
of
coded
human
speech,
discussed above.
At the lowest level the conversion
of
a
detailed
rules
have
to
deal
When
success is
inherently
straightforward,
humans
learn
to
auditory,
and
speak, and
completely
hence
significant
It
features.
(Holmes,
1973)
it
seems
that
a
to
most
use
aiming
has
speech
criterion of
adequate, merely
demonstrated
the
the
acoustic-domain model for synthesis, auditory
the
phonetic specification (including
pitch and timing information for each phone) into waveform.
with
an
to
copy
already
been
fairly
simple
parallel-formant model of speech production can produce a very close
approximation
practical
to
purposes.
human
speech,
adequate
for
all
Although minimal rules for synthesizing
speech with this model would probably be more complicated than for
an
rules by
articulatory study
of
model, it is much easier to derive such human
speech,
because
the
information
provided by s p e c t r o g r a p h s analysis is directly related to the way the input to a parallel-formant synthesizer is I
specified.
therefore very strongly believe that the future of phonetic
synthesis by rule
is,
for
wedded to formant synthesis.
good
practical
reasons,
firmly
124
Units in speech synthesis The specification of input for the phonetic level is most
conveniently
supplied in terms of allophones.
synthesis system the only penalty
for
In a practical
freedom
in
regarding
phones in special phonetic combinations as distinct allophones is the increase in the with
corresponding
number
storage
of
allophone
requirements.
system of Holmes, Mattingly and Shearme Pronunciation
specifications, The
(1964)
table-driven for
Received
(RP) had an extremely simple rule structure and
very few special allophones for any of the R P phonemes.
Based
on experience of that system, I think it is very unlikely that the number of entries in the rule synthesis tables would to
be
very
large
allophones of RP.
to
have
deal with all significantly different
The inclusion of a separate
stage
in
the
synthesis process to select the correct allophone according to phonetic environment would be a negligible extra computational load.
I
thus
believe
that
the units for synthesis at the
phonetic level should be allophones, of which there would have
to
be
more
than
100 - 150 in most languages.
input to the allophone selection
stage
the
Input
not
At the
would
be
phonemes.
The
best
performance
synthesis-by-rule mistaken for systems
system
human
produced
so
far
speech.
by
would
However,
never in
in
possible
approximate
to
devise
natural speech.
speaking
formant
normally
contrast
to
be any
using a concatenated selection from a small number of
coded human segments, it should
produce
any
Moreover,
different with
rules
the
voice same
to it
principle
should
qualities accent,
also
eventually very be
be
closely to possible
to
(man, woman, child etc.) by
modifications
to
the
Holmes: Units in speech synthesis synthesizer
to
the
synthesizer control signals, but without having to modify
the
phonetic
parameters
rules.
I
for
all
by
transformations
therefore expect that in due course this
approach will completely methods
and/or
displace
applications
the
stored
human
speech
except those with a few fixed
messages.
There is so much difference between the prosodic patterns of
different
languages
(and
even between different accents
with the same language), that the structure of prosodic needs
to
be
to
a
large
extent
language-specific.
successful systems have been developed for both pitch
of
some
languages;
rules Very
duration
and
for example there are systems for
English for which the input consists of phonemes, stress marks and
punctuation.
The
stress
marking
for
such
a
system
corresponds fairly well to the type of stress marking used phoneticians
in phonemic transcription, and punctuation marks
control the
pauses
intonation
contour
and
selection
for
as
closely
of
certain
principle it should be possible to approximate
by
as
particular
stressed refine
desired
types
syllables.
this
approach
of In to
to the performance of a
skilled human in reproducing the prosody of an utterance
from
an appropriately marked phonetic transcription.
The next level synthesis
depends
up on
in
the
heirarchy
the application.
of
programs
for
In a reading machine
for the blind it is obviously necessary to be able
to
accept
conventionally
spelled text, and this brings all the problems
of
from
converting
particularly
spelling
difficult
for
to
pronunciation
English,
and
a
(this
is
pronouncing
126
Units in speech synthesis
dictionary for many of the words is an essential part effective
system).
of
any
Lexical stress assignment within words is
sometimes dependent on the syntactic function of the word in a sentence (such as for the English word 'contract').
The level
of stress used in any word in a sentence will also the
semantics,
dependent
words to make a variations
this
type
of
meaning.
require
The
more
modest
from
general
systems text
subtle
a reading machine to have
linguistic knowledge approaching that of a human more
with
on the desire to emphasise certain
distinction
of
vary
reader,
but
that can give very useful speech output are
now
clearly
practicable
(Allen,
Hunnicutt, Carlson and GranstrSm, 1979).
For various
announcing types
machines
in
information
services
the obvious approach is for an operator of the
system to specify the message components in textual form. reduce
application
there
is
no
where
possible,
difficulty
pronunciation anomalies by ad hoc addition dictionary
of
exceptions,
by
phonetic
in
to
deliberate
facilitate pronunciation by rule, or by
a
be
in
correcting pronouncing
embedding
to
occasional
symbols within a conventional orthographic sequence.
over-ridden
by
achieve special effects. under
but
mis-spelling
Assignment by rule of both lexical and semantic also
To
the need for operator training, it is advantageous for
conventional orthography to be used this
of
operator
control
the
stresses
can
operator whenever necessary to
Parts
of
messages
so
constructed
may
be
subsequently
re-arranged
entirely automatically, but such
changes
will
not
normally
involve phonemic or stress changes within the message parts.
Holmes: Units in speech synthesis At a slightly higher level concept'
approach
the
'speech
In
conventional process,
and
from
of Young and Fallside (1979) could be used
to go straight from the concept to the input of program.
synthesis
the
prosodic
systems of this type there is no point in using orthography knowledge
at
any
stage
in
the
synthesis
of the concept obviously defines the
stress assignments needed
to
represent
particular
semantic
interpretations of the words.
I expect these large-vocabulary
three
synthesis
high-level
input
techniques
for
will continue to have their place
for appropriate applications, but ultimately they will
nearly
all use formant synthesis by rule for the low-level stage.
128
Units in speech synthesis
References
Allen,
J.,
(1979).
Hunnlcutt,
MITalk-79:
S-,
Carlson,
R. and
Granstrom,
The 1979 M I T text-Co-speech system.
Speech Communication Papers Presented at the 97th the
Acoustical
D. H.,
Eds.),
Society
of
Acoustical
America
B. Tn:
Meeting
of
(Wolf, J. J. and Klatt,
Society
of
America,
New
York,
pp. 507-510.
Browman, C. P. (1980). Lingua,
a
language
International
Rules for demisyllable synthesis using Interpreter.
Conference
on
Proceedings
Acoustics,
Speech
of
the IEEE
and
Signal
Processing, Denver, Co., 561-564.
Holmes, J. N. (1973).
The influence of
glottal
the naturalness of speech from a parallel formant IEEE
Transactions
on
Audio
and
waveform
on
synthesizer.
Electroacoustics
AU-21,
298-305.
Holmes, J. N.,
Mattingly,
Speech synthesis by rule.
I. G. and
Rule synthesis
units.
of
Acoustics, 568-570.
Speech
and
J. N. (1964).
Language and Speech 7, 127-143.
Olive, J. P. (1977). Proceedings
Shearme,
of
speech
from
dyadic
the IEEE International Conference on Signal
Processing,
Hartford,
Ct., WW
Holmes: Units in speech synthesis Schwarz, R., Klovstad, J., Makhoul, J., Klatt, D. and Zue, (1979).
Diphone synthesis for phonetic vocoding.
V.
Proceedings
of the IEEE International Conference on Acoustics, Speech
and
Signal Processing, Washington, D.C., 891-894.
Young, S. J. and Fallside, concept: a
F. (1979). Speech
synthesis
from
method for speech output from information systems.
Journal of the Acoustical Society of America 66, 685-695.
131 UNITS IN TEXT-TO-SPEECH SYSTEMS Rolf Carlson and Björn Granström Department of Speech Communication and Music Acoustics KTH, Stockholm, Sweden Current text-to-speech systems make use of units of different kinds depending on the goals involved. If the purpose is adequate output only, the internal structure and the choice of units is, to some extent, arbitrary. The choice of units in a research system, with the ambition to model the speech production process adequately, might be quite different. The paper deals with this problem and makes some comparisons to research in speech recognition. INTRODUCTION Speech synthesis systems have existed for a long time. About ten years ago the first text-to-speech systems were designed. The speech quality has been improved to a high degree, but the progress has been gradual. Adjustments of parameter values or formulations of linguistic rules have been the base for this work. It is important to note that the progress has been made within the already established framework, and broader views have been exceptions. In the whole text-to-speech process many units are conceivably of interest , ranging from concepts to motor commands given to individual articulators. The title of this session, "Units in Speech Synthesis," suggests that we could select some specific units on which a synthesis system could be based. Rather than taking on the task of deciding on the "best" unit or set of units, we want to briefly discuss some considerations in designing a text-to-speech system. The choice of units will be shown to be highly dependent on the goal of the synthesis project.
A TYPICAL TEXT-TO-SPEECH SYSTEM Let us, as a starting point, describe a typical text-tospeech system. The system is meant to accept any kind of text
132
Units in speech synthesis
input. The first task is to generate a phonetic transcription of the text. This can be done by a combination of rules and lexical lookups. The lexicon can either be built on a word-byword basis or can use some kind of morph representation. Optionally, a grammatical analysis is performed and syntactic information included. At the next stage, the phonetic representation is transformed to control parameters that are adjusted depending on context, prosodic information and type of synthesizer. The acoustic output is generated with the help of an LPC, formant or articulatory model. Let us, already at this point, conclude that a system of this kind is a sequential system to a very high degree. The text is succesively transformed into a speech wave.
Systems of this kind have been designed at several research institutes, Allen et al.(1979), Carlson and Granstrom (1976), Coker et al. (1973), Cooper et al. (1962), Pujimura (1976), Klatt (1982), Mattingly (1968) and Olive (1977). Some are now produced for different applications. The output is often quite far from natural human speech, but nevertheless understandable due to the eminent capability of man to cope with a great variety of dialects under poor signal-to-noise conditions. UNITS FOR WHAT ? Synthesis systems may be divided sometimes different goals:
into two groups with
I. Systems that can be commercialized and used in applications . II. Research systems that model our present knowledge of speech production. There is not a distinct borderline between these groups. Presently there is no guarantee that a system of type II will produce better speech than those of type I. Consider, for example, the present level of articulatory models. Much more has to be done before the output can be compared to the best
Carlson and Granstrom: Units in text-to-speech systems
133
terminal analog systems. This will probably change in the future. Another example is the use of syntactic information. It is obvious that information of this kind can be used to gain higher speech quality. An erroneous syntactic analysis is, however, often worse than no analysis at all. An LPC-based system of type I forms a rigid system with limited flexibility and often produces an uneven speech quality. Rules for coarticulation and reduction are much harder to implement in this framework than in an articulatory model. On the other hand, rules of this kind are presently only partially known. This is compensated for by using a variety of allophones or diphones that approximates the variations found in human speech. The designer of systems of type I often takes a very pragmatic view on the selection of units. The availability of information, the ease of gathering speech data, computational efficiency, hardware cost, etc., are important considerations. Any kind of feature, diphone or parameter could be a helpful unit. The problem is rather to choose the optimal description at a certain point in the system. In a system of type II, the choice of units is obviously important since it reflects the researcher's view on speech production. For example, a system exclusively based on features and their acoustical correlates forms a strong hypothesis and generates important research activity, but does not necessarily produce the highest speech quality. MACHINES AND MODELS Computers have had an enormous impact on the present society, including speech research. A computer program is normally a single sequential process. This has, to a high degree, guided the development of speech synthesis systems. No system makes use of extensive parallell processing, physiological or auditory feedback. No system compensates for noise level or the quality of the loudspeaker. If such processes were included, the choice of units would be different. The dynamic properties of speech production would have been emphasized. In contrast, speech recognition research has, to some degree, chosen an alternative way by exploring analysis-bysynthesis techniques and a combination of top down - bottom up processes.
134
Units in speech synthesis UNITS AND DIMENSIONS
Despite the great number of units that could be chosen in speech synthesis research, the approach is still "two-dimensional" to a very high degree: ordered rules, linguistic tree structures, formant versus time trajectories, sequential phonemes, allophones or diphones Articulatory models are sometimes important exceptions. The multidimensionality that clearly exists in speech production has to be explored. Conflicting cues have been studied in speech perception research, and must also be considered in speech production. FUTURE Present synthesis systems produce output that is fairly intelligible. Considering the difficulty of the task of generating speech from an unknown string of letters, the quality of the speech is astonishingly high in some systems. This partial success could have a negative effect for the near future. It does not immediately force synthesis research to explore new approaches and to investigate more complete models of the speech production process. We may learn from research on speech recognition, which has had a different evolution compared to speech synthesis. At a very early stage it was realized that recognition based on pattern matching techniques was inadequate to solve the general problem. Thus, in parallel, complex systems, designed for recognition of continuous speech, started to be developed. It was clear that all aspects of speech communication had to be included. We now face a scientific dilemma. It is possible to make more shortcuts in speech synthesis compared to speech recognition. Does this mean that the effort to generate high quality synthetic speech will be reduced? The number of papers in scientific journals and at conferences dealing with speech synthesis has been drastically reduced. A reorientation of speech synthesis efforts appears to be needed to make it again an interesting and active field of research.
Carlson and Granstrom: Units in text-to-speech systems
135
REFERENCES Allen, J., Hunnicutt, S., Carlson, R., and Granstrom, B. (1979). MITalk-79. The MIT text-to-speech system, in J.J. Wolf and D. H. Klatt (Eds) Speech Communication papers presented at the 97th Meeting of the Acoustical Society of America, published by the Acoustical Society of America. Carlson, R. and Granstrom, B. (1976). A text-to-speech system based entirely on rules. Conf. Rec. 1976 IEEE International Conf on Acoustics, Speech, and Signal Processing. Coker, C.H., Umeda N. and Browman C.P. (1973). Automatic Synthesis from Ordinary English Text, IEEE Audio and Electroacoustics, AU-21,293-397 Cooper, F., Liberman, A., Lisker, L. and Gaitenby, J. (1962). Speech Synthesis by Rules. Paper F2. Speech Communication Seminar. Stockholm. Fujimura, 0., (1976). Syllables as Concatenated Demisyllables and Affixes. 91st Meeting of the Acoustical Society of America. Klatt, D.H. (1982) The KLATTalk Text-to-speech Conversion System. IEEE Int. Conf. on ASSP, IEEE Catalog No. 82CH1746-7, 1589-1592 Mattingly, I. (1968) Synthesis by Rule of American English. Supplement to Status Report on Speech Research at Haskins Laboratories, New Haven, Conn. USA Olive, J.P. (1977) Rule Synthesis from Diadic Units. IEEE Int. Conf. on ASSP, IEEE Catalog No. 77CH1197-3 ASSP, 568-570
137 LINGUISTIC UNITS FOR FO SYNTHESIS Janet Pierrehumbert Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974 1.
Introduction
In this note, I will summarize some results on the structure of the English intonation system and relate them to rules for TO synthesis.
So far, speech
synthesis programs have not used the full repertoire of English intonation. However, study of the complete system can help guide the choice of units and the formulation of implementation rules. Thus, the first part of the paper will be devoted to theoretical description and the latter part to practical issues.
Comparisons to other proposals in the literature can be found in
Pierrehumbert (1981). 2.
A Phenomenology
2.1 Melody, Stress, and Phrasing In English,
the words which
comprise
an utterance
constrain
but
do
not
determine its intonation. Three dimensions of choice are melody, stress, and phrasing. The same text, with
the same
stress pattern, can be produced
with many
different melodies. For example, Figure 1 shows three of the many FO patterns that could be produced on the monosyllable "Anne". 1A is a declarative, 1B is a question, and 1C could convey incredulity.
Conversely, the same melody can
occur on many different texts. These basic observations have led both British and American linguists to treat melodies as autonomous constructs which are coordinated with the text. The coordination of melody and text is determined by the stress pattern and the
phrasing.
Some
features
of
the melody
are
associated
with
stressed
syllables, while others are associated with the intonation phrase boundary. The FO configurations which are (optionally) assigned to stressed syllables are "pitch accents". The English pitch accents include the peak in 1A, the valley in 1B, and the scoop, or valley plus peak, in 1C. In a polysyllabic word, these configurations attach themselves to the main word
stress. In
138
Units in speech synthesis
ANNE.
ANNE?
ANNE !
Figure 1
300 N X
S 150 S I
100
THE CARDAMOM BREAD WAS PALATABLE
Figure 2
Pierrehumbert: Linguistic units for FO synthesis multiword
utterances,
attachment.
the
In Figure
phrasal
139
stress
2, for example,
subordination
focus on the
single scoop accent to the front of the sentence.
controls
their
subject has moved
the
The relation of stress to
melody also shows up in a different form. The more stressed syllables there are in a phrase, the more opportunities for a pitch accent. Thus, longer phrases can have more complex melodies than shorter ones. Figures 1 and 2 also exemplify some of the options in the treatment of the end of the phrase. The rise at the end of contour 1C remains at the end when the scoop accent is moved leftward, as in 2. The terminal low value in 1A would behave
similarly.
Multi-word
utterances
can
be
divided
into
several
intonational phrases, each with a phrase-final tonal configuration at its end. The boundary between one phrase and the next can be marked with a pause or lengthening
of
the
final
syllable.
It
is
important
to
note
that
the
intonation phrase is a unit of phonological rather than syntactic description. Intonation phrases are not always syntactic constituents.
The same sentence,
with the same syntax, can often be produced with more than one
intonational
phrasing. 2.2 What controls the variants observed? The
previous
section described
options in the English intonational
system.
What governs choices among these options in natural speech? A
tradition
exemplified
by
Halliday
(1967)
defines
intonation
phrases
as
informational units, or "sense groups". Selkirk (1978) attempts to predict the set of permissible phrasings as a function of the syntax.
What controls the
choice among alternative phrasings is not well understood. The
phrasal
presupposed
stress
pattern
is
in the discourse.
a
function of what
is
focused
and
what
is
The FO contour in Figure 2, for example, is
only appropriate if "palatable" is in some sense already known. Carlson (1982) and
Selkirk
(1983)
take interesting
approaches
to
this problem and
provide
additional references. Efforts
to
describe
the
usage
of
particular
melodies
suggests
that
they
function much like pragmatic particles in other languages, such as "doch" in German.
They
convey
information
about
the
attitude
of the speaker and
the
relation of the utterance to others in the discourse. Sag and Liberman (1975)
140
Units in speech synthesis
describe
instances
in
which
melody
disambiguates
between
the
literal
and
rhetorical readings of questions. Ladd (1978) suggests that a common vocative melody has a broader meaning as a marker of conventionalized expressions.
The
rather sparse experimental work on the perception of melody tends to support this viewpoint; see especially Nash and Mulac (1980) and Bonnet (1980). 2.3 Micromelody The speech segments introduce contour.
perturbations of
the
prosodically
derived
FO
All other things equal, the FO during a high vowel is higher than
during a low vowel. Voiceless obstruents raise the FO on a following vowel; voiced obstruents depress it.
A large literature documents the existence of
these effects, but does not provide a complete picture of their quantitative behaviour in different prosodic environments.
(See Lea (1973) for a review.)
Making even an approximation to them is likely to add to the naturalness of FO synthesis. 3-
A Theory
3-1 Phonological Units A theory of intonation developed in Pierrehumbert (1980) decomposes the melody into a sequence of elements. The minimal units in the system are low (L) and high (H) tones. tones,
with
accents
are
A pitch accent can consist of a single tone or a pair of two
one marked posited.
to
fall
on
The melodic
the
stress;
altogether
elements associated
seven
different
with the end of
phrase are the boundary tone, wh'ich is located right at the phrase
the
boundary
and may be either L or H, and the phrase accent, which controls the FO between the
last
pitch
accent
and
the boundary
tone, and may also be
H. The boundary tone was proposed in Liberman (1975).
either L or
The idea of the phrase
accent is adapted from Bruce's (1977) work on Swedish.
As far as is known,
different pitch accents, phrase accents, and boundary tones combine freely, so that
the phrasal melodies can be
treated
as
the
output
of a
finite
state
grammar. 3.2 Realization There
are two aspects of the implementation
process to consider. Tones are
realized as crucial points in the FO contour; transitions between one tone and
Pierrehumbert: Linguistic units for FO synthesis the next fill in the FO contour.
141
The transitions can generate contours
on
syllables with only one L or H tone, and they supply FO values for unstressed or unaccented syllables which do not carry a tone. Experiments mapping
reported in Liberman and Pierrehumbert
from
computations
tones
to
crucial
points
in
the
(1983) shed light on the contour.
It
appears
that
can be made left to right in a "window" of two pitch accents.
Superficially global trends arise from iterative application
of these
local
rules. The
transitions
intonation nonmonotonic
between
synthesis
crucial system
transitions
to
points
have
described
achieve
not been as well
in
Pierrehumbert
a sparse
representation.
pitch accents, a transition which dips as a function of the
studied.
(1981)
The
posits
Between two separation
H in
frequency and time between"the two target values is computed. Our impression from comparing rule systems
is that the ear is much less sensitive
to
the
shape of the transitions than to the configuration of the crucial points in the FO contour. 3-3 Disagreements The picture we have presented differs in important ways
from others in
the
literature. In the speech communication literature, FO sometimes figures as a transducer of stress-the more a syllable is stressed, the higher its FO or the greater its FO change. The existence
of many different
pitch accents
means
that this approach cannot be maintained in the general case. We also deny that intonation is built up in layers, by superposing local movements on a global component. We differ from theorists who decompose the melody into FO changes rather
than
into
crucial
points.
Pierrehumbert
(1980)
and
Liberman
and
Pierrehumbert (1983) offer arguments for this position. 4.
Synthesis Rules
Section 3 suggests that FO contours can be represented as a sequence of tonal units assigned contours
needed
relatively for
sparsely
speech
appear to be problematic.
to
synthesis
the text. Deriving from
such
a
the continuous
representation
does
FO not
The major problems arise at the level of assigning
the representation to a text; only part of the information which governs real speakers' choices is available in a speech synthesis program.
Punctuation is
142
Units in speech synthesis
an imperfect
representation
of phrasing.
A
representation
of new
and
old
information sufficient to predict the phrasal stress contour is not available. Nor do current systems model the pragmatic and affective information on which choice of melody and overall pitch range appear to be based. Two
responses
to
this
state
of
affairs
suggest
themselves.
The
information can be specified in the input, using an appropriate of melody and
phrasing.
this approach.
Pierrehumbert
(1981)
missing
transcription
describes a program based
on
In a text-to-speech system, however, it is necessary to make
the best of conventional English text input. What principles can be used to generate an intonation transcription from the information available?
Phrasing
breaks are posited at punctuation marks. A parse may be helpful for
further
subdivision of the sentence, but only if it is accurate and feeds a good set of phrasing rules. always
been
To date, the problem of selection among accent types has
side-stepped
by the use of a reduced
inventory.
For
example,
rules for text-to-speech given in Pierrehumbert (1981) use H accents, and The MITalk rules, which are based on observations in O'Shaughnessy (1979), appear to use a mixture
of two
accent
for rhythmic alternation. of
speech,
types.
Rules
for accent
placement
rely
on
Pierrehumbert's rules are based on the tendency
tendencies in the language.
O'Shaughnessy's depend on the fact that some parts
such as nouns,
receive
accents more often than others,
such
as
verbs. In
our
experience,
the
most
conspicuous
deficiencies
of
such
systems
are
inappropriate phrasing and wrong placement of accents. Assigning accents on a purely syntactic or rhythmic basis often gives undue importance to words which are presupposed in the discourse, or insufficient importance to words should be emphasized.
are especially abrasive. likely to suggest rules.
The
monotonous (1976) accents
even
Additional study of phrasing and accent placement is
rules of thumb for improving this aspect of FO
restricted melodic
for
and
texts
of any
O'Shaughnessy in
which
Errors involving wrong placement of accent in compounds
reading
inventory
length.
(1976)
shows
used
in
Examination that
FO
synthesis
of FO
speakers
use
contours a
synthesis
programs
is
in
Maeda
variety
pitch
neutral declarative materials. Work on
unobtrusive
variation of accent type is likely to pay off in a more lively result.
Pierrehumbert: Linguistic units for FO synthesis
143
References Bonnet, G. (1980), "A Study of Intonation in the Soccer Results,"
Journal of
Phonetics, 8, 21-38. Bruce, G. (1977), "Swedish Word Accents in Sentence Perspective," l'Institut de Linguistique de Lund.
Travaux de
Malmo: CWK Gleerup.
Carlson, L. (1982), "Dialogue Games: An Approach to Discourse Analysis," Ph.D dissertation, MIT (forthcoming from Reidel). Halliday,
M.A.K.
(1967),
"Intonation
and
Grammar
in
British
English,"
The
Hague: Mouton. Ladd, D.R. (1978), "Stylized Intonation," Language, 54, 517-540. Liberman, M. (1979), "The Intonational System of English," Liberman, Invariance
M. and
J. Pierrehumbert
under
Changes
in
(forthcoming
Pitch
Range
in
and
New York: Garland.
1983)-
Length,"
"Intonational
In
M. Aronoff
and
R. Oehrle, eds, Language Sound Structure. Cambridge: MIT Press. Maeda,
S. (1976),
"A Characterization
of American English Intonation,"
Ph.D
disseration, MIT. Nash,
R. and
Waugh
and
A. Mulac
C. H.
van
(1980),
"The
Schooneveld,
Intonation eds,
The
of Verifiability,"
Melody
of
Language.
In
L.R.
Baltimore:
University Park Press. O'Shaughnessy,
D. (1976),
"Modelling
Fundamental
Frequency,
and
Relationship to Syntax, Semantics, and Phonetics,"
Ph.D dissertation, MIT.
O'Shaughnessy,
in
D. (1979),
"Linguistic
Features
Fundamental
its
Frequency
Patterns," Journal of Phonetics, 7, 119-146. Pierrehumbert, J. (1980), "The Phonology and Phonetics of English Intonation," Ph.D dissertation, MIT, (forthcoming from MIT Press, Cambridge) Pierrehumbert,
J. (1981),
"Synthesizing
Acoustical Society of America, 70, 985-995-
Intonation,"
Journal
of
the
144
Units in speech synthesis
Sag, I. and M. Liberman Speech
Acts,"
Papers
(1975), "The Intonational Disambiguation of Indirect from
the
Eleventh
Regional
Meeting
of
the
Chicago
Linguistic Society, 487-497. Selkirk,
E. (1978),
Structure,"
Paper
"On
Prosodic
presented
to
Structure
the
Sloan
and
its
Foundation
Relation
to
Conference
Syntactic on
Mental
Representation in Phonology. (MS Dept. of Linguistics, U. Mass, Amherst.) Selkirk,
E. (forthcoming
in
between Sound and Structure,"
1983),
"Phonology
and
Cambridge: MIT Press.
Syntax:
The
Relation
Symposium 3
Models of the larynx
147 LARYNX M O D E L S AS COMPONENTS IN M O D E L S OF SPEECH DUCTION Celia
PRO-
Scully
University of Leeds, U.K. The most relevant dictionary d e f i n i t i o n of a model is the simplified representation of a process or system. To be manageable, a complicated system, such as that which generates speech, needs to be reducible to a number o f quasi-independent components. It i s helpful to i d e n t i f y the larynx as one such component. I t s complexity and importance in speech are indicated by the number o f scholars investigating larynx a c t i v i t y and by the wide v a r i e t y of techniques employed. Since the symposium on the larynx at the 8th International Congress o f Phonetic Sciences (Fant and Scully, 1977), several conferences and books have been partly or e n t i r e l y devoted to the larynx in speech and singing (Fink, 1975; Carre, Descout & Wajskop, 1977; Fink & Demarest, 1978; Boe, Descout & Gu^rin, 1979; Lass, 1979, 1981; Lawrence & Weinberg, 1980; Stevens & Hlrano, 1981). Modelling i s one path towards a greater understanding of the larynx, but i t needs to be considered as several systems: 1. neural control mechanisms; 2. anatomical structures, tissue properties and muscle mechanics; 3. articulation; 4. aerodynamics; 5. acoustic sources; 6. acoustic f i l t e r s . These are t e n t a t i v e , s i m p l i f i e d and, to some extent, a r b i t r a r y d i v i s i o n s . The processes l i s t e d are not truly independent. Insofar as each system may be considered as a stage of speech production, conditions in a l a t e r stage are determined from the variables defined in more than one of the e a r l i e r stages. The larynx does not operate in i s o l a t i o n . To d i f f e r e n t degrees, in each of the systems l i s t e d above, links between the larynx and both subglottal and supraglottal regions need to be considered. Neural systems w i l l not be discussed here (but see Muller, Abbs & Kennedy, pp.209-227 in Stevens & Hirano, 1981). Anatomical and physiological links Anatomically, i t may seem reasonable to treat the larynx structures, from cricoid c a r t i l a g e to hyoid bone, as a component in a t o t a l speech-producing system which i s reducible in the cybernetic sense. But through extrinsic' muscles the larynx is linked to structures which shape the vocal t r a c t ; because i t forms the top of the trachea, the larynx is a f f e c t e d by f o r c e s associated with changes o f lung volume. To what extent could a simplified representation of the forces associated with the extrinsic laryngeal muscles explain, for example, the hyoid bone movements observed in real speech? To what extent do larynx settings and vocal tract shapes covary? Fink and Demarest ( 1978) have provided a foundation f o r this kind of larynx modelling.
148
Models of the larynx
The a r t i c u l a t o r y system Here, a r t i c u l a t i o n i s taken to mean a l l actions needed f o r speech production, not only those which shape the supraglottal vocal t r a c t .
At l e a s t
three kinds o f a r t i c u l a t o r y action may t e n t a t i v e l y be ascribed to the l a r y n x : 1.
the slowly-changing ( d . c . ) component o f g l o t t a l area, or abduction and
2.
v e r t i c a l movements o f the larynx;
3.
changes in the e f f e c t i v e s t i f f n e s s and mass o f the vocal f o l d s , as a
adduction o f the vocal
folds;
component o f fundamental frequency c o n t r o l . There are unanswered questions here.
To what extent does an action o f one
o f these three a r t i c u l a t o r s a f f e c t the other two?
Should there be a fourth
a r t i c u l a t o r y action f o r the control o f phonation type? are o f the essence in an a r t i c u l a t o r y model.
Timing and coordination
Much has been learned in recent
years about changes o f g l o t t a l width in r e a l speech and patterns o f associated muscle a c t i v i t y ( f o r example, Hirose & Sawashima, Chapter 11 and Sawashima & Hirose, Chapter 22 in Stevens 4 Hirano, 1981).
Coordination o f a c t i o n s , both
within the larynx and also between laryngeal and other a r t i c u l a t o r s , may be modelled.
What i s not yet well established i s the v a r i a b i l i t y - both in timing
and in a r t i c u l a t o r y distance - found in r e a l speech. a given speaker and/or a given speech rate?
Is v a r i a b i l i t y f i x e d
for
Are kinematic d e s c r i p t i o n s
s u f f i c i e n t f o r an a r t i c u l a t o r y larynx model, or i s there a pressing
requirement
to model the dynamics? The aerodynamic system The low frequency ( s l o w l y changing, d . c . ) components o f volume f l o w r a t e o f a i r through the g l o t t i s and pressure drop across the g l o t t i s are v a r i a b l e s in the aerodynamic system of speech production.
This i s i r r e d u c i b l e :
aerodynamic
conditions a t the g l o t t i s depend on the changing configurations o f the whole respiratory t r a c t .
S i g n i f i c a n t subglottal
f a c t o r s include the r a t e o f lung
volume decrement and subglottal airways r e s i s t a n c e .
Significant
supraglottal
f a c t o r s include the c r o s s - s e c t i o n area o f any severe c o n s t r i c t i o n s o f the vocal t r a c t and changes in c a v i t y volumes:
both a c t i v e , due to a r t i c u l a t o r y a c t i o n s ,
and passive, due to wall compliance.
More data are needed to develop improved
representations o f the g l o t t a l o r i f i c e as a flow-dependent r e s i s t a n c e . Laryngeal processes f o r acoustic
sources
Quantification o f the myoelastic-aerodynamic theory o f v o i c i n g i s an important goal f o r modelling.
Larynx muscle f o r c e s , the v i s c o e l a s t i c
p r o p e r t i e s o f the body and cover o f the vocal f o l d s and t h e i r geometry combine with l o c a l aerodynamic f o r c e s to make the vocal f o l d s v i b r a t e . waveform f o r the a . c . component o f volume v e l o c i t y o f g l o t t a l said to d e f i n e a ' s h o r t c i r c u i t current s o u r c e ' :
The r e s u l t a n t a i r f l o w can be
the v o i c e source as i t would
appear in the absence o f the vocal t r a c t acoustic tube.
Each vocal f o l d may be
represented as two or more mass-spring systems, as a continuous bounded medium or as a beam;
a functional approach may be employed instead.
model may depend on the a p p l i c a t i o n .
The choice o f
Among the questions as yet unanswered by
modelling i s the exact nature o f the aerodynamic contribution to fundamental frequency.
The processes generating turbulence noise are not well
understood,
Scully: Larynx models as components of speech production
149
e s p e c i a l l y f o r the i n t e r m i t t e n t a i r f l o w and moving boundaries a s s o c i a t e d with v i b r a t i o n of t h e vocal f o l d s . Because of the u n i t y of the whole aerodynamic system, g l o t t a l and s u p r a g l o t t a l source processes a r e i n t e r r e l a t e d . A complex a u d i t o r y framework has been proposed f o r laryngeal and other components of voice q u a l i t y (Laver, 1980). Relationships between wave d e s c r i p t i o n s f o r the a c o u s t i c sources and perceptual a t t r i b u t e s need to be f u r t h e r e x p l o r e d . I t may be hoped t h a t modelling w i l l take us towards a c o n s i s t e n t d e s c r i p t i v e framework f o r v i b r a t i o n modes of t h e vocal f o l d s and the co-occurrence of voice and t u r b u l e n c e noise sources. An agreed s e t of phonetic f e a t u r e s f o r the larynx may f o l l o w , e v e n t u a l l y . C l a r i f i c a t i o n of nomenclature could be one short-term g o a l , but even t h i s may have to await increased understanding of the concepts of r e g i s t e r and phonation t y p e . Acoustic coupling between source and f i l t e r The s u b g l o t t a l f i l t e r might be i n c l u d e d . A voice source as defined above i s assumed to be l i n e a r l y s e p a r a b l e from the a c o u s t i c f i l t e r s , but source and f i l t e r s i n t e r a c t , e s p e c i a l l y in the frequency region of the f i r s t f o r m a n t s . One approach i s to d e r i v e t r u e g l o t t a l flow from lung a i r p r e s s u r e and waveforms of g l o t t a l a r e a . To what e x t e n t do g l o t t a l l o s s e s and the e f f e c t s of FQ - Fj_ proximity vary with speaker type? Concluding remarks The larynx p r e s e n t s us with a l a r g e number of q u e s t i o n s about the c o n t r o l and operation in speech and singing of i t s t i n y and r a t h e r i n a c c e s s i b l e s t r u c t u r e s . Some c h a l l e n g e s and p o s s i b i l i t i e s f o r modelling may be o f f e r e d . S i m p l i f i c a t i o n i s advantageous, but only i f i t does not s e r i o u s l y d i s t o r t the system and change the whole p i c t u r e . Dominant f a c t o r s may emerge, but d e c i s i o n s about the need to include in a model s p e c i f i c v a r i a b l e s must be judgements (or guesses) based on fragmentary d a t a . Modelling and data g a t h e r i n g can complement each other and give a ' l e a p - f r o g ' progress in understanding p o r t i o n s of t h e complex whole. Individual f a c t o r s may be i d e n t i f i e d and manipulated as they cannot be in r e a l speech. For example, a c o u s t i c coupling and anatomical l i n k s between larynx and vocal t r a c t can be d i s s o c i a t e d . Since modelling of a c o u s t i c coupling alone g i v e s v a r i a t i o n s of fundamental frequency with vowel type in the o p p o s i t e sense to t h a t of r e a l speech, p h y s i o l o g i c a l l i n k s should give o v e r r i d i n g e f f e c t s (Gu^rin, Degyrse and Boe, pp.263-277 in Carre e t a l . , 1977). Modelling should be able to make p r e d i c t i o n s beyond what can be measured in r e a l speech. Normal a d u l t speech i s so s k i l f u l , r e l i a b l e and, probably, s t y l i s e d , t h a t the t a s k s performed in speech a r e not obvious. A model can get t h i n g s wrong (only too e a s i l y ! ) by using i n a p p r o p r i a t e s e t t i n g s , a c t i o n s and temporal o r g a n i s a t i o n ; and thus help t o explain the p a r t i c u l a r p a t t e r n s found in r e a l speech. Regions of s t a b i l i t y f o r the larynx may be i d e n t i f i e d . Models ought to conform to physical p r i n c i p l e s and c o n s t r a i n t s , w h i l s t being capable of simulating d i f f e r e n t speaker types and d i f f e r e n t o p t i o n s ; a l s o d i f f e r e n t degrees of c o n t r o l , as in singing versus speech. I n c r e a s i n g l y r e a l i s t i c
150
Models of the larynx
r e p r e s e n t a t i o n s w i l l have o b v i o u s a p p l i c a t i o n s i n speech s y n t h e s i s ( b o t h t e r m i n a l - a n a l o g and l i n e - a n a l o g ) and in d i a g n o s i s o f p a t h o l o g i c a l s t a t e s o f t h e v o c a l f o l d s . One o b j e c t i o n t o m o d e l l i n g a s an a n a l y s i s - b y - s y n t h e s i s a p p r o a c h i s t h a t , w i t h many i l l - q u a n t i f i e d p a r a m e t e r s t o m a n i p u l a t e , t o o many u n i n f o r m a t i v e matches t o r e a l speech can be a c h i e v e d . Practical experience s u g g e s t s t h a t t h i s need n o t be a m a j o r worry f o r t h e p r e s e n t l a r y n x m o d e l l i n g symposium; where c u r r e n t models a r e n o t c o m p l e t e l y s u c c e s s f u l s i m u l a t i o n s , t h e ways in which t h e y f a i l a r e i l l u m i n a t i n g .
References Boë, L . J . , D e s c o u t , R. & G u é r i n , B., e d s . ( 1 9 7 9 ) . Larynx e t P a r o l e . P r o c e e d i n g s o f a GALF Seminar, G r e n o b l e , F e b r u a r y 8 - 9 , 1979. Grenoble: I n s t i t u t de Phonétique de G r e n o b l e . L a s s , N . J . , e d . (1979, 1981). Volumes 2 & 5 . Speech and Language Advances i n Basic Research and P r a c t i c e . New York: Academic P r e s s . C a r r é , R., D e s c o u t , R. & Wajskop, M., e d s . ( 1 9 7 7 ) . A r t i c u l a t o r y Modelling and P h o n e t i c s . P r o c e e d i n g s o f a Symposiun, G r e n o b l e , J u l y 10-12, 1977. B r u s s e l s : GALF Groupe de l a Communication P a r l e e . F a n t , G. & S c u l l y , C., e d s . ( 1 9 7 7 ) . The Larynx and Language. P r o c e e d i n g s o f a D i s c u s s i o n Seminar a t t h e 8 t h I n t e r n a t i o n a l Congress o f P h o n e t i c S c i e n c e s , Leeds, August 17-23, 1975, P h o n e t i c a 31 C O . F i n k , B.R. ( 1 9 7 5 ) . The Human Larynx, A F u n c t i o n a l S t u d y . New York: Raven Press. F i n k , B.R. & Demarest, R . J . ( 1 9 7 8 ) . Harvard U n i v e r s i t y P r e s s .
Laryngeal Biomechanics.
L a v e r , J . ( 1 9 8 0 ) . The Phonetic D e s c r i p t i o n of Voice Q u a l i t y . Cambridge U n i v e r s i t y P r e s s .
Cambridge, MA:
Cambridge:
Lawrence, V.L. & Weinberg, B., e d s . ( 1 9 8 0 ) . T r a n s c r i p t s o f t h e Eighth Symposium Care of t h e P r o f e s s i o n a l V o i c e , P a r t s I , I I , & I I I . , June 11-15, 1979, The J u i l l i a r d S c h o o l , NYC. New York: The Voice F o u n d a t i o n . S t e v e n s , K.N. & H i r a n o , M., e d s . ( 1 9 8 1 ) . Vocal Fold P h y s i o l o g y . P r o c e e d i n g s o f a C o n f e r e n c e , Kurune, J a n u a r y 15-19, 1980. Tokyo: U n i v e r s i t y o f Tokyo P r e s s .
151 THE VOICE SOURCE - ACOUSTIC MODELLING Gunnar Fant Department of Speech Communication & Music Acoustics, Royal Institute of Technology, Stockholm, Sweden Abstract Recent improvements in the source-filter concept of voice production take into account interaction between the time variable nonlinear glottal impedance and the pressure-flow state of the entire sub- and supraglottal systems.
Alternative defini-
tions of source and filter function in a physical production model and in syntesis schemes are reviewed.
Requirements for
time and frequency domain parameterization of the voice source are discussed with reference to speech dynamics. Introduction The acoustics of speech production is based on the concept of a source and a
filter function - in a more general sense a
raw material and a sound shaping process.
In current models the
source of voiced sounds is represented by a quasiperiodic succession of pulses of air emitted through the glottis, as the vocal cords open and close, and the filter function is assumed to be linear and short-time invariant.
Not much work has been done on
the description of the voice source with reference to speaker specifics and to contextual factors.
Speech synthesis has gained
a fair quality on the basis of conventional idealizations, such as a -12 dB/oct average spectrum slope and uniform shape.
Source
and filter have been assumed to be linearly separable. This primitive view has served well as the foundation of acoustics phonetics, but it is time to revise and improve it.
In
the last few years it has become apparent that we need a firmer theoretical basis of voice production for descriptive work.
Al-
so, there is demand for a more natural quality in speech syn-
152
Models of the larynx
thesis and a way out of the male voice dominance. A more profound view of the production process is now emerging. It is clear that the filter and source functions are
mutually depen-
dent, there is acoustical and mechanical interaction (Ishizaka and Flanagan, 1972).
The concept and the definition of a source
will differ in a true, maximally human-like model and in a terminal analog synthesizer.
Also, there exist various combinations
of a source and a filter function that will produce one and the same or approximately the same output. A part of the theoretical foundation lies in the development of suitable models for parametrical representation of the voice source together with measurement technqiues, such as time domain and frequency domain inverse filtering and spectrum matching. These tools are still in a developing stage. Source -filter decomposition of voiced sounds The major theoretical complication in human voice production models is that in the glottal open state, the sub- and supraglottal parts of the vocal tract are acoustically coupled through the time variable and nonlinear glottis impedance, whereas when glottis is closed the sub- and supraglottal systems execute approximately free and separate oscillations. Resonance frequencies and especially bandwidths may differ in the two states.
The glottal
open period can be regarded as a phase of acoustical
energy
charging, Ananthapadmanabha and Fant (1982), followed by a discharge at glottal closure.
During the glottal closed conditions,
the formants prevail with a constant bandwidth, i.e., a constant rate of decay followed by a relative faster decay of amplitude during the next glottal open interval.
This truncation effect
(Fant, 1979, 1981; Fant and Ananthapadmanabha, 1982) cially apparent in maximally open
is espe-
back vowels
of high F^. The intraglottal variations in the system function
Fant: The voice source - acoustic modelling
153
are only approximately definable by time varying frequencies and b a n d w i d t h s of vocal resonances. This brief introduction to the production mechanism shows that terminal analog synthesizers with independent source and filter function linearly combined have inherent
limitations.
The physically m o s t complete speech production model that has been
developed
is that of
Flanagan et al (1975).
Ishizaka
and
Flanagan
(1972),
W i t h the t w o - m a s s m o d e l of the vocal
cords incorporated in a distributed parameter system, their model does not have a specific source in the linear network sense. is a self-oscillating and self-adjusting
system,
It
the main power
deriving from the expiratory force as represented by the lung pressure. In the work of Ananthapadmanabha and Fant (1982), the acoustical modeling of voice production starts by assuming a specific glottal area function A g ( t ) within a fundamental period and a specific lung pressure, other
P^.
parts of the system
The flow and pressure states are then calculated
by
in
techniques
similar to those of Ishizaka and Flanagan (1972) leading to numerical d e t e r m i n a t i o n s of the glottal v o l u m e velocity Ug(t), the output volume velocity at the lips U Q (t), and the sound pressure, p
a(t),
at a distance of a centimeters from the speaker's mouth.
By defining the filter function as the relation of P a (t) to Ug(t) w e define Ug(t) as the source or w e m a y go to
underlying the
glottal area function, Ag(t) as a conceptual reference which may attain the d i m e n s i o n a l i t y of a flow source by calculating the current in a submodel w i t h the lung pressure Pj^ inserted as a constant voltage across the time-variable glottal impedance represented
by the
"kinetic" term
only,
inductance and frictional resistance.
thus
ignoring
glottal
These are of secondary
importance only and may be taken into account later, Ananthapadmanabha and Fant (1982).
From the relation
Models of the larynx
154
, S1 •S £ Ü cn •H M4(D Vl
!! i
M-l M-l
h
iI s
172
Models of the larynx
excitations, i.e., at discontinuities at a glottal opening which are highly susceptable to glottal damping.
Excitations during
glottal closure reported by Holmes (1976) would require additional discrete elements of excitation functions in the glottal source.
The theoretical modelling undertaken here to convert
glottal area function into glottal flow could
be extended to
take into account the displacement of air associated with the lateral and longitudinal movements of the vocal cords (Flanagan and Ishizaka, 1978). Conclusions, source dynamics The theory of voice production, as well as parametric models and data collection techniques, has advanced significantly during the last years but we are still in a developing stage gradually appproaching the central objects of research,
a quantitative
rule-oriented description of voice individualities, and speech dynamics. A voiced sound may be decomposed into a source function and a filter function, the definition of which varies with the particular production or synthesis model.
We do not yet have suffi-
cient experience of how well the complete interactive model may be approximated in various synthesis schemes.
Terminal analog
synthesizers may be modified in various ways, e.g., to adopt the smoothed glottal flow as source and to simulate truncation effects which should ensure a sufficient amount of naturalness for synthesis of low Fg voices.
It is possible that the situation is
not that simple at high F Q , where the interaction effects appear to be more demanding. Up till now inverse filtering has mostly served as a tool for qualitative, indirect studies of the.phonatory process at the level of the vocal cords and of temporal patterns of formant
Fant: The voice source - acoustic modelling excitation.
173
It is time that we develop more
oriented analysis techniques.
quantitatively
This requires specifications of
simultaneous values of both source and filter functions and of significant interaction effects, e.g., truncation phenomena as a constituent in estimates of effective bandwidths. Source spectrum shapes are not adequately described by a single slope value only.
An important feature derivable from the
particular combination of the F^ , FO, and K parameters (Fant, 1979) is the amplitude level of the F
source
maximum relative
the source spectrum level at some higher frequency, e.g., at 1000 Hz.
In connected speech,
the F g level stays rather invariant
whilst formant levels in addition to F-pattern-induced variations tend to fall off as a result of a progressive abduction.
Similar
effects occur as a reaction from supraglottal narrowing when extended beyond that of close vowels. relative to that of F
q
The amplitude level of F1
and also the absolute level of F „ are
4 , 5 ) .
This f i n d i n g was synthesized in a
model of auditory analysis that provides a t o n o t o p i c a l l y organized short-term frequency amplitude spectrum which preserves a l l
the information required by
the psychophysics of auditory frequency measurements and i s constrained by auditory-nerve physiology (_6»18).
This model of auditory a n a l y s i s , coupled
to an optimum central processor of fundamental frequency, r e p l i c a t e s the predictions of the e a r l i e r non-physiological optimum processor theory of pitch of complex tones.
The new analysis model i s more general because the
short-term spectrum s i g n a l l i n g the central processor i s defined f o r a r b i t r a r y acoustic signals including speech.
Indeed the potential relevance of
this
auditory analysis model f o r speech signals has been d i r e c t l y demonstrated in physiological
experiments
Cl9»13).
Several aspects of t h i s research on the perception of fundamental pitch are r e l e v a n t f o r speech research.
Most d i r e c t l y , because fundamental
pitch
Goldstein: An outline of recent research progress
247
is a significant conveyer of linguistic information, the basic understanding of Its auditory communication limits should be useful in the normal situation as well as in the cases of deteriorated speech signal and damaged auditory system. Secondly, the short-term speech spectrum described by the new auditory analysis model, or its variants, is likely to be more relevant for speech communication than conventional spectogram representations.
Finally, the general research
strategy of relating both experiment and theory of psychophysics and physiology and the use of ideal communication models for representing central processing should be productive as well for the more complex problems of speech communcation where the tasks of the central processor are less understood (1,17).
248
Auditory analysis and speech perception
1. Delgutte, B. (1982). Some correlates of phonetic distinctions at the level of the auditory nerve, pp. 131-149 in Carlson and Grandstrom (eds.). The Re-presentation of Speech in the Peripheral Auditory System. Elsevier Biomedical Press. 2. Gerson, A. and Goldstein, J.L, (1978). Evidence for a General Template in Central Optimal Processing for Pitch of Complex Tones. J. Acoust. Soc. Am. (tf, 498-510. 3. Goldstein, J.L. (1973). An Optimum Processor Theory for the Central Formation of the Pitch of Complex Tones. J. Acoust. Soc. Am. 54, 1496-1516. 4. Goldstein, J.L. (1978). Mechanisms of Signal Analysis and Pattern Perception in Periodicity Pitch. Audiology 17, 421-445, 5. Goldstein J.L. (1980). On the Signal Processing Potential of High Threshold Auditory Nerve.Fibers, pp. 293-299 in van den Brink and Bilsen (eds.). Psychological, Physiological and Behavioural Studies in Hearing, Delft Univ. Press. 6. Goldstein, J.L. (1981). Signal Processing Mechanisms in Human Auditory Processing of Complex Sounds. Final Report U,S.-Israel Binational Fund, Grant No. 1286/77, 4/1977-3/1980, Tel-Aviv University, School of Engineering. 7. Goldstein J.L. and Srulovicz, P. (1977). Auditory-Nerve Spike Intervals as an Adequate Basis for Aural Spectrum Analysis, pp. 337-345 in Evans and Wilson (eds.) Psychophysics and Physiology of Hearing. Academic Press. 8. Goldstein J.L., Gerson, A., Srulovicz P. and Furst, M. (1978). Verification of the Optimal Probabilistic Basis of Aural Processing in Pitch of Complex Tones. J. Acoust. Soc. Am. 486-497. 9. Houtsma, A.J.M. and Goldstein, J.L. (.1971). Perception of Musical Intervals: Evidence for the Central Origin of Musical Pitch. MIT Res. Lab. Elec. Technical Rpt. 484. 10. Houtsma A.J.M. and Goldstein J.L. (.1972). The central origin of the pitch of complex tones: evidence from musical interval recognition, J. Acoust. Soc. Amer, 51^, 520-529. 11. Plomp, R. and Smoorenburg, G,F. Eds. (1970). Frequency Analysis and Periodicity Detection in Hearing. Sijthoff, Leiden. 12. Rabiner, L.R. Cheng, M.J. Rosenberg, A.E. and Mc Gonegal, C.A. C1976). A comparative performance study of several pitch detection algorithms. IEEE Trans. ASSP 24, 399-418 13. Sachs, M.B. and Young E.D. (1980), Effects of Nonlinearities on Speech Encoding in the Auditory Nerve. J. Axoust. Soc. Am. 68, 858-875. 14. Schouten, J.F. (1970). The residue revisited
in Plomp and Smoorenburg, op. cit.
15. Siebert, W.M. (.1968). Stimulus Transformations in the Peripheral Auditory System. pp. 104-133 in Kohlers and Eden (eds.) Recognizing Patterns. M.I.T. Press, Cambridge. 16. Siebert, W.M. (1970). Frequency Discrimination in the Auditory System: Place or Periodicity Mechanisms. Proc. IEEE 58, 723-730. 17. Soli, D. and Arabie, P. (1979). Auditory versus phonetic accounts of observed confusions between consonant phonemes. J. Acoust. Soc. Am. 66^, 46-59.
Goldstein: An outline of recent research progress
249
18. Sruluovicz, P. and Goldstein J.L. (1983), A Central Spectrum Model: A Synthesis of Auditory-Nerve Timing and Place Cues in Monaural Communication of Frequency Spectrum J. Acoust. Soc. Am. Scheduled March. 19. Young, E.D. and Sachs, M.B. (1979). Representation of Steady-State Vowels in the Temporal Aspects of the Discharge Patterns of Populations of Auditory-Nerve Fibers. J. Acoust. Soc. Am. 66, 1381-1403.
Symposium 5
Phonetic explanation in phonology
253 THE DIRECTION OF SOUND J o h n J.
CHANGE
Ohala
P h o n o l o g y L a b o r a t o r y , D e p a r t m e n t of L i n g u i s t i c s , v e r s i t y of C a l i f o r n i a , B e r k e l e y , U S A
Uni-
Introduction The striking success of the comparative method for the reconstruction of the linguistic past rests in part on linguists' intuitions as to the favored direction of sound change. Given dialectal variants such as [tjuzdi] and [tjuzdi] "Tuesday", linguists know that the second is more likely to have developed from a form similar to the first rather than viceversa. This intuition is based primarily on experience, i.e., having previously encountered several cases of the sort [tj] » [tj] (where the directionality is established on independent grounds) and few or none of the reverse. If there were assumptions about the physical workings of speech production and speech perception which informed these intuitions, they were, with few exceptions, naive and empirically unsupported, e.g., the still unproven teleological notion of 'ease of articulation.' The history of progress in science and technology, e.g., in medicine, iron smelting, bridge construction, demonstrates, however, that although intuitions can take a field a long way, more can be achieved if these are complemented by empirically-founded models (theories) of the system the field concerns itself with, i.e., if induction is united with deduction. In this paper I briefly review some of the universal phonetic factors that determine the direction of sound change. Articulatory Constraints One of the clearest and most well-known examples of an articulatory constraint determining the direction of sound change is the aerodynamic factors which lead to the devoicing of stops, especially those with a long closure interval, e.g., geminates, see Table I.
T a b l e I. Devoicing of Geminate Stops in Moré 1953; transcription simplified).
(from
Alexandre
Morphophonemic Form
Phonemic Form
French
Gloss
pabbo
papo
"frapper"
bad + do
bato
"corbeilles"
lug + gu
luku
"enclos"
254
Phonetioc explanation in phonology
As noted by Passy (1890:161) voicing is threatened by the increasing oral pressure (and thus the decreasing pressure drop across the glottis) caused by the accumulation of the glottal airflow in the oral cavity. Recent mathematical models provide us with a much better understanding of this process (Rothenberg 1968, Ohala 1976, Ohala and Riordan 1979, Westbury 1979). Therefore, unless the closure duration of the stop is shortened (which may happen in word-medial, intervocalic position) or the oral cavity is actively enlarged during the closure (i.e., becomes implosive) or the oral cavity is vented somehow (say, via the nasal passage), then there is a strong physical motivation for the direction of a change in voicing to be from [+voice] to [-voice] (Ohala 1983). Acoustic Factors It has also long been recognized that certain distinct articulations may give rise to speech sounds which are substantially similar acoustically and auditorily, such that listeners may inadvertently substitute a different articulation for the original one (Sweet 1874:15-16). Sound changes such as those in Table II are very common cross-linguistically and can be shown to arise, plausibly, due to the acoustic similarity of the sounds or sound sequences involved in the change. Table II. Sound Changes Precipitated by Acoustic Similarity Different Articulations. Palatalized Labials Roman Italian
>
Apicals [pjeno]
Genoese
Italian
[bjaoko] Pre-Classical
Greek khalep-jo
>
[tjena]
(1f u l l
[d^aqku]
"white"
Classical Greek
guam-jo Labialized Velars
of
p r o v o k e ii
khaIept o
»
baino
"I
c o m e II
Labials
Proto-Indo-European
ekwos
Classical
Proto-Bantu
-kumu
W. Teke
Greek
hippos
"horse"
pfumu
"chief"
Ohala: The direction of sound change
255
The problem, though, is that if these speech sounds are just similar, then the substitutions should occur in either direction. In fact, the substitutions are strongly asymmetrical, i.e., in the direction shown in Table II. The change of labials to velars, for example, though attested (Lemle 1971) are much rarer and often occur only in highly restricted environments, e.g., Venezuelan Spanish [ e k l i k s e ] < [ e k l i p s e ] "eclipse", [ k o n s e k s i o n ] < [ k o n s e p s i o n ] "conception".where the presence of the following apical seems to be essential (Maríscela Amador, personal communication). The same asymmetry is evident in the confusion matrices resulting from laboratory-based speech perception studies such as that by Winitz, Scheib, and Reeds (1972); see Table III. Table III. Asymmetry of Identification Errors in the P e r c e p t i o n Study of W i n i t z e t al. (1972).
Speech
[p] >
[t]/
[i]
(34%) but [t] >
[p]/
[i]
(6%)
[k] >
[p] /
[u]
(27%) b u t [p] >
[k]/
[u]
(16%)
Attributing these listeners identify the will not work since in /p/ (Wang and Crawford
asymmetries to "response bias" (when in doubt, sound as that which is most frequent in the language) English, at least, /k/ occurs more frequently than 1960).
To try to understand the causes of this asymmetry in misperception it may be useful to examine similar asymmetries in the perception of stimuli in different sensory channels. (See also Ohala 1982.) When subjects' task is the identification of briefly glimpsed capital letters of the Roman alphabet (A, B, C, etc.), the following asymmetries in misidentification occur (where ' >' means misidentification in the given direction more often than the reverse): E > F, Q > 0, R > P, B > P, P > F, J > I, W > V (Gilmore, Hersh, Caramazza, and Griffin 1979). Again, "response bias" is of no help to account for the favored direction of these errors since •E1 is far more common than 1F1 (in printed English). In each of these pairs the letter on the left is graphically equivalent to that on the right plus some extra feature. As Garner (1978) has pointed out, it follows that if this extra feature is not detected, for example, in the case of the first pair, the "foot" of the 'E1, then it will be misperceived as the letter that is similar except for the lack of this feature, 1F' in the cited example. Inducing ("hallucinating") the presence of this extra feature when it is not present in the stimulus is less likely. (This does not mean that inducing absent features or segments is uncommon; in fact, if the listener has reason to suspect that he has "missed" something because of background noise, this may be the most common source of error.) To understand the asymmetries in the errors of speech perception and thus the favored direction of sound change due to this cause we should look for the features which, for example, /kw/ has but /p/ lacks. In the case of /kw/ » /p/, it seems likely that it is the relatively intense stop burst with a compact spectrum that velars have but which is missing in labials. Research on these details is likely to benefit not only diachronic phonology but also such practical areas as automatic speech recognition.
256
Phonetioc explanation in phonology
Auditory factors There is considerable evidence that speech perception is an active process such that "top-down" information is applied to resolve ambiguities and to factor out distortions in the speech signal (Warren 1970). Ohala, Riordan, Kawasaki, and Caisse (forthcoming) and Kawasaki (1978, forthcoming) demonstrated that listeners alter their identification of speech sounds or phonetic features as a function of the surrounding phonetic context-apparently by factoring out of the stimulus the distortions that the surrounding sounds would be likely to create. For example, Kawasaki found that listeners judged the same phonetically nasalized vowel to be less nasal when it was flanked by full nasal consonants, vis-a-vis the case where the nasal consonants were attenuated or deleted. Ohala et al. found that listeners would identify a more front vowel on the /i/-to-/u/ continuum as an /u/ when it was flanked by apical consonants, vis-a-vis when it was flanked by labial consonants. (That apical consonants have a fronting effect on /u/ is well known--Lindblom 1963, Stevens and House 1963--; it is this distortion, presumably, that the subjects were factoring out from the stimuli surrounded by apicals in Ohala et al.'s study.) What this means is that if the articulatory constraints of the vocal tract cause a shift of the sound A to B, then the listener's "reconstructive" ability allows him to "undo" this change and convert perceived B into recognized A. Most of the time this process seems to succeed in enabling the listener to recognize what the speaker intended to say in spite of the fact that his speech is encrusted--like a boat with barnacles--with the unintended distortions caused by articulatory constraints. There is evidence, however, that in some cases these reconstructive processes are applied inappropriately to factor out parts of the intended signal. This is a kind of "hypercorrection" at the phonetic level. As argued by Ohala (1981), this is the basic nature of dissimilation. According to this hypothesis, dissimilatory changes of the sort Latin /k u iqk u e/ > /kiqk w e/ > Italian /tjiokwe/ arose due to listeners (or a listener) thinking that the labialization on the first velar was a distortion caused by spillover of labialization from the second velar and therefore factoring it out when they pronounced this word. There is support for this hypothesis. It predicts that the only phonetic features which could be dissimilated at a distance, i.e., undergo the change in (1), would be those which could spread (1)
[CCfeature] — >
[-0-feature] /
X [^feature]
like a prosody across intervening segments. Thus features such as labialization, retroflexion, palatalization, voice quality (including that induced by aspiration), and place of articulation of consonants would be prime candidates for dissimilation and features that did not have this property, e.g., [+obstruent] or [+affricate], would not. This prediction seems to agree with the available data (see Ohala 1981). Furthermore, although it is often the case that sound changes due to assimilatory processes, as in (2), involve the (apparently simultaneous) loss of the conditioning environment (italicized in (2)), this could not happen in the case of
257
Ohala: The direction of sound change (2)
an > a foti_ » f«St
dissimilation. The conditioning environment must be present if the listener is to be misled in thinking it is the source of the phonetic feature which is factored out, i.e., dissimilated. Thus, sound changes such as those in (3), hypothetical (3)
b h and h
>
kuic]kwe >
ban kiqe
versions of Grassmann's Law and the Latin labialization dissimilations, should not occur, i.e., the dissimilating segment or feature should not be lost simultaneously only in the cases where it had caused dissimilation. Again, this prediction seems to be borne out by the available data. Many linguists have been reluctant to include dissimilation in their list of natural or expected process by which sounds may change, if this means giving it the same status as assimilation (Sweet 1874:13,Bloomfield 1933:390, Schane 1972:216). This is understandable since it appears unprincipled to claim that changes of the sort AB BB (assimilatory) are expected if changes in the reverse direction, BB AB (dissimilatory), are also expected. No doubt to the prescientific mind the fact that wood falls down in air (when elevated and released) but rises if submerged in water presents a serious obstacle to the development of a coherent generalization regarding the expected or natural motions of objects in the world. A scientific understanding of these phenomena removes these obstacles. Similarly, the account given here delineates the different circumstances under which assimilation and dissimilation may occur so that there is no contradiction between them. Dissimilation is perpetrated exclusively by listeners (through a kind of perceptual hypercorrection); assimilation is largely attributable to the speaker. References Alexandre, R. P. (1953). La langue more. français d'Afrique noire, NoI 34. Bloomfield, L.
(1933).
Language.
Dakar:
New York:
Mémoires de Institut
Holt, Rinehart, & Winston.
Garner, W. R. (1978). Aspects of a stimulus: features, dimensions, and configurations. In Cognition and categorization (E. Rosch & B. B. Lloyd, Eds.), pp. 99-133. RTllsdaTël Lawrence Erlbaum Associates. Gilmore, G. C., Hersh, H., Cararnazza, A., & Griffin, J. (1979). Multidimensional letter similarity derived from recognition errors. Perception & Psychophysics, 25, 425-431. Kawasaki, H. (1978). The perceived nasality of vowels with gradual attenuation of adjacent nasal consonants. J. Acous. Soc. Am., 64, S19. [Forthcoming in Experimental phonology (J. I5hala, Ed.).J Lemle, M. (1971). Internal classification of the Tupi-Guarani linguistic family. In Tupi studies I (D. Beridor-Samuel, Ed.), pp. 107-129. Norman: Summer Institute of Linguistics.
258
Phonetioc explanation in phonology
Lindblom, B. (1963). Spectrographic study of vowel reduction. Soc. Am., 35, 1773-1781. Ohala, J. J. (1976). A model of speech aerodynamics. logy Laboratory, 1, 93-107.
J. Acous.
Report of the Phono-
Ohala, J. J. (1981). The listener as a source of sound change. In Papers from the Parasession on Language and Behavior, (C. S. Masek, R. 7 C HencTrTclc, & M. F. Miller, Eds.), pp. 178-203. Chicago: Chicago Linguistic Society. Ohala, J. J. (1982). The phonological end justifies any means. Preprints of the plenary sessions, 13th Int. Congr. of Linguists, Tokyo, 195Zj pp,nU9-208. Tokyo: ICL Editoria 1 Committee. Ohala, J. J. (1983). The origin of sound patterns in vocal tract constraints. In The production of speech, (P. F. MacNeilage, Ed.), pp. 189-216. New York"! Springer-Terlag. Ohala, J. J. & Riordan, C. J. (1979). Passive vocal tract enlargement during voiced stops. In Speech communication papers (J. J. Wolf & D. H. Klatt, Eds.), pp. 89-9Y. New York: Acoustical Society of America. Ohala, J. J., Riordan, C. J., Kawasaki, H., & Caisse, M. (Forthcoming). The influence of consonant environment upon the perception of vowel quality. Passy, P. (1890). Etudes sur les changements phonetiques. Librairie Firmin-Didot.
Paris:
Rothenberg, M. (1968). The breath-stream dynamics of simple-releasedplosive production. "TFibliotheca Phonetica No.~F.l Basel: S. Karger. Schane, S. (1972). Natural rules in phonology. In Linguistic change and generative theory (R. P. Stockwell & R. K. S. Macaulay, Eds.), pp. 199229. Bloomington: Indiana University Press. Stevens, K. N. & House, A. S. (1963). Perturbation of vowel articulations by consonantal context: an acoustical study. J. Speech & Hearing Res., 6, 111-128. Sweet, H.
(1874).
History of English sounds.
London:
Trubner.
Wang, W. S.-Y. & Crawford, J. (1960). Frequency studies of English consonants. Language & Speech, 3, 131-139. Warren, R. (1970). Perceptual restoration of missing speech sounds. Science, 167, 392-393. Westbury, J. R. (1979). Aspects of the temporal control of voicing in consonant clusters in English. Unpub. Doc. Diss., University of"Texas at Austin. Winitz, H., Scheib, M. E., & Reeds, J. A. (1972). Identification of stops and vowels for the burst portion of /p,t,k/ isolated from conversational speech. J. Acous. Soc. Am., 51, 1309-1317.
259 VOWEL FEATURES AND THEIR EXPLANATORY POWER IN PHONOLOGY Eli Fischer-J^rgensen University of Kopenhagen, Denmark
Phonetics cannot explain the phonological pattern of a given concrete language and its development within a given period of time, but it can attempt to explain some universal constraints and tendencies in phonological patterns and phonological change. For this purpose one needs detailed quantitative models of speech production and speech perception, but it is also necessary to have a general frame of reference for the description of the individual languages in the form of a system of phonetic dimensions, according to which the speech sounds of a language can be grouped into classes with common features. These dimensions (which, according to a now generally accepted, but not quite unambiguous, terminology are also called features) must on the one hand have correlation to speech production and speech perception, on the other hand be adequate for the description of phonological patterns and rules. In the present paper I will consider only vowel features, and only some basic features (excluding nasality, r-colouring, etc., but including tenseness). Since the days of Bell and Sweet it has been the tradition to describe vowel systems by means of the basic features high-low, front-back, and rounded-unrounded and (for some languages) tense-lax. This system, which was defined in articulatory terms, has been severely criticized for not covering the articulatory facts, e.g. by Russell (1928) and, more recently, by Ladefoged (e.g. 1976 and 1980), Wood (1982) and Nearey (1978). It has been criticized that according to X-ray photos, the highest point of the tongue is> e.g., often lower
260
Phonetioc explanation in phonology
for [i] than for [e], for [o] than for [a], for [u] than for [i ], and that, on the whole, this point is rather variable. However, almost all these objections are only valid as regards the revisions introduced by Daniel Jones in the classical system for the purpose of his cardinal vowel chart which was meant as a practical tool in phonetic field work and not as a theoretical vowel system. English phoneticians have simply identified the cardinal vowel chart with the system of classical phonetics. None of the founders of classical phonetics (e.g. Bell, Sweet, Jespersen, Sievers, Vietor) ever used the term "the highest point of the tongue". It was introduced in Jones' Outline (1918). This point is indeed often variable (although I agree with Catford (1981) and Lindau (1977) in finding the criticism somewhat exaggerated) . It is also very rarely used in later phonetic works by continental phoneticians except when they quote Jones. Moreover, it was Jones who, for practical reasons, placed [a] as representing the lowest degree of height in the series [u o o a], and who discarded tenseness as a separate dimension and thus placed [i] between [i] and [e] in his chart. If tenseness is considered a separate dimension and height is taken to mean the relative distance between the whole surface of the tongue and the palate within each of the series of rounded or unrounded, tense or lax, front or back vowels, most of the inconsistencies between these traditional labels and the articulatorv facts disappear. It is true that tenseness has been defined in several different ways, and not very convincingly, within classical phonetics, and it has probably been applied to too many languages. But it is a useful feature for the description of some languages, e.g. German, Dutch, various Indian languages, and for the high vowels of English. It seems to be correlated with the acoustic property of a relatively more peripheral vs. a more central placement in the vowel space (Lindau 1977) and, as far as articulation is concerned, with a higher vs. lower muscular
261
Fischer-J0rgensen: Vowel features and their explanatory power tension, which has a number of observable
consequences:
a flattening of the tongue accompanied by a narrower pharyngeal cavity, less pronounced lip articulation, and a relatively small mandibular distance ment with Wood 1982).
(I am here in agree-
"Advanced tongue root" captures only
one of these consequences, and does not work for [o] vs. [o]. However, the classical system has rightly been
criticized
for not taking account of the pharynx and thus for constituting an inadequate starting point for the calculation of the acoustic output.
Various more recent descriptions of vowel articulation
have seen it as their main purpose to establish a more direct connection with the acoustic consequences.
In one version,
place and degree of constriction play a central role.
This is
no doubt a better starting point for computing the area
function
of the vocal tract, but I do not think that it is a useful basis for setting up a new feature system.
The constriction
parameter has only been used as one factor of the tense-lax feature, and it will probably be difficult to find any phonological rule that is common to the class of the most stricted vowels [i u o a].
con-
Wood invokes sound typology, but
four-vowel systems containing these four vowels, which he considers basic, are extremely rare, whereas [i u a e ] and [i o a e ] are more common. (1969) and Wood
But both Lindblom and Sundberg
(1982) have proposed feature systems where
the traditional front-back dimension has been replaced by ±palatal, ±velar, and ±pharyngeal, which in the LindblomSundberg system define three places of articulation, and in Wood's system four places, since he defines [o] as pharyngovelar and [a] as low pharyngeal.
Both use jaw opening to
define height differences, Lindblom-Sundberg operating with ±close and ±open, Wood with only ±open. Wood uses his feature ±pharyngeal to describe the retracted and lowered allophones of the Greenlandic vowels before uvular
(pharyngeal) consonants.
It might also be used to
describe the allophone of Danish /a/ after [«].
But apart
from these cases I think the front-back dimension is more useful.
262
Phonetioc explanation in phonology
Place features are needed in the description of e.g. the Germanic i-Umlaut and of vowel harmony of the type found in Finnish and Turkish, but these facts are expressed in a much simpler way by the features front and back than by combinations of three different features for back vowels. And, incidentally, X-ray photos show the place of the narrowest constriction to be just as variable as the highest point of the tongue. Front vowels may have their narrowest constriction in the alveolar region, and the narrowest constriction of an [o] may be found at the velum, at the uvula, or in the pharynx. This does not make much difference for the general area function, but it does make a difference for the feature system. Moreover, Wood's ±open feature is not sufficient for describing and explaining the many cases where one needs three or four degrees of height, e.g. the Danish vowel system, the English great vowel shift, the general tendency for relatively high vowels to be shorter than relatively low vowels (which has phonological consequences in various languages), the role of vowels of different height in palatalization processes, etc. Ladefoged does not consider place and degree of constriction to be a good basis for a feature system, nor does he find his own physiological front and back raising parameters adequate for this purpose (1980), whereas he recognizes that the traditional features have proved their value in phonological description. He solves the problem in a different way, i.e. by defining the (multivalued) features front-back and high-low in auditory-acoustic terms, high-low corresponding to frequency of F^ and front-back to F 2~ f i* Rounding is retained as a feature defined in articulatory terms. This interpretation of the traditional features raises some problems. In the first place the auditory "front-back" dimension (as well as its acoustic correlate F 2~ F i' corresponds to a combination of the articulatory dimensions of rounding and front-back tongue position. Ladefoged has demonstrated himself that even phoneticians have difficulty in distinguishing
Fischer-J0rgensen: Vowel features and their explanatory power
263
between front rounded and back unrounded vowels, and it is difficult to elicit a dimension of rounding in multidimensional scaling experiments of vowel perception. Terbeek (1977) did not succeed until using five dimensions. The F 2 ~F 1 dimension corresponds to the auditory bright-dark dimension which was used in vowel descriptions in the pre-Bell period. As this dimension includes the effect of rounding it seems problematic to set up an independent rounding dimension at the same time. (A binary opposition, like 1 Jakobson s grave-acute does not raise quite the same problems.) Moreover, an auditory definition of the front-back dimension implies that processes like i-Umlaut and vowel harmony of the Finnish and Turkish type should be described and explained in auditory terms. But this is not adequate. In these processes rounding and the front-back dimension are kept strictly apart. In i-Umlaut the vowels retain their rounding feature ([a] does not become [as]), and the same is the case in the Finnish vowel harmony. In the Turkish vowel harmony both rounding and front-back dimensions are involved but according to separate rules. This seems to indicate an articulatory process. It also seems more plausible to explain such processes of assimilation in motor terms as an anticipation of an articulatory movement. As far as the i-Umlaut is concerned, perception may play a role at the last stage, where the i of the ending may have become so weak that the listener does not hear it and therefore perceives the front feature of the stem vowel as an independent feature (cf. Ohala 1981), but in its start it must be a mainly articulatory process. It therefore seems preferable to define the front-back dimension in articulatory terms, and there is a sufficiently clear basis in X-ray photos for a definition in terms of a forward, respectively backward movement of the tongue body. But at the same time it must be recognized that from an auditory point of view rounding and front-back combine in one dimension, and I think this aspect is prevalent in the patterning of vowel systems where [u] and [i] are extremely common because they are maximally different in a two-dimensional auditory space, whereas [y] and [ »« ] are rare (cf. Lindblom 1982).
264
Phonetioc explanation in phonology
As for the height dimension, it is probably true that it has a somewhat simpler connection with its physical than with its physiological correlates, even when the distance of the whole upper surface of the tongue is considered, but I think a two-sided correlation can be retained. It would also be very difficult to find an auditory explanation of some of the rules involving this feature, e.g. the differences in duration and fundamental frequency, the influence on palatalization, and the tendency towards unvoicing of high vowels, whereas plausible, if not definitive, physiological explanations have been advanced for these facts.
References: Catford, I.e. (1981). Observations on the recent history of vowel classification. Towards the History of Phonetics, ed. Asher and Henderson. Edinburgh: University Press. Jones, D. (1918).
Outline of English Phonetics. London:
Heffers and Sons. Ladefoged, P. (1976). The phonetic specification of the languages of the world. UCLA Working Papers in Phonetics, 31, 3-21. Ladefoged, P. (1980). What are linguistic sounds made of? Language, 55, 485-502 (also UCLA Working Papers, 45, 1-24). Lindau, M. (1977). Vowel features.
UCLA Working Papers, 38, 49-81.
Lindblom, B. and Sundberg, J. (1969). A quantitative model of vowel production and the distinctive features of Swedish vowels. Speech Transmission Laboratory, Quarterly Progress and Status Report, Stockholm, 14-32. Lindblom, B.E. (1982).
Phonetic universals in vowel systems
(to be published in Experimental Phonology, Academic Press) Nearey, T.M. (1978). Phonetic feature systems of vowels. Linguistic Club.
Indiana
Ohala, J. (1981). The listener as a source of sound change. Papers from the Parasession on Language and Behaviour, ed. Miller et al., Chicago. 26 pp.
Fischer-J0rgensen: Vowel features and their explanatory power
265
Ohala, J. (1982). The origin of sound patterns in vocal constraints (to be published in The Production of Speech, ed. MacNeilage), 32 pp. Russell, G.O. (1928). The Vowel.
Columbus: Ohio State Uni-
versity Press. Terbeek, D. (1977). A cross-language multidimensional scaling study of vowel perception, UCLA Working Papers, 37, Wood, S. (1982). X-ray and model studies on vowel articulation. Working Papers, Lund, 23.A, 1-191.
267 VOWEL SHIFTS AND ARTICULATORY-ACOUSTIC RELATIONS Louis Goldstein, Haskins Laboratories and Yale University, USA
R i o n o l o g i s t s have o f t e n supposed t h a t t h e p h o n e t i c v a r i a b i l i t y o f speech i s somehow r e l a t e d to sound c h a n g e . B l o o m f i e l d (1933, p p . 365) f o r example, d e p i c t s p h o n e t i c change a s " a g r a d u a l f a v o r i n g of some n o n - d i s t i n c t i v e [phonetic] variants," which a r e a l s o s u b j e c t t o i m i t a t i o n and a n a l o g y . Hbckett (1958) makes t h e c o n n e c t i o n q u i t e e x p l i c i t l y ; he views sound change a s a mechanism of p h o n o l o g i c a l c h a n g e , whereby s m a l l asymmetries i n t h e d i s t r i b u t i o n o f p r o n u n c i a t i o n e r r o r s a b o u t some a c o u s t i c target result in pushing t h e phoneme's t a r g e t , i m p e r c e p t i b l y o v e r t i m e , i n t h e e r r o r - p r o n e direction. In t h i s view t h e n , s p e e c h v a r i a b i l i t y r e p r e s e n t s t h e s e e d s o u t o f which a p a r t i c u l a r sound change may grow. The s p r o u t i n g and d e v e l o p n e n t a r e , o f c o u r s e , dependent on many o t h e r l i n g u i s t i c and s o c i a l f a c t o r s . Sound change d o e s n o t a p p e a r t o be random, a s p h o n o l o g i s t s and p h o n e t i c i a n s have long n o t e d , i n t h a t t h e r e seem to be c e r t a i n p a t t e r n s o f change t h a t r e c u r i n a number o f u n r e l a t e d l a n g u a g e s . In t h i s p a p e r , I s p e c u l a t e a b o u t how p a t t e r n s o f v a r i a b i l i t y c o n s i s t e n t with c e r t a i n t y p e s o f sound change might emerge from t h e r e s o n a n c e p r o p e r t i e s o f t h e human v o c a l t r a c t , g i v e n e s s e n t i a l l y random a r t i c u l a t o r y v a r i a b i l i t y . If t h e s e p r i n c i p l e s a r e g e n e r a l i z e d i n an o p t i m i s t i c way, we might l o o k a t them as d e f i n i n g p o s s i b l e sound c h a n g e s . In t h i s p a p e r , however, d i s c u s s i o n w i l l be r e s t r i c t e d to one type o f sound change—vowel s h i f t s . Vowel S h i f t s When vowels change t h e i r q u a l i t i e s , i t i s p o s s i b l e t o a s k i f any g e n e r a l p a t t e r n s o f change emerge, o r i f vowels s i m p l y s h i f t t o some random ( b u t roughly adjacent) q u a l i t y . Labov, Yaeger and S t e i n e r (1972) examined a number o f ongoing c h a i n s h i f t s , changes i n which J :he q u a l i t i e s o f a number o f vowels a r e interdependent a c r o s s the v a r i o u s s t a g e s of the change. T o g e t h e r with d a t a from completed sound c h a n g e s , t h e i n v e s t i g a t i o n s l e d to t h r e e p r i n c i p l e s o f c h a i n - s h i f t i n g : ( 1 ) f r o n t o r back " t e n s e " ( p e r i p h e r a l ) vowels tend to r i s e i n c h a i n s h i f t s , ( 2 ) f r o n t o r back " l a x " ( l e s s p e r i p h e r a l ) vowels tend to f a l l , ( 3 ) back vowels tend to be f r o n t e d . Thus, ( r e o r g a n i z e d slightly), the main movement i n vowel q u a l i t y i s i n t h e dimension o f vowel h e i g h t , (where "vowel h e i g h t " i s b e i n g used h e r e as a g e n e r a l term f o r t h e dimension a l o n g which vowels l i k e [ i ] , [ i ] , [ e ] , [£/] l i e , r a t h e r t h a n some u n i t a r y a c o u s t i c o r a r t i c u l a t o r y p a r a m e t e r ) . For f r o n t v o w e l s , h e i g h t i s t h e o n l y d i r e c t i o n of movement; back vowels can a l s o be f r o n t e d . While t h e r e a r e a number o f m e t h o d o l o g i c a l problems i n t h i s i n v e s t i g a t i o n ( f o r example, t h e use o f an a c o u s t i c F1 X F2 c h a r t a s a r e p r e s e n t a t i o n of vowel q u a l i t y allowed no way t o i s o l a t e l i p - r o u n d i n g ) , t h e g e n e r a l p i c t u r e seems r o b u s t , and w i l l be t a k e n a s a working h y p o t h e s i s a b o u t t h e n a t u r e of vowel s h i f t s . We would l i k e t o u n d e r s t a n d why s p e e c h v a r i a b i l i t y d e v e l o p s a l o n g the p a r t i c u l a r d i m e n s i o n s t h a t a r e i n v o l v e d i n vowel s h i f t s . One might a t t e m p t t o f i n d a r e a s o n i n c o n s t r a i n t s t h a t t h e s p e e c h p r o d u c t i o n mechanism employs t o
268
Phonetioc explanation in phonology
coordinate the a c t i v i t i e s pf the various muscles. Random p e r t u r b a t i o n o f a vowel w i t h i n such a system of c o n s t r a i n t s might r e s u l t i n non-random displacement from t h e t a r g e t ( s e e d i s c u s s i o n below o f P e r k e l l and N e l s o n , 1982). However, i t i s p o s s i b l e t h a t even i f a r t i c u l a t o r y v a r i a b i l i t y were c o m p l e t e l y random, t h e a c o u s t i c c o n s e q u e n c e s o f s u c h v a r i a b i l i t y would be d i r e c t e d along c e r t a i n d i m e n s i o n s , g i v e n t h e t h e n o n - l i n e a r r e l a t i o n s h i p s t h a t o b t a i n between v o c a l t r a c t c o n f i g u r a t i o n s and t h e i r a c o u s t i c c o n s e q u e n c e s . It i s t o t h i s l a t t e r a p p r o a c h we t u r n . Bon-linearities in articulatory-acoustic
relations
The a c o u s t i c s e n s i t i v i t y o f t h e v o c a l t r a c t t o a r t i c u l a t o r y perturbation can be examined by o b s e r v i n g , w i t h a v o c a l t r a c t a n a l o g , how t h e t u b e r e s o n a n c e s v a r y a s a f u n c t i o n o f c h a n g e s i n s h a p e . F o r e x a m p l e , t h e nomograms o f Fant ( i 9 6 0 ) show t h e f o r m a n t f r e q u e n c i e s o f a h o r n - s h a p e d a n a l o g of t h e v o c a l t r a c t a s a f u n c t i o n o f c o n s t r i c t i o n l o c a t i o n , c o n s t r i c t i o n s i z e , and l i p area. In such f i g u r e s , i t i s p o s s i b l e t o see t h a t there are certain constriction l o c a t i o n s that are a c o u s t i c a l l y s t a b l e , i n the sense that small v a r i a t i o n s i n c o n s t r i c t i o n l o c a t i o n r e s u l t i n l i t t l e o r no c h a n g e i n r e s o n a n c e s , whereas a t o t h e r l o c a t i o n s , s m a l l c h a n g e s r e s u l t i n r a t h e r large resonance s h i f t s . Stevens (1972) f i r s t c a l l e d attention to these s t a b l e regions in his quantal theory, i n which he proposed t h a t t h e contrasts employed i n human l a n g u a g e make use o f t h e s e r e g i o n s . A vowel l i k e [ i ] has an a c o u s t i c a l l y s t a b l e p a l a t a l c o n s t r i c t i o n , and t h e r e f o r e s m a l l c h a n g e s i n c o n s t r i c t i o n l o c a t i o n w i l l have l i t t l e o r no e f f e c t on i t s r e s o n a n c e s . However, a change i n t h e s i z e ( i e . narrowness) of the constriction will produce c h a n g e s i n t h e f o r m a n t s . A small, random p e r t u r b a t i o n o f tongue p o s i t i o n f o r [ i ] might a f f e c t e i t h e r t h e constriction l o c a t i o n or the c o n s t r i c t i o n s i z e . Ho wever, o n l y t h e p e r t u r b a t i o n s t h a t , i n f a c t , modify t h e c o n s t r i c t i o n s i z e w i l l have any s u b s t a n t i a l a c o u s t i c effect. Thus, random a r t i c u l a t o r y e r r o r would be r e f l e c t e d i n t h e a c o u s t i c medium i n an almost i d e n t i c a l f a s h i o n t o v a r i a t i o n i n c o n s t r i c t i o n s i z e o n l y . Since t h e dimension o f vowel h e i g h t ( f o r a f r o n t v o w e l , l i k e [ i ] ) c o r r e s p o n d s t o d i f f e r e n c e s i n t h e s i z e o f a p a l a t a l c o n s t r i c t i o n , t h e v a r i a b i l i t y produced b y random p e r t u r b a t i o n w i l l l i e a l o n g t h e vowel h e i g h t d i m e n s i o n . This i s , o f c o u r s e , i n t h e d i m e n s i o n a l o n g which vowel s h i f t s o c c u r . Simulation of a r t i c u l a t o r y
variability
In o r d e r t o examine t h e a c o u s t i c e f f e c t s o f random articulatory p e r t u r b a t i o n more s y s t e m a t i c a l l y , t o n g u e shape v a r i a b i l i t y was s i m u l a t e d u s i n g t h e Haskins l a b o r a t o r i e s articulatory synthesizer (Rubin, Baer and Mermelstein, 1981). This program a l l o w s m o d i f i c a t i o n o f a m i d - s a g i t t a l r e p r e s e n t a t i o n o f t h e v o c a l t r a c t by means o f t h e s i x a r t i c u l a t o r y parameters shown i n F i g u r e 1. The s h a p e o f t h e u p p e r p a r t o f t h e t o n g u e body i s m o d e l l e d as a segment o f a c i r c l e o f f i x e d r a d i U 3 . D i f f e r e n t tongue shapes are produced b y moving t h e c e n t e r o f t h i s c i r c l e ( p a r a m e t e r C) . Random a r t i c u l a t o r y v a r i a b i l i t y was s i m u l a t e d by c h o o s i n g a s e t o f t a r g e t vowel s h a p e s , and f o r e a c h o n e , g e n e r a t i n g 100 new v o c a l t r a c t s h a p e s , e a c h o f whose tongue body c e n t e r l i e s on a c i r c l e o f a r a d i u s 2 mm a b o u t t h e c e n t e r o f the t a r g e t tongue s h a p e . Thus, t h e s e t o f s h a p e s r e p r e s e n t s a c o n s t a n t e r r o r o f 2 mm i n any d i r e c t i o n from t h e h y p o t h e t i c a l tongue body c e n t e r t a r g e t .
Goldstein: Vowel shifts and articulatory-acoustic
relations
269
L -- L I P S H -- HYOID
K E Y V O C A L TRACT F i g u r e 1. Vocal synthesizer.
tract
control
PARAMETERS
parameters
for
Haakins
articulatory
270
Phonetioc explanation in phonology
The vowels chosen f o r study wore three f r o n t vowels ( [ i ] i [ e], [2^-]), three back vowels ( [ u ] , [ o ] , [®-]) and a mid c e n t r a l vowel [• [ n ] ) ,
e, e ] , e . g .
[ ' I ad i ] " l a d d i e " , [ ' lo.se]
(Dieth 1932, 72 f f ) . V o i c i n g can be maintained
more e a s i l y i f
the tongue i s pushed in a high f r o n t
position.
I t i s t o be e x p e c t e d t h a t the /a/ i s a l s o d i f f e r e n t i n the two environments.
In t h i s connection we can a l s o mention the
S l a v i c assonances, i n which v o i c i n g i s the only consonantal f e a t u r e , e . g .
relevant
doba - droga - woda - koza - sowa
v s . kopa - sroka - r o t a - rosa - s o f a in P o l i s h o r a l and written poetry
(Jakobson, Fant, H a l l e
1951,
42).
In the Romance and S l a v o n i c languages, in Dutch and Phineland German, and i n S c o t t i s h E n g l i s h the v o i c i n g has a high rank i n the h i e r a r c h y of p h o n e t i c i n d i c e s g u i s h i n g the two obstruent c l a s s e s ,
feature distin-
o v e r and above the
lenis/
f o r t i s c o n t r a s t . P e r i o d i c i t y i s a c t i v e l y c o n t r o l l e d i n these languages or d i a l e c t s , whereas o t h e r t y p e s o f E n g l i s h and German commonly r e g u l a t e v o c a l f o l d v i b r a t i o n p a s s i v e l y . g r e a t e r a r t i c u l a t o r y e f f o r t i n the f o r t i s a r t i c u l a t i o n
The
does
not only a f f e c t the v o c a l t r a c t movement, but a l s o the t e n s i o n i n the l a r y n x
( H a l l e and Stevens 1971) r e s u l t i n g i n a q u i c k e r
decay of v o i c i n g ; tinue i f
i n the l e n i s o b s t r u e n t , p e r i o d i c i t y may con-
the pressure drop across the g l o t t i s
e . g . because of a s h o r t s t r i c t u r e duration i n position.
is
favourable,
intervocalic
In the languages with a c t i v e v o i c i n g c o n t r o l , on
the o t h e r hand, a r t i c u l a t o r y maneuvers d e l i b e r a t e l y
maintain
a pressure d i f f e r e n t i a l , which may even s e t i n t o o soon, l e a d i n g t o r e g r e s s i v e a s s i m i l a t i o n o f v o i c e i n a l l the guages mentioned above, but unknown i n e . g .
lan-
the o t h e r Ger-
manic languages. In c e r t a i n languages, the f o r t i s / l e n i s
f e a t u r e i s sup-
Kohler: Phonetic timing as explanation in phonology
279
plemented by an aspiration contrast, either in combination with the voicing feature (as in the languages of India) or on its own (as in Standard German and most varieties of English). Aspiration is often associated with the fortis element, but in Danish the strongly aspirated [ph, th, kh] (as against the slightly aspirated or unaspirated, but commonly voiceless [[}, j, §] are characterised by weaker articulation and shorter closure duration (Fischer-J• Vm ("habe" , "haben") , f o r i n s t a n c e , but p r e s e r v e f o r t i s stops as o c c l u s i o n s , simply reducing the degree o f a s p i r a t i o n and making p a s s i v e v o i c i n g p o s s i b l e in c e r t a i n c o n t e x t s , e . g .
"hat e r "
(Kohler 1979c). In h i s t o r i c a l sound change, these
[d]
develop-
ments are well-known, f o r i n s t a n c e from L a t i n t o t h e western Romance languages, e . g . Spanish
( " v i d a " *• " v i t a " ,
"porfía"
" p e r f i d i a " ) . The r e l a t i o n of f o r t i s t o l e n i s i s
maintained
i n the d i f f e r e n t t i m i n g s o f a r t i c u l a t o r y movements, even if
the absolute v a l u e s are lowered. This r e l a t i o n s h i p between
m o r e - f o r t i s stops and l e n i s approximants a l s o a p p l i e s the a l l o p h o n i c s of modern Spanish: u t t e r a n c e - i n i t i a l
to or
p o s t - p a u s a l p o s i t i o n as w e l l as g r e a t e r emphasis, which both r e q u i r e more a r t i c u l a t o r y e f f o r t , demand v o i c e d s t o p s ,
as
a g a i n s t i n t e r v o c a l i c approximants, and t h i s i s even true of the o r i g i n a l semi-vowels /w/ and / j / , e . g . [bw],
"hierro"
[d^]
('rehilamiento',
"huevo"
[gw],
c f . Barbón Rodriguez
1978). A f o r t i s / l e n i s o p p o s i t i o n w i t h the same t y p e s a r t i c u l a t o r y reduction a l s o manifests i t s e l f g r a d a t i o n o f Finnish
of
in the consonant
( c f . Karlsson 1981, 36 f f ) ,
as the
re-
s u l t o f a tendency t o make corresponding open and c l o s e d s y I l e í b l e s the same l e n g t h : v o i c e l e s s stops
(e.g.
long v o i c e l e s s
"kukka" - "kukan");
stops short
stops a f t e r homorganic n a s a l s - long n a s a l s -
"Helsingin");
[It,
otherwise [ t ] - [ d ] (e.g.
r t ] - [11, (e.g.
"tupa" - " t u v a s s a " ) ,
rr]
(e.g.
voiceless
(e.g.
"Helsinki"
"kulta" -
"katu" - " k a d u l l a " ) , [ k ] - 0 (e.g.
short
"kulla");
[p] -
"jalka" -
The timing o f a r t i c u l a t o r movement and o f the
[v]
"jalan"). concomitant
l a r y n g e a l a c t i v i t y may be r e o r g a n i s e d in such a way t h a t u t t e r a n c e - f i n a l l e n i s and f o r t i s consonants c o a l e s c e
in
Kohler: Phonetic timing as explanation in phonology
281
their glottal features whereas the duration contrast in the preceding vowel remains and may even be accentuated. English is developing in this direction. In Low German this change is complete, "ick riet" [rit] and "ick ried" [ri:t] (from OSax "writan" and "ridan") are now differentiated by short and long vowel, respectively (cf. Kohler 1982a), after the final /e/ apocope led to a levelling in the consonants themselves. The explanation usually given for this phenomenon - compensatory vowel lengthening in connection with the elimination of the following /e/ (e.g. Bremer 1929) is wrong because it does not only misrepresent the genesis of the quantity opposition, but it cannot even account for the differentiation of the two examples given. The distinction in vowel duration is tied to an original fortis/lenis contrast in the following consonant and to the structures 'vowel + fortis consonant' vs. 'vowel + morpheme boundary + fortis consonant' (as in "Brut" [ut] - "bru-t" [u:t]), the latter case preserving final vowel length. This vowel quantity feature is modified by the timing at the utterance level, but the details of this interaction still require thorough investigation. References Bannert, R. (1976). Mittelbairische Phonologie auf akustischer und perzeptorischer Grundlage. Travaux de l'lnstitut de Linguistique de Lund X. Lund: Gleerup/Fink. Barbón Rodríguez, J. A. (1978). El rehilamiento: descripción. Phonetica, 35, 185 - 215. Bremer, 0. (1929). Der Schleifton im Nordniedersachsischen. Niederdeutsches Jahrbuch, 53, 1 - 32. Chen, M. (1970). Vowel length variation as a function of the voicing of the consonant environment. Phonetica, 22, 129 - 159. Dieth, E. (19 32). A Grammar of the Buchan Dialect. Cambridge: Heffer.
282
Phonetioc explanation in phonology
van Dommelen, W.
(1983). Parameter i n t e r a c t i o n i n the p e r -
c e p t i o n o f French p l o s i v e s . Eiert,
C.-C.
Phonetica,
40.
(1964). P h o n o l o g i c a l S t u d i e s o f Quantity i n
Swedish. Stockholm: A l m q v i s t S W i k s e l l . Fischer-J^rgensen,
E.
(1954). A c o u s t i c a n a l y s i s o f s t o p con-
sonants. M i s c e l l a n e a P h o n e t i c a , I I , F i s c h e r - J ^ r g e n s e n , E.
42 - 59.
(1980). Temporal r e l a t i o n s i n Danish
t a u t o s y l l a b i c CV sequences w i t h s t o p consonants.
Annual
Report of the I n s t i t u t e of P h o n e t i c s . U n i v e r s i t y
of
Copenhagen 14, 207 - 261. F i t c h , H. L.
(1981). D i s t i n g u i s h i n g temporal i n f o r m a t i o n
speaking r a t e from temporal i n f o r m a t i o n f o r s t o p consonant v o i c i n g . search 65,
1 -
for
intervocalic
Haskins L a b o r a t o r i e s Speech Re-
32.
Fujimura, 0. & M i l l e r , J . E.
(1979). Mandible h e i g h t and
s y l l a b l e - f i n a l tenseness. Phonetica, H a l l e , M. & S t e v e n s , K.
36, 2 6 3 -
272.
(1971). A note on l a r y n g e a l
features.
MIT Research Laboratory o f E l e c t r o n i c s , Q u a r t e r l y Report 101, 198 -
Progress
213.
Jakobson, R . , Fant, C. G. M. & H a l l e , M. (1951).
Preliminaries
t o Speech A n a l y s i s . Cambridge, Mass.: The MIT P r e s s . K a r l s s o n , F.
(1981). Finsk grammatik. Suomalaisen
Kirjalli-
suuden Seura. Kim, C.-W. 107 -
(1970). A t h e o r y of
aspiration. Phonetica,
21,
116.
K o h l e r , K. J.
(1979a). Dimensions in the p e r c e p t i o n o f
and l e n i s p l o s i v e s . K o h l e r , K. J.
Phonetica,
36, 332 -
(1979b). Parameters i n the production and the
p e r c e p t i o n of p l o s i v e s in German and French.
Arbeitsbe-
r i c h t e des I n s t i t u t s f ü r Phonetik der U n i v e r s i t ä t (AIPUK)
fortis
343.
12, 261 - 280.
Kiel
Kohler: Phonetic timing as explanation in phonology
283
Kohler, K. J. (1979c). Kommunikative Aspekte satzphonetischer Prozesse im Deutschen. Phonologische Probleme des Deutschen (Vater H. ed.), Studien zur deutschen Grammatik 10, 13 39. Tübingen: G. Narr. Kohler, K. J. (1982a). überlänge im Niederdeutschen? Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 19, 65 - 87. Kohler, K. J., van Dommelen, W. & Timmermann, G. (1981). Die Merkmalpaare stimmhaft/stimmlos und fortis/lenis in der Konsonantenproduktion und -perzeption .des heutigen Standardfranzösisch. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 14. Kohler, K. J., van Dommelen, W., Timmermann, G. & Barry, W. J. (1981). Die signalphonetische Ausprägung des Merkmalpaares fortis/lenis in französischen Plosiven. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 16, 43 - 94. Kohler, K., Krützmann, U., Reetz, H. & Timmermann, G. (1982). Sprachliche Determinanten der signalphonetischen Dauer. Arbeitsberichte des Instituts für Phonetik der Universität Kiel (AIPUK) 17, 1 - 48. Öhman, S. E. G. (1966). Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America, 39, 151 - 168. Port, R. F. (1981). Linguistic timing factors in combination. Journal of the Acoustical Society of America, 69, 262 - 274. Port, R. F./ Al-Ani, S. & Maeda, S. (1980). Temporal compensation and universal phonetics. Phonetica, 37, 2 35 - 252. Slis, J. H. & Cohen, A. (1969). On the complex regulating the voiced-voiceless distinction I. Language and Speech, 12, 80 - 102.
Symposium 6
Human and automatic speech recognition
287 THE PROBLEMS OF VARIABILITY SPEECH PERCEPTION D e n n i s H.
IN S P E E C H R E C O G N I T I O N A N D IN M O D E L S
Klatt
M a s s a c h u s e t t s I n s t i t u t e of T e c h n o l o g y , C a m b r i d g e , EXTENDED
U.S.A.
ABSTRACT
Human listeners know, implicitly, a great deal about a c o u s t i c p r o p e r t i e s d e f i n e a n a c c e p t a b l e p r o n u n c i a t i o n of given word.
variability, within-speaker variability,
any
and
a c r o s s - s p e a k e r s v a r i a b i l i t y that is to be e x p e c t e d and the p r o c e s s of
identification.
Current
algorithms that recognize speech employ rather techniques
what
Part of t h i s k n o w l e d g e c o n c e r n s the k i n d s of
environmental
during
discounted
computer primitive
for d e a l i n g w i t h t h i s v a r i a b i l i t y , and t h u s o f t e n
it d i f f i c u l t to d i s t i n g u i s h b e t w e e n m e m b e r s of a r e l a t i v e l y
find small
s e t of a c o u s t i c a l l y d i s t i n c t w o r d s if s p o k e n b y m a n y t a l k e r s . will try to i d e n t i f y e x a c t l y w h a t is wrong w i t h "pattern-recognition"
t e c h n i q u e s for g e t t i n g a r o u n d
their s p e e c h r e c o g n i t i o n p e r f o r m a n c e to c o n s t r a i n t s
We
current variability,
and s u g g e s t w a y s in w h i c h m a c h i n e s m i g h t s i g n i f i c a n t l y
perception
OF
in the f u t u r e by
imposed by the h u m a n s p e e c h p r o d u c t i o n
improve attending and
apparatus.
We will also c o n s i d e r
the a d d i t i o n a l v a r i a b i l i t y
a r i s e s w h e n the task is c o n t i n u o u s s p e e c h r e c o g n i t i o n . "second-generation"
LAFS m o d e l of b o t t o m - u p lexical
that A
access
is
o f f e r e d as a m e a n s for identifying w o r d s in c o n n e c t e d s p e e c h , as a c a n d i d a t e p e r c e p t u a l m o d e l .
and
The r e f i n e m e n t s c o n c e r n a m o r e
e f f i c i e n t g e n e r a l i z a b l e w a y of h a n d l i n g
cross-word-boundary
288
Human and automatic speech recognition
p h o n o l o g y and c o a r t i c u l a t i o n , a n d a m o d e l of l e a r n i n g p o w e r f u l e n o u g h to e x p l a i n h o w l i s t e n e r s
that m a y b e
optimize
a c o u s t i c - p h o n e t i c d e c i s i o n s and d i s c o v e r p h o n o l o g i c a l
rules.
BACKGROUND It is n e a r l y 7 y e a r s s i n c e the end of the A R P A understanding
project, which
I reviewed
speech
in a 1 9 7 7 p a p e r
1 9 7 7 ) , and it is a b o u t 4 y e a r s s i n c e I p u b l i s h e d a
(Klatt,
theoretical
paper o n m o d e l i n g of s p e e c h p e r c e p t i o n a n d l e x i c a l a c c e s s 1979a)
(Klatt,
that w a s b a s e d in l a r g e m e a s u r e o n i d e a s from the A R P A
project.
S i n c e t h a t time t h e r e h a s b e e n m u c h a c t i v i t y
on
i s o l a t e d w o r d r e c o g n i t i o n , s o m e l i m i t e d a c t i v i t y in c o n n e c t e d s p e e c h r e c o g n i t i o n , p a r t i c u l a r l y a t IBM, and some of p e r c e p t u a l m o d e l s
(Elman and M c C l e l l a n d ,
proliferation
19xx).
This paper
w i l l be an a t t e m p t to look c r i t i c a l l y at a c t i v i t y in e a c h of t h e s e a r e a s , b u t p a r t i c u l a r l y to focus o n w h a t a p p e a r s to be a bottleneck limiting A typical
progress
in i s o l a t e d w o r d
recognition.
isolated word recognition system might
c h a r a c t e r i z e a n i n p u t s p e e c h w a v e f o r m as a s e q u e n c e of c o m p u t e d e v e r y t e n to t w e n t y m s e c .
spectra
E a c h v o c a b u l a r y item m i g h t
t h e n be r e p r e s e n t e d b y o n e or m o r e s e q u e n c e s of s p e c t r a from training d a t a .
R e c o g n i t i o n c o n s i s t s of finding
match between input and vocabulary
derived
the
best
templates.
R e c o g n i t i o n of a small set of w o r d s w o u l d n o t be
difficult
w e r e it n o t for the r e m a r k a b l e v a r i a b i l i t y s e e n in the p r o n u n c i a t i o n of a n y g i v e n w o r d .
In the s y s t e m s we are
about, within-speaker variability
in p r o n u n c i a t i o n a n d
talking speaking
Klatt: The problems of variability in speech recognition rate are h a n d l e d b y a clustering
(1) including m o r e t h a n one w o r d t e m p l a t e
algorithm decides that a single template
a d e q u a t e l y d e s c r i b e the t r a i n i n g d a t a using d y n a m i c p r o g r a m i n g temporal
289
cannot
(Rabiner et a l . , 1 9 7 9 ) ,
to t r y e s s e n t i a l l y all
C h i b a , 1971; X t a k u r a , 1 9 7 5 ) , and
word templates (3) using
(Sakoe
the linear
r e s i d u a l s p e c t r a l d i s t a n c e m e t r i c to q u a n t i f y p h o n e t i c b e t w e e n p a i r s of s p e c t r a
(Itakura,
(2)
reasonable
a l i g n m e n t s of the u n k n o w n s p e c t r a l s e q u e n c e w i t h
spectral sequences characterizing
if
the
and
prediction similarity
1975).
E a c h of t h e s e t h r e e t e c h n i q u e s r e p r e s e n t s a n
important
s c i e n t i f i c a d v a n c e m e n t over s c h e m e s used p r e v i o u s l y .
However,
the t h e m e to b e d e v e l o p e d
techniques
are n o t a d e q u a t e .
in t h i s p a p e r
One m u s t u n d e r s t a n d
is t h a t t h e s e
the p r o c e s s e s b y w h i c h
v a r i a b i l i t y a r i s e s and by w h i c h we as l i s t e n e r s h a v e l e a r n e d ignore i r r e l e v a n t a c o u s t i c v a r i a t i o n w h e n r e c o g n i z i n g spoken by many
to
words
talkers.
T R E A T M E N T OF V A R I A B I L I T Y V a r i a b i l i t y in the a c o u s t i c m a n i f e s t a t i o n s of a g i v e n utterance
is s u b s t a n t i a l
and a r i s e s f r o m m a n y s o u r c e s .
These
include: [1] r e c o r d i n g c o n d i t i o n s (background n o i s e , room reverberation, microphone/telephone characteristics) [2] w i t h i n - s p e a k e r v a r i a b i l i t y ( b r e a t h y / c r e a k y v o i c e q u a l i t y , c h a n g e s in v o i c e f u n d a m e n t a l f r e q u e n c y , s p e a k i n g r a t e - r e l a t e d u n d e r s h o o t in a r t i c u l a t o r y t a r g e t s , s l i g h t s t a t i s t i c a l v a r i a b i l i t y in a r t i c u l a t i o n t h a t c a n lead to big a c o u s t i c c h a n g e s , v a r i a b l e a m o u n t of f e a t u r e p r o p a g a t i o n , s u c h as n a s a l i t y or r o u n d i n g , to a d j a c e n t sounds),
290
Human and automatic speech recognition [3] c r o s s - s p e a k e r v a r i a b i l i t y ( d i f f e r e n c e s in d i a l e c t , v o c a l - t r a c t l e n g t h and n e u t r a l s h a p e , d e t a i l e d a r t i c u l a t o r y habits) [4] w o r d e n v i r o n m e n t in c o n t i n u o u s s p e e c h (cross-word-boundary coarticulation, phonological p h o n e t i c r e c o d i n g of w o r d s in s e n t e n c e s )
and
The c u m u l a t i v e e f f e c t s of this v a r i a b i l i t y are so g r e a t
that
c u r r e n t s y s t e m s d e s i g n e d to r e c o g n i z e o n l y the i s o l a t e d
digits
z e r o - t o - n i n e h a v e c o n s i d e r a b l e d i f f i c u l t y d o i n g so in a speaker-independent manner. is p e r h a p s the m o s t
A poor u n d e r s t a n d i n g of
important stumbling block inhibiting
d e v e l o p m e n t of r e a l l y p o w e r f u l There
variability the
isolated word recognition
devices.
is a c r y i n g n e e d for a s y s t e m a t i c a c o u s t i c s t u d y of
variability. CONTINUOUS SPEECH
RECOGNITION
T h e r e is n o t a s m u c h d i f f e r e n c e b e t w e e n the p r o b l e m s the d e s i g n e r of a d v a n c e d
i s o l a t e d word r e c o g n i t i o n s y s t e m s
t h o s e facing the d e s i g n e r of a c o n t i n u o u s s p e e c h d e v i c e as was o n c e s u p p o s e d .
recognition
in c o n t i n u o u s s p e e c h
in s t r e s s , d u r a t i o n , a n d
c o n t e x t s p e c i f i e d by a d j a c e n t w o r d s . considered
and
H o w e v e r , a w o r d is s u b j e c t to a
g r e a t e r n u m b e r of p e r m i t t e d v a r i a t i o n s in i s o l a t i o n due to v a r i a t i o n s
facing
than
phonetic
One q u e s t i o n to b e
in this s e c t i o n is h o w to c h a r a c t e r i z e
by r u l e a n d use the r u l e - b a s e d k n o w l e d g e
in a
these
processes
recognition
algorithm.
R e p r e s e n t a t i o n of A c o u s t i c - P h o n e t i c LAFS. digits.
Consider
and P h o n o l o g i c a l
the p r o b l e m of r e c o g n i z i n g
Knowledge
connected
In o r d e r to r e c o g n i z e the d i g i t "8" w h e n p r e c e d e d
or
Klatt: The problems of variability in speech recognition
291
f o l l o w e d by a n y o t h e r d i g i t or s i l e n c e , o n e m u s t s p e c i f y c o n s i d e r a b l e d e t a i l the a c o u s t i c m o d i f i c a t i o n s at w o r d t h a t are l i k e l y to t a k e p l a c e
in e a c h c a s e .
in
boundaries
If n o t , t h e n o n e
m u s t r e l y o n the r o b u s t n e s s of the a c o u s t i c c e n t e r of the w o r d and t r e a t the c o a r t i c u l a t i o n o c c u r r i n g m o s t l y a t o n s e t and
offset
as one further s o u r c e of n o i s e .
and
This is o f t e n d o n e
(Sakoe
C h i b a , 1971; M y e r s and R a b i n e r , 1 9 8 1 ) , b u t it is a c o m p r o m i s e t h a t , it is c l e a r , we as l i s t e n e r s do n o t m a k e .
The
alternative
is to d e s c r i b e t e n d i f f e r e n t e x p e c t e d o n s e t s and t e n d i f f e r e n t e x p e c t e d o f f s e t s for e a c h d i g i t , and f u r t h e r m o r e , to r e q u i r e t h e s e a l t e r n a t i v e s o n l y b e used if in fact the a p p r o p r i a t e is s e e n at each end of the
digit
"8".
C o n s t r u c t i o n of a n e t w o r k of a l t e r n a t i v e s p e c t r a l for e a c h p o s s i b l e d i g i t - d i g i t t r a n s i t i o n (though tedious)
that
s o l u t i o n , as d e s c r i b e d
(lexical a c c e s s from spectra)
is a
sequences
straight-forward
in m y paper o n
(Klatt, 1979a).(1)
I am
LAFS currently
doing just t h a t in o r d e r to e x p l o r e the p r o p e r t i e s of LAFS a n d to provide a testbed
for e v a l u a t i o n of a l t e r n a t i v e d i s t a n c e
metrics.
P e r h a p s p r e l i m i n a r y r e s u l t s will be a v a i l a b l e b y the t i m e of
the
conference. Word-Boundary Effects.
Coarticulation
c r o s s - w o r d - b o u n d a r y p h o n o l o g y are t r e a t e d creating
and
in the LAFS s y s t e m
a large n u m b e r of s t a t e s and p a t h s b e t w e e n w o r d
and all w o r d b e g i n n i n g s .
Can such a brute-force method
(1) A n i m p r o v e d " v e r s i o n of L A F S is d e s c r i b e d later section.
by
endings be
in this
292
Human and automatic speech recognition
applied successfully
in v e r y l a r g e - v o c a b u l a r y c o n n e c t e d
r e c o g n i t i o n , a n d c a n it s e r v e as a m o d e l of
speech
perceptual
strategies?
All a c o u s t i c d e t a i l s c a n n o t be l e a r n e d
for
word because
it w o u l d r e q u i r e too m u c h l a b e l e d t r a i n i n g
every data
(even for I B M ) , and t h a t g e n e r a l i z a t i o n s , m o s t p r o b a b l y a t the level of the d i p h o n e , are the o n l y w a y to c a p t u r e r u l e s at w o r d b o u n d a r i e s
coarticulation
in a m e a n i n g f u l a n d u s e f u l
way.
C h i l d r e n are e x p o s e d to a n e n o r m o u s a m o u n t of s p e e c h d a t a , the s a m e a r g u m e n t m u s t a p p l y — perceptual
e l s e we w o u l d e x p e c t to
a b e r r a t i o n s w h e r e e . g . the p a l a t a l i z a t i o n
you" fame w a s k n o w n for s o m e w o r d p a i r s , y e t o t h e r s , p a l a t a l i z e d , s l o w e d d o w n p e r c e p t i o n or c a u s e d So h o w do w e r e s o l v e the p a r a d o x
find
rule of
misperceptions.
i m p l i e d b y the n e e d to and
a l a b y r i n t h of " h y p o t h e s i z e ,
and p o s t - t e s t - f o r - v a l i d - c o n t e x t "
"did
when
apply cross-word-boundary coarticulation rules rapidly a u t o m a t i c a l l y w i t h o u t invoking
yet
h e u r i s t i c s ? (1)
t h a t w e m u s t d e v i s e a w a y to g e t the a d v a n t a g e of
The a n s w e r
test, is
precompiled
n e t w o r k s of s p e c t r a l e x p e c t a t i o n s for e a c h w o r d , b u t w i t h i n
the
c o n s t r a i n t t h a t c o a r t i c u l a t o r y e f f e c t s b e t w e e n w o r d s m u s t be d e s c r i b e d by s e g m e n t a l l y b a s e d r u l e s sub-network)
(or a s i n g l e
word-boundary
rather t h a n c o m p l e t e e l a b o r a t i o n of the
at e a c h w o r d b o u n d a r y in the
alternatives
network.
(1) N e w e l l (1979) a r g u e s t h a t c o g n i t i v e s t r a t e g i e s s u c h a s a n a l y s i s b y s y n t h e s i s or h y p o t h e s i z e and t e s t are too time c o n s u m i n g to be r e a l i s t i c m o d e l s of h u m a n p e r c e p t i o n , and that o n e m u s t find w a y s of d e v i s i n g h i g h l y p a r a l l e l a u t o m a t i c p r o c e s s e s to m o d e l h u m a n t h o u g h t . A s i m i l a r v i e w is e x p r e s s e d in the w o r k of E l m a n a n d M c C l e l l a n d (19xx).
Klatt: The problems of variability in speech recognition The N e w LAFS.
293
C o n s i d e r the s i m p l e s t form that
s t r u c t u r e c o u l d take b y a s s u m i n g
for the m o m e n t the
this simplifying
assumption that coarticulation across word boundaries
is
r e s t r i c t e d to the d i p h o n e c o n s i s t i n g of the last h a l f of the p h o n e m e at the e n d of the w o r d and the f i r s t h a l f of the p h o n e m e of all w o r d s .
initial
T h e n , i n s t e a d of c r e a t i n g a l a r g e s e t of
p a t h s a n d s t a t e s for e a c h w o r d in the l e x i c o n , so as to
connect
the end of the w o r d to all w o r d b e g i n n i n g s , it s u f f i c e s to to the a p p r o p r i a t e p l a c e in a s i n g l e w o r d - b o u n d a r y carrying
forward a backpointer
recognized
network,
to the w o r d that w o u l d be
if t h i s c o n t i n u e s to be the b e s t n e t w o r k p a t h .
word-boundary network specifies spectral traversed
The
sequences that must be
in o r d e r to g e t to the b e g i n n i n g of w o r d s w i t h
possible beginning
jump
each
phoneme.
It is p o s s i b l e to c o n c e i v e of m o r e g e n e r a l v a r i a n t s of a p p r o a c h t h a t a l l o w c o a r t i c u l a t i o n and p h o n o l o g i c a l g r e a t e r p o r t i o n s of w o r d s , and that m i g h t
recoding
incorporate
into
words, such as -to", "and", "a", This p r o p o s a l
function
LAFS d e s i g n
A s a p r a c t i c a l m a t t e r , it m a k e s
p o s s i b l e the c o n s t r u c t i o n of l a r g e - v o c a b u l a r y n e t w o r k s the p r o h i b i t i v e c o s t of full d u p l i c a t i o n of n e t w o r k s at the b e g i n n i n g
and
"the".
is a n e x t e n s i o n of the o r i g i n a l
t h a t I d e s c r i b e d in 1 9 7 9 .
over
this
s p e c i a l p a r t of the n e t w o r k regular s u f f i x e s s u c h a s p l e u r a l p a s t , and even i n c o r p o r a t e the s h o r t h i g h l y m o d i f i a b l e
this
without
cross-word-boundary
and end of e a c h w o r d .
m o d e l , the n e w L A F S is an i m p r o v e m e n t b e c a u s e
As a
it m e a n s
perceptual that
294
Human and automatic speech recognition
w o r d - b o u n d a r y r u l e s are a u t o m a t i c g e n e r a l i z e d lexical
items.
to all
If t a k e n as a p e r c e p t u a l m o d e l , the
appropriate word-boundary
s u b - n e t w o r k s e e m s to imply t h a t o n l y the b e s t - s c o r i n g w o r d e a c h p o s s i b l e p h o n e t i c ending c a n be "seen" by the m o d u l e
for or
d e m o n that s e a r c h e s for the b e s t s c o r e , w h i c h w o u l d b e a s t r o n g constraint on bottom-up lexical Unsupervised IBM s y s t e m
Learning
(Jelinek, 1976)
search.
of A c o u s t i c - P h o n e t i c
Decisions.
is o n e of s e v e r a l a t t e m p t s
The
to
a u t o m a t i c a l l y o p t i m i z e a d e c i s i o n s t r u c t u r e o n the b a s i s of experience
(see also
Lowerre and Reddy, 1980).
Generalization
t a k e s p l a c e at the level of the p h o n e m e , and thus m a y a n a t t r a c t i v e m o d e l of h o w c h i l d r e n f i r s t a t t e m p t that d e p a r t from w h o l e - w o r d a c o u s t i c p a t t e r n s . IBM s y s t e m is l o o k i n g
for a c o u s t i c i n v a r i a n c e
generalizations
In a s e n s e , the in the
r e p r e s e n t a t i o n s t h a t it s e e s , a n d thus the s y s t e m
spectral
is p a r s i m o n i o u s
w i t h s e v e r a l c u r r e n t a c c o u n t s of c h i l d r e n ' s l a n g u a g e (Stevens,
d e c i s i o n s c o n s t i t u t e s a strong invariance
is r e q u i r e d
in m a k i n g
fine
phonetic
r e f u t a t i o n of the idea
that
is s u f f i c i e n t for s p e e c h u n d e r s t a n d i n g .
q u e s t i o n we pose is t h i s — model
acquisition
19xx).
The w e a k n e s s of the IBM s y s t e m
phonemic
constitute
how great a modification
to the
in o r d e r to d i s c o v e r m o r e a p p r o p r i a t e
The IBM
acoustic
generalizations? The a n s w e r
is s u r p r i s i n g l y s i m p l e : w h e n a s e q u e n c e
s p e c t r a is m a p p e d o n t o a p a r t i c u l a r
phoneme
in a n input
of utterance
Klatt: The problems of variability in speech recognition
295
(that is c o r r e c t l y r e c o g n i z e d ) , do n o t u p d a t e p r o b a b i l i t i e s all
i n s t a n t i a t i o n s of a p h o n e m e , as IBM d o e s n o w , b u t
update only probabilities at those network locations the same p h o n e t i c e n v i r o n m e n t as is o b s e r v e d
rather possessing
in the i n p u t .
d o n e o n a d i p h o n e b a s i s , t h e n n e t w o r k s t a t e s near the
If
beginning
of a p h o n e m e d e f i n i t i o n are t u n e d o n l y to inputs i n v o l v i n g p h o n e m e p r e c e d e d by the a p p r o p r i a t e p h o n e m e , a n d
that
correspondingly,
n e t w o r k s t a t e s near the end of the p h o n e m e d e f i n i t i o n are o n l y to i n p u t d a t a h a v i n g the a p p r o p r i a t e
at
following
tuned
phoneme.
W h i l e this is c o n c e p t u a l l y s i m p l e to i m p l e m e n t in a c o m p u t e r , it i m p l i e s a m u c h larger
set of p r o b a b i l i t i e s to b e
e s t i m a t e d and s t o r e d t h a n in the s t a n d a r d
IBM s y s t e m .
In the
s t a n d a r d s y s t e m , t h e r e m i g h t be 40 p h o n e m e s , 10 n e t w o r k t r a n s i t i o n s per p h o n e m e , and 200 t e m p l a t e p r o b a b i l i t i e s to be estimated
for e a c h t r a n s i t i o n , or a b o u t 1 0 0 , 0 0 0 p r o b a b i l i t i e s
b e e s t i m a t e d and s t o r e d .
If training
d i p h o n e s , and the t e m p l a t e
to
is d o n e o n the b a s i s of
i n v e n t o r y is i n c r e a s e d
in o r d e r
p e r m i t finer p h o n e t i c d i s t i n c t i o n s , the n u m b e r s m i g h t be
to
about
1000 d i p h o n e s t i m e s , s a y , 6 n e t w o r k t r a n s i t i o n s per d i p h o n e
times
1000 t e m p l a t e p r o b a b i l i t i e s , or 6 m i l l i o n p r o b a b i l i t i e s to be e s t i m a t e d and s t o r e d . impractical
increase
A 60-fold
i n c r e a s e w o u l d imply a n
in r e q u i r e d training d a t a as w e l l as a
computer memory greater
t h a n is e a s i l y r e f e r e n c e d
in m o s t
computers. An alternative
is to r e t u r n to the f r a m e w o r k of H a r p y
LAFS in w h i c h p r o b a b i l i t y is r e p l a c e d by a d i s t a n c e m e t r i c
and that
296
Human and automatic speech recognition
c o m p u t e s the l i k e l i h o o d t h a t an i n p u t s p e c t r u m is the same as a spectral
template
representing
(or small set of a l t e r n a t i v e
each network state.
d e s c r i b e o n e m e a n s of using
L o w e r r e and R e d d y
unsupervised learning
t e m p l a t e s to c o n v e r g e t o w a r d d a t a s e e n d u r i n g by a v e r a g i n g
input s p e c t r a w i t h t e m p l a t e
It is d o u b t f u l best because spectral
templates) (1980)
to c a u s e
the
recognition,
simply
spectra.
that d i r e c t s p e c t r a l a v e r a g i n g will p e a k s b e c o m e less p e a k e d d u r i n g
work
averaging.
Some o t h e r m e t h o d of a v e r a g i n g will h a v e to b e f o u n d , b u t
the
c o n c e p t of a u t o m a t i c t e m p l a t e t u n i n g , w h i l e n o t n e w , s e e m s to be sufficiently powerful
to s e r v e a s a n a t t r a c t i v e m e c h a n i s m
for
b o t h m a c h i n e and m a n . Unsupervised
Learning of N e w N e t w o r k
Configurations.
A v e r a g i n g of input w i t h t e m p l a t e s c a n c a u s e n e t w o r k s t r u c t u r e s c o n v e r g e t o w a r d o p t i m a l p e r f o r m a n c e , b u t h o w d o e s one c r e a t e n e t w o r k s t r u c t u r e s to c h a r a c t e r i z e n e w l y d i s c o v e r e d r u l e s or u n f o r e s e e n a c o u s t i c - p h o n e t i c
new
phonological
possibilities?
It is
recognized
n e c e s s a r y to b e a b l e to d e t e c t w h e n a c o r r e c t l y acoustic
to
input d o e s n o t m a t c h the c o r r e c t p a t h t h r o u g h
the
n e t w o r k v e r y w e l l , and e s t a b l i s h t h a t t h i s a c o u s t i c d a t a c o u l d be g e n e r a t e d by a h u m a n vocal t r a c t o b e y i n g p h o n e t i c s and p h o n o l o g y .
r u l e s of
English
D e t e c t i n g a poor m a t c h m a y n o t b e
d i f f i c u l t , b u t to be a b l e to d e t e r m i n e w h e t h e r
the d e v i a t i o n s
w o r t h y of i n c l u s i o n a s n e w n e t w o r k p a t h s of local or i m p o r t r e q u i r e s e x p e r t k n o w l e d g e of the r u l e s of p r o d u c t i o n a n d their a c o u s t i c c o n s e q u e n c e s .
too are
global
speech
It is m y b e l i e f
that
Klatt: The problems of variability in speech recognition
297
the r o l e of a n a l y s i s b y s y n t h e s i s and the m o t o r t h e o r y of
speech
p r o d u c t i o n a r i s e s e x a c t l y h e r e , to s e r v e a s a c o n s t r a i n t o n the c o n s t r u c t i o n of a l t e r n a t i v e n e t w o r k p a t h s d u r i n g learning.
C o n s t r u c t i o n of a L A F S - l i k e c o m p u t e r
possessing
these s k i l l s m u s t a w a i t p r o g r e s s
unsupervised simulation
in u n d e r s t a n d i n g
detailed relations between speech production, perception,
the
and
phonology. W e h a v e d e s c r i b e d p o s s i b l e m e c h a n i s m s for tuning t h r o u g h e x p e r i e n c e a n d for a u g m e n t i n g
their connectivity
b u t h o w d o e s the w h o l e p r o c e s s g e t s t a r t e d general
networks pattern,
in the c h i l d ?
Do
principles exist that, when confronted with speech data
like that to w h i c h a c h i l d is e x p o s e d , c r e a t e n e t w o r k s of sort?
The a l t e r n a t i v e , innate p r o c e s s e s a n d s t r u c t u r e s
s p e e c h p e r c e p t i o n , g o e s w e l l b e y o n d the k i n d s of s i m p l e
this
for innate
f e a t u r e d e t e c t o r s t h a t h a v e b e e n p r o p o s e d from t i m e to t i m e the s p e e c h p e r c e p t i o n l i t e r a t u r e .
While
it m a y t u r n o u t
in
that
m u c h of the s t r u c t u r e of the p e r c e p t i o n p r o c e s s m u s t be
assumed
to be s p e c i f i e d g e n e t i c a l l y ,
search
it is b e s t to c o n t i n u e the
for d a t a - d r i v e n m e c h a n i s m s rather t h a n b o w d o w n to the g o d of i n n a t e n e s s too
soon.
CONCLUSIONS This is a v e r y e x c i t i n g psychologists
interested
time for e n g i n e e r s , l i n g u i s t s ,
in s p e e c h r e c o g n i t i o n a n d
and
speech
p e r c e p t i o n , for we a r e c l e a r l y at the t h r e s h o l d of a b r e a k t h r o u g h in b o t h u n d e r s t a n d i n g a n d m a c h i n e p e r f o r m a n c e .
It h a s b e e n
298
Human and automatic speech recognition
argued here that this breakthrough will be expedited by careful study of variability in speech, development of better phonetically-motivated
distance metrics, and the description of
acoustic-phonetic details within the framework of a recognition algorithm that is both simpls and powerful, such as LAFS.
ACKNOWLEDGEMENT This work was supported
in part by grants from the National
Science Foundation and the Department of Defense.
299 PROPOSAL FOR AN ISOLATED-WORD RECOGNITION SYSTEM BASED ON PHONETIC KNOWLEDGE AND STRUCTURAL CONSTRAINTS Victor W. Zue Massachusetts Institute of Technology, Cambridge, U.S.A. During the past decade, significant advances have the
field
of
isolated
word
recognition
transitions from research results to taken
place.
(IWR).
practical
been
In many
made
in
instances,
implementations
have
Today, speech recognition systems that can recognize a
small set of isolated words, say 50, for a given speaker with an error rate of less than 5% appear to be current
systems
derive
their
utilize power
techniques.
The
relatively
common.
Most
little or no speech-specific
from
success
general-purpose of
attributed
to the introduction
(Makhoul,
1975),
distance
these of
the
knowledge, but
pattern
recognition
systems can at least in part be
novel
metrics
of
parametric
(Itakura,
powerful time alignment procedure of dynamic
representations
1975),
and
programming
the very
(Sakoe
and
Chiba, 1971). While we have clearly made significant advances in dealing with a small
portion
of
doubt regarding tasks
the
speech
recognition problem, there is serious
the extendibility of the pattern matching approach
involving
multiple
speakers,
large
continuous speech.
One of the limitations of
approach
both
is
that
linearly with the size of
computation the
and
vocabulary.
the
vocabularies template
storage grow When
the
to
and/or matching
(essentially) size
of
the
vocabulary is very large, e.g., over 10,000 words, the computation and storage
requirements
associated
with
current
IWR
systems
prohibitively expensive.
Even if the computational cost were
issue,
of
the
performance
become not
an
these IWR systems for a large vocabulary
300
Human and automatic speech recognition
would surely deteriorate size
(Keilin et al., 1981).
Furthermore,
as
the
of the vocabulary grows, it becomes imperative that such systems
be able to operate in a speaker-independent mode,
since
training
of
the system for each user will take too long. This
paper proposes a new approach to large-vocabulary,
word recognition which combines detailed with
constraints
knowledge
on the sound patterns imposed by the language.
proposed system draws on the results demonstrating
acoustic-phonetic
isolated
the
richness
of
signal and the other demonstrating
of
two
sets
of
The
studies;
one
phonetic information in the acoustic the power of structural
constraints
imposed by the language. Spectrogram Reading
Reliance
on
general
pattern
techniques has been partly motivated by the unsatisfactory of
early
phonetically-based
speech
recognition
matching performance
systems.
The
difficulty of automatic acoustic-phonetic analysis has also led to the speculation that phonetic information must be derived, in large from
semantic,
syntactic
the acoustic signal. phonetically-based
part,
and discourse constraints rather than from
For the most part, the poor performance of these
systems can be attributed
to
the
fact
that
our
knowledge of the context-dependency of the acoustic characteristics of speech
sounds was very limited at the time.
slowly changing.
We now have a far better understanding of contextual
influences on phonetic segments. demonstrated al.
1980).
However, this picture is
This improved understanding
in a series of spectrogram reading experiments It
was
found
that
a trained subject can
has been (Cole
et
phonetically
Zue: Proposal for an isolated-word recognition system
301
transcribe unknown sentences from speech spectrograms with an accuracy of approximately 85%.
This performance is better
recognizers
in
reported
order statistics.
It
the
was
literature,
also
than
the
phonetic
both in accuracy and rank
demonstrated
that
the
process
of
spectrogram reading makes use of explicit acoustic phonetic rules, and that
this skill can be learned by others.
the acoustic signal is rich permit
substantially
in
phonetic
better
These results suggest that information,
performance
which
should
in
automatic
phonetic
improved
knowledge
base,
recognition. However, even with a substantially completely
bottom-up
phonetic
a
analysis still has serious drawbacks.
It is often difficult to make fine phonetic distinctions
(for example,
distinguishing
the word pair "Sue/shoe") reliably across a wide
range
of
Furthermore, the application of context-dependent
rules
speakers.
often requires the specification of the that
can
be
prone
to error.
correct
context,
retroflex
consonant /r/.)
be
a
desirable
aim
identifying
Problems such as these suggest that a
detailed phonetic transcription of an unknown itself
process
(For example, the identification of a
retroflexed /t/ in the word "tree" depends upon correctly the
a
for
the
utterance
may
not
by
early application of phonetic
knowledge. Constraints on Sound Patterns
Detailed segmental
of the speech signal constitutes but one of phonetic
information.
the
representation
sources
of
encoded
The sound patterns of a given language are not
only limited by the inventory of basic sound units, but
also
by
the
302
Human and automatic speech recognition
allowable
combinations
phonotactic
of
constraints
communication,
since
these
is
presumably
be
Thus,
recognized
Knowledge about such
very
useful
are
otherwise
not
without
having
to
specify
the
of the phonemes /s/, /p/, and /n/.
is the only word
in the Merriam Pocket
available
[1] [VOWEL]
[NASAL]
detailed
Dictionary
20,000 words) that satisfies the following [CONS]
speech
or
acoustic
In fact, "splint" (containing
[STOP].
found that knowledge of even broad specification of the sound American
English
about
description:
In a study of the properties of large lexicons, Shipman and Zue
of
are
as an extreme example, a word such as "splint" can
characteristics
[CONS]
in
it provides native speakers with the ability to
fill in phonetic details that distorted.
sound units.
words, both at the segmental and
(1982)
patterns
suprasegmental
levels, imposes strong constraints on their phonetic identities.
For
example, if each word in the lexicon is represented only in terms of 6 broad manner categories then
the
(such as vowel, stop, strong
fricative, etc.),
average number of words in a 20,000-word lexicon that share
the same sound pattern is about 2.
In fact, such crude
classification
will enable about 1/3 of the lexical items to be uniquely There
is
indirect
evidence
that
the
determined.
broad
phonetic
characteristics of speech sounds and their structural constraints utilized
to aid human speech perception.
has shown that people can be speech,
in
which
while detailed
manner
place
cues
taught
to
For example, Blesser perceive
are
(1969)
spectrally-rotated
cues and suprasegmental cues are preserved are
severely
distorted.
The
data
on
303
Zue: Proposal for an isolated-word recognition system misperception
of fluent speech reported by Bond and G a m e s
the results of experiments on listening by
Cole and Jakimik
(1980) and
for mispronunciation
reported
(1980) also suggest that the perceptual mechanism
utilizes information about the broad
phonetic
categories
of
speech
sounds and the constraints on how they can be combined. Proposed System
Based
on the results of the two studies cited
above, we propose a new approach to recognition.
This
approach
is
phonetically-based distinctly
isolated-word
different from previous
attempts in that detailed phonetic analysis of the acoustic signal not
performed.
Rather, the speech signal is segmented and classified
into several broad manner categories. classifier
serves
several
The
purposes.
broad
First,
would be reduced. should
Finally, we
feature large
be
speculate
mechanisms to is
in
phonetic
phonetic analyses,
less that
sensitive the
to interspeaker
sequential
constraints
variations. and
their
even at the broad phonetic level, may provide powerful reduce
the
search
space
substantially.
This
last
particularly important when the size of the vocabulary is
(of the order of several thousand words or more) . Once the acoustic
signal
has
been
reduced
to
a
lattice)
of
resulting
representation will be used for lexical access.
is
(manner)
Second, by avoiding fine phonetic distinctions, the
also
distributions,
phonetic errors
labeling, which are most often caused by detailed
system
is
to
(or
phonetic segments that have been broadly classified, the
reduce
knowledge
string
about
the the
number
of
structural
The
intent
possible word candidates by utilizing constraints,
both
segmental
and
304
Human and automatic speech recognition
suprasegmental,
of
the
words.
The result, as indicated previously,
should be a relatively small set of word candidates. will then be
selected
through
judicious
The correct word
applications
of
detailed
phonetic knowledge. In
summary, this paper presents a new approach to the problem of
recognizing speakers.
isolated The
words
system
from
been
vocabularies
significantly
reduced
Once the potential through
the
would
multiple
follow.
Such
word
candidates
utilization
structural constraints, then a detailed examination differences
and
initially classifies the acoustic signal into
several broad manner categories. have
large
of
the
of
the
acoustic
a procedure will enable us to deal
with the large vocabulary recognition problem in an efficient
manner.
What is even more important is the fact that such an approach bypasses the
often
tedious
and
error-prone
process
of deriving a complete
phonetic transcription from the acoustic signal. detailed
acoustic
phonetic
knowledge
can
In
this
approach,
be applied in a top-down
verification mode, where the exact phonetic context can be
specified.
REFERENCES Bond, Z.S. and G a m e s S. (1980) "Misperceptions of Fluent Speech," Chapter 5' in Perception and Production of Fluent Speech, ed. R.A. Cole, 1154-132 (Lawrence Erlbaum Asso., Hillsdale, New Jersey). Cole, R.A. and Jakimik, J. (1980) "A Model of Speech Perception," Chapter 6 in Perception and Production of Fluent Speech, ed. R.A. Cole, 133-163 (Lawrence Erlbaum Asso., Hillsdale, New Jersey). Cole, R.A., Rudnicky, A.I., Zue, V.W., and Reddy, D.R. (1980) "Speech as Patterns on Paper," Chapter 1 in Perception and Production of Fluent Speech, ed. R.A. Cole, 3-50 (Lawrence Erlbaum Asso., Hillsdale, New Jersey).
305
Zue: Proposal for an isolated-word recognition system
Itakura, F. (1975) "Minimum Prediction Residual Principle Applied to Speech Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-23, 67-72. Keilin, W.J., Rabiner, L.R., Rosenberg, A.E., and Wilpon, J.G. (1981) "Speaker Trained Isolated Word Recognition on a Large Vocabulary," J. Acoust. Soc. Am., Vol. 70, S60. Makhoul, J.I. (1975) "Linear Prediction: IEEE, Vol. 63, 561-580.
A
Tutorial
Review,"
Proc.
Shipman, D.W. and Zue, V.W. (1982) "Properties of Large Lexicons: Implications for Advanced Isolated Word Recognition Systems," Conference Record, IEEE 1982 International Conference on Acoustics, Speech and Signal Processing, 546-549. Sakoe, H. and Chiba, S. (1971) "A Dynamic Programming Optimization for Spoken Word Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, Vol. ASSP-26, 43-49.
307 TIME IN THE PROCESS O F SPEECH RECOGNITION Stephen M. Marcus Institute for Perception Research Eindhoven, The Netherlands.
1
(IPO)
introduction
The process of speech recognition involves finding an optimum match between an unknown iput and a sequence of stored representations of known words in a listener's vocabulary. In normal conversational speech a number of problems arise. Firstly, word onsets and offsets are not clearly marked, if at all, in the acoustic signal. Secondly, the durations of words may show large variations, which may result in non-linear changes in segment duration within a word. Finally, the spectral or even phonetic realisations of a word may exhibit considerable variation from production to production, even given the same speaker. Most approaches consider speech as a sequence of events ordered in time, and implicitly assume that such a linear sequential representation is used both to represent the unknown input and to code the stored lexicon. It is not of great importance here whether such a representation is in terms of phoneme-like segments, short-term spectral descriptions, or longer diphone or syllabic elements. The same problems, of onset detecton, time normalisation, and variant production, arise in all cases. Adequate solutions have been developed, respectively involving testing all possible starting points for all possible words in the lexicon, "time warping" algorithms to find an optimal temporal match between the input and the representation of each possible word, and representations built of nodes with alternative branching paths for known variations in production. Given the rapid increase in speed and decrease in cost of computer hardware, it seems certain that, in the not-too-distant future, such approaches will become economically feasible, even for relatively large vocabularies.
states in input
sequence
states in s t o r e d r e p r e s e n t a t i o n
Figure 1.
"Time warping" to match an input with its ing stored representation.
correspond-
308
Human and automatic speech recognition
However, the most efficient speech recognition device constructed to date, which will probably remain so for a considerable time to come, is characterised not by serial processing hardware with cycle times measured in nanoseconds, but by electro-chemical "wetware" with transmission latencies measured in milliseconds, and owes its speed to a large capacity for independent parallel processing. The device I am referring to is of course the human brain. Seen from this viewpoint, speech recognition becomes essentially a problem of memory access, and we need to ask how speech might be stored in our memories.
2
context-sensitive coding
One model of human memory is the "slot theory", in which memory is seen as organised into a set of discrete "slots" in which differing items are stored. Items are recalled by accessing the contents of each slot in turn, and forgetting results when the contents of a slot becomes overwritten or lost. An alternative theory, one of whose major proponents is Wickelgren (1969), is that memory has an associative structure, based on a context-sensitive code in which the sequence of items is not stored, but only the association between adjacent items. Thus in the slot theory, the sequence "m", "e", "n", "i" would be stored as: (1s "m"; 2: "e"; 3: "n"; 4: "i"). In an associative model the same sequence would be represented by: ("m"-"e"; "e"-"n";
"n"-"i")
where " - " should be read as "followed by". The slot theory is analogous to the use of a linear sequential representation in coding speech. It presents the same problems of onset detection, normalisation, and variations in production. An associative or context-sensitive code offers a far greater flexibility in these matters. Since the temporal sequence is no longer explicitly represented, the associations (which will here be limited to associations between neighbouring pairs of states, and termed state-pairs) may be compared directly as they occur in the unknown input with those anywhere in any stored representation. Such a matching process is ideally suited to a parallel processing system working with a content addressable memory (as we know the human brain is: for example, remember a moment in your early childhood when you were outside in the sunshine, or the first time you heard of speech research). Wickelgren (1969) has pointed out that such a system based on pair information will tend to make errors, somewhat like stuttering or
309
Marcus: Time in the process of speech recognition some slips of the tongue, when used for speech would like to suggest that conversely, it may flexibility we need in perception, in dealing with in the speech signal. Figure 2 illustrates context-sensitive code based on state-pairs in same sequence as in Figure 1.
production. I offer just the the variability the use of a recognizing the
s t a t e - p a i r s in stored representations
Figure 2.
Associative coding used to match number of stored representations.
an
input
with
a
Note that it is neither necessary to locate the moment of word onset in the unknown input, nor to have special procedures to deal with local changes in speech rate. Each state-pair in the input finds its match in certain stored representations, regardless of its temporal location. Nor do all state-pairs need to match, and alternative productions may be catered for by including their corresponding state-pairs in the stored representation. All that is required for correct recognition is that the number of state-pairs corresponding between the input and the correct representation is greater than for all other stored representations.
3
a computer simulation
Such a simple coding may introduce ambiguity in the recognition of some sequences. It remains an empirical question whether this ambiguity will result in problems in recognition, or will turn out to result in the flexibility needed to deal with the idiosyncracies of the speech signal. Since the effectiveness of such a context-sensitive code with real speech is difficult to estimate by purely theoretical considerations, a simple computer speech recogniton system was implemented (Marcus, 1981). Speech was
310
Human and automatic speech recognition
analysed at 10msec intervals, the first three formants >eing extracted automatically by the IPO linear-prediction vocoder system (Vogten & Willems, 1977). No assumptions were made about the nature of possible intermediate states, such as phonemes or syllables, between such a spectral representation and each word in the vocabulary. The system performed a direct mapping between the spectral representation of the input and that for each word. Such an approach has been demonstrated to be highly successful in conventional approaches to automatic speech recognition by Bakis (1974) and Lowerre ( 1976), and has also been advocated by Klatt (1979) . In this case however, both the input and the stored lexicon were represented using a context-sensitive coding of adjacent spectral states. These "state-pairs" were used in a recogniton system for a small vocabulary - the English digits "one" to "nine". Figure 3 shows the performance of nine independent recognition units, one for
Figure 3.
The response of the computer simulation using a context-sensitive code to "unknown" tokens of the digits "one", "two" and "three". Each figure shows the response of all recognition units for all nine digits to the unknown input. The horizontal axis is stimulus time, the vertical axis is a logarithmic transform of recognition unit activity.
Marcus: Time in the process of speech recognition
311
each digit, in response to tokens of the words "one", "two" and "three". For each recognition unit, a logarithmic transform of cumulative probability is plotted against stimulus time. The system displays a number of interesting properties. The most superficial, though not the least impressive, is that the upper trace in each case is the "correct" recognition unit, corresponding in type with the stimulus word. Secondly, within 100 to 150 ms from stimulus onset, only a small number of recognition units remain as possible candidates, the rest having become so improbable that they have been deactivated from the system. Each recognition unit is operating independently, this characteristic is not critically dependent on the number of recognition units in the system. This time period is of the same order of magnitude as that suggested by Marslen-Wilson for restricting activity to a "word initial cohort" in the Cohort Model (Marslen-Wilson & Welsh, 1978). Thirdly, it is generally possible to make a recognition decision well before the end of the word, as soon as the activation of a particular recognition unit has risen sufficiently above that of all others. Though no such recognition decision was built into the computer simulation, it can be seen in Fig. 3 that such a decision can be made within 300 ms of stimulus onset with these stimuli. This performance is also typical for tokens of the other stimuli. This simulation also allows us to compare the effectiveness of state-pair information to that contributed simply by the presence of single states. The perfomance shown for "one" in Figure 3a should be contrasted with Figure 4, where only single-state information, rather than state-pairs, are used. The increase in performance using a context-sensitive code is clear and quite dramatic.
Figure 4.
The response of the computer simulation to the same "one" as in Figure 3, here using only single-state information. The vertical axis is to the same scale.
312 4
Human and automatic speech recognition word boundaries
Though a system as outlined above provides an elegant solution to the problem of word onset detection - by making such detection unnecessary - it contains no component for producing an actual recognition decision. This would presumably base its decision on the relative activity of all the word candidates, waiting until one had risen sufficiently highly in relation to all others. Contextual information could also be incorporated into such a decision algorithm. Lacking such a component, it was presumed that word onsets could be difficult or impossible to detect. The high level of activity of the candidate corresponding to the previous word was expected to mask the rise in activity resulting from the onset of the following word. However, Figure 5 shows the activation of the model in response to the sequence "one-nine-eight-two". The only modification to the system described above is that the activity of recognition units cannot fall below zero - that is, evidence is not collected against improbable candidates, only for or against those which have a significant chance of being present (a recognition unit cannot be "deader than dead"). Under the horizontal axis, the number corresponding to the highest recognition unit is displayed, above, the actual stimulus sequence is also shown.
Figure 5.
The response of the computer simulation to the sequence "one-nine-eight-two". No recognition component has been incorporated to deactivate each unit after recognition in preparation for the onset of the next. The number under the horizontal axis indicate the digit corresponding to the most active recognition unit at each moment in time.
Marcus: Time in the process of speech recognition 5
313
conclusion
One c o n c l u s i o n which may be drawn even f r o m t h i s l i m i t e d s i m u l a t i on i s t h a t t h e s o l u t i o n t o many c u r r e n t p r o b l e m s i n speech r e c o g n i t i o n may be s i m p l e r t h a n we u s u a l l y s u p p o s e . In p a r t i c u l a r t h e r e p r e s e n t a t i o n of t i m e and p r o b l e m s a s s o c i a t e d w i t h word o n s e t s and o f f s e t s may n o t r e q u i r e t h e e x t r e m e l y complex a p p r o a c h es c u r r e n t l y being used. I t r e m a i n s t o be s e e n w h e t h e r t h i s same a p p r o a c h w i l l be v a l i d f o r a much l a r g e r v o c a b u l a r y . There i s good r e a s o n t o s u p p o s e t h i s w i l l be t h e c a s e ; f i r s t l y t h e e x t r e m e r a p i d i t y ( i n t e r m s of s t i m u l u s t i m e ) w i t h which t h i s s i m u l a t i o n d i s c r i m i n a t e s one word f r o m a l l o t h e r s i n d i c a t e s t h e p r e s e n c e of much more power t h a n n e e d e d t o d i s c r i m i n a t e n i n e w o r d s . Secondly, trials w i t h words n o t in the c u r r e n t vocabulary show good r e j e c t i o n of t h e s e w o r d s , u n l e s s , of c o u r s e , t h e y s h a r e much i n common w i t h a known w o r d . Then, j u s t as w i t h h y p e r - e f f i c i e n t human s p e e c h r e c o g n i t i o n , such m i s p r o n u n c a t i o n s w i l l be i g n o r e d , and t h e c l o s e s t c a n d i d a t e s e l e c t e d .
references Bakis, R. (1974) Continuous-speech word spotting via centisecond acoustic s t a t e s . I EM speech processing group, report RC 4788. K l a t t , D.H. (1979) Speech perception: a model of acoustic-phonetic analysis and l e x i c a l access. Journal of Phonetics, 279-312. Lowerre, B.T. (1976) The HARPY speech recognition system. Unpublished PhD t h e s i s , Carnegie-Mellon University. Marcus, S.M. (1981) ERIS - c o n t e x t - s e n s i t i v e coding in speech perception. Journal of Phonetics, 9 , 197-220. Marslen-Wilson, W. & Welsh, A. (1978) Processing i n t e r a c t i o n s and l e x i c a l access during word recognition in continuous speech. Cognitive Psychology, 10, 29-63. Vogten, L.L.M. & Willems, L.F. (1977) The formator: a speech analysis-synthesis system based on formant extraction from l i n e a r prediction c o e f f i c i e n t s . IPO Annual Progress Report, 12, 47-62. Eindhoven, The Netherlands. Wickelgren, W.A. (1969) Context-sensitive coding, associative memory, and s e r i a l order in (speech) behaviour. Psych. Rev., 76, 1-15.
315 ON THE ROLE OF PHONETIC STRUCTURE IN AUTOMATIC SPEECH RECOGNITION Mark Liberman Bell Laboratories, Murray Hill, N.J. USA Introduction Three principal assumptions underlie this presentation: (1) sub-lexical information in speech (e.g. that available in fluently spoken nonsense words) is ordinarily of very high quality; better access to this information (which is traditionally studied under the name of phonetics and phonology) is the main requirement for better automatic speech recognition (ASR); (2) in order to improve ASR it is now especially important to devise improved algorithms for describing the phonetic structure of speech signals; (3) the phonological structure of speech must ultimately play a part in ASR, but is of limited relevence without robust phonetic descriptions of the the sort just mentioned. As always, the best evidence for such assumptions will prove to be the value of their consequences. I will sketch a sample recipe for exploring the consequences of these assumptions, suggesting what some appropriate phonetic descriptions might be like, and how they might be used. Recognizing speech on a phonetic basis means using the phonetic structure of the human speech communication system in mapping from sound streams to "toplinguistic messages. It does not imply any necessary choice between down" and "bottom-up" processing directions, and in particular it does not imply the construction of a standard "phonetic transcription" as the first stage of the recognition process. Phonetic recognition requires the design of phonetically robust signal descriptions: that is, acoustically definable predicates that have reliable connections to lexically relevent categories, or, to put it in plain language, things that reliably relate sounds to words. To avoid confusion, I will call these phonetic predicates "descriptors." Such descriptors need not impose any well-defined segmentation and labelling ("phonetic transcription"); rather, they are any appropriate set of descriptions of points or regions in the sound stream — what we might think of (very loosely) as a commentary on a spectrogram. Given a set of such descriptors, the recognition process can proceed by hypothesis-and-test ("top-down"), or by describe-and-infer ("bottom-up") methods, or by a mixture of such methods, as we please. For any particular application, we also need a control structure that makes effective use of the information contained in the signal descriptions. In the end, perhaps, a single algorithm can be made to work for all cases, but this goal is surely far in the future, and phonetically-based recognition can meanwhile produce useful results in less general forms. In what follows, I will specify a general philosophy for the selection of descriptors, a particular set of descriptors with which I have sufficient experience to be confident of their usefulness, and a particular class of recognition control structures that can make effective use of the proposed descriptor set. My hope is to find a framework within which research on phonetically based speech recognition can proceed productively, with useful applications available from the beginning.
316
Human and automatic speech recognition
Design Criteria for a Descriptor Set. Evidently, the descriptor set should work. The general strategy is to use phonetic knowledge to look in the right places for the crucial information; for instance, in distinguishing among the phonetically confusable set of letter names "E,B,D,V,P,T," the crucial information is in the burst spectrum, the first AO msec, or so of F2 and F3 transitions, and the voice onset time. Unless this information is extracted and made available in some relatively pure form, it will be swamped by phonetically irrelevent acoustic variation from other sources. Appropriate signal transforms are better than complex logic. For instance, in looking for candidate bursts, finding peaks in the (smooth) time derivative of the log energy is preferable to a complex procedure that "crawls around" in the rms amplitude. Such an "edge detector" is relatively immune to level changes and to background noise, and its output is characterized by a small set of numbers (e.g. the value of the derivative at the peak, the duration of the region of positive slope, etc.) that can easily be used in training the system. Because such signal transforms are mathematically simple, their behavior is also more predictable and less likely to suffer from "bugs" due to unexpected interactions among the branches of a complex decision tree. In general, the descriptor definitions should be as simple as possible, so as to minimize surprises and maximize trainability. However, as a last resort, temporary logical "patches" that seem to work are not to be disdained. All the descriptors used should be insensitive to pre-emphasis, band-limiting, broadband noise, and other phonetically irrelevent kinds of distortion. For instance, using spectral balance as a measure of voicing status fails badly if F1 is sometimes filtered out, as it often is in high vowels in telephone speech. The most interesting descriptors are lexically independent ones, since they are the most general, and will best repay the time spend to develop them. The fact that speech is phonologically encoded guarantees that an adequate set of general descriptors will exist. However, lexically specific descriptors are perfectly legitimate for applications that justify their development. For instance, the rising F2 transition in "one" will function quite reliably as a "one-spotter" in connected digit recognition. Some previously made points about the proposed "descriptors" may now be clearer. These are not phonological or phonetic features in the usual sense, but phonetically relevent acoustic descriptions. The more closely they correspond to phonologically defined categories, the better they will work, of course, but they will still be useful if the correlation is imperfect. These descriptors carry with them sets of parameter values that can be used by higher levels of the recognizer. For instance, in constructing a spotter for the rising F2 transition in "one," we find all monotonically rising "ridges" between 500 and 2000 Hz. in an appropriately smoothed time-frequency-amplitude surface; the attached numbers include (say) the lowest frequency, the highest frequency, the duration, and the rate of spectral balance change over the corresponding time period. In the space defined by the joint distribution of
Liberman: On the role of phonetic structure
317
these parameters, "one" will be fairly well separated from anything else that can arise in connected digit sequences. There are a variety of different ways to use this separation: we can send all candidates up to a higher level parsing algorithm; we can prune hopeless cases first; we can combine the rising-F2 feature in a bottom-up way with descriptors relating to the expected nasal implosion; and so forth. The descriptors under discussion can be divided roughly into a "first tier," made up of fairly direct characterizations of local properties of the signal (like "peak in the first derivative of mid frequency log energy"), and a "second tier" of descriptions that are more abstract and closer to linguistic categories (like "apparent voiceless stop burst"). In all cases, the point is to find ways to look at signal parameters through phonetic spectacles. A Candidate Descriptor Set In the conference presentation, a candidate set of phonetic descriptors will be given, w i t h a sketch of algorithms for their computation. The "first tier" of descriptors will be emphasized, and includes (1) time dimension "edges," (2) voicing and FO determination, (3) formant-like frequency-time trajectories of local spectral features, and (A) local spectral balance changes. In the "second tier," predicates like "pre-vocalic voiceless burst" are defined, largely in terms of first-tier descriptors. A Class of Control Structures. The general idea is to match descriptors predicted on the basis of a linguistic transcription against candidate descriptors found in the waveform. Three cases are contemplated: (1) phonetic alignment, in which the correct transcription is known or hypothesized, and then matched against a description of the speech; (2) sequence spotting, which is like (1) above except that the endpoint of the match are free to float in the input descriptor stream; (3) phonetic parsing, in which the transcriptional pattern is generalized into the form of a grammar, and the matching algorithm is generalized to become a form of probabalistic (or otherwise "fuzzy") parsing. Type (1) algorithms can be used to construct annotated data bases, and can do isolated-word recognition on a hypothesize-and-test basis. Type (2) algorithms can be used as the first pass in a connected word recognizer, or as part of a phonological recognizer that looks for syllable-sized patterns. Type (3) algorithms will be an effective architecture for a connected word recognizer if the local information (provided by the descriptors under discussion) is good, since then a beam search or shortfall-scheduled search will work effectively. The conference presentation is divided into three parts: the nature of the lexically-derived phonetic patterns to be matched against speech- derived information; the definition of a match and its overall evaluation; and some of the algorithms that can be used to find nominally optimal matches. One general point: the hardest problem, the one that most stands in the way of progress, is the definition of effective local phonetic descriptors. Control structures for integrating local information are an interesting topic in their
318
Human and automatic speech recognition
own right, but for now, I think, the goal should be to find simple, adequate algorithms that don't get in the way of research on phonetic description. The Role of Phonological Structure It is a useful as well as interesting fact that the acoustic properties of the class of noises corresponding to a given word can be predicted from its phonological representation. As a result, a system of phonetic description can hope eventually to achieve closure, so that the introduction of a new vocabulary item does not require the inference of any information beyond its phonological "spelling." This is by no means just a matter of storing word templates at a slightly lower bit rate, which would be a matter of little consequence. The important thing is to "modularize" the mutual influence of adjacent words at their boundaries, the effects of phrasal structure, the vagaries of dialect, the consequences of rate, emphasis and affect, and so forth, so that these need not be learned anew for every word and every combination of other relevent circumstances.
319 A COMPUTATIONAL MODEL OF INTERACTION BETWEEN AUDITORY, SYLLABIC AND LEXICAL KNOVJLEDGE FOR COMPUTER PERCEPTION OF SPEECH Renato de Mori Department of Computer Science, Concordia University, Montreal, Quebec, Canada ABSTRACT A system organization for extracting acoustic cues and generating syllabic and lexical hypotheses in continuous speech is outlined. A model for machine perception involving a message passing through knowledge instantiations is proposed. Some components of this model can be implemented by a pipeline scheme, some others by an Actor's system and some others with special devices for performing parallel operations ©n a semantic network. 1 INTRODUCTION Complex speaker-independent systems capable of understanding continuous speech must take into account many diverse types of features extracted by processes performing different perceptual tasks. Some of these processes have to work directly on the speech patterns for extracting acoustic cues. Some other processes operate on representations obtained by intermediate processing steps and generate feature hypotheses. Knowledge-based systems using rules for hypothesis generation can be enriched as new experience is gained, allowing simulation of human behaviour in learning new word pronunciations, the characteristics of new speakers or new languages. Algorithms for extracting a unique non-ambiguous description of the speech data in terms of acoustic cues have been proposed (De Mori [1]). These algorithms are based on a knowledge representation in which structuraland procedural knowledge are fully integrated. Knowledge representation shows how the complex task of describing the acoustic cues of the speech message can be decomposed into sub-tasks that can be executed in parallel. Task decomposition and parallel execution of cooperating processes have the advantage of speeding up the operation of a speech understanding system which should work close to real-time. Furthermore, conceiving the system as a collection of interacting modules offers the more important advantage of designing, testing and updating separately and independently a large portion of the knowledge of each module. The performances of each module can be improved by adding specific knowledge which can be acquired in any particular area of speech and language researches. 2 A COMPUTATIONAL MODEL FOR THE INTERACTION BETWEEN AUDITORY, SYLLABIC AND LEXICAL KNOWLEDGE The knowledge of a Speech Understanding System (SUS) can be subdivided into levels. The choice of these levels is not unique.
320
Human and automatic speech recognition
Following past experience and in agreement with speech perception results (De Mori [1]), three levels are considered through which the speech signal is transformed into lexical hypotheses. These levels are called the auditory, the syllabic and the lexical one. Fig. 1 shows a three-dimensional network of computational activities. Activities Ai are performed at the auditory level, activities Si at the syllabic level, activities Lij at the lexical level. Each activity processes a portion of hypotheses in a time interval Tli corresponding to the duration of a syllable (0 t C ^ r V ^ ^ (C3...) where C^ denotes the velar and palato-dental stops (k, £, t, and d) and L, liquids (Japanese has only r) . Historically, there are three distinct stages. Though the manner of borrowing into Japanese was once uniquely determined, drastic processes like Forward and Backward Vowel Assimilation of epenthentic vowels have gradually given way to Forward Consonant Assimilation, which represents a weaker process, but a more corrplicated rule. Thus, native phonology and loan phonology seem to differ in the power of analogy to promote rule generalization. The examples cited suggested that, for native phonology, once some members of a class submit to a change, analogy works to include those with less phonetic motivation, while loan phonology has the option of restructuring borrowed classes in ways not always predictable by analogy.
656 THE TYPOLOGICAL CHARACTER OF ACOUSTIC STRUCTURE OF EMOTIONAL SPEECH IN ENGLISH, RUSSIAN AND UKRAINIAN E.A. Nushikyan English Department, the University of Odessa, U S S R This paper deals with the general problem of typological investigations of the acoustic parameters of emotional speech. Six English, six Russian and six Ukrainian native speakers participated as subjects. To preserve comparability of the experimental material, s i t u ations were chosen from English fiction and then translation into R u s s i a n and Ukrainian, their overall number being two hundred statements, questions and exclamations expressing the most frequently observed positive and n e g a tive emotions. The material was recorded twice: in isolation (as non-emotional utterances) and in situations (as samples of emotional speech). T h e n oscilligrams were obtained, fundamental frequency, intensity and duration of speech signal were recorded. The detailed contrastive analysis of the acoustic structure of emotional and non-emotional speech have revealed that the frequency range, the frequency interval of the terminal tone, the frequency interval of the semantic centre, the peak intensity and the nuclear syllabic intensity are always greater in the emotional utterance in all languages under investigation. In this way the similarity of emotion expression is manifested. But it is the movement of the fundamental frequency in the emotional utterance that is peculiar for each language. The nuclear syllable duration and the duration of emotional utterances on the w h o l e exceed the duration of neutral ones. O n the other hand it turns out to be the m o s t variable acoustic parameter. For evaluating the average data standard methods of mathematical statistics were applied (t ratio, Student's t). Formant frequencies were measured from broad-band spectograms m a d e o n Kay Sonograph (6061 B). The shift of the intensity of formants frequencies of the stressed vowels into higher regions is noticed in the emotional speech. A constant increase of the total energy is observed at the expense of the first and second formants. W e also found frequency ranges enlargment, as well as in the greater importance of the third and fourth formants frequency range. The results of this contrastive study permit us to suppose that acoustic structure of emotional speech in English, Russian and Ukrainian displays more universal than particular properties in the manifestation of emotion.
Section
15
657 STYLISTIC VARIATION IN R.P. Susan Ramsaran Department of Phonetics and Linguistics, University College, London, U.K. A corpus of twenty hours of spontaneous English conversation was gathered in order to study the phonetic and phonological correlates of style in R.P. Six central subjects were recorded in both casual and formal social situations as well as in monologue. Each informant was recorded in different dyadic relationships, an attempt being made to hold all social variables constant apart from the subjects' relation to their interlocutors. Since the identity of the interlocutor is not an adequate criterion for determining speech style, a supplementary grammatical analysis of the data was also made. In formal situations where the topic of conversation is also formal, there is a tendency for speech to contain more stressed syllables, audibly released plosives and concentrations of nuclear tones than occur in the most casual stretches of speech. Assimilations (which do not increase with pace) and elisions are more frequent in the casual contexts. Linking /r/ and weak forms are not indicators of style. Although some of these features are distributionally marked, none is unique to a speech variety. It is concluded that in R.P. there is no shift from a distinct 'casual' style to a distinct 'formal' style. Instead gradual variation is to be seen as inherent in a unitary system.
658 CHARACTERISTIC FEATURES IN USAGE OF "LIAISON" IN THEATRICAL SPEECH. V.A.SAKHAROV. Department of Phonetics at the Institute of Foreign Languages, Tbilisi, Georgian SSR, USSR. According to the data of our research, all the three categories of "liaison" are used in theatrical speech-liaison obligatoire, facultative and interdite, meanwhile facultative "liaison" permanently exchanges places with that of obligatory and the latter with facultative "liaison" and conditions of their changing places depend on: a) language and style of a literary work; b) author and period when a literary work was written; c) direction of a theatre; d) individuality of an actor. Usage of linkage in contemporary language is largely determined by a number of factors among which the following are essential: a) functional style; b) semantic-syntactic connection; c) grammatical factor; d) phonetic conditions; e) historical factor. Our work enables us to prove that "liaison" is most frequently used in theatrical speech. According to our research we may say that it is theatrical speech which combines both classical norms of literary language and contemporary tendencies of colloquial language comparison of which enables us to trace back the evolution undergone by "liaison" at the present stage of development of the French language.
Section
15
659 STRATEGIES DE REMPLACEMENT DU R APICAL PAR LE R POSTERIEUR L. Santerre Département de Linguistique, Université de Montréal, Québec, Canada Les R postérieurs tendent de plus en plus à remplacer les R antérieurs dans le langage soigné des francophones à Montréal. De nombreux locuteurs font tantôt l'un tantôt l'autre variphone, mais aussi très couramment les deux à la fois, en finale de mot; ex. Pierre [pj£Rr] . C'est le premier R qui est plus intense et perçu, donc transmis aux enfants; les radiofilms montrent que le second est réalisé par une occlusion de l'apex sur les alvéoles, mais difficilement audible. Autres stratégies: la réduction du /r/ à une seule occlusion, en position intervocalique; ex. la reine [ l a r e n ] ; après consonne, il y a épenthèse; ex. drame [deram] , crème [ ksrem] , une robe [ynarob] r ardoise rouge [ ardwazaru^ ] Devant consonne apicale, le [r] est réduit à sa phase implosive; ex. retourne [ r e t u r n ] , regarde [ regard ] , université [ yni ve r /i ste ] ; devant constrictive non-apicale et sonore, production d'épenthèse; ex. large [ Iars^] , pour vous [pu^avu] ; devant les occlusives et les sourdes, le [ r ] peut être réduit à sa phase implosive; ex. remarque [ rsmar/i k] , marche [mar/* J ] , parfumeuse [parfym0z] • On tend à réduire au minimum le nombre de battements antérieurs. Remarque: les épenthèses ne sont pas perçues comme telles, mais comme un deuxième battement. Présentation de tracés articulatoires aux rayons-x synchronisés avec l'analyse spectrale. = seulement la phase implosive du battement
660 SOCIO—STYLISTIC VARIABILITY OF SEGMENTAL UNITS V.S. Sokolova, N.I. Portnova Foreign Languages Institute, Moscow, USSR
1.The problems of studying socio-stylistic variability of the units of segmental level are very acute in modern socio phonetics. 2.The study of socio- and stylistically informative value of the variants of segmental units serves to reveal the inner mechanism of sound modifications and the peculiarities of combinations and the distribution of concrete realizations in such varieties of speech as functional stylistic variety and the speech of social groups. 3.The results of the present experimental phonetic investigation carried out on the basis of the French language make it possible to state the following: 3.1. The inner mechanism of the variability of vowels lies in the changes of the acoustic paramétrés: frequency index of the second and zero formants in the spectrum of vowels, energy satiation of the spectrum and the length of vowels. 3.2. Realization of the so-called semi-vowels depends upon extralinguistic factors. 3.3«Timbre characteristics of vowels, and especially of the functioning of semi-vowels correlates with such extralinguistic factors as the age status of an informant, which affects interpersonal sound variations and changeable communicative situations, which determine individual sound changeability in stylistic varieties of oral texts.
Section
15
661 DISTANCES BETWEEN LANGUAGES WITHIN LANGUAGE FAMILIES FROM THE POINT OF VIEW OF PHONOSTATISTICS Yuri A. Tambovtsev Department of Foreign Languages at the Novosibirsk University, Novosibirsk, USSR There may be different classifications and measurements of the distances between languages in language families. It is proposed in this report to measure the distances by the values of frequencies of certain phoneme groups in speech. These phonostatistical methods are believed to give the necessary precision and objectivity. The following methods of phonostatistics are used in this study: 1) The value of the consonant coefficient; 2) The value of the ratios and differences of frequencies of certain groups of consonants in speech and in the inventory; 3) The value of ratios and differences of frequencies in different languages derived by a unified four features method. In discussing the closeness of some languages to others within the language family, it is important to take into consideration both typological and universal characteristics among the languages subjected to phonostatistical investigations. Special coefficients which take into account the values of the genuine language universals are introduced. Some powerful methods of mathematical statistics are used in order to avoid mistakes in conclusions caused by a number of common features due to pure chance. The group of experimental linguistics of Novosibirsk University has studied by phonostatistical methods the languages of Finno-Ugric, Turkish, Tungus-Manchurian, Paleo-Asiatic and some isolated languages of Siberia and the Far East. Some phonostatistical studies are based on acrolets taking a language as a whole; others deal with dialects. This study involved dialects to measure the distances between the dialectal forms and the literary language on the one hand, and it considered the distances between languages as a whole on the other. The main objective of the study is to demonstrate the relations of languages not only on a qualitative but also on a quantitative scale. It is believed that it is quite essential to determine the correct degrees of relationship between the dialects in a language and between languages in language families. The method is believed to be basically new and is being applied to some Siberian, FinnoUgric, Turkish and Tungus-Manchurian languages for the first time. These phonostatistical methods lead to outcomes similar to the results achieved by classical comparative linguistic methods. The degrees of closeness in the relationship between some Finno-Ugric, Slavonic and Turkish languages may encourage scholars to use such analyses for the whole Finno-Ugric, Slavonic, Turkish, TungusManchurian and other language families and to help to include some isolated languages into language families since this particular approach seems to be accurate, precise and promising.
662 LE PROBLEME DE L'EXPLICATION DES INNOVATIONS PHONIQUES ET L'AFFAIRE DES LABIOVELAIRES EN GREC ANCIEN A. Uguzzoni I s t i t u t o di G l o t t o l o g i a , Université di Bologna, I t a l i a L ' a p p l i c a t i o n du modèle de Henning Andersen s ' e s t révélée particulièrement i n téressante pour l ' é t u d e de l ' o r i g i n e des changements phoniques qui sont à la base des correspondances diachroniques et qui conduisent à la formation des d i vergences et des convergences d i a l e c t a l e s . Une t e l l e optique consent en e f f e t de formuler des hypothèses s a t i s f a i s a n t e s sur l a nature, les motivations, les modalités des innovations é v o l u t i v e s , et apporte un contribut c o n s t r u c t i f pour la r e d é f i n i t i o n du problème controversé de l ' e x p l i c a t i o n en l i n g u i s t i q u e h i s t o rique. A l ' i n t é r i e u r de ce cadre théorique et méthodologique s ' i n s c r i t la r e cherche dont nous présentons i c i quelques e x t r a i t s . L'étude de l ' é l i m i n a t i o n des l a b i o v ë l a i r e s en grec ancien révèle des anomalies qui peuvent être reéxaminées et dépassées en tenant compte aussi bien des aspects u n i v e r s e l s de la substance phonétique que des asoects spécifiques de la phonologie des dialectes grecs.Un cas très controversé est représenté par les développements des l a b i o v ë l a i r e s s u i v i e s de e^ On propose de considérer la correspondance diachronique l a b i o v ë l a i r e s : dentales comme le r é s u l t a t de deux changements de caractère essentiellement d i f f é r e n t : un changement contextuel, à cause duquel les l a b i o v ë l a i r e s sont p a l a t a l i s é e s devant i , e, et un acontextuel, pour lequel les produits de la p a l a t a l i s a t i o n sont suïïstitués par les dentales. Le premier est une innovation dëductive, qui c o n s i s t e en l ' i n t r o d u c t i o n d'une nouvelle règle de prononciation, tandis que le second est une innovation abductive, qui consiste en la rëinterprêtation des e n t i t é s phonétiques issues du processus précédent. La d i s t i n c t i o n entre ces stades é v o l u t i f s met en évidence la contribution e x p l i cative s o i t des facteurs a r t i c u l a t o i r e s , s o i t des facteurs acoustiques et audit i f s . L'exégèse proposée dans cette communication se fonde sur l'hypothèse de deux conditions préalables. D'une part on postule que la divergence entre les r é s u l t a t s éoliens et les r é s u l t a t s non-éoliens dépend d'un degré d i f f é r e n t de p a l a t a l i s a t i o n qui touche, respectivement, les l a b i o v ë l a i r e s s u i v i e s de et les l a b i o v ë l a i r e s s u i v i e s de i_, e^. D'autre part on relève que ce qui permet la confusion entre les l a b i o v ë l a i r e s p a l a t a l i s é e s et les dentales est probablement une ressemblance a c o u s t i c o - a u d i t i v e r e l i é e avec les t r a n s i t i o n s des seconds f o r mants. L'examen de la correspondance diachronique l a b i o v é l a i r e s : l a b i a l e s confirme le rôle de l'apprenant-auditeur dans les procès é v o l u t i f s et l'importance de l ' é t u de des ambiguités acoustiques qui peuvent constituer les prémisses pour les i n novations abductives. Toutefois i l n ' e s t oas exclu que à la projection phonétique d é f i n i t i v e de la réinterprétation des l a b i o v é l a i r e s comme l a b i a l e s on s o i t a r r i vé à travers des degrés a r t i c u l a t o i r e s intermédiaires, semblables à ceux qui ont été envisagés par les analyses t r a d i t i o n n e l l e s de l ' é v o l u t i o n des l a b i o v ë l a i r e s .
Section
15
663 THE RELATIVE IMPORTANCE OF VOCAL SPEECH PARAMETERS FOR THE DISCRIMINATION AMONG EMOTIONS R. van Bezooijen and L. Boves Institute of Phonetics, University of Nijmegen, The Netherlands In the experimental literature on vocal expressions of emotion two quite independent mainstreams may be distinguished, namely research which focusses on a parametric description of vocal expressions of emotion and research which examines the recognizability of vocal expressions of emotion. In the present contribution an effort is made to link the two approaches in order to gain insight into the relative importance of various vocal speech parameters for the discrimination among emotions by human subjects. Hundred and sixty emotional utterances (8 speakers x 2 standard phrases x 10 emotions) were rated by six slightly trained judges on 13 vocal speech parameters. The speakers were native speakers of Dutch of between 20 and 26 years of age. The phrases were "two months pregnant" (/tve: mai'nda zvaqar/) and "such a big American car" (/zo:n yro:ta amerika:nsa o:to:/). The emotions were disgust, surprise, shame, interest, joy, pitch level, pitch range, loudness/effort, tempo, precision of articulation, laryngeal tension, laryngeal laxness, lip rounding, lip spreading, creak, harshness, tremulousness, and whisper. The means of the ratings on 12 of these parameters - lip rounding was discarded because an analysis of variance failed to reveal a significant effect of the factor emotion - were subjected to a multiple discriminant analysis. Aim of this analysis was to attain an optimal separation of the 10 emotional categories by constructing a limited set of dimensions which are linear combinations of the original 12 discriminating variables. In a 3-function solution 62.5% of the utterances were correctly classified. The same 160 emotional utterances which were auditorily described, were offered to 48 Dutch adults with the request to indicate for each which of the 10 emotional categories had been expressed. From the responses 67% proved to be correct. The confusion data resulting from this recognition experiment were first symmetrized and then subjected to a multidimensional scaling analysis which aimed at providing insight into the nature of the dimensions underlying the classificatory behavior of the subjects. A comparison of the dimensions emerging from the two analyses suggests that the dimension of level of activation - and the vocal speech parameters related to level of activation, such as loudness/effort, laryngeal tension, laryngeal laxness, and pitch range - plays a central role in the discrimination among emotional expressions, not only in a statistical sense, but also in connection with the classificatory behavior of human subjects. An evaluative dimension was not clearly present.
664 A CROSS-DIALECT STUDY OF VOWEL PERCEPTION IN STANDARD INDONESIAN1 Ellen van Zanten & Vincent J. van Heuven Dept. of Linguistics/Phonetics laboratory. University of Leyden, the Netherlands When the Republik Indonesia was founded as an independent state in 1947, Indonesian was declared the national language, after having served as a lingua franca for the entire archipelago for centuries. In our research we purport to establish both acoustic and perceptual aspects of the Indonesian vcwel system, as well as the (potential) effects of the speakers' local vernacular or substrate dialect on his performance in the standard language. In the present experiment we sought to map the internal representation of the 6 Indonesian monophthongs (/i, e, a, o, u, a/) for 3 groups of subjects: 4 Toba Bataks, 5 Javanese, 4 Sundanese. The Javanese vowel system is identical to that of the standard language, the Batak dialect lacks the central vcwel, whereas Sundanese has an extra central (high) vcwel /y/. 188 vcwel sounds were synthesized on a Fonema OVE Illb speech synthesizer, sampling the acoustic F1/F2 plane in both dimensions with 9% frequency steps. Subjects listened to the material twice in counterbalanced randcm orders. They were instructed to label each stimulus as one of the 6 Indonesian monophthongs (forced choice), as well as to rate each token along a 3-point acceptability scale. The results contain important differences in preferred locations of the response vowels, which are argued to reflect properties of the substrate systems of the three groups of subjects. For Batak listeners, /s/ is the least favoured reponse category, and its distribution is highly irregular and restricted; for Sundanese listeners, hcwever, it is the most favoured vcwel, occupying a relatively large area in the F1/F2 plane, to the detriment of /u/, whose distribution is conspicuously more restricted here than with the Javanese and Batak listeners. Javanese listeners performed their task better than the other groups, suggesting that the Javanese substrate interferes least with Indonesian. We conclude that the labelling method is a highly sensitive tool in the study of vcwel systems, that merits wider application in field studies. "'"This research was supported in part by a grant from the Netherlands Organisation for the Advancement of Pure Research (ZWO) under project # 17-21-20 (Stichting Taalwetenschap).
Section
15
665 THE PHONEMIC SYSTEM AND CHANGE IN PRONUNCIATION NORMS L.A. Verbitskaja University of Leningrad, Leningrad, USSR Questions concerning the connection between a phonemic system and pronunciation norms are considered from the point of view of the interrelation between the phonological system and the pronunciation norms. A phonemic system includes not only the inventory of phonemes, but also the distribution of phonemes, their alternations, functions, and phonemic combinations. The problem of pronunciation norms arises because a system has not just one, but two or more possible designations for one and the same linguistic realization. Variation within the system is limited by the norms. Pronunciation norms can change either within the limitations of a system or as a result of changes internal to the system. The influence of orthography can hardly be considered a major factor that affects changes in the pronunciation norms.
666 THE STRUCTURAL UNITS OF EMOTIONALLY EXPRESSIVE SPEECH E.N. Vinarskaya Foreign Language Institute, Moscow, USSR Any speech utterance is formed of structural units of one of three types: sign units of language, sign extralinguistic units and nonlanguage inborn biological units. Language units are of the uppermost hierarchy and dominate the ones of the lower hierarchical levels, thus making their structure less defined. It is in the simplified speech structure (utterances by infants, or everyday vernacular), that the structural peculiarities of the lower sign and non-sign units stand out more explicitly. Extralinguistic sign units of speech could be also described as specific units of emotional expressiveness. Active formation of these sign units takes place in the pre-language period of early childhood. Similar to the peripheral speech organs that build themselves upon the earlier formed digestive and respiratory systems, language units are developed on the basis of already existing emotionally expressive signs. Emotionally expressive signs participate in the processes of speech in their two functions: one that provides the transmission of emotional meaning and whose evolution has been long completed, and the other that has been acquired much later and is functionally linguistic.
Section 15
667 TENDENCIES IN CONTEMPORARY FRENCH PRONUNCIATION. I. Zhgenti. The head of the department of Phonetics at the Institute of Foreign Languages, Tbilisi, Georgian SSR, USSR. Sounds in contemporary French were studied on the basis of phonological distribution of phonemes and with the help of research in Paris in 1972 after which a complex experimental research aimed at fixing pronunciation tendencies and articulative acoustic changes by means of spectrum and X-ray analysis, oscillographing and synthesis of speech sounds, was carried out. As the result of our research, the following tendencies can be outlined in French pronunciation: a) tendency towards front articulation; b) tendency towards delabialization and reducing the number of nasal vowels ; c) tendency towards open articulation; d) stabilization of uvular "r" or "r grasseyé" in standard French pronunciation, which also helps the general tendency towards front pronunciation. As for the pronunciation of variphones of "r" in French, German, Swedish, Dutch, Danish, etc., we may conclude, judging from our observation, that the uvular."r" similar to French "r grasseyé" is pronounced in those vocal-type languages which include 12 vowels in their phonological system, out of which 8 vowels are of front articulation, i.e. they have a tendency towards front pronunciation. These objectives revealed by us can be considered universal.
Section 16
Phonetics and Phonology
Section 16
671 DIFFERENCES IN DISTRIBUTION BETWEEN ARABIC /l/, /r/ AND ENGLISH /l/, /r/ Mohammad Anani Department of English at the University of Jordan, Amman, Jordan Phonetic differences between lateral and trill articulations in 'emphatic' and 'non-emphatic' contexts raises special problems for Arabic speakers of English. The contextual distribution of Arabic varieties of /l/ and /r/ is examined and differences in distribution between Arabic /l/, /r/ and English /l/, /r/ are stated. Some instances of pronunciation errors due to such differences are mentioned.
672 THE PHONETIC CHARACTERIZATION OF LENITION Laurie Bauer Victoria University of Wellington, New Zealand If lenition is a phonological process that does not have a phonetic definition, it very soon becomes clear that there are instances where it is not possible to state non-circularly whether lenition or fortition is involved. Most attempts to give a phonetic characterization of lenition seem to suggest that: 1) the voicing of voiceless stops; and 2) the spirantization of voiced stops are lenitions; but that 3) the centralization of vowels to schwa is not a lenition, although it is usually described as such in the literature. This raises the question of whether lenition is a unitary phenomenon as it affects both consonants and vowels, and whether a single characterization is possible. These are the questions that will be posed in this paper, and a number of possible phonetic characterizations of lenition will be considered in an attempt to provide an answer.
Section 16 673 VOWEL REDUCTION IN DUTCH G.E.BOOÍj Department of General Linguistics, Free University, Amsterdam, the Netherlands In Dutch, a vowel in a non-wordfinal, unstressed syllable can be reduced to a a schwa. If the syllable is the word-initial one, the vowel to be reduced must be in syllable-final position. This process of reduction manifests itself in three ways: (1) as a diachronic process, with concomitant restructuring of the underlying form, as in: repetítie, conferéntie, reclame, televísie, where the underlined e always stands for a schwa; (2) as an obligatory synchronic rule, as shown by the following pairs of related words: proféet-profetéer, juwéel-juwelíer, géne-genánt; (3) as an optional style-dependent synchronic rule, e.g. in banáan, polítie, minúut, relátie, economíe, where the underlined vowel can be pronounced as a schwa. In Koopmans-van Beinum(1980) it is shown that vowel reduction (=vowel contrast reduction) is a general phonetic tendency of Dutch, occurring in all speech styles, both in lexically stressed ana lexically unstressed syllables. This tendency, presumably a manifestation of the principle of minimal effort, does not suffice, however, to explain the phenomena observed above, although it makes them understandable . The vowel reduction in (1) - (3) is a grammaticalization of a phonetic tendency, manifesting itself in threee different ways in the grammar of Dutch. This grammaticalization can also be inferred from the fact that vowel reduction is lexically governed (vowels in words of high frequency reduce more easily). Consequently, the acoustic realization of a Dutch vowel is determined by at least the following factors: (i) the systematic phonetic representation of the vowel; this may be a schwa, due to the diachronic or synchronic rule of vowel reduction; (ii) the general tendency of contrast reduction in both stressed and unstressed syllables. This implies that in many cases phonological intuitions with respect to vowel reduction cannot be checked by means of phonetic measurements. If the phonological rule claims that reduction is possible, but we do not find it phonetically, this may be due to the optionality of the rule. Conversely, if the rule says that reduction is impossible (e.g. in lexically stressed syllables), and yet we do find reduction, this can be ascribed to the general phonetic tendency. I conclude, therefore, that -even in the case of relatively low level rulesit is not possible to provide direct evidence for the reality and correctness of phonological rules by means of phonetic experimentation.
Reference: F.J.Koopmans-van Beinum, Vowel contrast reduction. An acoustic and perceptual study of Dutch vowels in various speech conditions. Amsterdam: Academische Pers, 1980.
674 ON THE PHONETICS AND PHONOLOGY OF CONSONANT CLUSTERS Andrew Butcher Department of Linguistic Science, University of Reading, UK. Syllable-final consonant clusters of the type NASAL + FRICATIVE and NASAL + STOP + FRICATIVE have attracted the interest of both phoneticians and phonologists over the years. On the one hand, study of variation in the temporal organization of such complex sequences (across languages, across speakers and w i t h i n speakers) provides insight into various aspects of the speech production mechanism. On the other hand, the often apparently optional presence in such clusters of phases perceived as oral stops has provoked some discussion as to whether phonemes are being inserted, deleted or inhibited. This paper presents an evaluation of some acoustic, electroglottographic, pneum o t a c h o g r a p h ^ and electropalotographic data recorded simultaneously, using British English speakers. The clusters concerned w i t h (phonologically) both voiced and voiceless (STOP+) FRICATIVE cognates - were pronounced under three different conditions: isolated words, sentences and connected texts. The results indicate that the occurrence of an oral stop during fully voiced sequences - and therefore the maintenance of the phonolgical opposition b e t w e e n /-ndz/ and /-nz/ clusters - is extremely rare. The occurrence of 'stopped' versus 'stopless' transitions between nasals and voiceless fricatives, on the other hand is highly variable, although for most speakers it bears no relation to the phonological opposition predicted by the orthography. It depends on the relative timing of velic closure, oral release and glottal abduction. Simultaneous execution of all three gestures is m o s t infrequent and most ^stopless' pronunciations in fact include a period of simultaneous oral and nasal air flow as the velic closure lags behind. Transitions in which a stop is perceived sometimes include a true stop where oral release occurs after the other two gestures. The most common strategy of all, however, is the delay of both velic closure and oral release until well after glottal opening, producing a voiceless nasal phase, w h i c h is nonetheless perceived as a stop. On the basis of this and further auditory-based data, the following factors seem to play a role in determining the occurrence of a perceived stop! firstly speakers may be divided, apparently on a regional/social basis into habitual 'stoppers', 'non-stoppers' and ' distinguishers t. secondly increased tempo leads to less stopping and less consistent distinction in general. M u c h less importantly, there is also less stopping in homorganic sequences and more stopping in cases where a semantic opposition depends on the stop versus stopless distinction. Obviously the range of variation to be found in such sequences is rather wide, and presents difficulties for any m o n o s y s t e m i c phonemic type of analysis. It is suggested that a polysystemic approach m i g h t be somewhat more satisfactory, or even one w h i c h treated such clusters as single phonemes and differences in their realization at the level of diaphonic and allophonic variation.
Section 16 675 VOWEL CONTRAST REDUCTION IN TERMS OF ACOUSTIC SYSTEM CONTRAST IN VARIOUS LANGUAGES 1 2 Tjeerd de Graaf and Fiorina J. Koopmans-van Beinum ^Institute of Phonetic Sciences, Groningen University, The Netherlands.
2
Institute of Phonetic Sciences, University of Amsterdam, The Netherlands.
In a previous study on vowel contrast reduction (Koopmans-van Beinum, 1980) a measure was introduced to indicate the degree of acoustic contrast of vowel systems in various speech conditions. For the Dutch vowel system the values of this measure ASC (Acoustic System Contrast = the total variance of a vowel system) decrease in a similar way for male as well as for female speakers when passing from vowels pronounced in isolation via isolated words, stressed and unstressed vowels in a text read aloud, stressed vowels in a retold story and in free conversation to, finally, unstressed vowels in a retold story and in free conversation. Thus the ASC measure, its value being to a large extent dependent on speaker and on speech condition, provides a quantitative description of the process of vowel contrast reduction. Can this measure ASC also be used for the comparison of vowel contrast reduction in other languages ? This is of particular interest when these languages have vowel systems differing in the number of vowels involved, or when the individual vowels assume deviant positions in the acoustic vowel space. We might hypothesize that systems involving fewer vowels would have a larger degree of vowel contrast reduction than richer vowel systems. However, our first results in the comparison of Dutch (12 vowels) , Italian (7 vowels), and Japanese (5 vowels) display a similar pattern of vowel contrast reduction independent of the number of vowels or of their position in the acoustic space, as can be seen in the illustration. Another point that will be discussed is the question whether the ASC measure can also be applied to vowel systems with nasalized vowels and to systems with distinctive pairs of long and short vowels. Koopmans-van Beinum,F.J.(1980). 'Vowel Contrast Reduction'. Diss. Amsterdam. Individual ASC values in various speech conditions for Dutch (2 male: Dl and D2 and 2 female: D3 and D4 ) , Italian (2 male: II and 12), and Japanese (3 male: J1, J2, and J3) speakers.
Dl
02
03
04
[1 12 J1
J2 03
676 VOWEL QUALITY IN DIFFERENT LANGUAGES AS PRONOUNCED BY THE SAME SPEAKER Sandra Ferrari Disner UCLA Phonetics Laboratory, Los Angeles, USA Most cross- linguistic comparisons of vowel quality are impaired by the diversity of speakers in the language groups. If significant differences are found between the formant frequencies of, say, the vowel transcribed as [a] in one language and the vowel transcribed as [a] in another, one cannot be certain that such differences are entirely linguistic in nature. While they may be due to shifts along linguistic parameters such as tongue height or pharyngeal constriction, there is always a possibility that the observed acoustic differences between pairs of similar vowels are due to consistent anatomical differences between the samples, such as a difference in mean vocal tract length or lip dimensions. This difficulty can be avoided by studying a group that speaks both of the languages to be compared. Over a period of time 27 subjects were recorded who spoke two or more of the following languages: English, Dutch, German, Norwegian, Swedish, and Danish. The list of monosyllabic test words pronounced by each speaker was submitted to a panel of native-language judges for evaluation; only those speakers who were judged to have native proficiency in at least two languages were selected for this study. The sample of speakers who passed this evaluation enables many of the possible pairs of these languages to be compared. Since each speaker employs the same vocal apparatus to produce the vowels of two or more languages, the observed differences between pairs of similar vowels cannot be attributed to anatomical differences between the language groups. Rather, they are a close approximation of all and only the linguistic differences which hold between the vowels of these six closely-related languages. Formant-frequency charts of eight different pairs of languages reveal that many of the differences frequently noted in conventional language comparisons, such as the relatively low F1 of Danish vowels and the relative centralization of English vowels, are supported by these bilingual results. In addition, a number of more specific differences, affecting a single vowel rather than the entire vowel system, are found across all speakers. Many of these differences are consistent but quite small, and as such have not received attention in the literature to date.
Section
16 677
CONTRASTIVE STUDY OF THE MAIN FEATURES OF CHINESE AND ENGLISH SOUND SYSTEMS Gui Cankun Guangzhou Institute of Foreign Languages, Guangzhou, China 1 . Differences in segmental phonemes a. Different ways of classification b. Differences in distinctive features c. Differences in distribution d. Trouble spots for Chinese students learning English 2. Tone language vs intonation language Chinese as a tone language and English as an intonation language Each has its own characteristics Trouble spots for Chinese students 3. Differences in rhythm Different arrangements of rhythm patterns Chinese is syllable-timed while English is stress-timed Examples in poetry and conversation Trouble spots for Chinese students 4. Differences in juncture Chinese speech flow — staccato, no smooth linking of syllables or words English speech flow — legato, very smooth linking of words with the exception of breaks and pauses Trouble spot for Chinese students
678 DURATION AND COARTICULATION IN PERCEPTION: A CROSSLANGUAGE STUDY Charles Hoequist I n s t i t u t f ü r Phonetik der U n i v e r s i t ä t K i e l , West Germany A p e r c e p t i o n experiment was run t o t e s t p r e d i c t i o n s concerning two aspects of speech p r o d u c t i o n : rhythm and c o a r t i c u l a t i o n . The rhythm hypothesis under t e s t concerns the p o s t u l a t i o n o f rhythmic c a t e g o r i e s , such as s t r e s s - t i m i n g , s y l l a b l e - t i m i n g , and mora-timing, i n t o which languages can be grouped. Some experiment a l e v i d e n c e i n d i c a t e s that these i m p r e s s i o n i s t i c c a t e g o r i e s do correspond t o d i f f e r e n c e s i n the way and degree s y l l a b i c d u r a t i o n i s u t i l i z e d by speakers of languages claimed t o b e l o n g t o d i f f e r ent rhythm c a t e g o r i e s . The c o a r t i c u l a t i o n hypothesis i s based on e v i d e n c e i n d i c a t i n g t h a t vowel production in speech usually c o a r t i c u l a t e s across consonants, so t h a t a consonant does not n e c e s s a r i l y s e r v e as the boundary of a vowel. The ' e d g e s ' of vowels (and t h e r e f o r e perhaps of s y l l a b l e s ) seem t o depend more on a d j a c e n t vowels than on i n t e r vening consonants. I f the p e r c e p t i o n o f duration uses vowel bounda r i e s as markers, then i t may be t h a t p e r c e i v e d s y l l a b l e d u r a t i o n (along w i t h whatever i s s i g n a l e d by s y l l a b l e d u r a t i o n i n language) can be changed by a l t e r i n g degrees of c o a r t i c u l a t i o n of vowels i n a d j a c e n t s y l l a b l e s , w i t h o u t changing the temporal r e l a t i o n s among the consonants. In o t h e r words, t h e r e might be a p e r c e p t u a l t r a d e o f f between c o a r t i c u l a t i o n and d u r a t i o n . The r e s u l t s of the experiment i n d i c a t e l i t t l e i f any t r a d e o f f between c o a r t i c u l a t i o n and duration p e r c e p t i o n under the c o n d i t i o n s t e s t e d . However, duration p e r c e p t i o n alone does p a r t i a l l y c o r r e l a t e with d u r a t i o n a l c h a r a c t e r i s t i c s of the s u b j e c t s ' n a t i v e languages.
Section 16 679 ESPACE VOCALIQUE ET STRUCTURATION PERCE7TUELLE: APPLICATION AU SYSTÈME VOCALIQUE DU SWAHILI J.M. Hombert et G. Puech Centre de Recherche en Linguistique et Sémiologie UER Sciences du Langage Université Lyon , France Le modèle proposé par Lindblom et Sundberg (1972) prévoit que les voyelles se distribuent dans un espace acoustique de manière à maximiser les distances qui les séparent. En phonologie structurale on considère par ailleurs que la pérennité d'un système repose sur le maintien des distinctions même si la distance phonétique entre certaines voyelles, mesurées par exemple à partir des valeurs des trois premiers formants, est minimale. Notre contribution au débat consiste en premier lieu à comparer 3 systèmes vocaliques à 5 voyelles: celui du swahili, étudié avec 6 locuteurs (3 honries et 3 ferrites) et ceux du japonais et de l'espagnol pour lesquels il existe des données publiées. Cette comparaison montre que si les hypothèses de Lindblom rendent compte de façon satisfaisante du cas du swahili, elles s'adaptent mal aux deux autres cas mentionnés. Nous pensons que la méthode la plus significative n'est pas en fait l'analyse des distances acoustiques mais celle des distances perceptuelles. Nous présenterons donc aux 6 locuteurs dont nous avons déjà analysé le système 53 voyelles synthétiques réparties sur l'ensemble de l'espace vocalique et présentées chacune cinq fois dans un ordre aléatoire en leur demandant de les comparer avec celles de leur propre système. Les 6 sujets seront donc amenés à Êire un découpage phonologique de leur espace vocalique. Cette méthode permet de recueillir des données comparables d'un locuteur à l'autre sans se heurter au traditionnel problème de la normalisation. Les résultats obtenus montreront dans quelle mesure la structuration perceptuelle opérée par des locuteurs dont le système maximalise les distances acoustiques entre voyelles est comparable à celle que supposent nécessairement les systèmes où cette même distance est minimale.
680 DISTINCTIVE FEATURES IN A PHONOLOGY OF PERCEPTION Tore Janson Department of Linguistics, University of Stockholm,
Sweden
In modern phonology, it has often been assumed that phonological features and rules possess "psychological reality". This may imply, among other things, that distinctions corresponding to a feature classification are actually made in perception and/or production. Here are discussed some consequences of the assumption that such distinctions are made in perception. In that case, there must be a feature classification in the memorized lexical items, to be matched in sone way with a corresponding classification of the incoming signal. The classification in the lexicon must be fully specified for all distinctions made in careful pronunciation. That is, all relevant features have to be assigned one discrete value (+ or - , in most phonological notations). The classification of an incoming signal consisting of a carefully pronounced word in isolation could receive an identical classification. In ordinary speech, however, many segments are pronounced in a way that does not allow complete phonological classification of the ensuing acoustic signal. The most important causes are reduction and various forms of assimilation. A n example is centralized pronunciation of vowels. A phonetic form [1^/] in English may represent either live or love. Under such c i r c u m stances, the relevant f e a t u r e s + f o r the incoming vowel are indeterminate, and can be denoted - (e.g. - h i g h , - l o w ) . A n indeterminate feature for the incoming signal is different from a feature that has not been determined (because of interfering noise, for example), which could be denoted b y a q u e s t i o n + m a r k (?). While the presence of a ? does not allow any further conclusions, a - in the incoming matrix is a feature value associated w i t h a particular configuration of phonetic parameters. It supplies not only the information that the lexical matrix should have + or - in the relevant slot, but also that the feature is one that jan be subjected to indeterminacy in this particular context. For example, - h i g h in English is incompatible with +tense, since tense vowels do not reduce to shwa. Thus, in matrices for incoming signals, binary features may take on at least four values: +, - , - , ?. In the oral presentation, several cases of assignment of - will be demonstrated, w i t h some discussion of the theoretical implications.
Section
16
681 ON THE PERCEPTION OF DISTINCTIVE FEATURES OF PHONEMES. Z.N. Japaridze The Institute of Linguistics of the Academy of Sciences of the Georgian SSR, Tbilisi, USSR. 1. In acts of speech, at the stage of perception, distinctive features have no form, therefore the description of this form (one of the aims of Jakobson's theory of distinctive features) is in principle impossible. This form (the sound corresponding to each feature) is not necessary in language functioning and is not part of normal language perception. 2. In language perception distinctive features are differentiated only on a subsensory level. On the sensory level a phoneme cannot be broken down into either consecutive or simultaneous components. 3. In some cases, however, the observation of speech sounds makes it possible to speak about the perception of distinctive features. This is the case for those kinds of features which are added to feature complexes and change their characteristics only slightly. In these cases speakers can perceive the sound of the feature, such as, nasality in vowels, aspiration in consonants, although we are unable to perceive the sounds of features which cause considerable changes in the formant structure or the dynamics of the sound, e.g. the sound of the features "diffuse", "interrupted", etc.
682 ON THE LEVEL OF SYLLABIC FUNCTIONING. L.Kalnin. Institute of Slavic and Balkanology. Moscow, USSR. The syllabic composition of speech possesses suprasegmental and segmental aspects (that is respectively creation of syllables, producing unbroken syllabic chains, and division of words into syllables with the syllable as the main segment). Suprasegmental characteristics of the language realize themselves in the syllabic chain, not in the syllable itself. The syllable as a segment has a different functional status in the languages with unified and not unified types of syllables. Unified syllables may be identified with phonemes. The function of the not unified syllable may be determined by correlation with: a) its evaluation by the language-user; b) the sound syntagmatics of the word; c) the morphemic composition of the word. Dialects give the most favourable possibilities for research into the phonetics of the syllabic division. That is because phonetic notions of the language-users are not influenced by morphemic and orthographic associations. Investigation of the syllable and syllabic division in the Russian dialects brought us to the conclusion that the syllable functions on the verge of the levels of speech and language. The speech level index: a) the process of syllabic division is intuitive, it is influenced by factors that are not within the scope of the speaker's consciousness; b) lack and functional meaning of the syllable -among the phonetic phenomena there is no such function which is possessed exclusively by the syllable. The language level index: a) the speaker has his own point of view on the correctness/incorrectness of the syllabic division; the phonetic structure of the syllable is one of the peculiarities of the phonetic system of certain languages; b) the syllabic division destroys any audibly perceptible connection between sounds within the word (assimilation, dissimilation); it does not only change the phonetics of the word, but also destroys such phonematic characteristics as neutralization according to distinctive features; c) if the morphemic borders do not coincide with the phonetic border of the word, then the syllabic division destroys the morpheme itself. In the languages with the not unified type of syllable the latter's functioning on the verge of speech and language levels has the meaning of the syllable's ontologic characteristic.
Section 16 683 ON UNIVERSAL PHONETIC CONSTRAINTS Patricia A. Keating Linguistics Department, UCLA, Los Angeles CA, USA, The model of grammar currently assumed w i t h i n generative phonology, e.g. Chomsky and Halle 1968, divides phonetics into two components. First, the largely language-specific phonetic detail rules convert binary phonological feature specifications into quantitative phonetic values, called the "phonetic transcription". The phonetic transcription is the end product of the grammar. Next, a largely universal phonetic component, w h i c h is technically not part of the grammar, translates the segmental transcription into continuous physical parameters. It is therefore a question of some theoretical interest to determine which aspects of speech sound production are universal, i.e. independent of any given language, and non-grammatical. In this paper I w i l l consider three cases of phonetic patterns that have been, or could be, considered automatic consequences of universal phonetic constraints. Recent investigations indicate that none of these phonetic patterns is completely automatic, though they may be quite natural results of other choices made by languages. The first case is intrinsic vowel duration, i.e. that lower vowels are generally longer. Lindblom (1967) proposed that this pattern is an automatic consequence of jaw biomechanics, and that no temporal control need be posited. However, recent BMG data demonstrate that such temporal control is exercised by a speaker. Thus the phonetic pattern, even if universal, must be represented directly at some point in the production of vowels, cannot be The second case is a n extrinsic vowel duration pattern: vowels are shorter before voiceless consonants. Chen (1970) proposed that some part of this pattern is universal, and that its origin is in universal speech physiology. I w i l l argue that the pattern is not found universally, since at least Czech and Polish are exceptions. In addition, the fact that in other languages (e.g. English, Russian, German) the pattern is made opaque by the later application of other phonological rules suggests that the synchronic source of the variation cannot be automatic physical constraints, and that the pattern is itself represented as a rule in the grammar. The third case is the timing of voicing onset and offset in stop consonants, as affected b y the place of articulation of the stop, and the phonetic context in which it is found. Articulatory modeling shows that some such patterns can be derived from certain articulatory settings, without explicit temporal control. However, those articulatory settings themselves are not automatic,and languages m a y differ along such physical dimensions. Thus none of these candidates for universal phonetic patterns can be motivated as automatic consequences of the articulatory apparatus. It could be that most phonetic patterns are like this — even at a low phonetic level, they follow from language-specific, rule-governed behavior. This result w o u l d suggest that the most productive approach to the role of phonetics in the grammar w i l l treat phonetic patterns like other, "phonological", patterns w i t h formal representations, w i t h no patterns being attributable only to mechanical constraints.
684 GRADATIONAL PHONOLOGY OF THE LANGUAGE. E.F. Kirov Department of Philology of the Kazan University, Kazan, USSR The phonological concept is a development from Baudouin de Courtenay's ideas: the phoneme is understood as the image of the sounds (Lautvorstellung) in the individual and social consciousness . The concept of the phoneme has increasingly narrowed and this lead phonologists to make attempts to find at least two juxta positions in this concept: phoneme - archiphoneme (N.Trubetzkoy); phoneme Hyperphoneme (Sidorov); phoneme - general type (Smirnitskij); stark phoneme - weak phoneme (Abanesov). Thus a paradigm of phonological units is formed. This paradigm should correspond closely to the paradigm of phonological positions: the phoneme is perceived in the strong position, whereas a more generalized unit, termed here as megaphoneme is perceived in the weak position. The investigation of reduced sounds in the Russian language prompts the idea, that there is still another position - the superweak position with a corresponding phonological unit, termed here quasiphoneme. In Russian there are to be found quasiphonemes only within vowels. The quasiphoneme has the status of a phonological unit since it has a phonological function: it makes op the words, the syllables and the strong position for the preceding consonant phoneme. The phoneme has the largest range of distinctive features (DF), the megaphoneme is characterized by a narrowed range of DF, whereas no DF are to be found within the quasiphoneme. Unit alternation corresponds to position alternation. Phonemes, megaphonemes and quasiphonemes alternate corresponding to the alternation of phonological positions. However, this is not always the case. Among the Russian words with a frequency of 10 or more, constituting 92,4 % of all the words used in standard and everyday speech, we found only 28 % of words whose unstressed sounds can be verified by stress. Hence, 72 % of Russian highfrequency words contain vowel quasiphonemes and megaphonemes which do not alternate with phonemes, i.e. their phonemic composition proper is impossible to establish.
Section 16 685 THE THEORY OF PHONOLOGICAL FEATURES AND DEFINITION OF PHONEME G.S. Klychkow N.K. Krupskaya Moscow Regional Teacher Treaning Institute, Moscow, USSR A reverse postulate of the phonological features regarding their inner structure may lead to fundamental reformulation of basic phonological concepts. If a feature can be regarded as a complex structure in turn consisting of features (Tbilisi phonological school of the Soviet linguistics), a combination or rather amalgamation of features can generate a new feature. Feature synthesis presupposes some theoretical possibilities. A group of low level features may be treated as a low level "cover feature" (e.g. coronal in generative phonology). Synthesis of low level features may lead to a complex feature of the same phonological level - if the input features show the relation of interdependence - e.g. if all phonologically back vowels are rounded, and all rounded vowels are back the features can be treated as one complex feature back/rounded. Synthesis of two contradictory features produces a feature of higher level (front and back are peripheral). Synthesis of higher level features may produce a lower level feature (non-consonant, non-vocalic means sonorant; non-consonant, non-vocalic, interrupted means nasal). The most interesting case appears when a feature determines direction of the markedness of an opposition at the next node of classificational tree, generating a pair of lower level features. The feature turbulent (strident) presupposes the pair interruptednoninterrupted, the feature nonturbulent determines the opposition continuant-noncontinuant. The affricates are marked among turbulent by the feature interrupted, the fricatives are marked among nonturbulent by the feature continuant. Theoretically synthesis/analysis of features is connected with absorption/ emanation of information (functional load). The functional load freed from a segmental phonological feature is used in phonotactics or prosodic structures. Thus group phonemes and prosodemes may be treated as transforms of phonemes and vice versa. There is no gap between phonetic and phonological units. Levels of abstraction in phonology form a continuum.
686 ON THE USES OF COMPLEMENTARY
DISTRIBUTION
Anatoly Liberman Department of German at the University of Minnesota, Minneapolis, U.S.A. The idea of complementary distribution (CD) played a great role in the d e v e l o p ment of the phonological schools that recognize the phoneme as a n autonomous linguistic unit. Since the allophone is the manifestation of the phoneme in a given context, all the allophones of one phoneme stand in CD. The question important from a methodological point of view is whether it is enough for two sounds to stand in CD in order to qualify as allophones of the same phoneme. This question often arises in historical phonology, in the study of combinatory changes. For example, ji in West Germanic was not allowed before i (West Germanic breaking), and it stood in CD w i t h i. Most scholars are ready to treat e and i as allophones of one phoneme. Later, Old High German a became e just before i (umlaut), and this new e did not coalesce with e. Again, there are no serious objections to viewing e^ and a as belonging to the same phoneme. But as a combined result of West Germanic breaking and Old High German umlaut, «1 and e^ also found themselves in CD: ji never before e only before i. Uniting these two sounds as allophones of an e-like phoneme seems a very dubious operation, even if we disregard the fact that i and a will also turn out to be allophones of this artificial unit. Synchronic phonology, too, works w i t h rules beginning w i t h the statement: "Two sounds belong to the same phoneme if . . ." CD is a usual condition following the IF. Trubetzkoy realized that CD was not a sufficient condition for assigning two sounds to one phoneme (his example w a s li and i}) and suggested that such sounds should also possess a unique set of distinctive features. This is a correct but impracticable rule, because at the earliest stage of decoding, w h e n the phonemes have not yet b e e n obtained, their features remain unknown. L.R. Zinder showed that for two sounds to belong to the same phoneme they must stand in CD AND alternate w i t h i n the same morpheme. Indeed, in a language in which vowels are nasalized before m and n, all the nasalized vowels are in CD with all the non-nasalized ones. Yet, we unite a w i t h a, not o or e. In Russian, all vowels are fronted before palatalized consonants, but, as in the previous case, we assign them to phonemes pairwise, though all the fronted vowels stand in CD w i t h all the more retracted ones. Trubetzkoy's example is unique only because it deals w i t h two phonemes rather than two allophones. Zinder's rule is an improvement o n Trubetzkoy's theory, but it is equally inapplicable, for it also presupposes that the phonologist can work w i t h phones before the phonemes have b e e n isolated and fully described. Only in historical phonology, w h e n we have the w r i t t e n image of the w o r d segmented into letters and each unit implicitly analyzed b y the scribe, so that we make conclusions about sounds from letters, can we resort to this rule. For instance, ii and e^ never alternate w i t h i n the same morpheme and consequently are not allophones of one phoneme. But the phonological analysis of living speech starts w i t h morphological segmentation, w h i c h yields phonemes as bundles of distinctive features. Allophones, together with the phonetic correlates of distinctive features, are the last units to be obtained. Once the entire analysis has b e e n carried out, the concept of CD can add something to our knowledge of the language, but as a tool of assembling the phoneme it is worthless, just as the idea itself of assembling phonemes from their widely scattered allophones is worthless. The problem of CD is not isolated. Together w i t h many others, such as neutralization, biphonemicity, and the role of morphology in phonological analysis, it can be solved only by a method that starts with a flow of speech and ends up with a set of discrete phonemes.
Section 16 687 IS THERE A FRICATIVES ?
VALID
DISTINCTION
BETWEEN
VOICELESS
LATERAL
APPROXIMANTS
AND
Ian M a d d i e s o n and K a r e n Emmorey Phonetics Laboratory, Department of Linguistics, UCLA, Los Angeles, California, USA. This paper examines the validity of a distinction between types of voiceless laterals. They have been divided into two major manner of articulation classes. Some languages (such as Zulu, W e l s h , Navaho and B u r a ) have been described as having voiceless lateral fricatives, and others (such as Burmese, Iai, Sherpa and Irish) have been described as having voiceless lateral approximants. Sounds in this latter class are sometimes referred to as voiceless aspirated laterals. N o language has been reported as having a contrast between these two segment types. This could be the case because there is no basis for making a distinction between them. The majority of phonetic typologies do not m e n t i o n the difference. Compare voiceless nasals; no language is known to contrast two different types of voiceless nasals, and phoneticians have never suggested any basis for making such a linguistic distinction. In the same way, because of their phonetic similarity and lack of contrast, there is some doubt about the reality of the phonetic distinction between the two classes of voiceless laterals. The different descriptions applied to voiceless lateral segments in various languages could be simply a result of different terminological traditions. If this is true, then the claim that languages do not contrast voiceless lateral fricatives and approximants would be vacuous. However, w e find that there are both phonological and phonetic grounds for affirming that voiceless lateral fricatives and voiceless lateral approximants are distinct types of sounds. The phonological case is made on the basis of the phonotactlcs and allophony of laterals in languages w i t h voiceless lateral phonemes. In general, the following distinctions between the two voiceless types hold. The approximants occur only in prevocalic position, fricatives may occur in final position and in clusters. Fricatives may have affricate allophones, approximants do not. Fricatives may occur in languages w i t h no voiced lateral phonemes, approximants may not. The phonetic case is made on the basis of measurements on voiceless lateral sounds from several languages from each of the two groups mentioned above. The following properties were measured: (i) the relative amplitude of the noisy portion of the acoustic signal, (ii) the spectral characteristics of the noise, and, (ill) the duration of the portion of the lateral w h i c h is assimilated in voicing to a following voiced vowel. It was found that voiceless fricative laterals are noisier and have less voicing assimilation than voiceless approximant laterals. Several of the phonological and phonetic attributes which distinguish these two types of laterals can be related to w h a t is presumed to be the m a i n articulatory difference between them, namely, a more constricted passage between the articulators for the fricative. Hence indirect evidence of this articulatory difference is obtained. Because of the differences revealed by this study, the claim that languages do not contrast voiceless lateral fricatives and approximants has been shown to have meaningful content. The more important lesson for linguistics is that a small, often overlooked, difference in phonetic substance is associated w i t h major differences in phonological patterning.
688 ON THE RELATIONSHIP BETWEEN PHONETIC AND FUNCTIONAL CHARACTERISTICS OF DISTINCTIVE FEATURES I.G. Melikishvili The Institute of Linguistics of the Academy of Sciences of the Georgian SSR, Tbilisi, USSR 1. The comparison of functional characteristics of distinctive features (their frequency characteristics and different capacities to form simultaneous and linear units) with the results of the experiments on perception reveals the relationship between the phonetic and functional data. The more the sonority and distinguishability of the sound, the greater the intensity of its utilization in speech. 2. The functional characteristics of distinctive features reflect the complexity of their internal structure. These characteristics can be correlated with different phonetic components of distinctive features. The study of the features of laryngeal articulation - voice, aspiration and glottalization - reveals functional equivalents for various phonetic components: of voice, aspiration and glottalization proper and for different degrees of the tenseness and duration.
Section 16 689 A PROCEDURE IN FIELD PHONETICS E. Niit Institute of Language and Literature, Academy of Sciences of the Estonian SSR, Tallinn, USSR In field linguistics one often faces a problem like the following: there is a great bulk of screening material awaiting computer analysis, but the process is rendered unduly complicated by recording disturbances on the one hand and unavoidable errors in segmentation and parameter discrimination on the other. Here the possibility of basing fieldwork on a set of test sentences may sometimes come in handy. I tape-recorded 12 sentences from 173 informants living along the whole coast of Estonia and on the islands. Clipped speech was fed into the computer (storing the zerocrossings of the signals) and its F^ contours were found out by means of a special algorithm. My problem consisting in an analysis of the distances between turning-points in the contours it was the turning-points that had to be determined first. As the points could not be discerned in the "naive" way, i.e. searching the contours for their significant rises and falls only, I also made use of the fact that the sentences were known beforehand. The computing procedure combining the "naive" with the "sophisticated" consisted of the following steps: (i) All the rises and falls with sufficient length were extracted from the contour. (ii) The sequence of rises and falls (and "nothings") so obtained was juxtaposed with a sequence of theoretically expected contours of the corresponding stressed and unstressed vowels (the long and overlong marked separately). (iii) As the number of rises and fall in the initial F0-contour considerably exceeded the theoretical expectancy the following stage of the procedure consisted in smoothing the contours by joining shorter rises (falls) - where their occurrence was densest - with the preceding and following ones until either an "expected" contour was obtained or one should have continued the joining over rests more than 50 ms long. As a result the number of falls and rises in the original contour which exceeded the theoretical expectancy by 3.7 times dropped to an excess of 1.4 times only. Evidently the procedure can be counted on in fieldwork planning.
690 ON THE CORRELATION OF PHONETIC AND PHONEMIC DISTINCTIONS A. Steponavifiius Department of English Philology at the University of Vilnius, Vilnius, USSR The correlation and interdependence of phonetics and phonology may be defined in terms of the dichotomies between language and speech, paradigmatics and syntagmatics. Phonetics lies both in the domain of speech and language (in that it is the level of both indiscrete speech sounds and discretfe "sound types" of language), and has both syntagmatic and paradigmatic aspects. Phonology lies in the domain of language, but not speech, and has both paradigmatic and syntagmatic aspects. Phonemes are intrinsically related to sounds (their phonetic realizations) in that distinctive features (DFs) have anthropophonic correlates. Yet phonemes are purely structural-functional entities, which are defined as members of phonematic oppositions and are endowed with constitutive and distinctive functions, whereas sounds are above all material entities, "substance". Furthermore, DFs need not necessarily be directly related to phonetic features (cf. negatively expressed DFs, distinctive features which cover several phonetic properties, or the functional difference of DFs, based upon the same phonetic properties). The above generalizations may be illustrated by data from Baltic, Slavonic and Germanic languages.
Section 16 691 PHONETIC CORRELATES OF' REGISTER IN VA Jan-Olof Svantesson Department of Phonetics, University of Lund, Sweden Va belongs to the Palaungic branch of the Austroasiatic (MonKhmer) languages. It is spoken by some 500 000 people in China and Burma. As many other Austroasiatic languages, Va has lost the original opposition between voiced and voiceless initial stops, and has in its place developed a phenomenon usually termed "register". The registers in Va are referred to as tense and lax. They correspond to original voiceless and voiced initial consonants respectively . The phonetic correlates of register differ from language to language, but the following factors seem to be involved: (1) Phonation type (properties of the voice source) (2) Pharyngeal effects (widening or constriction of the pharynx) (3) Fundamental frequency (pitch) (4) Vowel quality For this investigation minimal tense/lax pairs were recorded for each vowel from two speakers of the Par^uk dialect of Va, spoken in Western Yunnan province, China. The following results have been obtained: There is a small but systematic difference between the vowel qualities of the two registers, so that the lax register vowels are somewhat centralized in relation to the tense register vowels. There is no systematic difference in fundamental frequency between the two registers. There is a phonation type difference which is reflected in a somewhat greater difference between the level of the first formant area and the fundamental frequency area of vowel spectra for the tense register vowels compared to the lax register.(This measure has been proposed by Peter Ladefoged and Gunnar Fant.) There also seems to be a greater slope in the spectra above the F1 area for lax than for tense vowels, which indicates that the source spectrum also has a greater slope in the lax register than in the tense.
692 A DISTINCTIVE FEATURE BASED SYSTEM FOR THE EVALUATION OF SEGMENTAL TRANSCRIPTION IN DUTCH W.H. Vieregge, A.C.M. Rietveld and C.I.E. Jansen Institute of Phonetics; E.N.T.-clinic, University of Nijmegen, the Netherlands However extensive the literature on transcription systems may be, it remains astonishing to see that data on inter- and intrasubject reliability are almost completely lacking. One of the major problems in the assessment of reliability is that it requires a system with which differences between transcription symbols can be assigned numbers corresponding to the distances between the transcription symbols, or rather corresponding to the distances between the segments that the transcription symbols stand for. Preferably, these distances should be defined articulatorily rather than auditorily, since the training in the use of transcription symbols is largely articulatorily based as well. For the construction of a system in which the distances between the Dutch vowels are numerically expressed, enough experimental data may be found in the literature (f.e. Nooteboom, 1971/72; Rietveld, 1979). The available data with respect to the Dutch consonants we find less satisfactory. Spa (1970) describes the Dutch consonants by means of 16 distinctive features. One of our main objections against Spa's systems is that the front-back dimension - a dimension which is crucial for the classification and the adequate use of transcription symbols is only implicitly represented by the features cor, ant, high, low, and back. Moreover, the validity of Spa's system was not experimentally tested. We therefore decided to develop a new consonant system for Dutch with a heavier emphasis on articulation. The validity of this system was assessed by means of an experiment in which subjects were asked to make dissimilarity judgments on consonant pairs. Twenty-five first year speech therapy students were offered 18 Dutch consonants pairwise in medial word position; they were asked to rate each pair on articulatory dissimilarity on a 10-point scale. The stimulus material consisted of 153 word pairs. In the instructions it was emphasized that during the rating process the whole articulatory apparatus should be taken into consideration. Multidimensional scaling was carried out on the dissimilarity judgments of the subjects, yielding five dimensions that can be interpreted in terms of phonetic categories. The configuration which resulted from the multidimensional scaling suggested us to revise our consonant feature system. Distances calculated between the consonants using the revised system correlated rather highly with the dissimilarities obtained in our experiment ( r = .75). An additional evaluation of our system was performed by correlating Spa's DFsystem, our DF-system and the similarities obtained in Van den Broecke's (1976) experiment and our experiment. The correlations showed that Spa's DF-system is a better predictor of the auditorily based similarity judgments gathered by Van den Broecke. Our own DF-system, however, is more successful in predicting the dissimilarity judgments which are based on articulation.
Section 16
693 ASSIMILATION VS. DISSIMILATION: V O W E L QUALITY AND VERBAL REDUPLICATION IN OBOLO K. Williamson and N. Faraclas Department of Linguistics and African Languages, University o f Port Harcourt, Port Harcourt, Nigeria In Obolo, a Lower Cross (Niger-Congo) language of S.E. Nigeria, verbal reduplication involves the lowering o f the quality o f stem vowels. Changes in vowel quality due t o verbal reduplication in other West African languages, such as Igbo (Williamson 1972), however, usually are in an upward rather t h a n a downward direction. The exceptional behavior o f Obolo in this respect m a y b e considered to be the result either of dissimilatory or of assimilatory p r o cesses. The dissimilatory explanation is b a s e d on Ohala's (1980) perceptual model and the hierarchies of vowel assimilation in Obolo established b y Faraclas (1982). The assimilatory explanation relies primarily u p o n recent w o r k indicating that Proto-Benue-Kwa had a 10-vowel system (Williamson 1982) and some apparent areal characteristics of 'Bantu Borderland' languages. Both explanations will be discussed in light of the questions that each raises regarding the production, perception, and historical development of vowels in West African languages as w e l l as in languages universally.
Section 17 Phonology
Section
17 697
ON DIFFERENT METHODS OF CONSTRUCTING A PHONOLOGICAL MODEL OF LANGUAGE. L.V. Bondarko. University of Leningrad, Leningrad USSR. Two different approaches to the formation of a phonological system are considered: the classical linguistic approach, based only on the analysis of existing oppositions differentiating meaning, and the approach in which the speech habits of the speakers of a language are taken into consideration. The system of phonemes, their distinctive features, their relation to other layers in the organization of a language are seen differently from these two perspectives. For example, the classical model of Russian vocalism becomes "deformed" if we do not take into account speech habits, and above all, speech perception. Those problems of phonological interpretation are considered, that native speakers of Russian perceive in one and the same way, e.g. the phonological status of the feature front vs. back, the relation between Qi/J and [5-3. For the speakers of a language, the position of a vowel in the phonemic system is determined not only (and not exclusively) by possible oppositions, but also by the extent to which it alternates with other phonemes, by the frequency of occurrance, and by realization of its potential to be an exponent of linguistic meaning. The speech habits of the speakers of a language can be considered as evidence of linguistic competence, which can be seen as a result of the influence of the language system on the linguistic competence.
698 NOTIONS OF STOPPING CLOSURE AND SOUND-PRODUCING STRICTURE FOR CLASSIFYING CONSONANTS AND ANALYSING AFFRICATES L.N. Cherkasov Department of English Philology at the Teachers' Training College, Yaroslavle, USSR The widely used term "complete closure" is not entirely devoid of ambiguity. So it seems proper to distinguish between a directing closure which directs the flow of air along one of the two existing ways (oral or nasal) and a stopping closure which, though "mute", affects the following sound. Only the latter is essential for descriptive phonetics. It helps to divide consonants into stops articulated with the stopping closure and continuants which have only directing closures. The stopping closure is not regarded as a kind of stricture, the latter being defined here as a soundproducing obstruction. Any stopping closure is pre-strictural. The off-glide in stops and the retention-stage in continuants are strictural. So the stricture can be dynamic (in stops) and static (in continuants). In English the dynamic stricture is either abrupt or gliding, the static stricture being narrow or broad. Accordingly, the English consonants can be divided into abrupt and gliding stops, and narrow and broad continuants. Acoustically, they are plosives, affricates, fricatives, and sonorants. The stopping closure and the stricture do not only serve as basis of classification, but also help to better understand the monophonemic nature of affricates, for they show that there is no essential difference in the basic structure of a plosive and an affricate, because both of them consist of two elements, one of which is a stopping closure and the other is a stricture, the difference between the two types of consonants being strictural, not structural. It does not mean that any combination of stopping closure and gliding stricture is an affricate, as it is evidently believed those phoneticians who consider /ts, dz, t9, dS, tr, dr/ to be affricates. According to our observation, an abrupt stop becomes gliding before a homorganic continuant. For explanation of this phenomenon we postulate the assimilation affecting the muscular tension in the articulator. The fact is that with plosives all muscles of the fore part of the tongue are strong and release the closure abruptly and instantaneously. With affricates this part of the tongue is weak. When an abrupt stop is followed by a homorganic continuant redistribution of muscular tension takes place so that the tip of the tongue is weakened and the gliding stricture substitutes for the abrupt one. What occurs in those clusters and even in /t+j, d+j/ is affrication of /t, d/ and reduction of the continuants. These two factors make /ts, dz, etc./ sound like affricates. Since the latter are regarded as monophonemic entities we suggest for the aforenamed clusters the term "gliding consonants"for it adequately shows the articulatory peculiarities of those sounds and bears no phonological connotations.
699 LES PROCESSUS PHONÉTIQUES DANS LE MOLDAVE PARLE Gozhin, G.M. Kichinev, Academie des Sciences de Moldavie Dans le présent énoncé on systématise pour la première fois la structure sonore du style parlé du moldave en comparaison avec le moldave littéraire codifié. 482 sujets on été examinés. Le matériel est analysé selon la méthode de E.A. Zemskaïa (Le russe p.M., 1973). Dans structure du moldave parlé des changements quantitatives n'ont pas lien. Le système des voyelles du moldave parlé comporte 7 phonèmes voyelles comme et le moldave littéraire codifié qui se caractérisent par les mêmes traits différentiels. Pourtant on y trouve toute une série de processus phonétiques spécifiques. La réduction pleine englobe les voyelles fermées qui se soumettent surtout à ce phénomène (HyK.yu>°T - H^kU>Ôi>\ "petit hoyer". La réduction partielle comprend les fermées postaccentuées ,y (MÂnyruJie -Jif)JlyP"jiu ) (les rives). La fermeture englobe les voyelles ouvertes et semi-ouvertes accentuées et inaccentuées (kAc3 - KÂCti , MîcÉM - fiiciJlU) . L'ouverture embrasse les voyelles fermées M,*/ , y ( Bej>tir3 -iepéru ) . La postérisation englobe les voyelles H,& (rnru?é3 -ryrypé3 ) • La délabilisation englobe les voyelles y,0 (cvHK.ywôr-c.m4Kaat6P, RHÀiroAt)-t)H/iiT34,fi ) . Le hiatus complet comprend un des constituants de l'hiatus (joojoçûe -30MOXÛË , COYUÂA -COf'/Cr) . Le hiatus partiel renferme les voyelles (reofiie-T"OrtÎu ) . Les groupes des voyelles prëaccentuées se diphtonguent dans les diphtongues ascendantes (14) mais les groupes poi.taccentués se diphtonguent^ dans les diphtongues descendantes (14) (MoHAHAJl-MMAUùjiji, JlékifiMJie -Jtekyuûe) . Le système des consonnes du moldave parlé (22) se caractérise par un mode ferme d'articulation. On observe certains processus^ d ' af friquâtion et de dépalatalisation (ytive -i(ûttn, jécrse -Ajécrre ) , et d'autres. Des syllabes consonantiques apparaissent ( C9 ce -BQA3 -
C iÂAll) .
Les processus phonétiques en fonction éloiguent considérablement le système du moldave parlé du moldave littéraire codifié.
700 STABLE ELEMENTS IN THE PHONETIC EVOLUTION OF LANGUAGE. Ts. A. Gventsadze. The Tbilisi State University, Tbilisi, USSR. There are two permanently interacting forces in language: one directed at language change, the other - acting towards the preservation and stability of its elements. The historical study of languages is mainly aimed at the analysis of the first force that finds its explanation in terms of the function, substance or structure-orientated approaches. However, it does not seem contradictory to assert that the same approaches can be taken into consideration in the analysis of the stable elements and features of the phonetic system. Elements of the phonetic system may have various degrees of stability which can be defined according to a number of criteria. The first criterion may be called phonological. This was clearly formulated by Martinet who asserted that elements of the phonetic system preserved in a language if they are functionally loaded, i.e. if there is a necessity for a phonological opposition. The second criterion is phonetic. It is based on the phenomenon of diaphonic variation. Sound elements are preserved when the degree of diaphonic variation is minimal. The third criterion may be called phonotactic and has been least investigated. It is based on the phenomenon of allophonic variation, the characteristic arrangement of phonological items in sequence and their possible co - occurence. An example of this is provided by the word - initial, biphonemic complexes [br ] , T fr ] , [ tr], [gr ] in French, Spanish and Italian. Their phonetic substance has changed while the phonotactic pattern remains intact. This all testifies to the fact that in analysing the degree of stability of a phonological element in the system one should take into consideration all the three criteria.
701 A STRUCTURAL-FUNCTIONAL ANALYSIS OF ONOMATOPOEIAS IN JAPANESE AND ENGLISH Kakehi, Hisao Kobe University, Japan (1) Originally, onomatopoeias are to imitate the sounds of the outer world. Since linguistically deviced, they should be influenced by the phonological systems of respective languages. In the present section, our chief concern is limited to the reduplicated forms of onomatopoeias. In Japanese, the same set of vowels is repeated in the first and the A
A
second parts of the onomatopoeic expression (eg. gon-gon, pata-pata, etc.), while in English, different set of vowels is usually employed (e£. d_ing-dong, pitter-patter, etc.). This is mainly because Japanese has pitch accent, and English has stress accent. The above-mentioned phenomenon is found not only in onomatopoeic words but also in the phonemic structures of the words of these languages. In English, no two vowels with equal stress and quality can occur in a word, but this is not applied to Japanese.(eg. aka 'red1, ki.mi. 'you', kokoro 'mind' etc.). Since the sound of the outer world is a physical continuum, Japanese seems to be more suitable to express sounds as they are, in that it permits the repetition of the same set of vowels. English, on the other hand, describes the natural sound in the more indirect (ie, more lexicalized*) level. This is proved by the fact that, in English a part of the reduplicated form of an onomatopoeia can operate as a free form (eg. ding, dong, patter, etc.); in Japanese,,however, this is not the case. For example, gon of gon-gon, or pata of pata-pata remains still in the onomatopoeic level. (2) Japanese onomatopoeic expressions are usually realised as adverbials, while those in English, except nonce formations, are most frequently realised as nouns and verbs. From the syntactic point of view, adverbials functioning as the optional modifiers of the verb, can enjoy much wider positional freedom in a clause, compared with nouns and verbs which function as obligatory components of a clause like subjects, objects and predicate verbs (eg. Kodomotachi ga wai-wai, kya-kya itte-iru. 'Children are screaming and shrieking.')• Chiefly for this reason, the Japanese language can more freely create onomatopoeias such as run-run (an expression indicating that things are going on quite smoothly), and shinzori, which describes the snow piled up high and still on leaves of an evergreen which is about to fall off them in the glittering morning sunshine. What is stated in the two sections above may explain some of the reasons why the Japanese language abounds in onomatopoeic expressions. * With regard to "degrees of lexicalization", see Kakehi, H. (forthcoming) "Onomatopoeic expressions in Japanese and English", in Proceedings of the 13th International Congress of Linguists.
702 DAS SCHICKSAL DER NICHTKORRELIERENDEN PHONEME IN DER REALISATION R. Kaspranskij Fremdsprachenhochschule in Gorki, UdSSR Die Realisation der Phoneme ist der systembedingten und normativen Regulation unterworfen. Die systembedingte Regulation hängt von den oppositiven und korrelativen Verbindungen des Phonems ab; je umfangreicher diese Verbindungen sind, um so größeren Systemzwang erleidet das Phonem, und umgekehrt. Die Liquida stehen im Konsonantensystem abgesondert da: sie treten in keine Serien- und Reihenkorrelationen ein (nur in wenigen Sprachen korrelieren sie mit anderen Konsonanten nach dem modalen DM "scharf - nichtscharf"). Diese Tatsache, daß die Liquida im konsonantischen Oppositionssystem mit anderen Konsonanten nicht korrelieren und daß sie im Oppositionssystem allein durch negative DM charakterisiert werden ("nichtexplosiv", "nichtfrikativ", "nichtnasal"), führt dazu, daß ihr Erstreckungsgebiet von Seiten anderer Konsonanten nicht "bedroht" wird und deshalb nicht streng fixiert und lokalisiert ist. Durch diese "Freiheit" in ihrer Stellung im konsonantischen System ist es zu erklären, daß die Liquida in der Realisation oft ihre Erstreckungsgrenzen überschreiten und über den semikonsonantischen Zustand bis zum semivokalischen und vokalischen von ihren Realisationsnormen ausweichen können, manchmal bis zum völligen Ausfall des Konsonanten, d.h. bis zur Nullrealisation. Dafür sprechen Beispiele aus den germanischen, slawischen, romanischen u.a. Sprachen, solche wie [ 1']> [j ] im Französischen, [l ] > [u] im Niederländischen, [1]> [w] im Polnischen und Ukrainischen, [l] > [i] im Wienerischen u.a. Einige Sprachforscher sehen die Bedingungen und Ursachen dieser "Freiheiten" in der Realisation darin, daß diese Phoneme funktionalarm oder gar funktionslos sind (M. Lehnert, G. Meinhold, H. Weinrich u.a.). Dagegen sprechen aber Tatsachen aus slawischen Sprachen, wo z.B. der Liquidlaut [ 1] funktional sehr wichtig ist (besonders im Paradigma des Verbs). Die Realisation der Liquida wird demzufolge nicht durch den Systemzwang, sondern hauptsächlich durch sozial-traditionelle Realisationsnormen reglamentiert.
703 ONOMATOPOEIA K. C. Summer
IN T A B A S C O
CHONTAL
Keller Institute of Linguistics,
Mexico
Most o n o m a t o p o e t i c w o r d s in C h o n t a l , a M a y a n language s p o k e n in s o u t h e r n M e x i c o , not only fit the s y s t e m a t i c p h o n e t i c p a t t e r n s of the l a n g u a g e , but also h a v e regular g r a m m a t i c a l f o r m a t i o n s and functions. T h e r e is no g e n e r a l term for sound or n o i s e , but m a n y s p e c i f i c terms. M a n y of the s p e c i f i c terms are formed by an onom a t o p o e t i c root f o l l o w e d by the suffix -lawe " s o u n d of"' to form a n o u n , or the root plus the suffix -law to form w o r d s w i t h s t a t i v e verb f u n c t i o n or adverbial function. E x a m p l e s : top4lawe " s o u n d of b u r s t i n g , as of c o c o a b e a n s w h e n t o a s t i n g " , tsahlswe " s o u n d of frying", ramlawe " w h i r r i n g or w h i z z i n g s o u n d " , hak'lawe " h i c c o u g h " , hamlawe " r o a r or zoom of s o m e t h i n g going r a p i d l y by, as a b u s " , wet^'lawe s o u n d of w a t e r b e i n g thrown o u t " , wek'lawe " s o u n d of w a t e r m o v i n g in a v e s s e l " . The a d v e r b i a l f u n c t i o n is shown by ramlaw u nume 'it goes w h i z z i n g by". The verbal f u n c t i o n is s h o w n in wek'law ha? tama t'ub "the w a t e r s o u n d s from b e i n g s h a k e n in the gourd". W h e r e a s the -l'awe/r1£w forms r e p r e s e n t n o n - s p e c i f i e d or u n m a r k e d a c t i o n s o u n d s , i t e r a t i v e or r e p e t i t i v e a c t i o n sounds are repres e n t e d by c o m p l e t e r e d u p l i c a t i o n of the b a s i c CVC o n o m a t o p o e t i c root p l u s the suffix -ne for n o u n s or the suffix -na for adverb or s t a t i v e verb f u n c t i o n . E x a m p l e s : wohwohne, wohwohna " b a r k i n g of d o g s " , tumtumne, tumtumna "beating of d r u m s " , ts 3 i?ts 3 i?ne, ts'iTts^iina " s o u n d m a d e by rats or b a t s " , loklokne " g o b b l e of turkey' 1 , t'oht'ohne " s o u n d of p e c k i n g or of k n o c k i n g on w o o d " , ?eh?ehne "grunt of p i g " , ramramna "hum or b u z z i n g of bees or m o s q u i t o e s " . C o m p l e t e r e d u p l i c a t i o n for iterative a c t i o n is a p r o d u c t i v e p a t t e r n , and occurs also in w o r d s w h i c h are not o n o m a t o p o e t i c , as inliklikne, liklikna " s h i v e r i n g " , p'ikp'ikna " b l i n k i n g the e y e s " , welwelna " f l a p p i n g in the w i n d " . C o n t i n u o u s a c t i o n s o u n d w o r d s are f o r m e d by r e d u p l i c a t i o n of the stem v o w e l plus -k f o l l o w e d by the suffix -ne for n o u n s or the suffix -na for adverb or s t a t i v e verb f u n c t i o n w o r d s . E x a m p l e s : sahakne " s o u n d of w i n d in the t r e e s " , hanakne "purr of cat or roar of t i g e r " , kat^'akne yeh " g n a s h i n g of teeth". O n o m a t o p o e t i c roots s o m e t i m e s o c c u r as the s t e m for t r a n s i t i v e or i n t r a n s i t i v e verbs as in wohan "bark at" (compare w i t h wohwohna " s o u n d of dog b a r k i n g " ) , tsahe? "fry it" (compare w i t h tsahtsahne " s o u n d of f r y i n g " ) , t*ohe? " - p e c K a t . i t " (compare w i t h foht^ohne " s o u n d of p e c k i n g or k n o c k i n g on w o o d , as a w o o d p e c k e r " ) , u top'e "it b u r s t s " (compare w i t h top'T^we " s o u n d of b u r s t i n g " ) , wet^'Bn 'throw out w a t e r " (compare w i t h wetS'lawe " s o u n d of t h r o w i n g out water"). O n o m a t o p o e t i c roots may also e n t e r into c o m p o u n d s , as in ramhule? "throw it w i t h w h i z z i n g s o u n d " (hule? " t h r o w " ) , rahhats'e? "spank or hit w i t h s p a n k i n g n o i s e " (hats'e? "hit").
704 COMPARISON OF RUSSIAN SPEECH SEGMENT-RHYTHMIC ORGANIZATION OF GERMAN SPEAKERS AND TATARS. Kulsharipova R.E. (Kazan University). Language contact investigation must be regarded important in view of language policy, teaching practics and language functioning in society. The m a i n disadvantage in the analysis of language contact specifics is the neglect of speech segment-rhythmic typology. Our research is built o n comparison of speech sound content of G e r m a n students and Tatars studying at the Russian philology department of Kazan University. Russian speech of Germans and Tatars is characterized by two variants of syntagmatic division: each syntagma is correlated w i t h aspiration group and vice versa, each rhythmic structure represents an independent syntagma; incorrect intonation of syntagma melodic schemes was noticed here, viz. complete-incomplete communicative types of neutralization; m o s t cases of rhythmic interference occurs in speech-polylogue, fewer in speech dialogue. Accoustic analysis showed the activation of représentants Î.O.IO ), that testifies to other zones of vocal allophone dispersion than in Russian literary language. Phonostylistic information of each sound type gives the opportunity to distinguish the following positions in unprepared speech, meaningful for neutralization of vowels by non-Russians: stressed syllable in logical and illogical centres of syntagma, 1 pretonic and 1 posttonic o p e n in logical and illogical centres of syntagma; 2 pretonic and posttonic closed syllables. Distinguishable peculiarities of non-Russian segment-rhythmic speech organization testify to the mixed type of prosody in conditions of different language-system speaker contacts. We consider this important for the creation of general typological complex models of speech prosody.
705 TOWARDS TYPOLOGY OF PHONOLOGICAL SYSTEMS OF LANGUAGES OF CENTRAL AND SOUTH-EAST EUROPE M.I. Lekomtseva Institute of Slavic and Balkan Studies, Academy of Sciences, Moscow, USSR The theoretical aspect of our report concerns the interrelation between a phonological opposition and phenomena within a certain area. The facts under consideration are the palatal occlusives widespread in the Latvian, Czech, Slovak, Hungarian, Albanian, Macedonian, Serbo-Croatian and Slovenian language, as well as in dialects of Polish, Romanian and Bulgarian. These languages are surrounded by both types of languages, i.e. those with the palatalization correlation and those lacking both palatalized and palatal consonants. To the East, Lithuanian, Russian, Byelorussian, Ukrainan and Bulgarian possess the palatalization correlation - somewhere small islands are included, where k'- t' are interchangeable. To the West and to the North we find the Germanic languages without either type of correlation. In the old Slavic areas of Polabian and Sorabian as well as further to the West in the Celtic group of languages, one again encounters the palatalization correlation; it is remarkable that one finds the palatalization in those languages where one finds the interchange of k' and t" . From the extralinguistic point of view it is important to emphasize that the entire zone of languages with palatal occlusives coincides with the genetic pool of gene B at a statistical level of 10 - 15 %. Apart from the genealogical and typological types of linguistic interconnection the term "linguistic pool" must be introduced in order to indicate the similarity between languages caused by interchange of genetic information. The linguistic pool of palatal consonants is the borderline zone between the satam and centum groups of IndoEuropean languages. It may be suggested that the Proto-IndoEuropean possessed palatal occlusives, at least for the satem languages. In the historical shifts of the palatal consonants a vacant cell (according to A. Martinet) was repeatedly filled from different sources. The satdm languages where the retroflex order had developed on the basis of a different substratum, were prevented from developing the palatals. The languages of Central and SouthEast Europe, developing according to an archaic phonological pattern, correlated with the genetic model of articulation processes, created time and again new layers of palatal consonants.
706 A CERTAIN REGULARITY OF AFFRICATIZATICN IN KARTVELIAN LANGUAGES Lomtadze Amiran, E. Tbilisi, USSR I. According to an accepted point of view, one of the ways of affricatization in Karbelian Languages is organic fusion of obstruent and spirant consonants: t + s > c , t + s > £ , d + z > ? ... Literary Georgian: at + ^armreti > cameti, at + Svidmeti > cvidmeti, bed + savi > bet + savi > becavi ... Kevsurian: erturt-s > ertur-c ... Gurian-Adjarian: gverd-ze > gver-?e, gverd-Si > gvert-si > gverci ... Cvanian: padasa-s > padsa-s > patsà-s > paca-s ... II. According to soma kartvelologists, spirants when immediately following résonants / m, n, JL, r / are also affricatized ÇI. Megrelidze, Sh. Gaprindashvili, Kobalava, G. Klimov ... ) : kal-s > kal-c, vard-s > var-s > var-c. /;ensia > /¡encia, xelsaxoci xelcaxoci, P. arsva > /;arcva ... Sh. Gaprindashvili is inclined to explain this phenomenon namely, that the affricatization of spirants in this position is caused by a special kind of off-glide in the résonants. III. Contrary to this last consideration our observations suggest that the spirants turn into affricates when the spirants s or z, s or z come directly after an obstruent consonant. A noise-obstruent consonant / voiced, aspirate, glottalized / provokes affricatization in the same manner as résonants. It is immaterial whether they are pure obstruent, affricate or the so-called mixed ones. The important thing is that their articulation is to be characterized by occlusion. The partial fusion of the occlusion component with the spirant produces an affricate. The phenomenon is of a labile nature and characterizes all the three Kartvelian Languages: Georgian, Svanian and Colchian (Megrelian-Chanian). a). Colchian (Megrelian dialect) : Nominative Case - dxoul-ejD-i, Dative Case éxoul-ep-s >// âxoul-ep-c; dud-i - dud-sa>// dud-c5; toron?-i - toronf-s >// toronf-c >// toron-c; koi-i - koc-s > koc-c > ko-c ... £xom-i - ¿xom-s // cxcm-c ... cil-i - cil-s > cil-c ... çkun-s ¿kun-c, lur-s > lur-c ... kurs-i>// kurc-i, erskem-i^// erckem-i, rskin-i>// rckin-i Chanian dialect : memsxveri > memcxveri ... mzora > m|ora, mzvabu > m^rabu ... b). In literary georgian, affricatization of the kind of spirants mentioned above occurs rarely, but in the dialects especially in the Gurian dialect, it is comparatively frequent: Gurian: m£ad-s > miad-c. datv-s > *dat-s > dat-c, kun?-s > kunj-c // kun-c, kvic-s > kvic-c (//kvi-c), lur?-s > lur/-c, biç-s > biç-c (>bic) ... ucyvet-s > ucjvet-c > ucyve-c ... elisabedi > elsabedi > elcabedi ... Gurian, Imeretian, Kartlian: Sansiani ? sanciani, mam^arse > mam/>ace ... Ad jar ian: bjravil-s > bjravil-c, ertsaxe > ersaxe > ercaxe ... Literary Georgian: sabzeli > sabfeli, anderzi / Persian - andarz / anderji ... mzaxali> mjaxali Gurian: elanze > elan^e ... Gurian, Ad jar ian: xarsvà > xarcva ... Mesxian: cver-si > cver-ci .. Imeretian: midfem-Si midjem-ci, c^al-li c burfuazia ... c). The above regularity of the spirants' affricatization is also observed in the Svanian Language: xat-si > xâtyci, xeçsix > xeçcix ...
707 THE PHONETIC STRUCTURE OF THE WORD IN LITERARY UZBEK A. Makhmudoff Tashkent, Uzbek, USSR The author examines the phonetic structure in literary Uzbek as four nutually connected substructures: a) the substructure of phonemes; b) the substructure of phonemes 1 combinations; c) syllabic substructure; d) accentual-rhythmical substructure. The study of the phonetic system of the Uzbek draw to the conclusion that the vowel harmony is the basic sign of the phonetic word in Turkisch languages, and in literary Uzbek the soundcompaunding role belongs to the accent; in Turkisch languages the phonetic Kernel of the word is disposed at the beginning part, while in Uzbek the phonetic structure of the root and affixes depends on the accent. Thus, in literary Uzbek the phonetic structure of the word has its distinguishing features, not like in the other Turkisch languages. In the Uzbek the combinations of the consonants may be situated not only in the middle of the word but in the beginning of it too. This is an innovation in the modern Uzbek. The Uzbek syllables and the principles of syllabation are examined in this work.
708 REGULARITIES IN THE FORMATION OF CONSONANT CLUSTERS M. Rannut Institute of Language and Literature, Academy of Sciences of the Estonian SSR., Tallinn, USSR. This paper provides the results of statistical processing of Estonian consonant clusters presenting a complete list of the clusters with their frequencies of occurrence as well as their structual regularities. Estonian consonant clusters can be divided into the following three groups: 1) genuine clusters; 2) clusters of later origin (developed through vowel contraction); 3) clusters occurring in foreign words only. Some of the phonetic constraints characteristic of genuine consonant clusters are abandoned in clusters of later origin. Structual rules determining the consonant clusters in modern usage can be characterized by means of prediction formulas that have been presented together with their corresponding probabilities. To handle certain phonemena that change the cluster structure (syllable boundary, assimilation) we offer additional formulas to be applied in those cases when syllable boundary passes through the consonant cluster or when one of the components of the cluster is not pronounced.
709 SCHWA Annie
AND
SYLLABLES
IN
PARISIAN
FRENCH
Rialland
C.N.R.S.,
E.R.A.
433,
Paris,
France
T h e p r o b l e m o f s c h w a is t h e m o s t w r i t t e n a b o u t i n t h e f r e n c h p h o n o l o g y w h e r e a s t h e p r o b l e m o f t h e p h o n e t i c s y l l a b a t i o n is n o t well known. U s i n g p h o n e t i c d a t a as s t a r t i n g p o i n t , w e s h a l l s h o w t h a t t w o s y l l a b i c s t a t u s m u s t be p o s i t e d for the s c h w a s . Some of t h e m are p h o n o l o g i c a l nuclei while the others are e p e n t h e t i c vowels following a c l o s e d s y l l a b l e . We s h a l l d e s i g n a t e t h e f i r s t s c h w a s as n u c l e i s c h w a s a n d t h é s e c o n d o n e s as n o n n u c l e i s c h w a s . The n u c l e u s s c h w a o c c u r s i n s i d e of the l e x e m e s , the n o n n u c l e u s schwa occurs elsewhere. S e v e r a l f a c t s s h o w t h a t t h e n u c l e u s s c h w a is a p h o n o l o g i c a l n u cleus : 1 - t h e p r e c e d i n g s y l l a b l e h a s the a l l o p h o n e s of the o p e n s y l l a b l e £S>] a n d f e ] . 2 - t h e n u c l e u s s c h w a h a s the p r o p e r t y of b e i n g r e p r e s e n t e d i n c e r t a i n c o n t e x t s ( m a i n l y a t t h e b e g i n n i n g o f w o r d s ) b y a n u c l e u s w h e n the£gj] is e l i d e d . F o r e x a m p l e , the c o n s o n a n t t of t ( e ) r e n v e r s e r d o e s n o t b e c o m e t h e o n s e t of a s y l l a b l e [ t r a ] , the r e a l i z a t i o n of t a n d r b e i n g d i f f e r e n t f r o m the r e a l i z a t i o n of t h e c l u s t e r tr a n d it d o e s n o t become the c o d a of a p r e c e d i n g s y l l a b l e , the c o a r t i c u l a t i o n of t a n d r in t(e) r e n v e r s e r b e i n g s t r o n g e r t h a n the o n e of t a n d r w h e n t h e y b e l o n g to t w o d i f f e r e n t s y l l a b l e s . F o r t ( e ) r e n v e r s e r , w e s h a l l p o s i t t h e f o l l o w i n g s y l l a b a t i o n at the p h o n e t i c level: r ea~ a£ led zed
K t
r
a
A
. v t r
s
e J
T h e n u c l e u s b e i n g f r e e b e c a u s e of t h e e l i s i o n of the £ » } is f i l by the f r i c a t i v e ( o r v i b r a n t ) r. The p r o c e s s u s can be s c h e m a t i as f o l l o w i n g ; was sufficient to cue accentedness. Correct recognition was 63% and 71% (for the initial stop in /ti/ and /tu/); and 66% and 69% (for the vowels of / t i / and /tu/). Finally, Exp. 5 showed that listeners were able to detect accentedness when just the first 30 ms of /tu/ was presented, even though these "/t/-bursts" were not identifiable as speech sounds. These results are discussed in terms of the role of phonetic category prototypes in speech perception.
736 PATTERNS OF ENGLISH WORD STRESS BY NATIVE AND NON-NATIVE SPEAKERS Joann Fokes,
Z. S. Bond, Marcy Steinberg
School of Hearing and Speech Sciences, Ohio University, Athens, Ohio U.S.A. The purpose of this study was to examine the acoustical characteristics of English stressed and unstressed syllables in the speech of non-native speakers, in comparison to the same syllables produced by native speakers. Non-native speakers were college students of five different language backgrounds, learning English as a second language, and enrolled in an English pronunciation class. Native speakers were college students from the midwestern United States. Students were tape recordered while reading five types of words: 1) prefixed words with second syllable stress, e.g. "confess"; 2) the same words with an "-ion" suffix, e.g. "confession"; 3) and 4) words which change stress patterns upon suffixation, e.g. "combine"/"combination"; and 5) words of similar phonetic form but stressed on the initial syllable, e.g. "conquer". Selected suffixed words given in citation form were also compared with a sentence form, spoken in answer to a question, "Was the confession accepted?"/"The confession was accepted." Measurements were made of fundamental frequency, relative amplitude, and duration for the stressed and unstressed syllables. Mean differences of these three acoustical correlates of stress were compared between the native and nonnative speakers for all classes of test syllables. The correlates of stress as used by non-native and native speakers are discussed.
Section 19 III KORREKTIVER AUSSPRACHEUNTERRICHT AUF AUDITIVER BASIS Hans Grassegger Institut für Sprachwissenschaft der Universität Graz, Österreich Im Ausspracheunterricht kommt dem auditiven Aspekt eine grundlegende - weil im Lernprozeß zeitlich vorgeordnete - Rolle zu. Lautliche Interferenzerscheinungen im L2-Erwerb können nämlich auch auf charakteristischen Hörfehlern beruhen, die eine ihrer Ursachen in der auditiven Ähnlichkeit von AS- und ZS-Phon haben. Die vorliegende Studie untersucht die auditive Ähnlichkeit des (im Deutschen nicht vorkommenden) stimmlosen dentalen
Lateralfrikativs ([4 ]) mit einer Reihe von
möglichen Substitutionen im Urteil von deutschen Hörern. Aus den Ähnlichkeitsurteilen über alle Lautpaare, die den Lateralfrikativ als ersten oder als zweiten Stimulus enthalten, lassen sich Konsequenzen für die Erstellung eines korrektiven Programms zu bestimmten Problemlauten der ZS ableiten. Danach sollten ZS-Problemlaute mit ihren auditiv motivierten AS-Substitutionen kontrastiert werden und damit die hördiskriminatorische Voraussetzung für die korrekte Produktion des ZS-Lautes geschaffen werden.
738 PHONIC TRANSFER: THE STRUCTURAL BASES OF INTERLINGUAL ASSESSMENTS Allan R. James Engels Seminarium, Universiteit van Amsterdam, Netherlands The paper discusses certain inadequacies in the phonetic and phonological explication of foreign language pronunciation data which derive from a restricted view of the structural determinants of TL pronunciation behaviour. Above all, existing frameworks of description seem unable to account for the inherent variability in TL pronunciation. This, it is claimed, is a product of a 'compartmentalized' view of the structure of sound systems as well as of a simplified view of the dynamics of the TL acquisition and production processes themselves. Central to the latter however are assessment procedures of the foreign language learner in which TL forms are judged for their compatibility with forms of his NL. These assessments in turn lay the basis for the phonic transfer of elements of the NIL into the TL. The potentiation for transfer of NL forms is a product of the degree of relatedness as perceived between TL and NL sound elements within phonological, phonetic and atriculatory levels of sound structure. Actuation of transfer is a product of properties of the suprasegmental context within which the fl?L form is located. These structural properties of context account for the positional constraints on TL production and determine the form of the pronunciation variant used. An analysis of the typical occurrence of non-target segments in the English pronunciation of Dutch speakers compared to that of German speakers is offered by way of illustration of of the points made.
Section
19
739 DIE SYLLABISCH-AKZENTOLOGISCHEN MODELLE DER RUSSISCHEN SUBSTANTIVE E. Jasovä Pädagogische Hochschule, Banskä Bystrica, CSSR Zu den Grundproblemen des Unterrichts in der russischen Sprache als Fremdsprache in der slowakischen bzw. tschechischen Schule gehört nach underer Auffassung auch die Betonung, die durch einen Komplex von phonetischen und morphologischen Eigenschaften gekennzeichnet ist. Das typische Merkmal der russischen Sprache ist eine sog. freie Betonung im Unterschied zur slowakischen oder tschechischen Sprache, in denen die Betonung an die erste Silbe des Wortes gebunden ist. Der Hauptgegenstand unserer Untersuchung sind die syllabischakzentologischen Beziehungen der russischen Substantiv^ auf Grund des Frequenzwörterbuches der russischen Sprache. Diese werden an einer begrenzten Zahl von Substantiven /unabgeleiteten, abgeleiteten; einfachen, zusammengesetzten; "einheimischen" und "fremden"/ unter zwei Aspekten untersucht. Sie sind in einem Verzeichnis nach Silbenzahl und sinkender Frequenz erfasst. Die Untersuchungsaspekte: 1. Der syntagmatische Aspekt. Wir stellen die Distribution der russischen freien Betonung von Substantiven nach der Grundform des Wörterbuches /Nom.Sing./ vom Standpunkt der grammatischen Kategorie des Geschlechts her fest. 2. Der paradigmatische Askpekt. Wir erforschen: a/ die syllabische Variabilität von Wortformen der Paradigma und b/ die Bewegung der Betonung in den Wortformen der Substantive. In unserer Untersuchung gehen wir von der akzentologischen Konzeption von V. Strakovä2 aus, die uns vom Standpunkt der pädagogischen Praxis im Unterricht der russichen Sprache als Fremdsprache besonders geeignet erscheint. 1/ Castotnyj slovar'russkogo jazyka, Pod red. L.N. Zasorinoj, Moskva 1977 f 2/ Strakovä, V.:Rusky prlzvuk v prehledech a komentärich, Praha 1978.
740 INTERLANGUAGE PRONUNCIATION: WHAT FACTORS DETERMINE VARIABILITY? J. Normann J0rgensen Dept. of Danish, Royal Danish School of Educational Studies at Copenhagen The material for this investigation is the Danish language as spoken by young Turkish immigrants attending the public schools of Metropolitan Copenhagen. Several social and educational factors have been covered as well as certain linguistic characteristics have been investigated. Regarding the young Turks' pronunciation of Danish, it was found that "contrastive" deviations from native Danish were frequent. Naturally, so were non-deviant features. When Turkish-Danish pronunciation is described as an interlanguage, a variation must be accounted for which can not simply be described in the usual terms for deviation: interference, generalization Sc. Rather, the variation seems to depend on the variation within natively spoken Danish as well. An example: Danish short non-high back tongue vowels ([o+,or,oT-j- Turkish has [o]). For Danish [o+] we often hear in Turkish-spoken-Danish. Adjacent to [ 8 ] , however, [o] is predominant. The native Danish [o+] and [or] are represented in the Turkish-spoken-Danish by [o]. That native Danish [oT] is not represented by [o+] is probably due to the fact that the dominant sociolect in immigrant-dense parts of Copenhagen often has [OT] for standard Danish [o+]. On the other hand, the young Turks sometimes do have [o+], i.e. non-deviant standard Danish pronunciation. This is particularly frequent among young Turkish women. Such variation is similar to sex-related variation among native Danish speakers. It is tempting to describe an interlanguage as a series of stages - or a continuum - between a source language and a target language. Of course, it has long been realized that not every deviation from the target language can be related to the source language and that the interlanguage therefore is more complicated than that. It seems, however, that the target (as such) is to be understood in no simpler way. Any description of at least this particular (Turkisk-spoken-Danish) interlanguage must take the complicated reality of Danish variation into account. Consequently, phonological considerations will conceal some of the interlanguage features. The interlanguage variation is no less systematic for that reason: it is partly systematic in the way described by Dickerson, partly systematic like intrinsic language variation as described by e.g. Labov, but first of all: as a whole more complex than either, in the way social and linguistic factors interrelate. References:M. S0gaard Ldrsen: Skolegang, etnicitet, klasse og familiebaggrund. Danmarks pjedagogiske Bibliotek, Copenhagen 1982. M. S0gaard Larsen & J. Normann J0rgensen: Kommunikationsstrategier som graensedragningsfaenomen. KURSIV 3, Copenhagen (Dansklsererforeningen) , 1982, p. 19-35. J. Normann J0rgensen: Det flade a vil sejre. SAML 7, Copenhagen University Dept. of Applied Ling., 1980, p. 67124. J. Normann J0rgensen: Kontrastiv udtalebeskrivelse: Gabrielsen & Gimbel (eds) : Dansk som fremmedsprog, Laererforeningens Materialeudvalg, Copenhagen 1982, p. 297-336. Lonna J. Dickerson: The Learner's Interlanguage as a System of Variable Rules. TESOL Quarterly, Vol.9, No.4, December 1975.
Section
19 741
TIMING OF ENGLISH VOWELS SPOKEN WITH AN ARABIC ACCENT Fares Mitleb Department of English, Yarmouk University, Irbid, Jordan This study attempts, first, to determine to what extent the temporal properties of Arabic-accented English vowels resemble the first or the second language. And second, to examine the extent to which abstract-level differences between Arabic and English affect Arabs' production of the phonetic manifestation of the phonological rule of flapping in American English and that of the novel English syllable type CVC. Results show that Arabs failed to exhibit a vowel duration difference for voicing and produce an exaggerated length contrast of tense vs. lax vowels that more closely resembles Arabic long and short vowels. Thus the temporal properties of Arabic-accented English vowels are only slightly altered from the corresponding Arabic values. On the other hand, the Arabs thoroughly acquired the American segmental phonological rule that changes intervocalic /t-d/ into apical flap L O and produced correctly the novel English CVC syllable type instead of their native CVCC. These results support the hypothesis that phonetic implementation rules are more difficult for an adult language learner to change than rules which can be stated at the level of abstract segmental units. (Research supported by Yarmouk University)
742 INDIVIDUAL STUDENT INPUT IN THE LEARNING OF F 1 PRONUNCIATION H. Niedzielski Department of European Languages at the University of Hawaii, Honolulu, Hawaii, USA. Acoustic phonetics based on contrastive recordings of the source and target languages can be learned individually by students with a minimum intervention of the teacher.
The help of the latter
is needed more for articulatory phonetics.
However, here again
with contrastive descriptions of both languages, most of the work can be performed by students themselves on a n individual basis. As I often say to m y students in French phonetics, I can show you how to improve your pronunciation but I cannot improve it for you.
That is your part.
Consequently, I provide them with var-
ious oral and visual contrastive materials and assign them various tasks to perform individually.
Among others, they keep a
diary of their own efforts in the language laboratories, at home or anywhere.
They report their problems, solutions, failures a n d
successes; a n d we discuss all these in class, in m y office or in the cafeteria. This paper will present some of m y students' comments about the materials, techniques, and approaches which they find so attractive and efficient.
Sample materials will be exhibited.
Section
19 743
VOICE QUALITY AND ITS IMPLICATIONS Paroo Nihalani Department of English Language and Literature at the National University of Singapore, Singapore The widespread use of Daniel Jones' English Pronouncing Dictionary in the Commonwealth countries seems to imply that British Received Pronunciation (BRP) is the model of English prescribed for the learners of English in these countries. The speaker feels that this form of pronunciation represents an unrealistic objective and one that is perhaps undesirable. BRP is the 'normative' model that limits itself to the consideration of communicative intentions attributed to the speaker only. The speaker argues in favour of a pragmatic model which is a twoeay interactional model within the framework of Speech Act theory which considers the hearer as an active participant. Only the observation of the hearer's answer can tell whether the speaker has succeeded in performing his/her speech act. The importance of para-phonological features such as 'pleasant' voice quality for communicative purposes will be discussed. It is suggested that perhaps a course in Spoken English based on 'diction' and 'dramatics' rather than on the exact phonetic quality of sounds may prove to be more effective. Phonetic correlates of what is called 'pleasant' voice quality have also been discussed.
744 UNTERSUCHUNGEN ZUR FRIKATIV-KORRELATION IM DEUTSCHEN E. Stock Wissenschaftsbereich Sprechwissenschaft der Sektion Germanistik und Kunstwissenschaften, Martin-Luther-Universität HalleWittenberg, Deutsche Demokratische Republik In vielen Beschreibungen des nördlichen Aussprachestandards im Deutschen wird die konsonantische Korrelation bei Explosiven und Frikativen nach wie vor als eine Korrelation von stimmhaft und stimmlos dargestellt. Ziel der vorliegenden Arbeit ist es nachzuweisen, daß es angemessener und nützlicher ist, nicht von einer Stimmbeteiligungskorrelation, sondern von einer Spannungskorrelation zu sprechen. Dieses Problem ist vor allem für die phonetischen Übungen im Unterricht Deutsch als Fremdsprache bedeutungsvoll. Hat ein Spracherlerner mit seiner Muttersprache automatisiert, daß die Phoneme /b/-/p/ in der Realisierung vorwiegend mittels des distinktiven Merkmals Stimme unterschieden werden, so wird er das gleiche Merkmal ohne Bedenken auch im Deutschen anwenden, wenn die phonologisch-phonetische Beschreibung ihn in dieser Richtung bestärkt. Daraus resultieren Fehler in der Phonemrealisation und falsche Assimilationen. Die Aussprache bleibt auffällig fremd; die Besonderheit des deutschen Konsonantismus wird verfehlt. Meinhold/Stock haben bereits 1963 auf der Grundlage experimentell-phonetischer Untersuchungen gezeigt, daß die Verschlußphase der Medien /b, d, g/ nur nach stimmhaften Allophonen stimmhaft, nach Pausen und stimmlosen Allophonen dagegen stimmlos ist, ohne daß diese Phoneme dadurch als Tenues realisiert werden. Das Auftreten der Stimmhaftigkeit ist positionsabhängig; das dominierende und frequentere distinktive Merkmal ist fortis-lenis. Eine 1982 vorgenommene experimentell-phonetische Untersuchung zeigt ähnliche Verhältnisse auch für die Realisierung der Frikative /v, z, j/. Die Ergebnisse werden mit Tabellen und Oszillogrammen veranschaulicht. Für die phonologische Beschreibung wird daraus die Berechtigung abgeleitet, von einer konsonantischen Spannungskorrelation im Deutschen zu sprechen. An Unterrichtsanalysen wird abschließend die Zweckmäßigkeit dieser Beschreibung verdeutlicht.
Section 19 745 METRIQUE POUR L'EVALUATION DES ERREURS PROSODIQUES P. Touati Institut de Linguistique et de Phonétique, Lund, Suéde Le but de cette ccnntunication est de présenter une méthode qui permet de mesurer les erreurs prosodiques effectuées par des locuteurs suédois parlant français. La difficulté première lorsque l'on juge les erreurs de prononciation d'un locuteur est de séparer ce qui est dû aux erreurs segmentales de ce qui est dû aux erreurs prosodiques.Grâce au système d'analyse-synthèse à codage prédictif nous avons remédié à cette difficulté en introduisant dans une phrase française la prosodie suédoise et en maintenant les segments français.Cette manipulation est le point de départ de notre méthode. Expérimentation Le rythme et l'intonation suédois ont été introduits dans la phrase originale française en effectuant des modifications systématiques sur les paramètres de durée et de fréquence fondamentale de cette phrase.Quatre phrases-stimuli ont été ainsi obtenues:1a première étant une simple copie synthétique de l'originale,la deuxième a le rythme suédois,la troisïène a l'intonation suédoise et la auatriène a le rythire et l'intonation suédois. Ces stimuli ont été présentés à l'écoute de cinq auditeurs chargés d'évaluer le degré d'accent étranger de ces stimuli.Les auditeurs ont ensuite comparé ces stimuli à la phrase originale française produite par trois locuteurs suédois.Cette comparaison nous a permis d'évaluer le degré d'accent étranger de nos locuteurs et surtout de déterminer de manière plus systématique quel était le paramètre prosodique responsable de cet accent. Résultats et conclusion Pour la majorité des auditeurs le stimuli ayant le rythme et l'intonation suédois a été considéré came ayant le plus d'accent étranger,puis vient celui avec l'intonation et enfin celui avec le rythme. Quant aux locuteurs le paramètre jugé responsable de leur accent varie.Les causes de cette variation seront discutées. Ces premiers résultats semblent cependant indiquer la validité de cette méthode ccmme métrique pour l'évaluation des erreurs prosodiques. Références Touati, p. (1980) Etude comparative des variations de la fréquence fondamentale en suédois et en français.Working Papers,Department of Linguistics,Lund Univer sity 19: 60-64 Garding, E.,Botinis, A. & Touati, P. (1982) A comparative study of Swedish,Greek and French Intonation.Wbrking Papers,Department of Linguistics,Lund University 22: 137-152 Gârding, E. (1981) Gontrastive Prosody: A model and its application.Spécial Lecture to ÂILA Congress 1981.In Studia Linguistica, Vol. 35, No 1-2
746 RESEARCH IN THE FOREIGN-LANGUAGE PHONETICS CLASS Joel C. Walz
Department of Romance Languages at the University of Georgia, Athens, Georgia, USA The foreign-language phonetics class in the United States is usually a combination of theoretical aspects based on articulation and corrective procedures designed to improve students' pronunciation.
Students participating in the
course are most often in their third or fourth year of undergraduate work and have had little linguistic training.
Theory certainly can help students become
cognizant of their own weaknesses, but it does not always integrate well into class activities. The author proposes that teachers require all students to complete a research project, which will combine phonetic theory with practical applications. Since few foreign-language teachers have elaborate equipment or the expertise to use It, topics will involve only standard tape recording.
Six problems can
be studied by students at an elementary level, yet their findings can prove pedagogically quite useful. 1)
Phonetics students can test a beginning language student and design a
corrective program for him. 2)
They can test a native speaker of the language they are studying and
describe variations from the orthoepic system. 3)
They can design and administer a test of register.
4)
In cosmopolitan areas, a test of regional variation is possible.
5)
They can administer a self-test and develop hypotheses for their errors
(interference, intralingual confusion). 6)
A test of native speaker reaction to pronunciation errors could have
immediate applications. The research project in the foreign-language phonetics class is an effective way of uniting the theoretical and practical aspects of the course.
Section TEACHING UNSTRESSED VOWELS IN GERMAN: STRESS UPON VOWEL DIFFERENTIATION
19 747
THE EFFECT OF DIMINISHED
R. Weiss Department of Foreign Languages and Literatures at Western Washington University, Bellingham, WA. 98225, U.S. This paper addresses itself to the phenomenon of reduction of stress and its resultant effect upon the length and quality of German vowels. Prior research has indicated that in German a maximum of 15 vowel oppositions are operative in fully stressed syllable position. The total number of oppositions are minimized due to a complex but systematic relationship of length and quality. (See Weiss, "The German Vowel: Phonetic and Phonemic Considerations," Hamburger Phonetische Beitrage, 25 (1978) , 461-475.) In unstressed syllable position the system potentially increases in vowel diversity to include as many as 29 vowels (Duden). Although this maximal diversity is reflected primarily in borrowed words of a non-German origin, the increase in vowel opposition is due largely to the fact that length and quality may function more independently in unstressed syllable position. Although it appears that vowel differentiation seems to be more diverse in unstressed syllable position, it will be demonstrated that in reality quite the opposite is true: in actual practice vowel differentiation dramatically decreases in unstressed syllable position. An attempt will be made to correlate diminishing degrees of stress with loss of certain phonetic features, such as liprounding, length, extremes in quality, etc. A priority system of loss of features sequentially both in regard to perception and normal production will be proposed. Additionally, other factors which play a role in unstressed syllable position such as phonemic and morphophonemic considerations will be taken into account. It will be demonstrated that there exists a direct and positive correlation between vowel diversity and vowel stress. A set of postulates operative in unstressed syllable position will be given respective to different levels of stress. An attempt will be made to present in hierarchical fashion the different vowel systems operative at different levels of stress from fully stressed to totally unstressed. In addition the implications of the above findings for foreign language teaching will be discussed. Pedagogical guidelines for a practical treatment of unstressed vowels will be given. A relatively simple and practical seven vowel system will be proposed which not only more accurately reflects the principles actually operative in unstressed vowel production, but more closely reflects the actual articulation most commonly associated with unstressed vowels.
748 THE VISUALISATION OF PITCH CONTOURS: SOME ASPECTS OP ITS EFFECTIVENESS IN TEACHING FOREIGN INTONATION *) B. Weltens & K. de Bot Institute of Phonetics / Institute of Applied Linguistics, University of Nijmegen, The Netherlands. One of the contributions from phonetic research to applied linguistics is the development of technical aids for the teaching of pronunciation. Such aids have been developed for both segmental and suprasegmental aspects of pronunciation, the latter having - until recently - received comparatively little attention (cf. e.g. Abberton & Fourcin, 1975; Hengstenberg, 1980; James, 1976, 1977; Léon s Martin, 1972). Since 1976 work has been carried out towards developing a microcomputercontrolled set of equipment for displaying pitch contours of sentences. The aim was to produce a practical set-up in which target sentences recorded on tape and ad-hoc imitations by learners/subjects could be displayed simultaneously on the upper and lower halves of a t.v. screen, and to test this set-up with different target languages and different groups of learners under different conditions. Over the past seven years a number of experiments have been carried out with several different target languages, experimental designs and set-ups of the equipment. Intonation contours of three languages have been used in consecutive experiments: Swedish (with Dutch subjects who had no previous knowledge of the target language), English (with advanced Dutch learners: lst-year undergraduate students) and Dutch (with Turkish immigrants of different levels of proficiency in the target language). In this series of experiments we have investigated the influence of the following variables on the ability to imitate foreign intonation patterns: - feedback mode : auditive vs. audio-visual, - feedback delay : the time lag between the moment of producing the utterance and the moment of plotting its pitch contour on the screen, - proficiency level of the learner in the target language: measured by means of a written editing test (Mullen, 1979), - age of the learner : 9 - 12 year old children vs. adults. The outcome of these investigations will be presented in very general terms: We will discuss the effectiveness of visual feedback compared with auditive feedback, the effect of the feedback mode on the practice behaviour of the subjects during the experimental session, and the influence of potentially interfering variables (feedback delay, proficiency level and age of the learner) on the effectiveness of visualising pitch contours in intonation learning. We will also briefly describe the latest set-up of the equipment, which proved to be highly workable for the individual learner and could form a major contribution to intonation teaching in many areas of language teaching and speech therapy. *) This research was partly sponsored by the Research Pool of the University of Nijmegen, and by the Netherlands Organisation for the Advancement of Pure Research (ZWO).
Section 19 749 CROSS-LINGUISTIC RHYTHMIC
INFLUENCE
IN SECOND L A N G U A G E A C Q U I S I T I O N :
THE
DIMENSION
B. J. W e n k Department of M o d e r n Languages; Strasbourg, France.
Université
de S t r a s b o u r g
II;
A s temporal patterning constitutes a crucial feature of language p e r f o r m a n c e — p o s s i b l y m o r e f u n d a m e n t a l to c o m p r e h e n s i o n a n d e x p r e s s i o n t h a n segmental f e a t u r e s in a n d of t h e m s e l v e s — t h e a c q u i sition of target rhythmic organisation for learners whose first l a n g u a g e is d i s p a r a t e i n t h i s r e g a r d to t h e t a r g e t l a n g u a g e is clearly an area worthy of careful analysis. Perhaps because of the naivety of p o p u l a r rhythmic typologies a n d the e x c l u s i o n of r h y t h m f r o m the p u r v i e w o f m u c h p h o n o l o g i c a l t h e o r y , t h e p r o b l e m h a s n o t so f a r r e c e i v e d t h e a t t e n t i o n it m e r i t s . H o w e v e r , t h e d e s c r i p t i v e f r a m e w o r k a p p l i e d to F r e n c h a n d E n g l i s h r h y t h m s i n W e n k and W i o l a n d (1982), revealing a hitherto u n s u s p e c t e d set of i n t e r relationships between temporal patterning and a range of phonetic f e a t u r e s , p e r m i t s the d i s c o v e r y of a g e n e r a l i s a b l e o r d e r of a c q u i s i t i o n for w h i c h experimental c o n f i r m a t i o n is p r e s e n t e d . The d a t a a r e a l s o a n a l y s e d w i t h r e s p e c t to v a r i a t i o n d u e t o s p e e c h s t y l e and phonetic context.
Reference Wenk,
B . J . a n d F. W i o l a n d ( 1 9 8 2 ) : " I s F r e n c h r e a l l y t i m e d ? " , J o u r n a l o f P h o n e t i c s , 10, 1 9 3 - 2 1 6 .
syllable-
750 E N G L I S H I N T O N A T I O N FROM A D U T C H P O I N T O F V I E W N.J.
Willems
I n s t i t u t e for P e r c e p t i o n R e s e a r c h , E i n d h o v e n , the N e t h e r l a n d s W h e n n a t i v e s p e a k e r s of D u t c h speak (British) E n g l i s h , their p r o nunciation will generally differ f r o m that of E n g l i s h native s p e a k e r s . T h e s e d i f f e r e n c e s g i v e r i s e to the p e r c e p t i o n of a n o n n a t i v e 'accent'. T h i s p a p e r r e p o r t s o n a n e x p e r i m e n t a l - p h o n e t i c inv e s t i g a t i o n w h i c h a t t e m p t s to c h a r a c t e r i z e and d e s c r i b e the i n t o n a t i o n a l , or r a t h e r m e l o d i c , a s p e c t s of this n o n - n a t i v e n e s s . T h i s w i l l b e u s e d to d e s i g n an e x p e r i m e n t a l l y b a s e d i n t o n a t i o n c o u r s e for D u t c h l e a r n e r s of E n g l i s h . C o n t r a r y to t r a d i t i o n a l courses in E n g l i s h i n t o n a t i o n , w h i c h are m a i n l y b a s e d o n 'drills', the p l a n n e d c o u r s e c o u l d m a k e s t u d e n t s aware of i n t o n a t i o n a l s t r u c t u r e s of the t a r g e t l a n g u a g e by p r o v i d i n g t h e m w i t h an e x p l i c i t b u t s i m p l e d e s c r i p t i o n . O u r a p p r o a c h is l a r g e l y b a s e d o n the r e s e a r c h m e t h o d s of the 'Dutch s c h o o l ' of i n t o n a t i o n , w h i c h d e s c r i b e s the d e t a i l e d perceptually fundamental f r e q u e n c y c u r v e s in t e r m s of d i s c r e t e e q u i v a l e n t p i t c h m o v e m e n t s using a s t r a i g h t line approximation (stylization). A n e x t e n s i v e c o m p a r i s o n w a s m a d e b e t w e e n a b o u t 600 f u n d a m e n t a l f r e q u e n c y c u r v e s of E n g l i s h u t t e r a n c e s p r o d u c e d by a d o z e n n a t i v e D u t c h a n d E n g l i s h s p e a k e r s . T h i s a n a l y s i s y i e l d e d six fairly s y s t e m a t i c m e l o d i c d e v i a t i o n s p r o d u c e d by D u t c h n a t i v e s p e a k e r s . M a j o r d e v i a t i o n s w e r e s u b s t a n t i a l l y s m a l l e r e x c u r s i o n s , r i s i n g instead of f a l l i n g p i t c h m o v e m e n t s and a too low ( r e ) s t a r t i n g level. In a f i r s t p e r c e p t i o n t e s t , in w h i c h s y n t h e t i c s p e e c h w a s u s e d , E n g l i s h n a t i v e s p e a k e r s w e r e a s k e d to a s s e s s the a c c e p t a b i l i t y of s y s t e m a t i c v a r i a t i o n s in m a g n i t u d e of the e x c u r s i o n and p o s i t i o n of the p i t c h m o v e m e n t in the s y l l a b l e w i t h r e s p e c t to v o w e l o n s e t . The o u t c o m e of this e x p e r i m e n t s h o w e d that E n g l i s h p i t c h c o n t o u r s c a n be a d e q u a t e l y d e s c r i b e d w i t h an a v e r a g e e x c u r s i o n of 12 s e m i t o n e s . In o r d e r to e s t a b l i s h the p e r c e p t u a l r e l e v a n c e of the d e v i a tions found, a s e c o n d p e r c e p t i o n e x p e r i m e n t was p e r f o r m e d in w h i c h all o r i g i n a l f u n d a m e n t a l f r e q u e n c y c u r v e s w e r e r e p l a c e d by s y s t e m a t i c a l l y v a r i e d a r t i f i c i a l c o n t o u r s by m e a n s of L P C - r e s y n t h e s i s . T h e s e c o n t o u r s w e r e s u p e r i m p o s e d o n the u t t e r a n c e s p r o d u c e d by the n a t i v e s p e a k e r s of E n g l i s h . T h i s t e c h n i q u e was used to p r e v e n t the n a t i v e E n g l i s h j u d g e s f r o m being i n f l u e n c e d by d e v i a t i o n s o t h e r t h a n t h o s e in p i t c h . A c c o r d i n g to 55 E n g l i s h n a t i v e s p e a k e r s , w h o a p p e a r e d to b e v e r y c o n s i s t e n t in their j u d g m e n t s , the deviations w e r e to a g r e a t e r or lesser e x t e n t u n a c c e p t a b l e . O n the b a s i s of the r e s u l t s of this e x p e r i m e n t the u s e f u l n e s s of s o m e s t y l i z e d m e l o d i c p r o n u n c i a t i o n p r e c e p t s was tested. R e s u l t s s h o w e d h i g h a c c e p t a b i l i t y s c o r e s , s u g g e s t i n g the p o t e n t i a l e f f e c t i v e n e s s of the p r e c e p t s . O u r r e s u l t s s u g g e s t it s h o u l d be p o s s i b l e to set up a m e l o d i c i n t o n a t i o n course for D u t c h s t u d e n t s based o n e x p e r i m e n t a l e v i d e n c e . M o r e o v e r the s u c c e s s of the s t y l i z a t i o n m e t h o d for E n g l i s h s u g g e s t s that there is g r e a t p r o m i s e in d e v e l o p i n g a n o t a t i o n a l s y s t e m of s t r a i g h t - l i n e c o n t o u r s .
Section 20
Speech Pathology and Aids for the Handicapped
Section 20
753 INTONATION PATTERNS IN NORMAL, APHASIC, AND AUTISTIC CHILDREN Christiane A.M. Baltaxe, Eric Zee, James Q. Simmons Department of Psychiatry at the University of California, Los Angeles, US Prosody is constituted of the acoustic parameters of fundamental frequency, intensity, duration, and their covariation. Prosody presents an important dimension in language development. Prosodic variables may affect aspects of language comprehension, retention, and production in the acquisition process. Prosodic patterns are mastered and stabilize prior to segmental and syntactic patterns of language and several investigators have proposed that they form 'frames' or 'matrices' for subsequently developing segmental and syntactic units. However these prosodic frames have not b e e n sufficiently characterized and the phonetic and linguistic details of their development are presently unclear. Prosody as a potential and powerful variable in delayed or deviant language development also awaits investigation. The present study examines the frequency characteristics of intonation patterns of three matched groups of young children (MLU 1.5-4.0), normal subjects, aphasic subjects (language delayed),and autistic subjects (language delayed/deviant). Only the autistic group also had perceptible prosodic abnormalities. The parameter of frequency was chosen for study since.developmentally, it appears to stabilize first. The present investigation focuses on simple declara tive subject-verb-object u t t e r a n c e s ^ r o d u c e d spontaneously under controlled conditions. Frequency measurements were obtained using a pitch meter and Oscillomink tracings, and measurements w e r e subjected to appropriate statistical analysis. Results show that all three groups can be differentiated on the basis of intonation contours of declarative utterances, based on visual inspection of the pitch contours as well as on comparisons based on statistical analyses of the measurements taken. Only the normal group showed a frequency pattern w h i c h was comparable to normal adult speech. Descriptively, the pattern can best be characterized by initial rise and terminal fall of the fundamental frequency contour. Although the aphasic and autistic groups generally showed initial rise, terminal fall was absent in most s u b j e c t s ^ h o showed level or rising pitch finally. Most characteristic of the autistic group was a saw-tooth pattern of pitch modulation. In addition, a flat pitch pattern w a s also seen. This was also the dominant pattern of the aphasic group. Based on frequency measurements and statistical analyses, significant between-group differences were seen in overall frequency range, frequency differences within each syllables, and frequency shift between syllables. Differences in frequency modulation between content and function words were significant for all three groups. This appears to indicate that normal as well as the language deficient groups differentiated among the two semantic categories and generally adhered to the related stress pattern. Some significant differences in frequency modulation were also seen within the content w o r d category, depending on whether subject,verb, or object positions w e r e involved. The occurrence of primary sentence stress on final stressable syllable, i.e. object position, was not supported by greater pitch modulation in that position. Theoretical implications of these findings and their clinical significance are discussed.
754 AN INVESTIGATION OF SOME FEATURES OF ALARYNGEAL R.
SPEECH
Beresford
Sub-department
of S p e e c h , U n i v e r s i t y of N e w c a s t l e u p o n T y n e ,
England.
T h i s is p a r t of a larger i n v e s t i g a t i o n b e i n g m a d e w i t h c l i n i c a l c o l l e a g u e s a n d is c o n c e r n e d w i t h the f o l l o w i n g q u e s t i o n s : (1) Is the o e s o p h a g u s p a s s i v e d u r i n g o e s o p h a g e a l s p e e c h ? (2) W h a t use is m a d e of the c a p a c i t y of the o e s o p h a g u s d u r i n g alaryngeal speech? (3) Is f l o w - r a t e m o r e i m p o r t a n t than the o e s o p h a g e a l v o l u m e u s e d ? (4) Is d u r a t i o n of p h o n a t i o n d e p e n d e n t u p o n the air r e s e r v o i r in the o e s o p h a g u s ? (5) Is o e s o p h a g e a l p r e s s u r e low in 'good' o e s o p h a g e a l s p e e c h ? (6) W h a t are the v a r i a b l e s of e x p u l s i o n ?
Section
20
755 THE INTELLIGIBILITY Gerrit
O F S E N T E N C E S S P O K E N BY
LARYNGECTOMEES
Bloothooft
F a c u l t y of M e d i c i n e , F r e e U n i v e r s i t y , A m s t e r d a m , T h e N e t h e r l a n d s P r e s e n t a d d r e s s : I n s t i t u t e of P h o n e t i c s , U t r e c h t U n i v e r s i t y , U t r e c h t , The N e t h e r l a n d s L a r y n g e c t o m e e s are severly h a n d i c a p p e d in their speech c o m m u n i c a t i o n b e c a u s e of t h e r e l a t i v e l y l o w q u a l i t y of t h e i r s e c o n d v o i c e . W e i n v e s t i g a t e d t h e i r h a n d i c a p i n t e r m s of r e d u c e d s p e e c h i n t e l l i g i b i l i t y . F o r 18 l a r y n g e c t o m e e s , 9 of w h i c h developed esophageal speech and 9 speech from a neoglottis o b t a i n e d by a surgical r e c o n s t r u c t i o n a f t e r S t a f f i e r i , the i n t e l l i g i b i l i t y of s h o r t , p h o n e t i c a l l y b a l a n c e d , e v e r y d a y sentences was determined. Intelligibility was measured in terms of t h e s p e e c h - r e c e p t i o n t h r e s h o l d ( S R T ) : t h e s o u n d l e v e l f o r w h i c h 50 % of t h e s e n t e n c e s w a s r e p r o d u c e d c o r r e c t l y b y l i s t e n e r s of n o r m a l h e a r i n g . S R T w a s d e t e r m i n e d b o t h i n q u i e t a n d i n i n t e r f e r i n g n o i s e of 60 d B ( A ) w i t h a s p e c t r u m e q u a l to t h e l o n g - t e r m a v e r a g e s p e c t r u m of n o r m a l s p e e c h . W i t h t h e S R T values for normally spoken sentences as a reference (Plomp a n d M i m p e n , 1979), we d e t e r m i n e d the s p e e c h - i n t e l l i g i b i l i t y loss (SIL) f o r a l l 18 a l a r y n g e a l s p e a k e r s . T h e S I L in n o i s e w a s n o t s i g n i f i c a n t l y d i f f e r e n t f r o m the S I L i n q u i e t , i n d i c a t i n g t h a t the i n t e l l i g i b i l i t y loss for a l a r y n g e a l speech was largely due to d i s t o r t i o n . T h e S I L v a r i e d i n t e r i n d i v i d u a l l y b e t w e e n 2 d B a n d 20 d B w i t h a n a v e r a g e of 10 d B . N o s i g n i f i c a n t d i f f e r e n c e s in SIL values between esophageal and Staffieri neoglottis speakers w e r e f o u n d . A model w i l l be p r e s e n t e d in w h i c h the l i m i t i n g c o n d i t i o n s in s p e e c h c o m m u n i c a t i o n f o r l a r y n g e c t o m e e s c a n b e d e m o n s t r a t e d a s a f u n c t i o n of t h e a m b i e n t n o i s e l e v e l . In t h i s m o d e l n o t o n l y t h e S I L b u t a l s o the l o w e r a v e r a g e v o c a l i n t e n s i t y of a l a r y n g e a l s p e e c h (on a v e r a g e 12 d B i n c o m p a r i s o n to n o r m a l s p e e c h u n d e r t h e s a m e c o n d i t i o n s ) is i n c l u d e d . It will be shown that many l a r y n g e c t o m e e s are already severely h a n d i c a p p e d in t h e i r s p e e c h c o m m u n i c a t i o n w i t h a n a m b i e n t n o i s e l e v e l as l o w as t y p i c a l l y p r e s e n t i n l i v i n g - r o o m s ( ¿ 0 d B ( A ) ) . The c o o p e r a t i o n of p a t i e n t s a n d the s t a f f of t h e L o g o p e d i c a n d P h o n i a t r i c D e p a r t m e n t of t h e F r e e U n i v e r s i t y H o s p i t a l is k i n d l y acknowledged. Reference P l o m p , R. a n d M i m p e n , A . M . ( 1 9 7 9 ) . S p e e c h - r e c e p t i o n t r e s h o l d f o r s e n t e n c e s as a f u n c t i o n of a g e a n d n o i s e l e v e l . J . A c o u s t . S o c . A m . 66 (1333-134-2).
756 SPECIAL PROBLEMS IN DIAGNOSIS: PROSODY AND RIGHT HEMISPHERE DAMAGE Pamela Bourgeois , M.A., Amy Veroff, Ph.D. and Bill LaFranchi, M.A. Casa Colina Hospital for Rehabilitative Medicine, Pomona, California, USA Control of the syntactic, semantic, and phonological aspects of language has long been associated with the left or "dominant" hemisphere. In contrast, recent research has correlated dysfunctions in the prosodic components of language with right hemisphere lesions (Ross, 1981). Although advances have been made in the scientific study of the minor hsnisphere's role in speech and language function, especially with regard to dysprosody, this information has not been integrated into the rehabilitation of patients with disorders arising from minor hemisphere pathology. Dysprosody has been operationally defined as "the inability to impart affective tone, introduce subtle shades of meaning, and vary emphasis of spoken language" (Weintraub, Mesulim, and Kramer, 1981). Specifically, this disturbance reflects the individual's inability to express stress and melodic contour in communication. The absence or reduction of prosodic components in verbal language produces a demonstrable conmunication deficit in which the pragpiatic or socio-linguistic aspects of cotnnunication are affected. Our research of dysprosody offers techniques which speech/language pathologists and psychologists can use to differentiate disturbances of the affective components of speech as isolated symptoms separate from real disturbances of affect such as depression. A complete neuropsychological evaluation is also performed which provides valuable information of the patient's overall cognitive status. This information is currently being integrated into the rehabilitation of patients with disorders arising from minor hemisphere pathology. The purpose of this presentation is to offer a clinical approach to disorders of prosody. It will discuss 1) the effects of minor hemisphere lesions upon speech and language, 2) the need for communicative rehabilitation with patients presenting prosodic impairments, and 3) a systematic diagnostic and treatment protocol.
Section
20
757 SOME SPEECH CHARATERISTICS E L E C T R O M Y O G R A P H I C STUDY OF M.
Gent i1,
Centre
+
J.
Pellat
Hospitaller
+
& A.
OF P A R K I N S O N I A N D Y S A R T H R I A : FOUR LABIAL MUSCLES AND ACOUSTIC Vila
Universitaire
DATA
++
de
Grenoble,
Grenoble,
France
Early studies provided acoustic information about parkinsonian speech diseases (Alajouanine T.,Sabouraud 0 § Scherrer J., Contrib u t i o n a l ' e t u d e o s c i 1 1 o g r a p h i q u e d e s t r o u b l e s d e la p a r o l e , L a r y n x et P h o n a t i o n , p p 1 4 5 - 1 5 8 , Greray F . , T h e s e d e M e d e c i n e , P a r i s , 1 9 5 7 ) . M o r e r e c e n t i n v e s t i g a t i o n s u s e d e l e c t r o m y o g r a p h i c t e c h n i q u e s to d e t e r m i n e n e u r o - B U S c u l a r d y s f u n c t i o n s in t h e o r o f a c i a l s y s t e m o f p a r k i n s o n i a n p a t i e n t s (Hunker C.J., ABBS J.H.5 BARLOW J.M., The relat i o n s h i p b e t w e e n p a r k i n s o n i a n r i g i d i t y a n d h y p o k i n e s i a in t h e o r o f a c i a l s y s t e m , N e u r o l o g y 3 2 , 1 9 8 2 ) . H o w e v e r in m o s t e x p e r i m e n t s o n ly f e w m u s c l e s w e r e m o n i t o r e d . It s e e m s n e c e s s a r y , g i v e n t h e p h e n o m e n o n o f m o t o r e q u i v a l e n c e , to m o n i t o r a lot o f m u s c l e s to o b t a i n a valid p e r s p e c t i v e of the control p r o c e s s (Gentil M., Gracco V . L . , § Abbs J . H . , lie ICA, P a r i s , 1 9 8 3 ) . The p u r p o s e of the p r e s e n t stud y w a s to a n a l y z e t h e a c t i v i t y o f 4 l a b i a l m u s c l e s o f p a r k i n s o n i a n s u b j e c t s , p e r f o r m i n g l o w e r l i p m o v e m e n t s a n d to s e e if o u r o b s e r v a t i o n s w e r e c o n s i s t e n t w i t h f i n d i n g s c o n c e r n i n g l i m b m u s c l e s , in spite of d i f f e r e n t p h y s i o p a t h o 1ogica 1 m e c h a n i s m s . F u r t h e r m o r e oscill o g r a p h i c r e c o r d i n g s of d a t a s p e c i f i e d the c h a r a c t e r i s t i c s of subj e c t s ' v o i c e s : low and u n i f o r m i n t e n s i t y , p o o r t i m b r e , i r r e g u l a r speech rate and a b n o r m a l i t i e s of pitch. The P a r k i n s o n d y s a r t h r i c s for this study were 2 f e m a l e and I m a l e s u b j e c t s , r e s p e c t i v e l y a g e d 40 - 74 - 70 y e a r s . A l l w e r e j u d g e d to m a n i f e s t r i g i d i t y a n d h y p o k i n e s i a in t h e l i m b s a n d r e d u c e d f a c i a l m i m i c . No labial t r e m o r s w e r e n o t i c e d . T h e s e p a t i e n t s w e r e u n d e r L D o p a m e d i c a m e n t . T w o n o r m a l f e m a l e s u b j e c t s w e r e i n v e s t i g a t e d in p a r a l l e l as c o n t r o l s . O u r o b s e r v a t i o n s c o n c e r n e d l o w e r l i p b e c a u s e o f I) i t s c o n t r i b u t i o n to s p e e c h p r o d u c t i o n s , 2) i t s u s u a l r i g i d i t y in c o m p a r i s o n w i t h u p p e r l i p in p a r k i n s o n i s m . E M G a c t i v i t y w a s recorded with needle electrodes from orbicularis oris inferior(001) d e p r e s s o r labii i n f e r i o r (DLI), m e n t a l i s (MTL) a n d b u c c i n a t o r ( B U C ) . Acoustical signals were simultaneously recorded. The subjects r e p e a t e d 3 t i m e s 3 s e n t e n c e s (cf s p e e d r a t e m e a s u r e m e n t ) a n d a l i s t o f 27 w o r d s o f CV or C V C V t y p e (V = a , i , u ) . T h e s e o n e s w e r e s e l e c ted b e c a u s e of the p a r t i c u l a r a c t i v i t y of r e c o r d e d m u s c l e s d u r i n g t h e i r p r o d u c t i o n . C o a r t i c u l a t o r y e f f e c t s w e r e e s p e c i a l l y s t u d i e d in monosyllabic words. The r e s u l t s of our a n a l y s e s i n d i c a t e d well d e f i n e d EMG a b n o r m a l i t i e s f o r a l l p a t i e n t s . W e o b s e r v e d I) I m p a i r m e n t o f t h e f u n c t i o n a l o r g a n i z a t i o n of the a n t a g o n i s t i c m u s c l e s (lack of r e c i p r o c a l i n h i b i tion) 2) t h e e x i s t e n c e o f a r e s t i n g a c t i v i t y b e t w e e n p r o d u c t i o n s 3) the p r e s e n c e of a s u s t a i n e d h y p e r t o n i c b a c k g r o u n d a c t i v i t y during the productions 4) d i f f e r e n c e s c o n c e r n i n g c o a r t i c u l a t o r y e f fects between normal and p a r k i n s o n i a n subjects. These results will be d i s c u s s e d in t e r m s o f t h e e v e n t u a l a n a l o g y t h e y h a v e w i t h g e n e r a l s y m p t o m s o b s e r v e d in l i m b s . + Service
de
Neurologie
Dejerine
++
Service
d1E1ectromyographie
758 PHONO-ARTICULATORY STEREOTYPES IN DEAF CHILDREN L. Handzel Independent Laboratory of Phoniatrics, Medical Academy, Wroclaw, Poland Phono-articulatory phenomena were analysed in deaf children at the age between 7-14 years using a visible speech sonograph, apparatus for registration of tone-pitch and an oscillograph to register acoustic energy while a short sentence was uttered. The studies made it possible to distinguish a number of phonoarticulatory stereotypes and their variants. Observations of phono-articulatory events in the deaf can be regarded as a contribution to diagnostic tools, as regards the time of onset and degree of the hearing impairment, which means the possibility of developing appropriate rehabilitation methods.
Section
QUANTITATIVE STUDY OF ARTICULATION DISORDERS USING INSTRUMENTAL PHONETIC TECHNIQUES W.J. Hardcastle Department of Linguistic Science, University of Reading, Reading, U.K. In this study* a detailed analysis of the speech of five articulatory dyspraxic, and five normal children ranging in age from 8 to 14, was carried out using the instrumental techniques of electropalatography and pneumotachography. Electropalatography provided data on the dynamics of tongue contacts with the palate and pneumotachography was used to measure air-flow characteristics of obstruent sounds. The temporal relationships between lingual patterns and air-flow and acoustic characteristics were determined by synchronizing the recordings from each instrument. Speech measures included: (1)
place of lingual contact during obstruent sounds;
(2)
timing of approach and release phases of obstruents;
(3)
voicing of obstruents (including V.O.T.);
(4)
articulatory variability;
(5)
timing of component elements in consonant clusters.
The speech of the disordered group was found to differ from the normals in all five areas. Specific abnormal articulatory patterns found included (a) simultaneous contact at alveolar and velar regions of the palate during alveolar consonants, (b) abnormal temporal transitions between elements in clusters such as [st], [skj, (c) abnormal tongue grooving for fricatives, (d) lack of normal V.O.T. distinctions. The advantages of this quantitative approach over traditional assessment techniques relying solely on impressionistic auditory judgments are discussed.
* This work is supported by a research grant from the British Medical Research Council.
20
759
760 THE IMPLICATIONS OF STUTTERING FOR SPEECH PRODUCTION J.M. Harrington Department of Linguistics, University of Cambridge, Cambridge, England A recent pilot study has suggested that stuttering o n the post-palatal of [ci:n]
(keen)
manifests itself as a repetition of the mid-velar
[c]
[ks].
Stuttering on the affricate [tr]of [tj°u:z] (choose) manifests itself as a repetition [tJsnta]
of the entire affricate [t_r] whereas stuttering on the onset of (tranter) manifests itself as either a repetition of the entire
[t.x] cluster or simply the alveolar stop plus schwa vowel [to]. The ! data is then interpreted in terms of a speech production model and auditory, proprioceptive and internal feedback channels. It is argued that stuttering involves a breakdown at the level of internal feedback w h i c h results in the inability to integrate the onset and rhyme of a syllabic cluster.
Section
20
761 ACOUSTICAL MEASUREMENT OF VOICE QUALITY IN DYSPHONIA AFTER TRAUMATIC MIDBRAIN DAMAGE E. Hartmann, D. v. Cramon Neuropsychological Department, Max-Planck-Institute for Psychiatry, Munich, FRG In a former study the characteristics of dysphonia after traumatic mutism and the patterns in the recovery of normal laryngeal functions had been established. Traumatic dysphonia is initially characterized by breathy voice quality; with decreasing breathiness voice quality becomes in particular tense and mildly rough. This had been described in terms of auditory judgement. In order to provide quantitative and objective measures for these qualitative description, several acoustical parameters were developed. About twenty male and female patients, suffering from traumatic midbrain syndrome, were examined. The isolated cardinal vowels, repeated by the patients in a phonatory test, were recorded, digitized and automatically analyzed by a Foanalysis and a FFT spectral analysis routine on a PDP 11/40 computer. Subsequently the following parameters were calculated: 'mean fundamental frequency', 'fundamental period perturbation', 'spectral energy above 5 KHz' and 'variance of spectral energy'. Additionally, with the aid of a segmentation routine, the 'time lag of exhaled air', preceding the vocal onset, was measured. These parameters showed significant differences between patients with different pathological voice qualities and a control group of speakers without voice disorders. Besides this classification the different stages in the process of recovery of normal phonation could be described quantitatively. The results encourage acoustical analysis as a tool in clinical examination and treatment of voice disorders.
762 A CONTRIBUTION TO THE PHONOLOGIC PATHOLOGY OF SPEECH STRUCTURE IN CHILDREN WITH IMPAIRED HEARING A. Jarosz Independent Laboratory of Phoniatrics, Medical Academy, Wroclaw, Poland Examinations were made in 12 pupils /4 girls, 8 boys/ from the 5th form of the hearing deficient children in Wroclaw. They were 12 to 13 years old, except 2 aged 14 and 15 years. Phonoarticulatory, audiometric and neurologic aspects were taken into consideration. The utterances consisted in naming objects, events etc., as well as telling picture stories from everyday life and experience. Vowel phoneme patterns were analysed. Pathologic features of the phonemes are presented and interpreted in the light of linguistic rules, environmental factors and others.
Section 20 763 VOCAL PROFILES OF ADULT DOWN'S SYNDROME SPEAKERS J. Laver, J. Mackenzie, S. Wirz and S. Hiller Phonetics Laboratory, Department of Linguistics, University of Edinburgh, Scotland. A descriptive perceptual technique has been developed, for use in speech pathology clinics, for characterizing a patient's voice in terms of his long-term characteristics of supralaryngeal and laryngeal vocal quality, of prosodic features of pitch and loudness, and of temporal organization features of continuity and rate. The product of the technique, expressed in 40 scalar parameters, is the speaker's 'Vocal Profile'. As part of a large-scale project investigating the vocal profiles of eight different speech disorders (1), 26 male and female adult Down's Syndrome speakers, and a sex-matched normal control group were tape-recorded. Three trained judges each independently constructed a vocal profile for each subject, and a representative consensus profile was established for the subject under strict criteria of agreement. Comparison of the vocal profiles of subjects in the two groups shows that the profiles of the Down's Syndrome groups differ significantly from those of the control groups on a majority of the parameters, and that the detailed differences are plausibly related to organic differences of vocal anatomy between the two groups. (1)
This research was supported by a project grant from the Medical Research Council.
764 PSYCHOTISCHE SPRACHE UND KOMMUNIKATION V. Rainov Sofia, Bulgaria Auf Grund unserer vorangegangenen Untersuchungen über die Spezifik der psychotischen Sprache und speziell, der Besonderheiten der Lautersetzung in der Sprachproduktion (Zaimov, Rainov, 1971,1972), wird der Versuch unternommen, die Rolle und den Ausprägungsgrad der Kommunikationsfähigkeit bei Geisteskranken zu ermitteln. Zur Untersuchung wurden Patienten mit Schizophrenie, Zyklophrenie und mit amentiver Symptomatik herangezogen. Es wurden:. 1) die linguostatistischen Daten des Sprachzerfalls und 2) der Grad der Störung der Kommunikationsfähigkeit analysiert. In Ubereinstimmung mit Watzlawick, Beavin und Jackson (1967) vertreten wir die Einstellung, dass "es unmöglich ist nicht zu kommunizieren" und dass die globale aktive bzw. passive Strategie, die die Sprache des Geisteskranken charakerisiert, bereits eine Art der Kommunikation darstellt.
Section
20
765 A NEW TYPE OF ACOUSTIC FEATURES CHARACTERIZING LARYNGEAL PATHOLOGY AT THE ACOUSTIC LEVEL Jean Schoentgen, Paul Jospa Institut de Phonétique, Université Libre de Bruxelles, Bruxelles, Belgium Features characterizing vocal jitter and vocal shimmer have become by now fairly standard with investigators interested in the quantification of voice quality or in the detection of laryngeal pathology at the acoustic level. In the meantime some authors have reported several cases of laryngeal pathologies (e.g. light laryngitis, hypotonia of the vocal folds) which are not characterized by higher perturbation, but by an abnormally low excitation of the vocal tract at the instant of glottal closure. We have proposed a family of algorithms which allow for the estimation of the evolution of the local damping constant inside one glottal period. These algorithms have been described elsewhere. In this paper we present tne results of an evaluation of the discrimination performance, between normal and dysphonic speakers, of a set of acoustic features making explicit use of the estimation of the local damping measure provided by this type of algorithms. The performance of the features in discriminating between normal and dysphonic subjects was evaluated by a clustering analysis. Open and closed tests were performed. We also computed the more classic jitter and shimmer features in order to allow for comparisons and to evaluate the combined detection power of jitter/shimmer and damping features. We extracted our features from stable portions of the sustained vowel/a/, uttered by 35 dysphonic speakers and 39 normal speakers respectively. Several authors have reported that in the presence of laryngeal pathology continuous speech is, at the acoustic level, much more likely to exhibit abnormalities than sustained' vowels. A careful study of the littérature convinced us that this is only true with reference to the narrow band equipment (e.g. pitch analyzers) used by these investigators. For large band equipment (sampling frequencies up to 10 khz) this question is still unsettled. These matters will also be discussed in our paper.
766 A TACTUAL "HEARING AID" FOR THE DEAF Karl-Erik Spens The Swedish Institute for the Handicapped, and the Department of Speech Communication and Music Acoustics, Royal Institute of Technology, Stockholm, Sweden In order to make a wearable aid for the deaf which transforms acoustic information into tactile stimuli there are several problems to be solved before the aid actually gives a positive net benefit to the user. Some of the basic problems to be solved are, power consumption, size and weight, feedback risks, dynamic range and intensity resolution. For high information rate aids with several channels power consumption and size become even more important and difficult to solve. A one-channel aid is developed and will be demonstrated. It has the following features: 1. battery life 50 hrs. 2. size and weight, similar to a body worn hearing aid. 3. no feedback problems. 4. a signal processing which gives a good intensity resolution and a large dynamic.input range in order to fit the intensity characteristics of the skin. The aid conveys information about the speech rhythm which is a support for lipreaders. A test with unknown sentences and 14 normal hearing subjects indicates an average increase shown in the figure. The aid is now used on an every day basis by five postlingually deaf subjects. All subjects report that the aid facilitates lipreading and that the ability to monitor the acoustic environment adds to their confidence. A condensed collection of subjective evaluation data from the long term use of the aid and quantitative evaluation with lipreading tests will be presented at the conference. 100 L IPREADING o
Id CC a O
WITH
"
WITHOUT.»—« TACTILE
AID.
o
Lü O Cd
Ld Q.
0
EACH • 1
POINT
=
280
ST IM,