Speech Timing: Implications for Theories of Phonology, Phonetics, and Speech Motor Control 0198795424, 9780198795421

This book explores the nature of cognitive representations and processes in speech motor control, based primarily on evi

244 47 4MB

English Pages 400 [387] Year 2020

Recommend Papers

Speech Rhythm in Learner and Second Language Varieties of English (Prosody, Phonology and Phonetics) 9811989397, 9789811989391

This book presents cutting-edge research on the production and perception of speech rhythm by speakers of English in cou

118 37 7MB Read more

The Phonetics of Great Smoky Mountain Speech 9780231896245

Describes the sounds of English as it was spoken in the Great Smoky Mountains of Tennessee and North Carolina. Looks at

116 6 11MB Read more

Speech acoustics and phonetics [1 ed.] 9781402027895, 1402027893

The overall aim of the book is to provide an integrated view of the separate stages of the speech chain, covering the pr

301 78 4MB Read more

Models and Theories of Speech Production 9782889639281, 2889639282

407 35 54MB Read more

Motor Speech Disorders: A Cross-Language Perspective 9781783092338

This book investigates cross-language aspects of motor speech disorders, including their assessment and treatment as wel

133 72 4MB Read more

Positive Free Speech: Rationales, Methods and Implications 9781509908295, 9781509908325, 9781509908318

Freedom of expression is generally analysed as a bare liberty against restraint by state action. However, rationales und

166 7 3MB Read more

Handbook of Japanese Phonetics and Phonology 9781614511984, 9781614512523

This volume is the first comprehensive handbook of Japanese phonetics and phonology describing the basic phonetic and ph

167 47 10MB Read more

The Phonetics/Phonology Interface 9780748681808

Moves beyond the basics of phonetics and phonology and investigates their interaction Designed for the advanced student

119 36 6MB Read more

Efficient Algorithms for Speech Recognition

446 113 1MB Read more

Connected speech : the interaction of syntax and phonology. [4. print. ed.] 9780123947208, 0123947200

214 25 8MB Read more

Speech Timing: Implications for Theories of Phonology, Phonetics, and Speech Motor Control
0198795424, 9780198795421

Author / Uploaded
Alice Turk
Stefanie Shattuck-Hufnagel

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

Speech Timing

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

OXFORD STUDIES IN PHONOLOGY AND PHONETICS General editors Andrew Nevins, University College London Keren Rice, University of Toronto Advisory editors Stuart Davis, Indiana University, Heather Goad, McGill University, Carlos Gussenhoven, Radboud University, Haruo Kubozono, National Institute for Japanese Language and Linguistics, Sun-Ah Jun, University of California, Los Angeles, Maria-Rosa Lloret, Universitat de Barcelona, Douglas Pulleyblank, University of British Columbia, Rachid Ridouane, Laboratoire de Phonétique et Phonologie, Paris, Rachel Walker, University of Southern California  1 Morphological Length and Prosodically Defective Morphemes Eva Zimmermann 2 The Phonetics and Phonology of Geminate Consonants Edited by Haruo Kubozono 3 Prosodic Weight: Categories and Continua Kevin M. Ryan 4 Phonological Templates in Development Marilyn May Vihman 5 Speech Timing: Implications for Theories of Phonology, Phonetics, and Speech Motor Control Alice Turk and Stefanie Shattuck-Hufnagel   Phonological Speciﬁcation and Interface Interpretation Edited by Bert Botma and Marc van Oostendorp Doing Computational Phonology Edited by Jeffrey Heinz Intonation in Spoken Arabic Dialects Sam Hellmuth Synchronic and Diachronic Approaches to Tonal Accent Edited by Pavel Iosad and Björn Köhnlein The Structure of Nasal-Stop Inventories Eduardo Piñeros

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

Speech Timing Implications for Theories of Phonology, Phonetics, and Speech Motor Control ALICE TURK AND STEFANIE SHATTUCK-HUFNAGEL

1

OUP CORRECTED PROOF – FINAL, 11/2/2020, SPi

3

Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Alice Turk and Stefanie Shattuck-Hufnagel 2020 The moral rights of the authors have been asserted First Edition published in 2020 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2019945699 ISBN 978–0–19–879542–1 Printed and bound in Great Britain by Clays Ltd, Elcograf S.p.A. Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

Contents Series preface Acknowledgments List of ﬁgures and tables List of abbreviations

ix xi xiii xv

1. Introduction

1

2. Articulatory Phonology/Task Dynamics

8

2.1 Introduction 2.2 The dual function of gestures within Articulatory Phonology: contrast and constriction formation 2.3 Using mass–spring systems to model gestural movement in TD 2.4 Gestural control of individual articulators, and gestural activation 2.5 Timing Control in AP/TD 2.6 Key features of AP/TD 2.7 Advantages of the AP/TD framework 2.8 Conclusion

3. Evidence motivating consideration of an alternative approach 3.1 AP/TD default speciﬁcations require extensive modiﬁcations 3.2 Relationships among distance, accuracy, and duration are not fully explained in AP/TD 3.3 Distinct synchronous tasks cause spatial interference 3.4 Issues not currently dealt with 3.5 Summary

4. Phonology-extrinsic timing: Support for an alternative approach I 4.1 Introduction 4.2 A challenge to the use of mass–spring oscillators in the implementation of timing effects 4.3 Evidence for the mental representation of surface durations 4.4 Further evidence for general-purpose timekeeping mechanisms to specify durations and track time 4.5 Conclusion

5. Coordination: Support for an alternative approach II 5.1 Introduction 5.2 Evidence consistent with AP/TD inter-planning-oscillator coupling, and alternative explanations

8 10 11 14 20 38 43 47

49 50 53 57 62 62

64 64 67 75 90 100

102 102 105

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

vi

 5.3 Evidence that requires the consideration of non-oscillatory approaches 5.4 Evidence that timing relationships in movement coordination are not always based on movement onsets 5.5 Possible mechanisms for endpoint-based timing and coordination 5.6 Planning inter-movement coordination and movement-onset timing 5.7 Summary of ﬁndings relating to movement coordination

6. The prosodic governance of surface phonetic variation: Support for an alternative approach III 6.1 Introduction 6.2 Evidence relating to Pi/MuT mechanisms for boundary- and prominence-related lengthening 6.3 Evidence relating to the coupled oscillator hierarchy mechanism for poly-subconstituent shortening 6.4 Evidence which challenges the use of oscillators in controlling overall speech rate 6.5 Summary

7. Evidence for an alternative approach to speech production, with three model components 7.1 Existing three-component models and some gaps they leave 7.2 Why the timing evidence presented earlier motivates the three components of the XT/3C approach, despite the gaps 7.3 Evidence for the separation between the Phonological and Phonetic Planning Components: Abstract symbols in Phonological Planning 7.4 The translation issue 7.5 Motivating the separation between Phonetic Planning and Motor-Sensory Implementation 7.6 Key components of the proposed model sketch

8. Optimization 8.1 8.2 8.3 8.4 8.5 8.6 8.7

General overview Key features What are the costs of movement? Predictions of Stochastic Optimal Feedback Control Theory Challenges for Optimal Control Theory approaches Optimization principles in theories of phonology and phonetics Conclusion

9. How do timing mechanisms work? 9.1 General-purpose timekeeping mechanisms

112 119 127 129 130

132 132 133 135 143 144

146 150 158

162 171 178 188

190 194 195 199 214 218 220 236

238 239

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

 9.2 Lee’s General Tau theory 9.3 Summary

10. A sketch of a Phonology-Extrinsic-Timing-Based Three-Component model of speech production 10.1 10.2 10.3 10.4

Phonological Planning Phonetic Planning Motor-Sensory Implementation Summary and discussion

vii 256 262

264 268 298 310 312

11. Summary and conclusion

313

References Index

321 363

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

Series preface Oxford Studies in Phonology and Phonetics provides a platform for original research on sound structure in natural language within contemporary phonological theory and related areas of inquiry such as phonetic theory, morphological theory, the architecture of the grammar, and cognitive science. Contributors are encouraged to present their work in the context of contemporary theoretical issues in a manner accessible to a range of people, including phonologists, phoneticians, morphologists, psycholinguists, and cognitive scientists. Manuscripts should include a wealth of empirical examples, where relevant, and make full use of the possibilities for digital media that can be leveraged on a companion website with access to materials such as sound ﬁles, videos, extended databases, and software. This is a companion series to Oxford Surveys in Phonology and Phonetics, which provides critical overviews of the major approaches to research topics of current interest, a discussion of their relative value, and an assessment of what degree of consensus exists about any one of them. The Studies series will equally seek to combine empirical phenomena with theoretical frameworks, but its authors will propose an original line of argumentation, often as the inception or culmination of an ongoing original research program. Based on a theory involving planning the time between acoustic landmarks, this book provides a model of speech production utilizing symbolic (nongestural, without speciﬁc spatiotemporal content) phonological representations and phonology-extrinsic, non-speech-speciﬁc, general-purpose timing mechanisms that have sufﬁcient ﬂexibility to account for empirically documented timing behavior. The model takes into account a variety of sources of evidence, including listener-related factors, and presents an elegant model of speech production that involves separate planning components for phonology and phonetics, an Optimal Control Theory approach, and movement coordination based on movement endpoints and continuous tau coupling, rather than on movement onsets. This volume is a ground-breaking culmination of many years of research by the authors, and offers up much serious discussion for consideration, alongside pronounced challenges to competing theories of speech timing and task dynamics. Andrew Nevins Keren Rice

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

Acknowledgments It is a deeply felt pleasure to acknowledge the many people who have helped to shape our thinking over the years of writing this book. We especially thank Elliot Saltzman and Dave Lee for their indefatigable patience in explaining the ﬁne details of AP/TD and General Tau theory, sometimes multiple times. Louis Goldstein took time to answer our questions, and Ioana Chitoran, Fred Cummins, Jelena Krivokapić, Bob Ladd, and Juraj Šimko have read through parts of the book in their intermediate stages, and provided remarkably useful feedback. The comments of three anonymous reviewers and of Khalil Iskarous have also improved the book; one reviewer in particular provided extensive pages of extremely valuable comments (we think we know who you are, and we thank you!). Katrina Harris’ work on the bibliography and Ada RenMitchell’s work on the index saved us many weeks of time; Katherine Demuth provided useful feedback. We thank Ken Stevens, who ﬁrst sensitized us to the role of individual acoustic cues to distinctive features in speech processing, and who established an atmosphere of inquiry in the MIT Speech Communication Group that fostered critical and creative thinking, and inspired this collaborative effort. We are grateful to the Arts and Humanities Research Council, who funded the initial stages of research for the book (grant number AH/1002758/ 1 to the ﬁrst author), and to the US National Science Foundation (grant numbers BCS 1023596, 1651190 and 1827598 to the second author). Our friends and families also played a critical role in the completion of this volume, not least by asking us at regular intervals “Is Chapter 7 done yet?” We will not soon forget their support of the project, and their understanding of research visits, and an inordinate number of strategically timed Skype calls (made necessary by our collaboration across 3000 miles of Atlantic Ocean). Finally, we would like to acknowledge the intellectual generosity of the developers of the AP/TD approach, whose work has been a model to us of how science should be done. All errors and omissions are, of course, our responsibility alone.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

List of ﬁgures and tables List of ﬁgures 2.1 Gestural scores for the words mad and ban, illustrating gestural activation intervals and their relative timing

15

2.2 Time functions of vocal tract variables, as measured using X-ray microbeam data, for the phrase pea pots, showing the in-phase (synchronous within twenty-ﬁve ms) coordination of the lip gesture for the /p/ in pots and the /a/ gesture for the vowel in pots

17

2.3 The coupling graph for spot (top) in which the tongue-tip (fricative) gesture and the lip-closure gesture are coupled (in-phase) to the tongue-body (vowel) gesture, while they are also coupled to each other in the antiphase mode

26

2.4 Coupling graphs for syllable sequences

27

2.5 Steady-state patterns of (slow) foot and (fast) syllable oscillators, with asymmetrical (foot-dominant) coupling between foot and syllable oscillators

31

2.6 A schematic gestural score for two gestures spanning a phrasal boundary instantiated via a π-gesture

32

3.1 Schematic diagrams of the templates for the four experimental conditions in Franz et al. (2001)

59

4.1 Start and end times (in milliseconds) of keypress movements for two repetitions of the same . . . an epic . . . sequence

69

4.2 Schematic illustration of data extraction

71

4.3 Scatter plots of protrusion duration interval versus consonant duration (left column); onset interval versus consonant duration (middle column), and offset interval versus consonant duration (right column) for lip-protrusion movements from four participants’ /i_u/ sequences (shown in each of four rows)

72

4.4 Mean test vowel durations (in ms) in the baseline and three experimental conditions

82

4.5 Means and standard deviations for vowel duration as a function of lexical/morphological quantity—short stem in short grade (SS–SG), short stem in long grade (SS–LG), long stem in short grade (LS–SG), long stem in long grade (LS–LG)—and sentence context—(Medial, Final)

83

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

xiv      5.1 The distribution of keypress start and end times measured relative to the previous keypress

124

7.1 The utterance excerpt . . . caught her . . . produced by a Scottish female speaker from the Doubletalk corpus (Scobbie et al. 2013)

168

7.2 The utterance excerpt . . . caught her . . . produced by the same Scottish female speaker that produced . . . caught her . . . shown in Figure 7.1

169

7.3 A schematic diagram of the proposed XT/3C-v1 model

188

8.1 Prosodic structure as the interface between language and speech, illustrating some of the factors that inﬂuence Phonetic Planning

191

9.1 TauG guidance of the tongue when saying ‘dad’

260

10.1 An example prosodic structure for Mary’s cousin George baked the cake

271

10.2 A grid-like representation of prominence structure for one possible prosodiﬁcation of Mary’s cousin George

273

10.3 The complementary relationship between predictability (language redundancy) and acoustic salience yields smooth-signal redundancy (equal recognition likelihood throughout an utterance)

276

10.4 Factors that shape surface phonetics and their relationship to predictability, acoustic salience, and recognition likelihood

277

10.5 The utterance excerpt . . . elf in the mirror . . . spoken by a Southern British English speaker from the Doubletalk corpus (Scobbie et al. 2013)

309

List of tables 2.1 AP/TD tract variables and the model articulator variables that they govern 9.1 TauG guidance of the jaw, lips, and tongue in monologue recordings from the ESPF Doubletalk corpus

14 261

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

List of abbreviations 3C AP/TD C DIVA GLO GO GoDIVA LA LP LTH MRI OCT OFCT OT SOFCT TADA TD TDCD TDCL TTCD TTCL TTCO V VEL VITE XT/3C

three-component Articulatory Phonology/Task Dynamics consonant Directions into Velocities of Articulators glottal aperture Gradient Order Gradient Order DIVA lip aperture lip protrusion lower tooth height Magnetic resonance imaging Optimal Control Theory Optimal Feedback Control Theory Optimality Theory Stochastic Optimal Feedback Control Theory Task Dynamic Application Task Dynamics tongue-dorsum constriction degree tongue-dorsum constriction location tongue-tip constriction degree tongue-tip constriction location tongue-tip constriction orientation vowel velic aperture Vector Integration To Endpoint phonology-extrinsic-timing-based three-component approach

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

1 Introduction This is a book about speech timing, and about the implications of speech timing patterns for the architecture of the speech production planning system. It uses evidence from motor timing variation to address the question of how words come to have such different acoustic shapes in different contexts. The book came about for two main reasons: First, it was written in reaction to a debate in the literature about the nature of phonological representations, which, together with a set of mechanisms that operate in relation to these representations, account for the range of systematic surface variation observed for phonologically equivalent forms. Phonological representations are proposed to be spatiotemporal by some (Fowler, Rubin, Remez and Turvey 1980), and symbolic (atemporal) by others (Henke 1966; Keating 1990; Fujimura 1992 et seq.; Guenther 1995 et seq.; Levelt et al. 1999; inter alia). The model of speech articulation which currently provides the most comprehensive account of systematic phonetic patterns, including timing, is a spatiotemporal approach called Articulatory Phonology (Browman and Goldstein 1985, 1992a; Saltzman, Nam, Krivokapić and Goldstein 2008). This model has many strengths, among them that it accurately captures a wide variety of complex characteristics of speech articulation (including smooth, singlepeaked movement velocity proﬁles, coarticulation, and systematic effects of prosodic structure). However, its choice of spatiotemporal phonological representations and phonology-intrinsic timing mechanisms makes it structurally very different from approaches based on symbolic representations. Thus resolving the debate about spatiotemporal vs. symbolic representations has implications not only for phonological theory, but also for the architecture of the speech motor control system. The second motivation was that our shared interest in timing patterns in speech led us to wonder about the type of motor control system that can best explain what is known about speech timing. Because one of the primary differences between symbolic and spatiotemporal approaches is in how they deal with timing, an evaluation of these theories in terms of available timing evidence simultaneously leads to answers to both questions, namely, about the nature of phonological representations, and about the type of motor Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

2



control system that can account for speech timing behavior. Evaluation of the Articulatory Phonology model from this point of view is presented in the ﬁrst part of the book, and is made possible in large part by the exemplary explicitness of the Articulatory Phonology model. Because the lines of evidence presented here do not accord with Articulatory Phonology’s spatiotemporal approach, in the second half the book we provide a sketch of a model of speech production based on symbolic phonological representations and phonology-extrinsic timing mechanisms that has the ﬂexibility to account for known timing behavior. As we have noted, the choice between symbolic, atemporal phonological representations and spatiotemporal phonological representations has several fundamental implications for the architecture of the speech production planning system. One of the most signiﬁcant of these implications is the number of planning components that are required. In systems that include a planning component with symbolic (i.e. discrete, without speciﬁc spatiotemporal content) phonological representations (Henke 1966; Klatt 1976; Keating 1990; Shattuck-Hufnagel 1992; Shattuck-Hufnagel, Demuth, Hanson and Stevens 2011; Munhall 1993, Kingston and Diehl 1994; van Santen 1994; Guenther 1995; Clements and Hertz 1996; Levelt, Roelofs, and Meyer 1999; Fujimura 1992 et seq.; Goldrick, Baker, Murphy and Baese-Berk 2011; Houde and Nagarajan 2011; Perkell 2012; Lefkowitz 2017), a separate phonetic planning component is required to plan the details of surface timing and spatial characteristics for each context. These aspects of an utterance are not speciﬁed by the symbolic phonological representation. As a result, a separate phonetic planning process is required to map, or ‘translate’ from the representational vocabulary of abstract phonological symbols to a different (i.e. fully quantitative) representational vocabulary that can specify the physical form of an utterance. In contrast, the Articulatory Phonology system has a very different architecture; because its phonological representations are already spatiotemporal and fully quantitative in nature, it does not require a separate phonetic planning component. That is, because phonology in the Articulatory Phonology framework is already spatiotemporal, a single type of representational vocabulary is used throughout production, and this avoids the need for a separate phonetic component to plan the spatiotemporal details that are required to implement a symbolic phonological plan in a spoken utterance. In addition to its implications for the architecture of the planning model, the choice between spatiotemporal and symbolic phonological representations also has important consequences for how these two approaches deal with timing issues. This is because time is intrinsic to phonological representations

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



3

in Articulatory Phonology, but is not part of phonology in symbol-based models. Because timing is intrinsic to the phonology, surface timing characteristics simply emerge from the phonological system, and are not explicitly represented, speciﬁed, or tracked. In contrast, in symbol-based approaches, timing is extrinsic to the phonological representations, so that surface timing characteristics must be explicitly planned in a separate phonetic planning component. These fundamental differences in how the two contrasting approaches deal with timing suggest an important criterion for comparing and evaluating them: that is, how well do they account for what is currently known about motor timing in general, and speech motor timing in particular? A large part of this book is devoted to such an evaluation, made possible by the explicit predictions of the Articulatory Phonology approach developed in the framework of Task Dynamics (Saltzman and Munhall 1989; Saltzman, Nam, Krivokapić and Goldstein 2008). This book presents a number of lines of evidence that are inconsistent with the Articulatory Phonology approach in particular and the phonology-intrinsic timing approach in general, and therefore suggest the need to consider a different approach based on phonology-extrinsic timing. To begin, the ﬁrst few chapters of the book lay out the key features of phonology-intrinsic-timing-based Articulatory Phonology in the Task Dynamics framework, and examine the oscillator-based mechanisms it uses. This model of speech production planning has evolved signiﬁcantly over the years, under the inﬂuence of a number of important theoretical developments and observational ﬁndings. For example, the development of modern prosodic theory, with its hierarchy of prosodic constituents and prominence levels governing systematic patterns of duration variation (such as boundary-related lengthening, prominence-related lengthening, and poly-constituent shortening, see Chapters 6 and 10) led to the incorporation of planning oscillators for the syllable, the foot, and the phrase, and to the postulation of other timingadjustment mechanisms for boundary- and prominence-related lengthening, and speech rate (Byrd and Saltzman 2003; Saltzman et al. 2008). These developments have resulted in a system which is signiﬁcantly more complex than the original proposal, but provides a much-needed account of contextual variability, via the manipulation of the activation intervals for the spatiotemporal representations in different contexts. The added mechanisms begin to chip away at the initial attractive simplicity of its model, but don’t undermine its core principles signiﬁcantly. The second and more telling part of the evaluation involves an examination of current evidence in the non-speech motor timing literature and in the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

4



related speech literature that is currently not modeled within Articulatory Phonology. This evaluation reveals phenomena which appear to be incompatible with the phonology-intrinsic timing approach, and which therefore motivate the consideration of an alternative approach based on phonologyextrinsic timing. Some of these phenomena appear to require the representation and planning of surface timing characteristics. These are not consistent with phonology-intrinsic timing theories, because in such theories surface timing characteristics are not explicitly represented or planned as goals. Instead, they emerge from interacting components within the spatiotemporal phonological system. As a result, there is no mechanism for explicitly specifying surface timing, yet a number of observations suggest speakers can do this. Other phenomena suggest the involvement of general-purpose timekeeping mechanisms, which are not invoked in Articulatory Phonology because they are at odds with its phonology-intrinsic timing approach, in which the timing mechanism is speciﬁc to speech, and surface timing characteristics are emergent. Still other phenomena, relating to timing precision at movement endpoints, are also at odds with spatiotemporal phonological representations and thus have no principled explanation within Articulatory Phonology. In contrast, they can be straightforwardly explained in a three-component speech production system which combines symbolic phonological representations with separate phonetic planning and motor-sensory implementation components. The evaluation is then extended to the ways in which Articulatory Phonology has chosen to account for movement coordination and effects of prosodic structure on articulation, within the phonology-intrinsic timing framework. This evaluation in light of additional ﬁndings in the motor-control literature similarly suggests the need to consider approaches to coordination and suprasegmental structure that are different from the oscillator-based approach of Articulatory Phonology. The results of this multipart evaluation in the ﬁrst half of the book highlight the need to develop an alternative type of speech motor control model that can deal more straightforwardly with available motor timing evidence. Drawing on and extending existing proposals, the second half of the book sketches out a three-part model that includes a Phonological Planning Component, a separate Phonetic Planning Component, and a Motor–Sensory Implementation Component. This model has two goals: to provide a more complete description of the phonological planning process than is available in existing threepart symbol-based systems, and to provide an account of certain aspects of systematic variation in surface phonetic timing behavior that is not available either in existing three-component models or in Articulatory Phonology.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



5

Like any model of speech production, including Articulatory Phonology, this alternative approach must meet a certain set of generally agreed-upon criteria for an adequate model. That is, it must make contact with the phonological information that speciﬁes the lexical form of each word in the planned utterance; it must include some speciﬁcation of the utterance-speciﬁc prosodic structure, including relative word prominence and the grouping of words into larger constituents; it must provide an account of the ways in which words and their sounds vary systematically in different contexts; and it must provide instructions to the articulatory control mechanisms that are adequate to match the observed quantitative facts about spoken utterances, such as appropriate articulator trajectories, acoustic patterns, and surface phonetic timing in both the acoustic and the articulatory domains. The model proposed here builds on insights gained from existing symbolbased three-component models, but extends this approach to account more comprehensively for the details of surface phonetic variation, using generalpurpose timing mechanisms that are extrinsic to the phonology. This extended three-component approach based on phonology-extrinsic general-purpose timing mechanisms follows traditional phonological theory in assuming symbolic phonological representations. However, early models based on symbolic representations, derived from Generative Phonology (Chomsky and Halle 1968), did not attempt to deal with the physical manifestation of speech—in a sense they stopped at the point when the surface form of an utterance was still symbolically represented. Later models in the Generative Phonology framework generate articulatory movements, but they do not provide a full account of surface timing. The approach advocated here develops these ideas further, by proposing a more comprehensive account of surface phonetic variability, including timing. In doing so, it incorporates some of the ideas in the existing literature (e.g. Keating 1990; Guenther 1995; Guenther, Ghosh, and Tourville 2006; Guenther 2016; Fujimura 1992, 2000 et seq.) but differs in three main ways that provide the ﬂexibility necessary to account for the full range of systematic context-governed phonetic variability. First, the proposed approach provides an account of the types of task requirements that are speciﬁed in the Phonological Planning Component. This aspect of the proposed model is based on evidence highlighting the large number of contextual factors that can inﬂuence utterance-speciﬁc surface phonetic form, including timing characteristics, and must therefore be included in the phonological plan in non-quantitative (but sometimes relational) symbolic terms, for later development in quantitative terms to form the phonetic plan. In this proposed model, task requirements include the production of phonological contrasts

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

6



appropriately in different contexts (where context is deﬁned by factors such as location in a hierarchical prosodic structure, relative speaking rate, dialect and idiolect, speaking style/situation, and others, see Turk and Shattuck-Hufnagel 2014), as well as the choice of appropriate symbolically expressed acoustic cues to meet these task requirements. The second difference from earlier approaches is that, although the proposal shares with other three-component models the assumption of phonology-extrinsic, general-purpose timekeeping mechanisms, it differs in its account of timing control. The proposed account is based on planning the timing between acoustic landmarks (Stevens 2002), and incorporates Lee’s (1998) General Tau theory to plan appropriate movement velocity proﬁles and target-based movement coordination for the movements that achieve the landmarks. The computation of parameter values to be speciﬁed in the Phonetic Planning Component (including parameter values for timing) occurs via mechanisms proposed in Optimal Control Theory, to determine the optimum way of meeting utterance-speciﬁc goals speciﬁed in the Phonological Planning Component, at minimum cost. Optimal Control Theory models the choice of movements to satisfy multiple goals while economizing on effort, time, and other costs (cf. Nelson 1983; Lindblom 1990), and has been used in several recent models of surface timing variation in speech, e.g. Flemming (2001); Šimko and Cummins (2010, 2011); Katz (2010); Braver (2013); Lefkowitz (2017). Finally, like other three-component approaches, the model incorporates a Motor–Sensory Implementation Component to carry out the optimized instructions, consistent with the evidence that speakers track and adjust their movements when possible, to ensure that their acoustic goals are achieved. Such a component is widely agreed to be necessary, and speciﬁc proposals for how this component works have been advanced by Houde and Nagarajan (2011) and Guenther (2016); see also Hickok (2014). The chapters that follow ﬁrst lay out the major tenets of the Articulatory Phonology approach, and some of its remarkable successes in providing an account of speech phenomena such as coarticulation (Chapter 2). Because a full description of the theory is necessary in order to evaluate it in light of accumulating evidence about the nature of movement planning and motor control, this chapter provides a comprehensive description of its current state, with elements pulled together from disparate parts of the extensive relevant Articulatory Phonology literature. Several chapters are then devoted to explicating why, in our view, currently available evidence from motor timing in general and speech timing in particular suggests that an alternative model is needed (Chapters 3–6). Chapter 7 summarizes this timing evidence and

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



7

presents additional spatial evidence which suggests the value of developing a three-component model with phonology-extrinsic timing and abstract symbolic phonological representations. Chapters 8–9 present a number of components from the existing literature that could provide some of the pieces of such an alternative model. These components include Stochastic Optimal Feedback Control Theory (Todorov and Jordan 2002; Houde and Nagarajan 2011), and General Tau theory (Lee 1998). Chapter 10 draws all these elements together, providing a sketch of a phonology-extrinsic-timing-based threecomponent model of speech production planning, and Chapter 11 provides a summary of the main points made in the book. Although many of the components of this alternative approach are drawn from the existing literature, they have not previously been combined into a model of acoustic and articulatory speech planning based on symbolic phonological representations and phonology-extrinsic timing, which can account for systematic surface phonetic variation in speech, including systematic surface timing patterns. That is one of the tasks that we have set ourselves in this book. The proposed model is still at the beginning stages of development, but we believe that its eventual computational implementation will provide a more principled and comprehensive account of phonetic behavior, and a more realistic account of speech production processing in general, than is currently available. We hope that other researchers will be inspired to consider whether their phonetic observations could be accounted for by such a model, and we look forward to some lively interactions.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

2 Articulatory Phonology/Task Dynamics 2.1 Introduction Articulatory Phonology, developed in the Task Dynamics framework (hereafter AP/TD), provided the ﬁrst comprehensive model of phonology, speech articulation, and the connection between them (Browman and Goldstein 1985, 1989, 1990, 1992, 1995; Saltzman and Munhall 1989 and more recent developments) This theory, like any theory of speech motor control, faces the challenge of explaining the how ‘the same sound’ can be produced in systematically different ways in different contexts. AP/TD is based on the idea ﬁrst developed in OHB non-speech motor control, that “it is Heller et al: of International Criminal tendencies in dynamics—the free interplay of forces and mutual inﬂuences Law toward equilibrium or steady states—that are among components tending (9780198825203)_Fin primarily responsible for the order of biological processes” (Kugler, Kelso, and al XML and web PDF Turvey 1980, p. 6), and request that biological processes therefore require minimal involvement of “intelligent regulators,” i.e. minimal planning and computation. While acknowledging the importance of linguistic goals (tasks) in speech, the AP/TD approach attempts to reduce the burden of planning and regulation by adopting Fowler’s (1977, 1980) proposal that phonological representations are (spatio)temporal, and thus that the timing of speech movements is intrinsic to the phonology. Because the dynamical spatiotemporal phonological representations determine the movements which shape the acoustic speech signal, there is no requirement for a separate phonetic planning component. And because the phonological representations are not symbolic, there is no requirement to translate from non-spatiotemporal, discrete symbols to quantitative, continuous articulatory movements. Thus the AP/TD approach addressed one of the most vexing problems in speech: The gap between symbolic representations in the mind and quantitative values in the speech signal. Since the 1980s, the AP/TD approach has been further developed to account for many effects of context on speech articulation, including coarticulatory effects of adjacent context, and effects of prosodic position. Because of this, AP/TD currently provides the most comprehensive account of systematic spatiotemporal variability in speech. In doing so, it represents the standard

Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. 

9

which any alternative theory of speech production must match or surpass, and provides a clear advantage over traditional phonological theories as a model of the speech production process. This chapter reviews the assumptions, mechanisms, and implications of this theory. Because one of its central tenets is that time is intrinsic to phonological representations, and a major goal of the book is to evaluate the theory in terms of its ability to account for timing behavior, a particular focus of the chapter is on the consequences of the commitment to phonology-intrinsic timing for the way speech timing phenomena are modeled. Understanding these aspects of the Articulatory Phonology approach is critical for identifying phenomena for which the model currently has no account (Chapters 3–6), and which have motivated the proposal of an alternative, phonology-extrinsic timing approach (laid out in Chapters 7–10). Because there have been considerable developments to the AP/TD theory over the years in response to new data, the aim of this chapter is to pull together a full description of the structures and mechanisms which are proposed within the current theory to account for systematic variability in speech production. It is primarily based on a series of papers which together describe the current state of the theory: Browman and Goldstein (1985); Browman and Goldstein (1990a, b); Browman and Goldstein (1992a); Browman and Goldstein (2000); Browman and Goldstein (unpublished ms); Byrd and Saltzman (1998); Byrd and Saltzman (2003); Goldstein, Nam, Saltzman and Chitoran (2009); Krivokapić (2020); Nam, Goldstein, and Saltzman (2010); Nam, Saltzman, Krivokapić and Goldstein (2008); Saltzman and Byrd (2000); Saltzman, Löfqvist, and Mitra (2000); Saltzman and Munhall (1989); Saltzman, Nam, Goldstein, and Byrd (2006); and Saltzman, Nam, Krivokapić and Goldstein (2008). A computational implementation of the model has been developed by Nam, Browman, Goldstein, Proctor, Rubin, and Saltzman, and is described here: www.haskins.yale.edu/tada_download/index.php. Where appropriate, reference is also made to newer developments, e.g. Sorensen and Gafos (2016), which have not yet been incorporated in a fully working system. This chapter ﬁrst introduces gestures as units of contrast and constriction formation in AP/TD (Section 2.2), presents how a mass–spring system is used to model constriction formation (Section 2.3), and describes the function of gestures in controlling individual articulators, of gestural activation in controlling gestural movement (Section 2.4), and mechanisms for timing control (Section 2.5) within this system. It then summarizes the key features of the model (Section 2.6) and its advantages (Section 2.7). Finally, the last section looks ahead to the evidence laid out in Chapters 3–6, which motivates the alternative approach presented in the remainder of the book.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

10  / 

2.2 The dual function of gestures within Articulatory Phonology: contrast and constriction formation In the AP/TD framework, basic units of phonological contrast and speech production are units of vocal tract constriction formation called gestures, e.g. tongue-tip closure, tongue-body opening. In this framework, the term ‘gesture’ has a very speciﬁc meaning, which is somewhat different from the common use of the term. Each gesture speciﬁes a set of articulators responsible for achieving a particular constriction in the vocal tract. For example, the upper lip, lower lip, and jaw act together to form a bilabial constriction, and the tongue body, tongue tip, and jaw act together to form a tongue-tip constriction at the alveolar ridge. A central tenet of AP/TD is that gestures, in the technical AP/TD sense, have a dual function. On the one hand, they are contrastive phonological units, that is, units that distinguish word meanings. At the same time, they each specify a family of movement trajectories with the same constriction target, and describe how these trajectories unfold over time. In different utterances, the articulatory trajectories for a given gesture can be different for several reasons. The ﬁrst is because a given gestural constriction represents the activity of a task-speciﬁc coordinative structure (cf. Kugler et al. 1980), and can thus be achieved through different contributions of individual articulators which make up the coordinative structure and act in a coordinated fashion to achieve the gestural goal. An example is when the upper lip, lower lip, and jaw act together to achieve a bilabial constriction gesture, and can compensate for one another when one of these articulators is perturbed from its normal pattern of activity (Folkins and Abbs 1975), or when one of its articulators is involved in the production of a different, overlapping gesture. Reasons for variability in the articulatory trajectories for a gesture include 1) differences in gestural starting position, e.g. because of a different preceding gesture produced with the same articulators; 2) differences in overlapping gestures; or 3) differences in how long a gesture is active, due either to differences in prosodic position, or to differences in speech rate. Because the surface form of each gesture-related movement will differ depending on context, the gestures themselves can be considered abstract, although they are not symbolic because they contain intrinsic speciﬁcations of quantitative information. The AP/TD view contrasts with traditional approaches to phonology and phonetics, in which phonological representations are symbolic, and therefore do not deﬁne quantitative aspects of articulatory movements or their timing. That is, in traditional approaches, the sequence of symbols /bæt/ cannot be

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. –      

11

considered a recipe for generating the quantitative aspects of movements involved in producing the word bat. This is because, among other things, the symbols /bæt/ do not specify the movement times, the exact degree of lip compression for the stop closures etc. In contrast, in Articulatory Phonology, phonological representations (i.e. equations of movement and their parameter values) do determine aspects of the way constrictions are formed over time. An oft-cited advantage of this approach is that it does not involve translating from one type of representation (categorical symbolic mental representation), to another (representation for speciﬁc phonetic form). This advantage is viewed as critically important, because such translation into context-speciﬁc variants has been argued to destroy information about the contrastive phonological categories of speech (Fowler, Rubin, Remez and Turvey 1980). The rest of this chapter ﬁrst presents a general overview of the Task Dynamic approach to motor control (2.3), followed by an introduction to speech motor control in AP/TD (2.4). There follows a detailed discussion of AP/TD mechanisms that relate most closely to speech timing (2.5). Finally, it discusses the speciﬁc features that distinguish AP/TD from other approaches, and highlights the advantages of the system (2.6).

2.3 Using mass–spring systems to model gestural movement in TD The AP/TD model generates articulatory trajectories for planned utterances, which can serve as input to an articulatory synthesizer.¹ In the Task Dynamics framework, gestural movement, or movement toward a constriction goal, is modeled as movement toward an equilibrium position in a damped, massspring system, i.e. the movement of a mass attached to a spring (Asatryan and Feldman 1965; Turvey 1977; Fowler et al. 1980). That is, the gestural starting position is analogous to the position to which the mass attached to the spring is stretched, and the equilibrium position is the target position that is approached by the mass after releasing the spring. A mass–spring system can be described as an oscillator, because if the spring is stretched and released, it will oscillate around its equilibrium position in the absence of friction

¹ The model is restricted to articulatory trajectories and does not describe muscle contractions. For discussions of issues relating to modeling muscle contractions, see e.g. Asatryan and Feldman’s (1965) equilibrium point hypothesis, and Bullock and Grossberg’s (1990) FLETE model.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

12  /  (i.e. when not damped²). Because the system is critically damped, the spring doesn’t oscillate, but rather reaches within a very short distance of the equilibrium position very quickly, and then continues moving asymptotically toward the equilibrium position but never quite reaches it. When an oscillator is damped (either critically damped, or over-damped) so that instead of oscillating it simply approaches a target, it is said to have point-attractor dynamics. When it oscillates freely because of less-stringent damping, it is said to have limit-cycle dynamics. Both types of oscillators are used within the AP/TD framework, but the discussion here will focus here on point-attractor dynamics, leaving the discussion of limit-cycle oscillators until Section 2.5.3. A key feature of oscillatory systems with point-attractor dynamics is that they will approximate the equilibrium (target) position, regardless of starting position. The use of this tool in the Task Dynamics model provides a way for the same context-independent phonological unit (a gesture) to have different physical instantiations depending on phonetic context (e.g. the starting position deﬁned by the preceding gestural context). This is because a given gestural dimension is always described by the same equation of motion, with the exception that the value for the starting position parameter is dependent on context. Therefore phonological equivalence can be expressed in terms of the equations of motion that deﬁne each gesture (apart from the speciﬁcation of the starting parameter value). In addition, in simple systems of this type that have linear damping and restoring forces (spring stiffness), gestural movement duration is proportional to the square root of the stiffness of the spring normalized for its mass, and is predicted to be the same for movements of different amplitude. That is, the spring will move back to equilibrium more quickly when stretched further (cf. Cooke 1980; Ostry and Munhall 1985), and more slowly when stretched less far, resulting in equivalent durations for the movements.³ The equation of motion that describes mass–spring oscillations, and is used for each dimension of gestural movement (i.e. for dimensions of constriction location and constriction degree) is the following: ::

:

mx þbx þkðx x0 Þ ¼ 0 It contains one context-dependent parameter, x, which represents the gestural starting position, and four context-independent parameters: m for mass, b for ² An informal analogy of damping might be putting one’s feet on the ground to stop a swing. ³ See Sorensen and Gafos (2016) for a recent proposal for a mass–spring dynamical system with a nonlinear restoring force that makes slightly different and more realistic predictions for the relationship between movement amplitude, speed, and time. See Section 2.5.1.3 for further discussion.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. –      

13

damping, k for spring stiffness, and x0 for target position. The parameter b (the damping coefﬁcient) is set to a value that ensures critical damping, that is, such that the system reaches very close to equilibrium very quickly, then moves asymptotically toward it, without oscillating. The parameter m (mass) is arbitrarily set to 1 in most implementations (although see Šimko and Cummins 2010 for recent work in which this parameter is varied in a principled way, so that the cost of movement can be computed). This equation deﬁnes the movement trajectory for the gestural dimension for each point in time. In the AP/TD model, values of k (spring stiffness) and x0 (target position) are speciﬁc to the deﬁnition of one contrastive category vs. another, and therefore form part of the deﬁnition of each gesture (Saltzman, Löfqvist, and Mitra 2000). The values for these parameters that have been chosen in implementations of the model have been estimated from kinematic data. The value of the spring stiffness parameter k is identical for both dimensions (location and degree) that characterize a given gesture (Saltzman and Munhall 1989). Differences in k for consonants vs. vowels are implemented in the model as consistent with empirical data. In particular, the spring stiffness of vowels is lower than that of consonants, and as a result the gestural movements for vowels are slower than those for consonants (Saltzman and Munhall 1989). The target position x0 is speciﬁc for each dimension of gestural movement (i.e. different for constriction location vs. degree). Differences in x0 across gesture dimensions are to be expected, since these are required for different constriction locations and degrees characteristic of each linguistic category. For example, the conﬁguration for lip closure for a labial stop will have a different constriction location and degree from the conﬁguration appropriate for an alveolar fricative. Gafos (2006) and Gafos and Beňuš (2006) have proposed that the underlying target position x0 for a given gesture can change in particular utterance contexts, according to grammatical and extra-grammatical constraints. This proposal was made in order to account for incomplete voicing neutralization in word-ﬁnal position in German. In German, e.g., Rad ‘wheel’ and Rat ‘advice’ are both pronounced as [ʁat], but in some experimental conditions the two types of words show subtle differences in voicing during closure, as well as differences in vowel, closure, and burst duration, suggestive of the underlying phonological categories (Port and Crawford 1989; Gafos 2006). Gafos (2006) and Gafos and Beňuš (2006) account for this incomplete neutralization behavior by proposing that the glottal aperture target position x0 for the ﬁnal consonant in e.g. Rad is under the inﬂuence of two weighted, competing attractors, one with a value corresponding to the target position for

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

14  /  voiced stops (consistent with lexical contrast), and the other with a value corresponding to the target position for voiceless stops (consistent with the apparent grammatical de-voicing rule or constraint). This section has discussed how the equations of mass–spring oscillation are used to describe movements in each gestural dimension; the next section describes two additional aspects of the AP/TD system: how gestures control individual articulators and the mechanism for specifying gestural activation.

2.4 Gestural control of individual articulators, and gestural activation In the AP/TD framework, the movement of individual articulators is not controlled directly, but rather indirectly, via the selection of gestures (i.e. sets of yoked tract variables) and via gestural activation. First, each tract variable (e.g. constriction location or constriction degree) controls one aspect of the behavior of a group of articulators that are yoked together (i.e. a coordinative structure) (Table 2.1). Second, the tract variables control constriction formation for each gesture; when a gesture is speciﬁed for more than one tract variable, the tract variable specifying location and the tract variable deﬁning degree together deﬁne the gesture. And third, there is a gestural score for each utterance, which dictates the amount of time each gesture is active (the gestural activation interval), as well as the relative timing/ overlap of different gestures (inter-gestural coordination) (Figure 2.1).

Table 2.1 AP/TD tract variables and the model articulator variables that they govern Tract Variables LP LA TDCL TDCD LTH TTCL TTCD TTCO VEL GLO

lip protrusion lip aperture tongue dorsum constriction location tongue dorsum constriction degree lower tooth height tongue-tip constriction location tongue-tip constriction degree tongue-tip constriction orientation velic aperture glottal aperture

Source: Saltzman et al. (2008). Reproduced with permission.

Model Articulators upper and lower lips upper and lower lips, jaw tongue body, jaw tongue body, jaw jaw tongue tip, body, jaw tongue tip, body, jaw tongue tip, body, jaw velum glottal width

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.   

wide

TONGUE BODY

alveolar closure

alveolar closure

TONGUE TIP

LIPS

wide

VELUM

pharyngeal wide bilabial closure

15

pharyngeal wide bilabial closure

GLOTTIS

Figure 2.1 Gestural scores for the words mad and ban, illustrating gestural activation intervals and their relative timing. Source: Goldstein, Byrd, & Saltzman (2006, p. 226; Figure 7.7). Reproduced with permission from Cambridge University Press © Cambridge University Press 2006

This section discusses the gestural control of individual articulators (Section 2.4.1), as well as the impact on gestural movement of gestural activation, coordination, and the neutral attractor (Section 2.4.2). The details of timing control (both relative timing and the timing of gestural activation) are left until Section 2.5.

2.4.1 Gestural control of individual articulators Gestures represent movements toward constrictions along a set of relevant dimensions, speciﬁed by tract variables. For example, tongue-tip constriction location and degree are the tract variables (dimensions) for tongue-tip gestures; lip aperture and protrusion are the tract variables for lip gestures. Each tract variable, in turn, represents the collective movement of a set of articulators that cooperatively contribute to constriction formation in the dimension speciﬁed by the tract variable. These articulator sets are called coordinative structures, or synergies. For example, the separate articulators upper lip, lower lip, and jaw contribute to tract variables for lip aperture and protrusion in lip gestures, while the tongue tip, tongue body, and jaw contribute to tract variables for tongue-tip constriction degree in tongue-tip gestures. Coordinative structures are task- (gesture-)speciﬁc. Although there is a default speakerspeciﬁc relative contribution of each articulator, the model is conﬁgured so that articulators within a coordinative structure can compensate for one another when the need arises. This feature of the model makes it possible for the model to adapt to perturbations. For example, if a load is placed on the jaw during the production of a bilabial sound /p/, the upper lip will

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

16  /  compensate so that lip closure can nevertheless be achieved (e.g. Folkins and Abbs 1975, among many others). In non-perturbed situations, compensation is also seen when a single articulator is involved in the production of multiple overlapping gestures. For example, the jaw is involved in the production of bilabial consonants as well as vowels; if these overlap, and the overlapping vowels are low (low jaw position), the lips will be more involved in the bilabial production than they would be if the overlapping vowel were high (higher jaw position), to compensate for the fact that the jaw is governed by the vowel gesture as well as the consonant gesture. Each articulator will therefore contribute to a gestural tract variable in different proportion depending on context; this type of reorganization is often required in speech.

2.4.2 Gestural activation, overlap in the gestural score, and the neutral attractor The gestures that specify a particular utterance are organized into a gestural score, which speciﬁes the temporal intervals (gesture activation intervals) during which gestures will be active during the utterance, and patterns of gestural overlap and coordination among gestures. As explained in more detail in Section 2.5, timing at the inter-gestural level is governed by an ensemble of undamped (limit-cycle) planning oscillators, one associated with each gesture (Goldstein et al. 2009), whose frequency, in turn, is inﬂuenced by the prosodic level. The prosodic level consists of a set of suprasegmental oscillators (syllable, foot, and phrase, where the foot is deﬁned as a unit extending from one word-level stress to the next; see Saltzman et al. 2008; Nam et al. 2010; Krivokapić 2013). 2.4.2.1 Gestural activation Gestural activation is schematized by ‘boxes’ on gestural score diagrams (see Figure 2.2), replaced by slightly different shapes in later versions of the theory. Gestural movement is generated by multiplying the parameters of the mass– spring equation for each tract variable by the gestural activation value at each point in time. This gestural activation value is 0 when activation is off, 1 when it is turned on completely, and an intermediate value if activation is partial; partial activation occurs during on- and off-ramps, as implemented in more recent versions of the model. When it activates a gesture, the activation function also indirectly triggers the movement of the set of articulators controlled by each gesture.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    p

ea

p

o

t

17

s

15 mm

Lip Aperture Tongue Body distance palate Tongue Tip Constriction

Time 100 ms

Figure 2.2 Time functions of vocal tract variables, as measured using X-ray microbeam data, for the phrase pea pots, showing the in-phase (synchronous within twenty-ﬁve ms) coordination of the lip gesture for the /p/ in pots and the /a/ gesture for the vowel in pots. Note: Tract variables shown are lip aperture (distance between upper and lower lips), which is controlled for lip closure gestures (/p/ in this example) and tongue-tip constriction degree (distance of the tongue tip from the palate), which is controlled in tongue-tip gestures (/t/ and /s/ in this example). Also shown is the time function for the distance of the tongue body from the palate, which is small for /i/ and large for /a/, when the tongue is lowered and back into the pharynx. (The actual controlled tract variable for the vowel /a/ is the degree of constriction of the tongue root in pharynx, which cannot be directly measured using a technique that employs transducers on the front of the tongue only. So distance of the tongue body from the palate is used here as a rough index of tongue root constriction degree.) Boxes delimit the times of presumed active control for the oral constriction gestures for /p/ and /a/. These are determined algorithmically from the velocities of the observed tract variables. The left edge of the box represents gesture onset, the point in time at which the tract variable velocity toward constriction exceeds some threshold value. The right edge of the box represents the gesture release, the point in time at which velocity away from the constricted position exceeds some threshold. The line within the box represents the time at which the constriction target is effectively achieved, deﬁned as the point in time at which the velocity toward constriction drops below the threshold. Source: Goldstein, Byrd, & Saltzman (2006, p. 230; Figure 7.9). Reproduced with permission from Cambridge University Press. © Cambridge University Press 2006

At a normal, i.e. default, rate of speech, the activation interval is long enough for the gesture to approximate (i.e. reach very close to) its constriction target. However, when the activation interval is shorter than the default (e.g. at faster rates of speech), target undershoot will occur, because the gesture doesn’t have enough time to approximate its target. In addition, if the activation interval is longer than the default (because it has been stretched via mechanisms discussed in Sections 2.5.3 and 2.5.4), the articulators will remain

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

18  /  in a quasi-steady state after the target has been approximated, for the remainder of the interval, as the gesture continues to move asymptotically toward the target. Figure 2.2 illustrates lip-aperture movement and tonguebody movement, which remain in a quasi-steady state after the targets have been approximated (dashed lines in grey boxes indicate the point of target approximation). More detail about the control of activation interval timing is given in Section 2.5. In sum, gesture activation intervals specify when and how long each gesture will be active; intergestural coordination patterns are speciﬁed by the gestural score, as described in the following section.

2.4.2.2 Intergestural coordination Because the gestural score consists of parallel tiers, one for each gesture, it also speciﬁes patterns of intergestural coordination. That is, the gestural score speciﬁes how gesture activation intervals are timed relative to one another, and thus whether and by how much they overlap. Gestural overlap can have both spatial and temporal consequences. For example, if overlapping gestures make use of the same model articulators (e.g. the jaw is often involved in successive consonant and vowel articulations), then the activity of shared articulators is blended. In such cases the parameter values for the shared articulators will be a combination (i.e. a weighted average) of the parameter values that would have been speciﬁed in a non-overlapping conﬁguration. For example, in a VdV sequence, the tongue-tip gesture for /d/ shares tongue-body and jaw articulators with the surrounding vowels, and the activity for these articulators will reﬂect the combined control of the tongue-tip and vowel gestures. However, tongue-tip activity will be controlled by the tongue-tip gesture alone because it is uniquely involved in the consonant (Öhman 1966; Saltzman and Munhall 1989). Note that if the overlapping gestures share all articulators, target attainment for the overlapped gestures may be compromised. However, AP/TD has mechanisms to ensure target attainment in such circumstances. For example, /g/ in VgV sequences shares both of its oral articulators with the adjacent vowels (i.e. tongue body and jaw). In this situation, the constriction location for /g/ is a result of the combined (overlapping) vocalic and consonantal instructions for the tongue body and jaw, which are shared in the production of /g/ and the surrounding vowels. However, because the blending strength is set to favor constriction degree for consonants over constriction degree for vowels,

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.   

19

the constriction degree target for /g/ can still be reached, with undershoot of the vowel target (Fowler and Saltzman 1993). Gestural activation intervals and gestural scores specify when the vocal tract is governed by each gesture, but another mechanism is required so that gestural targets can be released. This mechanism, the neutral attractor, is described in the next section. 2.4.2.3 The neutral attractor When a tract variable is inactive, the articulators which it governs return to their respective neutral positions. These neutral positions are speciﬁed by the target of the neutral attractor, which in English is the target conﬁguration for the neutral schwa vowel (Saltzman and Munhall 1989). The neutral attractor governs articulators that aren’t governed by active gestures, and thus provides a way of implementing constriction releases. This is because all articulators governed by a gesture will move toward the targets speciﬁed by the neutral attractor once the gestural activation interval ends. If gestural activation is partial (e.g. at the beginning or end of a ramped activation interval, Byrd and Saltzman 1998), the vocal tract will be under the simultaneous inﬂuence of both the tract variables and the neutral attractor. This is because gestural + neutral attractor activation must always sum to 1 (Byrd and Saltzman 2003). The neutral attractor yokes together uncoupled, articulator-speciﬁc point attractors. Like the tract variables that specify gestural movement, these articulator-speciﬁc point attractors are deﬁned by equations that specify movement toward their equilibrium positions (targets). Because the starting position of each articulator-speciﬁc point attractor is the point that the articulator has reached at the end of the activation interval for the preceding gesture, the acoustic signal will be inﬂuenced by the preceding gesture after it is no longer active (i.e. during the interval governed by the neutral attractor).⁴ The mechanisms that control gestural activation, gestural overlap, and the neutral attractor, described in Sections 2.2, 2.3, and 2.4, provide a general picture of motor control in AP/TD. The following sections provide considerable further detail about timing control mechanisms in particular.

⁴ Note that there are thus two types of coarticulation in AP/TD: 1) coarticulation due to gestural overlap, and 2) coarticulation due to the inﬂuence of a preceding gesture on the starting position(s) of a articulators governed by an immediately following gesture or neutral attractor.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

20  / 

2.5 Timing Control in AP/TD In AP/TD, time is included as part of phonological representations (and is therefore intrinsic as proposed by Fowler 1977). For how long each gesture shapes the vocal tract (gestural activation intervals), as well as how individual gestures are coordinated with one another (inter-gestural relative timing), are determined by a system of gesture-extrinsic (see Sorensen and Gafos 2016), but phonology-intrinsic, control mechanisms. Speakers do not need to explicitly plan or specify the timing patterns that can be measured from surface acoustics, or surface movement trajectories, because surface timing patterns emerge from the phonological system. Some of the resulting surface timing patterns derive from mass–spring modeling of gestures, whereas others come from the way AP/TD models the control of gestural activation within an utterance, i.e. how it uses prosodic structure and rate control to dictate the amount of time a gesture can shape the vocal tract (its gestural activation interval), and how it models inter-gestural relative timing. In this section, timing patterns and the way they emerge from control mechanisms are examined for: 1) the timing control of individual gestures (determined by three control mechanisms: gestural stiffness (part of lexical representation), gestural activation (including its rise time), and one gesturespeciﬁc, distance-dependent timing adjustment mechanism, discussed in Section 2.5.1, 2) inter-gestural (relative) timing, accomplished through gestural planning oscillator coupling, discussed in Section 2.5.2, and 3) prosodic timing, discussed in Section 2.5.3, which involves two mechanisms: a) transgestural timing mechanisms and b) coupled prosodic constituency oscillators, where the coupled prosodic constituency oscillators are also used for global timing control of overall speech rate, discussed in Section 2.5.4. Within this framework, all aspects of timing control are accomplished using oscillators, either critically damped oscillators (for movements toward constrictions and for local timing adjustments), or un-damped, freely oscillating, limit cycle oscillators (for inter-gestural coordination, and prosodic constituent organization).

2.5.1 Timing control of individual gestures This section discusses three ways in which timing patterns for utterances emerge in AP/TD from mechanisms for the timing control of individual gestures. These three aspects are 1) the surface timing consequences of gestures as mass–spring systems and 2) the amount of time that the vocal tract is

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

21

governed by a gesture (the gestural activation interval), as well as 3) mechanisms required to account for differences in movement duration for different distances. 2.5.1.1 Surface timing consequences of gestures as mass–spring systems As noted above, phonological representations of lexical contrast in AP/TD, i.e. gestures, are spatiotemporal and are modeled as critically damped mass– spring systems. A stretched, critically damped spring will take a predictable amount of time to return to a state very close to equilibrium position (the settling time), and will have a predictable movement time-course during its trajectory. Modeling gestures as critically damped springs in this way has several consequences for timing patterns: 1) Gestural representations determine gestural settling time, or the time it would take the gesture to approximate its target, assuming the gesture is fully active and active for long enough. Because each contrastive gesture is described phonologically as a mass–spring system, i.e. as an equation of oscillatory motion, the time it takes to approximate a target is a function of the mass, damping, and stiffness parameters of the equation. In the fully implemented Saltzman et al. (2008) model, damping is always ﬁxed to critical, and mass is also invariant across phonological categories (but see Šimko and Cummins 2010 for an alternative approach). Temporal differences across phonological categories in the current version of AP/TD therefore relate exclusively to k, the stiffness parameter. Vowels are assumed to have lower stiffness than consonants (and consequently longer mass–spring settling times); as a result, vowels have slower movements toward vowel targets as compared to the movements toward consonantal targets. 2) Movements are produced with a smooth, single-peaked tangential velocity proﬁle. The point attractor mass–spring dynamics of this model generates the hallmark of practiced, purposeful movements: a smooth, single-peaked, tangential velocity proﬁle, i.e. with a single acceleration and a single deceleration phase. However, as discussed in more detail below, the velocity proﬁles generated by systems with linear restoring forces (as in the original AP/TD model) have velocity peaks that are much earlier than observed in empirical data. An extra mechanism, i.e. gradual activation interval on- and off-ramps, was therefore added to the original system to create more realistic velocity proﬁles. A more recent proposal with a nonlinear restoring force (Sorensen and Gafos

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

22  /  2016) can generate more realistic timing of the velocity peak without the extra mechanism. 3) Mass-spring dynamics predicts that movement peak velocity will be faster for movements of longer distances. In mass–spring systems with a given stiffness speciﬁcation, the peak velocity is faster if the movement distance is longer. In fact, in mass–spring systems with linear restoring forces (Saltzman et al. 2008), the ratio of movement distance to movement peak velocity is proportional to mass–spring settling time, resulting in equal durations for movements of longer distance as compared to movements of shorter distance (cf. Ostry and Munhall 1985). Put another way, gestural movements of different distances are predicted to have the same duration if their stiffness speciﬁcations are the same. Observations of speech data do indeed show that peak velocity is faster for longer-distance movements as compared to shorter-distance movements, as this mechanism predicts. However, durations for movements of different distances nevertheless do differ (e.g. Ostry, Keller, and Parush 1983) (sometimes described as ‘the farther, the longer’ phenomena), and therefore require additional mechanisms to account for them (see Section 2.5.1.3). An alternative in the form of mass–spring dynamics with a nonlinear restoring force has been suggested by Sorensen and Gafos (2016). See also Chapter 8 for an alternative explanation in the Optimal Control Theory framework. 2.5.1.2 The amount of time the vocal tract is governed by a gesture: the Gestural Activation Interval Activation intervals determine the amount of time for which the vocal tract is intended to be shaped by the movements of a given gesture. These intervals are controlled by a hierarchy of coupled planning and suprasegmental oscillators. In recent developments of AP/TD, gestural activation intervals are not speciﬁed in terms of milliseconds (or other units that correlate with solar time). Instead, each gestural activation interval corresponds to a ﬁxed proportion of a gestural planning oscillator period, and the oscillation frequency of the planning+suprasegmental oscillator ensemble determines the gestural activation interval. The oscillation frequency of this planning+suprasegmental ensemble can be varied according to desired speech rate, and can be adjusted at appropriate prosodic positions (e.g. boundaries and prominences), in order to stretch the activation intervals at these positions. As discussed, the amount of time required to approximate the target will be dictated by gestural stiffness. At a normal, default, rate of speech, the activation interval for each gesture is

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

23

long enough for the gesture to approximate its constriction target. However, at faster rates of speech, the gesture may not have enough time to approximate its target. And if the rate of speech is slow or if the gesture is in a prosodically prominent position or at a prosodic constituent boundary, the activation interval will be longer than its default at a normal rate of speech in nonprominent, non-boundary-adjacent position. As a result, the articulators will continue to move asymptotically toward the target for the remainder of the interval, effectively remaining in a quasi-steady-state conﬁguration after the target has been approximated. In the case where the activation interval is shorter (assuming full activation and no overlap), peak movement speed is the same as in cases where activation intervals are longer. This is because peak movement speed is governed by mass-normalized gestural stiffness, which is the same regardless of activation interval duration. However, when two gestures overlap, the movement speed of participating articulators will be related to the blended tract variable stiffness speciﬁcations. 2.5.1.2.1 Time to peak velocity: Activation interval rise time In early versions of TD, activation was either off (0) or on (1), as shown in the rectangular boxes on early ﬁgures of the gestural score. However, for systems with a linear restoring force (as in the fully implemented Saltzman et al. 2008 system), this type of step-function activation yields velocity peaks that occur much too early: approximately 20%–25% of the way through a movement (Byrd and Saltzman 1998; Sorensen and Gafos 2016). This is below the range of 30%–70% reported for speech by Perkell et al. (2002). Byrd and Saltzman (1998) showed that the relative timing of the tangential velocity peak (i.e. velocity proﬁle symmetry) could be manipulated by changing the shape of the activation function. In particular, if gestural activation is sometimes partial (as would be the case early and late in a movement if the activation function gradually rose to a plateau and gradually fell from the plateau), the vocal tract is shaped by a blend of neutral attractor + gestural parameters, where the degree of neutral attractor vs. gestural inﬂuence is determined by the magnitude of activation at any given time point. Byrd and Saltzman (1998) show that gradual-rise-plateaugradual-fall functions do a better job of modeling the near-symmetry of most movement velocity proﬁles than do activation functions that are stepfunctions (on vs. off). More recently, Sorensen and Gafos (2016) proposed an alternative. They have shown that a more realistic time-to-peak velocity can be generated if the restoring force in the mass–spring system is nonlinear. In systems of this type, gestural

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

24  /  activation can be instantaneously switched to completely on or off, without gradual on- or off- ramps, while still generating appropriate velocity proﬁles. Generating appropriate velocity proﬁles autonomously (without gestureextrinsic control) is advantageous but the authors acknowledge that their system would nevertheless still require gesture-extrinsic, i.e. non-autonomous, gestural activation control in order to model prosodic timing effects. 2.5.1.3 Mechanisms for longer durations for longer-distance movements Another advantage of Sorensen and Gafos’ (2016) nonlinear system is that it accounts for the fact that movement durations are typically longer for longer distances (Fitts 1954), in spite of the fact that the peak velocity of longer distance movements is typically higher. In systems with a linear restoring force (e.g. Saltzman et al. 2008), movements of longer distance have the same duration; without an additional mechanism there is no additional duration for such movements. Saltzman et al. (2000) proposed a type of purely temporal manipulation that could account for these longer durations for longer distances. They proposed that the time-course of individual gestures is slowed in proportion to distance-to-target, while ensuring that the gesture’s spatial characteristics remain unchanged (see also Kelso, Vatikiotis-Bateson, Saltzman, and Kay 1985 for an alternative approach based on gestural stiffness). The purely temporal consequence of Saltzman et al.’s (2000) proposed manipulation is different from the temporal+spatial consequences of changing gestural activation intervals discussed earlier, which is used to account for the contextual variability of articulatory movement including timing. Thus, timing control mechanisms that yield both purely temporal and temporal+spatial effects are available in this theory. This section has discussed how the timing of individual gestures is controlled in AP/TD, i.e. how some aspects of timing patterns for individual gestures emerge from the nature of the control mechanisms. The following section addresses the timing of sequences of gestures, i.e. how successive gestures are coordinated with each other.

2.5.2 Inter-gestural timing, i.e. gestural coordination This section discusses the second way in which timing patterns emerge from control mechanisms, this time via gestural coordination control. Because the gestural score consists of parallel tiers, one for each gesture, it also speciﬁes how gestural activation intervals are timed relative to eone another, i.e.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

25

whether and by how much they overlap. The relative timing of gestural activation intervals is governed by the relative phasing of the planning oscillators associated with each gesture, and is therefore speciﬁed in terms of AP/TD planning oscillator ‘clock’ time, rather than in terms of solar time, which can be speciﬁed e.g. in milliseconds (Nam, Goldstein and Saltzman 2010; Goldstein Nam, Saltzman and Chitoran 2009). If the planning oscillators speed up or slow down, the relative timing in terms of phase proportions will stay the same, but the absolute, solar timing relationships will differ. The gestural planning oscillators can inﬂuence one another (i.e. they are coupled in pairwise fashion), and eventually entrain, i.e. settle into stable phasing relationships, much in the same way as wagging ﬁngers of the left and right hands eventually entrain even when they start out moving at different frequencies (Haken, Kelso, and Bunz 1985). Once the planning oscillators for a given utterance have entrained (e.g. in-phase, or antiphase), they trigger the activation of corresponding gestural tract variables. In this way, the relative phasing of the entrained planning oscillators determines the relative timing of gestural activation intervals and therefore the relative timing of the onsets of gestural movement. Desired planning oscillator entrainment patterns are speciﬁed by ‘coupling graphs’ in the mental lexicon. For example, in English, the planning oscillator for a prevocalic onset consonant gesture in a CV syllable is speciﬁed to entrain in-phase with the planning oscillator for the following nucleus vowel gesture. Because there are no other competing coupling speciﬁcations, the onset consonant gesture and nucleus vowel gesture will consequently begin at the same time, consistent with data showing nearsynchrony (within 50 ms) of movement onsets for bilabial-vowel syllables in Löfqvist and Gracco (1999). Because consonant gestures are stiffer than vowel gestures, and therefore reach their targets earlier than vowels, a consonant gesture and a vowel gesture that are activated simultaneously (as proposed for CV syllables) will result in articulatory approximation of the consonantal target before approximation of the vowel target. Although fast-moving onset consonant movements and slower-moving nuclear vowel movements may be envisioned as beginning in phase, some other mechanism must be invoked to ensure that successive onset consonants in an onset cluster are serially ordered. In this framework, the non-simultaneity of the surface production of gestures in a sequence of consonants in an onset cluster is ensured by specifying an additional antiphase relationship between consonants within the cluster (Goldstein, Nam, Saltzman, and Chitoran 2009), which competes with each of their in-phase relationships with respect to the nucleus vowel (Figure 2.3). This competition gets resolved in the planning process, according to

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

26  / 

Lips

labial clo

Tongue Body

pharyngeal narrow

Tongue Tip

alveolar crit

Glottis

wide

alveolar clo

Lips Tongue Body Tongue Tip Glottis 50 ms

Time

Figure 2.3 The coupling graph for spot (top) in which the tongue-tip (fricative) gesture and the lip-closure gesture are coupled (in-phase) to the tongue-body (vowel) gesture, while they are also coupled to each other in the antiphase mode. The pattern of gestural activations that results from the planning model is also shown (bottom). Note: Lines indicate coupling relationships between pairs of gestures—solid and dashed are different in-phase and antiphase coupling modes, respectively. Source: Goldstein et al. (2006, p. 227; Figure 7.8). Reproduced with permission of Cambridge University Press. © Cambridge University Press 2006

the speciﬁed relative strengths of each coupling relationship, so that gestural activations for both onset consonants do not overlap completely on the gestural score (Browman and Goldstein, 2000; Saltzman et al. 2006). Evidence consistent with this hypothesis for English can be found in Browman and Goldstein (1988). This evidence, often called the C-center effect, shows that the temporal midpoint of a sequence of onset consonant constrictions in a cluster shows the same timing relationship with a nucleus vowel as does the temporal midpoint of a singleton onset consonant. The relative timing of a coda consonant is also expressed in terms of planning oscillator phasing:⁵ In English, the planning oscillator for a post-vocalic syllable coda ⁵ See Tilsen (2013) for a more recent alternative proposal.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

27

consonant gesture is speciﬁed to entrain in an antiphase relationship with gestures for a nucleus vowel, that is, the coda consonant gesture is planned to begin 180 degrees into the nucleus vowel cycle. Browman and Goldstein (1990a) proposed that pronunciation alternates such as perceived perfe[k] memory vs. perfe[kt] memory could be due to variability in gestural phasing. That is, in tokens heard as perfe[k] memory, gestures for memory might be phased earlier in relation to perfect, than when the listener hears the perfe[kt] memory variant, so that the acoustic reﬂex of the initial /m/ of memory overlaps with the ﬁnal /t/ of perfect. As a result, the earlier phasing of memory results in the acoustic ‘hiding’ of the tongue-tip gesture corresponding to /t/, and the result of this gestural overlap is that the /t/ is often not heard; see also Byrd (1996). Saltzman et al. (2006) show that different degrees of variability in the relative timing of clusters like this that span word boundaries (most variability) vs. those in coda position vs. those in onset position (least variability) can be modeled by adding random noise to the difference in natural frequencies of component oscillator pairs. This modeling result reﬂects differences in the coupling graphs for these three types of sequences; see Figure 2.4, where, for example, in the cross-boundary case (with highly variable phasing of consonants in the cross-boundary cluster), the adjacent words are coupled with each other only via coupling of the cross-boundary nuclei, but not the cross-boundary consonants. A prediction of the planning oscillator model is that gestures in stable phasing relationships should show behaviors characteristic of coupled oscillator systems. That is, the coupled oscillators should entrain in phase and frequency (phase-locking and frequency-locking, i.e. in a ﬁxed phaseand-frequency relationship, e.g. 1:1, 1:2, 2:1, etc). Saltzman, Löfqvist, Kay,

A: V # C

C

V

B: V

C

C # V

C: V

C # C

V

Figure 2.4 Coupling graphs for syllable sequences. Note: Consonant clusters are in onset position (A), are in coda position (B), or span the syllable boundary (C). #s denote syllable boundaries. Source: Saltzman, Nam, Goldstein, & Byrd (2006, p. 71; Figure 10). Reproduced with permission from Springer Nature. © Springer Nature 2006

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

28  /  Kinsella-Shaw and Rubin (1998) present evidence consistent with the general phase-locking view from an experiment involving jaw perturbation during production of /paepaepaepaepae . . . / sequences. In their study, the glottal and oral gestures for /p/ re-established a stable relative timing pattern following perturbation; the coupled-oscillator explanation is that the glottal and oral planning oscillators for /p/ are phase-locked. Another prediction is that gestures should be attracted to the most stable phasing and frequency patterns at faster oscillation rates, where in-phase is more stable than antiphase, and one-to-one frequency is more stable than two-to-one frequency. For example, when the rate is sped up, coupled oscillators transition from a less stable to a more stable relationship (Haken et al. 1985). For phase, the prediction is that any phasing relationship other than in-phase should shift to in-phase at a fast rate. For example, for speech, e.g. /ap ap ap ap ap ap . . . / should shift to /pa pa pa pa pa pa . . . / because in /pa pa pa pa pa pa . . . /, the consonants are in a more stable relationship with the following vowel, i.e. they are in-phase (Tuller and Kelso 1990; but see de Jong 2001b, who shows that VC tokens do not neutralize completely with CV tokens at fast rates). Goldstein et al. (2007) and Goldstein and Pouplier (2010) present evidence supporting a shift to a more stable frequency pattern from intrusive articulations in repetitive top cop tongue-twister sequences. In these sequences, word-onset [k] tongue-dorsum and [t] tongue-tip articulations are intended to occur in a 1:2 frequency relationship with the [p]-lip articulation, i.e. there is one /t/ or /k/ for every two /p/s. However, the tongue-dorsum constriction and the tongue-tip constriction are sometimes mistakenly produced synchronously, particularly at faster rates. These results are consistent with the view that the tongue-dorsum and tongue-tip articulations are attracted to a more stable 1:1 frequency locking relationship with the lip oscillations at fast rates, and moreover, that this relationship is implemented using an in-phase coordination relationship, where alveolar and velar gestures are triggered simultaneously (Goldstein and Pouplier 2010). See Chapters 5 and 10 for further discussions of this issue. 2.5.2.1 Gestural coordination in languages other than English The foregoing discussion has shown how the AP/TD coupled planning oscillator proposal can account for coordination patterns in English. A growing body of research has shown that gestural and/or planning oscillator-based accounts constitute plausible accounts of coordination patterns in other languages. For example, Goldstein et al. (2009) show how the coupled planning oscillator framework can account for coordination patterns in Georgian. In addition, the coupled oscillator framework for modeling segmental coordination has been extended to account for fundamental frequency

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

29

patterns and their coordination with segments in Mandarin tones (Gao 2008), as well as patterns of coordination of intonational pitch accents with segments in Catalan, German, and Italian (Mücke, Nam, Hermes and Goldstein 2012; Niemann, Mücke, Nam, Goldstein and Grice 2011). In addition, Gafos (2002) has proposed an Optimal Theoretic account of patterns of word forms in Moroccan Colloquial Arabic that involves a system of ranked phonological constraints relating to coordination patterns for sequences of consonantal gestures, as well as constraints relating to perceptual recoverability and faithfulness to underlying representations. Readers are referred to Krivokapić (2020) and Gafos (2002) for more discussion, and to Chapter 10 for an indication of how similar phenomena would be dealt with in the phonologyextrinsic-timing/three component approach proposed in this book.

2.5.3 Prosodic Timing Earlier sections have discussed mechanisms for controlling within-gesture and between-gesture timing; this section discusses a third set of timing control mechanisms in AP/TD, used to account for the effects of prosodic structure on surface characteristics. Prosodic effects on timing include phenomena such as poly-subconstituent shortening, prominence- and boundary-related lengthening, and variations in overall speaking rate. There are two mechanisms for prosodic timing in this framework: 1) for manipulating timing relationships among constituents at different levels in the prosodic hierarchy, discussed in Section 2.5.3.1, and for manipulating overall speech rate, discussed later in Section 2.5.4, and 2) for stretching gestural activation intervals for prominent syllables and at phrase boundaries, discussed in Section 2.5.3.2. The ﬁrst set of mechanisms is used for modeling poly-subconstituent shortening, i.e. shortening subconstituents in higher level constituents that contain more of them. These effects are controlled via an organizational hierarchy of coupled oscillators which specify the rates of syllable, cross-word foot, and phrase production. Here the foot is delimited by lexically stressed syllables, whether primary or secondary, and can include word fragments (i.e. it is a cross-word foot, as in e.g. make allowances ! [make a-][-llowances]).⁶ The oscillation frequencies of

⁶ The AP/TD cross-word foot is different from the Abercrombian cross-word foot, which is based on higher-level prominences, cf. examples of Abercrombian feet from Abercrombie (1973, p.11), which extend from one phrasal accent to the next, and aren’t delimited by lexical stresses: | Know then thy- | -self, pre- | -sume not | God to | scan |^ |. For example, thy- and not have lexical stress, but do not begin a new Abercrombian foot.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

30  /  these higher-level oscillators in turn affect the frequencies of planning oscillations for individual gestures, which determine gestural activation interval durations, because each activation interval corresponds to a proportion of a planning oscillator cycle. The second mechanism for addressing prosodic effects on timing accounts for prominence- and boundary-related lengthening. These phenomena are modeled by adjustments that are made to the activation intervals of all gestures that are concurrently active within a speciﬁed interval. These adjustments are made either via direct, proportional stretching of activation intervals in the gestural score (using Pi-gesture adjustment), or via the more general MuT mechanism,⁷ which accomplishes the same thing, i.e. activation interval stretching, by slowing the planning+suprasegmental oscillator ensemble oscillation frequency during a speciﬁed interval, e.g. during a single period of the syllable oscillator. 2.5.3.1 Relationships among the timing of syllables, feet, and phrases The ﬁrst set of timing mechanisms that address prosody-related timing issues involves the manipulation of timing relationships among constituents in the prosodic hierarchy. Early claims of tendencies toward ‘stress-timing’ and ‘syllable-timing’ (Pike 1945; Abercrombie 1967) are suggestive of mechanisms that result in tendencies toward isochrony of units delimited by stressed syllables (for ‘stress-timed’ languages), or of all syllables (for ‘syllable-timed’ languages),⁸ AP/TD provides an oscillator-based mechanism that can model proposed tendencies toward isochrony at different levels. Saltzman et al. (2008) proposed a hierarchy of coupled oscillators for the syllable, crossword foot, and phrase, all of which are entrained in-phase. Their oscillation frequencies inﬂuence the oscillation frequencies of lower-level planning oscillators for individual gestures, which in turn, inﬂuence gestural activation intervals (Figure 2.5). In this model, relative coupling strength of different levels can be manipulated so that oscillator periods (i.e. constituents) of a particular level can be less variable than those of another level. In this way, adjustments to coupling strength ratios can inﬂuence tendencies toward isochrony at each of the syllable, foot, and/or phrase levels. For example, Saltzman et al. (2008), following O’Dell and Nieminen (1999), propose that asymmetry in the coupling strength between ⁷ Similarly, spatial modulation gestures (Mus gestures) have been proposed to modulate the spatial target parameters during the interval when the Mus gestures are active (Saltzman et al. 2008). ⁸ See Turk and Shattuck-Hufnagel (2013) for an extensive discussion of some of the difﬁculties with this view.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

31

2 0 –2 70

72

74

76

78

80

82

84

86

88

2 0 –2 68

70

72

74

76

78

80

82

84

86

Figure 2.5 Steady-state patterns of (slow) foot and (fast) syllable oscillators, with asymmetrical (foot-dominant) coupling between foot and syllable oscillators. Note: Top panel: 2 syllables per foot, with both syllable durations = 1/2 foot duration; Bottom panel: 3 syllables per foot, with all syllable durations = 1/3 foot duration. Horizontal axis = time (s); vertical axis = oscillator position (arbitrary units). Each panel starts at ϕF = 0 rad. Source: Saltzman et al. (2008: Figure 7). Reproduced with permission.

suprasegmental oscillators can account for polysyllabic shortening within crossword feet delimited by lexically stressed syllables (see Chapter 6 for discussion of evidence related to polysyllabic (or more generally, poly-subconstituent) shortening). Polysyllabic shortening refers to the fact that shorter stressed syllables are observed when more syllables occur within a larger constituent (sleep is shorter in sleepy and slightly shorter still in sleepiness, Lehiste 1972), although this shortening never results in sufﬁcient change to create isochronous productions of the larger constituents.⁹ In Saltzman et al.’s model, the coupling strength ratio determines oscillator interdependency, and can be set so that one contributes more than the other to the overall timing pattern (asymmetry). For example, a higher relative coupling strength for the foot oscillator will yield a tendency toward foot isochrony (the system attempts to keep foot duration constant, with less dependency of foot duration on the number of syllables in each foot). A tendency toward foot isochrony results in polysyllabic shortening within feet, so that syllable durations accommodate somewhat to the target isochrony of foot durations. On the other hand, a higher relative coupling strength of the syllable oscillator will yield a greater tendency toward syllable isochrony, so that the system attempts to keep syllable durations constant, with foot durations that expand according to the

⁹ See Chapter 6 for a critical discussion of polysyllabic shortening as evidence for oscillator-based timing mechanisms.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

32  /  number of syllables within each foot. Thus a greater tendency to syllable isochrony would yield less polysyllabic shortening. Saltzman et al. (2008) model observations that polysyllabic shortening usually affects stressed syllables, but not unstressed syllables, by stipulating that coupling strength is strong for stressed syllables but weak or absent for unstressed syllables. Saltzman et al. (2008) also propose oscillator coupling between stressdelimited feet and phrases, to account for behavior during rhythmical speech production (Cummins and Port 1998). Cummins and Port’s experiment showed a tendency toward stable stressed syllable/phrase phasing relationships when participants were asked to repeat phrases such as Big for a duck multiple times (where Big and duck were phrasally stressed) according to a speciﬁed rhythm. The second stressed syllable, e.g. duck, was preferentially produced either 1/2, 1/3, or 2/3 of the way through a phrasal cycle, even when participants were encouraged via auditory entrainment stimuli to produce a much wider range of phasing relationships. The coupled oscillator model accounts for these patterns on the assumption that feet are phase-locked to phrases.¹⁰ 2.5.3.2 Trans-gestural timing The prosody-related timing phenomena of prominence- and boundary-related lengthening are modeled in AP/TD via Pi gestures, which lengthen the gestural activation intervals with which they overlap. Pi gestures were later generalized to MuT gestures, which accomplish the same thing (i.e. lengthening of gestural activation intervals) using a slightly different mechanism (deﬁnition follows; see Figure 2.6). These mechanisms are trans-gestural in AP/TD because they apply to all gestural activation intervals within their scope.

Prosodic Tier Tract Variable 1 Tract Variable 2

π-gesture constriction 1 constriction 2 domain of slowing

Figure 2.6 A schematic gestural score for two gestures spanning a phrasal boundary instantiated via a π-gesture. Source: Byrd & Saltzman (2003, p. 160). Reproduced with permission from Elsevier. © Elsevier 2003

¹⁰ Note however that this type of coupling may only be appropriate for periodic, rhythmicized speech, which is likely to be evoked by such repetitive tasks; see also the discussion in Chapter 6.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

33

Pi or MuT gestures are analogous to segmental gestures in the sense that they 1) have a temporal extent, 2) form part of the gestural score, and 3) can overlap other gestures in the gestural score. However, unlike that of segmental gestures, their activity does not involve moving toward a spatial target. Instead, the trajectory of a Pi or MuT gesture deﬁnes the degree of activation interval slowing that occurs at each moment in time. That is, the amplitude of the Pi or MuT gesture at any point in time dictates the amount of proportional slowing that will occur. The temporal extent of the Pi gesture can be speciﬁed in timing units of the gestural score (Byrd and Saltzman 2003), or, as in the new MuT formulation (Saltzman et al. 2008), in terms of (a proportion of) a period of one of the oscillators in the suprasegmental hierarchy (e.g. a syllable). MuT gestures are implemented within the planning oscillator framework, where MuT gestures slow the oscillation rate of the planning oscillator ensemble, and as a consequence, lengthen the gestural activation intervals within the MuT gesture scope. In principle, the shape of the Pi/MuT gesture can vary so that the clock rate of overlapped gestural activation intervals can change either uniformly or dynamically. For example, for boundary-related lengthening, clock-rate is changed dynamically with an asymmetric trajectory showing higher values for lengthening later in the domain, so that parts of gestures closest to the boundary are lengthened more than earlier parts (Byrd and Saltzman 2003). This shape of the Pi gesture is stipulated in order to generate the greater magnitude of ﬁnal lengthening observed on the rhymes of phraseﬁnal syllables as compared to their onsets (often termed progressive lengthening, Berkovits 1994 for Hebrew; but see Turk and Shattuck-Hufnagel 2007 for a more complex pattern in American English, where lengthening was progressive toward the boundary but discontinuous). Byrd and Saltzman (2003) showed that gestural activation interval stretching mechanisms are more successful than other possible lengthening mechanisms in modeling boundary-related lengthening within the AP/TD framework. That is, early proposals for the Pi-gesture time slowing mechanism involved modulating gestural stiffness (Byrd 2000; Byrd, Kaun, Narayanan and Saltzman 2000), consistent with ﬁndings of lower peak velocity/ distance ratios for boundary-related movements (e.g. Edwards et al. 1991; Byrd et al. 2000; Cho 2002, 2006). However, Byrd and Saltzman (2003) showed that this stiffness adjustment mechanism could not account for additional observations of reduced articulatory overlap at phrase boundaries (e.g. Edwards, Beckman, and Fletcher 1991; Byrd et al. 2000; Cho 2002, 2006). Consequently, they proposed a Pi-gesture adjustment mechanism that stretched gestural activation intervals, and showed through simulations that

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

34  /  they could successfully reduce overlap and lower peak velocity/distance ratios at boundaries, consistent with empirical observations. Lower peak velocity/ distance ratios at boundaries can be accounted for by gestural activation interval stretching because activation intervals in Saltzman et al. (2008) are modeled as having gradual on-ramps and off-ramps. Temporal stretching of these kinds of ramped activation intervals increases the absolute amount of time a gesture is partially activated. Because gestural parameter values (stiffness, target position) at any point in time correspond to the gesture’s underlying parameter values multiplied by its activation value, partial activation for a longer period of time will lower peak velocity. This is because the gestural target will be closer to the default, neutral value for longer, and assuming a constant stiffness, the gestural ‘spring’ will spring back toward its target equilibrium position more slowly because it is less stretched. This provides a possible account for the fact that the peak velocity/distance relationship for articulatory movement in phrase-ﬁnal position is often lower than in phrasemedial position (e.g. Cho 2002, 2006; Bonaventura 2003 and others). Note, however, that in Sorensen and Gafos’ (2016) proposal, where the restoring forces in gestural mass–spring systems are nonlinear and gestural activation intervals do not have gradual on-and-off ramps, this mechanism for decreasing peak velocity for comparable distances at phrase boundaries would not be available. An additional advantage of gestural activation interval stretching mechanisms over e.g. stiffness modulation mechanisms for modeling prosodic timing effects is that they predict patterns of interaction between durational and spatial effects. This is because a longer activation interval provides more time for the gesture’s target to be approximated. And indeed, spatial effects are often observed along with timing effects in prominent syllables and at boundaries where lengthening occurs. For example, phrasally prominent syllables are typically more hyperarticulated than non-phrasally-prominent syllables, as well as longer. Phrase-initial consonants show a similar pattern. Byrd and Saltzman (2003) show that a Pi gesture at phrase-onset can model articulatory lengthening + strengthening of an initial C in a phrase-initial CV, if the V gesture is assumed to begin later than the C gesture. In this type of situation, an initial Pi gesture will have a greater inﬂuence on early parts of the CV gestural complex; the earlier part of the C gesture will consequently be lengthened proportionally more than the gestural activation intervals associated with the following V. Proportionately less overlap will result in less CV blending; consonant articulations which are not blended with the following vowel articulations are predicted to be more hyperarticulated (strengthened),

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

35

as is observed in Fougeron and Keating’s (1997) study and in many other studies showing phrase-initial strengthening. Recent work (Katsika 2012; Katsika, Krivokapić, Mooshammer, Tiede, and Goldstein 2014) has started to explore patterns of coordination of Pi and Mu- gestures, speciﬁcally in situations where ﬁnal lengthening and prominence interact. These issues will not be discussed here; interested readers are referred to Krivokapić (2020).

2.5.4 Global timing: overall speech rate This section describes how one of the mechanisms described in Section 2.5.3 can be used to account for some of the effects of varying overall speech rate. It is well-established that overall speech rate manipulations have complex effects on the timing of movements within an utterance. For example, unconstricted vocalic intervals are affected to a greater extent than are the constriction intervals of consonants (Gaitenby 1965). This example, as well as other effects reported in a large body of literature, suggests that speech at a fast rate is not a simple proportional rescaling of speech at a slower rate (see also Shaiman, Adams, and Kimelman 1995). Instead, ﬁndings suggest that a variety of mechanisms contribute to the pattern of reduced durations observed at fast rates (see Chapters 4 and 6 for more detail). Some of these mechanisms may involve proportional rescaling, which preserves relative timing, and others may not. Within the AP/TD framework, the effects of varying speech rate have not been investigated or modeled fully. However, one available mechanism for manipulating speech rate is to change the oscillation frequency of the planning +suprasegmental oscillator ensemble (Byrd and Saltzman 2003). This manipulation would affect several aspects of timing: 1) Intra-gestural aspects: A faster oscillation frequency corresponds to shorter activation intervals, because activation intervals are deﬁned as a proportion of the associated planning oscillator cycle; the proportion of the cycle that is activated would stay the same but the duration of the movement in absolute time would be shorter. Because there would not be as much time for gestures to reach their targets, there should be more target undershoot. 2) Inter-gestural aspects: Inter-gestural timing intervals would also be affected because these are speciﬁed in terms of phasing relationships within pairs of planning oscillators. Faster planning-oscillator

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

36  /  frequencies predict that, all things being equal, a given phasing relationship would correspond to a shorter absolute duration between gesture onsets (but the same relative duration). 3) Prosodic aspects: Changing the oscillation frequency of a planningoscillator ensemble would affect both global and local aspects of prosody. Effects on global prosodic aspects include a faster oscillation frequency for a planning+suprasegmental oscillator ensemble, with consequently shorter syllable, foot, and phrasal cycles. Effects on local prosodic (trans-gestural) aspects relate to the fact that MuT gestures have a scope deﬁned by (proportions of) suprasegmental oscillator cycles (e.g. a syllable oscillator period). Thus a faster oscillation rate for a planning+suprasegmental oscillator ensemble would mean that shorter intervals in solar time are governed by the associated MuT gestures (although structurally these would be the same, i.e. a syllable would still be a syllable even though its duration is shorter). For example, if a syllable’s default duration is 50 ms, and the MuT gesture lengthens it by 50%, the lengthening would be 25 ms. If a syllable’s duration is 100 ms, and the MuT gesture lengthens it by 50%, then the lengthening would be 50 ms. Thus, this type of manipulation would yield differences in absolute amounts of lengthening in solar time at fast vs. slow rates, but the same proportional amount of lengthening. This is because the MuT gesture value speciﬁes planning oscillation frequencies that are fractions of the original planning oscillator frequencies, according to the value of the MuT gesture at each point in time. As a result, a faster planning-oscillator frequency divided by the same MuT gesture value would yield the same amount of proportional lengthening, but a smaller amount of absolute lengthening, at the faster rate. To our knowledge, such predictions of the model relating to speech rate remain to be tested in detail.

2.5.5 Summary of timing control mechanisms in AP/TD This section has reviewed AP/TD’s account of speech timing behavior. To account for extensive systematic variability in surface timing behavior, while conforming to its intrinsic timing commitment to spatiotemporal gestures, AP/TD has had to propose a complex set of interacting mechanisms to adjust the default activation intervals for these gestures, and to control their relative timing. Here the AP/TD parameters which affect surface timing patterns are

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

37

listed. This list gives a ﬂavor of the complexity required in this framework to account for observed effects. • Gestural parameters (the ﬁrst three are coefﬁcients in the equation of gestural motion in Section 2.3): ◦ Stiffness (lexically speciﬁed, affects time to target approximation) ◦ Damping (set to critical, so the gestures approximate the targets quickly, but don’t oscillate) ◦ Mass (ﬁxed at 1 in current implementations) ◦ Gesture-speciﬁc temporal manipulation for longer durations for longer distances (to our knowledge, not currently implemented in TADA), required in mass–spring systems with a linear restoring force, but not in those with a nonlinear restoring force (Sorensen and Gafos 2016). • Parameters that affect gestural activation, i.e. time and degree to which each gesture shapes the vocal tract, including time available to reach the target and time (if any) to remain in a quasi ‘steady state’: ◦ Activation rise/fall time (affects time and magnitude of movement peak velocity), required in mass–spring systems with a linear restoring force (but not in those with a nonlinear restoring force). ◦ Proportion of gestural planning oscillator cycle that deﬁnes the gestural activation interval. This proportion gives gestures enough time to approximate their targets at the default speaking rate, without any Pi/MuT gesture adjustments, and is different for consonants and vowels ◦ Planning+suprasegmental oscillator ensemble oscillation frequency (higher frequency results in shorter activation intervals, since these are speciﬁed as a proportion of a gestural planning oscillator cycle) ◦ Suprasegmental oscillator coupling strength ratios (e.g. different weightings of syllable vs. foot oscillators give different tendencies toward syllable vs. foot isochrony for different languages) ◦ Trans-gestural modiﬁcations (Pi, MuT gestures), at particular places within utterances which stretch gestural activation intervals of the gestures with which the Pi or MuT gestures overlap (for e.g. boundary-related lengthening) • Parameters that affect inter-gestural timing: ◦ Gestural planning oscillator coupling and planned entrainment patterns (e.g. C and V lexically speciﬁed to be in-phase for CV syllables, antiphase for V and C in VC syllables in English)

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

38  /  ◦

Gestural planning oscillator coupling strengths (affects relative timing of activation intervals in e.g. CCV syllables in English owing to competing in-phase C-to-V and antiphase C-to-C couplings)

Surface timing properties of an utterance result from default values assigned to many of these parameters (as well as lexically stored values for mass–spring stiffness, damping, and gestural targets), combined with any required contextual adjustments, speciﬁed through Pi or MuT gestures, for lengthening at special prosodic positions within an utterance (phraseboundaries, prominent syllables), and through changes to the default planning+suprasegmental oscillator ensemble oscillation frequency, for changes in global speaking rate. As noted earlier, in this phonology-intrinsic-timing-based system, surface timing properties emerge from the default speciﬁcations and adjustments, and do not need to be explicitly speciﬁed, represented, or tracked. Although this system is highly complex, it has several advantages, discussed in Section 2.7, after a summary of its key features in Section 2.6. However, in the subsequent chapters, evidence is presented that challenges these features, and has led to the consideration of an alternative approach based on phonology-extrinsic rather than phonology-intrinsic timing.

2.6 Key features of AP/TD The AP/TD approach is characterized by four key features that distinguish it from other frameworks. The ﬁrst three are 1) its use of spatiotemporal gestures as units of lexical contrast, 2) surface characteristics that emerge from phonological structure without being explicitly represented or planned, and 3) its commitment to a single phonological/phonetic planning component. Of these three, the use of spatiotemporal gestures is the most fundamental, because it is this feature which leads to emergent surface characteristics, and makes it possible to do without a separate phonetic planning component. Other features are derivative of these three, including: the non-symbolic nature of its phonological representations (2.6.1.1), articulatory goals of speech production (2.6.1.2), phonology-intrinsic timing control (2.6.1.3), with a default-adjustment approach to contextual variation (2.6.1.4), with no straightforward correspondence between AP/TD phonological time and solar time (2.6.1.5), and with no involvement of phonology-extrinsic, general-purpose timekeepers, or representation of surface timing

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

39

characteristics (2.6.1.6). A fourth key feature is 4) AP/TD’s assumption of the fundamental commonality between periodic and non-periodic behaviors, reﬂected in its use of oscillators as a modeling tool for all aspects of speech motor control (articulatory movement, coordination, and suprasegmental organization).

2.6.1 Key features 1–3: Spatiotemporal gestures as units of lexical contrast, emergent surface characteristics, and a single phonological/phonetic planning component The units of lexical contrast in AP/TD are spatiotemporal. In this respect, AP/ TD is fundamentally different from other phonological theories in which units of lexical contrast are symbolic and therefore discrete and without speciﬁc spatiotemporal values. The spatiotemporal nature of lexical representations makes it possible for surface characteristics of speech to emerge from phonological structure without being explicitly planned, and thus enables the model to do without a separate phonetic planning component. In contrast, theories which propose symbolic phonological representations necessarily include a component of grammar in which the spatiotemporal details of movement are planned, i.e. a phonetic planning component, because articulatory movement and sound cannot emerge from phonological symbols. Fowler et al. (1980) have suggested that theories in which surface characteristics emerge from phonological representations provide the advantage of not requiring translation from one type of data structure (i.e. symbolic mental representations) to another (i.e. representations for speciﬁc phonetic forms) in production, and vice versa in perception. Citing Liberman and StuddertKennedy (1978), Fowler et al. suggest that such a translation process would be disadvantageous, because it “involves a ‘drastic restructuring’” of intended segments (p. 376), and therefore “destroy[s] crucial information about segment identity” (p. 382). That is, in production, such translation destroys information about each segment’s phonological category, which listeners must somehow reconstruct to identify segments in perception. In contrast, information about contrastive categories is ever-present in theories such as AP/TD, in which phonological representation and phonetic form are both speciﬁed in phonological representations.¹¹

¹¹ See Chapter 7 for a critique of AP/TD’s solution to the translation problem.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

40  /  2.6.1.1 Derivative feature 1: Units of lexical contrast are not symbolic Because the units of lexical contrast are spatiotemporal, they are not symbolic as the phonological representations are in traditional phonological theories (e.g. Chomsky and Halle 1968). Nevertheless, they are abstract, because the surface forms that they govern can vary according to context. This contextgoverned variation arises from factors such as 1) differences in gestural starting position, 2) differences in overlap with other gestures which can lead to different contributions of governed articulators to the same gestural motion, 3) differences in prosodic position and speech rate which can lead to differences in gestural activation, and 4) possible external perturbations, which like gestural overlap, can lead to differences in the contributions of individual articulators to the gestural constriction. The phonological equivalence of the different contextual variants of a gesture is expressed through consistent values of all but one of the parameters of the equation of oscillatory motion that deﬁnes the general form of gestural movement, i.e. all but the startingposition coefﬁcient. That is, parameter values of mass–spring stiffness and target position are stored as part of lexical representation and remain constant regardless of differences in segmental, prosodic, and/or rate of speech contexts, and, in addition, the parameter values for mass and damping are invariant across all gestures. 2.6.1.2 Derivative Feature 2: Speech production goals are articulatory rather than sensory Within AP/TD, gestures are deﬁned as the synergetic motion of coordinative structures of multiple articulators which act together to produce vocal tract constrictions of varying positions and degree. The goals of speech production are therefore explicitly articulatory; AP/TD contrasts fundamentally in this feature with other frameworks in which speech production goals are sensory (i.e. auditory and somatosensory), e.g. Guenther (1995); Perkell (2012), although it should be noted that auditory approaches can make use of coordinative articulatory structures to achieve the auditory goals (e.g. Guenther 1995). 2.6.1.3 Derivative Feature 3: Timing control is intrinsic to the phonological system Unlike the phonological units in symbol-based frameworks, the phonological gestures in AP/TD include both spatial and temporal information, and therefore conform to Fowler’s (1977, 1980) proposal for phonology-intrinsic timing. As discussed in detail in Section 2.5, accounting for systematic, contextual variation in the surface timing of speech movements without a separate

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    /

41

phonetic planning component requires extra-gestural mechanisms within the phonological planning component, to control the time for which each gesture is active as well as gestural coordination. These mechanisms include the gestural activation intervals themselves, the planning+suprasegmental oscillator hierarchy which controls their timing, as well as Pi/MuT mechanisms for stretching gestural activation intervals. This type of control system has led Sorensen and Gafos (2016) to term the AP/TD model a ‘hybrid’ intrinsic/extrinsic timing model: timing is both intrinsic to the gestures, and also controlled by extragestural mechanisms.¹² However, because timing control within AP/TD is accomplished via a set of mechanisms that are speciﬁc to the phonological system, and therefore phonology-intrinsic, it will be called a phonologyintrinsic timing model in this book, to distinguish it from models in which timing is fully extrinsic to the phonology, like the Phonology-ExtrinsicTiming-Based Three-Component model proposed in Chapters 7–10. 2.6.1.4 Derivative Feature 4: AP/TD has a default-adjustment approach to contextual variability As discussed above, in AP/TD extra-gestural (but nevertheless phonologyintrinsic and phonology-speciﬁc) mechanisms are used to adjust default gestural activation intervals, to account for contextual variability. This approach contrasts with what would be required in symbol-based approaches, namely a mechanism for developing phonetic speciﬁcations for timing and spatial properties of articulatory movements. In AP/TD, default activation intervals are speciﬁed as proportions of gestural planning oscillator cycles, ensuring that gestures are active long enough to approximate their constriction targets at a normal rate of speech. To account for systematic patterns of timing differences related to speaking rate or structural context, these default activation intervals are modiﬁed in particular prosodic contexts (e.g. at phrase edges, and in prominent positions) and at different rates of speech, through the use of Pi gestures and by changing the oscillation frequency of the planning+suprasegmental oscillator hierarchy. See Chapter 6 for discussion of some of the problems that this approach encounters.

¹² Sorensen and Gafos (2016), who have attempted to make gestures more autonomous (i.e. less dependent on gesture-extrinsic control mechanisms) through nonlinear mass–spring restoring forces and non-ramping gestural activation, nevertheless invoke extra-gestural timing control in modeling periodic behavior, and acknowledge that extra-gestural timing control mechanisms may be required to model speech timing variability: “ . . . systems external to the gesture (but not necessarily external to the phonological system) may determine variability in the duration and coordination of gestures . . . ,” p. 212.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

42  /  2.6.1.5 Derivative Feature 5: No correspondence between AP/TD planning +suprasegmental oscillator ‘clock’ time and solar time An important consequence of the default-adjustment approach of the AP/TD system is that these temporal adjustments warp the correspondence between AP/ TD planning+suprasegmental ensemble oscillation rates (AP/TD ‘clock’ time) and solar (e.g. millisecond), time. This topic is addressed in Chapters 4 and 6. 2.6.1.6 Derivative Feature 6: No system-extrinsic timekeeping mechanisms, no representation of surface time A consequence of the fact that surface characteristics emerge from the AP/TD system without being explicitly planned is that AP/TD does not make use of a phonology-extrinsic, general-purpose timekeeper to assign desired durations of movements or intervals, to measure elapsed time, or to track it. Surface time is not represented, speciﬁed, tracked, or measured, because it emerges from the interacting, oscillator-based components of the system. In fact, as has been mentioned, all aspects of surface form (spatial and temporal) emerge from the phonological plan, without having to be speciﬁed in a separate phonetic planning component.

2.6.2 Key feature 4: Shared control mechanisms for periodic and non-periodic behavior Theories of motor control can be divided into those which view all movements as controlled by oscillator-based mechanisms (e.g. damped or limit cycle), and those which posit that point-to-point movements and movements which are periodic and repetitive should be controlled in fundamentally different ways (see Hogan and Sternad 2007). AP/TD is of the former type. Its modeling framework uses oscillators as fundamental modeling building blocks, for modeling all aspects of speech motor control: for movements toward gestural constrictions, for gestural activation, and for the control structures used to modify the gestural activation defaults. In using these oscillators, TD borrows a well-tested set of mathematical modeling tools from dynamical systems research. Oscillator-based modeling is common in many ﬁelds, and has been used to describe and predict the behavior of circadian rhythms, heart rhythms, ﬂuids, chemical reactions, electronic circuits, and semiconductors, among other phenomena (Strogatz 1994). Of closer relevance to speech, the use of point-attractor mass–spring systems to model movements is common in nonspeech motor control (Turvey 1977; Cooke 1980; Kelso 1981; Saltzman and

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    / 

43

Kelso 1987; and others). The use of oscillators follows from the AP/TD assumption that rhythmic (periodic) and non-rhythmic behaviors “have a common underlying dynamical organization” (Saltzman and Byrd 2000, p. 503), that is, that all human motor behaviors are intrinsically vibratory, and as such are “intrinsically cyclic or rhythmic but . . . need not behave cyclically or rhythmically” (Fowler et al. 1980, p. 396). That is, underlying periodic mechanisms need not result in surface periodicity. This can occur in AP/TD because it uses limit cycle oscillators for suprasegmental organization and inter-gestural coordination, and because coupling among limit cycle oscillators for suprasegmental organization yields tendencies toward surface isochrony of different units (e.g. phrase, cross-word foot, syllable), without perfect surface periodicity. However, the underlying control structures are assumed to be periodic,¹³ and this feature sharply distinguishes the AP/TD approach from other approaches to speech motor control.

2.7 Advantages of the AP/TD framework Articulatory Phonology provides a clear advantage over traditional phonological theories because it provides an account of temporal aspects of speech. Time in anything but its most abstract form (serial ordering and timing slots) was strikingly missing from traditional phonological accounts of sound patterning in e.g. generative phonology. Incorporating time into phonological representations provided a way to account for the overlapping dynamics and reductions of speech articulation, which was impossible in traditional approaches. In those traditional approaches, speech sounds were described as bundles of distinctive features. Sequences of distinctive feature bundles, or even categories of positional allophones, provided only a rudimentary description of the dynamically changing vocal tract during speech, and had little to say about the timing of speech movements and its patterns of variation.¹⁴ In contrast, AP/TD provides an account for an impressive range of attested ¹³ Another model of speech timing which assumes periodic control structures is Rusaw (2013a, 2013b), based on central pattern generators. It models hierarchical prosodic structure via an artiﬁcial neural network that includes three interconnected oscillators, with small, medium, and large periods, respectively (for e.g. syllable, cross-word foot, and phrase oscillators). Excitatory or inhibitory connections between oscillators can decrease or increase the oscillator periods and can thus be used to model e.g. phrase-ﬁnal lengthening. ¹⁴ Henke’s (1967) account showed how distinctive features could be mapped onto articulatory patterns, and provided an initial idea about timing in this process. In his look-ahead model of coarticulation, he proposed that movements should start as early as possible as long as they are not inconsistent with current featural speciﬁcations.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

44  /  time-varying phenomena using a small set of structures and modeling tools, and has some clear advantages over other approaches. Several of these advantages are outlined in Sections 2.7.1–2.7.6. However, later chapters (Chapters 3–6) review phenomena that challenge AP/TD in spite of these advantages, and suggest that other approaches are required.

2.7.1 The ability to model the gradient nature of many utterance-level processes in spoken language better than the alternative feature-rewriting-rules approach that was standard at the time of the theory’s inception As noted earlier, AP/TD can deal with examples of acoustically apparent segment deletion in which the articulation is nevertheless maintained, e.g. perceived perfek memory for articulated perfect memory. Such cases can be accounted for in terms of articulatory overlap of the tongue-tip constriction gesture for the ﬁnal /t/ with the labial constriction gesture for the initial /m/ which hides the acoustic consequences of the tongue-tip constriction and its release. In addition, it can also deal with more complex examples. For example, in traditional approaches, the pronunciation of handbag as something akin to hambag could only be described in terms of categorical feature-changing operations: 1) deletion of /d/, and 2) place assimilation, i.e. the categorical change of the place of the nasal stop from alveolar /n/ to labial /m/. In contrast, the gestural scores of Articulatory Phonology provided a different account: the /d/ is inaudible in the perceived hambag pronunciation because it is ‘hidden’ by the second syllable’s onset bilabial gesture. That is, the /n/ is overlapped by the bilabial gesture, and this changes its apparent place. Gow and Gordon’s (1995) and Gow’s (2002) work supports this type of account. They found that listeners listening to a highly overlapped production of right berries (which is reported by listeners as ripe berries if they are asked to make an explicit report) nonetheless show evidence that they have accessed the word right, in a crossmodal priming lexical decision task (in which no explicit report of the word as ripe or right is required). It is now generally accepted that gestural hiding can occur, with or without the preservation of acoustic cues to the target sounds of the words involved. However, there is also evidence that complete featurechanging processes can occur in phrase-level processing. For example, Ellis and Hardcastle (2002) showed, using articulatory measures, that /n/ can be articulated as a typical / ŋ/ in sequences such as . . . ban cuts . . . , where the articulation of the nasal was indistinguishable from that in . . . bang comes . . . , in some

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.    / 

45

fast-rate tokens. See also Maiteq (2013) and Zsiga (1997), and further discussion in Chapters 7 and 10 of the implications of categorical assimilation processes for the nature of phonological representations. In addition, AP/TD provides an account of many other phenomena that have been observed to be gradient, such as coarticulatory and reduction/ lenition effects, which thus cannot easily be accounted for in a framework where contextually based variability involves one or more binary feature changes. Many of these cases are modeled by partial gestural overlap and/or differences in gesture activation intervals due to e.g. rate of speech or prosodic context.

2.7.2 An account of surface variability despite phonological invariance Movement trajectories that instantiate gestures of the same phonological category differ according to their context in a particular utterance. For example, articulator movements toward a lip constriction target can differ on different occasions 1) because their starting positions differ, 2) because they are co-produced with other gestures, and/or 3) because articulators involved in producing the constriction can compensate for each other. Despite these differences, movements toward the same constriction target share an invariant phonological representation in AP/TD, because each one is generated by the same equation of motion (apart from the starting position coefﬁcient). Gestures exhibit this property because: 1) they are speciﬁed in terms of pointattractor dynamics, which means that they will converge toward the same constriction target regardless of starting position (they have the property of equiﬁnality), 2) gestures can be co-produced with other gestures (and/or articulator-neutral attractors) when they overlap on the gestural score: gestural co-production results in articulations which reﬂect aspects of all concurrently active gestures (blending); and 3) the mapping from each gesture to the activity of a coordinative structure (synergy) of multiple articulators is such that a variety of different articulator contributions can be used to produce the same constriction. This type of non-unique gesture–articulator mapping accounts for the compensatory behavior exhibited when one articulator involved in a constriction is perturbed, e.g. when the jaw is loaded, so that the lower lip that rides on it cannot move toward the upper lip to form a bilabial constriction, the upper lip compensates during the production of such a constriction (Folkins and Abbs 1975, inter alia).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

46  / 

2.7.3 An account of the cross-linguistic frequency of CV syllables It is well-established that CV syllables are cross-linguistically more common than VC syllables (Jakobson and Halle 1956; Greenberg, Osgood, and Jenkins 2004; Bell 1971). This fact is explained within AP/TD by the assumption that C to V coordination within a syllable is governed by the in-phase entrainment of planning oscillators, as compared to antiphase entrainment for V-to-C intra-syllable coordination. The preponderance of CV syllables is thus predicted by the greater stability of in-phase (as opposed to antiphase) coordination patterns, which is a general principle of oscillatory movements (e.g. Kelso, Holt, Rubin, and Kugler 1981), because more stable patterns are believed to survive over time.

2.7.4 A sophisticated account of most known speech timing effects One of the most impressive advantages of the task dynamics approach is that it provides a sophisticated account of most known speech timing phenomena. As discussed earlier, the set of surface behaviors that AP/TD’s system can successfully generate includes velocity proﬁles of single movements, the approximate equal duration of movements of different distances, durational differences between segments, interactions between temporal and spatial effects via activation interval adjustments, inter-articulator relative timing (coordination), as well as higher-level structural (prosodic) timing properties, such as constituent-ﬁnal lengthening, constituent-initial lengthening, prominence-related lengthening, and polysyllabic and polysegmental shortening. Although it will be argued later that the mechanisms by which AP/TD models surface behavior are not consistent with what is known about human speech processing, nevertheless AP/TD currently represents the most extensive and compelling account of speech timing available in the literature.

2.7.5 Planning an utterance does not require planning the details of movement In AP/TD, planning an utterance involves determining the sequence of words to be said, their prosodic structure, and an overall rate of speech; once these

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. 

47

phonological decisions have been made, the planned utterance unfolds without further cognitive processing. A substantial advantage of this approach is that speakers do not have to compute, plan, or specify the details of movement trajectories, because they are determined by the task dynamical system, given its linguistic input. Because timing is intrinsic to the AP/TD oscillators, surface timing patterns emerge from the model without requiring explicit speciﬁcation by the speaker when planning an utterance. Speakers’ timing-related planning is therefore minimal: Speakers can optionally adjust speaking rate by adjusting the frequency of the suprasegmental oscillator ensemble, and can make choices about prosodic structure that will dictate the occurrence of Pi- or MuT gestures at particular positions in prosodic structure (e.g. for boundary-related or prominence-related lengthening). Similarly, the spatial paths of unperturbed articulatory movements are determined by the linguistic plan and overall rate of speech: speakers don’t have to plan the details of movement paths.

2.8 Conclusion The goal for this chapter was to present a description of the AP/TD approach to phonology and speech motor control, in enough depth to allow for an evaluation of the theory in terms of its account of speech timing behavior (Chapters 3–6). This description highlighted AP/TD’s commitment to phonology-intrinsic timing, and reviewed the types of control and adjustment mechanisms it has adopted to achieve emergent surface timing and spatial characteristics appropriate to different utterance contexts. These representations and mechanisms have allowed AP/TD to avoid the need for a separate phonetic planning component, and thus to avoid translating from the data structures of phonology to different data structures in phonetics. AP/TD avoids specifying surface timing characteristics in surface (solar) timing units, and relatedly avoids using phonology-extrinsic (e.g. general-purpose) timekeeping units and mechanisms, and thus retains its commitment to phonology-intrinsic timing. As summarized in Section 2.7, this approach has many advantages, including providing an account of lexical contrast, while at the same time generating realistic movement trajectories as well as many known aspects of speech timing patterns. However, the evidence that will be presented in Chapters 3–6 motivates the consideration of phonology-extrinsic timing approaches that include a separate phonetic planning component. Chapter 3 provides a look at the

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

48  /  design features of AP/TD that raise some questions about the desirability of its approach. Chapter 4 presents the core motivation for considering an alternative approach to speech motor control based on phonology-extrinsic, general-purpose timing mechanisms: Several lines of evidence from the speech and non-speech motor control literature challenge AP/TD’s phonologyintrinsic approach to timing control, characterized by temporal phonological representations, emergent timing, and phonology-speciﬁc timekeeping. Chapter 5 presents evidence relating to movement coordination in speech production. This evidence suggests that inter-articulator coordination control can be based on utterance-speciﬁc movement endpoints, rather than on movement onsets as implemented in AP/TD. It further suggests that oscillator-based control mechanisms are not required, since other mechanisms are available, e.g. time-to-target-approximation and/or spatial control. And ﬁnally, Chapter 6 relates to AP/TD’s mechanisms for modeling timing effects of suprasegmental structure. It reviews evidence presented in Chapter 4 that challenges AP/TD’s default-adjustment approach for modeling boundaryrelated and prominence-related lengthening, and presents additional evidence that challenges AP/TD’s coupled-oscillator approach to poly-subconstituent shortening. In sum, the evidence presented in Chapters 3–6 motivates the alternative approach based on symbols and phonology-extrinsic timing that is presented in Chapters 7–10.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

3 Evidence motivating consideration of an alternative approach As noted in Chapter 2, the AP/TD approach accounts for many aspects of the surface timing of speech movements, including higher peak velocities for longer-distance movements, inter-articulator coordination/overlap, and aspects of timing relating to prosodic structure. As a result, it generates overlapping articulatory trajectories which are plausible in light of what has been observed using articulatory measures, such as electromagnetic articulometry and real-time MRI. AP/TD has the advantage of accomplishing this while avoiding complex online computations during speech production: surface timing patterns can be achieved in this model without explicit planning, because the planning takes place in the representational language of a gesturebased phonology. That is, once the speaker has put the gesturally-deﬁned words into an appropriate prosodic structure, and determined overall speaking rate, the timing unfolds automatically, without any need for further processing. These advantages have led to the computational implementation of the AP/TD theory (in the form of TADA software, www.haskins.yale.edu/ tada_download/index.php) and an intensive effort to develop it further to account for additional ﬁndings (see Krivokapić 2020 and references therein). However, as one looks more deeply into the implications of the AP/TD architecture, a disquieting lack of ﬁt between several of its design features and what is known about the speech signal and human speech processing begins to emerge. This chapter presents three of these design features, along with the evidence that suggests the advisability of considering a different approach. This sets the stage for the presentation of evidence in Chapter 4 that more directly challenges a core assumption of AP/TD: that the timing of articulatory movements is intrinsic to the phonological system. As will be seen, that additional evidence supports a mechanism of general timing control that is extrinsic to the phonological system. Before turning to that, however, this chapter discusses three aspects of speech behavior that induce some disquiet with the AP/TD approach. First, because the number of contextual factors that inﬂuence surface phonetic form is large, the default-adjustment feature

Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

50       of AP/TD’s design (e.g. the application of Pi and MuT gestures to the default activation intervals) will be substantially more extensive than what is currently envisioned; this suggests that default adjustment may not be the wisest choice of architecture for modeling this pervasive aspect of speech behavior. Adjustments to default speciﬁcations would be most appropriate if durational patterns were invariant most of the time; evidence suggesting pervasive and widespread adjustments, in response to a large number of different factors, provides a motivation for considering a different approach. The next source of disquiet is the set of motor timing behaviors described by Fitts’ law (Fitts 1954) relating distance, accuracy, and duration of movement. Some aspects of these behaviors have been modeled within AP/TD but others have not, and there is no single, uniﬁed explanation for all aspects of the law within the theory. An alternative theory with design features that can provide a natural account of the entire set of behaviors would be desirable. A third design feature that begins to raise questions is AP/TD’s gesturalscore-based organization of utterances, which may increase the risk of spatial interference among gestural movements that are planned to occur simultaneously. Again, this characteristic raises the question of why an architecture was adopted that left open the possibility of such an undesirable behavioral consequence. Taken together, these issues begin to indicate that a different type of model architecture might be worth considering for speech production, i.e. one based on phonology-extrinsic timing mechanisms. The following chapters will provide additional motivation for this alternative approach, in the form of evidence that supports symbolic phonological representations (Chapters 4 and 7), and phonology-extrinsic, general-purpose, non-speech-speciﬁc timing mechanisms to track and specify timing characteristics in units that correlate with solar time (Chapter 4), as well as evidence for non-oscillatory approaches to movement coordination (Chapter 5) and to suprasegmental organization (Chapter 6).

3.1 AP/TD default speciﬁcations require extensive modiﬁcations The AP/TD architecture speciﬁes activation intervals for each gesture, and deﬁnes these intervals as a proportion of a planning oscillator cycle. It assumes a default planning oscillator frequency that yields gestural activation intervals that give gestures enough time to approximate their targets. These default

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.   

51

activation intervals can be adjusted according to both speech rate and prosodic position. Thus the AP/TD system can be considered a default-adjustment architecture. However, a default-adjustment system may not be ideal, because evidence from both non-speech and speech motor control suggests that the number of gestural modiﬁcations required to account for all aspects of surface behavior may be larger than currently acknowledged within AP/TD. A list of factors that are known to inﬂuence timing aspects of movement is provided below; these are in addition to the prosodic and rate-of-speech factors currently modeled within AP/TD in ways that have been described earlier. To account for these additional factors, AP/TD would need to make new use of existing mechanisms and/or would need to propose new mechanisms. As noted above, AP/TD’s default-adjustment architecture would be most appropriate if durational patterns were invariant, or even proportionally invariant, most of the time. Instead, the evidence challenges the view that there is a single, predominant durational pattern for each gesture type. Factors known to inﬂuence durational aspects of movement that are currently not modeled within AP/TD include: 1) Improvement with practice, (for non-speech: Hansen, Tremblay, and Elliott 2005; Elliott et al. 2004; Khan, Franks, and Goodman 1998; Khan and Franks 2000; for speech: e.g. Prather, Hedrick, and Kern 1975; Schulz, Stein, and Micallef 2001; Reilly and Spencer 2013). In general, practice tends to reduce overall movement times, with greatest reductions early in learning (Schulz et al. 2001). Whereas reductions in overall rate tend to reduce spatial accuracy in AP/TD because they can lead to undershoot, practiced movements generally become more accurate and more efﬁcient, with relatively earlier peak velocities to allow movements more time during deceleration to home in on the target. 2) How a task was previously performed (for non-speech: Rosenbaum et al. 1986; van der Wel et al. 2007; Ganesh et al. 2010; for speech: Rosenbaum et al. 1986, Turk and White 1999; Chen 2006; Dimitrova and Turk 2012). For example, in van der Wel et al.‘s (2007) study, participants moved a vertical dowel held upright on a planar surface from target to target in time to a metronome. Some of the movements involved clearing an obstacle between targets, and results showed that the spatial paths of dowel-raising hand movements between targets that followed the obstacle-clearing movements were higher than those seen in control, non-obstacle trials, even when the successive movements were made with the opposite hand. Van der Wel et al. propose that

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

52       planning movements involves setting parameters (e.g. for hand, spatial characteristics, etc.) in abstract representations for movement. On their view, parameters are set by modifying settings used in previous movements, and there is a cost related to the size of parameter changes that is minimized over the whole task. Turk and White (1999), Chen (2006), and Dimitrova and Turk (2012) present speech-related timing ﬁndings that are consistent with this view. Their experiments show ‘spill-over’ effects in studies of focus-related phrasal prominence on duration in speech. For example, Chen (2006) showed that in Mandarin Chinese, speakers lengthen syllables within a constituent that is pragmatically focused (e.g. emphasized). In addition, syllable durations are longer in syllables immediately adjacent to (primarily following) the pragmatically focused constituent, where the magnitude of the ‘spill-over’ effect is smaller than the effect of focus within the focus domain. 3) Listener-related factors, such as the speaker’s estimate of the ability of the listener to understand what is being said, and of the ability of the listener to see the talker (Kuhl et al. 1997; Burnham, Kitamura, VollmerConna 2002; Uther, Knoll, and Burnham 2007). For example, infantdirected speech shows enhanced vowel formant contrasts and longer vowel durations as compared to adult-directed speech (Uther et al. 2007), whereas foreigner-directed speech shows enhanced vowel formant contrasts without durational differences. 4) Stylistic factors. For example, Winter and Grawunder’s (2012) study of formal vs. informal speech styles in Korean showed that formal speech is characterized by lower average fundamental frequency and reduced pitch range, longer durations, less spectral tilt, and more noisy breath intakes. These examples from published data suggest that an extensive range of factors inﬂuence phonetic variability relating to timing. To account for the effects of each of these factors on gestures, AP/TD would need additional adjustment mechanisms and/or would need to make more extensive use of existing adjustment mechanisms such as supra-segmental oscillator frequency adjustments, and/or Pi, or MuT mechanisms. And because speakers are also able to vary the degree of durational adjustment (e.g. a greater degree of ﬁnal lengthening for a particular reason), the theory would need to provide ways of computing the amount of adjustment for each context in each planned utterance. The next section turns to observations about motor timing control which are addressed in the AP/TD framework to some extent, but do not seem to

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. —, , 

53

ﬁnd a complete or principled account in that approach. Like the evidence presented above, these phenomena suggest that alternatives to AP/TD may be worth considering.

3.2 Relationships among distance, accuracy, and duration are not fully explained in AP/TD It has been well-known since Woodworth (1899) that there is a systematic relationship between distance, spatial accuracy, and movement duration; this relationship is described by Fitts’ law (Fitts 1954). This section explains how AP/TD accounts for some, but not all, aspects of the law, and suggests that phonology-extrinsic-timing-based models which allow the speciﬁcation of desired surface-movement durations and the optimization of these values may provide more comprehensive and principled explanations (e.g. Schmidt et al.’s 1979 impulse-variability theory; see also Harris and Wolpert’s 1998 minimum endpoint variance approach). Fitts’ law (Fitts 1954) was derived from results of experiments in which participants alternately tapped a stylus within each of two targets of width W whose centers were separated by a distance D. Participants were asked to minimize their movement times. Average movement time was found to increase with distance and decrease with target width in the following way: MT = a + b[log₂(2D/W)], where D = distance, W = target width, and a and b are constants. [log₂(2D/W)] is considered the ‘index of difﬁculty’. Subsequently, Schmidt et al. (1979) found that for rapid simple aiming tasks, in which movements were required to be made in a speciﬁc time, the relationship between MT and D/W is linear. What Fitts’ law means is that for a given target width (= spatial accuracy criterion), movement time increases with distance. In order to move faster while maintaining spatial accuracy, distance must be decreased. Moving faster while maintaining distance will have the consequence of decreased spatial accuracy, while decreasing a movement’s accuracy criterion will make it possible to move the same distance in less time. Keele’s (1981) review reports that Fitts’ law applies to monkeys, human children and adults; and to different human effectors (upper limbs, foot, arm, hand, ﬁngers, eyes). It applies to different tasks, such as alternating movements of a stylus to targets on a desk, placing disks over pegs, pointing, reaching, grasping, positioning movements of keys, joysticks, moving a computer mouse, head movements, and speech movements, and it applies to different movement sizes. In speech,

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

54       well-documented differences in intrinsic vowel duration for high vs. low vowels (Peterson and Lehiste 1960), where low vowels are consistently longer than high vowels for the same [tense] feature speciﬁcation, have a natural explanation in this account (House 1961; Delattre 1962; Ostry, Keller, and Parush 1983). That is, longer distances must be traveled to produce low vowels from closed vocal tract conﬁgurations for consonants, as compared to high vowels, and according to Fitts’ law, the increased distance should require longer durations if target spatial accuracy is to be maintained. Another speech phenomenon that Fitts’ law may explain is the cooccurrence of durational reduction and spectral variability in vowels such as schwa. For example, in English, schwa is notorious for its short duration and for the fact that its formants are variable, and F2 in particular is highly predictable from the surrounding phonemic context (Browman and Goldstein 1992b; Bates 1995, among others). In AP/TD, duration relates to the spatial accuracy of movement because the gesture’s activation interval determines the amount of time available for the gesture to reach its target. A shorter activation interval thus predicts more target undershoot.¹ Along these lines, Browman and Goldstein (1992b) showed that AP/TD can account for the short duration and average spatial attributes of schwa in English in different segmental contexts. A different aspect of Fitts’ law concerns the relationship between distance and duration: moving further takes longer as long as accuracy (target width) remains constant. Two proposals exist in AP/TD for modeling this relationship: 1) Saltzman, Löfqvist, and Mitra’s (2000) proposal to adjust the ticks of a gesture-speciﬁc clock according to gestural movement distance, and 2) Sorensen and Gafos’ (2016) proposal to make the restoring force in the mass–spring system non-linear, which has a similar effect. Despite these proposals, some aspects of Fitts’ law have no principled explanation in Saltzman, Löfqvist, and Mitra’s (2000) proposal, and others are left unaccounted for in both proposals. In particular, Saltzman, Löfqvist, and Mitra’s (2000) proposal to adjust the speed of a gesture-speciﬁc clock according to gestural movement distance has no principled explanation in their model. Sorensen and Gafos’ (2016) proposal of a non-linear restoring force for AP/TD mass–spring systems is more principled, because it accounts for the longer duration of longer distance movement while at the same time accounting for more realistic relative timing of the velocity peak than earlier proposals. However, neither the degree of ¹ For further discussion of the relation between movement duration and both spatial and temporal accuracy, see Chapter 8.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. —, , 

55

clock speed adjustment in Saltzman, Löfqvist, and Mitra’s (2000) proposal, nor the non-linear restoring force in Sorensen and Gafos’ (2016) proposal, has any consequences for movement accuracy;² these AP/TD approaches therefore capture only part of Fitts’ law. In addition, although duration decrease in AP/ TD, as implemented through a decrease in gestural activation interval, results in a greater likelihood of target undershoot, this undershoot will be the same for all repeated movements governed by the same gestural score; there is no increase in movement endpoint spatial variability, or ‘target width’, with shorter activation intervals, as predicted by Fitts’ law. To sum up, although Sorensen and Gafos (2016) have provided a more principled explanation than Saltzman, Löfqvist, and Mitra (2000) for the longer duration of longer-distance movements, Fitts’ law leads to the additional expectation that both duration decrease and distance increase for the same movement duration should result in a decrease in the spatial accuracy at the movement target. In AP/TD, activation interval duration relates to accuracy of target achievement in a limited way, by increasing the likelihood of undershoot, but distance manipulations have no effect on the spatial accuracy of target achievement. Thus, only a portion of the results of Fitts’ law is accounted for. In addition, Fitts’ law suggests that adjusting target spatial accuracy requirements should have predicted consequences for movement durations. AP/TD currently has no provision for explicitly specifying or adjusting the spatial accuracy requirements of constriction targets in speech. That is, in AP/TD, effects on spatial accuracy are only emergent from duration speciﬁcations, and not the other way around. That is, specifying spatial accuracy as a goal is not possible. Yet specifying spatial accuracy might be required for e.g. clear speech, where spatial accuracy appears to be important and durations are often longer. If specifying and/or manipulating accuracy requirements should be part of the speech planning process, AP/TD would need to be amended. While it might be possible to account for all aspects of Fitts’ law within a dynamical systems framework (see e.g. the suggestion in Kelso 1992), this will require some substantive changes. That is, the accuracy considerations presented here do not appear to emerge naturally from the existing AP/TD architecture and suggest the need to at least amend important aspects of the current AP/TD approach.

² Whether the movement reaches the target is determined instead by the degree of overlap with other gestures and consequent gestural blending, and by the amount of time available in the gesture’s activation interval.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

56       Alternatively, it could be useful to consider different approaches to modeling the control of timing in speech movements that have the potential to provide a more principled account of Fitts’ law. Chapter 8 discusses alternative theories of this type. Many of these theories assume a mechanism to assign desired surface durations, and propose that desired duration (or time-to-target) can be optimized to meet task requirements (e.g. target spatial accuracy) at minimum cost in time or energy. Following Schmidt et al.’s (1979) explanation for Fitts’ law based on the idea that faster movements require bigger impulses, i.e. greater areas under the force-time curve when the movement is accelerating, Harris and Wolpert (1998) propose that faster movements require control signals of greater magnitude (i.e. greater neural activity associated with motor commands). Because neural control signals are assumed to be noisy in proportion to their magnitude, the larger neural control signals required by faster movements result in more endpoint spatial inaccuracy as compared those for slower movements. Harris and Wolpert (1998) propose that movements of greater distance also require higher amplitude control signals than their shorter counterparts, if they have a greater average velocity (as they would if produced in approximately the same amount of time as movements of shorter distance). The greater amount of motor noise associated with the larger control signal will result in more accumulated noise at the end of the movement, and as a result the movement will be less spatially accurate. In this approach, to ensure consistent spatial accuracy, additional time is required for longer-distance (and hence noisier) movements, in order to provide enough time to reach the target accurately. Thus, this approach provides a more complete account of Fitts’-law-related effects than AP/TD, and moreover provides an explanation of why these effects arise. That is, Harris and Wolpert’s (1998) account of Fitts’ law is attractive because it explains the observed relationships among distance, spatial accuracy, and time using a single principle, i.e. motor noise that derives from the size of the control signal. Taken together, these observations suggest it would be useful to develop a model that provides a single uniﬁed account of the observations captured in Fitts’ law, as well as a mechanism to allow speakers to specify spatial accuracy as a goal, with consequent effects on duration. Harris and Wolpert’s (1998) optimization approach that relates duration to the effects of motor noise on spatial accuracy plausibly provides such a possible uniﬁed account; see Chapter 8 for a discussion of optimization and its relevance to speech modeling.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

57

3.3 Distinct synchronous tasks cause spatial interference This section discusses a third feature of AP/TD which begins to raise questions about its adequacy as an account of established observations. In AP/TD, speech involves multiple, separately controlled, spatially independent tasks (gestures) which are often synchronous. Some synchronous tasks result in what is typically described as a single sound, for example, a bilabial closing gesture synchronized with a glottal abduction gesture is heard as a [p]. Other synchronized tasks in AP/TD result in what is typically described as a sound sequence: the production of a CV syllable involves the simultaneous onset of consonantal and nucleus vowel gestures, with the acoustic consequence that the vowel target landmarks reach fruition later in time, because vowel gestures are intrinsically slower (less stiff) than consonant gestures. The AP/TD synchronous-production view contrasts with other theories in which stretches of speech are conceptualized as a sequence of goals to be realized one after the other (Shaffer 1982; Stevens 2002). For example, Stevens’ (2002) theory suggests that /apa/ is conceptualized as a sequence of three phonemes, which, in relatively clear speech, can be signaled by a sequence of four landmarks: 1) a vowel target, 2) a voiceless closure onset, 3) a closure release, and 4) a second vowel target, with each of these targets planned to occur at a separate moment in time. (The onset of voicing for the vowel is an acoustic event, but not an acoustic landmark, because it does not signal an articulator-free feature, but rather an articulator-bound feature; see Halle 1992). On this view, although the close temporal proximity of the sequence of movement goals will result in the temporal overlap of the movements that produce these goals, the stretch of speech is nevertheless conceptualized by the speaker as a sequence of temporally separate target landmarks. With regard to this issue of how the task is conceptualized, a potential problem with a model architecture like that of AP/TD, which places the overlapping and often synchronous coordination of separate, spatially independent tasks at center stage, is that the temporal overlap of distinct tasks can raise the risk of undesirable spatial interference. This has been clearly shown in non-speech domains. For example, patting one’s head with one hand while rubbing one’s tummy with the other is notoriously difﬁcult; the up-and-down patting movement of one hand interferes with the circular rubbing motion of the other, and vice versa. This parlor trick has been reproduced and quantitatively measured in a number of experimental settings as well. For example, in a laboratory study of a related phenomenon, Franz, Zelaznik, and McCabe

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

58       (1991) found that repetitively drawing a circle with one hand and a line with another to a metronome beat made the circle more line-like, and the line more circle-like. Similar ﬁndings of spatial interference have been observed elsewhere: that is, Franz and Ramachandran (1998) found that when participants used one hand to draw repetitive lines and made a twirling motion with their index ﬁnger with the other (using different muscles), the lines became more circle-like and the circles became more line-like. They found that this effect was observed even for amputees who report phantom movement for one of the hands. In addition, spatial interference does not appear to require repetition. For example, drawing a single repetition of the number 6 in the air with one hand while drawing a clockwise circle with the foot is just as difﬁcult, even when the drawings are only made once;³ see Marteniuk, MacKenzie, and Baba (1984) for a laboratory demonstration of spatial interference for non-repetitive tasks. See also Mechsner et al. (2001) for another experimental demonstration that the way a task is conceptualized inﬂuences its implementation. These ﬁndings raise the possibility that distinct, synchronous tasks in speech that involve different articulators moving in different directions (i.e. for different gestures on the gestural score that begin synchronously), might also engender undesirable spatial coupling. As is well-known, there is co-articulation (in the articulatory sense of gestural overlap) when two successive speech sounds are produced with the same articulator, but the arguments presented here apply to cases where different articulators moving in different directions are coordinated. For example, syllable-onset [n] production involves two synchronous movements of velum lowering and tongue-tip raising, and if these are both planned as distinct, synchronous tasks, there might be a risk of incomplete velum lowering and/or incomplete tongue tip–palate contact. In AP/TD, synchrony of movement onsets also occurs between traditionally sequential elements, e.g. C and V gesture activations in a CV sequence are proposed to begin synchronously. As a result, spatial coupling (and thus distortion of the movement of the articulators) might occur whenever articulators for synchronous C and V gestures move in opposite directions, as is required in e.g. [ta], where the tongue tip raises for [t], and the tongue body lowers for [a].⁴ (The production

³ Thanks to Jim Scobbie for pointing this out to us. ⁴ The production of single gestures can also involve the movement of different articulators in different directions, but in this case spatial interference is not expected because the movements of different articulators that form part of a coordinate structure are in service of a single task (i.e. the gesture).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

59

of single gestures can also involve the movement of different articulators in different directions, but in this case spatial interference is not expected because the movements of different articulators that form part of a coordinate structure are in service of a single task (i.e. the gesture).) However, Franz et al. (2001) propose that spatial interference can be reduced, if not eliminated, if there is a uniﬁed conception of the task. For example, although holding a jar with one hand and opening the lid with another can be planned as two separate, distinct tasks, which might engender spatial interference, people often conceptualize them as a single uniﬁed task. Franz et al. (2001) propose that conceptualizing the task in this way decreases the likelihood of spatial interference. They tested this proposal by asking participants to repetitively draw semicircles with each hand; in two conditions, the semicircles were parallel (bump up for each hand; bump down for each hand), and in two other, non-parallel, tasks the semicircles either 1) formed a circle, or 2) formed a less-recognizable conﬁguration (concave-up semicircle on top of concave-down) (Figure 3.1). The two tasks which involved non-parallel movements provided opportunities for spatial interference between the hands, and, as expected, results showed greater variability and lower accuracy in the two non-parallel tasks. However, tellingly, the condition where the semicircles formed a circle was much less variable and more accurate than the condition in which the concaveup semicircle was on top of the concave-down semicircle. The authors interpreted their results to mean that spatial interference can be overcome by conceptualizing independent movements as part of a single uniﬁed task. That is, where the two hands contributed to drawing a circle, and therefore BOTTOM TOP BOTTOM TOP BOTTOM TOP BOTTOM TOP

Figure 3.1 Schematic diagrams of the templates for the four experimental conditions in Franz et al. (2001). Source: Franz et al. (2001, p. 106). Reproduced with permission of Taylor & Francis Ltd, http://www. tandfonline.com © Taylor & Francis 2001

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

60       contributed to a single, uniﬁed task goal, accuracy improved and variability was lower than when the tasks could not be easily uniﬁed in this way. See Mechsner et al. (2001) for an additional example. As noted above, separate movements that begin synchronously but contribute to the same gestural task would not be expected to show spatial interference (as for AP/TD’s coordinative structures that create single gestures). However, movements that contribute to separate, synchronous gestures might be expected to show interference. As far as is known, however, such spatial interference doesn’t occur in normal speech (although comprehensive experiments to test this hypothesis have not been carried out). There could be several reasons for its apparent absence. One possibility is that speech articulators behave differently from articulators that are known to show spatial interference in synchronized tasks, e.g. the two hands, arms, or legs. If the speech articulators are not predisposed to spatial interference when performing synchronous, distinct tasks, then ﬁndings of spatial interference in non-speech motor activity (as discussed) would not be relevant to speech production, and therefore there would be no implications for AP/TD. A second possibility is that speech doesn’t show spatial interference because the movements used to produce speech sounds are not conceived of as temporally synchronous separate tasks (as they are in AP/TD), but are rather conceived of as contributing to a sequence of single uniﬁed tasks. For example, the lip protrusion and tongue-body raising movements involved in [u] production might be conceived of as contributing to a single phonemic, uniﬁed vocal tract conﬁguration, or acoustic landmark goal (involving low F1 and F2), rather than as synchronous, but distinct, gestural tasks, and this uniﬁed way of conceiving of [u] production may prevent spatial interference between bilabial protrusion and tongue-body raising movements. Likewise, sequences of sounds (e.g. CV syllables) may be conceived of as sequences of phonemes, or of vocal-tractconﬁguration or acoustic landmark (Stevens 2002) task goals, rather than as independent actions that are synchronized at their onsets (as proposed in AP/ TD). For example, the movements involved in consonant–vowel syllables may be conceived of as contributing to the achievement of a sequence of separate speech landmark targets for the C and the V; this type of representation of speech goals may prevent spatial coupling between consonants and vowels in CV syllables. A system architecture in which articulations contribute to meeting sequentially organized phonemic, whole-vocal-tract-conﬁguration, or acoustic landmark goals, would thus be preferred over one in which different synchronized gestures are conceptualized as independent tasks, as

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

61

in AP/TD. Further experimentation will be necessary to determine whether spatial interference does arise in speech. It is interesting to consider how this hypothesis might apply when phrasal intonation and/or lexically contrastive tonal patterns are taken into account. In symbol-based phonological theories, tones are associated with tone-bearing-units, such as syllables or vowels. In this sense, at the phonological level of representation, tones and segments that make up the tonebearing-units can be said to represent distinct tasks, which are to be realized synchronously, which might raise the risk of undesirable spatial coupling between segmental movements and tonal contours, e.g. difﬁculty in lowering the articulators when a rising intonation contour is being produced. However, translating these phonological representations into sequences of speech-sound targets or landmarks, in which each target is conceptualized as a uniﬁed speech-sound goal, may prevent this type of coupling. Recent ﬁndings suggest that the phonetic implementation of intonation involves tonal targets that are temporally aligned with respect to separately represented segmental landmarks (see Ladd 2008 for a review); these ﬁndings are consistent with the view that at the phonetic planning stage, connected speech is represented as a sequence of speech-sound targets. To summarize this section, if movements of the speech articulators are similar to the movements of the two hands or the hand and the foot, in being at risk of spatial interference when performing distinct (but simultaneous) tasks, then the ﬁndings from general motor studies reviewed here suggest that models of speech production that avoid the risk of spatial interference may be desirable. That is, models in which the production of each speechsound landmark⁵ is represented as a single uniﬁed goal, and connected speech is represented as a sequence of such uniﬁed speech-sound goals, may be preferable to models such as AP/TD in which each speech sound is represented by multiple, distinct, synchronous overlapping tasks, and connected speech is represented as a complex sequence of overlapping tasks. Alternatively, one could maintain a model which treats speech as an ensemble of distinct, synchronous tasks (such as AP/TD), but add mechanisms to prevent spatial interference. However, if AP/TD is to cope with data like these, it would need to evolve in the direction of greater complexity, thereby losing some of its original attractive simplicity. ⁵ Note that Stevens proposes that single phonemes are often realized with more than one landmark, e.g. a closure and release landmark for obstruents.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

62      

3.4 Issues not currently dealt with In addition to these problems, there are other aspects of the production planning process that remain to be dealt with within AP/TD. These include a principled way of determining Pi (or MuT) gesture amplitudes, and of determining the locations of the phrase boundaries which Pi gestures instantiate. See Chapter 10 for some discussion of how related issues can be handled within the proposed extrinsic-timing-based three-component model.

3.5 Summary This chapter has discussed three types of phenomena that cause some concern about the general architectural assumptions of the AP/TD approach. First there are many systematic timing behaviors that are currently not modeled in AP/TD; these might be able to be modeled in the AP/TD framework, but would require a large number of additional adjustment mechanisms. AP/TD’s default-adjustment approach would be more reasonable if the default speciﬁcations could be used most of the time, but this appears not to be the case for speech. Second, Fitts’ law lacks a full explanatory account within the theory. While certain aspects of Fitts’-law-related regularities have been modeled in the AP framework, and others might eventually ﬁnd such a treatment there, these behaviors are not predicted by AP and do not ﬁnd an explanatory account in that framework. Third, observations of spatial interference among synchronous movements in non-speech behavior suggest that AP/ TD’s gestural score architecture may have some unforeseen disadvantages, due to its reliance on simultaneous in-phase onset of some gestures that are conceptualized as independent tasks, which leaves it open to such interference phenomena. Finally, there are additional aspects of the production planning process that lack an explanation in the current version of AT/TD. While it is possible that these issues can be dealt with in AP, they do not ﬁnd a natural account within that phonology-intrinsic-timing-based framework. In contrast, they suggest the desirability of a model architecture that allows for more ﬂexibility in specifying contextually appropriate timing patterns, provides a more complete account of the speed, distance, and accuracy tradeoffs that have been observed (Fitts’ law), and has a sequential, non-overlapping architecture for speech-sound goals that avoids the possibility of spatial interference between discrete synchronous movements. More speciﬁcally, these problems can be dealt with in an alternative model of the type described

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. 

63

later in this volume, that postulates an abstract symbolic phonological system, and involves a Phonetic Planning Component separate from the Phonological Planning Component, to specify the details of speech acoustics and speech movement. In this approach, goals of speech production are sequences of acoustic landmarks and other cues to contrast, in which surface durations of movements and intervals appropriate to each context are computed using an optimization approach. These surface durations are represented and speciﬁed using general-purpose, phonology-extrinsic timekeeping mechanisms. Before the presentation of that framework in later chapters, Chapter 4 will discuss ﬁndings from the timing behavior literature that provide a more substantial challenge to the intrinsic timing assumption that lies at the core of the AP/TD approach. These ﬁndings are perhaps the strongest motivation for considering an alternative, phonology-extrinsic-timing-based model, because they ﬁnd a more straightforward account in theories with symbolic (i.e. non-gestural) phonological representations and phonology-extrinsic, nonspeech-speciﬁc, general-purpose timing mechanisms that track and specify surface time in units that correlate with solar time. That chapter, which motivates phonology-extrinsic timing, will be followed by two other chapters which provide additional motivation for considering an alternative to phonology-intrinsic timing approaches to speech motor control, Chapter 5 (Coordination) and Chapter 6 (Prosodic governance of surface phonetic variation). Together, these chapters motivate the consideration of the alternative, phonology-extrinsic timing approach to speech production, which is sketched out in Chapters 7–10.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

4 Phonology-extrinsic timing: Support for an alternative approach I 4.1 Introduction Chapter 3 presented evidence that raises some initial questions about the AP/TD approach to modeling speech production. This chapter discusses further evidence that motivates consideration of an alternative to AP/TD’s phonologyintrinsic-timing approach—an alternative that is based on phonology-extrinsic timing. Before turning to this evidence, a brief review and summary of phonology-intrinsic timing in AP/TD is in order. In AP/TD, timing is phonology-intrinsic because phonological representations for lexical contrast (gestures) are spatiotemporal (Fowler 1977). Additional mechanisms extrinsic to the gestures, but still intrinsic to the phonology, are used to control the amount of time each gesture is active, as well as inter-gestural relative timing. Surface durations emerge from gestural representations once they are active, and do not have to be speciﬁed or tracked by timekeeping mechanisms extrinsic to the phonology. Timing control is accomplished through the use of system-speciﬁc oscillators that are part of phonological representation, without any reference to solar time (e.g. in milliseconds). As noted in Chapter 2, these oscillators are of two types: Point-attractor (critically damped mass–spring) oscillators for forming constrictions as well as for adjusting the timing of local gestural activation intervals, and limit-cycle (freely oscillating) oscillators for specifying default gestural activation intervals, overall speech rate, temporal compression effects within constituents, and inter-gestural (phase-based) coordination. Surface timing patterns result from oscillator-related properties such as their natural frequencies (mass–spring stiffness), the proportions of oscillator periods used for gestural activation, inter-oscillator coupling strength, and stable entrainment patterns. The planning+suprasegmental oscillator frequencies have default values, but local adjustments in these default values can be used to model e.g. boundary-related lengthening and prominence-related lengthening, by slowing or speeding those frequencies, resulting in longer or shorter Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

65

surface intervals (as measured in solar time, i.e. milliseconds). In this way, surface timing patterns emerge from the interaction between the intrinsic timing of spatiotemporal gestures and the phonology-speciﬁc timing control mechanisms that determine gestural activation. Consistent with the AP/TD view that surface timing properties do not have to be speciﬁed, because they emerge from the phonological system, ﬁndings in the literature suggest that surface timing patterns can be emergent, or at least partially emergent, in certain non-speech behaviors. For example, this is the case for continuous hand movements, particularly where well-deﬁned temporal intervals are absent, e.g. in periodic, repetitive circle drawing (Robertson et al. 1999; Zelaznik, Spencer, and Dofﬁn 2000; Zelaznik and Rosenbaum 2010; Repp and Steinman 2010). However, as will be shown in this chapter, many other movement behaviors show timing characteristics that require an alternative timing explanation, in which the surface durational patterns are explicitly planned in order to achieve the goal(s) of movement. That is, there is a substantial body of evidence that is inconsistent with emergent timing, because it suggests that the timing of speech production, as well as many other types of motor activity, often involves the explicit planning of the timing of surface intervals, using one or more general-purpose timekeeping mechanisms. This evidence motivates consideration of an alternative approach to speech-timing control, based on phonology-extrinsic, general-purpose timing mechanisms, which is discussed in more detail in Chapters 7 and 10. In the phonology-extrinsic-timing approach, phonological representations for speech sounds are symbolic,¹ i.e. categories without speciﬁc spatiotemporal content, rather than spatiotemporal. In these symbol-based systems, nothing about the symbolic representation predicts the surface timing plan; instead, an additional phonetic planning component is required to specify surface timing, and other aspects of context-governed surface phonetic form, for the task requirements for the utterance.

¹ Phonological representations in AP/TD are spatiotemporal, and are therefore not symbolic. However, the spatiotemporal representations in AP/TD can be considered abstract, because there is not a one-to-one mapping between phonological representation of each gesture and surface realization. This is because 1) the same gesture can be produced with differing contributions of the articulators that are used to produce it, e.g. lip closure can be produced with different contributions of the upper and lower lips, and jaw, depending on the nature of articulatory overlap and possible perturbations to speech, 2) starting positions of articulators producing a given gesture can vary depending on adjacent context, and 3) gestural realizations can be modiﬁed by adjustments to gestural activation intervals via suprasegmental Pi or MuT gestures and changes to overall speech rate. These suprasegmental and speech-rate adjustments will affect the amount of time for which a gesture inﬂuences the vocal tract (gestural activation), and will consequently affect how closely the articulators approximate the intended target at the end of gestural activation in a particular utterance.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

66       This chapter presents a number of lines of evidence that challenge phonology-intrinsic-timing-based approaches to speech timing. The ﬁrst line of evidence shows that one particular aspect of movement timing, i.e. the timing of the endpoint, is often less variable in repeated movements than the timing of other parts of movement. This evidence is difﬁcult to explain in phonology-intrinsic timing approaches where all parts of movement correspond to the phonological goal, as they are in the particular types of phonologyintrinsic timing models used in AP/TD and Sorensen and Gafos (2016), and in some phonology-extrinsic models, e.g. Fujimura (1992) et seq. where a phonological representation maps onto an entire movement trajectory. That is, in models such as AP/TD where the phonological representation is described by an equation of motion that deﬁnes all parts of movement (apart from the starting position), it is not possible to ‘pick out’ one part of movement so it can be prioritized for timing accuracy. In contrast, phonology-extrinsic-timing models have the potential to provide a straightforward account of these observations, because a) a symbolic phonological goal can be associated with a particular part of the movement, e.g. the movement endpoint, and b) this part can therefore be prioritized for timing (and spatial) accuracy. A second line of evidence shows that the representation of time in motor activity can be independent of the representation of spatial information. This evidence is difﬁcult to reconcile with proposals where the timing information for movement is integrated with spatial information, as it is in AP/TD’s spatiotemporal phonological representations. In contrast, it ﬁnds a natural explanation in accounts of speech production like the symbolic-phonologybased system with separate phonetic planning and motor-sensory components that is presented later in this volume, in which temporal and spatial information are separately represented. A third set of studies provides evidence for the mental representation of surface time and time-to-expected-event occurrence, in the planning for both non-speech- and speech-related motor activity. This evidence is difﬁcult to reconcile with AP/TD and other phonology-intrinsic timing approaches, in which surface time is emergent and not represented, speciﬁed, or tracked. It is also difﬁcult to reconcile with AP/TD because the periods of the phonology-internal ‘clock’ used in AP/TD to control gesture activation (the planning+suprasegmental oscillator ensemble) bear no straightforward correspondence to solar time. This is seen, for example, in phrase-boundary and prominent positions, where the phonology-internal ‘clock’ is slowed, and also when the phonology-internal ‘clock’ is sped up for faster speaking rates. In these situations, there is no change in the number of phonology-internal

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    – 

67

‘clock’ ticks, and therefore e.g. phrase-ﬁnal syllables are not ‘longer’ in AP/TD–time than phrase-medial syllables, although the intervals change in surface duration (as measured in solar, e.g. millisecond, time units). Thus, AP/TD can generate appropriate surface durations in boundary-adjacent positions, but doesn’t have a representation of these durations, which evidence presented in Section 4.3.1 requires. Finally, two additional lines of evidence are presented for the use of generalpurpose mechanisms that specify planned surface durations: 1) the fact that timing variability increases with interval duration, suggesting that the duration of surface intervals is timed in solar time, and consistent with a ‘noisy’ timekeeper, and 2) neural evidence that the brain tracks and represents time. Together, this evidence motivates the consideration of phonology-extrinsictiming systems for speech processing that make use of general-purpose timekeeping mechanisms to represent, specify, and track surface time, as an alternative to phonology-intrinsic timing.

4.2 A challenge to the use of mass–spring oscillators in the implementation of timing effects This section presents two types of evidence that challenge the use of mass– spring oscillators to model timing effects: one set of studies that supports the separate representation of the timing of the part of movement most directly related to the goal, often the endpoint as opposed to the onset or other parts of a movement, and a second set of studies showing that temporal aspects of movement are represented separately from spatial aspects.

4.2.1 Evidence for the separate representation of the timing of movement endpoints vs. other parts of movement In his 1998 paper, Dave Lee notes: “it is frequently not critical when a movement starts—just so long as it does not start too late. For example, an experienced driver who knows the car and road conditions can start braking safely for an obstacle a bit later than an inexperienced driver.” This observation suggests that the timing of goal-attainment should be less variable than the timing of movement initiation. This section presents evidence from repeated movements elicited in controlled laboratory experiments that that conﬁrms Lee’s observation. These data suggest that actors are able to

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

68       separately represent, and differentially prioritize, the timing of different components of movement, i.e. goal-related parts such as movement endpoints or release, over other parts of movement, such as movement onset.² These ﬁndings are difﬁcult to explain in mass–spring point attractor models, because in these models, the representation for movement is described by a single equation of motion, which fully speciﬁes the entire movement trajectory (path and timing). That is, the timing of movement offset is fully predicted by the timing of other parts of the movement trajectory, including the movement onset. On the other hand, models that involve symbolic representations are able to map these symbolic representations onto parts of movement that are most closely tied to the movement goal(s). As a result, such models are able to separately represent the movement goal(s) from the means to achieve it/them, and thus are better able to account for ﬁndings of low timing variability of parts of movement most closely related to the goal, on the view that the parts of movement most closely associated with the movement goal(s) are prioritized for timing accuracy. (See Shaffer 1982, Semjen 1992, Billon, Semjen, and Stelmach 1996 for the importance of endpoint timing, and Todorov and Jordan’s 2002, 2003 Minimal Intervention Principle). Note that the fact that the movement target is a parameter of movement in AP/TD does not mean that the target can be singled out as a part of movement that is independent of other parts. This is because the movement target parameter value, along with values for starting position, spring stiffness, and damping parameters, all inﬂuence the entire trajectory of movement in AP/TD; as a result, no single part of the trajectory can be ‘picked out’ so that it can be prioritized above the others. Consistent with Lee’s observation, Gentner, Grudin, and Conway’s (1980) study of keypress timing in typing found lower consistency in the start times of key press movements than in the end times, for two typed repetitions of the same sequence, performed by an experienced typist (Figure 4.1). The median difference in start times across the two repetitions was 58 ms (grey dots across the two panels), compared to a difference of 10 ms for end times (black dots). Additional evidence for lower timing variability at movement endpoint for repeated movements can be found in periodic tapping data (Billon, Semjen, and Stelmach 1996; Spencer and Zelaznik 2003; Zelaznik and Rosenbaum 2010). For example, Spencer and Zelaznik (2003) found that timing variability ² It is often the case that the part of movement most closely related to the goal in speech is the movement endpoint. For example, the endpoint of lip protrusion is most closely related to the goal for /u/, but for geminate consonants followed by a vowel, the timing of the beginning of the release movement toward the following vowel may be the most relevant for signaling the geminate status of the consonant.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    – 

69

1200 1000 800 600 400 200 0

a

n

space

e

Start time

p

i

c

i

c

End time

1200 1000 800 600 400 200 0

a

n

space

e

Start time

p End time

Figure 4.1 Start and end times (in milliseconds) of keypress movements for two repetitions of the same . . . an epic . . . sequence. Note: The start times for a and space were not measured. Top panel = ﬁrst repetition, Bottom panel = second repetition. Based on a similar ﬁgure in Gentner et al. (1980, p. 3).

in repetitive tapping showed lower variance at ﬁnger touchdown than at the time of peak velocity. Zelaznik and Rosenbaum (2010) found similar results for tapping, in that timing variability of contact with the tapping surface was lower than that of maximum ﬁnger extension. Interestingly, however, both Spencer and Zelaznik (2003) and Zelaznik and Rosenbaum (2010) found a different pattern of results for circle drawing, that is, no evidence for differences in timing accuracy at different points in the circle cycle. For example, in

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

70       Zelaznik and Rosenbaum (2010), the variability at cycle onset (0 degrees) was no different from timing variability at a spatial location opposite to cycle onset (180 degrees). This evidence is consistent with the emergent timing view of continuous circle drawing, that is, that timing in such tasks is primarily emergent from dynamic characteristics and has minimum involvement from a timekeeping mechanism. See Zelaznik and Rosenbaum (2010) and Studenka, Zelaznik, and Balasubramaniam (2013) for evidence less consistent with emergent timing for circle drawing when it creates a perceptual (auditory or tactile) event; and Repp (2008) and Repp and Steinman (2010) for more nuanced discussions. Although speech production data on this topic are limited, the available data show timing variability patterns that are consistent with those observed for periodic tapping and typing; that is, they show less timing variability at goal-related movement endpoint than at other parts of movement, measured relative to a reference event. Perkell and Matthies (1992) studied timing variability for upper lip protrusion movements during spoken /i_u/ sequences, where the number of intervocalic consonants was varied systematically.³ They observed lower variability in the timing of movement endpoint (maximum protrusion) relative to voicing onset for /u/, as compared to the timing of a point after movement onset (maximum acceleration), relative to voicing onset for the same vowel.⁴ This pattern suggests a tighter temporal coordination of maximum lip protrusion (movement endpoint) with voicing onset than of lip protrusion movement onset to voicing onset, and suggests that the timing coordination of movement endpoint has higher priority than the timing of other parts of these speech movements (See Figures 4.2 and 4.3). This pattern suggests that having protruded lips at the onset of voicing is the prioritized goal for these speakers. Taken together, these results are consistent with the view that the most taskrelevant features of motor performance have the least variability (Winter 1984; Lacquaniti and Maioli 1994; Scholz and Schöner 1999; discussed in Scott 2004, Todorov and Jordan 2002). AP/TD’s gestural representations do ensure that spatial targets can be reached regardless of starting position and can thus account for less spatial variability at targets compared to other parts of ³ Upper lip movements in these sequences were less likely than lower lip movements to be inﬂuenced by phonemes in the /i_u/ sequence other than /u/, because upper lip movements are not required for those phonemes. ⁴ The point of maximum acceleration was chosen as the beginning of the measured movement interval to ensure that this interval was not governed by phonemes other than /u/; presumably, by the time maximum acceleration has been reached the movement is entirely under the control of the /u/ segment.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    –  i End

ACOUSTIC

71

VBeg CONSONANT DURATION

/Cn/

/i/ MBeg

/u/ MEnd

PROTRUSION

MOVEMENT INTERVAL VELOCITY

ONSET INTERVAL

ACCELERATION

OFFSET INTERVAL

PROTRUSION DURATION

ACC MAX .2 sec.

Figure 4.2 Schematic illustration of data extraction. Note: From top to bottom: (1) a segment of the acoustic signal (ACOUSTIC), (2) lip protrusion (PROTRUSION), (3) lip velocity (VELOCITY) and (4) lip acceleration (ACCELERATION) versus time. Acoustic events in the time-expanded acoustic signal are end of the /i/ (iEnd) and beginning of the /u/ (Vbeg). Movement events are: movement beginning (mBeg), movement end (mEnd), and maximum acceleration (AccMax). Source: Redrawn from Perkell, Joseph S., & Melanie L. Matthies (1992, p. 2915; Figure 3). Temporal measures of anticipatory labial coarticulation for the vowel /u/: within-subject and cross-subject variability. Journal of the Acoustical Society of America, 91(5) with the permission of the Acoustical Society of America.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

72       Slope = –.044 R2 = .004. p = .261

Slope = .228 R2 = .093. p = .003

SUBJECT 1

0.5

0.3

0.4

0.2

0.3

0.1

0.2

0.0

0.1

–0.1 –0.2

0.0 Slope = .634 R2 = .350. p = .000

Slope = –.305 R2 = .200. p = .000 0.3

0.4

0.2

0.3

0.1

0.2

0.0

0.1 0.0 SUBJECT 3 0.5 0.4

Slope = .365. R2 = .057. p = .000

OFFSET INTERVAL (sec.)

0.5

ONSET INTERVAL (sec.)

PROTRUSION DURATION (sec.)

SUBJECT 2

–0.1 Slope = .088 R2 = .037. p = .021

–0.2 0.3 0.2

0.3

0.1

0.2

0.0

0.1

–0.1

0.0 Slope = .504 R2 = .042. p = .019

SUBJECT 4

Slope = .041 R2 = .000. p = .624

–0.2

0.5

0.3

0.4

0.2

0.3

0.1

0.2

0.0

0.1

–0.1

0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.1

0.2

0.3

0.4

–0.2 0.5

CONSONANT DURATION (sec.)

Figure 4.3 Scatter plots of protrusion duration interval versus consonant duration (left column); onset interval versus consonant duration (middle column), and offset interval versus consonant duration (right column) for lip-protrusion movements from four participants’ /i_u/ sequences (shown in each of four rows). Note: Further details in Perkell and Matthies (1992). Source: Redrawn from Perkell, Joseph S., & Melanie L. Matthies (1992, Figure 9). Temporal measures of anticipatory labial coarticulation for the vowel /u/: within-subject and cross-subject variability. Journal of the Acoustical Society of America, 91(5) with the permission of the Acoustical Society of America.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    – 

73

movement when these are produced in different contexts. However, ﬁndings of lower timing variability for repeated movements (of the same movement distance) at the goal-related movement endpoint as compared to other parts of movement are difﬁcult to explain in mass–spring models, in which the timing of all parts of movement is speciﬁed by the same equation of motion. In these models, it isn’t possible to separately represent the movement endpoint from other parts of the movement trajectory, so that it can be timed more accurately. As proposed above, the differential timing variability of the (taskrelevant) goal-related part of movement vs. other parts of movements can be more straightforwardly accommodated in models in which the timing of the goal-related part of a movement (often the endpoint) is separately represented and speciﬁed from other parts, and its temporal coordination is planned with higher priority (Shaffer 1982; Billon, Semjen, and Stelmach 1996); see also Chapter 5 for many additional pieces of evidence for endpoint-based coordination. As discussed in Chapter 7, these ﬁndings can be explained in threecomponent models of speech production which have symbolic phonological representations, where the parts of movement usually most closely related to the symbolic phonological representations are accorded highest priority, for both timing coordination and spatial accuracy. See Chapter 5 for further evidence of greater timing accuracy at goal-related parts of movement, including greater timing accuracy at the point of maximum extension of hand movements associated with stressed syllables in speech; see Leonard and Cummins (2011).

4.2.2 Temporal information is represented independently of spatial information in motor activity This section presents a second type of evidence which challenges mass–spring oscillator models of phonological representation. Mass–spring models are spatiotemporal in nature; that is, there is no separate representation of temporal vs. spatial information. Evidence from motor learning studies suggest, however, that temporal information can be represented and learned independently of spatial information, at least for intervals greater than 500 ms. This type of evidence is difﬁcult to reconcile with AP/TD’s gestural representations, since gestures in AP/TD are modeled as mass–spring point attractor oscillators that integrate spatial and temporal properties of movement, without an independent representation of timing.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

74       An example is Torres and Andersen’s (2006) study, in which monkeys were taught to reach to different targets for a small reward (juice), with blocks of straight reaches alternating with blocks of reaches around obstacles. Movements lasted 500–1000 ms. Although the spatial paths of the obstacle avoidance movements were consistent from the ﬁrst trial, remaining virtually invariant over months of training, the speed proﬁles of the movements changed over time, both in terms of the number of velocity peaks and in terms of the value of the ﬁrst peak. Speeds became faster, and movements became more efﬁcient, as evidenced by single-velocity peaks in later trials, due to fewer accelerations and decelerations. These results show that the monkeys learned temporal aspects of movement independently of spatial aspects: Spatial aspects were mastered from the very ﬁrst trial, but mastery of temporal aspects proceeded much more slowly. Kornysheva, Sierk, and Diedrichsen (2013) and Kornysheva and Diedrichsen (2014) present ﬁndings from human participants that corroborate these results, this time for sequences of movements. They show that timing information learned from one sequence of movements can be transferred to another, once the spatial task has become familiar. Kornysheva, Sierk, and Diedrichsen’s (2013) experiment involved training participants to learn a sequence of ﬁve ﬁnger key presses and their timing (inter-press intervals were ﬁve possible periods ranging from 500 to 1700 ms). The goal of the experiment was to see whether there would be a beneﬁt in a series of new trials when 1) reproducing the same sequence of presses and their timing (same spatiotemporal pattern), 2) reproducing the same sequence of presses with a different timing (same spatial pattern, different temporal pattern), and 3) reproducing the same timing pattern, but with a different sequence of presses (different spatial pattern, same temporal pattern), as compared to a baseline, where neither the temporal pattern, nor the sequence of presses was preserved. In all cases, participants were given visual instructions as to which ﬁnger to use for the press, and this information was visible until the time for the next key press. The time to begin each key press from the onset of each new visual ﬁnger cue (key press reaction time) was measured. Results showed a beneﬁt in key press reaction time for all three types of reproduction, but the independent temporal beneﬁt was only observed after the new sequence of key presses was established (3rd trial in a series of 10). These results suggest that timing information is represented separately from spatial information, and that the beneﬁts of learning timing information can only take effect after the spatial information has been learned, i.e. after the timing patterns can be linked to a

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

75

spatial representation.⁵ Kornysheva and Diedrichsen (2014) showed that the pre-motor areas (involved in movement planning) coded the temporal information and spatial information involved in these types of tasks separately , in contrast to the motor cortex, where integrated spatiotemporal coding was observed. The next section turns from evidence for separate representations of endpoints vs. other parts of movements, and of temporal vs. spatial aspects of movements, to the more general question of whether there is good evidence that surface durations must be represented in the control of movement.

4.3 Evidence for the mental representation of surface durations AP/TD’s phonology-intrinsic timing approach does not require (or even allow) speakers to represent the desired timing of surface events; instead, surface timing patterns emerge from an interaction among the spatiotemporal phonological representations, the default gestural activations, and adjustments to the default gestural activations as speciﬁed in the plan for an utterance. Surface durations are thus are not themselves speciﬁed. Moreover, although AP/TD does have limit-cycle oscillators (its planning+suprasegmental oscillator ensemble, used to drive the length of time that each gesture is active) which could be considered a ‘clock’, the AP/TD ‘clock’ only operates before and during speech activity, and its time units bear no straightforward correspondence with time used in the external world (e.g. solar time). This is because the units of the planning+suprasegmental oscillator ensemble (i.e. the periods of the AP/TD ‘clock’) are adjusted for different speech rates, and are slowed at special prosodic positions by Pi and Mu gestures. This ‘time-warping’, used in the AP/TD system to account for systematic surface durational variability due to rate and prosodic structure, means that AP/TD ‘clock’ time bears no straightforward correspondence with solar time (with relatively stable timing units). Moreover, even if AP/TD had a mechanism to represent the surface duration of intervals in AP/TD ‘clock’ units (which it currently does not), this type of representation would be difﬁcult to use in interactions with predictably ⁵ Kornysheva, Sierk, and Diedrichsen (2013) report that these results reconcile previous seemingly contradictory results in the literature (Salidis 2001; Shin and Ivry 2002; Ullen and Bengtsson 2003; O’Reilly et al., 2008; Gobel, Sanchez, and Reber 2011), because the previous studies offered fewer opportunities for the spatial sequence to be learned prior to the testing of temporal transfer.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

76       timed events in the external world, since the AP/TD ‘clock’ frequency, or AP/TD timing unit rate, is so variable and has no ﬁxed relationship with solar time. This section presents two types of evidence that suggest that humans have a set of stable and reliable timekeeping mechanisms to measure, represent, track, and specify surface durations during perception and action, and that temporal plans and representations can be separate from spatial plans and representations for movement. The ﬁrst type of evidence comes from actors’ interactions with perceived events (Section 4.3.1), and the second type of evidence suggests that constraints and goals in speech production are represented in terms of surface durations (Section 4.3.2). This evidence suggests that a) humans and other animals have representations of the surface durations of timed intervals, b) these representations can be used in both production and perception behaviors, and c) humans use general-purpose timekeeping mechanisms to keep track of surface durations in units that correlate with solar time.

4.3.1 Evidence for the mental representation of surface durations, from actors’ interactions with perceived and predicted events Two types of evidence from actors’ interactions with external events suggest that actors represent the surface durations of movements. The ﬁrst is that timing in action is often related to timing in perception (4.3.1.1), and the second is that patterns of anticipatory behavior in both humans and nonhuman animals require representations of timed intervals (4.3.1.2). 4.3.1.1 Timing in action is often related to timing in perception in tasks involving both Both timing in perception and timing in action are involved in tasks such as interception tasks, and braking to avoid collision. Evidence from interception tasks (e.g. catching a ball) and braking to avoid collision suggest that humans are able to couple the timing of actions to the timing of perceived events in order to ‘be at the right place at the right time’ for successful interception, and to successfully avoid collision. For example, Lee et al. (1983) suggest that knee and elbow angles of participants jumping up to hit a falling ball appeared to be continuously tuned to the time it would take for the ball to reach the location of collision, assuming the ball would continue at its current rate. Similarly, Lee (1976) showed that braking behavior is closely related to the perception of

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

77

time-to-collision at the current gap closure rate. These ﬁndings are consistent with the view that time is represented; more speciﬁcally, they suggest that a timekeeper (or timekeeping system) continuously keeps track of time-toevent-occurrence at the current movement rate, and that the timing of action is continuously coupled to it (Lee 1998). However, these early ﬁndings are also consistent with a different view within the framework of Ecological Psychology (Gibson 1979): unfolding action might be coupled to unfolding perception without involving a timekeeping mechanism per se. For example, when looking directly at an oncoming object, the size of the oncoming object divided by its rate of change directly speciﬁes the amount of time until collision at the current gap closure rate (Lee 1976, 1980). However, more recent work is more difﬁcult to explain without representing time explicitly. In particular, Katsumata and Russell’s (2012) study of the timing of participants batting a ball dropped from heights of 1, 1.3, and 1.5 m, with most of the ball-fall hidden (i.e. all but the initial drop and the last 200 ms of the ball-fall visible), showed that participants were equally successful at batting balls whose initial trajectory was mostly occluded as they were at batting balls whose trajectories were fully visible. Arguably, they could not have waited to initiate their swings until the ball reappeared (swings were initiated 36, 40, and 36 ms after the ball reappeared, which is strikingly less than the 50 to 135 ms visuo-motor reaction time for catching balls, as reported in Lee et al. (1983) and discussed in Benguigui, Baurès, and Le Runigo (2008). This result suggests that participants did not continuously couple their swing to visual information from the falling ball, but instead used the initial ball acceleration to predict the time-until-ball-reaches-the-batting-place, and timed the initiation of their swing accordingly. The ability to predict the time-until-the-ball-reaches-the-batting-place was presumably based on knowledge of the accelerating effects of gravity (see also Lacquaniti and Maioli 1987, 1989; McIntyre et al. 2001; Zago et al. 2004; Zago et al. 2008), and required a way of explicitly representing time. In addition, this experiment showed that movements were shorter in duration when they were initiated later in ball-fall than when they were initiated earlier in ball-fall, suggesting that movement speed was controlled in order to hit the ball on time, regardless of movement initiation time. This is another example of how movement initiation time can be more variable than movement endpoint time, and therefore separately controlled, as per Perkell and Matthies (1992) for speech,⁶ cf. Section 4.2.1. ⁶ Additional evidence supported the view that movements were continuously modulated during the swing: 1) less timing variability was observed at ball–bat contact than at swing initiation, suggesting

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

78       4.3.1.2 Anticipatory behavior in both non-human animals and humans requires representations of timed intervals 4.3.1.2.1 Humans are able to predict the timing of the sensory consequences of movement This section presents evidence from non-speech experiments showing that actors can interact with predictably timed events, suggesting that they represent and track the time-until-expected-event occurrence, and that they can reuse these representations in planning and specifying motor actions. Along these lines, an experiment by Blakemore, Wolpert, and Frith (2000) suggests that humans predict the timing of the sensory consequences of a self-induced tickle and use this information to attenuate its effects at the anticipated time (a proposed explanation for why we can’t tickle ourselves). In this experiment, a robotic interface was used to introduce 100–300 ms delays between the participant’s left-hand movement of a robot arm, and sinusoidal movement of soft foam across their right palm, controlled by the robot arm. The participant’s judgment of the tickliness of the stimulation increased as the delay increased, consistent with the proposal that actors use an internal feedforward model to predict the sensory consequences of movement, including their timing, and time the tickle-inhibition activity to coincide with the predicted timing of the sensory consequences of movement. This evidence supports the view that surface timing is represented, tracked, and used to time tickleinhibition activity. That is, the correlation between the judgment of stimulus tickliness and the delay in tactile consequences of movement in Blakemore et al.’s experiment would be difﬁcult to explain without recourse to a mechanism or set of mechanisms that can predict the timing of the sensory consequences of movement, can track the time-left-until expected event occurrence, and can use this information to specify the timing of tickle-inhibition activity. 4.3.1.2.2 Humans and animals are able to predict the expected arrival time of a stimulus Another line of evidence for the representation of surface time comes from many examples in the literature on conditioned responses in humans and other animals, where anticipatory behavior is closely tuned to the expected arrival time of a stimulus (e.g. shock, food). See Catania (1970), cited in Gibbon (1991); Gibbon (1977); Gibbon et al. (1997); Richelle and Lejeune

that timing was modulated as the swing was unfolding in order to minimize timing error, and 2) elbow acceleration continuously measured during the swing correlated positively with the velocity required to hit the ball on time (distance/time-to-ball-contact), measured at each point in time.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

79

(1980); Lejeune and Wearden (1991); Rakitin et al. (1998) for examples.⁷ Such observations provide additional support for the view that timed intervals are represented, and that the time-to-event occurrence is tracked and represented for use in the production of movement even when the conditioning perceptual stimulus is anticipated but absent. For example, Roberts (1981) pioneered a variant of the conditioned response paradigm in which rats were trained to estimate a temporal interval by the ﬁrst response (lever press) after a ﬁxed interval (usually 40 s) being rewarded with food. These ﬁxed interval trials (80% of all trials) were interspersed with longer, 160-s trials in which food reinforcement never occurred (20% of trials). In these longer trials, peak response rate was centered at the time food was expected on the basis of the preceding ﬁxed interval trials. The timing of peak responses provided strong evidence that the animals had estimated the predominant time of arrival of food. Many examples in the conditioning literature (such as those discussed) relate to timing intervals that are much longer than the sub-second timescales relevant for speech (Gibbon 1977 reviews studies that range from thirty seconds to ﬁfty minutes). However, there are some examples in the literature relating to less-than-a-second timing intervals. In a study by Green, Ivry, and Woodruff-Pak (1999), participants ﬁrst learned the timing relationship between a tone (conditioned stimulus) and a puff of air to the cornea (unconditioned stimulus). They then produced a blink (conditioned response) in response to the tone, in anticipation of the air puff. Different intervals between the tone and the puff of air (ranging from 325 to 500 ms) were trained in different sessions. Correspondingly, anticipatory eye-blinks were timed to occur just prior to the expected arrival of the air puff. Thus, the evidence that actors represent anticipated time intervals and use them in controlling the timing of their motor behavior extends to the scale of time intervals observed in speech. This evidence suggests that actors have a timekeeping mechanism or set of mechanisms that keeps track of surface time at a wide range of duration intervals, including those appropriate for speech.

⁷ As Macar and Vidal (2009) discuss, the fact that some non-human animals show evidence of forming a temporal relationship between events, just as humans do, suggests that temporal representations can be established even without the complexity of the human brain. Zakay (1989) proposes that temporal representations are activated automatically, but fade from memory if not meaningful to the organism in some way, cf. supporting neural evidence in Jin, Fujii, and Graybiel (2009). Relevance to the organism encourages accurate consolidation of the interval in memory.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

80      

4.3.2 Evidence for surface duration requirements in speech Section 4.3.1 presented evidence, drawn from patterns of movement interaction with perceived and predicted events, for a general timekeeping ability, and for the ability to represent surface durations. This section presents an additional kind of evidence that speakers represent surface time, this time from speech data. In AP/TD, patterns of systematic surface duration variability in speech due to phonemic identity, prosodic context, rate of speech, etc. are not goals in themselves, but instead emerge from model parameters and adjustments to gestural activation intervals. Surface duration requirements (goals) aren’t represented, speciﬁed, or speciﬁable within the theory. In contrast, the previous section presented evidence suggesting that humans can explicitly represent and specify the surface durations of intervals used in a variety of motor tasks, suggesting that humans have timekeeping abilities that make it possible to track surface time, to represent the durations of timed intervals, and to store this information for later re-use, e.g. to plan appropriately timed actions to interact with anticipated events. This section discusses ﬁndings which suggest that these abilities are also demonstrated in speech, i.e. that surface durations are explicitly represented during speech production. These ﬁndings challenge the AP/TD view more directly, since surface time is not represented within the AP/TD theory. The evidence comes from 1) apparent constraints on the magnitude of ﬁnal lengthening in quantity languages (Section 4.3.2.1), and 2) systematic surface durational patterns that are achieved in a variety of articulatory ways (Section 4.3.2.2). Both lines of evidence suggest that the constraints and goals required to account for these processes are represented in terms of surface durations. The representation of surface durations challenges AP/ TD’s phonology-intrinsic timing approach, because it suggests that there must be a distinction between the phonological vs. phonetic representations of time; that is, AP/TD’s phonology-intrinsic time would need to be transformed, or translated, into surface, (e.g. millisecond) time. This implication appears to run counter to one of the core principles of AP/TD’s phonologyintrinsic timing approach, which is to avoid having to translate from one type of representation into another, and moreover would be challenging because there is no straightforward correspondence between the oscillator frequency of AP/TD’s planning+suprasegmental oscillator ensemble and solar time. The evidence in Section 4.3.2.2 provides a further challenge to AP/TD because it suggests that goals are speciﬁed separately from how the goals are to be achieved, and appears to require at least two stages of planning: one to

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

81

specify the goals, and one to specify how the goals will be achieved. In AP/TD, there is only one single planning stage, which is phonological (although spatiotemporal), and the output of this stage, the phonological plan for an utterance, fully determines gestural movement.⁸ 4.3.2.1 Constraints on magnitudes of ﬁnal lengthening in quantity languages A constraint on surface durations of phonemically short vowels in phrase-ﬁnal position appears to be required to preserve quantity contrasts in some languages, such as Northern Finnish and Dinka (Nakai et al. 2009; Nakai et al. 2012; Remijsen and Gilley 2008). In Finnish, phonemic quantity distinctions are signaled durationally; observed spectral correlates are subtle and possibly imperceptible (e.g. Wiik 1965; Engstrand and Krull 1994; Eerola and Savela 2012). Nakai et al. (2012) observed that the magnitude of phrase-ﬁnal, phrasal accent-related, and combined lengthening on phonemically short vowels in Northern Finnish is restricted compared to lengthening on phonemically long vowels (Figure 4.4), suggesting that speakers of this language explicitly manipulate the degree of ﬁnal lengthening to maintain the contrast in duration between phonemically short and long vowels. For example, the phonemically short vowel in the last syllable of CVCV(C) words, cf. the left-hand side of Figure 4.4, panel b, shows 17% combined accentual + ﬁnal lengthening vs. 68% on the phonemically long vowel in the same context. In particular, the lengthening pattern on this so-called ‘half-long vowel’, i.e. a phonemically short vowel whose duration is intermediate between that of the phonemically short vowel in other contexts, and that of the long vowel (VV), is suggestive of a duration constraint for phonemically short vowels. Two types of empirical evidence for a duration constraint are provided in Nakai et al. (2009) and Nakai et al. (2012). First, Nakai et al. (2009) found a negative correlation between phrase-medial duration and the amount of ﬁnal lengthening for V2 in CV1CV2 structures. One might initially imagine a mechanism by which speakers could learn to lengthen phonemically short vowels less to avoid confusion in their listeners, without explicitly representing a durational constraint. However, this potential solution is ruled out by the observation that speakers adjust the amount of lengthening for their short vowels to avoid exceeding the criterion duration threshold. That is,

⁸ As noted earlier, there are limited ways in which the realization of the gestures can differ in AP/TD, e.g. due to differing contributions of different articulators to each gesture that might occur when an articulator is perturbed and to other context factors.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

82       (a)

300

(V)V1

Duration (ms)

250 200 150 100 50

CV1 CV(C) (b)

CV1 CVV(C)

Utt. final

Combined

Baseline

P. accent

Utt. final

Combined

Baseline

P. accent

Utt. final

Combined

Baseline

P. accent

Utt. final

Combined

Baseline

P. accent

0

CVV1 CV(C) CVV1 CVV(C)

300 o (V)V2

Duration (ms)

250 200 150 100 50

CVCVV2(C)

CVVCV2(C)

Utt. final

Combined

Baseline

P. accent

Utt. final

Combined

Baseline

P. accent

Utt. final

Combined

Baseline

P. accent

Utt. final

CVCV2(C)

Combined

Baseline

P. accent

0

CVVCVV2(C)

Figure 4.4 Mean test vowel durations (in ms) in the baseline and three experimental conditions. Note: P. accent = phrasal pitch accent, Utt. ﬁnal = utterance ﬁnal, Combined = Combined-effect. The durations of (V)V1 are plotted in the upper panel; (V)V2 in the lower panel. Error bars represent 1SD. Source: Redrawn based on Nakai et al. (2012) with permission from Elsevier. © Elsevier 2012

phonemically short vowels that are shorter are lengthened more, and phonemically short vowels that are longer are lengthened less, showing clear evidence of a surface duration constraint. Further support for a constraint comes from Nakai et al. (2012)’s study of ﬁnal lengthening and accentual lengthening, which combine sub-additively for V2 in CV1CV2. These results thus support the view that the surface durations of the (phonemically short)

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

83

0.240

Nucleus duration (s)

0.210 0.180 0.150 0.120 0.090 0.060 Fin. Med. SS-SG

Fin. Med.

Fin. Med.

Fin. Med.

SS-LG LS-SG LS-LG Lexical and morphological quantity

Figure 4.5 Means and standard deviations for vowel duration as a function of lexical/morphological quantity—short stem in short grade (SS–SG), short stem in long grade (SS–LG), long stem in short grade (LS–SG), long stem in long grade (LS–LG)—and sentence context—(Medial, Final). Note: Items ending in /r/ are excluded. SS–SG are considered to have short quantity, SS–LG and LS–SG are considered to have medium quantity, and LS–LG are considered to have long quantity. Source: Redrawn from Remijsen and Gilley (2008) with permission from Elsevier. © Elsevier 2008

half-long vowel are restricted in order to avoid endangering the surface duration manifestation of the phonemic short vs. long vowel quantity contrast in this language. This type of constraint is difﬁcult to express in a system which does not explicitly represent surface durations. Dinka, a Nilotic language, also shows patterns of ﬁnal lengthening consistent with this type of constraint. This language has a three-level quantity system, and the magnitude of ﬁnal lengthening is smaller for vowels of short and medium quantities as compared to the long quantity (compare LS-LG with others in Figure 4.5, Remijsen and Gilley 2008). On the assumption that contrasts in vowel quantity in AP/TD would be expressed in terms of different numbers of gestures in lexical representation (e.g. one vs. two in Finnish⁹), the phonological contrast could be maintained even if contrastive short and long vowels were lengthened by the same ⁹ Different numbers of gestures (or moras as traditionally expressed), e.g. 1 vs. 2, for phonemically short vs. long vowels would also appear to be required in Finnish to account for the patterning of long vowels with VC syllable rhymes in behavior relating to syllable weight.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

84       amount. However this isn’t what is observed: Phonologically contrastive short vowels are lengthened less than phonologically contrastive long vowels. This pattern of lengthening can only be implemented in AP/TD by an ad-hoc speciﬁcation of a smaller amount of lengthening on the phonologically contrastive short vowels. In contrast, the restricted magnitude of lengthening on the contrastive short vowel can be explained if there is a constraint which preserves the duration distinction between short and long contrastive quantities. To put it another way, if vowels of different quantities had the same phonological representation, the constraint on prosodic lengthening for short (and medium) vowels could be expressed as a constraint on the height of the Pi/MuT-gesture (i.e. a constraint on the degree of AP/TD ‘clock’ slowing). But, in this hypothetical case, where short and long vowel quantities had the same phonological representation (i.e. the same number and type of gestures), there would be no lexical contrast, which would be highly undesirable. Instead, because AP/TD differentiates phonological categories with gestures, we assume that the phonological contrast between these types of vowels is expressed in the lexicon as one vs. two or more gestures. Thus, in AP/TD, the surface durations of these vowels would be due to a combination of 1) the number of AP/TD ‘clock’ timing units in their gestural activation intervals (determined by the number of gestures) and 2) the degree of AP/TD ‘clock’ slowing (determined by the height of the Pi or MuT gesture). In this type of system, there is no way to account for the apparent surface-duration constraint on the lengthening of contrastively short vowels, because this constraint relates to the interaction of two different AP/TD properties: 1) the number of AP/TD ‘clock’ timing units in the activation interval and 2) the degree of AP/TD ‘clock’ slowing, both of which result in surface duration in solar time. AP/TD can refer to each of these quantities, but has no way of representing the fact that they both affect surface duration. That is, it has no way of relating their effects on a desired surface duration, since surface durations are only emergent in this theory and are not represented. AP/TD therefore has no explanation of different degrees of lengthening (‘clock’ slowing) on phonologically contrastive short vs. long vowels, because the explanation has to do with the maintenance of a surface duration distinction. In sum, while the Pi- and MuT-gestures of AP/TD might provide a mechanism to specify different degrees of ﬁnal- or phrasal-accent-related lengthening for phonologically contrastively short (or medium) vs. long vowels, AP/ TD does not predict that this pattern should ever occur, and does not offer an explanation for why contrastively short vowels should be lengthened less than contrastively long vowels. The results reviewed here suggest that the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

85

explanation relates to surface durational information which is represented in the minds of speakers, and is involved in the maintenance of phonological contrasts. These results are difﬁcult to account for in models, such as AP/TD, in which surface durations are the emergent output of activation interval durations+clock-slowing adjustments, are not explicitly represented, and so can’t be invoked as constraints on lengthening. 4.3.2.2 Different articulatory strategies for different speakers suggest the representation of surface duration goals as distinct from the articulations that achieve them This section presents evidence that speakers specify surface interval duration requirements as goals of speech production, and meet these requirements using a variety of different strategies (an idea proposed in Edwards, Beckman, and Fletcher 1991 for phrase-ﬁnal lengthening, and in Hertrich and Ackermann 1997 for vowel quantity). In AP/TD, surface duration goals aren’t represented within the theory, so although the theory would presumably have no trouble in modeling each of the different articulatory strategies, there is no way to express the fact that these different articulatory strategies are equivalent, in the sense that they meet the goal of producing similar surface durational patterns. That is, the only thing shared by all of the different strategies is their equivalent effects on surface durations; the equivalence of these strategies goes unexplained in theories that cannot represent surface durations. These ﬁndings also suggest that surface duration goals are represented separately from the spatial path of goal achievement (Edwards, Beckman, and Fletcher 1991; Hertrich and Ackermann 1997); this challenges AP/TD’s view that spatial and temporal properties of movement are integrated. 4.3.2.2.1 Different strategies for speech rate manipulations Studies of overall rate-of-speech manipulations provide a good example of speakers using different articulatory strategies to achieve what appears to be a shared durational goal expressed in surface time. In laboratory studies, speakers easily produce shorter-duration utterances when asked to speak at a fast speech rate, compared with a slower rate, but notably, they achieve this surface durational difference in different ways. Strategies for achieving shorter durations at faster speech rates include: 1) reduced number and/or strength of prosodic constituent boundaries, as evidenced by the apparent deletion of fundamental frequency boundary markers at optional intonational boundaries (Caspers 1994), 2) a similar result achieved by less ﬁnal lengthening in

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

86       absolute terms (Cummins 1999), 3) less prominence-related lengthening (Cummins 1999), 4) fewer and/or shorter duration pauses (Goldman-Eisler 1956; Caspers 1994; Trouvain and Grice 1999; Trouvain 1999), 5) segmental lenition, assimilation, and omission (Kohler 1990; Trouvain 1999) 6) increased slope of the peak speed/distance relationship (e.g. Ostry and Munhall 1985), 7) more coarticulation/articulatory overlap (e.g. Byrd and Tan 1996), 8) fewer speed peaks (i.e. normal and fast rates typically have one speed peak, whereas very slow movements can have more) and 9) shorter articulatory distances (expected to yield shorter durations according to Fitts’ law). Some strategies may be language- or language-variety-speciﬁc, e.g. shortening of vowels before tautosyllabic voiced obstruents to increase overall speech rate occurs in varieties of English where these are longer than vowels before voiceless consonants, such as in American English (Smith 2002) and other varieties, but this strategy is not available in Scottish English where vowels before voiced and voiceless obstruents are both short if in monomorphemic syllables (Scobbie, Hewlett, and Turk 1999). Even within a language or language variety, inter-speaker differences in strategy are rampant: For example, in one of Trouvain’s studies, although all three speakers used fewer and shorter pauses and increased the number of spoken syllables per second at faster rates, only two of the three speakers showed more segmental reductions at a fast rate, while the other speaker used the same surface segmental forms for normal and fast rates. Also, while it is often the case that fewer and shorter pauses occur at fast rates, some speakers choose not to vary the number of pauses (e.g. one speaker in one of Trouvain’s 1999 studies), and some speakers do not decrease pause duration (Fletcher 1987). One speaker who reduced the number of pauses at a fast rate in one of Trouvain’s (1999) studies even increased pause duration at this rate. Widespread inter-speaker differences in strategy are also observed in kinematic studies. While the peak speed/distance ratio is very often higher for fast rates than for slower rates,¹⁰ speakers differ in how they achieve it. Some speakers choose to vary speed while keeping distance constant, while others vary distance (cf. Berry 2011 for a review). For example, in Ostry and Munhall’s (1985) study of repeated /CV/ sequences (C = /k,g/, V = /a,o,u/) at a slow and fast rate, two of three speakers reduced tongue-dorsum displacement, and one speaker showed no change in displacement, but did increase movement

¹⁰ Hertrich and Ackermann (2000) report different rate-related behavior of this ratio for the jaw vs. the lips and tongue, as might be expected for the different masses of these articulators; because the jaw is heavier than the lips and tongue, it would be expected to move more slowly.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

87

speed. Goozée, Lapointe, and Murdoch (2003) report similar ﬁndings from a study of /ta/, /ka/ repetitions by eight speakers. Most speakers reduced vowel-related tongue displacement at a fast rate, but one did not. The one that did not increased tongue peak speed instead (see also Abbs 1973 and Engstrand 1988 for similar ﬁndings). The fact that distance manipulations appear to be more common than speed manipulations without differences in distance is consistent with AP/TD’s implementation of rate of speech effects, namely, increased supra-segmental oscillator ensemble frequency at faster rates would have the effect of shortening gestural activation intervals; shorter gestural activation intervals would truncate gestural movements toward constrictions, causing less displacement, and would thus make them less likely to approximate their targets. However, as far as we know, there is no current AP/TD proposal to account for the alternative strategy, namely of increasing movement speed while keeping distance constant. One possibility would be to manipulate gestural stiffness (parameter k). Regardless of the mechanism, it seems clear that it would be difﬁcult to account for the equivalence among different strategies for changing rate of speech without modifying the theory. Rate-dependent inter-speaker variability has also been observed for articulatory overlap. All three possible outcomes have been observed, i.e. rateinduced increase in overlap, decrease, and lack of change (Abbs 1973; Boyce et al. 1990; Byrd and Tan 1996; Engstrand 1988; Shaiman, Adams, and Kimelman 1995; Shaiman 2001, 2002; all cited in Berry 2011). In all of these cases, giving instructions to make changes by speaking more quickly or more slowly results in changes in the surface duration of the utterance. That is, all of the fast rate strategies shorten total utterance duration, and all of the slow rate strategies lengthen total utterance duration. However, because the different strategies have different types of effects on movement kinematics (e.g. movement speed, distance, and temporal overlap of articulation), the equivalence of these strategies cannot be expressed without reference to their effects on surface durations. This is at odds with AP/TD’s emergent timing approach, where surface durations cannot be speciﬁed as a goal. Although gestural activation interval timing can be manipulated via planning+suprasegmental oscillator frequency, this manipulation provides only a single speech-rate strategy, with a particular kinematic signature (since spatial and temporal aspects of movement are not independent). AP/TD thus does not provide an explanation for the existence and equivalence of multiple strategies, which can be explained if surface durations are represented as goals, and are represented separately from spatial aspects of movement.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

88       4.3.2.2.2 Different strategies for signaling quantity differences Hertrich and Ackermann (1997) provide another example of speakers using different articulatory strategies to achieve what appears to be a common surface durational goal. In their study of quantity differences among German vowels in /pVp/ syllables, all speakers showed shorter decelerations of opening movements, shorter acceleration of closing movements, and faster distancenormalized peak speeds for lip+jaw opening movements of phonemically short vowels as compared to phonemically long vowels. However, there were differences across participants in the closing movements. Three out of six speakers showed higher distance-normalized peak speeds for phonemically short vowels than for phonemically long vowels, and the other three speakers showed no difference. These ﬁndings suggest that there is more than one kinematic way to achieve a desired durational distinction between phonemically short and long vowels in German. What all speakers share is that they achieve shorter durations for phonemically short as compared to long vowels, suggesting that their goal is to produce a difference in surface duration pattern for phonemically short vs. long vowels, and they can achieve this pattern using different articulatory strategies. In addition to evidence that different speakers use different strategies to achieve similar durational patterns, Hertrich and Ackermann also provide evidence that the same speaker can achieve the short vs. long distinction in different ways for different types of vowels. That is, some speakers showed a longer opening movement for /ɑ:/ than for /ɑ/, but a predominant pattern of a longer initial part of the closing movement for /u:/ than for /u/. This provides even stronger evidence for the equivalence of different strategies for producing the short vs. long quantity distinction and for the representation of surface duration patterns, because differences in strategy are observed for the same speaker. 4.3.2.2.3 Different strategies for ﬁnal lengthening A ﬁnal line of evidence for the explicit representation of surface duration goals comes from the work of Edwards, Beckman, and Fletcher (1991), who suggest that different speakers can use different articulatory strategies for achieving the surface durational patterns traditionally described as ﬁnal lengthening. In their study, four speakers produced the words pop and poppa in phrase-medial and ﬁnal positions, at fast, normal, and slow rates. They observed that for three speakers, ﬁnal lengthening in pop was characterized by comparable jaw

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

89

displacements, but slower jaw movements toward ﬁnal targets, as well as slightly later initiation of closing movements, than for the same word produced in phrase-medial position. A fourth speaker showed no evidence of slower peak speeds for either the opening or the closing jaw movements in phrase-ﬁnal position, but had even later initiation of the closing movement (less articulatory overlap) for ﬁnal tokens, as well as greater displacement. As noted in Edwards et al. (1991), these ﬁndings suggest that the explanation for the equivalence of the strategies used to signal phraseﬁnality vs. non-ﬁnality lies in the similarity of the surface duration patterns that all of these strategies achieve, further suggesting that surface duration goals are represented by speakers, and are distinct from the articulatory strategies that achieve them. As was the case for quantity distinctions, Edwards, Beckman, and Fletcher (1991) present evidence suggesting that the same speaker can use different strategies to produce a durational difference in phrase-ﬁnal vs. phrase-nonﬁnal positions, depending on the context. They found that some speakers used different strategies at different rates of speech, e.g. at faster rates they slowed articulatory speed in phrase-ﬁnal position compared to phrasemedial position, but at slower rates they held the articulation in quasisteady-states for longer. Taken together, these studies of strategies for adjusting durations for rate of speech, vowel quantity, and ﬁnal lengthening suggest that surface durations are speech production goals that can be achieved in a variety of ways. This type of motor equivalence supports the view that 1) surface utterance duration requirements can be speciﬁed as part of the speech production process, and 2) these requirements or goals are separately speciﬁed from how the goals are achieved. Particularly telling are cases where the same speakers show different articulatory strategies for achieving similar durational patterns in different contexts. This evidence challenges AP/TD because this model doesn’t allow the speciﬁcation of surface duration requirements. Moreover, AP/TD doesn’t make a distinction between temporal goals and how the goals are achieved in articulation, since in this model, spatial and temporal aspects of movement are not independent: In AP/TD, both are determined by the same phonological plan. In Chapters 7 and 10, we propose that 1) surface duration goals for intervals can be speciﬁed during phonetic planning, and 2) these goals are speciﬁed separately from how the goals are achieved articulatorily. This type of model architecture makes it possible for the same goal to be achieved in a variety of ways.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

90      

4.4 Further evidence for general-purpose timekeeping mechanisms to specify durations and track time This section presents further evidence for general-purpose timekeeping mechanisms that can be used to specify surface durations and track time. The existence of these mechanisms makes atemporal, symbolic phonological representations more plausible, because it suggests that humans have the required general-purpose, phonology-extrinsic-timing mechanisms to represent and specify durations that might form part of the phonetic implementation of phonological structure. The evidence comes from interval duration variability, both in non-speech, and in speech (Section 4.4.1), and from neural evidence that the brain represents and tracks time (Section 4.4.2). That there should be some type of neural evidence for the representations of surface interval durations and for tracking time relative to event-occurrence is expected, given the behavioral evidence presented earlier in this chapter. Section 4.4.2 is included in order to provide corroborative evidence for these representations and timekeeping abilities, and also to illustrate the form which the neural evidence can take, as well as the parts of the brain in which the neural activity takes place.

4.4.1 Evidence for a noisy timekeeper: Timing variability correlates with interval duration This section presents evidence from timing variability that supports generalpurpose timing mechanisms that could be used to specify and plan surface durations in speech. Many studies show that variability in interval duration grows linearly with interval duration (Treisman 1963; Gibbon 1977; Schmidt et al. 1979; Rosenbaum and Patshnik 1980a, 1980b; Wing 1980; Hancock and Newell 1985; Wearden 1991; Ivry and Corcos 1993; Ivry and Hazeltine 1995; Spencer and Zelaznik 2003; see Malapani and Fairhurst 2002 for a review). As Schmidt et al. (1979) explain, these ﬁndings are expected in models which make use of a noisy timing mechanism: “the mechanism that meters out intervals of time . . . is variable, and the amount of variability is directly proportional to the length of the interval of time to be metered out” (Schmidt et al. 1979, p. 422). The relationship of variability to mean duration follows Weber’s law, with a coefﬁcient of variation (standard deviation/mean) that is approximately

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

91

constant across a range of intervals, for both humans and animals (Gibbon 1977). This is known as ‘scalar variability’ or the ‘scalar property’, (but see Allan and Gibbon 1991 for a report of constant standard deviations for a range of intervals for well-practiced participants). Although the range of relevant intervals that typically show scalar variability is somewhat unclear, it appears to span 200–1300 ms (Rosenbaum and Patashnik 1980a, 1980b; Gibbon et al. 1997 and references cited therein; Melgire et al. 2005; Penney, Gibbon, and Meck 2000; Rakitin et al. 1998; Hinton and Rao 2004; cited in Allman et al. 2014; Merchant, Zarco, and Prado 2008; Grondin 2014).¹¹ Grondin (2014) and Gibbon et al. (1997) observed an increase in the Weber fraction for intervals from 1300 ms to 2 s. Longer intervals (from 1 s to many minutes) can show lower coefﬁcients of variation, which can be attributed to counting strategies in some cases (Bangert, Reuter-Lorenz, and Seidler 2011, but see also Lewis and Miall 2009). Shorter intervals (50–200 ms) often show higher coefﬁcients of variation (references cited in Getty 1975), with the likely explanation that these coefﬁcients of variation are inﬂated by constant sensory/motor noise. Getty (1975) therefore described a generalized form of the relationship between standard deviation and interval duration that includes an intercept term that reﬂects the constant, duration-independent noise component: sd2 ¼ k 2 D2 þc where D = interval duration, c is the intercept, and k is the Weber fraction. On this view, the variability can be attributed to two sources: 1) a durationdependent source of variability, and 2) a duration-independent source. The duration-dependent variability is thought to reﬂect noise in a hypothesized timing process, such as reading the time from memory (Gallistel 1999; Gallistel and Gibbon 2000; Jones and Wearden 2004), or neural spike-count variability (Shouval et al. 2014) and is described by the slope of the linear relationship, k² (note that k represents the Weber fraction) (see Chapter 9 for more discussion). In production tasks, the duration-independent source of variability (described by the intercept c) can be related to noise inherent in motor implementation. The generalized form of the Weber relationship between variability and interval duration, suggestive of noise in a timing process, is observed in a

¹¹ Qualitatively different timing behavior is often observed for longer intervals. For example, in tapping to an external stimulus, the timing of the taps switches from anticipatory to reactive at about 2 s (Mates 1994); see Chapter 9 for more discussion.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

92       variety of tasks. These include both non-speech (Gibbon 1977; Roberts 1981; Ivry and Hazeltine 1995; Merchant, Zarco, Bartolo, and Prado 2008; Merchant, Zarco, and Prado 2008) and speech tasks (Byrd and Saltzman 1998, also data from Turk and Shattuck-Hufnagel 2007), that involve perception, production, or both, and the relationship has been shown for both auditory, and visual stimuli (Merchant et al. 2008 a,b). Types of motor tasks that have shown scalar variability include: 1) Single timed-interval production (Rosenbaum and Patashnik 1980a, 1980b; Ivry and Hazeltine 1995; Merchant, Zarco, and Prado 2008). In these tasks, participants were asked to reproduce a single interval whose duration should match that of a model, using e.g. taps (Ivry and Hazeltine 1995), or left- and right-hand keypresses (Rosenbaum and Patashnik 1980a, b) to delimit the interval. Intervals ranged from 325 to 550 ms in Ivry and Hazeltine (1995), from 50 to 1050 ms in Rosenbaum and Patashnik (1980a, b) (with a 0 ms ‘interval’ produced by simultaneous left- and right-hand keypresses), and from 350 ms to 1000 ms in Merchant, Zarco, and Prado (2008). The intervals used in the Rosenbaum and Patshnik studies are particularly relevant for speech, since they include short (50–200 ms) intervals; the durations of many speechrelevant intervals lie within this short range. 2) Repeated movements made to a metronome: (Schmidt et al. 1979, for elbow ﬂexion/extension movements made to a variety of metronome beats with 200–500 ms inter-beat intervals). 3) Repeated movements made to an internally recalled rhythm in a continuation paradigm (Wing 1980; Ivry and Hazeltine 1995; Spencer and Zelaznik 2003; Merchant, Zarco, and Prado 2008 among others). In this paradigm, participants ﬁrst produce a movement (e.g. tapping) in synchrony with a metronome (pacing phase), and then are asked to continue the rhythm after the metronome is turned off (continuation phase). Typically, the interval duration measurements are made from the continuation phase; standard deviations and mean interval durations are computed over a series of trials. Ivry and Hazeltine (1995) and Merchant et al. (2008) found increased variability for longer tapping interval duration for intervals ranging from 325 to 1000 ms. Spencer and Zelaznik (2003) observed increased variability for longer tapping intervals, for continuous circle drawing, and for back-and-forth line drawing intervals, for intervals ranging from 300 to 500 ms. However, the slope of the relationship between variability and interval

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

93

duration was shallower for continuous circle drawing and line drawing than for tapping, and the coefﬁcient of variation (standard deviation divided by the mean) was lower for these two tasks than for tapping. Merchant, Zarco, and Prado (2008) also found that circle drawing showed shallower slopes than tapping, and additionally that repeated tapping showed lower coefﬁcients of variation than single tapping tasks. The shallower slopes for continuous circle and back-and-forth line drawing may reﬂect less involvement of explicit timekeeping in these tasks, i.e. timing that is more emergent (Robertson et al. 1999; Zelaznik, Spencer, and Dofﬁn 2000; Zelaznik and Rosenbaum 2010; Repp and Steinman 2010). 4) The timing of anticipatory behavior in animals and humans. In the experiments discussed in Section 4.3.1.1, non-human animals and humans made anticipatory responses at expected times-of-occurrence of previously conditioned stimuli (e.g. food, puffs of air, etc.). The timing of these responses shows variability that relates linearly to the duration of the timed interval (Gibbon 1977 and many others). For example, in the Roberts (1981) experiment in which rats anticipated the time of arrival of a food reward, a comparison of behavior on trials with different ﬁxed interval durations (e.g. 20 vs. 40 seconds) showed that deviation of response times increased linearly with the duration of the ﬁxed interval, consistent with Weber’s law. The Green, Ivry and Woodruff-Pak (1999) experiment in which participants were trained to produce a blink (conditioned response) in response to an anticipated air puff, likewise showed variability in the timing of the conditioned blink that increased linearly with the interval duration between the tone and the air puff.¹² 5) Speech movements and intervals. Although they did not provide an explanation for their ﬁndings relating to variability, Byrd and Saltzman (1998) found that variability increased with movement duration, for measured durations of lip aperture closings associated with a transboundary /m/-schwa-/m/ sequence. Movements of different durations were elicited in conditions designed to systematically vary the prosodic boundary strength before the second /m/. For example, the target sequence –mam- in mommamia, produced as a single word, had no ¹² Additionally, a common slope of the variability/interval relationship was observed for the timing of eye-blink conditioning and timing of tapping in a periodic continuation task, where the same set of intervals (ranging from 325 to 500 ms) were tested. This ﬁnding is consistent with the view that a common timing mechanism is used for both tasks.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

94       prosodic word boundary before the last /m/ in the target sequence. In other cases, the last /m/ in the target sequence was separated from the vowel by a stronger boundary, and was either prosodic word-, phrase-, or utterance-initial. Movement durations were generally longer at stronger boundaries, because of constituent-ﬁnal lengthening, whose magnitude increases with perceived boundary strength (cf. Wightman, ShattuckHufnagel, Ostendorf, and Price 1992 for acoustic data), and because of constituent-initial lengthening, e.g. Keating (2006). Data from the study described in Turk and Shattuck-Hufnagel (2007) show a similar pattern for measures of the acoustic duration of word-ﬁnal syllable rhyme intervals in phrase-ﬁnal vs. phrase-medial position, based on landmarks in the acoustic signal. Rhyme interval duration means and standard deviations (unpublished) were considerably higher for phrase-ﬁnal words than for phrase-medial words. For example, the rhyme interval of monosyllabic words (e.g. -om in Tom) in phrase-ﬁnal position had mean durations of 346 ms (82 ms s.d.) vs. mean durations of 193 ms (47 ms s.d.) in phrase-medial position, but their coefﬁcients of variation were virtually identical, as predicted by scalar variability. Other examples can be found in the literature: Edwards, Beckman, and Fletcher (1991) present similar data for English ﬁnal lengthening; Remijsen and Gilley (2008) found greater variability for the durations of acoustic intervals for phonemically long vowels in Dinka than for those of intervals for phonemically medium and short vowels; and Chen (2006) found greater variability for the durations of acoustic intervals corresponding to longer, focused constituents than to shorter, unfocused constituents in Mandarin. Similarly, Nakai et al. (2012) found greater variability for longer intervals in combined phrase-ﬁnal + phrasally-accented conditions, compared to shorter intervals in non-ﬁnal, non-accented conditions, and Lefkowitz (2017) found a positive correlation between interval duration and standard deviations across 128 experimental conditions. In AP/TD, longer movement durations at phrase boundaries and in prominent positions arise by stretching activation intervals through the use of Pi or MuT gestures. As discussed in Chapter 2, activation intervals are speciﬁed as a ﬁxed proportion of a planning oscillator cycle. To stretch an activation interval, the oscillation frequency of the planning+suprasegmental oscillator ensemble can be slowed down, that is, the frequency of the planning+suprasegmental oscillator ensemble clock can be slowed. Within this framework, therefore, for phrase-ﬁnal lengthening the oscillator periods are longer, but the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

95

number of oscillations stays the same. As a result, there are no additional timing units (or periods) of an utterance-speciﬁc clock to provide a source for the additional temporal variability, if we assume that additional periods would be associated with additional noise, as proposed by Gallistel (1999), Gallistel and Gibbon (2000), Jones and Wearden (2004), and Shouval et al. (2014). One possibility would be to associate noise with the height of a Pi or MuT gesture, but on this assumption, the prediction would be for no timing variability for movements unaffected by a Pi/MuT gesture, which runs counter to observations of positive standard deviations for e.g. the duration of English schwa vowels in a phrase-medial, unstressed context (e.g. Turk and White 1999). The pattern of observed variability appears to relate instead to a general-purpose mechanism that governs surface time by marking out more solar time units for longer intervals.

4.4.2 Neural evidence that the brain tracks time, plans the time-course of movement, and represents time: Timing information in changing neural ﬁring rates, and neural tuning to particular intervals This section presents neural evidence that the brain tracks and represents surface timing characteristics. Together with the evidence presented earlier in this chapter, this evidence suggests that humans have general-purpose mechanisms that could potentially be used to represent surface interval durations in speech production and to plan and track the time-course of movement. Evidence from methods such as imaging, transcranial magnetic stimulation (TMS), and single-unit recording evidence, reviewed in Mauk and Buonomano (2004), Grondin (2010), Merchant, Harrington, and Meck (2013), and Allman et al. (2014), suggests that certain cortical structures form a neural network involved in timing. In particular these areas include the medial premotor cortex (pre- and supplementary motor areas), prefrontal cortex, posterior parietal cortex, auditory cortex, and visual cortex, as well as subcortical structures (which include the basal ganglia, cerebellum, and thalamus). Clinical lesion studies and TMS studies suggest that damage or temporary disruption to any one of these areas can affect timing accuracy and/or variability over multiple repetitions of the same task (see Allman and Meck 2012 for a review of the effects of Parkinson’s disease, schizophrenia, attention deﬁcit hyperactivity disorder, and autism on timing). For example, Parkinson’s disease, which affects the basal ganglia, is known to cause abnormally slow movements (Pastor and

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

96       Artieda 1996), and can cause longer intervals to be perceived as shorter (see Gräber et al. 2002 for evidence from speech perception), consistent with the idea that Parkinson’s disease slows an internal clock. Cerebellar lesions can cause substantial increases in temporal variability during repetitive tapping, and in syllable repetitions in speech (but see Ackermann and Hertrich 2000, 1994 for some opposite effects of lower variability), as well as difﬁculties in producing appropriate stop VOT (Keller 1990; Franz, Ivry, and Helmuth 1996; Spencer and Ivry 2013). Such lesions can also prevent the discrimination of lexical contrasts based on stop closure duration (Ackermann et al. 1997). Many of these ﬁndings suggest that particular regions of the brain may make qualitatively different contributions to the processing of temporal information. However, other studies show that impairments to different regions can cause similar behavioral effects. For example, Ackermann and Hertrich (1997) showed that disorders of the cerebellum and basal ganglia cause similar disruptions to the production of VOT of voiceless stops (in means and/or standard deviations). For this reason, and because different neural areas appear to be able to compensate for one another to a large extent (a common biological phenomenon described as degeneracy;¹³ see Lewis and Meck 2012, discussed in Merchant, Harrington, and Meck 2013 and Yu et al. 2007), isolating the exact contribution of each brain area to timing behavior is very difﬁcult. Indeed, it does not seem to be the case that timing abilities can be completely abolished through injury, and there is no clinical condition that can be solely described as a disorder of timing (Allman and Meck 2012). Neural activity relating to the timing characteristics of movement can be seen directly in the temporal evolution of neural ﬁring rates recorded in animals (using embedded electrodes) as they perform a variety of tasks. These tasks include explicit timing tasks, (e.g. periodic synchronization/continuation tasks, interval reproduction tasks, and tasks relating to interception or collision avoidance), as well as tasks that unfold in time, but don’t involve time as an explicit task goal (e.g. visual perception or reaching tasks with no explicit time requirement). In particular, a large number of studies show neural evidence of tracking the time that has passed since an event occurred and/or the time remaining until the next anticipated event or planned action. Some of these studies are reviewed below, including evidence for neural-ﬁring-rate ramps that correlate with movement tau (4.4.2.1), ¹³ Speciﬁcally, biological degeneracy is deﬁned as “circumstances where structurally dissimilar components/modules/pathways can perform similar functions (i.e. are effectively interchangeable) under certain conditions, but perform distinct functions in other conditions” (https://en.wikipedia. org/wiki/Degeneracy_(biology).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

97

patterns of time-varying neural activity that indicate time elapsed since a previous event (4.4.2.2), neurons that appear to be tuned to particular timed intervals (4.4.2.3), and evidence that neural activity involved in temporal processing is distinct from neural activity related to spatial processing (4.4.2.4). 4.4.2.1 Neural ramps which correlate with movement tau (time-until-target-achievement-at-the-current-movement-rate) Early studies of visual neurons in the locust (Hatsopoulos, Gabbiani, and Laurent 1995; Judge and Rind 1997) and of neurons in the nucleus rotundus of pigeons (Sun and Frost 1998; Rind and Simmons 1999) showed timevarying correlates of estimated time-to-target-achievement (time-to-contact) when the animals responded to looming stimuli. In these studies, neural ﬁring rates ramped up and peaked before the time-of-contact. Similar rampings have been observed in embedded electrode studies of medial prefrontal cortex neurons in behaving monkeys in collision-avoidance, interception, and reaching tasks (Merchant and Georgopoulos 2006; Lee, Georgopoulos, and Pepping 2017), and in synchronization-continuation tasks. In one of Merchant and Georgopoulos’ interception tasks, where monkeys had to intercept a target that appeared to be moving on a screen by depressing a joystick at the appropriate moment, populations of neurons in the motor cortex and in area 7a of the posterior parietal cortex showed peaks of activity at a similar value of tau (time-to-target-achievement-at-the-current-movement-rate, Lee 1998) for different movement speeds. In Lee, Georgopoulos, and Pepping’s (2017) reaching task, changing neural ﬁring rates in the basal ganglia of monkeys reaching to a visual target showed near-perfect correlations with changing movement tau (time-to-targetachievement-at-the-current-movement-rate), at different time delays. Neural activity led (in the global pallidus external), co-occurred with (in the zona inserta), or followed by 100 ms on average (in the subthalamic nucleus) the movements in question, consistent with the authors’ proposal that different ganglial structures are responsible for planning, executing, and monitoring the movement time course. Like Lee, Georgopoulos, and Pepping’s (2017) study, Jin, Fujii, and Graybiel’s (2009) basal ganglia study also showed neural ﬁring rate peaks in some populations of monkey neurons that followed events related to movement, and were time-locked to them, consistent with the idea that some neuronal activity reﬂects temporal monitoring. In Jin, Fujii, and Graybiel’s study, monkeys were required to look at targets that moved on a screen every 400 ms.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

98       On a few trials, the task was modiﬁed to have different duration intervals between targets. Their recordings from large populations of neurons in the dorsolateral prefrontal cortex and the caudate nucleus of the basal ganglia showed that separate sets of neurons had activity peaks that followed different events (e.g. perception of the signal to start the saccade, start of saccade, and end of saccade) by 150–500 ms, and were time-locked to them. They also found that the neural reﬂexes in the prefrontal cortex and the basal ganglia were tightly correlated, suggesting a functional connection between the two areas. Jin, Fujii, and Graybiel (2009)’s interpretation of their ﬁndings was slightly different from Lee, Georgopoulos, and Pepping’s movement timecourse monitoring interpretation. Jin, Fujii, and Graybiel (2009) suggest that the neural ﬁring rate peaks that follow relevant movement-related events could be a type of time-stamping, created to be available for use as and when they are needed, e.g. to form associations between events and precisely timed actions, or to represent how long an event lasted, even if explicit timing was not required during the event (cf. also Allman and Meck 2012). Tan et al. (2009) provide additional evidence of neural activity that correlates with movement tau (time-to-target-achievement-at-the-current-movementrate), this time for human participants performing a self-paced line-drawing task. They used an autoregressive multiple-regression model to determine the relationship of speed and tau (estimated time-remaining-until-eventoccurrence, Lee 1998) with the time-varying MEG signal, when participants moved a joystick to draw lines on a screen to stationary targets. They found widespread correlations of speed with the MEG signal (for 81% of sensors) across the left frontal-parietal, the left parieto-temporal, and to some extent, the right temporo-occipital sensor space, i.e. in the hemisphere contralateral to the moving limb. They also found signiﬁcant correlations of tau (time-totarget-achievement-at-the-current-movement-rate) with the MEG signal, whereby 22% of sensors showed signiﬁcant correlations in the parietal (bilateral), the right parietal-temporal, and to a lesser extent the left temporooccipital sensor space. The tau effects often occurred concurrently with the speed effects and, in the case of the left front-parietal sensors, spatially overlapped them in the brain. Tan et al. observed that their ﬁndings of tau processing in the right parietal cortex in this production task were consistent with Harrington, Haaland, and Knight’s (1998) and Rao, Mayer, and Harrington’s (2001) ﬁndings of right parietal involvement in temporal processing in perception tasks.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

99

4.4.2.2 Some patterns of time-varying neural activity indicate time elapsed since a previous event The evidence discussed above suggests neural time-tracking of timeremaining-until-event-occurrence and time-stamping during a variety of production tasks. There is also evidence for continuous neural tracking of time-elapsed-since-a-previous-event. For example, Merchant et al. (2011) report results from a synchronization-continuation study where monkeys were required to tap to an auditory beat and continue tapping after the beat had stopped. Several types of ramping spike density functions were observed for different populations of neurons in the medial premotor cortex (presupplementary and supplementary motor areas). These ramps differed depending on the duration of the intervals which were manipulated in the study (ﬁve different interval durations were used, ranging from 450 to 1000 ms). The ramps for some of the active neurons appeared to encode timeremaining-until-the-next-beat, because they peaked at a constant time interval before the following beat, and had slopes that correlated negatively with interval duration. Others did not occur at a constant time with respect to the following beat, and appeared to reﬂect time-elapsed-since-the-last-beat, either through their slopes or through the magnitude of their peaks. These ﬁndings were similar to Leon and Shadlen’s (2003) ﬁndings that rhesus monkeys tracked elapsed time in a perceptual interval discrimination task. 4.4.2.3 Some populations of neurons appear to be tuned to particular timed intervals The evidence discussed above shows that some populations of neurons encode time-remaining-until-event-occurrence, while others appear to encode time-elapsed-since-a-previous-event. Merchant and colleagues (reviewed in Merchant, Harrington, and Meck 2013 and Merchant et al. 2014) present ﬁndings consistent with yet another type of temporal processing. They suggest that certain populations of medial premotor cortex cells observed in behaving monkeys are tuned to particular interval durations, that is, they show greater activity when a monkey is required to produce particular interval durations, as compared to other temporal intervals, in both synchronization-continuation and in single-interval duration reproduction tasks. (See also Mauk and Buonomano 2004, who discuss selective responses to temporal features in frog auditory midbrain, reported in Alder and Rose 1998, 2000, and in the bat auditory brainstem, reported in Covey and Casseday 1999.) Some, but not all, of these interval-tuned cells also show ramping proﬁles (Merchant et al. 2014).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

100       In addition, Merchant et al. (2014)’s study showed that the medial premotor cortex cells that were tuned to particular interval durations also showed sensitivity to the sequential organization of the task, suggesting that these cells co-represent event durations and their order in a sequence. Tanaka (2007) provides additional evidence of cells tuned to intervals of a particular duration, showing ramping activity in monkey thalamic cells before saccades that were required to be initiated at a ﬁxed time interval after an external cue, but not to saccades whose initiation was immediate after an external cue. 4.4.2.4 Neural activity involved in temporal processing is distinct from neural activity related to spatial processing While all of the above studies show that temporal information is encoded in neural activity, very few studies have explicitly tested the independent encoding of temporal vs. spatial information. Kornysheva and Diedrichsen (2014) is an exception. The authors found evidence for the separate neural encoding of temporal vs. spatial information in a transfer production study that allowed them to determine the representation of temporal features independently from that of spatial features (an idea discussed in Section 4.2.2). Results of an fMRI pattern classiﬁcation analysis showed that populations of neurons in the ventral premotor cortex of human participants represented temporal information independently of spatial information, consistent with the idea that timing and spatial characteristics are planned separately. Their study additionally showed that temporal and spatial information was represented integrally in the motor cortex. Taken together, the evidence presented in this chapter is consistent with the view that human and non-human animals have the neural machinery to represent and track time relative to event occurrence during a variety of motor tasks. Although studies of neural time-tracking during speech production have not yet been performed, the available evidence supports the idea that general-purpose timing mechanisms are available for use during speech production, and motivates the inclusion of phonology-extrinsic timing in speech motor-control models that require such mechanisms (Chapters 7 and 10).

4.5 Conclusion While it is possible that some aspects of observable surface timing patterns in speech are emergent, and not explicitly speciﬁed as part of a motor plan,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

101

the evidence presented in this chapter challenges phonology-intrinsic timing as proposed in AP/TD in several ways, and provides part of the motivation for considering a phonology-extrinsic-timing alternative. First, evidence of lower timing variability at goal-related vs. other parts of movement presents a challenge to models which use equations of motion as phonological representations, because these equations do not (and cannot) prioritize the timing of any particular part of movement, e.g. the goal-related endpoint, over the timing of other parts of movement. The observation that movement endpoints, which are closely related to goals, are more accurately controlled than other aspects of movement is more consistent with models in which symbolic representations map onto particular parts of movement, which are prioritized for accuracy. Note that this evidence also challenges models in which symbolic representations map onto entire movement (or spectral) trajectories, because these models do not provide a way to ‘pick out’ a goal-related part of movement, such as the endpoint, so it can be prioritized for timing accuracy. Second, the evidence suggests that humans can represent temporal information separately from spatial information, which is at odds with models which make use of phonological representations in which temporal and spatial aspects of movement are inseparable. Third, the evidence shows that humans have the ability to represent, specify, and track the timing of surface events and intervals, and use these abilities in many non-speech and speech activities. This evidence supports models of speech production which make use of phonology-extrinsic general-purpose timekeeping mechanisms to measure, represent, and specify the timing of surface intervals in speech production. In contrast, it challenges models in which surface timing characteristics are emergent from interacting components of the phonological system and not represented, as well as models which do not make use of time that corresponds in a straightforward way with solar time. The following chapters (5 and 6) present additional evidence that motivates the consideration of phonologyextrinsic timing. In later chapters (i.e. Chapters 7 and 10), we present a sketch of a phonology-extrinsic-timing model, on the assumption that many aspects of the surface timing patterns observable in speech are under voluntary control, and are explicitly speciﬁed as part of a phonetic planning stage of speech production.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

5 Coordination: Support for an alternative approach II 5.1 Introduction This chapter discusses evidence relevant to models of coordination in speech. APTD has two types of coordination: 1) the coordination of movements by different articulators controlled by a single gesture and its activation, and therefore synchronized from beginning of movement until its end, and 2) inter-gestural coordination, used to coordinate movements of different gestures. In its current form, AP/TD inter-gestural coordination is conceived of as a matter of relative timing, accomplished via relative timing of gestural activation intervals, controlled by phonology-intrinsic coupled, limit-cycle (freely oscillating) oscillators, also used for suprasegmental organization (syllable, foot, and phrase oscillators).These gestural activation intervals are coordinated via gestural-planning oscillators that entrain to one another during the utterance planning process, and arrive at stable entrainment patterns e.g. in-phase for syllable onsets and nucleus vowels, antiphase for syllable nuclei and codas. The number, type, and strength of competing coupling relationships among gestural-planning oscillators determine the resulting entrainment patterns as well as the variability of these patterns in the presence of system-intrinsic noise. The entrainment patterns determine the relative timing of the onsets of gestural activation intervals for different gestures, e.g. synchronous for syllable onset consonant and nucleus vowel gestures; asynchronous for nucleus vowel and coda consonant gestures. Because consonant gestures have higher spring stiffness than vowel gestures, and consonant gestures consequently approximate their targets faster than vowels, an in-phase entrainment pattern for a consonant gesture and a vowel gesture in a CV syllable will nevertheless result in the sequential approximation of consonant and vowel targets. The evidence presented in this chapter comes primarily from movements that are coordinated, but are not completely synchronous, from beginning to end (i.e. partially overlapped), and thus is most directly relevant to inter-gestural Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

103

coordination in AP/TD. However, the mechanisms that account for the data presented here could potentially be used to coordinate movements that contribute to the same constriction (intra-gestural in AP/TD terms), and thus are relevant to all types of coordination. An advantage of using an oscillatory approach to movement coordination as in AP/TD, is that a single tool (oscillators) can be used to model all aspects of timing: limit-cycle (freely oscillating) oscillations to model movement coordination and interactions among levels in a suprasegmental oscillator hierarchy, and point-attractor oscillators to model the unfolding of individual gestures (and their component articulators) over time, as well as adjustments to the time for which each gesture shapes the vocal tract (via Pi and Mu gestures). However, the previous chapter argued that the smaller timing variability at movement endpoint compared to other parts of movement challenges point-attractor oscillators as currently used in AP/TD to model individual gestures. Although evidence against point-attractor oscillators to model constriction formation does not necessarily rule out the use of oscillators to control movement coordination and to represent suprasegmental structure, it nevertheless removes some of the motivation for using oscillators to model coordination. This is because it suggests that oscillators are not appropriate descriptions of all aspects of phonological representation, so that the advantage of using a single type of representational/control mechanism for all aspects of speech motor control is diluted. This chapter discusses the use of oscillators in the control of movement coordination in AP/TD; discussion of the use of oscillators for suprasegmental structure is reserved for Chapter 6. These two chapters show that 1) oscillatorbased mechanisms as implemented in AP/TD cannot account for endpointbased patterns of movement coordination (Section 5.4), and that 2) available evidence does not unequivocally support the use of oscillators (i.e. mechanisms based on relative timing in terms of phase proportions) for movement onset timing (Section 5.3) or suprasegmental structure (Chapter 6). In this way they motivate the consideration of alternative ways to accomplish these functions, e.g. non-oscillatory spatial control, control based on absolute solar time (including tau coupling, Lee 1998) for coordination (this chapter and Chapter 9), and suprasegmental control based on a hierarchy of word-based constituents (Chapter 6). Control based on absolute solar time (including taucoupling) would require the use of phonology-extrinsic timekeepers and would thus be incompatible with phonology-intrinsic timing, in which surface timing characteristics emerge from the phonological system without any involvement of system-extrinsic time. On the other hand, coordination

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

104       based on spatial control would not necessarily be incompatible with AP/TD, if a mechanism for tracking the spatial positions of articulators were available. The following section (5.2) begins by discussing evidence consistent with the AP/TD coupled-oscillator approach to coordination, but suggests that there are plausible alternatives that do not require oscillators as part of phonological representation. The subsequent sections (5.3 and 5.4) discuss two key features of the AP/TD approach: 1) treating coordination among gestures as relative timing control, accomplished via planning-oscillator phase-relationships, rather than as coordination based on spatial information or absolute solar timing, and 2) basing coordination on the direct control of the (relative) timing of movement onsets, rather than the (relative) timing of parts of movement most closely related to the movement goals (often movement endpoints). Evidence bearing on both of these issues is important for two reasons. First, it motivates the consideration of approaches to movementonset timing other than those based on oscillator phase relationships; some of these alternative accounts may be compatible with AP/TD (e.g. spatial control) but others are not (e.g. control based on absolute solar timing). Second, the evidence suggests that coordination is often primarily based on the part of the movement most closely related to the movement goal (often the endpoint), rather than on the movement onset as proposed in the current version of AP/TD (described for example in Goldstein et al. 2009, although see Browman and Goldstein 1989, 1992a for earlier proposals that involve target-based coordination). In the current AP/TD system, coordination patterns speciﬁed in terms of gestural onsets do result in lawful relative timing patterns for gestural target approximation, if gestural activation intervals are long enough. This is because a gestural target is approximated¹ at a time that is determined by gestural activation and properties of the gestural mass-spring system such as stiffness, if the gestural activation interval is long enough. Properties of the mass–spring system fully determine mass–spring settling time, i.e. the amount of time for a mass attached to a spring to ‘settle’ in equilibrium position. Settling time is deﬁned as the time it takes for a mass to come very close to the target, that is, within a ﬁxed proportion of the distance from the target, e.g. 2%. If gestural activation intervals are long enough, the relative timing of target approximation for coordinated gestures emerges from gestural activation and properties of the mass–spring systems that determine mass–spring settling

¹ Because gestures are modeled as critically damped mass–spring systems, they approximate a target, i.e. come very close to the target, but never actually reach it.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    - 

105

time, and from the inter-oscillator entrainment speciﬁcations for the coordination of gestural onsets. However, additional ﬁndings showing that coordination patterns can be based directly on movement endpoints (Section 5.4) are challenging for AP/TD. This is because 1) targets are not approximated (and thus the end of the activation interval does not correspond to the movement target) if gestural activation intervals are too short (i.e. at fast rates of speech), and 2) when gestural activation intervals are longer than mass–spring settling time, the timing of target approximation is not available, since this would involve a representation of target approximation time which its phonology-intrinsic timing system doesn’t have. Although the time of gestural settling is at a predictable time interval after a gesture begins (assuming the gesture is active for a period longer than the settling time), current versions of AP/TD can’t refer to this time point, since this approach can refer to time points only as proportions of planning-oscillator cycles. Because target approximation time is determined primarily by mass– spring settling time, it corresponds to different proportions of a planningoscillator cycle when the planning-oscillator cycle has been modiﬁed for a different speech rate, or stretched in a particular prosodic position, as for e.g. boundary-related lengthening. The ﬁndings of endpoint-based coordination presented in Section 5.4 are thus challenging to current versions of AP/TD. Finally, this chapter brieﬂy discusses two mechanisms related to interarticulator coordination and endpoint timing that will play a role in the alternative model proposed in this book. The ﬁrst is Lee’s (1998) proposed tau-coupling mechanism that can be used for endpoint-based movement coordination, and the second is the set of Optimal Control Theory mechanisms for planning intervals between goal-related movement endpoints, and for planning the appropriate timing of movement onsets to reach goal-related movement endpoints on time. These issues will be treated in more detail in Chapters 8 and 9.

5.2 Evidence consistent with AP/TD inter-planningoscillator coupling, and alternative explanations Before turning to evidence that is not consistent with AP/TD (in Section 5.3), this section presents three types of evidence that are consistent with coordination based on inter-planning-oscillator coupling mechanisms, but do not require these mechanisms because they are also consistent with other

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

106       approaches. The ﬁrst type of evidence comes from observations of proportional scaling of interval durations that are preserved across different rates of speech. This type of situation is predicted from oscillator-phase-relationship-based approaches to coordination, since phasing relationships maintain proportional timing of gestural activation intervals (but not absolute solar timing), when the planning+suprasegmental oscillator ensemble frequency increases or decreases. Section 5.2.1 reviews evidence suggesting that proportional (relative) timing is maintained in speech production in certain types of situations (e.g. intrasegmental coordination across rates), although it is not found in all situations. The situations which show stable proportional timing are consistent with inter-planning-oscillator coupling mechanisms, but other types of mechanisms, such as spatial coupling, could also generate these patterns. The second type of evidence (Section 5.2.2) comes from the typological prevalence of CV syllables, which ﬁnds an explanation in the greater stability of inter-planning-oscillator phasing proposed in AP/TD for syllable onset and nucleus gestures as compared to syllable nucleus and coda gestures. For this line of evidence, there is also an alternative explanation that is not based on inter-oscillator coupling. The third type of evidence comes from observations of repeated speech sequences that relate to the relative stability of particular inter-oscillator phasing and cycle frequency patterns (Section 5.2.3). A prediction of coupled-oscillator systems is that fast rates of production should encourage less stable coupling modes to transition abruptly to more stable coupling modes (Haken, Kelso, and Bunz 1985). In-phase coordination relationships are inherently more stable than other coordination relationships, and 1:1 cycle relationships are more stable than other cycle patterns. However, the extent to which these data can be taken to support the use of inter-planningoscillator relationships in normal speech coordination depends on assumptions about the relevance of evidence from repeated speech sequences to normal, non-repeated speech. That is, oscillator-governed mechanisms may be relevant for certain special varieties of speech where temporal periodicity is important. Taken as a whole, this evidence suggests that inter-planning-oscillator coupling mechanisms and other possible mechanisms that lead to relative surface timing patterns are plausible mechanisms for determining coordination patterns for some aspects of speech production, but also raises the possibility that there may be non-oscillatory mechanisms that are equally suited to the task. Further evidence supporting the consideration of alternative, non-oscillatory mechanisms is presented in Section 5.3.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    - 

107

5.2.1 Relative timing evidence is consistent with coordination control based on oscillator phase relationships, but also with other mechanisms Situations where three or more events (e.g. movement onsets) are not synchronous, but the durations of the intervals that they deﬁne scale proportionally across differences in overall movement rate, can be efﬁciently implemented using oscillator phase relationships. This is because a ﬁxed phasing relationship preserves relative, or proportional, timing of events that correspond to proportions of an oscillator period, even when oscillation frequency changes. Some examples of proportional scaling can be found in the nonspeech motor control literature (Diedrichsen, Criscimagna-Hemminger, and Shadmehr 2007; Hore and Watts 2005; Schmidt 1988; Heuer and Schmidt 1988), and behaviors consistent with this view have also been reported for speech. For example, Gaitenby’s (1965) study showed proportional scaling for almost all measured intervals in the utterance I consider myself with changes in speech rate (the closure period for the voiced stop /d/ was the exception). Similarly, de Jong (2001a) found that closure, VOT, and vowel interval duration in CV syllables (but not VC syllables) exhibit near-proportional scaling across differences in speaking rate in English (see also Allen and Miller 1999). Likewise, prevoicing for voiced stops and voice onset time for voiceless stops increase with syllable or word duration in Swedish (Allen and Miller 1999; Beckman et al. 2011). However, many claims of proportional scaling of duration have been criticized by Gentner (1987) on statistical grounds: In many cases, conclusions have been based on data averaged over participants and/or items, which yields ambiguous results. This is because proportional timing is proposed to hold for individual instances, and while it is true that if proportional timing holds for individual cases it will hold for the averaged data, the reverse is not necessarily true. Additionally, in some early studies, proportional timing was inferred from statistical correlations between a whole interval and part of the same interval (part–whole correlations, see discussion in Benoit 1986). These conclusions are dubious because statistically signiﬁcant correlations are expected for any two intervals in a part–whole relationship. Gentner (1987) proposed that a more stringent test of proportional scaling would be to determine if the ratio of a particular interval duration to the duration of an entire movement sequence is invariant across differences in movement sequence duration (e.g. rate). His regressionanalysis-based meta-study of six speech and over twenty non-speech studies suggested that that proportional invariance is almost never perfectly met, and

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

108       that some intervals appear to scale more readily with differences in rate than others. However, to put these results in perspective, Schmidt et al. (1998) note that some of the departures from proportional invariance that Gentner (1987) observed were numerically trivial, and suggested that proportional invariance might hold at a planning level, but not at the surface, because of noise in the motor implementation system, at least in some situations. Löfqvist (1991) provided two examples of the application of Gentner’s test to speech data at different rates. The ﬁrst involves lip-raising and jaw-cycle intervals in CV-labial-consonant-VC intervals. In this example, the jaw-cycle interval was the duration from the onset of jaw lowering for the ﬁrst vowel to the onset of jaw lowering for the second vowel, and the lip-raising interval was the duration of the interval from the onset of jaw lowering for the ﬁrst vowel to the onset of lower lip raising. The second example involves oral constriction and glottal opening intervals in voiceless stops and fricatives, where the duration of the glottal-opening interval was deﬁned as the interval from the onset of glottal opening to its peak. He showed that proportional invariance can be rejected in 90% of the CV-labial-consonant-VC cases, but in only 33% of voiceless obstruent cases. This result suggests the possibility of greater stability in relative timing for intrasegmental coordination than for intersegmental coordination (Löfqvist 1991; Byrd 1996), and thus, following Schmidt et al. (1998)’s logic, that relative timing control might be appropriate for intrasegmental control, even if the evidence for relative timing control is weaker for intersegmental coordination. De Jong’s (2001b) results likewise suggest the possibility of relative timing control in some contexts. He found that closure durations, VOT, and voiced vowel interval durations scaled proportionally with speaking rate for CV syllables, but did not scale proportionally for VC syllables. VC syllables adapted to speaking rate in a different way: through the reduction and eventual disappearance of the glottal constriction between the release of the coda C and the onset of the next vowel, with only small differences in absolute durations of the vowel interval and coda C closure interval, as measured in solar time. These results suggest the possibility that mechanisms to implement proportional timing, or approximate proportional timing, may be required in some situations, e.g. for intrasegment coordination, and for CV syllables, but may not be appropriate for others. An important question, then, is whether proportional timing control (perhaps implemented via planning-oscillator phasing) would be required to account for cases such as these that show, or come close to, proportional invariance. Section 5.3.1 presents evidence suggesting that oscillator-based

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    - 

109

control is not required, because an alternative strategy based on spatial control can result in similar patterns in surface relative timing. In this type of mechanism, actors might begin/end a movement at the time when another movement achieves a particular (relative) spatial position, rather than at a particular phase of a control oscillator. Tilsen (2013, 2018) provides a proposal of this type for coda coordination within the AP framework. This type of mechanism would require keeping track of (expected) positions, but would not necessarily require the use of oscillatory control mechanisms. Another important question is, what is the best mechanism to explain cases where proportional scaling does not occur. Some non-proportional mechanisms are available within AP/TD: For example, AP/TD predicts that time-to-targetapproximation relative to gestural onset should not show proportional scaling, since time-to-target approximation is determined by spring stiffness and other properties which determine mass–spring settling time, and thus is invariant because it is not affected by mechanisms which stretch or shrink gestural activation intervals. Additionally, differences in inter-planning-oscillator coupling strength might also yield departures from proportional scaling. However, another possible mechanism is control based on absolute solar timing (discussed in Section 5.3.2). Absolute timing control might be used, for example, to compute and specify absolute durations (i.e. durations expressed in surface, solar time) between sequential segmental landmarks in a sequence of phones, e.g. between the onset of voicing for a vowel, the onset of closure for a consonant, and the release of closure, in a vowel-consonant (VC) sequence, cf. de Jong (2001b). And a third possibility is that multiple control mechanisms are available to a speaker, with different mechanisms used in different situations. Section 5.3.3 discusses results from an experiment suggesting that this may indeed be the case; i.e. that multiple control mechanisms are available, including both spatially based and absolute-solar-timing-based, and that different mechanisms are adopted depending on circumstances.

5.2.2 Evidence from the CV-favored syllable structure is consistent with oscillator-based control of coordination, but also with other mechanisms Another piece of evidence that is consistent with the AP/TD inter-planningoscillator view of coordination comes from cross-linguistic typology. CV structures are more prevalent than VC structures, both within and across languages. AP/TD’s theory of coordination offers an explanation for this

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

110       asymmetry: CV structures are preferred cross-linguistically because the in-phase coordination of C and V planning gestures for these syllables is inherently more stable than the antiphase coordination pattern for VC syllables. While AP/TD’s account is attractive in its simplicity, Ohala (1996) offers an alternative explanation: Information about place and manner of consonant articulation is more readily available from the acoustics in CV sequences than from those in VC sequences, owing to the consonantal release into the following vowel, which always occurs in CV sequences but not always in VC sequences.

5.2.3 Evidence from repeated speech sequences is consistent with the coupled-oscillator approach to coordination, but also with other mechanisms Two further types of evidence offer support for a coupled-oscillator model of coordination (although both of them involve the production of highly repetitive and quasi-periodic speech, which may involve a planning process which is different from that used for typical speech). One such line of evidence comes from the repetition of simple CV vs. VC syllables at various rates, and the other from the repetition of alternating-CVC-based tongue twisters. 5.2.3.1 Repeated sequences of VC syllables appear to transition to CV syllables at fast rates Support for the view that CV syllables have an inherently more stable coordination pattern than VC syllables has come from experiments of repeated sequences of CV or VC syllables where the rate increases over the course of the experiment. These studies are relevant because coupled oscillators with less stable patterns of phasing shift to more stable phasing patterns at fast rates (e.g. Haken et al. 1985). Thus the prediction was that if gestural relative timing organization is oscillator-controlled, it is unstable in VC syllables. As a result, repetitions of VC syllables should shift to the phasing pattern for CV syllables, because CV syllables have a more stable in-phase phasing pattern. Consistent with this view, several researchers (e.g. Tuller and Kelso 1990; de Jong 2001b) have found that VC syllables in repeated VC VC VC . . . sequences do become more similar to CV syllables in repeated CV CV CV CV . . . sequences at fast rates. For example, Tuller and Kelso (1990) showed that repeated VC /ip/ syllables at increasing rates of speech transitioned abruptly from a pattern of peak glottal opening occurring at the point of minimum lip aperture (the onset

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    - 

111

of lip opening movement) to a pattern of peak glottal opening timed slightly later relative to minimum lip aperture, as was the case for CV syllables. In contrast, CV syllables maintained the phasing relationship between glottal opening and lip aperture at all rates. These results support the view that the phasing pattern of constriction gestures for consonants and vowels in sequences of repeated CV syllables is inherently more stable than that in sequences of repeated VC syllables. However, because the experiments consisted of periodic repetitions of syllables, it is an open question as to whether 1) the stability of consonant-vowel coordination patterns for CV syllables spoken in nonrepetitive, non-periodic contexts, as is more typical of communicative speech, is similarly greater than for VC syllables, and 2) the coordination patterns observed in CV vs. VC syllables in non-repetitive, non-periodic contexts are generated by an oscillatory planning process. As discussed in Section 5.3, nonoscillatory control processes are used to coordinate multi-movement actions in non-speech activity, and may be available for use in speech as well. 5.2.3.2 Intrusive gestures in repeating sequences of alternating syllables in tongue-twister experiments Goldstein et al. (2007) and Pouplier and Goldstein (2010) present another set of ﬁndings consistent with the coupled-oscillator approach to coordination, again from experiments involving repeated speech sequences spoken periodically. They found that speakers producing rapid repetitive productions of syllable sequences with alternating alveolar and velar onset consonants, e.g. top cop top cop . . . , often mistakenly produce ‘double’ alveolar-velar onset articulations, with gradient amounts of constriction at both locations. These ‘double’ articulations follow a pattern that is consistent with the prediction that coupled oscillators should entrain to more stable patterns of frequency and phasing at fast rates. In normal productions of top cop top cop, there are two cycles of labial oscillation (–p –p –p –p) for each cycle of alveolar or velar oscillation (t– k– t– k–), but at the fast rate, the oscillations move toward a more stable 1:1 frequency pattern, resulting in the intrusive production of an alveolar constriction along with the intended velar /k/ constriction, and vice versa. In addition, the synchronous occurrence of the ‘double’ articulations is consistent with the coupled-oscillator prediction that in-phase, synchronous articulations are more stable than out-of-phase, sequential articulations. These results are consistent with the view that oscillations can couple in normal speech production. However, Goldrick and Chu (2014) provide a different, non-oscillatory explanation for the ﬁndings. They suggest that the source of the double articulation is not coupled-oscillator entrainment, but is

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

112       instead the graded co-activation of symbolic representations in a phonological planning stage of production (see also Goldrick 2006; Goldrick and Blumstein 2006; Goldrick et al. 2011; Smolensky, Goldrick, and Mathis 2014). On their view, faster rates of speech involve faster processing, and this encourages errorful co-activation of symbolic phonological representations in similar structural positions. This simultaneous activation of multiple phonological representations leads to simultaneous articulations in a separate, phonetic implementation stage of production. Another possibility is that the results of the Goldstein et al. (2007) and the Pouplier and Goldstein (2010) experiments do in fact reﬂect the behavior of coupled oscillators for these tasks. However, the repetitive, periodic nature of the tongue-twister tasks used in most experiments raises the possibility that behaviors characteristic of coupled oscillators are observed because the tasks are intrinsically oscillatory, i.e. they involve repetition and/or (near)-periodicity. This view suggests that synchronous, ‘double’ articulations would be less likely in tasks which involve less repetition and periodicity (see Chapter 10 for further discussion). In addition, it raises the possibility that the coordination patterns in speech may not require coupled planning oscillators, since coordination patterns planned using other mechanisms (e.g. non-oscillatory spatial control, or absolute solar timing control) might nevertheless become oscillatory on the surface, when the speech tasks are periodic. On this view, the characteristic behavior of coupled oscillators observed in these experiments would be due not to oscillatory planning (coordination control) mechanisms, but rather to the coupling of oscillations (e.g. between tongue-tip and tongue-dorsum movements) that emerge because of the repetitive, periodic nature of the task. To summarize, although evidence presented in this section is consistent with AP/TD’s oscillatory approach to coordination, other explanations for these ﬁndings are available. In addition, the periodic, repeated nature of the tasks raises the question of whether behaviors characteristic of coupled oscillators may emerge in these tasks even if the planning processes that generate the coordination patterns are not based on oscillatory planning processes.

5.3 Evidence that requires the consideration of non-oscillatory approaches The evidence presented above offers some support for the use of coupled planning oscillators in the coordination of speech movements, but this support is equivocal because the evidence is also consistent with other,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   - 

113

non-oscillatory accounts. Some reports in the literature, however, are not consistent with the coupled-oscillator approach. This section presents three lines of evidence that are of this latter type, and that therefore provide even stronger motivation to consider alternative approaches to coordination that differ from the coupled-oscillator, relative-phasing approach of AP/TD. The ﬁrst line of evidence suggests that the preservation of proportional timing at different movement speeds may be due to coordination based on spatial characteristics, rather than temporal phasing (Section 5.3.1). This evidence suggests that oscillatory control is not required to account for patterns of proportional timing and scalability (one of the main advantages of an oscillatory approach), and that spatial control is a viable alternative to oscillatory control approaches in explaining such patterns. The second line of evidence is provided by coordination patterns that appear to be based on absolute solar time, and, if applicable to speech, would appear to require a phonologyextrinsic-timing approach, rather than a phonology-intrinsic approach based on inter-oscillator-phasing relationships (Section 5.3.2). And ﬁnally, other ﬁndings suggest that coordination is ﬂexibly dependent on different sources of information (e.g. spatial information or absolute solar time information), depending on their reliability, again suggesting that non-oscillatory approaches can be appropriate (Section 5.3.3).

5.3.1 Spatial control as an alternative to control based on temporal phasing/relative timing In many cases, patterns of coordination that can be described in terms of temporal phasing can also be described in terms of coordination of the onset of movement of one effector with the spatial position of another. It is therefore plausible that some patterns of coordination that appear to be triggered at a particular relative time, e.g. at a particular phase of a planning oscillator, may instead be generated by a non-oscillatory spatial controller. See, for example, Tilsen’s (2013, 2018) proposal that movements for coda consonants are controlled to begin once a particular spatial position for the nucleus vowel has been reached. The non-speech motor control literature provides several examples consistent with the use of spatial coordination mechanisms. For example, Hore and Watts (2005) suggest that ﬁnger opening during baseball throwing is triggered when the hand reaches a particular position in its path, on the basis of an internal representation of the path, learned through practice. These investigators measured the timing of ﬁnger opening of slow, medium,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

114       and fast throws to a target 3.1 m away. They found that for throws of different speeds, and for different types of throws (sitting and standing), there was no difference in the relationship between the onset of ﬁnger opening and relative spatial position in hand angular path. These results are consistent with the idea that ﬁnger opening can be controlled either on the basis of a positional representation of hand path in space, or on the basis of a temporal coordination mechanism that is speciﬁed in terms of proportional timing or phase. Alberts, Saling, and Stelmach (2002) also present results consistent with the view that movements can be coordinated on the basis of spatial information. In their experiment, participants reached to grasp and lift a cylinder positioned on a table 30 cm away from the seated participant. The transport distance was varied on ‘obstacle’ conditions by positioning an obstacle between the participant and the cylinder, where the obstacle was either 10 cm away from the participant (near), or 20 cm away from the participant (far). Participants were asked to reach over the obstacle to grasp the cylinder. As expected, the distance traveled by the wrist increased in the ‘obstacle’ conditions, as transport distance increased. However, the distance that the wrist traveled during grasp closing was essentially constant for each participant across conditions, in spite of systematic variation in the timing of grasp closing relative to target achievement across conditions. Results are therefore consistent with the view that participants timed the onset of grasp closing to occur at a ﬁxed distance from the cylinder, i.e. that they used spatial information to coordinate the closing component of grasp with the reach, as proposed earlier by Wang and Stelmach (1998, 2001), and Saling et al. (1998). Rand, Squire, and Stelmach’s (2006) study of reach-to-grasp movements at different speeds explicitly tested the hypothesis that closing movements would begin at a ﬁxed proportion of the total movement time, as predicted under a phasingcontrol hypothesis, but showed instead that the time of grasp-closure initiation was proportionally variable across their speed conditions. Their results thus added support to the view that grasp closure is initiated at a ﬁxed distance from the target, but with the added nuance that the grasp closure is initiated at a greater distance from the object at the fastest speeds. This result suggests that participants need a minimum time to appropriately and/or accurately close the grasp, and begin relatively earlier to allow enough time to complete the task accurately (cf. Fitts’ law). This result suggests that they are using a combination of strategies, based on distance (a spatial strategy) as well as the required minimum time to complete the task accurately (an absolute solar timing strategy). Broadly speaking, all of these ﬁndings suggest that spatial control mechanisms and/or absolute timing control mechanisms may be preferable to

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   - 

115

oscillator-based relative timing mechanisms in accounting for at least some surface coordination patterns, although up to now such evidence has been reported only for non-speech motor-control tasks.

5.3.2 Other aspects of coordination require absolute surface time-based control Other patterns of coordination are more difﬁcult to account for either in terms of proportional timing or in terms of control based on spatial information, and so appear to require the primary use of absolute, surface-temporal information. Chapter 4 presented several pieces of evidence that actors use representations of absolute surface time in coordinating with predicted, upcoming external events. This section presents further evidence that actors use absolute time in coordination, this time from experiments in which multiple strategies were available (absolute timing control, spatial control, and/or proportional timing control). For example, Diedrichsen, Criscimagna-Hemminger, and Shadmehr’s (2007) study of the coordination of a thumb press with ipsilateral arm movement showed that patterns of coordination that involved temporal overlap of the thumb press with the arm movement were controlled either spatially, or in terms of relative phase, but that patterns of coordination that did not involve temporal overlap were controlled on the basis of absolute (solar) time. Their results are consistent with the view that a timekeeper which speciﬁes absolute solar time intervals may be involved in at least some aspects of movement coordination. Participants were asked to perform a thumb-press task in coordination with an 8 cm, 350 ms arm movement, and different groups of participants were trained to produce the thumb press at different timing intervals either before or after the onset of movement ( 500, 250, 150, 50, 150, 250, and 350 ms relative to movement onset). In training, they were given feedback as to the accuracy of timing of the arm movement and as to the accuracy of the timing of their thumb press relative to arm-movement onset. In test trials, they were asked to generalize the training task to slower movements, i.e. to produce 500 ms arm movements and “to press the button as you have in the previous training blocks.” Results showed that when the thumb press overlapped with the arm movement, participants generalized the timing training in a different way than when the thumb press did not overlap with the arm movement. That is, when the thumb press overlapped with the arm movement, slowing down the arm movement in the generalization trials resulted in a temporal interval between arm-movement onset and thumb press

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

116       that was scaled proportionally with the longer arm-movement time, as would be predicted if participants had used state information about the arm (e.g. position, velocity, arm rotation, or percentage of movement time or distance completed) to time their thumb presses. However, when the thumb press did not overlap with the arm movement, participants reproduced the absolute time interval they were trained on, suggesting that they had used an internal estimate of absolute time in controlling their thumb presses in the new task. Kimura and Gomi (2009) also present results consistent with the use of absolute solar time information in coordinating the modulation of a somatosensory reﬂex response in shoulder muscles. In their experiments, participants performed planar reaching movements toward a goal in a straight path in a forward direction away from the body. A set of experiments manipulated four factors: 1) the presence, absence, and direction (leftward, rightward) of a force ﬁeld which overlapped a part of the movement, 2) the speed of movement (slow, medium, fast), 3) the spatial location of the force ﬁeld, and 4) the timing of perturbations designed to elicit a reﬂex response. Before each reaching movement began, participants were told which type of force-ﬁeld condition they would be faced with (null force ﬁeld, leftward, or rightward). Results showed that the timing of the disturbance imposed by the force ﬁeld had an effect on the amplitude of the reﬂex response, with higher amplitude responses occurring when the disturbance was closer in absolute solar time to when the arm would be affected by the upcoming force ﬁeld. In contrast, the spatial distance of the hand from the force ﬁeld at the time of the disturbance had no signiﬁcant effect. These ﬁndings suggest that actors continuously kept track of time-until-force-ﬁeld-occurrence, and modulated their reﬂex amplitudes accordingly. Kimura and Gomi (2009) note that their ﬁndings suggesting that reﬂex modulation is based on absolute solar time appear to contradict other reﬂex-modulation ﬁndings in the literature which suggest that modulations may be based on spatial information (Sinkjaer, Andersen, and Larsen 1996; Xia, Bush, and Karst 2005). Kimura and Gomi (2009) speculate that tasks that involve coordination with the external environment (i.e. as when coordinating with an upcoming force ﬁeld) may be different from tasks that involve coordination in situations such as walking and throwing, where the coordinated environment is “relatively static and ﬁxed.” They propose that reﬂex modulation is set on the basis of absolute solar time when the environment is relatively unstable. They note on p. 2229: “It may be that the reﬂex amplitude setting shifts and ﬁxates in the spatial domain as adaptation and/or learning to a given environment progresses.”

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   - 

117

It is unclear from these studies what might be predicted for sequences of segmental goals in speech. On the one hand, speech is a highly practiced activity involving large degrees of articulatory overlap; on this view, speakers might be predicted to use spatial control, or relative timing, oscillator phase relationship control, at least to time the onsets of movements. On the other hand, speech involves producing the same segments in many different contexts; on this view, the speech ‘environment’ is relatively unstable, and might encourage absolute timing control of a sequence of acoustic landmarks (Stevens 2002). The model sketch presented in Chapter 10 proposes that the surface duration of intervals between speech landmarks is planned and speciﬁed in terms of absolute solar time; this view is supported by evidence showing that intervals that are longer in duration show larger amounts of temporal variability (e.g. Byrd and Saltzman 1998; Chen 2006; Nakai et al. 2012; Remijsen and Gilley 2008). As discussed in Chapter 4, this relationship between duration and variability is difﬁcult to explain without some type of phonology-extrinsic timekeeper. However, it is possible that some coordination patterns in speech (e.g. intrasegment coordination, Löfqvist 1991) are controlled on the basis of spatial or relative timing information. This is a topic that deserves investigation by varying both distance and speed of each coordinated movement, so that the spatial vs. relative- vs. absolute-nature of coordination control mechanisms can be revealed.

5.3.3 Coordination control is ﬂexibly dependent on both temporal and spatial information, depending on the reliability of the information and the actor’s age A further line of argument for considering alternatives to phase-based, relative timing coordination mechanisms comes from evidence that coordination control is ﬂexible, in the sense that it can depend on different sources of information according to reliability. For example, Medina, Carey, and Lisberger (2005) tested the contribution of absolute-time (in solar time units), positional, and distance information on monkeys’ eye movement change of direction, using tasks in which monkeys tracked moving targets that appeared on a screen in front of them. In a baseline condition, monkeys tracked a horizontal moving target; in another condition, they tracked a target that ﬁrst moved horizontally, then changed direction after a ﬁxed time interval. In the direction-change condition, the experimenters included probe trials

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

118       that had only a horizontal moving target. After the monkeys had sufﬁcient experience with the direction-change stimuli (50–100 repetitions), the probe trials evoked a change of direction in smooth-pursuit eye movement at about the time when the change in direction was expected. This result shows that the monkeys had learned that a change in direction was expected. However, because any combination of temporal, positional, or distance information could have triggered the change-of-direction eye movements on these probe trials, a further set of experiments tested the contribution of each type of information. Results showed that the monkeys could rely on temporal and/or distance information (or both) to coordinate their movements with the expected moving target, but were unable to use information about the position of the target on the screen. In a third set of experiments, the experimenters manipulated the reliability of the different types of information. When either temporal or distance information was made unreliable, the monkeys planned their eye movements according to the more reliable source of information.² These results generally show that monkeys are ﬂexible in the type of information they use for coordinating their eye movements with an external stimulus, depending on its reliability, and that they can also use more than one type of information simultaneously. An additional experiment by Kayed and van der Meer (2009) showed agerelated differences in strategies used for coordinating movements with an external stimulus. Their experiment involved infants catching a toy approaching with either different constant velocities or different constant accelerations. Results suggested that while younger infants initiated their movements based on information about the distance of the toy from the catching place, older, 48-week-old infants, used a strategy based on time-to-contact. In the time-tocontact strategy, infants would initiate their movement at a ﬁxed time before the toy would reach the catching place. This strategy was effective because they had the same amount of time to complete their catching movements regardless of oncoming toy movement velocity or acceleration. In contrast, strategies based on initiating the movement when the toy was at a ﬁxed distance would require different movement times in situations where the toy moved toward them at different velocities/accelerations. Although infants younger than 48 weeks used strategies based on distance and velocity, most infants changed to a time-based strategy by 48 weeks, a strategy which may have been more ² See Clayards, Tanenhaus, Aslin and Jacobs (2008), Toscano and McMurray (2010), and Beddor, McGowan, Boland, Coetzee, and Brasher (2013) for analogous evidence from speech perception that listeners are able to assess the relative reliability of different speech cues and give more weight to speech cues that are more reliable.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. --  

119

efﬁcient because it allowed them to keep their movement parameters relatively constant. Thus both of these experiments support the view that additional coordination mechanisms beyond phase-based models are required to account for the full range of available motor-control strategies, because they must include strategies based on spatial information and/or absolute surface time. While it is still unknown whether the ﬁndings presented here generalize to the actions of the vocal articulators in humans, they suggest that models such as (AP)/TD may need to be modiﬁed if they are to account for all aspects of coordination abilities. In particular, the results suggesting that actors can rely exclusively on absolute timing information in coordinating movements are difﬁcult to account for without a timing mechanism that speciﬁes absolute timing relationships in solar time, which AP/TD cannot do. Thus these ﬁndings again suggest the importance of considering alternative accounts for articulatory coordination that operate in this way.

5.4 Evidence that timing relationships in movement coordination are not always based on movement onsets This section discusses a particularly critical aspect of the way coordination is implemented in AP/TD, namely that coordination is directly planned based on the relative timing of movement onsets, as opposed to other parts of movement, e.g. target-related movement endpoints. That is, in AP/TD, gestural-planning-oscillator entrainment patterns (i.e. in-phase or antiphase) determine the relative timing of the onsets of gestural activation. When the entrainment pattern is in-phase, gestural activation interval onsets are simultaneous, and when the entrainment pattern is antiphase, one gestural activation interval begins halfway through the other gesture’s planning-oscillator period. Although the timing of gestural target approximation is not explicitly planned in AP/TD, if the gestural activation interval is long enough, target approximation will occur at a predictable time after the gestural activation onset (i.e. at a time deﬁned by gestural activation onset time + properties of the gestural mass–spring system that determine how long it takes to reach equilibrium from a starting position + gestural overlap). However, if the gestural activation interval is shorter than the mass–spring settling time, the target will be undershot and will not be approximated. This section presents a series of ﬁndings that together suggest that movement onsets are often less tightly coordinated than movement endpoints, supporting the idea that patterns of

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

120       coordination are more likely to be based on parts of movement most directly related to the movement goals (often the movement endpoints) rather than on movement onsets. This evidence challenges AP/TD because the timing of target approximation (i.e. of mass–spring settling time) in AP/TD is an emergent property of each mass–spring system and is not explicitly represented in its phonology-intrinsic timing system. The previous chapter presented evidence suggesting that the part of movement most closely related to the movement goal (often the movement endpoint) shows less timing variability than other parts of movement, in coordination with external stimuli, internally generated periodic beats in synchronization continuation tasks, and voice onset in speech. For example, Perkell and Matthies (1992) reported less variability in the coordination of lip-protrusion endpoint timing with voice onset time for /u/ than for lipprotrusion onset timing. In addition to suggesting that individual movements are not controlled using mass–spring oscillators (because in mass–spring models, the timing of movement onset is not independent of the timing of target approximation), these ﬁndings suggest that movement coordination may be based primarily on the part or parts of movement most closely related to the movement goal (e.g. movement endpoints), rather than on movement onsets as is currently implemented in AP/TD (e.g. Goldstein et al. 2009). This section presents further evidence suggesting that coordination with respect to movement onsets does not always occur, and instead coordination often occurs with respect to the part(s) of a movement that is/are most closely associated with the movement’s goal, i.e. often the endpoint. This evidence suggests that AP/TD’s current model would need to be modiﬁed to account for target- or endpoint-based coordination. Note in this regard that Browman and Goldstein (1989, 1992a) initially proposed coordination based on targets, in particular for nucleus vowels coordinated with coda consonants as well as peak glottal opening in coordination with stop release for aspirated stops; see also Löfqvist (1991). However, in the current versions of AP/TD which use planning-oscillator entrainment mechanisms for gestural coordination, movement target- (or gestural target)-based coordination would be challenging, since gestural targets do not occur at a ﬁxed proportion of a planningoscillator period. That is, they occur at a ﬁxed duration in solar time from gestural onset, i.e. at mass–spring settling time, but at different proportions of a planning-oscillator period depending on speech rate and prosodic position. Because AP/TD doesn’t have a representation of mass–spring settling time (like all other surface time intervals, it is an emergent property), it would be difﬁcult to identify the time of gestural targets so that they could be

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. --  

121

coordinated. Even if they could be identiﬁed and coordinated, it would be difﬁcult to determine when the gestural activation interval should begin without an explicit representation of gestural mass–spring settling time. And an additional problem arises when gestural activation is shorter than mass–spring settling time (i.e. at fast rates of speech). In these cases, the target is undershot, and not approximated, and so the gestural target is not available for coordination. As explained in Section 5.3, there are other mechanisms which may be better suited to the coordination task, that do not rely on planning-oscillator entrainment. In Section 5.4.1, evidence is presented from non-speech activity suggesting that movements of the two hands that are planned to be synchronous (either through explicit instruction that they begin at the same time, or through the accomplishment of a single, bimanual goal that requires some degree of synchronization) are not necessarily synchronized at their onsets. Evidence from speech is then reviewed that suggests the same thing. That is, movements toward consonant and vowel targets in CV syllables, proposed to be in an inphase coordination relationship in AP/TD, often show variability in relative timing according to segmental context. This evidence suggests that movement onsets are not in a ﬁxed coordination relationship. Finally, evidence is presented from studies which have compared the temporal coordination of parts of movement related to goal-achievement (often the endpoints) vs. other parts of movement. These suggest tighter temporal coordination at goal-related parts of movement (often the endpoints) than for other parts of movements. This evidence suggests that AP/TD’s current model would need to be modiﬁed to account for endpoint-based coordination. As explained in Section 5.4.2, there may be other mechanisms which are better suited to the task; one of these, Lee’s (1998) non-oscillatory tau-coupling mechanism, is discussed in some detail.

5.4.1 Evidence that synchronized non-speech movements do not necessarily begin synchronously One type of activity that has informed theories of coordination control has been bimanual activity, where participants are instructed to move two hands as fast as possible toward separate targets in response to a ‘GO’ signal. Kelso, Southard, and Goodman (1979) is often cited as a classic paper showing evidence of temporal synchronization among movements, from beginning to end, because their results showed that bimanual activity can induce temporal assimilation in the behaviors of the two hands. However, as discussed further

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

122       on in this section, later experiments cast doubt on this interpretation. In the Kelso, Southard, and Goodman (1979) experiment, participants were asked to move their index ﬁngers from a home position on a table in front of them to targets of different sizes (3.2 cm vs. 7.2 cm target width), positioned at different distances away from the home position on the same table (6 cm vs. 24 cm). They were asked to perform lateral movements away from the midline, lateral movements toward the midline, and movements in a forward direction away from the midline in several conditions: 1) uni-manual movements to near, wide-width targets, predicted to be ‘easy’ by Fitts’ (1954) law, and to therefore require less movement time, 2) uni-manual movements to far, narrow-width targets, predicted to be ‘hard’ by Fitts, and to require more movement time, 3) bimanual movements to equidistant targets of the same width, one per hand, and 4) bimanual movements, where one hand moved to a near, wide-width target, and the other hand moved to a far, narrow-width target. Participants were instructed to move to the targets as quickly and as accurately as possible, after a visual warning light and a subsequent auditory signal instructed them to begin moving. Results showed signiﬁcant temporal assimilation in bimanual movements where one hand moved to an ‘easy’ target and the other moved to a ‘hard’ target. In these bimanual mixed target conditions, time to initiate the movements after the GO signal (reaction times) and movement times to the ‘easy’ target were longer than in unimanual movements, and longer than in bimanual movements where both left- and right-hand targets were ‘easy’. For example, where bimanual movements to matched easy targets showed a mean movement time of 85 ms for the right hand, and bimanual movements to matched hard targets had a movement time of 169 ms for the right hand, a movement time of 133 ms was observed to the right-hand easy target when the left hand’s target was ‘hard’, and a right-hand movement time of 158 ms was observed when its target was ‘hard’ and the left hand’s target was ‘easy’. The investigators concluded that in unmatched bimanual conditions, the movement to an easy target is set at a different speed so that the velocity and acceleration patterns are synchronized with the movement to the ‘hard’ target. That is, they interpreted their results as evidence for temporal synchronization of the movements of the two hands, from beginning to end. Marteniuk, MacKenzie, and Baba (1984) present results that supported Kelso, Southard, and Goodman’s (1979) results of temporal assimilation for bimanual movements to unmatched ‘easy’ and ‘hard’ targets. However, their results challenged Kelso, Southard, and Goodman’s view that the movements themselves were synchronized from beginning to end; importantly, they

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. --  

123

showed signiﬁcant differences in movement initiation times for movements to ‘easy’ vs. ‘hard’ targets in unmatched conditions. In Marteniuk, MacKenzie, and Baba’s experiment, the targets were smaller than in the Kelso, Southard, and Goodman experiment (1 mm wide vs. 3.6 or 7.2 cm wide), and movement distance (10 cm vs. 30 cm) and stylus weight (50 g vs. 350 g) were varied systematically. They found that when one hand moved further, the other hand moving a shorter distance had an increased movement time (consistent with Kelso, Southard, and Goodman 1979), but the movement time was not as long as the movement time for the hand moving a longer distance.³ Differences in movement time were in part compensated for by reaction time, i.e. the time of movement onset for each hand relative to the time of the GO signal: When the hands moved different distances, the delay to beginning the movement to the longer-distance target was often shorter than the delay to beginning the movement to the longer-distance target when both moved the same distance. These ﬁndings suggest that bi-manual target-directed movements are often not synchronized at their onsets, particularly when movement difﬁculty is very different for the two hands (because of differences in distance, transported mass, or target width). In summary, Marteniuk, MacKenzie, and Baba’s data cast doubt on the view that movements that are triggered simultaneously by a common GO signal are necessarily synchronized at their onsets. These results raise questions about onset-based coordination mechanisms, because one might expect cases like these to be most likely to show onset synchrony if coordination is usually based on movement onsets. Relatedly, speech studies have shown that coordinated movements can show systematically different timing patterns for movement onsets depending on the segmental context, a ﬁnding which is also not predicted by the current AP/TD model. For example, the onset of a C movement can occur later relative to the onset of a V2 movement for e.g. [ipa] compared to [api] (Löfqvist and Gracco 1999; Šimko 2009; O’Dell et al. 2011; Šimko, O’Dell, and Vainio 2014). Findings of different relative timing patterns, e.g. for [ipa] vs. [api] challenge the AP/TD view that C and V gestures in CV syllables are tightly and uniformly coordinated at their onsets. These studies raise the question of whether goal-related movement endpoint coordination might be more appropriate, and suggest at least that other principles of movement coordination are worthy of consideration.

³ Additionally, they found that temporal assimilation was accompanied by spatial assimilation: for example, overshoot was more often observed for 10-cm movements made by one hand when the other made movements of 30 cm than when both hands moved 10 cm.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

124      

5.4.2 Evidence for coordination of goal-related parts of movement, often the endpoints Results from other tasks, where the coordination of onsets is explicitly compared with the coordination of parts of movement most closely related to goal achievement (e.g. movement endpoints), suggest that coordination is often based on goal-related parts of movement. For example, Gentner et al.’s (1980) study of 147 keystrokes performed by a skilled typist, showed that the timing of the ends of the keystroke movements (related to the goal of pressing the key) was much less variable than the timing of the start of the movements, when measured relative to the timing of the previous key press (Figure 5.1). Likewise, Bootsma and van Wieringen (1990) showed that the timing of initiating attacking forehand drives in table tennis was more than twice as variable as the timing of paddle contact with the ball. Forehand drives in this experiment had average movement times that ranged between 92 and 178 ms. The timing accuracy at paddle-ball contact was estimated on the basis of the ratio of the standard deviation of the direction of travel (an angular measure)

100

Number of instances

80

60

40

0

–800 –750 –700 –650 –600 –550 –500 –450 –400 –350 –300 –250 –200 –150 –100 –50 0 50 100 150 200 250

20

Time (ms) Start

End

Figure 5.1 The distribution of keypress start and end times measured relative to the previous keypress. Based on a similar ﬁgure in Gentner et al. (1980, p. 2).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. --  

125

and its mean rate of change, and was calculated to be within 2–5 ms. In contrast, standard deviations for total movement time ranged from 5 to 21 ms depending on the player, showing that movement initiation times were much more variable. See also Katsumata and Russell (2012) for similar evidence for endpoint-based coordination when hitting a ball with a rod. In a task that was explicitly periodic, Craig, Pepping, and Grealy (2005) found that movement endpoints, but not movement onsets, were timed synchronously to an externally presented beat pattern. In their experiment, they asked participants to move a graphics tablet pen repetitively between target zones of equal width in time with the sounding of a 50 ms beat. For the auditory stimuli, ‘beeps’ deﬁned a cycle of 3 or 4 seconds’ duration, where each cycle was subdivided by a tone of lower frequency (‘bop’). Subdivisions were either symmetric or asymmetric. When the subdivisions were symmetric, the time interval between ‘beep-bop’ was the same as that between ‘bop-beep’, and participants had equal time intervals to move between targets, whereas when the subdivisions were asymmetric, there was a shorter time interval between ‘beep-bop’ compared to ‘bop-beep’, and participants had to time their movements accordingly to reach the targets on time. Results showed that participants did not move continuously during the task, but rather waited in the target region before moving to the next target in order to reach the movement target synchronously with the ‘beep’ or ‘bop’. This strategy of not moving all of the time, and of starting movement toward the target relatively late when the interval between targets was long, may have been adopted to lead to greater temporal accuracy of target attainment, since faster movements are known to be more temporally accurate than slower movements (Schmidt 1969, cited in Hancock and Newell 1985). All in all, these results suggest that movement-tone coordination is based on the timing of movement endpoints, rather than on movement-onset timing. In other words, participants used the tones to condition target attainment timing, rather than to condition onset timing directly.⁴ Haggard and Wing (1998) present results from reach-to-grasp movements that also suggest a coordination pattern based on movement goal attainment as opposed to movement onsets. In their experiment, participants reached to ⁴ Craig, Pepping, and Grealy (2005) suggest that actors time movement targets with respect to the beat by estimating the time remaining until a beat occurs and timing their movements accordingly. They interpret their results in terms of the tau-coupling framework of General Tau theory (Lee 1998), discussed in more detail in Section 5.4.2 and in Chapter 9. According to Craig, Pepping, and Grealy (2005), following Lee (1998), participants continuously keep track of the time-to-the-next-beat, and control the time-course of their movements so that arrival of the pen at the target is achieved simultaneously with the time of the next beat.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

126       grasp a target vertical dowel placed 30 cm away on a table in front of them. They were instructed either to reach along a straight path toward the dowel, or to curve the path. When the paths were straight, hand-opening movements started synchronously with the transport movement, but when the paths were curved, and therefore the transport movement took longer, hand-opening movements were delayed. These results suggested that the two types of movements (transport and opening) were not necessarily synchronized at onsets, but that movement onsets were timed so that the grasping goal could be achieved at the end of the reaching movement. Kazennikov et al. (1994) also present evidence consistent with this view, for movements made by monkeys. These investigators trained three monkeys to open a spring-loaded drawer with the left hand, while retrieving food from the open drawer with the right hand. They examined temporal coordination of the two hands at different points in the movement sequence. The monkeys showed tight temporal coordination in the two hands at the movement goal, that is, when the food was being picked up (e.g. r = .93 for one of the monkeys), but less stringent temporal coordination at other points in the sequence (e.g. r = .71 at movement onset for the same monkey). The available data for speech (Perkell and Matthies 1992) also supports goal-related endpoint-based coordination. In this study, discussed in more detail in Chapter 4, upper lip movement endpoint timing was less variable than timing for earlier parts of movement, relative to reference events in the speech signal, suggesting that the precise coordination of the timing of movement endpoints was of greater importance than the timing of other parts of movement, e.g. parts of movement closer to the movement onset. Data for non-speech-gestures aligned with speech show a similar pattern of least variability for goal-related parts of movement. Leonard and Cummins (2011) found lower timing variability at the endpoint of extension compared to other parts of movement, for hand+arm ‘beat’ gestures that co-occur with speech. They recorded hand+arm movements (by recording the movement of an LED marker attached to the base of the thumb) while a speaker read two repetitions of three short fables. They found that the point of maximum extension of the hand before retraction had the least timing variability compared to other parts of movement (movement onset, peak velocity of extension, peak velocity of retraction, and retraction endpoint), measured relative to landmarks in the stressed syllable in each word. These ﬁndings suggest that the point of maximum extension is the part of movement which is coordinated with the stressed syllable.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    - 

127

Shaffer (1982) and Semjen (1992) note that these ﬁndings show that the function of a “sequence-level” timing plan “does not consist of triggering the movements, but of providing temporal reference points at which movements must produce their behaviorally meaningful effects . . . these reference points provide for coordination between streams of events that unfold in independent motor subsystems” (Semjen 1992, pp. 248, 256).

5.5 Possible mechanisms for endpoint-based timing and coordination In the phonology-intrinsic timing AP framework, planning-oscillator entrainment could potentially be used to coordinate the ends of gestural activation intervals with each other, in the same way that it is currently used to indicate when gestural-activation intervals should begin. Thus, gestural activation interval onset could potentially be speciﬁed to start at a particular gesturalplanning-oscillator phase before the activation interval offset. However, this mechanism could not be used to implement target-based or target-related endpoint coordination, since activation-interval offset only corresponds to target approximation at the default speech rate, and would not correspond to target approximation when speech rate is slower, or in prosodic positions where a Pi or MuT gesture stretches the activation interval. And in cases where the speech rate is faster than the default, the movement would be truncated, i.e. the target would not be approximated. As discussed above, this is because target approximation is determined by the mass–spring settling time, and not by the activation interval. It is possible to identify the beginning and end of the activation interval in terms of planning-oscillator phase proportions, but not the mass–spring settling/target approximation time, because the mass–spring settling time depends on gestural mass–spring stiffness (stored as part of gestural representation), and not on planning-oscillator frequency. It is therefore difﬁcult within the current AP/TD system to identify the time of target approximation so that it can be coordinated with other events. In addition, it would be difﬁcult within this system to specify the timing of gestural onsets in order to approximate the gestural target on time, since an explicit representation of gestural mass–spring settling time is not possible. Finally, as discussed in Chapter 4, AP/TD’s mass–spring oscillators do not provide an account of the greater timing variability that is often observed at movement onsets compared to movement targets.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

128       These arguments demonstrate some substantial problems faced by the AP/TD approach to modeling endpoint-based coordination, as well as the distribution of timing variability across a movement. In contrast, Lee’s (1998) General Tau theory (discussed in more detail in Chapter 9) provides a way to achieve endpoint-based coordination, and at the same time provides an account of the difference in timing variability observed at movement onsets vs. endpoints. On this theory, the information used to guide movement is the continuously changing tau of the gap between the current position and the gap closure/target, where tau is the time-to-gap-closure at the current gap-closing rate. Actors plan to close each gap (i.e. reach each target) at a speciﬁed time, and the tau information speciﬁes how soon the gap will close if movement continues at the current rate. On the basis of this information, movements can be adjusted to reach targets on time. For example, if the current tau speciﬁes that gap closure would occur too soon if movement continues at the current rate, movements can be slowed to reach the target at the right time. Coordinated movements can be achieved through tau coupling of one movement to another, via perceptual information, or by tau coupling to an internally (i.e. (cognitively) generated tau Guide, which is an abstract ‘pattern’ for the timecourse of movement, based on Newton’s equations of motion (see Chapter 9 for the tau Guide equation). The tau Guide coupling mechanism provides a way to achieve target-based timing and coordination even when sensory information about to-be-coordinated movements is not available. As discussed in more detail in Chapter 9, many different types of movement, including speech movements, show tau proﬁles (tau time series) that are in constant proportion to the tau proﬁle of the proposed tau Guide. Tau coupling is achieved by keeping taus of movement (or taus of movement and the taus of the tau Guide) in constant proportion: movements whose tau functions are in constant proportion will end at the same time. That is, as explained in Lee et al. (2001), when two movement tau functions are in constant proportion, or when a movement tau function is in constant proportion to the tau Guide function, so that tauA = ktauB, tauA reaches zero as the target is reached, and because tauB is in constant proportion to tauA, it reaches zero at the same time. On this theory, movement coordination involves endpoint-based coordination. It does not require the time course of the coordinated movements to be the same (only that they be in constant proportion), which means that two tau-coupled movements might have velocity peaks that don’t occur at the same time. Nor is there a strict requirement for the movements to begin at the same time: If one of two coordinated movements starts later than the other, it is assumed that the later-onset movement is accelerated until

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

129

taulater = ktauearlier, and then that the relation is maintained so that the two movements end at the same time. That is, the timing of movement onset can vary, as long as the movement doesn’t start too late for tau coupling to occur by the end of the movement. Similarly, if the taus of a movement are coupled onto the taus of an internally generated tau Guide, the movement will end at the end of the tau Guide (i.e. at the speciﬁed movement target achievement time) as long as the taus are in constant proportion by the end of the movement. This provides an account of greater timing variability at movement onset compared to target achievement.

5.6 Planning inter-movement coordination and movement-onset timing Although tau theory offers a way of guaranteeing that movements end at the same time when they need to, and offers a way of reaching a target at a speciﬁed point in time, it does not provide a principled explanation for the timing of the movement endpoints relative to other movements in a sequence. Optimal Control Theory approaches (discussed in Chapter 8) provide an explanation for the timing of interval durations (e.g. inter-endpoint interval durations), by proposing that movement parameter values are chosen that satisfy task requirements at minimum cost. Because time is a cost, inter-target intervals will be planned to be as short as possible while still achieving the requirements for the utterance. What this means is that the movements will often overlap (cf. Šimko 2009; Šimko and Cummins 2010, 2011). Tau theory also lacks a principled explanation for planning movement durations (i.e. for planning when a movement should start). Even though the timing of the movement onset is not critical according to this theory, because the endpoint can be reached on time as long as the movement does not start too late, a movement duration speciﬁcation (and thus a planned duration) is nevertheless required for the Tau Guide equation. Optimal Control Theory approaches propose that planning movement durations (and therefore their onset times, assuming the endpoint time has already been determined), requires the balancing of task requirements such as 1) desired spatial accuracy, and/or 2) desired temporal accuracy at target achievement, against cost of movement, e.g. energy/effort and time. As discussed in Chapter 4, shorter-duration movements are predicted to have greater temporal accuracy (because they have fewer timing units of a noisy clock, Schmidt 1969; Hancock and Newell 1985), but as discussed in Chapter 3,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

130       longer-duration movements are predicted to have greater spatial accuracy (because they have more time to home in on the target, Schmidt et al. 1979). Within many Optimal Control Theory approaches, actors balance these requirements against costs of movement (e.g. effort, time etc.), to plan movements that have an optimal movement duration given the task requirements and costs of movement (e.g. effort, time, etc.). This Optimal Control Theory approach suggests that movements are planned to start at an optimal time before the desired movement endpoint time, and thus, like General Tau theory, suggests that actors explicitly plan movement durations. An alternative to this type of explicit duration planning strategy, would be to plan to trigger a movement at the time of another event. For example, the inﬂuential look-ahead model of coarticulation (Henke 1966) proposed that movements should start when a relevant articulator is free to move (i.e. when not required by a preceding segment’s distinctive feature speciﬁcation). Because many speech movements are shorter in duration than predicted by the look-ahead theory (see e.g. Perkell and Matthies 1992, and the discussion in Fowler and Saltzman 1993), this theory appears inadequate to explain all movement onset behavior. However, it is possible that movements are planned to start when articulators are free to move (as the look-ahead theory predicts), but because of inertia, noise, or error in the triggering system, movements fail to start on time. The discussion of spatial strategies in Section 5.3.1 suggests that another strategy might be to trigger a movement when a particular (relative) spatial position of another articulator is reached.

5.7 Summary of ﬁndings relating to movement coordination Although some ﬁndings presented in this chapter relating to movement onset timing are consistent with AP/TD’s planning-oscillator-phase-relationshipbased account of gestural coordination in speech, they are also consistent with spatial control strategies, and these spatially based approaches ﬁnd additional support in some of the literature on non-speech motor control. A more serious challenge to AP/TD is presented by the timing-accuracy ﬁndings relating to endpoint-based coordination, because of AP/TD’s inability to represent gestural mass–spring settling time. Because this settling time is an emergent property, it is not explicitly represented, and thus cannot serve as the basis for the control of movement coordination in AP/TD. Endpoint-based coordination thus challenges any model which cannot reliably identify the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

131

timing of movement endpoints, and requires a mechanism different from AP/TD’s planning-oscillator-based mechanism. Plausible suggestions for this mechanism include Lee’s General Tau theory, which proposes that movements are continuously coordinated on the basis of their tau proﬁles (continuously changing time-to-target-achievement-at-the-current-movement-rate), as well as mechanisms based on Optimal Control Theory. If endpoint-based coordination is required, as these data illustrate, the timing of movement onset must also be planned, albeit with a lower requirement for accuracy than the timing of movement endpoints. For this, three different control strategies are possible: 1) starting a movement at a particular relative time with respect to another interval (similar to AP/TD’s interoscillator relative-timing control mechanism), 2) starting a movement when a particular event occurs, e.g. when an articulator is free to move, or when a spatial position is reached (e.g. distance from movement target, see e.g. Tilsen 2013, 2018), or 3) starting a movement at a particular duration from the time of desired endpoint/target achievement. Further research is needed to determine which of these strategies is most appropriate for speech, or whether a combination of strategies is required. Taken together, the ﬁndings presented in this chapter challenge the use of oscillators in AP/TD for onset-based movement coordination; the next chapter provides a further challenge to AP/TD’s use of oscillators, this time for patterns of suprasegmental structure.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

6 The prosodic governance of surface phonetic variation: Support for an alternative approach III 6.1 Introduction A great deal of evidence suggests that prosodic structure, including a hierarchy of constituents (e.g. syllables, words, and phrases), as well as prosodic prominences (e.g. word level and phrase level stress), affects the phonetic characteristics of utterances, including their timing. For example, acoustic and articulatory intervals that relate to phrase-ﬁnal syllable rhymes and phraseinitial word onsets are longer than their phrase-medial counterparts; intervals that relate to phrasally stressed syllables are longer than those that relate to syllables that do not bear phrasal stress, and longer syllables and segments in these prosodic positions are also often hyperarticulated. The hierarchical nature of prosodic structure is evidenced by different degrees of lengthening at different junctures, and for prominence at different levels. For example, ﬁnal lengthening related to a full intonational phrase is usually greater than the ﬁnal lengthening observed at more minor phrase boundaries, and syllable-related intervals which bear phrasal stress are longer than those which bear word-level stress but are not phrasally stressed. (See Chapter 10 for more detailed discussion and references.) Although these facts about prosodic structure and its inﬂuence on speech are well-accepted, there are several areas of disagreement. The contested issues include the relationship between prosodic structure and syntax; whether differences in ‘boundary strength’ implied by levels in the hierarchy and reﬂected in measurable phonetic differences are categorical, or gradient; the types of constituents and number of levels that are included in the hierarchy; and the mechanisms that yield observed phonetic effects that relate to prosodic structure. Although AP/TD has little to say about the relationship between prosodic structure and other components of the grammar (such as syntax and semantics), and remains agnostic about the gradient vs. categorical nature of distinctions among constituent levels (Krivokapić, Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    / 

133

2020), it does make clear claims about the mechanisms that speakers use to produce the types of prosodic effects on surface phonetics that have been observed. These proposed mechanisms, discussed in detail in Chapter 2, include 1) Pi/MuT gestures, which adjust gestural activation intervals at phrase boundaries and in prominent positions, and 2) a hierarchy of coupled syllablelevel, cross-word foot-level (where the foot, in AP/TD, is deﬁned as an interval stretching from the onset of one lexically stressed syllable to the onset of the next), and phrase-level oscillators. The oscillation frequency of this coupledoscillator hierarchy affects gestural activation intervals through effects on gestural-planning oscillators, because gestural activation intervals are deﬁned as a ﬁxed proportion of a gestural-planning-oscillator cycle. The coupling strength of each oscillator relative to each other oscillator in the hierarchy affects tendencies towards isochrony at different levels. And the overall oscillation frequency of this ensemble affects overall speech rate. This chapter presents evidence relevant to these proposed mechanisms. The evidence is threefold: 1) evidence recapitulated from Chapter 4 that challenges AP/TD’s phonology-intrinsic timing and default adjustment approach based on Pi/MuT gestures and changes in planning+suprasegmental oscillation frequency (Section 6.2), here focusing on its relevance for prosodic mechanisms, 2) evidence relating to poly-subconstituent shortening that raises some questions about the constituents that are involved, and suggests that this phenomenon is not as uniform throughout an utterance as would be expected from AP/TD’s coupled suprasegmental oscillator control mechanism (Section 6.3), and 3) evidence relating to the control of overall speech rate (also presented in Chapter 4), which suggests that speaker strategies for manipulating speech rate are more diverse than would be expected from an oscillator account, and instead suggests a phonology-extrinsic-timing account based on surface durations (Section 6.4). Together, these pieces of evidence suggest that it is time to consider alternative approaches to prosodic control.

6.2 Evidence relating to Pi/MuT mechanisms for boundaryand prominence-related lengthening AP/TD’s strategy for preserving intrinsic timing in gestural representations, while accounting for surface-durational and spatial variability, is to propose a set of adjustment mechanisms to change the system’s default gestural activation intervals in different contexts. These adjustment mechanisms include Pi (and in more recent versions of the theory, the more general MuT)

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

134       mechanisms for boundary- and prominence-related lengthening, as well as mechanisms for changing the oscillation frequency of the planning+suprasegmental oscillator ensemble. Chapter 4 presented a number of lines of evidence that challenge AP/TD’s phonology-intrinsic timing; here it is argued that that same evidence presents some serious challenges to the default-adjustment approach of AP/TD (including the use of Pi/MuT gestures) to account for prosodically governed patterns of surface phonetic variation. To review: ﬁrst, the adjustments warp the correspondence between phonological ‘time’ (i.e. planning+suprasegmental oscillator periods) and solar time, making it challenging to explain how speakers and singers interact with external timed stimuli, e.g. auditory stimuli, such as instrumental accompaniments etc., without a representation of surface timing in solar time units. Second, Pi/MuT mechanisms provide no explanation for constraints on the amounts of ﬁnal- and prominence-related lengthening observed on phonologically short vowels in some quantity languages, which require a mechanism that refers to surface-duration representations that are unavailable in AP/TD in order to keep surface durations distinct for phonological short vs. long vowels across contexts in these languages. Third, Pi/MuT mechanisms provide no explanation for observations of greater timing variability for intervals with longer surface durations, which can be explained if the timing variability correlates with surface durations. This explanation is not available in AP/TD because 1) it has no representation of surface time, and 2) intervals are stretched in AP/TD not by adding units of AP/TD time, but by slowing the AP/TD clock. That is, phrase-ﬁnal intervals that are longer in milliseconds are not longer in AP/TD timing units. And ﬁnally, AP/TD’s spatiotemporal gestural system provides no explanation for the equivalence of different strategies which lead to similar surface-duration patterns in phrase-ﬁnal position, at slower speech rates, and for phonemically long quantities, because longer gestural activation intervals in these contexts dictate a single spatiotemporal strategy. In contrast, the equivalence between these different articulatory strategies, both within and across speakers (as reviewed in Chapter 4), can be explained if surface-durational patterns are speechproduction goals. These ﬁndings suggest strongly that speakers make use of explicit representations of surface time intervals when they plan and produce utterances. While they do not logically rule out the use of spatiotemporal phonological representations and Pi/MuT gestures in addition to a representation of surface durations, this scenario would require a mechanism of translation between AP/TD time and surface, solar time. As noted in Chapter 4, the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

135

relationship between AP/TD time (i.e. AP/TD planning+suprasegmental oscillator ensemble time units) and surface, solar time, is not ﬁxed but is highly complex and variable, changing with speaking rate and positionin-utterance. Additional evidence presented in Chapter 4 calls even this possibility (i.e. of spatiotemporal representations + representations of intervals in solar time units) into question, i.e. the evidence relating to temporal variability at goal-related endpoints vs. other parts of movement. This evidence challenges the temporal nature of gestural phonological representations, because temporal phonological representations do not provide a way of ‘picking out’ the movement endpoint so that its timing can be prioritized for accuracy. In contrast, if a symbolic phonological representation and a movement endpoint are closely mapped onto each other, the endpoint must be represented separately from other parts of movement, and in that case its timing can be prioritized for accuracy in motor-sensory implementation. This approach stands in stark contrast to one based on a default adjustment approach. That is, if phonological representations are atemporal (and symbolic), then there is no default time interval to adjust, and no requirement for mechanisms such as Pi/MuT gestures or changes in planning+suprasegmental oscillator frequency. Instead what is needed to account for these ﬁndings are mechanisms to associate a movement endpoint and/or set of acoustic landmarks with each phonological symbol, and to specify appropriate surface durations of intervals deﬁned by these endpoints or landmarks. Such mechanisms are a key feature of the phonology-extrinsic-timing-based three-component model proposed in this volume. In short, the evidence presented in Chapter 4 challenges AP/TD’s defaultadjustment account of timing variation in different prosodic contexts, and suggests instead that an account based on symbolic phonological representations and phonology-extrinsic timing is more appropriate.

6.3 Evidence relating to the coupled oscillator hierarchy mechanism for poly-subconstituent shortening Chapter 5 discussed evidence that relates to AP/TD’s use of limit-cycle (undamped, freely oscillating) planning oscillators to model inter-articulatory relative timing, and concluded that this framework is not suited to explaining ﬁndings of endpoint-based coordination. This section discusses evidence related to a second proposed use of limit-cycle oscillators in AP/TD, namely the use of a hierarchy of oscillators (at the syllable, cross-word foot, and phrasal

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

136       levels) to model tendencies toward isochrony at different levels, and to model changes in overall speaking rate. The evidence presented here weakens the motivation for using suprasegmental oscillators for these purposes. Evidence cited in support of suprasegmental oscillators has come mainly from observations of polysegmental and polysyllabic shortening. These terms describe phenomena in which, when a constituent contains a larger number of subconstituents (e.g. more segments or syllables in an inter-stress interval, word, or larger constituent), these segments or syllables undergo temporal compression. For example, Lehiste (1972) showed that e.g. sleep in sleep is longer than sleep in sleepy, which in turn is longer than in sleepiness, indicating that the addition of –y or –iness causes temporal compression of sleep. Similarly, if the consonantal onset of a syllable includes multiple consonants, the increasing number of consonants results in temporal compression of those consonants compared to their individual durations as singleton onsets (e.g. Waals 1999, for Dutch, among others). This is illustrated by the shorter duration of frication for English /s/ in e.g. stop vs. in sop. Thus, a larger number of subconstituents in a higher-level constituent results in temporal shortening of these segments or syllables. English and Swedish are examples of languages that have often been claimed to have polysyllabic shortening within inter-stress intervals (and have been called ‘stress-timed’, on the assumption that this shortening represents an effort toward keeping the inter-stress intervals somewhat constant). Campbell (1988), cited in Williams and Hiller (1994), Eriksson (1991), Williams and Hiller (1994), Kim and Cole (2005), and Kim (2006) have all shown that stressed syllable durations are shorter when additional syllables follow the stressed syllable within an inter-stress interval. It should be highlighted that the effect size is small (e.g. 10%–15% per additional syllable in Williams and Hiller’s study), and the total duration of the inter-stress interval increases linearly with additional syllables/segments (see also Dauer 1983). As a result, the compression effect does not result in actual inter-stress interval isochrony, but only a change in the direction of greater similarity of inter-stress-interval durations. These surface shortening patterns might be due to periodic control at an inter-stress interval level, e.g. a “rhythmic tendency” which “has to contend with other factors which obscure its effects” (Classe 1939, p. 87). Such a possible mechanism is modeled in AP/TD (Saltzman et al. 2008, based on O’Dell and Nieminen 1999 and Barbosa 2007), using interaction of coupled suprasegmental planning oscillators at multiple levels of prosodic constituency (syllable, cross-word foot, and phrase), as described in Chapter 2. O’Dell and Nieminen (1999), Barbosa (2007), and Saltzman et al. (2008) have shown that periodic

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

137

(i.e. oscillator-based) syllable, foot, and phrase control structures can be used to model surface-durational patterns that are not strictly periodic, but show only a tendency toward isochrony. Sections 6.3.1, 6.3.2, 6.3.3, and 6.3.4 argue, however, that phenomena that have been described as poly-subconstituent shortening do not provide as strong evidence for AP’s suprasegmental oscillator hierarchy as has been claimed. First, some patterns that appear to arise from poly-subconstituent shortening may actually arise from other mechanisms, such as boundary-related lengthening (Section 6.3.1) (although other patterns more convincingly require a poly-subconstituent mechanism, Section 6.3.2). Second, various aspects of these timing patterns are highly ambiguous as to the larger constituents within which such shortening might occur (Section 6.3.3), and implicate word-based units that are not included in AP’s hierarchy of syllable-, cross-word-foot-, and phrase-based planning oscillators. Finally, Section 6.3.4 argues that even phenomena that may be correctly characterized as poly-subconstituent shortening effects should not be modeled using such oscillator-based mechanisms.

6.3.1 It is often difﬁcult to distinguish poly-subconstituent shortening from other types of effects Although ﬁndings of shorter subconstituents when more occur in a higherlevel constituent are consistent with compression mechanisms that create a tendency toward isochrony, they are often difﬁcult to distinguish from other types of effects, such as stress-adjacent lengthening or boundary-related lengthening, which do not depend on the number of subconstituents in a larger constituent. For example, Huggins’ (1975) experimental results showed evidence consistent with polysyllabic shortening within a cross-word foot, i.e. the vowel in bound was longer in . . . [bound]F out than in . . . [bound a-]F [-bout], and evidence consistent with polysyllabic shortening within a word, i.e. bound in [bound]W about was longer than in . . . [bounded]W out.. Although both of these lines of evidence seem to implicate polysubconstituent shortening, each of them has another possible interpretation. The bound vs. bound a- results are also interpretable in terms of a local effect which has to do with stress adjacency (van Lancker, Kreiman, and Bolinger 1987; White 2002, 2014), whereby syllables are longer if adjacent to stressed syllables. Evidence supporting this view comes from Windmann, Šimko, and Wagner (2015a), who found that word-ﬁnal syllables in a large corpus, whether stressed or unstressed, were longer when followed by a stressed

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

138       syllable across a word boundary than when followed by an unstressed syllable.¹ The e.g. bound about vs. bounded out results, although consistent with polysyllabic shortening within a word, also have an alternative explanation, in terms of well-documented constituent-ﬁnal lengthening effects (e.g. Wightman et al. 1992), since bound in bound is word-ﬁnal, whereas bound in bounded is not as strongly inﬂuenced by word-ﬁnal lengthening since it is not in absolute word-ﬁnal position. Another factor that might explain Lehiste’s (1972) observation of a longer sleep in sleepy vs. sleepiness is that constituent-ﬁnal lengthening effects are greater in magnitude the closer they are to the boundary (Berkovits 1994). Thus, sleep might be longer in sleepy because it is closer to the constituent-ﬁnal boundary in sleepy than it is in sleepiness. Along these lines, Windmann, Šimko, and Wagner (2015a) observed that polysyllabic shortening effects in their corpus study largely disappeared when the position of the target syllable with respect to the word boundary was taken into account.

6.3.2 However, some ﬁndings appear to require poly-subconstituent shortening While some aspects of these timing patterns could be explained in terms of boundary-related lengthening, and so do not unambiguously support the need for a mechanism for poly-subconstituent shortening, other timing patterns offer less ambiguous evidence for such a shortening mechanism. For example, ﬁndings of e.g. a shorter mend in recommend vs. commend, observed when these words are phrasally stressed (Lindblom and Rapp 1972 for Swedish; White and Turk 2010 for English, among others) can’t be accounted for by ﬁnal- or initial- lengthening on -mend, since -mend is word-ﬁnal and noninitial in both cases, and thus appears to require a poly-subconstituent shortening mechanism. Similarly, ﬁndings of shorter syllable durations in utterances with additional, non-adjacent, syntactic phrases, e.g. shorter duke in The young duke (dis)armed his subjects against the advice of his counselors, compared to duke in The young duke (dis)armed his subjects (Rakerd, Sennett, and Fowler 1987), also ﬁnd no explanation in boundary-related lengthening. Assuming that the

¹ Note that stress-adjacency ﬁndings were observed only for word-ﬁnal syllables, not word-medial syllables.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

139

prominence of e.g. duke was comparable for different utterance lengths, these ﬁndings appear to instead involve poly-subconstituent shortening within a higher-level constituent, such as a phrase or utterance. Taken together, these ﬁndings suggest that, although many effects which have been described in terms of poly-subconstituent shortening have alternative accounts, there are some phenomena which appear to require such a mechanism, and thus at ﬁrst glance appear to support AP/TD’s use of a hierarchy of planning oscillators to account for prosodic effects on timing. However, the following Section 6.3.3 argues that there is substantial ambiguity about the nature of the constituents within which these effects occur, and thus about the appropriateness of the levels proposed in AP/TD’s syllable-, foot-, and phrase-based planning oscillators. Furthermore, Section 6.3.4 presents evidence that polysyllabic shortening is less pervasive than has been assumed, i.e. it is more likely to apply to the phrasally prominent syllables of an utterance, and applies more strongly to these syllables than elsewhere. This latter set of observations casts doubt on the oscillator-based approach to modeling prosodic effects on duration.

6.3.3 Evidence on the units which govern possible poly-subconstituent shortening is ambiguous As discussed above, some durational patterns appear to depend on the number of subconstituents within a higher-level constituent, while many others are ambiguous in origin, and may depend on proximity to prominence and/or to a constituent boundary. For the phenomena that are interpretable as polysubconstituent shortening phenomena (either unambiguously or ambiguously), it is often difﬁcult to determine the types of higher-level constituents that govern these effects. AP/TD’s oscillator hierarchy of syllables, cross-word feet that may contain word fragments, and phrases, allows for the possibility that poly-subconstituent shortening occurs at each of those levels. However, in corpus studies, cross-word feet deﬁned on the basis of lexical stresses (as they are in AP/TD) can often be isomorphic with other constituents like words, cross-word feet based on phrasal stresses, and/or clitic groups (content word +following unstressed syllables). Early proponents of ‘stress timing’ in English proposed that the relevant governing constituent is a foot based on phrasal stresses, rather than a foot based on lexical stresses, as proposed in AP/TD. To give an example, Abercrombie (1973, p. 11) shows the following foot-parsing (where | indicates a foot boundary, and ^ indicates what he calls a “‘silent stress’”, Abercrombie 1968, 1991):

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

140       (32) | Know then thy- | -self, pre- | -sume not | God to | scan | ^ |,

Uldall (1971) provides additional examples from David Abercrombie’s recording of the North Wind and the Sun, parsed into feet in the Abercrombian tradition: |Then the |sun|shone out|warmly and im-|mediately the |traveller took| off his |cloak,

In these examples, not all lexically stressed syllables are treated as foot-initial (thy in |Know then thy-|, out in |shone out|, and took in |traveller took|), indicating that his deﬁnition of a cross-word foot is different from that used in AP/TD. Abercrombie notes that the term sentence stress has been used to describe what he calls salience, a property assigned to “the ﬁrst syllable in a foot, the syllable on which the beat of stress-timing falls” (Abercrombie 1991, p. 83). In contrast, Abercrombie uses the term accent to refer to abstract word-level prominence (now often called word-level, or lexical stress). For Abercrombie (1991, pp. 82, 83), accent “exists only at the lexical level. We can . . . deﬁne ‘accent’ as a potentiality for salience.” Kim and Cole (2005) and Kim (2006) tested for polysyllabic shortening within different types of cross-word feet (Abercrombian feet and cross-word feet delimited by lexical stresses) in the BU Radio News Corpus (Ostendorf, Price, and Shattuck-Hufnagel 1995). Their study showed that lexically stressed syllables were shorter when more unstressed syllables occurred within a crossword foot, as long as the following unstressed syllables occurred in the same intermediate phonological phrase as the stressed syllable. The effect persisted when the position of word boundaries was taken into account; that is, the effect could not be explained either by the number of syllables within a word and/or the proximity of the stressed syllable to a word boundary. The comparison of cross-word feet based on lexically stressed syllables vs. cross-word feet based on phrasally stressed syllables (i.e. Abercrombian feet) suggested a stronger shortening effect for cross-word feet based on lexically stressed syllables. However, despite the impressive amount of hypothesis testing in this study, the evidence is still ambiguous, because it remains possible that a clitic group (i.e. content word+following unstressed function words) is the major determiner of compression effects in the BU Radio News Corpus, rather than either type of cross-word foot. Testing this possibility is particularly important because the sequence content word+following unstressed function word occurs so frequently in American English.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

141

One preliminary study did test for all four structures (the clitic group, word, cross-word feet based on lexical stresses, and the Abercrombian cross-word foot based on phrase-level stress) as possible domains of polysyllabic shortening (Shattuck-Hufnagel and Turk 2011). This study found more support for word-based prosodic units (words and clitic groups) than for either type of cross-word inter-primary-word-stress intervals, even in poetic limerick contexts, where inter-stress interval periodicity involving word fragments would be most likely to surface. Shattuck-Hufnagel and Turk (2011) found that the rhyme interval duration in e.g. bake was reliably shorter in baking apples, bake us apples, bake an apple, and bake us an apple than e.g. bake in bake apples for two out of three participants, consistent with polysyllabic shortening either within a cross-word foot (from the stressed onset syllable in bake to the stressed onset of apples), or with polysyllabic shortening within word-based clitic groups (including the content word bake+function words us and/or an). A second test designed to test the relevance of the word-fragmenting Abercrombian foot vs. word and/or clitic group showed that the rhyme interval of bake (-ake) was reliably shorter when in e.g. baking, bake us, bake an, and bake us an than in e.g. bake elixirs and bake avocadoes, for these speakers, suggesting that word-based units have a greater inﬂuence on rhyme interval durations than cross-word feet which can include fragments of words. One of the three participants in this experiment provided support for the Abercrombian foot, in showing shorter durations for e.g. -ake in bake elixirs (cross-word-foot structure: [bake e-] [-lixirs]) and bake avocadoes ([bake avo-] [-cadoes]) compared to durations of e.g. -ake in bake apples ([-ake] [apples]). However, for this participant, e.g. -ake in bake elixirs and bake avocadoes was longer than e.g. -ake in baking apples and bake us/an apple, suggesting a stronger compressing role for word-based constituents such as the word and clitic group (e.g. bake, baking, bake us, bake an, bake us an) as compared to either type of cross-word foot (e.g. bake e- or bake avo-), even in these highly rhythmic contexts. This result, although derived from very limited data, is consistent with the view that polysyllabic shortening lies primarily in the domain of word-based structure, rather than of structure based on crossword feet which can include word fragments.² ² The Shattuck-Hufnagel and Turk (2011) dataset included e.g. bake elixirs, where e.g. elixirs can sometimes begin with a full vowel, and can thus bear secondary word stress. Therefore, although Shattuck-Hufnagel and Turk had intended e.g. |bake e-| to be a disyllabic cross-word foot delimited by lexical stresses, the cross-word-foot-delimited-by-lexical-stresses boundary occurred earlier in these materials, as in e.g. |bake | elixirs. However, over half of the materials in their data set did not have this problem, e.g. |pick a-|romas, and these showed the same behavior as e.g. bake elixirs, i.e. there was limited evidence of shortening within cross-word feet, and greater shortening within word-based constituents, e.g. picking acorns, pick us acorns.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

142       The evidence presented here suggests that it is difﬁcult to unambiguously determine the higher-level constituents which govern possible polysubconstituent shortening effects. Available evidence that distinguishes word-based constituents vs. cross-word feet as the relevant units is limited, but is more consistent with word-based constituents as the governing units. If this ﬁnding is supported by further evidence, it would motivate changing the AP/TD oscillator hierarchy to include word-based, word-sized constituents.

6.3.4 Evidence which challenges the use of oscillators in modeling poly-subconstituent shortening effects The evidence presented above suggests that it is often difﬁcult to unambiguously identify poly-subconstituent shortening effects on timing, as distinct from constituent-boundary effects; and that the units which govern possible poly-subconstituent shortening effects, while difﬁcult to determine, may include word-sized, word-based constituents that are incompatible with a planning hierarchy that does not include such constituents. This section takes the argument further, presenting ﬁndings which challenge the use of oscillators to implement polysyllabic shortening effects where they may occur. This evidence comes from a study by White and Turk (2010), who found that timing effects that are unambiguously attributable to polysyllabic shortening within words (i.e. longer e.g. -mend in commend compared to recommend, in utterances containing the same number of syllables, e.g. John saw Jessie commend it again vs. John saw Jess recommend it again, where the phrasal prominence pattern was experimentally manipulated) do not occur in all contexts. Instead, these effects are often governed by phrase-level prominence. That is, word sets with ﬁnal stress (e.g. mend, commend, recommend), only showed evidence of polysyllabic shortening on e.g. –mend when phrasally stressed, and word-sets with initial stress e.g. mace, mason, masonry showed greater shortening effects on e.g. mas- when phrasally stressed compared to when non-phrasally stressed.³ These results challenge an oscillator-based implementation system, because they suggest that words without phrasal

³ White and Turk (2010) note that differences in duration for e.g. mas- in mace vs. mason vs. masonry are ambiguous in origin because they can be explained by polysyllabic shortening and/or progressive word-ﬁnal lengthening, where e.g. mas- in mason is further from the boundary than masin mace, and closer to the boundary than mas- in masonry.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

143

stress can sometimes be overlooked by the suprasegmental compression system, or affected to a much lesser degree. This reduces the motivation for using a periodicity-based compression mechanism, since such a mechanism should arguably apply across the board to all constituents in an utterance, not just to (or preferentially to) those bearing phrase-level stress. Instead, these observations support the idea that, rather than being a reﬂection of a periodic control mechanism tending toward surface isochrony, polysyllabic shortening may instead be one of a set of mechanisms that speakers use to signal the locations of word-based prosodic constituent boundaries in an utterance (Turk 2012). Such effects may be strongest in cases where words are relatively less predictable from context and bear phrasal prominence (Chapter 10; Aylett 2000; Aylett and Turk 2004; Turk 2010), i.e. often on pitch accented words.

6.4 Evidence which challenges the use of oscillators in controlling overall speech rate This section recalls evidence presented in Chapter 4 relating to the use of suprasegmental oscillators in the control of overall speech rate. That evidence suggested that different speakers use different strategies to change overall speech rate. The ﬁndings cited there support the view that saving time (as measured in surface, e.g. solar, units) is one of the goals of speech production, and that this goal can be achieved in a variety of ways by different speakers. For example, they may optionally reduce the number and/or duration of pauses, reduce movement distance, and/or increase movement speed. Differences in how individual speakers accomplish the goal of temporal efﬁciency appear to require different articulatory mechanisms for different speakers, and sometimes even for the same speaker in different circumstances. This presents a challenge to AP/TD, because it has difﬁculty referring to the shared goal of all these strategies, which is a timing pattern for a particular utterance in solar time. The AP/TD planning+suprasegmental oscillation-frequency speech-rate mechanism allows for different speakers to change the oscillation frequency of the planning+suprasegmental oscillator ensemble to differing degrees, but this mechanism cannot lead to qualitatively different outcomes for different speakers, e.g. shorter movement distances for some vs. increased speed for similar movement distance for others, or fewer and reduced pauses for some vs. fewer and lengthened pauses for others.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

144      

6.5 Summary Patterns of speech timing, presented in Chapter 4 and in this chapter, motivate the consideration of alternatives to the mechanisms used in AP/TD to account for effects of prosodic structure on surface characteristics of spoken utterances. The use of a default-adjustment approach, which accounts for surface durational variability while preserving phonology-intrinsic timing (such as AP/TD’s Pi and MuT gestures), is challenged by the evidence presented in Chapter 4, that supports phonology-extrinsic timing. The use of periodic control structures and mechanisms to account for poly-subconstituent shortening is challenged by ﬁndings suggesting that this phenomenon does not occur on all words in an utterance, and does not apply uniformly on the words where it does occur, as argued in Section 6.3.4. Finally, the use of periodic control structures for the control of speech rate is challenged by ﬁndings of speaker-dependent strategies for manipulating overall speech rate. Taken together with the ﬁndings presented in Chapter 5, which showed that oscillatory mechanisms are not required to account for coordination patterns, and do not provide an account of endpoint-based coordination, these ﬁndings suggest that there are reasons to be uncertain whether periodicity is a major factor in speech motor control in typical speaking circumstances. These ﬁndings call into question the use of suprasegmental oscillators and motivate the consideration of alternative, non-oscillatory approaches to both coordination and suprasegmental control of timing. However, it is still possible that periodic control structures (or surface periodic planning goals) are invoked for certain types of speech production styles that might be called rhythmicized or periodicized speech, such as singing, or for stylistic purposes at certain points during typical communicative speech. The possibility that the timing pattern for an utterance could be computed by different mechanisms, depending on the task requirements (e.g. periodic vs. non-periodic requirements) is deserving of future experimental treatment. The next chapter motivates the general architecture of an alternative approach to speech motor control, a Phonology-Extrinsic-Timing-Based Three-Component Model. This approach is inspired by several proposals in the literature, but differs from existing proposals in several key respects, including its focus on phonology-extrinsic timing. In this proposal, (discussed in more detail in Chapters 7–10), surface–timing patterns result from an optimization process which balances out a set of prioritized requirements, where the set of requirements includes the production of words in their prosodic contexts, as well as other requirements, such as speaking in a

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

145

particular style or at a particular rate. Among other things, this approach has the potential to account for effects of prosodic structure, as well as tendencies toward periodicity in certain periodic styles of speech, or in particular rhythmic contexts, but does not invoke periodic control mechanisms for normal styles of speech.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

7 Evidence for an alternative approach to speech production, with three model components The preceding chapters presented a summary of AP/TD’s approach to modeling speech sound production (Chapter 2), and a number of lines of evidence that are difﬁcult to reconcile with spatiotemporal phonological representations, emergent timing, and oscillator-based control structures (Chapters 3–6). This evidence strongly motivates the development of an alternative to the AP/TD approach to account for a range of timing-related facts that do not ﬁt comfortably into a phonology-intrinsic oscillator-based timing framework. The alternative approach proposed here shares AP/TD’s goal of modeling speech articulation, which is an advance on alternative proposals that do not include this component. Other approaches that have focused on predicting surface acoustic duration patterns directly from text, such as Klatt (1976), van Santen (1994), and current commercial speech-synthesis systems based on concatenation, have not attempted to model articulation. As a result, these approaches do not model the human speech production process. Furthermore, because acoustic interval durations are thought to relate to the durations of movement required to meet goals at minimum cost, models that do not address articulatory issues are primarily descriptive, in that they do not seek principled explanations for the duration patterns that they produce (Windmann 2016). The approach presented here is therefore designed to model the processes that control speech articulation, because it is hypothesized that this effort will lead to explanations for systematic surface duration patterns. And in addition, to the extent that the principles that underlie such a model are well-motivated, it is likely to predict and model systematic durational variability in a variety of contexts. This chapter addresses the general nature of a model of speech production that can account for known timing behavior in human speakers. It argues that the timing evidence presented earlier requires a production process with three components, and supports this claim with additional, non-timing evidence. Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

 

147

Part of the motivation for a three-component model comes from two lines of argument presented earlier, a) the need for a set of phonology-extrinsic timing mechanisms to represent the surface timing of intervals, and to guarantee appropriate matching to the timing of perceived events (for example, temporally coordinated multi-performer musical performances) and b) the need to represent certain aspects of a movement’s timing (e.g. the time of reaching the endpoint) separately from other aspects (such as the time of onset), to allow for higher timing accuracy at the endpoint. To summarize arguments presented earlier, these requirements are not compatible with the AP/TD approach for two reasons. First, in that model, surface timing is not explicitly represented or speciﬁed, but instead emerges from interactions among intrinsic gestural representations and contextual representations in the phonology. Oscillations in AP/TD do not correspond straightforwardly with surface time in the world, owing to phonology-intrinsic time warping via Pi and MuT gestures in different prosodic positions in an utterance, and to overall changes in planning+suprasegmental oscillator ensemble frequency for changes in speech rate. Yet, as the evidence in Chapters 3–6 suggests, planning for explicit speciﬁcation of surface time is required to account for movement timing behavior, for instance greater timing variability for longer duration intervals, differential lengthening of contrastively long and short vowels in quantity languages, and different strategies for producing surface durational patterns, as well as other evidence (See Chapters 3–6 for more complete argumentation). Thus, a model in which surface time is explicitly and transparently represented in desirable. Second, AP/TD does not allow for independent timing or coordination characteristics for the part(s) of movement most closely related to the goal, e.g. the endpoint vs. other parts of movement, as required in order to account for higher timing precision at the part(s) of movement most closely related to the goal. This is because a gestural representation in AP/TD speciﬁes an entire gestural trajectory as a uniﬁed whole. Although the time of movement onset can be identiﬁed as the gestural activation interval onset, the time of movement offset can’t be identiﬁed with gestural activation offset, since the relationship of movement offset (time of target approximation/gesture settling time) to gestural activation interval offset varies according to context. Other models that lack a distinction between the goals of movement and how they are achieved (i.e. other models that merge phonology and phonetics, such as Šimko and Cummins 2010, 2011), as well as models which map phonological representations onto entire spatiotemporal goals (later versions of DIVA, Guenther 2016; and Fujimura’s model, Fujimura 1992 et seq.), face the same challenge that AP/TD faces, in

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

148  -  identifying the part(s) of movement most closely related to the goal as conceptually distinct from other movement parts, and therefore able to be represented separately. As a result, these models currently have no account for differences in timing precision at e.g. goal-related movement endpoints vs. other parts of movements. What kind of a model is likely to provide a plausible alternative to AP/TD, once it has been developed and tested with the same degree of rigor and comprehensiveness? The possibility explored here is a three-component framework in which 1) the requirement to signal abstract symbolic phonological representations, including lexical and prosodic structures, as well as other requirements (e.g. rate and style of speech) deﬁne the goals and guide the choice of qualitative acoustic cues to the distinctive features of the words that will appear in the utterance being planned; these goals and cues form the basis for 2) planning quantitative context-speciﬁc acoustic targets¹ to be reached at speciﬁc points in time, as well as optimal movements for reaching those acoustic targets on time, which are then 3) implemented articulatorily by a system that tracks and adjusts the unfolding movements and their auditory results, to ensure that those acoustic/articulatory goals are met at appropriate times. This can be called a Phonology-Extrinsic-Timing-Based ThreeComponent approach (XT/3C), because two of its most important aspects are its use of general-purpose timing mechanisms which are extrinsic to phonology (XT), and its three separate planning components (3C), for Phonological Planning, Phonetic Planning, and Motor-Sensory Implementation. This type of model provides a plausible architecture that can account for observed timing behaviors, including the wide range of systematic timing variability observed in speech, as well as for additional critical aspects of systematic context-governed surface phonetic variation. This chapter motivates the general architecture of the XT/3C approach, and uses the evidence presented in earlier chapters, along with additional evidence, to argue that this is the right framework for developing and testing an alternative to AP/TD. It ﬁrst introduces the three processing components (Section 7.1), and discusses how the three-component approach relates to proposals in the existing literature, considering some of the strengths as well as ¹ In the proposed model, as in Lindblom (1990), acoustic target speciﬁcations (as well as the targets of movements that produce them) are speciﬁc to a particular context within an utterance. This view of the acoustic and articulatory targets, as speciﬁc to each context, differs from proposals in the literature in which targets are envisioned as idealized, extreme spatial locations, which are often ‘undershot’ in ‘weak’ contexts, e.g. unstressed positions. To highlight this distinction for articulatory movements, the term ‘movement endpoint’ is employed here, for the context-dependent target that is planned for each movement.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

 

149

the drawbacks of those proposals. The drawbacks of existing three-component models are largely related to the lack of comprehensive consideration of how to implement timing phenomena in speech articulation. The remainder of the section presents the evidence that, despite these drawbacks, the threecomponent architecture is well-suited to account for observed timing behaviors. In fact, the three-component architecture appears to be necessary, because it provides accounts of both timing and non-timing (spatial and spectral) aspects of speech production. To make this clear, Section 7.2 discusses how evidence presented earlier supports the XT/3C approach, and summarizes additional evidence for the separation of the three processing components. Section 7.3 more speciﬁcally lays out the arguments for the use of symbolic phonological representations in the Phonological Planning Component. Section 7.4 addresses the fact that a model that postulates symbolic phonological planning representations requires a translation mechanism to derive quantitative acoustic and articulatory speciﬁcations. It ﬁrst argues that AP/TD’s apparent advantage in avoiding this requirement is not compelling, since 1) a mechanism for representing, specifying, and tracking surface time (and thus a mechanism to translate AP/TD oscillator time units into surface solar time units) will be required in any case, and 2) avoiding the translation issue creates serious problems for both the listener and the learner. It then discusses how the proposed XT/3C framework provides such a translation mechanism, in the form of a process for selecting individual contextappropriate cues² to symbolic features. This translation process divides its work between the Phonological Planning Component, where symbolically and relationally represented individual context-appropriate acoustic cues to distinctive features are speciﬁed, and the Phonetic Planning Component, where these cues receive the quantitative acoustic-phonetic values appropriate for the particular utterance that is being planned. Section 7.5 then presents evidence for the separation between the Phonetic Planning Component and the MotorSensory Implementation Component, which tracks and adjusts movements to ensure the timely attainment of utterance-speciﬁc goals planned in the Phonological Planning Component. The chapter concludes with a brief review of a number of key elements of the proposed model (Section 7.6). In sum, this chapter lays the groundwork for the remainder of the volume, which includes a more detailed discussion of possible optimization mechanisms (based on

² The term ‘cue’ is used here rather than the term ‘correlate’ for convenience, although the question of which acoustic correlates actually serve as cues for the listener, as well as of which ones are independently represented and planned by the speaker, remains to be explored.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

150  -  concepts from Optimal Control Theory, Chapter 8), a discussion of what is known about general-purpose timekeeping mechanisms, including a discussion of Lee’s General Tau theory, which provides a mechanism for movement coordination and for specifying the time-course of movement (Chapter 9), and a more detailed sketch of an initial proposal for an XT/3C model which can be viewed as XT/3C-version 1 (Chapter 10).

7.1 Existing three-component models and some gaps they leave As noted above, many of the components of the proposed XT/3C model have also played a role in earlier proposals for modeling speech production planning, although they have not been integrated into a framework that accounts for all known aspects of surface timing behavior. This section elaborates on the three planning components and relates them to existing proposals that make use of similar components (Section 7.1.1), while discussing some of the gaps that remain in the treatment of surface timing phenomena and in the speciﬁcation of how articulatory plans are formulated (Section 7.1.2). It is argued that, despite the gaps in existing three-component models of articulation, which are largely due to the fact that they do not comprehensively address the issue of surface timing in speech, the 3C approach in general is well suited to accounting for patterns of variation in surface phonetic form that are typically observed in continuous speech, and thus provides a desirable framework for developing an alternative to AP/TD (Section 7.1.3).

7.1.1 The general architecture of a three-component model and precursors in the literature The approach proposed here includes three stages which have played a role in a number of previous models of production. As noted above, the three components are: 1. a Phonological Planning Component, to set and prioritize abstract task requirements for the utterance, using symbolic representations; 2. a Phonetic Planning Component, to specify in quantitative terms how those goals will be achieved in the planned utterance; and 3. a Motor-Sensory Implementation Component, to implement the quantitatively speciﬁed plan.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.  -  &   

151

In the proposed XT/3C framework, the function of the Phonological Planning Component is to prioritize the set of task requirements for the planned utterance; these include its phonological goals as well as other goals, such as rate of speech (speciﬁed relationally), clarity, or stylistic characteristics. The process of setting the phonological goals for an utterance involves planning the prosodic structure for the intended utterance, as well as selecting and sequencing the acoustic cues to the contrastive phonological features of the words of the intended sentence, as appropriate to their prosodic and stylistic context.³ (See Halle 1992 and Stevens 2002 for deﬁnition of individual cues to distinctive features, and Chapter 10 for further discussion.) The cues for the features that deﬁne the target words are speciﬁed in patterns that are appropriate for the particular contexts of a speciﬁc planned utterance, including its prosodic structure. Thus the phonological goals reﬂect utterance-speciﬁc task requirements, such as signaling the words of the utterance and their structural relationships, but these goals are expressed in abstract symbolic terms. That is, at this point in the planning process, the cues to the contrastive phonological features of the utterance are the goals that will lead to the speciﬁcation of the physical aspects of the utterance. These symbolic goals are discrete, qualitative and relational, and do not contain speciﬁc spectral, spatial, and/or temporal information. This set of goals output by the Phonological Planning Component provides the input to the Phonetic Planning Component, whose function is to provide quantitative acoustic/articulatory speciﬁcations for the selected acoustic cues, including their timing characteristics, and to plan the coordinated and often overlapping articulatory movements that will generate those acoustic cues with appropriate parameter values. Subsequently, in the Motor-Sensory Implementation Component, the planned movements are tracked and adjusted to ensure that the goals deﬁned in the Phonological Planning Component, and speciﬁed quantitatively in acoustic/articulatory terms in the Phonetic Planning Component, are met at appropriate times. Together, these components meet the two requirements for an alternative model reviewed above. First, the Phonetic Planning Component provides for explicit speciﬁcation of timing characteristics of movement, and durations of surface time intervals between acoustic landmarks, using units provided by general-purpose timekeeping mechanisms. This model can be considered a Phonology-Extrinsic-Timing-Based model, because the surface-timing ³ Note that this means that the phonological representations for a particular utterance that emerge the Phonological Planning Component contain substantially more information than the phonological representations of words in the lexicon. That is, the Phonological Planning Component is not equivalent to the Phonological Component of a linguistic grammar.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

152  -  speciﬁcations are extrinsic to the phonological speciﬁcation for the utterance, and are generated by phonology-extrinsic, general-purpose timing mechanisms. This contrasts with phonology-intrinsic-timing-based models such as AP/TD, where time is an intrinsic part of phonological representation, and surface-timing patterns emerge from these representations in combination with their utterance-speciﬁc activation speciﬁcations. Second, the goals that are represented in the Phonological Planning Component as feature cues are directly related to the part(s) of movement most closely related to the goal, such as the endpoint, that will be planned in the Phonetic Planning Component. This explicit representation of part(s) of movement most closely related to the goal and their surface timing makes it possible for these parts of movement, e.g. endpoints, to have priority over other parts of movements, such as onsets; this is often required to account for the observation that there is often less variability in endpoint timing than in other movement parts. It also provides a mechanism for movements to be coordinated based on the endpoint times, as opposed to the times of movement onsets (cf. Chapter 5). Many of the key characteristics of the three-component approach have formed a part of earlier proposals in the literature. The next section describes some of these earlier proposals, and argues that, despite their incomplete treatment of surface timing, the three-component framework they adopt provides the best approach to developing an alternative to AP/TD.

7.1.2 Precursors in the literature, and some of their drawbacks One key characteristic of the three-component approaches proposed in the literature is the separation between phonological and phonetic planning (Henke 1966; Keating 1990; Shattuck-Hufnagel 1992; Kingston and Diehl 1994; Guenther 1995; Fujimura 1987, 1992, 1994, 2000, 2003; Buenaventura and Fujimura 2007; Fujimura and Williams 2015; Williams 2015; Levelt, Roelofs, and Meyer 1999; Guenther, Ghosh, and Tourville 2006; Ladd 2011; Houde and Nagarajan 2011; Goldrick et al. 2011; Perkell 2012; Lefkowitz 2017). Most of these proposals agree that the Phonological Planning Component contains categorical, symbolic information about grammatical categories. Shattuck (1975, Shattuck-Hufnagel 1992, 2015) draws heavily on this concept in her frame-and-content planning model (see also MacNeilage and Davis 1990). She emphasizes the need for a serial ordering process for inserting abstract symbolic phonological elements into a prosodic planning frame, with

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.  -  &   

153

subsequent speciﬁcation of the surface phonetic form of these symbolic elements, as appropriate for their segmental and prosodic context. Of the models in the literature which contain a separation between Phonological and Phonetic Planning Components, however, only a small number have described how the stages of production could lead to articulatory movement, and these few do not fully address the question of surface timing. Examples include Henke (1966), Keating (1990), Fujimura (1992) et seq., Guenther (1995) et seq., and Levelt (1989), Levelt, Roelofs, and Meyer (1999). All of these proposals assume a categorical, symbolic phonological component, with a separate phonetic planning component that provides a quantitative speciﬁcation of the articulatory movements that will implement that phonological plan, but none deal comprehensively with surface timing. Henke’s focus is on coarticulation, and thus his model provides an account of some aspects of surface timing. That is, he accounts for anticipatory coarticulation of succeeding phonemes, via the early-as-possible activation of goals for upcoming phonemes combined with the sluggish response of the articulators to these commands, which results in overlapping movements. In doing so, Henke’s model provides an account of the temporal co-production of articulatory movements, but does not provide an account of other aspects of timing, such as the durations between movement endpoints or how long each movement lasts, and also does not provide an account of how spatial aspects of movement vary in different contexts. Keating’s (1990) model relates to this latter spatial issue, by focusing on the contextual variability of movement paths. In her inﬂuential ‘window’ model, windows deﬁne a range of possible spatial target values for each physical articulatory dimension across utterances, and the path through a sequence of windows that is taken for a particular utterance is determined by considerations such as economy of effort, movement smoothness, and continuity. Although Keating’s model assumes that temporal speciﬁcation occurs in the Phonetic Planning Component, the details of this process are left for future research. Moreover, Henke’s and Keating’s models were not able to take advantage of post-1990 advances in understanding how pervasively and systematically phrase-level prosodic structure governs aspects of surface form. Another three-component model has been proposed by Fujimura (Fujimura 1992, 1994, 2000, 2003; Bonaventura and Fujimura 2007; Fujimura and Williams 2015; Wilhelms-Tricarico 2015). This model is of note because, unlike Henke’s and Keating’s models, it puts prosodic structure at center stage, and provides a framework for modeling the inﬂuence of prosodic structure (including prominence and phrasal structure) on articulatory

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

154  -  displacement, timing, and overlap. This framework assumes that phonological representations are expressed in terms of symbolic distinctive features, as well as symbolic representations of syllables (including their subconstituents, i.e. onsets, nuclei, and codas), and assumes higher-level prosodic constituency which can inﬂuence syllable durations in the vicinity of higher-level constituent boundaries. The syllable representations are mapped onto a ‘syllable pulse train’, i.e. a series of (usually symmetric) triangles corresponding to syllables and pauses (if they occur), whose bases are contiguous. Triangle heights represent an appropriate magnitude multiplication factor (the pulse) which controls syllable prominence and phrasal boundary effects, and triangle bases represent syllable or pause duration. Because the apex angle is assumed to be the same for all triangles, syllable triangle height correlates with syllable duration, so that longer duration and prominence are linked. Symbolic distinctive features are mapped onto ‘elemental gestures’. Vocalic elemental gestures are turned on and off by step functions, whose edges are aligned with syllable boundaries. Consonantal elemental gestures are modeled as impulse response functions, whose speciﬁcations are stored in a table, and are triggered at appropriate delays or lags from the syllable pulse, speciﬁed as ratios of the syllable duration (Wilhelms-Tricarico 2015). The model thus assumes that the elemental gestures overlap, i.e. local ‘fast time’ consonant gestures are superimposed on slower vocalic gestures, cf. Öhman (1967). Because the height of each syllable triangle represents the prominence of the syllable, it therefore controls the spatial magnitude of each elemental gesture. While this model is promising, in that it provides a framework for modeling the inﬂuence of prosodic structure on speech articulation (including timing patterns), and provides an implementation of articulatory overlap, it leaves some important details unaccounted for. For example, it doesn’t provide a way of determining what syllable durations should be for a given context, and, while it does provide movement parameters for different elemental gestures that make it possible to adjust the movement time-course, it doesn’t provide a way for determining what the values of the movement parameters should be. In addition, because it maps phonological symbols onto elemental gestures which, together with their context, deﬁne entire movement trajectories, it doesn’t provide a way to separately control behaviorally meaningful parts of movement, as required to account for lower timing variability at goal-related movement endpoints than at other parts of movement. And ﬁnally, while Bonaventura and Fujimura (2007) acknowledge the potential role of feedback in inﬂuencing speech articulation once it has begun, there is no explicit model

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.  -  &   

155

of a motor-sensory implementation component to track and adjust movements. Another approach that includes three components and addresses articulation is the comprehensive model of the entire utterance planning process proposed by Levelt and his colleagues at Nijmegen (Levelt 1989; Levelt et al. 1999; Levelt 2002, inter alia).⁴ Distinguishing earlier stages of conceptual planning from later stages of word-form planning, this model proposes three-word-form planning components: phonological encoding, phonetic encoding, and articulation. A valuable aspect of the Nijmegen model is that, in its various presentations, many of the factors that inﬂuence speech timing are mentioned, including higher-level prosodic constituent structure and prominence. However, details of how prosodic structure is generated, how and when choices are made among candidate pitch accents and boundary tones, and how this structure inﬂuences timing goals are not developed. This is in part because the focus of the model, at least after the publication of Levelt (1989), has been primarily on how lexical access interacts with other components of speech production, and on accounting for patterns of utterance initiation time under various conditions, rather than on issues such as the computation of prosodic structure or the details of speech articulation and its acoustic goals. The phonetic encoding and articulation components of the Nijmegen approach borrow heavily from AP/TD: The model adopts the use of AP/TD gestures in its mental syllabary, which consists of a stored set of syllable-sized gestural scores, and uses these to generate spatial as well as temporal aspects of movement. However, in its current state it has no account of coarticulation among sequences of syllables, particularly across word boundaries, leaving the details of contextual articulatory accommodation for later work. Because it adopts AP/TD’s use of gestures, the Nijmegen approach has the same timingrelated challenges as AP/TD: no explanation for greater timing accuracy/less variability at the part of movement most closely related to the goal than elsewhere in a movement (also, as noted above, a problem for Fujimura’s model), and no way to represent or specify the surface durations of intervals (required to explain the greater timing variability for longer duration intervals, differential lengthening of contrastively long and short vowels in quantity languages, and different strategies for producing surface durational patterns, ⁴ Levelt and colleagues have sometimes explicitly separated what they call Phonological Encoding from Phonetic Encoding (Levelt et al. 1999; Cholin, Levelt, and Schiller 2006), although at other times these two separate components are presented under the single umbrella of Phonological Encoding (Levelt 1989, 2002).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

156  -  as reviewed in Chapters 3–6). And like other existing three-component models, it has no mechanism for modeling the inﬂuence of multiple factors on surface timing. Levelt’s model assumes (but does not develop) a Motor– Sensory Implementation component. Guenther and colleagues’ DIVA and GoDIVA models (Guenther 1994, Guenther 1995, further developed in Guenther, Ghosh, and Tourville 2006, Guenther 2006, reviewed in Perkell 2012, and Bohland, Bullock, and Guenther 2009) also have three separate components, i.e. they distinguish Phonological from Phonetic Planning, and have a Motor–Sensory Implementation Component to generate, track, and correct articulatory movement. These neural network models assume symbolic categories (e.g. phonemes and/or syllables) that are slotted into a suprasegmental frame. They provide a mechanism for mapping among distinct data structures (i.e. symbolic elements, articulatory movements, acoustics, and auditory/somatosensory consequences), and they propose sensory tracking mechanisms to ensure targets are reached. (Such mechanisms are also available in Houde and Nagarajan’s 2011 State Feedback Control model, in which the use of state estimation and sensory tracking mechanisms in Motor–Sensory Implementation is a main focus; see also Hickok 2014). A primary focus of recent DIVA research has been to localize model components in particular regions of the brain according to their functions (e.g. Guenther 2016), adding valuable new perspective to the three-component approach. However, like most of the other three-component models (with the exception of that proposed by Fujimura), the DIVA and GoDIVA models have not focused on timing issues, and later developments of the DIVA model run into the same challenges that AP/TD faces in accounting for the ﬁndings presented earlier, in Chapter 4 (Phonology-Extrinsic Timing). That is, whereas early versions of DIVA took as input symbolic phonemes (which mapped onto targets that occur at single points in time), and included a type of phonologyextrinsic timing mechanism (a GO signal, which speciﬁes the timeremaining-until-target-achievement, Bullock and Grossberg 1988), so that they were therefore broadly compatible with the three-component approach proposed here, in later versions the phonological representations correspond to time-varying targets, i.e. time-varying trajectories in auditory space, which are adjusted as part of phonetic planning to produce surface timing patterns. From the point of view adopted here, these later versions of the model, with quantitatively speciﬁed time-varying targets which undergo durational adjustments, represent a step in an undesirable direction. That is, this shift to adjustment of default time-varying trajectories means that the more

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.  -  &   

157

recent versions of this model share the challenges that AP/TD faces: 1) no explanation for ﬁndings of less timing variability at certain parts of movements, since phonological representations are mapped onto movement goals that are undifferentiated in terms of e.g. endpoints vs. other parts, and thus no one point in time is more important (i.e. more closely related to the speaker’s goals) than others (Fujimura’s model has this same problem), and 2) no explanation for evidence suggesting that surface durations are represented (e.g. no explanation for constraints relating to surface time), since surface timing in more recent incarnations of DIVA is emergent from time-varying goals for each speech sound unit + adjustment mechanisms, as in AP/TD. In sum, existing three-component models of speech articulation have focused on different aspects of the speech production process and have moved the ﬁeld forward in important ways. However, all stop short of providing a comprehensive account for what is currently known about timing-related phenomena. Two issues stand out: 1) none have provided mechanisms for modeling the inﬂuence of multiple factors on surface durations, and 2) because most models map phonological symbols onto entire gestural or spectral trajectories, they do not provide a way to separately control, coordinate, and prioritize behaviorally meaningful parts of these trajectories, e.g. movement endpoints.

7.1.3 Why the three-component approach is the right approach, despite the drawbacks of existing proposals with some similar characteristics In spite of the drawbacks of existing proposals, the general approach provided by the multilevel architecture of such three-component models is well-suited to explain the patterns of variability observed in the surface phonetic forms of different utterances of the same word, phrase, or sentence, and, crucially, for the timing phenomena discussed in Chapter 4. Even though none of the currently available models of this general type have a comprehensive account of surface timing patterns in speech, it appears that the three-component architecture is required for doing so. In part, this is because this architecture supports the development of a model in which timing is extrinsic to the phonology, and planned within a separate phonetic planning component. Later chapters of this volume will present a fuller sketch of a particular three-component model which is explicitly based on phonology-extrinsic timing, and incorporates many of the mechanisms and characteristics already

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

158  -  proposed in the literature. The next section of this chapter shows how experimental evidence presented earlier in the book, along with additional evidence, supports a phonology-extrinsic-timing-based three-component approach (XT/3C) to modeling speech production planning.

7.2 Why the timing evidence presented earlier motivates the three components of the XT/3C approach, despite the gaps Earlier chapters presented evidence that supports the use of a set of generalpurpose phonology-extrinsic timing mechanisms for representing, specifying, and tracking the timing and duration of movements, for the representation and speciﬁcation of surface durations in speech production, and for greater timing precision at e.g. movement endpoints compared to other parts of movement. Taken together, this evidence suggests that timing speciﬁcation does not occur within the phonological component of grammar, and therefore motivates proposals in which timing is extrinsic to the phonology. Section 7.2.1 revisits one piece of timing evidence discussed in Chapter 4, namely, the greater temporal accuracy at part(s) of movement most closely related to the goal, compared to other parts of movement, and shows how this argues for all three of the components in the proposed XT/3C approach, where 1) phonological representations are a-temporal and symbolic, 2) timing speciﬁcation occurs in a separate Phonetic Planning Component, and 3) movements are monitored for accuracy in a Motor-Sensory Implementation Component and adjusted to ensure targets are reached appropriately. Section 7.2.2 then discusses the support provided for this proposal by evidence from the spatial domain. The implications of these arguments for the XT/3C approach are summarized in Section 7.2.3.

7.2.1 How timing precision at movement endpoints argues for an XT/3C approach Chapter 4 presented ﬁndings of less timing variability at a movement’s endpoint, compared to at other points in a movement. These ﬁndings suggest that in these cases the endpoint is more ‘behaviorally meaningful’ than other parts of movement (Shaffer 1982; Semjen 1992). Semjen (1992) makes this point about the control of ﬁnger movements in typing:

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

159

When copying a text, the typist probably attempts to produce the successive keystrokes ﬂuently and at a fast sustained rate. The typist would thus anticipate the temporal properties of a sequence of behaviorally meaningful events, rather than the characteristics of the individual movements producing them. . . . We are thus led to a notion of multilevel temporal organisation in serial movements, with some level(s) being more directly related to the subject’s intentions than others. (Semjen 1992, p. 248)

Along these lines, ﬁndings of greater temporal accuracy at particular parts of movement, e.g. the endpoints in speech and speech-related movements (Perkell and Matthies 1992; Leonard and Cummins 2011) suggests that these parts of movement are more ‘behaviorally meaningful’ and are more closely related to the speaker’s goals for the utterance. For example, the various movements of the articulators must be coordinated to create particular conﬁgurations at appropriate times, or the goal of acoustically signaling the features, sounds, and words of the utterance will not be met. Other, less behaviorally meaningful parts of movement are produced in service of achieving those goals. Findings of less temporal variability at movement endpoints support all three components of the three-component approach. That is, these ﬁndings can be explained if 1) the utterance-speciﬁc endpoint is the part of a movement that is prioritized, because it is ‘behaviorally meaningful’, i.e. most closely related to the goals developed in the symbolic phonological representation (i.e. during Phonological Planning) that the speaker is trying to signal, and 2) other aspects of the movement (speciﬁed during Phonetic Planning) are organized in the service of reaching the high-priority movement endpoint at the right time. As a result, non-endpoint parts of a movement are less likely to be corrected and adjusted during Motor-Sensory Implementation, because their accuracy is less critical, as long as the endpoint can be reached on time (cf. Todorov and Jordan’s 2002, 2003 Minimal Intervention Principle, motivated from spatial-accuracy evidence). Instead, the resources for tracking and adjusting are focused on the aspects of a movement that are most closely related to the goal of producing a planned set of acoustic cues, i.e. often its endpoint, but other parts of movement may also be relevant, e.g. constriction release for geminate consonants. The ﬁnding that particular parts of movement are more accurate/less variable than other parts of a movement is a critical part of this argument, because it is difﬁcult to account for in a model that does not separate abstract symbolic phonological representations from quantitative phonetic representations. In AP/TD, for example, a phonological representation takes the form

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

160  -  of equations that deﬁne the full trajectory of a gesture (and trajectories of individual movements that form the gesture given a starting position and overlap context), so that it is not possible to represent either the spatial or temporal aspects of e.g. the endpoint of a movement separately from its beginning; thus, it is not possible to ensure greater timing accuracy for different parts of a movement. In a three-component model that separates the goal (as an abstract, symbolic, phonological element) from the manner of carrying out the goal (as a quantitative phonetic speciﬁcation), the phonological representation (the goal) can be related to the part(s) of articulatory movement most closely related to achieving the phonological goals. Critically, separate representation of those parts of movement makes it possible to prioritize them for more accurate production in the Motor-Sensory Implementation Component, as appears to be required by the distribution of timing accuracy across a movement. The Motor-Sensory Implementation Component, which tracks timing and position relative to the endpoint (based on prediction from an efference copy of the motor commands as well as on sensory information), is required to provide adjustments to the movements to ensure that the prioritized endpoint is reached at an appropriate time. These arguments provide compelling support for the three-component approach to speech-production planning, from the timing domain. The following section provides converging non-timing evidence for this view from the spatial domain.

7.2.2 Converging non-timing evidence: how the priority of movement endpoints in the spatial domain also argues for an XT/3C approach Evidence from studies of repeated movements in the general motor control literature suggests that, just like timing accuracy, spatial accuracy is prioritized differently at different points in a movement. In particular, spatial accuracy has higher priority at endpoint achievement than at other points in a movement. Todorov and Jordan (2002) showed this in an inﬂuential experiment, in which they asked participants to move a pointer through a series of circular targets on a ﬂat table, over repeated trials.⁵ When analyzing their results, Todorov and Jordan (2002) sampled each movement trajectory at 100 equally ⁵ Target-to-target movement durations in this experiment were comparable to those observed in speech (i.e. approximately 100–400 ms).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.      

161

spaced points along the path. They computed the average movement path, and determined spatial deviations from the average path at each of the 100 points. Results showed that spatial deviations from the average path were lowest at the circular targets, and higher in between. Paulignan et al. (1991) report similar results for similarly shorter-than-a-second reaching movements (variability greater for the ﬁrst half of reaching movement, compared to the second half as the hand approached the target), as do Liu and Todorov (2007) for two reaching tasks. Liu and Todorov (2007) found that spatial variability was lowest at the beginning (where it was constrained by a ﬁxed starting position) and end of each movement, and highest in between. See also Scholz and Schöner (1999), Domkin et al. (2002), Yang and Scholz (2005), Todorov (2004), and Katsumata and Russell (2012) for additional evidence of lower spatial variability at target attainment than at other points in movement. Rosenbaum and colleagues present additional evidence consistent with the view that different parts of a movement can be separately represented, and their accuracy assigned different priorities. Rosenbaum observed that when grasping an overturned glass to ﬁll it, waiters often initially grasp the glass with an uncomfortable, thumb-down position. This uncomfortable initial position allows them to enjoy a comfortable, thumb-up, end-state position while ﬁlling the upright glass. Rosenbaum et al. (2012) review a series of experimental tasks that demonstrate this effect. In these tasks, subjects typically grasp an object and move it to a target, and there are multiple options for the initial grasp conﬁguration (e.g. palm up or palm down when grasping a horizontal dowel, thumb up or thumb down when grasping a handle oriented vertically in front of the participant) or position (e.g. grasp height when grasping a vertical dowel). These experiments show that participants often sacriﬁce comfort when initially grasping the object in order to achieve comfort at the end of their movement, an effect that is termed ‘end-state comfort’. End-state comfort has the beneﬁt of allowing greater precision in positioning the object in relation to the goal (Rosenbaum, van Heugten, and Caldwell 1996), as compared with end-state positions that are less comfortable. The link between endstate comfort and precision is supported by results presented by Rosenbaum, van Heugten, and Caldwell (1996), showing that when precision requirements of tasks were reduced, the end-state comfort effect was also reduced. Initial grasp conﬁguration is also inﬂuenced by other factors (such as visibility of the object to be moved), which also contribute to positioning accuracy. The endstate comfort effect appears to increase in strength from childhood to adulthood (see review in Rosenbaum et al. 2012), suggesting that effective strategies for achieving endpoint accuracy must be learned.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

162  - 

7.2.3 Summary of evidence for three separate components in speech sound production Together, the three components in a framework based on phonology-extrinsic timing provide an account for the distribution of accuracy within a movement, motivating consideration of an approach to speech movement planning that separates the mechanism for planning how to achieve a goal (Phonetic Planning) from both the mechanism for setting the goals for the plan (Phonological Planning), and the mechanism for tracking and adjusting the implementation of the plan (Motor-Sensory Implementation). The following section turns to evidence for the symbolic nature of the representations used in the Phonological Planning component, in contrast to the abstract (because they do not fully specify the surface form), but nevertheless spatiotemporal (and hence non-symbolic), representations of AP/TD.

7.3 Evidence for the separation between the Phonological and Phonetic Planning Components: Abstract symbols in Phonological Planning Almost every approach to modeling the phonological/phonetic level of human speech processing, whether in production or perception or learning, involves some degree of abstraction in the cognitive representations of words in the lexicon that subserves those processes. An example from the perception literature is the work of McQueen, Cutler, and Norris (2006), which provides evidence for abstract sub-lexical representations. This work shows that listeners exposed to recognizable words whose ﬁnal fricative /s/ or /f/ has been acoustically distorted learn to recognize the distorted fricative even in new words. The fact that the training generalizes to untrained words is difﬁcult to explain without a sub-lexical abstract unit that those words share. That is, the distorted signal is linked to an abstract representation of the target segment.⁶ Moreover, the fact that the same distorted fricative token can be perceived as evidence for two different phonemes, /s/ or /f/, depending on whether the listener has

⁶ An exception to the idea that all models share some degree of abstraction is found in extreme exemplar-based models; however, even a radical version of exemplar-based processing posits a cloud of stored tokens for each word, and the cloudlike relatedness of these forms might be taken as a representation of their equivalence, which could be viewed as a kind of abstract representation of that word. Hybrid models that combine stored exemplars with abstract representations (Ernestus 2014; Pierrehumbert 2002; Cutler et al. 2010) usually invoke symbolic sub-lexical representations.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

163

heard it in /s/-ﬁnal (like brass) or /f/-ﬁnal (like golf ) words, supports the view that these abstract representations don’t necessarily have a ﬁxed relationship with surface phonetic form. That is, the abstract category can’t be inferred from the acoustics alone; it is only recoverable from the acoustics plus knowledge of abstract representations in the lexicon. See also e.g. Lisker and Abramson (1970) and Kazanina, Phillips, and Idsardi (2006) for evidence that listeners categorize the same acoustic stimuli differently depending on the phonological categories in their sound inventories; this suggests that some type of abstract, category representation is used in perception that differs from auditory information. Thus, there is general agreement on the need for some kind of abstract representation of a word that differs from any given token of the surface signal that is produced or heard, yet links all of these acoustically different tokens together. However, there are signiﬁcant differences among models in what that claim means, i.e. how abstract these representations are, and what kinds of units are involved. With respect to the degree or type of abstraction, for example, AP/TD postulates that the phonological form of a word stored in the lexicon is abstract, because it does not explicitly specify the surface form that the word will take in any particular utterance. That is, the abstract phonological elements (gestures) are ﬁxed by the gesture’s equation of motion including its target, but not the value for the starting position parameter, which varies with preceding context). As a result, the choice of articulators and the value of the target parameter for a gesture never change. However, through the operation of a variety of mechanisms, the surface characteristics of a gesture in a particular utterance are determined by contextual factors. These include the starting position, prosodic and overlap context, speaking rate, as well as any unexpected perturbations. As a result, the surface form emerges from the interaction between a) the ﬁxed ensemble of gestures and the coupling graphs for words stored in the lexicon and b) the gestural score for a particular utterance, which speciﬁes context-speciﬁc gestural activation. Thus the lexical representations in AP/TD are abstract in the sense that they do not fully specify the surface form, since this form depends on context. However, although lexical representations and gestures of AP/TD are abstract in this sense, they cannot be taken as symbolic. This is because the stored lexical forms include quantitative information, in the form of the equations of motion of their gestures, and also involve the speciﬁcation of which sets of physical articulators (coordinative structures) will be involved in carrying out the gesture. In contrast, the abstract lexical representations postulated in the XT/3C approach are symbolic, because they do not contain

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

164  -  quantitative spectral, spatial, or temporal speciﬁcations and do not specify the articulators; instead, they specify contrastive, qualitative (often relational) categories which receive their quantitative speciﬁcations during phonetic planning. Thus we can identify two possibilities for types or levels of abstraction in lexical representations: 1) abstract representations, which provide quantitative information that does not (yet) correspond to surface forms (as in AP/TD), and so must be adjusted for different contexts, vs. 2) abstract symbolic representations (as in the XT/3C approach), which provide contrastive category information but without spectral, spatial, or temporal quantitative characteristics; these must be generated in a separate planning component. This section presents evidence that supports the latter type of symbolically realized abstraction, leaving to a later section (7.4) the question of how such symbolic representations are translated into quantitative acoustic speciﬁcations, and then into movement speciﬁcations which can serve as input to the motor system, during speech production. What is the size/nature of the abstract units? In addition to the question of whether phonological representations (both lexical and planning) are merely abstract or also symbolic, there is the question of whether they consist of smaller elements (such as gestures or distinctive features), or larger bundles of features (such as phonemes). With respect to the question of what kinds of planning units are involved, there are a number of plausible candidates, including gestures, distinctive features, phonemes, and syllable subconstituents such as onsets and rhymes. Chapter 10, Section 10.1.3 discusses some of the evidence that bears on this issue. One of the most compelling lines of evidence for the use of abstract symbols (rather than abstract but non-symbolic gestures) as the representational units in the lexicon that serve as inputs to the phonological/phonetic planning process, is the phonological equivalence⁷ of sounds produced with different sets of articulators. It has long been recognized that a single phonological category can be realized in very different ways in different contexts. A classic example is American English /t/, which is often produced quite differently in e.g. the top (aspirated), stop (unaspirated), butter (ﬂapped) and pot (often unreleased and/or glottalized). Moreover, native speakers of English have little trouble recognizing that all four of these variants, and others, are instances of the same sound category, even when produced in atypical ways. For example,

⁷ Phonological equivalence is reﬂected in the ability to signal the same abstract element in several different ways, as a speaker, and to recognize these different variants as members of the same abstract category, as a listener.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

165

an American English speaker uttering the word butter in a particularly clear manner may produce it with the full closure, a burst of frication and aspiration (normally found for the /t/ in the top), instead of with the ﬂapped or sometimes even glide-like version that is more typically heard in butter, and a listener hearing such an atypical production would have no trouble accepting it as a token of the word butter. Thus, listeners are accustomed to recognizing tokens with quite different acoustic manifestations as instances of the same phonological category. What is not as widely appreciated is that instances of the same phonological category may be produced with different sets of articulators, a pattern of behavior that is difﬁcult to account for in a gesturebased framework. This section presents evidence that different (sets of) articulators can be used to produce different tokens of the same contrastive sound category, which argues for phonological equivalence across different gesture types. This evidence includes studies supporting the phonological equivalence of articulatorily and gesturally different types of /r/ (7.3.1.), articulatorily and gesturally different types of /t/ (7.3.2), and articulatorily and gesturally different types of /n/ (7.3.3).

7.3.1 Phonological equivalence between different gestures for /r/ Several different strands of experimental work suggest that speakers use different types of constrictions, sometimes made at different places of articulation and with different articulators, to produce tokens of the same phonemic category.⁸ These ﬁndings raise problems for a gesture-based phonology, where phonological equivalence requires the use of the same gesture(s). The ﬁrst challenging set of results comes from studies of /r/. For example, Scobbie, Sebregts, and Stuart-Smith’s (2009) ultrasound tongue imaging study showed

⁸ There are various ways of determining whether two phonetic variants are members of the same phonological category. One is to note whether the two variants are used in the same word by the same speaker on different occasions; another is whether they occur in complementary structural positions in morphologically related words (such as the /t/ in cite vs. in citation, or in versions of words created in language games (as in the American English game Ubby-Dubby, where the syllable /bə/ is inserted between syllables of a word, so that butter is transformed into buh-bə-ter-bə, with aspirated /t/ rather than the more typical ﬂap). Somewhat weaker evidence is found when the morphological evidence is missing but the two variants are found in complementary distribution. Additionally, if a writing system is created by native speakers to use the same symbol for two different variants, it suggests strongly that they are instantiations of the same symbolic phonological class; see Lindau (1985) for discussion of this phenomenon with respect to /r/ across the languages of the world.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

166  -  that some Dutch speakers have qualitatively different tongue shapes for /r/ when it is realized in different positions, e.g. onset (uvular) vs. coda (usually alveolar approximant). This is consistent with observations of two gesturally distinct allophones or context-speciﬁc realizations for this phoneme (ﬁrst suggested by Whorf in the 1940s, as cited in Trager and Smith 1951/1957; see also van Bezooijen 2005; Sebregts 2015). One speaker in the Scobbie et al. (2009) study showed tongue-tip raising for coda approximant /r/, but no involvement of the tongue tip for the uvular variant produced in onset position. This is consistent with observations from a large subset of speakers in an acoustic study, where many speakers showed the same pattern of coda approximant /r/ and onset uvular /r/ (Sebregts 2015). The fact that a speaker realizes the same phonemic category using two different parts of the tongue at two different places of articulation in two different structural contexts is difﬁcult to reconcile with a gesture-based deﬁnition of the phonemic category, as in AP/TD. Scobbie et al. report similar variation between speakers, with regard to which part of the tongue (i.e. which articulator) is used to produce /r/ in the same structural context. Like the observation that a single speaker produces /r/ with different parts of the tongue in different contexts, the observation that different speakers produce /r/ with different articulators in the same context is challenging to a model like AP/TD, since a deﬁnition of the category in terms of gestures obscures the phonological equivalence between the two types of realization. Additional evidence for the cognitive equivalence of /r/ sounds made by different articulators comes from a shadowing study of the alveolar vs. uvular variants of /r/ in Dutch by Mitterer and Ernestus (2008). In this study, speakers of one variant were asked to shadow speech containing different non-words beginning with each variant; in most cases, the shadowers used their habitual variant regardless of the variant that occurred in the speech that they were asked to shadow. Critically, they did not show any evidence of slower production latencies when their produced variant did not match the variant they heard. These results suggest that the alveolar and uvular variants are phonologically equivalent for these participants, in spite of the fact that they are modeled as distinct gestures in AP/TD. This evidence is particularly compelling because it shows that the speaker regards the two variants as equivalent in the same word position. Evidence for the phonological equivalence of gesturally distinct variants of /r/ is not restricted to Dutch. In American English, Tiede et al. (2010) provide additional evidence for the phonological equivalence of /r/ produced by different parts of the tongue. They used an artiﬁcial palate with a protuberance that

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

167

interfered with the articulation of North American English /r/, to test whether speakers freely substitute a bunched vs. a retroﬂex articulation of this segment under perturbation. They report a variety of individual speaker behaviors, but conclude overall that “ . . . speakers (1) have a repertoire of articulatory strategies for /r/ (including the use of qualitatively different tongue conﬁgurations that produce equivalent acoustics) in normal speech, and (2) call on these strategies in the face of articulatory perturbations.” They note that their speakers’ ability to switch between two distinct articulatory conﬁgurations that produce very similar acoustic results is consistent with the primacy of acoustic goals in the production of /r/, and is also consistent with symbolic representations. Finally, Foulkes and Docherty (2000) present auditory and acoustic analyses suggesting that different /r/ variants are not limited to those produced by the tongue. They report that several Newcastle English speakers in their corpus produce both labiodental and alveolar approximant variants of /r/; the phonological equivalence of these variants is difﬁcult to capture using a single set of gestures, and suggests a more abstract symbolic phonological category.

7.3.2 Phonological equivalence between different gestures for /t/ A second line of work that suggests phonological equivalence across different articulators concerns glottal stop vs. released variants of /t/ in British English (Heyward, Turk, and Geng 2014). They report evidence that qualitatively different sets of articulators (which in AP/TD terms would be classiﬁed as different gestures) can be used in stylistically and contextually governed variants of this phoneme. This evidence comes from an articulatory study of /t/ produced by eleven British English speakers in a corpus of multiple speech tasks in different styles (the Edinburgh Speech Production Facility DoubleTalk corpus, Scobbie, Turk, Geng, King, Lickley, and Richmond 2013). Their analysis showed that phrase-medial, intervocalic /t/ was very often produced as a glottal stop without any evidence of tongue-tip raising. Thus phonological equivalence between the variants of /t/ cannot be captured by a tongue-tip raising gesture, since there is no evidence for tongue-tip raising in these phrase-medial glottal stop variants. This pattern was in marked contrast to the articulation of released variants of /t/ produced on other occasions by the same speakers, either in other positions (e.g. syllable-initial position) where /t/ is produced with tongue-tip closure, release burst and aspiration, or in the same intervocalic position. Figures 7.1 and 7.2 show examples of caught her

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

168  -  produced on two occasions by the same speaker, once with a released /t/ accompanied by expected tongue-tip raising (Figure 7.1), and once with a glottal stop without /t/-related tongue-tip raising (Figure 7.2). The fact that the same speaker produces glottal stop and released /t/ on different occasions in two variants of the same phrase strongly argues for their phonological equivalence. AP/TD currently cannot express the phonological equivalence of productions of the same speech-sound category that involve movements by different sets of articulators, as appears to be required for e.g. [th] (tongue-tip raising + glottal opening) vs. [ʔ] (glottal constriction only) allophones of Scottish and Southern British English /t/, or the uvular vs. alveolar variants of Dutch /r/. In contrast, these ﬁndings can be easily accommodated in a theory that involves symbolic categories (e.g. /t/ or /r/), which can be realized phonetically in different ways, by different sets of articulators and (as a result) show different acoustic cues to their contrastive features in different tokens. This is because the lexical representations in such a theory do not include quantitative spatial, temporal, or acoustic speciﬁcations. The evidence showing that tokens are judged to be the same sound category even when they are produced by different articulators

13499 Amp –8836 16000

0

0.694

0

0.694

Freq 62.5 0 TTz /t/ –1.67 0

Time (s)

0.694

Figure 7.1 The utterance excerpt . . . caught her . . . produced by a Scottish female speaker from the Doubletalk corpus (Scobbie et al. 2013). Note: The top panel shows the acoustic waveform, the middle panel shows the spectrogram, with units of frequency in Hz, and the bottom panel shows the vertical movement (in cm) of a sensor attached less than 1 cm from the tip of the tongue. The tongue was raised for /t/ in caught to produce oral closure; see the region of the ﬁgure enclosed in a box.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.     

169

12349 Amp 8116 0

0.503

62.5 0 0

0.503

8116 Freq

/ɒt/

TTz –1.67 0

Time (s)

0.503

Figure 7.2 The utterance excerpt . . . caught her . . . produced by the same Scottish female speaker that produced . . . caught her . . . shown in Figure 7.1. Note: The top panel shows the acoustic waveform, the middle panel shows the spectrogram (frequency values in Hz), and the bottom panel shows the vertical movement (in cm) of a sensor attached less than 1 cm from the tip of the tongue. The /t/ in this instance of caught her was produced with glottalization (cf. the acoustic waveform for the vowel), but without tongue-tip raising, and was heard as a glottal stop; see the region of the ﬁgure enclosed in a box.

(when that sound occurs either in different positions in a constituent, i.e. in syllable-onset vs. non-initial intervocalic position, or in the same context in different speech styles) challenges the view that phonological categories are deﬁned by consistent gestural participation, and instead suggests that the category is represented in a symbolic way.

7.3.3 Phonological equivalence between different gestures for /n/ The evidence cited above supports the view that speakers may use different sets of articulators to signal the same phonemic contrast on different occasions, consistent with the hypothesis that accounting for speaker behavior requires abstract symbolic planning elements combined with a range of potential ways of signaling those elements. Additional evidence for this view comes from a study of how different individual speakers produce sequences of sounds. When two sounds in sequence differ in the symbolic features that

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

170  -  specify their place of articulation, different speakers use different strategies to accomplish the patterns of temporal co-production/overlap known as assimilation or co-articulation. Ellis and Hardcastle (2002) report EPG data from ten English speakers producing, at varying rates of speech, utterances that contain two experimental sequences across a word boundary, i.e. /-n#k-/ and /- ŋ#k-/, in . . . ban cuts . . . and . . . bang comes . . . The latter provides a lexical velar–velar sequence, with which the assimilatory results for the alveolar-velar sequence /n#k-/ sequences can be compared. Ellis and Hardcastle note that, for the alveolar-velar /-n#k-/ sequence at a fast rate of speech, some of their speakers exhibited the kind of gradient assimilatory or gestural overlap behavior that is often reported, but two of their speakers did not. Instead, these two speakers produced full alveolars for some tokens of /n/ and complete assimilations to a velar articulation in others. Follow-up electromagnetic articulometry analysis yielded no trace of the alveolar gesture when the velar articulation was fully realized. Thus, for at least some speakers, the word-ﬁnal /n/ in ban was realized as a full velar in some tokens and as an alveolar in others, with no evidence of gradient gestural overlap behavior. In AP/TD, a velar nasal would be produced with a velum-lowering gesture combined with an oral constriction gesture which would be accomplished by movements of the tongue body + jaw, whereas an alveolar nasal would be produced with a velum-lowering gesture combined with an oral constriction gesture accomplished by movements of the tongue body + jaw + tongue tip. Ellis and Hardcastle’s observation suggests the phonological equivalence of variants of the word-ﬁnal phonemes produced as velars and as alveolars; the equivalence between these two forms cannot be expressed in terms of a tongue-body constriction gesture, as would be expected from an AP/TD lexicon. The studies described above show that speakers can use different sets of articulators to produce the same speech sound in different contexts, and different sets of articulators to signal the same sequence of phonemic contrasts on different occasions. These ﬁndings challenge an approach to phonology in which the spatiotemporal characteristics of a word are emergent from a lexical representation which deﬁnes a word in terms of ﬁxed sets of articulators and the equations of motion for their constriction formation. Instead, these results support the view that words are represented in the lexicon in terms of abstract symbolic elements, which are translated into phonetic plans that can differ from one planned utterance to another, both in the sets of articulators involved and in the quantitative spatial and temporal speciﬁcations for articulatory movement. The issue of what exactly these abstract, symbolic sublexical constituents might be is addressed in Chapter 10.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   

171

On the assumption that these representations are abstract and symbolic, a ‘translation’ mechanism is required to map these symbols onto quantitative phonetic instructions. The next section addresses this issue.

7.4 The translation issue Evidence presented in the Phonology-Extrinsic Timing chapter (Chapter 4, as reprised in Section 7.2) shows that representing and specifying surface time are required for an adequate model of speech-production planning. Moreover, evidence presented in Section 7.3 above strongly suggests that a production model must also represent abstract symbolic categories (unspeciﬁed for articulators and other details), so that a category can be implemented by different sets of articulators; this adds to evidence from other domains that supports symbolic phonological representations. Since such symbolic representations do not include quantitative speciﬁcations for either the acoustic goals (including surface timing), or the articulatory means of achieving those goals, this view requires a translation mechanism that takes the abstract symbols as input and produces quantitative speciﬁcations as outputs. In the AP/TD framework, this translation requirement has been viewed as a compelling argument against symbolic representations in the lexicon (see for example Fowler 1980), and has motivated the phonology-intrinsic timing approach. However, it will be argued that in an XT/3C approach, an utterance-speciﬁc representation in terms of individual feature cues and their values can bridge the gap between the symbolic lexical representation of a word and the quantitative representations of the form that that lexical item will take in a particular utterance. This section highlights the complexities that arise in the AP/TD framework for speech processing (in production, perception, and learning) due to its avoidance of symbolic representations (Section 7.4.1), and describes how the XT/3C approach meets the translation requirement, by adopting a feature-cue-based model of phonological/phonetic processing in production, also providing an equivalence classiﬁcation account of perception (Section 7.4.2).

7.4.1 Why AP/TD needs to represent surface time and symbolic categories, and thus cannot avoid the translation issue In addition to its success in pointing out that many coarticulatory phenomena do not involve a category change, but instead occur by gestural overlap,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

172  -  AP/TD’s postulation of phonology-intrinsically-determined default timing for the gestures in a phonological plan was particularly attractive for other reasons as well. When it was ﬁrst proposed, it seemed to eliminate the need for complex computation of surface-timing patterns in production, and equally complex interpretation processes in perception. It also avoided an aspect of symbolic-feature-based models that seemed particularly challenging: the need to translate between cognitive/conceptual symbolic units in the phonology (deﬁned by distinctive features) and quantitative speciﬁcations for physical instantiations (Fowler 1980; Fowler et al. 1980). Postulating that lexical representations and utterance processing representations share the same gestural vocabulary appeared to provide seamless integration between the two, and (in later versions) postulating that timing relationships among the gestures in particular utterances simply emerge from their governance by coupled planning oscillators, without explicit planning of relative timing (i.e. coordination), appeared to avoid the translation problem posed by symbolic feature-based representations. In fact, as the complex systematicity of the factors that inﬂuence speech timing, such as prosodic structure, has emerged more and more clearly over the succeeding decades of speech research, the complexity of the timing-control system proposed in the AP/TD model has continued to increase in response. As a result, the apparent advantages of spatiotemporal representations and plans for coordination stored in the lexicon, whose default activation intervals must be adjusted to the contexts of particular utterances, have lessened. In fact, there are even more such factors than are currently handled by the Pi- and MuT-gestures in AP/TD (see Chapter 3), so that maintaining this framework will require the development of many additional adjustment mechanisms, and more widespread use of the ones that are currently instantiated, substantially increasing the complexity of the model. The evolving complexity of the model is suggestive of the need to consider an alternative approach, but the lack of explicit representation of surface time in AP/TD provides even stronger motivation. Because experimental ﬁndings presented in Chapter 4 strongly suggest that speakers represent surface-timing information in solar time units, AP/TD would need to explicitly represent the correspondence between its temporal phonological representations (in planning+suprasegmental oscillator ensemble period units) + adjustments, and surface, solar, time. That is, the gesture/oscillator-based approach would need to incorporate a mechanism to translate between its set of phonology-speciﬁc abstract timing units + adjustments, on the one hand, and surface-timing units on the other, something that is currently absent from the model. This translation would be particularly complex to implement because there is no

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   

173

straightforward correspondence between AP/TD planning+suprasegmental oscillator ensemble units and external-world ‘solar time’ units, owing to adjustments to the AP/TD planning+suprasegmental ensemble oscillation frequency for speech rate and for speciﬁc prosodic positions. These adjustments effectively warp AP/TD ‘clock’ time with respect to solar time differently for different rates of speech, and in different parts of an utterance, i.e. at boundaries and prominences vs. other parts of the utterance. This lack of consistent correspondence in the duration of units between the two types of clocks for speciﬁc utterances, and for speciﬁc positions in each utterance, has the potential to make translating between them extremely difﬁcult, if in fact these relationships could be computed at all. In addition, translating between AP/TD ‘clock’ time and solar time would be contrary to the basic tenets of the phonology-intrinsic-timing-based approach, in which there is no separation between Phonological and Phonetic planning components, and thus no translation between them. Thus, in a framework like AP/TD, which relies on intrinsic ‘phonological timing’ (with surface time emerging from a set of interacting mechanisms without an explicit representation), it will nevertheless be necessary to translate to explicitly represented surface timing. By Occam’s Razor, then, it seems useful to explore an alternative processing model like the XT/3C model, in which the surface timing for a particular utterance is speciﬁed directly from atemporal symbolic representations, under the governance of a number of weighted grammatical and extra-grammatical factors, rather than by adjusting a phonologically speciﬁed default duration deﬁned in terms of phonologyspeciﬁc timing units. This approach avoids the necessity of adjusting default gestural activation intervals in complex ways in different contexts, and makes use of general-purpose timekeeping mechanisms to represent and specify surface durations in solar timing units, mechanisms which are well-motivated in other non-speech motor activities (see Chapter 4). Thus, the available evidence suggests strongly that the translation problem can’t be avoided simply by adopting a non-symbolic, spatiotemporal, gesturebased phonology, in which gestural-activation intervals are adjusted by a complex set of mechanisms that are not otherwise motivated by the theory. That is, if it is compelling (as argued in Chapter 4) that there is a need to represent, specify, and track surface time intervals, in speech as in other motor activities, then there is a need to generate representations of surface time in any case, whether the underlying representations are symbolic or gestural. This need for translation lessens the advantage of a non-symbolic, gesturebased account of speech production, where translation would be difﬁcult

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

174  -  because there is no straightforward correspondence between internal (e.g. AP/ TD) time and solar time, since the correspondence is different in different parts of an utterance. Moreover, as our understanding of the wide range of factors that can inﬂuence the surface form of words and their component sounds in different utterances deepens, the number of adjustment mechanisms required to ensure that a gesture is produced appropriately for a particular context must multiply as well. While it may be possible to craft such a set of adjustment mechanisms, as their number increases, the complexity of the resulting gesture-based system will move it away from the elegant simplicity which was part of its initial appeal. These challenges help to motivate the development of a different approach to speech-production planning, so that its predictions (and eventually the performance of its implementation in generating natural-sounding speech) can be tested against those of the AP/ TD model. 7.4.1.1 Implications for perception The complexities that arise for production in a model that lacks both symbolic representation and phonology-extrinsic timing have parallels for perception as well. In the AP/TD approach to perception, listeners directly apprehend the equations of motion speciﬁed by the speaker’s articulatory plan, and thus the corresponding lexical representations, so that (in parallel to the AP/TD approach to speech production) there is no requirement to translate from the quantitative values in the signal to an underlying sequence of phonological symbols corresponding to the speaker’s intended words. However, this creates a difﬁculty: a listener who hears an utterance must determine the coefﬁcients of the equations of motion that underlie that spoken waveform, and this is not an easy task, because the surface forms vary so widely. To give a concrete example, assuming gestural stiffness can be inferred from the relationship between peak velocity and distance, a short-distance tongue-tip raising movement with a high peak velocity might be interpreted as either 1) a stiff movement with a relatively low tongue-height gestural target, or 2) a less stiff movement with a higher tongue gestural target, whose activation interval has been truncated so that the target isn’t approximated. Inferring target parameter values may be similarly difﬁcult, since targets can be undershot owing to short activation intervals and/or gestural overlap. This example illustrates the more general problem, which is that there are many factors which lead to differences in the articulator trajectories associated with each gesture—factors that include e.g. adjacent elements in the plan, speaking rate, phrase-level prosodic structure (boundaries and prominences),

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   

175

speaking style (casual vs. formal), etc. Many of these involve adjustments to gestural-activation intervals that result in spatial and temporal differences that can obscure the underlying coefﬁcient values for each gesture’s equation of motion. Thus the listener must undo the effects of multiple factors whose number and effects must be discovered—rather like solving an equation with many unknowns. (This is arguably a much harder problem than category identiﬁcation, see the following paragraphs for further discussion.) Another way of putting this point is that AP/TD puts the complexities of surface variation into the phonological planning component; complicating the phonological representation in this way may make it difﬁcult for the listener to uncover various types of phonological equivalence among different surface forms. While it might be possible to devise a perception model in the AP/TD framework that addresses these complexities for the listener who already knows the language, additional challenges arise for the language learner. Not only do learners have to master the speciﬁc parameter values for gestural stiffness and target position as part of the phonological representation for each gesture; in addition, for each gesture type, they must also learn the gesturalplanning oscillator phase proportions for gestural-activation intervals. And for each context, they must learn Pi/MuT heights and scopes, where scope refers to the amount of gestural score time overlapped by the Pi/MuT gesture (see Byrd and Saltzman 2003). Finally, for the language as a whole, they must learn relative frequencies and coupling strengths for syllable, foot, and phrase oscillators. Like listeners, language learners must therefore solve an equation in many unknowns for each gesture in each context. Thus the learner is faced with a particularly challenging many-part problem: how to ﬁgure out the gestural target and stiffness values and the coupling patterns (for coordination and suprasegmental structure) from the widely varying surface forms. Additionally, the learner must learn the values for all of the adjustments to the activation intervals and/or planning+suprasegemental ensemble frequencies that are appropriate to different contexts. Even if it is assumed that activation intervals can be determined with certainty, an activation interval that is longer than the default might be due to 1) Pi or MuT gesture-related lengthening, 2) slow global suprasegmental oscillation frequency (corresponding to slow speaking rate) or both. Moreover, while the stored forms of words in the mental lexicon include coupling graphs that specify e.g. the strength of coupling between gestures/conﬁgurations associated with the onset consonant of a syllable and its following vowel, there is no such road map for the coupling strength between offsets and onsets of successive words; such coordination is

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

176  -  governed by higher-level planning oscillators. The learner must therefore decipher which types of gestural overlap are speciﬁed in the lexicon, and which types are due to overall speech rate and coupling between different levels in the suprasegmental hierarchy. These problems may be more tractable in an XT/3C model, which does not require the listener/learner to infer speciﬁc quantitative values to deﬁne fundamental phonological elements, but instead requires learning to associate a variety of different surface forms with a symbolic category. 7.4.1.2 Summary This section has discussed the challenges to AP/TD that arise because of the assumption that timing is intrinsic to the phonology, and the consequent lack of representation of surface timing intervals. It has argued that a translation mechanism will be required in any case, to account for the evidence (presented in Chapter 4, Phonology-Extrinsic Timing) that speakers explicitly represent surface time. An additional difﬁculty arises from the fact that, because words in the lexicon are deﬁned in terms of equations of motion, listeners and learners must extract their quantitative coefﬁcients from the signal whose widely varying surface properties are ambiguous as to their source in many different ways. The additional mechanisms, such as coupled planning+suprasegmental oscillators and Pi- and MuT-gestures, which have been added to the AP/TD framework over the decades, make this ‘reverse engineering’ for quantitative values even more challenging. In sum, these arguments suggest the advisability of considering an alternative approach to the planning of speech-production timing that directly addresses the translation problem. The next section describes how the XT/3C approach can do so.

7.4.2 Translating symbolic representations into quantitative speciﬁcations by using context-appropriate sets of individual feature cues, and some potential advantages to this approach In the XT/3C approach proposed here, the goal representations in the Phonological Planning Component are symbolic and non-temporal, and the Phonetic Planning Component explicitly speciﬁes surface-temporal characteristics (as well as spectral and spatial characteristics), as compatible with evidence presented in Chapter 4 and earlier in this chapter. However, this approach makes it necessary to provide a mechanism that can bridge the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.   

177

gap between these two types of data structures, i.e. between the abstract symbolic representations in the phonology and the quantitative speciﬁcations of surface characteristics (including time) in the phonetics. Unlike the translation that is required in the AP/TD framework, which is between phonology-intrinsic and surface timing, the translation mechanism proposed in the XT/3C model extends or enriches the abstract symbolic phonological representations to introduce quantitative phonetic (spectral, spatial, and temporal) speciﬁcations of surface forms. It does this via a two-step process by which the speaker ﬁrst makes binary choices among the possible individual cues to distinctive feature contrasts, in the Phonological Planning Component, and then speciﬁes the quantitative values for those cues, in the Phonetic Planning Component. That is, the XT/3C approach postulates a level of representation (individual feature cues) that is symbolic in nature, but closer to the production of the signal than words, phonemes, or distinctive features themselves. The concept of a level of representation corresponding to individual acoustic cues to distinctive features is adopted from the model of perception for lexical access proposed in Stevens (2002). Because these symbolic cues can take on a range of quantitative values, they provide a mechanism for translating between the symbolic goals of the Phonological Planning Component and the surface speciﬁcations of the Phonetic Planning Component. Thus modeling the wide range of possible surface implementations of a given phonological symbol becomes a more tractable problem, because each possible surface form is related in a learnable way to the underlying category, even though the surface variants are not related to each other in any obvious way. From this perspective, the phonological task of the listener or learner is one of equivalence classiﬁcation; that is, he or she must infer how each set of context-appropriate cues maps onto its underlying category. This is not a simple task, but it is a very different task from that of inferring speciﬁc underlying quantitative values from ambiguous surface information, and, it can be argued, a more tractable one. (See Poeppel, Idsardi, and van Wassenhove 2008 for discussion of Stevens and Halle’s 1967 analysis-by-synthesis approach to modeling speech perception.) Another way of describing this approach to the production planning mechanism is that the selection of feature cues (in the Phonological Planning Component) and speciﬁcation of cue values (in the Phonetic Planning Component) provides an enrichment of the symbolic categories, by deﬁning the acoustic parameters whose utterance-speciﬁc values will guide the development of the articulatory parameters which in turn will serve as input to the

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

178  -  Motor-Sensory Implementation Component.⁹ As will be discussed in Chapter 10 (Model Sketch), a critical aspect of this set of cues to distinctive features is that it includes feature cues called Landmarks (Stevens 2002), which signal manner features. These discrete events in the speech signal often resemble ‘acoustic edges’ or spectral turning points; they involve abrupt changes in energy across many different frequency bands at the same time. Because they occur at discrete time points, pairs of Landmark events deﬁne intervals that can be speciﬁed in surface time. This section has discussed the need for a translation mechanism in the XT/ 3C approach, to generate quantitative acoustic and articulatory speciﬁcations from a structured sequence of symbolic phonological categories, and has proposed that individual acoustic cues to feature contrasts provide a representational mechanism for this translation. This completes our discussion of the motivation for separating the Phonological Planning Component and the Phonetic Planning Component. We turn now to arguments for the separate, third component, for the Motor-Sensory Implementation of the quantitative phonetic plan.

7.5 Motivating the separation between Phonetic Planning and Motor-Sensory Implementation The timing accuracy evidence presented above motivates a three-part division between Phonological Planning, Phonetic Planning, and Motor-Sensory Implementation. It suggests that the higher spatial and temporal accuracy at movement endpoints over repeated movements can be explained if symbolic phonological representations, planned in the Phonological Planning Component, map onto the most behaviorally relevant parts of movement, often the endpoints, whose spatial characteristics and timing are planned in the Phonetic Implementation Component. These behaviorally relevant parts of movement are prioritized for spatial and temporal accuracy in the Motor-Sensory Implementation Component, according to Todorov and Jordan’s Minimal Intervention Principle. Indeed, most (if not all) models of speech production (including AP/TD) agree that there must be a separation between these two Planning and Implementation components. This is not least because some ⁹ If the symbolic representation is phonemic, it is at this point that the boundaries between these successive phonemes become irrelevant, because the features map onto acoustic cues and the cues map onto speciﬁc parts of movement, e.g. movement endpoints; it thus becomes possible for the realization of cues to successive segments to overlap in time.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    

179

characteristics of movement (e.g. goal-related movement endpoints) must be planned before talking begins, and therefore must be part of a Planning component, and also because some aspects of speech behavior can be explained only by processes that occur after speech has begun, e.g. speakers’ responses to unexpected perturbations or altered feedback (e.g. jaw loading, and manipulated acoustics (Folkins and Abbs 1975 et seq.; Houde and Jordan, 1998, 2002; Max, Wallace, and Vincent 2003; Purcell and Munhall 2006; Villacorta, Perkell, and Guenther 2007; Cai et al. 2011). These ﬁndings motivate planning and implementation components that operate at least somewhat sequentially, and are therefore independent. The following sections, 7.5.1 and 7.5.2, address the division of labor between Phonetic Planning and Motor Sensory Implementation, by asking 1) how much planning occurs in the Phonetic Planning Component (Section 7.5.1) and 2) what processes occur in the Motor-Sensory Implementation Component (Section 7.5.2).

7.5.1 The division of labor between Phonetic Planning and Motor-Sensory Implementation The main question addressed here is how much planning must take place in the Phonetic Planning Component. Put another way, to what extent can surface phonetic characteristics such as surface timing emerge from the Motor-Sensory Implementation Component, without prior planning in the Phonetic Component? Earlier, it was argued that the Phonological Planning Component involves abstract symbolic representations which are non-quantitative, and thus that at the very least the spatial characteristics and timing of the relevant movement endpoints, which relate most directly to the intended phonological categories, must be planned in a separate Phonetic Planning Component. What does the evidence say about planning other aspects of movement trajectories, such as spatial characteristics of the onset, or duration, or the time-course of movement (which relates to the velocity proﬁle)? Could these possibly emerge from the Motor-Sensory Implementation Component without being planned explicitly for each movement in the Phonetic Component? This type of scenario might arise if speakers follow a ﬁxed set of procedures for how to reach a target from the preceding one, e.g. straight movement paths, with ﬁxed duration movements and symmetric velocity proﬁles. Under this scenario, movements could follow these procedures automatically, without having to be

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

180  -  planned individually in each new context. Section 7.5.1.1 reviews timing and non-timing evidence which suggests that this scenario does not hold: instead, these phonetic characteristics of individual movements and acoustic intervals vary systematically according to later context. This argues for extensive phonetic planning, because it suggests that speakers have a plan for how to produce each movement appropriately for each context before each movement begins. Given this evidence, an additional set of questions arises: how far in advance does phonetic planning occur? Do speakers plan one movement at a time, or do they pre-plan, i.e. plan longer stretches of speech in advance? Articulatory overlap evidence presented in Section 7.5.1.2 argues for pre-planning. Other evidence from phonetic characteristics at the beginnings of utterances (e.g. degree of inspiration) that depend on later characteristics (e.g. length of utterance) is consistent with this view, but is ambiguous because it could be explained by abstract (i.e. phonological) representations of the later-occurring context, and does not necessarily require the pre-planning of the phonetic characteristics of an entire utterance. 7.5.1.1 Evidence for pre-planning of speciﬁc aspects of movement, in addition to movement endpoints As discussed earlier, results reported by Perkell and Matthies (1992) and by Leonard and Cummins (2011) show less variability at the end of movement than at other points in the movement, which supports the view that the timing accuracy of the movement endpoint has higher priority than that of other aspects of a movement, and thus that it is planned and controlled separately from other aspects of a movement. Additional studies from non-speech support the same point; see Chapters 4 and 5. In order to know which part of the movement to prioritize (and to make most accurate in terms of timing), the speaker must have planned when the movement endpoint will occur, prior to beginning to produce the other, more variable parts of movement. What about the other characteristics of movement, such as its duration, time-course (velocity proﬁle shape), and spatial characteristics? The next section reviews evidence that shows systematic differences across contexts in these characteristics, suggesting that more than just the movement endpoint is planned prior to movement onset. 7.5.1.1.1 Evidence for pre-planning movement durations Are movement durations systematically different across contexts? If so, speakers would need to plan a movement’s timing even before the movements began, in order to achieve the appropriate movement duration. Or, if

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    

181

movement durations are the same across contexts, speakers would not need to plan different durations for each context, since they could begin all movements at a ﬁxed duration before the planned movement endpoint. And a third possibility might be that speakers start movements at random times before a planned time-of-target attainment. For either of these last two possibilities, no systematic duration differences across contexts would be expected. Although it is the case that there is a strong positive correlation between articulatory movement distance and peak velocity (Ostry and Munhall 1985; Bonaventura 2003), with the consequence that articulatory movements of different distances are similar in duration, it is not the case that all movement durations are equal. As discussed in Chapter 3, Fitts’ law (Fitts 1954) leads to the prediction that movements of longer distances should be longer in duration if the same spatial accuracy is achieved at movement endpoint (Harris and Wolpert 1998), and this prediction appears to be upheld in the literature. For example, Hertrich and Ackermann (1997) found systematic differences among /i: a: u:/ lip-movement-cycle (lip opening+closing) durations in German, with low vowels longer than high vowels, (but few signiﬁcant differences for phonemically short vowels).¹⁰ This supports Fitts’ law, because low vowels have longer-distance movements than high vowels. Another prediction from Fitts’ law is that movements with lower spatial accuracy requirements should be faster; English schwa is a vowel notorious for its spectral and spatial variability (Browman and Goldstein 1992b; Bates 1995) and on the assumption that it has lower spatial accuracy requirements, would be predicted to have faster movements. Beckman and Edwards’ (1992, 1994) ﬁnding of a steeper slope of the peak velocity/distance relationship (and thus faster movements) for English schwa compared to full vowels is in line with this prediction. Systematic differences in movement durations across segments have also been observed for phonemically short vs. long vowels in German (Hertrich and Ackermann 1997); shallower slopes of the peak velocity/distance relationship were observed for opening movements for long vowels. And Summers (1987)’s study of longer vowels before voiced vs. voiceless consonants in American English showed that longer vowels before voiced word-ﬁnal consonants are produced by later and slower closing movements toward the ﬁnal consonant, with comparable jaw displacement.

¹⁰ Observed articulatory differences are consistent with vowel height-related differences reported for acoustic vocalic interval durations in e.g. Peterson and Lehiste (1960) and many others (e.g. Hertrich and Ackermann 1997).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

182  -  Differences in movement duration have also been observed in controlled experiments in several different prosodic contexts. Longer-duration jaw movements have been found in some phrasally accented syllables as compared to phrasally unaccented syllables (e.g. de Jong 1991; Cho 2006). Because these longer durations occur in contexts where movement distances are longer (Harrington, Fletcher, and Beckman 2000; Erickson 2002), the ﬁndings are consistent with Fitts’ law. Longer-duration movements have also been found in phrase-ﬁnal position as compared to phrase-medial position (e.g. Edwards, Beckman, and Fletcher 1991; Tabain and Perrier 2005, 2007; Cho 2006). In these positions, movement distances are often comparable to phrase-medial distances (at least when phrase-ﬁnal lengthening is not confounded with phrasal-accent-related lengthening). In these contexts, the peak velocity/distance relationship has a shallower slope (Edwards, Beckman, and Fletcher 1991; Bonaventura 2003), as compared to phrase-medial position. Finally, studies of nasalization in syllable onset vs. coda position (Cohn 1993; Krakow 1989, inter alia) show that the movement to open the velum for a nasal consonant starts earlier relative to oral closure (and has a much longer duration) when the nasal consonant is in coda position (i.e. post-vocalic, as in seem ore) compared to a following nasal in syllable onset position (as in see more). All of these systematic durational differences in different contexts support the need for a planning mechanism that occurs before the movement starts, in order to ensure that the context-dependent timing of the movement occurs appropriately. 7.5.1.1.2 Evidence for pre-planning the movement time-course In addition to evidence for planning movement durations and speed, there is evidence that actors and speakers plan temporal aspects that determine the shape of a movement velocity proﬁle, i.e. its symmetry or skewness. This evidence comes from non-speech movements of similar spatial characteristics in different contexts, as well as for the time-course of fundamental frequency changes in singing. Lee (2009) presents evidence that bowing movements in the playing of a string instrument (i.e. the way the bow moves across the strings) evolve over time in different ways in different contexts. This study showed that the way ‘intensity slides’ in bass playing evolve over time varies systematically for sad vs. happy moods in different parts of The Dance of the Sugar Plum Fairy. In this study, sad moods had longer-duration movements which had later velocity peaks. Evidence more directly relevant to speech comes from the behavior of accomplished singers singing Pergolesi: Lee

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    

183

(2009) showed that musically unstressed ‘pitch slides’ had earlier velocity peaks than stressed pitch slides, though durations and intensities didn’t differ signiﬁcantly. In order to produce these systematic differences in the time course of the pitch slides, i.e. in the pattern of change in velocity over time given the same pitch-slide duration, they need to be planned ahead of time. Just as for the ability to produce the same spatial endpoint with movements of different durations described above, these types of systematic differences in velocity patterns are incompatible with a model in which the time-course of movement speeds is ﬁxed, and thus they require a plan. 7.5.1.1.3 Evidence for spatial pre-planning of movement onset Evidence for planning of context-appropriate characteristics prior to the onset of movement is not limited to timing evidence. As noted earlier, the ‘end-state comfort’ ﬁndings reviewed in Rosenbaum et al. (2006) and in Rosenbaum et al. (2012) suggest that actors adopt a starting orientation and/or position that makes it possible to achieve precise movement end point positioning in an efﬁcient way. Rosenbaum’s experimental studies show that for tasks requiring precise positioning at a target, initial hand orientation (in rotation tasks), and hand position (in object-moving tasks) change systematically according to the target position that will be adopted at the end of the movement. Analogous effects suggesting planning for spatial aspects of movement onset are found in speech for the amplitude of inspiration (expansion of rib cage) at the beginning of phrases, which has been reported to vary systematically with phrase length. Sperry and Klich (1992) show that speakers take deeper breaths when required to read longer utterances extracted from the Rainbow passage, presumably to make it easier to maintain a relatively constant subglottal pressure throughout the utterance (Draper, Ladefoged, and Whitteridge 1960; Ladefoged 1963; Slifka 2006); see also Winkworth et al. (1994), Huber (2008), Rochet-Capellan and Fuchs (2013) inter alia for additional ﬁndings relating to the depth of inspiration and its relationship with phrase length). These ﬁndings show not only that the speaker knows quite a bit about quantitative aspects of the sequence of movements in the planned utterance before beginning to execute it, but also that he/she makes use of this knowledge in deciding how to begin the action. 7.5.1.2 How far in advance does (pre-)planning occur? An early hypothesis about the production of movement sequences was that each successive movement in a sequence might be triggered by the end of the preceding movement (Bernstein 1967). On this view, since each movement

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

184  -  would not start until the previous one had ended, planning for the upcoming movement could be relatively late. As Kozhevnikov and Chistovich (1965) argued, this type of stimulus–response model is inappropriate for speech movements.¹¹ The main piece of speech evidence that argues against this view comes from studies of coarticulation, which show that planning for an upcoming sound must occur before movements associated with a preceding sound have ended, because the upcoming articulation overlaps with the preceding articulation. For example, Boyce, Krakow, and Bell-Berti (1991) showed that lip protrusion for /u/ in kiktluk begins around the onset of closure for the ﬁrst consonant in the intervocalic consonant cluster, suggesting that the phonetic characteristics of /u/ must have been planned more than three segments in advance.¹² (See Whalen 1990 for supporting evidence.) Other evidence, such as the observations mentioned above relating to deeper breaths at the beginnings of longer utterances, is consistent with the view that at least some phonetic characteristics are planned much earlier, i.e. before the onset of the utterance or phrase. Similar evidence can be found for fundamental frequency (F0). For example, Beckman and Pierrehumbert (1986), Ladd and Johnson (1987), and Asu et al. (2016) showed higher F0 on an initial pitch-accented syllable for longer intonational phrases, which made it possible for speakers to avoid unnaturally low F0 at the end of the phrase after F0 downstep on subsequent pitch accents across the duration of the phrase. However, this evidence does not necessarily imply that speakers plan the phonetic characteristics of an entire phrase before beginning the phrase. It is also possible that they base their planning for the phonetic characteristics of the onset (e.g. deep inspiration, high F0) on knowledge of the symbolic phonological characteristics of the phrase, e.g. number of syllables or subsequent pitch accents, rather than on knowledge of the quantitative phonetic (spectral, spatial, and temporal) characteristics with which these elements will be realized in context. Thus, this evidence does not unequivocally support the view that phonetic (as opposed to phonological) characteristics are fully planned for an entire phrase before it starts. 7.5.1.3. Summary These studies provide good evidence that actors carry out a substantial amount of planning for both the timing and the spatial aspects of a movement, before ¹¹ Lashley (1951) makes a similar point about phonological planning. ¹² As Liberman, Cooper, Shankweiler, and Studdert-Kennedy (1967) have argued, coarticulation (articulatory overlap) makes it possible for speakers to be highly efﬁcient at conveying information quickly.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    

185

its implementation. Most if not all, models of speech production, including AP/TD (a two-component model), as well as the type of phonology-extrinsicbased model proposed in this model (a three-component model), agree that planning is separate from, and precedes, implementation. The next section discusses the Motor-Sensory Implementation Component of the proposed XT/3C model in more detail.

7.5.2 Motor-Sensory Implementation The evidence presented above supports the view that speakers formulate a plan for how to achieve the task requirements for a speech movement before they begin to produce it, and thus supports a separation between Phonetic Planning and Motor-Sensory Implementation Components. This section provides further motivation for the separation between Phonetic Planning and Motor-Sensory Implementation, in the form of evidence for monitoring and adjustment processes that occur after speaking has begun, i.e. after (at least some) planning has taken place. This evidence shows that speakers actively monitor both articulatory and acoustic aspects of their speech, and ﬂexibly adapt their articulations in order to achieve the planned goals. 7.5.2.1 Evidence for continuous monitoring, and rapid, goal-oriented adaptations to unexpected somatosensory feedback Early work showed that somatosensory information is used online to ensure accurate target production. For example, when the jaw is unexpectedly loaded during the production of /papapa/ sequences, the upper and lower lips start to compensate within 22–75 ms of perturbation, and bilabial closure is usually successfully produced in spite of the load (Folkins and Abbs 1975; Abbs and Gracco 1984; Abbs, Gracco, and Cole 1984). The timing of the onset of compensatory activity depends on the timing of the load relative to upper lip movement for the bilabial: Compensatory activity occurs at a longer delay from load onset for loads that occur before the onset of upper-lip-lowering muscle activity for the bilabial, as compared to compensatory activity for loads that occur after the onset of upper lip muscle activity, suggesting that speakers time their compensatory activity in order to achieve temporally coordinated lip closure. In labiodental productions, e.g. /afa/, where the upper lip is not normally involved, the upper lip does not compensate for jaw-loading (Abbs, Gracco, and Cole 1984; Shaiman and Gracco 2002). This result suggests that

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

186  -  compensation is not reﬂexive, but rather occurs within synergies of articulators (also called coordinative structures) that are recruited speciﬁcally for each type of task (a different set of articulators for e.g. bilabial vs. labiodental constriction). The timing of vocal fold opening also changes in response to perturbations of the jaw in /i’pip/ productions (Munhall, Löfqvist, and Kelso 1994). This suggests that task-oriented, compensatory responses can extend to articulations that are not involved in the affected oral constriction, which implies a tracking and adjustment mechanism beyond the coordinative structures proposed in AP/TD for a single gesture. Saltzman, Löfqvist, and Mitra (2000) propose such a mechanism within the AP/TD framework. The delayed glottal opening made it possible for speakers to achieve voicelessness at the onset of (delayed) oral closure.¹³ Taken together, the evidence suggests that speakers continuously monitor articulatory progress in achieving task goals, and are able use somatosensory feedback ‘on the ﬂy’ to assess the current state with respect to the goal, and to use this information to achieve these goals in spite of the unexpected perturbation. Both the fact that the timing of the onset of perturbation varies with the timing of the load relative to expected goal achievement, and the fact that compensation is task-appropriate, argue against a simple reﬂexive compensation mechanism, and for a more ﬂexible cognitive mechanism that is oriented toward producing coordinated task goals. 7.5.2.2 Evidence for continuous monitoring of speech acoustics and adaptation to unexpected auditory feedback More recent studies have shown that auditory feedback can also affect the way speech sounds are produced (Houde and Jordan 1998, 2002; Max, Wallace, and Vincent 2003; Purcell and Munhall 2006; Villacorta, Perkell, and Guenther 2007; Cai et al. 2011). In these studies, perturbations in formant frequency and f0 delivered via real-time auditory feedback induce shifts in produced formant frequencies and f0 that occur approximately 120–160 ms after presentation of the perturbed sound. These shifts provide compensation in the direction opposite to the perturbation, although compensation is not complete, perhaps owing to the effect of veridical somatosensory feedback in these conditions (Cai et al. 2011). Mitsuya, MacDonald, and Munhall’s (2014) study of repeated productions of tipper and dipper with auditory feedback indicating that the speaker was producing the word starting with the opposite ¹³ VOT at the release of /p/ closure was longer in perturbed as compared to normal trials, due to shorter closure duration in perturbed trials.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.    

187

voicing category (i.e. producing tipper while hearing dipper, and vice versa), showed a related effect. That is, speakers producing repeated words and hearing such feedback shift their VOT in a direction opposite to the direction of the auditory perturbation, over subsequent repetitions. These ﬁndings suggest that speakers monitor acoustic spectral (formant and F0) as well as temporal information (VOT) during online speech production, and adjust productions for accuracy as soon as is possible. As for the Cai et al. (2011) ﬁndings, the fact that compensation in these studies is not complete has been interpreted as an appropriate response to the combination of conﬂicting somatosensory and auditory feedback (Houde and Nagarajan 2011). 7.5.2.3 Models of the Motor-Sensory Component Most, if not all, models of speech production assume some type of motorsensory implementation component. AP/TD provides for compensation for the perturbation of individual articulations through the use of gestures, where a gesture is a task-dependent coordinative structure of articulators that together implement a gestural constriction. In the Task Dynamic system, the perturbation of one or more articulator(s) involved in a gestural constriction can be completely and automatically compensated by activity of another articulator involved in that constriction. Perturbations to individual articulations are assumed to be detected immediately, with immediate compensation as a consequence. However, AP/TD currently has no provision for the use of auditory feedback in online speech production. In addition, although AP/TD assumes that speakers continuously monitor individual articulators and modify the relative contributions of individual articulators to the achievement of a gestural constriction, the gestural (constriction) plan for an utterance will always be carried out as long as one or more articulator(s) in the relevant coordinative structure is available to produce each gesture in the plan. Thus, AP/TD currently has no way of modifying the gestural constriction plan, e.g. the plan to form a bilabial constriction, once an utterance has begun (although, as just noted, the contributions of articulators in a coordinative structure can be altered under perturbation conditions). The results described above suggest that such a mechanism is required, so that when hearing dipper instead of the intended tipper, speakers can modify their gestural glottal abduction targets according to the auditory information. Recent proposals that do model the use of auditory feedback (in addition to efference copy and somatosensory feedback) include Houde and Nagarajan (2011), Hickok (2014), and Guenther (2016).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

188  -  7.5.2.4 Summary of evidence for the separation between Phonetic Planning and Motor-Sensory Implementation Earlier sections have presented evidence suggesting that 1) a great deal of planning occurs before speaking begins, and 2) monitoring and adjustment mechanisms occur after speaking begins. The separation between Phonetic Planning and Motor-Sensory Implementation is thus motivated by qualitatively different types of activities that occur in the different components (i.e. planning in Phonetic Planning and monitoring and adjustment in MotorSensory Implementation).

7.6 Key components of the proposed model sketch Earlier chapters have presented arguments and evidence to motivate the consideration of an alternative to AP/TD as a model of speech production; Qualitative specification of utterance goals

Symbolic categories

Quantitative specification of acoustic cue goals and of the articulatory means to reach them

Gradient values

Tracking and adjustment

Phonological Planning

Phonological Representation

Phonetic Planning

Phonetic Representation

Motor-Sensory Implementation

Figure 7.3 A schematic diagram of the proposed XT/3C-v1 model.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

.       

189

this chapter has argued that a particular approach, involving three separate components in a Phonology-Extrinsic-Timing-Based framework, can provide such an alternative. The arguments for this general XT/3C approach to modeling speech production have been presented before turning to the speciﬁc proposal in more detail, because the requirements for an adequate model that these arguments suggest will stand, even if the speciﬁc instantiation proposed here is not correct or is incomplete. In addition to its Phonology-ExtrinsicTiming-Based Three-Component structure, some of the key elements of the proposed model include a prosodic planning frame; a feature-cue-based translation mechanism; an optimization mechanism for balancing the many competing factors that inﬂuence both surface timing and other aspects of surface phonetic form; general-purpose timekeeping mechanisms; a mechanism for planning the time-course of movement to generate appropriate velocity proﬁles; and a mechanism for tracking and adjusting ongoing movements to reach targets on time. The remaining chapters develop a proposal for an initial version of an XT/3C model in further detail. Each of the key components outlined above is either discussed in a separate chapter, or is described further in the model sketch in Chapter 10. As an introduction to this discussion, a visual summary of the proposed model is shown in Figure 7.3.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

8 Optimization As discussed in Chapter 3, AP/TD accounts for some of the factors which inﬂuence timing factors in speech, but hasn’t dealt comprehensively with the extensive interacting set that continues to be uncovered. Factors that inﬂuence timing patterns in speech include segment identity (e.g. low vowels longer than corresponding high vowels, fricatives longer than obstruent consonants, quantity contrasts in some languages), lexical tone identity (e.g. Mandarin third, low-falling-rising tone longer than fourth, high-falling tone), segmental context (e.g. consonantal closures in tautosyllabic clusters shorter than singleton consonants, vowels before voiced consonants longer than vowels before voiceless consonants in some languages), prosodic context (boundary-related lengthening, prominence-related lengthening, poly-subconstituent shortening, and speech rate), as well as extra-linguistic factors such as affect, sociolinguistic identity, interlocutor, practice, and speech style; see Lehiste (1970), Klatt (1987), Nooteboom (1972), van Santen (1992), Fletcher (2010), Lefkowitz (2017) for reviews; cf. Figure 8.1, illustrating a number of factors that inﬂuence phonetic variables in speech. One of the things that any model of speech production must do, therefore, is to model how speakers might take this multiplicity of factors into account when planning individual utterances. The previous chapter summarized the motivation for proposing three stages of speech production in a phonology-extrinsic-timing-based approach to speech production: 1) Phonological Planning, 2) Phonetic Planning, and 3) Motor-Sensory Implementation. Chapter 10 will ﬂesh out these three stages in more detail. Meanwhile, this chapter will review a framework for balancing the inﬂuence of multiple factors on behavior. (Chapter 9 will discuss generalpurpose timekeeping mechanisms for phonology-extrinsic timing.) The framework adopted here for balancing multiple factors that inﬂuence the surface form of an utterance was developed primarily in non-speech motor control, Optimal Control Theory, including its more recent development into Stochastic Optimal Feedback Control Theory (SOFCT) (e.g. Meyer, Abrams, Kornblum, and Wright 1988; Hoff and Arbib 1993; Todorov and Jordan 2002). Both of these theories are highly useful for the Phonetic Planning and Motor-Sensory Implementation stages in the model proposed here. Optimal

Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

  

Syntax

Utterance length

Semantics Pragmatics

Prosodic structure Nongrammatical factors (e.g.rate, clarity requirements, style, movement costs)

191

Lexicon

Segmental phonology and cue choice

Planned phonetics: Pronunciation, including surface prosody (timing, etc.)

Figure 8.1 Prosodic structure as the interface between language and speech, illustrating some of the factors that inﬂuence Phonetic Planning. Note: Segmental phonology includes cues to distinctive features in their symbolic form (see more discussion in Chapter 10). Based on similar ﬁgures in Shattuck-Hufnagel and Turk (1996, p. 237; Figure 5) and Turk and Shattuck-Hufnagel (2014, Figure 4).

Control Theory approaches are important because they provide a principled way of determining values of controlled variables. They are particularly interesting for models of speech motor control because they provide a way of modeling the inﬂuence of multiple factors on these parameter values. In doing so, they show how it is possible to plan movements to accomplish multiple task goals in an optimal manner, which is the main function of the proposed Phonetic Planning Component. In addition, Stochastic Optimal Feedback Control Theory suggests mechanisms for using sensory feedback, as well as an internal model of the relationship between motor activity and sensory consequences, to ﬂexibly adapt ongoing actions to accomplish task goals in ever-changing contexts; this is one of the main functions of the proposed Motor-Sensory Implementation component. Issues relating to the use of efference copy and sensory feedback in speech production modeling have been explored extensively in Houde and Nagarajan (2011), and will not be reviewed here. Instead, the main focus of this chapter is on the usefulness of Optimal Control Theory approaches for optimizing movements given a set of multiple task requirements and movement costs. Optimal Control Theory is a branch of mathematics devoted to determining optimal control policies (Bellman 1957; Pontryagin, Boltyanskii, Gamkrelidze,

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

192  and Mischchenko 1962). It has been used in motor control models since the 1980s, when research in speech production in particular (Nelson 1983; Nelson, Perkell, and Westbury 1984) showed how useful optimization principles can be in explaining patterns of skilled movement. Nelson (1983) showed that the minimization of movement costs such as jerk (ﬁrst time derivative of acceleration) and effort/energy could successfully explain smooth, single-peaked velocity proﬁles, i.e. with a single acceleration component followed by deceleration, and that the optimization of competing costs and task goals could explain relationships among movement peak velocity, distance, and duration for jaw movement in speech. Nelson et al. (1984) went further in showing that the minimization of movement costs could explain speakers’ reduction of jaw amplitude at increasingly fast rates of speech, because the effort and jerk costs for increasingly rapid movements of a given distance grow exponentially (Nelson 1983), and reducing movement distance provides a way of avoiding these high movement costs. Optimization principles also provide an explanation for widespread observations in the non-speech motor control literature that undershoot is more common than overshoot for rapid movements toward a speciﬁed target (Carlton 1979 cited in Elliott, Helsen, and Chua 2001; Harris 1995; see Elliott et al. 2001; Engelbrecht, Berthier, and O’Sullivan 2003; and Elliott, Hansen, and Grierson, Lyons, Bernnett, and Hayes 2010 for recent reviews). Overshoot is costly in terms of effort/energy and time, because it requires a corrective sub-movement that involves overcoming inertia for a movement direction reversal, as well as longer total movement distance.¹ Engelbrecht et al. (2003) and Lyons, Hansen, Harding, and Elliott (2006) present ﬁndings suggesting that undershoot is strategically planned to take the likelihood of corrective movements and their costs into account. Engelbrecht et al. (2003) asked participants to use a keypress to move a cursor on a screen to a target, and manipulated endpoint spatial variance by varying the degree of random motor error visible on the screen in three conditions. Over the course of the experiment, participants adjusted the amount of target undershoot according to the degree of random error (more undershoot for more error, resulting in a more optimal pattern toward the end of the experiment). This result suggests that the undershoot bias results from optimal movement strategies developed through practice (see also Worringham 1991; Elliott, Hansen, Mendoza, and ¹ Overshoot may also incur a reprogramming cost if agonists must become antagonists and vice versa. And a bias against overshoot may be less costly in terms of information processing, since it reduces the uncertainty as to corrective movement direction (Barrett and Glencross 1989, cited in Elliott et al. 2004).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

  

193

Tremblay 2004). And Lyons et al. (2006), who compared the magnitude of undershoot in upward, downward, forward, and backward target-directed reaching movements required to be as fast and as accurate as possible, found that undershoot magnitudes were greatest for downward movements, where corrective movement reversals would be costly in terms of time and energy among other things, because they would need to overcome the effects of gravity.² See also Oliveira, Elliott, and Goodman (2005) for further evidence that undershoot is a result of an optimization strategy, and can be reversed when costs of movement are experimentally manipulated. Optimal Control Theory approaches have received considerable interest in the phonology, phonetics, and speech motor control literature, and this interest is growing. Lindblom (1990) proposed that phonetic variability could be explained if speakers balance effort costs against perceptual distinctiveness requirements, and this view inspired several more recent Optimal Control Theory accounts of typological patterns in phonetics, as well as speech behavior. For example, Šimko and Cummins’ (2010) model of speech articulation, inspired by AP/TD, determines values for controlled variables by minimizing a cost function, and Flemming (1997, 2001), Kirchner (1998), Boersma (2009), Katz (2010), Braver (2013), Windmann (2016), and Lefkowitz (2017), among others, have made use of optimization principles to explain patterns of surface acoustic phonetic variability. In addition, Houde and Nagarajan (2011) and Hickok (2014) have advocated approaches to speech motor control based on state estimation and the use of feedback, inspired by Stochastic Optimal Feedback Control Theory approaches developed for non-speech motor control, although these investigators have not yet included optimization in their models. This chapter has two main goals: First, it provides a summary of Stochastic Optimal Feedback Control Theory (SOFCT), a relatively recent development of Optimal Control Theory that has so far been applied primarily to nonspeech motor control. This discussion has a special focus on evidence motivating different types of movement costs used in planning optimal movements (Section 8.3), on components of the theory that explain surface timing patterns (Section 8.3.4) and on the SOFCT account of movement synergies (coordinative structures) and hierarchical motor control (Section 8.4.2). It is based on the following references: Diedrichsen, Shadmehr, and Ivry (2010);

² These downward movements were also more variable at their endpoints; the magnitude of undershoot may also reﬂect the higher likelihood of movement reversal for these movements because the movements were less precise.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

194  Ganesh, Haruno, Kawato, and Burdet (2010); Guigon (2011); Hoff and Arbib (1993); Hu and Newell (2011); Jordan and Wolpert (1999); Li (2006); Liu and Todorov (2007); Nelson (1983); Mitrovic (2010); Scott (2004); Shadmehr (2009); Shadmehr and Krakauer (2008); Shadmehr and Mussa-Ivaldi (2012); Shadmehr and Wise (2005); Tanaka, Krakauer, and Qian (2006); Todorov (2004); Todorov (2005); Todorov (2007); Todorov (2009); Todorov and Jordan (2002); Todorov, Li, and Pan (2005). This part of the chapter concludes with a discussion of some of the challenges to SOFCT (Section 5). The second part of the chapter (Section 8.6) is devoted to an introduction to optimization approaches as currently used in phonology and phonetics. This discussion includes a review of current optimization accounts of durational patterns in speech.

8.1 General overview Stochastic Optimal Feedback Control Theory is a sub-theory within Optimal Control Theory that is speciﬁcally adapted to motor control. It assumes that speakers continuously monitor the states of their effectors in relation to the task goals, and continuously update their motor commands on the basis of state information, in order to accomplish the goals in a near-optimal way. It is stochastic because it explicitly models motor and sensory noise in the system, and is a feedback model because it uses (delayed) feedback to provide information about the current state relative to the goal(s). Because (near)-optimal goal-directed movements can be generated from the current state, regardless of where/what this might be, SOFCT is ﬂexible enough to change course ‘on the ﬂy’ in response to its current state, and can therefore adapt to perturbations. As will be discussed in more detail in this chapter, (near)-optimal movements are generated via a minimum cost control policy—that is, a policy (which can be a solution to a set of equations) that determines the sequence of optimal movement controls to achieve a target from any current state. The control policy is the result of minimizing a cost function, which deﬁnes the task goals, costs of movement, and relative weightings of these costs. Timing within most SOFCT theories is tracked continuously throughout movement (see evidence in Chapter 4 that time is tracked in the anticipation of upcoming events). As mentioned above, movement duration in many of these theories is a parameter of movement that can be optimized; three types of factors contribute to these optimal duration speciﬁcations. First, there are durational consequences of movement costs that are not explicitly temporal, such as endpoint variance or effort. Second, there are durational consequences from minimizing

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

195

cost functions that include time as a cost of movement (i.e. longer duration has higher cost). And third, durational properties can sometimes be deﬁned as an explicit goal, as in keeping time to a metronome, or generating a particular type of velocity proﬁle, or creating an e.g. 200 ms movement. Like its OCT precursors, SOFCT has been successful in accounting for welldocumented aspects of movement, including single-peaked, smooth-velocity proﬁles, and approximately straight movement paths. Additionally, it accounts for compensation for perturbations, movement synergies (coordinative structures), the undershoot bias, and observations that task-irrelevant variability tends to be greater than task-relevant variability (Todorov and Jordan 2002). Like all OCT theories, SOFCT provides a way to specify the balance among task requirements and costs (e.g. speed might be more heavily prioritized than a cost such as accuracy, or vice versa). It also provides a way to specify different balances among requirements and costs on different occasions. Recent SOFC theories provide a general, ﬂexible framework for representing effector state, where state is not limited to articulator position, as it is in some models (e.g. VITE, proposed by Bullock and Grossberg 1988, and DIVA, proposed in Guenther 1995). Instead, it can include current values for other attributes, such as velocity and acceleration. In addition, SOFCT provides a framework for determining optimal movements depending on these states. However, as noted by Shadmehr and Krakauer (2008), developing models within SOFCT is difﬁcult, because of challenges associated with 1) determining how state estimation takes place, 2) determining task requirements, costs, and their relative weightings, 3) specifying task constraints, e.g. the plant dynamics, i.e. dynamics of the musculo-skeletal system, and 4) computing optimal control policies.

8.2 Key features The key features of SOFCT include state estimation, weighted cost functions, and the control policy. Each of these features is discussed in more detail in Sections 8.2.1, 8.2.2, and 8.2.3.

8.2.1 State Estimation Within SOFCT, estimation of the current state of the effectors occurs continuously throughout a movement. Depending on the level of representation, state information within SOFCT might be position, velocity, and acceleration

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

196  of the effectors (relatively high-level state information), and/or e.g. muscle length, muscle contraction velocity, joint angle, or joint angular velocities (low-level state information), at each point in time.³ State estimation is accomplished by combining two sources of information: 1) predicted information derived from an efference copy of the motor commands, coupled with an internal forward model of the relationship between motor commands and outcomes,⁴ and 2) sensory feedback. Each source of information is weighted according to its reliability. The internal forward model enables state estimation even when sensory feedback is absent or delayed. It is assumed that the forward model can be updated on the basis of sensory feedback. Systematic changes in feedback might occur because of changes in the physical plant e.g. due to injury or growth during development, or because of artiﬁcial manipulations of feedback, e.g. in speech adaptation experiments. The latter include experiments where auditory feedback is manipulated via changes to acoustic properties (e.g. to formant frequencies or fundamental frequency), or where somatosensory feedback is manipulated via applying a mechanical load to one of the speech articulators (Folkins and Abbs 1975; Abbs and Gracco 1984; Abbs, Gracco, and Cole 1984; Munhall, Löfqvist, and Kelso 1994; Houde and Jordan 1998, 2002; Max, Wallace, and Vincent 2003; Purcell and Munhall 2006; Villacorta, Perkell, and Guenther 2007; Cai, Ghosh, Guenther, and Perkell 2011). An accurate internal model is important because, without one, the motor commands that will move an effector to a target most efﬁciently and effectively from its current state cannot be determined. Support for the internal forward model comes from evidence showing that 1) predicted information is generated during motor activity, and 2) humans modify their productions on the basis of sensory feedback. To take an example from speech, evidence for (1) comes from reports by Niziolek, Nagarajan, and Houde (2013), who have shown that activity in the auditory cortex is suppressed in response to speakers’ own productions of speciﬁc speech sounds, but not in audio-only conditions, where the sounds are heard but are not produced. These Speech Induced Suppression ﬁndings suggest that speakers generate predicted auditory consequences of their productions, and use these predictions to inhibit the auditory response to their own speciﬁc productions. This generation of predicted information during the motor activity of selfproduced speech, coupled with the suppression of the auditory response to

³ Currently most theories do not specify the size of the time step for state estimation. ⁴ Motor commands relate to the rate of change of muscle force or torque, where torque = rotational force around a joint (Haith and Krakauer 2013).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

197

this activity, ensures that the outcome is not noticeably louder than speech produced by others, despite the effects of bone-conduction. Evidence from speech for (2) comes from ﬁndings in numerous adaptation experiments (e.g. Houde and Jordan 1998; Purcell and Munhall 2006; Cai, Ghosh, Guenther, and Perkell 2010; Tremblay, Shiller, and Ostry 2003; and Lametti, Nasir, and Ostry 2012) showing that changes to auditory and/or somatosensory feedback can lead to compensatory changes in behavior. These ﬁndings are consistent with the view that the internal forward model is modiﬁed on the basis of sensory feedback. Within the SOFCT framework, the accuracy of state estimation will depend on two sources of noise: 1) motor noise, which affects the accuracy of the predictions made by the internal forward model, and 2) sensory noise, which affects the accuracy of state estimation from sensory feedback. Both types of noise are incorporated in the Stochastic Optimal Feedback Control models of e.g. Todorov (2005).

8.2.2 Weighted cost functions and the Optimal Control Policy Movement planning consists of formulating and minimizing a cost function, which, when minimized, determines the control policy,⁵ i.e. the mapping from any current state to the optimal controls to achieve the desired (target) state. The cost function relates potential movements to a weighted sum of 1) the scalar costs of not meeting the task requirements (sometimes called the reward, or value term), and 2) the costs of the movements themselves (the regularization term). Movement costs are discussed in more detail in Section 8.3. Some of the costs are incurred at the end of the movement (e.g. the cost of not reaching the target), whereas other costs are incurred at each time step, and accumulate over the duration of the movement (e.g. energy/ effort costs). Because of signal-dependent noise, cost in terms of e.g. effort or accuracy is a stochastic variable, that is, the actual costs cannot be predicted with certainty because of the noise. Therefore it is expected costs that are minimized, rather than actual costs. The cost function is constrained by the

⁵ Note that it is likely that some movement cost components do not have to be chosen anew when each movement is planned—for example, the relationship of endpoint variance to the magnitude of neural signals results from physical characteristics of neurons and their connections, and is therefore likely to be relatively constant across movements. However, several components of the cost function will require speciﬁcation for each new task, e.g. some or all task requirements, the costs of not achieving them, and the relative weighting or importance of these costs vs. the costs of movement.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

198  dynamics of the system as speciﬁed by the internal forward model: The optimal control policy must obey the relationship between motor commands and their consequences speciﬁed by the internal model. What this means is that different optimal control policies are predicted for different systems, e.g. when an effector is injured, different motor commands will be optimal because the internal model will be different from when the effector is uninjured. One of the most important claims of all Optimal Control Theories of motor control is that task requirements and movement costs are balanced against each other to generate movement (Nelson 1983). For example, in circumstances where speed is more highly weighted than endpoint accuracy as a goal of movement, fast movements will be produced that are likely to have endpoint errors. On the other hand, when endpoint accuracy is a goal of movement that is more highly weighted than speed, then slow but accurate movements will be produced (Fitts 1954; Harris and Wolpert 1998; Harris and Wolpert 2006; Guigon, Baraduc, and Desmurget 2008). The trade-off among rewards (achievement of task goals) and costs can be implemented within this framework by stipulating that the coefﬁcients of the reward and cost terms must sum to a constant number. Early instantiations of the theory treated movement goals (or task requirements) as inviolable constraints. However, later work (e.g. Liu and Todorov 2007) using a movement perturbation paradigm showed that the extent to which task goals such as target attainment and target stability (zero velocity at target) are met can vary in a way that is expected if the costs of scalar deviations from a task goal is balanced with other task requirements and movement costs. These later versions allow for the possibility that movement targets are approximated, but not fully met, if costs of movement are high. This feature could be useful in modeling apparent undershoot in speech and non-speech behavior.

8.2.3 Determining the control policy Once the cost function is formulated for a given movement, it is assumed that the actor determines the optimal control policy, that is, the mapping from (a sequence of) states to (a sequence of) control actions that will accomplish the task requirements at minimum cost. The theory is not committed to a particular set of computational mechanisms for determining the control policy, but rather only makes the claim that movements are chosen to be optimal under environmental, task, and neuromuscular constraints. Optimal

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

199

movement strategies are presumably learned from practice, where relationships between movements, costs, and task goal achievement are inferred from the outcomes of repeated movements (Engelbrecht et al. 2003; Bertsekas and Tsitsiklis 1996; Sutton and Barto 1998; Schultz, Dayan, and Montague 1997). In support of this view, Engelbrecht et al. (2003) showed that optimal control policies are learnable from exploration in a reasonable time period. However, as Todorov (2005) notes, computationally determining the control policy for any movement goal is mathematically non-trivial. It requires a model of the system dynamics, which constrains the control policy, as well as a cost function. Once these are available, the control policy can be determined numerically, via computerized search over actions, or analytically (by solving equations) in special cases. Approaches that afford analytical solutions are preferred over approaches that involve searches, because the latter are much more time consuming. Todorov (2005) and Todorov (2009) provide recent examples that afford analytical solutions. Computationally determining control policies in Optimal Control Theory is particularly difﬁcult if factors that contribute to uncertainties regarding state estimation are taken into account (e.g. signal-dependent motor noise, taken into account in SOFCT models, e.g. Todorov 2005 and Todorov 2009), and if a realistic (high-dimensional) musculoskeletal model is used. Many approaches to the computational problem are based on Bellman’s (1957) Optimality principle. Bellman showed that the optimal control policy can be determined if the optimal actions for the rest of the movement are known. Therefore, in principle, a control policy can be determined for movements of pre-determined duration by starting at the end of the movement and working backward to compute the motor commands for each state at each time step that minimize the total sum of costs over all remaining time steps (the cost-to-go). Bellman’s Optimality principle is embodied in Dynamic Programming (Bertsekas 2001) and Reinforcement Learning (Sutton and Barto 1998) approaches. In Reinforcement Learning approaches, it is assumed that an actor explores results of actions from different states, where (noisy) feedback is provided as to overall system performance.

8.3 What are the costs of movement? All Optimal Control Theory approaches to motor control assume that actors produce actions that accomplish task requirements at minimum cost. These

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

200  approaches require the speciﬁcation and quantiﬁcation of movement costs, in addition to the costs of not meeting the task requirements. This section reviews many of the movement costs that have been proposed in the speech and non-speech literature. Proposed costs include dynamic costs (which refer to applied forces, to joint torques, or to the motor commands that generate them), kinematic costs (which refer to costs associated with characteristics of movement trajectories), and other costs (e.g. the cost of accuracy and the cost of time). Within this framework, determining the costs of movement is required to explain the surface form of movements that people choose to make. For example, if actors minimize dynamic costs, i.e. those related to forces required to move articulators, then the explanations will rely at least in part on how heavy the articulators are, and on their acceleration, and (therefore) the forces that are required to move them. However, if the costs are kinematic, then the explanations will relate only to the movement trajectories, without direct reference to the forces that have been used to create them. Determining these costs has been difﬁcult for several reasons. First, several of the proposed costs make similar predictions for movement kinematics (i.e. related to trajectory), and are therefore difﬁcult to distinguish (Nelson 1983; Guigon 2011). For example, cost minimization based on jerk (3rd derivative of position), torque change, effort, energy, or endpoint variance all predict smooth, single peaked velocity proﬁles, as well as straight movement paths under normal conditions, (Shadmehr and Krakauer 2008). This is because effort, energy and endpoint variance all relate to force and/or torque production, and all else being equal, movements with fewer or smaller velocity peaks, with shorter movement distances, and/or fewer changes in direction, (i.e. straighter paths) require less force (force being proportional to the integral of acceleration over time, assuming negligible friction). Likewise, movements with fewer velocity peaks/and or smaller distance movements will have smaller integrated jerk values. Second, some ﬁndings in the literature appear to conﬂict: Some suggest that kinematic costs are minimized (e.g. Kistemaker, Wong and Gribble (2014), whereas other ﬁndings suggest that dynamic costs are minimized (e.g. Shadmehr and Mussa-Ivaldi 1994). Finally, there is growing recognition that more than one type of cost inﬂuences movement, and therefore the appropriate combination of costs must be determined. For example, Harris and Wolpert (2006) propose four types of costs: time, accuracy, stability, and energy, with the caveat that some types of movements may only require a subset of these costs. For example, eye movement saccades may not require energy as a cost since saccades are resistant to fatigue (Fuchs and

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

201

Binder 1983), and also may not require stability costs, since the eye movement system is assumed to be highly overdamped, so that it doesn’t oscillate and has a slow approach to the target. The following sections discuss various aspects of the costs of movement, including energy/effort (8.3.1), dynamic vs. kinematic costs (8.3.2), spatial accuracy/endpoint variance (8.3.3), time (8.3.4), motor memory/cost of reprogramming (8.3.5) and costs of not achieving the task requirements (8.3.6).

8.3.1 Energy/Effort Energy or effort costs are very frequently proposed in models of both speech and non-speech motor control. Energy is the capacity to do work, and effort can be deﬁned as the use of energy to perform an action. The terms ‘energy costs’ and ‘effort costs’ are therefore sometimes used synonymously. As mentioned earlier, these provide a potential explanation for smooth, singlepeaked velocity proﬁles and straight movement paths, as well as amplitudereduction strategies at fast rates of speech (Nelson 1983; Nelson et al. 1984). In addition, these costs may contribute to the undershoot bias, that is, widespread observations that undershoot is more common than overshoot for rapid, target-directed movements (Carlton 1979 cited in Elliott et al. 2001; see Elliott et al. 2001; Engelbrecht et al. 2003; and Elliott et al. 2010 for recent reviews). Because energy or effort costs relate to motor commands, applied muscle forces, and joint torques that relate to them, rather than movement trajectories, they are dynamic costs, as opposed to kinematic costs. The distinction between dynamic and kinematic costs is important because it has implications for theories of speech production planning; if costs are dynamic, then the forces required to produce movement (and therefore both the masses that are moved, and their accelerations) must be taken into account in the planning process. In contrast, if costs are solely kinematic, then only the forms of the movement trajectories matter, regardless of the forces required to produce them. When referring to the cost of performing a physical action, as opposed to a mental action, energy or effort costs can in principle be quantiﬁed in various ways. For example, they can be measured in terms of the magnitude of the control signal integrated over time (sum of the (squared) motor commands or action potentials of a motor group over time) (e.g. Todorov and Jordan 2002); in terms of quantity of oxygen or ATP required to activate all of the muscles involved in an action (Moon and Lindblom 2003); in terms of the

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

202  integral of applied force per unit mass with respect to time, or the integral of the square of applied force per unit mass with respect to time (Nelson 1983); or in terms of the integral of the squared derivative of torque with respect to time (Uno, Kawato, and Suzuki 1989). However, these costs are notoriously difﬁcult to measure objectively. In a frictionless system, the integral of applied force per unit mass with respect to time can be measured as peak velocity,⁶ but systems are not always frictionless, e.g. force can be applied when pushing an effector against a hard surface, with no resulting motion (i.e. zero velocity); see Kirchner (1998) for more discussion. And it is difﬁcult to measure quantities such as oxygen (or ATP) consumption for actions whose energy requirements are small, particularly at the time scales required for speech production (Moon and Lindblom 2003). And even for actions whose energy requirements are larger, there are disagreements about the way force production relates to energy consumption (e.g. Hatze and Buys 1977; Szentesi, Zaremba, van Mechelen, and Stienen 2001). Newer ﬁndings suggest that different cost functions may be required for different muscle groups (Praagman, Chadwick, van der Helm and Veeger 2006), and additional ﬁndings suggest different energy requirements for the onset of muscle contraction (higher requirement) vs. sustained contraction (lower energy requirement) (Hogan, Ingham, and Kurdak 1998). Despite all of these complications, a quadratic cost that is proportional to the sum of the squared motor commands has been shown to be successful, and superior to a cost proportional to the sum of the unsquared motor commands, in explaining the smooth, bell-shaped velocity proﬁles of movement (Nelson 1983), as well as actors’ choices of least-effortful arm reaches among choices of different distances and durations, in conditions which required participants to move against different magnitudes of resistive forces (Morel, Ulbrich, and Gail 2017). The success of the cost function based on the sum of squared motor commands has led Haith and Krakauer (2013, p.12) to conclude that “the quadratic motor command penalty should be viewed as a more abstract notion of ‘effort’, which appears to be successful in describing certain aspects of behavior, despite having no clear theoretical foundation.” In sum, many researchers think that a dynamic cost of energy or effort is relevant for explaining patterns of movement, and the quadratic deﬁnition of this cost appears to have advantages.

⁶ Peak velocity in a frictionless system equals the integral of force per unit mass over time (Nelson 1983).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

203

8.3.2 Dynamic vs. kinematic costs In spite of the importance of the distinction between dynamic and kinematic cost theories of motor planning, determining whether costs are dynamic or kinematic has proven difﬁcult. One of the reasons why it is difﬁcult to unambiguously distinguish dynamic from kinematic costs is because quadratic dynamic costs (described above) make similar predictions about movement trajectories to those made by kinematic cost, namely the time integral of the square of the magnitude of jerk (the ﬁrst time derivative of acceleration). That is, both types of costs predict trajectories with straight movement paths, and single-peaked, symmetric-velocity proﬁles that scale for different durations and amplitudes (Nelson 1983; Flash and Hogan 1985; Uno et al 1989),⁷ and they disfavor movement direction reversals that may contribute to the undershoot bias. How can these two types of costs be distinguished? One potential way to determine whether costs are kinematic or dynamic is to see what happens when movements are made in the presence of a perturbing force ﬁeld. The prediction of models with kinematic costs is that movements should be unaffected by force ﬁelds, since kinematic costs relate to movement trajectories, rather than applied forces or motor commands. In contrast, the prediction of models with dynamic costs is that movements will be different in the presence of force ﬁelds, since costs in terms of energy/effort or motor commands will be higher, and as a consequence, optimal movement strategies may be different. Several studies have tested the effects of applying a force ﬁeld in a direction perpendicular to the normal movement path (e.g. Uno et al. 1989; Shadmehr and Mussa-Ivaldi 1994; Kistemaker, Wong, and Gribble 2014). However, evidence from these studies has been mixed. Kistemaker et al. (2014) found that the paths of forward-backward planar reaching movements were not inﬂuenced by a force ﬁeld that made it “energetically beneﬁcial to move to the right at the onset of movement and to move forward and left in a later stage” (Kistemaker, Wong, and Gribble 2010, p. 2987), even though the force ﬁeld was strong enough to cause fatigue, and thus might have been

⁷ Flash and Hogan (1985) also showed that minimization of jerk can account for the velocity proﬁles of curved movements from one point to another that pass through a via point. In this kind of curved movement, a tangential velocity minimum (but not a velocity zero) typically occurs at the point of maximum curvature, and, as for straight-path trajectories, the velocity proﬁles scale for different movement times and amplitudes.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

204  expected to inﬂuence the movement paths if movement costs were dynamic. The Kistemaker et al. (2014) results suggest a preference for approximately straight movement paths and therefore support the use of kinematic costs. Other studies have produced different types of ﬁndings. For example, Shadmehr and Mussa-Ivaldi (1994) study, where reaching movements were perturbed by a velocity-dependent force ﬁeld in a direction perpendicular to the direction of movement, results showed that movements at the beginning of the experiment were perturbed in the direction of the perturbing force, but after a period of adaptation, participants overcompensated early in the movement in a direction opposite to the perturbing force in such a way as to minimize the amount of effort or torque change that would be required throughout the course of the whole movement (Shadmehr and Krakauer 2008). After further practice, the paths became straight. The fact that the paths were curved after a period of adaptation supports the view that movements were planned to minimize dynamic costs. However, that the paths became straight after further practice suggests that a straight path was one of the goals of movement that participants eventually chose to achieve in spite of the higher dynamic costs, a result that supports kinematic goals or costs. This study therefore suggests that both types of costs (i.e. both kinematic and dynamic costs) may be relevant. Models that include dynamic costs, such as a cost relating to the sum of the squared motor commands (or integral of squared torque change), have the advantage of providing an explanation for ﬁndings that force production is most often shared among multiple muscles (Hoffman and Strick 1999), even when the pulling direction of some of the muscles is not parallel to the direction of movement. Section 8.4.2.1 (Synergies 1) discusses the fact that the cost of the sum of squared motor commands is minimized when work is distributed over multiple effectors. The advantage of synergies/coordinative structures goes unexplained by a kinematic cost such as jerk, thereby adding further support to the view that both types of costs may be at play in movement control.

8.3.3 Spatial accuracy/endpoint variance Variance in endpoint position is a cost proposed by Schmidt, Zelaznik, Hawkins, Frank, and Quinn (1979), Schmidt and Lee (2005), Harris and Wolpert (1998, 2006), and Burdet and Milner (1998); see also Tanaka et al. 2006). Endpoint spatial variance is thought to be the consequence of noise

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

205

associated with motor commands. For movements produced with less than approximately 65% of maximum force, noise increases with control signal magnitude (Sherwood, Schmidt, and Walter 1988). For movements that involve more than 65% of maximum force, the noise appears to level off. For any given movement, the noise accumulates throughout its duration. Any movement that has a large degree of motor noise by the end of the movement, is likely to have a greater degree of endpoint variance; movements with the least endpoint variance are likely to be those that have smaller (and therefore less noisy) control signals. As shown by Harris and Wolpert (1998), smooth, single-peaked velocity proﬁles are optimal, according to the minimum endpoint variance view, because proﬁles with abrupt velocity changes would require large, noisy control signals. Similarly, movements of higher curvature require higher control signals to maintain the same speed, as compared to movements of lower curvature; preferred movements are therefore slower in regions of higher curvature. Harris and Wolpert (1998) have shown that the endpoint variance cost is as effective as the minimum jerk and minimum effort or torque change proposals, in accounting for the smooth, single-peaked nature of velocity proﬁles, as well as for the relationship between speed and curvature (the 2/3rds power law, Lacquaniti, Terzuolo, and Viviani 1983). Moreover, it has the advantage of explaining additional phenomena, including 1) the speed–accuracy trade-off known as Fitts’ law (discussed in Chapter 3), 2) earlier peak velocities for movements with higher amplitude and longer duration, and 3) end-state (dis-)comfort effects observed by Rosenbaum and colleagues (Rosenbaum, Marchak, Barnes, Vaughan, and Slotta 1990, Rosenbaum, Cohen, Meulenbroek, and Vaughan 1996). As Harris and Wolpert (1998) note, using endpoint spatial variance as a cost within Optimal Feedback Control is relatively easy to estimate (cost = endpoint variance as measured over repeated movements, or consequences of inaccuracy, such as time spent making corrective movements, Meyer, Abrams, Kornblum, and Wright 1988, Harris 1995). This is in contrast with e.g. minimum jerk, minimum effort, or minimum torque change cost computations, which require the integration of values over an entire movement trajectory and thus are more difﬁcult to estimate. 8.3.3.1 Minimum endpoint spatial variance and Fitts’ law As discussed in Chapter 3, Fitts’ law (Fitts 1954) relates movement time to distance and target width, with T = A +B*log₂(2D/W), where T = movement time, A and B are constants, D is distance, and W is target width. This law states that ceteris paribus movements of longer distance and/or more spatially accurate

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

206  movements take more time than their shorter-distance or less accurate counterparts; movements of shorter duration are less accurate than movements of longer duration over the same distance. The minimum endpoint variance model accounts for this speed–accuracy trade-off because faster movements require larger control signals, with more motor noise, and therefore result in endpoint spatial inaccuracy as compared to slower movements. Schmidt et al. (1979) provide an explicit proposal for the relationship between endpoint variance and speed: they note that the standard deviation of position at movement endpoints is proportional to average velocity plus a constant (see also Meyer et al. 1988; Schmidt and Lee 2005). As Harris and Wolpert (1998) discuss, movements of greater distance also require larger control signals than their shorter-distance counterparts. If the longer-distance movements are produced in the same amount of time as movements of shorter distance, they will have a greater average velocity, and will therefore have more accumulated noise at the end of the movement, and will be less accurate. To ensure consistent spatial accuracy, additional time would be required for longer distance movements. Endpoint spatial accuracy therefore provides an explanation for the longer durations observed for longer-distance movements, on the view that movements are made in the shortest durations that achieve the required endpoint spatial accuracy. 8.3.3.2 Earlier velocity peaks for longer duration movements Although fast movements tend to have symmetric, bell-shaped velocity peaks, asymmetric velocity proﬁles (with peaks relatively earlier in the movement) are often observed for longer-duration movements (Engelbrecht and Fernandez 1997; Moore and Marteniuk 1986; Nagasaki 1989; Wiegner and Wierzbicka 1992; Collewijn, Erkelens, and Steinman 1988; Zelaznik, Schmidt, and Gielen 1986; Johnson 1991; see Perkell, Zandipour, Matthies, and Lane 2002 for similar results in a speech task). Harris and Wolpert’s (1998) minimum endpoint spatial-variance proposal provides a potential explanation. On this view, signal-dependent noise proportional to the size of the motor commands accumulates throughout the duration of the movement. However, endpoint spatial variance (at least for saccades, cf. Haith and Krakauer (2013) for a discussion of saccades vs. arm movements) is proposed to depend more on the signal-dependent noise associated with motor commands that arrive later in absolute time. This view predicts fast acceleration to peak velocity, with more time at the end of the movement to home in on the target. Thus, the prediction of this model is that the velocity proﬁle should be skewed, with a relatively early velocity peak,

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

207

particularly for longer-duration movements where time is less restricted, and for movements that have large control signals, e.g. for longer distance movements. That way, spatial variance at the end of the movement can be minimized by having a smaller control signal later in the movement. Endpoint variance costs are difﬁcult to distinguish from energy or effort costs involving the sum of the squared motor commands (control signal), because endpoint variance is related to energy/effort; that is, endpoint variance grows with a weighted sum of the (squared) motor commands. However, the velocity proﬁles predicted by minimum effort models are symmetric and bell-shaped, whereas, as discussed above, minimum variance models predict asymmetric velocity proﬁles (Harris and Wolpert 1998; Shadmehr and Mussa-Ivaldi 2012), with the asymmetry more pronounced for longerduration movements. In sum, while energy/effort costs can account for symmetrical velocity proﬁles, the circumstances under which early-peaked velocity proﬁles are observed suggest the importance of including endpoint variance as a cost of movement. 8.3.3.3 End-state comfort Rosenbaum and colleagues (Rosenbaum et al. 1990; Rosenbaum et al. 1996) observed that participants in grasp-and-move-item tasks prefer to optimize effector comfort at the end position, rather than comfort at the beginning of movement (cf. discussion in Chapter 7). These observations relate to effort/ energy and/or endpoint accuracy, because end-state comfort has the beneﬁt of allowing greater ease and precision in positioning the object in relation to the target (Rosenbaum, van Heugten, and Caldwell 1996), as compared with endstate positions that are less comfortable. The link between end-state comfort and endpoint accuracy (precision) as opposed to effort/energy, is supported by results presented by Rosenbaum et al. (1996), showing that when precision requirements of tasks were reduced, the end-state comfort effect was also reduced. Initial grasp conﬁguration is also inﬂuenced by other factors, such as visibility of the object to be moved, which also contribute to positioning accuracy. This provides additional support for the view that grasp conﬁguration is inﬂuenced by accuracy requirements at the target. 8.3.3.4 Since minimizing endpoint variance is key, is the effort cost superﬂuous? The asymmetric velocity proﬁle shape evidence might seem to suggest that minimizing endpoint variance is key, and thus that energy or effort as a cost

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

208  (Section 8.3.1) might be superﬂuous. However, Nelson’s (1983) and Nelson et al.’s (1984) ﬁnding of smaller movement distances when participants are asked to move quickly (in situations where the target is not ﬁxed), is difﬁcult to explain without an energy/effort, or jerk cost. See also Mazzoni, Hristova, and Krakauer (2007) for evidence that endpoint accuracy is not sufﬁcient to explain the preponderance of slower, shorter-distance arm movements produced by Parkinsons’ patients; an additional cost, estimated in their study as absolute average acceleration, appears to be additionally required. O’Sullivan, Burdet, and Diedrichsen (2009) present additional evidence that motivates effort in particular (instead of jerk or energy) as a cost that is independent from endpoint variance, because it explains patterns of force-sharing across effectors. They examined how force production is shared between two ﬁngers (all possible two-way combinations of the index and little ﬁngers of the left and right hands) in an isometric force production task in which participants produced a target force, as indicated on a screen. Predictably, participants applied more force with the index than with the little ﬁnger, and more force with the right than with the left hand. The minimum endpoint variance model correctly predicted this qualitative pattern of force sharing, but failed to predict the quantitative pattern, which was much more even across ﬁngers than the minimum endpoint spatial variance model would predict. The model which did the best job of predicting observed force sharing across ﬁngers included costs for both effort and endpoint variability, where effort was measured as the sum of the squared motor commands, not normalized by effector strength. In this best-ﬁt model, the relative importance of unnormalized effort to variability was 7:1, demonstrating that effort costs can be more substantial than endpoint variance costs. In summary, the evidence presented here suggests that costs of both effort and endpoint spatial variance are required to explain movement behavior. 8.3.3.5 Time Some temporal aspects of movement are predicted by costs other than time. For example, as discussed above, the shape of velocity proﬁles (i.e. the timecourse of movement) is predicted by the endpoint variance cost, as is the fact that more time is required to maintain the accuracy of longer-distance movements (Fitts’ law). However, movement durations predicted by endpoint variance for these longer-distance movements are much longer than those that are observed (Harris and Wolpert 2006; Shadmehr, Orban de Xivry, Xu-Wilson, and Shih 2010; Shadmehr and Mussa-Ivaldi 2012). A number of researchers have therefore proposed a cost for time, which ensures that

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

209

movement times are those that are the minimum compatible with accuracy requirements, including Harris and Wolpert (2006), Tanaka et al. (2006), Shadmehr et al. (2010), and Shadmehr and Mussa-Ivaldi (2012). Harris and Wolpert (2006) suggest that the time cost can account for the widely observed high correlation between movement distance and peak velocity for movements of many different effector types. That is, in spite of the fact that movements of longer distance are produced with longer durations when they obey a similar endpoint accuracy criterion, movements of longer distance are nevertheless produced with higher velocities. As a result, movements of longer distance are very similar in duration to movements of shorter distance (although not identical, cf. Fitts’ law). An explicit cost for time also helps to prevent the occurrence of corrective sub-movements. In their stochasticoptimized-sub-movement model, Meyer et al. (1988) proposed that targets can be reached via an initial sub-movement followed by one or more optional corrective sub-movements. In this model, the number of sub-movements and their parameters are optimized in order to minimize the total movement time, while achieving target accuracy. On their view, producing (many) corrective sub-movements is suboptimal because it increases total movement time. There are various proposed explanations for why time should be a cost, discussed in Shadmehr et al. (2010) and Shadmehr and Mussa-Ivaldi (2012). One possibility is that it is a cost because longer movements have more temporal variation (Newell 1980; Hancock and Newell 1985; Drew, Zupan, Cooke, Couvillon, and Balsam 2005; Gallistel, King, and McDonald 2004), explained by the view that the mechanism that meters out time is variable, and hence more variability is expected (cf. discussion in Chapter 4). However, this explanation does not account for durations observed in tasks where temporal accuracy is not an issue. Harris and Wolpert (2006) propose that minimizing saccade movement time is advantageous because it minimizes the amount of time spent with poor vision. However, this view does not predict minimized movement durations that are found for other types of movements. Shadmehr et al. (2010) and Shadmehr and Mussa-Ivaldi (2012) offer a more general explanation in terms of time-to-reward. For example, Jimura, Myerson, Hilgard, Braver, and Green (2009) found that thirsty experimental participants preferred to receive a small amount of water sooner, rather than a larger amount later. On the view that time-to-reward is a cost, moving fast is desirable because the reward is obtained quickly; moving slowly is sub-optimal because it delays, and therefore discounts, the reward of being in the next desirable state. In support of this view, Takikawa, Kawagoe, Itoh, Nakahara, and Hikosaka (2002) found that monkeys produced faster saccades to

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

210  remembered visual locations when these were rewarded, as compared to saccades made to remembered locations that were unrewarded. Results obtained for humans are also consistent with this idea, on the assumption that new visual stimuli are more rewarding than repeated stimuli; see Montagnini and Chelazzi (2005); Chen-Harris, Joiner, Ethier, Zee, and Shadmehr (2008); Golla, Tziridis, Haarmeier, Catz, Barash, and Thier (2008). In these experiments, longer-duration, lower-velocity saccades were found for repeated presentations of the same visual stimulus, and faster speeds at the presentation of a new stimulus. There is additional evidence for the temporal-discounting-of-reward view, on which movement durations depend on the value of the upcoming state, that is, with shorter-duration movements for higher-value goal states. This supporting evidence comes from studies showing that saccades to faces are faster than saccades to other visual stimuli (Xu-Wilson, Zee, and Shadmehr 2009), on the assumption that faces are more valued than other types of visual stimuli. Other evidence consistent with this view comes from Parkinson’s patients with micrographia (who produce small letters at slow speed), thought to have impaired neural reward systems. Mazzoni, Hristova, and Krakauer (2007) propose that Parkinson’s patients have a devaluation of the state changes caused by the movements (i.e. devaluation of reward), and a relatively higher effort/energy cost, which leads to slower, smaller, movements. Movement durations are also predicted to depend on the rate at which the value of a state change is discounted over time. Shadmehr et al. (2010) evaluated the predictions of quadratic, linear, and hyperbolic discount functions in optimal feedback control models of human saccade movements of different amplitudes. A hyperbolic discount function (i.e. with steep discounting relatively early, and a more gradual fall-off in discounting over time) predicts that as movement amplitudes increase, durations should become more different; for a quadratic function, movement durations should become more similar, and for a linear cost, the durations should increase linearly. Evidence in other domains supported hyperbolic functions: Jimura et al. (2009) showed that a hyperbolic discount function predicted the temporal discount of liquid for their thirsty participants (see also Kobayashi and Schultz 2008; Louie and Glimcher 2010; Green, Myerson, Holt, Slevin, and Estle 2004; Myerson and Green 1995; cited in Shadmehr and Mussa-Ivaldi 2012). Consistent with the studies in other domains, Shadmehr et al. (2010) showed that a hyperbolic discount function for time predicts the temporal patterns of different amplitude saccades in humans. They showed that a hyperbolic discount function performed better than a linear discount function, and better than a quadratic

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

211

discount function. In their model, the discount rate also depends on the value of the stimulus (reward value), so that higher rewards encourage faster movements than lower rewards, with steeper reward discount functions of time for the higher value reward. That is, in their model, movement duration depends on the derivative of the cost of time, where the derivative of the cost of dJ αβ time Jp is deﬁned as dpp ¼ ð1þβpÞ 2 : where α is the reward value, β represents the temporal discounting rate, and p represents movement duration. As reward value increases, so does the derivative of the cost of time: with higher reward values, the cost of time increases more rapidly as durations increase. In Shadmehr et al.’s model, the reward value serves as the coefﬁcient of the time term in a cost function containing accuracy, effort, and time costs. What this means is that as the reward value of a change in state decreases, the cost of time would be weighted less heavily than the other cost terms, i.e. accuracy and effort. Shadmehr et al.’s (2010) model was used to predict saccade durations for children and Parkinson’s patients. As discussed above, Parkinson’s patients are expected to have a reduced reward value of reaching their movement target as compared with normal controls. Consistent with this view, reducing the value of the reward parameter α in the equation above provided a good ﬁt to the movement durations of people with Parkinson’s compared with those of typical participants. Likewise, children are typically more impulsive than adults, and predictably, their movement durations were best modeled with reward discount functions that discount more rapidly over time as compared with adults (in this case, via parameter β).

8.3.4 Timing control within SOFTC and other OTC theories In SOFTC and other OTC theories that include time as a parameter of movement, cost-function minimization leads to the temporal speciﬁcation for a movement, i.e. the movement duration that (in concert with other movement parameters) best satisﬁes the task requirements and movement costs. This movement duration results from several aspects of the cost function, including the speciﬁcation of time as a task requirement (if appropriate), the cost of time, the cost of temporal inaccuracy, as well as the temporal consequences of other movement costs, e.g. endpoint variance, effort, or jerk, where higher endpoint variance, effort, or jerk costs lead to lower velocity, or longer duration movements. Optimal movement duration is determined in some OTC models by performing cost minimization for each potential

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

212  movement duration: the optimum movement duration will be the duration for which the overall cost is minimal. Thus, the optimal duration of movement is determined even when movement duration is not an explicit goal. However, if a movement is required to be produced in a certain amount of time (as in periodic, rhythmic tasks), time would be an explicit task requirement, and temporal inaccuracy would be a cost. Hudson, Maloney and Landy (2008) propose that when time is a task requirement, the relative weighting of other constraints that may not be as important to the task should be downweighted.

8.3.5 Motor memory/the cost of reprogramming In Ganesh et al. (2010), participants often repeated a suboptimal, but yet tasksatisfying solution to a perturbed wrist-movement task even after experience with the optimal solution. This result is consistent with the view that there is a cost associated with reprogramming the task. Others also present evidence consistent with a cost associated with reprogramming (Van der Wel, Fleckenstein, Jax, and Rosenbaum 2007; Jax and Rosenbaum 2007, 2009), as discussed in Chapter 3, where the magnitude of the change in program parameter value inﬂuences the cost. The task in van der Wel et al. (2007) was to ask participants to move the end of a vertical dowel from one target position to another, where target positions were indicated by dots arranged in a semi-circular arc on a desk in front of the participant. They moved in time to a metronome, from left to right and right to left. The experimental manipulation of interest was that on any given trial an obstacle was positioned between two of the target-position dots, where the location of the obstacle and the height of the obstacle were varied from trial to trial. Results showed that movements from target-position to target-position immediately after clearing an obstacle were higher than expected, even though no obstacle was present between target-position dots. The height of the post-obstacle avoidance movement correlated with the height of the actual-obstacle avoidance movement, and gradually decreased to baseline over successive moves. A second, bi-manual experiment enabled them to reject an account based on the time it takes for muscles to relax. In this experiment, they asked participants to avoid the obstacle with one hand, and to continue moving between subsequent targets with the other hand. In this experiment, they found similar results to those of the unimanual task. In both experiments anticipatory effects were also observed, whereby movements after an obstacle were higher when

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      ?

213

another obstacle was upcoming. This result suggests that biomechanical costs were not minimized, since minimizing energy would have led to lower jump heights rather than higher jump heights. On their view, biomechanical costs are balanced against the costs of altering a previous plan. What participants seemed to care about was minimizing changes between successive jump heights. On this view, plans for movements are based on plans for previous (and even upcoming) movements, but are only slightly altered, consistent with a cost that correlates with the change in a parameter value for an abstract spatiotemporal plan for movement, where the parameter could be trajectory shape, point of maximal excursion, or the velocity proﬁle, etc. See Chapter 3 for a discussion of how these types of cost may apply to ‘spill-over’ effects in speech (Turk and White 1999; Chen 2006; and Dimitrova and Turk (2012).

8.3.6 The costs of not achieving the task requirements The costs discussed above are thought to apply to any movement, regardless of the speciﬁc goals or task requirements for the movements. This section discusses how the costs of not meeting task requirements constrain movement choices. In early Optimal Control Theory models, task requirements were treated as inviolable constraints, and movement speciﬁcations were found that met the task requirements at minimum cost. However, later ﬁndings show that task requirements are not always met completely; actors can estimate the costs of potential deviations and may choose to e.g. undershoot targets when reaching them would be too costly. These ﬁndings suggest that costs of not meeting the task requirements must be added to the cost function, and balanced against each other and the costs of movement. For example, Liu and Todorov (2007) showed systematic target undershoot, and longer movement times, in a reaching task in which participants were asked to move a cursor to a target on a screen within a prescribed time interval. In this experiment, the target was perturbed toward the end of the movement. Participants still had enough time to reach the perturbed target within the prescribed interval even after perturbation, but they often undershot it, and produced longer-duration movements than in unperturbed trials. This suggests that movement time could be adapted on the ﬂy as circumstances changed. Liu and Todorov (2007) showed that the degree of target undershoot could be modeled as a result of the interplay between the costs of exceeding the prescribed time limit, the costs

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

214  of target undershoot, the costs of not stopping completely at the end of movement, and costs of movement, such as energy and endpoint variance. Hudson et al. (2008) provide further evidence that actors are able to take the scalar costs of not meeting task requirements into account in planning movements. In their task, participants were rewarded for reaches that fell within a speciﬁed time window ($.12), and were penalized for being too early or too late ($.36), or for failing to touch the spatial target ($.60). Results suggested that participants minimized the overall (spatial and temporal) costs of inaccuracy, where temporal accuracy was assumed to be due to two types of temporal uncertainty or noise: 1) a standard deviation that increases with interval duration, and 2) uniform temporal uncertainty or noise that might grow with e.g. fatigue, and over a longer time scale, with e.g. age, injury, or disease. Both of these examples illustrate that the costs of deviating from both temporal and spatial task requirements can be taken into account in planning movements. The quantiﬁcation of these costs suggests that the extent to which they are taken into account might vary according to task relevance. For example, in a task where precise timing is more important than spatial accuracy, the cost of endpoint variance (and thus endpoint spatial inaccuracy) would have a smaller weighting than temporal inaccuracy. Similarly, in a task where spatial accuracy is more important than timing accuracy, the weighting of the timing inaccuracy cost would be smaller.

8.4 Predictions of Stochastic Optimal Feedback Control Theory As discussed above, the minimization of cost functions that include costs such as those discussed above successfully predict known characteristics of movement, e.g. smooth, single-peaked velocity proﬁles, approximately straight movement paths, velocity-proﬁle asymmetry with longer durations, Fitts’ law, the near isochrony of movements, slower movements where more curvature is required, and the undershoot bias. This section reviews additional phenomena predicted by the SOFCT model. The ﬁrst of these is structured variability, that is, the observation that variability is structured according to task requirements, and is lower for task-relevant aspects of movements than for task-irrelevant aspects (Todorov and Jordan 2002). The second has to do with the prediction of SOFCT that task-related activity should be shared across multiple effectors, i.e. that the use of coordinative structures, or motor synergies, is an optimal strategy.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.   

215

8.4.1 Model prediction: Structured variability A prediction of the SOFCT model is that variability in task-relevant dimensions is low, and variability in task-irrelevant dimensions is higher. This is because adjusting motor commands, when not required by the task goals, would not help in achieving the task goals, and would also incur a cost: correcting task-irrelevant errors would be suboptimal in terms of energy use or effort, and these corrections might themselves need to be corrected, leading to a cascade of costly and non-critical corrections. Operating optimally involves respecting a ‘minimal intervention principle’ of correcting deviations from an average trajectory when they interfere with task goals, but not if they are task-irrelevant (Scholz and Schöner 1999; Domkin, Laczko, Jaric, Johansson, and Latash 2002; Yang and Scholz 2005; Todorov and Jordan 2002; Todorov 2004). What this means is that the model would adapt to perturbations, or to the effects of signal-dependent noise, but would do so only if the perturbation or noise interferes with accomplishing the task goal(s). Section 8.4.1.1. lists papers that provide supporting evidence for this idea; Section 8.4.1.2 discusses an implication of these ﬁndings for the nature of motor control models. 8.4.1.1 Supporting evidence from spatial and temporal variability It is often the case that parts of a movement that are less relevant to the task goals are more variable than parts of the movement that relate more closely to the goals. This has been shown by spatial and temporal measurements from repeated movements in a variety of tasks (Todorov et al. 2005; Todorov and Jordan 2002; Liu and Todorov 2007 for spatial variability; and Bootsma and van Wieringen 1990; Billon, Semjen, and Stelmach 1996; Spencer and Zelaznik 2003; Zelaznik and Rosenbaum 2010; Katsumata and Russell 2012; Perkell and Matthies 1992; Leonard and Cummins 2011 for temporal variability, discussed in Chapters 4 and 7). 8.4.1.2 Findings of structured variability challenge ‘desired trajectory’ models These ﬁndings not only conform with the minimal intervention principle, but also provide evidence for the nature of movement planning. Early Optimal Control Theory models were sometimes called ‘desired trajectory’ models, and assumed that movement planning consisted of formulating a ‘desired trajectory’ goal. However, desired trajectory models predict that spatial and temporal variability should be uniform through an entire movement. Findings of structured variability challenge these models, because they suggest that

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

216  movement goals do not always correspond to entire movement trajectories and instead often correspond most closely to particular parts of movement, e.g. the endpoint in reaching or tapping tasks.⁸

8.4.2 Model prediction: Movement synergy Another prediction of the SOFCT model is movement synergy. Movement synergy refers to the task-dependent shared activity of multiple effectors in accomplishing a single goal (Latash 2008), and is widespread in speech, where multiple effectors (coordinative structures, Kelso, Tuller, and Harris 1983) work together to achieve a single constriction goal. For example, the jaw, upper lip, and lower lip all work together to form a bilabial constriction, and evidence from bite block experiments (e.g. Folkins and Abbs 1975) shows that when the movement of one or more of these articulators is restricted, the other articulator(s) in the coordinative structure compensate(s) to achieve the constriction goal. Because the activity of synergists within a coordinative structure is interdependent, they require hierarchical control. That is, the implementation of movement goals involves not the separate control of individual articulators, but rather the conjoined control of sets of articulators, in a task-speciﬁc way. The following sections describe how movement synergies ﬁnd an explanation within SOFCT approaches. This section describes how SOFCT approaches can account for three aspects of synergistic control: multiple effectors better than one in accomplishing any single task (8.4.2.1); interdependent control of redundant effectors (8.4.2.2), and the hierarchical control implied by this interdependency (8.4.2.3). 8.4.2.1 Synergies 1: The use of multiple effectors to accomplish a task is preferred over the use of a single effector O’Sullivan et al. (2009) point out that, under the assumption that the costs of movement relate to the sum of the squared motor commands, it is more efﬁcient to share task accomplishment evenly among multiple effectors than to accomplish a task using a single effector. Note that the squared motor commands relate both to effort and to endpoint variability. The squared motor commands relate to endpoint variability because the variance of produced ⁸ However, other types of tasks, e.g. circle drawing on paper, may show patterns of uniform variability that are more consistent with an entire trajectory goal (see e.g. Zelaznik and Rosenbaum 2010, who measured temporal variability in a circle drawing task and found comparable temporal variability at opposite sides of the circles produced by their participants).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.   

217

force varies proportionally with the square of the mean. To see why the cost of squared motor commands is minimized when work is distributed evenly over multiple effectors, imagine that an action requires a motor command magnitude of 5. If performed by a single effector, the squared motor command would be 25; if performed by two effectors, where the ﬁrst had a command magnitude of 3, and the other one of 2, the sum of the squared motor commands would be 13, because 3² + 2² = 13. If the motor commands are distributed evenly across effectors, the sum of squared motor commands goes down to 12.5. (2.5² + 2.5² = 12.5). This view also predicts that when corrections occur as a result of feedback, the corrections should be distributed across effectors. 8.4.2.2 Synergies 2: The interdependent control of redundant effectors Not only is the use of multiple effectors preferable; there is also evidence that interdependent control of redundant effectors is preferred over independent control (Todorov and Jordan 2002). This is because the manifold of acceptable states that accomplish a given task is larger when there are multiple, redundant, effectors which share the task than when the task is accomplished by a group of effectors which each must reach a single, speciﬁed target value. For a controller whose task it is to move the system state toward the nearest acceptable goal state, it will be less costly in terms of control energy if there is a larger manifold of acceptable goal states for the system than if there is only a single goal state. Todorov and Jordan give the example of a simple system x₁ + x₂ whose goal is to equal a particular value T. If the two components x₁ and x₂ are controlled independently, they each must equal T/2. However, if x₁ and x₂ are controlled synergetically (interdependently), the goal can be achieved for a variety of values of x₁ and x₂ (e.g. if T = 10, x₁ can be 1 and x₂ can be 9, or x₂ can be 2 and x₂ can be 8, etc.). Therefore there is a larger manifold of acceptable goal states when x₁ and x₂ are controlled interdependently, and the precise contribution of each effector can be chosen to minimize movement costs and/or to conform with constraints on the system. 8.4.2.3 Synergies 3: Modeling hierarchical control The evidence for motor synergies suggests a hierarchical control architecture in which a higher control level, which is related most closely to speciﬁed task requirements or goals, controls a lower level, e.g. of interdependent effectors (which in turn controls an even lower level of muscle activations). Todorov et al. (2005) present a framework for hierarchical control within SOFCT, and illustrate the framework with two levels of control: a higher level, which

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

218  controls e.g. forces, velocities, and positions in Cartesian space,⁹ and a lower level which controls e.g. muscle activations. They propose that the higher-level control policy minimizes costs that are approximations to the lower-level costs. In this two-level control framework, the high-level control optimization problem is easier to solve than one based on a (single) lower level, because the dynamics of the higher level is deﬁned in fewer dimensions than a single lower level would require. Once movement has begun, low-level (e.g. muscular) state estimates are mapped onto higher-level state estimates of e.g. position, velocity, etc., and appropriate higher-level control signals are mapped onto lower-level control signals. That is, a low-level (e.g. muscular) state estimate x is passed to the low-level controller, which converts the state estimate into a more abstract state representation y(x), which is sent to the higher-level controller. The higher-level controller sends a control signal v(y) back down to the lower-level controller, which then converts it into motor commands (u(v,x)). The motor commands generated at the lower level depend on the higher-level control signal, as well as on the lower-level state estimate x. Motor commands at the lower level are thus constrained to be compatible with the higher-level commands. Details can be found in Todorov et al. (2005), and in Li (2006); see also Diedrichsen et al. (2010) for an example from a two-handed cursor movement task. Note that actions which involve multiple effectors (and multiple muscles per effector), as is often the case for motor synergies, may require more than two levels of control. In sum, Todorov and colleagues have demonstrated the optimality of hierarchical and synergistic control. This motivates including this type of approach in a model of speech production.

8.5 Challenges for Optimal Control Theory approaches The previous sections have presented arguments suggesting that the OCT approach is appropriate for modeling processes of movement planning, which involve balancing many different task requirements. Given the appropriateness of this general approach, it is also useful to consider some of its ⁹ An important question has to do with the appropriate reference frame for state estimation for any particular task, i.e. external vs. internal reference frames. Hore and Watts (2005) studied the timing of the release of a ball during actions of throwing at different speeds, to distinguish between these two types of reference frames. Their results rule out control of the timing of ball release based on position of the elbow with respect to the body (an internal reference frame). Instead, their ﬁndings are consistent with either 1) the view that the ball release is controlled via goals and state estimates that are represented in external, spatial coordinates, i.e. in this case, the angular position of the hand in external space, or 2) the view that ball release is timed to occur at a particular proportion of the total movement time. See Chapter 5 for more discussion of issues related to movement coordination.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.      

219

potential drawbacks. This section lists known criticisms of Optimal Control Theory approaches, many of which focus on details of implementation or data which the theory has not yet addressed, as opposed to larger issues of whether the approach is appropriate. 1. The role of random factors. There are some choices relating to movement which may not reﬂect optimization, but may simply be the result of happenstances of history. For example, as discussed in Shadmehr and Mussa-Ivaldi (2012), marine mammals swim by moving horizontal tail ﬂukes up and down, but ﬁshes swim by moving vertical tail ﬂukes from side to side; this difference may simply be due to chance, rather than differences in what is optimal for the two species (Gould 1995). 2. Bimanual interactions even when each hand performs an independent task. These interactions are found both in the temporal (Kelso, Southard, and Goodman 1979; Marteniuk, Mackenzie, and Baba 1984, discussed in Chapter 5), and in the spatial domains. As discussed in Diedrichsen et al. (2010), interactions between movements in bimanual tasks are not predicted by SOFCT if the two different tasks for the two hands are separate and independent. SOFCT currently has no account of this type of interaction. 3. Need for a neural mechanism. The theory rests on the ability to specify movement costs, and the ability to minimize cost functions to compute a control policy to achieve the task goals from any current state. However, it is as yet unclear how movement costs are neurally represented, and which neural mechanisms are involved in computing a control policy that speciﬁes minimum cost movements from any current state (Diedrichsen et al. 2010; Guigon 2011). In addition, weighting costs that are speciﬁed in different types of units is non-trivial (Todorov 2004; Rosenbaum, Meulenbroek, Vaughan, and Jansen 2001). It has been suggested (Rosenbaum, Vaughan, Meulenbroek, Jax, and Cohen 2008; Ganesh et al. 2010; and Hu and Newell 2011) that an alternative to optimization may be satisﬁcing (Simon 1955), which involves ﬁnding the movements that have an acceptably low cost (but not necessarily the minimum cost). The advantage of this approach over optimization is that computation is less expensive than it is for optimization. 4. The cost of planning. Another cost that typically isn’t considered in these models is the planning cost, i.e. the cost of specifying the cost function and minimizing it to formulate the control policy. 5. Logical challenges due to unknowns. Diedrichsen et al. (2010) also point out that it is possible to explain any behavior as optimal if there are no

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

220  restrictions on the cost function. Shadmehr and Krakauer (2008) point out that it is difﬁcult to determine if a movement might be suboptimal because of an inaccurate internal model, an inaccurate cost function, or both. Guigon et al. (2008) also note that different weights of costs will predict different results, and it is unclear how the weights are to be chosen. Finally, optimizing means ﬁnding the movement that minimizes the combined costs. However, ﬁnding the minimum cost movement can be difﬁcult, and requires time-consuming computations in modeling (see point 3 above). Despite these criticisms, the OTC approach is proposed in the current XT/3C sketch because of its explanatory power, and because of its ability to model the systematic inﬂuence of multiple factors on movement behavior.

8.6 Optimization principles in theories of phonology and phonetics As noted at the beginning of the chapter, the relevance of optimization for theories of speech production was shown early on by the work of Nelson and colleagues (Nelson 1983; Nelson, Perkell, and Westbury 1984), which strongly suggested that speakers minimize effort and/or jerk when speaking. Nelson (1983) showed that the relationship between effort or jerk and movement duration for moving a given distance is non-linear, with effort/jerk costs much higher at very short durations than at longer durations. For example, whereas decreasing movement duration from 300 to 250 ms requires only a slight increase in peak velocity, and therefore only a slight increase in the effort or jerk cost, a decrease in movement duration from 150 to 100 ms more than doubles this cost. Nelson showed that an optimal strategy for avoiding prohibitively high and costly peak velocities at short movement durations would be to shorten movement distance by undershooting the assumed target, and this is in fact the strategy that some (but not all) speakers adopt at the fastest speaking rates (Kuehn and Moll 1976; Sonoda and Nakakido 1986; Lindblom 1963; Nelson et al. 1984). Nelson showed that jaw-movement kinematics, measured in natural unconstrained jaw movements for two speakers from the x-ray microbeam database, showed relationships between movement time, distance, and peak velocity that conform to those predicted if effort or jerk is being optimized under constraints of time. That is, as Lindblom (1990) put it, speakers appear to

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

221

conform to principles of “economy of effort.” Lindblom (1990) proposed that considerations of economy of effort (effort costs) balanced against requirements of perceptual distinctiveness, can explain a number of characteristics of human speech and language, including patterns of phoneme inventories in different languages (Liljencrantz and Lindblom 1972), duration dependent undershoot (Moon and Lindblom 1989), and patterns of VCV coarticulation. He argued that the interaction between what he called output constraints (e.g. perceptual distinctiveness), and system constraints (e.g. movement costs such as effort), results in forms that are more or less hyperarticulated. “When output constraints dominate . . . expect to see hyper forms, but with system constraints dominating, hypo speech will be observed” (Lindblom 1990, p. 418). His claim is that this variation in weighting occurs because what is sufﬁciently contrastive for lexical access will vary according to contextual predictability, where more predictable elements (e.g. segments, syllables, words) require less “explicit signal information for successful recognition than less predictable elements.” In addition to its contributions to modeling speech motor control, the general idea that optimization inﬂuences sound patterns in speech has been very inﬂuential in the ﬁeld of phonology. Researchers interested in explaining cross-linguistic typological patterns, e.g. differences in phoneme inventories, cross-linguistic differences in syllable structure, stress patterns, phonotactics, patterns of assimilation etc. have proposed that these patterns reﬂect the optimal choice of forms in each language (Liljencrantz and Lindblom 1972; McCarthy and Prince 1993, 1994; Prince and Smolensky 1993 et seq.). Optimality Theory (OT; McCarthy and Prince 1993; Prince and Smolensky 1993 et seq.) proposed that the relationship between underlying phonological representations and surface form is governed by a set of universal, violable, sometimes-conﬂicting, constraints. Differences in constraint rankings for different languages yield different optimal surface outputs, and can thus explain cross-linguistic differences in surface forms. The constraints widely used in OT proposals include faithfulness constraints, which penalize surface forms which differ from underlying representations, as well as markedness constraints, which penalize forms which are marked according to universal typology. The earliest versions of Optimality Theory in phonology differed from Optimal Control Theory proposals for motor control in several key respects. First, they did not explicitly invoke constraints relating to movement costs, although markedness constraints no doubt relate to these, on the assumption that universal patterns are likely to be those that are least costly in terms of e.g.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

222  effort or the kinematic jerk cost (Windmann 2016). Second, constraints in early versions of OT were categorical, unlike the scalar costs in Optimal Control Theory that are used to quantify different degrees e.g. of spatial deviation from a target, of effort or jerk, etc. Third, the constraints in early Optimality Theory obeyed a principle of strict dominance. Under the principle of strict dominance, a surface form will always be preferred if it conforms to a higher ranking constraint than an alternative possible form; the number and type of violations of lower-ranked constraints is irrelevant. In contrast, in cost functions where costs (constraints) are weighted and summed to yield the overall cost for a given form, all costs that are included in the cost function are relevant to determining the optimal form.¹⁰ And ﬁnally, the typological patterns explained by early versions of Optimality Theory were assumed to be the symbolic outputs of a phonological component of grammar, and were assumed to reﬂect categorical choices among competing forms. In contrast, the output of optimization in Optimal Control Theory approaches to motor control is non-symbolic, quantitative, values of movement parameters. Later researchers used optimization principles to explain typological differences in more ﬁne-grained phonetic patterns, showing how different weightings of what Lindblom (1990) called input and output constraints, i.e. of movement costs vs. contrastiveness, can explain a variety of typological patterns that can be gradient and therefore difﬁcult to describe in categorical terms. These include: —the typology of place assimilation of nasal consonants (Boersma 1998), —cross-linguistic patterns of lenition (Kirchner 1998), —different degrees of /u/ fronting in the environment of coronal consonants in different languages (Flemming 2001), —the typology of tonal coarticulation and assimilation (Flemming 2011), —patterns of reduction in unstressed vowels (Flemming and Johnson 2007; Boersma 1998), —cross-linguistic patterns of articulatory overlap in consonant clusters (Katz 2010), and —cross-linguistic patterns of (complete vs. incomplete) neutralization (Flemming 2001; Braver 2013).

¹⁰ See Flemming (1997) and Pater (2009) for more discussion, as well as Boersma (1998) for a method of interleaving categorical sub-constraints that yields outcomes that are similar to those from minimized cost functions.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

223

Boersma (1998) showed how gradient typological differences could be modeled using categorical rather than scalar constraints, as long as those constraints are sub-divided into sub-constraints which penalize different magnitudes of deviations from targets (differences in perceptual distinctiveness), and magnitude of jaw opening (proxy for effort). However, Flemming (1997, 2001), Kirchner (1998), Zsiga (2000), and Braver (2013) all propose scalar cost functions, which include weighted (rather than ranked) constraints. Flemming (2001) shows how both categorical and gradient effects can emerge from a model with weighted constraints/costs, depending on the relative weights that are assigned to each. The general idea of the interplay between what Lindblom calls “output constraints,” (i.e. the task requirements of perceptual identiﬁcation), and “system constraints” (i.e. the costs of movement) is well-supported, and is implemented in most optimization accounts of phonetic patterns. However, it is sometimes difﬁcult to unambiguously identify and classify the types of costs and task requirements that are involved. For example, although Kirchner (1998) argues that effort conservation plays an important role in explaining crosslinguistic patterns of consonantal lenition, Kingston (2008) and Kaplan (2010) suggest that the same facts can be better accounted for by perceptual requirements of signaling position in prosodic structure. That is, whereas Kirchner (1998) attributes surface patterns to system constraints (costs of movement), Kingston (2008) and Kaplan (2010) attribute the same surface patterns to output constraints (task requirements for the beneﬁt of the listener). Researchers adopting optimization approaches to explain cross-linguistic patterns differ in the number of components of grammar and/or the number of cognitive processing mechanisms¹¹ that they propose (cf. Chapter 7). For example, Boersma (1998, 2009), Zsiga (2000), Barnes (2006), and Iosad (2012) all propose separate phonological and phonetic components of grammar, whereas Flemming (2001) proposes a uniﬁed phonetics–phonology component. Flemming’s (2001) proposal of a single phonetics–phonology component is based on ﬁndings that several categorical and gradient phenomena, e.g. categorical neutralization of vowels next to coronal consonants, as well as gradient patterns of coarticulation between vowels and adjacent consonants, can be explained using cost functions that include the same movement costs, with categorical patterns arising from the interaction of movement costs with

¹¹ On some views, the grammatical components and processing components and are one and the same, but on other views, grammatical components can be part of stored knowledge, whereas processing components model how the speaker uses that knowledge.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

224  costs relating to perceptual distinctiveness. However, other researchers (e.g. Barnes 2006; Iosad 2012) argue that it may not be the case that all phonetically motivated patterns are derived from a single phonology–phonetics component of grammar. They suggest instead that some phonetically motivated patterns may be the result of historical processes of phonologization within particular languages, and are not the result of synchronic phonetic constraints, whereas others do result from synchronic phonetic pressures, e.g. reduction to schwa in Russian, encouraged by fast rates of speech.

8.6.1 Optimization models of variation within single utterances While the focus of most optimization modeling in phonology and phonetics has been on providing explanations for cross-linguistic differences in sound patterning, there has been increasing use of optimization principles in modeling variation due to context within individual utterances. Approaches of this type that have addressed systematic durational variability within utterances (Šimko 2009; Šimko and Cummins 2010; Windmann 2016; Katz 2010; Lefkowitz 2017) are discussed in more detail in the following sections. 8.6.1.1 Optimization models of timing patterns in the production of an utterance It is clear from decades of speech research that surface timing patterns in speech are systematically inﬂuenced by a large set of interacting factors (Lehiste 1970; Klatt 1976; Nooteboom 1972; Fujimura 1987, 2000; van Santen 1992; van Santen and Shih 2000; White 2002; Fletcher 2010; Fujimura and Williams 2015, among many others). Optimization approaches for modeling this type of systematic variation are attractive for two main reasons: 1) they have been used successfully in various domains to model the combined effects of multiple, conﬂicting factors on surface form, and 2) as Windmann (2016) notes, to the extent that costs and constraints are principled (e.g. related to Lindblom’s economy of effort and perceptual distinctiveness), they can be seen as explanatory, rather than descriptive, models of surface variation.¹² This section reviews the way optimization principles have been used in recent models of timing patterns in speech. Effects modeled in this way have

¹² Windmann (2016) provides a useful review of existing models of systematic durational variation in speech.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

225

been chosen by investigators to showcase the ability of optimization models to balance task requirements and costs, and have generally been successful in generating timing patterns that conform to observed effects. To date, most optimization models of surface durations generate durations of acoustically deﬁned constriction and non-constriction intervals (sometimes called consonantal and vocalic intervals), and largely ignore the articulatory movements which generate them. However, there are exceptions, most notably Šimko and Cummins’ (2010, 2011) articulatory model, discussed in Section 8.6.1.2. In addition, some models which do not explicitly model articulation nevertheless include costs which can be seen as stand-ins for articulatory costs (e.g. Flemming 2001; Windmann 2016). Windmann’s model is of note in this regard because it operationalizes the cost of movement (effort) in terms of the effects that changes in effort are assumed to have on acoustic durations. Optimization models of surface durations have accounted for a wide variety of timing phenomena. These include: —articulatory overlap (Šimko 2009; Šimko and Cummins 2011; Katz 2010), —durational differences associated with consonantal quantity (Šimko, O’Dell and Vainio 2014), —durational effects of prosodic prominence (Windmann 2016), —durational effects of prosodic boundaries (Beňuš and Šimko 2014), —rate of speech effects (Windmann 2016), —segment incompressibility or minimum durations noted by Klatt (1976) (Windmann 2016), —several types of what Katz (2010, 2012) calls “compression effects” (i.e. the shortening of subconstituents when more occur in a higher-level constituent): —closed-syllable vowel shortening in languages with phonemic quantity contrasts (Flemming 2001), —polysyllabic shortening within an accented word (White and Turk 2010; Windmann 2016), —shortening due to periodic pressure (e.g. metronome speech) (Katz 2010), as well as —the effects of the interaction of a variety of factors on vowel duration in monosyllabic words, including vowel type, the voiced vs. voicelessness of a following consonant, number of segments in the coda, phrasal accent, and phrase-ﬁnality (Lefkowitz 2017).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

226  Existing proposals differ in several respects, including: in whether the models assume timing targets and the costs that are evaluated (Section 8.6.1.2), in the way costs are deﬁned and operationalized (Section 8.6.1.3), in proposed source (s) of timing variability (Section 8.6.1.4), in the number of model components and in the related issue of phonology-intrinsic vs. phonology-extrinsic timing (Section 8.6.1.5), as well as in how the models integrate ‘performance factors’ such as rate of speech (Section 8.6.1.6). Each of these is discussed further in the following sections. Despite these differences, all of the approaches show how reasonable surface timing patterns can result from the optimal resolution of multiple, interacting, and conﬂicting demands, and thus support the plausibility of using the general optimization framework in modeling speech production and its timing. 8.6.1.2 Timing targets vs. emergent durational parameter values Several models which account for surface timing patterns assume abstract, idealized, timing targets (usually in ms) for different sizes of units (e.g. segmental intervals, syllable-sized intervals); weighted costs are assigned to the deviations of surface durations from the target values. These models contrast with models in which there are no duration targets per se; instead, durational parameter values emerge from the cost function which includes the cost of time, as well as other costs that affect timing parameters, e.g. effort and parsing (perceptual recoverability) costs. 8.6.1.2.1 Models with timing targets Flemming (2001) and Katz (2010) both model polysegmental compression effects using cost functions in which weighted costs are assigned to the deviation of acoustically deﬁned surface durations of syllables and segments from abstract, idealized, target syllable and segment interval durations. See also Braver (2013) for a similar account of incomplete durational neutralization of Japanese monoand bimoraic vowels. Lefkowitz (2017) models the durational behavior of vocalic intervals in a larger array of contexts, and proposes a single target for all vocalic intervals (regardless of context), as well as additional, more speciﬁc targets for high, low, tense vowels, nasal onset, closed, voiceless coda, complex coda contexts, accented, and phrase-ﬁnal phrasal contexts. In all of these timing-target models, cost-function minimization results in durational speciﬁcations that reﬂect a compromise between conﬂicting targetinterval-duration speciﬁcations for different types of units (or, in Lefkowitz’ case, for the contextless target vs. more speciﬁc contexts). For example, in Flemming’s (2001) model of closed-syllable vowel shortening in quantity

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

227

languages, durational targets are assigned to syllables and segments, and weighted syllable- and segment- duration deviation costs are balanced against the cost of not maintaining a minimum durational difference between phonemically distinct short vs. long vowels. If the cost of deviating from the syllable target is weighted more heavily than the cost of deviating from a segmental target, then segmental shortening will result as more segments are added to a syllable. 8.6.1.2.2 Models without timing targets Models of the type discussed above, which propose durational targets either for particular types of structures (e.g. syllables and segments, as in Flemming 2001 and Katz 2010¹³), or for the same type of segment in different contexts (e.g. Lefkowitz 2017), address the complexity of the known facts about speech timing, but face the challenge of explaining how these abstract targets are learned, given that the targets are not directly observable from surface-interval durations. (See Chapter 3 for a discussion of a similar issue with AP/TD’s default activation intervals.) Approaches without timing targets (e.g. Šimko and Cummins 2010, 2011; Windmann 2016) are attractive because they avoid this problem, but nevertheless are able to generate durational parameters of movement (and resulting surface durations). The following section reviews Šimko and Cummins’ Embodied Task Dynamic Model, in which optimal durational speciﬁcations emerge from weighted costs of effort, perceptibility and time. 8.6.1.2.3 Šimko and Cummins’ Embodied Task Dynamic Model: parameters of movement relating to surface time emerge from effort, parsing, and time costs. Šimko and Cummins’ Embodied Task Dynamics model (Šimko 2009; Šimko and Cummins 2010, 2011) generates articulatory trajectories for VCV sequences, where V = [i,a] and C = [b,d]. The model is a development of the Task Dynamics model used in AP/TD, in which the articulators are assigned masses, and optimization is used to determine model parameter values. In this model, gestural activation onset and offset timing and system stiffness (where system stiffness is a scaling factor for gestural and ‘speech-ready’ stiffness¹⁴) ¹³ Katz (2010) notes that the costs of deviations from segmental duration targets have a similar effect on surface forms as costs associated with perceptual contrast, and therefore suggests it may be possible to replace costs of deviations from segmental duration targets with costs associated with perceptual contrast. However, he does not propose eliminating syllable duration costs, as these are required to explain compression effects, e.g. polysegmental shortening effects. ¹⁴ Speech-ready stiffness is analogous to the stiffness of the neutral attractor in the AP/TD framework, except that speech-ready dynamics is always turned on, even when gestures are active, and the speech-ready stiffness of individual articulators can be manipulated according to requirements for higher precision (Šimko 2009). The speech-ready position is assumed to be “an average constellation with regard to the entire set of mastered gestures” (Šimko, O’Dell and Vainio 2014, p. 133).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

228  are optimized using three costs: An effort cost, a perceptual (parsing) cost, and a time cost. The effort cost is proportional to the integral of applied force over time,¹⁵ and can be estimated because the articulators in the model have masses. The parsing cost penalizes the degree of articulatory undershoot at each point in time, as well as temporal shortening of the interval when a gesture is within a criterion distance from the target, and is not overlapped by other consonantal gestures that would prevent it from being perceived. The time cost penalizes the total utterance duration, and this provides a pressure toward gestural overlap and shorter activation intervals.¹⁶ The weighting of each of these costs has consequences for timing patterns. Because the effort cost is deﬁned in terms of applied force, a higher weighting of this cost will cause lower movement speeds, implemented through a decrease in system stiffness, and shorter gestural activation intervals because the gestures are active and moving for shorter periods of time. Because the parsing cost increases the more difﬁcult it is to identify a gesture, increasing the weight of this cost will cause gestures to be produced in a way that makes it easier for listeners to identify them. This will be reﬂected in 1) increased movement speed (via increased system stiffness), because higher movement speed increases gestural precision by decreasing the amount of time away from the target, as well as 2) increased gestural activation intervals, and 3) decreased gestural overlap that increases the duration of non-overlapped, ‘steady-state’ intervals. The overall effect of these changes, it is argued, will be to make the gestures easier for the listener to identify. In support of this model, Šimko et al. (2014) showed how manipulations of the parsing cost could account for the difference in closure duration between singleton and geminate consonants in Finnish, if the parsing-cost weighting could be increased locally for geminates. Although both effort costs and parsing costs affect surface-timing patterns, the duration cost, that is, the cost of the duration of the entire utterance, is required to account for the ability of speakers to manipulate utterance duration (a.k.a. overall speaking rate) independently of effort and perceptual clarity (cf. Gay 1981). Increasing the duration cost decreases gestural activation intervals, increases gestural overlap, and increases gestural stiffness. ¹⁵ The exception is the damping force required to model the articulators’ contact with the vocal tract structures when making a constriction (Šimko and Cummins 2010). ¹⁶ See also Liberman, Cooper, Shankweiler, and Studdert-Kennedy (1967), who noted that one of the signatures of speech communication is that speakers are highly efﬁcient at conveying information quickly, and one of the mechanisms for accomplishing this is articulatory overlap.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

229

Beňuš and Šimko (2014) showed that locally decreasing the duration cost in the vicinity of a phrase boundary can be used to model boundary-related lengthening in Slovak m(#)abi and m(#)iba sequences. Windmann (2016) shows how this same general approach, i.e. without explicit durational targets, can be used to model durational effects of prominence (phrasal prominence, lexical prominence, and their interaction) and polysyllabic shortening, as well as interactions with speaking rate. In sum, a wide range of ﬁndings support the view that the optimization approach can provide an account of observed surface timing patterns in speech. Moreover, these ﬁndings suggest a difference in desirability between two different types of optimization approaches, that is, with and without timing target goals that are abstract in the sense that they are unobservable from surface forms. Šimko, Windmann, and colleagues have shown that unobservable (and therefore less desirable) timing target goals are not required in these approaches; instead plausible timing speciﬁcations can result from the optimization of time and other movement costs (e.g. effort), as well as optimization of perceptual contrast. 8.6.1.3 Different deﬁnitions and operationalizations of costs Models which include timing targets (e.g. Flemming 2001; Katz 2010; Lefkowitz 2017) have costs that are deﬁned in terms of deviations from proposed, idealized, target values. Flemming’s and Katz’ models assign penalties to the squared deviation from default target values. Because the deviation cost is squared, it equates negative and positive deviations, and the deviation cost increases more rapidly with the size of the deviation than an unsquared deviation. Lefkowitz’ (2017) model also penalizes squared deviations from target values, but has different constraints for positive vs. negative deviations. Having different constraints for positive vs. negative deviations provides a way for these models to account for Klatt’s (1973, 1976) observation that segmental intervals appear to be somewhat ‘incompressible’ (in that they do not appear to shorten beyond a minimum threshold¹⁷), and provides a way to account for lower temporal variability for shorter, more constrained, interval durations (see Section 8.6.1.4 for further discussion).

¹⁷ Windmann (2016) presents data from Dellwo, Steiner, Aschenberner, Danikovicová, and Wagner (2004) showing that segmental intervals appear to be deleted when subject to time pressure that would otherwise shorten them beyond the incompressibility threshold.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

230  In Šimko and Cummins’ articulatory model, scalar costs are assigned to possible gestural movements, e.g. produced with more force (greater effort cost), with more undershoot (greater parsing cost), or with a longer movement duration (greater time cost). Windmann (2016) has similar costs, but they are operationalized differently, in all cases via the duration of different sizes of acoustically deﬁned intervals. In Windmann’s model, 1) effort costs are assigned to syllable durations, 2) perceptual costs are assigned to syllable and accented word-interval durations, and 3) time costs are assigned to utterance-interval durations, with different shapes of the cost functions for each. The time cost is a linear function of utterance duration, as it is in Šimko and Cummins (2010, 2011), but the effort cost is a non-linear, increasing function of syllable duration. It rises quickly for short syllable durations, with a gentler rise for longer durations. This type of function is motivated on the assumption that (effortful) movement will correspond to a larger proportion of a short syllable interval than a long syllable interval, where long syllable intervals are assumed to be more static. Longer durations may continue to have an effort penalty even when the oral articulators are not moving, because effort is required to maintain phonation (Howard and Messum 2011), as well as for controlled expiration.¹⁸ Perception costs are convex functions, with a high cost for short-syllable, or short-accented-word durations, whose recognition likelihood is low, asymptoting to zero for longer durations. These cost functions represent the inverse of recognition likelihood: Recognition likelihood of syllables and/or words is low where the duration is short and the perception cost is high. In theory, recognition likelihood becomes perfect at a longer duration, where the cost function asymptotes to zero. At this point, recognition likelihood has reached a ceiling: recognition likelihood will not increase any further, and the syllable or word will not incur any further perception cost. Windmann showed that his model could account for a wide variety of observed durational phenomena, including incompressibility (Klatt 1973, 1976), effects of word- and phrasal stress, interactions of prominence with speaking rate, and polysyllabic shortening. 8.6.1.4 Proposed source(s) of timing variability An assumption of most models of non-speech motor activity is that there is noise from a timekeeping mechanism that grows with the duration of the

¹⁸ Windmann (2016) notes that although force is no doubt often required to hold an articulator in an endpoint position, force is not required to maintain articulators in a target equilibrium position in mass–spring systems of the type used by Šimko and Cummins (2011).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

231

timed interval (Getty 1975; Schmidt et al. 1979; Gallistel 1999; Gallistel and Gibbon 2000; Jones and Wearden 2004). Additionally, depending on the timing requirements of particular tasks, some patterns of variability may have to do with the minimal intervention principle (Scholz and Schöner 1999; Domkin et al. 2002; Yang and Scholz 2005; Todorov and Jordan 2002; Todorov 2004), discussed in Section 8.4.1 above. However, most optimization models of surface durations in speech do not make explicit proposals about the source of timing variability for repeated productions of the same utterance. Lefkowitz (2017) is an exception, but provides a very different approach from those in the non-speech motor control literature. Lefkowitz (2017) proposes that surface timing variability is due to the stochastic nature of the phonetic planning process, rather than to noise in a timekeeping process that occurs once a single planning choice has been made. His proposal was inspired by Maximum-Entropy models of phonology (e.g. Goldwater and Johnson 2003), which are stochastic models designed to model grammars with variable outputs (free variation). They are called Maximum Entropy because they are “designed to include as much information as is known from the data while making no additional assumptions (i.e. they are models that have as high an entropy as possible under the constraint that they match the training data)” (Goldwater and Johnson 2003, p. 112). In Lefkowitz’ application of the Max Entropy framework to modeling surface phonetics, the assumption is that speakers choose (planned) surface duration speciﬁcations from distributions of possible durations (analogous to probability density functions), where candidate values that are more ‘harmonic’ (less costly) are more likely to be chosen. The choice of duration will depend on the number of (conﬂicting) target values that speakers are under pressure to match in any given context, as well as to the penalties that they are assigned for deviating from these target values. For example, to account for systematically shorter acoustic vowel-interval durations before voiceless consonants in American English, he proposes competing constraints that penalize deviations from 1) the default target duration for vocalic intervals, and 2) the target duration for vocalic intervals before voiceless consonants. Durations in contexts subject to a larger number of constraints and/or constraints with higher weights will have lower surface phonetic variability because their probability density function is more constrained. This proposal contrasts with non-speech motor-control approaches in which optimal parameter values are output from the optimization (planning) process, and where variability is due to the noise in the control signal during the implementation of the production process.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

232  Because Lefkowitz observed lower durational variability in contexts with shorter durations, he concludes that these contexts must be subject to more highly weighted, or more numerous, constraints than contexts which yield longer durations. To achieve this, he proposes that each contextual target is associated with two types of hemiparabolic penalties, 1) a penalty for being shorter than the target (what he calls a STRETCH constraint), and 2) a penalty for being longer than the target (what he calls a SQUEEZE constraint). By stipulating that the default, target durations are long, and that most contextual effects on duration reﬂect SQUEEZE rather than STRETCH constraints, he is able to account for the lower-durational variability of shorter-duration intervals because these are more constrained. However, as noted in Chapter 4, lower variability for shorter-duration intervals is also observed in non-speech motor tasks where shorter-duration productions cannot be said to be subject to more (or more highly weighted) constraints. For example, lower variability for shorter-duration intervals is observed for tasks involving single interval production (Rosenbaum and Patashnik 1980a, 1980b; Ivry and Hazeltine 1995; Merchant et al. 2008b), tasks involving moving to metronomes or to internally recalled periodic ‘beats’ (Schmidt et al. 1979; Wing 1980; Ivry and Hazeltine 1995; Spencer and Zelaznik 2003; Merchant, Zarco, and Prado 2008 among others), and tasks involving the anticipation of a timed stimulus (e.g. Gibbon 1977; Roberts 1981; Green, Ivry, and Woodruff-Pak 1999). In all of these tasks, participants producing shorter vs. longer intervals are performing the same task, e.g. mimicking a stimulus, or tapping along to a metronome beat etc., and thus would not be subject to additional constraints for the shorterinterval productions. The similarity in patterns of variation for speech and these non-speech motor activities, where the shorter intervals in the non-speech motor activities cannot be said to be more constrained in terms of their required tasks or motor costs than longer intervals, is at odds with Lefkowitz’ view that speech-durational variability is due to phonetic grammar (as opposed to noise in the motor control system). Thus the similarity in variability patterns across speech and non-speech motor activities is more consistent with the view that durational variability is due to noise in the timekeeping mechanism, possibly in reading time from memory (Gallistel 1999; Gallistel and Gibbon 2000; Jones and Wearden 2004), rather than to the hypothesized stochastic nature of the phonetic candidate space, as proposed in Lefkowitz (2017). 8.6.1.5 The number of model components, and issues relating to phonology-intrinsic vs. phonology-extrinsic timing Optimization approaches to timing patterns within utterances vary in terms of the number of proposed model components. Flemming (2001) has no

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

233

separation between phonology and phonetics, and consequently, has a two-component model, that is, a model with 1) a phonology–phonetics component, and with 2) an assumed motor-sensory implementation component. Because there is no separation between phonology and phonetics, this model consequently proposes that timing is planned and speciﬁed during the phonology–phonetics planning component. The representations that are used in this planning component for specifying timing are temporal, i.e. target durations for different size units (e.g. segments and syllables) that compete with one another for determining interval durations in particular contexts. Like AP/TD, therefore, this model has temporal phonological representations. However, unlike AP/TD, this model uses non-phonology-speciﬁc solar time units, and therefore allows for the possibility that timing is represented and speciﬁed using general-purpose timekeeping mechanisms. Šimko and Cummins’ (2010, 2011) AP/TD-inspired model also has no separation between phonology and phonetics, and therefore speciﬁes timing (as well as all other surface characteristics) within the phonological planning component. As in AP/TD, phonological representations are gestural and spatiotemporal, and include stored gestural stiffness and spatial target values as part of lexical representation. Timing in this model is therefore phonologyintrinsic. However, like Flemming’s (2001) model, and unlike AP/TD, Šimko and Cummins’ model uses solar time. That is, gestural activation onset and offset times are speciﬁed in solar time, and values of the system stiffness parameter are optimized. This optimization process uses a cost function that includes costs that are speciﬁed in solar time units, as are values for the gestural-activation onset and offset timing parameters. These include the cost of time (used to penalize long-utterance durations) as well as a realization degree cost (used to penalize short productions of intervals in which gestures are least recognizable). This model therefore also requires general-purpose timekeeping mechanisms to represent and specify the values of these parameters, as well as costs that are quantiﬁed in solar time, and to track time as the utterance unfolds. In contrast to these two-component models, other models propose a phonetic planning component that is separate from the phonological component, and therefore assume a three-component model of speech production. Braver (2013) and Lefkowitz (2017) are of this type. Braver (2013) assumes a symbolic phonological component, in which optimal symbolic representations of the output of the phonological component are chosen through the use of ranked constraints (classical Optimality Theory; McCarthy and Prince 1993; Prince and Smolensky 1993). This symbolic output then serves as the input to a phonetic planning component that employs weighted cost functions and cost

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

234  minimization to generate quantitatively speciﬁed surface forms. In this model, as in Flemming (2001), deviations from durational targets associated with each segment, as well as from durational targets associated with a higher-level unit, e.g. a mora or two moras, are assigned costs, and surface-form durations are chosen that minimize the total cost of deviating from the competing targets. In summary, this section has highlighted that the debate about the two- vs. three-component nature of the speech-production system extends also to researchers working within the optimization framework. Optimization-based models with two components have phonology-intrinsic timing, as AP/TD does, but differ from AP/TD in their use of solar timing units, as opposed to phonology-speciﬁc oscillators. 8.6.1.6 Whether the outputs of the optimization process are surface durations or relative durations Another difference among optimization-based models of surface durations in speech has to do with the way grammatical constraints interact with so-called ‘performance factors’, such as rate of speech. In Flemming’s (2001) model, which assumes a single phonology–phonetic optimization component of grammar, phonetic costs such as effort interact with costs relating to contrasts (such as costs that penalize categories that are not perceptually distinct, and costs that penalize small numbers of contrasts in an inventory). In his proposal, rate of speech is reﬂected in the effort cost, where the assumption is that higher rates of speech require higher movement velocities, and therefore higher effort penalties. In Šimko and Cummins’ (2010, 2011) and Windmann’s (2016) models, which simultaneously evaluate time, perception/parsing, and effort costs to yield optimal parameter values for each planned utterance, manipulation of the weighting of the time cost provides direct control over speech rate, although the weighting of the effort and perception costs will also inﬂuence output durations. Katz’ (2010) model contrasts with these approaches in proposing that relative (rather than absolute) durations are emergent outputs of the constraint grammar. Once relative durations are optimized, absolute durations are speciﬁed at a later stage, under the inﬂuence of ‘performance factors’ such as rate of speech and isochrony (Katz 2010, pp. 127, 128): We can think of the outputs of the grammar, which are the input to a speech task, as a series of timing units that have relative durations assigned to them by the grammar, but do not have absolute durations, which depend on speech rate and sundry performance factors . . . Let us assume that relative durations are turned into absolute durations when they are assigned a

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.     

235

speech-rate coefﬁcient, a real number that is multiplied by relative duration. This may be a simpliﬁcation . . . we could model partial isochrony by expanding the formalism to include competing weighted constraints on isochrony. Such a model would also be able to derive a tendency towards evenly-spaced words, but not complete isochrony. This might emerge from competition between the pressure to be isochronous and, for instance, a dispreference for sudden large changes in speech rate . . .

In proposing this, Katz is proposing a three-component model of a different sort from the one proposed in this volume. That is, he proposes 1) a grammatical component that produces spectral aspects of surface phonetics, as well as relative timing, 2) a performance component that adjusts the output of the grammatical component and speciﬁes absolute durations, and 3) presumably, an implementation component which implements the plan. Because the grammatical component produces proportional (i.e. quantitative) timing, it requires some type of phonology-intrinsic timing, but it does not use solar time in this way. However, in the performance component timing is speciﬁed in terms of solar time. The XT/3C proposal is different, in that it is characterized by fully phonology-extrinsic timing, where all aspects of timing speciﬁcation are left to the Phonetic Planning Component, which makes use of general-purpose timekeeping mechanisms (discussed brieﬂy in Chapter 9).

8.6.2 Summary of optimization in phonology and phonetics Optimization models in phonology and phonetics have shown how surface forms can be generated in a principled way under the inﬂuence of competing factors or constraints. These models differ in many respects, e.g. in the number of components they propose for the speech-production system, in the degree of detail they seek to explain, in the way forms are evaluated (constraint ranking vs. weighted scalar costs or constraints), and in the costs or constraints that are included in each model. The models that are of most relevance to the XT/3C approach presented in this book are those that account for surface phonetic properties of planned utterances. Many of these have been inspired by Lindblom (1990), and therefore include both input and output constraints, i.e. costs of movement as well as task requirements relating to perceptual contrast or recoverability, and most of these include scalar costs. The models that account for durational properties are of special relevance, and of these, Šimko and Cummins’ (Šimko 2009; Šimko and Cummins 2010, 2011) and

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

236  Windmann’s (2016) models are most attractive because they show how durational speciﬁcations can emerge from costs for effort, time, and perceptual recoverability, without having explicit, abstract (in the sense that they can vary in different contexts) durational targets. Abstract (in this sense) durational targets are a disadvantage, as noted in Chapter 7, because they present a nontrivial challenge to the listener and learner, who must infer those targets from surface forms which only rarely have durations that correspond to them. As also discussed in Chapter 7, explanatory accounts of timing variability require a model with three components, i.e. with separate Phonological and Phonetic Planning Components, that makes use of phonology-extrinsic, general-purpose timekeeping mechanisms to represent, specify, and track solar time. Cost-function optimization and the Minimal Intervention Principle within Optimal Control Theory provides a principled explanation for surfacetiming patterns and for less variability at movement endpoints than elsewhere (Scholz and Schöner 1999; Domkin et al. 2002; Yang and Scholz 2005; Todorov and Jordan 2002; Todorov 2004). Additional evidence presented in Chapter 4 suggests that the increase in timing variability with increases in interval duration is due to noise in the timekeeping process. This evidence suggests that accounts of timing variability that rely on a stochastic process of choosing planned surface durations from distributions of possible durations in the Phonetic Planning Component (Lefkowitz 2017) may be superﬂuous.

8.7 Conclusion This chapter introduced OCT approaches to non-speech and speech motor control, and showed that OCT is a useful framework for determining parameter values for models that satisfy competing task requirements at minimum cost. Within this framework, OFCT approaches are of particular interest, because they determine the optimal motor commands to reach a target from any state the system is in at any point in time, where states are estimated on the basis of efference copy from the motor commands as well as on sensory feedback. These approaches can thus account for ﬁndings that show that actors can often successfully compensate for unexpected perturbations in the duration of goal-directed actions. The chapter also reviewed the large unresolved debate about the types of motor costs that should be included in these models. Findings suggest that costs relating to effort or jerk, endpoint variance, and time are all of likely relevance for most motor tasks, including speech, and are all of relevance in explaining temporal patterns.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. 

237

Probably the greatest challenge in applying OFCT approaches to speech has to do with determining the control policy, i.e. the optimal control signal that should be issued from any state the system is in, in order to reach the speciﬁed target on time. This task is computationally difﬁcult and time-consuming. Most approaches to phonetics and speech motor control have circumvented this problem by using optimal control theory in a more restricted sense, i.e. to determine optimal parameter values for models without state estimation or feedback. However, this problem will need to be resolved, in view of the evidence that state estimation and feedback are needed; see Houde and Nagarajan (2011) for recent work in this area. Section 8.6 reviewed some of the ways OTC approaches have been used in phonology and phonetics. OTC approaches to phonetics have largely followed Lindblom (1990) in proposing constraints or costs relating to output (i.e. contrast or perceptual recoverability) as well as to input (i.e. movement costs, such as effort), and some have followed more recent OCT accounts of non-speech motor control in including a cost for time (e.g. Šimko and Cummins 2010, 2011; Windmann 2016). As has been shown for non-speech motor-control tasks, all of these costs have implications for durational patterns in speech. While many aspects of durational behavior can be accounted for by effort/jerk and endpoint variance costs, without an explicit cost of time, the additional cost of time appears to be required to account for 1) relatively fast movement speeds even in the presence of effort costs which penalize fast movements (Tanaka et al. 2006), as well as for 2) articulatory overlap, both of which are widespread in speech (Liberman et al. 1967). Chapter 10 introduces a sketch of an XT/3C proposal which assumes the SOFCT approach, and incorporates many of these ideas. First, however, Chapter 9 discusses the general-purpose timekeeping mechanisms that are assumed to operate in the XT/3C proposal, including Lee’s General Tau theory for movement guidance and coordination.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

9 How do timing mechanisms work? Behavioral and neural evidence presented in Chapter 4 indicates that humans possess general-purpose timekeeping abilities, which they use to represent the surface durations of timed intervals, to estimate and track time until event occurrence, and to plan and specify surface-timing characteristics of movements. The ﬁrst part of this chapter presents what is currently known about human abilities to represent and track time, with a particular focus on timing mechanisms that would be extrinsic to the phonological component of the speech planning process, because they are general-purpose, non-phonologyspeciﬁc mechanisms. All phonology-extrinsic timing models of speech production assume these general mechanisms. This contrasts with AP/TD’s phonology-intrinsic timing approach, in which surface timing behaviors emerge from intrinsically timed phonological representations and phonologyspeciﬁc timing-control mechanisms that use time units that often don’t relate to solar time in a straightforward way. This chapter addresses what is known about the nature of the general-purpose timekeeping mechanisms that are extrinsic to the phonology, and describes candidate models of these mechanisms (Section 9.1). The state of the ﬁeld is such that although there is a clear consensus that humans have general-purpose timekeeping abilities (cf. the neural evidence presented in Chapter 4 showing that the brain tracks and represents time), there is currently a large amount of debate about the way neural timekeeping works. The discussion in Section 9.1 therefore focuses on key features and differences among models and guides the reader towards the issues that may be of most relevance to speech. The second part of the chapter (Section 9.2) presents Lee’s General Tau theory (Lee 1998, 2009), a theory of the temporal guidance of voluntary movement. This theory provides a crucial component for the XT/3C model of speech production, because its tau-coupling mechanism provides a way to plan movements with appropriate velocity proﬁles, as well as endpoint-based movement coordination (see supporting evidence in Chapter 5). In doing so, it provides a general-purpose, phonology-extrinsic alternative to AP/TD’s use of oscillators for articulatory movement and coordination.

Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

239

9.1 General-purpose timekeeping mechanisms The large body of work on mechanisms for keeping track of time has addressed a number of issues that are critical for understanding speech timing. This section reviews the wide range of durations that humans can keep track of (9.1.1), differences between timekeeping mechanisms for long vs. short intervals (9.1.2), how timekeeping mechanisms for the speech-relevant range operate (9.1.3) and the wide variety of candidate proposals for these mechanisms (9.1.4). It also addresses apparent task- and modality-speciﬁc timing behaviors (9.1.5), and beat-based vs. non-beat-based timing (9.1.6), ending with a summary (9.1.7). Because the literature on this topic is vast, the discussion here is of necessity limited in scope, with a focus (wherever possible) on points that are particularly relevant to the issue of phonologyextrinsic, general-purpose timekeeping mechanisms in speech motor control.

9.1.1 Duration ranges relevant to timekeeping mechanisms used for speech production One of the ﬁrst things that need to be established when studying timekeeping mechanisms is the range of relevant durations, since humans have different neural mechanisms for different time-scales. This is clear for the very short and long human timing scales. On the short end of the scalar continuum, sound localization involves keeping track of the microsecond difference between the time of arrival of sounds at the two ears; neurons in the medial superior olive of the brain process these differences, using mechanisms that are likely to involve delay lines and coincidence detectors (Jeffress 1948; Oliver, Becklus, Bishop, Loftus, and Batra 2003). At the long end of human timing scales, the circadian clock controls bodily functions (the sleep/wake cycle, eating, body temperature, and hormone release) in multi-hour rhythms. Circadian timing mechanisms include a ‘master clock’ in the hypothalamic suprachiasmatic nucleus that synchronizes with external light-dark cycles, as well as a network of distributed cellular oscillators (Aschoff 1985, 1998). 9.1.1.1 The range of relevant durations for speech production lies between these two extremes Individual speech movements range in duration from about 100–200 ms (Tasko and Westbury 2002), but some speech intervals with linguistic signiﬁcance

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

240     ? (e.g. some stop-closure durations, voiced intervals for some unstressed vowels) can be shorter than this, and some inter-landmark intervals can be longer. For example, Fletcher and McVeigh’s (1993) study of Australian English and review of the English durational literature reported segment-related durations (measured from acoustic landmarks) up to 300 ms. Speakers must therefore be able to represent and specify durations at least as long as 300 ms. Whether they represent, plan, and specify durations of longer intervals is unclear; the answer to this question depends on whether speakers represent, plan, and specify the durations of superordinate units, e.g. syllable rhymes (for phrase-ﬁnal lengthening), syllable-sized units (for prominence-related lengthening and polysegmental shortening), and words, clitic groups, or perhaps cross-word feet (for polysyllabic shortening) or whether the durational patterns of these superordinate units emerges from control regimes that operate on lower-level units (e.g. intervals between segmental landmarks), taking into consideration the position of segments in the prosodic hierarchy, and their relative prominence. Van Santen and Shih (2000) showed that although the role of position in prosodic structure is clearly relevant in inﬂuencing phoneme-related durations, it is not the case that there is a ﬁxed-duration goal for intervals related to higher-level prosodic constituents, e.g. a phrase-ﬁnal prosodic constituent. Instead of segment durations being compressed or stretched to ﬁt into speciﬁed syllable rhyme time intervals, the durations of acoustic intervals related to individual phonemes preserve intrinsic segmental duration differences. For example, acoustic intervals and articulatory movements for /a/ in ﬁnal position are longer than for /i/ in ﬁnal position, and syllables that contain more phonemes have corresponding acoustic intervals that are longer than syllables that contain fewer phonemes (Dauer 1983). Flemming (2001), Katz (2010), and Braver (2013) have modeled the interacting effects of phonemes and superordinate units on surface durations using abstract, durational targets for intervals relating to phonemes as well as durational targets for intervals relating to superordinate units; surface durations are chosen which optimize the deviations from both types of target. In this type of model, the representation of the duration of the intervals relating to the superordinate unit would require durations that are longer than the durations of intervals relating to individual phonemes, and depending on the type of superordinate unit (e.g. syllable vs. word), might require the representation of durations that are longer than 300 ms. However, although it is clear that the representation of a higher-level unit is required to explain these kinds of effects, the representation of the duration of the higher-level unit is not necessarily required. Van Santen and Shih (2000) suggest the possibility that timing is not speciﬁed at the level of higher prosodic units, like the rhyme, but is instead

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

241

speciﬁed at lower levels (like the acoustic intervals relating to), taking the context (e.g. in prosodic phonemes) of these lower level units into account. It is thus possible to account for the timing behavior of superordinate units such as the rhyme or word, without assuming that the durations of these intervals are necessarily speciﬁed. At an even higher level, phrase length is known to inﬂuence the degree of inspiration, where deeper breaths are observed for longer phrases (Sperry and Klich 1992; Winkworth, Davis, Ellis, and Adams 1994; Huber 2008; RochetCapellan and Fuchs 2013). If phrase length is represented in terms of absolute time, then this would be an example of representing even longer time ranges in speech production, up to ca ﬁve seconds. However, it is also possible that in computing the degree of inspiration at the onset of a phrase or utterance, speakers compute phrase length in terms of e.g. number of words, or syllables, rather in than in terms of the duration of the phrase. In short, it is clear that speakers need timing mechanisms for durations that are less than half of a second, but whether they need mechanisms for longer durations in the seconds-to-minutes range is less clear.

9.1.2 Timekeeping mechanisms in the 10s-of-millisecondsto-minutes range are different from those used for shorter and longer intervals Despite the lack of clarity about the relevance to speech production of timekeeping mechanisms for intervals lasting longer than approximately half a second, it is nevertheless clear that timekeeping mechanisms in the ms-to-minutes range are different from the timekeeping mechanisms used for sound localization (much shorter durations) and circadian timing (much longer durations). As noted above, sound localization and circadian timing functions are highly localized in the brain, with sound localization processes occurring in the medial superior olive, and circadian functions occurring in the hypothalamic suprachiasmatic nucleus. In contrast, timekeeping in the seconds-to-minutes range is far more distributed, with timing-related processing occurring in various parts of the cortex, the cerebellum, basal ganglia, and thalamus. Further evidence of a difference in timekeeping mechanism at these different timescales comes from correlational studies of timing estimation behavior with time-awake, where correlations with time-awake are taken as evidence of circadian timing processes. For example, Aschoff (1985, 1998) has shown that although time estimation on the scale of hours does correlate

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

242     ? with time-awake, time estimation on the scale of seconds and minutes does not, but does correlate with body temperature.

9.1.3 How do the timekeeping mechanisms work, for intervals ranging from 10s of ms to minutes? A number of studies have demonstrated the existence of general-purpose timekeeping abilities in humans and other animals. This section reviews some of the evidence for the nature of general-purpose timekeeping processes, for intervals with durations in the range relevant for speech (9.1.3.1), and presents some ﬁndings that indicate the neural mechanisms involved in these abilities (9.1.3.2). 9.1.3.1 A core set of general-purpose timekeeping abilities, shared across perception and production A great deal of evidence presented earlier in this chapter supports the idea that a core set of general-purpose timekeeping abilities (i.e. used for production and perception, and for all modalities within perception) is used for timekeeping in the ms-to-minutes range. This evidence includes e.g. evidence of anticipatory behavior in animals and humans that requires representations of timed intervals learned in perception and used in production. This suggests general surface-timing representations and time-tracking abilities that are common to perception and action. Other researchers have provided evidence that representations learned in one perceptual modality can be reused in other modalities. For example, Nagarajan, Blake, Wright, Byl, and Merzenich (1998) showed that discrimination facilitation from training on intervals signaled via somatosensory information can generalize to discrimination performance on intervals signaled via auditory stimuli. A core set of general-purpose timekeeping mechanisms is also supported by the fact that timing variability in a variety of production and perception tasks involving a range of different interval durations, consistently shows increased timing variability with longer-duration intervals (often called the scalar property, discussed in Section 4.1), although the slope of the relationship between interval timing variability and interval duration can vary across tasks. Ivry and Hazeltine (1995) argued that if a single, common timing mechanism underpins timing performance on all tasks, albeit with some effector-, modality-, or task-speciﬁc modulations, individual performance on different timing tasks (assessed by timing variability in repetitions of each task over a range of

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

243

interval durations) should be correlated, i.e. good timers on one task should be good timers on another. This was in fact what Keele, Pokorny, Corcos, and Ivry (1985), Ivry and Hazeltine (1995), Spencer and Zelaznik (2003), and Merchant, Zarco, and Prado (2008) observed for many of the tasks they investigated. For example, Ivry and Hazeltine (1995) showed that under some circumstances, perception and production tasks can share the same slope of the relationship between squared standard deviation and squared interval duration, if the tasks involve a comparable number of timed intervals. For example, they found that the slope was .002 for isolated intervals (production as well as perception) vs. .0005 for four repeated intervals (four in production, four presentations of the standard in perceptual discrimination).¹,² Tasks whose timing variability was not found to correlate with timing variability on tapping tasks were those that involved continuous, repetitive movements, e.g. continuous circle drawing, or back-and-forth line drawing, which lacked salient events that could be timed (Robertson, Zelaznik, Lantero, Bojczyk, Spencer, Dofﬁn, and Schneidt 1999; Zelaznik, Spencer, and Dofﬁn 2000; Zelaznik, Spencer, and Ivry 2002; Spencer and Zelaznik 2003). Surface timing in these types of tasks is said to be (mostly) emergent (see discussion in Repp and Steinman 2010; Delignières and Torre 2011; Zelaznik and Rosenbaum 2010), and does not involve an explicit representation of the surface interval duration. Merchant, Zarco, Bartolo, and Prado’s (2008) study that involved multidimensional scaling analysis corroborated these earlier results. Taken together, these results are at least consistent with the view that taskand modality-general timekeeping mechanisms underlie all non-emergent tasks, that is, tasks that involve a mental representation of the surface duration of the timed interval, regardless of the perceptual vs. production nature of the task. Therefore, if speech is a non-emergent task (as argued in Chapter 4), it is expected to be governed by the same task- and modality-general timekeeping

¹ An additional experiment ruled out the possibility that major differences in Weber slope were due to differences in the continuity vs. discontinuity of the comparison interval presentation (in the case of perception tasks), or elicitation (in the case of production tasks), in relation to the standard. In this experiment, the standard was periodically repeated, but the presentation/elicitation of the comparison interval was either continuous with the standard, so that it occurred according to the same periodic rhythm (500 ms inter-tone-onset interval), or discontinuous with the standard, i.e. presented/elicited after an interval of either 900, 1000, or 1100 ms in duration. The continuous vs. discontinuous presentation/elicitation of the comparison interval with the standard was not found to affect the nature of the Weber slope relationship, for either production or perception. ² These results suggest that the nature of the representation of the standard in memory affects the degree of duration-dependent timing variability. On this view, repeated presentation of the standard leads to a more accurate representation of the interval-to-be-timed as compared to a single presentation of the standard; a more accurate representation results in less variability of the timing mechanism over repeated trials.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

244     ? mechanisms as are used for other non-emergent tasks. One possibility is that these general-purpose timekeeping functions are performed by a single, ‘dedicated’ core-timekeeper, composed of specialized neural structures exclusively dedicated to timing, and not used for any other type of neural representation or processing. Another possibility is that the timekeeping functions are performed by neural structures that aren’t exclusively devoted to timing, i.e. that timing is an ‘intrinsic’ activity of neural networks that also encode other types of information (Ivry and Schlerf 2008; Buonomano and Laje 2010). That is, ‘intrinsic’ timekeeping models track and represent time, even though the neural structures which do this are not exclusively dedicated to timing and also do other things. However, note that according to the phonology-extrinsic vs. phonology-intrinsic timing deﬁnitions adopted in this book, ‘intrinsic’ and ‘dedicated’ timekeeping mechanisms are both considered to be phonology-extrinsic timing mechanisms, because from this perspective, they are all general-purpose timing mechanisms that would be used in the Phonetic Planning Component (and therefore outside of the Phonological Planning Component), to specify the surface timing for acoustic landmarks and movements associated with a-temporal (symbolic) phonological representations. The sections which follow discuss evidence and proposals relating to these issues. 9.1.3.2 A distributed network of neural structures involved in timekeeping Two types of evidence clearly show that it isn’t the case that a single neural structure behaves as a neural ‘clock’ exclusively dedicated to timing, and nothing else. First, although lesions of individual areas can alter timing to some extent (e.g. leading to greater timing variability or slower movements), they do not abolish timing behavior altogether. Indeed, there are no known disorders that abolish timing altogether, or that affect only timing. Second, the basal ganglia, thalamus, cerebellum, and cortical structures are all involved in timing behavior, making it difﬁcult to assign a timing role exclusively to one single structure. On the basis of this evidence researchers such as Merchant, Harrington, and Meck (2013) have suggested that these structures thus appear to be part of an interconnected, distributed timing network, in which different structures within the network can compensate for each other (called a ‘degenerate’ network). However, attributing these structures to a network exclusively devoted to timing is difﬁcult, since these structures are also involved in behaviors that

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

245

aren’t related to timing. So if the ‘dedicated’ timing view is to be maintained, the timing network must involve sub-parts of these structures whose exclusive timing function has yet to be identiﬁed. The other possibility is that generalpurpose timekeeping mechanisms are accomplished by neural structures who aren’t exclusively devoted to timing, i.e. that the timekeeping network is ‘nondedicated’.

9.1.4 A large number of candidate timekeeping models Even within the class of ‘dedicated’ models, that is, models that assume that there is a set of neural structures exclusively dedicated to timing, there is no consensus about the nature of the timing mechanisms. Below is a discussion of some of the available models. The most inﬂuential ‘dedicated’ core timing model is the traditional clockcounter, or pacemaker-accumulator model (Hoagland 1933; Creelman 1962; Treisman 1963; Gibbon 1977; Church 1984; Gibbon, Church, and Meck 1984; Wearden 1999; Wearden 2013), in which a clock (pacemaker) generates pulses that are sent to an accumulator during a timed interval. The accumulated pulses are stored in working memory, and are stored in long-term memory if reinforced. Temporal discrimination in perceptual tasks is made on the basis of a ratio comparison process between pulses stored in working memory and those stored in long-term memory. This model has been extremely inﬂuential because it accounts for many aspects of timing behavior, but is regarded by some to be neurally unrealistic because of its unbounded accumulator process and ratio comparisons (Gibbon, Malapani, Dale, and Gallistel 1997 cited in Matell and Meck 2004). Many of the more recent dedicated models are based on the clock-counter model, but with components that are thought to be more neurally realistic. These include • neurally inspired models, such as the neuromimetic model of interval timing which provides an account of attentional effects on temporal processing (Touzet, Demoulin, Burle, Vidal, and Macar 2005), • the striatal beat-frequency model (Matell and Meck 2004), • models of cerebellar and olivo-cerebellar timing (e.g. Yamazaki and Tanaka 2007; Jacobson, Rokni, and Yarom 2008), and • delay line models (Desmond and Moore 1988; Moore, Desmond, and Berthier 1989; Braitenberg 1966, all cited in Goel and Buonomano 2014).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

246     ? The neuromimetic model of interval timing (Touzet et al. 2005), provides a neural instantiation of the accumulator in traditional pacemaker-accumulator models, as well as interactions with memory components of processing. In this model, a layered pacemaker operates by emitting oscillations of decreasing frequencies in each layer during the timed interval, which are translated into a pattern of neural activity in the accumulator module that serves as the representation of the timed interval. The striatal beat-frequency model (Matell and Meck 2004) assumes that banks of cortical neurons oscillating at different frequencies are phase-aligned at the onset of an event, and yield different patterns at different points in time; neurons in the striatum are tuned to detect particular oscillator coincidence patterns corresponding to particular durations. Jacobson et al.’s (2008) model of timing in the olivo-cerebellar system assumes that timing is based on sub-threshold oscillations in a network of neurons in the inferior olive. These oscillations are at the same frequency; their relative phasing can be controlled by the cerebellar cortex and nuclei, and this relative phasing gives rise to an output temporal pattern in the cerebellar nuclei, which can either be repeating, or non-repeating.³ Delay line models (Desmond and Moore 1988; Moore et al. 1989; Braitenberg 1966, discussed in Goel and Buonomano 2014) are based on the idea that some neurons or groups of neurons might be activated at different time delays. An example of a delay-line model is a synﬁre chain, proposed for the temporal structure of birdsong (e.g. Long, Jin, and Fee 2010, discussed in Hardy and Buonomano 2016), where the activity of different groups of bursting neurons is propagated in sequence. In other delay-line models, different neurons or groups of neurons are activated or receive inputs at different time delays, and therefore are tuned to different time intervals (Goel and Buonomano 2014). The question of which timekeeping mechanism(s) are more appropriate is made even more complicated by evidence that timekeeping behavior can differ for different sub-ranges within the ms-to-minutes range. There is also evidence suggesting that timekeeping behaviors can differ for different perceptual modalities, for production vs. perception, and for different types of tasks. This evidence, reviewed below, has led some researchers to propose ways in which a core, dedicated timekeeper might perform differently in different situations, and has led other researchers to explore ‘intrinsic’ timing models, where timekeeping is an intrinsic activity of neural structures not exclusively devoted to timing.

³ See also Yamazaki and Tanaka (2007) for an alternative model of cerebellar timing.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

247

9.1.4.1 Commonalities and differences in timing behaviors for different time ranges Although some timing behavior in the sub-second range is far more similar to timing behavior in the supra-second-to-minutes range than it is to timing behavior in the much shorter (micro-second) or much longer (hour) ranges, there is a considerable amount of evidence suggestive of distinct behavior in the sub- vs. supra-second ranges. Five such lines of evidence are reviewed here. 1. Different relationships between timing variability and duration. Although timekeeping within the ms-to-minutes range shows evidence of increasing timing variability for increases in interval durations (the scalar property) for short (sub-second) as well as longer intervals, there are different slopes of the relationship for short (sub-second) vs. longer intervals (Grondin 2014; Rammsayer and Troche 2014), with increasing variability for intervals longer than ca. 1 second.⁴ 2. Different responses to medication. Although sub-second and 1s+ intervals can both be affected by antipsychotic medication (e.g. the dopaminergic antagonist haloperidol), other dopaminergic antagonists can be more selective. For example, the dopaminergic antagonist Remoxipride is selectively detrimental to the processing of 1-s. intervals, as opposed to 50-ms intervals (Rammsayer 1997, 1999). See also Wiener, Lohoff, and Coslett (2011) for corroborating evidence suggesting that different dopaminergic systems underlie the processing of durations in different time ranges. 3. Different responses to distraction. Although two studies relating to speech perception have shown that timing in sub-second and second+ ranges can be inﬂuenced by a secondary processing task (Casini, Burle, and Nguyen 2009) as well as by sleep deprivation (Casini, Ramdanibeauvir, Burle, and Vidal 2013), consistent with a single set of core timekeeping mechanisms in the ms-to-minutes range, Rammsayer and Lima (1991) found evidence suggestive of separate timing processes for short vs. longer ranges. They reported that a simultaneous word-learning

⁴ Grondin (2014) proposed that observations of reductions of timing variability for intervals longer than 1.2–1.5 s. could be due to the use of counting, or chunking strategies, because intervals longer than 1.2–1.5 s. “exceed the temporal capacity of working memory” (Grondin 2014, p. 28). An advantage of chunking, or subdividing longer intervals, is that it reduces timing variability (Killeen and Weiss 1987 and references cited therein). As Killeen and Weiss note, reduction of timing variability as a result of subdivision has been observed behaviorally and also follows from the assumption that timing variance (standard-deviation-squared) is proportional to the square of duration.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

248     ? task (memorizing a visually presented word for later recall) impaired the discrimination of one-second intervals, but not the discrimination of 50-ms intervals, whether ﬁlled or empty. 4. Different response to neural interference. Repetitive transcranial magnetic stimulation of the cerebellum interferes with a time reproduction task for short (400–600 ms), but not long (1600–2400 ms) intervals (Koch, Oliveri, Torriero, Salerno, Lo Gerfo, and Caltagirone 2007). In this study, reproductions of the 400–600-ms intervals under repetitive TMS (rTMS) were produced as longer than the same intervals in the absence of rTMS, but variability in the two conditions was comparable. Results of applying the rTMS during encoding vs. reproduction phases of the task suggested that the cerebellum was involved in encoding the temporal interval in memory, but not in its retrieval from memory. These authors found that repetitive TMS of the dorsolateral prefrontal cortex affected reproduction of 1600–2400-ms intervals, but not the shorter, (400–600 ms) intervals. 5. Different response to disruption. Results of duration discrimination tasks (discrimination of a test interval vs. an implicit 100-ms standard) show that there is a disruptive effect of a preceding tone on perceptual processing of a test interval if the tone occurs 100 ms before the test interval, but not 300 ms or more before the test interval (Spencer, Karmarkar, and Ivry 2009), suggesting that immediate context affects perceptual timing on a very short timescale, but not a longer scale. All of this evidence suggests commonalities and differences across shorter (sub-second) vs. (usually) supra-second time ranges, which complicates the search for a single, core timekeeper. Rammsayer and Troche’s (2014) correlational study was designed to test for the possibility of distinct timekeepers for different ranges. Their study evaluated statistical correlations among timing behaviors in different tasks (duration discrimination and standard-interval identiﬁcation tasks for short, sub-second (42–108 ms) and longer intervals (900–1100 ms). They compared the ﬁt of models that explained timing precision that either 1) contained different variables for short vs. long intervals, 2) contained a single variable for both interval types, or 3) contained separate but correlated variables for short vs. long intervals. Results showed that the model that assumed a single timing mechanism for both ranges provided a better ﬁt to the data than the model that assumed two completely independent ranges. However, the model with separate but correlated variables described the data equally well, and was preferred because it was more

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

249

parsimonious. The authors provided two possible explanations: 1) there are two distinct timing mechanisms for different ranges, but they have some cognitive (e.g. working memory) processes in common, or 2) the timing mechanism is hierarchically structured, with duration-speciﬁc components nested within a superordinate, common, timing component.

9.1.5 Task- and modality-speciﬁc behaviors Proponents of single-core, dedicated, timekeepers face the further challenge of accounting for modality-, task-, and situation-speciﬁc timing characteristics. Weber slopes can be different across tasks (e.g. Merchant, Zarco, Bartolo, and Prado 2008), with steeper slopes for the relationship between variability and interval duration for tasks involving single intervals as compared to tasks involving multiple, repeated intervals, steeper slopes for tapping compared to repetitive circle drawing (Spencer and Zelaznik 2003; Robertson et al. 1999), and steeper slopes for some visual tasks compared to auditory tasks. In general, overall temporal accuracy also varies across tasks and modalities: temporal precision and accuracy is better for auditory compared to somaesthetic and visual stimuli (Goldstone and Lhamon 1974; Pastor and Artieda 1996; Merchant et al. 2008). For example, Pastor and Artieda (1996) report that the minimum time interval between two successive stimuli for these to be perceived as separate is around 20 ms for auditory stimuli, 40 ms for somaesthetic stimuli, and 60 ms for visual stimuli. Temporal accuracy is also better for ﬁlled (continuous stimuli) rather than empty intervals (Rammsayer and Lima 1991), better for production compared to perception tasks, and better for tasks which involve multiple successive intervals compared to those which involve single intervals (Ivry and Hazeltine 1995 and Merchant et al. 2008), whether in perception or production. In addition, temporal judgments can be different for different modalities; e.g. Goldstone and Lhamon (1974) report that auditory intervals are judged longer than visual intervals of the same duration. Grondin (2010) reviews several other related ﬁndings. One interpretation of these ﬁndings is that there is a common mechanism underlying these behaviors, but that different tasks and modalities lead to modiﬁcations in the operation of the mechanism. Along these lines, Merchant, Harrington, and Meck (2013) propose that ﬁndings of task- and modalitydependency are due to the neural connections between the core timer (cortico

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

250     ? (SMA)-thalamic basal ganglia circuit in their proposal⁵) and cortical areas, e.g. the parietal cortex for the interface between motor and sensory process (Bueti, Walsh, Frith, and Rees 2008), the cerebellum for motor processes, although they don’t provide an explicit mechanism for how this might work. A recent transcranial magnetic stimulation (TMS) study of time estimation in auditory vs. visual modalities (Kanai, Lloyd, Bueti, and Walsh 2011) supports this general idea, and additionally highlights the complexity of the connections among structures. Kanai et al. (2011) found that when TMS disrupted neural activity in the auditory cortex, time estimation in both the auditory and visual modalities were disrupted to the same degree, whereas when TMS disrupted neural activity in the visual cortex, time processing was only disrupted in the visual modality. These results suggest that the organization of the timing network is complex and possibly hierarchical, with auditory temporal processing having a privileged status. This privileged status is also reﬂected in the degree of precision that listeners have for processing time in the auditory modality, as noted earlier in this section. Modality- and task-speciﬁc ﬁndings also raise the alternative possibility that separate timers may be used for different tasks and/or modalities, where the separate timing mechanisms all share characteristics that lead to the scalar property (Karmarkar and Buonomano 2007; Buonomano and Laje 2010 for motor timing). In these models, timing is coded as a by-product of coding for other sensory or spatial properties (Buonomano and Laje 2010; Laje and Buonomano 2013, for production; Karmarkar and Buonomano 2007 for perception; Buonomano 2014 for a review). That is, time is inherently ‘stamped’ in the neural traces of all activities,⁶ and as a result it can be ‘read out’ by adjusting the synaptic weights of excitatory connections from neurons in the cortex to downstream neurons which function as detectors tuned to intervals of speciﬁc durations (Buonomano 2000; Buonomano and Merzenich 1995). Proponents of such non-dedicated, ‘intrinsic’ timers cite perceptual evidence that timing acuity in perception is affected by immediate context, at least when the interfering contextual information is presented less than 300 ms earlier than the attended target interval (Karmarkar and Buonomano 2007; Spencer et al. 2009). And in production, neural processing of time can be difﬁcult to differentiate from the processing of non-timing-related information, e.g. motivation (Kalenscher, Ohmann, Windmann, Freund, and ⁵ Whether other structures, e.g. the cerebellum, should be considered (part of) the core timer is controversial, cf. Diedrichsen, Ivry, and Pressing (2003), Jacobson et al. (2008). ⁶ Jin, Fujii, and Graybiel’s (2009) ﬁndings that are suggestive of time-stamping even in tasks where time is not an explicit requirement of the task.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

251

Güntürkün 2006). For example, in Kalenscher et al.’s (2006) study, the timing of peaks in pigeon neural-ﬁring ramps (i.e. monotonically increasing neuralﬁring rates) was related both to the time of a motor response and to the amount of anticipated award. More generally, Buonomano (2014) suggests that timing acuity may actually relate to neural efﬁciency, citing Helmbold, Troche, and Rammsayer’s (2007) study showing that temporal sensitivity, as measured by e.g. temporal discrimination thresholds, correlates with general intelligence. On this view, the correlations of temporal acuity in different timing tasks across sensory modalities and motor tasks, observed by Ivry and Hazeltine (1995) and Merchant et al. (2008), would not be due to a shared core timer, but rather to shared neural processing efﬁciency. This view may be difﬁcult to maintain however, in the face of evidence that timing acuity can be selectively impaired: cerebellar pathology can cause impairment of temporal discrimination without causing impairment of loudness discrimination, and lesions of the cerebellum (though not the basal ganglia) can cause increases in timing variability (Spencer and Ivry 2013). In sum, commonalities and differences in timing behavior across tasks and modalities raise many possibilities for the nature of timekeeping mechanisms and their interactions with other forms of processing. Nevertheless, there is clear agreement that humans have general timekeeping capabilities. The following section describes some additional complexities in this domain.

9.1.6 Periodicity- vs. non-periodicity-based timing In addition to proposals for distinct timing mechanisms for different time ranges, and for motor vs. perceptual tasks in different modalities, different timing mechanisms have been proposed for periodicity-based (often called beat-based) vs. non-periodicity-based (often called non-beat-based) timing tasks. This distinction is relevant for speech for several reasons. Although speech production is not overtly periodic, some researchers (e.g. those in the AP/TD perspective; O’Dell and Nieminen 1999; Barbosa 2007) have proposed a central role for periodicity in movement coordination and suprasegmental timing control (Chapter 2). The XT/3C approach discussed in this book does not adopt this view (see Chapters 5 and 6), but does acknowledge that ﬂexible periodicity must play a role in certain types of speech behavior, e.g. singing. Whether there are special timekeeping mechanisms for periodic tasks is therefore an important issue for speech production.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

252     ? Several researchers claim that speakers and listeners use different procedures to perform timing tasks involving beats (i.e. which involve the extraction of perceived periodicity) vs. similar non-periodic timing tasks. However, as with other dichotomies in this theoretical domain, the evidence for the use of different timekeeping mechanisms in these two types of tasks is mixed. On the one hand, in production, beat-based tasks (which involve synchronization-continuation) vs. non-beat-based tasks (which involve single interval reproduction) both show the scalar property (that is, an increase in variability with increases in timed interval duration), as well as signiﬁcant correlations in timing behavior as measured by different tasks (Ivry and Hazeltine 1995; Merchant et al. 2008), suggesting that a common timekeeper might be used in the two types of tasks. On the other hand, there are also timing-related differences between the two types of tasks; for example, the slopes of the relationship between variability and interval duration are shallower for periodic tasks. This difference might be consistent with the idea of a different timing mechanism for periodic (beat-based) vs. non-periodic (nonbeat-based) tasks, but another interpretation is that actors have a better representation of the to-be-timed interval in the periodic case, because it is repeated, and therefore are better able to maintain temporal accuracy in reproducing the intervals, even when the intervals are long in duration. Another line of evidence that has been used to argue for a distinction in mechanisms between periodic, and non-periodic interval-based timing is the ﬁnding from perceptual studies that different brain regions are active during these two kinds of tasks. For example, neural studies have shown greater activation of the basal ganglia in beat-based discrimination tasks (i.e. in tasks in which there is a beat-inducing train of tones or pulses that precedes the intervals-to-be-compared), but greater activation of the cerebellum in non-beat-based discrimination tasks (Teki, Grube, Kumar, and Grifﬁths 2011). This has led to the proposal that beat-based timing is different from non-beat-based timing. Similarly, individuals with cerebellar degeneration, as well as individuals whose cerebellar function has been temporarily impaired in a TMS procedure (Grube, Cooper, Chinnery, and Grifﬁths 2010; Grube, Lee, Grifﬁths, Barker, and Woodruff 2010), show impaired discrimination ability for non-beat-based discrimination tasks (as evidenced by higher discrimination thresholds), but no impairment on beat-based tasks. This evidence is consistent with the idea of different mechanisms for beat- vs. non-beat-based timing, but it is not conclusive, given the differences in how listeners can respond to these two different types of tasks. In particular, the response mechanism to beat-based vs. non-beat-based tasks could be different, even

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

253

though the timing mechanisms involved in measuring interval durations and making comparisons might be the same. For example, in typical intervalcomparison tasks involving beat-inducing stimuli in which an inter-beat interval is compared to a test interval (beat-based tasks), there is a beatinducing train of tones or pulses that precedes the intervals-to-be-compared, and in non-beat-based tasks, there is either no precursor, or a precursor that doesn’t induce the perception of a beat. In beat-based tasks, listeners can ﬁnd a beat, predict the occurrence of the next one, and detect a recurring, potentially abstract,⁷ temporal pattern over a stretch of time that spans multiple inter-beat intervals. In addition, the beat can persist in consciousness, even when the beat-inducing stimuli have ceased (Large and Palmer 2002). This persistence of the beat in consciousness means that an inter-beat interval can be considered a pre-activated standard to be used as a comparison interval in a discrimination task. In non-beat-based tasks, on the other hand, there is no beat to ﬁnd, no prediction of an upcoming beat, no beat-based pattern that stretches over multiple intervals, and no repeated availability of the standard interval to be discriminated. That is, a standard interval is not pre-activated, and must be retrieved from memory to be compared with a test interval in a discrimination task. These differences, rather than differences involved in tracking and comparing estimated durations of intervals with a standard, could potentially be driving the observed neural activation differences. In particular, there have been proposals that the basal ganglia are involved in beat prediction and beat adjustment (Grahn and Rowe 2013), as well as proposals that different mechanisms and neural structures are involved in the timing of relatively short (up to ca. 500 ms) vs. long durations, where the cerebellum has been proposed to be involved in the timing of relatively short durations (Koch et al. 2007; Spencer and Ivry 2013). Differences in neural activation for beat-based (periodic) vs. non-beat-based tasks therefore do not necessarily imply that the timekeeping mechanisms are different, since it may be the case that there are beat-based (e.g. periodicity) detection mechanisms in addition to general-purpose timekeeping mechanisms in the periodic, beatbased tasks that are not used in non-periodic, non-beat-based tasks, and/or that beat-based tasks may involve general-purpose timekeeping mechanisms used for both shorter- and longer-duration intervals, whereas non-beat-based tasks may only involve the mechanisms used for shorter intervals.

⁷ We would like to point out that these recurring temporal patterns are potentially abstract and symbolic, i.e. the inter-beat interval is not speciﬁed, so that the actor could reproduce them at a wide range of rates.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

254     ? Moreover, the vocabulary used to refer to behavior in beat-based timing tasks vs. non-beat-based timing tasks suggests more of a qualitative difference in the actual timing mechanisms than may be warranted by the evidence. Behavior on beat-based timing tasks in perception is often referred to as ‘relative timing’, whereas behavior on non-beat-based timing tasks is referred to as ‘absolute timing’ (Grube et al. 2010; Teki et al. 2011; Teki, Grube, and Grifﬁths 2012). Because the perceptual tasks all involve discrimination, strictly speaking, they are all ‘relative timing’ tasks.⁸ However, as noted above, there are differences between the tasks that may give listeners an advantage in the beat-based task in forming a representation of the comparison standard, in the pre-activation of the standard, as well as other differences, e.g. of beatﬁnding, temporal pattern, etc. in the beat-based tasks. Earlier researchers (e.g. Keele, Nicoletti, Ivry, and Pokorny 1989) proposed that beat-based timing should involve comparisons of the time of events, judged against a metronome-like benchmark set up from initial beat-inducing events. This type of timing was thought to contrast with non-beat-based, interval timing, proposed to function much more like a stopwatch, and to involve the comparison of interval durations, rather than time-relative-to-the-beat. Only interval timers were thought to be able to start at arbitrary times. This type of view was challenged by ﬁndings of Pashler (2001), and McAuley and Jones (2003), which showed that a beat-based ‘entrainment’ timer could model results of interval timing comparisons as well as comparisons that involved beat-inducing stimuli, as long as there was a minimal cost to oscillator resetting at the onset of to-be-timed intervals. To summarize, although there are behavioral and neural differences associated with the performance of beat-based vs. non-beat-based tasks, behavioral timing differences across task types have not been qualitative. Neural activation differences may result from non-timing related differences, or timing differences that relate to different timing ranges (second range + millisecond ranges for beat-based tasks, as compared to only the millisecond range for non-beat-based tasks). As a result, a clear distinction between beat-based and non-beat-based timing is difﬁcult to motivate unambiguously. What does this mean for speech production? The answer to this question depends on several things: 1) the extent to which a distinction in beat-based vs. non-beat-based mechanisms can be maintained, 2) the extent to which the

⁸ Note that the authors do refer to what they call absolute timing as (delta-T), suggesting that they do acknowledge the relative timing nature of the non-beat-based task. They refer to what they call relative timing as (delta-T/Tbeat).

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. -  

255

proposed distinct mechanisms are applicable in production as well as perception, and 3) the extent to which speech production shares characteristics of periodic tasks. As discussed in Turk and Shattuck-Hufnagel (2013), natural speech is not overtly periodic, and there is a great deal of debate as to whether, how, and under what circumstances speech might involve periodic control structures (see also Chapters 5 and 6). However, it may be that the structured, sequential nature of speech is shared with periodic tasks, but does not rely on a periodic mechanism. This type of commonality would suggest that ﬁndings relating to timekeeping abilities in periodic tasks may be relevant for speech because of its structure and sequencing, but that the relevance of periodicity-based mechanisms (e.g. beat-based timing) for speech is debatable, except in speciﬁc, periodic styles, e.g. singing.

9.1.7 Summary of timekeeping mechanisms Although behavioral evidence suggests that humans and animals have a set of general-purpose timekeeping mechanisms for representing and keeping track of surface-timing characteristics in speech production and other motor and perceptual behaviors, it is far from clear how these mechanisms work. This section discussed some of the evidence that model developers must deal with: different neural structures involved in timing, commonalities and differences in timing behaviors for shorter (sub-second) vs. second and supra-second timing, commonalities and differences in timing behaviors for different task and perceptual modalities, and commonalities and differences in timing behaviors for ‘beat-based’ vs. non-beat-based tasks. The large number of available timekeeping models reﬂects a lack of consensus about how timekeeping works (cf. Wittmann 2013). What does this mean for phonology-extrinsic timing models of speech production? Phonology-extrinsic timing models of speech production such as the XT/3C approach proposed in this book assume that timekeeping abilities exist that make it possible for speakers to represent surface durations, and to track time-remaining-until-event-occurrence. What this section suggests is that it is not currently possible to be certain about the exact nature of the timekeeping component(s) that are used in speech production. Researchers interested in this area will need to monitor ongoing developments and ﬁndings. While ﬁndings relating to motor production of movements and intervals lasting less than 500 ms are likely to be particularly relevant, ﬁndings relating to timing more generally, e.g. longer intervals, timing in perception,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

256     ? beat-based and non-beat-based timing etc. may also be relevant, because these intervals and tasks may be relevant to speech production, and/or because of commonalities in mechanisms across timescales and tasks. For the purposes of the XT/3C model proposed in this volume, it is only necessary that generalpurpose timekeeping mechanisms exist, and that assumption is clearly supported by current ﬁndings. The next section describes Lee’s General Tau theory of the temporal guidance of voluntary movement. This theory is particularly relevant for phonology-extrinsic models of timing in speech production, because it provides an account of the way movements evolve over time once they have begun, as well as of movement coordination. The theory assumes the existence of general-purpose timekeeping abilities to represent and track duration. The basic tenets of the theory are presented, along with supporting evidence from human and animal studies, including speech.

9.2 Lee’s General Tau theory Like Optimal Control Theory approaches discussed in Chapter 8, Lee’s (1998) General Tau theory models and predicts the smooth and single-peaked shape of velocity proﬁles (including observed velocity proﬁle asymmetries, see Perkell, Zandipour, Matthies, and Lane 2002 and Chapter 2). In addition, General Tau theory also speciﬁes how targets can be achieved at desired time points, and how movements can be coordinated with other events or movements, either occurring in the environment (e.g. a ball coming toward the catcher), or produced by the same individual (see discussion in Chapter 5). This approach provides an alternative to oscillator-based control (as in AP/TD) for modeling the time-course of movement, for controlling the appropriate timing of target-related movement endpoints and movement coordination. A basic assumption of Lee’s Tau theory is that all purposeful movements have goals of accomplishing some type of gap closure, i.e. to close a distance gap, a force gap, an angle gap, etc. For example, in a reaching movement, the gap to be closed is the distance between the hand’s current position and the target of the reaching movement. Tau is deﬁned as the time to close a gap at the current gap-closing rate, equivalent to gap magnitude (e.g. distance) over instantaneous gap-closure rate. In typical practiced voluntary movements, tau is large at the beginning of movement, and gets smaller and smaller as the gap closes. Although tau is a function of spatiotemporal factors (gap-magnitude and instantaneous gap-closure rate), tau itself is a duration, i.e. purely

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. ’   

257

temporal, and the way tau changes over time during a movement can be taken as a space-invariant description of the time-course of movement. The motor implementation of a tau proﬁle in terms of speed (distance/time) requires a merger of spatial and temporal aspects of movement (cf. Georgopoulos 2002). Tau theory was originally developed to account for collision avoidance behavior (Lee 1976), where a prerequisite of successful collision avoidance was assumed to be perceiving tau at successive points in time, in order to estimate when collision was likely to occur. Lee and colleagues showed that tau of an oncoming object can be directly speciﬁed by visual information. Given a gap between an observer and an approaching object in his/her line of sight, and assuming the gap closes at a constant rate, the time-to-contact of the observer with the object is given by 1 over the rate of dilation of the retinal image of the object. Neuronal mechanisms for the detection of tau have been discovered in pigeon (Frost and Sun 2004), locust (Rind and Simmons 1999; Gabbiani, Krapp, and Laurent 1999), and gerbil (Shankar and Ellard 2000), particularly for cases where objects were on a collision course with the animal. By virtue of this direct apprehension of perceptual information, Tau theory appeared to conform to J. J. Gibson’s ecological theory of perception. Recent work suggests that other types of information about tau often supplement the information given by the rate of retinal dilation of the image (Hecht and Savelsbergh 2004; see also de Azevedo Neto and Teixeira 2009). Thus there is a debate about how information relevant to tau can be computed and integrated. Despite this debate, the use of visually provided tau information in motor control is well-supported (see Lee 2009 for a review). For example, Lee, Young, Reddish, Lough, and Clayton (1983) showed that knee and elbow angles of participants jumping up to punch a falling ball were continuously tuned to tau of the ball and the expected place in space where the ball and the punching hand will meet. Such evidence suggests that tau is a common currency in perception and action. Lee and Reddish (1981) showed that plummeting gannets used a critical value of tau to initiate their plunge. Crucially the evidence suggested that the gannets initiated their dives when tau (time-to-contact at the current closure rate), reached a particular value, rather than initiating their dives at a time based on a more accurate estimate of time-to-contact, that is, one that took the constant acceleratory forces of gravity into account. The Lee and Reddish (1981) ﬁnding suggested that organisms might use a ﬁxed tau criterion to trigger movements. However, since movement times depend on many things, e.g. distance-to-target (Fitts’ law), the ﬁxed tau criterion does not appear to be

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

258     ? ﬂexible enough to predict movement initiation times in many situations (cf. Lee 2009). However, as discussed in more detail later in this section, the tau framework is clearly valuable for guiding movement as it unfolds, even if movement initiation is not completely understood. In the 1990s, Tau theory was further developed to include all sensory modalities (including the somatosensory system), as well as the closure of gaps of all varieties (force, distance, angle, etc.), Lee (1998). In addition, it speciﬁed how the temporal proﬁle of a movement can be planned before the movement begins, a development which makes the approach particularly useful for planning the temporal characteristics of speech movements. General Tau theory (Lee 1998) is thus a general timing theory, in which time-tocontact-at-current-rate-of-movement can be perceived, planned, and controlled in motor action. The theory claims that actions can be continuously tau-guided in two ways: 1) on the basis of sensory information (TauS guidance), and 2) on the basis of an internal tau guide (the TauG Guide), which is an abstract ‘pattern’ for the time-course of movement; actors keep the tau of the gap to the target in constant ratio to the tau of the tau guide. The TauG Guide is proposed to be a neural-power-gap G generated in the brain (neuralpower = rate of ﬂow of energy in an ensemble of neurons = spike-rate in CNS), where neural-power changes at a constant accelerating rate. The TauG Guide is one of a family of ﬁnite movements⁹ that can be described by the equation T2 τG ðtÞ ¼ 12 t tG ; derived from Newton’s equations of motion. The parameters of the equation are TG , the time the G-gap takes to close; and t, the elapsed time from the start of the movement (t runs from 0 to TG ). The family of tau proﬁles for ﬁnite movements that can be generated by coupling onto the tau guide can be described by the equation τY ðtÞ ¼ kY;G τG ðtÞ; where different values of kY;G determine the shape of the Y-gap-closing movement’s velocity proﬁle. The family of movements whose time-course is described by the τG coupling equation includes movement from rest at constant acceleration (like an object falling due to the force of gravity), where k = 1, but also includes movements from rest that involve acceleration followed by deceleration (as in a bell-shaped velocity proﬁle), where k < 1. Different values of k < 1 show different velocity skewness proﬁles, that is,

⁹ Finite movements are movements from rest that have an acceleratory component and end after a ﬁnite duration.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. ’   

259

earlier relative timing of the velocity peak for values of k closer to 0. Lee proposes that actors vary parameter k to control the way a target is approached; that is, the lower the k, the more gentle the approach and the longer relative time spent in deceleration, and the higher the k, the less gentle the approach, and the less relative time spent in deceleration. For example, movements that collide with their targets, and do not require precise positioning (e.g. for stop consonants), are expected to have higher values of k, whereas movements to targets that require precise positioning, e.g. for vowels, will be more gently controlled, and will therefore have lower values of k.

9.2.1 Evidence for the operation of General Tau theory in speech motor control from a preliminary experiment TauG-guidance can be empirically assessed by recursively linearly regressing τY ðtÞ onto τG ðtÞ. TauG-guidance is diagnosed if more than 95% of the variance in the data is accounted for the TauG Guide equation (i.e., r² >0.95). The regression slope measures kY;G . Findings consistent with TauG-guidance have been observed for many types of human and animal movements, (references in Lee 1998 and 2009), including preliminary ﬁndings for speech data. Results of a pilot experiment (Lee and Turk, in preparation) involving tongue-tip movements for /d/ in e.g. deed, did, dead, Dade, dad in different prosodic contexts, spoken by one speaker, indicate tauG guidance (r² >0.95) for on average, over 95% of data points. TauG guidance was consistent across all words, although the mean and s.d. of k, A (amplitude of movement) and T (duration of movement) differed. Figure 9.1 below illustrates TauG guidance for the tongue tip for the word dad. Lee and Turk’s further work on electromagnetic articulometry data from ﬁve speakers in the Edinburgh Speech Production Facility Doubletalk corpus (Scobbie, Turk, Geng, King, Lickley, and Richmond 2013) conﬁrms that the TauG Guide and tau-coupling equations provide a good ﬁt to speech data, for sensors attached to the jaw (central and lateral), lips (upper and lower), and tongue (back, mid, front), as shown in Table 9.1. The TauG Guide mechanism therefore appears to be a plausible alternative to AP/TD’s mechanism for creating appropriately shaped velocity proﬁles, where AP/TD’s mechanism manipulates gestural activation rise time to shape the default velocity proﬁle generated by mass–spring movements. Sorensen and Gafos (2016) suggest that mass–spring systems with non-linear restoring forces can also model more realistic velocity proﬁles, without needing gradual

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

260     ? (a)

tongue movement record 0004.1

–4

tongue z (mm)

–8

z

zvel

–10 –12

o

–14 –16 –18

tongue z vel (mm s–1)

–6

–20 (b) tau(z-zend) & tauG (s)

0

tauG

–0.5

tau(z-zend)

–1 –1.5 –2 –2.5

0.65

0.7

0.75

0.8 0.85 time (s)

0.9

0.95

1

(c) tau(z-zend) (s)

0

K = 2.336 A = 12.0 mm T = 175 ms

–0.5 –1 –1.5 –2

2

K = 0.338 A = 12.9 mm T = 155 ms

r = 0.969 92% (32/35) y-Intercept = 72 ms

r2 = 0.998 97% (3031) y-Intercept = 2 ms

–2.5 –2.5 –2 –1.5 –1 –0.5 0 –2.5 –2 –1.5 –1 –0.5 tauG (s) tauG (s)

0

Figure 9.1 TauG guidance of the tongue when saying ‘dad’ (a) Vertical displacement (z) and velocity (zvel) of tongue. z was sampled at 200Hz and smoothed with a Gaussian ﬁlter, sigma 4, yielding a 9Hz cut-off. (b) Tau(z-zend) = (z-zend)/zvel, computed from (A). Zend is z at the end of the downward or upward movement. tauG = 0.5(t-T²/t), where T = movement duration and time t runs from 0 to T. (c) Recursive linear regressions of tau(z-zend) on tauG, for downward movement (left panel) and upward movement (right panel). Note: K = slope of regression which equals the coupling constant in the tauG guidance equation if R² = 1, tau(z-zend) = KtauG. A = movement amplitude. T = movement duration. r² = proportion of variance accounted for by the tauG guidance equation. % is the percentage of data points in the regression that yielded r² > 0.95.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. ’   

261

Table 9.1 TauG guidance of the jaw, lips, and tongue in monologue recordings from the ESPF Doubletalk corpus Participant

Direction of movement

Range of number of analyzed movements, across sensors

Range of % of data points that yielded r²> .95, across sensors

Range of standard deviations, across sensors

R0033_0021_cs5

Down Up Down Up Down Up Down Up Down Up

63–80 56–82 58–85 65–91 20–131 14–120 43–139 40–142 37–58 32–55

93–94 93–95 93–95 93–95 93–95 89–95 95–95 93–95 91–95 93–95

4.3–12.2 3.5–6.2 2.2–9.1 5.0–6.9 4.1–10.4 2.3–25.75 3.2–5.7 3.4–15.6 3.5–9.5 2.1–7.7

R0039_0020_cs5 R0036_0023_cs6 R0039_0022_cs6 R0036_cs5

Note: The number of analyzed movements differed per sensor, and the percentage of data points within movements that yielded r² > .95 (and standard deviations) was computed separately for each sensor.

gestural activation rise times; however, General Tau theory has the advantage of providing a way to manipulate velocity skewness, which Sorensen and Gafos’ (2016) proposal does not have. Lee’s General Tau theory also provides a mechanism to accomplish targetrelated, endpoint-based movement coordination (cf. discussion in Chapter 5). Speciﬁcally, according to tau theory, when two movements are tau-coupled, their taus are in constant proportion: τðY; tÞ ¼ kY;X τðX; tÞ where Y and X are the gaps (i.e. Y is the gap that is being actively controlled, and X is either the internal TauG-Guide gap, or a sensed gap that is being coupled onto), t is time, and k (the coupling constant) determines the kinematics of the Y-gap-closing trajectory relative to the X-gap-closing trajectory. When two gaps are tau-coupled, they are guaranteed to reach their endpoints at the same time. This is because when two movement tau functions are in constant proportion, e.g. τY ðtÞ ¼ kY;X τX ðtÞ, τX ðtÞ reaches zero as the endpoint is reached, and because τY ðtÞ is in constant proportion to τX ðtÞ; it reaches zero at the same time (Lee, Georgopoulos, Clark, Craig, and Port 2001). Although two coordinated movements might begin at the same time, they do not have to, as long as they are tau-coupled by the end of

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

262     ? the movement. This aspect of the model is relevant to the ﬁndings discussed earlier (see Chapters 4 and 5), suggesting coordination based on movement endpoints, and showing lower timing variability at movement endpoints compared to other parts of movement.

9.2.2 The relevance of Tau theory to aspects of speech motor control discussed earlier Lee’s General Tau theory (Lee 1998, 2009) provides an elegant account of three aspects of movement behavior relevant to speech: 1) smooth, single-peaked velocity proﬁles, 2) endpoint-based coordination, and 3) greater timing accuracy at movement endpoints compared to other parts of movement. As discussed in Chapter 2, in order to achieve realistic velocity proﬁles in AP/TD, the movements generated by mass–spring systems must be activated and deactivated in a gradual way. Lee’s General Tau theory provides a way to generate realistic velocity proﬁles with a simpler mechanism (one Tau-theory equation of motion as compared to one mass–spring equation + one activation equation). As discussed in Chapter 5, AP/TD uses onset-based coordination but does not have a principled way of accounting for greater spatial and temporal accuracy at movement endpoints, which is suggestive of endpoint-based coordination. Lee’s theory provides a way of modeling this type of coordination, and has an explanation for greater timing accuracy at endpoint achievement compared to movement onset. For all of these reasons, Lee’s General Tau theory will be used as a key component of the XT/3C-v1 approach presented in Chapter 10.

9.3 Summary This chapter provided an overview of the current debate about the nature of human timekeeping abilities, and discussed Lee’s General Tau theory of the temporal guidance of movement (Lee 1998, 2009). The bulk of the evidence suggests that general-purpose timekeeping abilities exist, but the large number of available candidate models of these abilities attest to many uncertainties about how these work. As a result of this uncertainty, we will assume that general purpose timekeeping abilities exist, but will not hypothesize about their exact mechanisms. Lee’s General Tau theory provides an elegant alternative to AP/TD’s oscillators for the generation of movement velocities,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

. 

263

provides a mechanism for endpoint-based movement coordination, and provides an account of greater timing accuracy at movement endpoints compared to other parts of movement. We will therefore adopt this theory in the sketch of an XT/3C model presented in Chapter 10.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

10 A sketch of a Phonology-Extrinsic-TimingBased Three-Component model of speech production The goal of this chapter is to present a sketch of a Phonology-ExtrinsicTiming-Based Three-Component model of speech production. Previous chapters provided motivation for the general architecture of the proposed approach (Chapter 7), suggested the Stochastic Optimal Feedback Control Theory computational framework (Todorov and Jordan 2002 et seq.) for determining movement parameter values and for ensuring that goals are met at minimum cost (Chapter 8), and discussed the nature of general-purpose timing mechanisms, including mechanisms to generate appropriate velocity proﬁles and movement coordination, required for a phonology-extrinsic timing model (Chapter 9). The current model sketch builds on these ideas, laying out in more detail what should occur in each of the three model components and the interfaces between them, and presents supporting evidence for the content of the components.¹ As noted earlier, the proposed approach is distinct from AP/TD in including three separate stages: 1) Phonological Planning, 2) Phonetic Planning, and 3) Motor-sensory Implementation. Because the representations in the Phonological Planning stage are abstract and symbolic, i.e. do not include quantitative aspects of timing or other information, and because those quantitative details of movement (including timing) are planned in the Phonetic Planning stage, timing in this model is extrinsic to the phonology. That is, quantitative aspects of timing are computed in the Phonetic Planning Component, and tracked and adjusted in the Motor-Sensory Implementation Component, in order to realize the goals speciﬁed during Phonological Planning. Moreover, it is assumed that the mechanisms used to represent, specify,

¹ The operation of these three processing components makes use of, but is not limited to, the grammatical knowledge of the speaker. Thus in our view the process of planning a particular utterance and implementing that plan includes many components that are not a part of the grammar.

Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

   /

265

and track timing in the Phonetic Planning and Motor-Sensory Implementation components are general-purpose mechanisms that are also used for nonlinguistic motor activity. The proposed approach is not the only model of speech production that includes three separate components, symbolic representations, and phonology-extrinsic, general-purpose timekeeping mechanisms. However, none of the existing models provides a full articulatory account of timing behavior, with the ﬂexibility and predictive power of current optimization models (cf. Chapter 7). The goal here is therefore to show how the threecomponent approach can be developed to provide a more comprehensive account of these and other well-established aspects of the control of timing behavior in speech. The speciﬁc version of an XT/3C approach adopted here, called the Phonology-Extrinsic-Timing-Based Three-Component model-version 1 (XT/3C-v1), includes a number of key elements, which will be described brieﬂy in this section and expanded on in the remainder of this chapter. The key features of the Phonological Planning component are that 1) symbolic representations that specify the forms of words (embodying lexical contrast) are ﬁrst slotted into a Prosodic Planning frame (which itself is planned according to principles of Smooth Signal Redundancy, cf. Section 10.1.2), 2) task requirements for the utterance are speciﬁed and prioritized—these include its phonological structure (i.e. the phonemic representations of words in their prosodic context), as well as other requirements, e.g. to speak quickly or in a particular style, and 3) these are used to condition the choice of sets of context-speciﬁc qualitative acoustic cues to distinctive features (Stevens 2002); like the other task requirements, these cues are also given relative priorities. Feature cues provide part of the required translational link between the symbolic representations of Phonological Planning and the quantitative representations in the Phonetic Planning Component. The Phonetic Planning Component is the stage where the quantitative details of the utterance are planned. The XT/3C-v1 approach borrows from other approaches in 1) mapping symbolic representations onto values of phonetic parameters (Guenther 2016) and 2) planning movement parameter values from any state (i.e. formulating a control policy, in Optimal Control Theory terms) that represent the optimal compromise between competing phonological task requirements and movement costs (Chapter 8). Timing in this approach involves planning 1) durations between acoustic landmarks, 2) durations of movements required to achieve the landmarks, and 3) values of parameters that determine the shape of velocity proﬁles and movement

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

266    / coordination (Lee 1998, 2009). Values for these parameters result from the optimization process, which balances costs for time and effort, in addition to the costs of not meeting the task requirements identiﬁed in the Phonological Planning stage. The proposed Motor-Sensory Implementation component follows SOFCT approaches in involving continuous monitoring of the current state of the system (from efference copy and sensory feedback), and adaptation of the control signal in order to achieve the goals for the utterance speciﬁed in the Phonological Planning stage. Because the formulation of the control policy within SOFCT yields the optimum motor commands to achieve the speciﬁed goals from any current state, SOFCT systems can adapt to perturbations that occur after movements have begun, and can predict and compensate for effects of motor noise if these are likely to interfere with the task goals. This proposal shares several features with current proposals for the interactions among phonology, phonetics, and speech motor control. Like AP/TD and most current approaches to phonology, it assumes that requirements related to signaling lexical contrast and prosodic structure are planned in the Phonological Planning Component. Also like AP/TD, it assumes that articulatory movements are organized hierarchically in task-dependent synergies, or coordinative structures. (See Chapter 8’s discussion of Todorov and Jordan’s 2002 and O’Sulllivan, Burdet, and Diedrichsen’s 2009 explanations for why these are advantageous.) Like AP/TD and Fujimura (1992 et seq.), it assumes that prosodic structure plays a substantial role in inﬂuencing surface-timing patterns. As in Perkell (2012) and DIVA (Guenther 1995, 2016), it assumes that the goals of speech production are to produce acoustic/auditory information that signals linguistic entities as well as para- and sociolinguistic characteristics, and that an internal model of the relationship between articulation and its sensory consequences (auditory and somatosensory) is learned during development, and maintained throughout the lifespan. It proposes the use of Optimal Control Theory principles in modeling speech production planning, like Nelson (1983), Nelson et al. (1984), Lindblom (1990), Šimko and Cummins (2010, 2011), Flemming (1997, 2001), Katz (2010), Braver (2013), Windmann (2016), and Lefkowitz (2017). And like Houde and Nagarajan (2011) and Hickok (2014), it advocates taking advantage of recent developments in stochastic optimal feedback control theory (SOFCT). However, unlike AP/TD and other two-component models of speech production (e.g. Flemming 1997, 2001; Šimko and Cummins 2010, 2011), this proposal is committed to three components, with symbolic phonological representations and phonology-extrinsic timing. As a result, the model

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

   /

267

assumes a bridging mechanism, in the form of context-appropriate selection of individual acoustic cues to distinctive features, that serves to interface the symbolic Phonological Planning Component with the quantitative Phonetic Planning Component. This bridging mechanism is different from the bridging mechanism proposed in Fujimura’s work, which is based on syllables and distinctive features that are mapped onto a syllable pulse train and elemental gestures (Fujimura 1992 et seq.; see Chapter 7 for more discussion). Following Guenther (1995), Flemming (2001), Perkell (2012), and Guenther (2016), the primary goal of the phonetic and motor-sensory implementation components is to produce acoustic cues; articulatory movements are planned in service of this goal. In the XT/3C-v1 approach these cues are context-speciﬁc, that is, they are chosen to signal the characteristics of phonemes in a particular prosodic and morphophonological context, and are appropriate for a chosen style and rate of speech. Movement targets (planned endpoints) are chosen to produce these cues and satisfy other task requirements at minimum cost. In this regard our proposal is different from AP/TD and others who propose idealized, context-independent movement targets. Instead, it follows Lindblom’s (1990) proposal of context-speciﬁc movement targets that produce acoustic cues that are sufﬁciently distinct in context to support lexical access. At the core of the proposed approach is the use of phonology-extrinsic timing implemented using general-purpose timekeeping mechanisms. As discussed in Chapter 9, although there is general agreement that such mechanisms exist, there is a great deal of debate about their exact nature. The proposal thus assumes mechanisms for the representation and speciﬁcation of interval and movement durations, but will not adopt a speciﬁc timekeeping model for those functions. Lee’s (1998, 2009) General Tau theory is used to model the time-course of movement as well as endpoint-based movement coordination. Central to Lee’s theory, in its most general and extended form (Lee 1998, 2009), is the assumption that humans and animals keep track of tau, i.e. of the time-until-gap-closure at the current gap-closure rate, and that movement coordination is based on tau-coupling either to another movement, or to an internal tau guide whose duration can be pre-speciﬁed. It is assumed that tau guidance is one of the general-purpose timing abilities required for speech production just as it is for other motor tasks; it is the mechanism that XT/3C-v1 proposes to use to generate movements that vary over time in an appropriate way. Lee’s General Tau theory thus provides a mechanism for generating the time-course of movement. When combined with spatial information (e.g. distance-to-target from the current position), it speciﬁes how a movement reaches a goal, and provides an alternative to AP/TD’s mass–spring

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

268    / systems for generating movement, cf. Chapter 2). In addition, this theory is also used to model goal-related, endpoint-based coordination, and thus provides an alternative to AP/TD’s planning oscillator coordination mechanism (cf. Chapter 5), which proposes coordination based on movement onsets rather than their endpoints. Finally, the proposed theory also provides a principled account of greater timing accuracy at movement endpoint compared to movement onset. This chapter presents a sketch of each of the proposed components of the three-component speech planning model, i.e. Phonological Planning (Section 10.1), Phonetic Planning (Section 10.2), and Motor-Sensory Integration (Section 10.3). It provides more detail about these components than is given in Chapter 7, including information about the proposed structures and processes involved in each stage, along with supporting evidence. This version of a Phonology-Extrinsic-Timing-Based Three-Component Model (XT/3Cv1) represents our current best guess as to how these three stages might work; considerable computational development and further experimentation will be required to evaluate and reﬁne these proposals.

10.1 Phonological Planning This proposal assumes that the Phonological Planning Component involves planning and prioritizing all of the task requirements for an utterance. In this sense it is different from the Phonological Component of the grammar, since the processing that occurs in this speech-production component is more extensive than that in the lexical phonology component of the grammar, although it makes use of similar abstract symbolic representations of word forms. The steps in this component of the planning process include 1) generating a prosodic planning frame that includes a hierarchy of prosodic constituents and prominence structure, 2) slotting the symbolic representations of the target words into the prosodic frame (Garrett 1980; Levelt 1989; Shattuck-Hufnagel 1992; Keating and Shattuck-Hufnagel 2002; and Pierrehumbert 2016), 3) specifying other task requirements for the utterance (e.g. style and rate of speech), 4) prioritizing the segmental, prosodic, and other task requirements in relation to one another, and 5) choosing and prioritizing appropriate acoustic/auditory cues to the features of the target words in order to achieve these task requirements, that is, specifying landmarks (Stevens 2002) and other acoustic cues for the characteristics of the phonemes in the target words. In the proposed model therefore, planning the prosodic structure for a particular utterance, and the phonological elements to

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

269

go into that structure, is just part of the planning that takes place in the Phonological Planning Component. That is, the Phonological Planning Component involves planning the full set of task requirements that additionally includes planning speech style, relative rate, and the types of acoustic cues for each context. The representations that are used in this component therefore include more than just the lexical representations of the words; they additionally include prosodic structure, as well as representations for types of acoustic cues chosen to signal each contrastive element in its context. Importantly, all of the task requirements planned in the Phonological Planning Component involve symbolic representations, structures, and relational expressions, but no quantitative speciﬁcations for how they are produced. The speciﬁcation of the quantitative aspects of the utterance-speciﬁc surface realization of these symbolic elements occurs in the subsequent Phonetic Planning Component. This section discusses and motivates the structures, cues, and other task requirements that are planned and prioritized during Phonological Planning. Following Aylett (2000), Aylett and Turk (2004), and Turk (2010), the prosodic frame is itself planned in order to achieve Smooth Signal Redundancy; that is, planning decisions are made based on the principle that linguistic elements across an utterance should have an equal chance of being identiﬁed by a listener.² The prosodic nature of this planning frame is motivated in Section 10.1.1, focusing on its account of systematic effects on speech timing; subsequent sections provide more detail about Smooth Signal Redundancy (Section 10.1.2), serial ordering of phonological elements into the prosodic frame (Section 10.1.3), planning other task requirements (Section 10.1.4), and planning the qualitative acoustic cues to features appropriate to each context (Section 10.1.5). These symbolic, qualitative cue speciﬁcations provide input to the Phonetic Planning Component, where they will receive their quantitative spectral, spatial, and temporal speciﬁcations.

10.1.1 Prosodic structure: Hierarchies of constituents and prominences In the proposed model, in addition to conveying meanings, prosodic structure has three key functions: 1) to constrain the choice of acoustic cues used to signal lexical contrast, 2) to determine their relative distinctiveness or clarity, ² On this view, linguistic elements include things like phonemes, syllables, and words, as well as lexical tone and intonational tone. We assume that prosodic constituent and prominence structure is planned to facilitate recognition of these linguistic elements in their syntactic structures.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

270    / in order to enable successful utterance recognition by the listener, and 3) to serve as a planning framework into which the phonological elements that specify the target lexical items are serially ordered. This section ﬁrst discusses behavioral evidence for planning based on prosodic structure, with a focus on evidence for its inﬂuence on systematic timing patterns, and then presents a hypothesis relating to the function of this planning framework as a sequence of sites for the serial ordering of sublexical elements. 10.1.1.1 Evidence for prosodic constituent structure A number of ﬁndings in the literature suggest that the structure that directly deﬁnes the word groupings and grouping-based prominences observed in speech is inﬂuenced by syntax, but not isomorphic to it (cf. ShattuckHufnagel and Turk 1996). These ﬁndings include evidence from segmental phonology (e.g. Selkirk 1978; Nespor and Vogel 1986) and intonational phonology (Beckman and Pierrehumbert 1986), evidence from the distribution of pauses and their durations (Grosjean and Collins 1979; Gee and Grosjean 1983), reliable labeling of different boundary strengths at word junctures in spoken corpora, as reﬂected for example in e.g. ToBI³ labeling of the BU Radio News Corpus (Pitrelli, Beckman, and Hirschberg 1994; Ostendorf, Price, and Shattuck-Hufnagel 1995), and durational phenomena such as ﬁnal and initial lengthening (e.g. Wightman, Shattuck-Hufnagel, Ostendorf, and Price 1992; Ferreira 1993; Keating 2006; Byrd and Saltzman 2003). The structure implicated by these ﬁndings, called prosodic constituent structure, is hierarchical, and includes constituents such as prosodic words and perhaps feet and/or syllables at lower levels, and phrases of various sizes at higher levels. Although there are debates about many aspects of the prosodic hierarchy, e.g. about the number of levels in the hierarchy that are qualitatively distinct from each other, about whether there is recursion, and about the name and deﬁnition of each constituent type, there is general agreement this hierarchy is ﬂatter and more symmetric than syntactic structure (Selkirk 1978; Gee and Grosjean 1983). Moreover, there is broad agreement that there is a scale of boundary ‘strengths’. However, it is not completely clear whether differences in boundary strength are entirely tied to different levels in the prosodic hierarchy, or whether some differences in boundary strength are scalar, i.e. gradient (Wagner 2005). Prosodic constituent structure is a likely linguistic

³ ToBI stands for Tones and Break Indices. It is a system for transcribing intonation (pitch accents, phrase accents and boundary tones), as well as the strength of prosodic constituent boundaries. See e.g. Beckman, Hirschberg, and Shattuck-Hufnagel (2005) for a more complete discussion.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

271

universal, although different languages may elect to employ different sets of levels from the universal hierarchy (Jun 2005). Support for the universality of prosodic structure comes from the ubiquitous occurrence of pausing and constituent-ﬁnal and -initial lengthening patterns that reﬂect a structural hierarchy in languages of the world (see for example Vaissière 1983; Keating, Cho, Fougeron, and Hsu 2003). An example of proposed prosodic constituent structure for an English utterance is shown in Figure 10.1 (see Shattuck-Hufnagel and Turk 1996 for more discussion). Prosodic constituent structure is important to include in any model of speech production designed to account for systematic patterns of duration, because of its measurable effects on durational phenomena. These include • initial lengthening (longer durations at the onsets of prosodic constituents, e.g. Cooper 1991; Oller 1973; Fougeron and Keating 1997; Fougeron 1998; Byrd 2000; Cho 2002; Cho and Keating 2009; Bombien, Mooshammer, Hoole, and Kühnert 2010; Byrd, Krivokapić, and Lee 2006) which often correlates with initial strengthening or hyperarticulation, Fougeron and Keating 1997 et seq.), • ﬁnal (or pre-boundary) lengthening (longer durations at the ends of prosodic constituents, e.g. Kohler 1983 for German; Wightman et al. 1992, Turk and Shattuck-Hufnagel 2007, and Cho 2002 for English; Cambier-Langeveld 1997, 2000 for Dutch; Krull 1997 for Estonian; and Berkovits 1994 for Hebrew);

Utterance Phrase

PWord Mary’s

PWord cousin

Phrase

PWord George

PWord baked

PWord the

cake

Figure 10.1 An example prosodic structure for Mary’s cousin George baked the cake. Note: PWord = Prosodic Word. The nodes dominating the and the cake are left unlabeled because the nature of the constituents that dominate content + function word sequences is controversial (see e.g. Selkirk 1996). Source: Turk and Shattuck-Hufnagel (2014; Figure 2). Reproduced with permission from the Royal Society.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

272    / • polysyllabic shortening i.e. the shortening of syllables when more of them occur within a prosodic constituent (e.g. Lehiste 1972 and Kim and Cole 2005 for English; Lindblom 1968 for Swedish; Nooteboom 1972 for Dutch), sometimes difﬁcult to distinguish from other effects (White and Turk 2010; Windmann, Šimko, and Wagner 2015a), • polysegmental shortening (the shortening of segments when more of them occur within a prosodic constituent, e.g. Abercrombie 1967; Jones 1950, cited in Maddieson 1985; Lehiste 1960; Munhall, Fowler, Hawkins, and Saltzman 1992; Waals 1999; Katz 2012; although this shortening does not achieve isochrony, Dauer 1983), and • pause (e.g. Grosjean and Collins 1979; Krivokapić 2007; see Klatt 1976, White 2002, and White 2014 for reviews). To elaborate, phrase-related initial and ﬁnal lengthening affect speciﬁc parts of prosodic-phrase-initial and -ﬁnal words, respectively (cf. White 2002, 2014). Initial lengthening appears to be primarily localized on the initial C in phrase-initial CV and CCV sequences (Cho and Keating 2009; Bombien et al. 2010; Byrd et al. 2006). In ﬁnal position, most of the lengthening occurs on the rhyme of the ﬁnal word; smaller, but signiﬁcant amounts of lengthening have also been observed on lexically stressed syllable rhymes when the lexically stressed syllable is pre-ﬁnal, as in Michigan or Trinidad (see Cambier-Langeveld 1997 for Dutch, and Turk and Shattuck-Hufnagel 2007 for American English). (Lengthening at other sites, e.g. the onset consonant of the phrase–ﬁnal syllable rhyme, has also been observed, but these effects are sporadic in the sense that they appear to be study- or materialdependent, and may possibly be speaker-dependent.) For both initial and ﬁnal lengthening, the magnitude of the durational effects varies with boundary strength: stronger boundaries (boundaries of constituents at higher levels in the hierarchy, e.g. phrases) are generally associated with greater degrees of lengthening (Wightman et al. 1992; Keating 2006) but interestingly not with an expanded domain of lengthening (Cambier-Langeveld 1997). For discussions of polysyllabic shortening, see Turk and Shattuck-Hufnagel (2000); Turk (2012); and Turk and Shattuck-Hufnagel (2013); White (2014); Windmann, Šimko, and Wagner (2015); as well as Chapter 6; see also Krivokapić to appear. Prosodic constituent structure also affects non-durational phonetic parameters. These include constituent-initial and -ﬁnal voice quality modiﬁcations (Pierrehumbert and Talkin 1992; Ogden 2004; Dilley, Shattuck-Hufnagel, and Ostendorf 1996; Redi and Shattuck-Hufnagel 2001; Tanaka 2004; Garellek 2014), supralaryngeal articulatory modiﬁcations, such as phrase-initial

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

273

strengthening, and syllable-ﬁnal and/or ambisyllabic lenition (Keating, Cho, Fougeron, and Hsu 2003; Fougeron and Keating 1997; Lavoie 2001; Turk 1994). Additional supporting evidence comes from phenomena described as phonological external ‘sandhi’ rules (e.g. French liaison, Italian raddoppiamento sintattico, etc.; Selkirk 1978; Kaisse 1985; Nespor and Vogel 1986), the placement of pitch accents near the beginnings or ends of constituents (Bolinger 1965, 1985; Shattuck-Hufnagel, Ostendorf, and Ross 1994; Selkirk 1995; Astésano, Bard, and Turk 2007), as well as from other intonational phenomena, e.g. phrase-ﬁnal lowering and phrase-initial reset (cf. Beckman and Pierrehumbert 1986; Ladd 2008, among others). Thus, the evidence that a hierarchical prosodic constituent structure plays a role in speech production is extensive. 10.1.1.2 Evidence for prosodic prominence structure Prosodic structure also includes a hierarchy of prosodic prominences, which describes different degrees of stress/accent found in words and/or phrases. For example, in one prosodiﬁcation of the phrase Mary’s cousin George, George is the most prominent word in the phrase, and is said to bear nuclear phrasal stress (also called nuclear sentence stress, or nuclear accent, Chomsky and Halle 1968). In the words Mary and cousin, the word–initial syllables Ma(r)and cou- are more prominent than the second syllables in these words, and are said to bear word- or lexical stress. In addition, Mary also bears phrasal prominence, called pre-nuclear prominence or pre-nuclear accent, although in this ‘prosodiﬁcation’ its phrasal prominence is not as salient as the nuclear prominence on George. Figure 10.2 shows a gridlike representation of prominence structure (Liberman and Prince 1977; Hayes 1983; Selkirk 1984; Halle and Vergnaud 1987), illustrated for this phrase. Like prosodic constituent structure, prosodic prominence structure is thought to be hierarchical, with word-stress near the bottom of the hierarchy, and phrasal stress at higher levels (Beckman and Edwards 1994). Prosodic prominence structure has measurable effects on a set of acoustic characteristics, including fundamental frequency (F0), sound pressure level, X X X X X X X Mary’s cou sin

X X X X George

Figure 10.2 A grid-like representation of prominence structure for one possible prosodiﬁcation of Mary’s cousin George. Source: Turk and Shattuck-Hufnagel (2014; Figure 3). Reproduced with permission from the Royal Society.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

274    / duration, vowel quality, and spectral balance, with many documented crosslinguistic differences (Cho 2006; Sluijter and van Heuven 1995, 1996; van Heuven and Sluijter 1996). Beckman and Edwards (1994) suggested that different levels in the hierarchy might be associated with different acoustic correlates, e.g. F0 for phrasal prominence in English, but not for lower levels. However, duration appears to be associated with multiple levels, with longer durations for higher levels in the prominence hierarchy (e.g. Shattuck-Hufnagel and Turk 1996; Keating, Cho, Fougeron and Hsu 2003, among others). The effects of prominence on duration appear to be different from those related to prosodic constituent boundaries (Turk and Shattuck-Hufnagel 2000; Beckman and Edwards 1992; Turk and Sawusch 1997; Cho and Keating 2009; Mo, Cole, and Hasegawa-Johnson 2010; van Santen and Shih 2000). For example, at least in English, monosyllabic words show different locations for the effects of phrasal prominence versus ﬁnal lengthening. That is, prominence increases the nucleus duration most, followed by the syllable onset, then optionally the coda; in contrast, ﬁnal lengthening increases the rime duration most, with optional lengthening on the onset (see also Berkovits 1993a,b for evidence of progressively increasing lengthening from nucleus to coda consonant in Hebrew). In polysyllabic words, phrase-level prominence can affect multiple sites. For example, in English words bearing a single phrase-level prominence, the syllable whose duration is most likely to be affected is the syllable that bears primary word stress, but other syllables can also be lengthened. Lengthening sites include the syllable following the primary word stress (spill-over lengthening), the ﬁnal syllable rhyme, the word-onset, and secondary stressed syllables (Dimitrova and Turk 2012; see also Eefting 1991 and Cambier-Langeveld and Turk 1999 for related ﬁndings in Dutch). In addition, words with more than one full vowel syllable can bear more than one phrase-level prominence, as in e.g. It’s in MASSaCHUsetts (Shattuck-Hufnagel et al. 1994). This is most likely to occur when the word bears all of the pitch accents in the intonational phrase, supporting Bolinger’s (1965) observation that speakers prefer to produce an early accent and a late accent to bracket the intonational phrase.

10.1.2 Factors that govern prosodic structure: The Smooth Signal Redundancy hypothesis A long-standing question in the ﬁelds of phonology, phonetics, speech production, and psycholinguistics has to do with the function of prosodic

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

275

structure, and with the type of information used to plan it. Early views suggested that prosodic structure might help children learn syntax (Gleitman and Wanner 1982). However, although syntactic structure clearly has a large inﬂuence on prosodic constituent structure (see Selkirk 2011 for a recent discussion), prosodic structure is not always isomorphic with it: speakers can, and often do, produce major prosodic breaks that are internal to syntactic phrases. To give just one example (from Jackendoff 1987, p. 329), Sesame Street is brought to you by . . . the Children’s Television Workshop, where a pause indicating a prosodic boundary is placed inside a prepositional phrase. Shattuck-Hufnagel and Turk (1996), Turk and Shattuck-Hufnagel (2014), and Selkirk (2011) discuss a number of factors that appear to inﬂuence prosodic constituent structure, including syntax, pragmatics/semantics, utterance length, and prosodic markedness factors. Aylett (2000), Aylett and Turk (2004), and Turk (2010) propose that prosodic structure, i.e. relative boundary strength and prominence, is planned to facilitate successful utterance recognition by the listener, that is, among other things to highlight and demarcate words that are less language-redundant, i.e. less predictable from lexical frequency, syntactic and semantic context, and real-world context (pragmatics) (see also Lieberman 1963; Fowler and Housum 1987),⁴ in order to make their forms sufﬁciently clear for lexical access (Lindblom 1990). Extrapolating from Shannon’s (1948) information theory, Aylett (2000), Aylett and Turk (2004), and Turk (2010) proposed that speakers manipulate prosodic structure (within constraints of the prosodic grammar⁵), as a way to control the acoustic salience of linguistic elements so as to achieve an inverse relationship between acoustic salience (acoustic redundancy) and language redundancy (cf. Figure 10.3),⁶ that results in smooth-signal redundancy, i.e. recognition likelihood that is spread evenly throughout an utterance. On this view, unpredictable words are more likely to bear prosodic prominence, and words in unpredictable word sequences are more likely to ⁴ This proposal assumes that, in planning prosodic structure, the speaker can compute predictability (language redundancy) on the basis of his/her own language and real-world experience. The speaker can incorporate information about the listener’s knowledge, but need not do so. ⁵ For example, many English phrases exhibit default ‘broad’-focus prominence patterns in which nuclear (primary) phrasal prominence falls near the end of the phrase (see Cruttenden 1986; Ladd 2008), regardless of the predictability of the words which bear nuclear prominence. This suggests a role for prosodic grammar in the speech production process that is independent of language redundancy. See Vallduví (1991) and Ladd (2008) for discussions of cross-linguistic differences. ⁶ A current hypothesis is that the predictability of each element in an utterance relates to its predictability on the basis of both preceding and following elements (including position in syntactic structure), as well as its frequency of use and likelihood on the basis of real-world context, but note that it is an important research question to determine exactly what contributes to an element’s predictability/language redundancy.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

276    /

Recognition likelihood

1 0.8 0.6

Acoustic salience

0.4

Predictability

0.2 0 Who's

the

au

thor

Ordered elements within a phrase

Figure 10.3 The complementary relationship between predictability (language redundancy) and acoustic salience yields smooth-signal redundancy (equal recognition likelihood throughout an utterance). Source: Turk and Shattuck-Hufnagel (2014), based on a ﬁgure in Turk (2010). Reproduced with permission from the Royal Society.

be separated by strong prosodic boundaries. An inverse relationship between acoustic and language redundancy yields smooth-signal redundancy, that is, an even likelihood that the listener will identify each element in a linguistic sequence. An even likelihood of identifying each element in a linguistic sequence is advantageous from an information-theory perspective (Shannon 1948), since it increases the likelihood of recognizing the entire sequence. Support for the Smooth Signal Redundancy hypothesis comes from evidence for the shared contribution of prosodic prominence structure and predictability factors to variation in syllable duration and vowel formant frequency (Aylett 2000; Aylett and Turk 2004; Aylett and Turk 2006). Those results showed that syllables were longer, and vowel formants more distinct, when their words were less predictable and therefore more prosodically prominent. Turk (2010) proposed that prosodic constituent structure is planned on similar principles, that is, with stronger prosodic boundaries when words are less predictable. Supporting evidence comes from ﬁndings that word durations are longer, and pauses and intonational boundaries more likely, in less predictable syntactic sequences (Watson, Breen, and Gibson 2006; Gahl and Garnsey 2004; discussed in Turk 2010). However, on the Smooth Signal Redundancy view, syntactic structure is not the only factor that determines prosodic structure; rather, it is one of a set of factors that contributes to language redundancy/predictability, which in turn inﬂuences prosodic structure (cf. Figure 10.4). These factors include, for example, length of the

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

Syntax

Utterance length

Prosodic structure Nongrammatical factors (e.g. rate, clarity requirements, style, movement costs)

Semantics Pragmatics

Lexicon

277

Language redundancy (Predictability)

Segmental phonology and cue choice

Planned phonetics: Pronunciation, including surface prosody (timing, etc.)

+

Acoustic redundancy (Acoustic salience) = Signal redundancy (Recognition likelihood)

Figure 10.4 Factors that shape surface phonetics and their relationship to predictability, acoustic salience, and recognition likelihood. Note: Based on similar ﬁgures in Aylett and Turk (2004), Turk (2010), and Turk and ShattuckHufnagel (2014).

utterance (whether in words, syllables, or other units), semantics, real-world context, and frequency of use. Along these lines, Turk (2010) noted that the Smooth Signal Redundancy view provides a plausible account of effects of utterance length on prosodic boundaries (more and stronger boundaries for longer utterances). This is because, all other things being equal, the likelihood of successful word recognition is decreased in longer utterances owing to the increased number of possible word parsings for them; stronger word boundaries which often align with the boundaries of higher-level prosodic constituents can mitigate these effects by providing more and stronger cues to the speaker’s intended phonemes, as well as to the way the sounds group into words. (This difﬁculty could also be mitigated to some extent by the fact that longer utterances might increase the likelihood of successful word recognition by providing more context). More recently, Pate and Goldwater (2014) have shown correlations between measures of predictability and log-word durations in speech (in several different listening contexts), all of which support the general principles of the Smooth Signal Redundancy hypothesis, and are less easily explained in terms of other principles, such as difﬁculty of lexical access. See also Jaeger (2006), Levy and Jaeger (2007), Jaeger (2010) for additional evidence supporting the general Smooth Signal Redundancy principle (often also called Uniform Information Density) in other domains, e.g. syntax.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

278    / To summarize, following Aylett and Turk (2004) and Turk (2010), the hypothesis advanced here is that prosodic structure is planned with the goal of an even distribution of recognition likelihood by the listener throughout an utterance (smooth-signal redundancy). To this end, predictability information from the number of possible word parsings in an utterance due to its length, as well as predictability information from grammatical sources (such as syntax), real world context, and frequency of use, is used to plan prosodic structure, within the constraints of the prosodic grammar, so that relatively unpredictable elements are highlighted. This highlighting is done either by manipulating relative prosodic prominence, or by manipulating relative prosodic boundary strength (through edge demarcation). Planning for smooth-signal redundancy in the Phonological Planning stage will yield a prosodic structure that includes 1) constituent structure with an indication of relative boundary strength, and 2) relative prominence.

10.1.3 The integration of lexical information with prosodic structure The prosodic structure as described can be considered the planning frame for the utterance. Given the assumption of a prosodic planning frame, some mechanism is required to associate the elements that deﬁne the forms of the selected lexical items with that frame.⁷ Serial ordering errors such as rath meview for math review have long been taken as evidence for a serial ordering process in which sub-lexical units are inserted into a structured multi-word planning frame (Fromkin 1971, 1980). Shattuck-Hufnagel (1992) proposed that this planning frame is prosodic in nature, consistent with e.g. Ferreira (1993, 2007); and Keating and Shattuck-Hufnagel (2002) sketched out a mechanism for the serial ordering process. They adopted Levelt’s (1989) and Levelt, Roelofs, and Meyer’s (1999) proposal that information about the lexical form of the speaker’s intended words is transferred into the planning frame in two steps: ﬁrst the metrical information and then the segmental information. (See also Shattuck-Hufnagel 2014 and 2015). Sections 10.1.3.1 and 10.1.3.2 address the question of the nature of these sub-lexical units. The evidence presented in Chapter 7 suggests that the phonological representations used in the Phonological Planning stage are ⁷ Garrett (1980) proposed such an insertion mechanism for serially ordering words and morphemes into a syntactic frame; here we are concerned with serial ordering at the phonological level.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

279

symbolic; here several possibilities are considered for the nature of the serially ordered symbolic units, namely that they are distinctive features, phonemes, or larger units, such as syllable subconstituents (i.e. syllable onsets, nuclei, codas, and/or rhymes). The symbolic units that would be analogous to AP/TD’s gestures would be distinctive features, and these are considered here along with higher-level units. The focus of Section 10.1.3.1 is on evidence from speech errors because it is most directly relevant. The evidence is consistent with the phoneme as the unit of serial ordering, but suggests that subphonemic and supra-phonemic units may also be required, at least in some situations. On the view that that evidence for one such component should not be taken as evidence against another, i.e. that more than one type of symbolic sublexical component may be represented (Kazanina, Bowers and Idsardi 2017), and on the basis of evidence for the phoneme presented in this section, XT/3C-v1 adopts the view that the symbolic phoneme is the best candidate for the units of serial ordering, and proposes that the phonemic segments of the target words are serially ordered into the word-based prosodic planning frame discussed above, during the operation of the Phonological Planning Component. Once this has occurred, the structured sequence of phonemes in their respective prosodic positions can be considered a sequence of task requirements for the utterance. That is, one of the speaker’s task requirements is to produce each phoneme appropriately for its structural position, as part of the full set of task requirements established in the Phonological Planning Component. 10.1.3.1 Speech error evidence for the nature of the units slotted into the prosodic frame during the serial ordering process Interpreting speech error patterns as evidence for the symbolic units that undergo serial ordering during the phonological planning process is challenging, in part because ﬁnding a way to measure errors objectively is not easy. Errors occur relatively rarely, so one method has been to collect large corpora of errors by listening. In this method, investigators attend to errors in the copious amounts of speech they hear every day and transcribe the errors that they perceive (see the seminal report by Meringer and Mayer 1895 for German; Fromkin 1971, 1980 and Shattuck 1975; Shattuck-Hufnagel 1992 for English; del Viso Pavón 1990 and Perez, Santiago, Palma, and O’Seaghdha 2007 for Spanish; and a large number of smaller corpora, some in other languages). This method offers the advantage of harvesting errors made during typical speech production, but it has a problem: the speech perception process is notoriously biased in favor of distorting the acoustic input in the direction of recognizable linguistic elements, legitimate sequences and existing

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

280    / words—and may also tend to include errors of the type that seem most interesting to the listener. Despite these drawbacks, patterns in error corpora collected in this way can provide some useful information, as will be noted below, and are also a rich source of hypotheses for experimental testing. An alternative approach is to elicit errors in the laboratory; such elicitation experiments generally require the use of atypical speech stimuli (like tongue twisters or priming stimuli) to ensure the occurrence of an adequate number of errors for analysis in a reasonable amount of time (e.g. Baars and Motley 1976; Shattuck-Hufnagel 1992; Mowrey and MacKay 1990; Pouplier and Goldstein 2010). Tongue twisters like She sells sea shells or top cop top cop are speciﬁcally designed to be highly repetitive, employ contrasting patterns of alternation (onset consonants sh-s-s-sh vs rhymes i-ells-i-ells) and are rhythmically regular, resulting in utterances with a quasi-periodic rhythm where confusable elements occur at approximately equal time intervals. These stimulus characteristics have been found to elicit a high rate of speech errors and other disﬂuencies, but they do not necessarily reﬂect the planning processes of typical communicative speech, which generally involves utterances with minimal repetition/alternation and, aperiodic rhythms, and employs a complex sequence of steps linking the meaning of an intended message to the syntactic, lexical and prosodic shape of the utterance. Moreover, twisters need to be repeated several times in order to elicit errors; typically, the ﬁrst repetition of such a stimulus is error-free (just like most utterances in more typical speaking circumstances). Thus the types of errors that occur in twister-elicited speech, while clearly revealing important attributes of the speech production apparatus, may not accurately reﬂect the types that occur in more typical speaking contexts. High error rates have also been reported using the SLIP technique (Baars and Motley 1976), in which speakers are primed by a sequence of visually presented word pairs that contain confusable onset consonants in one order (e.g. for b/d, word pairs like dig beets, dock boats, deer bites), followed by a word pair that contains the same consonants in the opposite order (e.g. barn door), which speakers are required to speak aloud. Baars and Motley found that, under these conditions, speakers were likely to reverse the two onset consonants in their spoken output, producing e.g. darn bore. This result was obtained by listening to the spoken output, so that if gradient errors occurred they could not be reliably detected. However Pouplier (2007), using the same elicitation method (a kind of priming) but measuring the results articulatorily, reported the same kinds of gradient intrusion errors that occur in highly repetitive, quasi-periodic tongue twisters with

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

281

alternating patterns, i.e. in both types of elicitations, errors showed varying degrees of the movements towards the target and/or the intrusion constriction. Again, while these results clearly show that such gradient errors can occur when appropriate priming is experienced, and are characteristic of the system under those priming conditions, such conditions are not often found in typical speaking situations. As a result, it is unclear that these types of gradient intrusion errors are characteristic of the system when it operates in more typical speaking conditions. Taken together, the evidence from tongue-twister and SLIP experiments has revealed that the speech articulation system is capable of producing gradient errors, but doubt remains as to whether such errors make up the majority of sound-level errors in everyday speech. A second problem in interpreting speech error patterns as evidence for the representational units used in typical speech production is that errors are notoriously ambiguous as to the unit involved, because of the hierarchical nature of linguistic structures. For example, in the exchange error I have to go home and give my bath a hot back, for . . . give my back a hot bath, the elements that changed places might have been the coda consonants (/k/ exchanged with /θ/), the rhyme (-ack exchanged with -ath), or the morpheme, the syllable or the word (back exchanged with bath). Similarly, in highly-played payer for highly-paid player, the error unit may have been the single-phoneme /l/, the word or syllable onsets (/p/exchanged with /pl/), or the morphemes or syllables (pay- exchanged with play-). Such ambiguity is endemic in speech error corpora, making it difﬁcult to determine precisely which representational unit is implicated. Despite these challenges, speech error patterns provide some useful insights. As noted by Lashley (1951), they clearly rule out associative stimulus-response models of speech production which cannot plan ahead. In addition, as noted by Fromkin (1971), the fact that errors generally involve linguistically-deﬁned units (rather than random parts of utterances) shows that these grammatically-signiﬁcant elements play an active role in speech planning; and as noted by Garrett (1975, 1980), they reveal important aspects of the order of processing steps in the planning process. For example, exchange errors involving pronouns in English often take the form of a change in person or gender but not case (e.g. She told them ! They told her, but not *Them told she, and He told her ! She told him, but not *Her told he. This suggests a twostep process that includes ﬁrst, determination of the case, which remains ﬁxed with its position in the planning frame, and only later the person, number and gender, which can become misordered. Such error patterns also indicate that

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

282    / the pronouns take on their surface phonological form only after these earlier processes have been completed. A number of speech production models have been developed based on the regularities that errors exhibit (e.g. Garrett 1980; Dell and Reich 1981; Dell et al. 1997, among others),⁸ but won’t be discussed further here because they do not model articulation. Instead, this section focuses on the error evidence that has been used to argue for units larger than the phoneme, i.e. the syllable and its subcomponents (onset, nucleus, coda, rhyme) as units that play a role in the speech planning process. It argues that this evidence offers little support for the view that the syllable and its sub-components are the only units of sublexical representation, i.e. for the view that phonemes as representational units can be dispensed with, as proposed, for example, by AP/TD and Fujimura (1992 et seq.). It then reviews the arguments that have been advanced for serial-ordering elements smaller than the phoneme, i.e. individual distinctive features, and draws a similar conclusion, namely that phonemes are still required to account for the data. Finally, some additional evidence is presented that supports the assumption that representations of individual phonemes are required in order to account for the full range of phonology-related behavior. Taken together, these lines of evidence motivate the assumption of the phoneme as the unit of sublexical serial ordering in the XT/3C-v1 model. 10.1.3.1.1 Phonemes vs. higher-level sublexical units Error-based arguments that have been advanced for the role of the syllable in speech production planning take several forms. First, entire syllables can serve as error units; second, syllable subconstituents can serve as error units; and third, when two sublexical elements interact in an error, they generally share position in a syllable: onsets interact with onsets, nuclei with nuclei and codas with codas. The problem is that, due to the ambiguity of error units noted above, each of these observations can be accounted for in terms of another linguistic unit, and, when viewed in the context of an error corpus as a whole, it is clear that whole syllables rarely if ever participate in serial ordering errors, and at the same time, it is difﬁcult to account for error patterns without a unit such as the phoneme.

⁸ Levelt (1999) points out that two major lines of thinking have inspired two separate approaches to modeling speech production: constraints from speech errors, and constraints from chronometric experiments, i.e. from timing measurements in elicitation and priming experiments.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

283

10.1.3.1.2 Syllables as error units? If syllables reliably appeared as error units, that would provide substantial evidence that they serve as units of serial ordering in the production planning process. But the evidence for such a claim is ambiguous at best and is more plausibly accounted for in terms of the word or morpheme. First, although many interaction errors could be described in terms of the syllable as the error unit, the two ‘syllables’ involved are almost always monosyllabic words. Because so many of the words in spoken English are monosyllables, it is possible to describe most such interaction errors as syllables, morphemes or words. But, it is clear that the word or morpheme is required to account for errors involving polysyllabic units like intelephoning stalls for installing telephones. In contrast, very few errors have been reported that unambiguously implicate the syllable. In particular, strikingly few errors have been observed in which two syllables from polysyllabic words interact, as would be the case in e.g. *[mor]tivating [cap]femes for captivating morphemes (but see Laubstein 1987). When the evidence for one type of error unit is clear (e.g. unambiguous morpheme errors) and the evidence for another competing error unit is ambiguous (e.g. errors ambiguous between syllable- or morpheme-sized units), then logic suggests that the unit for which the evidence is clear should be assumed in the ambiguous cases. 10.1.3.1.3 Do syllabic subconstituents function as error units? Much of the error evidence that bears on the question of whether error units are syllabic subconstituents (onsets, nuclei, codas or rhymes) or simply single phonemes is also ambiguous, but the resolution of this ambiguity takes a different form. The ambiguity arises because most interaction errors involve syllabic subconstituents that are also single phonemes, such as shop talk ! top shalk (onsets), come back ! cam buck (nuclei), and sit down ! sin dowt (codas). (Many similar examples can be found in Fromkin (1971) and Shattuck 1975, inter alia.) Such errors can be described either as syllabic subconstituent errors, or as individual phoneme errors. However, neither description alone can fully account for the full range of patterns observed in sublexical interaction errors. That is, while some interaction errors unambiguously involve syllable constituents like complex onsets (e.g. speak fast ! feak spast) and rhymes (e.g. back and forth ! borth and fack), it is also the case that single phonemes are often moved into or out of complex syllable onsets, as in split brain ! sprit blain. Thus, although the majority of evidence is ambiguous, there is also unambiguous error evidence both for syllabic subconstituents and for single phonemes, supporting the idea of multiple levels of representation.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

284    / Note that a word-based alternative to the syllable subconstituent view has been proposed by Shattuck-Hufnagel (1992, 2011). Because complex CC or CCC error units usually occur in word onset position, it is possible that the planning representation takes the form [word onset consonants]+[the rest of the word]. On this view, apparent syllable-onset constituents are seen as word-onset constituents, weakening the evidence for syllable-based serial ordering units. A similar argument can be made for errors that have been classiﬁed as involving syllable rhymes, such as mark the place ! mace the plark, since these units also correspond to the [rest of the word]. This option ﬁnds some additional support from errors which seem to mis-order the entire [rest of the word] constituent in polysyllabic words, as in Claire and Howard !Cloward and Haire. It is also consistent with some forms of Pig Latin in English, in which the word onset consonant cluster is moved to the end of the word before adding the sound /e/, as in stopping ! opping-stay, but this does not occur for word-internal syllables. Whatever the nature of the multi-segmental constituents that sometimes undergo serial ordering errors, i.e. whether they are constituents of the word or constituents of the syllable, it appears that the individual phoneme is nevertheless required to account for errors that break up complex syllabic subconstituents into single phonemes, as in split brain ! sprit blain. What is perhaps most signiﬁcant about these observations is that a serial ordering error almost never involves a random sequence of segments, such as the –pli- sequence in split, or fragments from two successive words, such as –it brai-; instead, perceived mis-orderings can almost always be described as involving a linguistically motivated constituent, even if there is often ambiguity as to which level of linguistic constituent is involved (Fromkin 1971). 10.1.3.1.4 Are there syllable-based position constraints on interaction errors? Two additional aspects of sublexical errors have sometimes been cited as evidence for the representation of syllable structure in production: the Position Similarity constraint and the Position Preference constraint. Both of these observations are often phrased in terms of position in the syllable. Position Similarity refers to the fact that when two sublexical elements interact in an error, they are generally from the same position in the syllable: onsets with onsets, codas with codas etc. While the observation of shared position is correct, it is not obvious that it should be stated in terms of the syllable. In fact, a number of lines of evidence suggest that it is similar position in the word rather than the syllable that governs error interactions, while the evidence for the syllable is ambiguous (Shattuck-Hufnagel 1992, 2011). Consider the fact

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

285

that most sublexical errors that involve consonants occur in onset position in both the word and the syllable. When the words are monosyllabic, this observation provides only ambiguous evidence for either unit. But the pattern of errors in polysyllabic words helps to resolve this ambiguity. That is, many errors occur between pairs of word onset consonants in polysyllabic words (such as player piano ! payer pliano), yet there are vanishingly few errors that involve interactions between potentially syllable-onset consonants inside a polysyllabic word (such as double decker ! *duckle debber). The fact that most sublexical errors occur between consonants that share both word and syllable position, while very few errors involve consonants that share only syllable position (i.e. that are located internal to a polysyllabic word), suggests that the Position Similarity constraint may be best stated in terms of the word. Similar arguments apply to the Position Preference constraint: most errors occur in initial position in the word, which is of course also initial in the syllable. But when errors occur in polysyllabic words in English, they also tend to occur in word-onset position, and seldom occur in syllable-onset position within the word.⁹ 10.1.3.1.5 What is the role of the syllable in speech production planning? Taken together, these arguments raise interesting questions about the role of the syllable in speech production planning, at least in English. If syllables and their subconstituents are not the only units that undergo serial ordering during phonological planning, and/or are not even the most likely units to function in this way, then what is their role? A number of investigators (e.g. Crompton 1981; Levelt, Roelofs, and Meyer 1999) have proposed that syllablesized articulatory plans are stored and retrieved to create the sequence of events that will produce an utterance (See also Fujimura 1992 et seq. and AP/ TD for a different type of proposal in which syllables also have a central role). But in American English, the syllable is often an elusive construct. That is, although the number of syllabic nuclei in an utterance is usually easy to count, the boundaries of syllables are not easy to determine, particularly in words with strong-weak metrical structure like butter or movies. Moreover, massive reductions (Johnson 2004) in American English are notoriously insensitive to syllable structure, often erasing it altogether while leaving only a few cues to individual features behind, as in e.g. probably ! ~[praəli], or Do you have ! ~[dʒəv].

⁹ Findings are different for error patterns in other languages; for example, in del Viso’s (1990) corpus of errors in Spanish, word-internal interaction errors more commonly implicate word-internal syllable structure as a constraining factor.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

286    / While the concept of stored syllable-sized articulatory routines has seemed attractive, because their pre-speciﬁed coarticulatory pattern removes the task of determining the degree of overlap afresh for each utterance, this advantage is becoming less compelling as we learn more about the nature of systematic variability in typical communicative speech. These discoveries mean that, in order to account for the sometimes-massive reductions that characterize surface phonetic variation, a planning representation consisting of syllable-sized articulatory plans retrieved from storage will need to undergo substantial adjustment for each new planned utterance, depending on the factors that inﬂuence phonetic form for that utterance, such as prosodic structure, adjacent words and their sounds, speaking rate, speaking style etc. (cf. Chapter 3 and Figure 10.4). These discoveries also suggest that a planning process based on the selection of individual feature cues and computation of the appropriate quantitative values for those cues, rather than the retrieval of syllabic motor plans and their adjustment, may provide a more tractable framework for describing the surface phonetic variation that speakers typically produce. This is not to say that the syllable has no role to play in speech or in the grammar of the speaker’s language. For example, it may be that American English is less reliant on the syllable than other languages, as has been suggested by Mehler, Cutler and their colleagues (Cutler et al. 1986). The representational units employed in speech production planning may not be identical in all languages (or for all speakers of a language, or for a given speaker on all occasions), even though they may be drawn from a limited set of universally available possibilities. Further, as noted above, the fact that American English speakers can generally determine the number of syllables in a word or utterance, and can recognize their metrical pattern, doesn’t necessarily implicate the syllable as a bounded representational unit separate from its surrounding syllables within the word. Instead, the syllable may well have an entirely different role to play. For example, in the grammar, syllable number and syllable weight play a role in grammatical regularities, such as word stress. Moreover, it is possible that the hierarchically structured prosodic planning framework includes syllable structure, and conditions the phonetic shape of a phoneme, as in the glottalization of /t/ in butler and Atkins in some varieties of English, even if syllables (or their subconstituents) do not serve as units that undergo serial ordering in speech planning. In sum, the role of the syllable and syllabic subconstituents in speech production planning in American English or any other language is far from settled; much of the evidence that is often cited in support of these

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

287

constituents may be equally well or better accounted for by morpheme or word structure. But the need for a smaller, lower-level representational unit is clearly supported by the speech error evidence. Is this sublexical unit phoneme-sized, or equivalent to an even lower-level unit, the individual feature, or a gesture (in the AP/TD sense of an abstract (yet spatiotemporally speciﬁed) element)? The next section turns to the speech error evidence that bears on this question. 10.1.3.1.6 Phonemes vs. lower-level sublexical entities The evidence reviewed above suggests that larger sublexical units, such as syllable- or word-onsets, which can include more than one phoneme, are not sufﬁcient to account for observed sublexical error patterns, because errors often break up these units into their individual phonemic components. The observation that many of the transcribed perceived errors involve single phonemes raises the possibility that the phoneme is a required unit. However, sublexical errors collected by listening to ongoing speech are subject to a number of biases, including potentially a bias toward hearing and recording phoneme-sized error units for events which in fact involve smaller units— perhaps the distinctive feature, the abstract articulatory gesture (as proposed by Pouplier and Goldstein 2010), or motoric units (as suggested by Mowrey and McKay 1990). 10.1.3.1.7 The phoneme or the gesture? Pouplier, Goldstein, and colleagues have made a signiﬁcant contribution to our understanding of the full range of speech error behavior. In a series of experiments they have drawn attention to the fact that that gradient subphonemic articulatory intrusion errors can occur under some circumstances, and that not all aspects of sub-lexical errors can be detected by listening (Pouplier 2007; Goldstein et al. 2007; Pouplier and Goldstein 2010). Using tongue-twister-like stimuli, such as repeated utterances of top cop, or the SLIP technique, these investigators have shown that it is possible to elicit sublexical errors which are not consistent with phoneme-sized units, because they show intermediate degrees of tongue movement for both the target and the intrusion segment at the same time. This result provides evidence that in at least some errors, the alveolar /t/ and the velar /k/ constriction gestures are coactivated. Errors of this type are inconsistent with a mechanism in which an intruding symbolic phoneme is chosen over the target segment during the serial ordering process, and produced in its normal fashion in its new context, leaving no residual evidence of the original target.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

288    / These articulatory observations from tongue twisters have been interpreted as evidence that sublexical errors in spontaneous speech, which are generally transcribed as phoneme substitutions, do not involve whole-phoneme error units, but instead involve partial activation of both the target and intrusion gestures, with acoustic results that are interpreted categorically by listeners, who fail to attend to (or cannot perceive) any residual acoustic evidence for the original target segment. This view ﬁnds additional support in results showing that when such co-activated tokens are presented to listeners, they often cannot perceive that an intruding gesture has occurred at the same time as the target gesture, or vice versa (Pouplier and Goldstein 2005). These results demonstrate beyond a doubt that such gradient intrusion errors can occur. However, they do not necessarily demonstrate that all (or even most) sound-level errors in typical communicative speech arise in this manner. Recent work has raised the possibility that sublexical serial ordering errors that occur in typical speaking situations may differ from those that occur in tongue twisters or the SLIP technique. It is well known that tongue twisters elicit more errors than sentences that contain the same confusable phonemic patterns (Shattuck-Hufnagel 1992), but do they elicit different types of errors in different proportions? That is, do twisters elicit more gradient gesture intrusion errors, and fewer whole-segment errors, than sentences, which may involve a different level of processing that includes, for example, complex syntax-related prosodic planning? Shattuck-Hufnagel, Bai, Tiede, Katsikis, Pouplier and Goldstein (2013) addressed this question by directly comparing the errors perceived in experimentally elicited utterances of list-like twisters (e.g. top cop top cop top cop) with those elicited by sentences (e.g.The top cop saw a cop top). They found that, based on perceptual labeling, there was a difference in the proportion of errors of different types for sentence production vs. list production. That is, list-like stimuli elicited more onsets with sequences of two release bursts, like t-cop or k-top¹⁰. In contrast, sentence-like stimuli invoked more errors that seemed to a listener to be the wholesale substitution of the intrusion segment for the target segment (e.g. top cop ! cop top), with the intrusion segment produced ﬂuently in its new context, and with no perceptible residual evidence (such as a release burst) of the original target segment. These results are preliminary, but they raise the possibility that the typical planning process for production of a grammatically well-formed sentence, ¹⁰ Such errors are clearly different from the imperceptible intrusion/target gesture combinations discussed by Pouplier and colleagues, which can only be detected by articulatory instrumentation.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

289

with its complex integration of prosodic, syntactic and lexical structure, fosters different proportions of the various possible types of error than does the planning for a twister-like list of words that is highly repetitive, quasi-periodic in timing and involves ongoing alternation between two confusable segments. It is possible that the differences in the planning process for these two types of utterance result in the preponderance of a different kind of sublexical error in sentence-based utterances, i.e. that errors during sentence planning involve the mis-selection of individual phonemic segments as wholes (for serial ordering into a prosodic planning frame), rather than the simultaneous articulation of two target constrictions, whose origin may lie in the motor implementation of repetitive sequences (i.e. in the Motor-Sensory Implementation component of XT/3C-v1). Testing this hypothesis will require articulatory measures of sublexical error patterns in less-repetitive sentences vs. highly-repetitive list-like utterances, to determine whether a high proportion of the apparent whole-segment substitutions perceived in conversational sentence-based utterances are truly segmental substitutions, or whether these errors are actually gradient gestural intrusions with imperceptible (but measurable) articulatory vestiges of the target segment. Until such evidence is available, the hypothesis that the sublexical units that interact in errors during typical speech often correspond to phonemes remains plausible. 10.1.3.1.8 The phoneme or the feature? Error evidence also sheds some light on how individual distinctive features, which specify the manner, place and voicing aspects of a phonemic segment in abstract symbolic terms, may play a role in the processing representations. It has often been noted that many interaction errors at the sublexical level involve phonemic segments that differ by only one feature, such as kategeeper for gate-keeper, where the interacting segments /g/ and /k/ differ only in their voicing. Moreover, a small number of errors have been observed that seem to involve the exchange of individual feature values between two segments (such as e.g. ponato for tomato) where the place features of the /t/ and / m/ have apparently changed places, generating two new segments /p/ and /n/). Such errors seem to support the representation and perhaps the serial ordering of individual feature cues. However, Shattuck-Hufnagel and Klatt (1979) showed that such unambiguous individual-feature errors are very rare compared to whole-segment serial ordering errors. They analyzed a small corpus of 70 errors in English that involved an exchange interaction between two segments differing by more than one feature, so that it was possible to determine whether the error implicated a single feature or the entire feature

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

290    / bundle. They found that 96% of these exchange errors involved wholesegment rather than single-feature errors. This ﬁnding is consistent with the idea that both features and phonemes are represented, but it is the larger element that typically undergoes serial ordering during the utterance planning process. In sum, taken together with evidence for symbolic units of phonological representation presented in Chapter 7, the evidence presented above suggests that phoneme-sized units are good candidates for the symbols that are serially ordered into the prosodic frame. However, the evidence doesn’t exclude other units, such as word- (or perhaps syllable-) onsets, or individual distinctive features, and in fact suggests that multiple levels in the linguistic hierarchy are reﬂected in error patterns. Section 10.1.3.2 brieﬂy discusses additional evidence in support of the phoneme as a representational unit in phonology. 10.1.3.2 Additional evidence for the phoneme as a representational unit Several pieces of evidence discussed in previous chapters, while not conclusive, are also consistent with the phoneme as a representational unit. These ﬁndings suggest that 1) the phoneme is a unit which may govern coordination, and 2) the phoneme provides an account for the phonological equivalence among phonetic variants. 10.1.3.2.1 The phoneme as a unit governing coordination The Löfqvist (1991) ﬁndings discussed in Chapter 5 showed that proportional rate scaling appears to hold more strongly for measured intra-segment intervals associated with the gestures of a single phoneme, as compared to measured inter-segment intervals associated with sequences of phonemes. That is, for voiceless labial consonants (/p/ and /f/), the duration of the glottal opening movements (measured from the onset of glottal opening to maximum glottal opening) relative to their oral constriction intervals showed strong evidence of proportional rate scaling. Moreover, and critically for the present argument, this proportional rate scaling within a segment was stronger than that observed between adjacent segments, i.e. for the timing of the onset of movement towards an oral constriction relative to a vowel cycle interval in a word-medial VCV sequence. Although Löfqvist’s experiment was not explicitly designed to contrast predictions of a phonemic vs. AP/TD gestural coordination account, his ﬁnding that within-segment relative timing of movements is more stable than between-segment relative timing is consistent with the hypothesis that phonemes are represented, and govern coordination in a different way than constituents that group successive segments.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

291

However, de Jong’s results showing that closure durations, VOT and voiced vowel interval durations scaled proportionally with speaking rate for CV syllables, but not VC syllables (cited in Chapter 5) suggest the possibility that mechanisms to implement proportional timing, or approximate proportional timing, may also be required in some cross-segmental circumstances, e.g. for CV syllables. Thus these results support the phoneme as a unit that governs proportional timing, but leave open the possibility that proportional timing may also occur within other types of groupings. 10.1.3.2.2 The phoneme accounts for phonological equivalence among variants The discussion of phonological equivalence presented in Chapter 7 as evidence for symbolic representations also supports the phoneme as a symbolic unit, because it suggests that phonological equivalence can’t be expressed in terms of a single shared distinctive feature. Ellis and Hardcastle (2002) observed that the /n/ in ban cuts can be realized as either [n] or [ŋ], but not /m/ (see discussion in Chapter 7). The phonological equivalence between [n] and [ŋ] is difﬁcult to account for without reference to an abstract phoneme /n/, whose realizations are not restricted to articulations that reﬂect distinctive feature sets such as [+coronal] and [+anterior]. Moreover, the feature [+nasal] is not adequate to deﬁne the phonemes which are realized with the [n] and [ŋ] variants; the phonemes /ŋ/ and /m/ cannot be realized in this way. The British English evidence cited in Chapter 7 for 1) glottal stop variants of /t/ without tongue-tip raising, along with 2) aspirated variants with tongue tip raising, in different positions-in-syllable or -word (i.e. glottal stop does not occur initially), also provides evidence for phonemes. That is, this observation is difﬁcult to account for without an abstract symbolic phoneme, since it suggests the phonological equivalence of a [+constricted glottis] consonant (i.e. glottal stop), with a [+spread glottis], [+alveolar] consonant (i.e. aspirated th); no single, traditional distinctive place feature describes both of these variants. Because these different variants can occur in different positions in a word (e.g. glottal stop frequently occurs word-ﬁnally, and is not found in initial position in the word or stressed syllable, where aspirated /t/ normally occurs), this evidence also shows that syllable subconstituents are not adequate to explain the phonological equivalence of the variants (because their status as syllable subconstituents, i.e. onset vs. coda, is different), and this provides even stronger supporting evidence for the phoneme. The Dutch evidence of uvular and alveolar variants of /r/ cited in Chapter 7 provides similar evidence for the phoneme; again, no single set

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

292    / of distinctive features can describe the set that includes both uvular /r/ and alveolar /r/. 10.1.3.2.3 Additional evidence supporting the phoneme as a representational unit Several other types of evidence are consistent with the phoneme as a representational unit, because they suggest that 1) larger units such as syllables, syllable subconstituents, and morphemes, as well as 2) smaller units, such as distinctive features, are often not adequate to explain observed behavior. Kazanina et al. (2017) and Fowler (2015) review many ﬁndings from the classic and more recent literature, that together make a convincing case for phonemes. Evidence they cite for the necessity of a phonemic unit includes differing patterns of morpho-phonological alternations (e.g. leaf/leaves vs. cuff/cuffs, suggesting 1) a different morphophonological status of leaf vs. cuff because they each condition different variants of the plural morpheme, and 2) the necessity of a phonemic representation to capture the similarity of the /f/ sounds in both words, Swadesh 1934), patterns of alliteration in poetry that involve single segments, e.g. kV can alliterate with klV, suggesting the inadequacy of syllable subconstituents such as onsets to describe this phenomenon; the success and prevalence of alphabetic writing systems; language games which manipulate phoneme-sized units, rather than syllable subconstituents; patterns of inﬁxation that require sequences of phonemes that do not make up syllables or syllable subconstituents; and word-recognition behavior that suggests the tracking of transitional probabilities among consonants, as opposed to e.g. syllables. Additional evidence for the inadequacy of smaller units, such as single distinctive features or particular phonetic variants (phones or allophones) to provide a full account of phonological behavior, comes from patterns of contrastive sound inventories that require reference to combinations of features (e.g. English voiceless nasals are not allowed, although voiceless stops are, Fowler 2015), and phonotactic patterns which can’t be simply stated in terms of distinctive features (e.g. English /pl/ and /sl/ are allowed, but /tl/ is not). Taken as a whole, this evidence highlights the inadequacy of units larger than the phoneme, as well as units smaller than the phoneme, in explaining the totality of observed behavior, and many pieces of evidence appear to suggest that both larger and smaller units are inadequate by themselves, thus motivating the phoneme in particular. In sum, this evidence motivates our choice of phonemes as units to be slotted into a prosodic frame in XT/3C-v1; the resulting sequence of phonemes in their prosodic context provides some of

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

293

the task requirements for a planned utterance. The following section discusses additional task requirements developed in the Phonological Planning Component.

10.1.4 Planning other task requirements As noted above, during the operation of the Phonological Planning Component, the phonemes slotted into the prosodic frame provide one set of task requirements for the utterance, by specifying the phonemes that must be signaled as distinct from other phonemes, as well as the structural context for which they must be signaled appropriately. In this component, other task requirements are also identiﬁed, such as requirements for signaling other aspects of lexical contrast (e.g. lexical tone, in languages that use it); intonational requirements (e.g. phrase-level tonal targets); and, as discussed below, the requirement of producing the appropriate sets of acoustic cues associated with each type of lexical contrast in each utterance-speciﬁc context. In addition, non-grammatical task requirements are also identiﬁed at this stage, such as speaking quickly, loudly, or in a particular style (e.g. clear speech, rhythmicized speech, etc.). All of the task requirements relate to the factors in the middle three boxes in Figure 10.4, which will eventually inﬂuence the planned phonetics. These requirements are assigned relative priorities in the Phonological Planning stage, so that in the Phonetic Planning stage they can be balanced against one another and against movement costs to yield optimal movements to meet the prioritized goals. Once the relative priorities of the task requirements and movement costs have been quantiﬁed with weights in the Phonetic Planning Component, these will dictate the planned surface expression of boundary strength and relative prominence, via the movements chosen to implement the cues for each context. What all of this means is that the ﬁnal phonetic shape of an utterance will depend on the types of task requirements chosen by the speaker and on their weightings, driven by their relative priorities. For example, if a speaker chooses that holding the ﬂoor is a high priority, s/he may choose to curtail cues to an intonational phrase boundary (e.g. ﬁnal lengthening, pause, F0 markers, voice quality changes) in order to discourage the interlocutor from breaking in. And if a speaker chooses fast speaking rate as a relatively high priority, then the optimal phonetic output may have fewer, or less salient, phonetic cues at prosodic constituent boundaries than if a fast speaking rate had a lower priority, since producing the lengthened durational cues to a boundary takes time.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

294    / Several aspects of Figure 10.4 are worthy of comment. First, a non-trivial assumption of this theory is that talkers are able to compute language redundancy, and it is not known how this is accomplished (although evidence that they do so is increasing). Second, the effects of language redundancy (predictability from context) on planned phonetic form are assumed to be indirect, in that language redundancy affects planned prosodic structure, and prosodic structure and other factors affect planned surface phonetics. This view represents a current hypothesis, but it is possible that language redundancy might have additional direct effects on phonetic form (in addition to those that are mediated by prosodic form). Third, it is assumed that the effects of nongrammatical factors (such as rate and style of speech) on phonetic form have a direct effect on planned surface phonetics. Although these factors have been observed to affect aspects of prosody (e.g. fewer ‘breaks’ at faster rates of speech, cf. Caspers 1994), the current proposal is that a speaker would plan the same phonological prosodic structure (i.e. same relative prominence and relative boundary strength structure) for a given utterance at different rates of speech, but that the planned phonetic correlates of this structure would be different at different rates, because the rate-of-speech requirement would be balanced against the prosodic structure requirement in determining the optimum phonetic characteristics to meet the competing demands. This balancing takes place in the Phonetic Planning Component. Finally, the list of factors mentioned in the ‘Non-grammatical factors’ box is intended to be a preliminary indicator of the many non-grammatical factors that might be at work, and may not be exhaustive.

10.1.5 Mapping symbolic phonemes to landmarks and other acoustic cues At the Phonological Planning stage, each symbolic representation of a sound category in its context (prosodic, stylistic, etc.) is associated with one or more acoustic landmarks (spectral discontinuities), which in turn are associated with a set of additional qualitative acoustic/auditory cues; together, these landmark cues and landmark-related cues will signal the contrastive features of the sounds of the speaker’s intended words (Stevens 2002, 2005). The choice of context-speciﬁc cues is a crucial step in providing a bridge, or translation, between the symbolic representations used in the Phonological Planning Component and the quantitative speciﬁcations of the Phonetic Planning Component. This section elaborates on the evidence for the claim that these goals are formulated in acoustic terms, and distinguishes between two types of

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

295

qualitative cues to features that are chosen at this stage: Landmark cues and Landmark-related cues. 10.1.5.1 Evidence for acoustic/auditory goals of speech production The view that phonetic goals are ultimately acoustic/auditory, rather than exclusively motoric, is motivated by Perkell and colleagues’ work (Perkell, Matthies, Svirsky, and Jordan 1993; Perkell, Matthies, Tiede, Lane, Zandipour, Marrone, Stockmann, and Guenther 2004; Perkell 2012) showing withinspeaker motor equivalence among separate constrictions that contribute to a similar acoustic/auditory goal. For example, the low F1 and F2 patterns for English /u/ can be produced with relatively more lip rounding (constriction and protrusion) and relatively less tongue-back raising, or vice versa (Atal, Chang, Mathews, and Tukey 1978). Savariaux, Perrier and Orliaguet (1995) present supporting evidence from a study of French /u/, in which a 2-cm diameter lip tube was inserted between the lips of the participants in a perturbation condition. In this study, seven of the eleven participants showed compensatory backward tongue movement on the ﬁrst post-perturbation trial, although none of the participants were able to achieve full compensation. On subsequent trials, compensatory behavior improved in the sense that most speakers achieved more /u/-like acoustic patterns (lower F2) and one speaker achieved complete compensation. This type of motor equivalent trade-off is not possible in current versions of AP/TD, since lip protrusion and tongueback raising are separate gestures in this theory, i.e. they involve two separate constrictions. In AP/TD, trade-offs are possible between articulatory movements that contribute to a single gesture, e.g. upper-lip, lower-lip and jaw movements can trade off in bilabial constrictions, but not between separate gestures. Guenther, Espy-Wilson, Boyce, Matthies, Zandipour, and Perkell (1999) also found motor equivalent tradeoffs for the English r-sound that would be difﬁcult to explain in AP/TD. The low F3 appropriate for approximant ‘r’ can be produced with a bunched tongue conﬁguration (modeled as a tongue-body constriction gesture in AP/TD), or with a retroﬂex tongue-tip conﬁguration (modeled as a tongue-tip constriction gesture in AP/TD). Guenther et al. (1999) showed motor-equivalent trading relations between these two types of articulations within a speaker in different contexts: 1) a longer and narrower tongue-body constriction (typical of bunched conﬁgurations) and 2) a longer front cavity (typical of retroﬂex conﬁgurations) in producing an r-sound with a low F3 signature. This evidence suggests that a speaker can have several gesturally different, motor-equivalent ways of realizing similar acoustic or auditory characteristics. In AP/TD, this type of motor equivalence is not predicted for two reasons: First, the only type of motor-equivalent

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

296    / compensation that can occur within AP/TD is between articulators used to produce the same gesture; there are no phonemic representations that yoke together two or more gestures so that they can compensate for one another. And second, the goals in AP/TD are articulatory, rather than acoustic/auditory, so that yoking to achieve a common acoustic goal is not possible. 10.1.5.2 Landmark cues Stevens (2002) proposes three types of acoustic landmarks, or spectral discontinuities, each associated with one of the three articulator-free distinctive features, [vowel], [glide], and [consonant]. (This approach builds on the distinction between articulator-free features, such as consonantal, and articulator-bound features, such as labial, proposed by Halle 1995). Each landmark signals a type of vocal-tract constriction conﬁguration, without speciﬁcation of the articulators which are performing the constriction. A vowel landmark consists of a maximum in ﬁrst-formant frequency, accompanied by a maximum in the low-frequency spectrum amplitude (Howitt 2000; Lee S-M and Choi 2012b). A glide landmark consists of a minimum in low-frequency spectrum amplitude, without spectral discontinuity (EspyWilson 1992). Consonant landmarks are discontinuities in amplitude across a range of frequencies in the spectrum caused by the formation or release of oral constrictions (Liu 1996; Park 2008; Suchato 2004; Lee, Choi, and Kang 2011, 2012). Other spectral discontinuities that are not landmarks can also occur, e.g. at voice onset after glottal opening (Choi 1999; Lee and Choi 2012a), at the opening or closing of the velum (Chen 1996, 1997), and at the onset or offset of episodes of irregular pitch periods, but these discontinuities are not created by oral constrictions/narrowings and releases/widenings, and do not signal contrasts among articulator-free features, and so are not classiﬁed as landmarks. Instead, they are classiﬁed separately as ‘acoustic events’. To give a concrete example, English pre-stressed /t/ in a word such as attack is likely to be associated with 1) a constriction formation landmark, i.e. the abrupt change in spectrum and amplitude that occurs at consonantal closure after the voicing for a vowel, as well as 2) a constriction release landmark, i.e. the release burst, and 3) an additional acoustic event, i.e. a spectral discontinuity at the moment of voicing onset after the aspiration interval. 10.1.5.3 Landmark-related cues Stevens (2002) proposes that each of the landmark cues is associated with cues to the articulator-bound features of the same phonemic segment, i.e. with cues to features specifying which articulator has formed the

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

297

constriction and release, as well as cues to voicing, e.g. [stiff vocal folds], and other modulating features, e.g. [round] and [tense]. These cues can be considered landmark-related cues. Landmark-related cues to vowels might include the auditory spectral patterns associated with different formant frequency patterns, e.g. relatively low F1 and high F2 for /i/, relatively high F1 and low F2 for /a/, etc. In the attack example mentioned above, the consonantal landmarks are likely to be accompanied by cues to the place and voicing features of the consonant. For example, in word-onset or prestressed-vowel positions, these might include silence between the closure and release landmarks (indicating that it is a voiceless non-continuant obstruent), aspiration noise following the release burst (providing an additional cue to the [-voice] feature), and the formant transitions and highfrequency release burst (indicating that its constriction is produced at the alveolar ridge, as a cue to the place feature). However, in different positions in an utterance, e.g. in word-ﬁnal position, the features of this same phoneme /t/ might be associated with a different set of cues, e.g. glottalization (either with or without formant transitions, depending on whether the tongue tip is involved in producing the constriction; cf. British /t/ glottalstop allophones that don’t involve tongue-tip raising, Heyward, Turk, and Geng 2014). Such considerations suggest that part of the operation of the Phonological Planning Component involves choosing the relevant set of cues for each segment in each context. In addition, sets of cues are chosen to signal position in prosodic structure. For example, in phrase-ﬁnal position, cues that might be chosen in English include longer duration, irregular pitch periods (ﬁnal creak), and F0 cues to ﬁnal boundary tones. We propose that, in the Phonological Planning Component, these cues are represented in abstract symbolic or relational form, and that they take on explicit quantitative values (ﬁrst as quantitative acoustic goals, and then as quantitative articulatory instructions for carrying out these goals) in the subsequent computations carried out in the Phonetic Planning Component. Like the task requirements relating to prosodic context, rate of speech, etc., production of the acoustic/auditory cues at landmarks can be considered task requirements in themselves. For example, for /i/, the requirement of specifying a spectrum at the vowel landmark with amplitude peaks corresponding to a relatively low F1 and high F2; and for a phonologically short vowel, the requirement to produce a short vocalic interval (e.g. Nakai, Kunnari, Turk, Suomi, and Ylitalo 2009; Nakai, Turk, Suomi, Granland, Ylitalo, and Kunnari 2012). Such relational expressions are consistent with the abstract symbolic speciﬁcations envisioned for the Phonological Planning Component.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

298    / Following Flemming (1997, 2001), it is also proposed that other task requirements may be associated with some types of cue production, e.g. an associated requirement for vowels would be to produce spectra that maximize distinctions among contrastive vowel phonemes in each context (Lindblom 1986), and this type of contrast maximization may also be required for other types of cues, such as F0 for tones, duration for quantity distinctions, etc.

10.1.6 Summary of Phonological Planning To summarize, at the end of the Phonological Planning stage the speaker will have serially ordered the symbolic phonemic segments that specify the lexical content of the utterance into a prosodic frame. In addition, the speaker will have set the prosodic task requirements and the non-grammatical task requirements such as rate or style of speech, and will have prioritized these requirements relative to one another. And ﬁnally, the speaker will have selected sets of qualitative, relationally speciﬁed acoustic cues to signal each contrastive element (e.g. segment, tone), and prosodic context. As noted earlier, the choice of acoustic cues will provide the bridge between the symbolic nature of the representations in the Phonological Planning Component, and the quantitative nature of the representations in the Phonetic Planning Component, because, as discussed in the next section, each cue will be mapped onto a range of quantitative values for its acoustic characteristics.

10.2 Phonetic Planning 10.2.1 Key features of Phonetic Planning It is proposed that the output of the Phonological Planning Component provides the input to the Phonetic Planning stage, where the speciﬁcations of utterance goals (and their priorities), which are qualitative and relative, are mapped onto quantitative speciﬁcations for the acoustic and sensory characteristics of speech sounds, and for the movements that produce them. This ﬁrst involves assigning quantitative weights to each of the goals and the costs of movement (see Section 10.2.2.1). The goal of the Phonetic Planning Component is to determine values of phonetic parameters that will produce appropriate landmarks and other feature-relevant cues to signal the sequence of segments/contrastive elements in its prosodic frame, while

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

299

appropriately meeting the prioritized list of other task requirements for the utterance. Landmarks are assumed to be sequentially ordered or occasionally synchronous (as when the release of one closure corresponds to the closure for another, or vice versa, as when the ‘release’ of an /s/ occurs at the closure for a / t/ in an /st/ cluster), but are not overlapping because they occur at single points in time. On this view, the speaker’s goal is to produce an ordered series of landmarks, rather than overlapping (but nevertheless independent) gestures as in AP/TD. As discussed in Chapter 3, overlapping movements that are produced in events where the movements are thought of as independent tasks (as e.g., rubbing the tummy while patting the head) have been shown to exhibit undesirable spatial coupling. Conceiving of speech as the production of a sequence of landmarks that occur at single points in time (i.e. as a sequence of tasks), and therefore can’t overlap, diminishes this risk, even though the articulatory movements used to produce each landmark, as well as to produce sequences of landmarks, do overlap in time. Overlapping speech movements contributing to a single landmark are thus analogous to the movements of the two hands when unscrewing a jar lid (where they are unlikely to show spatial coupling because they contribute to a single uniﬁed task). It is assumed that the spectral characteristics of the landmarks and other feature cues, the time between them, and the spatial and temporal characteristics of the movements that produce them, are all determined using the principles of Optimal Control Theory (see Chapter 8). OCT, as well as its SOFCT development (Bellman 1957; Todorov and Jordan 2002; Shadmehr and Mussa-Ivaldi 2012, among others in the non-speech literature; and Nelson 1983; Lindblom 1990; Kirchner 1998; Zsiga 2000; Flemming 2001; Šimko and Cummins 2010; Katz 2010; Braver 2013; Windmann 2016; Lefkowitz 2017 among others in the speech literature), provides a promising framework for planning optimal values of parameters, particularly for movements that are inﬂuenced by a range of factors, as is the case in speech. Following this theory and its developments (e.g. Todorov, Li, and Pan 2005 and Li 2006) which account for motor synergies/coordinative structures (cf. Chapter 8), it is assumed that speakers compute near-optimal values for sensory (auditory + somatosensory) goals of movement and their timing, and that they optimize, and thus minimize, the costs of the motor commands used to produce them. To do this, an internal model of the relationship between motor commands and their motor and sensory consequences is assumed. For reasons detailed in Chapter 4, and like Georgopoulos (2002), the XT/ 3C-v1 model proposes separate planning and speciﬁcation of temporal and

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

300    / spatial aspects of movement. Temporal information is speciﬁed and tracked using general-purpose, phonology-extrinsic timing mechanisms, including tau coupling (Lee 1998, 2009, described in Chapter 9) to specify the movement time-course as well as endpoint-based movement coordination. To account for observations that movement endpoints often show greater spatial and timing accuracy than other parts of movement (discussed in Chapters 4 and 7), separate speciﬁcation (and prioritization) are proposed for the parts of movement most closely related to the goals speciﬁed in the Phonological Planning component vs. the parts of the movements that are not as closely related to those goals. That is, in many cases, the parts of movement related most closely to the phonological goals are the movement endpoints. This proposal differs from the AP/TD proposal, where temporal and spectral aspects of a movement are integrated, no phonology-extrinsic timing mechanisms are invoked, and the part of movement most closely related to a gestural target (i.e. movement endpoints) cannot be treated differently from other parts of a movement. Lee’s General Tau theory (Chapter 9) provides a mechanism for implementing the greater timing accuracy at movement endpoints, as well as appropriate shapes of velocity proﬁles (once temporal information is combined with spatial information to generate movement) and endpoint-based movement coordination. The role of the Phonetic Planning Component in the XT/3C-v1 approach is thus to determine values for phonetic parameters that will meet the task requirements speciﬁed in the Phonological Planning component, at minimum cost. Key features include the separate speciﬁcation of temporal and spectral aspects of the acoustic signal, and the separate speciﬁcation of goalrelated vs. non-goal related parts of movement. The temporal evolution of movement and movement coordination are planned according to Lee’s General Tau theory. All of these features are explained in more detail in subsequent sections.

10.2.2 Phonetic Planning Sub-Components When the operation of the Phonological Planning Component is complete, speakers have chosen the appropriate set of qualitative landmark cues and landmark-related cues for the features of each speech sound in its context, and have speciﬁed the relative prominence of each syllable and word in the utterance, the relative strength of boundaries between words (i.e. the boundary strengths determined by prosodic constituent structure), and the relative

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

301

priority of each task requirement, including all of the prosodic and segmental requirements, as well as non-grammatical requirements (e.g. style etc.), cf. Figure 10.4. In the Phonetic Planning Component, these qualitative choices and relative differences need to be mapped onto quantitative values for movement parameters. The following sections provide a working hypothesis for how this mapping process might operate, i.e. how spectral targets and properties of movements might be speciﬁed, by describing the following steps: quantifying the weights for the task requirements and costs of movement (Section 10.2.2.1), quantifying sensory targets with ranges of parameter values (Section 10.2.2.2), computing movement parameters that satisfy task requirements at minimum cost (Section 10.2.2.3), computing durations between landmarks (Section 10.2.2.4), reﬁning the sensory goals of a movement (Section 10.2.2.5), planning the time course of a movement (Section 10.2.2.6), and planning coordinated movements (Section 10.2.2.7). 10.2.2.1 Quantifying the task requirements + costs of movement First, numerical costs are assigned to deviations from meeting the task requirements speciﬁed in the Phonological Planning stage, including a cost for time (corresponding to the speech-rate task requirement), such that greater costs will be assigned for deviations from higher-priority requirements. Additional costs, i.e. costs of movement, are also quantiﬁed: these include the costs of energy or effort, and may include other costs (cf. discussion in Chapter 8). 10.2.2.2 Quantifying sensory targets with ranges of parameter values Next, the qualitative cues chosen during the operation of the Phonological Planning Component are mapped onto sensory parameters, and value ranges for these parameters are chosen, which correspond to the full range of contextual variability for each contrastive sound (Keating 1990). Guenther (1995) et seq. has a neural network model of mappings of this type, which proposes that larger target ranges are used for parameters that are less relevant or critical, and smaller target ranges for more relevant parameters; this proposal is adopted here. It is assumed that the ranges for parameters that are less or even non-relevant might be dictated solely by biomechanical constraints of the vocal tract, in contrast to the more relevant parameters, whose ranges would be more inﬂuenced by weighted task requirements. As discussed in Guenther (1995), this target range proposal is similar conceptually to Keating’s (1990) window model of coarticulation, but differs in that the ranges are speciﬁed for sensory rather than articulatory variables.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

302    / 10.2.2.3 Computing movement parameters that satisfy task requirements at minimum cost Next, following SOFCT proposals discussed in Chapter 8, an optimal control policy is formulated to reach the sensory targets from any current state, based on a learned internal model of the relation between motor commands, articulation, and sensory consequences (cf. Guenther 1995; Kello and Plaut 2004; Toda, Black, and Tokuda 2008; Richmond 2009; Richmond, Ling, Yamagishi and Uría 2013). The ﬁeld’s understanding of this relation has been shaped by the pioneering work of Fant (1960) and Stevens (1998). The optimal motor commands are chosen on the basis of minimizing the costs of moving from one target to the next, (e.g. energy or effort costs, and the cost of time), as well as the costs of not achieving the task requirements speciﬁed in the Phonological Planning Component, (i.e. the cost of not achieving phonemic contrast, the cost of not appropriately signaling prosodic position, and the costs of not meeting the stylistic requirements). In this model, movement duration is chosen to minimize costs of temporal and spatial accuracy, effort/energy, and time. That is, the movement should be long enough to guarantee that it reaches the target range speciﬁed above, i.e. with longer durations for longer-distance movements, and longer durations for greater spatial accuracy, e.g. as dictated by smaller target ranges (Fitts 1954, see discussions in Chapters 3 and 7). At the same time, the movement should be short enough in duration to guarantee sufﬁcient temporal accuracy, i.e. to guarantee that the target range is reached at the speciﬁed time, because movements that are shorter in duration are more accurate in terms of their timing (Hancock and Newell 1985). A cost for time ensures that movements of longer distances are produced with higher peak velocities than movements of shorter distances, so that they are approximately of the same duration (Ostry, Keller, and Parush 1983; Harris and Wolpert 2006; Tanaka et al. 2006). It is also assumed that speakers will minimize costs of effort, where more effort would be required e.g. to produce higher velocities. Speakers must thus ﬁnd the optimum balance between potentially conﬂicting costs, i.e. temporal accuracy and the cost of time encourage faster movements, but spatial accuracy (smaller target ranges) and the cost of effort encourage slower movements. See additional discussions of movement costs in Chapter 8. 10.2.2.4 Computing durations between landmarks A key component of this phonology-extrinsic-timing model is the speciﬁcation of the optimal durations between landmarks, as well as the relationship between these inter-landmark durations and the optimized movement times,

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

303

identiﬁed above. This proposal contrasts with AP/TD’s inter-gestural phasing mechanism for target sequencing. XT/3C-v1assumes that the optimal durations between landmarks are found using Optimal Control Theory; Chapter 8 discussed several existing approaches to doing this that have been proposed in the phonetics literature. One option is to propose abstract durational targets for speciﬁc contexts, but this type of approach presents some problems, such as how to determine what the abstract targets should be, and how the learner could acquire them. The XT/3C-v1 model therefore follows the proposals of Šimko and Cummins (2010) and Windmann (2016) (based on many others in the non-speech motor control literature), in assuming that there is a cost of time, which competes with other costs and task requirements. Generally speaking, this time cost corresponds to the speech-rate task requirement speciﬁed in the Phonological Planning Component, that penalizes longer durations between speech landmarks (see Section 10.2.2.3 and Chapter 8 for the role of the cost of time in explaining higher peak velocities for longerdistance movements). As discussed in Chapter 8, a longer time before reaching a goal postpones the reward of reaching it, and is therefore undesirable, and the cost of time would be even higher for faster speech rates. It is hypothesized that the time between landmarks additionally depends on prosodic context, where intervals between landmarks in stretches of speech affected by prosodic position (e.g. at constituent boundaries and prominences) are required to be longer than in unaffected stretches (see also Šimko and Cummins 2010 and Windmann 2016 for similar proposals). Phrase-ﬁnal lengthening is a relevant example, which Beňuš and Šimko (2014) model using a decreased cost of time in phrase-ﬁnal position. Since the pre-boundary rhyme appears to be the interval most affected by phrase-ﬁnal lengthening,¹¹ the cost of time between landmarks in rhymes in pre-boundary position would be lower than elsewhere, and consequently, their duration would be longer than the duration of comparable phrase-medial rhymes. Because time is costly, the prediction is that phrase-ﬁnal lengthening should be minimized where it can be. Evidence consistent with this view can be found in van Santen and Shih (2000), who observed that it is not the case that e.g. every phrase-ﬁnal segment has the same duration; rather, all segments of a particular type are longer in phraseﬁnal position than medially, but the absolute amount of lengthening is segment-speciﬁc (e.g. different for different types of vowels, see also ¹¹ Additional phrase-ﬁnal lengthening has been found earlier than the pre-boundary rhyme, e.g. on the main-stress syllable rhyme –ad- in Madison, in addition to the lengthening on the phrase-ﬁnal rhyme (e.g. –on), in words where primary lexical stress is non-ﬁnal; see Turk and Shattuck-Hufnagel (2007), and discussion in Section 10.1.1.1.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

304    / Berkovits 1993a,b). Nevertheless, the relative durational rank ordering among segments is preserved. That is, because time is costly, speakers use the shortest durations that can achieve listener perception of the intended segmental contrasts in their prosodic structural positions, without paying undue costs of e.g. effort etc. It is expected that other task requirements speciﬁed in the Phonological Planning Component, such as e.g. requirements to signal phonological quantity (Šimko, O’Dell, and Vainio 2014), as well as a possible requirement to produce a periodic speech style (as in limericks or metrically regular recitations of other forms of poetry), would also inﬂuence the time between landmarks, and, like the phrase-ﬁnal example given above, might also require manipulation of the weight of the time cost in particular positions, during the operation of the Phonetic Planning Component. The above discussion applies to situations where adjacent landmarks are produced by different articulators. To model cases in which adjacent landmarks are produced with the same articulator, e.g. where jaw movement contributes to both the consonant-release landmark and vowel landmark in a CV syllable, the duration between landmarks would need to be at least as long as the optimal jaw-movement duration identiﬁed in Section 10.2.2.3. For example, longer movement durations would be required to produce low vowels in CVC syllables as compared to high vowels, owing to longer movement distances for low vowels, all other things being equal (see Harris and Wolpert 1998 for an explanation of the longer duration of longer-distance movements, described by Fitts’ law). The requirement for longer-duration movements may account for the longer durations between C1 release and C2 closure landmarks observed in these cases (see e.g. Peterson and Lehiste 1960). A similar situation might be observed if one of a sequence of landmarks has a target range that prohibits the use of a particular articulator (Henke 1966), as might be the case for the involvement of lip protrusion in the production of /i/ in an /iCu/ sequence. In this case, lip protrusion in anticipation of the rounded /u/ vowel would be constrained to begin after production of the unrounded /i/ vowel landmark, and potentially even later, depending on the temporal vs. spatial accuracy costs of producing the landmark for /u/ (cf. Benguerel and Cowan 1974; Perkell and Matthies 1992). Although segments in prosodically prominent syllables are typically longer than segments in prosodically non-prominent positions (for both phrasal and lexical prominence), these durational differences are usually accompanied by spectral differences suggesting that the targets are hyperarticulated (cf. de Jong 1995 for corroborating articulatory evidence). It is therefore possible that these

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

305

durational differences are not caused by an explicit requirement for longer durations (modeled as a decrease of the time cost in Windmann 2016) between landmarks for prominent syllables, but are instead an indirect consequence of hyperarticulation, that is, the greater distance required to produce their targets, which, according to Fitts’ law (Fitts 1954), should result in longer movement durations. Phrase-initial lengthening is another phenomenon that might be explained in this way, since phrase-initial segments are known to be hyperarticulated (e.g. Fougeron and Keating 1997). Further work will be needed to determine if the longer durations between landmarks that are associated with both prosodic prominence and phrase-initial lengthening must be accounted for via a decreased (i.e. lower-weighted) time cost (as proposed in Windmann 2016), or whether the durational differences are an indirect consequence of movement durations required to produce accurate targets, computed as described in Section 10.2.2.3.¹² 10.2.2.5 Reﬁning the sensory goals of movement The sensory (auditory + somatosensory) goals of movement, originally speciﬁed as target ranges, are then reﬁned to correspond to the sensory consequences of the endpoints of the optimal movements chosen as described in Section 10.2.2.3. Updating the goals in this way is required to account for patterns of suppression of neural activity in the auditory cortex in response to self-generated speech (Speech Induced Suppression) as well as patterns of movement correction during speaking (Niziolek, Nagarajan, and Houde 2013). These patterns suggest that the planned goals of speech production are more speciﬁc than the parameter-value ranges discussed in Section10.2.2.2. This evidence suggests that speakers use predictions of the auditory consequences of their own articulations to suppress their auditory response to their own speech (see Houde and Chang 2015 for a review), in a highly speciﬁc way, i.e. to the critical frequencies of the intended productions. Studies showing Speaking-Induced Suppression effects typically record neural activity in the auditory cortex in two types of conditions: 1) while participants speak and listen to their own productions while they are being uttered and 2) while participants stay silent and listen to pre-recorded playback of the same utterances. These studies ﬁnd that the auditory cortex ¹² As this example illustrates, it is not always straightforward to determine the most appropriate way to explain or model surface-durational patterns, cf. also debates in the 1990s about appropriate ways to model the timing of anticipatory coarticulation, e.g. Bell-Berti, Krakow, Gelfer, and Boyce (1995); Perkell and Matthies (1992).

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

306    / response is suppressed in the speaking conditions compared to the no-speaking listening conditions, and suppression can be reduced or even abolished if auditory feedback during speaking is altered. Niziolek et al. (2013) showed that auditory suppression responses to more peripheral formant values in tokens of eat, Ed, and add were smaller in magnitude compared to more prototypical tokens, suggesting that the speakers had planned to produce prototypical tokens, and had suppressed the activity of their auditory cortices accordingly. Moreover, they found that speakers made on-line corrections of these more peripheral tokens, centering the formant frequencies by amounts that negatively correlated with the magnitude of the Speaking-Induced Suppression effect. That is, when speakinginduced suppression was smaller because the vowels were more peripheral and less prototypical, corrections were greater in magnitude. Together, these ﬁndings suggest that goals for repeated productions of the same vowel correspond to speciﬁc prototypical values of acoustic dimensions (e.g. median F1, F2, etc.), rather than to a range of values for these dimensions (Niziolek et al. 2013). Although these ﬁndings suggest that the acoustic goals that speakers formulate are highly speciﬁc, results from Niziolek and Guenther (2013) suggest that information about the sensory target ranges as described in Section 10.2.2.2 is not discarded. Niziolek and Guenther’s (2013) study of speakers’ compensatory responses to altered auditory feedback when listening to their own speech showed that listeners’ responses to unexpected shifts in F1 and F2 were greater in magnitude (by a factor of three), if shifts induced crosscategory percepts (i.e. percepts of different vowels), as compared to conditions where the same magnitude of shift-induced within-category percepts (i.e. different versions of the same vowels). The mental representation of this range information may allow speakers to be strategic in terms of the adjustments or corrections they make in the Motor-Sensory Implementation stage (discussed in Section 10.3). That is, movements could be more likely to be corrected in response to state estimation or feedback if the uncorrected movement would lead to a target that falls outside of the allowable movement target range. 10.2.2.6 Planning the time-course of movement Once the durations between landmarks have been speciﬁed, the speaker can use this information (which determines the timing of movement endpoints), as well as the information about optimal movement durations, to plan when movements should start. However, another important component of phonetic

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

307

planning relates to the time-course of movement, that is, the way movements evolve over time, once they have started. A general observation about wellpracticed voluntary movements is that velocity proﬁles tend to be smooth and single-peaked, that is, with one acceleration phase and one deceleration phase. Although symmetric velocity proﬁles are commonly observed, asymmetric proﬁles are not uncommon (Perkell et al. 2002, discussed in Chapter 2). Chapter 8 discussed Stochastic Optimal Feedback Control Theory (SOFCT) approaches to generating optimal velocity proﬁles. Lee’s (1998) General Tau theory approach is adopted here for generating the time-course of each movement, because it has the advantage of allowing temporal properties of movement to be speciﬁed separately from spatial properties, at least at an early stage in movement planning. This view ﬁts well with evidence and proposals that temporal properties of movement can be speciﬁed independently of spatial properties (e.g. Georgopoulos 2002; see also supporting evidence discussed in Chapter 4). Although the SOFCT models discussed in Chapter 8 do not appeal to Lee’s General Tau theory, Tau theory is not inconsistent with SOFCT models, since its parameters could be optimized using SOFCT cost functions. This approach is useful for the proposed model because it provides a mathematically simpler way of generating an appropriate movement time-course than OCT approaches, and provides an alternative to oscillator-based control for modeling the time-course of movement, for controlling the appropriate timing of movement endpoint achievement, and for movement coordination. It is therefore proposed that once movement durations are determined (see Section 10.2.3), the optimal value of the coupling constant kY;G (in Lee’s taucoupling equation τ Y ðtÞ ¼ kY;G τ G ðtÞ;) is determined for each movement (see details and descriptions of equation parameters in Chapter 9). This value determines the coupling of the planned movement to the internal TauG Guide, and, once temporal information is combined with spatial information to generate movement, will determine the skewness of the velocity proﬁle, with lower values of kY;G leading to more gentle approaches toward the endpoint. When combined with spatial information, the TauG Guide coupling mechanism appears to be a plausible alternative to AP/TD’s mechanism for creating appropriately shaped velocity proﬁles. In AP/TD, appropriately shaped velocity proﬁles are generated by manipulating gestural-activation rise time to shape the default velocity proﬁle generated by mass–spring movements (but see Birkholz et al. 2011 and Sorensen and Gafos 2016 for different approaches). In XT/3C-v1, appropriate velocity proﬁles can be generated with the tau-theory equation of motion combined with spatial information, e.g. distance-to-planned-endpoint, as compared to one mass–spring equation +

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

308    / one activation equation in AP/TD. An advantage of General Tau theory is that it provides a plausible explanation for higher degrees of temporal accuracy at movement endpoints compared to other parts of movements, since accurate movement endpoint timing can be achieved even when coupling to the TauG Guide is delayed (cf. discussion in Chapter 9). 10.2.2.7 Planning coordinated movements This section discusses the XT/3C-v1 proposal for movement coordination, that adopts Lee’s (1998) tau-coupling theory. This mechanism allows for endpoint-based coordination for synchronous endpoint achievement, or for endpoint achievement at particular points in time, and provides a mechanism for greater temporal accuracy at movement endpoints compared to other parts of movements. 10.2.2.7.1 Endpoint-based coordination For reasons discussed in Chapter 5, it is proposed that movements to produce landmarks and other feature cues are primarily coordinated based on parts of movement most closely related to their goals, and that tau coupling is used as a mechanism to accomplish this, either via coupling of multiple movements to an internal TauG Guide, or via direct coupling of movements to one another. As discussed in Chapter 9, when two gaps (e.g. gaps between two current positions and their target endpoint positions) are tau-coupled, they are guaranteed to reach their endpoints at the same time. Although two coordinated movements (or a movement and a TauG Guide) might begin at the same time, they don’t have to, as long as they are tau-coupled before the end of the movement. 10.2.2.7.2 Coordination of movements to produce sequential acoustic targets In the proposed model, the coordination of movements to produce sequential acoustic targets results from the relative timing of movement endpoints; endpoints planned to achieve a sequence of acoustic landmarks are timed to follow each other in the time that represents the optimal balance of task requirements and costs, i.e. with the movement times that are required to produce the targets with the speciﬁed accuracy and acceptable costs of movement (Sections 10.2.2.3 and 10.2.2.4). Thus in the XT/3C-v1 proposal articulatory overlap is due in large part to the cost of time, in contrast to gestural-planning oscillator entrainment, in AP/ TD. This is similar to a proposal in Šimko and Cummins (2010, 2011). That is, in the XT/3C-v1 model, if movement endpoints are sequential but timed closely together, and are produced with different sets of articulators, then for

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

.  

309

a sequence of two endpoints AB, the movement to produce endpoint B will begin before movement endpoint A is reached. This is illustrated in Figure 10.5, where the upper-lip-constriction movement for /m/ in mirror starts before the endpoint of the movement for the merged dental nasal from /nð/ in in the. In XT/3C-v1, the planned timing of movement onsets is dictated by the timing of movement endpoints and by the optimal movement durations. According to General Tau theory, there is some ﬂexibility in the timing of movement onsets, since, as long as the movement is coupled to the appropriately timed tauGuide before the end of the movement, the movement endpoint will be reached at the appropriate time. The view that the timing of movement endpoints often has highest priority, and that the timing of movement onsets is often planned ‘in service of ’ the movement endpoints, contrasts with AP/TD’s view that gesture onsets are the parts of gestures that are coordinated with one another via stable phasing relationships between gestural-planning oscillators (Chapter 5). That chapter

4523 Amp 14630 16000 0

0.5

Freq 62.5

0

-

0.5

–0.213 ULz –0.67

0

0.5

–0.249 TTz –1.28 0

Time (s)

0.5

Figure 10.5 The utterance excerpt . . . elf in the mirror . . . spoken by a Southern British English speaker from the Doubletalk corpus (Scobbie et al. 2013). Note: The ﬁgure illustrates overlapping movements for /nð/ (merged into a dental nasal; see Manuel 1995), /m/ and /ɻ /, whose endpoints are sequential. The movement trajectories illustrated in the bottom two panels are movements in the vertical dimension of sensors attached to the vermilion border of the upper lip (ULz) and to the tongue, less than 1 cm posterior to the tongue tip (TTz). Vertical lines indicate the endpoints of these movements.

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

310    / discussed the possibility that movement-onset coordination might be based on spatial information. While this type of mechanism is not currently part of the proposed model sketch, it may be the case that movement onsets preferentially occur when a particular (relative) spatial position of another articulator is reached, and in that case, the sketch would need to be modiﬁed (see discussion in Chapter 7).

10.3 Motor-Sensory Implementation In the Motor-Sensory Implementation Component of XT/3Cv1, motor commands are issued at appropriate times in order to achieve the goals of the planned utterance. Following Guenther and colleagues (Guenther 1995; Guenther, Ghosh, and Tourville 2006; Guenther and Vladusich 2012), it is assumed that speakers have an internal model of relationships among motor commands, articulator movements, and acoustic consequences, which they use to plan appropriate motor commands to achieve acoustic/sensory goals. This internal model is likely to be hierarchical, (i.e. with control at higher articulator and lower muscular levels, cf. the discussion in Chapter 8 of Todorov et al. 2005 and Li 2006), and is thought to be learned and ﬁnetuned throughout the lifespan, by observing the relationship between motor commands and their somatosensory and auditory consequences. That is, the cognitive model of the articulation-acoustics relationship can be changed, e.g. when the vocal tract changes during development, or when the nature of the acoustic feedback is changed in experimental conditions, with the result that the speaker adjusts articulation so that the intended acoustic output is perceived, cf. discussion of literature in Chapter 7. Following proposals in Bullock and Grossberg (1988), cited in early versions of DIVA (Guenther 1995), in papers on Stochastic Optimal Feedback Control Theory for non-speech (Todorov and Jordan 2002; Shadmehr and MussaIvaldi 2012; discussed in Chapter 8), and in Houde and Nagarajan (2011) and Hickok (2014) for speech, it is assumed that, during a movement, speakers use an efference copy of their motor commands as well as available somatosensory and auditory feedback information to continuously monitor estimated effector states (e.g. positions) relative to the planned target. Additionally, following Lee (1998), it is assumed that speakers continuously adjust their motor commands on the basis of estimated spatial and tau information to ensure accurate and timely acoustic-cue production. That is, speakers monitor the estimated time until planned endpoint achievement at

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

. - 

311

each time point, assuming the effector moves at the current rate (tau). On this view, tau information (i.e. time-to-endpoint-achievement-at-the-currentmovement-rate) is continuously monitored and adjusted so that it is kept in constant proportion to the tau of the Tau Guide. The tau information would be continuously combined with spatial information (e.g. estimated distance-to-endpoint) to generate appropriate movement velocities to achieve the planned movement endpoint on time, and with an appropriate movement time-course (which determines the shape of the velocity proﬁle). Most, if not all, models of speech production assume some type of motorsensory implementation component. For example, AP/TD provides for compensation for the perturbation of individual articulations through the use of gestures, which are task-dependent coordinative structures of articulators that together implement a gestural constriction. In the Task Dynamic system, the perturbation of one or more articulator(s) can be completely compensated by activity of another articulator involved in the gestural constriction. Perturbations to individual articulators in a coordinative structure are assumed to be detected immediately, with immediate compensation as a consequence. However, consistent with its focus on articulatory, rather than acoustic/ auditory, factors, AP/TD currently has no provision for the use of auditory feedback in online speech production. In addition, although AP/TD assumes automatic response to perturbations, which suggests some mechanism for continuous monitoring of individual articulators and modiﬁcation of their relative contributions to the achievement of a gestural constriction, the gestural (constriction) plan for an utterance will always be carried out as long as one or more articulator(s) are available to produce each gesture in the plan. That is, AP/TD currently has no proposal for modifying the gesturalconstriction goals once an utterance has begun. The results described in Chapter 7 showing compensatory responses to altered auditory feedback suggest that such a mechanism is required, because speakers appear to modify their productions in an attempt to reach what they think will be a more appropriate auditory target. Available results are therefore more consistent with Stochastic Optimal Feedback Control approaches to motor-sensory implementation (cf. Houde and Nagarajan’s (2011) proposal for speech, currently under development), and other feedback approaches (e.g. Guenther 2016). These approaches assume that the current state of the articulators is tracked continuously throughout movement on the basis of predicted and sensory (proprioceptive and auditory) information, and that the speaker makes use of this information. That is, they allow for the possibility that movement goals are updated as

OUP CORRECTED PROOF – FINAL, 30/1/2020, SPi

312    / needed after an utterance starts, and also suggest that speakers can use auditory and other types of sensory feedback in assessing whether modiﬁcations are required.

10.4 Summary and discussion This chapter has presented a working hypothesis about the way a phonologyextrinsic timing model of speech production with three separate components might work. As far as possible, the proposal is based on mechanisms proposed for non-speech motor control (Stochastic Optimal Feedback Control Theory and General Tau theory), as well as on recent developments of these theories for speech. This XT/3C-v1 model sketch allows for 1) symbolic phonological representations, 2) phonology-extrinsic timing, 3) context-speciﬁc and nonoverlapping movement goals or targets, and 4) separate speciﬁcation of goals (Phonological Planning) from of how those goals are achieved (Phonetic Planning). It thus provides a better ﬁt to several types of existing data than the AP/TD model (cf. discussions in Chapters 3, 4, 5, and 6). However, there are many aspects of the model that require further testing. In particular, a major issue for further research is how best to describe and model movement coordination and the timing of sequential movement endpoints (see Chapter 5). It goes without saying that a major limit of the proposal is that it is still only a model sketch. As is the case for all Optimal Control Theory approaches, implementing it will be non-trivial for many reasons. These include difﬁculties in modeling the dynamics of the vocal-tract articulators, in identifying and quantifying movement costs, and in cost-function minimization (see Chapter 8). Attempts to implement the model will no doubt bring many deﬁciencies and oversights to light. But we believe it has the potential to provide a good account of currently available data, particularly timing data, as well as a framework for asking fruitful further questions about how to model speech production.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

11 Summary and conclusion The goal of this book was to determine the type of speech-production system which can best account for speech-timing patterns. A major divide among theories in this domain has to do with the type of phonological representations that are proposed: symbolic phonological representations in the case of traditional phonological theories vs. spatiotemporal representations assumed by the most thoroughly worked-out model of speech production that includes articulation, AP/TD; this model currently has the most comprehensive account of the articulation of connected speech. Although AP/TD grew out of Haskins Laboratories, where the similarity of speech and non-speech movements was noted many years ago (e.g. Kelso, Tuller, and Harris 1983), the development of AP moved away from this tradition in its development of proposed phonology-intrinsic timing mechanisms. The ﬁndings reviewed in this book motivate the return to an approach to speech-motor control based on general-purpose, phonology-extrinsic timing mechanisms, and symbolic phonological representations. The chapters of this book have laid out in some detail the workings and assumptions of AP/TD, discussed ways in which additional ﬁndings about motor control in general, and speech timing in particular, motivate consideration of an alternative model based on phonology-extrinsic timing and abstract symbolic phonology; reviewed existing three-component models as well as additional mechanisms from the literature that might be useful in implementing an alternative type of model; and laid out, in general terms, an alternative approach in the XT/3C framework. Chapter 2 reviewed the currently best-worked-out theory of timing in speech motor control, Articulatory Phonology in the Task Dynamic framework. This is an intrinsic timing theory which proposes that phonological representations are spatiotemporal. On this theory, surface-timing patterns do not have to be computed or speciﬁed during speech production, because once a setting for overall speech rate has been set, these patterns emerge from spatiotemporal phonological representations and from the oscillatory representations of structural context that determine the amount of time during which each gesture shapes the vocal tract, as well as inter-gestural Speech Timing. First edition. Alice Turk and Stefanie Shattuck-Hufnagel © Alice Turk and Stefanie Shattuck-Hufnagel 2020. First published in 2020 by Oxford University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

314    coordination. In this framework, spatiotemporal phonological representations are viewed as particularly attractive because they make it possible to avoid ‘translation’ from the representational vocabulary of abstract phonological symbols to a different (i.e. quantitative spectral, spatial, and temporal) representational vocabulary for physical form. However, this design decision had a further consequence: it requires a complex system of activations of the spatiotemporal representations, and adjustments to these activations, to generate appropriate patterns of surface variability for a given phonological form, across a wide variety of contexts. Subsequent discoveries about the nature of this systematic context-governed variation in surface form have increased the complexity of these required mechanisms. In the XT/3C alternative approached proposed here, these complexities are addressed in a three-part architecture which provides for a direct mapping from symbolic representations to quantitative acoustic-phonetic speciﬁcations via the choice of utterance-appropriate acoustic cues to contrastive features, and the computation of utterance-appropriate values for those cues, to optimize the satisfaction of the task demands of each particular utterance and to create the speciﬁc pattern of contextual variation that the utterance requires. Chapters 3–6 reviewed evidence from speech and non-speech motor timing that suggests the need for an alternative to the AP/TD phonology–intrinsictiming approach to modeling systematic-timing variation in speech production. Chapter 3 presented three phenomena which raised questions about the appropriateness of the default activation-interval adjustment approach to contextual variation, its account of Fitts’ law, and its gestural score architecture. Chapter 4 presented the core of the motivation for considering a phonology-extrinsic alternative to phonology-intrinsic timing. Evidence of greater timing accuracy at movement endpoints vs. other parts of movement appears incompatible with the use of mass–spring oscillators as phonological representations. Indeed, it challenges any type of phonological representation that consists of, or maps onto, an entire movement trajectory. This is because it suggests that speakers are able to identify and prioritize the part of articulatory movement that is most closely related to the phonological goal, i.e. the part of the movement that produces the appropriate acoustic cue (often the movement endpoint). Identifying the part of movement most closely related to the goal is difﬁcult when the entire gestural movement is represented as a unitary whole. This chapter also presented several pieces of evidence suggesting that humans represent and specify surface-timing properties in both nonspeech and speech motor activity, that they do this using general-purpose, phonology-extrinsic timing mechanisms, and that temporal representations

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

  

315

can be separate from spatial representations, contra gestural representations, in which spatial and temporal information are inseparable. The speech evidence directly challenges the core assumption of phonology-intrinsic timing in AP/TD and appears to require translation from phonology-intrinsic timing to surface timing in phonetics, something that it was hoped AP/TD proposals would avoid. In particular, the phonology-intrinsic, time-warping, ‘clock’ slowing mechanism proposed in more recent versions of AP/TD to account for boundary-and phrasal-prominence-related lengthening results in a discrepancy between speech-speciﬁc phonological time units and solar time units during the intervals that are slowed to different degrees throughout an utterance. A translation mechanism from phonology-intrinsic time to solar time is required in this framework, because available evidence suggests that it is the surface timing of speech intervals (rather than phonology-intrinsic timing) that is represented and speciﬁed. That is, surface timing constraints are required to explain lengthening patterns related to phrasal prosody in some quantity languages, surface-timing goals are required to explain the multitude of strategies for realizing different rates of speech and other timing-related distinctions, and surface-timing mechanisms are required to explain patterns of greater timing variability for longer intervals. This evidence appears to be incompatible with phonology-intrinsic timing which does not correspond to surface, i.e. solar timing. Chapter 5 reviewed evidence suggesting that the coordination of behaviorally meaningful parts of movement (often endpoints), is required, and that there are alternatives to coupled oscillators for specifying the timing of movement onsets. Chapter 6 suggested that patterns of polysyllabic shortening, often cited as support for oscillatory suprasegmental structure because they appear to make inter-stress intervals more regular, are in fact complex in ways that remove some of the motivation for proposals of oscillatory suprasegmental structure. It will be of great interest to determine whether AP can accommodate the evidence presented here, without compromising its core design principles. Chapters 7–10 introduced an alternative approach to speech motor control that includes three stages: A symbol-based Phonological Planning Component, a quantitative Phonetic Planning Component, and a Motor-sensory Implementation Component. Chapter 7 explained that ﬁndings of greater timing accuracy at movement endpoints than at other parts of movement provide a key piece of evidence for the three-stage model architecture, and presented additional evidence from non-timing-related phenomena that supports symbolic phonological representations and the three-part structure. In this chapter, the XT/3C framework is linked to earlier models, which provide

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

316    useful analogues although they do not comprehensively address the question of speech timing. Chapter 8 reviewed Stochastic Optimal Feedback Control Theory (Todorov and Jordan 2002), as well as Optimal Control Theory approaches to speech timing. Optimal Control Theory is useful for the proposed XT/3C approach because it provides ways of planning optimal movements given a set of task requirements and movement costs, based on estimates of current movement states. Chapter 9 reviewed existing proposals for general timekeeper models and some questions that they raise, since, like all extrinsic timing approaches, our proposal assumes the existence of general-purpose timekeeping mechanisms. In addition, Chapter 9 described Lee’s General Tau theory (Lee 1998) which is useful for speech motor control modeling because it provides 1) a way to model context-appropriate movement-velocity proﬁles, 2) a mechanism for movement coordination based on goal-related movement endpoints, and 3) an account of less timing variability at movement endpoints compared to other parts of movement. Chapter 10 presented a sketch of the proposed three-component, phonology-extrinsic-timing approach to speech production that includes Phonological Planning, Phonetic Planning, and Motor-Sensory Implementation Components. In the Phonological Planning Component, symbolic phonological representations are sequentially ordered in a prosodic frame, appropriate feature cues are associated with each phonological representation in its context, and a prioritized list of task requirements for the utterance is generated; as a result, the representations in the Phonological Planning Component are enriched with substantial amounts of utterance-speciﬁc information that is not available in the lexical representations of words. In the Phonetic Planning Component, acoustic goals are translated into articulatory plans: Acoustic goals are quantiﬁed as ranges of parameter values, task requirements and costs of movement are quantiﬁed, and inter-landmark interval durations and movements are chosen which minimize the costs while meeting the task requirements. Coordination is based on goal-related parts of movement and, like the time-course of movement, is planned according to Lee’s General Tau-coupling theory (Lee 1998, reviewed in Chapter 9). In the Motor-Sensory Implementation Component, spatial and temporal information is combined and motor commands are issued for movements to reach their goals at appropriate times. The temporal and spatial evolution, the predicted sensory (auditory and somatosensory) consequences of movement, as well as any available sensory feedback, are continuously tracked, and movements are updated, as needed, to increase

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

  

317

the likelihood of reaching the planned movement endpoints at appropriate times, and thus of achieving the goals speciﬁed in the Phonological Planning Component. As described in Chapter 10, key features of the proposed XT/3C approach, different from those of AP/TD include: 1. Separate planning components for phonology and phonetics, different from proposals such as AP/TD and Flemming (2001) in which phonology and phonetics are integrated. 2. Symbolic phonological representations, rather than spatiotemporal phonological representations. 3. Acoustic targets implemented articulatorily (as in Perkell 2012; Guenther 1995; Guenther 2016), rather than targets that are exclusively articulatory. 4. Speciﬁcation of surface-timing patterns as part of phonetic planning, different from phonology-intrinsic-timing models in which surfacetiming patterns are emergent from interacting phonological mechanisms. Surface-timing patterns include the timing between acoustic landmarks (Stevens 2002), as well as the durations of movements required to produce the landmarks. 5. The representation and tracking of surface timing using generalpurpose, phonology-extrinsic timing mechanisms, which are not used in phonology-intrinsic models. In the XT/3C approach, timing variability observed in repeated movements is assumed to result from noise in the timekeeping mechanism(s), rather than a stochastic phonetic planning process (as proposed in Lefkowitz 2017). 6. An Optimal Control Theory approach to determining movement parameter values that accomplish task goals at minimum cost (Todorov and Jordan 2002; Todorov 2004, 2005, 2007, 2009). This contrasts with the implementation of the inﬂuence of contextual factors by way of adjustments to default gestural-activation intervals. 7. Optimal surface-timing speciﬁcations that result from minimizing costs such as effort, endpoint variance, and time, as well as the cost of not signaling segment identity clearly enough in context (Šimko and Cummins 2010, 2011 and Windmann 2016), different from proposals in which multiple, competing timing targets determine surface durations (Flemming 2001; Katz 2010; Braver 2013; Lefkowitz 2017). 8. Velocity proﬁles that result from tau-guidance control (Lee 1998), as opposed to control from mass–spring oscillators and gestural activation.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

318    9. Movement coordination based on goal-related parts of movement, e.g. endpoints, and continuous tau coupling (Lee 1998), rather than on movement onsets and oscillatory entrainment. 10. Coarticulation that results from the close temporal proximity of speciﬁed sequential acoustic landmarks. On this view, the landmark goals are sequential; movements must overlap in order to produce the landmarks at the appropriate times. This view contrasts with views of coarticulation as the consequence of overlapping gestural goals. 11. Suprasegmental structure that is word-based; and control mechanisms that are periodic only for overtly periodic styles of speech. These control mechanisms contrast with suprasegmental control mechanisms in AP/TD which are periodic for all types of speech. While the key features of this approach are all motivated by experimental evidence in the non-speech and speech motor-control literature, the evidence from speech comes from fewer studies. It will be important, for a rigorous comparison of the AP/TD and XT-3C approaches, to address questions such as: • To what extent do Fitts’ law (1954) phenomena apply in speech? Although experimental research on non-speech strongly suggests that Fitts’ law phenomena are ubiquitous, and ﬁndings in speech are consistent with this view, as of this moment, it is only beginning to be rigorously tested in the speech domain. If Fitts’ law applies to speech, appropriate ways of modeling it will need to be tested; Optimal Control Theory approaches are promising and have been successful in modeling nonspeech movements. • What is the most appropriate mechanism for generating realistic velocity proﬁles for speech movements? AP/TD proposes mass–spring movements + shaped activation proﬁles, whereas XT/3C proposes movements generated by coupling to the Tau Guide proposed in Lee (1998). To date, one study has shown a good ﬁt of Lee’s Tau Guide equation to speechmovement data, but it will be important to have a wider test of this proposal. • How is coordination of articulatory movements controlled, for sequences of speech sounds, and what is the most appropriate way to model this? Coordination in the AP/TD approach is based on movement-onset-relative timing, whereas XT/3C proposes that the onset is normally not the critical aspect of the timing of a movement; rather, it is the part of movement most

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

  

•

•

•

•

319

closely related to the goal (often the endpoint) that is critical. Although evidence from non-speech and speech studies supports this view, few speech studies have been designed to address this question directly. Avoiding independent tasks that are performed synchronously appears to be important to prevent spatial interference; separate actions that are performed simultaneously to accomplish a single action goal do not encourage spatial interference in the same way (Franz, Zelaznik, Swinnen, and Walter 2001). In the XT/3C framework, spatial interference is avoided by having sequential landmarks as the goals of speech production; multiple articulatory movements contribute to the realization of each landmark, but are not represented as independent tasks. However, aspects of an utterance like its fundamental frequency contour and speech-accompanying hand and other bodily gestures are considered to be independent, synchronous tasks in traditional phonological theories. Do these encourage spatial interference and/or can their realization be considered as an integral part of communicative goal production? In general, it will be important to consider the extent to which all bodily movements related to the communicative act are coordinated and timed. What role does periodicity play in speechmotor control? AP/TD proposes a central role for periodicity in movement coordination as well as in suprasegmental structure. Answers to questions relating to coordination may determine whether there is a role for periodicity in coordinated speech behavior, whereas other types of experiments will be required to determine whether periodicity is required for suprasegmental control. In the XT/3C approach, periodic control may only be appropriate for certain styles of speech, where perceived periodicity is an important task goal. If it turns out to be appropriate to model phonetic planning and motorsensory implementation using Optimal Control Theory approaches, as proposed here, a large part of the computational work will involve modeling and testing different types of weighted task requirements and movement costs to clarify their role in determining movement parameters in different contexts. Are there separate types of sound-level errors corresponding to symbolic planning processes vs. motor execution processes?

In summary, this book has argued that the timing evidence from motor control in general and from speech motor control in particular, motivate the development of an approach to speech articulation based on symbolic

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

320    phonological representations and phonology-extrinsic timing, as an alternative to the currently dominant and successful AP/TD approach. Our hope is that this book will inspire the development of Three-Component, PhonologyExtrinsic-timing models of speech motor control that can account for timing behavior in speech, and that these models can be developed in enough detail to test their predictions quantitatively through experiments. As is evident from this volume, we are grateful to the developers of AP for the explicitness of their model, for the many insights it has provided, and for their responsiveness to emerging ﬁndings. These characteristics have both inspired and made possible the development of the alternative view presented here.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

References Abbs, James H. (1973). The inﬂuence of the gamma motor system on jaw movements during speech: A theoretical framework and some preliminary observations. Journal of Speech and Hearing Research, 16, 175–200. Abbs, James H. & Vincent L. Gracco (1984). Control of complex motor gestures: Orofacial muscle responses to load perturbations of lip during speech. Journal of Neurophysiology, 51(4), 705–723. Abbs, James H., Vincent L. Gracco, & Kelly J. Cole (1984). Control of multimovement coordination: Sensorimotor mechanisms in speech motor programming. Journal of Motor Behavior 16(2), 195–231. Abercrombie, David (1967). Elements of General Phonetics. Edinburgh: Edinburgh University Press. Abercrombie, David (1968). Some functions of silent stress. Work in Progress 2, Edinburgh University Department of Linguistics. Abercrombie, David (1973). A phonetician’s view of verse structure. In W. E. Jones & John Laver (eds), Phonetics in Linguistics: A Book of Readings (pp. 6–13). London: Longman. Abercrombie, David (1991). Fifty Years in Phonetics. Edinburgh: Edinburgh University Press. Ackermann, Hermann & Ingo Hertrich (1994). Speech rate and rhythm in cerebellar dysarthria: an acoustic analysis of syllable timing. Folia Phoniatrica et Logopaedica, 46, 70–78. Ackermann, Hermann & Ingo Hertrich (1997). Voice onset time in ataxic dysarthria. Brain and Language, 56, 321–333. Ackermann, Hermann & Ingo Hertrich (2000). The contribution of the cerebellum to speech processing. Journal of Neurolinguistics, 13, 95–116. Ackermann, Hermann, Susanne Gräber, Ingo Hertrich, & Irene Daum (1997). Categorical speech perception in cerebellar disorders. Brain and Language, 60, 323–331. Alberts, Jay L., Marian Saling, & George E. Stelmach (2002). Alterations in transport path differentially affect temporal and spatial movement parameters. Experimental Brain Research, 143, 417–425. Alder, Todd B. & Gary J. Rose (1998). Long-term temporal integration in the anuran auditory system. Nature Neuroscience, 1(6), 519–523. Alder, Todd B. & Gary J. Rose (2000). Integration and recovery processes contribute to the temporal selectivity of neurons in the midbrain of the northern leopard frog, Rana pipiens. Journal of Comparative Physiology A, 186, 923–937. Allan, Lorraine G. & John Gibbon (1991). Human bisection at the geometric mean. Learning and Motivation, 22, 39–58. Allen, J. Sean & Joanne L. Miller (1999). Effects of syllable-initial voicing and speaking rate on the temporal characteristics of monosyllabic words. Journal of the Acoustical Society of America, 106(4, Pt 1), 2031–2039. Allman, Melissa J. & Warren H. Meck (2012). Pathophysiological distortions in time perception and timed performance. Brain, 135, 656–677.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

322  Allman, Melissa J., Sundeep Teki, Timothy D. Grifﬁths, & Warren H. Meck (2014). Properties of the internal clock: First- and second-order principles of subjective time. Annual Review of Psychology, 65, 743–771. Asatryan, David G. & Anatol G. Feldman (1965). Functional tuning of the nervous system with control of movement or maintenance of a steady posture: I. Mechanographic analysis of the work of the joint or execution of a postural task. Biophysics, 10, 925–934. Aschoff, Jürgen (1985). On the perception of time during prolonged temporal isolation. Human Neurobiology, 4, 41–52. Aschoff, Jürgen (1998). Human perception of short and long time intervals: Its correlation with body temperature and the duration of wake time. Journal of Biological Rhythms, 13(5), 437–442. Astésano, Corine, Ellen Gurman Bard, & Alice Turk (2007). Structural inﬂuences on initial accent placement in French. Language and Speech, 50(3), 423–446. Asu, Eva Liina, Pärtel Lippus, Nele Salveste, & Heete Sahkai (2016). F0 declination in spontaneous Estonian: implications for pitch-related preplanning in speech production. Speech Prosody 2016, 1139–1142. Atal, B. S., J. J. Chang, M. V. Mathews, & J. W. Tukey (1978). Inversion of articulatory-toacoustic transformation in the vocal tract by a computer-sorting technique. Journal of the Acoustical Society of America, 63, 1535–1555. Aylett, Matthew P. (2000). Stochastic suprasegmentals: Relationships between redundancy, prosodic structure and care of articulation in spontaneous speech. (PhD), University of Edinburgh. Aylett, Matthew P. & Alice Turk (2004). The Smooth Signal Redundancy Hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration on spontaneous speech. Language and Speech, 47(1), 31–56. Aylett, Matthew P. & Alice Turk (2006). Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei. Journal of the Acoustical Society of America, 119(5), 3048–3058. Baars, Bernard J. & Michael T. Motley (1976). Spoonerisms as sequencer conﬂicts: Evidence from artiﬁcially elicited errors. The American Journal of Psychology, 89(3), 467–484. Bangert, Ashley S., Patricia A. Reuter-Lorenz, & Rachel D. Seidler (2011). Dissecting the clock: Understanding the mechanisms of timing across tasks and temporal intervals. Acta Psychologica, 136, 20–34. Barbosa, Plínio A. (2007). From syntax to acoustic duration: A dynamical model of speech rhythm production. Speech Communication, 49(9), 725–742. Barnes, Jonathan (2006). Strength and weakness at the interface: Positional neutralization in phonetics and phonology. Berlin: Mouton de Gruyter. Barrett, Nicholas C. & Denis J. Glencross (1989). Response amendments during manual aiming movements to double-step targets. Acta Psychologica, 70, 205–217. Bates, Sally Alexandra Rosemary (1995). Towards a deﬁnition of schwa: an acoustic investigation of vowel reduction in English. (PhD), University of Edinburgh. Beckman, Jill, Pétur Helgason, Bob McMurray, & Catherine Ringen (2011). Rate effects on Swedish VOT: Evidence for phonological overspeciﬁcation. Journal of Phonetics, 39, 39–49. Beckman, Mary E. & Jan Edwards (1992). Intonational categories and the articulatory control of duration. In Yoh’ichi Tohkura, Eric Vatikiotis-Bateson, & Yoshinori Sagisaka (eds), Speech Perception, Production and Linguistic Structure (pp. 356–375). Tokyo: OHM Publishing Co., Ltd.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



323

Beckman, Mary E. & Jan Edwards (1994). Articulatory evidence for differentiating stress categories. In P. Keating (ed.), Papers in Laboratory Phonology III: Phonological Structure and Phonetic Form (pp. 7–33). Cambridge: Cambridge University Press. Beckman, Mary E. & Janet B. Pierrehumbert (1986). Intonational structure in Japanese and English. Phonology Yearbook, 3, 255–309. Beckman, Mary E., Julia Hirschberg, & Stefanie Shattuck-Hufnagel (2005). The original ToBi system and the evolution of the ToBi framework. In Sun-Ah Jun (ed.), Prosodic Typology: The Phonology of Intonation and Phrasing (pp. 9–54). Oxford: Oxford University Press. Beddor, Patrice Speeter, Kevin B. McGowan, Julie E. Boland, Andries W. Coetzee, & Anthony Brasher (2013). The time course of perception of coarticulation. Journal of the Acoustical Society of America, 133(3), 2350–2366. Bell, Alan (1971). Some patterns of occurrence and formation of syllable structures. Working papers on language universals, Language Universals Project, Stanford University, 6, 23–137. Bell-Berti, Fredericka, Rena A. Krakow, Carole E. Gelfer, & Suzanne E. Boyce (1995). Anticipatory and carryover effects: Implications for models of speech production. In Fredericka Bell-Berti & Lawrence J. Raphael (eds), Producing Speech: Contemporary Issues. For Katherine Safford Harris (pp. 77–98). New York: AIP Press. Bellman, Richard (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. Benguerel, André-Pierre & Helen A. Cowan (1974). Coarticulation of upper lip protrusion in French. Phonetica, 30, 41–50. Benguigui, Nicolas, Robin Baurès, & Cyrille Le Runigo (2008). Visuomotor delay in interceptive actions. Behavioral and Brain Sciences, 31(2), 200–201. Benoit, Christian (1986). Note on the use of correlation in speech timing. Journal of the Acoustical Society of America, 80, 1846–1849. Beňuš, Štefan & Juraj Šimko (2014). Emergence of prosodic boundary: Continuous effects of temporal affordance on inter-gestural timing. Journal of Phonetics, 44, 110–129. Berkovits, Rochele (1993a). Utterance-ﬁnal lengthening and the duration of ﬁnal-stop closures. Journal of Phonetics, 21, 479–489. Berkovits, Rochele (1993b). Progressive utterance-ﬁnal lengthening in syllables with ﬁnal fricatives. Language and Speech, 36(1), 89–98. Berkovits, Rochele (1994). Durational effects in ﬁnal lengthening, gapping, and contrastive stress. Language and Speech, 37(3), 237–250. Bernstein, Nikolai (1967). The co-ordination and regulation of movements. London: Pergamon. Berry, Jeff (2011). Speaking rate effects on normal aspects of articulation: Outcomes and issues. Perspectives on Speech Science and Orofacial Disorders, 21, 15–26. Bertsekas, Dimitri (2001). Dynamic Programming and Optimal Control. 2nd Edn. Bellmont, MA: Athena Scientiﬁc. Bertsekas, Dimitri & John Tsitsiklis (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientiﬁc. Billon, Magali, Semjen, A., & Stelmach, G. E. (1996). The timing effects of accent production in periodic ﬁnger-tapping sequences. Journal of Motor Behavior, 28(3), 198–210. Birkholz, Peter, Bernd J. Kröger, & Christiane Neuschaefer-Rube (2011). Model-based reproduction of articulatory trajectories for consonant–vowel sequences. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1422–1433. Blakemore, Sarah-Jayne., Daniel Wolpert, & Chris Frith, C. (2000). Why can’t you tickle yourself? Neuroreport, 11(11), 11–16.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

324  Boersma, Paul (1998). Functional Phonology: Formalizing the Interaction between Articulatory and Perceptual Drives. The Hague: Holland Academic Graphics. Boersma, Paul (2009). Cue constraints and their interactions in phonological perception and production. In Paul Boersma & Silke Hamann (eds), Phonology in perception (pp. 55–110). Berlin: Mouton de Gruyter. Bohland, Jason W., Daniel Bullock, & Frank H. Guenther (2009). Neural representations and mechanisms for the performance of simple speech sequences. Journal of Cognitive Neuroscience, 22(7), 1504–1529. Bolinger, Dwight L. (1965). Pitch accent and sentence rhythm. In I. Abe & T. Kenekiyo (eds), Forms of English: Accent, Morpheme, Order (pp. 139–180). Cambridge, MA: Harvard University Press. Bolinger, Dwight L. (1985). Two views of accent. Journal of Linguistics, 21(1), 79–123. Bombien, Lasse, Christine Mooshammer, Philip Hoole, & Barbara Kühnert (2010). Prosodic and segmental effects on EPG contact patterns of word-initial German clusters. Journal of Phonetics, 38, 388–403. Bonaventura, Patrizia (2003). Invariant patterns in articulatory movements. (PhD), Ohio State University. Bonaventura, Patrizia & Osamu Fujimura (2007). Articulatory movements and phrase boundaries. In M.-J. Solé, P. S. Beddor, & M. Ohala (eds), Experimental Approaches to Phonology (pp. 209–227) Oxford: Oxford University Press. Bootsma, Reinoud J. & Piet C. van Wieringen (1990). Timing an attacking forehand drive in table tennis. Journal of Experimental Psychology: Human Perception and Performance, 16(1), 21–29. Boyce, Suzanne E., Rena A. Krakow, & Fredericka Bell-Berti (1991). Phonological underspeciﬁcation and speech motor organisation. Phonology, 8, 219–236. Boyce, Suzanne E., Rena A. Krakow, Fredericka Bell-Berti, & Carole E. Gelfer (1990). Converging sources of evidence for dissecting articulatory movements into core gestures. Journal of Phonetics, 18, 173–188. Braitenberg, Valentino (1966). Is the cerebellar cortex a biological dock in the millisecond range? Progress in Brain Research, 25, 334–346. Braver, Aaron (2013). Degrees of incompleteness in neutralization: Paradigm uniformity in a phonetics with weighted constraints (PhD), Rutgers, The State University of New Jersey. Browman, Catherine P. & Louis Goldstein (1985). Dynamic modeling of phonetic structure. In V. A. Fromkin (ed.), Phonetic Linguistics (pp. 35–53). New York: Academic Press. Browman, Catherine P. & Louis Goldstein (1988). Some notes on syllable structure in Articulatory Phonology. Phonetica, 45, 140–155. Browman, Catherine P. & Louis Goldstein (1989). Articulatory gestures as phonological units. Phonology, 6, 201–251. Browman, Catherine P. & Louis Goldstein (1990a). Representation and reality: Physical systems and phonological structure. Journal of Phonetics, 18, 411–424. Browman, Catherine P. & Louis Goldstein (1990b). Tiers in Articulatory Phonology, with some implications for casual speech. In John Kingston (ed.), Papers in Laboratory Phonology: Between the Grammar and Physics of Speech (pp. 341–376). Cambridge: Cambridge University Press. Browman, Catherine P. & Louis Goldstein (1992a). Articulatory phonology: an overview. Phonetica, 49(3–4), 155–180. Browman, Catherine P. & Louis Goldstein (1992b). Targetless Schwa: An articulatory analysis. In G. Docherty & D. R. Ladd (eds), Papers in laboratory phonology II: Gesture, segment, prosody (pp. 26–67). Cambridge: Cambridge University Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



325

Browman, Catherine P. & Louis Goldstein (2000). Competing constraints on intergestural coordination and self-organization of phonological structures. Les Cahiers de l’ICP. Bulletin de la communication parlée, 5, 25–34. Browman, Catherine P. & Louis Goldstein (Unpublished ms). Articulatory Phonology. Available from https://sail.usc.edu/~lgoldste/ArtPhon/, Aug. 16, 2018. Bueti, Domenica, Vincent Walsh, Chris Frith, & Geraint Rees (2008). Different brain circuits underlie motor and perceptual representations of temporal intervals. Journal of Cognitive Neuroscience, 20(2), 204–214. Bullock, Daniel & Stephen Grossberg (1988). Neural Dynamics of Planned Arm Movements—Emergent Invariants and Speed Accuracy Properties during Trajectory Formation. Psychological Review, 95(1), 49–90. Bullock, Daniel & Stephen Grossberg (1990). FLETE—an Opponent Neuromuscular Design for Factorization of Length and Tension. Ijcnn-90-Wash DC: International Joint Conference on Neural Networks, Vols. 1 and 2, B209–B212. Buonomano, Dean V. (2000). Decoding temporal information: A model based on shortterm synaptic plasticity. Journal of Neuroscience, 20(3), 1129–1141. Buonomano, Dean V. (2014). Neural dynamics based timing in the subsecond to seconds range. In Hugo Merchant & Victor de Lafuente (eds), Neurobiology of Interval Timing (pp. 101–117). New York: Springer. Buonomano, Dean V. & Rodrigo Laje (2010). Population clocks: motor timing with neural dynamics. Trends in Cognitive Sciences, 14(12), 520–527. Buonomano, Dean V. & Michael M. Merzenich (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 5200. Burdet, Etienne & Theodore E. Milner (1998). Quantization of human motions and learning of accurate movements. Biological Cybernetics, 78, 307–318. Burnham, Denis, Christine Kitamura, & Uté Vollmer-Conna (2002). What’s new, pussycat? On talking to babies and animals. Science, 296, 1435. Byrd, Dani (1996). Inﬂuences on articulatory timing in consonant sequences. Journal of Phonetics, 24, 209–244. Byrd, Dani (2000). Articulatory vowel lengthening and coordination at phrasal junctures. Phonetica, 57, 3–16. Byrd, Dani & Elliot Saltzman (1998). Intragestural dynamics of multiple prosodic boundaries. Journal of Phonetics, 26(2), 173–199. Byrd, Dani & Elliot Saltzman (2003). The elastic phrase: modeling the dynamics of boundary-adjacent lengthening. Journal of Phonetics, 31(2), 149–180. Byrd, Dani & Cheng Cheng Tan (1996). Saying consonant clusters quickly. Journal of Phonetics, 24, 263–282. Byrd, Dani, Jelena Krivokapić, & Sungbok Lee (2006). How far, how long: On the temporal scope of prosodic boundary effects. Journal of the Acoustical Society of America, 120(3), 1589–1599. Byrd, Dani, Abigail Kaun, Srikanth Nagarajan, & Elliot Saltzman (2000). Phrasal signatures in articulation. In John Kingston (ed.), Papers in laboratory phonology V (pp. 70–87). Cambridge: Cambridge University Press. Cai, Shanqing, Satrajit S. Ghosh, Frank H. Guenther, & Joseph S. Perkell (2010). Adaptive auditory feedback control of the production of formant trajectories in the Mandarin triphthong /iau/ and its pattern of generalization. Journal of the Acoustical Society of America, 128(4), 2033–2048. Cai, Shanqing, Satrajit S. Ghosh, Frank H. Guenther, & Joseph S. Perkell (2011). Focal manipulations of formant trajectories reveal a role of auditory feedback in the online

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

326  control of both within-syllable and between-syllable speech timing. Journal of Neuroscience, 31(45), 16483–16490. Cambier-Langeveld, Gerda Martina (1997). The domain of ﬁnal lengthening in the production of Dutch. In Jane A. Coerts & Helen de Hoop (eds), Linguistics in theNetherlands (pp. 13–24). Amsterdam: John Benjamins. Cambier-Langeveld, Gerda Martina (1999). The interaction between ﬁnal lengthening and accentual lengthening: Dutch versus English. In Jane A. Coerts & Helen de Hoop (eds), Linguistics in the Netherlands (pp. 13–25). Amsterdam: John Benjamins. Cambier-Langeveld, Gerda Martina (2000). Temporal Marking of Accents and Boundaries. Leiden: Holland Institute of Generative Linguistics. Campbell, W. N. (1988). Foot-level shortening in the Spoken English Corpus. Paper presented at the 7th FASE Symposium, Edinburgh, UK. Carlton, Les G. (1979). Control processes in the production of discrete aiming responses. Journal of Human Movement Studies, 5, 115–124. Casini, Laurence, Boris Burle, & Noël Nguyen (2009). Speech perception engages a general timer: Evidence from a divided attention word identiﬁcation task. Cognition, 112(2), 318–322. Casini, Laurence, Céline Ramdani-Beauvir, Boris Burle, & Franck Vidal (2013). How does one night of sleep deprivation affect the internal clock? Neuropsychologia, 51, 275–283. Caspers, Johanna (1994). Pitch movements under time pressure: Effects of speech rate on the melodic marking of accents and boundaries in Dutch. The Hague: Holland Academic Graphics. Catania, A. Charles (1970). Reinforcement schedules and psychophysical judgments: A study of some temporal properties of behavior. In W. N. Schoenfeld (ed.), The Theory of Reinforcement Schedules (pp. 1–42). New York: Appleton-Century-Crofts. Chen, Marilyn Y. (1996). Acoustic correlates of nasality in speech. (PhD), Massachusetts Institute of Technology. Chen, Marilyn Y. (1997). Acoustic correlates of English and French nasalized vowels. Journal of the Acoustical Society of America, 102(4), 2360–2370. Chen, Yiya (2006). Durational adjustment under corrective focus in Standard Chinese. Journal of Phonetics, 34(2), 176–201. Chen-Harris, Haiyin, Wilsaan M. Joiner, Vincent Ethier, David S. Zee, & Reza Shadmehr (2008). Adaptive control of saccades via internal feedback. Journal of Neuroscience, 28, 2804–2813. Cho, Taehong (2002). The Effects of Prosody on Articulation in English. New York and London: Routledge. Cho, Taehong (2005). Prosodic strengthening and featural enhancement: Evidence from acoustic and articulatory realizations of /ɑ,i/ in English. Journal of the Acoustical Society of America, 117(6), 3867–3878. Cho, Taehong (2006). Manifestation of prosodic structure in articulatory variation: Evidence from lip kinematics in English. In Louis Goldstein, D. H. Whalen, & Catherine T. Best (eds), Laboratory Phonology (Vol. 8, pp. 519–548). Berlin/New York: Mouton de Gruyter. Cho, Taehong (2008). Prosodic strengthening in transboundary V-to-V lingual movement in American English. Phonetica, 65, 45–61. Cho, Taehong & Patricia Keating (2009). Effects of initial position versus prominence in English. Journal of Phonetics, 37, 466–485. Choi, Jeung-Yoon (1999). Detection of Consonant Voicing: A Module for a Hierarchical Speech Recognition System. (PhD), Massachusetts Institute of Technology.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



327

Cholin, Joana, Willem J. M. Levelt, & Niels O. Schiller (2006). Effects of syllable frequency in speech production. Cognition, 99, 205–235. Chomsky, Noam (1957). Syntactic Structures. The Hague/Paris: Mouton de Gruyter. Chomsky, Noam & Morris Halle (1968). The Sound Pattern of English. New York: Harper & Row. Church, Russell M. (1984). Properties of the internal clock. Annals of the New York Academy of Sciences, 423, 566–582. Classe, André (1939). The Rhythm of English Prose. Oxford: Basil Blackwell. Clayards, Meghan, Michael K. Tanenhaus, Richard N. Aslin, & Robert A. Jacobs (2008). Perception of speech reﬂects optimal use of probabilistic speech cues. Cognition, 108, 804–809. Clements, George & Susan Hertz (1996). An integrated model of phonetic representation in grammar. In L. Lavoie & W. Ham (eds), Working Papers of the Cornell Phonetics Laboratory (Vol. 11, pp. 43–116). Ithaca, NY: CLC Publications. Cohn, Abigail C. (1993). Nasalisation in English: phonology or phonetics. Phonology, 10, 43–81. Collewijn, Han, Casper J. Erkelens, & Robert M. Steinman (1988). Binocular co-ordination of human horizontal saccadic eye movements. Journal of Physiology, 404, 157–182. Cooke, J. D. (1980). The organization of simple, skilled movements. In G. E. Stelmach & J. Requin (eds), Tutorials in Motor Behavior (pp. 199–212). Amsterdam: North Holland. Cooper, André M. (1991). Laryngeal and oral gestures in English /p,t,k/. Proceedings of the XIIth International Congress of Phonetic Sciences (Vol. 2, pp. 50–53). Aix-en-Provence. Covey, Ellen & John H. Casseday (1999). Timing in the auditory system of the bat. Annual Review of Physiology, 61, 457–476. Craig, Cathy, Gert-Jan Pepping, & Madeleine Grealy (2005). Intercepting beats in predesignated target zones. Experimental Brain Research, 165(4), 490–504. Creelman, C. Douglas (1962). Human discrimination of auditory duration. Journal of the Acoustical Society of America, 34(5), 582–593. Crompton, Andrew (1981). Syllables and segments in speech production. Linguistics, 19(7/8), 663–716. Cruttenden, Alan (1986). Intonation. Cambridge, UK: Cambridge University Press. Cummins, Fred (1999). Some lengthening factors in English speech combine additively at most rates. Journal of the Acoustical Society of America, 105(1), 476–480. Cummins, Fred & Robert Port (1998). Rhythmic constraints on stress timing in English. Journal of Phonetics, 26, 145–171. Cutler, Anne, Frank Eisner, James M. McQueen, & Dennis Norris (2010). How abstract phonemic categories are necessary for coping with speaker-related variation. In Cécile Fougeron, Barbara Kuehnert, Mariapaola Imperio, & Nathalie Vallée (eds), Laboratory Phonology 10 (pp. 91–112). De Gruyter. Dauer, Rebecca M. (1983). Stress-timing and syllable-timing re-analysed. Journal of Phonetics, 11, 51–62. de Azavedo Neto, Raymundo Machado, & Luis Augusto Teixeira (2009). Control of interceptive actions is based on expectancy of time to target arrival. Experimental Brain Research, 199, 135–143. de Jong, Kenneth J. (1991). The oral articulation of English stress accent. (PhD), Ohio State University. de Jong, Kenneth J. (1995). The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. Journal of the Acoustical Society of America, 97(1), 496–504.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

328  de Jong, Kenneth J. (2001a). Effects of syllable afﬁliation and consonant voicing on temporal adjustment in a repetitive speech-production task. Journal of Speech, Language, and Hearing Research, 44, 826–840. de Jong, Kenneth J. (2001b). Rate-Induced resyllabiﬁcation revisited. Language and Speech, 44(2), 197–216. del Viso Pavón, Susana (1990). Errores espontáneos del habla y producción del lenguaje. (PhD), Universidad Complutense de Madrid. del Viso, Susana, José M. Igoa, & José E. García-Albea (1991). On the autonomy of phonological encoding: Evidence from slips of the tongue in Spanish. Journal of Psycholinguistic Research, 20(3), 161–185. Delattre, Pierre (1962). Some factors of vowel and their cross-linguistic validity. Journal of the Acoustical Society of America, 34(8), 1141–1143. Delignières, Didier & Kjerstin Torre (2011). Event-based and emergent timing: Dichotomy or continuum? A reply to Repp and Steinman (2010). Journal of Motor Behavior, 43(4), 311–318. Dell, Gary S. & Peter A. Reich (1981). Stages in sentence production: An analysis of speech error data. Journal of Verbal Learning and Verbal Behavior, 20, 611–629. Dell, Gary S., Lisa K. Burger, & William R. Svec (1997). Language production and serial order: A functional analysis and a model. Psychological Review, 104(1), 123–147. Dellwo, Volker, Ingmar Steiner, Bianca Aschenberner, Jana Danikovicová, & Petra Wagner (2004). BonnTempo-Corpus and BonnTempo-Tools: A database for the study of speech rhythm and rate, Proceedings of Interspeech 2004 (pp. 777–780). Jeju Island, Korea. Desmond, J. E. & J.W. Moore (1988). Adaptive timing in neural networks: The conditioned response. Biological Cybernetics, 58, 405–415. Diedrichsen, Jörn, Sarah E. Criscimagna-Hemminger, & Reza Shadmehr (2007). Dissociating timing and coordination as functions of the cerebellum. Journal of Neuroscience, 27(23), 6291–6301. Diedrichsen, Jörn, Richard B. Ivry, & Jeff Pressing, (2003). Cerebellar and basal ganglia contributions to interval timing. In W. H. Meck (ed.), Functional and neural mechanisms of interval timing (pp. 457–483). Boca Raton, FL: CRC Press. Diedrichsen, Jörn, Reza Shadmehr, & Richard Ivry (2010). The coordination of movement: Optimal feedback control and beyond. Trends in Cognitive Sciences, 14(1), 31–39. Dilley, Laura, Stefanie Shattuck-Hufnagel, & Mari Ostendorf (1996). Glottalization of word-initial vowels as a function of prosodic structure. Journal of Phonetics, 24(4), 423–444. Dimitrova, Snezhina & Alice Turk (2012). Patterns of accentual lengthening in English four-syllable words. Journal of Phonetics, 40, 403–418. Domkin, Dmitri, Jozsef Laczko, Slobodan Jaric, Hakan Johansson, & Mark L. Latash (2002). Structure of joint variability in bimanual pointing tasks. Experimental Brain Research, 143, 11–23. Draper, M. H., Peter Ladefoged, & David Whitteridge (1960). Expiratory pressures and air ﬂow during speech. British Medical Journal, 1, 1837–1843. Drew, Michael R., Bojana Zupan, Anna Cooke, P. A. Couvillon, & Peter D. Balsam (2005). Temporal control of conditioned responding in goldﬁsh. Journal of Experimental Psychology: Animal Behavior Processes, 31(1), 31–39. Edwards, Jan, Mary E. Beckman, & Janet Fletcher (1991). The articulatory kinematics of ﬁnal lengthening. Journal of the Acoustical Society of America, 89(1), 369–382.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



329

Eefting, Wieke (1991). The effect of “information value” and “accentuation” on the duration of Dutch words, syllables, and segments. Journal of the Acoustical Society of America,89(1), 412–424. Eerola, Osmo & Janne Savela (2012). Production of short and long Finnish vowels with and without noise masking. Linguistica Uralica, 48(3), 200–208. Elliott, Digby, Werner F. Helsen, & Romeo Chua (2001). A century later: Woodworth’s (1899) two-component model of goal-directed aiming. Psychological Bulletin, 127(3), 342–357. Elliott, Digby, Steven Hansen, Jocelyn Mendoza, & Luc Tremblay (2004). Learning to optimize speed, accuracy, and energy expenditure: A framework for understanding speed-accuracy relations in goal-directed aiming. Journal of Motor Behavior, 36(3), 339–351. Elliott, Digby, Steve Hansen, Lawrence E. M. Grierson, James Lyons, Simon J.Bennett, & Spencer J. Hayes (2010). Goal-directed aiming: Two components but multiple processes. Psychological Bulletin, 136(6), 1023–1044. Ellis, Lucy & William J. Hardcastle (2002). Categorical and gradient properties of assimilation in alveolar to velar sequences: evidence from EPG and EMA data. Journal of Phonetics, 30, 373–396. Engelbrecht, Sascha E. & Juan Pablo Fernández (1997). Invariant characteristics of horizontal-plane minimum-torque-change movements with one mechanical degree of freedom. Biological Cybernetics, 76, 321–329. Engelbrecht, Sascha E., Neil E. Berthier, & Laura P. O’Sullivan (2003). The undershoot bias: Learning to act optimally under uncertainty. Psychological Science, 14(3), 257–261. Engstrand, Olle (1988). Articulatory correlates of stress and speaking rate in Swedish CV utterances. Journal of the Acoustical Society of America, 88(5), 1863–1875. Engstrand, Olle & Diana Krull (1994). Durational correlates of quantity in Swedish, Finnish and Estonian: Cross-language evidence for a theory of adaptive dispersion. Phonetica, 51, 80–91. Erickson, Donna (2002). Articulation of extreme formant patterns for emphasized vowels. Phonetica, 59, 134–149. Eriksson, Anders (1991). Aspects of Swedish Speech Rhythm. (PhD), University of Gothenburg. Ernestus, Mirjam (2014). Acoustic reduction and the roles of abstractions and exemplars in speech processing. Lingua, 142, 27–41. Espy-Wilson, Carol Y. (1992). Acoustic measures for linguistic features distinguishing the semivowels /w j r l/ in American English. Journal of the Acoustical Society of America, 92 (2), 736–757. Fant, Gunnar (1960). Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations. S’Gravenhage: Mouton. Fant, Gunnar & Anita Kruckenberg (1989). Preliminaries to the study of Swedish prose reading and reading style. STL-QPSR, 2/1989, 1–83. Fant, Gunnar, Anita Kruckenberg, & Lennart Nord (1991). Durational correlates of stress in Swedish, French and English. Journal of Phonetics, 19, 351–365. Ferreira, Fernanda (1993). Creation of prosody during sentence production. Psychological Review, 100(2), 233–253. Ferreira, Fernanda. (2007). Prosody and performance in language production. Language and Cognitive Processes, 22(8), 1151–1177. Fitts, Paul M. (1954). The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, 47(6), 381–391.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

330  Flash, Tamar & Neville Hogan (1985). The coordination of arm movements: An experimentally conﬁrmed mathematical model. Journal of Neuroscience, 5(7), 1688–1703. Flemming, Edward (1997). Phonetic optimization: Compromise in speech production. University of Maryland Working Papers in Linguistics, 5, 72–91. Flemming, Edward (2001). Scalar and categorical phenomena in a uniﬁed model of phonetics and phonology. Phonology, 18, 7–44. Flemming, Edward (2011). La grammaire de la coarticulation. In Mohamed Embarki & Christelle Dodane (eds), La Coarticulation. Des Indices à la Représentation (pp. 189–211). Paris: L’Harmattan. Flemming, Edward & Stephanie Johnson (2007). Rosa’s roses: reduced vowels in American English. Journal of the International Phonetic Association, 37(1), 83–96. Fletcher, Janet (1987). Some micro and macro effects of tempo change on timing in French. Linguistics, 25, 951–967. Fletcher, Janet (2010). The Prosody of Speech: Timing and Rhythm. In William J. Hardcastle, John Laver, & Fiona E. Gibbon (eds), The Handbook of Phonetic Sciences. 2nd edn (pp. 521–602). Wiley Online Library. Fletcher, Janet & Andrew McVeigh (1993). Segment and syllable duration in Australian English. Speech Communication, 13, 355–365. Folkins, John W. & James H. Abbs (1975). Lip and jaw motor control during speech: Responses to resistive loading of the jaw. Journal of Speech and Hearing Research, 18, 207–220. Fougeron, Cécile (1998). Variations Articulatoires en Début de Constituants Prosodiques de Différents Niveaux en Français. (PhD), Université Paris III-Sorbonne Nouvelle. Fougeron, Cécile & Patricia Keating (1997). Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America, 101, 3728–3740. Foulkes, Paul & Gerard J. Docherty (2000). Another chapter in the story of /r/: ‘Labiodental’ variants in British English. Journal of Sociolinguistics, 4(1), 30–59. Fowler, Carol A. (1977). Timing control in speech production (Vol. 134) Bloomington, IN: Indiana University Linguistics Club. Fowler, Carol A. (1980). Coarticulation and theories of extrinsic timing. Journal of Phonetics, 8, 113–133. Fowler, Carol A. (2015). Is segmentation real? In Eric Raimy & Charles E. Cairns (eds), The Segment in Phonetics and Phonology (pp. 23–43). Hoboken, NJ: John Wiley & Sons, Inc. Fowler, Carol A. & Jonathan Housum (1987). Talkers’ signaling of “New” and “Old” words in speech and listeners’ perception and use of the distinction. Journal of Memory and Language, 26, 489–504. Fowler, Carol A. & Elliot Saltzman (1993). Coordination and coarticulation in speech production. Language and Speech, 362(2,3), 171–195. Fowler, Carol A., P. Rubin, R.E. Remez, & M.T. Turvey (1980). Implications for speech production of a general theory of action. In B. Butterworth (ed.), Language Production (pp. 373–420). New York: Academic Press. Franz, Elizabeth A. & V. S. Ramachandran (1998). Bimanual coupling in amputees with phantom limbs. Nature, 1(6), 443–444. Franz, Elizabeth. A., Richard B. Ivry, & L. L. Helmuth (1996). Reduced timing variability in patients with unilateral cerebellar lesions during bimanual movements. Journal of Cognitive Neuroscience, 8(2), 107–118. Franz, Elizabeth A., Howard N. Zelaznik, & George McCabe (1991). Spatial topological constraints in a bimanual task. Acta Psychologica, 77, 137–151.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



331

Franz, Elizabeth A., Howard N. Zelaznik, Stephan Swinnen, & Charles Walter (2001). Spatial conceptual inﬂuences on the coordination of bimanual actions: When a dual task becomes a single task. Journal of Motor Behavior, 33(1), 103–112. Fromkin, Victoria A. (1971). The non-anomalous nature of anomalous utterances. Language, 47(1), 27–52. Fromkin, Victoria A. (ed.) (1980). Errors in linguistic performance: Slips of the tongue, ear, pen, and hand. New York: Academic Press. Frost, Barrie J. & Hongjin Sun (2004). The biological bases of time-to-collision computation. In Heiko Hecht & Geert J. P. Savelsbergh (eds), Time-to-Contact (pp. 13–37). Amsterdam: Elsevier. Fuchs, Albert F. & Marc D. Binder (1983). Fatigue resistance of human extraocular muscles. Journal of Neurophysiology, 49, 28–34. Fujimura, Osamu (1987). A linear model of speech timing. In Robert Channon & Linda Shockey (eds), In Honor of Ilse Lehiste (pp. 108–123). Dordrecht /Providence, RI: Foris Publications. Fujimura, Osamu (1992). Phonology and phonetics-A syllable-based model of articulatory organization. Journal of the Acoustical Society of Japan (E), 13(1), 39–48. Fujimura, Osamu (1994). C/D Model: A computational model of phonetic implementation. In E. S. Ristad (ed.), DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 17, 1–20. Fujimura, Osamu (2000). The C/D model and prosodic control of articulatory behavior. Phonetica, 57, 128–138. Fujimura, Osamu (2003). The C/D model: A progress report. Proceedings of the 15th international congress of phonetic sciences. Barcelona, Spain, 1041–1044. Fujimura, Osamu & J. C. Williams (2015). Remarks on the C/D Model. Journal of the Phonetic Society of Japan, 19(2), 2–8. Gabbiani, Fabrizio, Holger G. Krapp, & Gilles Laurent (1999). Computation of object approach by a wide-ﬁeld, motion-sensitive neuron. Journal of Neuroscience, 19(3), 1122–1141. Gafos, Adamantios I. (2002). A grammar of gestural coordination. Natural Language & Linguistic Theory, 20(2), 269–337. Gafos, Adamantios I. (2006). Dynamics in grammar: Comment on Ladd and Ernestus & Baayen. In Louis Goldstein, D. H. Whalen, & Catherine T. Best (eds), Laboratory Phonology 8 (pp. 51–80). De Gruyter. Gafos, Adamantios I. & Stefan Beňuš (2006). Dynamics of phonological cognition. Cognitive Science, 30, 905–943. Gahl, Susanne & Susan Marie Garnsey (2004). Knowledge of grammar, knowledge of usage: Syntactic probabilities affect pronunciation variation. Language, 80, 748–775. Gaitenby, Jane H. (1965). The elastic word. Status Report on Speech Research SR2. New Haven, CT: Haskins Laboratories, Gallistel, C. R. (1999). Can response timing be explained by a decay process? Journal of the Experimental Analysis of Behavior, 71, 264–271. Gallistel, C. R. & John Gibbon (2000). Time, rate, and conditioning. Psychological Review, 107(2), 289–344. Gallistel, C. R., Adam King, & Robert McDonald (2004). Sources of variability and systematic error in mouse timing behavior. Journal of Experimental Psychology: Animal Behavior Processes, 30(1), 3–16. Ganesh, Gowrishankar, Masahiko Haruno, Mitsuo Kawato, & Etienne Burdet (2010). Motor memory and local minimization of error and effort, not global optimization, determine motor behavior. Journal of Neurophysiology, 104, 382–390.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

332  Gao, Man (2008). Mandarin Tones: An Articulatory Phonology Account. (PhD), Yale University. Garellek, Marc (2014). Voice quality strengthening and glottalization. Journal of Phonetics, 45, 106–113. Garrett, Merrill F. (1975). The analysis of sentence production. In Gordon H. Bower (ed.), The Psychology of Learning and Motivation (Vol. 9, pp. 133–177). New York: Academic Press. Garrett, Merrill F. (1980). Levels of processing in sentence production. In Brian Butterworth (ed.), Language production. Vol. 1: Speech and talk (pp. 177–220). London: Academic Press. Gay, Thomas (1981). Mechanisms in the control of speech rate. Phonetica, 38, 148–158. Gee, James Paul & François Grosjean (1983). Performance structures: A psycholinguistic and linguistic appraisal. Cognitive Psychology, 15, 411–458. Gentner, Donald R. (1987). Timing of Skilled Motor Performance: Tests of the Proportional Duration Model. Psychological Review, 94(2), 255–276. Gentner, D. R., J. Grudin, & E. Conway (1980). Finger movements in transcription typing. Center for Human Information Processing, University of California, San Diego, Technical Report 8001, 1–8. Getty, David J. (1975). Discrimination of short temporal intervals: A comparison of two models. Perception and Psychophysics, 18(1), 1–8. Georgopoulos, Apostolos P. (2002). Cognitive motor control: spatial and temporal aspects. Current Opinion in Neurobiology, 12, 678–683. Ghitza, Oded (2013). The theta-syllable: A unit of speech information deﬁned by cortical function. Frontiers in Psychology, 4, Article 138. Gibbon, John (1977). Scalar Expectancy Theory and Weber’s law in animal timing. Psychological Review, 84(3), 279–325. Gibbon, John (1991). Origins of scalar timing. Learning and Motivation, 22, 3–38. Gibbon, John, Russell M. Church, & Warren H. Meck (1984). Scalar timing in memory. Annals of the New York Academy of Sciences, 423, 52–77. Gibbon, John, Chara Malapani, Corby L. Dale, & C. R. Gallistel (1997). Toward a neurobiology of temporal cognition: Advances and challenges. Current Opinion in Neurobiology, 7(2), 170–184. Gibson, James J. (1979). The Ecological Approach to Visual Perception. Boston, MA: Houghton Mifﬂin. Gleitman, Lila R. & Eric Wanner (1982). Language acquisition: the state of the state of the art. In Eric Wanner & Lila R. Gleitman (eds), Language Acquisition: The State of the Art (pp. 3–48). Cambridge: Cambridge University Press. Gobel, Eric. W., Daniel J. Sanchez, & Paul J. Reber (2011). Integration of temporal and ordinal information during serial interception sequence learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(4), 994–1000. Goel, Anubhuti & Dean V. Buonomano (2014). Timing as an intrinsic property of neural networks: evidence from in vivo and in vitro experiments. Philosophical Transactions of the Royal Society B, 369, 20120460. Goldman-Eisler, Frieda (1956). The determinants of the rate of speech output and their mutual relations. Journal of Psychosomatic Research, 1, 137–143. Goldrick, Matthew (2006). Limited interaction in speech production: Chronometric, speech error, and neuropsychological evidence. Language and Cognitive Processes, 21(7–8), 817–855.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



333

Goldrick, Matthew & Sheila E. Blumstein (2006). Cascading activation from phonological planning to articulatory processes: Evidence from tongue twisters. Language and Cognitive Processes, 21(6), 649–683. Goldrick, Matthew & Karen Chu (2014). Gradient co-activation and speech error articultion: comment on Pouplier and Goldstein (2010). Language, Cognition and Neuroscience, 29(4), 452–458. Goldstein, Louis (2010). (Personal communication). Goldstein, Louis & Marianne Pouplier (2010). Language and Cognitive Processes, 25(5), 616–649. Goldstein, L., D. Byrd, & E. Saltzman (2006). The role of vocal tract gestural action units in understanding the evolution of phonology. In M. A. Arbib (ed.), Action to language via the mirror neuron system (pp. 215–249). Cambridge: Cambridge University Press. Goldstein, Louis, Hosung Nam, Elliot Saltzman, & Ioana Chitoran (2009). Coupled oscillator planning model of speech timing and syllable structure. In G. Fant, H. Fujisaki, & J. Shen (eds), Frontiers in Phonetics and Speech Science (pp. 239–250). Beijing: The Commercial Press. Goldrick, Matthew, H. Ross Baker, Amanda Murphy, & Melissa Baese-Berk (2011). Interaction and representational integration: Evidence from speech errors. Cognition, 121, 48–72. Goldstein, Louis, Marianne Pouplier, Larissa Chen, Elliot Saltzman, & Dani Byrd (2007). Dynamic action units slip in speech production errors. Cognition, 103(3), 386–412. Goldstone, Sanford & William T. Lhamon (1974). Studies of auditory-visual differences in human time judgment: 1. Sounds are judged longer than lights. Perceptual and Motor Skills, 39, 63–82. Goldwater, Sharon & Mark Johnson (2003). Learning OT constraint rankings using a maximum entropy model, Proceedings of the Stockholm Workshop on “Variation in Optimality Theory” (pp. 111–120). Stockholm, Sweden. Golla, Heidrun, Konstantin Tziridis, Thomas Haarmeier, Nicholas Catz, Shabtai Barash, & Peter Thier (2008). Reduced saccadic resilience and impaired saccadic adaptation due to cerebellar disease. European Journal of Neuroscience, 27, 132–144. Goozée, Justine V., Leonard L. Lapointe, & Bruce E. Murdoch (2003). Effects of speaking rate on EMA-derived lingual kinematics: a preliminary investigation. Clinical Linguistics & Phonetics, 17(4–5), 375–381. Gould, Stephen J. (1995). Dinosaur in a Haystack: Reﬂections in Natural History. New York: Harmony Books. Gow, David W. Jr (2002). Does English coronal place assimilation create lexical ambiguity? Journal of Experimental Psychology: Human Perception and Performance, 28(1), 163–179. Gow, David W. Jr, & Peter C. Gordon (1995). Lexical and prelexical inﬂuences on word segmentation: Evidence from priming. Journal of Experimental Psychology: Human Perception and Performance, 21(2), 344–359. Gräber, Susanne, Ingo Hertrich, Irene Daum, Sybille Spieker, & Hermann Ackermann (2002). Speech perception deﬁcits in Parkinson’s disease: underestimation of time intervals compromises identiﬁcation of durational phonetic contrasts. Brain and Language, 82, 65–74. Grahn, Jessica A. & James B. Rowe (2013). Finding and feeling the musical beat: Striatal dissociations between detection and prediction of regularity. Cerebral Cortex, 23, 913–921.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

334  Green, John T., Richard B. Ivry, & Diana S. Woodruff-Pak (1999). Timing in eyeblink classical conditioning and timed-interval tapping. Psychological Science, 10(1), 19–23. Green, Leonard, Joel Myerson, Daniel D. Holt, John R. Slevin, & Sara J. Estle (2004). Discounting of delayed food rewards in pigeons and rats: is there a magnitude effect? Journal of the Experimental Analysis of Behavior, 81, 39–50. Greenberg, J. H., Charles E. Osgood, & James J. Jenkins (2004). Memorandum concerning language universals, presented to the Conference on Language Universals, Gould House, Dobbs Ferry, New York , April 13–15, 1961. In Joseph H. Greenberg (ed.), Universals of Language, 2nd edn (pp. 15–27). Cambridge, MA: MIT Press. Grondin, Simon (2010). Timing and time perception: A review of recent behavioral and neuroscience ﬁndings and theoretical directions. Attention, Perception & Psychophysics, 72(3), 561–582. Grondin, Simon (2014). About the (non)scalar property for time perception. In H. Merchant & V. de Lafuente (eds), Neurobiology of Interval Timing (pp. 17–32). New York: Springer Science+Business Media. Grosjean, François & Maryann Collins (1979). Breathing, pausing, reading. Phonetica, 36, 98–114. Grube, Manon, Freya E. Cooper, Patrick F. Chinnery, & Timothy D. Grifﬁths (2010). Dissociation of duration-based and beat-based auditory timing in cerebellar degeneration. Proceedings of the National Academy of Sciences, 107(25), 11597–11601. Grube, Manon, Kwang-Hyuk Lee, Timothy D. Grifﬁths, Anthony Barker, & Peter W. Woodruff (2010). Transcranial magnetic theta-burst stimulation of the human cerebellum distinguishes absolute, duration-based from relative, beat-based perception of subsecond time intervals. Frontiers in Psychology, 1, Article 171. Guenther, Frank H. (1994). A neural network model of speech acquisition and motor equivalent speech production. Biological Cybernetics, 72, 43–53. Guenther, Frank H. (1995). Speech Sound Acquisition, Coarticulation, and Rate Effects in a Neural-Network Model of Speech Production. Psychological Review, 102(3), 594–621. Guenther, Frank H. (2006). Cortical interactions underlying the production of speech sounds. Journal of Communication Disorders, 39, 350–365. Guenther, Frank H. (2016). Neural Control of Speech. Cambridge, MA: MIT Press. Guenther, Frank H. & Tony Vladusich (2012). A neural theory of speech acquisition and production. Journal of Neurolinguistics, 25, 408–422. Guenther, Frank H., Satrajit S. Ghosh, & Jason A. Tourville (2006). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language, 96, 280–301. Guenther, Frank H., Carol Espy-Wilson, Suzanne E. Boyce, Melanie L. Matthies, Majid Zandipour, & Joseph S. Perkell (1999). Articulatory tradeoffs reduce acoustic variability during American English /r/ production. Journal of the Acoustical Society of America, 105(5), 2854–2865. Guigon, Emmanuel (2011). Models and architectures for motor control: Simple or complex? In Frédéric Danion & Mark L. Latash (eds), Motor Control: Theories, Experiments, and Applications (pp. 478–502). Oxford: Oxford University Press. Guigon, Emmanuel, Pierre Baraduc, & Michel Desmurget (2007). Computational motor control: Redundancy and invariance. Journal of Neurophysiology, 97, 331–347. Guigon, Emmanuel, Pierre Baraduc, & Michel Desmurget (2008). Computational motor control: feedback and accuracy. European Journal of Neuroscience, 27(4), 1003–1016. Haggard, Patrick & Alan Wing (1998). Coordination of hand aperture with the spatial path of hand transport. Experimental Brain Research, 118, 286–292.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



335

Haith, Adrian M. & John W. Krakauer (2013). Theoretical models of motor control and motor learning. In Albert Gollhofer, Wolfgang Taub, Jens Bo Nielsen (eds), Routledge Handbook of Motor Control and Motor Learning (pp. 7–28). London: Routledge. Haken, Hermann, J. A. Scott Kelso, & Heinz Bunz (1985). A theoretical model of phase transitions in human hand movements. Biological Cybernetics, 51, 347–356. Halle, Morris (1992). Phonological features. In W. Bright (ed.), International Encyclopedia of Linguistics (Vol. III, pp. 207–212). New York: Oxford University Press. Halle, Morris (1995). Feature geometry and feature spreading. Linguistic Inquiry,26(1), 1–46. Halle, Morris & Jean-Roger Vergnaud (1987). An Essay on Stress. Cambridge, MA: MIT Press. Halliday, Michael Alexander Kirkwood (1967). Intonation and Grammar in British English. The Hague: De Gruyter. Hamilton, Antonia F. de C. & Daniel Wolpert (2002). Controlling the statistics of action: Obstacle avoidance. Journal of Neurophysiology, 87, 2434–2440. Hancock, Peter A. & Karl M. Newell (1985). The movement speed-accuracy relationship in space-time. In Herbert Heuer, Uwe Kleinbeck, & Klaus-Helmut Schmidt (eds), Motor Behavior: Programming, Control, and Acquisition (pp. 153–185). Berlin: SpringerVerlag. Hansen, Steve, Luc Tremblay, & Digby Elliott, (2005). Part and whole practice: Chunking and online control in the acquisition of a serial motor task. Research Quarterly for Exercise and Sport, 76(1), 60–66. Hardy, Nicholas F. & Dean V. Buonomano (2016). Neurocomputational models of interval and pattern timing. Current Opinion in Behavioral Sciences, 8, 250–257. Harrington, Deborah L., Kathleen Y. Haaland, & Robert T. Knight (1998). Cortical networks underlying mechanisms of time perception. Journal of Neuroscience, 18(3), 1085–1095. Harrington, Jonathan, Janet Fletcher, & Mary E. Beckman (2000). Manner and place conﬂicts in the articulation of accent in Australian English. In M. Broe & J. Pierrehumbert (eds), Papers in Laboratory Phonology V: Language Acquisition and the Lexicon (pp. 40–51). Cambridge: Cambridge University Press. Harrington, Jonathan, Janet Fletcher, & Corrine Roberts (1995). Coarticulation and the accented/unaccented distinction: evidence from jaw movement data. Journal of Phonetics, 23, 305–322. Harris, Christopher M. (1995). Does saccadic undershoot minimize saccadic ﬂight-time? A Monte-Carlo study. Vision Research, 35(5), 691–701. Harris, Christopher M. & Daniel M. Wolpert (1998). Signal-dependent noise determines motor planning. Nature, 394(6695), 780–784. Harris, Christopher M. & Daniel M. Wolpert (2006). The main sequence of saccades optimizes speed–accuracy trade-off. Biological Cybernetics, 95(1), 21–29. Hatsopoulos, Nicholas, Fabrizio Gabbiani, & Gilles Laurent (1995). Elementary computation of object approach by a wide-ﬁeld visual neuron. Science, 270(5238), 1000–1003. Hatze, H. & J. D. Buys (1977). Energy-Optimal controls in the mammalian neuromuscular system. Biological Cybernetics, 27, 9–20. Hayes, Bruce (1981). A Metrical Theory of Stress Rules. [Revised version distributed by Indiana University Linguistics Club, Bloomington. Published by Garland Press, New York 1985.] (PhD (1980)), Massachusetts Institute of Technology. Hayes, Bruce (1983). A grid-based theory of English meter. Linguistic Inquiry, 14(3), 357–393.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

336  Hayes, Bruce (1989). The prosodic hierarchy in meter. In Paul Kiparsky & Gilbert Youmans (eds), Phonetics and Phonology: Rhythm and Meter Volume 1 (pp. 201–259). San Diego, CA: Academic Press, Inc. Hecht, Heiko & Geert J. P. Savelsbergh (eds). (2004). Time-to-Contact. Amsterdam: Elsevier. Helmbold, Nadine, Stefan Troche, & Thomas Rammsayer (2007). Processing of temporal and nontemporal information as predictors of psychometric intelligence: A structuralequation-modeling approach. Journal of Personality, 75(5), 985–1006. Henke, William L. (1966). Dynamic Articulatory Model of Speech Production Using Computer Simulation. (PhD), Massachusetts Institute of Technology. Henke, William L. (1967). Preliminaries to speech synthesis based on an articulatory model. Proceedings of the 1967 IEEE Boston Speech Conference (pp. 170–171). New York: The Institute of Electrical and Electronic Engineers. Hertrich, Ingo & Hermann Ackermann (1997). Articulatory control of phonological vowel length contrasts: Kinematic analysis of labial gestures. Journal of the Acoustical Society of America, 102(1), 523–536. Hertrich, Ingo & Hermann Ackermann (2000). Lip–jaw and tongue–jaw coordination during rate-controlled syllable repetitions. Journal of the Acoustical Society of America, 107(4), 2236–2247. Heuer, Herbert & Richard A. Schmidt (1988). Transfer of learning among motor patterns with different relative timing. Journal of Experimental Psychology: Human Perception and Performance, 14, 241–252. Heyward, Jennifer, Alice Turk, & Christian Geng (2014). Does /t/ produced as [ʔ] involve tongue tip raising? Articulatory evidence for the nature of phonological representations. Poster presented at the 14th Conference on Laboratory Phonology, Tokyo. Hickok, Gregory (2014). The architecture of speech production and the role of the phoneme in speech processing. Language, Cognition and Neuroscience, 29(1), 2–10. Hinton, Sean C. & Stephen M. Rao (2004). “One-thousand one . . . one-thousand two . . .”: Chronometric counting violates the scalar property in interval timing. Psychonomic Bulletin & Review, 11, 24–30. Hoagland, Hudson (1933). The physiological control of judgments of duration: Evidence for a chemical clock. The Journal of General Psychology, 9(2), 267–287. Hoff, Bruce & Michael A. Arbib (1993). Models of trajectory formation and temporal interaction of reach and grasp. Journal of Motor Behavior, 25(3), 175–192. Hoffman, Donna S. & Peter L. Strick (1999). Step-tracking movements of the wrist. IV. Muscle activity associated with movements in different directions. Journal of Neurophysiology, 81, 319–333. Hogan, Michael C., Erica Ingham, & S. Sadi Kurdak (1998). Contraction duration affects metabolic energy cost and fatigue in skeletal muscle. American Journal of PhysiologyEndocrinology and Metabolism, 274(3), E397–E402. Hogan, Neville & Dagmar Sternad (2007). On rhythmic and discrete movements: reﬂections, deﬁnitions and implications for motor control. Experimental Brain Research, 181, 13–30. Hore, Jon & Sherry Watts (2005). Timing ﬁnger opening in overarm throwing based on a spatial representation of hand path. Journal of Neurophysiology, 93, 3189–3199. Houde, John F. & Edward F. Chang (2015). The cortical computations underlying feedback control in vocal production. Current Opinion in Neurobiology, 33, 174–181. Houde, John F. & Michael I. Jordan (1998). Sensorimotor adaptation in speech production. Science, 279, 1213–1216.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



337

Houde, John F. & Michael I. Jordan (2002). Sensorimotor adaptation of speech I: Compensation and adaptation. Journal of Speech, Language, and Hearing Research, 45, 295–310. Houde, John F. & Srikantan S. Nagarajan (2011). Speech production as state feedback control. Frontiers in Human Neuroscience, 5(Article 82), 1–14. House, Arthur S. (1961). On vowel duration in English. Journal of the Acoustical Society of America, 33, 1174–1178. Howard, Ian S. & Piers Messum (2011). Modeling the development of pronunciation in infant speech acquisition. Motor Control, 15, 85–117. Howitt, Andrew Wilson (2000). Automatic syllable detection for vowel landmarks. (ScD thesis), Massachusetts Institute of Technology. Hu, Xiaogang & Karl M. Newell (2011). Modeling constraints to redundancy in bimanual force coordination. Journal of Neurophysiology, 105, 2169–2180. Huber, Jessica E. (2008). Effects of utterance length and vocal loudness on speech breathing in older adults. Respiratory Physiology & Neurobiology, 164(3), 323–330. Hudson, Todd E., Laurence T. Maloney, & Michael S. Landy (2008). Optimal compensation for temporal uncertainty in movement planning. PLoS Computational Biology, 4(7), 1–9. Huggins, A. W. F. (1975). On Isochrony and Syntax. In G. Fant & M. A. A. Tatham (eds), Auditory Analysis and Perception of Speech (pp. 455–464). Orlando, FL: Academic Press. Iosad, Pavel (2012). Vowel reduction in Russian: No phonetics in phonology. Journal of Linguistics, 48, 521–571. Ivry, Richard & Daniel M. Corcos (1993). Slicing the variability pie—component analysis of coordination and motor dysfunction. Variability and Motor Control, 415–447. Ivry, Richard & R. Eliot Hazeltine (1995). Perception and production of temporal intervals across a range of durations—Evidence for a common timing mechanism. Journal of Experimental Psychology-Human Perception and Performance, 21(1), 3–18. Ivry, Richard & John E. Schlerf (2008). Dedicated and intrinsic models of time perception. Trends in Cognitive Sciences, 12(7), 273–280. Jackendoff, Ray (1987). Consciousness and the computational mind (Vol. 3). Cambridge, MA: MIT Press. Jacobson, Gilad A., Dan Rokni, & Yosef Yarom (2008). A model of the olivo-cerebellar system as a temporal pattern generator. Trends in Neurosciences, 31(12), 617–625. Jaeger, T. Florian (2006). Redundancy and syntactic reduction in spontaneous speech. (PhD), Stanford University, CA. Jaeger, T. Florian (2010). Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology, 61, 23–62. Jakobson, Roman & Morris Halle (1956). Fundamentals of Language. The Hague: De Gruyter. Jax, Steven A. & David A. Rosenbaum (2007). Hand path priming in manual obstacle avoidance: Evidence that the dorsal stream does not only control visually guided actions in real time. Journal of Experimental Psychology: Human perception and performance, 33(2), 425–441. Jax, Steven A. & David A. Rosenbaum (2009). Hand path priming in manual obstacle avoidance: Rapid decay of dorsal stream information. Neuropsychologia, 47, 1573–1577. Jeffress, Lloyd A. (1948). A place theory of sound localization. Journal of Comparative and Physiological Psychology, 41(1), 35–39. Jimura, Koji, Joel Myerson, Joseph Hilgard, Todd S. Braver, & Leonard Green (2009). Are people really more patient than other animals? Evidence from human discounting of real liquid rewards. Psychonomic Bulletin & Review, 16(6), 1071–1075.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

338  Jin, Dezhe Z., Naotaka Fujii, & Ann M. Graybiel (2009). Neural representation of time in cortico-basal ganglia circuits. Proceedings of the National Academy of Sciences, 106(45), 19156–19161. Johnson, Keith (1991). Dynamic aspects of English vowels in /bVb/ sequences. UCLA Working Papers in Phonetics, 80, 99–120. Johnson, Keith (2004). Massive reduction in conversational American English. In Spontaneous speech: Data and analysis. Proceedings of the 1st session of the 10th international symposium (pp. 29–54). Tokyo. Jones, Daniel (1950). The Phoneme: Its Nature and Use. Cambridge: Heffer. Jones, Luke A. & J. H. Wearden (2004). Double standards: Memory loading in temporal reference memory. The Quarterly Journal of Experimental Psychology, 57B(1), 55–77. Jordan, Michael I. & Daniel M. Wolpert (1999). Computational motor control. In M. Gazzaniga (ed.), The Cognitive Neurosciences , (pp. 601–620). Cambridge, MA: MIT Press. Judge, Sarah J. & F. Claire Rind (1997). The locust DCMD, a movement-detecting neurone tightly tuned to collision trajectories. Journal of Experimental Biology, 200, 2209–2216. Jun, Sun-Ah (ed.). (2005). Prosodic Typology: The Phonology of Intonation and Phrasing. Oxford: Oxford University Press. Kaisse, Ellen M. (1985). Connected Speech: The Interaction of Syntax and Phonology. Orlando, FL: Academic Press. Kalenscher, Tobias, Tobias Ohmann, Sabine Windmann, Nadja Freund, & Onur Güntürkün, (2006). Single forebrain neurons represent interval timing and reward amount during response scheduling. European Journal of Neuroscience, 24, 2923–2931. Kanai, Ryota, Harriet Lloyd, Domenica Bueti, & Vincent Walsh (2011). Modalityindependent role of the primary auditory cortex in time estimation. Experimental Brain Research, 209, 465–471. Kaplan, Abby (2010). Phonology Shaped by Phonetics: The Case of Intervocalic Lenition. (PhD), University of California Santa Cruz. Karmarkar, Uma R. & Dean V. Buonomano (2007). Timing in the absence of clocks: Encoding time in neural network states. Neuron, 53, 427–438. Katsika, Argyro (2012). Coordination of Prosodic Gestures at Boundaries in Greek. (PhD), Yale University. Katsika, Argyro, Jelena Krivokapić, Christine Mooshammer, Mark Tiede, & Louis Goldstein (2014). The coordination of boundary tones and their interaction with prominence. Journal of Phonetics, 44, 62–82. Katsumata, Hiromu & Daniel M. Russell (2012). Prospective versus predictive control in timing of hitting a falling ball. Experimental Brain Research, 216(4), 499–514. Katz, Jonah (2010). Compression effects, perceptual asymmetries, and the grammar of timing. (PhD), Massachusetts Institute of Technology. Katz, Jonah (2012). Compression effects in English. Journal of Phonetics, 40, 390–402. Kayed, Nanna Snnichsen & Audrey L. H. van der Meer (2009). A longitudinal study of prospective control in catching by full-term and preterm infants. Experimental Brain Research, 194, 245–258. Kazanina, Nina, Jeffrey S. Bowers, & William J. Idsardi (2017). Phonemes: Lexical access and beyond. Psychonomic Bulletin & Review, 25(2), 560–585. Kazanina, Nina, Colin Phillips, & William J. Idsardi (2006). The inﬂuence of meaning on the perception of speech sounds. Proceedings of the National Academy of Sciences, 103 (30), 11381–11386.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



339

Kazennikov, O., U. Wicki, M. Corboz, B. Hyland, A. Palmeri, E. M. Rouiller, & M. Wiesendanger (1994). Temporal structure of a bimanual goal-directed movement sequence in monkeys. European Journal of Neuroscience, 6, 203–210. Keating, Patricia (1990). The window model of coarticulation: Articulatory evidence. In John C. Kingston & Mary E. Beckman (eds), Papers in Laboratory Phonology I: Between the Grammar and Physics of Speech (pp. 450–469). Cambridge: Cambridge University Press. Keating, Patricia (2006). Phonetic encoding of prosodic structure. In Jonathan Harrington & Marija Tabain (eds), Speech Production: Models, Phonetic Processes, and Techniques (pp. 167–186). New York and Hove: Psychology Press. Keating, Patricia & Stefanie Shattuck-Hufnagel (2002). A prosodic view of word form encoding for speech production. UCLA Working Papers in Phonetics, 101, 112–156. Keating, Patricia, Taehong Cho, Cécile Fougeron, & Chai-Shune Hsu (2003). Domaininitial strengthening in four languages. In John Local, Richard Ogden, & Rosalind Temple (eds), Phonetic Interpretation: Papers in Laboratory Phonology VI (pp. 143–161). Cambridge: Cambridge University Press. Keele, Steven W. (1981). Behavioral analysis of movement. In Vernon B. Brooks (ed.), Handbook of Physiology. Section 1: The Nervous System, Vol. 2: Motor Control (pp. 1391–1414). Baltimore, MD: Williams & Wilkins. Keele, Steven W., Robert Nicoletti, Richard I. Ivry, & Robert A. Pokorny (1989). Mechanisms of perceptual timing: Beat-based or interval-based judgements? Psychological Research, 50, 251–256. Keele, Steven W., Robert A. Pokorny, Daniel M. Corcos, & Richard Ivry (1985). Do perception and motor production share common timing mechanisms—a correlational analysis. Acta Psychologica, 60(2–3), 173–191. Keller, Eric (1990). Speech motor timing. In W. J. Hardcastle & A. Marchal (eds), Speech Production and Speech Modelling (pp. 343–364). Dordrecht: Springer Netherlands. Kello, Christopher T. & David C. Plaut (2004). A neural network model of the articulatoryacoustic forward mapping trained on recordings of articulatory parameters. Journal of the Acoustical Society of America, 116(4), 2354–2364. Kelso, J. A. Scott (1981). On the oscillatory basis of movement. Bulletin of the Psychonomic Society, 18(2), 63. Kelso, J. A. Scott (1992). Theoretical concepts and strategies for understanding perceptualmotor skill: From information capacity in closed systems to self-organization in open, nonequilibrium systems. Journal of Experimental Psychology: General, 121(3), 260–261. Kelso, J. A. Scott & Betty Tuller (1987). Intrinsic time in speech production: theory, methodology, and preliminary observations. In Eric Keller & Myrna Gopnik (eds), Motor and Sensory Processes of Language (pp. 203–222). Hillsdale, NJ: Lawrence Erlbaum. Kelso, J. A. Scott, Dan L. Southard, & David Goodman (1979). On the coordination of twohanded movements. Journal of Experimental Psychology. Human Perception and Performance, 5(2), 229–238. Kelso, J. A. Scott, Betty Tuller, & Katherine S. Harris (1983). A “dynamic pattern” perspective on the control and coordination of movement. In P. F. MacNeilage (ed.), The Production of Speech (pp. 137–173). New York: Springer. Kelso, J. A. Scott, Kenneth G. Holt, Philip Rubin, & Peter N. Kugler (1981). Patterns of human interlimb coordination emerge from the properties of non-linear, limit cycle oscillatory processes: Theory and data. Journal of Motor Behavior, 13(4), 226–261.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

340  Kelso, J. A. Scott, Eric Vatikiotis-Bateson, Elliot L. Saltzman, & Bruce Kay (1985). A qualitative dynamic analysis of reiterant speech production: Phase portraits, kinematics, and dynamic modeling. Journal of the Acoustical Society of America, 77(1), 266–280. Khan, Michael A. & Ian M. Franks (2000). The effect of practice on component submovements is dependent on the availability of visual feedback. Journal of Motor Behavior, 32(3), 227–240. Khan, Michael A., Ian M. Franks, & David Goodman (1998). The effect of practice on the control of rapid aiming movements: Evidence for an interdependency between programming and feedback processing. The Quarterly Journal of Experimental Psychology Section A: Human Experimental Psychology, 51(2), 425–443. Killeen, Peter R. & Neil A. Weiss (1987). Optimal timing and the Weber function. Psychological Review, 94(4), 455–468. Kim, Heejin (2006). Speech Rhythm in American English: A corpus study. (PhD), University of Illinois at Urbana-Champaign. Kim, Heejin & Jennifer Cole (2005). The stress foot as a unit of planned timing: evidence from shortening in the prosodic phrase. Proceedings of Interspeech 2005. Lisbon, Portugal. Kimura, Toshitaka & Hiroaki Gomi (2009). Temporal development of anticipatory reﬂex modulation to dynamical interactions during arm movement. Journal of Neurophysiology, 102, 2220–2231. Kingston, John (2008). Lenition. In L. Colantoni & J. Steele (eds), Selected Proceedings of the 3rd Conference on Laboratory Approaches to Spanish Phonology (pp. 1–31). Somerville, MA: Cascadilla Proceedings Project. Kingston, John & Randy L. Diehl (1994). Phonetic knowledge. Language, 70(3), 419–454. Kirchner, Robert Martin (1998). An Effort-Based Approach to Consonant Lenition. (PhD), University of California, Los Angeles. Kistemaker, Dinant A., Jeremy D. Wong, & Paul L. Gribble (2010). The central nervous system does not minimize energy cost in arm movements. Journal of Neurophysiology, 104, 2985–2994. Kistemaker, Dinant A., Jeremy D. Wong, & Paul L. Gribble (2014). The cost of moving optimally: kinematic path selection. Journal of Neurophysiology, 112, 1815–1824. Klatt, Dennis H. (1973). Interaction between two factors that inﬂuence vowel duration. Journal of the Acoustical Society of America, 54, 1102–1104. Klatt, Dennis H. (1976). Linguistic uses of segmental duration in English: Acoustic and perceptual evidence. Journal of the Acoustical Society of America, 59(5), 1208–1220. Klatt, Dennis H. (1987). Review of text-to-speech conversion for English. Journal of the Acoustical Society of America, 82(3), 737–793. Kobayashi, Shunsuke & Wolfram Schultz (2008). Inﬂuence of reward delays on responses of dopamine neurons. Journal of Neuroscience, 28, 7837–7846. Koch, Giacomo, Massimiliano Oliveri, Sara Torriero, Silvia Salerno, Emanuele Lo Gerfo, & Carlo Caltagirone (2007). Repetitive TMS of cerebellum interferes with millisecond time processing. Experimental Brain Research, 179, 291–299. Kohler, Klaus J. (1983). Prosodic boundary signals in German. Phonetica, 40, 89–134. Kohler, Klaus J. (1990). Segmental reduction in connected speech in German: phonological facts and phonetic explanations. In William J. Hardcastle & Alain Marchal (eds), Speech Production and Speech Modelling (pp. 69–92). NATO Science Series D: Volume 55. Dordrecht: Springer Netherlands. Kornysheva, Katja & Jörn Diedrichsen (2014). Human premotor areas parse sequences into their spatial and temporal features. eLIFE, (3), e03043.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



341

Kornysheva, Katja, Anika Sierk, & Jörn Diedrichsen (2013). Interaction of temporal and ordinal representations in movement sequences. Journal of Neurophysiology, 109, 1416–1424. Kozhevnikov, V. & Chistovich, L. (1965). Speech: Articulation and Perception. Oxford: Nauka. Krakow, Rena A. (1989). The Articulatory Organization of Syllables: A Kinematic Analysis of Labial and Velic Gestures. (PhD), Yale University. Krakow, Rena A. (1999). Physiological organization of syllables: a review. Journal of Phonetics, 27, 23–54. Krivokapić, Jelena (2007). Prosodic planning: Effects of phrasal length and complexity on pause duration. Journal of Phonetics, 35, 162–179. Krivokapić, Jelena (2013). Rhythm and convergence between speakers of American and Indian English. Laboratory Phonology, 4(1), 39–65. Krivokapić, Jelena (2020). Prosody in Articulatory Phonology. In Jonathan A. Barnes & Stefanie Shattuck-Hufnagel (eds), Prosodic Theory and Practice. Cambridge, MA: MIT Press. Krull, Diana (1997). Prepausal lengthening in Estonian: Evidence from conversational speech. In Ilse Lehiste & Jaan Ross (eds), Estonian Prosody: Papers from a Symposium (pp. 136–148). Tallinn: Institute of Estonian Language. Kuehn, David & Kenneth Moll (1976). A cineradiographic study of VC and CV articulatory velocities. Journal of Phonetics, 4, 303–320. Kugler, Peter N., J. A. Scott Kelso, & M. T. Turvey (1980). On the concept of coordinative structures as dissipative structures: I. Theoretical lines of convergence. In G. E. Stelmach & J. Requin (eds), Tutorials in Motor Behavior (pp. 3–47). Amsterdam: North-Holland. Kuhl, Patricia K., Jean E. Andruski, Inna A. Chistovich, Ludmilla A. Chistovich, Elena V. Kozhevnikova, Viktoria L.Ryskina, Elvira I. Stolyarova, Ulla Sundberg, Francicso Lacerda (1997). Cross-language analysis of phonetic units in language addressed to infants. Science, 277, 684–686. Lacquaniti, Francesco & Claudio Maioli (1987). Anticipatory and reﬂex coactivation of antagonist muscles in catching. Brain Research, 406, 373–378. Lacquaniti, Francesco & Claudio Maioli (1989). The role of preparation in tuning anticipatory and reﬂex responses during catching. Journal of Neuroscience, 9(1), 134–148. Lacquaniti, Francesco & Claudio Maioli (1994). Coordinate transformations in the control of cat posture. Journal of Neurophysiology, 72, 1496–1515. Lacquaniti, Francesco, Carlo Terzuolo, & Pablo Viviani (1983). The law relating to the kinematic and ﬁgural aspects of drawing movements. Acta Psychologica, 54, 115–130. Ladd, D. Robert (2008). Intonational Phonology (2nd ed.). Cambridge: Cambridge University Press. Ladd, D. Robert (2011). Phonetics in phonology. In John Goldsmith, Jason Riggle, & Alan C. L. Yu (eds), The Handbook of Phonological Theory. Second Edition (pp. 348–373). Hoboken, NJ: Wiley-Blackwell. Ladd, D. Robert & Catherine Johnson (1987). ‘Metrical’ factors in the scaling of sentenceinitial accent peaks. Phonetica, 44, 238–245. Ladefoged, Peter (1963). Some physiological parameters in speech. Language and Speech, 6(3), 109–119. Laje, Rodrigo & Dean V. Buonomano (2013). Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience, 16(7), 925–936. Lametti, Daniel R., Sazzad M. Nasir, & David J. Ostry (2012). Sensory preference in speech production revealed by simultaneous alteration of auditory and somatosensory feedback. Jouranl of Neuroscience, 32(27), 9351–9358.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

342  Large, Edward W. & Caroline Palmer (2002). Perceiving temporal regularity in music. Cognitive Science, 26, 1–37. Lashley, Karl Spencer (1951). The problem of serial order in behavior. In Lloyd A. Jeffress (ed.), Cerebral Mechanisms in Behavior; the Hixon Symposium (pp. 112–146). Oxford: Wiley. Latash, Mark L. (2008). Synergy. New York: Oxford University Press. Laubstein, Ann Stuart (1987). Syllable structure: The speech error evidence. Canadian Journal of Linguistics/Revue canadienne de linguistique, 32(4), 339–363. Laubstein, Ann Stuart (2006). Two constraint types: Slots and percolation. In C. Gurski (ed.), Proceedings of the 2005 Canadian Linguistics Association Annual Conference (pp. 1–12). Toronto. Lavoie, Lisa M. (2001). Consonant Strength: Phonological Patterns and Phonetic Manifestations. London: Routledge. Lee, David N. (1976). A theory of visual control of braking based on information about time to collision. Perception, 5, 437–459. Lee, David N. (1980). The optic ﬂow ﬁeld: the foundation of vision. Philosophical Transactions of the Royal Society B, 290, 169–179. Lee, David N. (1998). Guiding movement by coupling taus. Ecological Psychology, 10(3–4), 221–250. Lee, D. N. (2009). General Tau Theory: evolution to date. Special Issue: Landmarks in Perception. Perception, 38, 837–858. Lee, David N. (2011). How movement is guided. (Unpublished ms) Available from http:// www.pmarc.ed.ac.uk/ideas/pdf/HowMovtGuided100311.pdf. Lee, David N. & Paul E. Reddish (1981). Plummeting gannets: A paradigm of ecological optics. Nature, 293, 293–294. Lee, David N., Apostolos P. Georgopoulos, & Gert-Jan Pepping (2017). Function of basal ganglia in tauG-guiding action. bioRxiv non-peer reviewed preprint. Available from http://dx.doi.org/10.1101/143065. Lee, David N., Apostolos P Georgopoulos, Martyn J. O. Clark, Cathy M. Craig, & Nicholas Lindman Port (2001). Guiding contact by coupling the taus of gaps. Experimental Brain Research, 139, 151–159. Lee, David N. & Turk, A. (in preparation). Vocalizing by tauG-guiding articulators. Lee, David N., D. S. Young, P. E. Reddish, S. Lough, & T. M. H. Clayton (1983). Visual timing in hitting an accelerating ball. Quarterly Journal of Experimental Psychology Section A-Human Experimental Psychology, 35, 333–346. Lee, D. N., Reinoud J. Bootsma, Barrie J. Frost, Mike Land, David Regan, & Rob Gray (2009). General Tau Theory: evolution to date. Special Issue: Landmarks in Perception. Perception, 38, 837–858. Lee, Suk-Myung & Jeung-Yoon Choi (2012a). Analysis of acoustic parameters for consonant voicing detection in clean and telephone speech. Journal of the Acoustical Society of America, 131(3), EL197–EL202. Lee, Suk-Myung & Jeung-Yoon Choi (2012b). Tense-lax vowel classiﬁcation with energy trajectory and voice quality measurements. IEICE TRANSACTIONS on Information and Systems, 95(3), 884–887. Lee, Jung-Won, Jeung-Yoon Choi, & Hong-Goo Kang (2011). Classiﬁcation of Fricatives Using Feature Extrapolation of Acoustic-Phonetic Features in Telephone Speech. Proceedings of Interspeech, 1261–1264. Lee, Jung-Won, Jeung-Yoon Choi, & Hong-Goo Kang (2012). Classiﬁcation of stop consonant place of articulation using feature extrapolation in telephone speech. Journal of the Acoustical Society of America, 131(2), 1536–1546.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



343

Lefkowitz, Lee Michael (2017). Maxent Harmonic Grammars and Phonetic Duration. (PhD), University of California, Los Angeles. Lehiste, Ilse (1960). An acoustic-phonetic study of internal open juncture. Phonetica, 5(Suppl.), 1–54. Lehiste, Ilse (1970). Suprasegmentals. Cambridge, MA: MIT Press. Lehiste, Ilse (1972). The timing of utterances and linguistic boundaries. Journal of the Acoustical Society of America, 51(6 part 2), 2018–2024. Lejeune, H. & J. H. Wearden (1991). The comparative psychology of ﬁxed-interval responding—some quantitative analyses. Learning and Motivation, 22(1–2), 84–111. Leon, Matthew I. & Michael N. Shadlen (2003). Representation of time by neurons in the posterior parietal cortex of the macaque. Neuron, 38, 317–327. Leonard, Thomas & Fred Cummins (2011). The temporal relation between beat gestures and speech. Language and Cognitive Processes, 26(10), 1457–1471. Levelt, Willem J. M. (1989). Speaking: From Intention to Articulation. Cambrige, MA: MIT Press. Levelt, Wilem J. M. (1999). Models of word production. Trends in Cognitive Sciences, 3(6), 223–232. Levelt, Willem J. M. (2002). Phonological encoding in speech production: Comments on Jurafsky et al., Schiller et al., and van Heuven & Haan. Laboratory Phonology, 7, 87–99. Levelt, Willem J. M., Ardi Roelofs, & Antje S. Meyer (1999). A theory of lexical access in speech production. Brain and Behavioral Sciences, 22(1), 1–38. Levy, Roger & T. Florian Jaeger (2007). Speakers optimize information density through syntactic reduction. In Bernhard Schlökopf, John Platt, & Thomas Hoffman (eds), Advances in Neural Information Processing Systems (NIPS) (19, 849–856). Cambridge, MA: MIT Press. Lewis, Penelope A. & Warren H. Meck (2012). Time and the sleeping brain. The Psychologist, 25(8), 594–597. Lewis, Penelope A. & R.C. Miall (2009). The experience of time: Neural mechanisms and the interplay of emotion, cognition and embodiment. Philosophical Transactions: Biological Sciences, 364(1525), 1897–1905. Li, Weiwei (2006). Optimal Control for Biological Movement Systems. (PhD), University of California, San Diego. Liberman, Alvin M. & Michael Studdert-Kennedy (1978). Phonetic perception. Handbook of Sensory Physiology, 8, 143–178. Liberman, Mark & Janet Pierrehumbert (1984). Intonational invariance under changes in pitch range and length. In Mark Aronoff & Richard T. Oehrle (eds), Language Sound Structure (pp. 157–233) Cambridge, MA: MIT Press. Liberman, Mark & Alan Prince (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8, 249–336. Liberman, Alvin M., F. S. Cooper, D. P. Shankweiler, & M. Studdert-Kennedy (1967). Perception of the speech code. Psychological Review, 4(6), 431–461. Lieberman, Philip (1963). Some effects of semantic and grammatical context on the production and perception of speech. Language and Speech, 6(3), 172–187. Liljencrants, Johan & Björn Lindblom (1972). Numerical simulation of vowel quality systems: The role of perceptual contrast. Language, 48(4), 839–862. Lindau, Mona (1985). The story of /r/. In V. A. Fromkin (ed.), Phonetic Linguistics: Essays in Honor of Peter Ladefoged (pp. 157–169). Orlando, FL: Academic Press. Lindblom, Björn (1963). A spectrographic study of vowel reduction. Journal of the Acoustical Society of America, 35, 1773–1781.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

344  Lindblom, Björn (1968). Temporal organization of syllable production. Reports of the 6th International Congress of Acoustics (pp. B29–B30). Tokyo. Lindblom, Björn (1986). Phonetic universals in vowel systems. In John J. Ohala & Jeri J. Jaeger (eds), Experimental Phonology (pp. 13–44). Orlando, FL: Academic Press. Lindblom, Björn (1990). Explaining Phonetic Variation: A Sketch of the H&H Theory. In William J. Hardcastle & Alain Marchal (eds), Speech Production and Speech Modelling (Vol. 55, pp. 403–439). Dordrecht: Kluwer. Lindblom, B. & K. Rapp (1972). Reexamination of the compensatory adjustment of vowel duration in Swedish words. Occasional Papers, University of Essex, 13, 204–224. Lisker, Leigh & Arthur S. Abramson (1970). The voicing dimension: Some experiments in comparative phonetics. Proceedings of the 6th International Congress of Phonetic Sciences, 563–567. Prague. Liu, Dan & Emanuel Todorov (2007). Evidence for the ﬂexible sensorimotor strategies predicted by Optimal Feedback Control. Journal of Neuroscience, 27(35), 9354–9368. Liu, Sharlene A. (1996). Landmark detection for distinctive feature-based speech recognition. Journal of the Acoustical Society of America, 100(5), 3417–3430. Löfqvist, Anders (1991). Proportional timing in speech motor control. Journal of Phonetics, 19, 343–350. Löfqvist, Anders & Vincent L. Gracco (1999). Interarticulator programming in VCV sequences: Lip and tongue movements. Journal of the Acoustical Society of America, 105(3), 1864–1876. Long, Michael A. & Michael S. Fee (2008). Using temperature to analyse temporal dynamics in the songbird motor pathway. Nature, 456, 189–194. Long, Michael A., Dezhe Z. Jin, & Michale S. Fee (2010). Support for a synaptic chain model of neuronal sequence generation. Nature,468, 394–399. Louie, Kenway & Paul W. Glimcher (2010). Separating value from choice: delay discounting activity in the lateral intraparietal area. Journal of Neuroscience, 30, 5498–5507. Lyons, James, Steve Hansen, Suzanne Harding, & Digby Elliott (2006). Optimizing rapid aiming behaviour: movement kinematics depend on the cost of corrective modiﬁcations. Experimental Brain Research, 174, 95–100. Macar, Françoise & Franck Vidal (2009). Timing processes: An outline of behavioural and neural indices not systematically considered in timing models. Canadian Journal of Experimental Psychology, 63(3), 227–239. MacNeilage, Peter F. & Barbara Davis (1990). Acquisition of speech production: Frames, then content. In Marc Jeannerod (ed.), Attention and performance 13: Motor representation and control (pp. 453–476). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Maddieson, Ian (1985). Phonetic Cues to Syllabiﬁcation. In E. V. Fromkin (ed.), Phonetic Linguistics: Essays in Honor of Peter Ladefoged (pp. 203–221) Cambridge, MA: Academic Press. Maiteq, Tareq Bashir (2013). Prosodic constituent structure and anticipatory pharyngealisation in Libyan Arabic. (PhD), University of Edinburgh. Malapani, Chara & Stephen Fairhurst (2002). Scalar timing in animals and humans. Learning and Motivation, 33, 156–176. Manuel, Sharon Y. (1995). Speakers nasalize /ð/ after /n/, but listeners still hear /ð/. Journal of Phonetics, 23, 453–476. Marteniuk, R. G., C. L. MacKenzie, & D. M. Baba (1984). Bimanual movement control: Information processing and interaction effects. The Quarterly Journal of Experimental Psychology Section A, 36(2), 335–365.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



345

Massion, J. (1984). Postural changes accompanying voluntary movements. Normal and pathological aspects. Human Neurobiology, 2, 261–267. Matell, Matthew S., & Warren H. Meck (2004). Cortico-striatal circuits and interval timing: coincidence detection of oscillatory processes. Cognitive Brain Research, 21, 139–170. Mates, Jiří (1994). A model of synchronization of motor acts to a stimulus sequence: I. Timing and error corrections. Biological Cybernetics, 70, 463–473. Mauk, Michael D. & Dean V. Buonomano (2004). The neural basis of temporal processing. Annual Review of Neuroscience, 27, 307–340. Max, Ludo, Marie E. Wallace, & Irena Vincent (2003). Sensorimotor adaptation to auditory perturbations during speech: Acoustic and kinematic experiments. Proceedings of the 15th International Congress of Phonetic Sciences (pp. 1053–1056). Barcelona: Universitat Autònoma de Barcelona. Mazzoni, Pietro, Anna Hristova, & John W. Krakauer (2007). Why don’t we move faster? Parkinson’s disease, movement vigor, and implicit motivation. Journal of Neuroscience, 27, 7105–7116. McAuley, J. Devin, & Mari Riess Jones (2003). Modeling effects of rhythmic context on perceived duration: A comparison of interval and entrainment approaches to shortinterval timing. Journal of Experimental Psychology: Human Perception and Performance, 29(6), 1102–1125. McCarthy, John J. & Alan Prince (1993). Prosodic Morphology: Constraint Interaction and Satisfaction. Linguistics Department Faculty Publication Series (Vol. 14). Amherst, MA: University of Massachusetts. McCarthy, John J. & Alan Prince (1994). The emergence of the unmarked: Optimality in prosodic morphology. Online. Available from https://doi.org/doi:10.7282/T3Z03663 McIntyre, J., M. Zago, A. Berthoz, & F. Lacquaniti (2001). Does the brain model Newton’s laws? Nature Neuroscience, 4, 693–694. McMurray, Bob & Allard Jongman (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118(2), 219–246. McQueen, James M., Anne Cutler, & Dennis Norris (2006). Phonological abstraction in the mental lexicon. Cognitive Science, 30, 1113–1126. Mechsner, Franz, Dirk Kerzel, Günther Knoblich, & Wolfgang Prinz (2001). Perceptual basis of bimanual coordination. Nature, 414(1), 9–73. Medina, Javier F., Megan R. Carey, & Stephen G. Lisberger (2005). The representation of time for motor learning. Neuron, 45, 157–167. Melgire, Manuela, Richard Ragot, Séverine Samson, Trevor B. Penney, Warren H. Meck, & Viviane Pouthas (2005). Auditory/visual duration bisection in patients with left or right medial-temporal lobe resection. Brain and Cognition, 58, 119–124. Merchant, Hugo & Apostolos P. Georgopoulos (2006). Neurophysiology of perceptual and motor aspects of interception. Journal of Neurophysiology, 95, 1–13. Merchant, Hugo, Deborah L. Harrington, & Warren H. Meck (2013). Neural basis of the perception and estimation of time. Annual Review of Neuroscience, 36, 313–336. Merchant, Hugo, Wilbert Zarco, & Luis Prado (2008). Do we have a common mechanism for measuring time in the hundreds of millisecond range? Evidence from multipleinterval timing tasks. Journal of Neurophysiology, 99, 939–949. Merchant, Hugo, Wilbert Zarco, Ramón Bartolo, & Luis Prado (2008). The context of temporal processing is represented in the multidimensional relationships between timing tasks. PLoS ONE, 3(9), e3169.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

346  Merchant, Hugo, Wilbert Zarco, Oswaldo Pérez, Luis Prado, & Ramón Bartolo (2011). Measuring time with different neural chronometers during a synchronizationcontinuation task. Proceedings of the National Academy of Sciences, 108(49), 19784–19789. Merchant, Hugo, Ramón Bartolo, Oswaldo Pérez, Juan Carlos Méndez, Germán Mendoza, Jorge Gámez, Karyna Yc, & Luis Prado (2014). Neurophysiology of timing in the hundreds of milliseconds: Multiple layers of neuronal clocks in the medial premotor areas. In H. Merchant & V. de Lafuente (eds), Neurobiology of Interval Timing 829 (pp. 143–154). New York: Springer. Meringer, Rudolf & Carl Mayer (1895). Versprechen und Verlesen: Eine psychologischlinguistische Studie. Stuttgart: Goschensche Verlagshandlung. Republished in 1978, Amsterdam: John Benjamins. Meyer, David E., Richard A. Abrams, Sylvan Kornblum, & Charles E. Wright (1988). Optimality in human motor performance: Ideal control of rapid aimed movements. Psychological Review, 95(3), 340–3470. Mitrovic, Djordje (2010). Stochastic Optimal Control with Learned Dynamics Models. (PhD), University of Edinburgh. Mitsuya, Takashi, Ewen N. MacDonald, & Kevin G. Munhall (2014). Temporal control and compensation for perturbed voicing feedback. Journal of the Acoustical Society of America, 135(5), 2986–2994. Mitterer, Holger & Mirjam Ernestus (2008). The link between speech perception and production is phonological and abstract: Evidence from the shadowing task. Cognition, 109, 168–173. Mo, Yoonsook, Jennifer Cole, & Mark Hasegawa-Johnson (2010). Prosodic effects on temporal structure of monosyllabic CVC words in American English. Paper presented at the 5th Speech Prosody, Chicago, IL. Montagnini, Anna & Leonardo Chelazzi (2005). The urgency to look: Prompt saccades to the beneﬁt of perception. Vision Research, 45, 3391–3401. Moon, Seung-Jae & Björn Lindblom (1989). Formant undershoot in clear and citation-form speech: A second progress report. STL-QPSR, 30, 121–123. Moon, Seung-Jae & Björn Lindblom (2003). Two experiments on oxygen consumption during speech production: vocal effort and speaking tempo. Proceedings of the 15th International Congress of Phonetic Sciences (pp. 3129–3132). Barcelona. Moore, J. W., J. E. Desmond, & N. E. Berthier (1989). Adaptively timed conditioned responses and the cerebellum: a neural network approach. Biological Cybernetics, 62, 17–28. Moore, Susan P. & R. G. Marteniuk (1986). Kinematic and electromyographic changes that occur as a function of learning a time-constrained aiming task. Journal of Motor Behavior, 18(4), 397–426. Morel, Pierre, Philipp Ulbrich, & Alexander Gail (2017). What makes a reach movement effortful? Physical effort discounting supports common minimization principles in decision making and motor control. PLoS Biology, 15(6), e2001323. Mowrey, Richard A. & Ian R.A. MacKay (1990). Phonological primitives: electromyographic speech error evidence. Journal of the Acoustical Society of America, 88(3), 1299–1312. Mücke, Doris, Hosung Nam, Anne Hermes, & Louis Goldstein (2012). Coupling of tone and constriction gestures in pitch accents. In Philip Hoole, Marianne Pouplier, Lasse Bombien, Christine Mooshammer, & Barabara Kühnert (eds), Consonant Clusters and Structural Complexity (pp. 205–230). Berlin/New York: Mouton de Gruyter.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



347

Munhall, Kevin G. (1993). The skill of speech production. In J. L. Starkes & F. Allard (eds), Cognitive Issues in Motor Expertise (pp. 201–221). Amsterdam: Elsevier Science Publishers B.V. Munhall, Kevin G., Anders Löfqvist, & J. A. Scott Kelso (1994). Lip–larynx coordination in speech: Effects of mechanical perturbations to the lower lip. Journal of the Acoustical Society of America, 95, 3605–3616. Munhall, Kevin G., Carol A. Fowler, Sarah Hawkins, & Elliot Saltzman (1992). “Compensatory shortening” in monosyllables of spoken English. Journal of Phonetics, 20, 225–239. Myerson, Joel & Leonard Green (1995). Discounting of delayed rewards: Models of individual choice. Journal of the Experimental Analysis of Behavior, 64, 263–276. Nagarajan, Srikantan S., David T. Blake, Beverly A. Wright, Nancy Byl, & Michael M. Merzenich (1998). Practice-related improvements in somatosensory interval discrimination are temporally speciﬁc but generalize across skin location, hemisphere, and modality. Journal of Neuroscience, 18(4), 1559–1570. Nagasaki, H. (1989). Asymmetric velocity and acceleration proﬁles of human arm movements. Experimental Brain Research, 74, 319–326. Nakai, Satsuki, Sari Kunnari, Alice Turk, Kari Suomi, & Riikka Ylitalo (2009). Utteranceﬁnal lengthening and quantity in Northern Finnish. Journal of Phonetics, 37, 29–45. Nakai, Satsuki, Alice E. Turk, Kari Suomi, Sonia C. Granlund, Riikka Ylitalo, & Sari Kunnari, (2012). Quantity constraints on the temporal implementation of phrasal prosody in Northern Finnish. Journal of Phonetics, 40, 796–807. Nam, Hosung, Louis Goldstein, & Elliot Saltzman (2010). Self-organization of syllable structure: a coupled oscillator model. In François Pellegrino, Egidio Marisco, & Ioana Chitoran (eds), Approaches to phonological complexity. Berlin, New York: Mouton de Gruyter. Nam, Hosung, Elliot Saltzman, Jelena Krivokapić, & Louis Goldstein (2008). Modeling the durational difference of stressed vs. unstressed syllables. Paper presented at the 8th Phonetic Conference of China, Beijing. Nelson, Winston L. (1983). Physical principles of economies of skilled movements. Biological Cybernetics, 46, 135–147. Nelson, Winston L., Joseph S. Perkell, & John R. Westbury (1984). Mandible movements during increasingly rapid articulations of single syllables: Preliminary observations. Journal of the Acoustical Society of America, 75, 945–951. Nespor, Marina & Irene Vogel (1986). Prosodic Phonology. Dordrecht: Foris Publications. Newell, Karl M. (1980). The speed-accuracy paradox in movement control: Errors of time and space. In G. E. Stelmach & J. Requin (eds), Tutorials in Motor Behavior. Amsterdam: North Holland. Niebuhr, Oliver & Klaus J. Kohler (2011). Perception of phonetic detail in the identiﬁcation of highly reduced words. Journal of Phonetics, 39, 319–329. Niemann, Henrik, Doris Mücke, Hosung Nam, Louis Goldstein, & Martine Grice (2011). Tones as gestures: the case of Italian and German. Proceedings of the 17th International Congress of Phonetic Sciences, Hong Kong, China, 1486–1489. Niziolek, Caroline A. & Frank H. Guenther (2013). Vowel category boundaries enhance cortical and behavioral responses to speech feedback alterations. Journal of Neuroscience, 33(29), 12090–12098. Niziolek, Caroline A., Srikantan S. Nagarajan, & John F. Houde (2013). What does motor efference copy represent? Evidence from speech production. Journal of Neuroscience, 33(41), 16110–16116.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

348  Nooteboom, Sieb (1972). Production and perception of vowel duration; a study of durational properties of vowels in Dutch. (PhD), University of Utrecht. O’Dell, Michael L & Tommi Nieminen (1999). Coupled oscillator model of speech rhythm. In John J. Ohala, Yuko Hasegawa, Manjari Ohala, Daniel Granville, & Ashlee C. Bailey (eds), Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 2, pp. 1075–1078): American Institute of Physics. O’Dell, Michael, Juraj Šimko, Tommi Nieminen, Martti Vainio, & Mona Lehtinen, (2011). Relative timing of bilabial gestures in Finnish. Proceedings of the 17th International Congress of Phonetic Sciences, Hong Kong, China, (pp. 1518–1521). Ogden, Richard (2004). Non-modal voice quality and turn-taking in Finnish. In Elizabeth Couper-Kuhlen & Cecilia Ford (eds), Sound Patterns in Interaction: Cross-Linguistic Studies from Conversation (pp. 29–62). Amsterdam: John Benjamins. Ohala, J. J. (1996). Speech perception is hearing sounds, not tongues. Journal of the Acoustical Society of America,99(3), 1718–1725. Öhman, Sven E. G. (1966). Coarticulation in VCV utterances: Spectrographic measurements. Journal of the Acoustical Society of America, 3911, 151–168. Öhman, Sven E. G. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41, 310–320. Oliveira, Flavio T. P., Digby Elliott, & David Goodman (2005). Energy-minimization bias: Compensating for intrinsic inﬂuence of energy-minimization mechanisms. Motor Control, 9, 101–114. Oliver, Douglas L., Gretchen E. Becklus, Deborah C. Bishop, William C. Loftus, & Ranjan Batra (2003). Topography of interaural temporal disparity coding in projections of medial superior olive to inferior colliculus. Journal of Neuroscience, 23(19), 7438–7449. Oller, D. Kimbrough (1973). The effect of position in utterance on speech segment duration in English. Journal of the Acoustical Society of America, 54(5), 1235–1247. O’Reilly, Jill X., Katharine J. McCarthy, Mariagrazia Capizzi, & Anna Christina Nobre (2008). Acquisition of the temporal and ordinal structure of movement sequences in incidental learning. Journal of Neurophysiology, 99, 2731–2735. Ostendorf, Mari, Patti J. Price, & Stefanie Shattuck-Hufnagel (1995). The Boston University radio news corpus. Linguistic Data Consortium, 1–19. Ostry, David J. & Kevin G. Munhall (1985). Control of rate and duration of speech movements. Journal of the Acoustical Society of America, 77, 640–648. Ostry, David J., Eric Keller, & Avraham Parush (1983). Similarities in the control of the speech articulators and the limbs: Kinematics of tongue dorsum movement in speech. Journal of Experimental Psychology: Human Perception and Performance, 9(4), 622–636. O’Sullivan, Ian, Etienne Burdet, & Jörn Diedrichsen (2009). Dissociating variability and effort as determinants of coordination. PLoS Computational Biology, 5(4), e1000345. Park, Chi-youn (2008). Consonant landmark detection for speech recognition. (PhD), Massachusetts Institute of Technology. Pashler, Harold (2001). Perception and production of brief durations: Beat-based versus interval-based timing. Journal of Experimental Psychology: Human Perception and Performance, 27(2), 485–493. Pastor, Maria A. & Julio Artieda (1996). Involvement of the basal ganglia in timing perceptual and motor tasks. In M. A. Pastor & J. Artieda (eds), Time, Internal Clocks and Movement, (pp. 235–255), Amsterdam: Elsevier Science B.V. Pate, John & Sharon Goldwater (2015). Talkers account for listener and channel characteristics to communicate efﬁciently. Journal of Memory and Language, 78, 1–17.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



349

Pater, Joe (2009). Weighted constraints in generative linguistics. Cognitive Science, 33, 999–1035. Paulignan, Y., C. MacKenzie, R. Marteniuk, & M. Jeannerod (1991). Selective perturbation of visual input during prehension: 1. The effects of changing object position. Experimental Brain Research, 83, 502–512. Pélisson, Denis, Claude Prablanc, M.A. Goodale, & Marc Jeannerod (1986). Visual control of reaching movements without vision of the limb II. Evidence of fast unconscious processes correcting the trajectory of the hand to the ﬁnal position of a double-step stimulus. Experimental Brain Research, 62, 303–311. Penney, Trevor B., John Gibbon, & Warren H. Meck (2000). Differential effects of auditory and visual signals on clock speed and temporal memory. Journal of Experimental Psychology: Human perception and performance, 26, 1770–1787. Perez, Elvira, Julio Santiago, Alfonso Palma, & Padraig G. O’Seaghdha (2007). Perceptual Bias in Speech Error Data Collection: Insights from Spanish. Journal of Psycholinguistic Research, 36, 207–235. Perkell, Joseph S. (2012). Movement goals and feedback and feedforward control mechanisms in speech production. Journal of Neurolinguistics, 25, 382–407. Perkell, Joseph S. & Melanie L. Matthies (1992). Temporal measures of anticipatory labial coarticulation for the vowel /u/: within-subject and cross-subject variability. Journal of the Acoustical Society of America, 91(5), 2911–2925. Perkell, Joseph S., Melanie L. Matthies, Mario A. Svirsky, & Michael I. Jordan (1993). Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: a pilot “motor equivalence” study. Journal of the Acoustical Society of America, 93(5), 2948–2961. Perkell, Joseph, Majid Zandipour, Melanie L. Matthies, & Harlan Lane (2002). Economy of effort in different speaking conditions. I. A preliminary study of intersubject differences and modeling issues. Journal of the Acoustical Society of America, 112(4), 1627–1641. Perkell, Joseph S., Melanie L. Matthies, Mark Tiede, Harlan Lane, Majid Zandipour, Nicole Marrone, Ellen Stockmann, & Frank H. Guenther (2004). The distinctness of speakers’ / R s/—/ / contrast is related to their auditory discrimination and use of an articulatory saturation effect. Journal of Speech, Language, and Hearing Research, 47, 1259–1269. Perrett, Stephen P. (1998). Temporal discrimination in the cerebellar cortex during conditioned eyelid responses. Experimental Brain Research, 121, 115–124. Peterson, Gordan E. & Ilse Lehiste (1960). Duration of syllable nuclei in English. Journal of the Acoustical Society of America, 32(6), 693–703. Pierrehumbert, Janet (2002). Word-speciﬁc phonetics. In Carlos Gussenhoven & Natasha Warner (eds), Laboratory Phonology 7 (pp. 101–139). Berlin: Mouton de Gruyter. Pierrehumbert, Janet (2016). Phonological representation: Beyond abstract versus episodic. Annual Review of Linguistics, 2, 33–52. Pierrehumbert, Janet & David Talkin (1992). Lenition of /h/ and glottal stop. In Gerard J. Docherty & D. Robert Ladd (eds), Papers in Laboratory Phonology II: Gesture, Segment Prosody, (pp. 90–127). Cambridge: Cambridge University Press. Pike, Kenneth L. (1945). The Intonation of American English. Ann Arbor, MI: The University of Michigan Press. Pitrelli, John F., Mary E. Beckman, & Julia Hirschberg (1994). Evaluation of prosodic transcription labeling reliability in the ToBI framework. VI, 123–126. Poeppel, David, William J. Idsardi, & Virginie van Wassenhove (2008). Speech perception at the interface of neurobiology and linguistics. Philosophical Transactions: Biological Sciences, 363(1493), 1071–1086.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

350  Pontryagin, L. S., V.G. Boltyanskii, R.V. Gamkrelidze, & E. F. Mishchenko (1962). The Mathematical Theory of Optimal Processes (Russian). English translation Hoboken, NJ: Interscience. Port, Robert & Penny Crawford (1989). Incomplete neutralization and pragmatics in German. Journal of Phonetics, 17, 257–282. Pouplier, Marianne (2003). The dynamics of error. In Proceedings of the 15th International Congress of Phonetic Sciences (pp. 2245–2248). Barcelona. Pouplier, Marianne (2007). Tongue kinematics during utterances elicited with the SLIP technique. Language and Speech, 50(3), 311–341. Pouplier, Marianne & Louis Goldstein (2005). Asymmetries in the perception of speech production errors. Journal of Phonetics, 33, 47–75. Pouplier, Marianne & Louis Goldstein (2010). Intention in articulation: Articulatory timing in alternating consonant sequences and its implications for models of speech production. Language and Cognitive Processes, 25(5), 616–649. Praagman, M., E. J. K. Chadwick, F. C. T. Van der Helm, & H. E. J. Veeger (2006). The relationship between two different mechanical cost functions and muscle oxygen consumption. Journal of Biomechanics, 39, 758–765. Prather, Elizabeth M., Dona Lee Hedrick, & Carolyn A. Kern (1975). Articulation development in children aged two to four years. Journal of Speech and Hearing Disorders, 40, 179–191. Prince, Alan & Paul Smolensky (1993, [2004]). Optimality Theory: Constraint Interaction in Generative Grammar. In John J. McCarthy (ed.) Optimality Theory in Phonology: A reader (pp. 3–71). Blackwell. Purcell, David W. & Kevin G. Munhall (2006). Adaptive control of vowel formant frequency: Evidence from real-time formant manipulation. Journal of the Acoustical Society of America, 120(2), 966–977. Rakerd, Brad, William Sennett, & Carol A. Fowler (1987). Domain-ﬁnal lengthening and foot-level shortening in spoken English. Phonetica, 44, 147–155. Rakitin, Brian C., John Gibbon, Trevor B. Penney, Chara Malapani, Sean C. Hinton, & Warren H. Meck (1998). Scalar expectancy theory and peak-interval timing in humans. Journal of Experimental Psychology-Animal Behavior Processes, 24(1), 15–33. Rammsayer, Thomas H. (1997). Are there dissociable roles of the mesostriatal and mesolimbocortical dopamine systems on temporal information processing in humans? Neuropsychobiology, 35, 36–45. Rammsayer, Thomas H. (1999). Neuropharmacological evidence for different timing mechanisms in humans. The Quarterly Journal of Experimental Psychology, 52B(3), 273–286. Rammsayer, Thomas H. & Susan D. Lima (1991). Duration discrimination of ﬁlled and empty auditory intervals: Cognitive and perceptual factors. Perception & Psychophysics, 50(6), 565–574. Rammsayer, Thomas H. & Stefan J. Troche (2014). Elucidating the internal structure of psychophysical timing performance in the sub-second and second range by utilizing conﬁrmatory factor analysis. In Hugo Merchant & Victor de Lafuente (eds), Neurobiology of Interval Timing (pp. 33–47). New York: Springer. Rand, Miya K., Linda M. Squire, & George E. Stelmach (2006). Effect of speed manipulation on the control of aperture closure during reach-to-grasp movements. Experimental Brain Research, 174, 74–85. Rao, Stephen M., Andrew R. Mayer, & Deborah L. Harrington (2001). The evolution of brain activation during temporal processing. Nature Neuroscience, 4(3), 317–323.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



351

Redi, Laura & Stefanie Shattuck-Hufnagel (2001). Variation in the realization of glottalization in normal speakers. Journal of Phonetics, 29(4), 407–429. Reilly, Kevin J. & Kristie A. Spencer (2013). Speech serial control in healthy speakers and speakers with hypokinetic or ataxic dysarthria: effects of sequence length and practice. Frontiers in Human Neuroscience, 7, 665. Remijsen, Bert & Leoma Gilley (2008). Why are three-level vowel length systems rare? Insights from Dinka (Luanyjang dialect). Journal of Phonetics, 36, 318–344. Repp, Bruno H. (2008). Perfect phase correction in synchronization with slow auditory sequences. Journal of Motor Behavior, 40(5), 363–367. Repp, Bruno H. & Susan R. Steinman (2010). Simultaneous event-based and emergent timing: Synchronization, continuation, and phase correction. Journal of Motor Behavior, 42(2), 111–126. Richelle, Marc & Helga Lejeune (eds). (1980). Time in animal behaviour. Oxford: Pergamon Press. Richmond, Korin (2009). Preliminary inversion mapping results with a new EMA corpus. In Interspeech: 10th Annual Conference of the International Speech Communication Association, September 6–10, (pp. 2835–2838); ISCA Archive, www.isca-speech.org/ archive/interspeech_2009. Brighton, UK. Richmond, Korin, Zhen-Hua Ling, Junichi Yamagishi, & Benigno Uría (2013). On the evaluation of inversion mapping performance in the acoustic domain. In Interspeech:14thAnnual Conference of the International Speech Communication Association, August 25–29, ed. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, & P. Perrier, (pp. 1012–1016). ISSN 2308-457X; ISCA Archive, http:// www.isca-speech.org/archive/interspeech_2013. Lyon, France. Rind, F. Claire & Peter J. Simmons (1999). Seeing what is coming: building collisionsensitive neurones. Trends in Neuroscience, 22, 215–220. Roberts, Seth (1981). Isolation of an internal clock. Journal of Experimental PsychologyAnimal Behavior Processes, 7(3), 242–268. Robertson, Shannon D., Howard N. Zelaznik, Dawn A. Lantero, Kathryn Gadacz Bojczyk, Rebecca M. Spencer, Julie G. Dofﬁn, & Tasha Schneidt (1999). Correlations for timing consistency among tapping and drawing tasks: Evidence against a single timing process for motor control. Journal of Experimental Psychology-Human Perception and Performance, 25(5), 1316–1330. Rochet-Capellan, Amélie & Susanne Fuchs (2013).The interplay of linguistic structure and breathing in German spontaneous speech. Interspeech: 14th Annual Conference of the International Speech Communication Association, August 25–29, ed. F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, & P. Perrier, (p. 1228). ISCA Archive, http://www.isca-speech.org/archive/interspeech_2013. Lyon, France. Rosenbaum, David A. & Oren Patashnik (1980a). A mental clock setting process revealed by reaction times. In G. E. Stelmach & J. Requin (eds), Tutorials in Motor Behavior (pp. 487–499). Amsterdam: North Holland. Rosenbaum, David A. & Oren Patashnik (1980b). Time to Time in the Human Motor System. In R. S. Nickerson (ed.), Attention and Performance (Vol. VIII, pp. 93–106). Hillsdale, NJ: Erlbaum. Rosenbaum, David A., Caroline van Heugten, & Graham C. Caldwell (1996). From cognition to biomechanics and back: The end-state comfort effect and the middle-isfaster effect. Acta Psychologica, 94, 59–85. Rosenbaum, David A., Rajal G. Cohen, Ruud G. J. Meulenbroek, & Jonathan Vaughan (2006). Plans for grasping objects. In Mark L. Latash & Francis Lestienne (eds), Motor Control and Learning (pp. 9–25). Boston, MA: Springer-Verlag.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

352  Rosenbaum, David A., Ruud G. J. Meulenbroek, Jonathan Vaughan, & Chris Jansen (2001). Posture-based motion planning: Applications to grasping. Psychological Review, 108(4), 709–734. Rosenbaum, David A., Robert J. Weber, William M. Hazelett, & Van Hindorff (1986). The parameter remapping effect in human performance: Evidence from tongue twisters and ﬁnger fumblers. Journal of Memory and Language, 25, 710–725. Rosenbaum, David A., Kate M. Chapman, Matthias Weigelt, Daniel J. Weiss, & Robrecht van der Wel (2012). Cognition, action, and object manipulation. Psychological Bulletin, 138(5), 924–946. Rosenbaum, David A., Frank Marchak, Heather J. Barnes, Jonathan Vaughan, & James Slotta (1990). Constraints on action selection: Overhand versus underhand grips. In Marc Jeannerod (ed.), Attention and Performance XIII (pp. 321–342). Hillsdale, NJ: Lawrence Erlbaum Associates. Rosenbaum, David A., Jonathan Vaughan, Rudd G. J. Meulenbroek, Steven Jax, & Rajal G. Cohen (2008). Smart moves: The psychology of everyday perceptual-motor acts. In Ezequiel Morsella, John A. Bargh, & Peter M. Gollwitze (eds), Oxford Handbook of Human Action, (pp. 121–135). Oxford: Oxford University Press. Rusaw, Erin (2013a). Modeling temporal coordination in speech production using an artiﬁcial central pattern generator neural network. (PhD), University of Illinois at Urbana-Champaign. Rusaw, Erin (2013b). A neural oscillator model of speech timing and rhythm. Interspeech 2013, 607–611. Salidis, Joanna (2001). Nonconscious temporal cognition: learning rhythms implicitly. Memory & Cognition, 29, 1111–1119. Saling, Marian, Jay Alberts, George E. Stelmach, & James R. Bloedel (1998). Reach-to- grasp movements during obstacle avoidance. Experimental Brain Research, 118, 251–258. Saltzman, Elliot & Dani Byrd (2000). Task-dynamics of gestural timing: Phase windows and multifrequency rhythms. Human Movement Science, 19(4), 499–526. Saltzman, Elliot & J. A. Scott Kelso (1987). Skilled Actions—a Task-Dynamic Approach. Psychological Review, 94(1), 84–106. Saltzman, Elliot L. & Kevin G. Munhall (1989). A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1(4), 333–382. Saltzman, Elliot, Anders Löfqvist, & Subhobrata Mitra (2000). ‘Glue’ and ‘clocks’: intergestural cohesion and global timing. In Michael B. Broe & Janet B. Pierrehumbert (eds), Papers in Laboratory Phonology V (pp. 88–101). Cambridge: Cambridge University Press. Saltzman, Elliot, Hosung Nam, Louis Goldstein, & Dani Byrd (2006). The distinctions between state, parameter and graph dynamics in sensorimotor control and coordination. In M. L. Latash & F. Lestienne (eds), Motor Control and Learning (pp. 63–73). Boston, MA: Springer. Saltzman, Elliot, Hosung Nam, Jelena Krivokapić, & Louis Goldstein (2008). A taskdynamic toolkit for modeling the effects of prosodic structure on articulation. In Plínio A. Barbosa, Sandra Madureira, & Cesar Reis (eds), Proceedings of the Speech Prosody 2008 Conference (pp. 175–184). Campinas, Brazil. Saltzman, Elliot, Anders Löfqvist, Bruce Kay, Jeff Kinsella-Shaw, & Philip Rubin (1998). Dynamics of intergestural timing: a perturbation study of lip-larynx coordination. Experimental Brain Research, 123(4), 412–424. Savariaux, Christophe, Pascal Perrier, & Jean Pierrre Orliaguet (1995). Compensation strategies for the perturbation of the rounded vowel [u] using a lip tube: A study of

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



353

the control space in speech production. Journal of the Acoustical Society of America, 95 (5), 2428–2442. Schmidt, Richard A. (1969). Movement time as a determiner of timing accuracy. Journal of Experimental Psychology, 79, 43–47. Schmidt, Richard A. (1988). Motor Control and Learning: A Behavioral Emphasis. 2nd edn. Champaign, IL: Human Kinetics. Schmidt, Richard A. & Tim D. Lee (2005). Motor Control and Learning: A Behavioral Emphasis 4th edn. Champaign, IL: Human Kinetics. Schmidt, Richard A., Herbert Heuer, Dina Ghodsian, & Douglas E. Young (1998). Generalized motor programs and units of action in bimanual coordination. In Mark L, Latash (ed.), Progress in Motor Control, Vol 1: Bernstein’s Traditions in Movement Studies (pp. 329–360). Champaign, IL: Human Kinetics. Schmidt, Richard A., Howard Zelaznik, Brian Hawkins, James S. Frank, & John T. Quinn (1979). Motor-output variability: A theory for the accuracy of rapid motor acts. Psychological Review, 86(5), 415–451. Scholz, John P. & Gregor Schöner (1999). The uncontrolled manifold concept: identifying control variables for a functional task. Experimental Brain Research, 126, 289–306. Schultz, Wolfram, Peter Dayan, & P. Read Montague (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schulz, Geralyn M., Lee Stein, & Ryan Micallef (2001). Speech motor learning; Preliminary data. Clinical Linguistics & Phonetics, 15, 157–161. Scobbie, James M., Nigel Hewlett, & Alice E. Turk (1999). Standard English in Edinburgh and Glasgow: the Scottish Vowel Length Rule revealed. In Paul Foulkes & Gerry Docherty (eds), Urban Voices (pp. 230–245). London: Arnold. Scobbie, James M., Koen Segbregts, & Jane Stuart-Smith (2009). Dutch rhotic allophony, coda weakening, and the phonetics–phonology interface. QMU Speech Science Research Centre Working Paper. Scobbie, James M., Alice Turk, Christian Geng, Simon King, Robin Lickley, & Korin Richmond (2013). The Edinburgh Speech Production Facility Doubletalk corpus. INTERSPEECH 2013, 764–766. Scott, Stephan H. (2004). Optimal feedback control and the neural basis of volitional motor control. Nature Reviews Neuroscience, 5, 534–546. Sebregts, Koen (2015). The Sociophonetics and Phonology of Dutch r. (PhD), Utrecht University. Selkirk, Elisabeth O. (1978). On prosodic structure and its relation to syntactic structure. In T. Fretheim (ed.), Nordic Prosody II (pp. 111–140). Trondheim: TAPIR. Selkirk, Elisabeth O. (1984). Phonology and Syntax: the Relation between Sound and Structure. Cambridge, MA: MIT Press. Selkirk, Elisabeth O. (1995). Sentence prosody: Intonation, stress and phrasing. In J. A. Goldsmith (ed.), The Handbook of Phonological Theory (pp. 550–569). Oxford: Blackwell. Selkirk, Elisabeth O. (1996). The prosodic structure of function words. In J. Martin & K. Demuth (eds), Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition (pp. 187–213). Mahwah, NJ: Lawrence Erlbaum. Selkirk, Elisabeth O. (2011). The Syntax-Phonology Interface. In John Goldsmith, Jason Riggle, & Alan C. L. Yu (eds), The Handbook of Phonological Theory, Second Edition (pp. 435–484). Wiley-Blackwell.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

354  Semjen, A. (1992). Determinants of timing in serial movements. In F. Macar, V. Pouthas, & W. J. Friedman (eds), Time, Action and Cognition, NATO ASI Series (Series D: Behavioural and Social Sciences), (Vol. 66, pp. 247–261). Dordrecht: Springer. Shadmehr, Reza (2009). Computational approaches to motor control. Encyclopedia of Neuroscience, 3, 9–17. Shadmehr, Reza & John W. Krakauer (2008). A computational neuroanatomy for motor control. Experimental Brain Research, 185(3), 359–381. Shadmehr, Reza & Ferdinando A. Mussa-Ivaldi (1994). Adaptive representation of dynamics during learning of a motor task. Journal of Neuroscience, 14(5), 3208–3224. Shadmehr, Reza & Sandro Mussa-Ivaldi (2012). Biological Learning and Control: How the Brain Builds Representations, Predicts Events, and Makes Decisions. Cambridge, MA: MIT Press. Shadmehr, Reza & Steven P. Wise (2005). The Computational Neurobiology of Reaching and Pointing: A Foundation for Motor Learning. Cambridge, MA: MIT Press. Shadmehr, Reza, Jean Jacques Orban de Xivry, Minnan Xu-Wilson, & Ting-Yu Shih (2010). Temporal discounting of reward and the cost of time in motor control. Journal of Neuroscience, 30(31), 10507–10516. Shaffer, L. Henry (1982). Rhythm and timing in skill. Psychological Review, 89(2), 109–122. Shaiman, Susan (2001). Kinematics of compensatory vowel shortening: The effect of speaking rate and coda composition on intra- and inter-articulatory timing. Journal of Phonetics, 29, 89–107. Shaiman, Susan (2002). Articulatory control of vowel length for contiguous jaw cycles: The effects of speaking rate and phonetic context. Journal of Speech, Language, and Hearing Research, 45, 663–675. Shaiman, Susan & Vincent L. Gracco (2002). Task-speciﬁc sensorimotor interactions in speech production. Experimental Brain Research, 146, 411–418. Shaiman, Susan, Scott G. Adams, & Mikael D. Z. Kimelman (1995). Timing relationships of the upper lip and jaw across changes in speaking rate. Journal of Phonetics, 23, 119–128. Shankar, Sunita & Colin Ellard (2000). Visually guided locomotion and computation of time-to-collision in the Mongolian gerbil (Meriones unguiculatus): the effects of frontal and visual cortical lesions. Behavioural Brain Research, 108, 21–37. Shannon, Claude E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Shattuck, Stefanie R. (1975). Speech errors and sentence production. (PhD), Massachusetts Institute of Technology. Shattuck-Hufnagel, Stefanie (1992). The role of word structure in segmental serial ordering. Cognition, 42, 213–259. Shattuck-Hufnagel, Stefanie (2011). The role of the syllable in speech production in American English: A fresh consideration of the evidence. In Charles E. Cairns & Eric Raimy (eds), The Handbook of the Syllable (pp. 195–224). Leiden: Brill. Shattuck-Hufnagel, Stefanie (2014). Phrase-level phonological and phonetic phenomena. In M. Goldrick, V. S. Ferreira, & M. Miozzo (eds), The Oxford Handbook of Language Production (pp. 259–274). Oxford: Oxford University Press. Shattuck-Hufnagel, Stefanie (2015). Prosodic frames in speech production. In M. A. Redford (ed.), The Handbook of Speech Production (pp. 419–445). Chichester: John Wiley & Sons. Shattuck-Hufnagel, Stefanie & Dennis H. Klatt (1979). The limited use of distinctive features and markedness in speech production: evidence from speech error data. Journal of Verbal Learning and Verbal Behavior, 18, 41–55.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



355

Shattuck-Hufnagel, Stefanie & Alice Turk (1996). A prosody tutorial for investigators of auditory sentence processing. Journal of Psycholinguistic Research, 25(2), 193–247. Shattuck-Hufnagel, Stefanie & Alice Turk (2011). Durational evidence for word-based vs. prominence-based constituent structure in limerick speech. Proceedings of the 17th International Congress of Phonetic Sciences. Hong Kong. Shattuck-Hufnagel, Stefanie, Mari Ostendorf, & Ken Ross (1994). Stress shift and early pitch accent placement in lexical items in American English. Journal of Phonetics, 22, 357–388. Shattuck-Hufnagel, Stefanie, Katherine Demuth, Helen Hanson, & Kenneth N. Stevens (2011). Acoustic cues to stop-coda voicing contrasts in the speech of 2–3 year olds learning American English. In G. Nick Clements & Rachid Ridouane (eds), Where do phonological features come from? (pp. 327–341). Amsterdam: John Benjamins. Shattuck-Hufnagel, Stefanie, Cathy Bai, Mark Tiede, Argyro Katsikis, Marianne Pouplier, & Louis Goldstein (2013). A comparison of speech errors elicited by sentences and alternating repetitive tongue twisters. Journal of the Acoustical Society of America, 134 (5, pt.2), 4166. Sherwood, David E., Richard A. Schmidt, & Charles B. Walter (1988). The force/forcevariability relationship under controlled temporal conditions. Journal of Motor Behavior, 20, 106–116. Shin, Jacqueline C. & Richard B. Ivry (2002). Concurrent learning of temporal and spatial sequences. Journal of Experimental Psychology: Learning, Memory and Cognition, 28, 445–457. Shouval, Harel Z., Marshall G. Hussain Shuler, Animesh Agarwal, & Jeffrey P. Gavornik (2014). What does scalar timing tell us about neural dynamics? Frontiers in Human Neuroscience, 8, 438. Šimko, Juraj (2009). The embodied modelling of gestural sequencing in speech (PhD), University College Dublin. Šimko, Juraj & Fred Cummins (2010). Embodied Task Dynamics. Psychological Review, 117 (4), 1229–1246. Šimko, Juraj & Fred Cummins (2011). Sequencing and optimization within an embodied Task Dynamic model. Cognitive Science, 35, 527–562. Šimko, Juraj, Michael O’Dell, & Martti Vainio (2014). Emergent consonantal quantity contrast and context-dependence of gestural phasing. Journal of Phonetics, 44, 130–151. Simon, Herbert A. (1955). A behavioral model of rational choice. Quarterly Journal of Economics, 69(1), 99–118. Sinkjaer, Thomas, Jacob B. Andersen, & Birgit Larsen (1996). Soleus stretch reﬂex modulation during gait in humans. Journal of Neurophysiology, 76, 1112–1120. Slifka, Janet (2006). Some physiological correlates to regular and irregular phonation at the end of an utterance. Journal of Voice, 20(2), 171–186. Sluijter, Agaatha M. C. & Vincent J. van Heuven (1995). Effects of focus distribution, pitch accent and lexical stress on the temporal organization of syllables in Dutch. Phonetica, 52, 71–89. Sluijter, Agaatha M. C. & Vincent J. van Heuven (1996). Spectral balance as an acoustic correlate of linguistic stress. Journal of the Acoustical Society of America, 100, 2471–2485. Smith, Bruce L. (2002). Effects of speaking rate on temporal patterns of English. Phonetica, 59, 232–244. Smolensky, Paul, Matthew Goldrick, & Donald Mathis (2014). Optimization and quantization in gradient symbol systems: A framework for integrating the continuous and the discrete in cognition. Cognitive Science, 38, 1102–1138.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

356  Sonoda, Yorinobu & Kazuto Nakakido (1986). Effect of speaking rate on jaw movements in vowel sequences. Journal of the Acoustical Society of Japan (E), 7(1), 5–12. Sorensen, Tanner & Adamantios Gafos (2016). The gesture as an autonomous nonlinear dynamical system. Ecological Psychology, 28(4), 188–215. Spencer, Rebecca M. C. & Richard B. Ivry (2013). Cerebellum and timing. In Mario Manto, Donna L. Gruol, Jeremy D. Schmahmann, N. Koibuchi, & F. Rossi (eds), Handbook of the Cerebellum and Cerebellar Disorders (pp. 1201–1219). Dordrecht: Springer Science +Business Media. Spencer, Rebecca M.C. & Howard N. Zelaznik (2003). Weber (slope) analyses of timing variability in tapping and drawing tasks. Journal of Motor Behavior, 35(4), 371–381. Spencer, Rebecca M. C., Uma Karmarkar, & Richard B. Ivry (2009). Evaluating dedicated and intrinsic models of temporal encoding by varying context. Philosophical Transactions of the Royal Society B, 364, 1853–1863. Sperry, Elizabeth E. & Richard J. Klich (1992). Speech breathing in senescent and younger women during oral reading. Journal of Speech and Hearing Research, 35, 1246–1255. Stevens, Kenneth N. (1998). Acoustic Phonetics. Cambridge MA: MIT Press. Stevens, Kenneth N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111 (4), 1873–1891. Stevens, Kenneth N. (2005). Features in speech perception and lexical access. In David Pisoni & Robert Remez (eds), The Handbook of Speech Perception (pp. 125–155). WileyBlackwell. Stevens, Kenneth N. & Morris Halle (1967). Remarks on analysis by synthesis and distinctive features. In Welant Wathen-Dunn (ed.), Models for the perception of speech and visual form: proceedings of a symposium (pp. 88–102). Cambridge, MA: MIT Press. Strogatz, Steven H. (1994). Non-Linear Dynamics and Chaos with Applications to Physics, Biology, Chemistry, and Engineering. Reading, MA: Addison-Wesley. Studenka, Breanna E., Howard N. Zelaznik, & Ramesh Balasubramaniam (2013). The distinction between tapping and circle drawing with and without tactile feedback: An examination of the sources of timing variance. The Quarterly Journal of Experimental Psychology, 65(6), 1086–1100. Suchato, Atiwong (2004). Classiﬁcation of stop consonant place of articulation. (PhD), Massachusetts Institute of Technology. Summers, W. Van (1987). Effects of stress and ﬁnal-consonant voicing on vowel production: articulatory and acoustic analyses. Journal of the Acoustical Society of America, 82 (3), 847–863. Sun, Hongjin & Barrie J. Frost (1998). Computation of different optical variables of looming objects in pigeon nucleus rotundus neurons. Nature Neuroscience, 1, 296–303. Suomi, Kari, Juhani Toivanen, & Riikka Ylitalo (2008). Finnish sound structure. Studia Humaniora Ouluensia, 9. Oulu: Oulu University Press. Sutton, Richard S. & Andrew G. Barto (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Svirsky, Mario A., Harlan Lane, Joseph S. Perkell, & Jane Wozniak (1992). Effects of shortterm auditory deprivation on speech production in adult cochlear implant users. Journal of the Acoustical Society of America, 92(3), 1284–1300. Swadesh, Morris. (1934). The phonemic principle. Language, 10(2), 117–129. Szentesi, P., R. Zaremba, W. van Mechelen, & G. J. M. Stienen (2001). ATP utilization for calcium uptake and force production in different types of human skeletal muscle ﬁbres. Journal of Physiology, 531, 393–403.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



357

Tabain, Marija & Pascal Perrier (2005). Articulation and acoustics of /i/ in preboundary position in French. Journal of Phonetics, 33, 77–100. Tabain, Marija & Pascal Perrier (2007). An articulatory and acoustic study of /u/ in preboundary position in French: The interaction of compensatory articulation, neutralization avoidance and featural enhancement. Journal of Phonetics, 35, 135–161. Takikawa, Yoriko, Reiko Kawagoe, Hideaki Itoh, Hiroyuki Nakahara, & Okihide Hikosaka (2002). Modulation of saccadic eye movements by predicted reward outcome. Experimental Brain Research, 142, 284–291. Tan, Heng-Ru May, Arthur C. Leuthold, David N. Lee, J. K. Lynch, & A. P. Georgopoulos, (2009). Neural mechanisms of movement speed and tau as revealed by magnetoencephalography. Experimental Brain Research, 195, 541–552. Tanaka, Hirokazu, John W. Krakauer, & Ning Qian (2006). An optimization principle for determining movement duration. Journal of Neurophysiology, 95, 3875–3886. Tanaka, Hirokazu, Meihua Tai, & Ning Qian (2004). Different predictions by the minimum variance and minimum torque-change models on the skewness of movement velocity proﬁles. Neural Computation, 16, 2021–2040. Tanaka, Hiroko (2004). Prosody for marking transition-relevance places in Japanese conversation: The case of turns unmarked by utterance-ﬁnal objects. In Elizabeth Couper-Kuhlen & Cecilia E. Ford (eds), Sound Patterns in Interaction: Cross-linguistic Studies From Conversation (pp. 63–96). Amsterdam: John Benjamins. Tanaka, Masaki (2007). Cognitive signals in the primate motor thalamus predict saccade timing. Journal of Neuroscience, 27(44), 12109–12118. Tasko, Stephen M. & John R. Westbury (2002). Deﬁning and measuring speech movement events. Journal of Speech, Language, and Hearing Research, 45, 127–142. Teki, Sundeep, Manon Grube, & Timothy D. Grifﬁths (2012). A uniﬁed model of time perception accounts for duration-based and beat-based timing mechanisms. Frontiers in Integrative Neuroscience, 5(Article 90), 1–7. Teki, Sundeep, Manon Grube, Sukhbinder Kumar, & Timothy D. Grifﬁths (2011). Distinct neural substrates of duration-based and beat-based auditory timing. Journal of Neuroscience, 31(10), 3805–3812. Tiede, Mark K., Suzanne E. Boyce, Carol Espy-Wilson, & Vincent Gracco (2010). Variability in North American English /r/ production in response to palatal perturbation. In Ben Maassen & Pascal van Lieshout (eds), Speech Motor Control: New Developments in Basic and Applied Research (pp. 53–67). Oxford: Oxford University Press. Tilsen, Sam (2013). A dynamical model of hierarchical selection and coordination in speech planning. PLoS ONE, 8(4), e62800. Tilsen, S. (2018) Three mechanisms for modeling articulation: selection, coordination, and intention. Cornell University Working Papers in Phonetics and Phonology. Toda, Tomoki, Alan W. Black, & Keiichi Tokuda (2008). Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Communication, 50(3), 215–227. Todorov, Emanuel (2004). Optimality principles in sensorimotor control. Nature Neuroscience, 7(9), 907–915. Todorov, Emanuel (2005). Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Computation, 17, 1084–1108. Todorov, Emanuel (2007). Optimal control theory. In Kenji Doya, Shin Ishii, Alexandre Pouget, & Rajesh P. N. Rao (eds), Bayesian Brain: Probabilistic Approaches to Neural Coding (pp. 269–298). Cambridge, MA: MIT Press.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

358  Todorov, Emanuel (2009). Efﬁcient computation of optimal actions. Proceedings of the National Academy of Sciences, 106(28), 11478–11483. Todorov, Emanuel & Michael I. Jordan (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11), 1226–1235. Todorov, Emanuel & Michael I. Jordan, (2003). A minimal intervention principle for coordinated movement. In Becker et al. (eds), Advances in neural information processing systems (Vol. 15, pp. 27–34). Cambridge, MA: MIT Press. Todorov, Emanuel, Weiwei Li, & Xiuchan Pan (2005). From task parameters to motor synergies: A hierarchical framework for approximately optimal control of redundant manipulators. Journal of Robotic Systems, 22(11), 691–710. Torres, Elizabeth & Richard Andersen (2006). Space-time separation during obstacle avoidance learning in monkeys. Journal of Neurophysiology, 96, 2613–2632. Toscano, Joseph C. & Bob McMurray (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics. Cognitive Science, 34(3), 434–464. Touzet, Claude F., Pierrick Demoulin, Boris Burle, Franck Vidal, & Françoise Macar (2005). Neuromimetic model of interval timing. Proceedings of the 13th European Symposium on Artiﬁcial Neural Networks, 393–398. Bruges, Belgium. Trager, George L. & Henry L. Smith (1951 [1957]). An Outline of English Structure. Norman, OK: Battenburg Press. Reprinted 1957 by American Council of Learned Societies, Washington, DC. Treisman, Michel (1963). Temporal discrimination and the indifference interval - Implications for a model of the internal clock. Psychological Monographs, 77(13), 1–31. Tremblay, Stéphanie, Douglas M. Shiller, & David J. Ostry (2003). Somatosensory basis of speech production. Nature, 423, 866–869. Trouvain, Jürgen (1999). Phonological aspects of reading rate strategies. Phonus 4, Research Report Phonetics Saarbrücken, 15–35. Trouvain, Jürgen & Martine Grice (1999). The effect of tempo on prosodic structure. Paper presented at the 14th International Congress of Phonetic Sciences (ICPhS), San Francisco, CA. Tuller, Betty & J. A. Scott Kelso (1990). Phase transitions in speech production and their perceptual consequences. In Marc Jeannerod (ed.), Attention and Performance XIII (pp. 429–452). Hillsdale, NJ: Erlbaum. Turk, Alice (1994). Articulatory phonetic clues to syllable afﬁliation: gestural characteristics of bilabial stops. In P. A. Keating (ed.), Phonological Structure and Phonetic Form: Papers in Laboratory Phonology III (pp. 107–135). Cambridge: Cambridge University Press. Turk, Alice (2010). Does prosodic constituency signal relative predictability? A Smooth Signal Redundancy hypothesis. Journal of Laboratory Phonology, 1, 227–262. Turk, Alice (2012). The temporal implementation of prosodic structure. In Abigail C. Cohn, Cécile Fougeron, & Marie K. Huffman (eds), The Oxford Handbook of Laboratory Phonology (pp. 242–253). Oxford: Oxford University Press. Turk, Alice E. & James R. Sawusch (1997). The domain of accentual lengthening in American English. Journal of Phonetics, 25, 25–41. Turk, Alice E. & Stefanie Shattuck-Hufnagel (2000). Word-boundary-related duration patterns in English. Journal of Phonetics, 28, 397–440. Turk, Alice E. & Stefanie Shattuck-Hufnagel (2007). Multiple targets of phrase-ﬁnal lengthening in American English words. Journal of Phonetics, 35(4), 445–472.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



359

Turk, Alice & Stefanie Shattuck-Hufnagel (2013). What is speech rhythm? A commentary inspired by Arvaniti & Rodriquez, Krivokapić, and Goswami & Leong. Laboratory Phonology, 4(1), 93–118. Turk, Alice & Stefanie Shattuck-Hufnagel (2014). Timing in talking: what is it used for, and how is it controlled? Philosophical Transactions of the Royal Society B, 369, 20130395. Turk, Alice E. & Laurence White (1999). Structural inﬂuences on accentual lengthening in English. Journal of Phonetics, 27, 171–206. Turvey, M. T. (1977). Preliminaries to a theory of action with reference to vision. Perceiving, Acting and Knowing, 2, 211–265. Uldall, Elizabeth T. (1971). Isochronous stresses in R.P. In Louis L. Hammerich, Roman Jakobson, & Eberhard Zwirner (eds), Form and Substance: Phonetic and Linguistic Papers Presented to Eli Fischer-Jorgensen (pp. 205–210). Copenhagen: Akademisk Forlag. Uldall, Elizabeth T. (1972). Relative durations of syllables in two-syllable rhythmic feet in R.P. in connected speech. Department of Linguistics, Edinburgh University, Work in Progress #5. Ullén, Fredrik & Sara L. Bengtsson (2003). Independent processing of the temporal and ordinal structure of movement sequences. Journal of Neurophysiology, 90, 3725–3735. Uno, Yoji, Mitsuo Kawato, & Rika Suzuki (1989). Formation and control of optimal trajectory in human multijoint arm movement. Biological Cybernetics, 61, 89–101. Uther, M., M. A. Knoll, & D. Burnham (2007). Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech. Speech Communication, 49, 2–7. Vaissière, Jacqueline (1983). Language independent prosodic features. In Anne Cutler & D. Robert Ladd (eds), Prosody: Models and Measurements (pp. 53–65). Berlin: Springer. Vallduví, Enric (1991). The role of plasticity in the association of focus and prominence. Proceedings of the Eastern States Conference on Linguistics (ESCOL), 7, 295–306. van Bezooijen, Renée (2005). Approximant /r/ in Dutch: Routes and feelings. Speech Communication, 47, 15–31. van der Wel, Robrecht P. R. D., Robin M. Fleckenstein, Steven A. Jax, & David A. Rosenbaum (2007). Hand path priming in manual obstacle avoidance: Evidence for abstract spatiotemporal forms in human motor control. Journal of Experimental Psychology: Human Perception and Performance, 33(5), 1117–1126. van Heuven, Vincent J. J. P. & Agaath M. C. Sluijter, (1996). Notes on the phonetics of word prosody. In Rob Goedemans, Harry van der Hulst, & Ellis Visch (eds), Stress patterns of the world, part 1: background. (HIL Publications) (pp. 233–269). The Hague: Holland Academic Graphics. van Lancker, Diana, Jody Kreiman, & Dwight Bolinger (1987). Anticipatory lengthening. Journal of Phonetics, 61, 339–347. van Santen, Jan P. H. (1992). Contextual effects on vowel duration. Speech Communication, 11, 513–546. van Santen, Jan P. H. (1994). Assignment of segmental duration in text-to-speech synthesis. Computer Speech and Language, 8, 95–128. van Santen, Jan P. H. & Chilin Shih (2000). Suprasegmental and segmental timing models in Mandarin Chinese and American English. Journal of the Acoustical Society of America, 107(2), 1012–1026. Villacorta, Virgillio M., Joseph S. Perkell, & Frank H. Guenther (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. Journal of the Acoustical Society of America, 122(4), 2306–2319.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

360  Waals, Juliette Adriana Johanna Simone (1999). An experimental view of the Dutch syllable. Amsterdam: Netherlands Graduate School of Linguistics (Landelijke Onderzoekschool Taalwetenschap). Wagner, Michael (2005). Prosody and Recursion. (PhD), Massachusetts Institute of Technology. Wang, Jinsung & George E. Stelmach (1998). Coordination among body segments during reach-to-grasp action involving the trunk. Experimental Brain Research, 123, 346–350. Wang, Jinsung & George E. Stelmach, (2001). Spatial and temporal control of trunkassisted prehensile actions. Experimental Brain Research, 123, 346–350. Watson, Duane, Mara Breen, & Edward Gibson (2006). The role of syntactic obligatoriness in the production of intonational boundaries. Journal of Experimental Psychology: Learning, Memory and Cognition, 32, 1045–1056. Wearden, J. H. (1991). Do humans possess an internal clock with scalar timing properties? Learning and Motivation, 22, 59–93. Wearden, J. H. (1999). ‘‘Beyond the ﬁelds we know . . . ’’: exploring and developing scalar timing theory. Behavioural Processes, 45, 3–21. Wearden, John H. (2013). The cognitive neuroscience of time perception: How psychological studies might help to dissect the timing system. Neuropsychologia, 51, 187–190. Whalen, D. H. (1990). Coarticulation is largely planned. Journal of Phonetics, 18, 3–35. White, Laurence (2002). English speech timing: a domain and locus approach. (PhD), University of Edinburgh. White, Laurence (2014). Communicative function and prosodic form in speech timing. Speech Communication, 63–64, 38–54. White, Laurence & Alice E. Turk (2010). English words on the Procrustean bed: polysyllabic shortening reconsidered. Journal of Phonetics, 38, 459–471. Whorf, B. (1938). Language: plan and conception of arrangement, Unpublished ms, Yale University. Reprinted in J. B. Carroll (ed.) (1956). Language, Thought and Reality: Selected Writings of Benjamin Lee Whorf (pp. 125–133). Cambridge, MA: MIT Press. Wiegner, Allen W. & Margaret M. Wierzbicka (1992). Kinematic models and human elbow ﬂexion movements: Quantitative analysis. Experimental Brain Research, 88, 665–673. Wiener, Martin, Falk W. Lohoff, & H. Branch Coslett (2011). Double dissociation of dopamine genes and timing in humans. Journal of Cognitive Neuroscience, 23(10), 2811–2821. Wightman, Colin W., Stefanie Shattuck-Hufnagel, Mari Ostendorf, & Patti J. Price (1992). Segmental durations in the vicinity of prosodic phrase boundaries. Journal of the Acoustical Society of America, 91(3), 1707–1717. Wiik, Kalevi (1965). Finnish and English vowels: A comparison with special reference to the learning problems met by native speakers of Finnish learning English. (PhD), University of Turku. Turun Yliopiston Julkaisuja. Annales Universitatis Turkuensis, Series B, 94. Wilhelms-Tricarico, Reiner (2015). A multi-language speech synthesizer based on syllables as the functional units of speech. Journal of the Phonetic Society of Japan, 19(2), 86–99. Williams, Briony & Steven M. Hiller (1994). The question of randomness in English foot timing: a control experiment. Journal of Phonetics, 22, 423–439. Windmann, Andreas (2016). Optimization-Based Modeling of Suprasegmental Speech Timing. (PhD), Universität Bielefeld. Windmann, Andreas, Juraj Šimko, & Petra Wagner (2015a). Polysyllabic shortening and word-ﬁnal lengthening in English. Proceedings of Interspeech 2015, 36–40. Windmann, Andreas, Juraj Šimko, & Petra Wagner (2015b). What do regression analyses of inter-stress interval duration really measure? In The Scottish Consortium for ICPhS

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi



361

2015 (ed.), Proceedings of the 18th International Congress of Phonetic Sciences. Glasgow: The University of Glasgow. Wing, Alan M. (1980). The long and short of timing in response sequences. In G. E. Stelmach & J. Requin (eds), Tutorials in Motor Behavior (pp. 469–484). Amsterdam: North Holland. Winkworth, Alison L., Pamela J. Davis, Elizabeth Ellis, & Roger D. Adams (1994). Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors. Journal of Speech and Hearing Research, 37(535–556). Winter, Bodo & Sven Grawunder (2012). The phonetic proﬁle of Korean formal and informal speech registers. Journal of Phonetics, 40, 808–815. Winter, David A. (1984). Kinematic and kinetic patterns in human gait: variability and compensating effects. Human Movement Science, 3, 51–76. Winters, Jack M. & Laurence Stark (1985). Analysis of fundamental human movement patterns through the use of in-depth antagonistic muscle models. IEEE Transactions on Biomedical Engineering, 32, 826–839. Wittmann, Marc (2013). The inner sense of time: how the brain creates a representation of duration. Nature Reviews Neuroscience, 14, 217–223. Woodworth, Robert Sessions (1899). The accuracy of voluntary movement. Psychological Monograph Supplements, 3(2), i–114. Worringham, Charles J. (1991). Variability effects on the internal structure of rapid aiming movements. Journal of Motor Behavior, 23(1), 75–85. Xia, Ruiping, Brian M. H. Bush, & Gregory M. Karst, (2005). Phase-dependent and taskdependent modulation of stretch reﬂexes during rhythmic hand tasks in humans. Journal of Physiology, 564, 941–951. Xu-Wilson, Minnan, David S. Zee, & Reza Shadmehr (2009). The intrinsic value of visual information affects saccade velocities. Experimental Brain Research, 196, 475–481. Yamazaki, Tadashi & Shigeru Tanaka (2007). The cerebellum as a liquid state machine. Neural Networks, 20, 290–297. Yang, J.-F. & J. P. Scholz (2005). Learning a throwing task is associated with differential changes in the use of motor abundance. Experimental Brain Research, 163, 137–158. Yu, Hong, Dagmar Sternad, Daniel M. Corcos, & David E. Vaillancourt (2007). Role of hyperactive cerebellum and motor cortex in Parkinson’s disease. NeuroImage, 35, 222–233. Zago, Myrka, Joseph McIntyre, Patrice Senot, & Francesco Lacquaniti (2008). Internal models and prediction of visual gravitational motion. Vision Research, 48, 1532–1538. Zago, Myrka, Gianfranco Bosco, Vincenzo Maffei, Marci Iosa, Yuri P. Ivanenko, & Francesco Lacquaniti (2004). Internal models of target motion: expected dynamics overrides measured kinematics in timing manual interceptions. Journal of Neurophysiology, 91, 1620–1634. Zajac, Felix E. (1989). Muscle and tendon: properties, models, scaling and application to biomechanics and motor control. Critical Reviews in Biomedical Engineering, 17, 359–411. Zakay, Dan (1989). Subjective time and attentional resource allocation: An integrated model of time estimation. In Iris Levin & Dan Zakay (eds), Time and Human Cognition: A Life-Span Perspective (pp.365–397). Amsterdam: Elsevier Science Publishers B.V. Zelaznik, Howard & David A. Rosenbaum (2010). Timing Processes Are Correlated When Tasks Share a Salient Event. Journal of Experimental Psychology-Human Perception and Performance, 36(6), 1565–1575.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

362  Zelaznik, Howard, Richard A. Schmidt, & Stan C.A.M. Gielen (1986). Kinematic properties of rapid aimed hand movements. Journal of Motor Behavior, 18(4), 353–372. Zelaznik, Howard, Rebecca Spencer, & Julie G. Dofﬁn (2000). Temporal precision in tapping and circle drawing movements at preferred rates is not correlated: Further evidence against timing as a general-purpose ability. Journal of Motor Behavior, 32(2), 193–199. Zelaznik, Howard, Rebecca Spencer, & Richard B. Ivry (2002). Dissociation of explicit and implicit timing in repetitive tapping and drawing movements. Journal of Experimental Psychology: Human Perception and Performance, 28, 575–588. Zsiga, Elizabeth C. (1997). Features, gestures, and Igbo vowels: An approach to the phonology–phonetics interface. Language, 73(2), 227–274. Zsiga, Elizabeth C. (2000). Phonetic alignment constraints: consonant overlap and palatalization in English and Russian. Journal of Phonetics, 28, 69–102.

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

Index 2/3rds power law 205 Abercrombian foot 29, 43, 133, 135, 137, 139–2, 240 absolute time 25, 34–6, 86, 103–4, 106, 108–9, 112–17, 119, 138, 206, 208, 234–5, 241, 254, 303 abstract non-symbolic phonological representations 10, 40, 65, 163–4, 287, 303 abstract symbolic phonological representations 1–2, 4, 7, 50, 63, 66, 73, 90, 112, 135, 148–9, 151–3, 158–9, 164–5, 167, 169, 171, 173, 178–9, 184, 253, 264, 266, 268, 289, 291, 297, 312–15 abstract temporal pattern 253–4 accumulator 245–6 accuracy 50–1, 53–6, 59–60, 62, 66, 68–9, 73, 95, 101, 115, 124–5, 129–31, 135, 147, 155, 158–62, 178, 180–1, 187, 195, 197–8, 200–1, 204–9, 211–12, 214, 249, 252, 262–3, 268, 300, 302, 304, 308, 314–15 accuracy cost(s) 204–11, 214, 304 acoustic cue 44, 148–9, 151, 159, 168, 177–8, 188, 265, 267–9, 293–4, 298, 314 acoustic interval 94, 146, 180, 240–1 acoustic landmark 6, 57, 60–1, 63, 109, 117, 126, 135, 151, 240, 244, 265, 294, 296–300, 302–6, 308, 316–19 antipsychotic medication 247 architecture of the speech planning system 1–2, 49–51, 55, 57, 60, 62, 89, 144, 148–50, 157, 217, 264, 314–15 Articulatory Phonology 1–6, 8–48 (AP/TD throughout) atemporal phonological representations (see also ‘abstract symbolic phonological representations’) 1–2, 90, 135, 158, 173, 244

auditory cortex 95, 196, 250, 305 auditory interval 249 auditory modality (see also ‘visual modality’) 70, 92, 242, 249, 250 basal ganglia 95–8, 241, 244, 250–3 beat adjustment 253 beat prediction 253 beat-based (see also ‘periodicity-based’) 239, 251–6 boundary strength 93–4, 132, 270, 278, 293–4 boundary-related lengthening (see also ‘ﬁnal lengthening’ and ‘initial lengthening’) 33, 37, 64, 105, 137–8 braking 67, 76 cerebellum 95–6, 241, 244–6, 248, 250–3 chunking 247 circadian 42, 239, 241 clock-counter model 245 coarticulation (see also ‘gestural overlap’) 1, 6, 19, 43, 71–2, 86, 130, 153, 155, 184, 221–3, 301, 305, 318 coefﬁcient of variation 90, 91, 93, 94 coincidence detector 239 collision avoidance 96, 257 conditioned response paradigm 79 constriction degree 12, 14–15, 17–19 constriction location 13–14, 18 context-governed 5, 40, 65, 148, 314 context-governed phonetic variation 5, 40, 65, 148, 314 context-independent movement targets 267 context-speciﬁc (see also ‘context-governed phonetic variation’) 11, 148, 163, 166, 265, 267, 294, 312 contextual factors 49, 163, 317 continuation paradigm/task 92, 96, 97, 120, 252

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

364  continuous, repetitive movement 243 continuous stimulus (see also ‘ﬁlled interval’) 249 contrastive short and long vowels 80–5, 88–9, 94, 134, 147, 155, 181, 190, 225–7, 298, 315 control policy 194–5, 197–9, 218–19, 237, 265–6 control signal 56, 201, 205–7, 218, 237, 266 coordination 4, 6, 14–18, 20, 24, 28–9, 35, 39, 41, 43, 46, 48–50, 57, 64, 70, 73, 102–31, 135, 144, 147, 150, 172, 175, 218, 237–8, 251, 256, 261–4, 266–8, 290, 300, 307–8, 310, 314–16, 318–19 coordinative structure (see also ‘synergy’) 10, 14–15, 45, 60, 163, 186–7, 193, 214, 216, 266, 299, 311 cost function 19–5, 197–9, 202, 211, 213–14, 219–20, 222–3, 226, 230, 233, 307 cost of accuracy 204–11, 214, 304 cost(s) of movement 13, 129–30, 193–5, 197–201, 213–14, 223, 225, 235, 301, 316 cost of planning 219 cost of reprogramming 212 cost of time 209–11, 226–8, 230, 233–4, 237, 302–5, 308 cross-word foot 29, 43, 133, 135, 137, 139–42, 240 curvature 203, 205 CV syllable, coordination patterns in 25, 28, 37, 57, 58, 102, 107, 108, 111, 121, 123, 291 CV syllable, frequency of 46, 106, 109, 110 damping 12–13, 21, 38, 40, 68, 228 dedicated timekeeping mechanism 244–6, 249–50 default adjustment 20, 38, 41–2, 48–51, 62, 133–5, 144 delay line model 239, 245 desired trajectory 215 Dinka 81, 83, 94, 351 discrimination task 96, 99, 242–3, 245, 248, 251–4 distance, accuracy, and duration, relationships among 50, 53–6, 62, 114, 181–2, 205, 208–9, 257, 304–5, 314, 318 distance-to-target 24, 257, 267

distributed timing network 244 DIVA 147, 156–7, 195, 266, 310 dopaminergic antagonist 247 duration range 239–41 durational target 226–7, 229, 234, 236, 240, 303 dynamic cost 200–4 dynamic vs. kinematic cost(s) 201–4 Ecological Psychology 77, 257 economy of effort 153, 221, 224 efference copy 160, 187, 191, 196, 236, 266, 310 effort cost 193, 197, 201, 207–8, 221, 228, 230, 234, 237, 302 emergent surface characteristic 38–9 emergent timing 48, 65, 70, 87, 146, 328 end-state (dis-)comfort 161, 205, 207 endpoint accuracy 53–6, 70, 73, 129–30, 133, 160–1, 181, 192, 194, 197–8, 200–1, 204–9, 211, 214–15, 236–7, 302, 304, 317 endpoint-based movement coordination 73, 105, 120–1, 125–8, 130–1, 135, 144, 238, 261–3, 267–8, 300, 308 endpoint variance 53–6, 70, 73, 129–30, 133, 160–1, 181, 192, 194, 197–8, 200–1, 204–9, 211, 214–15, 236–7, 302, 304, 317 energy cost 210 entrainment 25, 32, 37, 46, 102, 105, 111, 119–21, 254, 308, 318 equivalence of strategies 40, 85, 87–9, 134 Estonian 271 extrinsic timing 2–4, 7, 9, 41, 47, 50, 63–4, 100–1, 135, 144, 147, 156, 158, 174, 190, 226, 232, 235, 238, 244, 255, 264, 266, 300, 312–14, 316–17, 320 extrinsic-timing-based three-component model (see also ‘phonologyextrinsic-timing-based three-component model(s)’) 53, 62, 63, 135 Extrinsic-Timing-Based Three-Component model-version 1 (XT/3C-v1) 188, 262, 265, 267, 279, 282, 289, 292, 300, 303, 307–9, 312 faithfulness constraint 221 feature bundle 43

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

 ﬁlled interval (see also ‘continuous stimulus’) 247–8 ﬁnal lengthening (see also ‘boundaryrelated lengthening’, and ‘lengthening’) 33, 35, 43, 46, 52, 80–3, 85, 88–9, 94, 132–4, 293, 303 Finnish 81, 83, 228 Fitts’ law 50, 53–6, 62, 114, 181–2, 205, 208–9, 257, 304–5, 314, 318 ﬂexible periodicity 251 foot 3, 12, 16, 29–32, 36–7, 43, 53, 58, 61, 102, 133, 135–7, 139–42, 175, 240, 270 foot oscillator 31, 37, 135–44 force ﬁeld 116, 204 foreigner-directed speech 52 formal vs. informal speech styles 52 forward model 196–8 gap closure (see also ‘General Tau Theory’) 77, 128, 150, 158, 171, 177, 256–8, 261, 267, 308 general-purpose timekeeping 4, 6, 38, 42, 48, 63, 65, 67, 76, 90, 100–1, 148, 151–2, 173, 189, 233, 235–9, 242, 244, 253, 255–6, 262, 264–5, 267, 316 General Tau Theory 6–7, 74, 86, 96–8, 103, 105, 121, 125, 128–31, 150, 190, 237–8, 256–62, 267, 300, 307–12, 316–18 Generative Phonology 5, 43 German 13, 29, 88, 181, 271 gestural activation 9, 14–22, 24–6, 29–30, 32–4, 37, 40–2, 45, 50, 55, 58, 64–6, 75, 80, 84, 87, 102, 104–6, 119, 121, 127, 133–4, 147, 163, 173, 227–8, 233, 259, 261, 317 gestural blending 18, 34, 45, 55 gestural overlap 10, 14, 16, 18–19, 23, 25–7, 32–4, 37, 40, 43–5, 49, 55, 57–8, 61–2, 65, 86–7, 89, 98, 102, 115–17, 129, 151, 153–4, 170–1, 174–6, 178, 180, 184, 222, 225, 228–9, 237, 286, 299, 308–9, 312, 318 global timing 20, 35–6, 38, 175 Hebrew 33, 271, 274 hierarchy of prominences 273–4 hierarchy of prosodic constituents 3, 29–30, 103, 132, 240, 268, 270–3 hyperbolic discount 210

365

impaired discrimination 252 improvement with practice 51 impulse 53, 56, 154 initial lengthening (see also ‘boundaryrelated lengthening’ and ‘lengthening’) 46, 94, 270–1, 305 inter-beat interval 92, 253 inter-gestural coordination (see also ‘intergestural relative timing’) 14, 18–20, 43, 102–31 inter-gestural relative timing (see also ‘intergestural coordination’) 20, 64 inter-landmark interval 240, 302–5, 316 inter-speaker variability 86–7 interception 76, 96–7 internal forward model 196–8 interval duration 23, 55, 67, 85, 90–5, 99–100, 106, 108, 129, 136, 146, 181, 214, 226, 229–31, 233, 242–3, 247, 249, 252–4, 291, 316 interval timing 18, 87, 242, 246, 254 intonational 29, 85, 132, 184, 269–70, 273–4, 276, 293, 360 intonational phonology 132, 184, 273–4, 293 intrinsic timing 1, 3–4, 9, 36, 38, 40–1, 47, 63–4, 66–7, 75, 80, 101, 103, 120, 127, 133–4, 144, 171, 234–5, 238, 244, 313–15 jerk 192, 200, 203–5, 208, 211, 220, 222, 236–7 kinematic cost 200–1, 203–4 Korean 52, 361 landmark(s) 6, 57, 60–1, 63, 109, 117, 126, 135, 151, 240, 244, 265, 294, 296–300, 302–6, 308, 316–19 lengthening 3, 32–8, 43, 46–8, 52, 64, 80–6, 88–9, 94, 105, 132–4, 137–8, 142, 147, 155, 175, 182, 190, 229, 240, 270–2, 274, 293, 303, 305, 315 lexical contrast 14, 21, 38–40, 47, 64, 84, 265–6, 269, 293 lexical phonology 268 limit-cycle oscillator 12, 16, 64, 75, 102–3, 135 listener 27, 39, 44, 52, 81, 118, 149, 162–5, 174–7, 223, 228, 236, 250, 252–4, 269–70, 275–6, 278, 280, 288, 304, 306

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

366  long-term memory 245 look-ahead model of coarticulation 43, 130 Mandarin 29, 52, 94, 190 mapping 45, 65, 156, 197–8, 265, 301 mass-spring settling time 104–5, 109, 119–21, 127, 130, 147 mass-spring system 9, 11–14, 21–3, 34, 37–8, 40–2, 54, 104–5, 109, 119, 127 Maximum Entropy 231 minimal intervention principle 215, 231 minimum cost 6, 129, 146, 194, 198–9, 213, 219–20, 267, 301–2, 317 modality-speciﬁc timing behavior 239, 249–51 monitoring, continuous 185–6, 266, 311 motor memory 201 motor noise 56, 91, 197, 199, 205–6, 266 motor-sensory 4, 66, 135, 148, 150–1, 155, 158–60, 162, 178–9, 185, 187–8, 190–1, 233, 264–8, 289, 306, 310–12, 316 Motor-Sensory Implementation Component 150, 178–9, 185, 310–12, 316 movement cost 191–4, 197–213, 217, 219, 221–3, 229, 265, 293, 302, 312, 316, 319 movement goal (see also ‘phonological goal’) 57, 68, 104, 120, 125–6, 157, 198–9, 216, 311–12 movement initiation (see also ‘movement onset’) 67, 77, 123, 125, 131, 258, 262 movement initiation time (see also ‘movement initiation’) 77, 123, 125, 258 movement onset (see also ‘movement initiation’) 25, 48, 58, 68, 70, 104–5, 107, 115, 119–20, 123, 125–31, 147, 180, 183, 262, 268, 309–10, 315, 318 movement target (see also ‘target’) 55, 68, 105, 125, 127, 129, 131, 211, 267 movement time-course 21, 95, 97, 125, 128, 150, 154, 179–80, 182–3, 189, 256–8, 267, 300–1, 306–7, 311, 316 multidimensional scaling 243 multiple successive intervals 249 MuT gestures 30, 32–3, 36–8, 41, 47, 50, 52, 62, 65, 84, 94–5, 127, 133–5, 144, 147, 172, 175–6

neural control signal 56, 201, 205–7, 218, 231, 237, 266 neural ﬁring ramp 96–100, 251 neural interference 248 neural timekeeping 238 neuromimetic model 245–6 neutral attractor 15–16, 19, 23, 45, 227 non-beat-based 239, 251–6 non-linear restoring force 54–5 olivo-cerebellar timing (see also ‘cerebellum’) 245 Optimal Control Theory 22, 105, 129–31, 150, 190–237, 256, 266, 299, 303, 312, 316–19 Optimality Theory 221–2, 233 optimization 53, 56, 63, 144, 149, 189, 190–237, 265–6 oscillator-based timing mechanism 3–4, 28, 30–1, 42, 48, 108–9, 115, 131, 137, 139, 142, 146, 172, 256, 307 output constraint 221–3, 235 pacemaker-accumulator model 245 Parkinson’s disease 95–6, 210–11 parietal cortex 95, 97–8, 250 parsing cost 226–8, 230, 234 pause 86, 143, 154, 270, 272, 275–6, 293 peak velocity 22–4, 33–4, 37, 69, 126, 174, 181–2, 192, 206, 209, 220 peak velocity/distance relationship 34, 181 perceptual distinctiveness 221, 224 periodic behavior 41–2 periodic rhythm 280 periodic speech 110, 304 periodic styles 145, 255, 318 periodic tasks 251–2, 255 periodicity 43, 106, 112, 141, 143–5, 251–3, 255, 319 periodicity-based (see also ‘beat-based’) 143, 251, 255 perturbations 15, 40, 65, 116, 163, 167, 179, 186, 195, 215, 236, 266, 311 phasing (see also ‘planning oscillator coordination mechanism’) 25–8, 32, 35–6, 106–8, 110–11, 113–14, 246, 303, 309, 355

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

 phoneme 57, 60–1, 70, 153, 156, 162, 164, 166–7, 170, 177–8, 221, 240–1, 267–9, 277, 279, 281–4, 286–94, 298 phonemically short and long vowels 80–5, 88–9, 94, 134, 147, 155, 181, 190, 225–7, 298, 315 phonetic planning 2, 4, 8, 38–9, 41, 47, 61, 65–6, 89, 101, 152–3, 156–7, 164, 180, 233, 317, 319 Phonetic Planning Component 4, 6, 63, 149–51, 153, 158–71, 176, 178–88, 235–6, 244, 265, 269, 293–4, 297, 298–310, 315–16 phonological goal (see also ‘movement goal’) 66, 151, 160, 300, 314 phonological planning 4–5, 41, 149, 158–71, 175, 184, 233, 279, 285 Phonological Planning Component 4, 149–52, 158–71, 176–9, 266–9, 279, 297–8, 300, 303, 316 phonological representations (see also ‘abstract non-symbolic phonological representations’, ‘abstract symbolic phonological representations’, ‘phonological structure’, and ‘prosodic structure’) 1–5, 7–11, 20–1, 38–40, 45, 48, 50, 61, 63, 65–6, 73, 75, 101, 112, 135, 147, 149, 151, 156–9, 164, 171–2, 177–8, 221, 233, 238, 278, 313–17, 320 phonological structure 38–9, 90, 265 phonology-extrinsic timing 2–3, 7, 9, 47, 50, 63, 100, 135, 144, 147, 158, 174, 190, 232, 235, 238, 244, 255, 264, 266, 300, 312–14, 317, 320 phonology-extrinsic-timing-based model 53, 63 phonology-extrinsic-timing-based threecomponent model of speech production 264–312 phonology-intrinsic timing 1, 3–4, 9, 38, 40, 47, 63–4, 66–7, 75, 80, 101, 103, 120, 127, 133–4, 144, 171, 234–5, 238, 244, 313–15 phonology-speciﬁc timing control 41, 48, 65, 172, 233–4 phrase oscillator 31, 43, 102, 133, 175 phrase-ﬁnal lengthening 33, 35, 43, 46, 52, 80–3, 85, 88–9, 94, 132–4, 293, 303

367

Pi gesture 30, 32–4, 36–8, 41, 47, 50, 52, 62, 65, 84, 94–5, 127, 133–5, 144, 147, 172, 175–6 planning cost 219 planning oscillator (see also planning+ suprasegmental oscillators) 3, 4, 16, 20, 22, 25–8, 30–1, 33, 35–8, 41–3, 46–8, 50, 64–6, 75, 80, 87, 94, 102, 106, 112–13, 133, 134, 135, 137, 139, 143, 147, 172, 173, 175–6, 268, 308–9 planning oscillator coordination mechanism (see also ‘phasing’) 24–9, 268 planning+suprasegmental oscillators 22, 30, 36–8, 41–2, 64, 66, 75, 80, 87, 94, 106, 133–5, 143, 147, 172, 173, 176 point-attractor oscillator 12, 42, 103 poly-subconstituent shortening (see also ‘polysyllabic shortening’ and ‘polysegmental shortening’) 29, 31–2, 46, 133, 135–44, 190, 225, 227, 229–30, 240, 272, 315 polysyllabic shortening 31–2, 136–8, 140–3, 225, 229–30, 240, 272, 315 practice 21, 51, 91, 113, 117, 190, 192, 199, 204, 256, 307 practiced voluntary movement 256, 307 prediction 3, 12, 27–8, 36, 95, 106, 110–11, 160, 174, 181, 196–7, 200, 203, 206, 210, 214–16, 253, 290, 303, 305, 320 prefrontal cortex 95, 97–8 premotor cortex 95, 99–100 pre-planning 178–87 prominence 3, 5, 22, 29–30, 32, 46–8, 52, 64, 86, 132–4, 139–40, 142, 153–5, 173–4, 190, 225, 229–30, 240, 268–70, 273–6, 278, 293–4, 300, 303–5, 315 prominence hierarchy 273–4 prominence-related lengthening 3, 46–8, 64, 86, 133–4, 190, 240, 315 proportional timing (see also ‘relative timing’) 12, 30, 33–6, 51, 90, 106–9, 113–16, 235, 247, 290–1 prosodic constituent hierarchy 3, 29, 30, 103, 132, 240, 268, 270–3 prosodic constituent structure 3, 20, 23, 85, 136, 143, 154–5, 240, 268–77, 300 prosodic control of timing 41, 80, 132–45, 153, 163, 182, 190, 259, 265, 292, 297–8, 303

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

368  prosodic hierarchy 3, 29–30, 103, 132, 240, 268, 270–4 prosodic planning 152, 189, 268, 278–9, 286, 288–9 prosodic planning frame 152, 189, 268, 278–9, 286, 289 prosodic position 8, 10, 22, 38, 40, 75, 105, 120, 127, 132, 147, 173, 279, 302 prosodic prominence hierarchy 273–4 prosodic structure 1, 3–6, 20, 23, 29, 30, 43, 46–7, 49, 75, 85, 132, 136, 143–5, 148, 151, 153–5, 172, 174, 223, 240, 266, 268–78, 294, 297, 300 prosodic word 94, 270 prosody (see also ‘prosodic structure’ and ‘prosodic control of timing’) 30, 32, 191, 277, 294, 315 quadratic 202–3, 210 qualitative acoustic cues 148, 265, 269, 294–5, 298, 300–1 quantitative phonetic representation 2, 5, 8, 10–11, 39–40, 63, 65, 148–51, 156, 159–60, 163–4, 168, 170–2, 174, 176–9, 183–4, 208, 222, 234–5, 253, 264–5, 267, 269, 286–7, 294, 297–8, 301, 314–15, 320 quantity/quantity language 80–5, 88–9, 94, 134, 147, 155, 181, 190, 225–7, 298, 315 rate of speech 3, 10, 17, 20, 22–3, 29, 35–8, 40–1, 45–7, 51, 64–6, 75, 80, 85, 87, 105, 107–8, 120, 127, 133–6, 143–4, 147, 151, 163, 170, 173–6, 190, 220, 225–6, 228–30, 234–5, 267–8, 280, 286, 291, 293, 297, 303, 313 ratio comparison 245 relational representation 5, 149, 151, 164, 241, 269, 297–8, 339 relationship between variability and interval duration 91–2, 252 relative timing (see also ‘proportional timing’) 14–15, 20, 23, 25, 27–8, 35, 38, 46, 54, 64, 102–4, 108–10, 113, 115, 117, 119, 123, 131, 135, 172, 235, 254, 259, 290, 308, 318 repetitive circle drawing 65, 249 response to medication 247

saccade 98, 100, 200, 206, 209–11 scalar property 91–2, 242, 247, 250, 252 scalar variability 91–2, 242, 247, 250, 252 segmental 4, 16, 22, 28, 30–1, 33, 35–8, 40–3, 46–8, 50, 52, 54, 61, 64–6, 75, 80, 86–7, 94, 102–3, 106, 108–9, 117, 121, 123, 131, 133, 135–7, 143–4, 147, 153, 156, 172–3, 175–6, 190–1, 225–7, 229, 240, 251, 268, 270, 272, 278, 284, 289, 291, 301, 304, 315, 318–19 segmental phonology 191 sensory feedback 185–7, 191, 196–7, 266, 312, 316 sensory modality 251 sensory noise 194, 197 singing 144, 182, 251, 255 Smooth Signal Redundancy 265, 269, 274–8 SOFCT (Stochastic Optimal Feedback Control Theory) 190, 193–5, 197, 199, 214–17, 219, 237, 264–6, 299, 302, 307, 312, 316 solar timing 25, 104, 106, 109, 112, 114, 173, 234, 315 somatosensory 40, 156, 185–7, 196–7, 242, 258, 266, 299, 305, 310, 316, 341, 347 sound localization 239, 241, 337 Spanish 279, 285 spatial accuracy 53–6, 70, 73, 129–30, 133, 160–1, 181, 192, 194, 197–8, 200–1, 204–9, 211, 214–15, 236–7, 302, 304, 317 spatial interference 57–62, 319 spatial variability 53–6, 70, 73, 129–30, 133, 160–1, 181, 192, 194, 197–8, 200–1, 204–9, 211, 214–15, 236–7, 302, 304, 317 spatiotemporal representations (see also ‘abstract non-symbolic phonological representations’) 1–4, 8, 21, 36, 38–40, 64–6, 73–5, 81, 134–5, 146–7, 162, 170, 172–3, 213, 233, 256, 313–14, 317 speaking rate 3, 10, 17, 20, 22–3, 29, 35–8, 40–1, 45–7, 51, 64–6, 75, 80, 85, 87, 105, 107–8, 120, 127, 133–6, 143–4, 147, 151, 163, 170, 173–6, 190, 220, 225–6, 228–30, 234–5, 267–8, 280, 286, 291, 293, 297, 303, 313 Speech Induced Suppression 196, 305 speech rate 3, 10, 17, 20, 22–3, 29, 35–8, 40–1, 45–7, 51, 64–6, 75, 80, 85, 87, 105, 107–8, 120, 127, 133–6, 143–4, 147, 151,

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

 163, 170, 173–6, 190, 220, 225–6, 228–30, 234–5, 267–8, 280, 286, 291, 293, 297, 303, 313 spill-over effect 52, 213, 274 stability 46, 106, 108, 111, 198, 200–1 stiffness 12–13, 20–3, 33–4, 38, 40, 64, 68, 87, 102, 104, 109, 127, 174–5, 227–9, 233 Stochastic Optimal Feedback Control Theory (SOFCT) 190, 193–5, 197, 199, 214–17, 219, 237, 264–6, 299, 302, 307, 312, 316 straight movement path 179, 195, 200–1, 203–4 stress-adjacent lengthening 137 stress-timing 30, 139–40 striatal beat-frequency model 245–6 strict dominance 222 structured variability 215 sublexical element 270, 282, 284 sum of the squared motor commands 202, 207–8, 216–17 suprasegmental (see also ‘prosodic structure’) 4, 16, 22, 31, 33, 35–8, 41–3, 47–8, 50, 64–6, 75, 80, 87, 94, 102–3, 106, 131, 133, 135–6, 143–4, 147, 156, 172–3, 175–6, 251, 315, 318–19 suprasegmental oscillator 16, 22, 30–1, 35–8, 41–2, 47, 64, 66, 75, 80, 87, 94, 103, 106, 133–6, 143–4, 147, 172, 173, 176 surface phonetic variation 5, 7, 49, 65, 133–4, 148, 150, 153, 157, 163, 179, 189, 231, 235, 277, 282, 286, 294 surface timing 2–6, 20, 36, 38, 40, 42, 47, 49, 63, 65–7, 75, 78, 80, 85, 95, 100–1, 106, 115, 119–20, 134, 147, 149–53, 156–7, 171–3, 176–9, 189, 193, 224, 226–7, 229, 231, 238, 244, 315, 317 syllable oscillator 30–1, 36 syllable pulse train 154, 267 symbolic phonological representation 1–2, 4, 7, 50, 63, 66, 73, 90, 112, 135, 148–9, 151–3, 158–9, 164–5, 167, 169, 171, 173, 178–9, 184, 266, 268, 289, 291, 297, 312–15 synaptic weight 250 synchronization 96–7, 99, 120–1, 252 synchronous task 57–8, 61, 319

369

synergy (see also ‘coordinative structure’) 15, 40, 45, 186, 193, 195, 204, 214, 216–18, 266, 299 synﬁre chain 246 system constraint 221, 223 tapping 53, 68–70, 91–3, 96, 99, 216, 232, 243, 249 target (see also ‘movement target’) 6, 10–14, 17–19, 21–5, 30–1, 33–5, 37–8, 40–1, 44–5, 48, 50–1, 53–7, 60–1, 65, 68, 70, 74, 87, 89, 93–4, 97–8, 102, 104–5, 109, 114, 117–23, 125–31, 135, 138, 147–8, 151, 153, 156, 158, 160–3, 174–5, 179, 181, 183, 185, 187, 189, 192–4, 196–8, 201, 205–9, 211–14, 217, 220, 222–3, 226–34, 236–7, 240, 250, 256–9, 261, 267–8, 270, 279, 281, 287–9, 291, 293, 300–6, 308, 310–12, 317, 322 Task Dynamics 3, 8, 11–12, 46–7, 227 (AP/TD throughout) task-speciﬁc ﬁnding 10, 216, 242, 250 tau 6–7, 74, 86, 96, 103, 105, 121, 125, 128–31, 150, 190, 237–8, 256–62, 267, 300, 307–12, 316–18 tau guidance (see also ‘General Tau Theory’) 258–61, 267 tau-coupling (see also ‘General Tau Theory’) 103, 128–9, 308, 318 temporal discrimination 210–11, 251 temporal precision 249 temporal sensitivity 251 three-component models of speech production planning 4–7, 41, 62, 135, 146–89, 233–5, 264–313, 316 tickling 78 time cost 209–11, 226–8, 230, 233–4, 237, 302–5, 308 time estimation 241–2, 250, 338, 361 time range 125, 241, 247–8 time scale 202, 214, 239 time warping 42, 75, 134, 147, 173, 315 time-course of movement 21, 95, 97, 125, 128, 150, 154, 179–80, 182–3, 189, 256–8, 267, 300–1, 306–7, 311, 316 time-stamp 98–9, 250 time-to-contact (see also ‘time-to-target achievement/approximation’) 97, 118, 257

OUP CORRECTED PROOF – FINAL, 29/1/2020, SPi

370  time-to-peak-velocity 23–4 time-to-target achievement/approximation (see also ‘mass-spring settling time’) 37, 48, 56, 97–8, 109, 131 time-tracking 99–100, 242 timekeeping mechanisms 4, 6, 38, 42, 47–8, 63–5, 67, 70, 76–7, 79–80, 90, 93, 101, 103, 115, 117, 151, 173, 189–90, 230–3, 235–9, 241–9, 251–3, 255–6, 262, 265, 267, 316–17 timing accuracy 4, 66, 68–9, 73, 95, 101, 124, 147–8, 155, 158, 160, 178, 214, 247, 262–3, 268, 300, 314–15 timing adjustment mechanism 20, 38, 41–2, 48–51, 62, 133–5, 144 timing cost 209–11, 226–8, 230, 233–4, 237, 302–5, 308 timing precision 4, 66, 68–9, 73, 95, 101, 124, 147–8, 155, 158, 160, 178, 214, 247, 262–3, 268, 300, 314–15 timing target 226–7, 229, 317 timing variability 1, 6, 41, 67–70, 73, 77, 90, 101, 120, 126–9, 134–5, 147, 154–5, 157–8, 226, 230–1, 236, 242–4, 247, 251, 262, 314–17 ToBI 270 torque 196, 200–2, 204–5 torque change 200, 204–5 transcranial magnetic stimulation (TMS) 95, 248, 250, 252 translation, phonology to phonetics 11, 39, 134, 149, 171–3, 176–8, 189, 265, 294, 301, 314–15 typing 68, 70, 158 undershoot 17, 19, 35, 51, 54–5, 192–3, 195, 198, 201, 203, 213–14, 220–1, 228, 230

universality 221, 271, 286 universality of prosodic structure 271 velocity (see also ‘peak velocity’, ‘timeto-peak-velocity’, and ‘peak velocity/ distance relationship’) 1, 6, 17, 21–4, 33–4, 37, 46, 54, 56, 69, 71, 74, 78, 116, 118, 122, 126, 128, 174, 179–83, 192, 195–6, 198, 200–11, 213–14, 220, 238, 256, 258–62, 264–5, 300, 307, 311, 316, 318 velocity proﬁle 1, 6, 21, 23–4, 46, 179–80, 182, 192, 195, 200–3, 205–8, 214, 238, 256, 258–9, 262, 264–5, 300, 307, 311, 316 visual cortex 95, 250 visual modality (see also ‘auditory modality’) 249–51 voicing 13–14, 57, 70, 107, 109, 187, 289, 296–7 voicing neutralization 13 voluntary movement 238, 256, 307 Weber’s law (see also ‘scalar property’) 90, 93 window model of coarticulation 153, 301 working memory 245, 247, 249 XT/3C (Phonology-Extrinsic-TimingBased Three-Component Model of Speech Production) 148–51, 158, 160, 163–4, 171, 173, 176–7, 185, 188–9, 220, 235, 237–8, 251, 255–6, 262–3, 264–319 XT/3C-v1 (Phonology-Extrinsic-TimingBased Three-Component Model of Speech Production, version 1) 188, 264–312