The Perceptual Structure of Sound 3031255658, 9783031255656

This book presents a comprehensive review of how acoustic waves are processed by the auditory system into structured sou

247 85 22MB

English Pages 839 [840] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Speech Perception or Music Perception
Prior Knowledge
Sound Demonstrations and Matlab Scripts
References
Contents
1 Introduction
1.1 Pure Tones
1.1.1 Amplitude
1.1.2 Frequency
1.1.3 Phase
1.2 Complex Tones
1.3 Speech Sounds
1.4 Musical Scales and Musical Intervals
1.5 The Sum of Two Sinusoids
1.5.1 Two Tones: The Range of Perception of a Steady Tone
1.5.2 Two Tones: The Range of Roughness Perception
1.5.3 Two Tones: The Range of Rhythm Perception
1.5.4 Two Tones: The Range of Hearing Slow Modulations
1.5.5 In Summary
1.6 Amplitude Modulation
1.6.1 Amplitude Modulation: The Range of Perception of a Steady Tone
1.6.2 Amplitude Modulation: The Range of Roughness Perception
1.6.3 Amplitude Modulation: The Range of Rhythm Perception
1.6.4 Amplitude Modulation: The Range of Hearing Slow Modulations
1.6.5 In Summary
1.7 Frequency Modulation
1.7.1 Frequency Modulation: The Range of Perception of a Steady Tone
1.7.2 Frequency Modulation: The Range of Roughness Perception
1.7.3 Frequency Modulation: The Range of Rhythm Perception
1.7.4 Frequency Modulation: The Range of Hearing Slow Modulations
1.7.5 In Summary
1.8 Additive Synthesis
1.8.1 The Saw Tooth and the Square Wave
1.8.2 Pulse Trains
1.8.3 Phase of Sums of Equal-Amplitude Sinusoids
1.9 Noise
1.9.1 White Noise
1.9.2 Pink Noise
1.9.3 Brown Noise
1.10 Subtractive Synthesis
1.11 Envelopes
References
2 The Ear
2.1 Overview
2.2 Two Ears
2.3 The Outer Ear
2.4 The Middle Ear
2.5 The Inner Ear
2.5.1 The Basilar Membrane
2.5.2 Distortion Products
2.5.3 The Organ of Corti
2.6 The Auditory Nerve
2.6.1 Refractoriness and Saturation
2.6.2 Spontaneous Activity
2.6.3 Dynamic Range
2.6.4 Band-Pass Filtering
2.6.5 Half-Wave Rectification
2.7 Summary Schema of Peripheral Auditory Processing
2.8 The Central Auditory Nervous System
References
3 The Tonotopic Array
3.1 Masking Patterns and Psychophysical Tuning Curves
3.2 Critical Bandwidth
3.3 Auditory-Filter Bandwidth and the Roex Filter
3.4 The Gammatone Filter
3.5 The Compressive Gammachirp
3.6 Summary Remarks
3.7 Auditory Frequency Scales
3.7.1 The Mel Scale
3.7.2 The Bark Scale
3.7.3 The ERBN-Number Scale or Cam Scale
3.8 The Excitation Pattern
3.8.1 Transfer Through Outer and Middle Ear
3.8.2 Introduction of Internal Noise
3.8.3 Calculation of the Excitation Pattern
3.8.4 From Excitation on a Linear Hertz Scale to Excitation on a Cam Scale
3.8.5 From Excitation to Specific Loudness
3.8.6 Calculation of Loudness
3.9 Temporal Structure
3.9.1 The Autocorrelation Model
3.9.2 dBA-Filtering
3.9.3 Band-Pass Filtering
3.9.4 Neural Transduction
3.9.5 Generation of Action Potentials
3.9.6 Detection of Periodicities: Autocovariance Functions
3.9.7 Peak Detection
3.10 Summary
References
4 Auditory-Unit Formation
4.1 Auditory Scene Analysis
4.2 Auditory Units
4.3 Auditory Streams
4.3.1 Some Examples
4.4 The Perceived Duration of an Auditory Unit
4.5 Perceptual Attributes
4.6 Auditory Localization and Spatial Hearing
4.7 Two Illustrations
4.8 Organizing Principles
4.8.1 Common Fate
4.8.2 Spectral Regularity
4.8.3 Exclusive Allocation
4.9 Consequences of Auditory-Unit Formation
4.9.1 Loss of Identity of Constituent Components
4.9.2 The Emergence of Perceptual Attributes
4.10 Performance of the Auditory-Unit-Formation System
4.11 Concluding Remark
References
5 Beat Detection
5.1 Measuring the Beat Location of an Auditory Unit
5.1.1 Tapping Along
5.1.2 Synchronizing with a Series of Clicks
5.1.3 Absolute Rhythm Adjustment Methods
5.1.4 Relative Rhythm Adjustment Methods
5.1.5 The Phase-Correction Method
5.1.6 Limitations
5.2 Beats in Music
5.3 Beats in Speech
5.3.1 The Structure of Spoken Syllables
5.3.2 Location of the Syllable Beat
5.3.3 The Concepts of P-centre or Syllable Beat
5.4 The Role of Onsets
5.4.1 Neurophysiology
5.5 The Role of F0 Modulation
5.6 Strength and Clarity of Beats
5.7 Interaction with Vision
5.8 Models of Beat Detection
5.9 Fluctuation Strength
5.10 Ambiguity of Beats
References
6 Timbre Perception
6.1 Definition
6.2 Roughness
6.3 Breathiness
6.4 Brightness or Sharpness
6.5 Dimensional Analysis
6.5.1 Timbre Space of Vowel Sounds
6.5.2 Timbre Space of Musical Sounds
6.6 The Role of Onsets and Transients
6.7 Composite Timbre Attributes
6.7.1 Sensory Pleasantness and Annoyance
6.7.2 Voice Quality
6.7.3 Perceived Effort
6.8 Context Effects
6.9 Environmental Sounds
6.10 Concluding Remarks
References
7 Loudness Perception
7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL)
7.1.1 Measurement of Sound Pressure Level
7.1.2 ``Loudness'' Normalization
7.2 The dB Scale and Stevens' Power Law
7.3 Stevens' Law of a Pure Tone and a Noise Burst
7.4 Loudness of Pure Tones
7.4.1 Equal-Loudness Contours
7.5 Loudness of Steady Complex Sounds
7.5.1 Limitations of the Loudness Model
7.6 Partial Loudness of Complex Sounds
7.6.1 Some Examples of Partial-Loudness Estimation
7.6.2 Limitations of the Partial-Loudness Model
7.7 Loudness of Time-Varying Sounds
7.7.1 Loudness of Very Short Sounds
7.7.2 Loudness of Longer Time-Varying Sounds
7.8 Concluding Remarks
References
8 Pitch Perception
8.1 Definitions of Pitch
8.2 Pitch Height and Pitch Chroma
8.3 The Range of Pitch Perception
8.4 The Pitch of Some Synthesized Sounds
8.4.1 The Pitch of Pure Tones
8.4.2 The Duration of a Sound and its Pitch
8.4.3 Periodic Sounds and their Pitch
8.4.4 Virtual Pitch
8.4.5 Analytic Versus Synthetic Listening
8.4.6 Some Conclusions
8.5 Pitch of Complex Sounds
8.6 Shepard and Risset Tones
8.7 The Autocorrelation Model
8.8 The Missing Fundamental
8.8.1 Three Adjacent Resolved Harmonics
8.8.2 Three Adjacent Unresolved Harmonics
8.8.3 Three Adjacent Unresolved Harmonics of High Rank
8.8.4 Three Adjacent Shifted Unresolved Harmonics of High Rank
8.8.5 Seven Adjacent Unresolved Harmonics of High Rank
8.8.6 Seven Adjacent Unresolved Harmonics in the Absence of Phase Lock
8.8.7 Conclusions
8.9 Pitch of Non-periodic Sounds
8.9.1 Repetition Noise
8.9.2 Pulse Pairs
8.10 Pitch of Time-Varying Sounds
8.11 Pitch Estimation of Speech Sounds
8.12 Estimation of Multiple Pitches
8.13 Central Processing of Pitch
8.14 Pitch Salience or Pitch Strength
8.15 Pitch Ambiguity
8.16 Independence of Timbre and Pitch
8.17 Pitch Constancy
8.18 Concluding Remarks
References
9 Perceived Location
9.1 Information Used in Auditory Localization
9.1.1 Interaural Time Differences
9.1.2 Interaural Level Differences
9.1.3 Filtering by the Outer Ears
9.1.4 Reverberation
9.2 The Generation of Virtual Sound Sources
9.3 More Information Used in Sound Localization
9.3.1 Movements of the Listener
9.3.2 Rotations of the Sound Source Around the Listener
9.3.3 Movements of the Sound Source Towards and from the Listener
9.3.4 Doppler Effect
9.3.5 Ratio of Low-Frequency and High-Frequency Energy in the Sound Signal
9.3.6 Information About the Room
9.3.7 Information About the Location of Possible Sound Sources
9.3.8 Visual Information
9.3.9 Background Noise
9.3.10 Atmospheric Conditions: Temperature, Humidity, and the Wind
9.3.11 Familiarity with the Sound Source
9.4 Multiple Sources of Information
9.5 Externalization or Internalization
9.6 Measuring Human-Sound-Localization Accuracy
9.7 Auditory Distance Perception
9.7.1 Accuracy of Distance Perception
9.7.2 Direct-to-Reverberant Ratio
9.7.3 Dynamic Information
9.7.4 Perceived Distance, Loudness, and Perceived Effort
9.7.5 Distance Perception in Peripersonal Space
9.8 Auditory Perception of Direction
9.8.1 Accuracy of Azimuth Perception
9.8.2 Accuracy of Elevation Perception
9.8.3 Computational Model
9.9 Auditory Perception of Motion
9.9.1 Accuracy of Rotational-Motion Perception
9.9.2 Accuracy of Radial-Motion Perception
9.9.3 Perception of Looming Sounds
9.9.4 Auditory Motion Detectors?
9.10 Walking Around
9.10.1 Illusory Motion
9.11 Integrating Multiple Sources of Information
9.11.1 Cooperation and Competition?
9.11.2 Exclusive Allocation
9.11.3 Plasticity and Calibration
References
10 Auditory-Stream Formation
10.1 Some Examples
10.1.1 The Trill Threshold
10.1.2 Sequential Integration and Segregation
10.2 Measures of Integration or Segregation
10.3 The Perceived Number of Streams in an Auditory Scene
10.4 The Perceived Number of Units in an Auditory Stream
10.5 The Emergence of Rhythm
10.6 Frequency Scale
10.7 Instability
10.7.1 Build-Up and Resets
10.7.2 Bistability and Multistability
10.7.3 Modelling Multistability
10.7.4 Neurophysiology of Multistability
10.8 Factors Playing a Role in Sequential Integration
10.8.1 Tempo of the Tonal Sequence
10.8.2 Separation in Pitch
10.8.3 Differences in Timbre
10.8.4 Differences in Perceived Location
10.8.5 Familiarity
10.8.6 Attention
10.8.7 Syntax and Semantics
10.8.8 Visual Information
10.8.9 Loudness
10.8.10 Concluding Remarks
10.9 Organizing Principles
10.9.1 Proximity
10.9.2 Similarity
10.9.3 Completion
10.9.4 Connectedness
10.9.5 Good Continuation
10.9.6 Temporal Regularity
10.9.7 Exclusive Allocation
10.9.8 Figure-Ground Organization
10.9.9 Other Elements of Gestalt Psychology
10.10 Establishing Temporal Coherence
10.11 The Continuity Illusion
10.11.1 Four Rules Governing the Continuity Illusion
10.11.2 Information Trading
10.11.3 Onsets and Offsets
10.11.4 Restoration in Speech and in Music
10.11.5 Other Sound Demonstrations on the Continuity Illusion
10.12 Consequences of Auditory-Stream Formation
10.12.1 Camouflage
10.12.2 Order and Temporal Relations
10.12.3 Rhythm
10.12.4 Pitch Contours
10.12.5 Consonance and Dissonance
10.13 Primitive and Schema-Based ASA
10.13.1 The Nature of Schemas
10.13.2 Information Trading
10.14 Human Sound Localization and Auditory Scene Analysis
10.15 Computational Auditory Scene Analysis
10.15.1 Temporal Coherence
10.15.2 Predictive Coding
10.15.3 Neural Networks
References
11 Interpretative Summary
11.1 Introduction
11.2 The Ear
11.3 The Auditory Filter and the Tonotopic Array
11.4 Auditory-Unit Formation
11.5 Beat Detection
11.6 Timbre Perception
11.7 Loudness Perception
11.8 Pitch Perception
11.9 Perceived Location
11.10 Auditory-Stream Formation
11.10.1 Instability
11.10.2 Neurophysiology of Instability
11.10.3 Factors Playing a Role in Sequential Integration
11.10.4 Organizing Principles
11.10.5 Establishing Temporal Coherence
11.10.6 The Continuity Illusion
11.10.7 Consequences of Auditory-Stream Formation
11.10.8 Primitive and Schema-Based ASA
11.10.9 Human Sound Localization and Auditory-Stream Formation
11.10.10 Computational Auditory Scene Analysis
Reference
Index
Recommend Papers

The Perceptual Structure of Sound
 3031255658, 9783031255656

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Current Research in Systematic Musicology

Dik J. Hermes

The Perceptual Structure of Sound

Current Research in Systematic Musicology Volume 11

Series Editors Rolf Bader, Musikwissenschaftliches Institut, Universität Hamburg, Hamburg, Germany Marc Leman, University of Ghent, Ghent, Belgium Rolf-Inge Godoy, Blindern, University of Oslo, Oslo, Norway

The series covers recent research, hot topics, and trends in Systematic Musicology. Following the highly interdisciplinary nature of the field, the publications connect different views upon musical topics and problems with the field’s multiple methodology, theoretical background, and models. It fuses experimental findings, computational models, psychological and neurocognitive research, and ethnic and urban field work into an understanding of music and its features. It also supports a pro-active view on the field, suggesting hard- and software solutions, new musical instruments and instrument controls, content systems, or patents in the field of music. Its aim is to proceed in the over 100 years international and interdisciplinary tradition of Systematic Musicology by presenting current research and new ideas next to review papers and conceptual outlooks. It is open for thematic volumes, monographs, and conference proceedings. The series therefore covers the core of Systematic Musicology, - Musical Acoustics, which covers the whole range of instrument building and improvement, Musical Signal Processing and Music Information Retrieval, models of acoustical systems, Sound and Studio Production, Room Acoustics, Soundscapes and Sound Design, Music Production software, and all aspects of music tone production. It also covers applications like the design of synthesizers, tone, rhythm, or timbre models based on sound, gaming, or streaming and distribution of music via global networks. • Music Psychology, both in its psychoacoustic and neurocognitive as well as in its performance and action sense, which also includes musical gesture research, models and findings in music therapy, forensic music psychology as used in legal cases, neurocognitive modeling and experimental investigations of the auditory pathway, or synaesthetic and multimodal perception. It also covers ideas and basic concepts of perception and music psychology and global models of music and action. • Music Ethnology in terms of Comparative Musicology, as the search for universals in music by comparing the music of ethnic groups and social structures, including endemic music all over the world, popular music as distributed via global media, art music of ethnic groups, or ethnographic findings in modern urban spaces. Furthermore, the series covers all neighbouring topics of Systematic Musicology.

Dik J. Hermes

The Perceptual Structure of Sound

Dik J. Hermes Department of Industrial Engineering & Innovation Sciences, Human-Technology Interaction Group Eindhoven University of Technology Eindhoven, The Netherlands

ISSN 2196-6966 ISSN 2196-6974 (electronic) Current Research in Systematic Musicology ISBN 978-3-031-25565-6 ISBN 978-3-031-25566-3 (eBook) https://doi.org/10.1007/978-3-031-25566-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

I want to dedicate this book to the memory of the old Institute for Perception Research/IPO (1957–2001), where many of the ideas discussed in this book originated

Preface

The path along which this book has come about has been quite tortuous. It started as a book that aimed at bridging the gap between sound designers on the one hand and engineers on the other. The provisional title was “Sound Perception: The Science of Sound Design”. As such, the book was conceived as an introduction into the various subjects of hearing research such as the anatomy and physiology of the peripheral hearing system, the auditory filter and the tonotopic array, loudness perception, pitch perception, auditory scene analysis, auditory localization, and the ecological psychology of hearing. These were the subjects of a lecture series on sound design and sound perception I started to present about twenty years ago at the Human Interaction Technology group at Eindhoven University of Technology. The readers of this book will see, however, that the order in which the various subjects of the chapters are presented is quite different from this original. Indeed, in the course of writing and certainly during reading the literature, various ideas about the way in which the perceptual structure of sound comes about crystallized out. The result is the book as it is composed now.

Speech Perception or Music Perception In the auditory domain, two important ways of human communication distinguish human beings from all other organisms. These ways are, off course, speech and music. Nevertheless, in its original conception, this book was planned to be about neither music nor speech, but to focus on the design of everyday sounds, and to discuss this in the framework of ecological psychology. After all, there are already various good books specifically dedicated to music perception, e.g., Deutsch [2], Jones, Fay, and Popper [3], and Roederer [7], and to speech, e.g., Behrman [1], Moore, Tyler, and Marslen-Wilson [4], and Pisoni and Remez [6]. The relation between music and speech is extensively discussed by Patel [5]. On the other hand, human conceptions of music and speech have significantly influenced the field of hearing research. For instance, the science of sound perception is so pervaded by terminology derived from vii

viii

Preface

music that, in the introduction of this book, some elementary information will be presented on musical scales and musical intervals. In addition, a very large part of hearing research is carried out with musical or speech-like stimuli. I hope that those interested in speech and music perception will find a lot of information in the text that will influence their ideas and beliefs about the perceptual process of hearing in general and about music and speech in particular. Understanding the principles of the formation of coherent musical melodies or speech utterances, the emergence of timbre, loudness, pitch, etc., is, I think, basic to the understanding of the perception of music and speech, in general. In this sense, I hope that this book will contribute to the understanding of these typically human ways of communication.

Prior Knowledge The reader should have some background in mathematics. Although I aim at being as simple as possible, the description of sound would be over simplified and lead to misconceptions if no use would be made of basic algebra, trigonometry, calculus, and spectral analysis. A simple example, the “intensity” of sound is mostly described in decibels. The reason for this is that, as a first and, within limits, good approximation, the human sound processor processes intensity in a logarithmic way. Therefore, the reader will need some elementary skills in algebra, specifically in dealing with logarithms. In addition, in various instances, some elementary calculus will be used. Moreover, sound is generally conceived as consisting of “frequency components”, in other words as sums of sinusoids. The reader must be familiar with the mathematical tools to describe sums of sinusoids, and know how to do calculations with them, in other words with trigonometry, the power spectrum, and the spectrogram. This will be summarized and illustrated in the introduction of this book.

Sound Demonstrations and Matlab Scripts This book contains many sound demonstrations, most of which have been generated with Matlab. They can be accessed through https://dhermes.ieis.tue.nl/PSS. Readers of the E book can also listen to them by clicking on demo at the end of the corresponding figure caption; the Matlab scripts can be retrieved by clicking on Matlab just before demo. Eindhoven, The Netherlands

Dik J. Hermes

Acknowledgments Herewith I would like to express my gratitude to Don Bouwhuis and Raymond Cuijpers who read the whole book, proposed a multitude of corrections and improvements, and

Preface

ix

came up with many suggestions. Without them, this book would have contained many more errors. I thank Erik Edwards for reading the sections on rhythm perception and Leon van Noorden for reading sections of the Chapter on “Auditory-Stream Formation”.

References 1. Behrman, A. Speech and Voice Science. San Diego, CA: Plural Publishing, Inc., 2018. 2. Deutsch, D., ed. The Psychology of Music. 3rd edition. Amsterdam: Elsevier, 2013. https://doi. org/10.1016/C2009-0-62532-0. URL: http://www.sciencedirect.com/science/book/978012381 4609. 3. Jones, M. R., Fay, R., and Popper, A. N., eds. Music Perception. New York, NY: Springer Science+Business Media, 2010, pp. i–xii, 1–264. https://doi.org/10.1007/978-1-4419-6114-3. 4. Moore, B. C., Tyler, L. K., and Marslen-Wilson, W. D., eds. The Perception of Speech: From Sound to Meaning. Oxford, UK: Oxford University Press, 2009. 5. Patel, A. D. Music, Language, and the Brain. Oxford, UK: Oxford University Press, 2008. 6. Pisoni, D. B. and Remex, R.E., eds. The Handbook of Speech Perception. Oxford, UK: Blackwell Publishing Ltd., 2005. 7. Roederer, J. G. The Physics and Psychophysics of Music: An Introduction. 4th edition. New York, NY: Springer Science+Business Media, 2008.

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Pure Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Amplitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Complex Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Speech Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Musical Scales and Musical Intervals . . . . . . . . . . . . . . . . . . . . . . 1.5 The Sum of Two Sinusoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Two Tones: The Range of Perception of a Steady Tone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Two Tones: The Range of Roughness Perception . . . 1.5.3 Two Tones: The Range of Rhythm Perception . . . . . 1.5.4 Two Tones: The Range of Hearing Slow Modulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.5 In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Amplitude Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Amplitude Modulation: The Range of Perception of a Steady Tone . . . . . . . . . . . . . . . . . . 1.6.2 Amplitude Modulation: The Range of Roughness Perception . . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Amplitude Modulation: The Range of Rhythm Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 Amplitude Modulation: The Range of Hearing Slow Modulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Frequency Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Frequency Modulation: The Range of Perception of a Steady Tone . . . . . . . . . . . . . . . . . . 1.7.2 Frequency Modulation: The Range of Roughness Perception . . . . . . . . . . . . . . . . . . . . . . .

1 3 4 5 14 15 18 24 32 35 38 41 43 44 46 49 51 52 53 54 56 60 62 xi

xii

Contents

1.7.3

Frequency Modulation: The Range of Rhythm Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.4 Frequency Modulation: The Range of Hearing Slow Modulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.5 In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Additive Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 The Saw Tooth and the Square Wave . . . . . . . . . . . . . 1.8.2 Pulse Trains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.3 Phase of Sums of Equal-Amplitude Sinusoids . . . . . 1.9 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Pink Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.3 Brown Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Subtractive Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Envelopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66 66 68 68 71 72 74 75 76 78 79 84 86

2

The Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Two Ears . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Outer Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The Middle Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 The Inner Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 The Basilar Membrane . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Distortion Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 The Organ of Corti . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 The Auditory Nerve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Refractoriness and Saturation . . . . . . . . . . . . . . . . . . . 2.6.2 Spontaneous Activity . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Dynamic Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Band-Pass Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.5 Half-Wave Rectification . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Summary Schema of Peripheral Auditory Processing . . . . . . . . . 2.8 The Central Auditory Nervous System . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 90 92 92 93 97 99 110 111 121 122 123 123 126 127 132 133 136

3

The Tonotopic Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Masking Patterns and Psychophysical Tuning Curves . . . . . . . . . 3.2 Critical Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Auditory-Filter Bandwidth and the Roex Filter . . . . . . . . . . . . . . 3.4 The Gammatone Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Compressive Gammachirp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Summary Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Auditory Frequency Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 The Mel Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 The Bark Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

141 142 147 151 154 156 157 158 159 160

66

Contents

xiii

3.7.3 The ERBN -Number Scale or Cam Scale . . . . . . . . . . The Excitation Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Transfer Through Outer and Middle Ear . . . . . . . . . . 3.8.2 Introduction of Internal Noise . . . . . . . . . . . . . . . . . . . 3.8.3 Calculation of the Excitation Pattern . . . . . . . . . . . . . 3.8.4 From Excitation on a Linear Hertz Scale to Excitation on a Cam Scale . . . . . . . . . . . . . . . . . . . . 3.8.5 From Excitation to Specific Loudness . . . . . . . . . . . . 3.8.6 Calculation of Loudness . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Temporal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 The Autocorrelation Model . . . . . . . . . . . . . . . . . . . . . 3.9.2 dBA-Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Band-Pass Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4 Neural Transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.5 Generation of Action Potentials . . . . . . . . . . . . . . . . . . 3.9.6 Detection of Periodicities: Autocovariance Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.7 Peak Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

161 164 165 167 168

4

Auditory-Unit Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Auditory Scene Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Auditory Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Auditory Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Perceived Duration of an Auditory Unit . . . . . . . . . . . . . . . . 4.5 Perceptual Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Auditory Localization and Spatial Hearing . . . . . . . . . . . . . . . . . . 4.7 Two Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Organizing Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Common Fate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Spectral Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Exclusive Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Consequences of Auditory-Unit Formation . . . . . . . . . . . . . . . . . . 4.9.1 Loss of Identity of Constituent Components . . . . . . . 4.9.2 The Emergence of Perceptual Attributes . . . . . . . . . . 4.10 Performance of the Auditory-Unit-Formation System . . . . . . . . . 4.11 Concluding Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

187 187 189 191 192 193 196 196 197 199 200 208 214 215 215 216 218 218 219

5

Beat Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Measuring the Beat Location of an Auditory Unit . . . . . . . . . . . . 5.1.1 Tapping Along . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Synchronizing with a Series of Clicks . . . . . . . . . . . . 5.1.3 Absolute Rhythm Adjustment Methods . . . . . . . . . . .

225 227 228 228 229

3.8

169 172 173 174 176 177 177 179 180 180 182 182 183

xiv

Contents

5.1.4 Relative Rhythm Adjustment Methods . . . . . . . . . . . . 5.1.5 The Phase-Correction Method . . . . . . . . . . . . . . . . . . . 5.1.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Beats in Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Beats in Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Structure of Spoken Syllables . . . . . . . . . . . . . . . 5.3.2 Location of the Syllable Beat . . . . . . . . . . . . . . . . . . . . 5.3.3 The Concepts of P-centre or Syllable Beat . . . . . . . . 5.4 The Role of Onsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Neurophysiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Role of F 0 Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Strength and Clarity of Beats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Interaction with Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Models of Beat Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Fluctuation Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Ambiguity of Beats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

230 230 231 232 233 233 234 237 239 241 242 245 246 247 251 252 253

6

Timbre Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Roughness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Breathiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Brightness or Sharpness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Dimensional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Timbre Space of Vowel Sounds . . . . . . . . . . . . . . . . . . 6.5.2 Timbre Space of Musical Sounds . . . . . . . . . . . . . . . . 6.6 The Role of Onsets and Transients . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Composite Timbre Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Sensory Pleasantness and Annoyance . . . . . . . . . . . . . 6.7.2 Voice Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.3 Perceived Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Context Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Environmental Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

261 262 263 272 275 280 280 287 296 300 300 301 302 307 309 316 319

7

Loudness Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Measurement of Sound Pressure Level . . . . . . . . . . . . 7.1.2 “Loudness” Normalization . . . . . . . . . . . . . . . . . . . . . . 7.2 The dB Scale and Stevens’ Power Law . . . . . . . . . . . . . . . . . . . . . 7.3 Stevens’ Law of a Pure Tone and a Noise Burst . . . . . . . . . . . . . . 7.4 Loudness of Pure Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Equal-Loudness Contours . . . . . . . . . . . . . . . . . . . . . . 7.5 Loudness of Steady Complex Sounds . . . . . . . . . . . . . . . . . . . . . .

333 334 337 340 343 345 348 348 351

Contents

7.5.1 Limitations of the Loudness Model . . . . . . . . . . . . . . Partial Loudness of Complex Sounds . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Some Examples of Partial-Loudness Estimation . . . . 7.6.2 Limitations of the Partial-Loudness Model . . . . . . . . 7.7 Loudness of Time-Varying Sounds . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Loudness of Very Short Sounds . . . . . . . . . . . . . . . . . . 7.7.2 Loudness of Longer Time-Varying Sounds . . . . . . . . 7.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

356 358 358 363 365 366 368 372 375

Pitch Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Definitions of Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Pitch Height and Pitch Chroma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 The Range of Pitch Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 The Pitch of Some Synthesized Sounds . . . . . . . . . . . . . . . . . . . . . 8.4.1 The Pitch of Pure Tones . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 The Duration of a Sound and its Pitch . . . . . . . . . . . . 8.4.3 Periodic Sounds and their Pitch . . . . . . . . . . . . . . . . . 8.4.4 Virtual Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Analytic Versus Synthetic Listening . . . . . . . . . . . . . . 8.4.6 Some Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Pitch of Complex Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Shepard and Risset Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 The Autocorrelation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 The Missing Fundamental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.1 Three Adjacent Resolved Harmonics . . . . . . . . . . . . . 8.8.2 Three Adjacent Unresolved Harmonics . . . . . . . . . . . 8.8.3 Three Adjacent Unresolved Harmonics of High Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.4 Three Adjacent Shifted Unresolved Harmonics of High Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.5 Seven Adjacent Unresolved Harmonics of High Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8.6 Seven Adjacent Unresolved Harmonics in the Absence of Phase Lock . . . . . . . . . . . . . . . . . . . 8.8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Pitch of Non-periodic Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.1 Repetition Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9.2 Pulse Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Pitch of Time-Varying Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11 Pitch Estimation of Speech Sounds . . . . . . . . . . . . . . . . . . . . . . . . 8.12 Estimation of Multiple Pitches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.13 Central Processing of Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.14 Pitch Salience or Pitch Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.15 Pitch Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

381 383 386 388 390 390 391 392 393 398 400 401 406 407 408 410 412

7.6

8

xv

412 413 413 416 417 418 418 420 421 423 426 427 430 432

xvi

9

Contents

8.16 Independence of Timbre and Pitch . . . . . . . . . . . . . . . . . . . . . . . . . 8.17 Pitch Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.18 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

433 436 438 439

Perceived Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Information Used in Auditory Localization . . . . . . . . . . . . . . . . . 9.1.1 Interaural Time Differences . . . . . . . . . . . . . . . . . . . . . 9.1.2 Interaural Level Differences . . . . . . . . . . . . . . . . . . . . . 9.1.3 Filtering by the Outer Ears . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 The Generation of Virtual Sound Sources . . . . . . . . . . . . . . . . . . . 9.3 More Information Used in Sound Localization . . . . . . . . . . . . . . . 9.3.1 Movements of the Listener . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Rotations of the Sound Source Around the Listener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Movements of the Sound Source Towards and from the Listener . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Doppler Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.5 Ratio of Low-Frequency and High-Frequency Energy in the Sound Signal . . . . . . . . . . . . . . . . . . . . . 9.3.6 Information About the Room . . . . . . . . . . . . . . . . . . . . 9.3.7 Information About the Location of Possible Sound Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.8 Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.9 Background Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.10 Atmospheric Conditions: Temperature, Humidity, and the Wind . . . . . . . . . . . . . . . . . . . . . . . . 9.3.11 Familiarity with the Sound Source . . . . . . . . . . . . . . . 9.4 Multiple Sources of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Externalization or Internalization . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Measuring Human-Sound-Localization Accuracy . . . . . . . . . . . . 9.7 Auditory Distance Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.1 Accuracy of Distance Perception . . . . . . . . . . . . . . . . . 9.7.2 Direct-to-Reverberant Ratio . . . . . . . . . . . . . . . . . . . . . 9.7.3 Dynamic Information . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.4 Perceived Distance, Loudness, and Perceived Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7.5 Distance Perception in Peripersonal Space . . . . . . . . 9.8 Auditory Perception of Direction . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8.1 Accuracy of Azimuth Perception . . . . . . . . . . . . . . . . . 9.8.2 Accuracy of Elevation Perception . . . . . . . . . . . . . . . . 9.8.3 Computational Model . . . . . . . . . . . . . . . . . . . . . . . . . . 9.9 Auditory Perception of Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.9.1 Accuracy of Rotational-Motion Perception . . . . . . . .

449 455 456 460 462 469 471 473 473 478 479 482 485 485 486 487 488 490 490 491 492 495 497 498 503 506 507 510 511 512 514 517 519 519

Contents

xvii

9.9.2 Accuracy of Radial-Motion Perception . . . . . . . . . . . 9.9.3 Perception of Looming Sounds . . . . . . . . . . . . . . . . . . 9.9.4 Auditory Motion Detectors? . . . . . . . . . . . . . . . . . . . . . 9.10 Walking Around . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10.1 Illusory Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11 Integrating Multiple Sources of Information . . . . . . . . . . . . . . . . . 9.11.1 Cooperation and Competition? . . . . . . . . . . . . . . . . . . 9.11.2 Exclusive Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.11.3 Plasticity and Calibration . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

522 522 525 527 529 533 534 537 538 541

10 Auditory-Stream Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 The Trill Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Sequential Integration and Segregation . . . . . . . . . . . 10.2 Measures of Integration or Segregation . . . . . . . . . . . . . . . . . . . . . 10.3 The Perceived Number of Streams in an Auditory Scene . . . . . . 10.4 The Perceived Number of Units in an Auditory Stream . . . . . . . 10.5 The Emergence of Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Frequency Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.1 Build-Up and Resets . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7.2 Bistability and Multistability . . . . . . . . . . . . . . . . . . . . 10.7.3 Modelling Multistability . . . . . . . . . . . . . . . . . . . . . . . . 10.7.4 Neurophysiology of Multistability . . . . . . . . . . . . . . . 10.8 Factors Playing a Role in Sequential Integration . . . . . . . . . . . . . 10.8.1 Tempo of the Tonal Sequence . . . . . . . . . . . . . . . . . . . 10.8.2 Separation in Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.3 Differences in Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.4 Differences in Perceived Location . . . . . . . . . . . . . . . . 10.8.5 Familiarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.6 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.7 Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.8 Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.9 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 Organizing Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.1 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.2 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.3 Completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.4 Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.5 Good Continuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.6 Temporal Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.7 Exclusive Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.8 Figure-Ground Organization . . . . . . . . . . . . . . . . . . . .

559 560 560 561 568 571 575 581 584 585 586 590 593 595 596 596 597 598 601 603 604 608 609 610 613 614 614 617 618 619 621 624 627 631

xviii

Contents

10.9.9 Other Elements of Gestalt Psychology . . . . . . . . . . . . 10.10 Establishing Temporal Coherence . . . . . . . . . . . . . . . . . . . . . . . . . 10.11 The Continuity Illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.1 Four Rules Governing the Continuity Illusion . . . . . . 10.11.2 Information Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.3 Onsets and Offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.4 Restoration in Speech and in Music . . . . . . . . . . . . . . 10.11.5 Other Sound Demonstrations on the Continuity Illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.12 Consequences of Auditory-Stream Formation . . . . . . . . . . . . . . . 10.12.1 Camouflage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.12.2 Order and Temporal Relations . . . . . . . . . . . . . . . . . . . 10.12.3 Rhythm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.12.4 Pitch Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.12.5 Consonance and Dissonance . . . . . . . . . . . . . . . . . . . . 10.13 Primitive and Schema-Based ASA . . . . . . . . . . . . . . . . . . . . . . . . . 10.13.1 The Nature of Schemas . . . . . . . . . . . . . . . . . . . . . . . . . 10.13.2 Information Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.14 Human Sound Localization and Auditory Scene Analysis . . . . . 10.15 Computational Auditory Scene Analysis . . . . . . . . . . . . . . . . . . . . 10.15.1 Temporal Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.15.2 Predictive Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.15.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

636 637 643 651 653 658 664

11 Interpretative Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 The Ear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 The Auditory Filter and the Tonotopic Array . . . . . . . . . . . . . . . . 11.4 Auditory-Unit Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Beat Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Timbre Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Loudness Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8 Pitch Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Perceived Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.10 Auditory-Stream Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.10.1 Instability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.10.2 Neurophysiology of Instability . . . . . . . . . . . . . . . . . . 11.10.3 Factors Playing a Role in Sequential Integration . . . . 11.10.4 Organizing Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 11.10.5 Establishing Temporal Coherence . . . . . . . . . . . . . . . . 11.10.6 The Continuity Illusion . . . . . . . . . . . . . . . . . . . . . . . . . 11.10.7 Consequences of Auditory-Stream Formation . . . . . . 11.10.8 Primitive and Schema-Based ASA . . . . . . . . . . . . . . .

785 785 785 788 789 791 792 796 798 799 800 802 803 803 804 806 807 808 812

666 666 667 669 670 688 701 717 724 733 736 740 741 742 747 749

Contents

xix

11.10.9

Human Sound Localization and Auditory-Stream Formation . . . . . . . . . . . . . . . . . 814 11.10.10 Computational Auditory Scene Analysis . . . . . . . . . . 814 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817

Chapter 1

Introduction

Sound consists of audible fluctuations of the air pressure. The most direct way to represent these fluctuations is by graphically presenting the course of these pressure waves as a function of time. Such a representation will be referred to as the waveform or oscillogram of the signal. An example of an oscillogram, in this case of a speech signal, is given in Fig. 1.1 for the utterance “All lines from London are engaged”. Two waveforms are presented, the upper panel shows the complete utterance; the lower panel shows the waveform of a segment from the vowel “i” in the word “lines”. Other examples of oscillograms are shown at various instances further on in this book, e.g., of some more simple sounds in the lower panels of Figs. 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10. The waveform of a sound signal will often be used in the graphical presentations of this book, but it has limitations. First, it is only suitable for relatively short, slowly fluctuating signals; the longer the signal gets, the more fluctuations must be packed into the available graphical space. In the lower panel of Fig. 1.1, the separate pressure fluctuations can clearly be distinguished, but in the upper panel, the fluctuations are so densely packed that all details of the separate fluctuations are lost. For instance, in the first two words “All lines”, the graph is clogged and almost reduced to a patch bounded below and above by the largest negative and positive peaks of the separate fluctuations. A second disadvantage of the presentation of the waveform is that it is difficult to derive the frequency content from it. For a reason to be discussed later, we think of sound as consisting of a number of components that are in the first place characterized by their frequency; each of these frequency components then evolves in time. In fact, this represents the way we think about music. The problem is that this mental representation does not in general correspond to the acoustic structure of the sound. In music played by a band, an orchestra, or sung by a choir, indeed, various instruments or voices play together, and the successive notes of the voices compose into melodies, each with its own melodic and rhythmic structure. This process may seem self-evident but it is not; each note played by one of the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_1

1

2

1 Introduction

Fig. 1.1 Two oscillograms of the utterance “All lines from London are engaged” spoken by a male speaker. The upper panel gives the oscillogram of the complete utterance; the lower panel a segment of the vowel “i” in the word “lines” indicated by the vertical lines in the upper panel. (Matlab) (demo)

musical voices is generally itself the sum of a sometimes large number of frequency components, called partials. The lowest of these partials is called the fundamental, its frequency the fundamental frequency or F0 . Partials with frequencies higher than the fundamental frequency are called overtones. In most musical signals, the frequencies of these overtones are integer multiples of the fundamental frequency. Such sounds are called harmonic. As a first approximation, it is the frequency of the fundamental of these harmonic sounds that corresponds to the frequency of the pitch of the tone [22]. The frequencies of these pitches make up our mental representation of the melody, not the frequencies of the overtones. We generally do not hear the fundamental and its overtones as separate sounds, because perceptually they merge into coherent tones. The amplitudes of the partials that merge together contribute to the loudness of the tone. Their relative amplitudes and their course in time define what kind of tone is heard, e.g., which instrument is heard. The auditory attribute that makes a sound different from other sounds is called timbre. So, besides by its pitch, a musical tone is perceptually characterized by its loudness and its timbre. They represent different aspects of what we hear and are thus auditory properties that, along with a number of other auditory attributes, will be discussed in detail in this book. It will become clear that none of them has a simple relation with the acoustic properties of the sound, but is the result of complex perceptual processing within our auditory system. Sounds are very often described in terms of time and frequency. Such descriptions will be referred to as spectrotemporal or time-frequency representations. This will first be illustrated by means of the standard musical notation of Western music. An example is presented in Fig. 1.2 for the tune of the Westminster Abbey chimes. In

1.1 Pure Tones

3

Fig. 1.2 Musical score of the Westminster chimes tune. (demo)

this notation, the vertical position represents “frequency” and the horizontal position represents “time”. The words “frequency” and “time” are written within quotation marks, here, because in this context their meaning is defined neither mathematically nor acoustically. In principle, different heights represent different pitch values of a musical scale. By including the flat sign  or the sharp sign  before the note, these pitches can be decreased or increased by a specific amount, called a semitone. Consequently, the set of pitch values that can be represented remains very limited, and equal distances in the vertical direction do not in general correspond to equal distances in tone height. In the horizontal dimension of the musical bars, indicating time, something comparable happens. Here, too, the set of possibilities is severely restricted by the rhythm and meter of the music. Differences in duration of the notes are not represented by differences in distances along the horizontal dimension, but symbolically by differences in the signs indicating a note, as can be seen in Fig. 1.2. This all shows that musical notation is a symbolic representation, in which the vertical position, “frequency”, of each note indicates the intended pitch value in the melodic structure, and its horizontal position, “time”, its intended position in the rhythmic structure. In the context of this book, more exact time-frequency representations are required.

1.1 Pure Tones A pure tone is a tone that can be described mathematically as a single sinusoid. In general, sinusoids are considered “simple”, but perhaps “elementary” would be a better term. In hearing research, sounds are very often described as sums of sinusoids, not only mathematically but also conceptually, and this will be done very regularly throughout the rest of this book. This is motivated by the structure of the peripheral hearing system, which, as a good first approximation, operates as a frequency analyser [20, 21]. This means that the peripheral hearing system, consisting of the outer ear, the middle ear, and the inner ear, splits a complex sound into a large set of separate frequency components. This will be discussed in close detail in the next chapters, because this is of decisive perceptual importance. Acoustically, a pure tone s (t) consists of an audible pressure wave the magnitude of which can be represented as a sinusoid, i.e., a sine or cosine wave, possibly shifted in phase: s(t) = a sin(2π f t + ϕ).

(1.1)

4

1 Introduction

Fig. 1.3 Waveforms of two pure tones. The upper panel shows the start of a 440 Hz sine wave with relative amplitude of 0.76; the lower panel shows a 660 Hz cosine wave of amplitude 0.60. (Matlab) (demo)

Besides the time variable t, one can distinguish three parameters in Eq. 1.1: The amplitude a, the frequency f , and the phase ϕ. In Fig. 1.3, the waveforms of two sinusoids are depicted. In the upper panel, a 440 Hz sine wave with amplitude 0.76 is presented, so that a = 0.76, f = 440, and ϕ = 0. The lower panel shows a 660 Hz cosine wave with amplitude 0.60, hence a = 0.60, f = 660, and ϕ = π /2, since, of course, the cosine is a sine shifted in time over one quarter of its period, cos (2π f t) = sin (2π ( f t + 1/4)) = sin (2π f t + π/2).

1.1.1 Amplitude A sinusoid has three parameters: Amplitude, frequency, and phase. Amplitude will be discussed first. If a sound signal is composed of only one sinusoid, the amplitude of this sinusoid represents the pressure of the sound wave relative to the average pressure of the atmosphere. In the absence of any other sounds, the amplitude of the pressure wave with which a pure sinusoid enters our ears determines its loudness, which is the perceptual attribute that listeners associate with the amplitude of the tone. If the amplitude of the sound-pressure wave increases, loudness increases, and when this amplitude decreases, loudness decreases. Note, however, that this is only true in general in the absence of other sounds. In the presence of other sounds, various things can happen to the perception of a pure sinusoid. The sinusoid may indeed be unaffected by these other sounds, and is then heard as a separate sound

1.1 Pure Tones

5

which “sounds” the same as when heard in isolation, i.e., in the absence of other sounds. Another possibility is that the loudness with which the sinusoid is perceived diminishes due to the presence of the other sound—which is called partial masking—or can even become inaudible—which is called complete masking. Yet another possibility—and it is important to realize that this can happen—is that the sinusoid is not perceived as a separate sound but merges perceptually with components of another sound, which may result in a change in the timbre of that sound. This will be discussed in detail in later chapters of this book. In the absence of other sound components, the perceived loudness of a pure tone completely depends on the amplitude of the sinusoid at the entrance of our ears. This amplitude is determined by quite a few acoustic variables: The power of the sound source, the distance of the sound source from the listener, the reflections from the ceiling, the floor, the walls, and other reflecting surfaces in the room in which the listener is situated, the reverberation of the room, the position of the listener, and the direction from which the sound enters the ear canals of the listener. In hearing research, the amplitude of the sinusoid is mostly given by its sound pressure level (SPL). The normal unit of pressure in physics is newton per square meter (N/m2 ) or pascal (P), but sound pressure level is usually not expressed in newton per square meter or pascal, but equivalently in decibel (dBs). This is because, within certain limits, equal differences in dB are perceived as equal differences in loudness. This will be discussed in detail in Chap. 7.

1.1.2 Frequency The pitch of a purely sinusoidal sound signal can generally be associated with its frequency. Again, however, this is only true in the absence of any other sounds and even then, only as a first, yet quite accurate, approximation. For, it appears that the pitch of pure tones with frequencies lower than 1–2 kHz decreases as the intensity increases, while the pitch of pure tones with frequencies higher than 1–2 kHz go up in pitch as the intensity increases [30, 34]. This effect is quite small, however. What happens to the pitch of a sinusoid when it is played in the presence of other sounds? In fact, various things can happen. For instance, if a second sinusoid is present with a frequency that differs only by a few Hz, both tones interfere. In fact, they merge perceptually into one sound signal perceived as a pure tone with a frequency that is intermediate between the frequencies of the two tones. This pure tone fluctuates in loudness with a rate equal to the difference frequency of the two tones. This will be discussed in more detail in Sect. 1.5. Another instance occurs when other sounds with rich spectral structures are present such as music. In that case, the pure sinusoid may merge with one of the musical signals so that it is no longer heard separately as a pure tone with its own pitch. This, too, will be discussed, e.g., in Chap. 4. Here, a sinusoid in the absence of other sounds will be discussed. In that case, the sinusoid is indeed perceived as a pure tone with a pitch that, as just mentioned, closely corresponds to the frequency of the

6

1 Introduction

Fig. 1.4 Use of a 6.4 ms analysis window to gate out five segments of a pure 440 Hz sinusoid. The bottom panel gives the signal and five analysis windows; the middle panel presents the five windowed signals; the top panel their power spectra. (Matlab) (demo)

sinusoid. There is a problem here, however, and that relates to the fact that, mathematically, frequency resolution must be handed in for temporal resolution. Indeed, the accuracy with which the frequency of a sound component can be estimated depends on the number of periods on which that estimate is based. The more periods, the more accurate the frequency estimate can be. Hence, a high accuracy in frequency estimation requires a long sound interval. This can be problematic, however, when the signal changes within the duration of that interval. The consequences of this will first be illustrated based on a pure tone with constant frequency. The result of a frequency analysis based on a shorter interval will be compared with the result of a frequency analysis based on a longer interval. In signal processing, this is done by defining an analysis window, a function of time that gates out a limited interval of the signal, and then determining the amplitude spectrum of this gated-out interval. Figures 1.4 and 1.5 both show five such analysis windows in their lower panels as bell-shaped curves numbered 1–5. Figure 1.4 shows five short windows of 6.4 ms; in Fig. 1.5, the analysis windows last 51.2 ms. The 6.4 ms window in Fig. 1.4 covers about 0.0064 · 440 = 2.82 periods of the sinusoid, while the 51.2 ms window

1.1 Pure Tones

7

Fig. 1.5 Use of a 51.2 ms analysis window to gate out five segments of a pure 440 Hz sinusoid. The bottom panel gives the signal and five analysis windows; the middle panel presents the five windowed signals; the top panel their power spectra. The peak in the spectra is now much narrower than in Fig. 1.4, indicating the higher frequency resolution. (Matlab) (demo)

in Fig. 1.5 covers about 0.0512 · 440 = 22.53 periods. The five gated-out segments are presented in the middle panel of the figures. The idea is now that the higher the number of periods, the more accurately the duration of one period, and hence the frequency of the signal, can be estimated. This is illustrated in the five top panels of Figs. 1.4 and 1.5, showing the power spectra of the gated out sound intervals. The power spectrum gives the distribution of the power of the signal over the various frequencies. Of course, the main power is centred on peaks 440 Hz, the frequency of the sinusoid. In the power spectrum calculated with the short analysis window, the peak is much wider than in the power spectrum with the longer analysis window. In the power spectra of Fig. 1.4, there are many smaller peaks, ripples, about 50 or more dB lower than the main peak. They are a consequence of the exact shape of the analysis window and, hence, do not represent properties of the signal. They will be ignored. The main point is that, in general, the shorter the analysis window, the wider the spectral peaks, which means that the high temporal resolution implied by the short

8

1 Introduction

Fig. 1.6 The waveform, the narrow-band spectrogram, and the wide-band spectrogram of a pure 440 Hz tone in cosine phase with an abrupt onset and offset. Because of the “spectral splatter” at these abrupt onset and offset, the listener actually perceives three different sounds, two clicks, and one pure tone. (Matlab) (demo)

analysis window goes at the cost of the spectral resolution, and the other way round. This can clearly be seen in the five panels in the upper parts of Figs. 1.4 and 1.5. One may now ask the general question: What must be the length of the analysis window? The answer to this question depends on the required temporal and frequency resolution of the analysis. In general, one can say that it can have advantages to use a long analysis window for slowly changing signals because the frequency of the peaks can then be determined more accurately. But if a signal does change significantly within an analysis window, the resulting power spectra will not show these changes but will distribute the spectral characteristics of the different signal segments over the duration of the whole analysis window. For such rapidly changing signals, an analysis window should be so short that the signal does not change significantly within the interval of the window. This is further elaborated in the next two figures, Figs. 1.6 and 1.7, in which the spectrogram is introduced. The spectrogram represents the distribution of the

1.1 Pure Tones

9

Fig. 1.7 Waveform, the narrow-band spectrogram and the wide-band spectrogram of a pure 440 Hz tone with a gradual, 20 ms, onset and offset. Owing to this gradual onset, the “spectral splatter” at the onset and offset is much reduced, and the auditory system parses this sound into one single sound, a pure tone. Compare this with Fig. 1.6. (Matlab) (demo)

power of the signal over frequency, i.e., the power spectrum, as a function of time. The horizontal dimension represents time, and the vertical dimension frequency. In combination, this is be called the spectrotemporal domain or time-frequency domain. To obtain the spectrogram, the power spectrum is calculated for a large number of overlapping analysis windows at equidistant moments of the signal. The spectrogram is presented on a gray scale; lighter colours will be used for lower power and darker colours for higher power. As explained above, the duration of the analysis window determines the temporal and spectral resolution of the spectrograms. A spectrogram calculated with a short analysis window has a high temporal resolution and a low spectral resolution resulting in wide spectral peaks, which is why it is called a wide-band spectrogram or broad-band spectrogram. On the other hand, a spectrogram calculated with a long analysis window has a low temporal resolution but a high spectral resolution resulting in narrow spectral peaks. This will be called a narrow-band spectrogram.

10

1 Introduction

This all is demonstrated in Fig. 1.6 showing the waveform and both the narrowband and the wide-band spectrogram of a 440 Hz cosine switched on at 0 ms and switched off 200 ms later. The top panel shows the wide-band spectrogram, calculated with a relatively short analysis window of 6.4 ms; the middle panel shows the narrowband spectrogram, calculated with a longer analysis window of 51.2 ms; the bottom panel shows the waveform. First, the wide-band spectrogram in the top panel will be discussed. At the middle of the signal, at around 100 ms, the signal is relatively stable. This part of the tone is represented by a wide horizontal band between about 200 and 700 kHz. Compare this with the power spectra presented in the top panel of Fig. 1.4, which, in fact, represent vertical cross-sections through this wide-band spectrogram. As to the abrupt onset of the tone at 0 ms and its offset at 200 ms, they are represented by vertical bands in this wide-band spectrogram. The duration of these vertical bands is quite short, a few ms perhaps, but in the frequency dimension they extend from 0 to over 1.5 kHz; so, in the wide-band spectrogram, the onset and offset are well represented in time with an accuracy of a few ms, but, in frequency, they extend over a considerable range. In other words, the temporal resolution is relatively high, while the frequency resolution is relatively low, certainly, when compared with the narrow-band spectrogram presented in the middle panel of Fig. 1.6. In this narrowband spectrogram, the onset at 0 ms and the offset are 200 ms are represented in time over a much longer interval; at the onset from about –20 to 20 ms, and at the offset from about 180–220 ms. The vertical bands, however, at the onset and offset are now much better defined in frequency, and have almost vanished above about 1 kHz. The centre part of the tone is represented by a much narrower band to be compared with the power spectra presented in the upper panel of Fig. 1.5. This means that the temporal resolution is now relatively low, while the frequency resolution is relatively high. The abrupt onset and offset have scattered representations in both spectrograms, but in the wide-band spectrogram the distribution of the signal power is more scattered in the frequency dimension than in the time dimension; in the narrow-band spectrogram the power of the signal is less scattered in the frequency dimension, but the more so in the time dimension. The scattering of power in the vertical or the horizontal dimension at abrupt transitions of a sound is called spectral splatter. How does this all look like in the perspective of perception? What is heard when a 440 Hz sinusoid is abruptly switched on and then switched off after 200 ms? The amazing thing is that, although there is actually only one sound source, three sounds can be heard. In the first place, there is the 440 Hz tone, represented by the stable horizontal band in the spectrograms of Fig. 1.6, but it is accompanied by two clicks, represented by the spectral splatter in Fig. 1.6, one click at the beginning and one click, most often somewhat weaker, at the end of the tone. Recapitulating, the representations of rapid transitions such as abrupt onsets and offsets in the sound signal are not precisely localized in the spectrotemporal domain. In the wide-band spectrogram, they are smeared out in the frequency dimension; in the narrow-band spectrogram in the time dimension. Such very abrupt changes in the signal can elicit the perception of very audible clicks that are perceived as produced by separate sound sources. In very loud sounds, such as fire alarms, they can even induce a startle response in the unsuspecting listener [19]. Apparently, the spectral

1.1 Pure Tones

11

splatter at the onset and offset of the tone are not heard as hard attacks or releases, but as separate sounds that sound like clicks. The hearing system interprets the incoming information as three different sounds: a click at the onset of the pure tone, then the pure tone, and a click at the offset. When, e.g., due to transmission errors, interruptions in a sound succeed each other very rapidly, i.e., more than a few times per second, the successions of clicks can form a separate, very creaky sound. This can be so creaky that it dominates what is heard and can obscure the original sound. Apparently, the incoming acoustic information is perceptually interpreted as coming from two sound sources, one the original sound, and the other a disturbing creaky sound. The principles governing the distribution of auditory information among the various perceived sound sources, auditory streams, as they will be called, will be discussed in detail in the Chaps. 4 and 10. In the time dimension of the narrow-band spectrogram, the changes are relatively gradual, at least much more gradual than those of the wide-band spectrogram, but realize that this is a property of the analysis window, not of the signal; physically, the signal has an almost instantaneous onset at the start. This is better represented in the wide-band spectrogram, where the temporal resolution is much better though still limited. In conclusion, the wide-band spectrogram represents the temporal properties of the signal more accurately than the narrow-band spectrogram. On the other hand, the spectral resolution of this wide-band spectrogram is less than that of the narrow-band spectrogram, as can be seen in Fig. 1.5, where the frequency 440 Hz is represented by a much narrower line in the narrow-band spectrogram than in the wide-band spectrogram. Studying spectrograms is a very powerful way to study sound signals,—they will be presented regularly—but one must always realize that the frequency and the temporal resolution are limited in all spectrograms, and that a higher resolution in one dimension always has its penalty in the other dimension. A more extended, non-mathematical treatment of the material in the context of speech processing can be found in Rosen and Howell [24]; a more mathematical treatise of Fourier analysis in the context of music processing can be found in, e.g., Müller [15, Chap. 2, pp. 39–114]. It has already been mentioned that the clicks at onsets and offsets can be very annoying. A simple way to remove them is to fade the signal in and out more gradually. This is illustrated in Fig. 1.7, showing the waveform, the wide-band spectrogram, and the narrow-band spectrogram of the same 440 Hz tone as in Fig. 1.6, but now switched on and off more gradually with a fade-in and fade-out time of 20 ms. The spectral splatter in the spectrograms is now much reduced. In fact, when listening to this sound, one sound is heard, the short sound of a 200 ms 440 Hz pure tone. Clicks at onset and offset are no longer audible. Wright [36] showed that, for 80 dB 1000 Hz tones, rise and decay times as short as 5 ms are enough to prevent the audible clicks at the onset and offset of these tones. For tones of lower intensity, even shorter rise and decay times are enough. In the demos of this book, spectral splatter will always be removed in this way. This can be checked by clicking on Matlab at the end of the figure captions which gives the Matlab code with which the figures and the demos have been produced.

12

1 Introduction

Fig. 1.8 Waveform, narrow-band spectrogram, and wide-band spectrogram of a pure 440 Hz tone with exponentially decaying envelope; the decay time is 25 ms. In spite of the abrupt onset no tick is heard at the start of the sound. (Matlab) (demo)

The next figure, Fig. 1.8, shows a sinusoid with an abrupt, almost instantaneous onset, and a gradually decaying amplitude. Mathematically, this sound can be represented as a function of time t: s (t) = ae− τ sin (2π f t) , t ≥ 0. t

(1.2)

This is a sine wave a sin (2π f t) with amplitude a and frequency f modulated by t the exponential function e− τ with decay time τ . The decay time is the time it takes for the amplitude to decrease to a proportion of 1/e ≈ 0.368 of its original value. This amounts to a half-life of 53.1 ms, since the half-life is 1/ ln (2) of the decay time. In Fig. 1.8, the frequency of the sinusoid 440 Hz and the decay time of the exponential is 25 ms. The waveform, the narrow-band spectrogram, and the wideband spectrogram are presented in Fig. 1.8. The waveform, depicted in the bottom panel, shows, indeed, an abrupt onset followed by a sine wave gradually decreasing in amplitude. The onset is about the same as the onset of the constant tone presented

1.1 Pure Tones

13

Fig. 1.9 Waveform, narrow-band spectrogram, and wide-band spectrogram of a pure 2000 Hz tone with exponentially decaying envelope; the decay time is 25 ms. The auditory system parses this into one single sound, that of a short metallic impact. (Matlab) (demo)

in Figs. 1.4 and 1.5. In the narrow-band spectrogram in the middle panel, this onset is smeared out in time, but the spectral splatter in the frequency dimension is limited. In the wide-band spectrogram in the top panel, the timing of the onset is much more accurate, but the spectral splatter in the frequency dimension is much stronger, extending beyond 1.5 kHz. In the narrow-band spectrogram, the slowly decreasing decay of the 440 Hz tone is represented by a much narrower horizontal line gradually decreasing in amplitude. In the wide-band spectrogram in the top panel, the decay of the tone is represented by a much wider band. Again, the temporal resolution is less fine in the narrow-band spectrogram than in the wide-band spectrogram, while the frequency resolution is finer in the narrow-band spectrogram than in the wide-band spectrogram. What does a pure tone with an exponentially decaying envelope sound like? A clear attack is heard, and the spectral splatter seen in the spectrogram of the decaying sinusoid in Fig. 1.8, is considerable, though not as large as in Fig. 1.6. This is because in this decaying exponential the sinusoid does not start in cosine phase, as in Fig. 1.6,

14

1 Introduction

but in sine phase. The abrupt onset is hardly, if at all, accompanied by a separate, audible click. When listening to the sound presented in Fig. 1.8, some listeners may interpret it as an impact sound, e.g., as the sound of a soft stick on a wooden bar. The percept of an impact sound becomes much stronger when the frequency of the sinusoid is increased to, e.g., 2 kHz, shown in Fig. 1.9. In that case, the sound really has the character of a metallic impact. Apparently, the auditory system recognizes this as an impact of a hard object on another, metal object.

1.1.3 Phase The demo of Fig. 1.6 played a sinusoid starting in cosine phase, s (t) = a cos (2π f t). In that demo, the tone is abruptly switched on and off, which implies that the signal value jumps from 0 to a at its start and then from a back to 0 at its end, so that the signal is not continuous there. As has been argued, the consequence is that the tone is not only heard as a pure tone, but is accompanied by two well audible clicks, one at the onset and the other at the offset of the tone. In the demo of Fig. 1.7, the tone is more gradually faded in and out, so that the signal is no longer discontinuous. This had as a consequence that the ticks at the onset and offset of the tone disappeared. This is more systematically illustrated in the demos of the next two figures, Figs. 1.10 and 1.11. In these demos, five 100 ms 440 Hz tones are played starting in different phases. When the tone is represented as s (t) = a sin (2π f t + φ), φ is varied over −π/4, −π/6, 0, π/6, and π/2. The figure shows that the discontinuity at the onset and offset of the tones is largest for the first and the last tone, and intermediate in the second and the fourth. For φ = 0, the tone starts in sine phase and is continuous at the onset and offset, though its derivative is not continuous. Listening to the demo of Fig. 1.10, one clearly hears the clicks at the onset and offset of the tones, most clearly for the first and the last, less clearly for the second and the fourth and least clearly in the third tone. In Fig. 1.11, the tones are gradually faded in and out in 10 ms. The clicks are no longer audible, and all five tones sound the same. It is often said that our auditory system is quite insensitive to phase. Indeed, if a signal has a limited number of components with widely spaced frequencies and the transients such as the onset and offset are not too abrupt, it does not matter perceptually in what phase the individual frequency components are played. The demos of Figs. 1.6, and 1.10 showed, however, that the phase at which a tone begins or ends determines the audibility of clicks at the onset and offset of the tone, when these are abrupt. Later in this chapter, in Sect. 1.8.3, it will additionally be shown that, if the components of a sound come close together in frequency, they start to interfere perceptually with each other. This has significant auditory consequences, which will be discussed at various instances later on.

1.2 Complex Tones

15

Fig. 1.10 Short 100 ms 440 Hz sinusoids starting and ending in phases varying over −π/4, −π/6, 0, π/6, and π/2. Note the change in the loudness of the clicks at the onsets and offsets of the tones. (Matlab) (demo)

1.2 Complex Tones In hearing research, an important role is attributed to what will be referred to as complex tones, tonal sounds, or simply tones. Tonal sounds are sounds consisting of the sum of a limited number of sinusoids. Each separate sinusoid is called a partial. The partial with the lowest frequency is called the fundamental, the higher partials are called overtones. Depending on the relation between the partials, complex tones will be indicated with harmonic or inharmonic tones. Grossly speaking, tones are harmonic if they are periodic. Fourier’s theory tells that the frequencies of the partials of a periodic signal are all integer multiples of a common frequency, the fundamental frequency or F0 , the inverse of the period duration. The integer with which F 0 has to be multiplied to get the frequency of the harmonic partial is referred to as the harmonic rank of that partial. Some examples of such Fourier expansions will be presented in Sect. 1.8. If F 0 is within the range of 50–5000 Hz, the partials of harmonic sounds generally merge perceptually into well-defined tones in which the various partials largely lose their identity. Although, at least for relatively stable tones, trained listeners may be able to hear out separate partials, the merged percept of one coherent tone dominates.

16

1 Introduction

Fig. 1.11 Same tones as in Fig. 1.10, but now with a 10 ms fade-in and fade-out, effectively removing the clicks at the onsets and offsets. (Matlab) (demo)

Why is there the limitation of F 0 to the range between 50 and 5000 Hz? That is because, below 50 Hz and above about 5000 Hz, it is no longer well possible to play musical melodies of which the tones have well-defined pitches. This is why the frequency range between 50 and 5000 Hz will be indicated with the pitch range. This issue will be discussed in detail in Sect. 8.3. So, harmonicity is the property of periodic sounds. When the F 0 of such a sound is within the pitch range, such harmonic sounds generally have a well-defined pitch, the frequency of which, at least as a very good first approximation, corresponds to F 0 . The pitch of sounds plays a very important role in, e.g., the intonation of our speech and the melodies of music, and is often an important attribute of sounds used in animal communication. If there is no frequency of which the frequencies of all partials are integer multiples, they are referred to as inharmonic tones. Inharmonic tones generally do not have a well-defined percept of pitch for the tone complex as a whole. One of the most straightforward ways to get a harmonic sound is to add up a number of sinusoids with frequencies that are all multiples of an F 0 . This is illustrated in Fig. 1.12, in which the first eight harmonics 440 Hz are successively added up. In the bottom panel the waveform is shown but, as the reader can see, only the envelope is well represented; the individual signal periods are not visible as they are completely clogged. What these signal periods look like will be discussed in more detail below

1.2 Complex Tones

17

Fig. 1.12 Addition of harmonics. The upper panel shows the schematic time-frequency representation of the stimulus, the middle panel its spectrogram, and the bottom panel its waveform. First, only the first harmonic is played; then the next seven harmonics of rank 2–8 are successively included. The pitch of the tones remains the same; what changes is their timbre. (Matlab) (demo)

in Sect. 1.8. The envelope of the tones is trapezoid, so that the onsets and offsets are not abrupt and no clicks are audible at the onsets and offsets. The amplitude of these envelopes gets higher for each added harmonic. The middle panel presents the narrow-band spectrogram of the stimulus. The splatter is insignificant. The upper panel shows, what will be called, the schematic time-frequency representation of the stimulus. This shows the separate partials in the spectrotemporal domain with a width in the vertical direction that is proportional to the amplitudes of the partials. This representation may give a more direct impression of how the stimulus is composed. Listening to the demo of Fig. 1.12 it is important to realize that the pitch of the successive tones remains the same, in spite of the addition of higher harmonics. The first tone is a pure tone and heard as such. As more harmonics are added, the timbre of the tones gets more and more complex, sometimes called “richer”. The individual harmonics have lost their identity. Also the loudness increases with every tone due to the addition of more harmonics. How the percept of loudness depends on the composition and time course of a complex tone will be discussed in detail in Chap. 7.

18

1 Introduction

1.3 Speech Sounds Until now, only sounds consisting of a sinusoid or a sinusoid and its harmonic overtones have been discussed. Based on their mathematical description, it is not too difficult to get an impression of the waveforms and the spectrograms of such simple sounds. For more complex sounds, a mathematical description is not readily available and, based on their sound, it is very difficult to imagine what their waveforms and their spectrograms look like. In fact, the spectrogram is very often used as an analysis tool for these sounds, since it shows the distribution of the energy of their spectral components over time. How spectrograms are obtained has been illustrated and explained for pure tones in Figs. 1.4, 1.5, and 1.6 of the previous Sect. 1.1.2. In the next figures, the narrow-band and the wide-band spectrogram will now be described based on the speech utterance “All lines from London are engaged”, the same utterance as presented in Fig. 1.1. In Fig. 1.13 the waveform and the narrowband spectrogram are presented. The vertical lines at about 0.35 s mark the location of a short segment of this utterance for which more details will be presented of how the spectrogram is calculated (see Fig. 1.14). When one looks at the narrow-band spectrogram in Fig. 1.13, one of the first things that draws attention is the large number of wavy, more or less parallel lines that run from left to right. At the vertical lines at about 350 ms, about 35 such lines can be counted between 0 and 5000 Hz. These lines represent the harmonic partials of the voiced segments of the utterance. Hence, these segments can be described

Fig. 1.13 Waveform and narrow-band spectrogram of the utterance “All lines from London are engaged”, spoken by a male speaker. The vertical lines in the spectrogram at about 0.35 s indicate the position of the segment selected for the more detailed analysis shown in Fig. 1.14. The horizontal lines segments above “uv” indicate the unvoiced segments. (Matlab) (demo)

1.3 Speech Sounds

19

Fig. 1.14 Detail of the short segment of the narrow-band spectrogram shown in Fig. 1.13. For the explanation, see the text. (Matlab) (demo)

as the sum of sinusoids with frequencies that are all multiples of the fundamental frequency, F 0 , here about 5000/35 ≈ 140 Hz. In other words, these speech sounds are periodic with a periodicity of about 1000/140 ≈ 7.1 ms. Just as for the harmonic signal shown in Fig. 1.12, the pitch of the speech sound corresponds quite well to this F 0 . So, the course of the pitch of this utterance is represented by the lowest of all the harmonics shown in this narrow-band spectrogram. The course of the harmonics in Fig. 1.13 is only interrupted during the unvoiced parts of the utterance indicated in the oscillogram of the lower panel by the horizontal lines marked with “uv”.

20

1 Introduction

It must be remarked that, since fluent speech changes continuously, speech signals are generally not perfectly periodic or harmonic. It would, therefore, be better to talk about pseudo-periodic or pseudo-harmonic signals. As long as the changes in the speech signal are smooth and gradual, however, most often the shorter words “periodic” and “harmonic” will be used. One can see that the harmonics change in amplitude in the course of time, changes which are not random but form patterns in the form of the dark bands running over the harmonics from left to right. These width of these bands comprises several harmonics and, hence, is much wider than that of a harmonic. They become weaker and stronger and go up and down in the course of the utterance. They are a consequence of the articulatory movements of the speaker and represent the resonances of the vocal tract, called formants [11]. These formants play an important role, e.g., in what vowel is heard. In a normal male voice, the resonant frequency of the first formant fluctuates around 0.5 kHz, the second around 1.5 kHz, the third around 2.5 kHz, etc. In Fig. 1.13, especially the first and the second formant are clearly visible as the two lower fluctuating dark bands running over the harmonics from left to right. The way in which this narrow-band spectrogram is calculated is illustrated in Fig. 1.14 for a short segment within the vowel “i” of the word “lines”, the same as presented in the lower panel of Fig. 1.1. In Fig. 1.13, the location of this short segment is indicated by the vertical lines at about 350 ms. The waveform of the short segment is shown in the bottom panel of Fig. 1.14, also showing five analysis windows used in the calculation of this spectrogram. In this waveform, somewhat more than eleven periods of about 7.1 ms can be distinguished. These periods correspond to the periodic vibrations of the vocal chords. The five bell-shaped analysis windows are presented with a spacing of about half such a period. The windowed signals are presented in the second panel from below. As one can see, each windowed signal spans about seven periods of the speech signal. Since the signal does not change very much from period to period, each windowed signal very much resembles its predecessor except for a phase shift. When two signals are equal except for a phase shift, their power spectra are equal. These power spectra are presented in the top panel of Fig. 1.14, and one can see, indeed, that there are only minor differences between these spectra. These power spectra are vertical cross-sections through the spectrogram at the positions of the five vertical lines. The complete spectrogram is presented in the second panel from above. The large number of harmonics is quite clear, showing up as the nearly parallel, horizontal lines. Since the periodicity hardly changes, their frequency is about constant. What changes somewhat are the relative intensities of the harmonics due to the changes in formant frequencies induced by the articulatory movements of the speaker. The harmonics are relatively high in intensity at the positions of the formants. The frequencies of the first four formants are indicated by F1, F2, F3, and F4 in the power spectra of the windowed signals shown in the top panel. The frequency of the pitch of the speech, 140 Hz in this case, corresponds to F 0 , the frequency of the lowest harmonic of the spectrum, and also equals the distance in frequency between successive harmonics. In summary, in the narrow-band spectrogram of this voiced speech signal, the periodicity of the signal expresses itself in well separated, more or less horizontal

1.3 Speech Sounds

21

Fig. 1.15 Waveform and wide-band spectrogram of the utterance “All lines from London are engaged”. The vertical lines in the spectrogram at about 0.35 s indicate the position of the segment selected for the more detailed analysis shown in Fig. 1.16. (Matlab) (demo)

lines, each of which represents a harmonic of the signal. The distance between the harmonics corresponds to the frequency of the periodicity of the speech signal. Due to the relatively long analysis window, the spectrogram changes relatively slowly, possibly obscuring rapid changes in the signal. Next, the wide-band spectrogram will be described. The wide-band spectrogram of the same utterance “All lines from London are engaged” is shown in the next two figures; that of the complete utterance in Fig. 1.15, and for the short segment at 0.350 ms in Fig. 1.16. This instant is again indicated by the vertical lines in Fig. 1.15. The horizontally running lines representing the harmonics of the speech sounds in the narrow-band spectrogram can no longer be distinguished in this wide-band spectrogram. The periodicity of the speech signal is now represented by the densely packed, dark, vertical lines in the spectrogram. They are the moments at which the acoustic energy produced by the closure of the vocal chords is highest. Every time the vocal chords are closed, a sound pulse is generated. These moments are indicated by the arrows in the bottom panel of Fig. 1.16. Such a pulse is filtered by the oral cavity resulting in amplification of the sound at the formant frequencies. After the closure of the vocal chords and the generation of the sound pulse, the sound damps out, so that within one period the sound intensity is first high as a result of the generated sound pulse, but then decays until the next closure of the vocal chords. This can be inferred from the signal waveform shown in the bottom panel of Fig. 1.16, and becomes clear when the wide-band spectrogram is calculated. The wide-band spectrogram is calculated based on relatively short analysis windows, shorter than the period between two successive closures of the vocal chords. At the same instants

22

1 Introduction

Fig. 1.16 Detail of the wide-band spectrogram shown in Fig. 1.15. The arrows in the bottom panel indicate the estimated moments of the glottal closures. An interval between two successive arrows corresponds to one pitch period. (Matlab) (demo)

as in Fig. 1.14 for the narrow-band spectrogram, these bell-shaped analysis windows are shown in the bottom panel of Fig. 1.16, together with the waveform. The first, the third, and the fifth analysis window are positioned where the intensity of the speech signal is about maximum; the second and the fourth window are positioned in between, where the intensity is relatively low. This can be seen in the windowed signals shown in the second panel from below. The difference between the first, the third, and the fifth segment on the one hand, and the second and the fourth on the other

1.3 Speech Sounds

23

is evident, and so are the differences in their power spectra shown in the top panel of Fig. 1.16. Due to the fact that each analysis window covers significantly less than one period between the closures of the vocal chords, the periodicity no longer turns up as parallel series of horizontally running harmonics. Since the analysis window is shorter than one period, changes in intensity within one period become visible resulting in the dark, vertical lines just after the moments of the glottal closures. Since, as explained above, periodic signals with an F 0 in the pitch range have a pitch frequency closely corresponding to this F 0 , one period is indicated with pitch period. In summary, in the calculation of the wide-band spectrogram of speech, the analysis window is shorter than the pitch period. Consequently, the power spectra of the windowed signals no longer show any peaks at the positions of the harmonic frequencies. The periodicity of the signal now expresses itself in the periodic increases and decreases in intensity resulting in vertical lines in the spectrogram of the sound, a consequence of the fine temporal resolution, which now is finer than one pitch period. The downside is that, since an analysis window in the wide-band spectrogram covers less than one signal period, the frequency resolution is less than the distance between harmonic frequencies. Hence, in the power spectra shown in the top panel of Fig. 1.16, individual harmonics cannot be distinguished. The position of the formants is now clearer, at least in the spectrograms of the first, the third, and the fifth segments where the intensity is relatively high. The formants are less well-defined in the spectra of the second and the fourth segment. Comparing the power spectra and the spectrograms in Figs. 1.13 and 1.14 on the one hand, and those in Figs. 1.15 and 1.16 on the other, the differences are striking. It is very important to realize, however, that these differences are not due to differences in the represented signal. The narrow-band and the wide-band spectrogram represent different aspects of the same signal. The narrow-band spectrogram has a relatively good spectral resolution so that components that are close in frequency can still be distinguished. Close in frequency here means that the analysis window is so long that is covers at least two periods of the signal. But this good frequency resolution comes at the expense of the temporal resolution. Changes in the signal with a time course that is shorter than one analysis window are smeared out over time and obscured in this way. The wide-band spectrogram on the other hand, obtained with a short analysis window, is good at resolving short-term changes in the signal. When the analysis window is shorter than one signal period, changes in the signal within one period can become apparent. This good temporal resolution is at the expense of the spectral resolution, however. Since the analysis windows covers less than one signal period, the harmonics of the signal do not show up as series of harmonics in the power spectrum. One cannot say that wide-band spectrograms are better or worse than narrow-band spectrograms. Which one is more useful, depends on what one wants to know, since both represent different aspects of the same sound signal. A different question is whether the wide-band or the narrow-band spectrogram better represents aspects of the sound that are perceptually relevant. In the next chapters, it will appear that the auditory periphery can be considered as a long series of overlapping band-pass filters.

24

1 Introduction

Each of these filters has a specific time and frequency resolution which depends on the centre frequency of the filters. It will turn out that this time and frequency resolution of the auditory filters plays a decisive role in determining what is perceptually relevant and what is not. The properties of these auditory filters will be discussed in detail in Chap. 3.

1.4 Musical Scales and Musical Intervals Figure 1.17 shows a schematic representation of part of a piano keyboard. The number below a white key and above a black key represents the fundamental frequency (F 0 ), or simply the frequency of the tone that is produced when the key is played. The lowest note shown in the figure has a frequency of 174.6 Hz; the highest note one of 1319 Hz. In traditional, modern Western music, tuning an instrument starts with the A 440 Hz, here presented somewhere in the middle of the figure. In music, the distance between two tones is called a musical interval or simply interval. Musical intervals are equal when the ratios of the frequencies of the two tones that constitute the intervals are equal. This is why a musical scale is often indicated with ratio scale. Hence, the interval between two tones of which the lower has frequency f 1,low and the higher has frequency f 1,high is equal to the interval between two tones with frequencies f 2,low and f 2,high , if f 1,high / f 1,low = f 2,high / f 2,low . A central role in all music is played by the octave for which this ratio of the frequencies of the two tones is 2. In the scales of Western music, musical notes that differ one octave bear the same name. For the white keys on a keyboard, these names are the letters A to G, as shown in Fig. 1.17. For the black keys, the naming is somewhat more complicated, because it depends on the key in which the music is played. Without going into detail about harmony, they can either be thought of as increments of the white key to their left, or as decrements of the white key to their right. If they are regarded as increments of the key to their left, they are called sharp, the sign for which is , e.g., F sharp,

Fig. 1.17 Some keys of the mid-range of a piano with their frequencies in hertz. The numbers of the various octaves in which the keyboard is divided are presented in the lowest row. (Matlab)

1.4 Musical Scales and Musical Intervals

25

F , indicates the black key just above the F; if they are regarded as decrements of the white key to their right, they are indicated with flat, for which the sign is , e.g., G flat, G , is the black key below the G, which, as one can see, indicates the same black key as F sharp, F . In order to indicate their position, the keyboard is divided into numbered octaves. The transition between the octaves is between the B and the C. The 440 Hz A note is in the fourth octave and so, is indicated with A4 , the black key to its right by A 4 or B 4 , then comes B4 , followed by C5 , etc. It is important to realize that the white keys only represent a subset of the twelve notes of an octave, and that the interval between two adjacent white keys depends on whether there is a black key in between them or not. When there is a black key between two white keys, the interval between these white keys is twice as large as when there is no black key between them. As one can see in Fig. 1.17, this is only the case between the B and the C, and between the E and the F. In order to play all musical notes on a piano keyboard within one octave in succession, one must play both white and black keys. One gets then what is called a chromatic scale, demonstrated below in Fig. 1.20. In order to calculate the frequencies of these notes, the octave must be divided into twelve equal intervals, equal on a logarithmic frequency scale. One such interval, representing one twelfth of an octave, is called a semitone. The ratio of the frequencies of two tones differing one semitone is 21/12 . As a consequence, any musical interval can be represented by 2k/12 , k = · · · , −3, −2, −1, 0, 1, 2, 3, . . .. In modern Western music, the A 440 Hz is the base of all tunings. This implies that the frequencies of all notes of a piano can be represented by 440 · 2k/12 , k = · · · , −3, −2, −10, 1, 2, 3, . . .. The thus calculated frequency values for three octaves on a piano keyboard are given in hertz with an accuracy of four decimals in Fig. 1.17. The reader can now check that the interval between two adjacent white keys is one semitone when there is no black key in between them, whereas it is two semitones when there is a black key in between them. A combination of two concurrent tones is called a dyad. All possible musical intervals between the two notes of a dyad are integer multiples of a semitone. If n is that multiple, an interval is called a prime or unison for n = 0; for n = 1, it is called a minor second or semitone; for n = 2, a major second; for n = 3, a minor third; for n = 4, a major third; for n = 5, a perfect fourth; for n = 6, a tritone; for n = 7, a perfect fifth; for n = 8, a minor sixth; for n = 9, a major sixth; for n = 10 a minor seventh; for n = 11, a major seventh; and for n = 12, it is naturally called an octave. The terms minor, major, and perfect, are based on consonance and the laws of harmony, about which somewhat more will be said in Sect. 10.12.5. This vocabulary can be extended with the terms “diminished” or “augmented” to indicate that an interval is decreased or increased, respectively, by one semitone. For the details of this, the reader is referred to a book on harmony, e.g., Parncutt [17]. Why are there white and black keys on a piano? The reason is that, in most Western music, a musical scale only consists of a subset of these twelve notes. Usually there are seven different notes within a scale which is then called a diatonic scale. When the musical interval between the first and the third note is a minor third, it is called

26

1 Introduction

Fig. 1.18 Diatonic scale played in the key of C major. (demo)

Fig. 1.19 Two diatonic scales, the first in the key of A major and the second in the key of D major. (demo)

a minor-third scale; when this interval is a major third, it is called a major-third scale. The most common of these is the scale that begins on C and consists only of the subsequent white keys of a keyboard. When played in ascending order, the C one octave higher than the starting C is mostly also played, resulting in an ascending diatonic scale on C. The successive notes are often indicated with Do, Re, Mi, Fa, Sol, La, Ti, Do. Including the lower and the upper Do, this series of notes comprises eight notes, which is why the interval between the first Do and the last Do is called an octave, from octo, the Latin word for eight. So, the first Do has an octave relation with the last Do, and, in this scale, represents the same note. Played on C, one can then play a diatonic scale by playing the successive keys, C, D, E, F, G, A, B, and C. One says that this diatonic scale is played in the key of C. The musical notation of this scale starting at C5 is presented in Fig. 1.18. Since the interval between the first C and the E is a major third, this is a major-third scale. One says that this diatonic scale is played in the key of C major. Musical chords are formed by combinations of different notes. A triad is formed by successions of three successive notes separated by thirds. When the interval between the first and the second note is a minor third, one gets a minor-third triad; when it is a major third, a major-third triad. A major-third triad remains a major-third triad when the notes are played in a different order or when the notes are increased or decreased by an octave. This can be extended to chords consisting of tetrads and, in more modern music, to pentads, hexads, etc. In this way one can create a multitude of different musical chords each with its own harmonic characteristic. For diatonic scales on other notes than on C, one needs to include black keys. Since one now knows the distances between successive notes of the diatonic scale, one can determine the diatonic scale on, e.g., A. This scale, presented as the first eight notes in Fig. 1.19, comprises A, B, C , D, E, F , G , and finally A, again. The major-third scale on D , presented as the last eight notes in Fig. 1.19, consists of D , E , F, G , A , B , C, and finally D , again. You may check that the first two intervals of the major-third scale are two semitones, then there is one interval of one semitone, then there are three intervals of two semitones, and finally there is an interval of one semitone.

1.4 Musical Scales and Musical Intervals

27

Fig. 1.20 The chromatic scale over two octaves. The tones are short harmonic impact sounds. Their fundamental frequency F 0 runs in steps of one semitone 440–1680 Hz. The frequencies of the second and the third harmonic are twice and three times as high as the frequency of the fundamental, respectively. The upper panel shows the schematic time-frequency representation with a linear frequency ordinate, the lower panel with a logarithmic frequency ordinate. (Matlab) (demo)

Schematic time-frequency representations of the chromatic and the diatonic scale are presented in Figs. 1.20 and 1.21, respectively. The tones of the scales are complex tones consisting of the sum of three harmonics of rank 1, 2, and 3, with exponentially decaying envelopes. Such a tone sounds like a metallic impact sound, e.g., a note played on a simple, metal xylophone. First, the chromatic scale is illustrated in Fig. 1.20. In the bottom panel, the waveform is presented but, due to clogging, only the envelope of the tones can be distinguished, so that all tones look the same. In the middle and the top panel, the schematic time-frequency representation is shown, in the top panel on a linear frequency scale, in the middle panel on a logarithmic frequency scale. Since all twelve notes of the two octaves between 440 and 4·440 = 1760 Hz are played, 25 tones can be counted.

28

1 Introduction

Fig. 1.21 The diatonic scale over two octaves, played by short impact sounds consisting of the sum of three harmonics of rank 1, 2, and 3. The waveform is shown in the lower panel, but only the envelope of the signal is well represented. As the diatonic scale starts on A, the key in which it is played is A major. (Matlab) (demo)

In the top panel of Fig. 1.20, the frequency ordinate is linear. For every tone, the distance in hertz between the first and the second harmonic is equal to that between the second and the third harmonic. But this distance increases from tone to tone, 440 Hz for the first tone to 1760 Hz for the last. In spite of this, the timbre of the tones does not change significantly, only the pitch increases. Moreover, this increment in pitch is perceived as equal from tone to tone, though in the figure the distance between the successive notes seems to increase. This changes when the schematic time-frequency representation is presented with a logarithmic ordinate. On this logarithmic frequency scale, the distance between the first and the second harmonic is much larger than the distance between the second and the third harmonic, which applies to all successive tones. Actually, the ratio between the frequency of the first and the second harmonic is always 2: 2 · f 0 / f 0 = 2. Similarly, the ratio between the frequencies of the second and the third harmonic is always 1.5: 3 · f 0 / (2 · f 0 ) = 1.5. A frequency ratio of 1.5 corresponds to 12 · log2 (1.5) = 7.0196 semitones. This is illustrated in the middle panel of Fig. 1.20, showing that, on this logarithmic frequency axis, the distance between the first and the second harmonic is the same for all tones, as is the distance between the second and the third harmonic. Aside from its loudness and pitch, the way a tone sounds is called its timbre. The fact that the distance between the successive harmonics remains the same is perceived by the listener as a more or less constant timbre, while the equidistant jumps in F 0 between successive notes are perceived as equal increments in pitch between successive notes. This shows that, for these musical scales, the time-frequency representation in which the ordinate is presented logarithmically gives a better picture of what is heard than when the ordinate is linear.

1.4 Musical Scales and Musical Intervals

29

Fig. 1.22 Notes of a short melody in the key of A major. (demo)

The next figure, Fig. 1.21, shows the schematic time-frequency representation of a diatonic scale played over two octaves. The frequency ordinate is logarithmic. The scale starts with the 440 Hz A and, hence, is played in A major. If one looks carefully, one can see that the increments between harmonics of the same rank are not always the same for successive tones: The increment in frequency between corresponding harmonics of the third and the fourth tone, and the tenth and the eleventh tone, the Mi and the Fa in the diatonic scale, and the seventh and the eighth tone, and the fifteenth and the sixteenth tone, the Ti and the Do, is smaller than the increments between the harmonics of the other notes. This all corresponds to the musical construction of the diatonic scale. Both schematic representations shown in Figs. 1.20 and 1.21 may suggest that listeners would hear three separate melodies. Listening to these synthesized sounds shows, however, that they do not. As has been said before, the three harmonics fuse perceptually into one tone with a pitch frequency equal to the fundamental frequencies of the tones. This is further illustrated for a short melody the musical score of which is presented in Fig. 1.22. Its schematic time log-frequency representation is presented in Fig. 1.23. This melody consists of the notes: A4 , A4 , C 5 , E5 , A4 , A4 , C 5 , E5 , A4 , and E5 , corresponding to pitches of 440.0, 440.0, 554.4, 659.3, 440.0, 440.0, 554.4, 659.3, 440.0, 659.3 Hz, respectively. In the synthesized version of Fig. 1.22, every tone consists of three harmonics, again, the first, the second, and the third. What is perceived as melody consists of the fundamental frequencies, so the frequencies of the first harmonics. These fundamental frequencies are all within the horizontal rectangle drawn in Fig. 1.23. The second and third harmonics of the tones are outside this rectangle. Similar to the tones displayed in Fig. 1.12, the three harmonics merge perceptually into one tone, and the melody of only one musical instrument is heard. The second and the third harmonic are not perceived as separate tones, and they do not form any separate melody. So, one cannot hear one of the two tonal lines outlined by the slanted, dotted parallelograms, even though some of the partials are played one after the other without any significant difference in frequency; for instance, the second harmonic of the fourth note, an E, has a frequency of 2 · 659.3 = 1308.5 Hz. This is virtually the same as the frequency of the third harmonic of the sixth note, B , 3 · 440.0 = 1320.0 Hz. For every tone, the three harmonics lose their identity, merge into one tone, and together determine its timbre. The succession of these tones is then perceived as a melody of frequencies corresponding to their F 0 s. In the descriptions of the diatonic and the chromatic scales, it has been assumed that the octave is divided into twelve semitone intervals of exactly the same size. This way of tuning an instrument is the common practice for all modern Western orchestras and is the standard in all usual music synthesis software. It is called equal

30

1 Introduction

Fig. 1.23 Schematic time log-frequency representation of the melody shown in Fig. 1.23. The melody is defined by the F 0 s of the tones, here outlined by the rectangle, not by the more proximate successions of partials as outlined by the slanted dotted parallelograms. (Matlab) (demo)

temperament. The big advantage of this way of tuning is that music can be played in all keys, and that musicians can modulate from one key to any other key. There is also a disadvantage, however. Indeed, tuning an instrument in equal temperament means that the quotient of the F 0 s of two adjacent notes must be exactly 21/12 . Except when k is a multiple of 12, 2k/12 is an irrational number and, hence, cannot be written as a ratio of integers. This implies that, except for the octave, the ratio of the F 0 s of two tones is never a quotient of small integers. On the other hand, the harmonics of musical tones have frequencies that are integer multiples of the F 0 s of the tones. This implies that, except for the octave, also the harmonics of two different tones are never exactly equal. For instance, the third harmonic of the A4 440 Hz is almost equal to the second harmonic of the E5 one fifth higher, which has a frequency of 659.26 Hz. Indeed, the third harmonic of this A is 3 · 440.00 = 1320.00 Hz, while the second harmonic of the E is 2 · 659.26 = 1318.51 Hz. The difference between these two harmonics is 1.59 Hz causing audible interferences of 1.59 Hz as will be discussed in Sect. 1.5. Such interferences between the harmonics of two tones are inevitable for simple intervals such as the fifth, the fourth, and the major third. The ratios of the F 0 s for these intervals are 1.4983, 1.3348, 1.2599, close to, but not equal to 3:2, 4:3, and 5:4, respectively. Phenomena like these cause audible interferences between the harmonics of tones played in equal temperament, and audible interferences are associated with dissonance, a property of simultaneously sounding tones to be discussed later in this book in Sect. 10.12.5. Dissonance is a property not quite appreciated in music. This is why some people say that all equally tempered instruments such as modern pianos are out of tune.

1.4 Musical Scales and Musical Intervals

31

This tuning problem, if it is a problem, applies to all musical instruments on which only a fixed number of different notes, usually twelve per octave, can be played. On other instruments, such as the violin or the singing voice, performers are free to choose the frequency ratios between the different tones. At least since the days of Pythagoras, it has been proposed that, in order to avoid interferences between overtones, the frequency ratios of musical intervals must be as close as possible to ratios of small integers. For instance, the ratio 2:1 defines the octave, the ratio 2:3 the fifth, the ratio 3:4 the fourth, etc. This has led to different tuning systems, e.g., just temperament, mean-tone temperament, or Pythagorean temperament. These tuning systems have been used since Greek antiquity. The problem with them is that, ideally, every different key needs a different tuning, so that one cannot arbitrarily modulate from one key to another. In the practice of early Western music, instruments were tuned in such a way that music could only be played in keys with a small number of flats and sharps, at most four. In the eighteenth century, equal temperament became the dominant tuning system. For a concise overview of tuning systems, the interested readers is referred to Roederer [23, pp. 176–180], Thompson [32], or Schneider [25, pp. 651–663]. A more elaborate account is presented by Gunther [10]. For those who are interested in an in-depth study and adventurous treatise on musical scales and tuning systems, Sethares [29] is highly recommended. He also discusses some less conventional scales and tuning systems, e.g., those not based on the octave! Above, the structure of scales is described from a purely technical point of view. Frequencies have mostly been presented with an accuracy of 0.1 or 0.01 Hz. This is necessary, e.g., when one wants to calculate the rate of the fluctuations caused by the interference between the harmonics of different tones. This is far too precise for the perceived frequency of the pitch of a tone or the perceived interval between two tones. The perception of the pitches of the tones of a musical scale and of their intervals is a much less accurate process than suggested by the mathematical precision with which the F 0 s of the equally tempered scales and their harmonics have been calculated. The values presented are at best a good first approximation of the perceptual attributes of musical pitch and its intervals as it plays a role in music perception and production. A very critical and elaborate review of taking the exact small-integer ratios of the F 0 s of tones as the basis of the musical interval is presented by Parncutt and Hair [18], which is also wholeheartedly recommended to the interested reader. In a musical score, timbre is represented by indicating the musical instrument for which the melody is meant. This is inadequate, however, for synthetic sounds except perhaps when the synthesis is carried out to mimic a specific instrument. Due to the fact that the timbre of a tone is determined by a wide variety of different factors, it is not possible to represent timbre unambiguously in the time-frequency domain. The perception of timbre will be discussed in much more detail in Chap. 6. But before this is done, there are still quite a few subjects to discuss. In the next sections, a description will be presented of what happens perceptually when a small number of sinusoids are added up. First, focus will be on the simplest of complex tones, the sum of two sinusoids. It will be concluded that this can already lead to an interesting series of auditory phenomena.

32

1 Introduction

1.5 The Sum of Two Sinusoids It was shown that sinusoids are generally considered the building blocks of sounds, at least of relatively simple sounds. By adding up a number of sinusoids, a large number of different kinds of sounds can indeed be synthesized. Actually, very many simple warning sounds used in everyday household equipment such as microwaves, water heaters, or electrical toothbrushes, are simply synthesized by adding a number of sinusoids. This way of synthesizing sounds is called additive synthesis. As a first step, the most natural thing to do is to add just two sinusoids, which will be discussed in this section. Next, sinusoids sinusoidally modulated in amplitude, and sinusoids sinusoidally modulated in frequency will be discussed. They will be described as sounds modulated in time. The frequency of these modulations will prove to be paramount to the way these sounds are perceived. Based on this modulation frequency (MF), four auditory ranges will be distinguished. First, for the fastest MFs, there is the range of perception of a steady tone. Second, when the MF falls below 100–150 Hz, the range of roughness perception, or the roughness range for short, will be described. Roughness is maximum for MFs of 70 Hz. Third, when the MF drops below about 15–20 Hz, the range of rhythm perception, or the rhythm range for short, is entered. Rhythm is maximum between 2 and 5 Hz. Finally, MFs below 1–0.5 Hz are in the range of hearing slow modulations. These four ranges were, in somewhat different terms, first described in 1929 by Wever [35, p. 405]. They have not always been referred to as they are here. What is called the range of rhythm perception can be compared with the existence region of pulse sensation by Edwards and Chang [16], the range of sequence perception by McAdams and Drake [14], and the fluctuation range by Edwards and Chang [6]. A detailed description of these ranges is given by Edwards and Chang [6] including an elaborate review of the auditory and neurophysiological basis of the distinctions between them. In describing the synthesis and perception of the sum of two sinusoids, a number of different perceptual phenomena will be encountered that are a consequence of the frequency and time resolution of the human auditory system. First, the equations will be presented. Indeed, the sum of two sinusoids can be presented in two mathematically equivalent ways: cos α + cos β = 2 cos

1 1 (α − β) cos (α + β) . 2 2

If this equation is applied to the sum of two sinusoids of amplitude a, one with frequency f 1 and the other with frequency f 2 , this results in:     f1 + f2 f1 − f2 t cos 2π t a cos 2π f 1 t + a cos 2π f 2 t = 2a cos 2π 2 2

(1.3)

1.5 The Sum of Two Sinusoids

33

Fig. 1.24 Adding up two sinusoids, one 400 Hz, shown in the top panel, and the other 600 Hz, shown in the second panel. The result of the addition is shown in the bottom panel. The dashed lines indicate the positive and negative envelope. The sum of the sinusoids is periodic with a periodicity of 5 ms, corresponding 200 Hz. (Matlab) (demo)

The sum of the two sinusoids, one with frequency f 1 and the other with frequency f 2 , is presented on the left-hand side of this equation. This will be called a spectral representation. The right-hand side shows the product of two sinusoids: a cosine with a frequency of ( f 1 − f 2 ) /2, half the frequency difference between the two tones, and a cosine with a frequency of ( f 1 + f 2 ) /2, the average of the two frequencies f 1 and f 2 . The latter will be called a temporal representation. The mathematical equivalence of these representations is illustrated in Figs. 1.24 and 1.25. In Fig. 1.24, the oscillogram of 400 Hz sinusoid is presented in the top panel, while the middle panel shows the oscillogram of 600 Hz tone. The result of adding them up is shown in the bottom panel. The dashed lines in the bottom panel are the absolute value of the modulator, or envelope, and its inverse. Notice that the envelope has a period corresponding to twice the frequency of modulator, i.e., 200 Hz. This can be compared with Fig. 1.25, showing the product of two cosines, one cosine with the average frequency, (600+400)/2 = 500 Hz in the top panel, and a cosine with half the difference frequency, (600–400)/2 = 100 Hz in the middle panel. In the temporal representation, the sinusoid with the average frequency, 500 Hz in this case, can be viewed as a carrier modulated in amplitude by a modulator, the sinusoid with half the difference frequency, 100 Hz in this case. So, in the temporal domain, the sum of two sinusoids can be represented as the product of a carrier and

34

1 Introduction

Fig. 1.25 The product of two sinusoids, one with a frequency 500 Hz, shown in the top panel, the other with a frequency 100 Hz, shown in the middle panel. The result of the multiplication is shown in the bottom panel. In accordance with Eq. 1.3, the result in the bottom panel is exactly the same as that of the bottom panel of the previous figure, Fig. 1.24. (Matlab) (demo)

a modulator. The frequency of the carrier is indicated by carrier frequency (CF); that of the modulator by modulation frequency (MF). The question is now: What is heard? Does one hear a complex of two tones with frequencies of 400 and 600 Hz? Or does one hear a 500 Hz tone modulated in amplitude with a rate 200 Hz? Listening to the examples of Figs. 1.24 and 1.25 one does not hear any temporal fluctuations. Apparently, the temporal resolution of the auditory system is not fine enough to track the 200 Hz modulations of the envelope. Furthermore, the 500 Hz carrier is not heard out as a separate tone. Hence, one does not hear a modulated 500 Hz tone, but one hears a continuous complex tone with partials of 400 and 600 Hz, both multiples 200 Hz, resulting in a harmonic sound with a periodicity of 5 ms. An inharmonic example is presented in Fig. 1.26. Two sinusoids are added, one 421 Hz and the other 621 Hz. This two-tone complex, too, is periodic, but the periodicity is as low 1 Hz, which is beyond the range of pitch perception of 50 Hz to 5 kHz. What happens now when the difference between the two sinusoids diminishes and the modulation frequency decreases? In this section, the perceptual effects that will appear will be discussed systematically. As indicated at the start of this section, four different ranges of perception will be described: The region in which a continuous two-tone complex is perceived, the region of roughness perception, the region of rhythm perception, and the perceptual region in which a slow amplitude modulation

1.5 The Sum of Two Sinusoids

35

Fig. 1.26 Sum of two sinusoids with frequencies of 421 and 621 Hz. (Matlab) (demo)

is perceived. Since they overlap, there are no clear boundaries between these regions and, when the difference frequency between the two sinusoids is gradually increased or decreased, one percept is gradually replaced by another.

1.5.1 Two Tones: The Range of Perception of a Steady Tone In the previous example shown in Figs. 1.24, 1.25 and 1.26, one could not hear the presence of any 200 Hz modulation corresponding to the difference frequency of the two components of the two-tone complex. In general, modulations are not audible as fluctuations for frequencies higher than 150 Hz. Moreover, one does not hear the average frequency 500 Hz either. Apparently, the listener does not hear the sound as in its temporal representation, but as a steady complex tone. Indeed, listening carefully to the frequencies of the sound shown in Figs. 1.24 and 1.25, one may hear out one of the two partials of 400 and 600 Hz separately. What one hears, depends on the listener. Some listeners are better at hearing out partials than other listeners. This will be discussed in Sect. 8.4.5. One final phenomenon is noteworthy. It can occur that one hears an additional tone with a frequency 200 Hz, though this does not correspond to any frequency component of the two-tone complex. This frequency 200 Hz corresponds to the periodicity of the sound signal as can be verified in Figs. 1.24 and 1.25. Indeed, the lowest panel there shows a signal with a periodicity of 5 ms, corresponding 200 Hz, the F 0 of the two-tone complex. In Sect. 1.1.2 of this chapter, it was argued that the pitch of a periodic complex tone is largely determined

36

1 Introduction

Fig. 1.27 Waveforms, left, and the discrete amplitude spectra, right, of a number of harmonic two-tone complexes. The frequencies of the two tones are indicated with the discrete spectra. The short, upward ticks on the abscissae show the harmonic positions corresponding to the periodicity of the amplitude of the two-tone complex. The frequencies of the two partials fit exactly into this harmonic pattern. (Matlab) (demo)

by its period, at least when this period is in the pitch range. The two-tone complex of 400 and 600 Hz with an F 0 200 Hz is an example of this. Indeed, later on, in Sect. 8.4.4, it will appear that the pitch frequency of a harmonic tone corresponds in general very well with the F 0 , even if the complex tone does not contain a partial with that F 0 , as in the example of Figs. 1.24 and 1.25. This virtual pitch, already described by Seebeck [28] in 1843 and Schouten [26] in 1938, is quite weak for only two harmonic components, however, certainly when the rank of the harmonics is high. More attention will be paid to the virtual pitch of amplitude and frequency modulated sinusoids in the next Sects. 1.6 and 1.7. Now, some more examples are presented, shown in Fig. 1.27. The two sinusoids are now added in such a way that both the signal and the envelope start in sine phase. This can be done by subtracting two cosines. Indeed,

1.5 The Sum of Two Sinusoids

37

    f1 + f2 f2 − f1 t sin 2π t a cos 2π f 1 t − a cos 2π f 2 t = 2a sin 2π 2 2 As just said, harmonic sounds consist of components with frequencies that are all multiples of a common fundamental frequency F 0 . Such sounds are periodic with a periodicity corresponding to this F 0 . Compare this with the situation that the two components do not have an F 0 in the pitch range between 50 and 5000 Hz. The two situations are illustrated in Figs. 1.27 and 1.28. The average frequency of the two tones is always 1200 Hz, so all signals of Figs. 1.27 and 1.28 have the same CF of 1200 Hz. The panels on the left-hand side of this figures show the waveforms and the envelopes of the signals; the panels of the right-hand side show their discrete amplitude spectra. A discrete amplitude spectrum shows the amplitudes of the sound components as a function of their frequency, not to be confused with the continuous spectra based on the Fourier analysis of a windowed sound segment as shown in Figs. 1.14 and 1.16. Besides the discrete spectra of the two-tone complex depicted as two thick vertical lines, also the multiples of the frequency corresponding to the period of the modulator are indicated by the upward ticks on the abscissae. These ticks represent the harmonic pattern corresponding to the periodicity of the modulator. First, the harmonic situation will be discussed, shown in Fig. 1.27, where, from the top to the bottom panel, F 0 is 300, 200, 150, 120, 100 Hz, respectively. The first two-tone complex consists of the 3rd and 5th harmonic 300 Hz; the last of the 11th and the 13th harmonic. Note, furthermore, that the F 0 s of the successive tones, 100, 120, 150, 200, 300 Hz, are the 12th, the 10th, the 8th, the 6th, and the 4th subharmonic of the CF 1200 Hz, respectively. It can easily be verified that, in all five cases, the signal is periodic with a period corresponding to half the difference frequency of the two tones. In the left panels, one such period is indicated by two thick, vertical lines, one at 0 ms, and the other at the length of one period. The harmonic pattern corresponding to this period is indicated by the upward ticks at the abscissae of the discrete amplitude spectra on the right. One sees that the frequencies of the two components fit exactly into this harmonic pattern. The F 0 s of the tones are then 300, 200, 150, 120, 100 Hz, respectively, which in principle may result in a melody consisting of tones with these frequencies as virtual pitches. As the number of harmonics is low and, especially for the later tones of the demo, their harmonic rank is high, these pitches will not be very salient and the melody will be unclear. The readers can judge for themselves as, in the demo of Fig. 1.27, a pure-tone melody consisting of these pitches is played after the sequence of two-tone complexes. In the following sections on amplitude and frequency modulation, the virtual pitches will be more salient. Figure 1.28 illustrates the inharmonic situation. The frequencies of the two tones deviate randomly from the values they have in Fig. 1.27 according to a normal distribution with a standard deviation of 5%. The resulting frequency values are given in the right-hand panels of Fig. 1.28, where also the frequency corresponding to the period of the modulator is presented. This absolute value of the modulator, the envelope, and its inverse are shown in the left-hand panels by the dashed lines. One

38

1 Introduction

Fig. 1.28 Same as in the previous figure, Fig. 1.27, but now the two-tone complexes are not harmonic, and the frequencies of the two tones do not fit into the harmonic pattern corresponding to the periodicity of the amplitude of the two-tone complex. (Matlab) (demo)

of its periods is indicated by the two thick, vertical lines. As can be checked, no integer number of periods of the carrier fits into one period of the envelope. This inharmonicity expresses itself in the discrete spectra by the fact that the frequencies of the two tones do not fit into the harmonic pattern corresponding to the period of the modulator indicated by the upward ticks on the abscissae. This shows that these two-tone complexes are not harmonic.

1.5.2 Two Tones: The Range of Roughness Perception In the previous paragraphs, the difference between the frequencies of the two tones was relatively large, at 100 Hz. When the frequency difference between the two components of the two-tone complex is decreased, there is a point at which the two components start to interfere audibly. This will certainly happen when the difference gets smaller than 100 Hz. The two frequency components are now so close in fre-

1.5 The Sum of Two Sinusoids

39

Fig. 1.29 Adding a 480 and a 520 Hz cosine. The result, presented in the bottom panel, shows a 600 Hz tone modulated with a cosine 20 Hz. The envelope is periodic with a period corresponding 40 Hz. (Matlab) (demo)

quency that the frequency resolution of our peripheral hearing system no longer fully separates the two components and the resulting fluctuations become audible. This interference induces the percept of roughness [35]. Roughness is the percept that arises when, on the one hand, the sound is perceived as fluctuating but, on the other hand, the fluctuations follow each other so rapidly that it is not possible to follow each fluctuation separately. In general, roughness is associated with unfavourable properties of sound, such as dissonance in music and creaky voice in speech. Alarm clocks mostly produce rough sounds and many warning signals are rough. Other examples of rough sounds are those produced by crows or frogs. An example of a rough two-tone complex is presented in Fig. 1.29, showing the sum of a 480 and a 520 Hz sinusoid. The common F 0 of this two-tone complex 40 Hz, which corresponds to a periodicity of 25 ms. Listening to this sound, one no longer hears a steady tone. Apparently, the frequency difference between the two tones is so small that the two tones interfere with each other resulting in audible fluctuations. In other words, in the roughness range, the difference between the two frequencies of the two-tone complex is so small that the frequency resolution of the auditory system does not resolve the two components. As to the temporal aspect, when the frequency difference is smaller than 100 Hz, the temporal resolution of the auditory system is sufficiently fine to make the fluctuations audible. Another important change in percept is that the sound loses its tonal character. Most listeners will not be able to hear out the frequencies, 480 and 520 Hz, of the two partials of the sound of Fig. 1.29; neither will they hear out the CF of 500 Hz.

40

1 Introduction

Fig. 1.30 Waveforms and spectra of a number of rough two-tone complexes. The average frequency of the two tones, so the CF, is always 1200 Hz; the frequency intervals between the two tones are 100, 71, 50, 35, 25 Hz. The discrete spectra show the two lines representing the components of the two-tone complex. (Matlab) (demo)

Five more examples of rough two-tone complexes are presented in Fig. 1.30. The difference frequencies are equidistant on a logarithmic scale between 100 and 25 Hz, from top to bottom 100, 71, 50, 35, 25 Hz, respectively. Listening to the sounds, one can clearly hear the rapid fluctuations associated with roughness. The upper three two-tone complexes with a difference frequency of 100, 71, 50 Hz have lost about all of their tonal character; they have much more the character of a penetrating hum. The lower two, however, get an increasingly tonal character, but the frequency of the tone one hears does neither correspond to one of the two partials of the tone, nor with the difference frequency, but it does correspond to the average frequency of the two tones, 1200 Hz, the frequency of the carrier when the complex is considered as the product of a carrier with a modulator. So, the frequencies of the two tones are now so close to each other that the auditory system processes the sound as one tone with a pitch frequency that corresponds to the average frequency of the two tones. Hence, the spectral description of the two-tone complex no longer corresponds to what is heard. Due to the limitations of the spectral

1.5 The Sum of Two Sinusoids

41

resolution of the auditory system, none of the two frequencies of the two tones can be distinguished anymore. The temporal representation of the two-tone complex now is a better representation of what is heard. The temporal resolution of the auditory system is sufficiently fine to detect the fluctuations due to the interference between the two tones and can be heard clearly. In the examples of Fig. 1.30 the fluctuation rate is so high, however, that the listener is unable to keep track of every fluctuation separately. The percept of roughness will be discussed extensively in the section of that name, Sect. 6.2.

1.5.3 Two Tones: The Range of Rhythm Perception If the difference in frequency between the two sinusoids is further decreased, e.g., below some 10–15 Hz, one will start hearing the successive fluctuations separately. Something perceptually very interesting happens, the emergence of the percept of rhythm. Below a certain frequency, it becomes possible to count the fluctuations, perhaps not every fluctuation separately, but in groups of two, three, or four. When such sounds are played, it is common for some listeners to start moving in synchrony with this rhythm [7, 33]. Apparently, they perceive events in time with which they are inclined to synchronize their movements. These perceived events are called beats. When the tempo is relatively fast, listeners may not synchronize with every beat, but with every other beat, every third beat, or every fourth beat. The successive beats of a sound define the rhythm of that sound. Beats will play an important role in this book. A more exact definition and description of the beat will be presented in Sect. 4.2 and in Chap. 5. How to determine the beat location of a sound will be extensively discussed in Sect. 5.1. An example is shown in Fig. 1.31. Two sinusoids are added, one with a frequency 496 Hz, the other with a frequency 504 Hz. In fact, showing the waveform in the upper and middle panel hardly makes any sense. The graphical resolution is not fine enough to track the periods of the sinusoids and the waveforms are clogged resulting in not more than a filled, horizontal band. From the display of the sum of the two tones one can see, however, that at 0 ± k · 125 ms the two sinusoids are in phase and amplify each other, while at 62.5 ± k · 125 ms the sinusoids are out of phase and cancel each other. As a result, the envelope of the sound fluctuates between 1 and 0 with a period of 125 ms corresponding to a rate of 8 beats per sec, the rate of the rhythm heard. In conclusion, this is the range of rhythm perception or, in short, the rhythm range. Rhythm is determined by the beats of the sounds that make up the stimulus. For series of sounds with periods in the rhythm range, the beat locations can be found by asking listeners to tap along with the sounds, if not with every beat, then, for the faster rates, with every other, every third, or every fourth beat. In the rhythm range, this appears to be an easy task. Already in 1984, Bolton [3] gives a range of intervals between successive tones of 115–1581 ms with an optimum at about 600 ms. So, within this rhythm range, listeners appear to predict the coming beats relatively accurately and,

42

1 Introduction

Fig. 1.31 Adding a 496 and a 504 Hz sinusoid. The individual phases of the two tones can no longer be distinguished. The result, presented in the bottom panel, shows a 600 Hz tone modulated with a cosine 4 Hz. The amplitude decreases from 2 to 0 and then increases to 2 again at 125 ms. This is perceived as a modulation with a frequency 8 Hz. It is this 8 Hz modulation which corresponds to the beats of an 8 Hz rhythm. (Matlab) (demo)

when asked to tap along with them in synchrony, are in general easily capable of doing so [8, 13]. This phenomenon is called rhythmic entrainment. Outside the rhythm range, the variance of the tap intervals increases quite rapidly. Experiments in which listeners are asked to tap along with isochronous tone sequences have shown that the upper limit of the range of rhythm perception is somewhere between 10 and 15 Hz. For higher presentation rates, the listeners are no longer able to do the task. This corresponds well to estimates of the maximum number of tones per sec a professional pianist can realize, estimates that vary from 16 tones per sec [9] to 20 tones per sec [12]. Rhythmic entrainment will be discussed in more detail in Sect. 10.12.3.4. In addition to the demo of a beating two-tone complex of Fig. 1.31, some more examples are presented in Fig. 1.32. The CF is 1200 Hz, again, as in the demos of Figs. 1.27, 1.28, and 1.30. The frequency difference between the two tones is 10 times as small as in Fig. 1.30 for the two-tone complexes in the roughness range. This results in rhythms of 10.0, 7.1, 5.0, 3.5, and 2.5 beats per sec. Presenting the spectra in Fig. 1.32 no longer makes sense, since the frequency difference of the two partials is now so small that, due the limited resolution of the figure, the two spectral lines would merge into one spectral line and, hence, would not be separately visible. Moreover, in the rhythm range, no separate partials can be heard, which implies that they have no perceptual significance.

1.5 The Sum of Two Sinusoids

43

Fig. 1.32 Waveforms of a number of beating two-tone complexes. The average frequency of the two tones, so the CF, is always 1200 Hz; the frequency intervals between the two tones are 10.0, 7.1, 5.0, 3.5, and 2.5 Hz. (Matlab) (demo)

Listening to the sounds of Fig. 1.32, one hears a pure tone rhythmically fluctuating in loudness. The pitch of the pure tones is that of the carrier, 1200 Hz. So, one can conclude that the auditory system processes these sounds as modulated pure tones. Hence, the temporal representation of the two-tone complex represents the perception of these two-tone complexes quite well. The spectral representation is perceptually not adequate.

1.5.4 Two Tones: The Range of Hearing Slow Modulations Now, the difference in frequency is further decreased to 0.4 Hz. The periodicity corresponding to 0.4 Hz is 2.5 s. The result is presented in Fig. 1.33. It makes no sense now to show the waveforms of the two tones separately since this results in

44

1 Introduction

Fig. 1.33 Addition of a 499.8 and a 500.2 Hz sinusoid. The individual phases of the carrier can no longer be distinguished. Only the 0.4 Hz envelope of the sound can be seen. (Matlab) (demo)

clogged bands, with even less detail than in Fig. 1.29. The reader will see, or rather hear, that the percept of this sound corresponds best to its temporal representation, i.e., a sinusoid modulated in amplitude. For the two tones of 499.8 and 500.2 Hz played in Fig. 1.33, the average frequency 500 Hz and the difference in frequency is 0.4 Hz. When listening to the combination of these two tones, a 500 Hz tone is heard, slowly fluctuating in loudness with a rate of 0.4 per second. If people are asked to move in synchrony with these fluctuations, they can do so. It appears, however, that the variance of the periodicity of their movements is much higher than for fluctuations in the rhythm range; in addition, they make more errors and find the task more difficult [1, 2]. Hence, rhythmic entrainment, i.e., a real sense of rhythm induced by series of beats in the rhythm range, no longer occurs. Some more examples in this range of very slow modulations are presented in Fig. 1.34 for two-tone complexes centred around 1200 Hz. The frequency differences between the two tones are 0.50, 0.44, 0.39, 0.34, and 0.30 Hz, respectively. Hence, in the sound with the most rapid fluctuations, shown in the top panel, one hears five fluctuations every ten seconds; in the lowest version one hears three fluctuations every ten seconds. Just a pure tone is perceived gently fluctuating in loudness.

1.5.5 In Summary The way in which one perceives even the simplest example of a complex sound, the sum of two sinusoids, strongly depends on the frequency and temporal resolution of the human hearing system. When the two components are widely separated from each other in frequency, the sound is perceived as a tonal sound. Only some listeners can still perceive the separate tones with the corresponding pitches. For these two-tone complexes, two different situations can be distinguished, an inharmonic situation in which the two tones do not have a common F 0 in the pitch range, and a harmonic situation in which the two tones do have such an F 0 . In the latter case, some listeners may hear a pitch with the F 0 of the two tones as frequency. This pitch of a complex tone is called virtual when there is no partial with the frequency corresponding to this pitch. For many listeners, this virtual pitch will be weak, certainly when the ranks

1.5 The Sum of Two Sinusoids

45

Fig. 1.34 Waveforms of a number of slowly modulating two-tone complexes. The average frequency of the two tones, so the CF, is always 1200 Hz; the frequency intervals between the two tones are 0.50, 0.44, 0.39, 0.34, 0.30 Hz. (Matlab) (demo)

of the two harmonics are high. The pitches will be “stronger”, more “salient”, for the examples shown in the coming sections on amplitude and frequency modulation. Virtual pitch will be discussed in detail in Sect. 8.4.4. In the range of hearing a steady tone, what is heard corresponds best to the spectral representation of the sum of two sinusoids. There is no indication of anything fluctuating. This changes when the frequency difference between the two tones becomes smaller. The partials start to interfere perceptually which results in an auditory attribute called roughness. Fluctuations can be heard, but they are still so rapid that they cannot be heard separately. Neither can they be counted. One just hears that something in the sound fluctuates, vibrates very rapidly. Moreover, the smaller the difference in frequency of the two tones, the more and more difficult it becomes to hear out the pitches of the two separate frequency components. Apparently, the frequency resolution of the hearing system is not fine enough to resolve these frequency components. The fact that one perceives the fluctuations shows that the temporal res-

46

1 Introduction

olution of our hearing system can now resolve the separate fluctuations. When the frequency difference decreases further, there is a moment for which one will more and more hear out the average frequency of the two tones, the CF, as pitch and the interference between the tones will be perceived as fluctuations, which, below about 10–15 Hz, induce beats resulting in a regular rhythm. Then one can hear out all individual fluctuations, and move or tick in synchrony with them and count them. When the fluctuations are relatively rapid, they are counted in doublets, triplets, or quadruplets. What is perceived corresponds best to the temporal representation of the sound as a carrier modulated with a modulator. Up to 1 Hz, the frequency of the modulator defines the rhythm of the sound. This percept of rhythmic beats is lost when the MF gets below 1 Hz. Then one hears a pure tone with the average frequency of the two tones as pitch, slowly increasing and decreasing in loudness. It is concluded that, even for the perception of a complex tone as simple as the sum of two sinusoids, four different perceptual ranges can be distinguished. The same perceptual ranges will be distinguished in the next two sections discussing sinusoidal AM and sinusoidal FM.

1.6 Amplitude Modulation One of the ways to modulate a sinusoid is by modulating its amplitude. Amplitude modulation (AM) consists of varying in time the amplitude of another signal, the carrier, by a modulator. In general, the modulator is a positive function of time and the frequency of the modulator is lower than the frequency of the carrier. The situation will be discussed in which both the carrier and the modulator are sinusoidal. In that case, one speaks of a sinusoidally amplitude-modulated (SAM) sinusoid or SAM tone. Mathematically, a SAM tone can be written as s (t) = a [1 + m d sin (2π f m t + ϕm )] sin (2π f c t + ϕc )

(1.4)

In this equation, the factor 1 + m d sin (2π f m t + ϕm ) represents the modulator and the factor sin (2π f c t + ϕc ) the carrier. The constant a is the average amplitude, m d is the modulation depth (MD), f c the carrier frequency (CF), f m the modulation frequency (MF), and ϕm and ϕc are the phases of the modulator and the carrier, respectively. An example of a SAM cosine is presented in Fig. 1.35. The carrier with a CF f c 500 Hz, is shown in the top panel, the modulator with an MF f m 200 Hz and an MD md of 0.8 is shown in the middle panel. The product of carrier and modulator is shown in the bottom panel. Equation 1.4 presents the temporal representation of a SAM sinusoid. It can mathematically be rewritten into a spectral representation consisting of the sum of three sinusoids. For the sake of simplicity and without loss of generally, ϕm and ϕc in Eq. 1.4 are set to zero. Indeed,

1.6 Amplitude Modulation

47

Fig. 1.35 Sinusoidal amplitude modulation. A 500 Hz sinusoidal carrier with amplitude 1 is sinusoidally modulated by a 200 Hz modulator with an MD of 0.8. The top panel shows the carrier, the middle panel the modulator, and the bottom panel the modulated sinusoid. The upper dashed line, its amplitude, is the modulator. This modulator is also the envelope of the signal. The negative envelope is presented as the lower dashed line. The signal is periodic with a period of 10 ms, corresponding 100 Hz, the F 0 of the tone. (Matlab) (demo)

s (t) = a [1 + m d sin (2π f m t)] sin (2π f c t) = a sin (2π f c t) + am d sin (2π f m t) sin (2π f c t) 1 1 = a sin (2π f c t) − a m d cos [2π ( f c + f m ) t] + a m d cos [2π ( f c − f m ) t] 2 2

(1.5) The last expression, mathematically identical to the first, is the sum of three sinusoidal partials. Hence, the signal can be characterised by a discrete amplitude spectrum consisting of three equidistant components with frequencies f c − f m , f c , and f c + f m . The absolute values of the amplitudes of these three components are 1 am d , a, and 21 am d , respectively. In other words, the spectrum of a SAM tone 2 consists of a central component and two side bands. This is illustrated in Fig. 1.36, in which the top panel, the second panel, and the third panel show the waveforms of the three partials; the bottom panel shows their sum. Naturally, this is identical to the lower panel of Fig. 1.35. As mentioned before, harmonicity means that all partials of a complex tones have frequencies that are integer multiples of a common F0 in the pitch range. The pitch frequency of these sounds then corresponds to this F 0 . The importance of harmonicity has already been indicated in the discussion of a two-tone complex in

48

1 Introduction

Fig. 1.36 Harmonic sinusoidal amplitude modulation. A 500 Hz sinusoidal carrier is sinusoidally modulated by a 200 Hz modulator. Its spectral components of 300, 500, 700 Hz, respectively, are successively presented in the top panel, the second panel, and the third panel. The sum of these components is presented in the bottom panel. The two dashed lines are the positive and the negative envelope. (Matlab) (demo)

the previous section. In the sound played in Fig. 1.36, the partials have frequencies of 300, 500, 700 Hz, all multiples 100 Hz. Hence, this is a harmonic signal, to which a virtual-pitch frequency 100 Hz can be attributed. The inharmonic case is shown in Fig. 1.37. A 500 Hz carrier is modulated with a 220 Hz modulator. The top panel, the second panel, and the third panel show the waveforms of the three partials with frequencies of 280, 500, 720 Hz, respectively. The bottom panel shows the waveform of the sum of these partials, and the positive and negative modulator as dashed lines. The F 0 of the partials 20 Hz, which is outside the range of pitch perception. Hence, the signal is not harmonic and, at the time scale shown in Fig. 1.37, no periodicity can be distinguished. In the demos of the Figs. 1.36 and 1.37, steady tones are heard. Apparently, the frequencies of the three partials are so far apart that no fluctuations can be heard that may be attributed to the interference between adjacent partials. This shows that the frequency resolution of the auditory system is so high that the partials of the stimulus are spectrally resolved. The partials are called resolved partials and, when

1.6 Amplitude Modulation

49

Fig. 1.37 Inharmonic sinusoidal amplitude modulation. Same as Fig. 1.35, except that the 500 Hz carrier is sinusoidally modulated by a 220 Hz modulator so that the frequencies of the spectral components are 280, 500 720 Hz. These frequencies are inharmonic so that the signal, shown in the lower panel, is not periodic. (Matlab) (demo)

these partials are harmonic, resolved harmonics. If not resolved, they are called unresolved partials and unresolved harmonics, respectively. As for the two-tone complex, this range of perception of a steady tone will now be discussed. Next, a description will be presented of what happens perceptually when the MF is decreased. The same perceptual ranges will be described as for the two-tone complex.

1.6.1 Amplitude Modulation: The Range of Perception of a Steady Tone When the MF is high enough, the spectral components are so far separated in frequency that the spectral resolution of the auditory system is good enough to resolve these three components. In this case, one can hear a complex tone in which no temporal structure can be distinguished, except, of course, that the sound goes on and off.

50

1 Introduction

Fig. 1.38 Harmonic sinusoidal amplitude modulation. A 1200 Hz sinusoidal carrier is successively modulated in amplitude by a modulator of 600, 400, 300, 240, 200 Hz. The MD is always 0.8. The waveforms of the sounds are presented in the left panels, their amplitude spectra in the right panels. The sounds are periodic with periods indicated in the successive panels. The harmonic positions are presented as upward ticks on the abscissae of the amplitude spectra. The spectral lines are exactly located on these harmonic positions. At the end of the demo, a pure-tone melody is played with the F 0 s as frequencies. (Matlab) (demo)

Above, harmonic AM was distinguished from inharmonic AM. First, some examples of harmonic AM will be demonstrated in Fig. 1.38. A 1200 Hz carrier is modulated is amplitude with harmonic MFs of 600, 400, 300, 240, 200 Hz, respectively, all divisors of 1200 Hz. The amplitude of the carrier is 0.5 and the MD is 0.8 for all signals, so that the envelope of the sound fluctuates between 0.1 and 0.9. The dashed lines are the positive and negative envelopes. For these harmonic MFs, the periods of the envelopes are the same as the periods of the sound signals. In the panels on the right-hand side of Fig. 1.38, the discrete amplitude spectra of the sounds are displayed. As shown above, these consist of three lines with frequencies f c − f m , f c , and f c + f m , and amplitudes 21 a m d , a, and 21 a m d , respectively. These frequencies are harmonic and, hence, integer multiples of the F 0 s, which, in this example, are equal to the MFs. The harmonic positions corresponding to these

1.6 Amplitude Modulation

51

F 0 s are indicated with the upward ticks on the abscissae of the spectra, the leftmost of which corresponds to F 0 . Hence, the demo shows a sequence of tones with F 0 s of 600, 400, 300, 240, 200 Hz, respectively. And, since the pitch frequency of a tone complex is in general very well approximated by its F 0 , listeners should be able to hear a melody of tones with those pitches. For comparison, in the demo of Fig. 1.38, a melody of pure tones with these frequencies is played after the sequence of the SAM tones. The musically trained listeners will recognize this falling melody. It consists of a succession of a fifth, a fourth, a major third, and a minor third. This may not be so clear for every listener, not only because listeners differ in musical training, but also because some listeners, referred to as synthetic listeners, are more sensitive to virtual pitches than others, referred to as analytic listeners. This will be discussed in Sect. 8.4.5. Note that F 0 is represented in the discrete spectrum only for the first tone in the demo. For the other tones, the spectral lines are positioned at harmonic frequencies of rank higher than 1. Hence, the pitches of these tones are virtual. Virtual pitch will be discussed in detail in the Sects. 8.4.4 and 8.8. The inharmonic situation of SAM tones is shown in Fig. 1.39. The CF and the MD are the same as in Fig. 1.38, 1200 and 0.8 Hz, respectively, but the modulation frequencies are now 564, 369, 284, 260, 191 Hz. The harmonic positions defined by these MFs, in this case their integer multiples, are indicated by the small upward ticks on the abscissae of the spectra. As can be seen, the tones are no longer harmonic, and no melody with well-defined pitches can be heard. For comparison, at the end of the demo of Fig. 1.38, a melody of pure tones is played with pitches that as equal to the MFs of the SAM tones.

1.6.2 Amplitude Modulation: The Range of Roughness Perception As was discussed for the sounds consisting of only two sinusoids, temporal fluctuations become audible when the MF of SAM tones gets lower than, depending on the frequency of the carrier, about 100 and 150 Hz. These fluctuations induce the percept of roughness. Some examples are presented in the demo of Fig. 1.40. The listener is encouraged to compare this demo with the demo of Fig. 1.30 of the range of roughness perception for two-tone complexes. For both kind of sounds, the fluctuations are so fast that they cannot be perceived one by one. One cannot, e.g., count the fluctuations and, although the fluctuations are regular in time, the sound has no regular rhythm.

52

1 Introduction

Fig. 1.39 Inharmonic sinusoidal amplitude modulation. Same as Fig. 1.38 except that the 1200 Hz sinusoidal carrier is modulated in amplitude by a modulator of 564, 369, 284, 260, 191 Hz, respectively. The discrete amplitude spectra are no longer harmonic, and no musical pitches can be heard. For comparison, a pure-tone melody is played at the end of the demo with the MFs as frequencies. (Matlab) (demo)

1.6.3 Amplitude Modulation: The Range of Rhythm Perception When the MF gets lower than 20 Hz, roughness is lost and, below 15 Hz, the fluctuations are heard as beats in the range of rhythm perception. This is illustrated in Fig. 1.41, in which the tones are modulated with frequencies of 10.0, 7.1, 5.0, 3.5, and 2.5 Hz, respectively. The reader is encouraged to compare this demo of the rhythm range with that of two-tone complexes as presented in Fig. 1.32. As for the roughness range, the phenomena are similar.

1.6 Amplitude Modulation

53

Fig. 1.40 Sinusoidal amplitude modulation in the roughness range. A 1200 Hz sinusoidal carrier is successively modulated in amplitude by a modulator of 100, 71, 50, 35, 25 Hz. The MD is always 0.8. The waveforms of the sounds are presented in the left panels, their amplitude spectra in the right panels. The envelopes of the sounds are periodic with periods indicated in the successive panels. (Matlab) (demo)

1.6.4 Amplitude Modulation: The Range of Hearing Slow Modulations A number of SAM tones with MFs in the range of slow modulations is shown in Fig. 1.42. A 1200 Hz sinusoid is modulated in amplitude with frequencies of 0.50, 0.44, 0.39, 0.34, and 0.30 Hz. One can compare this with the similar demos for two-tone complexes of Fig. 1.33. A pure tone is perceived, gently fluctuating in loudness. For the fastest fluctuations shown in the top panel of Fig. 1.42, one hears five fluctuations per ten seconds; in the slowest fluctuations shown in the bottom panel, one hears three fluctuations per ten seconds.

54

1 Introduction

Fig. 1.41 Sinusoidal amplitude modulation in the rhythm range. A 1200 Hz sinusoidal carrier is successively modulated in amplitude by a modulator of 10.0, 7.1, 5.0, 3.5, and 2.5 Hz. The MD is always 0.8. At the scale of the abscissa, the separate phases of the carrier cannot be distinguished and only the envelopes are significant. The envelopes of the sounds are periodic with periods indicated within the successive panels. (Matlab) (demo)

1.6.5 In Summary Similar to what was shown for the sum of two sinusoids, the way in which a SAM tone is perceived strongly depends on the frequency and temporal resolution of our hearing system. Just as for the two-tone complex, four auditory ranges are distinguished. When the MF is larger than 100–200 Hz, the sound is perceived as a steady tonal sound. As for the two-tone complexes, a harmonic and an inharmonic situation are distinguished. In the harmonic situation, the MF and the CF have a common F 0 in the pitch range. Listeners may hear a pitch at this F 0 , which is called virtual when none of the three partials of the SAM sinusoid has a frequency corresponding to this pitch. In the demo of Fig. 1.38, this is the case in the last four of the five complex tones played. For many listeners, these pitches will be weak, certainly when the rank

1.6 Amplitude Modulation

55

Fig. 1.42 Sinusoidal amplitude modulation in the range of hearing slow modulations. Same as Fig. 1.41, except that the 1200 Hz sinusoidal carrier is modulated in amplitude by a modulator of 0.50, 0.44, 0.39, 0.34, and 0.30 Hz. The character of these sounds is that of 1200 Hz pure tones slowly fluctuating in loudness without inducing a rhythm. (Matlab) (demo)

of the harmonics is high. These pitches will be more salient for the examples shown in the upcoming Sect. 1.7. When the MF is decreased below about 100–150 Hz, the partials of the SAM tones start to interfere perceptually resulting in roughness. When the MF is decreased further, below about 15–20 Hz, there is a moment at which the CF of the SAM tones gets easier to be heard out, and the interference between the tones will no longer be perceived as roughness. Instead, a pure tone changing in loudness will be heard with the CF as pitch frequency. Moreover, below 15 Hz, the modulations induce beats that together form a regular rhythm. When the MF is further lowered below 1 Hz, this percept of beats is lost, and a pure tone is heard with the CF as pitch, slowly increasing and decreasing in loudness. This all corresponds quite well to the four auditory ranges described for the perception of the two-tone complexes. The same will be concluded in the next section on SFM tones.

56

1 Introduction

1.7 Frequency Modulation In frequency modulation (FM), a signal has a frequency that changes from moment to moment. The frequency of such a signal at a specific moment is called the instantaneous frequency. The temporal representation of a sinusoid sinusoidally modulated in frequency can mathematically be written as s (t) = sin [2π f c t + ϕc + m i sin (2π f m t + ϕm )] in which f c is the CF, f m the MF, ϕc and ϕm are phase shifts, and m i is the modulation index (MI), which determines the maximum deviation of the instantaneous frequency from the CF. Such tones are indicated with sinusoidally frequencymodulated (SFM) sinusoids or SFM tones. They will now be discussed for an SFM tone with a CF 500 Hz, an MF 200 Hz, and an MI of 1.0. This MF is so fast that no temporal fluctuations are heard but a steady tone. So, though speaking of frequency modulation, nothing modulating is heard in the sound example of Fig. 1.43. For the sake of simplicity, and without loss of generality, the phase shifts ϕc and ϕm in Eq. 1.6 are chosen to be 0, so that the signal can be written as s (t) = a sin [2π f c t + m i sin (2π f m t)]

(1.6)

The top panel of Fig. 1.43 presents the waveform of this signal. One can see a sinusoid contracting and expanding in time. This expresses itself, e.g., by the intervals between the zero crossings that become shorter and longer. Naturally, the instantaneous frequency is higher than average during the contractions; during the expansions, it is lower than average. The middle panel of Fig. 1.43 shows the instantaneous frequency of the signal, calculated in the following way. Mathematically, a sinusoid can be written as    s (t) = sin Φ (t) The instantaneous frequency I (t) can then be found by dividing the time derivative of Φ (t) by 2π: 

Φ (t) I (t) = 2π

(1.7)

For the trivial case of an unmodulated sinusoid, Φ (t) is 2π f t + ϕ, which yields I (t) = f , a constant frequency, as it naturally should be. In the case of  SFM (see Eq. 1.6), Φ (t) = 2π f c t + m i sin (2π f m t), so that, since Φ (t) = 2π f c + 2π f m m i cos (2π f m t), the instantaneous frequency I (t) is I (t) = f c + f m m i cos (2π f m t)

(1.8)

1.7 Frequency Modulation

57

Fig. 1.43 Harmonic sinusoidal frequency modulation of a sinusoid. A 500 Hz sinusoidal carrier is modulated in frequency by a 200 Hz modulator. The modulation index is 1.0. The top panel shows the waveform, the middle panel, the instantaneous frequency, and the bottom panel the discrete amplitude spectrum. A continuous, complex tone with a pitch 100 Hz is heard. (Matlab) (demo)

This shows that the instantaneous frequency, shown in the middle panel of Fig. 1.43, is a cosine with frequency f c fluctuating around f m ; the maximum deviation from f c is f m · m i . Analogous to the maximum deviation from the average amplitude in AM, this is called the modulation depth (MD) or m d , m d = fm · m i

(1.9)

As in Fig. 1.43, the MF 200 Hz and the modulation index is 1, Eq. 1.9 gives an MD of f m · m i = 200 Hz. So, for the sound presented in Fig. 1.43, the instantaneous frequency is a 200 Hz cosine raised 500 Hz, and hence, fluctuates between 300 and 700 Hz. Note, again, that, when listening to this sound, a steady tone is heard. Apparently, just as for MFs higher than 100 Hz of the SAM tones, the temporal resolution of the hearing system is not fast enough to detect any temporal change as presented in the time course of the instantaneous frequency shown in the middle panel of Fig. 1.43. Hearing a steady complex tone suggests that, just as a SAM tone, an SFM sinusoid can be considered as the sum of a number of sinusoids. This appears indeed to be the case, but the number of partials is not just three as for SAM tones. The calculation of the amplitudes of the partials of SFM tones requires some more advanced mathematics, in this case Bessel functions of the first kind, Jk (x), in which k = · · · − 2, −1, 0, 1, 2, . . ., is the order of the Bessel functions. Bessel functions have many applications, but here they are only presented in order to calculate the

58

1 Introduction

amplitude spectra of the SFM tones. Their only property that will be used is that |J−k (x) | = |Jk (x) |. Indeed, a SFM tone with CF f c , MF f m , and MI m i can in its spectral representation be written as s (t) =

∞ 

Jk (m i ) sin 2π ( f c + k f m ) t.

(1.10)

k=−∞

It can be seen that there is an infinite number of partials at equidistant frequencies that can be represented as f k = f c + k f m , for k = · · · −2, −1, 0, 1, 2, . . .. These frequencies are negative for k < − f c / f m . The partials with these negative frequencies have relatively low amplitudes, however, and are perceptually not significant when the MD f m · m i is considerably smaller than CF. This means that the range of the instantaneous frequency should stay well 0 Hz. This will be illustrated in the next demos. So, the discrete amplitude spectrum of an SFM tone consists of one central partial with the CF and sidebands that are equidistantly separated from each other by integer multiples of the MF. Since |J−k (m i ) | = |Jk (m i ) |, the amplitude spectrum is symmetric around the central partial. The amplitude spectrum of the demo of Fig. 1.43 is presented in the bottom panel. As the CF 500 Hz and the MF 200 Hz, the frequencies of the partials are 500 ± k · 200 Hz. The largest partial is the central partial with the CF 500 Hz. Then there are significant first-order sidebands at 300 and 700 Hz and second-order sidebands at 100 and 900 Hz. Higher-order sidebands have small amplitudes because the MD 200 Hz which is significantly smaller than the CF 500 Hz, so that the minimum of the instantaneous frequency 300 Hz. All these frequencies are integer, in this case odd, multiples 100 Hz, so that the complex tone is harmonic with an F 0 100 Hz. The harmonic positions are indicated by the upward ticks on the abscissa of the discrete spectrum. The corresponding pitch period of 100 ms is indicated in the upper panel of Fig. 1.43. What now happens when the modulation index is increased from 1 to 2? This situation is shown in Fig. 1.44. When the modulation index is 2, the MD increases to 2 · 200 = 400 Hz, so that the instantaneous frequency now fluctuates between 500 − 400 = 100 Hz and 500 + 400 = 900 Hz. So, the instantaneous frequency becomes as low 100 Hz. Looking at the spectrum, depicted in the bottom panel of Fig. 1.44. It can be seen that the central partial has now become much smaller, while the contribution of the sidebands has increased. This is a general property of SFM sinusoids; the contribution of the central partial gets less for higher modulation indices. In Fig. 1.44, the largest partials are the partials of order –1 and 1 with frequencies at 300 and 700 Hz. The partials of order –2 and 2 are positioned at 100 and 900 Hz and are smaller but still larger than the central partial. The partials of order –3 and 3 are also significant, although they are smaller than the partials of lower order. The partial of order k = 3 can still be distinguished in the spectrum; the partial of order k = −3 now has a frequency of –100 Hz and, hence, is not shown in the amplitude spectrum. So, just as the sound presented in Fig. 1.43, also this sound has a harmonic spectrum consisting of odd harmonics 100 Hz.

1.7 Frequency Modulation

59

Fig. 1.44 Harmonic sinusoidal frequency modulation of a sinusoid. Same as Fig. 1.43, except that the modulation index is 2.0, twice as high as in the previous figure. A continuous, harmonic complex tone with a pitch 100 Hz is heard. (Matlab) (demo)

Consider now what happens when the MF has no harmonic relation with the CF. This situation is demonstrated in Fig. 1.45 in which a 500 Hz carrier is modulated by a 190 Hz modulator. The modulation index is 2.0. As in the previous figures, the top panel presents the waveform, the middle panel the instantaneous frequency, and the bottom panel the discrete amplitude spectrum. Since the amplitude of the partials, given by the absolute value of the Bessel functions |Jk (m i ) |, only depends on k and m, the amplitudes of the partials in the inharmonic spectrum of Fig. 1.45 are the same as those of the partials in the harmonic spectrum of Fig. 1.45. What has changed here is, except for the central partial, their frequency positions. The distance between the partials 190 Hz, so that the partials of order –1 and 1 are now positioned at 310 and 690 Hz, those of order –2 and 2 at 120 and 880 Hz, and those of order –3 and 3 at –70 and 1070 Hz. The F 0 of the partials 10 Hz, which is outside the pitch range of about 50–5000 Hz, so that the spectrum is not harmonic. Correspondingly, the waveform of the signal is not periodic as can be checked in the top panel of Fig. 1.45. The instantaneous frequency, on the other hand, is a periodic function with a frequency 190 Hz, as can be seen in the middle panel Fig. 1.45; the period corresponding to this frequency, 1/190 = 5.26 ms, is indicated. The upward ticks on the abscissa of the bottom panel indicate the harmonic position of the MF 190 Hz, but these are clearly different from the frequencies of the partials of this inharmonic SFM tone.

60

1 Introduction

Fig. 1.45 Inharmonic sinusoidal frequency modulation of a sinusoid. A 500 Hz sinusoidal carrier is modulated in frequency by a 190 Hz modulator. The modulation index is 2.0. The top panel shows the non-periodic waveform, the middle panel the instantaneous frequency, and the bottom panel the discrete amplitude spectrum. A continuous, inharmonic complex tone is heard. (Matlab) (demo)

This shows that complex tones with relatively complex spectra can be synthesized by modulating a simple sinusoid in frequency. Actually, the method can be extended in various ways to create sounds with spectra that resemble the spectra of the sounds of real musical instruments [4]. Since the 1970s, this has been intensively used in music synthesis [5]. Just as for the two-tone complex and the SAM tones, some more examples of harmonic and inharmonic FM in the range of hearing a steady tone will be described. Then the MF will be lowered, resulting in examples of FM in the roughness range, the rhythm range, and the range of hearing slow modulations.

1.7.1 Frequency Modulation: The Range of Perception of a Steady Tone Some examples of sinusoids harmonically modulated in frequency are demonstrated in Fig. 1.46. Their waveforms are presented in the left panels, and their discrete amplitude spectra in the right panels. Sinusoidal carriers of 1200 Hz are modulated in frequency by a modulator of, successively, 600, 400, 300, 240, 200 Hz. These CF and MFs are the same as those chosen in the demo of harmonic AM of Fig. 1.38.

1.7 Frequency Modulation

61

Fig. 1.46 Harmonic sinusoidal frequency modulation. A 1200 Hz sinusoidal carrier is modulated in frequency by a modulator of 600, 400, 300, 240, 200 Hz, respectively; the MD is 600 Hz, so that, for all five sounds, the instantaneous frequency fluctuates between 600 and 1800 Hz. The left panels present the waveforms of the sounds, the right panels their discrete amplitude spectra and also present the modulation indices. The sounds are periodic with periods indicated in the left panels. The harmonic positions are presented as upward ticks on the abscissae of the amplitude spectra. The spectral lines are exactly located on these harmonic positions. At the end of the demo, a pure-tone melody is played with the MFs as frequency. (Matlab) (demo)

In the demo of Fig. 1.46, the MD is kept constant 600 Hz for all five sounds, so that the instantaneous frequency always fluctuates between 600 and 1800 Hz. The period of the modulation is indicated in the left panels of Fig. 1.46; the right panels give, besides the discrete amplitude spectra, the MIs, the CFs, and the MFs. Since the MD is kept constant, the MI increases with decreasing MF (see Eq. 1.9). Correspondingly, the contribution of the sidebands becomes larger. The modulation frequencies 600, 400, 300, 240, 200 Hz are all divisors of 1200 Hz, the CF. Hence, in these examples, the CF is an integer multiple of the MF, so that the MF is the F 0 of a harmonic complex tone with the CF as one of the harmonic frequencies. This is not true in general, as just shown in Figs. 1.43 and 1.44, where the F 0 100 Hz is lower than both the CF 500 Hz and the MF 200 Hz. The spectra of the sounds played in Fig. 1.46 are harmonic, however, with the MF as F 0 . The corresponding harmonic positions are

62

1 Introduction

presented as upward ticks on the abscissae of the amplitude spectra shown on the right-hand side of Fig. 1.46. The spectral lines are exactly located on these harmonic positions. So, the conclusion is that the sequence of tones played in the demo of Fig. 1.46 consists of harmonic tones with F 0 s of 600, 400, 300, 240, 200 Hz, just as for the demo on harmonic SAM presented in Fig. 1.38. The intervals between the successive tones are a fifth, a fourth, a major third, and a minor third. A melody of pure tones with these pitches is played after the sequence of SFM tones. As the number of harmonics is now larger than for AM, the strength or salience of these pitches will now be much larger resulting in a melody that will now pop out more clearly. The inharmonic case of SFM sinusoids is demonstrated in Fig. 1.47. The sounds are the same as in the previous figure, Fig. 1.46, except that the MFs are 564, 369, 284, 260, 191 Hz, which have no harmonic relation with the CF of the sounds. The waveforms of the sounds are displayed in the left panels of Fig. 1.47. Since the instantaneous frequency is a cosine, the instantaneous frequency is maximum at the origin and at integer multiples of the period of the modulator, the first of which is indicated in the figure. The waveforms themselves, however, are not periodic in contrast with those presented in the left panels of Fig. 1.46. Correspondingly, the frequencies of the partials do not coincide with multiples of the MF. This can be seen in the amplitude spectra shown in the right-hand side of Fig. 1.47, where the frequencies of the partials deviate from the harmonic positions defined by the MFs indicated by the upward ticks on the abscissae. Due to this, the sequence of sounds played in the demo of Fig. 1.47 does not sound harmonic, in contrast with the sequence of sounds played in the demo of Fig. 1.46. For comparison with the harmonic case, a melody of pure tones with the MFs as frequencies is played after the sequence of SFM tones. What is heard illustrates the perceptual consequence of the lack of harmonicity. In conclusion, for MFs higher than about 100–150 Hz, SFM of sinusoids results in steady complex tones. Depending on whether there is a harmonic relation between the MF and the CF, the sounds are harmonic or not. In the harmonic situation SFM results in sounds with well-defined pitches.

1.7.2 Frequency Modulation: The Range of Roughness Perception When the MF of an SFM tone is lowered below some 100–200 Hz, the temporal resolution of the hearing system is so fine that it starts to detect these modulations resulting in the perception of roughness. As for two-tone complexes and SAM tones, the roughness range starts at about 20–30 Hz, and ends, depending on the frequency of the carrier, somewhere between 100 and 200 Hz. Some examples are presented in Fig. 1.49. A 1200 Hz sinusoidal carrier is sinusoidally modulated with frequencies of 100, 71, 50, 35, 25 Hz, frequencies equidistant on a logarithmic frequency scale.

1.7 Frequency Modulation

63

Fig. 1.47 Inharmonic sinusoidal frequency modulation. Same as Fig. 1.46, except that the 1200 Hz sinusoidal carrier is modulated in frequency by a modulator of 564, 369, 284, 260, 191 Hz, respectively. The discrete amplitude spectra are no longer harmonic, and no musical pitches can be heard. For comparison, a pure-tone melody is played at the end of the demo with the MFs as frequencies. (Matlab) (demo)

The modulation indices of the sounds are again chosen in such a way that the MD 600 Hz. Consequently, all five sounds have instantaneous frequencies fluctuating between 600 and 1800 Hz. The left panels of Fig. 1.49 show the waveforms of the sounds, the right panels their amplitude spectra. The period of the modulation is indicated in the left panels; the right panels give, besides the amplitude spectra, the modulation indices, the carrier frequencies, and the modulation frequencies. Looking at the waveforms in the left panels, the pattern of contracting and stretching sinusoids can still clearly be distinguished. In each panel, one period of the modulator is indicated by the two vertical lines of which the first is drawn through the origin. Since the instantaneous frequency is a cosine, these lines give the time of two consecutive maxima of the instantaneous frequency of the sounds. Although the rough fluctuations can clearly be heard, no modulations in frequency can be heard corresponding to the periods of the instantaneous frequency. For the higher MFs, 100, 71, and 50 Hz, a pitch may be heard, but this is a low pitch corresponding with

64

1 Introduction

Fig. 1.48 Frequency modulation in the roughness range. A 1200 Hz sinusoidal carrier is successively modulated in frequency by a modulator of 100, 71, 50, 35, 25 Hz; the MD is 600 Hz, so that, for all five sounds, the instantaneous frequency fluctuates between 600 and 1800 Hz. The waveforms of the sounds are presented in the left panels, their amplitude spectra in the right panels. The modulation indices are given in the right panels with amplitude spectra. The periods indicated in the left panels correspond to the periodicity of the instantaneous frequency of the signals. (Matlab) (demo)

the repetition frequencies of the modulations, not with the instantaneous frequency fluctuating around 1200 Hz. Hearing a pitch is associated with a harmonic spectrum, which brings us to the right panels of Fig. 1.48. Looking at the discrete spectra in the right panels of Fig. 1.48, a large number of partials can be distinguished. It is no coincidence that the frequencies of these partials more or less cover the range of the instantaneous frequency of the sounds, 600–1800 Hz. For the highest MF, 100 Hz, a harmonic spectrum can be distinguished consisting of harmonics with frequencies that are all integer multiples 100 Hz. Due to the large number of significant harmonics, a clear 100 Hz pitch can be heard. For 71 Hz modulator, a pitch 71 Hz can still be heard, in spite of the fact that the spectrum is not strictly harmonic. Apparently, pitch perception is not a mathematically precise process in which the F 0 s of the partials are calculated with mathematical precision. It will later be shown that pitch perception is much more a process in which the auditory system tries to find common periodici-

1.7 Frequency Modulation

65

Fig. 1.49 Instantaneous frequencies for frequency modulation in the rhythm range. A 1200 Hz sinusoidal carrier is successively modulated in frequency by a modulator of 10.0, 7.1, 5.0, 3.5, and 2.5 Hz; the MD is 600 Hz, so that, for all five sounds, the instantaneous frequency fluctuates between 600 and 1800 Hz. The periods of the modulations are indicated by the two vertical lines. The modulations induce beats with a regular rhythm. (Matlab) (demo)

ties in the frequency components of the sounds. This is a statistical process in which deviations are tolerated. Consequently, not strictly periodic sounds can have pitch. Even noisy signals with pitch will be described. The discussion about these issues can be found in Chap. 8. Also for the 50 Hz modulator, a pitch may still be heard, but when the MF becomes lower 50 Hz, the percept of pitch disappears completely, as can be heard in the last two sounds of the demo of Fig. 1.49. There, a rough sound is heard mostly evaluated as unpleasant. In the last sound, with the MF 25 Hz, some listeners may already distinguish a component with a high frequency, the CF, which brings us to the range of rhythm perception.

66

1 Introduction

1.7.3 Frequency Modulation: The Range of Rhythm Perception Just as for two-tone complexes and SAM tones, the range of rhythm perception is entered when the MF of SFM tones is lowered to less than 15 Hz. Some examples are presented in the demo of Fig. 1.49. In this figure, the waveforms nor the amplitude spectra are presented, for the same reason that they are not shown for the SAM tones in the rhythm range. Only the instantaneous frequency of the sounds is presented, since that best represents the time course of what is perceived, a carrier modulated in frequency. In the rhythm range, a pure tone with a strongly modulating pitch frequency is now clearly heard. Moreover, the separate fluctuations induce beats that form a regular rhythm. When playing these sounds in a class, there are very often students that start moving in synchrony with this rhythm, demonstrating rhythmic entrainment, mentioned also for the two-tone complexes and the SAM tones in the previous sections.

1.7.4 Frequency Modulation: The Range of Hearing Slow Modulations As for the two-tone complex and the SAM tones, real rhythm perception disappears when the MFs of the SFM sinusoids becomes lower than 1 Hz. Some examples are played in the demo of Fig. 1.50. A 1200 Hz carrier is modulated in frequency with MFs of 0.50, 0.44, 0.39, 0.34, and 0.30 Hz, respectively. The MD is 600 Hz, so that, for all five sounds, the instantaneous frequency fluctuates between 600 and 1800 Hz. Figure 1.50 shows the time course of the instantaneous frequency, of which one period is indicated in the corresponding panel by two vertical lines, one at the origin and the other at the period duration. Additionally, the MF and the MI are presented. A pure tone is heard, slowly fluctuating in pitch.

1.7.5 In Summary Similar to what was shown for the sum of two sinusoids and for SAM tones, the way in which an SFM tone is perceived depends on the frequency and temporal resolution of our hearing system. In the temporal domain, four ranges are distinguished. When the MF is larger than 100–200 Hz, an SFM tone is perceived as a steady tonal sound. For these relatively high MFs, an inharmonic and a harmonic situation are distinguished. In the inharmonic situation, the MF and the CF do not have a common F 0 in the pitch range, while they do in the harmonic situation. In the latter case, listeners can

1.7 Frequency Modulation

67

Fig. 1.50 Instantaneous frequencies for frequency modulation in the range of slow modulations. Same as Fig. 1.49, except that the 1200 Hz sinusoidal carrier is successively modulated in frequency by a modulator of 0.50, 0.44, 0.39, 0.34, and 0.30 Hz. The modulations are now so slow that they do not induce regular beats. (Matlab) (demo)

hear a pitch at the F 0 of the SFM tones. Since SFM tones have an infinite number of partials, this pitch is basically not virtual. For MFs higher than 100–200 Hz, the frequency resolution of our hearing system is so high that the frequency components are resolved and steady tones are heard. For lower MFs, the frequency components of the SFM tones start to interfere perceptually resulting in roughness. When the MF is decreased further, roughness vanishes again and, below 20 Hz, the instantaneous frequency of the SFM tones can be distinguished better and better. Then a pure tone with a changing pitch frequency is heard. Moreover, below 15 Hz, the modulations induce beats that together form a regular rhythm. When the fluctuations are relatively rapid, the beats are counted in doublets, triplets, or quadruplets; when they are slower, they are mostly counted one by one. When the MF is lowered below 1 Hz, this percept of beats is lost in turn. Then a pure tone is heard with the slowly rising and falling instantaneous frequency as pitch. So,

68

1 Introduction

as for the two-tone complex and the SAM tones, four different auditory ranges are distinguished for the perception of SFM tone: the range of hearing a steady tone, the roughness range, the rhythm range, and the range of hearing slow modulations. The boundaries between these ranges are not sharp. One range gradually goes over into another.

1.8 Additive Synthesis One of the most common ways to synthesize a sound is to add a number of sinusoids. Moreover, many tonal sounds have a harmonic structure, i.e., the frequencies of the components are all multiples of a common frequency, F0 . Indeed, Fourier showed already in Napoleonic times that a periodic function f (t) can be decomposed into a series of sinusoids the frequencies of which are all multiples of this fundamental frequency F 0 , i.e., the frequency corresponding to the periodicity of the signal: f (t) = a0 +

∞ 

(ak cos 2πk F0 t + bk sin 2πk F0 t)

(1.11)

k=1

In this equation, called the Fourier expansion of f (t), an , n = 0, 1, 2, 3, ... and bn , n = 1, 2, 3, 4, ... are the Fourier coefficients. The coefficient of lowest order, a0 , is the average of f (t) over one period T = 1/F0 . For n = 1, 2, 3, ..., the sequence an represents the cosine terms and the sequence bn the sine terms. They can be found by the following Eq. 1.12 2 ak = T



T 0

     t 2 T t dt and bk = dt. (1.12) f (t) cos 2πk f (t) sin 2πk T T 0 T

Since cos (−x) = cos x, the cosine is a symmetric function; the sine is an antisymmetric function, since sin (−x) = − sin (x). It follows that the series f s (t) = ∞ ∞ a cos f t) is symmetric, and that the series f = (2πk (t) k 0 a k=1 k=1 bk sin (2πk f 0 t) is antisymmetric, as can be checked in the next figure, Fig. 1.51, illustrating the Fourier expansions of a saw tooth and a square wave.

1.8.1 The Saw Tooth and the Square Wave Two examples will be briefly discussed, the saw tooth waveform and the square waveform, both presented in Fig. 1.51. The average of the signals over one period is 0, which implies that the coefficient a0 is 0 for both signals. Furthermore, the phase of the saw tooth is chosen in such a way that it is antisymmetric, which means that

1.8 Additive Synthesis

69

Fig. 1.51 Additive synthesis of a 500 Hz saw tooth, upper panel, and a 500 Hz square wave, lower panel. They are plotted as the thick lines consisting of straight-line segments. The first five nonzero harmonics are plotted as thinner lines, with their sum, a thicker line. For both sounds, the demo plays successively the first harmonic, the sum of the first two harmonics, that of the first three harmonics, that of the first four harmonics, that of the first five harmonics, and is finished by playing the sawtooth and the square wave, respectively. (Matlab) (demo)

all cosine terms of the Fourier expansion have zero coefficients; the phase of the square wave is chosen such that it is symmetric, which means that all sine terms are 0. Applying Eq. 1.12 then shows that the saw tooth st (t) with frequency f 0 can be written as st (t) =

∞ 2  (−1)k−1 sin (2πk f 0 t) π k=1 k

(1.13)

Hence, the sine terms bn of the Fourier expansion of the saw tooth are equal to 2 . This is shown in the upper panel of Fig. 1.51 for a saw tooth with an (−1)k−1 · kπ F 0 500 Hz corresponding to a period of 2 ms. The Fourier expansion of the square wave sq (t) can be calculated in the same way. All coefficients of even order are 0, so that the harmonics of rank 2k are zero. Only the odd harmonics of rank 2k − 1 have non-zero values. Indeed, ∞ 4  (−1)k−1 cos (2π (2k − 1) f 0 t) sq (t) = π k=1 2k − 1

(1.14)

This means that the cosine terms an of the Fourier expansion of the square wave 4 . This is illustrated in the lower panel of Fig. 1.51, in are equal to (−1)k−1 · (2k−1)π

70

1 Introduction

Fig. 1.52 Same additive synthesis as in Fig. 1.51, but the harmonics are now added in random phase. The first five harmonics are plotted in thinner lines, with their sum, thicker line. For comparison, the saw tooth and the square wave shown in Fig. 1.51 are also plotted in dashed lines. Although the waveforms of the sounds presented in Figs. 1.51 and 1.52 are much different, they sound the same. (Matlab) (demo)

which the first five non-zero cosine terms of the expansions, i.e., the terms of rank 1, 3, 5, 7 an 9, are shown by thin lines. Their sum is shown by the thick line. The square wave shown as the thick lines consisting of straight-line segments represents the final result of this addition if an infinite number of terms could be calculated and added. In Fig. 1.51, the harmonics of the saw tooth are added in sine phase, those of the square in cosine phase. In Fig. 1.52, the harmonics are added with the same amplitudes as in Fig. 1.51, but now in random phase. The waveforms of the signals are now very different, and do not converge to the saw tooth or the square wave. Just as in Fig. 1.51, the saw tooth and square waves are shown again, but now as dashed lines. It will be clear that the sum of the first five harmonics, shown as the thick line, is very different from these dashed lines, and that the sum of the harmonics will not converge to the saw tooth and the square. In spite of these different appearances, the resulting sounds of Figs. 1.51 and 1.52 are virtually indistinguishable. This is one of the reasons why it is often said that the human hearing system is quite insensitive to phase. Phase will further be discussed, e.g., in Sect. 1.8.3.

1.8 Additive Synthesis

71

1.8.2 Pulse Trains In this book, pulse trains, i.e., periodic series of pulses, will be used at various instances. Fourier analysis applied to pulse trains shows that pulse trains can be approximated by adding an infinite number of harmonic cosines of equal amplitude. Indeed, if p (t) represents the pulse train, p (t) = lim

N →∞

N 1  a cos (2πk f 0 t) . N k=1

(1.15)

This is illustrated in Fig. 1.53 for a 500 Hz pulse train in the same way as for the saw tooth and the square wave in Fig. 1.51. In Fig. 1.53, the pulse train is presented by the spikes; the first five harmonics 500 Hz in cosine phase are presented as thin lines, with their sum as a thick line. One can see that this sum approximates the train. A pulse train can, therefore, be seen not only as a series of pulses, but also as a harmonic complex tone of equal-amplitude sinusoids added in cosine phase. Whether a complex tone or a series of separate pulses is perceived depends, again, on the pulse rate, i.e., the number of pulses per second. In fact, one can make the same distinction in four perceptually different ranges, the range of hearing a steady tone, the roughness range, the rhythm range and the range of hearing slow modulations. This is illustrated in Fig. 1.54, where five pulse trains of different frequencies are presented. The frequencies of the pulses are 500, 200, 50, 5, and 0.5 Hz. Note the different abscissae of the five panels. The first two pulse trains, shown in the upper two panels of Fig. 1.54, are in the pitch range. They sound very busy and have well-defined pitch frequencies of 500 and 200 Hz, the frequencies of the two sounds. The third sound has a frequency 50 Hz, which is well within the roughness range. Perhaps a pitch frequency 50 Hz can still be distinguished, but the percept of roughness dominates. The fourth sound is within the rhythm range and listeners will almost automatically count the pulses, either each pulse separately or in sets of two, three, or four. Finally, when the pulse rate is 0.5 Hz, no real rhythm is induced anymore. By the way, more pulse trains in the rhythm range are presented in the demo of Fig. 10.11.

Fig. 1.53 Same as Fig. 1.51 for a 500 Hz pulse train. (Matlab) (demo)

72

1 Introduction

Fig. 1.54 Pulse trains with different frequencies. From top to bottom, the frequencies are 500, 200, 50, 5, and 0.5 Hz. The highest two are in the pitch range, the third is in the transition region from pitch to roughness, and the fourth is in the rhythm range. The pulse train shown in the bottom panel is in the range where only separate pulses are perceived, without inducing a regular rhythm. (Matlab) (demo)

1.8.3 Phase of Sums of Equal-Amplitude Sinusoids As has been illustrated in Figs. 1.51 and 1.52, it is often stated that the auditory system is insensitive to phase. In general, this is only true for resolved harmonics in the pitch range. Indeed, if a number of such harmonics is presented with different phase relations, their waveforms can look quite different as shown in Figs. 1.51 and 1.52. Nevertheless, no clear difference between the various versions is audible. In these demos, the tones had F 0 s 500 Hz. In the demo of the next figure, Fig. 1.55, it will be demonstrated that this changes when the F 0 of these harmonic gets lower than the range of hearing a steady tone. All four sounds shown consist of the sum of the first 50 harmonics 20 Hz. These harmonics have equal amplitudes. Consequently, all four

1.8 Additive Synthesis

73

Fig. 1.55 Sine phase, cosine phase, and positive and negative Schroeder phase. All four sounds consist of the first 50 harmonics 20 Hz, and have the same long-term amplitude spectrum. In spite of this, a clear difference can be heard between the sounds added in Schroeder phase and the other two sounds. (Matlab) (demo)

signals have the same discrete amplitude spectrum. The difference between the four signals only lies in the phase relations between the harmonics. In the upper panel, the harmonics are added in sine phase, in the second panel in cosine phase. These tones have a relatively high peak factor. The tones presented in the two lower panels have a relatively low peak factor. These tones were developed by Schroeder [27] in order to realize tones with as low a peak factor as possible. After its developer, these phase relations are called Schroeder phase. For these sounds, Schroeder derived the following equation. If ϕn is the phase of the nth of N harmonics: ϕn = ±π · n 2 /N . As shown by the ± sign, the phases can be positive or negative. When they are positive, as shown in the third panel, this is called positive Schroeder phase; when

74

1 Introduction

they are negative, as shown in the bottom panel, this is called negative Schroeder phase. Each 50 ms period has the character of a frequency glide. When the harmonics are added in positive Schroeder phase, one period of the signal has the character of a falling glide, whereas the character is that of a rising glide when Schroeder phase is negative. It is clear that the peak factor is much higher when the harmonics are added in sine as in the top panel or in cosine phase as in the second panel than when they are added in Schroeder phase. Listening to the sounds shows that our hearing system is quite sensitive to these differences. Apparently, the temporal resolution of our hearing system is high enough to perceive the time course of the unequal distribution of energy over the period of the signal for the sine-phase and the cosine-phase signals. Similarly, the changing distribution of energy over the different frequency bands within one 50 ms period can be heard for the Schroeder-phase signals.

1.9 Noise Up to now only sounds were discussed with a limited number of sinusoids. In the frequency domain this means that the spectrum consists of just a limited number of spectral lines. Many different kinds of sound, however, cannot be described as simple as that, among which many environmental sounds. In this section, such socalled noisy sounds or, more simply, noise, will be discussed. Noise is not periodic and usually has no pitch. But noisy sounds that do have pitch will be described in Sect. 8.9. Noise, in general, has a random, statistical character. This means that, when one selects a few stretches of noise at different times, they will be different. Noise is called stationary when its statistical properties do not change in time. These statistical properties comprise not only simple statistics such as the mean and the variance, but also the correlation between successive noise samples, in other words, the autocorrelogram. Since the Fourier transform of the autocorrelogram is the power spectrum, the expected value of the power spectrum of the noise is the same for every stretch of noise. This is why stationary noise is generally described by its power spectrum. Due to its random character, different short stretches of noise will have different power spectra but, on average, they will fluctuate around the same expected values. By averaging the power spectra of a sufficient number of successive stretches of noise, one can approximate the expected value of the power spectrum as close as necessary, if at least the noise remains stationary. The fact that two stretches of noise are different does not imply that they sound different. On the contrary, at various instances, different segments of noise will be played, which, in spite of the differences, sound the same. Apparently, the auditory system integrates over time and frequency in such a way that the fluctuations that are inherently present in the noise are more or less averaged out. Based on the concept of entropy, Stilp, Kiefte, and Kluender [31] defined this in an exact way and showed that the less “random” the noise, the easier listeners can distinguish between different realizations.

1.9 Noise

75

Three different kinds of noise will be discussed, based on the relative distribution of power over low and high frequencies. When the power of the noise is equally distributed over all frequency bands when expressed in hertz, it is white noise. On average, on a linear frequency scale, the power spectrum of white noise looks flat. The term “white noise” comes from an analogy with light; light is white when it contains all frequencies, or wavelengths, in equal amount. Besides white noise, two other kinds of coloured noise are distinguished, pink noise and brown noise. When the power of the noise is equally distributed over all frequency bands when expressed not in hertz but in octaves, the noise is called pink. In this case, the power spectrum falls off with frequency f according to 1/ f , which is why pink noise is also indicated with 1/f -noise, say one-over-f noise. When the power of the noise falls off with frequency according to 1/ f 2 , it is called brown noise.

1.9.1 White Noise An example of two bursts of white noise is presented in Fig. 1.56. Each burst has a duration of 400 ms and the silent interval between them is 100 ms. The noise is generated by drawing independent random samples from a normal, also called, Gaussian, distribution. Since the autocorrelogram of independently drawn samples is, per definition, zero everywhere except at 0, its Fourier transform is constant. Since the Fourier transform of the autocorrelogram is the power spectrum, the power spectrum of this noise is flat, hence the noise is white. So, what is heard in Fig. 1.56 is Gaussian white noise. Figure 1.56 shows the wide-band and the narrow-band spectrogram. The wideband spectrogram presented in the top panel is calculated with an analysis window of 1.6 ms, the narrow-band spectrogram presented in the middle panel has an analysis window of 51.2 ms. Comparing the wide-band and the narrow-band spectrograms shows that the upper one looks blurred in frequency but shows many details in the time domain, while the lowest looks more blurred in time but appears to show much more detail in the frequency domain. Remember that these differences are just properties of the analysis window and not of the noise. As one can see in the spectrograms, the energy of this noise is evenly distributed over all frequencies and also over time, but in a fluctuating, random way. The precise course of these fluctuations is different for the first and the second noise burst. Perceptually, however, these details are irrelevant. The two different successive noise burst sound the same. Since Gaussian white noise contains all frequencies in an equal amount, and since normally distributed random processes have nice statistical properties, it is often used for measuring the transmission properties of sound-processing equipment. From a perceptual point of view, the statistical distribution of the random samples is quite irrelevant. Gaussian white noise sounds the same as white noise generated by drawing samples from a uniform distribution. Even the difference with noise consisting of samples drawn from a distribution consisting only of two opposite values, e.g., −1

76

1 Introduction

Fig. 1.56 Waveform, narrow-band spectrogram, and wide-band spectrogram of two 400 ms bursts of Gaussian white noise. The duration of the analysis windows is 1.6 ms for the narrow-band spectrogram and 51.2 ms for the wide-band spectrogram. Although the two successive spectrograms differ in details, the two noise bursts sound the same. (Matlab) (demo)

and +1, sounds the same as Gaussian white noise. Their spectra, too, look quite the same. This is illustrated in Fig. 1.57, where one can listen to the three different kinds of white noise. The noise samples are drawn from three different distributions all with a standard deviation of 0.25. The first noise burst consists of Gaussian white noise, hence of random samples drawn from a normal distribution. √ The second 12/2 and burst consists of samples drawn from a uniform distribution between − √ 12/2. The third burst consists of samples randomly drawn from either −0.25 or 0.25. Since the power spectra of these signals are the same, the difference between the signals is based on their phase spectra but, apparently, they are perceptually not very relevant. This is another example of the idea that the human auditory system is quite insensitive to phase.

1.9.2 Pink Noise A 400 ms interval of pink noise is presented in Fig. 1.58. The power spectrum of pink noise can be represented as c · f −1 or c · 1/ f . This means that the power spectral density, the power in the spectrum per Hz, decreases from low frequencies to high frequencies. Light with this kind of spectrum looks pink, which is why this kind of

1.9 Noise

77

Fig. 1.57 Waveforms and narrow-band spectrograms of three different kinds of white noise with identical power spectra. The spectrograms are calculated with a 51.2 ms analysis window. The first noise burst consists of Gaussian white noise; the second of samples drawn from a uniform distribution; and the third burst consists of samples randomly drawn from just two values, −0.25 or 0.25. The three noise bursts sound the same. (Matlab) (demo)

noise is called pink. An important property of pink noise is that, although the spectral density per Hz decreases with increasing frequency, the spectral density per octave is constant. This implies that the power spectrum of pink noise falls off with 3 dB per octave. This is very relevant from a perceptual point of view because, 20 Hz up to about 5 kHz, our hearing system becomes more sensitive with about 3 dB per octave. So, when our hearing system is stimulated with pink noise, the lower sensitivity for lower frequencies is more or less compensated by the higher spectral density of pink noise at these lower frequencies. In other words, up to about 5 kHz, pink noise excites our hearing system to a more or less equal extent. This is why pink noise is very often used when measuring the frequency-transfer characteristics of rooms such as concert halls, because it makes unwanted resonances better audible. It can be seen in Fig. 1.58 that the waveform of pink noise deviates much more from the time axis than the waveform of white noise. This is due to the higher presence of lower-frequency components. For the same reason, the clogged band looks narrower than for white noise. White noise can simply be generated by a noise generator that gives random samples with a Gaussian, or otherwise specified, distribution. There is no simple formula or no simple filter which turns white noise into pink noise. In the demo of Fig. 1.58, pink noise is simply generated by adding 2000 sinusoids with frequencies equally spaced between 0 and 5000 Hz in random phase and with amplitudes that are proportional to the square root of these frequencies, so that the power spectrum is proportional to 1/ f .

78

1 Introduction

Fig. 1.58 Pink noise. Waveform and narrow-band spectrogram of two 400 ms bursts of pink noise. (Matlab) (demo)

1.9.3 Brown Noise Brown noise is called brown, not because of any connotation with colour, but because the signal looks like a one-dimensional random walk of Brownian motion, the motion of a large molecule or small particle in fluid due to collisions with other molecules of the fluid. This motion is called Brownian after its discoverer, the biologist Robert Brown, who in 1827 observed random-like movements of pollen in a fluid. So, it would be more appropriate to write Brown noise or Brownian noise, instead of brown noise. In contrast with pink noise, for the synthesis of which there is no simple procedure, brown noise Nb (t) can easily be generated from white noise Nw (t) by integration of Nw (t):  t Nw (τ ) dτ . Nb (t) = 0

After integration of a signal with power spectrum S ( f ), the new power spectrum of the integrated signal can be represented as S ( f ) · f −2 . Since brown noise can be generated by temporal integration of white noise, which has a constant spectrum, the spectrum of brown noise looks like c · f −2 . Two 400 ms bursts of brown noise are presented in Fig. 1.59. As one can see, the higher frequency content of this noise is even further reduced than for pink noise. In Fig. 1.60 the listener can compare the timbres of white, pink, and brown noise. The demo consists of three consecutive noise bursts, a burst of white noise, a burst of pink noise, and a burst of brown noise. Asking listeners what kind of difference they hear, they may say that the sound gets less and less “sharp” or “bright”, which will be discussed in Sect. 6.4. Another possibility is that listeners may say that the

1.10 Subtractive Synthesis

79

Fig. 1.59 Brown noise. Waveform and narrow-band spectrogram of two 400 ms bursts of brown noise. (Matlab) (demo)

Fig. 1.60 White noise, pink noise, and brown noise compared. The first noise burst has a white spectrum, the second pink, and the third brown. (Matlab) (demo)

noise sounds more and more muffled, as if produced behind some cloth or other soft, sound absorbing tissues. This corresponds to the better absorption of high-frequency sound components by many such materials.

1.10 Subtractive Synthesis Besides additive synthesis there is subtractive synthesis. For subtractive synthesis an existing sound is chosen, and one or more components are removed from it. This is mostly done by filtering out specific frequency components. If low-frequency components can pass the filter and high-frequency components are attenuated, this is

80

1 Introduction

Fig. 1.61 Low-pass filtered noise. The cut-off frequency is 1 kHz. (Matlab) (demo)

Fig. 1.62 High-pass filtered noise. The cut-off frequency is 4 kHz. (Matlab) (demo)

called a low-pass filter; a filter that passes high-frequency components and attenuates low-frequency components, is called a high-pass filter, and if only a limited band of frequency components are passed and both low and high frequency components are attenuated, this is called a band-pass filter. Finally, if a certain band of frequencies is filtered out, while the lower and the higher frequency components are passed, the filter is called a band-stop filter. This will be illustrated by filtering white noise with low-pass, high-pass, band-bass, and band-stop filters. Figure 1.61 shows the waveform and spectrogram of a noise burst low-pass filtered at 1 kHz, in Fig. 1.62 of a noise burst high-pass filtered at 4 kHz, in Fig. 1.63 a noise burst is shown band-pass filtered between 1.414 and 2.828 kHz, and in Fig. 1.64 a band-stop filtered noise burst is shown with cut-off frequencies of 1.414 and 2.828 kHz, again. When one listens to the low-pass filtered, the band-pass filtered, and the high-pass filtered noise burst in succession, one can clearly hear a percept that is low for the

1.10 Subtractive Synthesis

81

Fig. 1.63 Band-pass filtered noise. The lower cut-off frequency is 1.414 kHz; the higher cut-off frequency is twice that value, so 2.828 kHz. Hence, this noise spans one octave. (Matlab) (demo)

low-pass filtered noise, gets higher for the band-pass filtered noise, and is highest for the high-pass filtered noise. One may think that this corresponds to the pitch of these noise bursts. It will, however, be argued that this percept which, indeed, does run from low to high, is not pitch, but a perceptual attribute of a sound called brightness, one of the many perceptual attributes of a sound that are part of timbre. The main reason why brightness is not pitch, is that no musical melodies are heard when these wide-band noise bursts are varied in frequency, not even when the centre frequencies of these wide-band noise bursts are varied in accordance with diatonic scales or harmonic progressions. In Fig. 1.67 later on in this section, an example will be presented. Brightness will extensively be discussed in Sect. 6.4. There, an as yet tentative model of brightness perception will be presented that assumes that brightness corresponds to a weighted mean of the frequencies of the sound components as far as they excite our hearing system. When, on average, the sound components of a sound are higher in frequency, a higher brightness will be perceived than when, on average, the sound components are lower in frequency. Contrast this with what was said about the percept of pitch of tonal sounds, which, as a good first approximation, corresponds to the periodicity of the signal. When one compares the timbres of these low-pass, band-pass, and high-pass filtered noise bursts, it is not so difficult to describe the changes one perceives—as said, one will probably say something about a lower or a higher sound—but this becomes more difficult in the case of band-stop noise or notched noise, shown in Fig. 1.64. Although one will hear a clear difference when a white-noise burst and a band-stop filtered noise burst are presented one after the other, it will be difficult to describe the difference in words. In Fig. 1.65, the four kinds of noise are shown one after the other for comparison, and in Fig. 1.66, an interval of white noise is presented without any interruption followed by an interval of band-stop noise. Although a clear change in the sound is audible, it is hard to describe.

82

1 Introduction

Fig. 1.64 Band-stop filtered noise or notched noise. The notch is one octave wide and centred logarithmically around 2 kHz. Hence, its lower cut-off frequency is 1.414 kHz; its higher cut-off frequency is twice as high, so 2.828 kHz. Hence, the notch spans one octave. (Matlab) (demo)

Fig. 1.65 Low-pass, band-pass, high-pass, and band-stop noise compared. (Matlab) (demo)

Low-pass, high-pass, band-pass, and band-stop filtering, and hence subtractive analysis, can, of course, not only be carried out on white noise bursts, but on a wide variety of different sounds, pink noise, brown noise, complex tones, and also on recorded sounds. Furthermore, all kinds of filtering can be applied to different sounds, and the filtered sounds can be combined, etc. In Fig. 1.67, the waveform and spectrogram are presented of band-pass filtered noise bursts the centre frequencies of which follow a musical melody. When one listens to this, one can clearly hear something getting higher, although, as just said, no clear melody with musical intervals can be perceived. This is due to the fact that the noise bursts do not induce a well-defined percept of pitch. What goes up in the demo Fig. 1.67 but fails to induce pitch, is the brightness of the sound.

1.10 Subtractive Synthesis

83

Fig. 1.66 White noise without interruption followed by band-stop noise. Although one can hear a clear change in timbre, it is hard to describe this change in words. (Matlab) (demo)

Fig. 1.67 Band-pass filtered noise bursts with a bandwidth of one octave. The centre frequencies of the noise bursts rise in accordance with two consecutive diatonic scales spanning two octaves, as in the tonal example shown in Fig. 1.21. In spite of this no clear musical scales are perceived as in Fig. 1.21. (Matlab) (demo)

The bandwidth of the noise played in the demo of Fig. 1.67 was fixed at 1 octave. If the bandwidth is decreased, there will be a moment when the noise bursts start to sound more tonal, and indeed do induce the percept of a pitch. In Fig. 1.68 this is shown for noise bursts with a bandwidth of one sixth of an octave. In the extreme case, when the noise bandwidth gets much smaller, say smaller than one twelfth of an octave, the sound will sound like the tune produced by someone whistling a melody. One may wonder when one exactly hears a musical melody, and when a series of noise bursts changing in brightness. Indeed, the distinction between these two categories is not so clear. If the listener expects a certain melody, he or she may “recognize” the tune in the series of wide-band noise burst with centre frequencies rising and falling

84

1 Introduction

Fig. 1.68 Same as Fig. 1.67, except that the noise bursts are bandpass filtered with a bandwidth of one sixth of an octave. In contrast with the much wider noise bands of Fig. 1.67, now two consecutive rising diatonic scales can be heard. (Matlab) (demo)

in accordance with this melody. But if one does not expect this melody, one may just hear a series of noise bursts varying in brightness. It will appear at various other instances that the boundary between perceptual categories is not always clear. What one hears can depend on experience, context, and expectations.

1.11 Envelopes In presenting the chromatic and diatonic scales, the course of the amplitude of the tones that comprised the scale, were varied in different ways. Tones were synthesized with trapezoid envelopes, i.e., envelopes that are constant except for their onsets and offsets. At the onset, the amplitude of the tone increases linearly from 0 to the constant envelope value, at the offset the amplitude returns linearly from its constant value back to 0. As discussed in Sect. 1.8.3 of this chapter, this removes spectral splatter and, with that, the generation of annoying audible clicks at these instants. Other tones had exponentially decaying envelopes as illustrated in Figs. 1.20 and 1.21 for the chromatic and the diatonic scale, respectively. The character of these tones was very different; it was more that of an impact. This shows that the temporal envelope of a tone can have a large effect of the timbre of the tone. In traditional music synthesis, e.g., in midi, the tone envelope is divided into four successive intervals: the attack A, the decay D, the sustain S, and the release R. The envelopes defined by such terms are called the ADSR envelopes. An illustration is presented in Fig. 1.69. At the attack, the amplitude of the sinusoid rises linearly from zero to a certain value. During the decay, the amplitude decreases linearly to a level that remains fixed for a certain amount of time. This interval during which the amplitude is fixed is indicated with sustain. This sustain is followed by the release in

1.11 Envelopes

85

Fig. 1.69 Illustration of a short 440 Hz tone with an “ADSR” envelope, an envelope defined by an attack, a decay, a sustain, and a release. The attack is 5 ms, the decay is 10 ms, the sustain is 60 ms, and the release is 25 ms, resulting is a short tone with a total duration of 100 ms. (Matlab) (demo)

Fig. 1.70 Nine 440 Hz 450 ms tones with ADSR envelopes with increasing attack times of 1, 2, 4, 8, 16, 32, 64, 128, and 256 ms, respectively. The decays are kept constant at 100 ms and the releases at 20 ms for all tones. One of the impressions that will arise, is that, due to the increasing attack times, the timbre of the tones changes from that of a percussive or a plucked tone to that of a blown tone. (Matlab) (demo)

which the amplitude returns linearly to 0. It appears that such envelopes can strongly influence the perceived character of the synthesized tone, its timbre. When the attack is very short, the timbre will be that of a plucked instrument, whereas, when it is relatively gradual, the timbre is more that of a tone played on a wind instrument. The tone shown in Fig. 1.69 sounds like a short-plucked tone of a string instrument. In Fig. 1.70, a sequence of nine tones with ADSR envelopes is played with varying attack times. The frequency of the tones 440 Hz, their duration 450 ms. The attack time of the first tone is 1 ms, followed by tones with attack times of 2, 4, 8, 16, 32, 64, 128, and 256 ms, respectively. Listening to this sequence, one will hear a transition from a tone that sounds like being plucked to a tone that sounds like being blown.

86

1 Introduction

It is concluded that the attack time defines to an important extent the perception of the way in which the sound is produced. Hence, the attack time is an important determinant for the timbre of the sound. Actually, not only the time of the attack of a musical sound plays an important role in the way in which a tone is perceived. At the onset of a tone, many musical instruments produce a broad spectrum of frequency components each with its own time course; some components will die out very quickly, others will sound for a longer time. Moreover, some frequency components will fit into a harmonic pattern and, if these components do not decay too quickly, this harmonic pattern will determine the pitch of the produced tone. Other components, however, will not fit into this harmonic pattern, and will often die out relatively quickly. Furthermore, many different kinds of non-linear interactions may produce frequencies that originally are not present. It appears that these complex interactions, especially at the onset of musical sounds, are perceptually of great importance. Actually, the significance of the onset of a sound for its perception can hardly be overestimated. This will later be discussed in Sects. 5.4 and 6.6.

References 1. Bååth R, Madison G (2012) The subjective difficulty of tapping to a slow beat. In Proceedings of the 12th international conference on music perception and cognition (23–28 July 2012, Thessaloniki, Greece), pp 82–85 2. Bååth R, Tjøstheim TA, Lingonblad M (2016) The role of executive control in rhythmic timing at different tempi. Psychon. Bull. Rev. 23(6):1954–1960. https://doi.org/10.3758/s13423-0161070-1 3. Bolton TL (1894) Rhythm. Am J Psychol 6(2):145–238. https://doi.org/10.2307/1410948 4. Chowning JM (1973) The synthesis of complex audio spectra by means of frequency modulation. J Audio Eng Soc 21(7):526–534. http://www.aes.org/e-lib/browse.cfm?elib=1954 5. Chowning JM, Bristow D (1986) FM theory & applications: by musicians for musicians. Yamaha Music Foundation, Tokyo, Japan. http://www.dxsysex.com/images/FM-SynthesisTheory-Applications-extract.pdf 6. Edwards E, Chang EF (2013) Syllabic (∼2–5 Hz) and fluctuation (∼1–10 Hz) ranges in speech and auditory processing. Hear Res 305:113–134. https://doi.org/10.1016/j.heares.2013.08.017 7. Fraisse P (1982) Rhythm and tempo. In: Deutsch D (ed) The psychology of music. Academic Press, London, UK, Chap. 6, pp 149–180 8. Garner WR (1951) The accuracy of counting repeated short tones. J Exp Psychol 41(4):310– 316. https://doi.org/10.1037/h0059567 9. Goebl W, Palmer C (2013) Temporal control and hand movement efficiency in skilled music performance. PLoS ONE 8(1):e50901, 10 pages. https://doi.org/10.1371/journal.pone.0050901 10. Gunther L (2019) Tuning, intonation, and temperament: choosing frequencies for musical notes. In: The physics of music and color, 2nd edn. Springer Science+Business Media, Cham, Switzerland, Chap. 12, pp 303–324. https://doi.org/10.1007/978-3-030-19219-8_12 11. Hermann L, Matthias F (1894) Phonophotographische Mittheilungen. V. Die Curven der Consonanten. Pflügers Archiv 58(5):255–263 12. Lunney H (1974) Time as heard in speech and music. Nature 249(5457):592. https://doi.org/ 10.1038/249592a0 13. MacDougall R (1902) Rhythm, time and number. Am J Psychol 13(1):88–97. https://doi.org/ 10.2307/1412206

References

87

14. McAdams S, Drake C (2002) Auditory perception and cognition. In: Pashler H (ed), Stevens’ handbook of experimental psychology, volume 1: sensation and perception, 3rd ed. Wiley, New York, NY, Chap. 10, pp 397–452. https://doi.org/10.1002/0471214426.pas0110 15. Müller M (2015) Fourier analysis of signals. In: Fundamentals of music processing: audio, analysis, algorithms, applications. Springer International Publishing, Cham, Switserland, Chap. 2, pp 39–114. https://doi.org/10.1007/978-3-319-21945-5_2 16. Parncutt R (1994) A perceptual model of pulse salience and metrical accent in musical rhythms. Music Percept: Interdiscip J 11(4):409–464. https://doi.org/10.2307/40285633 17. Parncutt R (1989) Harmony: a psychoacoustical approach. Springer, Berlin, pp. i-xii, 1–206. https://doi.org/10.1007/978-3-642-74831-8 18. Parncutt R, Hair G (2018) A psychocultural theory of musical interval: bye bye Pythagoras. Music Percept: Interdiscip J 35(4):475–501. https://doi.org/10.1525/mp.2018.35.4.475 19. Patterson RD (1990) Auditory warning sounds in the work environment. Philos Trans R Soc B Biol Sci 327(1241):485–492. https://doi.org/10.1098/rstb.1990.0091 20. Plomp R (1964) The ear as frequency analyzer. J Acoust Soc Am 36(9):1628–1636. https:// doi.org/10.1121/1.1919256 21. Plomp R, Mimpen AM (1968) The ear as frequency analyzer. II. J Acoust Soc Am 43(4):764– 767. https://doi.org/10.1121/1.1910894 22. Plomp R (1967) Pitch of complex tones. J Acoust Soc Am 41(6):1526–1533. https://doi.org/ 10.1121/1.1910515 23. Roederer JG (2008) The physics and psychophysics of music: an introduction, 4th edn. Springer Science+Business Media, New York, NY 24. Rosen S, Howell P (2011) Signals and systems for speech and hearing, 2nd edn. Emerald, Bingley, UK 25. Schneider A (2018) Pitch and pitch perception. In: Bader R (ed) Springer handbook of systematic musicology. Springer GmbH Germany, Cham, Switserland, Chap. 31, pp 605–685. https:// doi.org/10.1007/978-3-662-55004-5_31 26. Schouten JF (1938) The perception of subjective tones. Proc K Ned Akad Wet 41:1086–1092 27. Schroeder MR (1970) Synthesis of low-peak-factor signals and binary sequences with low autocorrelation. IEEE Trans Inf Theory 16(1):85–89. https://doi.org/10.1109/TIT.1970.1054411 28. Seebeck A (1841) Beobachtungen über einige Bedingungen der Entstehung von Tönen. Annalen der Physik und Chemie 53(7):417–436. https://doi.org/10.1002/andp.18411290702 29. Sethares WA (2005) Tuning, timbre, spectrum, scale. 2nd ed. Springer, London, UK, pp. i-xviii, 1–426. https://doi.org/10.1007/b138848 30. Stevens SS (1935) The relation of pitch to intensity. J Acoust Soc Am 6(3):150–154. https:// doi.org/10.1121/1.1915715 31. Stilp CE, Kiefte M, Kluender KR (2018) Discovering acoustic structure of novel sounds. J Acoust Soc Am 143(4):2460–2473. https://doi.org/10.1121/1.5031018 32. Thompson WF (2013) Intervals and scales. In: Deutsch D (ed) The psychology of music, 3rd ed. Elsevier, Amsterdam, Chap. 4, pp 107–140. https://doi.org/10.1016/B978-0-12-3814609.00004-3 33. Van Noorden LPAS, Moelants D (1999) Resonance in the perception of musical pulse. J New Music Res 28(1):43–66. https://doi.org/10.1076/jnmr.28.1.43.3122 34. Verschuure J, Van Meeteren AA (1975) The effect of intensity on pitch. Acta Acust United Acust 32(1):33–44 35. Wever EG (1929) Beats and related phenomena resulting from the simultaneous sounding of two tones: I. Psychol Rev 36(5):402–418. https://doi.org/10.1037/h0072876 36. Wright HN (1960) Audibility of switching transients. J Acoust Soc Am 32(1):138. https://doi. org/10.1121/1.1907866

Chapter 2

The Ear

The primary function of the auditory system is to make sense out of the acoustic waves that arrive at our ears. Though errors are made and precision has its limits, people with normal hearing can hear what happens around them, where it happens, and in what kind of space this all happens. This chapter is concerned with the way in which the acoustic waves arriving at our ears are transformed into information that can be used by the central nervous system to carry out these auditory functions. This transformation process takes place in our peripheral hearing system or, in short, our ears. Our ears can be divided into three consecutive parts: the outer ear, the middle ear, and the inner ear. The main auditory function of the outer ear is to concentrate acoustic pressure at the eardrum. The main function of the middle ear is to convert the pressure waves at the eardrum into fluid vibrations in the inner ear. The main function of the inner ear is to convert these fluid vibrations into neural information that can be passed onto and processed by the central nervous system. This neural information, as it comes out of the inner ear, consists of series of electric pulses, called action potentials or spikes that propagate along nerve fibres, in this case the fibres of the auditory nerve which runs from the inner ear to the lower part of our central nervous system, the medulla oblongata. These series of action potentials or spike trains contain essentially all auditory information used by the hearing system to carry out its functions, which are to interpret this information in a meaningful way so that listeners know what happens where, and in what environment. Before discussing the anatomy and the physiology of our peripheral hearing system, it must be realized that, for the largest part of the frequency range to which we are sensitive, i.e., from about 20 to 20,000 Hz, our hearing system is extremely well adapted to naturally occurring sound levels, both low and high sound levels. As to the sensitivity for low-level sounds, it can be maintained that it would not make any sense to be more sensitive to sound than we actually are. If we were, sounds produced by our own bodies, such as heart beat, swallowing, chewing, and respiratory sounds with frequencies components below about 500 Hz, would annoy us; © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_2

89

90

2 The Ear

for higher frequencies—Sivian and White [78] give a range of 1000–6000 Hz—we would be able to hear the sound of the air molecules bouncing against our eardrums. As to high-level sounds, it is well known that many people, among which many school children, suffer from hearing loss due to the exposure to high-level sounds, mostly at musical or sport events, where the sound level is often way too high. In this respect, it may seem strange to state that our hearing system is extremely well adapted to sound levels that occur in natural situations. But one must realize that this hearing loss is caused by sounds produced with the help of technology, such as heavy machines and electronic amplifiers. Indeed, harmful sound levels not produced by the mediation of human technology are extremely rare. Even the claps of thunder, though frightening, are only very seldom intense enough to cause any permanent hearing loss. In order to come to situations in which natural sound is so intense that it can permanently damage our hearing, one must think of situations such as volcanic eruptions and raging tornados, which are, in general, much more dangerous in other respects than in that of the hearing damage it may inflict.

2.1 Overview As just mentioned, the auditory system can be divided into three successive subdivisions, the outer ear, the middle ear, and the inner ear, schematically presented in Fig. 2.1. Now only a short description will be given. They will be described in more detail later on. The outer ear consists of the pinna or auricle, and the ear canal. The latter is indicated in green in Fig. 2.1. The ossicles, indicated in blue, are positioned in the middle ear, which is indicated in red. The inner ear is coloured purple in Fig. 2.1. Each of these subdivisions plays its own functional role. First, there is the outer ear consisting of the pinna and the ear canal with at its end the eardrum or tympanic membrane. One of the two main functions of the outer ears is to pick up sound and to maximize the acoustic pressure at the eardrum at the end of the ear canal. The other main function is sound localization. This function can be carried out thanks to the outer ear’s irregular shape. Due to this, the acoustic waves that arrive at the ear canal interact with each other in a way that depends on the direction where the sound comes from and that varies from frequency to frequency. As a consequence, we can hear whether a sound comes from above or from below, and from in front of us or from behind. The acoustic pressure waves at the end of the ear canal set the eardrum in motion which, in turn, sets the three ossicles within the middle ear in motion. The air-filled cavity of the middle ear is presented in red in Fig. 2.1, while the three ossicles are in blue. The first ossicle with the light blue colour is attached to the eardrum and the last with the dark blue colour is attached to the entry to the inner ear, the oval window. The main function of these ossicles is to transfer the motions of the eardrum into the fluid-filled canals of the inner ear. In addition, the last of these ossicles is attached to a small muscle that contracts when the sound level gets high, higher than about 60–80 dB, so that the excursions of the vibrations are reduced, in this way protecting

2.1 Overview

91

Fig. 2.1 Overview of the peripheral hearing system. Retrieved from https://commons.wikimedia. org/w/index.php?curid=49935036 under CC BY 4.0 (https://creativecommons.org/licenses/by/4. 0)

the inner ear from too intense vibrations. This brings us to the third subdivision of the peripheral hearing system, the inner ear, the auditory part of which is called the cochlea. In the cochlea, the sound signal is mechanically split up into a wide array of overlapping frequency components. The frequency of these components runs from about 20 Hz to about 20 kHz. Low-level frequency components are amplified and then the transduction process takes place in which the mechanical motion is encoded into series of action potentials that contain the “neural code” [19] and convey this auditory information to the central nervous system. Acoustic waves with frequencies lower than 20 Hz are called infrasound, which per definition we cannot hear. But this does not mean that we cannot perceive it. The percept is different, however, from that of audible sound: “It is not only the sensitivity but also the perceived character of a sound that changes with decreasing frequency. Pure tones become gradually less continuous, the tonal sensation ceases around 20 Hz, and below 10 Hz it is possible to perceive the single cycles of the sound. A sensation of pressure at the eardrums also occurs” [49, p. 37].

92

2 The Ear

2.2 Two Ears Before discussing the three parts of the peripheral auditory system in more detail, a perhaps trivial but important aspect of our peripheral hearing system should be mentioned and that is that we have two ears. Thanks to our two ears, we can hear whether a sound comes from the left-hand side of our head, from its right-hand side, or from a point in the median plane, i.e., the plane that divides the head into a left and a right half. When a sound is produced at a location in the median plane and encounters no obstacles on its way to the head, it will reach both ears at about the same time and with about the same intensity. But when the sound comes from a more lateral location, it will reach the ipsilateral ear a bit earlier than the contralateral ear and, due to the acoustic shadow function of the head, also reach the ipsilateral ear with a somewhat higher intensity than the other ear. In discussing auditory localization, it will appear that time and intensity differences with which sound reaches our ears, play an important role in human sound localization. The three subdivisions of the peripheral auditory system, the outer ear, the middle ear, and the inner ear, will now be discussed in more detail.

2.3 The Outer Ear The outer ear consists of the external ear, i.e., the pinna or auricle, and the ear canal. The cavity at the entrance of the ear canal where we usually put earphones, is called the concha (see Fig. 2.1). The main function of the pinna is to concentrate acoustic energy at the concha. If this would be its only function, the ideal shape of the pinna would be that of a circular saucer, a dish antenna. Actually, the pinna is quite asymmetrical, and is additionally provided with irregularly shaped ridges and cavities, which are different from ear to ear. The asymmetry of the pinna has the effect that sound coming from in front of the listener is weighed heavier than sound coming from the back. The cavities and ridges bring about resonances and antiresonances of the incoming sound waves the frequencies of which depend on the direction where the sound comes from. This makes it possible to hear whether a sound is coming from above, from below, from behind or from in front of us. All this will be discussed in closer detail in Chap. 9. The main function of the ear canal is that of a resonator [90]. The amplification due to the resonance properties of the outer ear is shown in Fig. 2.2, which shows in dB the power of an acoustic wave at the eardrum relative to its power at the entry of the ear. This shows a wide resonance peak with a maximum at about 3 kHz. This peak represents the combined resonance of the concha and the ear canal. Besides this main peak at 3 kHz, Fig. 2.2 shows more irregular behaviour above 10 kHz. At these frequencies, higher-order resonances of the ear canal play a role together with the above-mentioned resonances and antiresonances of the outer ear that are involved in sound localization.

2.4 The Middle Ear

93

Fig. 2.2 Filter function of the outer ear. Note the significant peak at around 3 kHz due to the main resonance of the concha and the ear canal. The more jagged behaviour above 10 kHz is due to a second order resonance and to resonances and antiresonances of the outer ear. Data derived from Moore, Glasberg, and Baer [51] (Matlab)

It follows that the anatomical construction of our outer ear can explain why we are most sensitive to frequencies between 2 and 5 kHz. Sensitivity for lower frequencies diminishes and this is partly explained by the properties of the middle ear, but before turning to the discussion on this part of the peripheral hearing system, something will be said about another function of the ear canal, and that is to keep itself clean and open. When water, small insects, dust or other particles find their way into the ear canal, it is not likely that these will vanish by themselves. So, they must be removed actively. As to the eardrum, its cells are continuously renewed and the old cells are moved from the centre of the membrane outwards towards the sides of the ear canal. So, also these used cell remnants of the eardrum must be removed. To do so, the sides of the ear canal have glands that excrete a water-repellent substance, called earwax or cerumen; other cells have stretches of cilia that actively move the earwax outwards. If larger lumps of cerumen accumulate, chewing movements can loosen them from the wall of the ear canal. In this way, the ear is kept clean and blocking of the ear canal or the eardrum is generally prevented. In addition, earwax has antifungal and antibacterial properties which protect the ear against infections.

2.4 The Middle Ear Due to the resonance function of the ear canal, pressure builds up at the end of the ear canal, where it can set the tympanic membrane or eardrum in motion. At the other side of the tympanic membrane is an air-filled cavity, the middle ear. The air in the middle ear can be refreshed through the Eustachian tube, which ends up in the oral cavity. The Eustachian tube is mostly closed but opens every now and then, e.g., when we swallow or yawn. For the hearing process, the most important

94

2 The Ear

Fig. 2.3 Schema of the ear showing the position of the ossicles malleus, incus, and stapes. The function of this construction is impedance matching, which guarantees that the sound vibrations in the air can be transmitted to the fluid-filled inner ear. Note The positions of the two middle-ear muscles are not indicated. Retrieved from https://commons.wikimedia.org/wiki/ File:Blausen_0330_EarAnatomy_MiddleEar.png under CC BY 3.0 (https://creativecommons.org/ licenses/by/3.0)

part of the middle ear consists of three tiny bones, actually the tiniest bones in our body, the ossicles. These three ossicles are called hammer, anvil, and stirrup, or scientifically, malleus, incus, and stapes. They are sketched in Fig. 2.3. The hammer is fixed to the eardrum, then there is the anvil, and finally the stirrup, which is fixed to the entrance of the inner ear, the oval window. The stapes is the smallest bone of the human body. The ossicles are kept in place by ligaments, shown in Fig. 2.3 and, to be discussed later in this section, two middle-ear muscles, not shown in Fig. 2.3.

2.4 The Middle Ear

95

The main function of the ossicles is to convert the pressure waves from the airfilled ear canal into vibrations in the fluid-filled inner ear. Fluid is much more resistant to movement than air. Mechanically, this means that the impedance of fluid is much higher than that of air. A consequence of this is that, when a pressure wave in air arrives at the surface of a fluid, most of the waves are simply reflected and hardly any sound energy can enter the fluid. The ossicles must prevent this and do this through impedance matching. This means that the impedance of the air must be matched to that of fluid, so that the air vibrations are indeed transmitted to fluid vibrations in the inner ear. The ossicles have two properties to realize this. The first is that the surface with which the hammer is attached to the tympanic membrane is much larger than the surface with which the stirrup is attached to the oval window. This implies that the air pressure exerted on the tympanic membrane is concentrated on a small area; compare this with the huge amount of pressure that can be exerted by a stiletto heel. In fact, the proportion of the area of the tympanic membrane and that of the oval window is about 17 [98, p. 73]. A second property of the ossicles is that they have, to some extent, a kind of lever function. The proportion of the distance from the tympanic membrane to the centre of gravity of the malleus and from this centre of gravity to the oval window is 1.3. Another factor of 2 in amplification is the result of the “buckling action” of the tympanic membrane [98, p. 74]. These calculations result in a total amplification with a factor of 17 · 1.3 · 2 = 44.2, corresponding to 33 dB [98]. Actual measurements by Nedzelnitsky [52] show that near 1000 Hz the sound pressure at the oval window in the cochlea is 30 dB higher than that at the tympanic membrane. For lower frequencies the pressure gain by the operation of the middle ear decreases to less than 0 dB. A transmission characteristic of the middle ear is presented in Fig. 2.4. The middle ear has a somewhat irregular bandpass characteristic. Frequencies at about 800 Hz are transmitted best. In Fig. 2.4 one can see that the transmission is about 15 dB less at 50 Hz than at 800 Hz. (An update of this characteristic has been published by Glasberg and Moore [24].) For low sound intensities, it is useful to transmit as much acoustic energy from eardrum to oval window as possible. For high sound intensities, however, this may not

Fig. 2.4 Filter function of the middle ear. Data derived from Moore, Glasberg, and Baer [51] (Matlab)

96

2 The Ear

be the best thing to do since the inner ear contains some very vulnerable structures, and too high sound intensities can damage these structures. One of the mechanisms to protect the inner ear against too intense sounds consists of what is called the stapedius reflex. When the sound intensity gets high, higher than about 60–80 dB, one of two middle-ear muscles is contracted, the stapedius muscle. As the stapes is the smallest bone of the human body, the stapedius muscle is the smallest muscle of the human body. It is largely positioned inside the wall of the middle ear with one end attached to the stirrup or stapes. When the stapedius muscle contracts, it reduces the mobility of the stapes to some extent, so that the vibrations of the tympanic membrane are less efficiently transmitted to the inner ear. The stapedius reflex is only effective for frequencies below about 1500–2000 Hz and, when active, reduces the sound level that reaches the inner ear by more than 15 dB [8, 10]. So, the stapedius reflex offers some protection against high-intensity sound. But there is another relevant aspect. Later in this chapter it will be shown that intense frequency components can mask other sound components with nearby frequencies. This masking effect is asymmetric in the sense that higher frequencies are more masked by lower frequencies than the other way round. For moderate but not high levels, Aiken et al. [1] find that the stapedius reflex neutralizes this upward spread of masking of speech by low-frequency sounds. As said, the stapedius reflex is only active for frequencies below 1500–2000 Hz. Another property of the stapedius reflex is that it is quite slow; after the onset of the sound, it takes almost 100 ms to become effective [15]. For high-intensity sounds with slow onsets this is not an issue, but it implies that impulsive sounds of high intensity are not effectively attenuated by the stapedius reflex. This is one of the reasons why impulsive sounds, called shot noise, can be much more damaging to the inner ear than more steady sounds of the same intensity. Hence, the protective role of the stapedius reflex is limited. This may hint at another important function of the stapedius reflex, and that is to reduce the intensity and the masking effects of sounds produced by the perceivers themselves. Indeed, the stapedius muscle is not only active when intense external sounds reach the listeners, but also when the listeners themselves make noise, e.g., when chewing, yawning, laughing, sneezing, coughing, and also when tightly closing the eyes [74], or when speaking [9]. In these situations, not only the stapedius muscle but also the second muscle in our middle ear is active, the tensor tympani. The tensor tympani is attached to the malleus and, in contrast with the stapedius muscle, does not contract in response to external sounds. The combined action of the stapedius muscle and the tensor tympani is called the middle-ear reflex. So, the most significant function of this middle ear reflex is probably to reduce the intensity level of sounds produced by our own bodies. Indeed, the contractions of the two muscles start before the onsets of the actual sound production and also occur at low intensities [74]. Moreover, sounds produced by our own bodies in general have, with the exception of speech, a low-frequency content and, as mentioned, the middle-ear reflex predominantly attenuates low frequencies. Hence, the function of the middle ear reflex may not so much be the protection against high sound levels, but rather to reduce the masking effects of sounds produced by the body itself, though

2.5 The Inner Ear

97

the significance of the stapedius muscle in this respect is debated [59]. Finally, the middle-ear reflex may serve to prepare the listener for the possible event of highintensity sounds, since the reflex also occurs when tightly closing the eyes [74], when various parts of the head are stimulated, or when the eye is stimulated with a puff of air [38]. The last part of the middle ear to mention is the round window, a flexible membrane which functions as a pressure relief for fluid in the inner ear. The fluid in the cochlea is set in motion by the movements of the stapes and, since fluid is incompressible, the displacement of this fluid would not be possible without a relief. The round window functions as such a pressure relief [18].

2.5 The Inner Ear After the outer and the middle ear comes the inner ear, sketched in purple in Fig. 2.1. It consists of a vestibular part consisting of the semi-circular canals, clearly shown in Fig. 2.1, and the sacculus and the utriculus. As the vestibular system is not the subject of this book, it will not be discussed. The auditory part of the inner is called the cochlea. This consists of the spiral structure shown in the purple part of Fig. 2.1. In this cochlea the mechanical motions induced by the vibrations arriving at the oval window are converted into neuronal spike trains that travel to the central nervous system. A cross section through the cochlea is presented in Fig. 2.5. In humans, the cochlea contains 2.5 spiral turns and is about 35 mm long; its width is about 1 cm and its height 0.5 cm. The cochlea is situated in the hardest part of our skull, the petrosal bone. The cochlea consists of a bony structure around a spiral-shaped cavity illustrated in Fig. 2.5. This cavity is divided into three parallel compartments the names of which are indicated in the right-hand site of Fig. 2.5: the scala vestibuli, the cochlear duct (Ductus cochlearis in the Figure) or scala media, and the scala tympani. The scala media is positioned in between the two other scalae. These scalae spiral upwards from the basal part of the cochlea up to almost the apex of the cochlea. They are separated by two membranes, Reissner’s membrane, also called vestibular membrane, which separates the scala vestibuli from the scala media, and the basilar membrane, which separates the scala media from the scala tympani. Note that the use of the terms “vestibuli” and “vestibular” in scala vestibuli and vestibular membrane does not mean here that these structures are part of the vestibular system. The scala vestibuli and the scala tympani merge with each other at the apex of the cochlea. This connection is called the helicotrema and allows the fluid in the scala vestibuli and the scala tympani to communicate. In fact, the scala vestibuli and the scala tympani are both filled with perilymph, a sodium rich fluid which fills the membranous labyrinth of the inner ear. Perilymph contains little potassium. In contrast, the fluid in the scala media, called endolymph, contains much more potassium than sodium and is completely separated from the fluid in the other scalae. This high concentration of potassium is produced by the stria vascularis, a layer of

98

2 The Ear

Fig. 2.5 Cross section through the cochlea. The details are discussed in the text and in the next figures. Retrieved in the public domain from https://commons.wikimedia.org/wiki/File:Gray928. png

cells against the bony side of the scala media. Due to this difference in sodium and potassium concentration of the perilymph in the scala vestibuli and the scala tympani on the one hand and of the endolymph in the scala media on the other, there is a continuous voltage difference between endolymph and perilymph of 80–90 mV. This is of great importance for maintaining the energy supply of the cochlea. Together with the retina, the cochlea belongs to the most energy consuming parts of our body [76]. Except for maintaining the difference in sodium and potassium concentration between scala vestibuli and scala media, Reissner’s membrane is quite passive. This contrasts sharply with the basilar membrane, which carries the organ of Corti, the actual sensory organ where mechanical vibrations are converted into trains of action potentials. The organ of Corti is positioned all along the basilar membrane from the oval window up to the helicotrema. The action potentials are generated in the endings of nerve fibres in the organ of Corti. These action potentials run along myelinated fibres into the direction of the central axis of the cochlea. There they pass the cell body of the nerve cells in the spiral ganglion. From this spiral ganglion they run further towards the axis of the cochlea along fibres that are also myelinated. These fibres unite in the cochlear or auditory nerve. So, this auditory nerve, which is the eighth of the twelve cranial nerves, carries auditory information to the central nervous system. There, the auditory-nerve fibres make contact with cells in the cochlear nucleus, the first auditory station in the central nervous system.

2.5 The Inner Ear

99

2.5.1 The Basilar Membrane The conversion of mechanical energy into neuronal information is a delicate process. It takes place in the organ of Corti, a very vulnerable structure positioned all along the length of the basilar membrane within the cochlea. The basilar membrane is a long and thin membrane designed to vibrate and resonate along with the incoming acoustic waves. A schematic cross section through one turn of the scala media is presented in Fig. 2.6. Each location on the basilar membrane is specifically sensitive to a very narrow range of frequencies; the higher up in the cochlea, the lower the frequency to which the basilar membrane resonates. The frequency for which a location of the basilar membrane is most sensitive is called the characteristic frequency or the best frequency of that location. In humans, the basal portion of the basilar membrane near the oval window is sensitive to about 20,000 Hz at birth; at the apex of the cochlea, the characteristic frequency of the basilar membrane has decreased to less than 50 Hz.

Fig. 2.6 Cross section through the scala media, showing the three scalae, Reissner’s membrane, the tectorial membrane, and the basilar membrane with the organ of Corti. Retrieved from https://commons.wikimedia.org/wiki/File:Cochlea-crosssection.png under CC BY-SA 3.0 (https:// creativecommons.org/licenses/by-sa/3.0)

100

2 The Ear

Fig. 2.7 Schematic overview over the basilar membrane. The numbers along the spiral give the approximate characteristic frequencies of the basilar membrane at that location in hertz. The data in the legend are from Stuhlman (1943), as cited by Warren [85, p. 11]. The thick cross lines are equidistant on a perceptual frequency scale and represent the frequency resolution of the basilar membrane. The H in the middle indicates the helicotrema (Matlab)

So, each different part of the basilar membrane resonates with a different frequency component of the sound, the more basal parts with the higher frequencies, and the more apical parts with the lower frequencies. This demonstrates the main function of the basilar membrane, frequency analysis [57, 58]. The basilar membrane splits the incoming sound into a large number of frequency components. In other words, the basilar membrane operates as a filterbank, a large array of band-pass filters each with its own characteristic frequency. The distribution of centre frequencies over the basilar membrane is illustrated in Fig. 2.7 showing a schematic view over the basilar membrane. At the basal part, close to the oval window, the basilar membrane is very narrow and stiff, resulting in a high-frequency sensitivity. As mentioned, the highest frequency we can hear is about 20 kHz, at least when we are young; every ten years we get older, our maximum hearing sensitivity drops with at least 1 kHz. So, when we are 50 years

2.5 The Inner Ear

101

old, the maximum frequency we are sensitive to is at most 15 kHz. Ascending up into the cochlea along the basilar membrane, the basilar membrane gets wider and less stiff resulting in sensitivity for lower and lower frequencies. At the apex of the cochlea, the helicotrema, the characteristic frequency is as low as 40–50 Hz [32], and the lowest frequency that can set the basilar membrane in motion is about 20 Hz. Hence, the basilar membrane widens from basis to apex. This may seem somewhat confusing, because cross sections through the cochlea show that the spiral-shaped cavity of the cochlea containing the basilar membrane narrows from basis to apex. What is essential is that different locations of the basilar membrane resonate along with different frequency components of the sound arriving at our ears, with higher-frequency components closer to the base and lower-frequency components more towards the apex of the cochlea. Figure 2.7 suggests that every location of the basilar membrane is characterized by one specific characteristic frequency and that the excursions during stimulation with a sinusoid of that frequency are maximum at that location. This means that the location on the basilar membrane from where an auditory-nerve fibre originates, specifies the frequency of the stimulus. This is referred to as the place code. It appears, however, that the location where the basilar membrane shows maximum excursions depends on the intensity of the stimulus. For each frequency, a much more stable location is the location where the travelling wave stops. For this reason, Greenwood [26] and Zwislocki and Nguyen [102] argue that not the location where the excursion of the basilar membrane is maximum plays a role in the place code, but rather the location where it dies out. The most relevant part on the basilar membrane for hearing is the organ of Corti. Its position on the basilar membrane is shown in Fig. 2.6. Besides of supporting cells, the organ of Corti consists of two kinds of hair cells, one row of inner hair cells and three rows of outer hair cells, covered by a third membrane in the cochlea, the tectorial membrane. The tectorial membrane is 0.1 mm wide at the base of the cochlea and then broadens to 0.5 mm at the apex. The hair cells are called hair cells because each has an outward pointing bundle of hair-like structures, the stereocilia, that protrude into the scala media just below the tectorial membrane. When the basilar membrane vibrates, the stereocilia bend left and right: The stereocilia of the outer hair cells because they are attached to the tectorial membrane; the stereocilia of the inner hair cells because of the viscosity of the endolymph between the organ of Corti and the tectorial membrane. Inner and outer hair cells have different functions. Both are connected to the endings of nerve fibres, the inner hair cells mainly to afferent fibres, and the outer hair cells mainly to efferent fibres. The primary function of the outer hair cells is to amplify frequency components of low intensity; their cell bodies can contract and relax [12]. Indeed, within a very narrow frequency range, they contract and relax actively in synchrony with low-level frequency components, thus amplifying these weak components [71]. This positive feedback action of the outer hair cells represents the cochlear amplifier. The cochlear amplifier is regulated by efferent fibres from the olivocochlear bundle. This function of the outer hair cells will be discussed in more detail in Sect. 2.5.3.1.

102

2 The Ear

The inner hair cells are the true sensory cells; within the cell bodies of the inner hair cells, the mechanical movements of the basilar membrane are converted into a receptor potential or generator potential. This receptor potential induces the release of neurotransmitter into the synaptic cleft, the narrow slit between the inner hair cell and the nerve endings, dendrites, of the auditory-nerve fibres with which these inner hair cells are closely connected. This neurotransmitter can then generate an action potential in these afferent auditory-nerve fibres. This function of the inner hair cells will be discussed in more detail in Sect. 2.5.3.2. What is clear is that, at low sound levels, the role of the outer hair cells is essential in inducing and amplifying the tiny little movements of the basilar membrane. People who have lost a large part of the stereocilia of the outer hair cells, as all too often happens due to overexposure to too high levels of noise, need much higher stimulation levels. An elaborate review of ideas concerned with the mechanics of the basilar membrane is presented by Robles and Ruggero [65].

2.5.1.1

Basilar-Membrane Response to a Pure Tone: The Travelling Wave

The image of the basilar membrane when excited by a pure tone is often depicted as a travelling wave [5] originating at the oval window, propagated upwards into the cochlea up to the location of maximum excitation, and then dying out abruptly. How this may look like is schematically illustrated in Fig. 2.8 for an uncoiled basilar membrane that is stretched out in width; so, the cochlear coils are uncoiled and the basilar membrane runs from the left, where it is sensitive to higher frequencies, to the right, where it is sensitive to lower frequencies. The dimensions of the basilar membrane as shown in Fig. 2.8 are no longer real. The real dimensions have been presented in Fig. 2.7. The travelling wave of a continuous tone propagates from the oval window to the helicotrema, i.e., from left to right in Fig. 2.8. This figure shows the travelling wave for three rapidly succeeding moments, for a higher-frequency tone in the upper panel and for a lower-frequency tone in the lower panel. It is shown that the amplitude of the travelling wave is first low close to the oval window, so for locations with characteristic frequencies higher than the frequency of the tone; hereafter, the amplitude increases up to a maximum at that location the characteristic frequency of which is the frequency of the tone. Next, the amplitude rapidly returns to virtually zero when this location is passed, as shown on the right-hand side in Fig. 2.8. The fluctuations of the basilar membrane are naturally faster for the higher-frequency tone than for the lower-frequency tone, schematically indicated by the larger phase distance between the three waves in the upper panel than in the lower panel. Another way to illustrate the travelling wave, now for two tones, is presented in Fig. 2.9. It shows the fluctuations of the basilar membrane as a function of time. This is called a cochleogram which will be discussed in more detail in Sect. 3.9.3. Figure 2.9 shows the cochleogram for stimulation with a two-tone complex consisting of a 500-Hz and a 5000-Hz tone. It is clearly seen that the basilar membrane naturally

2.5 The Inner Ear

103

Fig. 2.8 Schematic representation of a wave travelling along the basilar membrane for different phases of the stimulating sinusoid. In the upper panel a high-frequency sinusoid is used as stimulus, in the lower panel a low-frequency sinusoid. The lower the frequency of the sinusoid, the farther away the stimulated part of the basilar membrane from the oval window. Naturally, the oscillations of the basilar membrane are much more rapid for the high-frequency sinusoid than for the lowfrequency sinusoid. The dashed lines represent the positive and negative envelopes of the travelling waves (Matlab)

Fig. 2.9 The excursion of the basilar membrane as a function of time in response to two simultaneous tones, one of 500 and the other of 5000 Hz. The grey scale represents the excursion of the basilar membrane (Matlab)

104

2 The Ear

vibrates much faster with the higher-frequency tone than with the lower-frequency tone. This image of a travelling wave was first presented in 1928 by Nobel laureate Békésy [5]. Von Békésy could, however, only carry out his measurements with basilar membranes of dead animals. As a result, he could only observe the passive response of the basilar membrane and not the active response, since it was found in the 1970s that the active amplification by the outer hair cells stops quickly after death [63]. The concept of a travelling wave assumes that mechanical energy is transported from the basal part of the basilar membrane up to where the basilar membrane resonates after which most energy is rapidly dissipated and the wave stops. It is important to realize, however, that the oval window is set in motion by the stapes, and that the fluid in the cochlea is not compressible. As a result, the movements of the stapes induce a pressure wave in the fluid in the scala vestibuli. This pressure wave propagates through the perilymph in the scala vestibuli to higher up in the cochlea up to the helicotrema, where the pressure wave propagates through the scala tympani back downwards in the cochlea to the round window, where finally the pressure wave is released. Since this pressure wave propagates through fluid, it is almost instantaneous. This leads to pressure differences in the cochlea between the scala vestibuli above the basilar membrane and the scala tympani below the basilar membrane. It is these pressure differences that set the basilar membrane in motion. It is still not clear what exactly the relation is between the travelling wave and these pressure differences: What sets the basilar membrane in motion is still a matter of debate, e.g., Bell [6]. An amazing, and perhaps problematic aspect of this all is that, for just audible low-intensity sounds, the excursions of the basilar membrane are in the order of magnitude of 0.1 nm, about the cross section of a hydrogen atom [75, 93]. A detailed review of the data is presented by Robles and Ruggero [65]. Every now and then some people wonder if these immensely small movements can indeed be the base of the mechanics of the cochlea. For a review on current ideas about the travelling wave, see Olson, Duifhuis, and Steele [53]. It has been emphasized that the basilar membrane operates as a series of overlapping band-pass filters, in total about 3500 since there are about 3500 inner hair cells. Under stimulation with a pure tone of low intensity, a limited and well-defined interval of the basilar membrane is excited, an excitation mainly due to the active response of the outer hair cells that amplify the sinusoid in synchrony with the phase of the sinusoid. For higher intensities, the passive response of the basilar membrane more and more dominates the active response. Under stimulation with a pure tone, one can observe a vibration that starts just after the oval window, then increases in intensity up to a maximum, and then suddenly dies out. When the intensity of the pure-tone stimulus is high, the vibration can already start close to the oval window. In this way, the presence of high-intensity low-frequency components can raise the threshold for components with frequencies equal to the characteristic frequency of the basilar membrane at that location. This phenomenon is called upward spread of masking, to be discussed in more detail in Chap. 3. Upward spread of masking was first described in 1924 by Wegel and Lane [86], though they did not call it like that.

2.5 The Inner Ear

2.5.1.2

105

The Impulse Response of the Basilar Membrane

It is known that the impulse response of a linear system completely specifies the system. It will appear, however, that the response of the basilar membrane is not linear. It has been argued that, due to the activity of the outer hair cells, sound components of lower intensities are amplified more than sound components of higher intensities. This means that the impulse response of the basilar membrane has a nonlinear relationship with the intensity of the stimulus. This will be discussed later on in this section. Realizing this, the impulse response of the basilar membrane at one fixed location will be described. The impulse response of the basilar membrane can be measured with various methods. One of the oldest methods is to use a capacitive probe [91]; another is to use the Müssbauer effect, with which the velocity of a radioactive substance can be measured. In the latter method, a small quantity of radioactive material is placed at a certain location on the basilar membrane, which then allows the measurement of the velocity of the basilar membrane at that location. Two results are presented in Fig. 2.10. One shows the impulse response of the basilar membrane of a Guinea pig at 23 kHz as measured by a capacitive probe [93]; the other that of a squirrel monkey at about 7.2 kHz as measured with the Müssbauer effect [64]. In both cases, the impulse responses have the characteristic of a band-pass filter of which the first appears to be more broadly tuned than the latter. A more recent method to measure the movements of the basilar membrane at a certain location is by laser velocimetry [62]. In this method, a laser beam is aimed at a certain location of the cochlea, and the Doppler effect of light reflected by the basilar membrane is then used to measure the velocity of the basilar membrane at that location. An example is presented in Fig. 2.11 for a location on the basilar membrane most sensitive to 10 kHz. The intensity of the stimulus pulses was varied in steps of 10 dB from 44 to 104 dB SPL. Two columns are presented: The left column presents the velocity of the basilar membrane after stimulation by an impulse; the right column is the normalized velocity, so the velocity divided by the amplitude of the impulse. If the response of the basilar membrane would be strictly linear, the normalized velocity should have identical amplitudes, which is not the case. One can clearly see that the amplitude of the normalized response increases with lower amplitudes. Hence, the lower the stimulus intensity, the higher is the amplification of the response of the basilar membrane. This amplification for lower intensities can be attributed to the amplification by the outer hair cells, i.e., to the activity of the cochlear amplifier.

2.5.1.3

The Basilar-Membrane Response to a Pulse Train

In this section, the response of the basilar membrane to a wider-band stimulus will be discussed, a 200-Hz pulse train. The sound signal is depicted in the lower panel of Fig. 2.12. The upper panel shows the cochleogram, i.e., the excursions of the basilar membrane as a function of time; the lighter and darker regions in the upper panel

106

2 The Ear

Fig. 2.10 Impulse responses of the basilar membrane measured with a capacitive probe, upper panel, and with the Müssbauer technique, lower panel. The upper figure is the impulse response at a characteristics frequency of 23 kHz measured from a Guinea pig (Wilson and Johnstone [93, p. 713, Fig. 18]); the lower is the impulse response at about 7.2 kHz measured from a squirrel monkey (Robles, Rhode, and Geisler [64, p. 951, Fig. 5c]). (Reproduced from Pickles [55, p. 46, Fig. 3.12], with the permission of Koninklijke Brill NV and the Acoustical Society of America)

represent negative and positive excursions of the basilar membrane, respectively; an average grey scale is the resting position of the basilar membrane. First, consider what happens at the locations of the basilar membrane that are sensitive to higher frequencies, so for frequencies higher than about 2000 Hz. One can see that, for those locations, the response to an impulse is damped out before the next pulse induces a response. Hence, to every pulse of the pulse train, the basilar membrane responds with one impulse response. In other words, the temporal resolution of the basilar membrane resolves every pulse of the train. But one can also see that, the lower the frequency, the longer the impulse responses last and, below about 2000 Hz, the impulse responses are not yet damped out when the response to the next impulse begins.

2.5 The Inner Ear

107

Fig. 2.11 Impulse response of the basilar membrane at the location where it is most sensitive to 10 kHz. In the left panel, the absolute velocities of the basilar membrane are plotted. In the right panel, the velocities are normalized to the sound pressure of the clicks. Reproduced from Recio et al. [62, p. 1974, Fig. 2], with the permission of the Acoustical Society of America

Second, consider what happens at the locations of the basilar membrane that are sensitive to lower frequencies, so for frequencies lower than about 1000 Hz. For describing what happens at those locations, the fact is used that a pulse train can be described as the sum of an infinite number of cosines of equal amplitudes with frequencies that are multiples of the fundamental frequency (see Eq. 1.15 and Fig. 1.53 discussed in Sect. 1.8). In this case, the fundamental frequency is 200 Hz, so that the pulse train can mathematically be described as the sum of harmonic cosines with frequencies of 200, 400, 600, 800 Hz, etc. In Fig. 2.12 one can see that, for the response to 200 Hz, there is one oscillation per pulse. Around 300 Hz the basilar membrane is not responding. At 400 Hz, the basilar membrane responds with two oscillations per pulse; so it vibrates with 400 Hz. Around 500 Hz the basilar membrane is not responding. At 600 Hz, the basilar membrane responds with three oscillations per pulse; so it vibrates with 600 Hz, etc., etc. So, one sees that, for these lower harmonics, the basilar membrane in fact responds as if it is stimulated by only one harmonic. This is a consequence of the fact that the adjacent lower harmonics are

108

2 The Ear

Fig. 2.12 The cochleogram. Simulated linear response at various locations of the basilar membrane to a 200-Hz pulse train. The lower panel presents the signal, the upper panel the cochleogram (Matlab) (demo)

separated from each other by more than the bandwidth of one auditory filter. One says that the frequency resolution of the basilar membrane resolves the lower harmonics of the pulse train. It is concluded that, for higher frequencies, the temporal resolution resolves the response to every separate pulse of the pulse train, while, for lower frequencies, the frequency resolution resolves every separate harmonic. Another way to look at the response to the lower harmonics is to consider the corresponding locations at the basilar membrane as oscillators stimulated by pulses. The locations of the basilar membrane that are maximally sensitive to the lower harmonics of

2.5 The Inner Ear

109

200 Hz are then basically stimulated by the pulse train at the same phases of the impulse response, for 200 Hz at every period, for 400 Hz at every other period, for 600 Hz at every third period, for 800 Hz at every fourth period, etc., etc. In this way, the separate responses amplify each other. For intermediate frequencies, so frequencies not at harmonic positions, the pulses arrive at different phases of the impulse response, so that the separate stimulations no longer amplify each other but more or less cancel each other. Above about the fourth harmonic, that of 800 Hz in this case, one can already see that the amplitude of the response decreases in the course of the 5-ms period of the 200-Hz pulse train. In other words, those locations of the basilar membrane sensitive to frequencies higher than this 800 Hz are then excited by more than one of the harmonics of the pulse train. In other words, the frequency resolution of the basilar membrane is no longer sufficient to completely separate the response to adjacent harmonics. All this shows that the lowest six to ten harmonics of the pulse train are in general spectrally resolved and excite different, separated intervals of the basilar membrane. On the higher end of the spectrum, so for harmonic ranks higher than about eight to twelve, the harmonics are not resolved spectrally. Here, the temporal resolution of the basilar membrane is high enough to separate the response to each stimulus pulse, showing the impulse response as the response to every pulse separately. Naturally, these two descriptions are the extremes of a continuum: At the lower end of the continuum, the spectrally resolved low harmonics induce a continuous sinusoidal movement of the basilar membrane, and, at the other end, the basilar membrane is excited by every separate pulse of the pulse train and the response to every pulse has died out before the response to the next stimulus starts. As can be seen in Fig. 2.12 the transition is gradual. The cochleogram shown in Fig. 2.12 suggests that the higher frequencies arrive earlier at their characteristic location than the lower frequencies. Experiments by Wojtczak et al. [94] show that these longer delays at lower frequencies are probably compensated for in more central processing of the sound signals. In summary, the basilar membrane starts at the basal part of the cochlea, where the last of the three ossicles, the stirrup, is attached to the oval window. Initially, the movements of the oval window induce a pressure wave in the cochlea in the space above the basilar membrane, the scala vestibuli and the scala media. This pressure wave instantaneously ascends to the apical part of the cochlea, the helicotrema, where it propagates into the space in the cochlea on the other side of the basilar membrane. From there, it returns to the basal part of the cochlea, the flexible and elastic round window where it is released into the air-filled middle ear. In this way, pressure differences arise between the parts of the cochlea above the basilar membrane and below the basilar membrane. These pressure differences can induce vibrations of the basilar membrane, but the extent to which this happens depends on the frequency sensitivity of the basilar membrane. The basilar membrane is relatively narrow and stiff at the basal side, and wide and flexible at the apical side. Thus, the more basal parts of the basilar membrane oscillate with the higher frequencies; the more apical parts with the lower frequencies. In this process, the sound wave that enters the ear canal is converted into a large number of overlapping frequency bands. For frequency

110

2 The Ear

components of low intensity, the vibrations of the basilar membrane are mostly the result of active amplification of the pressure wave by the outer hair cells; at higher intensities, the passive reaction of the basilar membrane gets more weight. Moreover, for a harmonic sound with many harmonics, all lower harmonics are resolved and, certainly for low and moderate intensities, every lower harmonic excites a relatively narrow segment of the basilar membrane; at these locations, the basilar membrane responds as if it is only stimulated by one single sinusoid and mutual phase relations are largely irrelevant. The higher the rank of the harmonics, however, the more the excitations of the basilar membrane by the individual harmonics start to overlap, and mutual phase relations begin playing a role. Furthermore, in a process of upward spread of masking, lower frequency components of high intensity can also passively excite higher-frequency parts of the basilar membrane, thus reducing the sensitivity to such higher-frequency components.

2.5.2 Distortion Products Up to now, the auditory filter has been described as a succession of a linear band-pass filter and a non-linear amplifier. There are, however, some other non-linearities in the system that generate distortion products. In general, these distortion products are auditorily not relevant. Therefore, only the two most significant will briefly be discussed, a quadratic and a cubic distortion product. As to the quadratic distortion product, when f 1 and f 2 are the frequencies of the two components with f 1 < f 2 , one of the quadratic distortion products is the difference tone with a frequency of f 2 − f 1 . In trigonometry, if s (t) is the sum of two sinusoids, sin (2π f 1 t) + sin (2π f 2 t), its square s 2 (t) can be written as: s 2 (t) = [sin (2π f 1 t) + sin (2π f 2 t)]2 = sin2 (2π f 1 t) + 2 sin (2π f 1 t) sin (2π f 2 t) + sin2 (2π f 2 t)

The second term of this equation is equal to cos (2π ( f 2 − f 1 ) t) − cos (2π ( f 1 + f 2 ) t)

(2.1)

Auditorily, the first component of Eq. 2.1 with frequency f 2 − f 1 is most important, as will be explained below. Similarly, the main cubic distortion product can be found. The third power s 3 (t) of the sum of two sinusoids can be written as s 3 (t) = [sin (2π f 1 t) + sin (2π f 2 t)]3 = sin3 (2π f 1 t) + 3 sin2 (2π f 1 t) sin (2π f 2 t) + 3 sin (2π f 1 t) sin2 (2π f 2 t) + sin3 (2π f 2 t) .

(2.2)

2.5 The Inner Ear

111

For the cubic distortion products, the distortion product with frequency 2 f 1 − f 2 is most important. It arises from the second term of this equation, 3 sin2 (2π f 1 t) sin (2π f 2 t), which equals 3 3 [1 − sin (2π2 f 1 t)] sin (2π f 2 t) = [sin (2π f 2 t) − sin (2π2 f 1 t) sin (2π f 2 t)] 2 2 3 3 3 = sin (2π f 2 t) − cos (2π (2 f 1 − f 2 ) t) + cos (2π (2 f 1 + f 2 ) t) 2 4 4

(2.3)

The second of these components has a frequency of 2 f 1 − f 2 and, indeed, in the presence of two tones a third tone with frequency 2 f 1 − f 2 , not present in the stimulus, can often be heard. The third term with frequency 2 f 1 + f 2 is much weaker, if audible at all. In order to explain this, note that, if f 1 < f 2 , 2 f 1 − f 2 is smaller than both f 1 and f 2 ; for instance, if f 1 is 1000 Hz and f 2 is 1200 Hz, 2 f 1 − f 2 is 800 Hz. It can be shown that this distortion product is generated on the basilar membrane, probably at the location on the basilar membrane where the two components interact [37, 66]. From there it is propagated upwards along the cochlea, just as the normal travelling wave, and amplified by the outer hair cells at the location where the basilar membrane is most sensitive to 2 f 1 − f 2 . The other cubic distortion product 2 f 1 + f 2 is much higher in frequency than both f1 and f 2 . It is much weaker, probably because, to be audible, it would have to be propagated downwards in the cochlea, which is the unnatural direction of the travelling wave [50, p. 19]. These distortion products are also indicated with combination tones. The quadratic combination tone with frequency f 2 − f 1 and the cubic combination tone with frequency 2 f 1 − f 2 are the best audible. The latter, cubic combination tone is the first of a series of combination tones the frequencies of which can be written as (k + 1) f 2 − k f 1 , k ≥ 1. Distortion products are very revealing about the operation of the peripheral hearing system. In addition, they are only produced in healthy cochleas with intact outer hair cells and, as such, are extensively used in medical examinations of cochlear function [34]. Yet they will be mentioned only in passing in the rest of this book. They are weak and, by adding a small amount of noise to the stimuli, combination tones can easily be made inaudible, when experimental meticulousness requires this. An elaborate and extensive review of the origin of auditory distortion products is presented by Avan, Büki, and Petit [4].

2.5.3 The Organ of Corti The question is now, how the vibrations of the basilar membrane are converted into information that can be transmitted to the more central parts of our nervous system. This process in which mechanical energy is converted into neural information, takes place in the organ of Corti and is called transduction. This neural information consists of spike trains, trains of electrical pulses of equal amplitude called action

112

2 The Ear

potentials or spikes, generated at the nerve endings of the auditory-nerve fibres closely connected to the real sensory cells of the auditory system, the inner hair cells. Before turning to describing the operation and function of the true sensory organ, the organ of Corti, a remark will be presented on the vulnerability of this structure in the inner ear. In Sect. 2.4, two reflexes were mentioned that have the function to reduce the intensity of high-level sounds before they are transmitted into the cochlea, the tensor-tympani reflex, which tightens the malleus, and the stapedius reflex, which tightens the stapes. So, these two reflexes play a suppressive role by stiffening the chain of ossicles. They are only active for frequencies lower that 1500–2000 Hz. Both reflexes anticipate the sounds produced by the body itself, such as during speaking, singing, chewing, swallowing. Since most of these sounds, except speaking and singing, are low in frequency, these reflexes are quite effective. During speaking and singing, they are especially effective in suppressing the effects of upward spread of masking, discussed in the previous paragraph. Only the stapedius reflex responds to extraneous sound, and this response only suppresses the low frequency components. Moreover, it is quite slow, almost 100 ms, which makes it ineffective against sounds with abrupt onsets, such as shot noise. Apparently, auditory damage by very intense sounds has been so rare originally that it has not played a significant role in the evolution of the human auditory system. This sharply contrasts with sounds produced by current technology, where machines and electronic equipment easily produce sound levels far exceeding the level where they are just harmless. An additional problem is that many people, in particular young people, like to party and dance in extremely loud music environments like discotheques and dance halls. Many such people lose some of their hearing faculties due to their presence at those events, where the sound levels are so high that the excursions of the basilar membrane are so large that the bundles of stereocilia break off. Once human outer hair cells have lost their stereocilia, they never recover. In order not to end too pessimistically, for people with this kind of hearing loss, there may be some hope in the far future. In reptiles and birds, it appears that hair-cell loss due to hearing damage can be restored, and “the most recent studies demonstrated that some genetic manipulations can drive mammalian supporting cells to convert into hair cell-like cells, but so far, these manipulations seem to be most effective in the neonatal ear and are insufficient to restore inner ear anatomy and function to normal levels in mature animals” [69, p. 48]. Recapitulating, the process of transduction takes place in the organ of Corti, which stretches out all along the upper side of the basilar membrane, from the basal, high frequency part at the oval window up to the apical, low frequency part at the helicotrema. A schematic cross section through the scala media has been presented in Fig. 2.6. A more detailed picture of the organ of Corti is presented in Fig. 2.13. Since the human basilar membrane has one row of inner hair cells and three rows of outer hair cells, these cross sections show one inner hair cell and three outer hair cells. Both the inner hair cells and the outer hair cells are supported by supporting cells. The row of inner hair cells and their supporting cells are separated from the outer hair cells and their supporting cells by the tunnel of Corti. All along the basilar membrane, both rows of hair cells are covered by another membrane, the tectorial

2.5 The Inner Ear

113

Fig. 2.13 Cross section through the organ of Corti. The hair cells are green, the tectorial membrane is blue, the afferent fibres from the auditory nerve are red, and the efferent fibres from the olivocochlear bundle are yellow. Reproduced from Fettiplace and Hackney [23, Fig. 2, p. 22] with the permission of Springer Nature BV; permission conveyed through Copyright Clearance Center, Inc

114

2 The Ear

membrane, which makes contact with the “hairs” of the hair cells, the stereocilia. The stereocilia of the outer hair cells are really attached to the tectorial membrane, while the stereocilia of the inner hair cells come close to the tectorial membrane but end in the fluid between the organ of Corti and the tectorial membrane, but are not attached to it.

2.5.3.1

The Outer Hair Cells

What happens now when the basilar membrane starts to vibrate? Observe in Fig. 2.13 that the hinge point of the basilar membrane is farther away from the central axis of the cochlea than that of the tectorial membrane. In Fig. 2.13, this central axis is positioned out of view to the left. When the basilar membrane vibrates, the upper side of the organ of Corti slides over the lower side of the tectorial membrane. Since the stereocilia of the outer hair cells are attached to the tectorial membrane, they will move to the right in Fig. 2.13 when the basilar membrane goes up, as shown in the lower panel of Fig. 2.13. When the basilar membrane goes down, the reverse happens. As a result, the stereocilia of the outer hair cells bend to the left and the right in synchrony with the movements of the basilar membrane. This repeats itself in correspondence to the periodicity of the frequency to which the basilar membrane responds at that location. As has been mentioned previously, the main function of the outer hair cells is to amplify low-level frequency components. They do so by actively contracting and stretching in synchrony with the stimulus [12]. Ruggero and Temchin [71] showed that this cochlear amplifier is only active for a very narrow range of frequencies. A remarkable fact is that the active response by the cochlear amplifier induces in turn a pressure wave in the cochlear scalae which is transmitted by the ossicles back into the ear canal, where it can be measured [34, 35, 92]. Such sounds produced by the cochlear amplifier and measurable in the ear canal are called otoacoustic emissions. As described in the previous Sect. 2.5.2, combination tones are also amplified by the cochlear amplifier and, as such, are part of these otoacoustic emissions. Since they can be measured non-invasively in the ear canal, they have played and still play a very important role in the study of the cochlear amplifier [4]. A fascinating account of the discovery of the cochlear amplifier is presented by Brownell [11], a detailed and dedicated review on its operation by Ashmore et al. [3]. Remarkably, in various situations, e.g., after being present in very noisy environments, the cochlear amplifier can be spontaneously active, resulting in what are called spontaneous otoacoustic emissions. This can result in the perception of tones not produced by external sound sources, called tinnitus. Tinnitus may not only come from the activity of the cochlear amplifier, but can also originate in central parts of the auditory nervous system. When these virtual tones last long and are very loud, they can become a source of extreme nuisance. A review on tinnitus is presented in Eggermont et al. [20]. All this shows that the response of the basilar membrane is not only passive since, especially at low intensities, the outer hair cells respond actively to the mechanical

2.5 The Inner Ear

115

events in the cochlea. This process is not uncontrolled. Indeed, the outer hair cells are innervated by efferent nerve fibres from the olivocochlear bundle which originates in the medial olivocochlear complex in the medulla oblongata [84]. These efferent nerve fibres are depicted in yellow in Fig. 2.13. It has been shown that the activity of the basilar membrane is suppressed when the olivocochlear bundle is activated [17, 72]. It appears that this suppressive feedback plays a role in speech understanding in noisy environments [16]. Shastri, Mythri, and Kumar [77] demonstrated that the activity of the olivocochlear bundle plays a role in the perception of phonetic contrasts. Andéol et al. [2] argue that the olivocochlear bundle plays a role in suppressing the disturbing effect of concurrent noise in auditory localization. Reviews of this efferent system are presented by Elgoyhen and Fuchs [21], Guinan Jr [28, 29], Lopez-Poveda [43] and Smith and Keil [79]. These reviews sketch the image of two mechanisms that complement the passive frequency response of the basilar membrane. The first mechanism is that of active cochlear amplification which amplifies low-level frequency components and sharpens the frequency response of the basilar membrane. The second, mediated by the medial olivocochlear complex, is centrally regulated, and enhances sounds like speech, especially in noisy environments [48]. It is still difficult to present “easily-described conceptual frameworks” [29, p. 45] of the function of the efferent system, however. Furthermore, the anatomy of the efferent system is quite complex. For instance, Terreros and Delano [82, p. 6] describe three parallel descending pathways between the cortex and the cochlea.

2.5.3.2

The Inner Hair Cells

The operation of the inner hair cells cells is schematically depicted in Fig. 2.13. The inner hair cells are stimulated in a way resembling that of the outer hair cells, but there are some differences. First, they do not move actively as do the outer hair cells, but move passively with the basilar membrane, the movements of which are largely under control of the outer hair cells. Moreover, while there are three rows of outer hair cells, there is just one row of inner hair cells and, in contrast with those of the outer hair cells, the stereocilia of the inner hair cells are not attached to the overlaying tectorial membrane. So, what happens to the stereocilia of the inner hair cells when the basilar membrane moves up and down, and drags the tectorial membrane in its course? During these movements the relative position of the inner hair cells and the tectorial membrane changes in a way comparable to that of the outer hair cells. Though the stereocilia of the inner hair cells are not attached to the tectorial membrane, their tips are so close to it that the fluid around the stereocilia of the inner hair cells will be set in motion. As a consequence, viscous forces exerted by the fluid around the stereocilia will bend the stereocilia, similar to the movements of the stereocilia of the outer hair cells. Furthermore, the layer of fluid between the rows of hair cells and the tectorial membrane will be pressed in and out inducing fluid currents that can also set the stereocilia of the inner hair cells in motion. So, the stereocilia of the inner hair cells are stimulated by a combination of shearing and

116

2 The Ear

viscous forces. A review of all factors playing a role in this process is presented and discussed by Guinan Jr [27]. The bending of the stereocilia of the inner hair cells initiates the first stage of the transduction process, the induction of a receptor potential in the cell bodies of the inner hair cells. An example of such a receptor potential is shown in Fig. 2.14 presenting the receptor potential of an inner hair cell stimulated by short pure tones the frequency of which is indicated in Hz on the right-hand side of the measurements. As can be seen, this receptor potential is not a linear replica of the bending of the stereocilia: The bending of the stereocilia of the inner hair cells is excitatory when the basilar membrane goes up, but inhibitory when the basilar membrane goes down [31]. In other words, a kind of half-wave rectification takes place, so that the receptor potential approximately represents a half-wave rectified version of the basilar-membrane motions. This half-wave rectification is not exact, since, e.g., exact half-wave rectification would induce sharp edges at the zero-crossings of the receptor potential. As can be seen in Fig. 2.14, the receptor potential does not show such discontinuities of its derivative. This aspect can be modelled as a low-pass filter with a cut-off frequency of about 3 kHz. As a result of this combination of half-wave rectification and low-pass filtering, the receptor potential can be decomposed into two components. First, there is the sinusoidal response closely resembling the sinusoidal stimulus; by analogy with electronics, this component is indicated with alternatingcurrent (AC) component. It has the same frequency as the sinusoidal stimulus and can be described as the output of a linear band-pass filter with the characteristic frequency of the basilar membrane at that location as band-pass frequency. Second, owing to the half-wave rectification, there is the more slowly changing component, the direct-current (DC) component. This component correlates with the envelope of the sinusoidal stimulus. This DC component may seem rather insignificant at 300 Hz, but the higher the frequency of the stimulus, the higher is its relative contribution. Above about 1–1.5 kHz, the DC component gets larger than the AC component. And, as one can see in Fig. 2.14, the AC component becomes smaller and smaller above about 2 kHz, in order to vanish almost completely above 4–5 kHz. This last aspect can, again, be modelled as the operation of a low-pass filter with a cut-off frequency of about 3 kHz. The presence of the half-wave rectification has one important consequence, and that is that we can hear frequencies much higher than 5 kHz. Indeed, the receptor potential does not contain frequencies higher that 4 to 5 kHz and, if a signal contains frequencies higher than the cut-off frequency of a linear system, these frequency components simply do not pass the filter. Half-wave rectification is a simple instantaneous non-linearity introducing a low-frequency component reflecting the envelope of the stimulus. It is this low-frequency component which makes it possible to hear sounds with frequencies higher than 5 kHz. The main effect of the receptor potential is the release of neurotransmitter, glutamate, into the synaptic cleft, the space between the inner-hair-cell body and the endings of auditory-nerve fibres. This neurotransmitter is packed in small vesicles, distributed in the cell body of the inner hair cell close to the outside of the cell

2.5 The Inner Ear

117

Fig. 2.14 Receptor potential of an inner hair cell stimulated by a pure tones with trapezoid envelope for 10 different frequencies. Note the DC component as a result of some level of half-wave rectification. As a first approximation, the AC component represents synchrony to the phase of the sinusoid, while the DC component represents the envelope of the stimulus. Above 3000 Hz, the synchrony to the phase of the sinusoid vanishes. Reproduced from Palmer and Russel [54, Fig. 9, p. 9] with the permission of Elsevier Science & Technology Journals; permission conveyed through Copyright Clearance Center, Inc.

118

2 The Ear

Fig. 2.15 Sketch of an inner hair cell

where the cell has synapses with the endings of the auditory-nerve fibres. This is schematically shown in Fig. 2.15. The receptor potential determines the probability that the content of these vesicles is emptied into the synaptic cleft; when the receptor potential is high, more synaptic vesicles will be emptied than when it is low. When the concentration of neurotransmitter in this synaptic cleft exceeds a certain level, it can trigger the generation of an action potential in the nerve fibre, thus completing the transduction process. This action potential is then propagated along the nerve fibre, through the auditory nerve, up to the first nucleus of the central auditory nervous system, the cochlear nucleus. From there, auditory information is processed by the central nervous system.

2.5.3.3

Sensory Adaptation

Before moving on to discuss the coding of the incoming acoustic information in the auditory-nerve fibres, another phenomenon will be discussed, adaptation, or more specifically sensory adaptation. Sensory adaptation is the phenomenon that the response of a sensor is stronger just at the onset of a stimulus than later on, and is temporarily suppressed just after the offset of the stimulus. In that sense, adaptation can be modelled as an automatic gain control. In the inner hair cell, sketched in Fig. 2.15, adaptation is a consequence of the following order of events: Before

2.5 The Inner Ear

119

the onset of stimulation, a reservoir of neurotransmitter is built up in the vesicles of the inner hair cells. In the absence of stimulation, a small number of vesicles are continuously emptied into the synaptic cleft. When stimulated, this depletion of neurotransmitter into the synaptic cleft leads to the spontaneous generation of action potentials, especially in the fibres that have endings more removed from the central axis of the cochlea, hence, closer to the outer cells; spontaneous generation of action potentials in the nerve endings closer to the central axis of the cochlea is in general much lower or absent [39]. As soon as stimulation starts, the number of vesicles that empty their content into the synaptic cleft increases rapidly, so that the reservoir of vesicles with neurotransmitter in the hair cell decreases rapidly and would run out if it would not be replenished. Indeed, the reservoir of neurotransmitter is replenished, but not so fast that it can fully compensate the loss of neurotransmitter at the onset of stimulation. As a consequence, after the first splash at the onset, the amount of neurotransmitter emptied into the synaptic cleft diminishes and then remains constant at a level at which the release of neurotransmitter is compensated by the replenishment. When stimulation stops, the concentration of neurotransmitter is relatively low, lower than it was before stimulation. Hence, the spontaneous release of neurotransmitter will be less than before stimulation, so that the spontaneous activity will also be less. The replenishing of neurotransmitter will, however, after some time restore the situation before stimulation. The generation of action potentials in the auditory-nerve fibres starts with the release of neurotransmitter in the synaptic cleft between the inner-hair-cell body and the ending of the auditory-nerve fibre. It is important to realize that this release of neurotransmitter and the generation of action potentials or spikes have a statistical nature, not in size—since the generation of an action potential is all or none— but in probability and in time. Not only is the concentration of neurotransmitter in the synaptic cleft dependent on how many vesicles are emptied per unit of time, also the spike-generation process in the ending of the nerve fibre is statistical. As a consequence, the timing of action potentials is subject to fluctuations, so that there is always an amount of jitter in the spike trains. This jitter has a spread of a few tenths of a millisecond. A real example of adaptation is presented in Fig. 2.16 for a high-frequency auditory-nerve fibre of a cat. It shows the post-stimulus-time histogram (PSTH) in response to a 50-ms pure tone. A post-stimulus-time histogram is measured by repeatedly presenting a stimulus and measuring the time intervals between the onset of the stimulus and the action potentials. The PSTH is a histogram of these time intervals. In the example of Fig. 2.16, the characteristic frequency of the nerve fibre is higher than 5 kHz, so that there is no phase lock to this frequency. About 30 ms after the onset of the stimulus the firing rate abruptly increases from a spontaneous activity of about 15 spikes per second to more than 100 spikes per second. Adaptation expresses itself in the decrease in the firing rate in the 10–20 ms after the peak to a more steady level, in this case of about 40 spikes per second. In addition, after the offset of the stimulus, the response is suppressed for 30–40 ms, after which the firing rate returns to the original level.

120

2 The Ear

Fig. 2.16 Adaptation in the auditory-nerve fibre of a cat. The post-stimulus-time histogram shows a high response in the first 10–20 ms after the onset of the stimulus and a suppressed response 30–40 ms after its offset. Reproduced from Pickles [56, Fig. 4.2, p. 76], with the permission of Koninklijke Brill NV

Fig. 2.17 Simulation of sensory adaptation based on Westerman and Smith [89]. The amplitude of the 50-ms stimulus is presented in the lower panel. The response is presented in spikes per second in the upper panel. (Matlab)

A very simple simulation showing sensory adaptation is presented in Fig. 2.17, based on Westerman and Smith [89]. The lower panel shows the amplitude of a stimulus, in this case a high-frequency sound starting at 0 ms and ending at 50 ms. The sound has a high frequency, higher than 5 kHz, so that also here the AC response can be neglected and the receptor potential roughly follows the stimulus envelope. The probability of an action potential is plotted in the upper panel. Westerman and

2.6 The Auditory Nerve

121

Smith [89] distinguish rapid adaptation with a time constant of 1 to 10 ms, and short-term adaptation with a time constant of about 60–80 ms. In Fig. 2.17, a time constant of 3 ms was chosen for the rapid adaptation and of 60 ms for the shortterm adaptation. To illustrate the statistical nature of the spike-generation process, filtered noise with a cut-off frequency of 2000 Hz has been added to the response. As in the real example of Fig. 2.16, the spike activity rises rapidly at the onset of the stimulation from its spontaneous level to a peak level, after which it decreases in 10–20 ms to a more or less stationary level. After the offset of the stimulus, the spike activity is suppressed for some tens of milliseconds, after which it returns to its spontaneous-activity level. Realistic models of the process of adaptation including simulating the receptor potential of the inner hair cells and the production, depletion, and repletion of neurotransmitter have been developed by Meddis and his colleagues [46], [81], and by Zilany and colleagues [100, 101]. An important aspect of adaptation is that, in the spike activity of the auditorynerve fibres, the onsets of the stimuli are enhanced compared to the steady parts, while their offsets are suppressed.

2.6 The Auditory Nerve The sensory cells of the auditory system, i.e., the inner and outer hair cells of the organ of Corti, are enervated by the nerve fibres of the auditory nerve. Various different kinds of fibres can be distinguished. First, there are the afferent fibres running from the sensory cells, where the spikes are generated, to the cochlear nucleus in the central nervous system. The cell bodies of these afferents are located in the spiral ganglion, which runs along the central axis of the cochlea at the base of the basilar membrane (see Fig. 2.5). The vast majority of the afferent fibres, called type-I fibres, enervate the inner hair cells. Every type-I fibre enervates one inner hair cell. Type-I fibres are relatively thick and myelinated, which means that the propagation of the spikes along the fibre is very fast. Most of what will be said in this section refers to these type-I fibres. In addition to the type-I fibres, there are the type-II afferents, which are very thin and not myelinated, which means that the propagation of the action potentials along the fibre is relatively slow. These type-II fibres enervate between one and eleven outer hair cells, only the combined activity of which can initiate an action potential [88]. The function of the type-II fibres remains unclear [14, 87]. One possibility is that they operate as a warning system, as they have been suggested to be involved in inducing the percept of pain at risky high signal levels [41]. From here on, focus will only be on the function of the type-I auditory-nerve fibres. Basically, all the auditory information represented in the vibrations of the basilar membrane and the resulting receptor potentials of the inner hair cells is transmitted to the central nervous system by these afferents. In the previous Sect. 2.5.3.2, the receptor potential was roughly described as an amplified, half-wave rectified, and

122

2 The Ear

low-pass filtered version of the basilar-membrane vibrations; for frequencies lower than the cut-off frequency of the low-pass filter, 3–5 kHz, this implies that the probability of an action potential is largest at the positive phase of the stimulus. In other words, these nerve fibres exhibit phase lock. For frequencies of 3–5 kHz, the jitter of the spike-generation process has the same order of magnitude as the period duration of the stimulus, some tenths of a millisecond. Hence, the timing of the spikes loses its relation with the phase of the stimulus. In other words, the spike trains are no longer phase locked to the stimulus frequency. The phase-lock limit of 3–5 kHz is the limit most often found for humans in the literature but still is a matter of dispute and varies from 1.5 to 10 kHz [83]. Summarizing the order of events between the sound arriving at our ears and the generation of action potentials in the auditory-nerve fibres, one can successively distinguish direction-dependent linear filtering by the outer ear, linear band-pass filtering by the middle ear, non-linear band-pass filtering by the basilar membrane, amplification by the outer hair cells and, finally, half-wave rectification and low-pass filtering by the inner hair cells. As a result of this processing, an AC component and a DC component can be distinguished in the receptor potential. The AC component, only present for low-frequency sounds, represents the phase of the stimulus, while the DC component approximates the stimulus envelope, be it a somewhat smoothed version of it. Due to adaptation, onsets are enhanced and offsets are suppressed. In the following sections, the main response properties of the auditory-nerve fibres will be summarized. Focus will be on those general properties that are important to understand the nature of the information the central nervous system processes in order to make sense out of the blend of acoustic waves that the outer ears pick up in their course through the environment. For a more elaborate and detailed review of the anatomy and operation of these fibres the reader is referred to, e.g., Rudnicki et al. [70] or Heil and Peterson [30].

2.6.1 Refractoriness and Saturation It is known by now that the probability of an action potential depends on the concentration of neurotransmitter in the synaptic cleft. There is, however, a clear limit to this relation, due to a property of nerve fibres called refractoriness. After a nerve fibre has fired an action potential, it cannot immediately fire again within a certain period, the absolute refractory period. The order of magnitude of the absolute refractory period is less than 1 millisecond; Heil and Peterson [30] give a time constant of 0.33 ms. After the absolute refractory period there is the relative refractory period, in which the fibre can fire again, but only when the stimulus is stronger than the stimulus inducing the previous action potential. The relative refractory period has an order of magnitude of 1 ms; Heil and Peterson [30] give a time constant of 0.41 ms. As a result of this refractoriness, the firing rate of a nerve fibre cannot exceed some hundreds of spikes per second. When the firing rate of a nerve fibre approaches this upper limit, the fibre is said to be saturated. Increasing the intensity of a stim-

2.6 The Auditory Nerve

123

ulus beyond this saturation level does not lead to a higher firing rate. Saturation determines the upper limit of the dynamic range of the neuron to be discussed below.

2.6.2 Spontaneous Activity The probability of the generation of an action potential in an auditory-nerve fibre has a strong positive relation with the concentration of neurotransmitter in the synaptic cleft. In the absence of stimulation, the concentration of neurotransmitter in the synaptic cleft is low but certainly not zero, and subject to statistical fluctuations. As a result, also in the absence of auditory stimulation, many nerve endings generate action potentials, at a relatively low but non-zero rate, the spontaneous firing rate. This spontaneous activity may be taken as a kind of internal noise. This internal noise will play a role in the discussion on loudness perception. The spontaneous firing rate of auditory-nerve fibres can vary from zero to about 100 spikes per second [22, 73]. It is correlated with the location on the cell body of the inner hair cell where the fibre makes synaptic contact. Every inner-hair-cell body is connected with a number of nerve endings, so that there are many more nerve fibres coming from the inner hair cells than there are inner hair cells. The nerve endings closer to the outer hair cells have a higher spontaneous firing rate, of several tens or even more than 100 spikes per second, and generally respond to lower stimulus intensities than the fibres with endings attached further from the outer hair cells [39, 40].

2.6.3 Dynamic Range As just mentioned, a nerve fibre is said to be saturated when increasing the intensity of a continuous stimulus does not result in an increased firing rate. This saturation level determines the upper value of the dynamic range. Hence, the upper limit of the dynamic range is the lowest intensity level above which the firing rate of the fibre cannot increase under continuous stimulation. The lower limit of the dynamic range is analogously defined as the highest intensity level below which the firing rate of the neuron cannot decrease, hence, is not lower than the spontaneous activity. So, the dynamic range of an auditory-nerve fibre is the intensity range of a stimulus below which the neuron just shows a spontaneous firing rate and above which the neuron is saturated. In this definition of dynamic range, the firing rate during continuous stimulation is implied. As discussed in Sect. 2.5.3.3, an auditory-nerve fibre can temporarily have a very high firing rate at the onset of stimulation. Similarly, a fibre can have a firing rate lower than its spontaneous firing rate but only after an offset of stimulation. These transient responses are not included in defining the dynamic range of an auditorynerve fibre. The curves representing firing rate as a function of intensity level are

124

2 The Ear

Fig. 2.18 Rate-intensity functions of five auditory-nerve fibres of a cat. Each fibre has a unit number presented in the figure. The characteristic frequency (CF) of the units is presented in kHz. Reproduced from Sachs and Abbas [73, p. 3, Fig. 3], with the permission of the Acoustical Society of America

called rate-intensity functions, or rate versus level functions. Some real examples from auditory-nerve fibres of a cat are presented in Fig. 2.18; simulated examples, based on Yates [96] are shown in Fig. 2.19. The rate-intensity functions shown in Figs. 2.18 and 2.19 are ordered from high-spontaneous-rate fibres to low-spontaneous-rate fibres. As mentioned in Sect. 2.5.3.2, nerve fibres with synapses at that side of the inner hair cell that is directed towards the outer hair cells generally have high spontaneous firing rates; nerve fibres with endings at the side of the inner hair cell directed away from the outer hair cells, so towards the central axis of the cochlea, show low spontaneous firing rates [39, 40]. Moreover, in Figs. 2.18 and 2.19, one sees that the dynamic range of high-spontaneous-rate fibres in general covers lower intensities; their threshold is low, and the intensity level at which they are saturated is also quite low, e.g., 40–50 dB above hearing threshold. Hence, the dynamic range of the high-spontaneous-rate

2.6 The Auditory Nerve

125

Fig. 2.19 Simulated rate-intensity functions. The upper curve shows a simulation of the rateintensity function of a high-spontaneous-rate fibre, the lower curve a simulation of the rate-intensity function of a low-spontaneous-rate fibre. The other three are intermediate. Based on Yates [96] (Matlab)

fibres is quite small, less than about 40–50 dB. On the other hand, the intensity-rate functions of the low-spontaneous-rate fibres shown in Figs. 2.18 and 2.19, show that these nerve fibres generally respond only at relatively high intensity levels, higher than 35–50 dB. These low-spontaneous-rate fibres are only saturated at relatively high intensity levels, e.g., 100 or 120 dB. Hence, their dynamic range is wider than that of the high-spontaneous-rate fibres, higher than 60 dB. This shows that the dynamic range of an individual auditory-nerve fibre is limited. In order to cover the entire range of audible intensities, information from a number of nerve fibres must be combined. It appears, indeed, that every individual inner hair cell makes contact with the endings of a handful of high-spontaneous-rate fibres on the side facing the outer hair cells and with the endings of a handful of lowspontaneous-rate fibres on the opposite side [39]. This show that, even if one looks at the fibres coming from only one inner-hair cell, the larger part of the range of audible intensities is covered. The details of the simulation shown in Fig. 2.19 are quite complex. It consists of a five-parameter model in which, first, the displacement of the basilar membrane is estimated as a function of stimulus level. Next, the firing rate is estimated as a function of basilar-membrane displacement. The reader is referred to the paper on which the simulation is based for the details [96]. This paper also shows many actual rate-intensity functions together with the fits by the model. More advanced models

126

2 The Ear

for the simulation of rate-intensity functions are presented by Sumner et al. [80] and Zilany and Bruce [99]. The rate-intensity functions shown in Figs. 2.18 and 2.19 are typical for rateintensity functions measured at the characteristic frequency of the nerve fibres. For frequencies separated from the characteristic frequency, the thresholds will be higher, but Yates, Winter, and Robertson [97] argued that, in other aspects, the mechanism that determines the shape of the rate-intensity functions is not essentially different. Moreover, they conclude that “General characteristics of the derived basilar membrane input-output curves show features which agree well with psychoacoustic studies of loudness estimation” (p. 203).

2.6.4 Band-Pass Filtering It has been shown that every location of the basilar membrane can be modelled as a band-pass filter with a centre frequency determined by that location. This aspect is reflected in the frequency response of the nerve fibres originating from that location, e.g., in what are called physiological tuning curves. Six examples of physiological tuning curves, measured for six auditory-nerve fibres of a cat, are shown in Fig. 2.20. These six auditory-nerve fibres originate at six different locations of the basilar membrane. Physiological tuning curves give the intensity a pure tone must have in order to generate a response from the auditory-nerve fibre as a function of the frequency of the tone. In Fig. 2.20 the dashed lines on the left represent for each curve a sound pressure level of 90 dB. On the right the characteristic frequency of the nerve fibres is presented in kHz. The highest frequency here is 21.0 kHz, which may seem high, but the frequency sensitivity of cats goes up to about 60 kHz, much higher than the 20-kHz limit of humans. It can be seen that the shape of the different tuning curves is approximately constant, except for the highest one, the one with a characteristic frequency of 21.0 kHz, which is flatter. It is important to realize that this more or less constant shape of the tuning curves only applies when the curves are displayed on logarithmic axes, both as to frequency on the abscissa and to intensity on the ordinate. The tuning curves presented in Fig. 2.20 show that every fibre has a well-defined characteristic frequency, corresponding to the characteristic frequency of the location on the basilar membrane from where the nerve fibre originates. Close to their characteristic frequency, the tuning curves are quite sharply tuned, which means that the sensitivity decreases quite rapidly as the frequency of the stimulus deviates from the characteristic frequency. As a consequence, the slopes of the tuning curves are quite steep close to the characteristic frequency. This sharp tuning is due to the active response of the outer hair cells to low-intensity sound. Farther away from the characteristic frequency of the fibre, the tuning curves are very asymmetric. The sensitivity of the fibres decreases much faster on the higher-frequency side of the characteristic frequency than on the lower-frequency side. The “tails” on the low-frequency side of the tuning curves in fact show that, if the level of the stimuli is high enough, the

2.6 The Auditory Nerve

127

Fig. 2.20 Physiological tuning curves of six auditory-nerve fibres of a cat. Reproduced from Kiang and Moxon [36, p. 622, Fig. 3], with the permission of the Acoustical Society of America

fibres still do respond to these lower frequencies. On the other hand, on the higher frequency side the slope of the tuning curves is very steep, showing that even very intense stimuli fail to excite the fibres. This is all in agreement with the idea of upward spread of masking discussed above in Sect. 2.5.1.1.

2.6.5 Half-Wave Rectification The process of half-wave rectification, originating in the receptor potential of the inner hair cells, expresses itself in various ways in the response of the auditory-nerve fibres. First, the post-stimulus-time histogram (PSTH) of an auditory-nerve fibre in response to a single click will be discussed. The PSTH was introduced above in the

128

2 The Ear

Fig. 2.21 Post-stimulus time histograms and their combination of an auditory-nerve fibre stimulated by a rarefaction click in (A) and a condensation click in (B). These two histograms are combined in (C), in which the response to the rarefaction click is not changed, but the response to the condensation click is inverted. Note the resemblance to the impulse response of a band-pass filter with a characteristic frequency of 450 Hz. Reproduced from Pickles [56, p. 85, Fig. 4.10], with the permission of Koninklijke Brill NV and the Acoustical Society of America

discussion of Fig. 2.16 of Sect. 2.5.3.3. Two kinds of clicks are distinguished, condensation clicks and rarefaction clicks. Condensation clicks are produced by momentary rises in air pressure; rarefaction clicks by momentary drops in air pressure. Figure 2.21 shows three graphs, the PSTH to a rarefaction click (A), the PSTH to a condensation click (B), and a combination of the two, obtained by subtracting the PSTH to the condensation click from that to the rarefaction click (C) (data from Goblick Jr and Pfeiffer [25] published as Fig. 4.10 by Pickles [56]). The combination of the two PSTHs shown in C has all the characters of the impulse response of a band-pass filter, in this case a band-pass filter with a characteristic frequency of 450 Hz. Indeed, the figure shown in Fig. 2.21A presents the half-wave rectified response to the rarefaction click, while the figure in Fig. 2.21B shows the half-wave rectified response to the condensation click. By inverting the PSTH to the condensation click and adding it to the PSTH to the rarefaction click, one gets an approximation of the impulse response of the linear part of the system. These PSTHs estimate the firing probability of the nerve fibre after the onset of the stimulus, in this case the click. So, since the probability of firing of the nerve fibre depends on the concentration of

2.6 The Auditory Nerve

129

Fig. 2.22 PSTHs of an auditory-nerve fibre stimulated with two tones, a 50-dB tone of 907 Hz and a 60-dB tone of 1814 Hz. In A and B the PSTHs are presented for stimulation with only one of the tones. In C to L the relative phase between the two tones is varied. Reproduced from Brugge et al. [13, p. 388, Fig. 1], with the permission of the American Physiological Society; permission conveyed through Copyright Clearance Center, Inc

neurotransmitter in the synaptic cleft, which in turn depends on the receptor potential of the inner hair cell, these results are in good agreement with the idea of half-wave rectification in the generation of the receptor potential in the inner hair cell. Another example is presented in Fig. 2.22, showing PSTHs of an auditory-nerve fibre with a characteristic frequency of about 1280 Hz. The graphs in the upper right of the figure show, for various sound levels, the number of spikes in the five seconds of stimulation with a pure tone varied in frequency. The frequency is indicated at the abscissa; the intensity is indicated as a parameter in dB SPL. In (A) the PSTH is presented to stimulation with a 50-dB tone of 907 Hz, in (B) with a 60 dB-tone of 1814 Hz. In (C) to (L) the relative phase between the two tones is varied. The continuous lines in (B) to (L) represent the waveform of the stimulus. Note, again, that the PSTHs closely match the half-wave rectified waveforms, a match that would even be better when combined with a low-pass filter, which would gently remove the discontinuities in the derivatives of the waveform at the zero crossings. The above two examples, illustrated in Figs. 2.21 and 2.22, show that, for the frequencies used, the timing of the action potentials is on average phase locked to the period of the sinusoidal stimulus. This may imply that the interval between

130

2 The Ear

two successive action potentials has approximately the same length as the stimulus period, and this can indeed be the case. This phase lock, however, must be taken in a statistical sense: The fibre may fire once or sometimes even twice within a stimulus period, and the action potentials will then, on average, come at a preferred phase of the sinusoid, but there will also be periods during which the fibre does not fire at all. In fact, due to refractoriness, a considerable number of periods will not contain any firings, certainly for higher stimulus frequencies. As a consequence, many intervals between two successive spikes will have about twice, three times, or even more times the duration of one stimulus period. This expresses itself in the interspike interval histograms of the auditory-nerve fibres under stimulation with various frequencies. Examples are shown in the next figure, Fig. 2.23. An auditory-nerve fibre with a characteristic frequency of about 1100 Hz is stimulated with 80-dB pure tones of varying frequency. Figure 2.23 shows the histograms of the intervals between two successive action potentials for the various stimulus frequencies. The figure shows that, above about 1500 Hz, the fibre no longer responds to the stimulus; apparently, the frequency of the stimulus exceeds the range of frequencies for which the neuron is sensitive. Moreover, the abscissae of the peaks in the histograms closely match the integer multiples of the period of the stimulus. These multiples are indicated by small dots below the abscissa of the histograms. It is emphasized that these intervals are the integer multiples of the period of the stimulus, not of the period corresponding to the characteristic frequency of the fibre. Furthermore, the intervals in the histograms of the second, third, and higher order correspond to frequencies which are exactly half the stimulus frequency, one third of this frequency, etc., etc. In other words, the first-order intervals in the spike trains of the auditory-nerve fibres correspond to the period of the stimulus, but higher order intervals in the spike trains correspond to subharmonic frequencies, i.e., frequencies one gets when the stimulus frequency is divided by a positive integer. This will appear to be very important in a model of pitch perception that will be described in Chap. 8. All what was said in this section as to phase lock applies only to lower frequency components, i.e., components with frequencies smaller than about 5 kHz. Due to statistical processes here modelled as the low-pass filter, the timing of the action potentials is not accurate enough above this frequency and the synchrony of the spikes with the phase of the stimulus gradually vanishes. For these higher frequencies, the firing rate of the nerve fibres has no relation with the phase of the stimulus but rather with its envelope. Moreover, due to sensory adaptation, the onsets of the stimuli and increments in intensity are characterized by higher spike rates than the steadier parts of the stimulus. Similarly, the offsets are characterized by suppressed spike rates (See Figs. 2.16 and 2.17).

2.6 The Auditory Nerve

131

Fig. 2.23 Interspike interval histograms of an auditory-nerve fibre with a characteristic frequency of 1100 Hz, stimulated with 80-dB pure tones with a frequency of 412, 600, 900 1000, 1100, 1200, 1300, 1400, 1500, and 1600 Hz, respectively (A–J). The peaks in the histograms correspond very closely to integer multiples of the stimulus period, indicated by small dots right below the abscissa. Reproduced from Rose et al. [68, p. 772, Fig. 1], with the permission of the American Physiological Society; permission conveyed through Copyright Clearance Center, Inc

132

2 The Ear

2.7 Summary Schema of Peripheral Auditory Processing A schematic overview of the auditory system is presented in Fig. 2.24. The first processing stage is indicated with “dBA filter” and represents the filter functions of the outer and middle ear. On the basilar membrane, the signal is split into a large number of frequency channels. Here, this is presented vertically by a large array of band-pass filters, or “BPF”. This filter will be described in great detail in the next chapter. Of the next stage “AGC and HWR”, automatic gain control “AGC” represents the operation of the outer hair cells, and “HWR”, half-wave rectification, the operation of the inner hair cells. Next, the low-pass filter “LPF” represents the statistical nature of the transduction process and the spike-generation process. The spike-generation “SG” process then concludes the operation of the peripheral hearing system. It is the information contained in this multitude of spike trains running along the auditorynerve fibres from the cochlea to the central hearing system that is used to create an auditory representation of what happens around the listener. With the schema presented in Fig. 2.24, tuned, of course, with the right parameter settings, one can in fact model many of the properties of the peripheral auditory system, certainly as a good first approximation (For more detailed reviews see LopezPoveda [44] and Meddis and Lopez-Poveda [47]). Models like these are indeed often used as a perceptual front end in pre-processing of recorded music, speech signals, or

Fig. 2.24 Schematic model of the auditory system, indicating the successive functional stages of the peripheral auditory system. BPF = band-pass filtering, AGC = automatic gain control, HWR = half-wave rectification, LPF = low-pass filtering, SG = spike generation. These functions are symbolically sketched by the icons in the rectangles. CNS = central nervous system (Matlab)

2.8 The Central Auditory Nervous System

133

other sounds [45]. Elements of this model will be used in describing computational models for loudness perception in Chap. 7 and for pitch perception in Chap. 8. In the model on loudness perception the spectral properties of the system and its frequency resolution play an essential role. An attempt will be made to answer the following question: If a sound consists of more than one frequency component, how much does every separate component contribute to the loudness of the sound? For this, a valid model of the filter properties of our hearing system is required. This model will be presented in the next chapter. For the model on pitch perception, the temporal properties of the system, such as phase lock and the spike-interval distributions of the spike trains of the auditory-nerve fibres will appear to be essential.

2.8 The Central Auditory Nervous System The central auditory system is divided into a number of auditory nuclei, most probably representing successive processing stages of the incoming auditory information. The most important of these nuclei are schematically presented in Fig. 2.25, including the ascending connections between the nuclei. As mentioned, there are also descending connections, even stretching down to the muscles of the middle ear and the hair cells, but for the sake of simplicity these connections have been omitted. The fibres in the auditory nerve, coming from the cochlea, arrive at the “lowest” auditory nucleus in the central auditory nervous system, the cochlear nucleus. This nucleus is located in the medulla oblongata. A second auditory nucleus, or rather complex, in the medulla oblongata is the superior olivary complex, which plays an important role in sound localization. Then the system ascends to the inferior colliculus of the midbrain or mesencephalon. From there it goes further up to the medial geniculate body in the interbrain or diencephalon. Finally, the ascending auditory pathway reaches the auditory cortex, where various auditory regions can be found. All these auditory nuclei can be divided into subdivisions. In general, two pathways are distinguished, the dorsal where system, and the ventral what system [7, 42, 60, 67]. As the names suggest, the where system processes information relevant for sound localization, while the what system processes information about the nature of the auditory events that take place. The various subdivisions of the auditory nuclei are attributed different roles in these two systems. Besides the ascending pathways shown in Fig. 2.25, all these nuclei are connected by descending pathways. Indeed, this descending system projects even back to the peripheral parts of the auditory system. The stapedius reflex and the tensor-tympani reflex are controlled by central auditory structures, as is the efferent control of the outer hair cells through the olivocochlear bundle which originates, as the name suggests, from the superior olivary complex. Actually, the central auditory system is a very extensive network consisting of many layers connected by ascending and descending pathways, and with nuclei divided into subdivisions to which different functions can be, and are, attributed.

134

2 The Ear

Fig. 2.25 Overview of the ascending auditory pathway. CN = cochlear nucleus; SOC = superior olivary complex; IC = inferior colliculus; MGB = medial geniculate body (Matlab)

2.8 The Central Auditory Nervous System

135

In addition to this, one must realize that, in many perceptual processes such as sound detection or sound localization, not only the auditory system is usually involved, but also one or two of our other senses, e.g., our visual system or our tactile system. Moreover, the reader will see that “cognitive” processes such as attention and memory can play an important role in auditory perception. This all makes the picture quite complex. Technology is developing quite rapidly, and a large number of studies is rapidly becoming available concerned with the complexities in these neural systems. Important as these studies are, and will appear to be more so in the future, they will only be mentioned when it strengthens an argument or illustrates a new insight. The emphasis will be on finding the regularities and functionalities of human hearing. Attention will be paid to what kind of information is available in the spike trains of the auditory-nerve fibres and how it may be used in the various functional processes of the auditory system such as auditory event detection, loudness perception, or pitch perception. Only when it strengthens an argument, it will be mentioned how and where in the central nervous system the calculations underlying these operations take place. It is realized that, in the end, perception research must comprise explanatory models of all subdivisions of our perceptual systems, models based on results from anatomical and physiological studies of these divisions and subdivisions. Indeed, the central nervous system consists of complex and coherent neural networks, and only models that incorporate the essential properties of such systems will be able to present a conclusive description of their function and operation, and to understand the “neural code” [19]. In the next few chapters, it will be shown how determinative are the consequences of the anatomy and physiology of the peripheral auditory system, especially with regard to temporal and frequency resolution. It is very unlikely that this close relation between the anatomy and the physiology on the one hand and the functionality on the other will stop just one step after the transduction process. Comprehensive comparative and functional overviews of the auditory central nervous system in humans and macaque monkeys are presented by Recanzone and Sutter [61] and Woods et al. [95]. Let us end with a summary of the operation of the peripheral auditory system and the function of the central auditory system: “The peripheral processing of sound by the external ear, middle ear and cochlea precedes the neural encoding of sound. The external ear collects sound power that is transmitted and reflected within the ear canal, absorbed by the middle ear, and transmitted as a coupled mechanical-fluid wave motion within the cochlea. The cochlea analyses the time-varying frequency components of the incident sound and converts them into a spatial-temporal distribution of neural spikes in the fibres of the auditory nerve. This encoded signal is subsequently analysed by the brain through detection and classification processes, which enables the human listener to construct a perceptual and cognitive map of the auditory world” (Keefe [33, p. 8]).

136

2 The Ear

References 1. Aiken SJ et al (2013) Acoustic stapedius reflex function in man revisited. Ear Hear 34(4):e38– e51. https://doi.org/10.1097/AUD.0b013e31827ad9d3 2. Andéol G et al (2011) Auditory efferents facilitate sound localization in noise in humans. J Neurosci 31(18):6759–6763. https://doi.org/10.1523/JNEUROSCI.0248-11.2011 3. Ashmore R et al (2010) The remarkable cochlear amplifier. Hear Res 226:1–17. https://doi. org/10.1016/j.heares.2010.05.001 4. Avan P, Büki B, Petit C (2013) Auditory distortions: origins and functions. Physiol Rev 93(4):1563–1619. https://doi.org/10.1152/physrev.00029.2012 5. Békésy G (1928) Zur Theorie des Hörens: die Schwingungsform der Basilarmembran. Physikalische Zeitschrift 29:93–810 6. Bell A (2012) A resonance approach to cochlear mechanics. PLoS ONE 7(11):e47918, 21 pages. https://doi.org/10.1371/journal.pone.0047918 7. Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693–707. https://doi.org/10.1038/nrn3565 8. Borg E (1968) A quantitative study of the effect of the acoustic stapedius reflex on sound transmission through the middle ear of man. Acta Oto-Laryngol 66(1–6):461–472. https:// doi.org/10.3109/00016486809126311 9. Borg E, Zakrisson J-E (1975) The activity of the stapedius muscle in man during vocalization. Acta Oto-Laryngol 79(3–6):325–333. https://doi.org/10.3109/00016487509124694 10. Brask T (1978) The noise protection effect of the stapedius reflex. Acta Oto-Laryngol 86(S360):116–117. https://doi.org/10.3109/00016487809123490 11. Brownell WE (2017) What is electromotility? The history of its discovery and its relevance to acoustics. Acoust Today 13(1):20–27. http://acousticstoday.org/wp-content/uploads/ 2017/01/What-Is-Electromotility-The-History-of-Its-Discovery-and-Its-Relevance-toAcoustics-William-E.-Brownell.pdf 12. Brownell WE et al (1985) Evoked mechanical responses of isolated cochlear outer hair cells. Science 227(4683):194–196. https://doi.org/10.1126/science.3966153 13. Brugge JF et al (1969) Time structure of discharges in single auditory nerve fibers of the squirrel monkey in response to complex periodic sounds. J Neurophysiol 32(3):386–401. https://doi.org/10.1152/jn.1969.32.3.386 14. Carricondo F, Romero-Gómez B (2019) The cochlear spiral ganglion neurons: the auditory portion of the VIII nerve. Anat Rec 302(3):463–471. https://doi.org/10.1002/ar.23815 15. Church GT, Cudahy EA (1984) The time course of the acoustic reflex. Ear Hear 5(4):235–242. https://doi.org/10.1097/00003446-198407000-00008 16. Clark NR et al (2012) A frequency-selective feedback model of auditory efferent suppression and its implications for the recognition of speech in noise. J Acoust Soc Amer 132(3):1535– 1541. https://doi.org/10.1121/1.4742745 17. Cooper NP, Guinan JJ Jr (2006) Efferent-mediated control of basilar membrane motion. J Physiol 576(1):49–54. https://doi.org/10.1113/jphysiol.2006.114991 18. Culler E, Finch G, Girden E (1935) Function of the round window in hearing. Amer J Physiol 111(2):416–425 19. Eggermont JJ (2001) Between sound and perception: reviewing the search for a neural code. Hear Res 157:1–42. https://doi.org/10.1016/S0378-5955(01)00259-3 20. Eggermont JJ, et al (eds) (2012) Tinnitus. Springer, Berlin. https://doi.org/10.1007/978-14614-3728-4 21. Elgoyhen AB, Fuchs PA (2010) Efferent innervation and function. In: Fuchs PA (ed) Oxford handbook of auditory science: the ear. Oxford University Press, Oxford, UK, pp 283–306 22. Evans EF, Palmer AR (1980) Relationship between the dynamic range of cochlear nerve fibres and their spontaneous activity. Exp Brain Res 40(1):115–118. https://doi.org/10.1007/ BF00236671 23. Fettiplace R, Hackney CM (2006) The sensory and motor roles of auditory hair cells. Nat Rev Neurosci 7(1):19–29. https://doi.org/10.1038/nrn1828

References

137

24. Glasberg BR, Moore BC (2006) Prediction of absolute thresholds and equal-loudness contours using a modified loudness model. J Acoust Soc Amer 120(2):585–588. https://doi.org/10. 1121/1.2214151 25. Goblick TJ Jr, Pfeiffer RR (1969) Time-domain measurements of cochlear nonlinearities using combination click stimuli. J Acoust Soc Amer 46(4B):924–938. https://doi.org/10. 1121/1.1911812 26. Greenwood DD (1997) The Mel Scale’s disqualifying bias and a consistency of pitchdifference equisections in 1956 with equal cochlear distances and equal frequency ratios. Hear Res 103:199–224. https://doi.org/10.1016/S0378-5955(96)00175-X 27. Guinan JJ Jr (2012) How are inner hair cells stimulated? Evidence for multiple mechanical drives. Hear Res 292:35–50. https://doi.org/10.1016/j.heares.2012.08.005 28. Guinan JJ Jr (2006) Olivocochlear efferents: anatomy, physiology, function, and the measurement of efferent effects in humans. Ear Hear 27(6):589–607. https://doi.org/10.1097/01.aud. 0000240507.83072.e7 29. Guinan JJ Jr (2018) Olivocochlear efferents: their action, effects, measurement and uses, and the impact of the new conception of cochlear mechanical responses. Hear Res 362:38–47. https://doi.org/10.1016/j.heares.2017.12.012 30. Heil P, Peterson AJ (2015) Basic response properties of auditory nerve fibers: a review. Cell Tissue Res 361(1):129–158. https://doi.org/10.1007/s00441-015-2177-9 31. Hudspeth AJ (1989) How the ear’s works work. Nature 341(6241):397–404. https://doi.org/ 10.1038/341397a0 32. Jurado C, Pedersen CS, Moore BC (2011) Psychophysical tuning curves for frequencies below 100 Hz. J Acoust Soc Amer 129(5):3166–3180. https://doi.org/10.1121/1.3560535 33. Keefe DH (2012) Acoustical test of middle-ear and cochlear function in infants and adults. Acoustics Today (8):8–17. https://acousticstoday.org/wp-content/uploads/2019/ 09/ACOUSTICAL-TESTS-OF-MIDDLE-EAR-AND-COCHLEAR-FUNCTION-ININFANTS-AND-ADULTS-Douglas-H.Keefe_.pdf 34. Kemp DT (2002) Otoacoustic emissions, their origin in cochlear function, and use. Br Med Bull 63(1):223–241. https://doi.org/10.1093/bmb/63.1.223 35. Kemp DT (1978) Stimulated acoustic emissions from within the human auditory system. J Acoust Soc Amer 64(5):1386–1391. https://doi.org/10.1121/1.382104 36. Kiang N, Moxon EC (1974) Tails of tuning curves of auditory-nerve fibers. J Acoust Soc Amer 55(3):620–630. https://doi.org/10.1121/1.1914572 37. Kim DO, Molnar CE, Matthews JW (1980) Cochlear mechanics: nonlinear behavior in twotone responses as reflected in cochlear-nerve-fiber responses and in ear-canal sound pressure. J Acoust Soc Amer 67(5):1704–1721. https://doi.org/10.1121/1.384297 38. Klockhoff I, Anderson H (1960) Reflex activity in the tensor tympani muscle recorded in man: preliminary report. Acta Oto-Laryngol 51(1–2):184–188. https://doi.org/10.3109/ 00016486009124480 39. Liberman MC (1982) Single-neuron labeling in the cat auditory nerve. Science 216(4551):1239–1241. https://doi.org/10.1126/science.7079757 40. Liberman MC, Simmons DD (1985) Applications of neuronal labeling techniques to the study of the peripheral auditory system. J Acoust Soc Amer 78(1):312–319. https://doi.org/10.1121/ 1.392492 41. Liu C, Glowatzki E, Fuchs PA (2015) Unmyelinated type II afferent neurons report cochlear damage. Proc Natl Acad Sci 112(47):14723–14727. https://doi.org/10.1073/pnas. 1515228112 42. Lomber SG, Malhotra S (2008) Double dissociation of ‘what’ and ‘where’ processing in auditory cortex. Nat Neurosci 11(5):609–616. https://doi.org/10.1038/nn.2108 43. Lopez-Poveda EA (2018) Olivocochlear efferents in animals and humans: from anatomy to clinical relevance. Front Neurol 9:18, Article 197. https://doi.org/10.3389/fneur.2018.00197 44. Lopez-Poveda EA (2005) Spectral processing by the peripheral auditory system: facts and models. Int Rev Neurobiol 70:7–48. https://doi.org/10.1016/S0074-7742(05)70001-5

138

2 The Ear

45. Lyon RF (2017) Human and machine hearing: extracting meaning from sound. Cambridge University Press, Cambridge, UK 46. Meddis R (1986) Simulation of mechanical to neural transduction in the auditory receptor. J Acoust Soc Amer 79(3):702–711. https://doi.org/10.1121/1.393460 47. Meddis R, Lopez-Poveda EA (2010) Auditory periphery: From pinna to auditory nerve (Chap 2). In: Meddis R et al (eds) Computational models of the auditory system. Springer Science+Business Media, New York, pp 7–38. https://doi.org/10.1007/978-1-4419-5934-8_2 48. Mertes IB, Johnson KM, Dinger ZA (2019) Olivocochlear efferent contributions to speech-innoise recognition across signal-to-noise ratios. J Acoust Soc Amer 145(3):1529–1540. https:// doi.org/10.1121/1.5094766 49. Møller H, Pedersen CS (2004) Hearing at low and infrasonic frequencies. Noise Health 6(23):37–57. https://www.noiseandhealth.org/text.asp?2004/6/23/37/31664 50. Moore BC (2007) Cochlear hearing loss: physiological, psychological and technical issues, 2nd edn. Wiley, Cambridge, UK 51. Moore BC, Glasberg BR, Baer T (1997) A model for the prediction of thresholds, loudness, and partial loudness. J Audio Eng Soc 45(4):224–240. http://www.aes.org/e-lib/browse.cfm? elib=10272 52. Nedzelnitsky V (1980) Sound pressures in the basal turn of the cat cochlea. J Acoust Soc Amer 68(6):1676–1689. https://doi.org/10.1121/1.385200 53. Olson ES, Duifhuis H, Steele CR (2012) Von Békésy and cochlear mechanics. Hear Res 293:31–43. https://doi.org/10.1016/j.heares.2012.04.017 54. Palmer AR, Russel IJ (1986) Phase-locking in the cochlear nerve of the guinea-pig and its relation to the receptor potential of inner hair-cells. Hear Res 24(1):1–15. https://doi.org/10. 1016/0378-5955(86)90002-X 55. Pickles JO (1982) Introduction to the physiology of hearing. Academic, London, UK 56. Pickles JO (2012) Introduction to the physiology of hearing, 4th edn. Emerald Group Publishing Ltd, Bingley, UK 57. Plomp R (1964) The ear as frequency analyzer. J Acoust Soc Amer 36(9):1628–1636. https:// doi.org/10.1121/1.1919256 58. Plomp R, Mimpen AM (1968) The ear as frequency analyzer. II. J Acoust Soc Amer 43(4):764–767. https://doi.org/10.1121/1.1910894 59. Rainsbury JW et al (2015) Vocalization-induced stapedius contraction. Otol & Neurotol 36(2):382–385. https://doi.org/10.1097/MAO.0000000000000447 60. Rauschecker JP, Tian B (2000) Mechanisms and streams for processing of ‘what’ and ‘where’ in auditory cortex. Proc Natl Acad Sci 97(22):11800–11806. https://doi.org/10.1073/pnas.97. 22.11800 61. Recanzone GH, Sutter ML (2008) The biological basis of audition. Ann Rev Psychol 59:119– 142. https://doi.org/10.1146/annurev.psych.59.103006.093544 62. Recio A et al (1998) Basilar-membrane responses to clicks at the base of the chinchilla cochlea. J Acoust Soc Amer 103(4):1972–1989. https://doi.org/10.1121/1.421377 63. Rhode WS, Robles L (1974) Evidence from Mössbauer experiments for nonlinear vibration in the cochlea. J Acoust Soc Amer 55(3):588–596. https://doi.org/10.1121/1.1914569 64. Robles L, Rhode WS, Geisler CD (1976) Transient response of the basilar membrane measured in squirrel monkeys using the Mössbauer effect. J Acoust Soc Amer 59(4):926–939. https:// doi.org/10.1121/1.380953 65. Robles L, Ruggero MA (2001) Mechanics of the mammalian cochlea. Physiol Rev 81(3):1305–1352. https://doi.org/10.1152/physrev.2001.81.3.1305 66. Robles L, Ruggero MA, Rich NC (1997) Two-tone distortion on the basilar membrane of the chinchilla cochlea. J Neurophysiol 77(5):2385–2399. https://doi.org/10.1152/jn.1997.77. 5.2385 67. Romanski LM et al (1999) Dual streams of auditory afferents target multiple domains in the primate prefrontal cortex. Nat Neurosci 2(12):1131–1136. https://doi.org/10.1038/16056 68. Rose JE et al (1967) Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. J Neurophysiol 30(4):769–793. https://doi.org/10.1152/jn.1967. 30.4.769

References

139

69. Rubel EW, Furrer SA, Stone JS (2013) A brief history of hair cell regeneration research and speculations on the future. Hear Res 297:42–51. https://doi.org/10.1016/j.heares.2012.12.014 70. Rudnicki M et al (2015) Modeling auditory coding: from sound to spikes. Cell Tissue Res 361(1):159–175. https://doi.org/10.1007/s00441-015-2202-z 71. Ruggero MA, Temchin AN (2005) Unexceptional sharpness of frequency tuning in the human cochlea. Proc Natl Acad Sci 102(51):18614–18619. https://doi.org/10.1073/pnas. 0509323102 72. Russell IJ, Murugasu E (1997) Medial efferent inhibition suppresses basilar membrane responses to near characteristic frequency tones of moderate to high intensities. J Acoust Soc Amer 102(3):1734–1738. https://doi.org/10.1121/1.420083 73. Sachs MB, Abbas PJ (1974) Rate versus level functions for auditory-nerve fibers in cats: Tone-burst stimuli. J Acoust Soc Amer 56(8):1835–1847. https://doi.org/10.1121/1.1903521 74. Salomon G, Starr A (1963) Electromyography of middle ear muscles in man during motor activities. Acta Neurol Scand 39(2):161–168. https://doi.org/10.1111/j.1600-0404. 1963.tb05317.x 75. Sellick PM, Patuzzi R, Johnstone BM (1982) Measurement of basilar membrane motion in the guinea pig using the Mössbauer technique. J Acoust Soc Amer 72(1):131–141. https:// doi.org/10.1121/1.387996 76. Sharpeshkar R (2010) Neuromorphic electronics (Chap 23). In: Ultra low power bioelectronics: fundamentals, biomedical applications, and bio-inspired system. Cambridge University Press, Cambridge, pp 697–752. https://doi.org/10.1017/CBO9780511841446.023 77. Shastri U, Mythri HM, Kumar UA (2014) Descending auditory pathway and identification of phonetic contrast by native listeners. J Acoust Soc Amer 135(2):896–905. https://doi.org/10. 1121/1.4861350 78. Sivian LJ, White SD (1933) On minimum audible sound fields. J Acoust Soc Amer 4(4):288– 321. https://doi.org/10.1121/1.1915608 79. Smith DW, Keil A (2015) The biological role of the medial olivocochlear efferents in hearing: separating evolved function from exaptation. Front Syst Neurosci 9:6, Article 12. https://doi. org/10.3389/fnsys.2015.00012 80. Sumner CJ et al (2003) A nonlinear filter-bank model of the guinea-pig cochlear nerve: rate responses. J Acoust Soc Amer 113(6):3264–3274. https://doi.org/10.1121/1.1568946 81. Sumner CJ et al (2002) A revised model of the inner-hair cell and auditory nerve complex. J Acoust Soc Amer 111(5):2178–2188. https://doi.org/10.1121/1.1453451 82. Terreros G, Delano PH (2015) Corticofugal modulation of peripheral auditory responses. Front Syst Neurosci 9:8, Article 134. https://doi.org/10.3389/fnsys.2015.00134 83. Verschooten E et al (2019) The upper frequency limit for the use of phase locking to code temporal fine structure in humans: a compilation of viewpoints. Hear Res 377:109–121. https://doi.org/10.1016/j.heares.2019.03.011 84. Warr WB, Guinan JJ Jr (1979) Efferent innervation of the organ of Corti: two separate systems. Brain Res 173(1):152–155. https://doi.org/10.1016/0006-8993(79)91104-1 85. Warren RM (1999) Auditory perception: a new synthesis. Cambridge University Press, Cambridge, UK 86. Wegel RL, Lane CE (1924) The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear. Phys Rev 23(2):266–285. https://doi.org/10.1103/ PhysRev.23.266 87. Weisz CJC, Glowatzki E, Fuchs PA (2014) Excitability of type II cochlear afferents. J Neurosci 34(6):2365–2373. https://doi.org/10.1523/JNEUROSCI.3428-13.2014 88. Weisz CJC et al (2012) Synaptic transfer from outer hair cells to type II afferent fibers in the rat cochlea. J Neurosci 32(28):9528–9536. https://doi.org/10.1523/JNEUROSCI.619411.2012 89. Westerman LA, Smith RL (1984) Rapid and short-term adaptation in auditory nerve responses. Hear Res 15(3):249–260. https://doi.org/10.1016/0378-5955(84)90032-7 90. Wiener FM, Ross DA (1946) The pressure distribution in the auditory canal in a progressive sound field. J Acoust Soc Amer 18(2):401–408. https://doi.org/10.1121/1.1916378

140

2 The Ear

91. Wilson JP (1973) A sub-miniature capacitive probe for vibration measurements of the basilar membrane. J Sound Vib 30(4):483–493. https://doi.org/10.1016/S0022-460X(73)80169-5 92. Wilson JP (1980) Evidence for a cochlear origin for acoustic re-emissions, threshold fine-structure and tonal tinnitus. Hear Res 2(3–4):233–252. https://doi.org/10.1016/03785955(80)90060-X 93. Wilson JP, Johnstone JR (1975) Basilar membrane and middle-ear vibration in guinea pig measured by capacitive probe. J Acoust Soc Amer 57(3):705–723. https://doi.org/10.1121/1. 380472 94. Wojtczak M et al (2012) Perception of across-frequency asynchrony and the role of cochlear delays. J Acoust Soc Amer 131(1):363–377. https://doi.org/10.1121/1.3665995 95. Woods DL et al (2009) Functional maps of human auditory cortex: effects of acoustic features and attention. PLoS ONE 4(4):19, Article e5182. https://doi.org/10.1371/journal.pone. 0005183 96. Yates GK (1990) Basilar membrane nonlinearity and its influence on auditory nerve rateintensity functions. Hear Res 50(1):145–162. https://doi.org/10.1016/0378-5955(90)90041M 97. Yates GK, Winter IM, Robertson D (1990) Basilar membrane nonlinearity determines auditory nerve rate-intensity functions and cochlear dynamic range. Hear Res 45(3):203–219. https:// doi.org/10.1016/0378-5955(90)90121-5 98. Yost WA (2000) Fundamentals of hearing: an introduction, 4th edn. Academic, San Diego, CA 99. Zilany MSA, Bruce IC (2006) Modeling auditory-nerve responses for high sound pressure levels in the normal and impaired auditory periphery. J Acoust Soc Amer 120(3):1446–1466. https://doi.org/10.1121/1.2225512 100. Zilany MSA, Bruce IC, Carney LH (2014) Updated parameters and expanded simulation options for a model of the auditory periphery. J Acoust Soc Amer 135(1):283–286. https:// doi.org/10.1121/1.4837815 101. Zilany MSA et al (2009) A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics. J Acoust Soc Amer 126(5):2390–2412. https://doi.org/10.1121/1.3238250 102. Zwislocki JJ, Nguyen M (1999) Place code for pitch: a necessary revision. Acta Oto-Laryngol 119(2):140–145. https://doi.org/10.1080/00016489950181530

Chapter 3

The Tonotopic Array

One of the most important functions of the peripheral auditory system is its function as a frequency analyser [49]. In this description, the basilar membrane is characterized as a filterbank, an array of band-pass filters of increasing centre frequencies. The distance along this array of filters will turn out to play a paramount role in hearing; in various cardinal perceptual processes, such as loudness perception or pitch perception, this distance basically determines the amount to which the frequency components of a sound interact with each other. This array of auditory filters will be referred to as the tonotopic array. To come to a perceptually validated description of this frequency array, not only the centre frequencies of the band-pass filters are needed, but also their bandwidths. How the spectral and the temporal properties of these auditory filters can be described is the subject of this chapter. A frequency scale will be constructed on which the bandwidth of each auditory filter is constant. As a consequence, distances along this array determine the amount of interaction between different frequency components. In the spectral domain, the concept of the auditory filter has led to the concept of critical bandwidth [6], which defines the frequency resolution of the auditory system. Grossly speaking, if two sound components are further apart than the critical bandwidth, they do not affect each other, but if they are closer, one affects the other. This interaction expresses itself as audible interferences such as beatings or roughness, or as a reduction in loudness of one of the components. The reduction in loudness of one sound components due to the presence of another is called masking. Hence, masking is one of the phenomena with which the critical bandwidth can be measured. Before discussing how the critical bandwidth can be measured, something more will be told about masking.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_3

141

142

3 The Tonotopic Array

3.1 Masking Patterns and Psychophysical Tuning Curves It is a common, everyday experience that, when two sounds are played together, one of them can be so intense that the other sound is heard less loud or is even not heard at all. This phenomenon, which has already been studied quite early in the history of hearing research, e.g., in the 1930s by Wegel and Lane [67], is called masking. In order to measure the critical bandwidth, the fact will be used that, grossly speaking, sound components outside each other’s critical bandwidths do not mask each other, since they do not interfere with each other. Doing so, it must first be realized that masking is not an all or none phenomenon. When one sound is not heard at all in the presence of another sound this is called complete masking. When the loudness of a sound is only reduced in the presence of another sound this is called partial masking. Masking will only be complete when, first, the frequency components of the two sounds are very close in frequency, and, second, the frequency components of one of the two sounds are much more intense than those of the other. When these two conditions are not fulfilled, masking, if any, will only be partial. The extent to which masking will take place depends on the distance between the frequency components of the two sounds and their relative intensities. This will be discussed quantitatively in Chap. 7. For the moment, only masking between just two narrow-band sounds, pure tones or narrow-band noises, will be described. One of the apparently most obvious ways to investigate masking is to present listeners with two tones, and then to find out how intense one of the tones must be in order to mask the other. The result of such an experiment will be described in a moment. Before doing this, the various perceptual effects that can take place when two sounds are played together will be recapitulated. First, as just mentioned, one sound can completely mask the other. Next, the presence of one sound may only partially mask the other, which means that the presence of one sound will only reduce the loudness of the other sound, but two separate sounds will still be heard, each with its own perceptual attributes such as loudness, timbre, etc. But another possibility is that perceptually the two sounds merge, partially or completely, into one sound. And this will certainly happen with two pure tones close in frequency. Indeed, as demonstrated in Sect. 1.5, the listener does not perceive two separate tones, when two pure tones are very close in frequency, but, instead, one pure tone is heard with a fluctuating loudness. The pitch of this one tone remains constant and has a frequency lying between the frequencies of the pitches the tones would have when played separately. In this situation, the listener will detect the addition of a masker tone not by detecting a separate tone, but by detecting temporal fluctuations in the loudness of the tone and, perhaps, by a small change in perceived pitch. This is illustrated in Fig. 3.1. Indeed, in 1950, Egan and Hake [5] determined the masking pattern of a pure tone. This was done for a 400-Hz tone of three different intensities. The masking pattern induced by this tone was systematically investigated by varying the frequency of another tone, the target, and determining, for every frequency, the elevation of the threshold of the target tone in the presence of the masker. The result is presented in

3.1 Masking Patterns and Psychophysical Tuning Curves

143

Fig. 3.1 Masking patterns of a pure 400-Hz tone at three different intensity levels, 40, 60, and 80 dB SPL. The target sound is a pure tone varied in frequency, and the plots show the elevation of the threshold of the target tone as a function of this frequency. Reproduced from Egan and Hake [5, p. 623, Fig. 1], with the permission of the Acoustical Society of America

Fig. 3.1 for tones with intensities of 40, 60, and 80 dB SPL. The plot at the masker level of 40 dB SPL is more or less symmetric but, at higher intensities of the masker, certainly at 80 dB, the plot becomes very asymmetrical. At frequencies lower than the frequency of the masker, the slope of the masking pattern is quite steep showing that the threshold of the target tone is only affected at frequencies close to the frequency of the masker. At frequencies higher than the frequency of the masker, the slope is less steep. For the 80-dB curve, the elevation of the threshold remains quite high up to 2 to 3 kHz. This is an illustration of upward spread of masking, mentioned in the previous chapter in Sect. 2.5.1.2. Indeed, the basal part of the basilar membrane, the part closer to the oval window, vibrates passively with the lower frequency stimulus, which induces an increase in threshold for stimuli of higher frequency. So, when the target tone is higher in frequency, its threshold may be raised by the lower-frequency masker, especially when the intensity of the masker is high. When, on the other hand, the frequency of the target tone is significantly lower in frequency than the masker, the travelling wave induced by the masker has already died out before it has reached the location on the basilar membrane excited by the target [67]. Another relevant aspect of the three graphs shown of Fig. 3.1 is a dip 400 Hz, the frequency of the masker. This is due to the fact that, when the masker and the target tone are close in frequency, they perceptually merge into one tone with a fluctuating loudness and a pitch frequency intermediate between the frequencies of the separate tones. In this situation the presence of these fluctuations reveals to the listener that the second tone is switched on. So, one cannot simply say that one tone masks the other. Rather, the target tone merges with the masker tone, and the listener detects

144

3 The Tonotopic Array

Fig. 3.2 Masking patterns of a narrow band of noise with a centre frequency 410 Hz. The intensity of this masker is varied over 40, 60, and 80 dB SPL. The target sound is a pure tone varied in frequency, and the plots show the elevation of the thresholds of the target tone as a function of this frequency. Reproduced from Egan and Hake [5, p. 627, Fig. 2], with the permission of the Acoustical Society of America

this based on the changes in the perceptual properties of this tone. Apparently, the listener is quite sensitive to such changes, since the dips 400 Hz in the masking patterns presented in Fig. 3.1 are quite clear. Three other small but significant dips can be seen in the 60- and 80-dB masking curves presented in Fig. 3.1, dips at 800 and 1200 Hz, multiples 400 Hz, the frequency of the masking tone. These dips are due to interferences between the target tone and the aural harmonics, i.e., harmonics of the masking tone in this case. They represent harmonic distortion products of the masker interacting with the target tone. For the purpose of this book, they are not so important. In order to prevent audible interferences between masker and target, not a pure tone but a narrow band of noise can be used as a masker. Such a narrow noise band is perceived as an almost pure tone irregularly fluctuating in loudness or pitch as demonstrated in Fig. 1.68. The masking pattern of such a narrow band of noise is presented in Fig. 3.2. The centre frequency of this noise band is a bit higher than in the previous figure, 410 Hz; its bandwidth 90 Hz. The dips in the graphs at the centre frequency of the noise, 410 Hz, and its multiples are now gone, apparently because the listener can no longer detect the presence of the target based on the absence or presence of loudness fluctuations. As in the masking pattern of a pure tone shown in Fig. 3.1, the masking pattern of the narrow noise band of low intensity, 40 dB SPL in this case, is quite symmetrical. For higher intensities the masking pattern gets more asymmetrical, which can be attributed to upward spread of masking, just as for a pure-tone masker.

3.1 Masking Patterns and Psychophysical Tuning Curves

145

Another way to avoid the dips in the masking curves is to present the target and the masker not simultaneously but successively. This can be done because the effect of masking extends beyond the offset of a sound. This is called forward masking. Actually, a sound can also, though to a lesser extent, have a masking effect on preceding sound, which is called backward masking. Forward and backward masking is usually studied by measuring the threshold of a very short tone, the probe, at different intervals just preceding or following the masker. Sensory adaptation, briefly described above in Sect. 2.5.3.3, plays an important role in forward masking, but more central mechanisms will also have effects. The result of forward and backward masking is that the onsets of sound components are strongly emphasized. For a summary description of forward and backward masking, the reader is referred to Moore [29, pp. 110–116]. The masking patterns shown in Figs. 3.1 and 3.2 were obtained by presenting a masker of constant frequency and intensity. What was varied was the frequency of the target tone, so the tone the listener had to detect, and the intensity of the target tone was determined at which it was just audible. Since the target varies in frequency, listeners will focus on those locations on the tonotopic array corresponding to this varying frequency. So, in measuring masking patterns, the masker is kept constant and the listeners focus their attention at various locations on the tonotopic array. In the measurement of psychophysical tuning curves, these two roles are reversed. Psychophysical tuning curves are obtained by fixing the frequency and intensity of the target tone; for a wide range of masker frequencies, the intensity is then determined a masker must have to completely mask the target [64]. These psychophysical tuning curves must not be confused with physiological tuning curves, examples of which have been presented in Fig. 2.20 of the previous chapter. Six examples of psychophysical tuning curves are presented in Fig. 3.3 [65]. The ordinate gives the intensity level of the masker L m , and the abscissa its frequency f m . Before interpreting tuning curves, one has to take account of the nature of the frequency axis on which the results are plotted. In Fig. 3.3, both the vertical intensity axis and the horizontal frequency axis are presented on logarithmic scales. On this scale, tuning curves of 1 kHz and higher are about equally wide, but at lower frequencies, the tuning curves are significantly wider. Another observation is that tuning curves are steeper on their high-frequency sides than on their low-frequency sides. This can again be explained by the asymmetrical nature of masking. Due to upward spread of masking, the target tone will now, relatively speaking, more readily be masked by a masker of lower frequency than by a higher frequency masker. In fact, certainly if the target tone has a low intensity, it is almost always possible to mask it with a low-frequency masker, but the very steep upper sides of the tuning curves in Fig. 3.3 show that maskers with frequencies higher than the frequency of the target tone must be very intense in order to mask the target and, when the frequency difference is too large, cannot mask the target at all without damaging the ear. This is why tuning curves are in general much steeper on their right-hand side than on their left-hand side. In both masking patterns and psychophysical tuning curves, the band-pass characteristics of the auditory system become quite clear. For various reasons, however,

146

3 The Tonotopic Array

Fig. 3.3 Psychophysical tuning curves as measured by Vogten [65]. The dashed line is the threshold curve of the participant. Both masker and target were short 50-ms pure tones. The level and frequency of the target tones are indicated by the dots between the threshold curve and the corresponding tuning curve. The frequencies of the target tones were 0.25, 0.5, 1, 2, 4, and 8 kHz. Reproduced from Vogten [65, p. 147, Fig. 5], with the permission of Springer; permission conveyed through Copyright Clearance Center, Inc.

the shape of these frequency-response curves can vary considerably. First, as shown in the Figs. 3.1 and 3.2, the amount of upward spread of masking increases with increasing intensity of the masker, which makes the estimation of the bandwidth of the auditory filter dependent on intensity; the higher the intensity of the masker, the broader the filter appears to be. Second, the frequency response of a filter only represents the linear properties of the filter, while the auditory filter is not strictly linear. Third, tuning may also be influenced by the efferent input from the olivocochlear feedback [21, 22], discussed in the previous chapter. Another complicating factor is indicated with off-frequency listening [43]. When one, e.g., measures a psychophysical tuning curve, say at 1 kHz, the level of another tone, say of 0.9 kHz, is determined that just masks the target of 1 kHz. In this situation, listeners may not focus their attention on what happens at exactly 1 kHz but on a somewhat higher frequency where the effect of the 0.9 kHz masker is less. So, it is possible that one does not measure what happens in the auditory filter centred at exactly 1 kHz, but in a filter with a somewhat higher centre frequency. Dedicated experiments in which this off-frequency listening cannot play a role have, indeed, shown that this off-frequency listening results in somewhat broader estimates of the auditory-filter bandwidths. Such experiments will be described below in Sect. 3.3 of

3.2 Critical Bandwidth

147

this chapter. It is concluded that deriving the bandwidth of the auditory filter from its frequency response is not straight-forward, and that the way in which it is measured should be well specified.

3.2 Critical Bandwidth Above, it was shown that the bandwidth characteristics of the auditory system express themselves at various levels in both the physiological and the psychophysical domain. The bandwidths of these filters determine the frequency resolution of the auditory system and, hence, the capability to separate the frequency components which together make up the sound. One of the earliest masking-based methods that have been used to estimate this resolution is to measure the threshold of a pure tone in combination with a more or less narrow noise band centred on the frequency of the tone. Such stimuli are demonstrated in Fig. 3.4. The threshold of this tone is then measured for different bandwidths of this noise band while keeping the spectral density, i.e., the intensity of the noise per Hz, fixed. The idea underlying this method is that only that part of the noise that passes the auditory filter with the frequency of the tone as characteristic frequency can contribute to masking the tone and, hence, contribute to increasing the threshold of this tone. As soon as the bandwidth of the noise is wider than the bandwidth of the auditory filter, further increasing the bandwidth of the noise can no longer affect the threshold of the target tone. This bandwidth beyond which noise can no longer affect the threshold is called the critical bandwidth. So, when one plots the threshold for the target tone as a function of the bandwidth of the noise masker, the threshold will first increase with increasing noise bandwidth until this bandwidth equals the critical bandwidth. By further increasing the noise bandwidth, only noise components outside the critical bandwidth are added and thus do not con-

Fig. 3.4 Spectrogram of a stimulus used to determine the critical bandwidth by masking. Eight noise bursts are played of bandwidths of 1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, and 1/128 octave centred on a logarithmic scale around 1.2 kHz. During each noise burst, ten 1.2-kHz pure tones are presented with levels increasing in steps of 3 dB. Modified after track 2 to 6 of the CD by Houtsma, Rossing, and Wagenaars [16]. (Matlab) (demo)

148

3 The Tonotopic Array

tribute to masking the target tone. Hence, the threshold will remain constant. This is indeed what one finds. A demo with the kinds of sound used in such experiments is presented in Fig. 3.4. It shows the spectrogram of a sequence of pure 1.2-kHz tones with intensity increasing in steps of 3 dB. These tones are played with concurrent bursts of band-pass noise centred around 1.2 kHz. In the first three to five burst the number of tones that can be heard is more or less stable. But when the bandwidth of the noise bursts gets smaller, smaller than the critical bandwidth, the number of audible tones increases and in the last burst nine or all ten tones can be heard. This is just one the methods used for measuring the critical bandwidth. Another method is perhaps less direct, but not less precise, and that is based on loudness summation. Loudness summation is the phenomenon that, when frequency components of a sound do not mask each other, the total loudness of the sound can be obtained by adding the loudness of the separate components. When the frequency components of a sound are closer to each other than the critical bandwidth, mutual interactions will come into play, and the total loudness will be less than the sum of the loudness of the separate components. This important phenomenon, which will be discussed in much closer detail in Chap. 7, can indeed be used to measure the critical bandwidth and leads to comparable estimates of the critical bandwidth. In this procedure, the listener is presented with only one sound, a noise band. The overall intensity of this noise band is fixed while its bandwidth is varied. Keeping the intensity of the noise constant while varying its bandwidth amounts to spreading its energy over a wider range of frequencies, which means that the energy per hertz, i.e., the spectral density, decreases. In other words, for a wider band, the acoustic energy of the sound is more spread over the tonotopic array than for a narrower band of the same intensity. The underlying idea is now that, when the energy of noise is more concentrated around the centre frequency, the noise components mutually interact, “suppress” each other as it were, to a larger extent than when the energy of the noise is distributed over a wider range of frequencies. As a consequence, below a certain bandwidth—and this is the critical bandwidth—the perceived loudness of the noise burst will remain constant at a relatively low level, but when the noise bandwidth exceeds the critical bandwidth and the frequency components are distributed over a wider range of frequencies, there will be less mutual interference, less “masking”, and the loudness of the noise band will increase as a consequence. And this is indeed what one finds. This is illustrated in Fig. 3.5. The demo consists of a sequence of pairs of noise bursts. The first burst of a pair remains the same, but the second, while maintaining the same intensity, increases in bandwidth in steps of 1/3 of an octave from 0.2 octaves to 1.6 octaves. So, the intensity is spread out over a larger and larger part of the tonotopic array. When this exceeds the critical bandwidth, the loudness of the bursts starts to increase. Remarkably, this change in loudness may also be associated with a change in the perceived effort with which the sound is produced (see Sect. 6.7.3). These experiments, when carried out at moderate sound levels, have yielded quite reliable and consistent results. It turned out that the critical bandwidth below 500 Hz is more or less constant, 100 Hz. Above 500 Hz, the critical bandwidth increases when expressed in hertz, but becomes more or less constant when expressed on a

3.2 Critical Bandwidth

149

Fig. 3.5 Spectrogram of a stimulus used to determine the critical bandwidth by loudness comparisons. Ten pairs of band-pass filtered noise bursts are played centred on a logarithmic scale around 1500 Hz. The first of a pair has a bandwidth of 0.2 octaves and remains the same. The second burst of a pair has the same intensity but its bandwidth increases in steps of 1/3 of an octave from 0.2 octaves to 1.6 octaves. The perceived loudness of the two bursts of a pair are equal for the first four to five pairs but then the loudness of the second burst increases in spite of the equal intensity of the two members of a pair. (Adapted from track 7 of the CD by Houtsma, Rossing, and Wagenaars [16]). (Matlab) (demo)

logarithmic frequency scale. Actually, the critical bandwidth above 1 kHz becomes about one quarter of an octave, i.e., three semitones. This is summarized in Table 3.1 after Zwicker [69] for 24 different frequencies. These frequencies are presented in the second column and are chosen in such a way—and this is not a coincidence—that the interval between two successive frequencies corresponds closely to the critical bandwidth within this interval. This list of frequencies, therefore, represents instants one critical bandwidth apart, instants numbered from 1 to 24 as indicated in the first column of Table 3.1. At various instances it has been shown that the perceptual effect that different frequency components of a sound have on each other depends on whether or not they are within each other’s critical bandwidths. Table 3.1 now gives a table in which one can look up whether two frequency components are within or outside each other’s critical bandwidths. Moreover, by interpolation of the values in the first and second column of Table 3.1, one can calculate for all frequencies how much they are separated from each other, not in hertz nor in octaves or semitones, but in units that indicate how much closer or farther away they are from each other than one critical bandwidth. This provides a handle for calculating a frequency scale on which distances are expressed relative to the critical bandwidths of the auditory filters. The data presented in Table 3.1 are the data that have led to the definition of what is called the Bark scale, the units of which are presented in the first column. It appears that 500 Hz the critical bandwidth is more or less constant, 100 Hz. A500 Hz, the Bark scale becomes more and more equal to a logarithmic frequency scale, and the critical bandwidth is about one quarter of an octave. This procedure with which this Bark scale is quantitatively computed will be discussed in a systematic way in Sect. 3.7 of this chapter.

150

3 The Tonotopic Array

Table 3.1 The critical bandwidth for a number of frequencies. Data reproduced from Zwicker [69, p. 248, Table I], with the permission of the Acoustical Society of America Bark Frequency Hz Lower cut-off Critical Upper cut-off frequency Hz bandwidth Hz frequency Hz 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

50 150 250 350 450 570 700 840 1000 1170 1370 1600 1850 2150 2500 2900 3400 4000 4800 5800 7000 8500 10500 13500

20 100 200 300 400 510 630 770 920 1080 1270 1480 1720 2000 2320 2700 3150 3700 4400 5300 6400 7700 9500 12000

80 100 100 100 110 120 140 150 160 190 210 240 280 320 380 450 550 700 900 1100 1300 18000 2500 3500

100 200 300 400 510 630 770 920 1080 1270 1480 1720 2000 2320 2700 3150 3700 4400 5300 6400 7700 9500 12000 15500

Zwicker and Terhardt [73] presented an analytical expression for the critical bandwidth as a function of frequency. Indeed, when f is frequency in Hz, the following equation gives the critical bandwidth CB: 0.69  . CB = 25 + 75 1 + 1.4 ( f /1000)2

(3.1)

3.3 Auditory-Filter Bandwidth and the Roex Filter

151

3.3 Auditory-Filter Bandwidth and the Roex Filter It has already been indicated that, due to non-linear phenomena, to upward spread of masking, or to off-frequency listening, the estimations of the auditory-filter bandwidth should be considered with great care. Furthermore, above it was shown that the critical bandwidth above about 500 to 1000 Hz is relatively constant on a logarithmic frequency scale, about one third of an octave, but turns out to be wider for lower frequencies. Experimentally measuring the critical bands 500 Hz appears to be quite difficult due to the fact that, at these low frequencies, the sensitivity and the efficiency of the auditory system rapidly diminishes, while headphone technology can cause leakage of acoustic energy at low frequencies. In order to avoid these problems, more advanced and more accurate measurements of the auditory-filter bandwidth have been developed. One of the most often used method is that based on notched-noise maskers [41, 42]. Notched noise is wide-band noise from which a more or less narrow frequency band, the notch, has been filtered out. It is also referred to as band-reject noise or as band-stop noise. Notched noise has been introduced as band-stop noise in Chap. 1 of this book where demos have been presented in Figs. 1.64, 1.65, and 1.66. The notched-noise method is demonstrated in Fig. 3.6. A sequence of eight 5-s noise bursts is played with notches of decreasing bandwidth. The notches have centre frequencies of 1.2 kHz. During the burst, a sequence of ten 1.2-kHz pure tones is played of increasing intensity. At the beginning, when the notch is wide, all ten tones of the sequence can be heard but, as the notch in the noise gets narrower than the critical bandwidth, the threshold increases and less and less tones of the sequence can be heard. This shows how the bandwidth of the auditory filter at a certain frequency can be determined.

Fig. 3.6 Spectrogram of sounds used to demonstrate how notched noise can be used to measure the critical bandwidth. Eight noise bursts are played with notches of 1, 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, and 1/128 octave centred on a logarithmic scale around 1.2 kHz. During each noise burst, ten 1.2-kHz pure tones are presented with levels increasing in steps of 3 dB. (Matlab) (demo)

152

3 The Tonotopic Array

In this way, by varying the width of the notch in the noise, the transfer characteristic of the auditory filters can be measured. It would be natural, as this is common practice among electrical engineers, to express the bandwidth of the auditory filter as the distance between the -3-dB points, i.e., the points where the band-pass characteristic of the filter has decreased by 3 dB. For theoretical reasons that will be discussed later on, the width of the filter will be expressed in its equivalent rectangular bandwidth (ERB). The ERB of a band-pass filter is the bandwidth of an ideal rectangular bandpass filter that, when fed with white noise, passes the same amount of energy as the filter at hand. Using this notched-noise method, the ERBs have been estimated for auditory filters covering the full frequency range of human hearing. Moore and Glasberg [34] presented an analytical approximation for the ERB of an auditory filter as a function of its centre frequency f : ERBN ( f ) = 24.7 (0.00437 f + 1) Hz

(3.2)

where the subscript N in the variable ERBN indicates that this equation represents the ERB of a normally hearing young listener at moderate sound levels. One of the main results of these experiments is that, in general, the auditory-filter bandwidth measured with notched-noise is smaller than the bandwidth measured by the methods used in determining the Bark scale. For instance, according to the value found in Table 3.1, the CB at 1000 Hz is 160 Hz, about 0.23 octaves, while the above equation yields an ERB N at 1000 Hz of 133 Hz, about 0.19 octaves. At 7000 Hz the difference has increased from 1300 Hz for the CB, about 0.27 octaves, while the ERB N at 7000 Hz is 780 Hz, about 0.16 octaves. The method described here assumes a symmetrical auditory filter, which may be appropriate at low to moderate stimulus levels, since the physiological and psychophysical results described above show that, for such low stimulus levels, the filters are more or less symmetrical. For higher intensities, however, the auditory filters become more and more asymmetrical. This can be accounted for by positioning the target tone asymmetrically within the notch of the noise [43]. In addition, besides becoming more asymmetrical for higher intensities, the auditory-filter bandwidth gets wider for higher intensities [1]. In this way, the frequency characteristic of the auditory filter has been measured for a large range of frequencies covering the entire range of human hearing and for a large range of intensities. In order to come up with a generalizable description of the auditory filter that can be used for calculations on human hearing, this description should be concise and adequate with as few parameters as possible. This all led to a frequency and intensity dependent description of the frequency characteristic |H ( f ) |2 of the auditory filter with skirts that are approximated by what are called rounded exponentials (roex). The filter described in this way is called a roex filter. Mathematically, a rounded exponential is a function with a derivative which is 0 at the origin and can be written as the product of a polynomial and an exponential function. When the polynomial is linear, the approximation has only one parameter p and the roex function of the variable g reduces to (1 + pg) · e− pg . To apply this

3.3 Auditory-Filter Bandwidth and the Roex Filter

153

Fig. 3.7 Examples of roex filters for three different values of p. The corresponding equivalent rectangular filters are presented by the thin dashed lines. (Matlab)

formula to an auditory filter, g is defined as | f − f c |/ f c , so the normalized deviation from the centre frequency f c of the filter. In this way, one ends up with a frequency characteristic |H ( f ) |2 of the auditory filter: |H ( f ) |2 = (1 + pg) · e− pg , g = | f − f c |/ f c .

(3.3)

Examples of symmetrical rounded exponential functions for three different values of p are presented in Fig. 3.7. The slope of the roex function is the steeper the larger the value of p. A simple calculation shows that the area under a roex function is 4/ p, so that the bandwidth of the equivalent rectangular filter is also 4/ p. The three equivalent rectangular filters are also presented in Fig. 3.7 by thin dashed lines. The auditory filter gets more and more asymmetrical for higher intensities. Hence, the estimated value of p not only depends on the frequency and the intensity of the stimulus, but also on whether it is applied to the left or to the right skirt of the filter. Figure 3.8 shows an example of an asymmetrical roex function. The parameter of the left skirt pl is 18, while the parameter of the right side pl is 25. Since the area under an asymmetrical roex function is p2l + p2u , this is also the ERB of such an asymmetrical filter. So, it appears that the auditory filter expresses itself in the masking properties of the auditory system. Actually, masking is used to measure the characteristics of the auditory filter. Some of these methods have now been sketched. For a very elaborate discussion of masking and the auditory filter, the reader is referred to Oxenham and Wojtczak [40].

154

3 The Tonotopic Array

Fig. 3.8 Example of an asymmetrical roex function. The corresponding equivalent rectangular filter is presented by the thin dashed lines. The different values of p on the left and the right skirts are indicated. (Matlab)

3.4 The Gammatone Filter In the previous subsection, the auditory filter has only been described in the frequency domain. In the time domain, a somewhat different approach is followed. In this domain, the impulse response of the auditory filter is most often described as a sinusoid with a gamma distribution as envelope: h (t) = c · t (γ−1) e−t/τ · cos (2π f c t + ϕ)

(3.4)

The envelope, c · t (γ−1) e−t/τ , is mathematically known as the gamma distribution. Hence, in the time domain, the impulse response of the auditory filter is often referred to as a gammatone. Examples of gammatones are presented in Fig. 3.9 for γ = 1 to 5. The dashed lines represent the positive and negative envelopes of the gamma distribution. In all five cases, τ is 2.5 ms and the frequency of the tones f c is 0.5 kHz. Some straightforward calculations show that the gamma function t (γ−1) e−t/τ reaches its maximum for t = (γ − 1) τ , while its maximum is 1 for γ = 1, and  (γ−1) for γ > 1. The five gammatones presented in Fig. 3.8 are scaled up (γ − 1) τe in such a way that their maximum is 1. Figures 2.10 and 2.11 of Sect. 2.5.1.2 showed some mechanical impulse responses of the basilar membrane. There, the band-pass filter characteristics of these impulse responses have already been discussed. The shape of these impulse responses can be compared with the shape of the gammatones with γ = 1 to γ = 5 as shown in Fig. 3.8. In general, the similarity between gammatones and actually measured impulse responses is best for γ = 3 or γ = 4.

3.4 The Gammatone Filter

155

Fig. 3.9 Five gammatones with different γ. The frequency f c of the tones is 0.5 kHz, and τ is 2.5 ms. The dashed lines represent the positive and negative envelopes of the gammatones. (Matlab)

In many applications, the first stage of auditory processing is simulated by a scheme as presented in Fig. 2.24. The auditory filters presented in the second column of Fig. 2.24 are often implemented as gammatone filters. In that case, one speaks of a gammatone-filterbank. Ideally such filterbanks have the same temporal and spectral resolution as the human auditory filters, and as such may well describe the limitations and capabilities of our hearing system as far as they depend on the temporal or spectral resolution of our auditory system. Many have used a gammatone-filterbank as a front end for speech and music processing. In this book, a computational model of pitch perception will be presented based on such a filterbank as described by Slaney [53]. The logical question one may ask now is: What is the relation between the gammatones presented here and the roex filters presented above? Or, can the amplitude of the frequency response of a gammatone filter indeed be described as a roex fil-

156

3 The Tonotopic Array

ter? The exact answer is no, but with a proper choice of parameters one can be a good approximation of the other. The relation between the roex description and the gammatone description of the auditory filter is discussed by Patterson et al. [46].

3.5 The Compressive Gammachirp In the simulations used in this book, only the gammatone description of the auditory filter presented in the previous section will be applied. This description, however, is just a first-order, linear approximation of the impulse response of the auditory filter. This is a simplification in several respects. The cochlear amplifier operates as a kind of compressive non-linearity that extends the range of intensities contributing to human hearing, especially to low intensities. Another aspect is that the frequency of the impulse response of the auditory filter is not exactly constant but increases or decreases somewhat in its course [2, 51]. These findings have resulted in a more precise description encompassing various non-linearities of the auditory filter and the replacement of the gammatone with the gammachirp [19]. The sinusoid in the gammachirp is not only modulated in amplitude by a gamma distribution but also modulated in frequency: h (t) = a · t (γ−1) e− τ · cos (2π f c t + c ln (t) + ϕ) t

(3.5)

In this equation, the parameters c and τ vary with the intensity of the stimulus in such a way that the resulting filter shape changes in a way similar to that of a real auditory filter. When Φ (t) is the argument of a sinusoid, its instantaneous frequency can be calculated according to Eq. 1.7, I (t) = Φ  (t) / (2π). For the compressive gammachirp, for which Φ (t) is 2π f c t + c ln (t) + ϕ, this yields an instantaneous frequency I (t) of f c t + c/ (2πt). In general, the constant c is positive for relatively high characteristic frequencies, higher than about 1500 Hz, resulting in an upward frequency glide, while for lower characteristic frequencies, lower 750 Hz, c is generally negative, resulting in a downward glide [57]. A graphical representation in which the gammachirp can be compared with the gammatone is presented in Fig. 3.10. In this example, f c 500 Hz. The periods of the gammachirp are shorter than those of the gammatone, but the difference diminishes in the course of the chirp. In the course of time, the precise definitions of the gammachirp have been adapted to newer insights [17, 18, 44]. This resulted in the dynamic compressive gammachirp auditory filterbank. A concise summary of this dynamic compressive gammachirp auditory filterbank and its applications in speech research is presented by Irino and Patterson [20]. As to the frequency characteristic of the gammachirp, Patterson, Unoki, and Irino [45] found that the single rounded-exponential or roex filter did not provide a sufficiently accurate description of the frequency characteristic of the compressive gammachirp. Instead, they propose a cascade of two roex filters, the double-roex filter. The first roex filter represents the passive response of the basilar membrane, the

3.6 Summary Remarks

157

Fig. 3.10 The gammachirp, thick line, and the gammatone, thin dotted line. For both signals, γ is 4. The frequency f c of the tones is 0.5 kHz, and τ is 2.5 ms. The constant c of the gammachirp is 2. The dashed lines represent the positive and negative envelopes of the gammatones. (Matlab)

second the cochlear amplifier. The first filter is linear and its output controls the gain of the second, non-linear narrow-band filter. Indeed, the gain of the second filter is inversely proportional to the output of the first filter, so that a smaller output of the first filter results in a higher gain of the second filter. All this operates as a compressive non-linearity just as the cochlear amplifier. In addition to defining the double-roex filter, Unoki et al. [60] present a time-domain implementation of the two filters. Moore [32, pp. 23–24], however, states that “we have not yet found a time-domain filterbank model that is sufficiently well behaved and for which the parameters can be adjusted to give a good fit to the equal-loudness contours in ISO 226 (2003).” This statement also applies to other descriptions of the auditory filter with frequency glides as impulse responses [14, 57, 68]. Furthermore, any model of the auditory filter will have inherent limitations that cannot be lifted in a simple way. Moore [32] mentions three factors that are hardly, if at all, implementable in a general model of the auditory filter. First, cognitive factors of the participants can influence the results of the experiments on which the parameter settings of the auditory-filter models are based. Second, there appear to be significant individual differences between participants, so that it is not straightforward to develop a general model of the auditory filter. Finally, there are different ways to measure thresholds, which can also result in considerable differences of the parameter setting of the resulting model.

3.6 Summary Remarks A relatively simple description of the impulse response of the auditory filter has been presented. This impulse response has the shape of a gammatone, a sinusoid with a gamma distribution as envelope. The frequency pass characteristic of the gammatone can, for the purposes of this book, be approximated by a rounded-exponential (roex) function. This description of the roex filter will be applied in discussing a computational model of human loudness perception discussed later in this chapter. A more advanced model of the auditory filter is presented by Chen et al. [3], which will briefly be discussed in Chap. 7. The gammatone description will be applied in

158

3 The Tonotopic Array

discussing a computational model of human pitch perception presented in Chap. 8. A more conclusive review of such models of the peripheral auditory system is presented by Meddis and Lopez-Poveda [27] or Lyon [25]. The values of the parameters of these models are mostly set based on the results of auditory experiments. Another approach, also briefly discussed in Chap. 7, is followed by Pieper et al. [48]. They developed a transmission-line model of the cochlea as described by, e.g., Duifhuis [4], Verhulst, Altoè, and Vasilkov [61], Verhulst, Dau, and Shera [62]. It will be clear that the development of models of the auditory filter is progressing rapidly. A comparison of seven human auditory-filter models is presented by Saremi et al. [52]. The auditory filter expresses itself in many different aspects of hearing, e.g., in masking, in loudness perception, in pitch perception, and in various other auditory phenomena that will be described. The auditory filter is a band-pass filter that is at the base of the time and frequency resolution of our hearing system. Moore [31] reviews the relevance of the spectral and temporal properties of the auditory system for the processing of speech sounds. Räsänen and Laine [50] argue that the spectrotemporal properties of our hearing system are specifically adapted to the processing of speechlike sounds. So, the significance of the auditory filter can hardly be overestimated.

3.7 Auditory Frequency Scales In the previous paragraphs, e.g., in Table 3.1 of this section, it appeared that, on a linear frequency scale, the auditory-filter bandwidth increases with frequency. On the other hand, when one looks at the physiological tuning curves of Fig. 2.20 or at the psychophysical tuning curves of Fig. 3.3, one sees that, on a logarithmic frequency scale, the auditory-filter bandwidth is relatively constant for frequencies higher than some 500 to 1000 Hz, but is larger for lower frequencies. Hence, when expressed in octaves, the filter bandwidth is relatively constant for higher frequencies but is larger for lower frequencies. The goal is now to construct a frequency scale on which the auditory-filter bandwidth is the same everywhere. This means that, starting from a linear frequency scale, the scale must be more compressed the higher the frequency on this scale. Starting from the logarithmic frequency scale, on the other hand, this means that the scale must be compressed for frequencies lower than about 500 to 1000 Hz. More specifically, a frequency scale must be constructed on which the bandwidth of the auditory filters is 1 everywhere. Three auditory frequency scales based on perceptual experiments will be discussed. The first is the mel scale, which has been presented in the 1930s. The second scale, the Bark scale, has been developed in the 1950 and 1960s. The ERB N -number or Cam scale has been developed two decades later. In discussing this last scale, the question will be answered how a frequency scale on which the bandwidth of the auditory filters is 1 everywhere can be constructed in a systematic and quantitative way. Greenwood [11] showed that this

3.7 Auditory Frequency Scales

159

scale bests represents distances along the basilar membrane and, as such, is assumed to best represent the tonotopic scale as found in the auditory nervous system up to at least the level of the cortex [24].

3.7.1 The Mel Scale The mel scale was first introduced in the 1930s at Harvard University in the US by Stevens, Volkmann, and Newman [56]. It did not originate from measurements of the auditory-filter bandwidth but was based on the results of experiments in which listeners were, e.g., asked to adjust the frequency of a variable test tone in such a way that its pitch was half the pitch of a fixed reference tone. In another experiment, listeners were asked to adjust the frequency of a variable tone in such a way that it perceptually divided the interval between two reference tones into two perceptually equal intervals. This was extended to experiments in which listeners adjusted the frequencies of three tones in such a way that it divided the interval between two reference tones into four perceptually equal quarter intervals [54–56]. In other words, the listeners were asked to perceptually bisect or quarter a pitch interval defined by two pure tones. The aim of these studies was to construct a frequency scale that represented pitch height. One may expect that such tasks may induce listeners to divide an interval into musically equal intervals, which would amount to bisection or quartering of the interval on a logarithmic frequency scale. Curiously, although the reference tones were steady pure tones, the obtained frequency scale derived from these experiments deviated from the logarithmic scale of music. There were some complications: For instance, there were systematic differences for quarterings in which the two reference tones were played in ascending order and quarterings in which these tones were played in descending order. By taking the average of these ascending and descending quarterings as final result, however, a perceptual frequency scale was constructed on which the intervals adjusted as equal by the listeners corresponded to equal intervals on that scale. The unit of this scale was the mel, chosen in such a way that 1000 Hz was given the value of 1000 mel. Stevens and Volkmann [55] presented this scale as a table and as a function displayed in a graph connecting the points presented in the table. Based on this, Pedersen [47] calculated a fifth-order polynomial that best fitted the data, and published a very detailed table with values derived from this analytic expression: f = 0.146750532 · 10−12 · mel5 − 0.795481794 · 10−9 · mel4 + 0.152864002 · 10−5 · mel3 − 0.687099785 · 10−3 · mel2 + 0.805045046 · mel + 2.14597315 Hz.

(3.6)

A more concise analytic formula is presented by O’Shaughnessy [37]: Mel ( f ) = 2595 log10 (1 + f /700) mel

(3.7)

160

3 The Tonotopic Array

Although not based on measurements of auditory-filter bandwidth, Stevens, Volkmann, and Newman [56] and Stevens and Volkmann [55] already mentioned the possible mapping between the mel scale and the location on the basilar membrane. They published measurements of responses to pure tones of guinea-pig cochleas that showed a close mapping with the mel scale. In this way, a link was established between the mel scale and the basilar membrane as origin of the tonotopic array. This was confirmed by Fletcher [6].

3.7.2 The Bark Scale The Bark scale, already introduced in Table 3.1, has been developed in the 1950 and 1960 in Munich [69, p. 557]. It is based on measurements of the critical bandwidth. The frequency points indicated in the second column of Table 3.1 are chosen in such a way that they are exactly 1 Bark apart, so that they correspond to the integer Bark values shown in the first column of Table 3.1. The relation between the mel scale and the Bark scale was indicated by Zwicker, Flottorp, and Stevens [71]. They concluded that, within statistical limits, they represented the same scale: “The average width of the critical band is about 137 mels. It varies from about 100 mels at low frequencies to about 180 mels at high frequencies. In view of the difficulty of determining the pitch scale with great precision, this degree of agreement between critical bands and intervals of subjective pitch makes it reasonable to entertain the hypothesis that the two may be closely related” (p. 557). This suggests that the mel scale and the Bark scale can be identified when 1 Bark is identified with 137 mel. Later, however, Zwicker [69] reported a correspondence of 100 mels to every Bark, which is now the accepted relation between the mel and the Bark scale. Various analytical expressions have been presented for the Bark scale. Zwicker and Terhardt [73] presented the following formula for f in kilohertz:  z ( f ) = 13 arctan (0.76 f ) + 3.5 arctan

f 7.5

2 Bark.

(3.8)

Traunmüller [59] presented the following equations for f in hertz: z(f) =

1960 (z + 0.53) 26.81 − 0.53, f = Bark. 1960 + f 26.28 − z

Finally, Hermansky [15] proposed, also for f in hertz, ⎛ ⎞  2 f f z ( f ) = 6 ln ⎝ + + 1⎠ Bark. 600 600

(3.9)

(3.10)

3.7 Auditory Frequency Scales

161

The Bark scale is still widely used in computational models for various auditory phenomena, e.g., in standard calculations of loudness (ISO 532B) [70, 74]. In the same way as suggested by Stevens and Volkmann [55], Zwicker, Flottorp, and Stevens [71] argued that equal distances between points on the Bark scale would represent equal distances on the basilar membrane, a relation further discussed — and disputed — by Greenwood [11]. Greenwood [11] argued that the next scale to be discussed, the ERB N -number or Cam scale, is a more accurate representation of distances along the basilar membrane.

3.7.3 The ERB N -Number Scale or Cam Scale The mel scale and the Bark scale were presented as frequency scales on which equal distances corresponded to perceptually equal distances. The question is now how this relation can be found in a quantitative accurate way. This will be done for the ERBs at a moderate sound level of a young adult with normal hearing, from where the subscript N. The result must be such that the auditory-filter bandwidth on that scale is exactly one equivalent rectangular bandwidth (ERBN ) everywhere. So, equal distances on this scale should correspond to equal distances in terms of the number of ERBs covered by these distances. In summary, the result must be so that, on this new scale of the variable z, a change in frequency f per unit z is exactly 1 ERB N [13]. Assuming that the new scale variable z is a continuously differentiable function of f , we have dz = (Δz/Δf ) d f.

(3.11)

In this equation 1/(Δz/Δf ) = Δf /Δz represents the change in f per unit z which, as mentioned, is per definition exactly 1 ERB N at that f . Or, dz = 1/ERBN ( f ) d f.

(3.12)

Equation 3.12 expresses that the frequency scale in hertz should be compressed where the auditory filters are relatively wide and stretched where they are relatively narrow. Assuming that this scale starts from 0 0 Hz, so that each position on this new scale can be presented as a number of ERBN s separated from the origin, this yields f f 1 d f . dz = (3.13) z(f) =  0 0 ERBN ( f ) Equation 3.13 gives the frequency scale on which changes in z are equal everywhere when expressed in ERB N . So, the rate at which z changes is in ERB N equal everywhere on this scale and, hence, its integration yields the separation from the origin in number of ERBs, which is why it is referred to as the ERBN -number scale. Hartmann [13] proposed to indicate the number of ERB N by Cam after the

162

3 The Tonotopic Array

English city of Cambridge where this scale was developed [8]. This suggestion will be followed in this book and the scale will be indicated with Cam scale. Equation 3.13 makes it possible to derive an analytical expression for the Cam scale if there is an integrable analytical expression in hertz of the inverse of the equivalent rectangular bandwidth ERBN ( f ). Such an expression is presented by Glasberg and Moore [8] ERBN ( f ) = 24.7 (0.00437 f + 1) Hz.

(3.14)

Applying this in Eq. 3.13 results in: z ( f ) = 21.4 log10 (0.00437 f + 1) Cam,

(3.15)

which gives the required expression for the Cam as a function of the centre frequency of the auditory filter. The inverse equation is

f (z) = 229 · 10z/21.4 − 1 Hz

(3.16)

This equation represents the sought-after perceptual frequency scale, based on measurements of the bandwidths of the human auditory filters, on which, in principle, equal intervals represent perceptually equal distances. The Cam scale is based on measurements of the auditory-filter bandwidth by means of the notched-noise method. The Bark scale was derived from earlier measurements of the critical bandwidth. It was argued that, using the notched-noise method, the bandwidth measurements were more accurate because, e.g., it prevented off-frequency listening. Taking some other experimental precautions into account, especially in measuring the auditory-filter bandwidth at lower frequencies [23, 36] has also resulted in bandwidths that are in general narrower than the previously found Barks. The Bark values presented in Table 3.1 show a critical bandwidth that, up to 500 Hz, is more or less constant, 100 Hz actually. The ERBs as measured by the more cautious notched-noise method give filter bandwidth that start increasing in Hz from the lowest frequencies on. Hence, the Cam scale is based on narrower filters than the critical bands of the Bark scale. The result is that, when the integration of Eq. 3.13 is carried out, the number of ERBs covering the frequency range of human hearing is higher than the number of Barks. This all is illustrated in Fig. 3.11 presenting the Bark scale and the Cam scale as a function of frequency in hertz in one figure. It shows that human hearing is covered by about 24 critical bands when expressed in Barks, while there are about 40 critical bands when expressed in ERBs. Just as the mel and the equivalent Bark scale, the Cam scale has been identified with distances on the basilar membrane [9, 10]. Greenwood [10] proposed a general equation for the relation between the centre frequency of a location on the basilar membrane and its distance from the oval window:

f = A 10ax − 1 Hz.

(3.17)

3.7 Auditory Frequency Scales

163

Fig. 3.11 Barks and Cams as a function of frequency. (Matlab)

This equation appears to hold quite well, not only for human listeners, but also for seven other animal species, six mammals: the cow, the elephant, the cat, the guinea pig, the rat, and the mouse; and one bird: the chicken [10]. For humans, the values of the parameters A and a are 229 and 1/21.4, respectively, which yields Eq. 3.16 shown above presented by Glasberg and Moore [8]. One may ask whether the Cam scale or the Bark scale better represents distances on the basilar membrane. Both Bark and Cam scale are presented in Fig. 3.11. The formula used for the Bark scale is from Traunmüller [59]. As mentioned, there are some differences between the two scales. First, the auditory filters are generally narrower in the Cam scale than in the Bark scale. As a consequence, the Bark number is always smaller than the Cam number. Moreover, as can be seen in Table 3.1, for frequencies lower than 500 Hz, the filters of the Bark scale are more or less equally wide, 100 Hz, when expressed in hertz, while the filters of the Cam scale get wider in hertz with higher centre frequency. Finally, at frequencies higher than 2 to 5 kHz, the width of the Cam filters remains more or less constant when expressed on a logarithmic frequency scale, whereas this decreases somewhat for the Bark filters. This is discussed by Greenwood [11], who showed that the Cam scale better corresponds to distances on the basilar membrane than the Bark scale. Readers interested in a critical comparison of the various equations for the concepts such as critical bandwidth, auditory-filter bandwidth, equivalent rectangular bandwidth, and Greenwood’s frequency-position functions are referred to a review by Völk [66]. The computational models that will be presented in this book for various auditory attributes, e.g., loudness, pitch, and brightness, will mostly use the Cam scale. For historical reasons, the Bark scale will be used in a model of roughness perception presented in Sect. 6.2.

164

3 The Tonotopic Array

It is important to realize that these perceptually defined frequency scales, whether one uses the Bark scale or the Cam scale, are both essentially different from the musical frequency scale presented in Sect. 1.4. In music, two intervals are equal if the ratios between the two tones that make up the intervals are equal and, as such, is a logarithmic scale. The Cam scale comes close to a logarithmic frequency scale for frequencies higher 500 Hz, but strongly deviates from it for lower frequencies. This applies even more so for the Bark scale. One may wonder why the human auditory system uses two different frequency scales. In Sect. 10.12.4, it will be argued that the logarithmic frequency scale, a ratio scale, only plays a role when it comes to musical harmonic intervals between simultaneous or consecutive notes. This will be discussed in Sect. 10.12.5. Outside the context of music, the perceptually defined tonotopic scale is more appropriate. All this indicates that the mechanical filters that constitute the basilar membrane are at the root of the perceptual processes discussed in this book.

3.8 The Excitation Pattern The excitation pattern represents the power of the output of the auditory filters as a function of their centre frequency. The excitation pattern was introduced into hearing research as a first stage in a model of loudness perception in [7] by Fletcher and Munson [7] and in [72] by Zwicker and Scharf [72]. These authors associated the excitation pattern induced by a sound with the masking pattern of that sound. They argued that the excitation increases the threshold of a masker by an amount proportional to the power of the sound. In this section, the first stages of a model for loudness perception will be presented as described by Moore, Glasberg, and Baer [35]. This model is based on the tonotopic Cam scale presented above in the Sect. 3.7.3. Loudness perception will be discussed in detail in Chap. 7, but it will be referred to at various instances already here. And, actually, at the end of this section, a sketch will be presented of how the loudness of a sound can be estimated from its excitation pattern. In the previous section, a frequency scale was calculated on which equal distances correspond to equal amounts of interaction between the frequency components of a sound. In the next step, the amount of excitation induced by a sound is calculated for every location on the tonotopic array. This calculation is based on the frequencies and the amplitudes of the components of the sound. For each of these components, the rounded-exponential description of the auditory filter will be used to calculate the excitation induced by this component. Adding the contributions of all these components will give the excitation pattern. Before discussing how excitation patterns can be calculated, Fig. 3.12 presents the auditory-threshold curve of human hearing. It shows the level a pure tone must have in order to be correctly detected in 50% of all presentations. Auditory threshold curves can vary from person to person. Figure 3.12 presents a representative curve for a young adult without any hearing loss. One can see that human hearing starts

3.8 The Excitation Pattern

165

Fig. 3.12 The auditory-threshold curve. The intensity threshold of a pure tone is presented as a function of the frequency of that pure tone. Data derived from ISO226. (Matlab)

Fig. 3.13 Frequency transfer by the outer ear. Data derived from Moore, Glasberg, and Baer [35]. (Same as Fig. 2.2). (Matlab)

20 Hz where the threshold is relatively high. The threshold then decreases and has a minimum around 3000 Hz. This minimum and the secondary minimum 15000 Hz represents the resonances of the concha and ear canal as shown in Fig. 3.13. In the course of this chapter, a description will be presented of how the shape of this threshold curve is defined by the various stages of acoustic and auditory processing in the outer ear, the middle ear, and the cochlea.

3.8.1 Transfer Through Outer and Middle Ear Usually, a sound is produced at a certain location in the listeners’ environment from where it propagates in all directions. The first sound from this acoustic event that

166

3 The Tonotopic Array

reaches the listener is the sound that travels directly to the listener, which is called the direct sound. There is not only this direct sound, but also the sound that indirectly travels to the listener via the ceiling, the floor, the walls, and all kind of other objects in the room, the shoulders, and other body parts of the listener. This indirect sound plays an important role in loudness perception, since it contributes to the loudness of the sound. It also plays a very important role in spatial hearing and in sound localization, but for the moment, this indirect sound will be ignored. Focus is on the direct sound as it arrives at the entrance of the outer ear before it is filtered by the outer ear. This sound can be described as a set of components each characterized by a frequency and an intensity. Mutual phase relations are ignored. In calculating the excitation pattern, the first operation on the sound consists of simulating the filtering by the outer ear. The transfer characteristic of the outer ear, previously shown in Fig. 2.2 in Sect. 2.3, is again presented in Fig. 3.13. The clear peak at about 3 kHz shows the resonance by the concha and the ear canal. The irregular fluctuations above 6 to 8 kHz show some of the resonances and antiresonances of the ridges and cavities of the outer ear. Very precise loudness estimations require some adaptations because these resonances and antiresonances are different from person to person. Moreover, the exact shape of these peaks and dips depends on the direction where the sound comes from. Also, when the sound is listened to through head phones, adaptations are necessary. Finally, we hear with two ears, so the contributions of both ears have to be combined in order to achieve a correct and precise estimate of the perceived loudness of the auditory event. In most cases, however, e.g., when there is little power in the range above 6 kHz or when only relative loudness changes have to be estimated, an “average” transfer function as displayed in Fig. 3.13 will do. Anyway, after having done all this, one knows the intensity with which each frequency component arrives at the eardrum. In the next stage of this loudness model, the filter characteristics of the middle ear are applied. The transfer function of the middle ear has already been shown in Sect. 2.4 in Fig. 2.4. It is shown again in Fig. 3.14. It has the characteristic of a wide band-pass filter with a peak at about 0.8 to 1 kHz and some secondary peaks at 3 to 5

Fig. 3.14 Frequency transfer by the middle ear. Data derived from Moore, Glasberg, and Baer [35]. (Same as Fig. 2.4) (Matlab)

3.8 The Excitation Pattern

167

and at 10 to 12 kHz; the dip between 2 to 3 kHz is more or less compensated by the peak in the transfer function of the outer ear shown in Fig. 3.13. Note the decreasing transfer for lower frequencies: The transfer 50 Hz is about 15 dB lower than at 500. This may seem considerable but, when one looks at the threshold curve shown in Fig. 3.12, the threshold 50 Hz is about 40 dB higher than 500 Hz. So, there must be another factor that explains the loss of sensitivity for lower frequencies as shown in the threshold curve depicted in Fig. 3.12. This will be described in the next stage.

3.8.2 Introduction of Internal Noise Up to now, the sound, after it arrived at our ears, has undergone two linear and passive filtering operations, one by the outer ear and one by the middle ear. This gives the intensities of all frequency components as they arrive at the oval window. Before calculating the actual excitation pattern, one must take into account that below 500 Hz the efficiency of the cochlear amplifier diminishes. As a consequence, a correction for frequencies lower than 500 Hz must be applied. In the model this is not done by, e.g., subtracting a number of decibels from these components, but by introducing internal noise. This internal noise can be thought of as spontaneous activity in the absence of any other sound stimulation, starting with the spontaneous release of neurotransmitter by the inner hair cells into the synaptic cleft resulting in spontaneous activity of the fibres in the auditory nerve. Another source of variability might be generated by the efferent system of the olivocochlear bundle [58] inducing spontaneous excitation or inhibition of the outer hair cells as discussed above in Sect. 2.5.3.1. In fact, the concept of internal noise plays an important role in explaining detection at threshold level, which applies not only to the lower frequencies but to all frequencies. In order to deal with these phenomena, a minimum level of internal excitation is applied to all frequency components. This level is 3.6 dB for frequencies 500 Hz, but 500 Hz, the level gets higher the lower the frequency until, 52 Hz, the internal noise induces an excitation of 26.2 dB. The graph representing this level of internal excitation as a function of frequency is shown in Fig. 3.15. So, it appears that the transfer characteristic of the middle ear explains about 15 dB increase in threshold between 500 50 Hz. The increase in internal noise adds another 25 dB, resulting in a 40 dB higher threshold 50 Hz than 500 Hz. And this is what was shown in the threshold curve of Fig. 3.12. It is concluded that the operation of the outer ear, the middle ear, and the efficiency of the cochlear amplifier to a large extent explain the form of the threshold curve. This threshold curve in fact gives the effective attenuation of the frequency components by the peripheral hearing system. Now, a sketch will be presented of how this can be used to calculate the excitation pattern induced by a sound. For the details of this procedure the reader is referred to Moore, Glasberg, and Baer [35].

168

3 The Tonotopic Array

Fig. 3.15 Internal excitation, or the excitation at threshold, as a function of frequency. A500 Hz, it is constant at 3.61 dB. Adapted from Moore, Glasberg, and Baer [35]. (Matlab)

3.8.3 Calculation of the Excitation Pattern In the first stage of the four-stage model just described, the intensity is calculated with which each frequency component of the sound arrived at the oval window. Next, for each frequency component, an amount of internal noise is added that is constant 3.6 dB for frequencies higher 500 Hz, but increases for frequencies lower than this. From these data, the level of excitation of the auditory filters can be calculated induced by each of the frequency components of the stimulus. This is illustrated in Fig. 3.16 for a single frequency component of a 1.5-kHz tone. The upper panel gives the transfer functions of seven roex filters centred at 1000, 1145, 1310, 1500, 1717, 1966, and 2250 Hz. These frequencies are equidistant on a logarithmic scale. The filters are calculated for an input level of 51 dB, the level where, on a linear frequency scale, the two skirts of the roex filters have about the same slope. The excitations of these seven filters by the 1500-Hz 51-dB tone are indicated by the small circles at which the centre frequencies of the corresponding filters are indicated. These values represent the excitation of one of the seven auditory filters by the 1.5-kHz tone. These same values are also indicated in the lower panel of Fig. 3.16 at the centre frequencies of the seven filters. Hence, the lower panel of Fig. 3.16 gives the excitation as a function of the centre frequency of the auditory filter. These excitations can be calculated for as many frequencies as necessary, yielding the complete excitation pattern by this 1.5-kHz component. This excitation pattern is given by the continuous line in the lower panel of Fig. 3.16. In this figure, the abscissa has a linear frequency scale. The asymmetric shape of the excitation pattern is because, in hertz, the auditory filter gets wider with higher frequencies.

3.8 The Excitation Pattern

169

Fig. 3.16 Calculating the excitation pattern of a 1.5-kHz 51-dB pure tone. The upper panel shows the shapes of seven auditory filters. The excitations of these filters by the 1.5-kHz tone are indicated by the circles and the centre frequencies of the corresponding filters. The lower panel shows the complete excitation pattern of this tone. The excitations of the seven auditory filters shown in the upper panel are also indicated by the circles at the corresponding centre frequencies. (Matlab)

3.8.4 From Excitation on a Linear Hertz Scale to Excitation on a Cam Scale In order to come to a representation that gives the contribution of every frequency component to the loudness of the sound, the linear frequency scale of the abscissa must be transformed to the tonotopic frequency scale, i.e., the Cam scale. This can be done by applying Eq. 3.15, z ( f ) = 21.4log10 (0.00437 f + 1), to the abscissa of the excitation pattern of Fig. 3.16. This gives the excitation pattern of Fig. 3.17. Figures 3.16 and 3.17 thus give the same information, the former on a linear scale, the latter on a Cam scale. The upper panel shows that, on this Cam scale, the shape of the auditory filters is the same for all displayed centre frequencies. In the discussion of the roex filter, it was mentioned that the auditory filters become more asymmetrical for higher intensities. Especially the lower frequency skirt gets less steep for higher intensities. In order to show the effect of this, Fig. 3.18 presents the excitation patterns of a 1.5-kHz tone with intensities varied in steps of 10 dB from 20 to 90 dB. The abscissa in this figure is the Cam scale. Note the similarity of these figures with the masking patterns shown in Fig. 3.2. In both figures the low-frequency skirts of the plots have more or less the same slopes, but the

170

3 The Tonotopic Array

Fig. 3.17 Same as Fig. 3.16 but now the abscissa is presented on a tonotopic frequency scale. Consequently, the shapes of the transfer functions of the seven auditory filters are about equal. The frequency of the pure tone, 1500 Hz, corresponds to 18.8 Cam. (Matlab)

Fig. 3.18 Excitation patterns of a 1500-Hz tone of intensities varied from 20 to 90 dB SPL in steps of 10 dB. The abscissa represents the tonotopic scale. (Matlab)

3.8 The Excitation Pattern

171

Fig. 3.19 Excitation patterns by the two lowest harmonics 400 Hz, upper panel, and by the seven lowest harmonics 400 Hz, lower panel. The intensity of the individual harmonics is 45 dB SPL. The excitation patterns of the tone complexes are presented in thick lines; those of the individual harmonics in thin lines. (Matlab)

higher-frequency skirts are much steeper for the lower-intensity tones. This naturally represents upward spread of masking. In fact, excitation patterns are closely related to masking patterns. Indeed, the excitation at a certain location on the tonotopic array represents the power that drives the auditory filter at that location. The excitation by another sound must be larger than this in order to overcome this excitation by the masker and, thus, become perceptually significant. In this way, the excitation pattern of a sound is related to the increase in threshold induced by that sound as a function of frequency [7, 72]. For a sound of one single frequency component, a pure tone with a specified intensity, it was shown how the excitation induced by this component can be calculated. In order to calculate the excitation pattern induced by a sound consisting of a number of frequency components, the excitation patterns are obtained by first calculating the excitation patterns of each frequency component separately and adding the thus obtained excitation patterns. This is demonstrated in Fig. 3.19 for two harmonic tones with an F 0 of 400-Hz, one consisting of the two lowest harmonics, the other of the seven lowest harmonics. The result shows the excitation induced by these tones over the tonotopic array. The thin lines are the excitation patterns of the separate harmonics, the thick line is that of the complex.

172

3 The Tonotopic Array

3.8.5 From Excitation to Specific Loudness The procedure sketched in this chapter is based on the model of loudness perception described in Moore, Glasberg, and Baer [35]. This has resulted in the excitation patterns demonstrated in the previous sections, which represent the effective power of the sound stimulus as a function of the location on the tonotopic array. It appears that there is no linear relation between this excitation at a location of the tonotopic array and the contribution to loudness perception. The next stage in this model, therefore, consists of the transformation that has to be carried out to calculate the contribution of each location on the tonotopic array to the perception of loudness, called the specific loudness. The loudness of the sound can then simply be found by integrating the specific-loudness distribution over the whole tonotopic array. The loudness of a sound will be expressed in sone, a perceptual unit chosen in such a way that equal differences in sone are perceived as equal differences in loudness. The transformation from excitation to specific loudness will only be described in a summary way because the details are very laborious. The interested reader is referred to Moore, Glasberg, and Baer [35]. The equations represent the compressive nonlinearity of the cochlear amplifier. The excitation is represented as E s (z), in which z is the location on the tonotopic array expressed in Cam and E s is the excitation, not expressed in dB but relative to a reference excitation E 0 of a 1-kHz pure tune of L(z) 0 dB: If the excitation at z is L (z) dB, E s (z) = E 0 10 10 . As a first approximation, the contribution of E s (z) to the loudness is presented as a power-law equation: N  (z) = C E s (z)α ,

(3.18)

in which N  (z) is the specific loudness representing the contribution to the loudness of the sound as a function of its location on the tonotopic array. Its unit is sone/ERB. The constant C is chosen in such a way that the loudness of a 1000-Hz tone of 40 dB SPL is exactly 1 sone, the perceptual unit for loudness, as just said. The compression factor, the exponent α, is set to 0.2. This equation appears to operate quite well for frequencies higher than 500 Hz and intensity levels between 30 and 90 dB. For lower intensities and for lower frequencies various adaptations have to be made. This results in an equation of the form:   N  (z) = C (G (z) E s (z) + A)α − Aα .

(3.19)

The constant A is introduced in order to deal with threshold behaviour, and represents the excitation produced by internal noise; G deals with the lower efficiency of the cochlear amplifier at frequencies lower than 500 Hz. Using these equations, the specific loudness can be calculated for all frequencies along the tonotopic array. It represents, for each frequency, the contribution of that frequency to the total loudness of the sounds, and its unit is sone/ERB. For the 1500-Hz pure tones of varying intensities of which the excitation patterns are

3.8 The Excitation Pattern

173

Fig. 3.20 Specific-loudness distributions of the same sounds as in Fig. 3.18: pure tones of 1500 Hz with intensities varying in steps of 10 dB from 20 to 90 dB. The loudness of the tones is indicated on the left from low to high. (Matlab)

shown in Fig. 3.18, the result is shown in Fig. 3.20. Similarly, the specific-loudness distributions of the two 400-Hz harmonic complexes of Fig. 3.19 are presented in Fig. 3.21. This stage of the model in which excitation is converted into specific-loudness may seem artificial, which, in fact, it is. In more recent models of the auditory filter presented by Chen et al. [3] and Pieper et al. [48], the output of the auditory filters directly represents the contribution to loudness as a function of the location on the tonotopic array. Hence, the transformation of excitation pattern to specific loudness can be omitted.

3.8.6 Calculation of Loudness In anticipation of Chap. 7, it is given away that the loudness of a sound can be found by integrating its specific-loudness distribution over the tonotopic array. This is called spectral loudness summation or, in short, loudness summation. The result is expressed in sone, the perceptual unit of loudness. Some results of this integration are presented to the left of the specific-loudness distributions in Fig. 3.20 and in the upper right of the panels in Fig. 3.21. The perception of loudness will be discussed extensively in Chap. 7. There it will be shown that many phenomena in loudness perception can be explaine by this model. Moreover, in Sect. 6.4, the specific-loudness distribution of a sound will be used to estimate its brightness. Finally, for each frequency of the tonotopic array, the excitation pattern presents the power of the outputs of the auditory filters as a function of their position along

174

3 The Tonotopic Array

Fig. 3.21 Specific-loudness distributions of the same sounds as shown in Fig. 3.19: The two lowest harmonics 400 Hz and the seven lowest harmonics 400 Hz. The intensity of the individual harmonics is 45 dB SPL. The loudness estimations of the tone complexes as a whole are indicated in the upper right of the panels. (Matlab)

the tonotopic array. As every location along this array specifies a frequency, this is often referred to as the place code of frequency. In the next section, the temporal code will be discussed.

3.9 Temporal Structure Up to now in this chapter, the temporal structure of the oscillations of the basilar membrane and of the series of action potentials in the auditory nerve has been ignored. The oscillations of the basilar membrane are in phase with the frequency components of the sound that induce these oscillations. Moreover, it was shown in Chap. 2 that the series of action potentials of the auditory-nerve fibres are to a large extent phase-locked to the oscillations of the basilar membrane, up to at least 3 to 5 kHz. This will be referred to as the temporal fine structure (TFS) of the outputs of the auditory filters. Moore [30] gives a detailed account of the central role TFS plays—or may play—in sound processing in the auditory system, not only for normal

3.9 Temporal Structure

175

hearing, but also for impaired hearing and hearing at an older age. He shows that TFS plays a central role, e.g., in masking, in pitch perception, and in speech perception. For lower frequencies, at least up to about 5 kHz, the distribution of the intervals between successive action potentials has clear peaks at the period of the stimulating frequency component and at the multiples of this period. This has been illustrated in Fig. 2.23. The TFS contains information about the frequency of the component stimulating the auditory-nerve fibre. In this book, it is generally assumed that the phase lock, and with it the temporal coding of frequency, stops at about 5 kHz [12]. This, however, is a matter of debate. Moore and Ernst [33] presented results consistent with the idea that there is a transition from a temporal to a place mechanism at about 8 kHz, rather than at 4–5 kHz, as is commonly assumed. Overviews discussing various such models are presented by Micheyl, Xiao, and Oxenham [28] and Oxenham [38, 39]. They come to the conclusion that, in general, the results are consistent with the idea that frequency information based on the TFS of the spike trains in auditorynerve fibres can be used up to about 5 kHz. The model for pitch perception that will be presented is based on this assumption. The matter, however, is not settled yet. A review summarizing the various viewpoints is presented by Verschooten et al. [63]. So, as to the frequency of the stimulus, there are two sources of information in the series of action potentials of an auditory-nerve fiber: One is the location on the basilar membrane where the fibre comes from and is referred to as the place code discussed in the previous section; the other is the TFS, also indicated with temporal code. But the spike trains from the auditory nerve contain more information than just the frequency of the sinusoidal stimulation. In Sect. 2.5.3.2, it was shown that, the higher the frequency of the stimulus, the larger the share of the DC component in the receptor potential. This could be described as a process of half-wave rectification and low-pass filtering in the inner hair cells. This DC component reflects the temporal envelope of the stimulus. As a consequence, the spike trains of the auditory-nerve fibres are also locked to the envelope of the output of the auditory filters. For frequencies higher than about 5000 Hz, when the nerve fibres no longer lock to the phase of the sinusoidal stimulus, they will only lock to its envelope. When the stimulus is periodic and contains harmonics of high rank, this means that the spike trains lock to the periodic envelope of the output of the auditory filters. In the previous section, the first stages of a model of loudness perception have been used to derive the excitation pattern of an arbitrary stimulus specified by its amplitude spectrum. Next, the first stages of a model of pitch perception will be described to derive the temporal structure of auditory information in the peripheral tonotopic array. It is based on a gammatone-filterbank as has been described in Sect. 3.4 of this chapter. Pitch perception in general will be discussed in Chap. 8.

176

3 The Tonotopic Array

3.9.1 The Autocorrelation Model Various autocorrelation models of temporal processing of sound have been published. For an extensive review and discussion, the reader is referred to Lyon [25]. The model presented here largely follows the processing stages presented by Meddis and Hewitt [26], but is simplified and modified in various aspects. These modifications will be indicated. The stages of the model are shown in Fig. 3.22. The first few stages are the same as in the model of the peripheral auditory system in Fig. 2.24. The first stage represents the filtering by the outer and middle ear. Since pitch perception is not very sensitive to the relative intensity of the frequency components, a simple dBA filter is used such as is used in standard measurements of sound pressure level, a subject to be discussed in more detail in Sect. 7.1.1. The second stage (BPF) represents the band-pass filtering by the auditory filters of the basilar membrane. The third stage (AGC and HWR) represents the automatic gain control exercised by the outer hair cells and the half-wave rectification in the generation of the receptor potential by the inner-hair cells. The fourth stage (LPF) represents the low-pass filtering of the rectified receptor potential. The fifth stage (SG) represents the spike-generation mechanism in the synapse between the inner-hair cells and the auditory-nerve fibres. The sixth stage (AC) represents the calculation of the interspike interval distribution, which is implemented as an autocorrelation of the spike trains. The final stage (SAC)

Fig. 3.22 Schematic overview of the successive stages of the autocorrelation model of pitch perception. BPF = band-pass filtering, AGC = automatic gain control, HWR = half-wave rectification, LPF = low-pass filtering, SG = spike generation, AC = autocorrelation, SAC = summary autocorrelation. These stages are symbolically sketched by the icons in the rectangles. (Matlab)

3.9 Temporal Structure

177

represents a summation over the tonotopic array, resulting in the summary autocorrelation function, completed by a peak-detection process that yields the estimate of the pitch. Now each of these stages will be described in more detail.

3.9.2 dBA-Filtering The first stage represents the filtering by the outer and the middle ear, and the introduction of internal noise as described in Sects. 3.8.1 and 3.8.2. In the implementation used here, this stage is approximated by a simple A-weighting filter as used in current dB meters. Its frequency characteristic is presented in Fig. 3.23. So, in the first stage of this model, the sound signal is processed in such a way that its output roughly approximates the intensity with which the frequency components excite the auditory filters.

3.9.3 Band-Pass Filtering This stage simulates the operation of the auditory filters. It consists of a gammatonefilterbank as described in Sect. 3.4. Its output is called the cochleogram. In this implementation, the signal is fed through 128 gammatone filters. Partly replicating Fig. 2.12, the result is illustrated in Fig. 3.24 for a 200-Hz pulse train. This pulse train is shown in the top panel. The second panel shows the outputs of eight gammatone filters with centre frequencies of 200, 300, 400, 500, 600, 1000, 2000, and 5000 Hz, respectively. For the lower centre frequencies, where the harmonics are resolved, the auditory filters are excited by at most one harmonic: For the displayed filter outputs

Fig. 3.23 Transfer characteristic of a dBA filter. This filter roughly combines the filter characteristics of the outer ear, presented in Fig. 3.13, that of the middle ear, presented in Fig. 3.14, and the introduction of internal noise presented in Fig. 3.15. (Matlab)

178

3 The Tonotopic Array

Fig. 3.24 The cochleogram of a 200-Hz pulse train. The waveform of the signal is presented at the top. The middle panel shows the excursions of the basilar membrane for eight positions on the basilar membrane. The surface plot in the bottom panel shows the complete cochleogram. (Matlab) (demo)

with centre frequencies of 200, 400, 600 and 1000 Hz, the filters resonate with the 1st, 2nd, 3rd, and 5th harmonic, respectively. For centre frequencies higher than about 2000 Hz, the filter outputs consist of series of the impulse responses of the auditory filters, which is most obvious for the output of the 5000-Hz auditory filter. This can be seen in more detail in the bottom panel, showing the cochleogram, i.e., the filter output of all filters of the filterbank on a grey scale, as described in Fig. 2.12. The ordinate on a Cam scale gives the centre frequencies of the auditory filters; the abscissa is time.

3.9 Temporal Structure

179

3.9.4 Neural Transduction The simulated inner-hair-cell response to a 200-Hz pulse train is presented in the next figure, Fig. 3.25. The top panel shows this pulse train. The compressive non-linearity is modelled by a non-linear halve-wave rectifier. An instantaneous compressive nonlinearity is applied by raising the result of the rectification to the third power. The loss of phase lock at high frequencies is simulated by applying a low-pass filter with a cut-off frequency of 3 kHz. The result of this transduction process is shown for eight positions on the basilar membrane in the mid panel of Fig. 3.25. One can see

Fig. 3.25 Simulated inner-hair-cell responses to a 200-Hz pulse train. The waveform of the signal is presented at the top. The middle panel shows the simulated receptor potentials of inner hair cells at eight positions on the basilar membrane. A surface plot of the inner-hair-cell responses is presented in the bottom panel. (Matlab) (demo)

180

3 The Tonotopic Array

that phase lock is well represented up to 3000 Hz, but is mostly lost at 5000 Hz and above. The complete inner-hair-cell response is presented on a gray scale in the bottom panel.

3.9.5 Generation of Action Potentials The next stage of this model consists of estimating the interval distributions of the spike trains in the auditory nerve. In the original paper by Meddis and Hewitt [26], this is done by autocorrelation of simulated inner-hair-cell responses. Since the probability of the generation of an action potential is directly proportional to the innerhair-cell response, the peaks in the autocorrelograms are assumed to represent the peaks in the interspike-interval distributions of the auditory-nerve fibres. The idea is then to find the periodicity most common to all spike trains. The disadvantage of calculating the autocorrelograms is that they all are 1 at time 0. This is because the value of a correlogram at time τ represents the correlation coefficient between signal values separated from each other by an interval τ , and correlation coefficients are at most 1 in absolute value. Finding a common periodicity in these frequency channels is then based on equal contributions by all frequency channels. Hence, in this case, all channels of the auditory system have an equal share in the pitch-estimation, whatever the excitation of the channels. It is presumed that reliable pitch estimation requires that channels in which excitation is higher contribute more to the estimated pitch frequency than channels in which excitation is lower. In the procedure shown below, this is simply done by skipping the normalization in the calculation of the correlogram, which results in the autocovariance function (ACVF). At τ = 0, the ACVF gives the energy of the signal and, since, in this model, the ACVF of the output of the inner hair cells was calculated, this might be described as the approximate excitation by the stimulus in that channel. That is why the ACVFs of the inner-hair-cell responses were calculated and not the ACFs. Finally, refractoriness will be ignored.

3.9.6 Detection of Periodicities: Autocovariance Functions The ACVFs of the simulated inner-hair-cell responses are shown in the mid panel of Fig. 3.26. The top panel shows the stimulus, as usual. The ACVFs were calculated based on a Hanning-windowed interval of the inner-hair cell responses with a width of twice 30 ms. This implies that the lowest pitch frequency considered in this implementation is 33 Hz. The bottom panel shows the sum of all ACVFs, or summary autocovariance function (SACVF). There is a vertical line at τ = 5 ms through the first peak at a delay τ > 0. It will be argued that this delay represents the duration of the most common interval in the spike trains and, hence, the estimate of the pitch period of the stimulus. Indeed, the ACVFs are horizontal cross sections through the

3.9 Temporal Structure

181

Fig. 3.26 Calculation of the summary autocovariance function for a 200-Hz pulse train. The waveform of the signal is presented at the top. The middle panel presents a surface plot of the autocovariance functions of the outputs of the inner hair cells. The bottom panel presents the summary autocovariance function. The abscissa of the highest peak next to the highest peak at τ = 0, represents the estimate of the pitch period, in this case 5 ms, which corresponds to the periodicity of the 200-Hz pulse train. (Matlab) (demo)

surface plot shown in the middle panel of Fig. 3.26. In all significantly stimulated frequency channels, these ACVFs have peaks at 5 ms and multiples of 5 ms. For the frequency channel 200 Hz, the peak at 5 ms corresponds to spike intervals of the first order. Since the 200-Hz harmonic is completely resolved, that part of the basilar membrane is only excited by that harmonic. Hence, the peak at 5 ms corresponds to the intervals between spikes produced in specific phases of directly consecutive periods of this 200-Hz harmonic; the peak at 10 ms corresponds to spike intervals of the second order, hence intervals between spikes in trains that skipped one pitch period; the peak at 15 ms corresponds to spike intervals of the third order, hence intervals between spikes in trains that skipped two pitch period, etc., etc.

182

3 The Tonotopic Array

The second harmonic, the harmonic 400 Hz, is completely resolved, too, but now the first order intervals are spaced by 2.5 ms, the second order by 5 ms, the third by 7.5 ms, etc. Hence, there are not only peaks at multiples of 5 ms but also exactly in between, i.e., at 2.5, 7.5, 12.5 ms, etc., corresponding to spike intervals of the first, third, and fifth order, respectively, hence with phase locked responses. This can be continued to higher frequencies as long as phase lock is present, so up to more than 3000 Hz. This expresses itself in Fig. 3.26 in the presence of peaks in the ACVFs at corresponding distances. For higher frequencies these peaks come closer together and form the kind of interference patterns up to about 3000 Hz. For frequencies higher than about 3000 Hz phase lock is lost and the ACVFs only show peaks at 5 ms and its multiples. These peaks are a consequence of the fact that the inner-haircells responses run synchronously with the envelope of the cochleogram shown in Fig. 3.25. So, in all excited frequency channels, there are significant peaks at the time corresponding to one pitch period, and also at multiples of this period, but these peaks are less high. Consequently, summing the ACVFs over all frequencies, gives a well-defined peak at the pitch period with secondary peaks at multiples of this pitch period. The result of this summation, the summary autocovariance function (SACVF), is shown in the bottom panel of Fig. 3.26.

3.9.7 Peak Detection The final stage of this pitch-estimation procedure consists of detecting the most relevant peak in the SACVF. This is in principle done by taking all maxima in the SAVF for time intervals larger than the shortest pitch period considered here, 0.333 ms, which means that this algorithm can only estimate pitch frequencies correctly when they are lower than 3000 Hz. This peak detection is done by parabolic interpolation of all maxima and calculating the time where these parabolas attain their maxima. The time corresponding to the largest maximum then gives the estimated pitch period.

3.10 Summary The tonotopic array can be represented as a bank of overlapping band-pass filters. Each skirt of the transfer function of such a filter can be approximated by a rounded exponential (roex) function. For each frequency, the parameters of these roex functions can be estimated by masking a pure tone of this frequency with noise with a notch in its spectrum at this frequency. For moderate intensity levels, a frequency scale can be developed on which the bandwidth of these auditory filters is constant. This frequency scale represents the tonotopic organization of the auditory system and is found up to high levels in the central auditory system, up to at least the auditory cortex.

References

183

A computational model has been presented that calculates the amount of excitation over the tonotopic array, as expressed in dB above threshold. This is called the excitation pattern of that sound, and is supposed to estimate the power of the output of the auditory filters as a function of the location on the tonotopic array. This excitation pattern plays a pivotal role in the models of the perception of loudness presented in Chap. 7 and of some auditory attributes such as brightness discussed in Chap. 6. In calculating the excitation pattern, the temporal relations between the phases of the stimulating frequency components and the oscillations of the basilar membrane have been ignored. This does not mean that these phase relations are of no interest. On the contrary, they express themselves in the TFS of the series of action potentials in the auditory-nerve fibres that transmit the auditory information from the cochlea to the central nervous system. This TFS, represented in the summary autocovariance functions, will play an essential role, e.g., in Chap. 8.

References 1. Baker RJ, Rosen S (2006) Auditory filter nonlinearity across frequency using simultaneous notched-noise masking. J Acoust Soc Am 119(1 2006), 454–462. https://doi.org/10.1121/1. 2139100 2. Carney LH, McDuffy MJ, Shekhter I (1999) Frequency glides in the impulse responses of auditory-nerve fibers. J Acoust Soc Am 105(4 1999):2384–2391. https://doi.org/10.1121/1. 426843 3. Chen Z et al (2011) A new method of calculating auditory excitation patterns and loudness for steady sounds. Hear Res 282:204–215. https://doi.org/10.1016/j.heares.2011.08.001 4. Duifhuis H (2012) Cochlear mechanics: introduction to a time domain analysis of the nonlinear Cochlea. Springer Science & Business Media, New York, NY. https://doi.org/10.1007/978-14419-6117-4 5. Egan JP, Hake HW (1950) On the masking pattern of a simple auditory stimulus. J Acoust Soc Am 22(5 1950):622–630. https://doi.org/10.1121/1.1906661 6. Fletcher H (1940) Auditory patterns. Rev Mod Phys 12(1 1940), 47–55. https://doi.org/10. 1103/RevModPhys.12.47 7. Fletcher H, Munson WA (1937) Relation between loudness and masking. J Acoust Soc Am 9(1 1937):1–10. https://doi.org/10.1121/1.1915904 8. Glasberg BR, Moore BC (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47(1–2 1990):103–138. https://doi.org/10.1016/0378-5955(90)90170-T 9. Greenwood DD (1990) A cochlear frequency-position function for several species—29 years later. J Acoust Soc Am 87(6 1990):2592–2605. https://doi.org/10.1121/1.399052 10. Greenwood DD (1961) Critical bandwidth and the frequency coordinates of the basilar membrane. J Acoust Soc Am 33(10 1961):1344–1356. https://doi.org/10.1121/1.1908437 11. Greenwood DD (1997) The Mel Scale’s disqualifying bias and a consistency of pitch-difference equisections in 1956 with equal cochlear distances and equal frequency ratios. Hear Res 103:199–224. https://doi.org/10.1016/S0378-5955(96)00175-X 12. Hanekam JJ, Krüger JJ (2001) A model of frequency discrimination with optimal processing of auditory nerve spike intervals. Hear Res 151:188–204. https://doi.org/10.1016/S03785955(00)00227-6 13. Hartmann WM (1998) Signals, sound, and sensation. Springer Science+Business Media Inc, New York, NY

184

3 The Tonotopic Array

14. Heinz MG, Colburn HS, Carney LH (2001) Auditory nerve model for predicting performance limits of normal and impaired listeners. Acoust Res Lett Online 2(3 2001):91–96. https://doi. org/10.1121/1.1387155 15. Hermansky H (1990) Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 87(4 1990):1738–1752. https://doi.org/10.1121/1.399423 16. Houtsma AJ, Rossing TD, Wagenaars WM (1987) Auditory demonstrations. eindhoven, The Netherlands: Institute for Perception Research (IPO), Northern Illinois University, Acoustical Society of America. https://research.tue.nl/nl/publications/auditory-demonstrations 17. Irino T, Patterson RD (2001) A compressive gammachirp auditory filter for both physiological and psychophysical data. J Acoust Soc Am 109(5 2001):2008–2022. https://doi.org/10.1121/ 1.1367253 18. Irino T, Patterson RD (2006) A dynamic compressive gammachirp auditory filterbank. In: IEEE transactions on audio, speech, and language processing, vol 14(6 2006), pp 2222–2232. https:// doi.org/10.1109/TASL.2006.874669 19. Irino T, Patterson RD (1997) A time-domain, level-dependent auditory filter: the gammachirp. J Acoust Soc Am 101(1 1997):412–419. https://doi.org/10.1121/1.417975 20. Irino T, Patterson RD (2020) The gammachirp auditory filter and its application to speech perception. Acoust Sci Technol 41(1 2020):99–107. https://doi.org/10.1250/ast.41.99 21. Jennings SG, Strickland EA (2012) Auditory filter tuning inferred with short sinusoidal and notchednoise maskers. J Acoust Soc Am 132(4 2012):2483–2496. https://doi.org/10.1121/1. 4746029 22. Jennings SG, Strickland EA (2012) Evaluating the effects of olivocochlear feedback on psychophysical measures of frequency selectivity. J Acoust Soc Am 132(4 2012):2497– 2513. https://doi.org/10.1121/1.4742723 23. Jurado C, Moore BC (2010) Frequency selectivity for frequencies below 100 Hz: comparisons with midfrequencies. J Acoust Soc Am 128(6 2010):3585–3596. https://doi.org/10.1121/1. 3504657 24. Kaas JH, Hackett TA, Tramo MJ (1999) Auditory processing in primate cerebral cortex. Curr Opin Neurobiol 9(2 1999):164–170. https://doi.org/10.1016/S0959-4388(99)80022-1 25. Lyon RF (2017) Human and machine hearing: extracting meaning from sound. Cambridge University Press, Cambridge, UK 26. Meddis R, Hewitt MJ (1991) Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification. J Acoust Soc Am 89(6 1991):2866–2882. https:// doi.org/10.1121/1.400725 27. Meddis R, Lopez-Poveda EA (2010) Auditory periphery: From pinna to auditory nerve. In: Meddis R et al (ed) Computational models of the auditory system. Springer Science+Business Media, New York, NY. Chapter 2, pp 7–38. https://doi.org/10.1007/978-1-4419-5934-8_2 28. Micheyl C, Xiao L, Oxenham AJ (2012) Characterizing the dependence of pure-tone frequency difference limens on frequency, duration, and level. Hear Res 292(1–2 2012):1–13. https://doi. org/10.1016/j.heares.2012.07.004 29. Moore BC (2012) An introduction to the psychology of hearing, 6th edn. Emerald Group Publishing Limited, Bingley, UK 30. Moore BC (2014) Auditory processing of temporal fine structure: effects of age and hearing loss. World Scientific, Singapore 31. Moore BC (2008) Basic auditory processes involved in the analysis of speech sounds. Philos Trans R Soc B: Biol Sci 363(1493 2008):947–963. https://doi.org/10.1098/rstb.2007.2152 32. Moore BC (2014) Development and current status of the ‘Cambridge’ loudness models. Trends Hear 18(2014):2331216514550620, 29 pages. https://doi.org/10.1177/2331216514550620 33. Moore BC, Ernst SMA (2012) Frequency difference limens at high frequencies: evidence for a transition from a temporal to a place code. J Acoust Soc Am 132(3 2012):1542–1547. https:// doi.org/10.1121/1.4739444 34. Moore BC, Glasberg BR (1996) A revision of Zwicker’s loudness model. Acustica 82(2 1996):335–345

References

185

35. Moore BC, Glasberg BR, Baer T (1997) A model for the prediction of thresholds, loudness, and partial loudness. J Audio Eng Soc 45(4 1997):224–240. http://www.aes.org/elib/browse. cfm?elib=10272 36. Moore BC, Peters RW, Glasberg BR (1990) Auditory filter shapes at low center frequencies. J Acoust Soc Am 88(1 1990):132–140. https://doi.org/10.1121/1.399960 37. O’Shaughnessy D (2000) Speech communication: human and machine, 2nd edn. The Institute of Electrical and Electronics Engineers Inc, New York, NY 38. Oxenham AJ (2018) How we hear: the perception and neural coding of sound. Annu Rev Psychol 69:27–50. https://doi.org/10.1146/annurev-psych-122216-011635 39. Oxenham AJ (2013) Revisiting place and temporal theories of pitch. Acoust Sci Technol 34(6 2013):388–396. https://doi.org/10.1250/ast.34.388 40. Oxenham AJ, Wojtczak M (2010) Frequency selectivity and masking. In: Plack CJ (ed) The Oxford handbook of auditory science: hearing. Oxford University Press, Oxford, UK, Chap 2, pp 5–44 41. Patterson RD (1976) Auditory filter shapes derived with noise stimuli. J Acoust Soc Am 59(3 1976):640–654. https://doi.org/10.1121/1.380914 42. Patterson RD, Henning GB (1977) Stimulus variability and auditory filter shape. J Acoust Soc Am 62(3 1977):649–664. https://doi.org/10.1121/1.381578 43. Patterson RD, Nimmo-Smith I (1980) Off-frequency listening and auditory-filter asymmetry. J Acoust Soc Am 67(1 1980):229–245. https://doi.org/10.1121/1.383732 44. Patterson RD, Uniko M, Irino T (2003) Extending the domain of center frequencies for the compressive gammachirp auditory filter. J Acoust Soc Am 114(3 2003):1529–1542. https:// doi.org/10.1121/1.1600720 45. Patterson RD, Unoki M, Irino T (2004) Comparison of the compressive-gammachirp and double-roex auditory filters. In: Pressnitzer D et al Auditory signal processing: physiology, psychoacoustics, and models. Springer, Berlin, pp 21–30. https://link.springer.com/content/ pdf/10.1007/0-387-27045-0_4.pdf 46. Patterson RD et al (1987) An efficient auditory filterbank based on the gammatone function. Annex B of the spiral VOS final report, Part A: the auditory flterbank. MRC Applied Psychology Unit and Cambridge Electronic Design, Cambridge, UK, 33 pages. https://www.pdn.cam.ac. uk/system/files/documents/SVOSAnnexB1988.pdf 47. Pedersen P (1965) The mel scale. J Music Theory 9:295–308. https://doi.org/10.2307/843164 48. Pieper I et al (2016) Physiological motivated transmission-lines as front end for loudness models. J Acoust Soc Am 139(5 2016):2896–2910. https://doi.org/10.1121/1.4949540 49. Plomp R (1964) The ear as frequency analyzer. J Acoust Soc Am 36(9 1964):1628–1636. https://doi.org/10.1121/1.1919256 50. Räsänen O, Laine UK (2013) Time-frequency integration characteristics of hearing are optimized for perception of speech-like acoustic patterns. J Acoust Soc Am 134(1 2013):407–419. https://doi.org/10.1121/1.4807499 51. Recio A et al (1998) Basilar-membrane responses to clicks at the base of the chinchilla cochlea. J Acoust Soc Am 103(4 1998):1972–1989. https://doi.org/10.1121/1.421377 52. Saremi A et al (2016) A comparative study of seven human cochlear filter models. J Acoust Soc Am 140(3 2016):1618–1634. https://doi.org/10.1121/1.4960486 53. Slaney M (1993) An efficient implementation of the Patterson-Holdsworth Auditory Filter Bank, pp 1–42 54. Stevens SS (1957) On the psychophysical law. Psychol Rev 64(3 1957):153–181. https://doi. org/10.1037/h0046162 55. Stevens SS, Volkmann J (1940) The relation between pitch and frequency: a revised scale. Am J Psychol 53(3 1940):329–353. https://doi.org/10.2307/1417526 56. Stevens SS, Volkmann J, Newman EB (1937) A scale for the measurement of the psychological magnitude pitch. Am J Psychol 9(3 1937):185–190. https://doi.org/10.1121/1.1915893 57. Tan Q, Carney LH (2003) A phenomenological model for the responses of auditory-nerve fibers. II. Nonlinear tuning with a frequency glide. J Acoust Soc Am 114(4 2003):2007–2020. https://doi.org/10.1121/1.1608963

186

3 The Tonotopic Array

58. Terreros G, Delano PH (2015) Corticofugal modulation of peripheral auditory responses. Front Syst Neurosci 9(2015), Article 134:8 pages. https://doi.org/10.3389/fnsys.2015.00134 59. Traunmüller H (1990) Analytical expressions for the tonotopic sensory scale. J Acoust Soc Am 88(1 1990):97–100. https://doi.org/10.1121/1.399849 60. Unoki M et al (2006) Comparison of the roex and gammachirp filters as representations of the auditory filter. J Acoust Soc Am 120(3 2006):1474–1492. https://doi.org/10.1121/1.2228539 61. Verhulst S, Altoè A, Vasilkov V (2018) Computational modeling of the human auditory periphery: auditorynerve responses, evoked potentials and hearing loss. Hear Res 360:55–75. https:// doi.org/10.1016/j.heares.2017.12.018 62. Verhulst S, Dau T, Shera CA (2012) Nonlinear time-domain cochlear model for transient stimulation and human otoacoustic emission. J Acoust Soc Am 132(6 2012):3842–3848. https:// doi.org/10.1121/1.4763989 63. Verschooten E et al (2019) The upper frequency limit for the use of phase locking to code temporal fine structure in humans: a compilation of viewpoints. Hear Res 377:109–121. https:// doi.org/10.1016/j.heares.2019.03.011 64. Vogten L (1972) Pure-tone masking of a phase-locked tone burst. IPO Annu Prog Rep 7:5–16 65. Vogten L (1974) Pure-tone masking: a new result from a new method. In: Zwicker E, Terhard E (eds) Facts and models in hearing. Springer, New York, NY, pp 142–155 66. Völk F (2015) Comparison and fitting of analytical expressions to existing data for the criticalband concept. Acta Acustica united with Acustica 101(6 2015):1157–1167. https://doi.org/10. 3813/AAA.918908 67. Wegel RL, Lane CE (1924) The auditory masking of one pure tone by another and its probable relation to the dynamics of the inner ear. Phys Rev 23(2 1924):266–285. https://doi.org/10. 1103/PhysRev.23.266 68. Zhang X et al (2001) A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. J Acoust Soc Am 109(2 2001):648–670. https://doi.org/10.1121/1.1336503 69. Zwicker E (1961) Subdivision of the audible frequency range into critical bands (Frequenzgruppen). J Acoust Soc Am 33(2 1961):248–248. https://doi.org/10.1121/1.1908630 70. Zwicker E, Fastl H, Dallmayr C (1984) BASIC-Program for calculating the loudness of sounds from their 1/3-oct band spectra according to ISO 532 B. Acustica 55:63–67 71. Zwicker E, Flottorp G, Stevens SS (1957) Critical band width in loudness summation. J Acoust Soc Am 29(5 1957):548–557. https://doi.org/10.1121/1.1908963 72. Zwicker E, Scharf B (1965) A model of loudness summation. Psychol Rev 72(1 1965):3–26. https://doi.org/10.1037/h0021703 73. Zwicker E, Terhardt E (1980) Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J Acoust Soc Am 68(5 1980):1523–1525. https://doi.org/10. 1121/1.385079 74. Zwicker E et al (1991) Program for calculating loudness according to DIN 45631 (ISO 532B). J Acoust Soc Jpn (E) 12(1 1991):39–42. https://doi.org/10.1250/ast.12.39

Chapter 4

Auditory-Unit Formation

In the previous chapters, it was shown that all auditory information enters the central nervous system through a wide range of frequency channels arranged from low to high frequencies over what is referred to as the tonotopic array. The distance along this tonotopic array is expressed in the equivalent rectangular bandwidths (ERBs) of the auditory filters that make up this array. In fact, there are two such arrays, one coming from the left ear and the other from the right ear. This auditory information is represented by the excitation patterns and the temporal fine structure along these two arrays. The rest of this book is about how this information is turned into sounds that gives us a sense of what is happening around us, where it is happening, and, albeit very briefly, in what kind of environment it is happening.

4.1 Auditory Scene Analysis In everyday circumstances, we can hear traffic, various speakers, music, footsteps, etc. In the time-frequency domain, all these sound sources can and often do overlap considerably. The main function of the auditory system now is to process this incoming information in such a way that listeners can interpret what happens around them and, if necessary, respond to it. In other words, the incoming auditory information has to be reorganized in such a way that the listeners can identify and interpret the soundproducing events around them, where they occur, and in what kind of environment they are produced. After the book by Bregman [4], this general process is referred to as auditory scene analysis (ASA). Bregman [4] describes ASA as follows: “the main function of the auditory system is to build separate mental descriptions of the different sound-producing events in our environment,” which must be carried out in situations in which “the pattern of acoustic energy received by both ears is a mixture of the effects of different events” (p. 642). Although it was published in 1990, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_4

187

188

4 Auditory-Unit Formation

everybody is advised to read Bregman’s book, in particular its excellent summary presented in its last chapter “Summary and conclusions: What we do and do not know about auditory scene analysis” from the first page of which the above citations were chosen. Bregman [4] distinguishes two sub-processes of auditory integration, simultaneous integration and sequential integration. Simultaneous integration is concerned with the perceptual process that integrates concurrent components; these components may differ in frequency or in the location where they are actually produced, but they occur at the same time. Simultaneous integration is also called spectral integration. Many examples of simultaneous integration have, in fact, already been presented, since, in many examples discussed in this book, complex tones are presented that consist of a number of concurrent, often harmonic partials. The individual partials of such a tone have lost their identity and the complex tone is perceived as one tone. Indeed, the successive tones of a melody will be referred to as the auditory units of that melody. Similarly, the successive syllables of a speech utterances will be referred to as the auditory units of that utterance. In this book, simultaneous integration is defined as the process in which auditory units are formed. In other words, an auditory unit is the result of simultaneous integration. In the process of sequential integration, successive auditory units are integrated into what will be indicated by auditory streams. Loosely defined, an auditory stream consists of a sequence of auditory units that are perceived as coming from the same sound-producing object or event. Examples of auditory streams are the melody played by one musician on one musical instrument or the sequence of syllables of a speech utterance spoken by one speaker. Sequential integration is also called temporal integration. Sequential integration is defined as the process in which auditory streams are formed. In other words, an auditory stream is the result of sequential integration. It has been shown that sounds produced by human voices or musical instruments consist of a multitude of frequency components with complex temporal time courses and, in the majority of cases, also the sounds from environmental sources have a complex structure in time and in frequency. In everyday situations, many of these sound sources may produce sound concurrently. In those circumstances, the auditory system must process this variety of sound components into auditory units and streams that make sense to the listeners. This perceptual process of ASA starts with the formation of auditory units, a process in which certain frequency components of the incoming sound are grouped together into what is perceived as one auditory unit. On the other hand, other components may not be included in an auditory unit and form one or more other auditory units, a phenomenon which is called auditory-stream segregation or streaming for short. These processes of grouping and streaming are also indicated with integration and segregation, or with fusion and fission, respectively.

4.2 Auditory Units

189

4.2 Auditory Units The first sub-process of ASA that will be described is called auditory-unit formation, the theme of this chapter. The term auditory unit will be used for what listeners perceive as one single sound, such as a footstep, a musical tone, or a single impact sound. In general, every auditory unit is assumed to be characterized by an auditory onset [25]. The auditory onset of an auditory unit is the moment we start hearing it. Every auditory unit, therefore, starts with an auditory onset. Although every auditory unit has an auditory onset, much more attention will be paid to its beat, an auditory attribute briefly discussed earlier in Chap. 1. The beat of an auditory unit is heard at its perceptual moment of occurrence or beat location. When two sounds are perceived as played in synchrony, their beats coincide. Hence, the beats of auditory units are those perceptual events in time that are associated with their occurrence [85]. Moreover, they define the rhythms of the sounds we hear. They will be discussed in great detail in Chap. 5. One may wonder whether every auditory unit has a beat, but this is not the case. Background sounds such as the wind or distant traffic noise can start so slow and remain so steady that they do not induce any beats. In sounds such as speech and music, on the other hand, a clear rhythm can be distinguished, and with every beat of that rhythm, an auditory unit will be associated. In the case of speech, this is mostly a single syllable; in the case of music, a single tone. It is well known that sensory systems are pre-eminently sensitive to increments in intensity. This is in the first place associated with the process of sensory adaptation, described in Sect. 2.5.3.3, which enhances onsets and attenuates offsets. In addition, there is the activation of low-spontaneous auditory-nerve fibres with higher dynamic ranges as discussed in Sect. 2.6.3. The importance of increments in intensity for the auditory system is discussed by, e.g., Heil [37, 38] and Phillips et al. [63], and for the central visual system by Donaldson and Yamamoto [23]. Later on, it will be argued that the beat of an auditory unit is associated with rapid increments or clusters of rapid increments in the intensity of the sound components that together form that auditory unit. One may ask why increments in intensity are important for perceptual systems. Well, the timing of intensity increments is perceptually not easily affected by other environmental sounds, in contrast with the timing of the acoustic beginning of a sound, where the intensity is, per definition, quite low. Consequently, environmental sounds can easily affect the moment we start hearing a sound, i.e., its auditory onset. On the other hand, the moment of rapid increments in intensity is much more robust to the presence of environmental sounds. This is probably the main reason why the beat plays a dominant role in the perceived timing of auditory events. The auditory onset of an auditory unit, if it precedes the beat, will appear to play an insignificant role in the timing of auditory events. In Chap. 5, it will be shown that listeners synchronize auditory units by synchronizing their beats and not their auditory onsets [25]. The same holds for auditory offsets. In general, listeners are not able to synchronize auditory units by synchronizing the auditory offsets [91], which brings us to the third auditory event attributed to an auditory unit, its auditory

190

4 Auditory-Unit Formation

offset. The auditory offset of an auditory unit is the moment we stop hearing it. For very short auditory units such as clicks or handclaps, the auditory offsets coincide with their beats and their auditory onsets, but many other auditory units, e.g., musical tones, are heard as being continued after their beats. In music, the beat of a tone is often very close to the onset of that tone, but this is certainly not always the case. In sung music, e.g., the tones most often consist of syllables, e.g., /na/, /sa/, or /stra/, and, when listeners are asked to synchronize another sound with such a syllable, they do not synchronize it with the beginning of the first consonant of the syllable, the auditory onset of the syllable, but with a moment that comes later in the syllable. Hence, the beat comes later than the perceived onset. It turns out that, if a syllable starts with such a consonant or a consonant cluster, the beat corresponds much more closely to the onset of the vowel than with the onset of the first consonant [24, 83]. The vowel onset is the moment in the syllable where restrictions in the oral cavity are removed and the oral cavity starts fully resonating. Consequently, the intensity within quite a few number of frequency channels rises rapidly. It will be argued that this rapid increase, in conjunction with other onsets during the prevocalic consonants, induces the beat. This shows that the beat can come later than the auditory onset of the syllable. In musical tones with abrupt onsets, they will coincide, but when one hears a sung syllable starting with a consonant or a consonant cluster, the beat comes after the auditory onset. In conclusion, a unique auditory beginning is associated with each auditory unit, namely the moment we begin to hear it. Most auditory units also have a beat, possibly at the same time as the auditory beginning, but often somewhat later. A variety of auditory units have a separate auditory offset. The term “auditory unit” as used here can be identified with the way it is used by Bregman [5] and Crum and Bregman [18]. It can also be identified with the term “perceptual unit” as used by Jones [41], the term “token” as used by Shamma et al. [78] and [79], or the term “auditory event” as used by McAdams and Drake [49] or as used by Nakajima et al. [60, p. 98]: “Auditory events are what we often call sounds in our everyday life: footsteps, hand claps, music tones, or speech syllables, of which we can count the number.” Counting the number of auditory units in an auditory stream will extensively be discussed in Sect. 10.4. Note that other authors can give quite different definitions of an event as, e.g., the one presented by Zacks and Tversky [94, p. 3 and 17]: “a segment of time at a given location that is conceived by an observer to have a beginning and an end.” Another term often used in this context is that of auditory object. The use of this term is quite diverse, however. For instance, Denham and Nelken [93] define auditory objects quite abstractly as “representations of predictable patterns, or ‘regularities’, extracted from the incoming sounds” (p. 532). In this definition, the authors emphasize the mutual relation between successive sound elements as parts of a pattern or regularity, which corresponds to what is called an auditory stream in this book. A more concrete definition is given by Nudds [61], who defines an auditory object as “a sequence of sounds which are such that they are normally experienced as having been produced by a single source” (p. 121). Also this definition corresponds well with what is called an auditory stream here. Others use the term auditory object for

4.3 Auditory Streams

191

what is called an auditory unit here. For instance, Gregg and Samuel [29] describe auditory objects as “relatively punctate events, such as a dog’s bark” (p. 998). So, in these definitions the term “auditory object” is used for what in this book is called an auditory unit or an auditory stream. The term auditory object is also used, however, for auditory entities at a smaller time scale [46]. For instance, in a review of the plasticity of the auditory system, Heald et al. [36] use the term auditory object also for phonemes. In summary, the term auditory object has been used for about anything auditory. Due to these various different definitions, the term auditory object will be avoided. If used, it will indicate an auditory unit or an auditory stream. Readers more interested in the concept of auditory object are further referred to, e.g., Bizley and Cohen [31], Griffiths and Warren [46], Matthen [61], and Nudds [2], or Denham and Winkler [22, pp. 242–243, 247]. Moreover, Kubovy and Van Valkenburg [43] and Green [28] discuss the concepts of both auditory and visual objects. Readers interested in further discussions on the concept of perceptual objects in general are referred to, e.g., Gregory [30] and Handel [33].

4.3 Auditory Streams In many cases, a sound source does not only produce one single sound, perceived as one single auditory unit, but goes on producing sound. For instance, a musician produces series of tones that will mostly be perceived as one melody consisting of notes played one after another on the same instrument. This melody constitutes perceptually a coherent entity, different from other sounds such as the melodies played by other instruments. In the same way, a speaker can produce a series of syllables that together are perceived as one utterance produced by one and the same speaker. Again, this utterance constitutes perceptually a coherent entity, different from other sounds such as the speech produced by other speakers. The process in which auditory units such as series of tones or syllables are perceptually integrated into auditory streams, i.e., perceptually coherent entities such as musical melodies or speech utterances, is called auditory-stream formation [8]. Bregman [4] compares the role of a stream in the auditory system with that of an object in the visual system: “The stream plays the same role in auditory mental experience as the object does in visual” [4, p. 11]. Each separate melody or each separate utterance constitutes a separate stream. Hence, each auditory stream consists of a series of auditory units perceived as coming from one sound source and perceptually constituting one coherent entity. Naturally, the auditory units that constitute the auditory stream must have perceptual attributes in common in order to be integrated into one stream. This will be one of the main issues in Chap. 10, the penultimate chapter of this book.

192

4 Auditory-Unit Formation

4.3.1 Some Examples Up to now, mainly musical notes and syllables have been presented as examples of auditory units, but there are many other examples outside the scope of music and speech. One example is the sound of a closing door; another is the sound of a falling drop of water. Such environmental sounds can also form auditory streams, e.g., the sound of a walking person. In this case, every single step will be perceived as one auditory unit, but the succession of steps will integrate into one auditory stream perceived as the sound of a walking person. Another example of an auditory stream is the sound of an elastic ball bouncing on a table. Each separate bounce is perceived as a separate auditory unit, while the series of bouncing sounds will together form one auditory stream, in this case a bouncing sound. Now, two other simple illustrations will be presented. The first example is that of the pure sinusoid slowly modulated in amplitude. A schematic time-frequency representation of such a tone is presented in Fig. 4.1. (Note that, in a schematic time-frequency representation, the thickness of a line indicates the amplitude of a sound components. See Sect. 1.2.) The frequency of this tone in Fig. 4.1 is 400 Hz. The modulation frequency (MF) is 0.2 Hz, which is in the range of hearing slow modulations as discussed in Sect. 1.6. The modulation depth (MD) is 0.9, which means that the amplitude fluctuates in time between 190 and 10% of its average amplitude. The tone starts almost instantaneously at 0 s. There is a fadein of 5 ms to prevent spectral splatter. So the auditory onset is very close to 0 s. Since the onset is abrupt and not followed by another abrupt onset, a beat is heard at this onset, which virtually coincides with the auditory onset. After its onset, the tone rises in amplitude and reaches its maximum after 1.25 s. Then the amplitude decreases and is minimum at 3.75 s, after which it rises again and, at 5 s, attains its average again. In the sound presented in Fig. 4.1 this is repeated twice, so there are three modulations. At the end at 15 s, there is an almost instantaneous fade-out of 5 ms and an auditory offset is perceived. When listening to this sound, one auditory unit is perceived slowly fluctuating in loudness with a fluctuation rate of 0.2 Hz. In summary, this sound is perceived as one long-lasting auditory unit. Its beat is at the

Fig. 4.1 Schematic time-frequency representation of a pure 400 Hz tone sinusoidally modulated in amplitude with a frequency is 0.2 Hz. The fluctuating thickness of the line represents the fluctuating amplitude of the pure tone. (Matlab) (demo)

4.4 The Perceived Duration of an Auditory Unit

193

Fig. 4.2 Schematic time-frequency representation of a pure 400 Hz tone sinusoidally modulated in amplitude with a MF of 4 Hz. The fluctuating thickness of the line represents the fluctuating amplitude of the pure tone. (Matlab) (demo)

beginning of the sound and coincides with its auditory onset. The auditory offset is at the end at 15 s. Some listeners may object and state that they do hear a beat during the rising phase of the amplitude. Experimentally this can be verified by studying how well defined in time and reproducible this “beat” is. This will be discussed in Sect. 10.12.3.4. The sound played in Fig. 4.1 will now be compared with the sound played in Fig. 4.2, in which the MF is 4 Hz, twenty times faster than the MF of the previous example. This 4 Hz is within the range of rhythm perception as discussed in Chap. 1. Since one auditory unit is associated with every beat, the sound of Fig. 4.2 constitutes an auditory stream consisting of a series of auditory units each with its own beat. These beats, four per second, define the rhythm of the stream. A good question is now: How can the perceptual moment of occurrence of an auditory unit be determined? For the sound presented in Fig. 4.2 one may try to determine the beat locations by asking listeners to produce sounds in synchrony with the beats of that sound, e.g., by tapping along. A tap lasts so short that its beat can be identified with its acoustic moment of occurrence. So, if listeners produce very short sounds such as taps or clicks in synchrony with the amplitude-modulated sound of Fig. 4.2, the beats of the amplitude-modulated sound are positioned at the instants of the taps or clicks. Another way is to ask listeners to synchronize a 4 Hz train of clicks with the stimulus. As a control, one may ask other listeners to indicate whether the two auditory streams are indeed perceived as synchronous. In principle, this makes it possible to determine the beat of auditory units, but there are some complications. This question will be discussed more generally in Sect. 5.1.

4.4 The Perceived Duration of an Auditory Unit Another good question may seem: Where are the auditory onsets and offsets of the auditory units located? One of the few publications in which this problem is discussed in the context of auditory-unit formation is by Crum and Bregman [18]. This problem will be discussed based on the demo of Fig. 4.2, in which a 4 Hz rhythm is heard.

194

4 Auditory-Unit Formation

Except for the first beat, no other beat is preceded by an acoustic beginning; neither is any beat, except for the last one, followed by an acoustic ending. In fact, one just hears a rhythmic succession of beats, without well-defined separate auditory onsets or offsets. Indeed, every beat indicates the start of a tone and, since one does not hear this tone continuing after the following beat, the tone ends perceptually at the following beat. This suggests that every auditory unit of this stream, except the first and the last, starts with its beat and ends at the next beat in the auditory stream. In other words, the auditory onset of every auditory unit coincides with the beat of the auditory unit, while the auditory offset coincides with the beat of the following auditory unit. For such successions of auditory units, in which there is no perceptually defined end, it will be argued that it makes sense to define the perceived duration as the time interval between the beat of that unit and the beat of the following auditory unit. The next question is: “What is the perceived duration of an auditory unit in general?” It will be argued that, for various sounds, the perceived duration is not a well-defined perceptual attribute. The perceived duration of very short sounds such as clicks, taps, and hand claps may still be defined as zero, but this is precarious for sounds that begin and end more gradually. Per definition, the intensity at gradual beginnings and endings is low, so that these parts of a sound can easily be masked by other sounds [7, p. 2703] [62, p. 2906]. This implies that the moment at which an onset or offset of a sound can still be perceived is quite sensitive to the presence of environmental sound. This contrasts with the beat of an auditory unit which is associated with increments in intensity, often, e.g., in speech, over a wide range of the tonotopic array. As said before in Sect. 4.2, if there is something to which sensory systems are sensitive, it is to increments in intensity. Moreover, the presence of environmental sound will not seriously affect the perceived timing of such increments. Consequently, the beat of an auditory unit is robust to the presence of environmental sounds. This is an example of perceptual constancy, an important concept discussed at various instances in this book. It states that attributes of perceptual objects remain constant under changes in the environment. This implies that perceptual attributes must be based on physical properties that are virtually invariant under environmental changes. The timing of abrupt increments in sound intensity is an example of such a property, and, hence, the prominent role of these increments in the perceived timing of auditory events. Another reason why an offset is often less precisely perceived in time is due to the presence in reverberation. This can be one of the reasons why sounds with abrupt onsets and more gradual offsets, so-called damped sounds, are perceived as shorter than so-called ramped sounds, which have a gradual onset and a more abrupt offset [26, 27, 77]. Since reverberation has the same frequency content as the direct sound, Grassi and Mioni [27] argued that the effect should be smaller for tones upward or downward modulated in frequency than for tones with constant frequencies. They synthesized complex tones with damped and ramped envelopes, and compared the perceived duration of tones with constant F 0 with that of complex tones with a rising F 0 . The authors found, indeed, that only damped sounds with constant F 0 s

4.4 The Perceived Duration of an Auditory Unit

195

were perceived as much shorter than the ramped sounds of the same duration. The difference was much less for tones with modulated frequencies. Above, it has been argued that the beat of an auditory unit cannot in general be identified with its acoustic beginning. As to the perception of the end of an auditory unit, it is clear that, when a musical note continues for a while after its beginning and then stops abruptly, a clear ending will be perceived. For such sounds, one may ask listeners to judge which of two such sounds lasts longer. Hence, it makes sense to attribute a perceived duration to them. Indeed, most research on auditory duration perception has been done with sounds with well-defined abrupt onsets and offsets. For such sounds, it is most likely that the perceived duration is predominantly defined by the acoustic interval between this onset and offset. This, however, is not the only factor that plays a role. Burghardt [10] showed that the perceived duration of pure tones shorter than about 800 ms depends on the frequency of the tone: Below 3 kHz, a tone with a lower frequency is perceived as shorter than a higher-frequency tone of the same duration; above 3 kHz, a tone with a higher frequency is perceived as shorter than a lower-frequency tone of the same duration. Another interesting effect is described by Matthews et al. [47]. They asked listeners to compare the relative duration of two noise bursts, a standard stimulus of fixed duration, 600 ms, and a comparison stimulus that was either shorter or longer in duration. In addition to differing in duration, the two noise bursts differed in intensity. In one set of trials, the two noise bursts were played in a background of noise that was more intense than the two noise bursts. In another set of trials, the background of noise was less intense than the noise bursts. In this “quiet” context Matthews et al. [47] found what has traditionally been reported: More intense stimuli are perceived as lasting longer than less intense stimuli of the same duration. In the “loud” context, however, they found the reverse: The less intense noise bursts were perceived as lasting longer than the more intense noise bursts of the same duration. The authors obtained similar results for visual stimuli. The results obtained by Burghardt [10] and Matthews et al. [47] were obtained for stimuli with acoustically well-defined beginnings and endings. The effects found by such studies will be much more complex for sounds such as speech syllables or musical tones with gradual onsets and offsets. Moreover, in the demos of Figs. 4.6, 4.7, and 4.8, it will be shown that the perceived onset of an auditory unit is not necessarily accompanied by any increase in intensity. The interested reader can already listen to them. In addition, there are more complicating phenomena. Many everyday sounds, such as claps, ticks, taps, and clicks, and also some percussion sounds in music, are so short that it does not make much sense to ask listeners to attribute a perceived duration to them. For the same reasons, it does not make much sense in other kinds of sounds, e.g., spoken utterances, to attribute a perceived duration to a specific syllable, since its perceptual beginning and ending are often not well defined, certainly not in noisy environments. In running speech, one can define the perceived duration of a syllable as the interval between the beat location of that syllable and that of the following syllable as proposed by Barbosa and Bailly [1]. This problem will be discussed in Sect. 5.3. A similar problem arises in sung music or in legato

196

4 Auditory-Unit Formation

played musical melodies, in which tones end at the same time or later than the start of the next tone. What has been said until now in this section, relates to sounds with a perceptually well-defined onset, its beat, and a perceptually well-defined offset. In the field of duration perception, the intervals between such an onset and offset are indicated with filled intervals. The counterpart of a filled interval is an empty interval, i.e., an interval between two, generally short marker sounds of which the first defines the start and the other the end of that time interval. Perceptually, the latter are, therefore, the intervals between the beat locations of two successive auditory units. Based on experiments with marker sounds of which one had a ramped envelope and the other a damped envelope, Chen [15] found that the perceived duration between two such marker sounds was determined by their perceptual moments of occurrence or Pcentres, i.e., their beat locations. In general, filled intervals are perceived as longer than empty intervals of the same duration [17, 96], a phenomenon sometimes called the filled-duration illusion [89] [32, p. 430]. It is not a small effect. Wearden et al. [89] report that the perceived duration of empty intervals is only 45–55% of that of filled intervals of acoustically equal duration.

4.5 Perceptual Attributes In addition to the beat, our hearing system attributes a number of other perceptual attributes to each auditory unit. Important examples of such perceptual attributes are loudness, pitch, timbre, perceived location, and, as just discussed, perceived duration. These perceptual properties emerge from the process of auditory-unit formation and are determined by the auditory information that the hearing system uses in the formation of these auditory units. In this book, a number of these perceptual properties, namely timbre, loudness, pitch, and perceived location will be discussed in the corresponding chapters.

4.6 Auditory Localization and Spatial Hearing In the processes of auditory-unit formation and auditory-stream formation, the auditory system carries out two other important functions. The first function is to localize a sound source. Indeed, for most sounds, listeners are more or less able to hear where they are being produced, and whether the sound sources are stationary or moving. The process that makes this possible will be referred to as auditory sound localization, human sound localization, or briefly, auditory localization. The discussion of this topic will largely be postponed to Chap. 9. This can be done because auditoryunit formation also occurs when a mixture of sounds comes from one acoustic sound source, something with which one is, in fact, very familiar. When listening to music or radio discussions over one loudspeaker, there is only one location where the sound

4.7 Two Illustrations

197

acoustically comes from, i.e., that loudspeaker. In spite of this, various musical instruments or various speakers are heard as separate auditory streams, perhaps not as clear and well defined as when the real sources are separated in the room, but separate streams can be heard. Even more amazing, it can often be heard whether a sound is recorded closer to or farther away from the microphone, giving the illusion that one sound comes from farther away than the other. Apparently, the processes of auditoryunit formation and auditory-stream formation are also operative in situations where there is only one acoustic sound source. The focus in this chapter will be on the process of auditory-unit formation based on sound coming from one physical sound source. The fourth function the auditory system carries out, besides auditory-unit formation, auditory-stream formation, and auditory localization, is to determine in what kind of environment the sound is produced. Except that the reverberation of a room is very important in this respect, not much is known about perceptual aspects of spatial hearing. Most studies are based on virtual acoustic simulations. It is, however, well known that, when listening to radio recordings, sounds are heard as coming from the radio and not as produced in the room in which one is. Moreover, one generally does not hear where in a room the sounds are produced during their recording. On the other hand, listeners often make judgments as to the size of the room in which a recording is made [11], and in what kind of environment the sounds are recorded; whether this is a living room, a church, in the street, in a restaurant, a concert hall, etc. The process that enables listeners to hear in what kind of environment a sound is produced will be indicated with spatial hearing. Although of interest, not much attention will be paid to spatial hearing in this book. It must be remarked that various authors, e.g., Blauert [3] and Middlebrooks [52], use the term spatial hearing where in his book the term auditory localization is used.

4.7 Two Illustrations In the next demo of Fig. 4.3, a simple harmonic sound is played, the sum of ten sinusoids with frequencies that are all integer multiples of 440 Hz. During the demo, one of the harmonics is repeatedly switched off and on with a rate of 0.5 Hz. At the beginning, this sounds like an annoying buzz. Then, after 1 s, the third harmonic is switched off. But this is perceived as a subtle change in timbre that is difficult to describe in words. After another second, the harmonic is switched on for a second, again, so that the sound is acoustically identical to the sound played in the first second. The remarkable thing is that, in spite of this, it is not heard as the one coherent sound it was when it started, but as two sounds, the annoying buzz plus a separate pure tone of which the pitch has the frequency of the harmonic that is switched off and on. Hence, the harmonic, after it is switched on again, pops out as a new auditory unit, a separate pure tone with the pitch frequency of that harmonic. A new auditory stream emerges consisting of an intermittent sequence of pure tones with a pitch frequency of 3 × 440 = 1320 Hz.

198

4 Auditory-Unit Formation

Fig. 4.3 The third harmonic of a 440 Hz harmonic series is switched on and off. In the first second of the demo one sound is heard, whereas in the third, the fifth, the seventh and the ninth second, two sounds are heard. (Adapted from Houtsma, Rossing, and Wagenaars [40, Tr. 1]). (Matlab) (demo)

Fig. 4.4 Intensity contour of a 1000 Hz pure tone. In this example, two tones are heard, a continuous tone with a gradually increasing loudness, indicated by the dotted line, and a series of tones of decreasing loudness. Both sounds have pitch frequencies of 1000 Hz. Adapted from track 7 of the CD accompanying [88]. (Matlab) (demo)

Now, the second example will be discussed. From what has been said earlier, one may be inclined to think auditory information close in frequency will, in general, be processed as coming from one auditory stream and will not be distributed over different auditory streams. One will see, however, that also information from within the bandwidth of one auditory filter can be allocated to more than one auditory unit. This example is illustrated in Fig. 4.4, showing the envelope of a 1000 Hz pure tone as a thick continuous line. It consists of a series of tones with a relative intensity of 60 dB. Their duration is 200 ms as are the intervals between them. During these intervals the intensity level is not zero but starts 10 dB below the intensity level of the tones and then slowly increases up to 60 dB, the level of the tones. When one listens to this sound, not one but two auditory streams are heard. The first stream consists of one uninterrupted pure tone starting relatively soft, then gradually increasing in loudness, and finally ending relatively loud. The second stream is an intermittent series of pure tones of the same pitch that, especially towards the end, decrease in loudness. The remarkable thing is that, since the sound consists of one pure tone, this segregation in two auditory streams takes place within the bandwidth of one auditory filter. Hence, auditory information from within the bandwidth of one auditory filter is allotted to different auditory streams. This example will be discussed in much more detail in Chap. 7; for the moment it is just mentioned to show that frequency separation of sound components is not

4.8 Organizing Principles

199

necessary for auditory-stream formation to occur. On the other hand, in what is going to come, it will be shown that auditory-filter bandwidth does play a significant role in auditory-unit and auditory-stream formation.

4.8 Organizing Principles All auditory information enters the central nervous system distributed over two tonotopic arrays, one from the left ear and one from the right ear. This information can come from a large variety of sound sources overlapping in location, in frequency and in time. In extreme cases, this may lead to a cacophony of sound the listener is unable to process in any sensible way but, in most everyday circumstances, the auditory system is quite well capable of dealing with the complexity of the incoming acoustic information. In order to accomplish this, the auditory processing of acoustic information must be based on the internal structure within this information, on its coherence in time and in frequency. Only by exploiting the internal coherence of the information coming from one sound source, and isolating it from the information from other sound sources, the information-processing auditory nervous system may accomplish its function of making sense out of this mixture of acoustic information. In the process of carrying this out, the auditory system is assumed to follow a number of heuristic rules based on this coherence of the information coming from one single sound source. These rules will be referred to as the organizing principles of auditory scene analysis (ASA). They are inspired by a branch of vision research indicated with Gestalt psychology [42, 90]. These organizing principles will, therefore, also be indicated with Gestalt principles. Gestalt psychology arose around 1930 as a response against structuralism, whose advocates argued that it should be possible to explain the perceptual properties of a visual object from the perceptual properties of its elements. The Gestalt psychologists argued, on the other hand, that the perceptual properties of the whole cannot be explained from the properties of its components, but that the whole is more than the sum of its parts. An overview of traditional and more recently defined organizational principles for the visual system is presented by Brooks [9]. A review of the role of Gestalt psychology in vision research is presented by Wagemans et al. [86, 87]. This will not be discussed here in detail for the visual system, but it will be discussed at various relevant instances in relation to auditory-unit formation and auditory-stream formation. There is much discussion as to the number of Gestalt principles to be distinguished. Some distinguish only four or five, while others distinguish more than twelve. In relation to ASA, Williams [92] describes nine Gestalt principles and illustrates them with sound examples. In this book, the principles discussed by Bregman [4] will mostly be followed. In the context of simultaneous integration, he discusses the Gestalt principle of regularity, identified more or less by him with harmonicity, of common fate, and of exclusive allocation. In the context of sequential integration, he discusses proximity, similarity, good continuation and closure. In this book, the principle of connectedness is added to this list. Finally, in combining simultaneous

200

4 Auditory-Unit Formation

and sequential integration, Bregman [4] discusses the Gestalt principle of organization, and the concepts of context, the perceptual field, innateness, and automaticity. These ideas as to the Gestalt principles in ASA will be discussed in Sect. 10.9. Now, the principles of common fate, spectral regularity, and exclusive allocation will be discussed in the context of auditory-unit formation.

4.8.1 Common Fate In the complex tones played until now, all partials started at the same time. This synchrony of the various partials of a tone is one of main factors contributing to their perceptual integration. This is an example of the Gestalt principle of common fate. The principle of common fate says that stimulus components that change in the same way are more likely to integrate perceptually than stimulus components that behave differently from each other. This illustrates that stimulus components that start at the same time, remain the same, and then end at the same time, are very likely to integrate. This, indeed, appears to be the case. The principle of common fate will be discussed for common onsets and offsets, common amplitude modulation, common frequency modulation, and common spatial origin.

4.8.1.1

Common Onsets and Offsets

What now happens when one of the harmonics of a complex tone does not start at the same time as the other harmonics? This is demonstrated in Fig. 4.5. In each complex tone, one randomly selected harmonic starts earlier or later than the other harmonics.

Fig. 4.5 A series of harmonic complex tones of which one randomly chosen harmonic starts asynchronously. In the first three tones, the deviant harmonic starts earlier than the other harmonics, 300, 100, and 30 ms, respectively. In the last three tones, it starts later than the other harmonics, 30, 100, and 300 ms, respectively. In the first two and the last two tones the deviant harmonic is clearly audible as a separate pure tone. (Matlab) (demo)

4.8 Organizing Principles

201

Fig. 4.6 Same as Fig. 4.5 except that the deviant harmonic does not start but ends asynchronously. (Matlab) (demo)

The deviations of the deviant harmonic are –300, –100, –30, 0, 30, 100, and 300 ms, respectively. So, in the first three tones, where the deviations are negative, the deviant harmonic starts “too early” and in the last three tones “too late”. Listening to these tones it becomes clear that, especially in the first two and the last two tones of the series, the deviant harmonic pops out as a separate auditory unit with the timbre of a pure tone and with a pitch corresponding to that of the deviant harmonic [19, 21]. In Fig. 4.6, a randomly chosen harmonic is not desynchronized at the beginning but at the end of the complex tone. The deviations are –300, –100, –30, 0, 30, 100, and 300 ms, respectively, again. In this case, one only hears a clear separate auditory unit for the last two tones. In the first two tones a change in timbre can be heard when the deviant harmonic stops, but it is hard to describe in words what kind of change one hears. When the deviant harmonic lasts longer than the other harmonics, something else happens. After the end of the deviant harmonics, one can clearly hear another auditory unit coming in, a pure tone with a pitch frequency of the deviant harmonic. In spite of the absence of any acoustic onset, this new auditory unit pops out as a separate sound. These examples show that the synchrony of increases in intensity plays an important role in the integration of different frequency components. When sound components increase suddenly in intensity at the same time, they are “interpreted” by the auditory system as coming from the same sound source and integrated into one auditory unit. Similarly for decreases in intensity, when sound components all stop simultaneously except one, the auditory information of this exceptional component, first used in the integration process of the complex, is released and processed as a new auditory unit. This remarkable phenomenon shows that a new auditory unit can be formed without an increase in intensity within any auditory frequency channel. Also when auditory information induced by a frequency component is no longer integrated with the information induced by other frequency components, this released information can be allocated to a new auditory unit.

202

4 Auditory-Unit Formation

Fig. 4.7 Time-stretching illusion for a 2.5 kHz pure tone. A wide band of noise is directly followed by a short pure tone, followed by a pause and a repetition of the tone. When played just after the wide-band noise, the tone sounds somewhat longer than when it is played in isolation. After Sasaki et al. [75] as cited by Sasaki et al. [76] (Matlab) (demo)

A related phenomenon, illustrated in Fig. 4.7, was described by Sasaki et al. [75] as cited by Sasaki et al. [76]. It starts with an 800-ms burst of wide-band noise; the bandwidth is 2 octaves, and the centre frequency is 2500 Hz. This burst is without interruption followed by a short 2500 Hz pure tone. Then, there is a 1-sec pause after which the tone is repeated once. It appears that the first tone played just after the noise burst has a longer perceived duration than its repetition one second later. The illusion is quite subtle, so the reader is advised to listen to it a few times. The effect is, however quite significant. Indeed, in order to measure how much longer the first tone was perceived, Sasaki et al. [75] made the duration of the second tone variable and asked listeners to adjust this duration in such a way that it sounded just as long as the first tone. It appeared that the second tone was assigned a longer duration than the first tone, which shows that the perceived duration of the first tone was longer than it actually was. This perceptual lengthening, or stretching, could be as much as 100 ms. Sasaki et al. [76] called this the time-stretching illusion. Time stretching is also described by Carlyon et al. [14]. They replaced the short pure 2.5 kHz tone of Fig. 4.7 by a burst of narrow-band noise centred at 2.5 kHz. A demo is presented in Fig. 4.8. The authors showed that time stretching could amount to more than 100 ms. Interestingly, they did more. First, they showed that the illusion only occurred when the frequency content of the preceding wide-band noise overlapped with that of the narrow-band noise, as it does in the demo. Moreover, by performing rhythmic adjustment experiments, they demonstrated that the perceived onset of the narrow band of noise, so its beat, was located before the end of the wideband noise. Hence, auditory information from the wide-band noise burst is integrated with the narrow-band noise in such a way that the narrow-band noise seems to start before the end of the wide-band noise. As a result, the burst of narrow-band noise has a longer perceived duration than when played in isolation. One may wonder whether the time-stretching illusion would also occur if the narrow-band signal would not be preceded but followed by a burst of wide-band noise. Remarkably, Kuroda and

4.8 Organizing Principles

203

Fig. 4.8 Time-stretching illusion for a narrow band of noise. A wide band of noise is directly followed by a short narrow band of noise, followed by a pause and a repetition of the short narrow band of noise. When played just after the wide-band noise, the narrow noise band sounds somewhat longer than when it is played in isolation. After Carlyon et al. [14]. (Matlab) (demo)

Grondin [44] showed that the answer is negative. Time stretching does not occur when a tone is followed by a burst of wide-band noise, nor does it occur when a tone is both preceded and followed by a burst of wide-band noise.

4.8.1.2

Common Amplitude Modulation

As said, the Gestalt principle of common fate states that sound components that change in the same way are likely to be grouped together into one auditory unit. This is illustrated in the next demo, presented in Fig. 4.9, for some sound components that have a common amplitude modulation. The sound starts as a buzz, a complex tone of ten harmonics of 440 Hz of equal amplitude. After one second, the odd harmonics are sinusoidally modulated in amplitude, while the even harmonics keep the same constant amplitude. The MF is 4 Hz, and the MD is 0.9, so that the amplitude is sinusoidally modulated with 4 Hz between an amplitude of 0.1 and 1.9 times the original amplitude. In the schematic time-frequency representation of Fig. 4.9, these amplitude modulations appear as thinning and thickening of the lines representing the harmonics of the sound. When one listens to this sound, one starts hearing the 440- Hz buzz. After 1 s, the sound segregates into two auditory streams. One of these streams is formed by the even, unmodulated harmonics; the frequencies of these even harmonics are all multiples of 880 Hz, so one actually hears a complex tone consisting of the first five harmonics of 880 Hz. The other stream is formed by the odd harmonics that are modulated in amplitude. The frequency of this modulation is 4 Hz, which is in the rhythm range, so that each modulation induces a beat. Since one auditory unit is associated with every beat, this stream consists of twelve successive auditory units. In summary, between 1 and 4 s of the stimulus, one hears two simultaneous auditory streams, an 880-Hz buzz, and a sequence of twelve almost identical auditory

204

4 Auditory-Unit Formation

Fig. 4.9 Streaming through common amplitude modulation. The thickness of the lines in this schematic time-frequency representation is proportional to the instantaneous amplitude of the harmonics. The demo starts and ends with a one-second buzz. During the modulation, the sound segregates into two auditory streams. (Matlab) (demo)

units. As soon as the amplitude modulation of the odd harmonics stops at 4 s the ten components integrate again, and one hears the same familiar 440-Hz buzz with which the sound started.

4.8.1.3

Common Frequency Modulation

In the next demo, shown in Fig. 4.10, not the amplitude of the odd harmonics is modulated but their frequency. The sound starts the same as in the previous demo, the now familiar buzz consisting of the first ten lowest harmonics of 440 Hz. After 1 s, the odd harmonics of the complex are modulated in frequency, while the even harmonics are kept constant in frequency. The MF is 4 Hz, again, so that one period of the frequency modulation is 250 ms. This is continued for 3 s, so for twelve modulations, after which the frequencies of the partials are kept constant again at their original frequencies, after which the sound stops 1 s later. When one listens to this sound, one initially hears the buzz, but, when the modulation starts after one second, the sound segregates into two different streams. One stream consists of the modulated harmonics, which many listeners will associate with an alarm sound; the modulations are in the range of rhythm perception, so that this stream consists of a succession of twelve auditory units. The other stream consists of the harmonics with constant frequencies, the even harmonics of the original complex. These even harmonics form another harmonic complex with a fundamental frequency twice as high as the fundamental frequency of the original, resulting in an 880-Hz buzz. After three seconds, when all harmonics are kept constant in frequency again, the harmonics re-integrate, and the original 440-Hz buzz is heard again. In the demo of 4.10, the difference in frequency between two successive modulated harmonics remains the same in semitones. This ensures that the partials of

4.8 Organizing Principles

205

Fig. 4.10 Streaming through common frequency modulation. Same as Fig. 4.9, except that not the amplitudes but the frequencies of the odd harmonics are modulated. (Matlab) (demo)

Fig. 4.11 Streaming through common frequency modulation. Same is Fig. 4.10, except that the distance between the modulated partials remains the same in hertz during the modulations. As a consequence, the relation between the modulated partials is not harmonic. The result sounds the same as in the demo of Fig. 4.10, except that now the timbre of the modulated stream is different. (Matlab) (demo)

the modulated stream keep their harmonic relationship. But in the next demo of Fig. 4.11, it will be shown that harmonicity is not a prerequisite for the integration of the partials of a complex tone. In Fig. 4.11, the distance in frequency between two successive modulated partials remains the same, not in semitones but in hertz, in this case 2 · 440 = 880 Hz. The consequence of this is that, during the modulations, the frequencies of the modulated partials are no longer multiples of their fundamental frequency, so that the tones of this stream are no longer harmonic. When the distance between successive partials of a complex tone is constant in hertz, this tone is called spectrally regular. The effect of the inharmonic modulations of Fig. 4.11 is about the same as that of the harmonic modulations in the example of Fig. 4.10; the main difference is that the timbres of the streams consisting of the modulated partials

206

4 Auditory-Unit Formation

are different. The different roles of harmonicity and spectral regularity played in auditory-unit formation will be discussed in the upcoming Sect. 4.8.2. The last two audio demos show that differences in frequency modulation can segregate sound components into different auditory streams. The sound components that change in a coherent way are grouped together and are segregated from the sound components that follow a different time course.

4.8.1.4

Common Spatial Origin

For most environmental sounds, the human listener is in general well able to hear the locations where they come from. This ability is based on a number of acoustic differences with which sound arrives at our ears, which will be discussed in detail in Chap. 9. It will be shown there that one of the important acoustic sources of information playing a role in sound localization appears to be the difference in time and intensity with which sound arrives at both ears. Naturally, sound arrives earlier and with a higher intensity at the ipsilateral ear than at the contralateral ear. One would expect that this would play a role in auditory-unit formation, so that sound components arriving with the same interaural time and intensity differences would be attributed to the same auditory unit and components that do not would be attributed to different auditory units. This, however, appears not to be correct in general. It is remarkable that one must conclude that the role of time and intensity differences—and other acoustic sources of information that play a role in auditory localization—is quite small in auditory-unit formation. In a review of spatial stream segregation Middlebrooks [52] summarizes: “published studies of spatial stream segregation measured with tasks that demanded integration have demonstrated minimal disruption of integration by spatial cues and only for cues corresponding to extreme spatial separation, as in the opposite-ear condition. More realistic differences in spatial cues between two sounds apparently are insufficient to disrupt fusion, especially when sounds are bound by common onset and/or common fundamental frequency” (p. 141). This suggests that perceived location is perceptually attributed to an auditory unit after it has been formed. For this reason, the subject will be left for what it is. Spatial sources of information appear to be more important in sequential integration. This topic will get more attention in the Chaps. 9 and 10.

4.8.1.5

Common Fate in Speech

It was shown that sound components that change in similar ways are likely to be integrated into one auditory unit, while sound components that behave differently are more likely to be allocated to different auditory units. This refers in the first place to the timing of the increases in intensity of the sound components. In many sounds these increases in intensity are at the start of the sound components, but in sounds such as spoken syllables, the main increases are often not at the start of syllable. Indeed, in syllables that start with a consonant or a consonant cluster, there will

4.8 Organizing Principles

207

usually be greater increases in intensity at the onset of the vowel when the main restrictions in the aural cavity are removed. It was already argued that the beat of a syllable is closely related to these rapid increases in intensity just before and at the vowel onset. This will be further elaborated in Chap. 5. Some other demos illustrated that, in order to remain integrated, the amplitude and the frequency of the components of a sound have to change in a coherent way. Looking at the spectrograms of a speech utterance such as shown in Figs. 1.13 and 1.15 of Sect. 1.3, one can also see very rapid changes at other moments than at the vowel onsets. Nevertheless, the sound is perceived as one auditory stream. Apparently, there are other factors playing a role in the integration and segregation of sound components. For instance, the regular structure due to the harmonicity of the voiced parts of the speech signal is another strong factor playing a role in integration (see next section). This regular structure and the more or less synchronous increases in intensity at the syllable onsets are possibly two decisive factors that play a role in the integration of all speech components into one auditory stream, overruling other possible competing factors that promote segregation. This will more elaborately be discussed in Chap. 5. Another interesting phenomenon is demonstrated in the next example, shown in Fig. 4.12. The figure shows the pitch contour of a synthesized vowel, the vowel /a/. The pitch contour is fixed at 150 Hz in the first half of the demo, while the pitch frequency fluctuates randomly around 150 Hz in the second half. The sound with the flat contour sounds buzzy and robotic; only at the beginning does it sound like the vowel /a/, but then loses its identity [82]. The sound with the micro modulation, on the other hand, sounds much fuller, more like a human vowel. A more elaborate example with three different synthetic vowels sung on different harmonic frequencies, can be heard on track 24 of the CD by Bregman and Ahad [6]. Summerfield et al. [82] explain this loss of richness in the unmodulated part by peripheral adaptation. Another explanation is that, by the common frequency modulation of the harmonics, a larger part of the tonotopic array is scanned, resulting in a richer vowel sound. Remarkably, it appears that addition of common frequency modulations to synthetic vowel sounds does not improve vowel identification [45]. On the

Fig. 4.12 Pitch contour of a vowel synthesized without micro modulation, left-hand side, and with micro modulation, right-hand side. Note the richer, more vowel-like timbre in the second part of the sound. (Adapted from [48] and [45]). (Matlab) (demo)

208

4 Auditory-Unit Formation

other hand, the coherent amplitude and frequency modulations in speech do not only contribute to the richness of the vowel sounds. It has been shown that, especially in noisy conditions, they also contribute to speech recognition [95].

4.8.1.6

Common Fate in Music

Western music is mostly performed by various instruments playing together. In general, the musicians will try to synchronize the beats of the concurrent notes they play as much as possible. In the case of perfect synchrony, this might in principle lead to the perceptual integration of the partials of different instruments. This, however, almost never actually happens. Indeed, Rasch [67] showed that deviations from synchrony in ensemble playing amount typically to 30–50 ms. These deviations are on the one hand enough to prevent partials of tones played by different musicians from being perceptually integrated, so that the listener can indeed distinguish the notes from different instruments played on the same beat of the bar. On the other hand, these deviations are too short to hear which instrument precedes the other [68].

4.8.2 Spectral Regularity The principle of regularity means that elements arranged in a regular pattern integrate better into a perceptual object than elements arranged in an irregular pattern. The question now is whether spectral regularity plays a role in simultaneous integration. In previous chapters, the roles played by the logarithmic frequency scale in music and by the Cam scale in many auditory phenomena such as masking and loudness have been indicated. It will now become apparent that the linear frequency scale cannot be ignored in hearing research either. Indeed, spectral regularity is the property that the spectral components of a sound are equidistant on a linear frequency scale! Naturally, regularity may be an important property of many harmonic sounds, but harmonicity, defined in Sect. 1.2, is not the same as regularity. Sounds are harmonic when the frequencies of the components are all multiples of a common fundamental frequency in the pitch range. Hence, sounds with harmonics the ranks of which are primes, e.g., 200, 300, 500, 700, 1100, 1300, 1700 and 1900 Hz, are harmonic, but the partials are not regularly spaced on a linear scale. An example of a regular but not harmonic sound is a sound with what are called shifted harmonics, i.e., partials that deviate from harmonic positions by a specific number of hertz. For instance, a sound with components with frequencies 220, 420, 820, 1020, and 1220 Hz is regular, since the partials are equally spaced by 200 Hz, but they are not harmonic, since the common fundamental frequency, 20 Hz, is below the pitch range of 50 to about 5000 Hz. Of course, sounds can be both regular and harmonic, since sounds consisting of the sum of a number of consecutive harmonics are also regular.

4.8 Organizing Principles

209

Fig. 4.13 Complex tones with frequencies that deviate more and more from harmonic positions. The deviations are drawn from normal distributions with standard deviations of 0, 1, 3, 10, 30, 10, 3, 1 and 0%, respectively. Adapted from track 3 of the CD by Plomp [66]. (Matlab) (demo)

Now, it will be demonstrated that the spectral regularity of such a harmonic complex strongly contributes to its integration into one auditory unit, in this case a tone. This is done in the next demo of Fig. 4.13, in which the partials of a tone have frequencies that are more or less displaced from their exact harmonic positions. In this example, ten sinusoids are added. In the first tone, the frequencies of these ten sinusoids are exact multiples of 440 Hz; but in the next four tones, the frequencies deviate successively from their harmonic positions according to a normal distribution with standard deviations of 1, 3, 10 and 30%. Subsequently, the deviations are reduced again, via 10, 3 and 1%, finally back to 0%. In the lower panel of Fig. 4.13, the waveforms are shown. These waveforms are so compressed that only the envelopes of the tones can be distinguished. The envelopes of the first and the last tone are flat suggesting that the first and the last tone are periodic, which is indeed the case since the components have frequencies that are exact integer multiples of 440 Hz. In the other tones, the signal is no longer periodic due to the deviations from harmonicity of the partials. The result is that the tones have irregular envelopes. When one listens to these sounds, one will not only experience the tones as sounding “out of tune”, but also that more partials pop out of the complex as the deviations from spectral regularity get larger. In other words, the larger the deviation from regularity, the less is the integration between the partials resulting in the percept of more than one auditory unit. In the next example, shown in Fig. 4.14, only one of ten regular partials is placed out of its spectrally regular position. The distance between the partials is 440 Hz, but a constant frequency is added to each partial to make the tones as inharmonic as possible. √ This is realized by increasing their frequencies with F 0 times the golden ratio, 5 − 1 /2, or 440 · 0.618 = 271.9 Hz. This guarantees that the tone is maximally inharmonic [84], while maintaining the regularity of its spectrum. As in the demo of the previous Fig. 4.13, first a buzz is heard, but that buzz is now inharmonic. It consists of the first ten harmonics of 440 Hz shifted upward by 271.9 Hz, but then

210

4 Auditory-Unit Formation

Fig. 4.14 Complex of ten regular but inharmonic partials all of which, except one, have frequencies that are multiples of 440 Hz shifted upward by about 271.9 Hz. The third partial is placed more and more out of its regular position and back again. The deviations are 0, 1, 3, 10, 30, 50, 30, 10, 3, 1 and 0%. The tones within the rectangle are heard as a separate series of pure tones. Note the fluctuating envelopes of the tones in the lower panel, which are not perceived as fluctuations. (Matlab) (demo)

the third partial is first increased in frequency and then decreased until it returns to its regular position. This mistuning is 0, 1, 3, 10, 30, 50, 30, 10, 3, 1 and 0% of 440 Hz, respectively. As soon as the third partial deviates by more than 1% or 3% from its regular position, it starts popping out of the complex tone [55]. Then it is perceived as a separate pure tone with a pitch frequency corresponding to the frequency of this pure tone. This means that, when the deviation from the partial is more than about 1–3%, indicated by the rectangle in Fig. 4.14, one does not hear one sound, but two, the deviant partial popping out as a pure tone and the complex tone consisting of the remaining regular partials. When, at the end of the series, the frequency of the deviating partial returns to its regular position, this partial is reintegrated into the complex tone, and one does no longer hear two sounds but just the same integrated complex tone as heard in the beginning. In other words, the first tone and the last tone of the series are perceived as one auditory unit. When one of the partials deviates sufficiently from its regular position, it is no longer integrated into the set of regular partials but pops out as a separated auditory unit. So, in the demo of Fig. 4.14 two auditory streams are perceived. The first stream consists of the sequence of buzzes, the second of the sequence of pure tones with a pitch that first increases and then decreases again. This demo shows that, when one partial in an otherwise regular complex deviates from its regular position, it is not or much less well integrated perceptually into the auditory unit formed by the other partials that do have regular positions. This, however, is not an all-or-none phenomenon. For a harmonic complex with a welldefined pitch, Moore et al. [54] showed that, for deviations smaller than 3%—so when the mistuned partial is still well integrated into the complex—it keeps on contributing to the pitch of the complex tone. This expresses itself in a small but significant shift in the pitch frequency of the complex tone. This shift is upward when the deviant

4.8 Organizing Principles

211

frequency is higher than its regular position and downward when it is lower. When the mistuning is more than 4%, the contribution of the mistuned partial to the pitch of the complex tone diminishes, but remains significant up to 8%. A significant finding in this respect is that the pitch of the mistuned partial itself is shifted from the value it would have when played in isolation [34, 35]. For a mistuning up to 4%, the pitch shift of the mistuned harmonic is somewhat “exaggerated”, which means that, when the mistuning is upward, the pitch frequency of the mistuned harmonic is higher than its actual frequency; when the mistuning is downward, the pitch frequency of the mistuned harmonic is lower than its actual frequency. It looks as if a little bit of information arising from the excitation induced by the mistuned partial close to the harmonic position of the complex still contributes to the pitch of that complex, and no longer contributes to the pitch of the mistuned partial. The remaining excitation induced by the mistuned partial has an average frequency that is farther away from the harmonic position, thus inducing the exaggerated pitch shift [39]. This indicates that the auditory information supplied by the mistuned harmonic is partitioned over two auditory units. Information close to the harmonic position of the harmonic tone contributes to the pitch—and other auditory attributes—of the complex tone. The remaining information contributes to the perceptual attributes of the mistuned partial. One may relate this with what will be said in the following Sect. 4.8.3 about the principle of exclusive allocation. The studies by Moore et al. [54, 55] were carried out with harmonic tones with fundamental frequencies of 100, 200, and 400 Hz, and the results described by them were obtained for a mistuning of the first six harmonics. For higher harmonics the authors describe something different, which will be demonstrated in Fig. 4.15, not for a harmonic, but for a spectrally regular tone. It is the same as Fig. 4.14 except that the frequencies of all partials are increased by 6 · 440 = 2640 Hz. Consequently, the partials are no longer resolved. The third partial is now shifted out of the regular pattern in the same way as in Fig. 4.14. (Note that Fig. 4.15 looks the same as Fig. 4.14, except that the ordinate is different.) In spite of this similarity, the percepts are quite different. In the sound presented in Fig. 4.15, the deviant partial does not stand out as a different auditory unit, but the deviation induces the percept of fluctuations corresponding to the envelope of the tones as shown in the lower panel of Fig. 4.15. Depending on their frequencies, these fluctuations are perceived as beats or roughness [58] [55, p. 480]. The important difference between the sounds of Figs. 4.14 and 4.15 is that the sounds consist of resolved partials in the former but of unresolved partials in the latter. In the former demo, the deviant partial is no longer processed as part of the regular complex tone, but perceived as a separate pure tone, whereas in the latter demo the deviant partial remains part of the complex and its interferences with other partials of the complex contribute to the way the complex tone is perceived. These examples indicate that spectral regularity is an important organizing principle only for resolved frequency components. For unresolved components, deviations from regularity do not result in segregation of the deviant components but result in perceivable fluctuations. Note, by the way, that this provides another example of the pivotal role played by the bandwidth of the auditory filters in auditory processing.

212

4 Auditory-Unit Formation

Fig. 4.15 Same as previous figure but now the harmonics of 440 Hz with rank 6 to 15 are shifted in frequency by about 271.9 Hz. The ordinate now runs from 2.64 to 7.64 kHz. The envelopes of the tones are identical to those in the previous figure, but are now perceived as fluctuations, which in the middle tones give rise to some roughness. The deviant harmonic does not pop out from the rest of the complex. (Matlab) (demo)

Fig. 4.16 Shifted harmonics. The first and the last tone consist of the sum of the second to the tenth harmonic of 440 Hz. In the intermediate tones, the frequency of the partials is shifted in Hz by a certain percentage of 440 Hz. Including the first and the last tone, these shifts are 0, 1, 3, 10, 30, 50, 30, 10, 3, 1 and 0%, respectively. (Matlab) (demo)

Naturally, the transition from resolved harmonics, those with harmonic rank ranging from 1 to about 6, to unresolved harmonics with rank higher than about 6, is not abrupt. The higher the rank of the harmonics, the closer to each other they will be on the tonotopic array. In practice, the transition is gradual and the percepts may not be unequivocal in the intermediary region. This section ends with two demos. In Fig. 4.16, a series of spectrally regular tones is presented consisting of nine equidistant partials. The spacing between the partials is 440 Hz. The first and the last tone consist of the second to the tenth harmonic of 440 Hz and, hence, are harmonic. The other tones are shifted upwards in frequency by a fixed percentage of 440 Hz in Hz. Including the first and the last tone, these shifts are 0, 1, 3, 10, 30, 50, 30, 10, 3, 1 and 0%, respectively. In spite of the fact that

4.8 Organizing Principles

213

Fig. 4.17 Same as previous figure, but now the first partial is included. Especially in the intermediate tones the first harmonic pops out as a separate auditory stream. (Matlab) (demo)

these intermediate tones consisting of shifted harmonics are not harmonic, they are perceived as more or less coherent sounds in which no separate partials pop out as in the demos of Figs. 4.13 and 4.14. The demo of Fig. 4.16 illustrates that harmonicity is not a prerequisite of simultaneous integration. Shifting the frequency of the partials by a fixed amount in Hz hardly affects the perceptual integrity of the tones. There are some complications, however. In the demo of Fig. 4.16, the first partial was not included. This was done because it appears to pop out of the complex when it is included. This is illustrated in Fig. 4.17. It is the same as Fig. 4.16, but now the first partial, 440 Hz in the first and the last tone, is included. The lowest partial now pops out as a separate tone, especially in the intermediate tones. It is difficult to present an explanation for this phenomenon. One explanation may be that the distance on the tonotopic array between the partials becomes larger the lower the rank of the partials. So, the distance in Cams between the first and the second lowest partial of the tone complexes is relatively large. This may explain the exclusion of the lowest partial from the regular complex. Roberts and Brunstrom [73] and Roberts and Holmes [74] argue that, in the first and the last tone in the demo of Fig. 4.17, which are both harmonic, the integration of the lowest with the higher harmonics is not so much determined by spectral regularity but by the simple harmonic octave relation of 1:2 between the first and the second partial. This simple harmonic relation is lost for the harmonically shifted intermediate tones. It is concluded that harmonicity and regularity play very important but different roles in sound perception. The partials of sounds with regular spectra are in general perceptually integrated into one auditory unit [71–73]. Tones that are both regular and harmonic are generally perceived as one auditory unit with a well-defined pitch. For such sounds, the pitch frequency is well approximated by the fundamental frequency of the harmonic complex. Resolved frequency components that do not fit into the regular pattern by more than 1–3% are no longer completely integrated with the regular components, and are more and more perceived as separate auditory units. When the mistuning is more than about 8%, this separation is complete [54].

214

4 Auditory-Unit Formation

Moreover, Roberts and Bailey [70] and Ciocca [16] found evidence that, indeed, the harmonic pattern and not the regular pattern of a complex tone determines the pitch of complex sounds. The fact that spectral regularity plays an important role in spectral integration does not necessarily imply that complex tones with partials not in a regular pattern are not perceived as one auditory unit. For instance, tones with harmonic, but not regular spectra are mostly also perceived as one coherent auditory unit, although the integration of the partials will not be as strong as harmonic partials with a regular spectrum. These issues are further discussed by Carlyon and Gockel [69], Micheyl and Oxenham [13], and Roberts [51]. An autocorrelation model for regularity perception is presented by Roberts and Brunstrom [72].

4.8.3 Exclusive Allocation According to Bregman [4, p. 12], “The exclusive allocation principle says that a sensory element should not be used in more than one description at a time.” When this Gestalt principle of exclusive allocation is applied to hearing, this means that, when a sound component integrates perceptually with some other sound components into an auditory unit, the auditory information generated by that sound component contributes only to the perceptual attributes of that auditory unit. Also, once a sound component segregates perceptually from other sound components, it contributes no longer to the loudness, the pitch, or the timbre of the auditory unit or units formed by the other components. An example demonstrating the exclusive-allocation principle has been given in the demo of Fig. 4.9. In that demo, the odd harmonics of a harmonic complex are amplitude modulated, while the even harmonics remain constant in amplitude. This induces the odd harmonics to segregate from the even harmonics. Consequently, the even harmonics form another auditory unit. The principle of exclusive allocation then implies that the odd harmonics no longer contribute to the pitch of the auditory unit formed by the even harmonics. Hence, the pitch frequency of the latter auditory unit is twice as high as the pitch of the stream of auditory units formed by the modulated odd harmonics. The reader may check that this is correct. Similar effects can be heard in the demos shown in Figs. 4.10 and 4.11, in which the odd harmonics are not modulated in amplitude but in frequency. The definition presented above is quite rigorous, and implies that a stimulus component is either part of one perceptual unit or of another. The situation, however, is not always so strict. Each auditory component is processed by the auditory system in such a way that the information is distributed in frequency and time over an array of auditory filters. Part of this information can contribute to one auditory unit and another part to another auditory unit. The effects of each contribution must then correspond to the amount of information attributed to each auditory unit. This has been called energy trading by Shinn-Cunningham et al. [80] and Snyder et al. [81]. It is emphasized here that this principle operates not so much on the total amount of acoustic energy in the signal but on the distribution of information in the course of

4.9 Consequences of Auditory-Unit Formation

215

auditory processing. This is why it will be called information trading in this book. An example of information trading has been mentioned in discussing the mistuned harmonics in the demo of Fig. 4.14. There, the allocation of auditory information from an upwardly mistuned harmonic in an otherwise harmonic complex was discussed. It was argued that, for small mistuning, part of this information, especially the lower-frequency part close to the harmonic position in the complex, is allocated to the harmonic complex thus increasing its pitch. The remaining information represents, on average, a higher frequency so that the pitch of the mistuned partial is slightly higher than when played in isolation as found by Hartmann and Doty [35] and Hartmann [34]. Another instance at which information trading may be operative is in masking experiments. In many masking experiments the aim of the study is to investigate whether listeners can hear a target sound in the presence of a masker. This is done by checking whether a listener can distinguish between the situation in which the target sound is present and the situation in which it is absent. A problem of this set-up is that listeners focus their attention on the audibility of the target and ignore the possible change in the masker due to the integration of the target with the masker. This change of the masker may consist of a change in loudness, pitch, or timbre. If listeners are not explicitly asked to pay attention to this change in a perceptual attribute of the masker, they may miss it. This problem is elaborately described by Bregman [4, pp. 314–320]. He concludes that, if a stimulus component cannot be perceived as a separate component, it does not necessarily mean that this component is masked. It may perceptually fuse with the other components and thus contribute to the perceptual properties of the perceptual object that results from the fusion process. This contribution can be so small that its perceptual effect can be difficult to hear, certainly when the perceiver tries hard to focus on the target signal, and is not paying attention to the possible changes in a perceptual attribute of the masker. This will be discussed in more detail in Chap. 10, especially in Sect. 10.13.2. In summary, the principle of exclusive allocation claims that perceptual information as it enters the perceptual system is partitioned over the perceptual units in such a way that information from a sound component that is allocated to one perceptual unit cannot additionally contribute to another perceptual unit. Either in this strict form or in the more moderate form of information trading, various examples of this principle will be discussed further on in this book.

4.9 Consequences of Auditory-Unit Formation 4.9.1 Loss of Identity of Constituent Components The main consequence of auditory-unit formation is that those sound components that integrate into one auditory unit lose their auditory identity. In other words, frequency components are no longer audible as separate sounds with the auditory attributes

216

4 Auditory-Unit Formation

they would have when presented in isolation. This loss of auditory identity is not always absolute. When a harmonic tone is synthesized by adding sinusoids that start and stop simultaneously and remain constant in between, it remains possible to hear out the lower harmonics. Indeed, for steady tones with equal-amplitude harmonics, Plomp [64] and Plomp and Mimpen [65] found that the first five harmonics can be “heard out” as separate tones. Again based on listeners’ ability to hear out partials in inharmonic complexes, they suggested that hearing out a partial is only possible when it is separated from adjacent partials by more than one critical bandwidth. This is confirmed by Moore and Ohgushi [57] for inharmonic tone complexes in which the frequencies of the partials are equidistant on the Cam scale. They found that a partial was correctly heard out in 75% of all presentations when its distance from adjacent partials was 1.25 Cam. When the distance was reduced to 1 Cam, performance fell almost to chance. A review of the ability to hear out partials and the mechanism behind this ability is presented by Moore and Gockel [56]. In general, the authors of these experiments mention that hearing out partials is a difficult task that requires training. Moreover, in the experiments carried out in order to check listeners’ ability to hear out partials, the complex tone is preceded by a pure tone with a frequency that is either equal to the frequency of one of the partials or not, a capture tone. The task of the listeners is then to indicate whether there is a partial of the same frequency as that of the capture tone in the complex or not. In Moore and Ohgushi [57], the capture tone was either somewhat lower or somewhat higher in frequency than the target harmonic, and listeners were asked to indicate whether the pitch of the target harmonic was higher or lower in frequency than the capture tone. If they could do so, they were supposed to be able to hear out the partial. The confounding factor in this experimental set-up is that the pure tone may sequentially integrate with the partial of the complex that is nearest in frequency, thus perceptually isolating it from the rest of the partials. The thus measured ability to hear out a partial may thus be the result of sequential integration, be it partly. This phenomenon of the capturing of a partial by a preceding tone will be discussed in much more detail in Chap. 10. It is clear that sound components integrated into one auditory unit lose at least a large part of their identity. Even in steady continuous complex tones hearing out partials requires preceding capture tones or training. If other factors contribute to integration of partials such as common frequency- or amplitude modulation, the fusion of the constituent partials is even stronger, so that the partials almost completely lose their identity, and it becomes virtually impossible to hear them out, as is, e.g., the case for fluent speech.

4.9.2 The Emergence of Perceptual Attributes One of the most remarkable consequences of auditory-unit formation is the emergence of perceptual attributes. In this book, Chap. 4 on auditory-unit formation deliberately precedes the chapters about the perceptual attributes of the auditory units such

4.9 Consequences of Auditory-Unit Formation

217

as loudness and pitch. This suggests that auditory-unit formation precedes, e.g., pitch estimation or loudness perception. Indeed, spectral regularity and common fate appeared to be the main organizing principles in integrating spectral components into an auditory unit. This makes it plausible that, first, auditory units are formed based on these organizing principles, and that, second, based on the information used in this process, the perceptual attributes emerge. One of those perceptual attributes is pitch. The presence of a common periodicity in those frequency channels that contribute to the formation of the auditory unit then defines the perceived pitch [51]. This view is in line with models that assume two phases in auditory processing, a prerepresentational phase and a representational phase, e.g., Näätänen and Winkler [59]. In the prerepresentational phase, “features” are extracted from the ascending auditory information, the “afferent activation pattern”. These features form traces that persist for a relatively short period, less than say a few hundred milliseconds, and are distributed over different low-level locations in the central nervous system. Each trace contains only partial information that is not directly accessible for higher level processes such as attention. One may think of information relevant for loudness perception, of information used in pitch perception, or of information used in sound localization. These different kinds of information are processed in different pathways, and the corresponding feature traces persist only within these different pathways. The formation of an auditory unit then requires that these different feature traces are re-assembled again, and this happens in the representational phase. In this representational phase, information from the feature traces are integrated into a “unitary auditory event or percept” [59, p. 846], i.e., what is called an auditory unit in this book. In contrast with the re-assembled features, the result of this representational phase, so the auditory unit, is accessible for conscious perception and attention. Näätänen and Winkler [59] describe the formation of these unitary auditory events mainly as a bottom-up process. Various authors, e.g., Carlyon [12], Darwin [20], and Mill et al. [53] have subscribed this in different contexts. McLachlan and Wilson [50] differ in this respect. They propose that long-term memory interacts with the early processing stages. They posit: “Pitch, loudness, and location are all processed in parallel in tonotopic arrays throughout the auditory pathways” (p. 176), and argue that these processes can be modulated by top-down processes that are involved in sound identification. These top-down processes are initiated by onset detection signalling the arrival of new auditory information [50, p. 179]. In their model, Näätänen and Winkler [59] mention various analogies between auditory-and visual-object formation, but also mention an interesting difference between a visual and an auditory object. In the visual system, an object is primarily represented in space. Indeed, it does not have a perceptual moment of occurrence, which an auditory unit does. In correspondence to the hypothesis by Näätänen and Winkler [59] and McLachlan and Wilson [50] that the auditory system represents auditory units in time, it is assumed in this book that an auditory unit is in the first place characterized by its beat, and the beat will, therefore, be the first perceptual attribute of an auditory unit to be discussed here, before the perception of timbre, loudness, pitch, and perceived location.

218

4 Auditory-Unit Formation

4.10 Performance of the Auditory-Unit-Formation System It has been argued that the function of the auditory scene analysis (ASA) system is to partition incoming acoustic information in such a way that the resulting perceived auditory units and streams afford listeners a meaningful interpretation about what happens around them, meaningful in the sense that the perceived objects correspond as much as possible to things happening in the physical environment of the listeners. This system can never be perfect, be it for the simple reason that there is no simple definition of what is correct. When recordings of two different speakers are played over one loudspeaker, there is only one physical sound source. In spite of that one hears two different speakers. When a recording of one human speaker is played over two different loudspeakers in front of the listeners, they hear the sound of that one speaker coming from somewhere between the two loudspeakers, while there acoustically are two physical sound sources. This situation is more extreme when one listens to a recording of an orchestra or to the wind in the trees. In those situations, one does not perceive every single acoustical event as a different auditory unit, but the auditory information arriving at our ears is recombined in such a way that one hears the melodies of the music and the wind in the trees, respectively. It is concluded that auditory-unit formation can result in more perceived units than there are sound sources, whereas in other cases the auditory-unit formation process will fuse information from many acoustically different sound sources into much fewer auditory units than there are sound sources, or even into just one auditory unit. The question now is whether the auditory system has a bias for one or the other. In general, the human ASA system appears to have a bias toward integration [4]. Only when there is evidence that sound components are generated by different acoustic events, the system segregates the corresponding components. In the situation of auditory-unit formation, segregation occurs when components evolve differently in time, or when the frequency of one or more components does not match the regular frequency pattern of other components. Bregman [4] also introduces a functional argument: “Having integration as the default makes sense, since without specific evidence as to how to break down the spectrum into parts, an infinite number of subdivisions can be made, each equally plausible, with no principled way to choose among them. The conservative approach is to treat the spectrum as a unitary block except when there is specific evidence that this interpretation in wrong” (p. 378).

4.11 Concluding Remark Up to now auditory-unit formation has been described as the result of integration of simultaneous sound component. Preceding sound has been ignored; in other words, it has been assumed that silence precedes the event of the formation of the auditory unit. This will appear to be much too simple. It will appear that what comes before and

References

219

after the formation of an auditory unit can strongly affect the result of the auditoryunit-formation process. This is the domain of auditory-stream formation in which it will be shown that factors promoting sequential integration compete with factors that promote simultaneous integration. This will be discussed in Chap. 10.

References 1. Barbosa PA, Bailly G (1994) Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Commun 15(1–2):127–137. https://doi.org/10.1016/0167-6393(94)90047-7. 2. Bizley JK, Cohen YE (2013) The what, where and how of auditory-object perception. Nat Rev Neurosci 14(10):693–707. https://doi.org/10.1038/nrn3565. 3. Blauert J (1997) Spatial hearing: the psychophysics of human sound localization, Revised. MIT Press, Cambridge, MA 4. Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Cambridge, MA 5. Bregman AS (2008) Rhythms emerge from the perceptual grouping of acoustic components. In: Proceedings of Fechner Day, vol 24 (1), pp 13–16. http://proceedings.fechnerday.com/index. php/proceedings/article/view/163 6. Bregman AS, Ahad PA (1996) Demonstrations of scene analysis: the perceptual organization of sound, Montreal, Canada. http://webpages.mcgill.ca/staff/Group2/abregm1/web/ downloadsdl.htm 7. Bregman AS, Ahad PA, Kim J (1994) Resetting the pitch-analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones. J Acoust Soc Am 96(5):2694–2703. https://doi.org/10.1121/1.411277 8. Bregman AS, Campbell J (1971) Primary auditory stream segregation and perception of order in rapid sequences of tones. J Exp Psychol 89(2):244–249. https://doi.org/10.1037/h0031163 9. Brooks JL (2015) Traditional and new principles of perceptual grouping. In: Wagemans J (ed) The oxford handbook of perceptual organization, Oxford University Press, Oxford, UK, Chap. 4, p 31. https://kar.kent.ac.uk/35324/1/Brooks-GroupingChapter-OUPHandbookREPOSITORY.pdf 10. Burghardt H (1973) Die subjektive dauer schmalbandiger schalle bei verschiedenen frequenzlagen. Acust 28(5):278–284 11. Cabrera D, Pop C, Jeong D (2006) Auditory room size perception: a comparison of real versus binaural sound-fields. In: Proceedings of the 1st Australasian acoustic societies’ conference (Acoustics 2000), Christchurch, New Zealand, pp 417–422. https://www.acoustics.asn. au/conference_proceedings/AASNZ2006/papers/p107.pdf 12. Carlyon RP (2004) How the brain separates sounds. Trends Cogn Sci 8(10):465–471. https:// doi.org/10.1016/j.tics.2004.08.008 13. Carlyon RP, Gockel HE (2008) Effects of harmonicity and regularity on the perception of sound sources. In: Yost WA, Popper AN, Fay RR (eds) Auditory perception of sound sources, Springer Science+Business Media, New York, Chap. 7, pp 191–213. https://doi.org/10.1007/ 978-0-387-71305-2_7 14. Carlyon RP et al (2009) Changes in the perceived duration of a narrowband sound induced by a preceding stimulus. J Exp Psychol Hum Percept Perform 35(6):1898–1912. https://doi.org/ 10.1037/a0015018 15. Chen L (2019) Discrimination of empty and filled intervals marked by auditory signals with different durations and directions of intensity change. PsyCh J 8(2):187–202. https://doi.org/ 10.1002/pchj.267 16. Ciocca V (1999) Evidence against an effect of grouping by spectral regularity on the perception of virtual pitch. J Acoust Soc Am 106(5):2746–2751. https://doi.org/10.1121/1.428102

220

4 Auditory-Unit Formation

17. Craig JC (1973) A constant error in the perception of brief temporal intervals. Percept & Psychophys 13(1):99–104. https://doi.org/10.3758/BF03207241. 18. Crum PAC, Bregman AS (2006) Effects of unit formation on the perception of a changing sound. Q J Exp Psychol 59(3):543–556. https://doi.org/10.1080/02724980443000737 19. Darwin CJ (1981) Perceptual grouping of speech components differing in fundamental frequency and onset-time. Q J Exp Psychol 24(4):185–207. https://doi.org/10.1080/ 14640748108400785 20. Darwin CJ (2005) Simultaneous grouping and auditory continuity. Percept & Psychophys 67(8):1384–1390. https://doi.org/10.3758/BF03193643. 21. Darwin CJ, Ciocca V (1992) Grouping in pitch perception: Effects of onset asynchrony and ear of presentation of a mistuned component. J Acoust Soc Am 91(6):3381–3390. https://doi. org/10.1121/1.402828 22. Denham SL, Winkler I (2015) Auditory perceptual organization. In: Jaeger D, Jung R (eds) Encyclopedia of computational neuro-science. Springer Science+Business Media Inc, New York, NY, pp 240–252 23. Donaldson MJ, Yamamoto N (2016) Detection of object onsets and offsets: does the primacy of onset persist even with bias for detecting offset?. Atten Percept Psychophys 78(7):1901–1915. https://doi.org/10.3758/s13414-016-1185-5 24. Eggermont J (1969) Location of the syllable beat in routine scansion recitations of a dutch poem. IPO Annu Prog Rep 4:60–64 25. Gordon JW (1987) The perceptual attack time of musical tones. J Acoust Soc Am 82(1):88–105. https://doi.org/10.1121/1.395441 26. Grassi M, Darwin CJ (2006) The subjective duration of ramped and damped sounds. Percept Psychophys 68(8):1382–1392. https://doi.org/10.3758/BF03193737 27. Grassi M, Mioni G (2020) Why are damped sounds perceived as shorter than ramped sounds?. Atten Percept Psychophys 82(6):2775–2784. https://doi.org/10.3758/s13414-020-02059-2 28. Green EJ (2019) A theory of perceptual objects. Philos Phenomenol Res 99(3):663–693. https:// doi.org/10.1111/phpr.12521 29. Gregg MK, Samuel AG (2012) Feature assignment in perception of auditory figure. J Exp Psychol Hum Percept Perform 38(4):998–1013. https://doi.org/10.1037/a0026789 30. Gregory RL (1980) Perceptions as hypotheses. Philos Trans R Soc B Biol Sci 290(1038):181– 197. https://doi.org/10.1098/rstb.1980.0090 31. Griffiths TD, Warren JD (2004) What is an auditory object?. Nat Rev Neurosci 5(11):887–892. https://doi.org/10.1038/nrn1538 32. Grondin S et al (2018) Auditory time perception. In: Bader R (ed) Springer handbook of systematic musiclology, Springer-Verlag GmbH Germany, Cham, Switzerland, Chap. 21, pp 423–440. https://doi.org/10.1007/978-3-662-55004-5_21 33. Handel S (2019) Objects and events. Perceptual organization: an integrated multisensory approach. Palgrave Macmillan, Cham, Switzerland, pp 9–82. https://doi.org/10.1007/978-3319-96337-2_2 34. Hartmann WM, Doty SL (1996) On the pitches of the components of a complex tone. J Acoust Soc Am 99(1):567–578. https://doi.org/10.1121/1.414514 35. Hartmann WM, McAdams S, Smith BK (1990) Hearing a mistuned harmonic in an otherwise periodic complex tone. J Acoust Soc Am 88(4):1712–1724. https://doi.org/10.1121/1.400246 36. Heald SLM, Van Hedger SC, Nusbaum HC (2017) Perceptual plasticity for auditory object recognition. Front Psychol 8, Article 781, p 16. https://doi.org/10.3389/fpsyg.2017.00781 37. Heil P (1997) Auditory cortical onset responses revisited. I. First-spike timing. J Neurophysiol 77(5):2616–2641. https://doi.org/10.1152/jn.1997.77.5.2616 38. Heil P (2003) Coding of temporal onset envelope in the auditory system. Speech Commun 41(1):123–134. https://doi.org/10.1016/S0167-6393(02)00099-7 39. Holmes SD, Roberts B (2012) Pitch shifts on mistuned harmonics in the presence and absence of corresponding in-tune components. J Acoust Soc Am 132(3):1548–1560. https://doi.org/ 10.1121/1.4740487

References

221

40. Houtsma AJ, Rossing TD, Wagenaars WM (1987) Auditory demonstrations. Institute for perception research (IPO), northern illinois university, Acoustical Society of America, Eindhoven, Netherlands. https://research.tue.nl/nl/publications/auditory-demonstrations 41. Jones MR (1976) Time, our lost dimension: toward a new theory of perception, attention, and memory. Psychol Rev 83(5):323–335. https://doi.org/10.1037/0033-295X.83.5.323 42. Koffka K (1955) Principles of gestalt psychology, 5th edn. Routledge, London, UK 43. Kubovy M, Van Valkenburg D (2001) Auditory and visual objects. Cogn 80(1):97–126. https:// doi.org/10.1016/S0010-0277(00)00155-4 44. Kuroda T, Grondin S (2013) No time-stretching illusion when a tone is followed by a noise. Atten Percept Psychophys 75(8):1811–1816. https://doi.org/10.3758/s13414-013-0536-8 45. Marin CMH, McAdams S (1991) Segregation of concurrent sounds. II: Effects of spectral envelope tracing, frequency modulation coherence, and frequency modulation width. J Acoust Soc Am 89(1):341–351. https://doi.org/10.1121/1.400469 46. Matthen M (2010) On the diversity of auditory objects. Rev Philos Psychol 1(1):63–89. https:// doi.org/10.1007/s13164-009-0018-z 47. Matthews WJ, Stewart N, Wearden JH (2011) Stimulus intensity and the perception of duration. J Exp Psychol Hum Percept Perform 37(1):303–313. https://doi.org/10.1037/a0019961 48. McAdams S (1989) Segregation of concurrent sounds. I: Effects of frequency modulation coherence. J Acoust Soc Am 86(6):2148–2159. https://doi.org/10.1121/1.398475 49. McAdams S, Drake C (2002) Auditory perception and cognition. In: Pashler H (ed) Stevens’ handbook of experimental psychology, volume 1: sensation and perception, 3rd edn. Wiley, New York, Chap. 10, pp 397–452. https://doi.org/10.1002/0471214426.pas0110 50. McLachlan NM, Wilson S ( 2010) The central role of recognition in auditory perception: a neurobiological model. Psychol Rev 117(1):175–196. https://doi.org/10.1037/a0018063 51. Micheyl C, Oxenham AJ (2010) Pitch, harmonicity and concurrent sound segregation: Psychoacoustical and neurophysiological findings. Hear Res 266(1-2):36–51. https://doi.org/10. 1016/j.heares.2009.09.012 52. Middlebrooks JC (2017) Spatial stream segregation. In: Middlebrooks JC et al (eds) The auditory system at the cocktail party. Springer International Publishing, Cham, Switzerland, Chap. 6, pp 137–168. https://doi.org/10.1007/978-3-319-51662-2_6 53. Mill RW et al (2013) Modelling the emergence and dynamics of perceptual organisation in auditory streaming. PLoS Comput Biol 9(3), e1002925, p 21. https://doi.org/10.1371/journal. pcbi.1002925 54. Moore BC, Glasberg BR, Peters RW (1985) Relative dominance of individual partials in determining the pitch of complex tones. J Acoust Soc Am 77(5):1853–1860. https://doi.org/10. 1121/1.391936 55. Moore BC, Glasberg BR, Peters RW (1986) Thresholds for hearing mistuned partials as separate tones in harmonic complexes. J Acoust Soc Am 80(2):479–483. https://doi.org/10.1121/1. 394043 56. Moore BC, Gockel HE (2011) Resolvability of components in complex tones and implications for theories of pitch perception. Hear Res 276:88–97. https://doi.org/10.1016/j.heares.2011. 01.003 57. Moore BC, Ohgushi K (1993) Audibility of partials in inharmonic complex tones. J Acoust Soc Am 93(1):452–461. https://doi.org/10.1121/1.405625 58. Moore BC, Peters RW, Glasberg BR (1985) Thresholds for the detection of inharmonicity in complex tones. J Acoust Soc Am 77(5):1861–1867. https://doi.org/10.1121/1.391937 59. Näätänen R, Winkler I (1999) The concept of auditory stimulus representation in cognitive neuroscience. Psychol Bull 126(6):826–859. https://doi.org/10.1037/0033-2909.125.6.826 60. Nakajima Y et al (2014) Auditory grammar. Acoust Aust 42(2):97–101 61. Nudds M (2010) What are auditory objects? Rev Philos Psychol 1:105–122. https://doi.org/ 10.1007/s13164-009-0003-6 62. Peeters G et al (2011) The timbre toolbox: extracting audio descriptors form musical signals. J Acoust Soc Am 130(5):2902–2916. https://doi.org/10.1121/1.3642604

222

4 Auditory-Unit Formation

63. Phillips DP, Hall SE, Boehnke SE (2002) Central auditory onset responses, and temporal asymmetries in auditory perception. Hear Res 167(1-2):192–205. https://doi.org/10.1016/S03785955(02)00393-3 64. Plomp R (1964) The ear as frequency analyzer. J Acoust Soc Am 36(9):1628–1636. https:// doi.org/10.1121/1.1919256 65. Plomp R, Mimpen AM (1968) The ear as frequency analyzer. II. J Acoust Soc Am 43(4):764– 767. https://doi.org/10.1121/1.1910894 66. Plomp R (1998) Hoe wij Horen: over de Toon die de Muziek Maakt 67. Rasch RA (1979) Synchronization in performed ensemble music. Acta Acust United Acust 43(2):121–131 68. Rasch RA (1978) The perception of simultaneous notes such as in polyphonic music. Acta Acust United Acust 40(1):21–33 69. Roberts B (2005) Spectral pattern, grouping, and the pitches of complex tones and their components. Acta Acust United Acust 91(6):945-957 70. Roberts B, Bailey PJ (1996) Spectral regularity as a factor distinct from harmonic relations in auditory grouping. J Exp Psychol Hum Percept Perform 22(3):604–614. https://doi.org/10. 1037/0096-1523.22.3.604 71. Roberts B, Bregman AS (1991) Effects of the pattern of spectral spacing on the perceptual fusion of harmonics. J Acoust Soc Am 90(6):3050–3060. https://doi.org/10.1121/1.401779 72. Roberts B, Brunstrom JM (2001) Perceptual fusion and fragmentation of complex tones made inharmonic by applying different degrees of frequency shift and spectral stretch. J Acoust Soc Am 110(5):2479–2490. https://doi.org/10.1121/1.1410965 73. Roberts B, Brunstrom JM (2003) Spectral pattern, harmonic relations, and the perceptual grouping of low-numbered components. J Acoust Soc Am 114(4):2118–2134. https://doi.org/ 10.1121/1.1605411 74. Roberts B, Holmes SD (2006) Grouping and the pitch of a mistuned fundamental component: effects of applying simultaneous multiple mistunings to the other harmonics. Hear Res 222:79– 88. https://doi.org/10.1016/j.heares.2006.08.013 75. Sasaki T, Nakajima Y, Hoopen G ten (1993) The effect of a preceding neighbortone on the perception of filled durations. In: Proceedings of the Spring Meeting of the Acoustical Society of Japan. pp 347–348 76. Sasaki T et al (2010) Time stretching: illusory lengthening of filled auditory durations. Atten Percept Psychophys 72 (5):1404–1421. https://doi.org/10.3758/APP.72.5.1404 77. Schlauch RS, Ries DT, DiGiovanni JJ (2001) Duration discrimination and subjective duration for ramped and damped sounds. J Acoust Soc Am 109(6):2880–2887. https://doi.org/10.1121/ 1.1372913 78. Shamma SA, Elhilali M, Micheyl C (2011) Temporal coherence and attention in auditory scene analysis. Trends Neurosci 34 (3):114–123. https://doi.org/10.1016/j.tins.2010.11.002 79. Shamma SA et al (2013) Temporal coherence and the streaming of complex sounds. In: Moore BC et al (eds) Basic aspects of hearing: physiology and perception. Springer Science+Business Media, New York, Chap. 59, pp 535–543. https://doi.org/10.1007/978-1-4614-1590-9_59 80. Shinn-Cunningham BG, Lee AK, Oxenham AJ (2007) A sound element gets lost in perceptual competition. In: Proceedings of the National Academy of Sciences. vol 104 (29), pp 12223– 12227. https://doi.org/10.1073/pnas.0704641104 81. Snyder JS et al (2012) Attention, awareness, and the perception of auditory scenes. Front Psychol 3, Article 15, p 17. https://doi.org/10.3389/fpsyg.2012.00015 82. Summerfield Q et al (1984) Perceiving vowels from uniform spectra: phonetic exploration of an auditory aftereffect. Percept Psychophys 35(3):203–213. https://doi.org/10.3758/BF03205933 83. Van Katwijk A, Van der Burg B (1968) Perceptual and motoric synchronisation with syllable beats. IPO Annu Prog Rep 3:35–39 84. Verwulgen S et al (2020) On the perception of disharmony. In: Ahram T et al (eds) Integrating people and intelligent systems: proceedings of the 3rd international conference on intelligent human systems integration (IHSI 2020). Springer Nature Switzerland AG, Cham, Switzerland, pp 195–200. https://doi.org/10.1007/978-3-030-39512-4_31

References

223

85. Villing RC et al (2011) Measuring perceptual centers using the phase correction response. Atten Percept Psychophys 73(5):1614–1629. https://doi.org/10.3758/s13414-011-0110-1 86. Wagemans J et al (2012) A century of gestalt psychology in visual perception. Psychol Bull 138(6):1172–1217. https://doi.org/10.1037/a0029334 87. Wagemans J et al (2012) A century of gestalt psychology in visual perception: II. conceptual and theoretical foundations. Psychol Bull 138(6):1218–1252. https://doi.org/10.1037/a0029333 88. Warren RM (1999) Auditory perception: a new synthesis. Cambridge University Press, Cambridge, UK 89. Wearden, JH et al (2007) Internal clock processes and the filled-duration illusion. J Exp Psychol Hum Percept Perform 33(3):716–729. https://doi.org/10.1037/0096-1523.33.3.716 90. Wertheimer M (1923) Untersuchungen zur lehre von der gestalt. II. Psychol Forsch 4(1):301– 350 91. Whalen DH, Cooper AM, Fowler CA (1989) P-center judgments are generally insensitive to the instructions given. Phonetica 46(4):197–203. https://doi.org/10.1159/000261843 92. Williams SM (1994) Perceptual principles in sound grouping. Auditory display: sonification, audification and auditory interfaces. MA: Addison-Wesley Publishing Company, pp 95–125 93. Winkler I, Denham SL, Nelken I (2009) Modeling the auditory scene: predictive regularity representations and perceptual objects. Trends Cogn Sci 13(12):532–540. https://doi.org/10. 1016/j.tics.2009.09.003 94. Zacks JM, Tversky B (2001) Event structure in perception and conception. Psychol Bull 1:3–21. https://doi.org/10.1037/0033-2909.127.1.3 95. Zeng F-G et al (2005) Speech recognition with amplitude and frequency modulations. Proc Natl Acad Sci USA 102(7):2293–2298. https://doi.org/10.1073/pnas.0406460102 96. Zwicker E (1969) Subjektive und objektive Dauer von Schallimpulsen und Schallpausen. Acustica 22(4):214–218

Chapter 5

Beat Detection

The official American National Standards Institute (ANSI) standard on psychoacoustical terminology [5, 6] recognizes four perceptual attributes of a sound: Its loudness, its pitch, its timbre, and its duration. The attributes of timbre, loudness, and pitch will be discussed in the next three chapters. As has already been discussed in Sect. 4.4 of the previous chapter, perceived duration is not a well-defined auditory attribute of many auditory units, such as the syllables of fluent speech. On the other hand, the duration of the time intervals between the beats of two successive syllables is perceptually well-defined. This chapter is about these beats. Terhardt and Schütte [109] found in the 1970s that, when series of tones or noise bursts with isochronous acoustic onsets but different rise times were played, the rhythm induced by these series was not isochronous. In order to make the series sound isochronous, the sounds with longer rise times had to start earlier in respect of acoustic isochrony than sounds with shorter rise times. This adjustment could amount to about 60 ms. In this way, Terhardt and Schütte [109] established that the perceptual moment of occurrence of a sound, or beat location as it is called in this book, was not at its acoustic onset but came later; moreover, the delay between these events was larger for sounds with longer rise times than for sounds with shorter rise times. This is demonstrated in Fig. 5.1. It is the same diatonic scale in on 440 Hz as shown in Fig. 1.18, except that the envelopes consist of alternately exponentially decaying or damped tones and exponentially rising or ramped tones. As in Fig. 1.18, the tones consist of the first three harmonics of F 0 . The scale played in the upper panel has isochronous acoustic onsets but sounds rhythmically irregular. In the scale played in the lower panel, the onsets of the ramped sounds are 76 ms earlier. In spite of this, this scale sounds much more isochronous, showing that the beat of the ramped tones comes about 76 ms later after the acoustic onset than that of the damped tones. In some further experiments, Schütte [97, 98] found that not only the rise time of a sound affected its beat location, but also its duration. The longer the duration of the tone or the noise burst, the later the beat location in respect of the acoustic onset. Varying some other experimental parameters of the tones such as their intensity or © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_5

225

226

5 Beat Detection

Fig. 5.1 The beat location depends on the envelope of a tone. Damped tones are alternated with ramped tones. In the upper panel the onsets of the tones are isochronous with a rate of 4 Hz. In the lower panel the start of ramped tones is 76 ms earlier. The rhythm of the tone sequence in the lower panel sounds more isochronous than that of the e sequence in the upper panel. (Matlab) (demo)

their fundamental frequency, on the other hand, had little effect on beat location [98]. Varying the intervals between the successive sounds, so the tempo, did not have any significant effect either. So, it appears that the beat locations of pure tones or noise bursts predominantly depend on their temporal envelope. Not only tonal sounds, but also speech sounds sound rhythmically irregular when they are played with isochronous acoustic onsets. Indeed, Morton, Marcus, and Frankish [75] wanted to synthesize rhythmic sequences of words and, in order to do so, first played these words in such a way that their acoustic onsets were isochronous. They found, however, that the rhythm of these acoustically isochronous sequences of words did not sound isochronous. This will be discussed in more detail later in this chapter. In addition to the terms “beat” and “beat location”, there is a variety of rival terms associated with the perceptual moment of occurrence of an auditory unit. Although some early studies in the 1960s with speech sounds speak about “syllable beat” [4, 117], the word “beat” is generally associated with music, not with speech. Other terms used in music are subjective onset time [124], perceptual onset [121], perceptual attack time [37], or pulse [79]. These terms will all be identified with what is called “beat” in this book when it refers to the perceived event, and with “beat location” when it refers to the perceived timing of that event. Another concept arising from speech research related to the concept of beat is the concept of perceptual centre, or P-centre, indicating the perceptual moment of occurrence of a word, not necessarily a monosyllabic word. The concept of P-centre was introduced for speech sounds by Marcus [69] and Morton, Marcus, and Frankish [75]. Morton, Marcus, and Frankish [75] also identified beat locations and P-centres: “what happens when a ballerina performs a movement ‘in time to the beat,’ it might be useful to consider that it is the P-centres of the production units of the movements that are adjusted to successive P-centres of the input music" (p. 408). In this quote, Morton, Marcus, and Frankish [75] identify the concept of beat location and P-centre not only with the perceptual moments of occurrence of auditory units but also with the perceptual moments of occurrence of the movements of the ballerina, implying an association between the

5.1 Measuring the Beat Location of an Auditory Unit

227

perceptual system and the motor system. This association is made explicit for speech in the studies by Fowler [31, 32]. The main reason why beats are presented as the most characteristic perceptual property of auditory units is that they can be counted, since every beat indicates one auditory unit. This is one of the most compelling arguments showing that auditory units are well-defined perceptual entities. The visual system defines visual objects primarily in space, while the auditory system defines auditory units primarily in time [77]. Köhlmann [56] asked listeners to tap along on a morse key with sentences spoken in German, French, or English. He found that there was almost a one-to-one relation between the taps by the listeners and the syllables, also when the listener did not know the language in which the sentence was spoken or when a sentence was played backwards. In music, the ability to count auditory units, or notes, is self-evident as it is the basis of keeping track of the notes. Even stronger, rapid isochronous sequences of tones, too rapid to count individually, can by counted in groups of two, three, or four [94]. This possibility of rhythmic grouping of isochronous sequences of tones shows the identity of the individual tone as a perceptual unit. In speech, counting syllables of an utterance is relatively easy. In fact, the ability to count syllables develops considerably earlier in life than the ability to count phonemes [60, 62, 68]. Apparently, also young children have an awareness of the identity of the syllable as a perceptual unit. In summary, there is a moment in time that perceptually defines the presence of a syllable, a tone, or another auditory unit. This perceived event is called the beat of that auditory unit; its perceptual moment of occurrence, its beat location. This beat and its location in time is the subject of this chapter.

5.1 Measuring the Beat Location of an Auditory Unit There are various ways to measure the temporal location of the beat of an auditory unit, relative ways and absolute ways. In relative ways, two auditory units are played alternately in such a way that the intervals between the beats are perceptually isochronous, so that the series of sounds induce an isochronous rhythm. When that is realized, one can measure the interval between a reference point of each sound, mostly the acoustic onset. This interval then gives the difference in the beat locations of the two sounds in respect of their acoustic onsets. An example of this was just shown in the second part of the demo of Fig. 5.1. In this demo, the acoustic onset of one of the two tones is played 76 ms earlier in the acoustically isochronous pattern in respect of the other. Since this results in an isochronous rhythm, this shows that the beat locations of the two tones differ by about 76 ms in respect of their acoustic onsets. Various authors state that this is the only valid way to measure beat locations, e.g., Morton, Marcus, and Frankish [75], Scott [101], Villing et al. [120], and do not recognize the existence of absolute measures for beat location. In this book, however, it will be assumed that some sounds, e.g. metronome ticks, are so short that their

228

5 Beat Detection

beats can be assumed to coincide with their acoustic onsets. This makes it possible to measure the beat locations of other sounds. An overview of such methods will now be presented.

5.1.1 Tapping Along One of the most natural ways to measure the beat location of an auditory unit is to present it repeatedly with isochronous intervals and to ask listeners to tap rhythmically along with these test sounds. This seems logical because, when a sound is very short,—and a tap is such a sound—its beat can be assumed to coincide with its acoustic occurrence. If the listener then taps in synchrony with the sounds, one can assume that the taps are synchronous with the beats of the sounds. There is a problem with this tapping task, however, which becomes apparent when the stimulus consists of series of very short clicks. Indeed, already in 1902, Miyake [74], as cited by Fraisse [34], found that, when listeners are asked to tap along with such an isochronous sequence of clicks, their taps on average precede the clicks by some tens of milliseconds. This was confirmed by Dunlap [25] in 1910 and by Woodrow [126] in 1932. So, the taps lead the clicks. This discrepancy between taps and clicks goes unnoticed by the participants. Moreover, the slower the tempo, the longer is this lead. This phenomenon is called negative asynchrony. Negative asynchrony is an important characteristic of rhythmic entrainment to be discussed in Sect. 10.12.3.4. It shows that listeners anticipate the next click. Possible explanations for this systematic deviation from synchrony are presented by Vos, Mates, and Van Kruysbergen [122] and Müller et al. [76], but no definite answer can yet be given. A review is presented by Aschersleben [7].

5.1.2 Synchronizing with a Series of Clicks Another way to determine the beat location of an auditory unit is to ask listeners to synchronize a pre-synthesized series of clicks with the series of the test sound, in such a way that the clicks are perceptually played at the same time as the test sounds. Of course, the inter-click interval must be the same as the inter-onset interval of the test sounds. This method has two problems, however. One is that the clicks and the sounds are played simultaneously so that they can mask each other, and it is not known to what extent a partially masked sound has the same beat location as the unmasked sound [120, p. 1617]. The other problem is that, most likely owing to the resultant synchrony of the onsets of the two sounds, the two sounds can to a greater or lesser extent integrate perceptually so that they are no longer perceived as two different sounds [37, 127]. These two problems, masking and simultaneous integration, may well have the same background as explained in Sect. 4.8.3 of the previous chapter. They do not arise in the next methods.

5.1 Measuring the Beat Location of an Auditory Unit

229

5.1.3 Absolute Rhythm Adjustment Methods In absolute rhythm adjustment methods, the clicks are not synchronized with the test sounds but played alternately with the test sounds: ‘click—test sound—click— test sound—click—test sound—click—test sound’, etc. Listeners are then asked to position the test sound in such a way between the clicks that the rhythm of the series sounds as isochronous as possible. When this is realized, it is assumed that the beat of the test sound is positioned precisely in the middle between the two adjacent clicks, so that this moment gives the beat location of the test sound. The validity of this assumption will be discussed below in Sect. 5.1.6. There are a number of variants of this method. One of them, illustrated in Fig. 5.2, is used by Pompino-Marschall [86]. Alternations of ‘click—test sound—click—test sound—click’ are played. The task of the participant is to adjust the test sound in such a way that the result sounds rhythmically isochronous. The test sound in Fig. 5.2 is the syllable /sap/. In the upper panel, the acoustic onset of the word is positioned halfway between the two ticks. So, if the acoustic onset and the beat would coincide, the stimulus shown in the top panel should sound rhythmically isochronous, but most listeners will judge that the syllable comes too late to sound rhythmically isochronous. In the middle panel, the syllable is played 200 ms earlier in the cycle, and now the rhythm is isochronous. In the bottom panel, the syllable is shifted to a position another 200 ms earlier, and most listeners will now find that the syllable comes too early. As it is the stimulus shown in the middle panel that does sound rhythmic and it is assumed that the beat is located halfway between the second

Fig. 5.2 Measurement of the beat location of the syllable /sap/. In the stimulus shown in the upper panel, the syllable comes too late; in the middle panel, the rhythm sounds isochronous, and in the bottom panel, the syllable comes too early. Adapted from Pompino-Marschall [86, 87]. (Matlab) (demo)

230

5 Beat Detection

and the third click, it is concluded that the beat is located somewhere between the /s/ and the /A/. It appears that, in this way, the beat location of a syllable can for most listeners be measured with an accuracy of some tens of milliseconds and, for some rhythmically talented listeners, even of 20–30 ms.

5.1.4 Relative Rhythm Adjustment Methods Relative rhythm adjustment methods for measuring the beat location of an auditory unit are similar to absolute rhythm adjustment methods, except that the clicks are replaced by other non-click auditory units. So, one has two sequences of test sounds, both played at the same rate. The task of the listeners is the same as in the absolute rhythm adjustments methods and, as for absolute adjustment tasks, there are a number of ways to do the task. One way is to ask listeners to adjust the two sound sequences so that they are perceived as synchronous. When that is realized one can assume that their beats coincide. As already discussed above in Sect. 5.1.2, this method has two problems. The two simultaneous sounds can mask each other to a greater or lesser extent; the other is that, most likely due to the synchrony of the onsets of the two sounds, the two sounds more or less integrate perceptually with each other [37]. These problems can be avoided by playing the two sounds alternately, and asking listeners to shift the inter-onset intervals of the two sound sequences in such a way that an isochronous rhythm arises. When this is realized, one can assume that the beat locations are equidistant in time. Since the beat location of a non-click auditory unit is not known in advance, one has to choose another reference point, and generally the acoustic onset of the sounds is chosen for this. Since the acoustic onset of a sound is generally different from its beat, the only conclusion one can draw from the results of such experiments is that the beat of one test sound leads or lags the beat of the other test sound by a certain amount of time in respect of their acoustic onset. In principle, it must be possible to predict the result obtained with these relative methods, if one knows the beat locations of the two test signals measured by absolute methods. No discrepancies could be found in the literature in this respect, except for one minor one to be discussed in Sect. 5.1.6 later in this chapter.

5.1.5 The Phase-Correction Method A disadvantage of rhythm adjustment methods can be that they demand very careful listening on the part of the listeners. They have to listen very accurately to every stimulus in order to judge whether the test sound comes a little too late or a little too early. In general, they find this “difficult and fatiguing” [120, p. 1617]. In order to avoid that such a high demand is put on the concentration of the participants, Villing et al. [120] introduced a new way to measure the beat location of a test sound by

5.1 Measuring the Beat Location of an Auditory Unit

231

using an automatic response of listeners to a sudden deviation from isochrony in an otherwise isochronous series of sounds. Indeed, listeners are asked to tap along with an isochronous series of the test sounds. Then, possibly unnoticed by the listener, a small deviation from isochrony is introduced. It then appears that, when the deviant test sound is too early in respect of the preceding isochronous sounds, the next tap is also somewhat earlier but not to the same extent as the deviation. When the deviant test sound comes later than expected the next tap also comes somewhat later. Apparently, participants show involuntary adaptations to the subtle rhythmic changes in the series of test sounds, which is called phase correction. Villing et al. [120] used these involuntary phase corrections to measure the relative beat locations of two test sounds. Hence, this method is called the phase-correction method. In this method, one test sound is first played isochronously, referred to as the base sound. Every now and then, this base sound is replaced by another test sound. If the test sound is played too early in respect of the rhythm of the base sound, the tap of the listener will also be somewhat earlier, indicating that the beat location of the second test sound is earlier than indicated by the beats of the base sounds. If the test sound is played too late in respect of the rhythm of the base sound, the tap of the listener will be somewhat later. In this way the relative beat locations of the two sounds can be measured. Villing et al. [120] compared the results of the phasecorrection method with the more traditional ways to measure relative beat locations and concluded that the results were essentially the same. Moreover, they found that the results were transitive or “context independent” as they called it. This means that, when the relative beat locations of both test sound 1 and test sound 2 are known and those of test sound 2 and 3, the relative beat positions of test sound 1 and test sound 3 can be calculated simply by subtracting one from the other.

5.1.6 Limitations In general, the methods discussed above for estimating beat location yield quite consistent results. Indeed, after a comparison of the results of the phase correction method with various other methods to determine beat location, or P-centre as they call it, Villing et al. [120] conclude that the consistency of the “estimates obtained using different measurement methods, in different laboratories, and using different participants supports the nature of the P-centre as a reliable and universal percept and corroborates previous research (e.g., Marcus, 1981), which indicates that Pcentres do not depend on individuals or groups” (p. 1629). An elaborate comparison of different methods to determine the P-centres of musical sounds, short clicks and 100-ms noise bursts is presented by London et al. [66]. In general, for these sounds, they confirmed more or less the conclusions by Villing et al. [120]. The largest, and most puzzling, discrepancy they found was the discrepancy between the task in which the click was the probe and the noise burst was the target, and the task in which the noise burst was the probe and the click was the target. In

232

5 Beat Detection

the first case, the click was on average positioned 34 ms after the onset of the noise; in the latter case this was 14 ms. The authors could not explain this discrepancy of 20 ms. Two other questions remain. First, in the above mentioned methods, the sounds are played sequentially and separated from each other by well-defined breaks. In most everyday situations, however, the successive auditory units are not well separated in time: In fluent speech, the boundaries between successive syllables are both acoustically and perceptually not well-defined; in music, the notes played by different instruments may overlap and be delayed or advanced in time for expressive purposes. Consequently, the series of beats of the successive or simultaneous auditory units are not isochronous. Fraisse [35] and Povel [89] showed that, when listeners are asked to synchronize to tone sequences with onsets that deviate from isochrony, the responses are more isochronous than the played tone sequences. Apparently, the perceived speech rhythm competes with the production system that produces the motor patterns that have the tendency to be more isochronous. This problem may perhaps be solved when a neurophysiological correlate of beats and their locations can be found as discussed later in this chapter in Sect. 5.4.1. The second problem is that one may question whether the beat locations of auditory units as measured by the above mentioned methods with isolated auditory units are the same as when those auditory units occur in fluent speech or in live music. For fluent speech, this will be briefly discussed in the upcoming Sect. 5.3.2 of this chapter.

5.2 Beats in Music Schütte [97] presented a simple computational model for estimating the beat location of musical tones. First, the temporal envelope was estimated by low-pass filtering the rectified sound signal. Then he determined fast rises in the amplitude of the envelope, “transients”. The estimated beat locations were then the instants before the maxima in these envelopes where 16% of these maxima i.e., −16 dB, was exceeded. By the way, the shift of 76 ms in the demo of Fig. 5.1 to make the ramped and the damped tones sound isochronous, was based on this model. Results similar to those of Schütte [97] were obtained by Vos and Rasch [121] for 400 Hz tones of which they systematically varied the rise times, and by Gordon [37] for 16 musical notes played by a number of musicians on a number of different instruments. Both Vos and Rasch [121] and Gordon [37] present simple models for beat-location estimation similar to the model presented by Schütte [97]. Moreover, they propose that a process of adaptation underlies the process of beat formation, which sounds quite plausible since the process of adaptation enhances fast rises in the signal [82].

5.3 Beats in Speech

233

5.3 Beats in Speech 5.3.1 The Structure of Spoken Syllables Only the structure of spoken syllables will be discussed here. The way in which syllables are written can deviate considerably from the way they are pronounced but, in this section, vowels and consonants are meant in the way they are realized in actual speech. First, a brief description of the structure of syllables will be presented. In general, a spoken syllable is divided into an onset, a nucleus, and a coda [112, 113]. Every syllable starts with an onset. This syllable onset can be virtually empty, e.g., when the first phoneme of the syllable is a vowel. In that case, the syllable onset consists of a glottal stop, i.e., a short but audible plosive sound preceding the vowel. In other cases, the syllable onset consists of one or more consonants as in words like split /splIt/ or street /stRit/. The syllable nucleus is the centre of the syllable and mostly consists of a vowel or a diphthong, but can also consist of a vocalic consonant. Examples of vocalic consonants are the /l/ and the /m/ in the fluently spoken two-syllabic words “apple” /æpl/ and “rhythm” /RITm/, respectively. The second syllables of these words have no vowel or a strongly reduced vowel. The nucleus of a syllable is followed by the syllable coda, which, as the onset, only consists of consonants. The coda can also be empty. The nucleus and the coda together form the syllable rhyme or rime. The transition from the syllable onset to the syllable rhyme will be referred to as the rhyme onset. The perceptual and cognitive importance of the syllable and its rhyme is indicated by the fact that small children acquire an awareness of syllables and their rhymes at an early age, earlier than an awareness of phonemes. Indeed, small children, younger than first graders, can count the syllables in simple words [62], and it is easier for them to indicate the presence of a syllable in a word than to indicate the syllable boundaries [33]. The importance of the rhyme is indicated by the fact that these young children can also tell whether two syllables have the same rhyme [54, 55, 60, 68]. This shows that young children are aware of syllables and their rhymes before they acquire an awareness of phonemes, so before they can learn to read and write (For a review, see Plomp [84, pp. 74–82]). The number of syllable onsets and syllable rhymes of a language is limited. In a tone language such as Cantonese Chinese, the number of syllable rhymes, or finals, is even as low as 51 and the number of syllable onsets, or initials, a mere 19 [70]. In Mandarin Chinese, there are 36 finals and 21 initials [95]. Moreover, not all combinations of initials and finals are used. For Western languages such as English, German, or French, the number of initials and finals is much higher, but it shows that, once the rhyme onset is known, the search space for syllable onsets and rhymes is considerably limited. This indicates that using the rhyme onsets as anchor points makes speech processing very efficient. Further arguments for this are provided by a study by Aubanel, Davis, and Kim [8]. They manipulated the temporal structure of speech in two different ways: In one way, the manipulations made the rhyme onsets of the stressed syllables periodic; in another way, they made the amplitude

234

5 Beat Detection

maxima of the stressed syllables periodic. They measured the intelligibility of these manipulated kinds of speech by adding various amounts of noise. It appeared that the manipulated speech with isochronous rhyme onsets required more noise to be masked than manipulated speech with isochronized amplitude maxima. The authors identified the rhyme onsets with the perceptual moments of occurrence or P-centres of the corresponding words. This means that, when the P-centres, in this book identified with the beat locations of the syllables, are isochronous, speech is easier to understand than when the amplitude maxima are isochronous. Apparently, knowing in advance when the syllable beats will occur enhances speech intelligibility more than knowing in advance when the amplitude maxima of the syllables will come.

5.3.2 Location of the Syllable Beat A detailed early study of the beat location of a syllable was carried out in the 1960s and 1970s by Allen [2, 4]. First, he presents a review of older studies on the subject and, based on those, wanted to determine the reliability and validity of the experimental methods described in those studies. In particular, he wanted to test the reliability and validity of studies in which listeners were asked to do a tapping task and of studies in which listeners were asked to do a click-placing task. As described above, in a tapping task, listeners are asked to tap along with a spoken utterance in the rhythm of that utterance; in a click-placing task, listeners adjust a click on a specified syllable so that the syllable and the click are perceived as synchronous. In his studies, Allen [4] used short, meaningful utterances consisting of a number of words. These utterances were played repeatedly, and listeners were asked to tap along with this utterance, or to align a click with the beat of one of the syllables. These syllables varied in prosodic prominence: The most prominent ones were “stressed and rhythmically accented”; the least ones were reduced and unaccented. Some important conclusions can be drawn from these studies. The first conclusion Allen [4] draws is that the reliability of the tapping task and the click-placing task are comparable and that both supply valid measures of syllable beat location, though there are some complexities. One is the phenomenon of negative asynchrony mentioned above, which means that tap locations systematically precede beat locations in general; other complexities relate to individual differences. The second conclusion is concerned with the location of the beat in the phonetic structure of the syllable. Allen [4] finds that it generally precedes the vowel onset, but how much depends on the length of the syllable onset: “The displacement in time of a subject’s tapping mean from the onset of the nuclear vowel of a stressed syllable was found to be moderately correlated with the length of the consonant sequence preceding the vowels” (p. 112). The third conclusion is concerned with the precision of the measurements of the beat location: “The variances of the distributions were used as a measure of the rhythmicalness of the syllable” [underlining by author] (p. 112). So, the beat is not only characterized by a location in the syllable, but also by a temporal precision. Allen [4] shows that the variance of the tapping distribution of stressed and rhyth-

5.3 Beats in Speech

235

mically accented syllables is less than that of less prominent syllables. He said that the “rhythmicalness” of stressed and rhythmically accented syllables is higher than that of less prominent syllables. Rhythmicalness, called beat clarity in this book, will be discussed in Sect. 5.6. In addition to the tapping task and the click-placing task, Allen [3] included a third way to measure beat location. He placed a click at various locations within a specified syllable of an utterance and asked listeners to judge whether the click was in synchrony with the beat of that syllable or not. It appeared, however, that most listeners, 14 out of 16, were so tolerant in their judgments that this method did not result in a precise measurement of beat location. It remains unclear why this is so. Another study on the location of the beat in the syllable was carried out by RappHolmgren [92]. She did not use single syllables as test words but two-syllable utterances /a’[C]A:d/, in which /C/ indicates a single consonant or a consonant cluster, in this case an /l/, /n/, /d/, /s/, /t/, /st/, or /str/. So, this utterance consists of two syllables, an unstressed syllable /a/, and a stressed syllable with a consonant cluster as syllable onset and /A:d/ as syllable rhyme. Rapp-Holmgren [92] asked the listeners to synchronize isochronous series of these words with metronome clicks presented over earphones. The average results are presented in Fig. 5.3. The origin of the abscissa represents the timing of the metronome ticks and, hence, since the listeners were asked to synchronize the clicks with these beats, the beat locations of the syllables. The results are ordered according to the duration of the consonant cluster. These results show that the beat location generally precedes the vowel onset, and the more complex the consonant cluster, the earlier the beat relative to this vowel onset. The results correspond quite well to those by Allen [4]. Furthermore, the interval between

Fig. 5.3 Phoneme boundaries, indicated by dots, relative to the beat location of the second syllable of the word /a’[C]A:d/, in which [C] represents a consonant or a consonant cluster, i.c., /l/, /n/, /d/, /s/, /t/, /st/, or /str/. The first syllable /a/ was unstressed; the second was stressed. The short vertical lines are the boundaries between the onset, the nucleus, and the coda of the syllables. (Based on data derived from figure 1-B-6 of Rapp-Holmgren [92]). (Matlab)

236

5 Beat Detection

the beginning of the first, constant syllable /a/ and the beat of the second syllable varies relatively little, showing that speakers spoke the successive words with almost identical rhythms. It is concluded that the more consonants precede the vowel of a syllable, the earlier the beat location in respect of the vowel onset. This is clear for the consonants that are part of the syllable onset, but no information could be found as to the effect of the last consonants in the coda of the preceding syllable. Certainly when a syllable starts with a vowel, the role of these preceding consonants may be considerable. Moreover, when the consonants of the syllable onset are coarticulated with the consonants in the coda of the preceding syllable, there is no good reason to believe that the consonants in the coda would not contribute to the beat location of the following syllable. This means that the beat location of a syllable depends not only on the phonetic composition of that syllable, but also on that of the preceding syllable. As mentioned earlier, most studies on beat locations have been carried out with isolated syllables, and the few ones that were concerned with fluent speech, e.g., Allen [4], Eggermont [26], Van Katwijk and Van der Burg [117], did not address this issue. A solution to this problem has recently been suggested by Rathcke et al. [93]. These authors looped well articulated, read sentences of four to ten syllables and asked listeners to tap along with these sentences as soon as they felt confident that their taps would be in synchrony with the beats of the sentences. They found that participants started tapping after two to three loop presentations and that the tap locations stabilized after three to five loop repetitions. The authors recommend “using at least 10 repetitions of a sentence to produce stable, consistent, and representative patterns across individual participants” (p. 22). They tested five temporal “landmarks” in the speech signal as to their relation with the tap locations: the syllable onset, the local energy maximum, the local amplitude maximum, the maximum difference in the local energy contour, and the vowel onset. They found that the vowel onset performed best in this respect, but did not address the question of the effect of the number of consonants in the syllable onset on beat location. The next question is whether the syllable rhyme affects the beat location. It probably does; the longer the syllable rhyme, the later the P-centre [17, 69, 86, 87]. The effect, however, is much smaller than that of the syllable onset, and sometimes no effect is found. Indeed, Cooper, Whalen, and Fowler [17] found only a significant effect of rhyme duration for one of three subjects, and that was much smaller than that reported by Marcus [69]. Harsin [44] found no effect of rhyme duration. In a tapping task, Janker [52] carefully corrected for the negative asynchrony and applied various models to a set of syllables of the form /[C]ast/, in which [C] represents a single consonant, and found no effect. For this reason, some authors have questioned the significance of it [52, 101, 106]. That the effect is real may, however, be concluded from the fact that a similar effect is found for musical tones, where the beat comes later for longer tone durations [109, 122]. It is for now concluded that the duration of the rhyme has a small but significant effect on beat location. Not only can the structure of the syllable onset induce discrepancies between the measured beat locations and the location of the vowel onset of a syllable. There are at least two other effects but, in contrast with the effect of syllable onset, these

5.3 Beats in Speech

237

effects can be compensated for. First, as to spoken syllables, when beat location is measured by a tapping-along task, there is the negative-asynchrony effect. At least for spoken syllables in experimental conditions, this effect appears to be quite constant, so that one can compensate for it [52]. Another example of a systematic deviation is described by Sundberg and Bauer-Huppmann [106]. These authors investigated the synchrony between a pianist and singer playing together. They conclude that “The results show that, most commonly, the accompanists synchronized their tones with the singers’ vowel onsets. Nevertheless, examples of lead and lag were found, probably made for expressive purposes. The lead and lag varied greatly between songs, being smallest in a song performed in a fast tempo and longest in a song performed in a slow tempo” (p. 285). So, these two examples indicate that one can compensate for the negative-asynchrony effect in a tapping task and the discrepancies between the beats of a singer and those of the notes played by an accompanist.

5.3.3 The Concepts of P-centre or Syllable Beat In the early experiments by Morton, Marcus, and Frankish [75], the authors found that recordings of consecutive words played with isochronous onsets did not sound isochronous. They asked themselves what in a sequence of words, if not their acoustic onsets, must be equidistant in time in order to make the rhythm induced by the sequence of words sounds isochronous. If it is not the acoustic onset of the words, there must be a “psychological moment of occurrence”, which the authors called P-centre. Later authors mostly defined this P-centre as the perceptual moment of occurrence. In this definition, the P-centre is defined as a property of a complete word, also of words consisting of more than one syllable. This makes sense, e.g., because series of numbers can be counted rhythmically, and many numbers, e.g., seven, eleven, and all numbers larger than 12, consist of more than one syllable and can be counted rhythmically. In this book, it will, however, be presumed that every syllable has one P-centre, and that in tasks involving sequences containing polysyllabic words, listeners will in general, focus on the P-centre of the stressed syllable of these words. The concept of P-centre will be identified with that of beat location, since there is no reason to presume any difference. The term syllable beat is older as it dates from the 1960s [4, 26, 117] and can also be used for musical auditory units. So, mostly the terms beat and beat location will be used in this book. Since, in the literature discussed, the term P-centre is often dominant, it will also be used occasionally in this book, but no difference must be adhered to the terms of P-centre and beat location. Whalen, Cooper, and Fowler [125] wondered what would happen when listeners were not simply asked to align different syllables in such a way that they sounded isochronous, but were explicitly asked to align these syllables in such a way that their syllable onsets, the onsets of their vowel, or their offsets sounded isochronous. It appeared that listeners were simply not able to align the offsets of the syllables. The results of experiments in which the instruction was to align the syllable onset

238

5 Beat Detection

or the vowel onsets did not differ from the original result in which the listeners were simply asked to align the syllables so that they sounded rhythmically isochronous. In this way, Whalen, Cooper, and Fowler [125] showed that listeners were only able to align the beats of the syllables, also when explicitly instructed to align the syllable onsets, the vowel onsets, or the syllable offsets. This shows that listeners can only reliably position beat locations in a rhythmic pattern and not another instant of the syllable. One might think that the concept of P-centre or beat is specific for English or languages with rhythmic properties similar to those of English. It appears, however, that these concepts also apply to other languages. Indeed, Hoequist [48] showed that the beats of the rhythmic units of a language behave the same for English, Spanish, and Japanese, languages with different rhythm structures. The difference between English and Spanish on the one hand and Japanese on the other is that, in Japanese, not the syllable is the rhythmic unit of speech but the mora, a subunit of the syllable. Hoequist [48] concluded that the “P-centre phenomenon” is universal to all languages. For the sake of simplicity, the syllable will be referred to as the rhythmic unit of fluent speech in almost all cases, as it is beyond the scope of this book to discuss the differences from moraic languages in more detail. It must, however, be realized that the syllable is not the rhythmic unit of speech in all languages. Besides for English, “P-centre phenomena” have been studied for Dutch [27], German [86], Portuguese [9], Czech [105], and Cantonese Chinese [15]. The results for the Western languages corroborate those obtained for English. For Cantonese Chinese, however, there appeared a notable difference. The authors asked participants to synchronize spoken syllables with the ticks of a metronome. They found no effect of tone or rhyme, but found that, except for fricatives, the metronome ticks were more closely aligned with the syllable onset than with the vowel onset. As a possible explanation they mention that, in Chinese, the initial consonant of a syllable is shorter and less variable than in Western languages. This indicates that these initial consonants are articulated with more effort, so that the weight of their onsets might get larger and, hence, compete with that of the vowel onset. Of course, this has to be confirmed. Until now, most of what has been discussed was based on perceptual aspects of beat alignment. But speakers are also well able to synchronize spoken syllables with the ticks of a metronome. Moreover, two speakers are also well capable of synchronizing their speech. Indeed, Cummins [18] showed that speakers can synchronize read speech with an accuracy of about 40 ms. This confirms that the beat is not only a perceptual phenomenon, but can also be associated with the motor system. This was investigated by De Jong [22], who tested a number of articulatory and acoustic events as candidates for beat location. With none of them, however, he could find a definite association. From the results discussed above, it is concluded that the beat is a temporally welldefined attribute of the syllable, the location of which, certainly for well-articulated syllables, can be estimated with an accuracy of 20–50 ms. The syllable beat precedes the vowel onset of the syllable by an interval the length of which correlates with the number of consonants preceding the vowel. In other words, the longer the syllable

5.4 The Role of Onsets

239

onset, the earlier the beat location in respect of the vowel onset. A smaller effect is that of the rhyme: The longer the syllable rhyme, the later the beat location.

5.4 The Role of Onsets Everybody agrees that onsets play a dominant role in determining the location of the beat in the syllable. Some emphasize the vowel onset in this respect. For instance, though Allen [4] will qualify this at another instance, he writes: “The acoustic wave shows a great increase in energy and striking changes in spectral structure at the beginning of the nuclear vowel. As will be pointed out later, these changes occur to different degrees and at different rates for different types of consonantal release, but they are strong perceptual cues in all cases. The vowel onset is therefore a clearly distinguished event for the listener as well as for the speaker” (p. 91). The vowel onset is explicitly indicated as temporal marker of spoken syllables by Barbosa et al. [9], Eriksson [29], and Janker [52], and for sung syllables by Lindblom and Sundberg [64] and Sundberg and Bauer-Huppmann [106]. Scott and McGettigan [99] conclude: “Perceptual centres are thus linked to the onsets of vowel sounds within syllables” (p. 2). It is clear, however, that, for syllable onsets of one or more consonants, the vowel onset is not the only onset in the cluster that forms the syllable onset. Indeed, in the same study by Allen [4] as just cited, he found that “the rhythmic beats were closely associated with the onsets of the nuclear vowels of the stressed syllables, but precede those vowel onsets by an amount positively correlated with the length of the initial consonant(s) of the syllable” (p. 72). These discrepancies induced Morton, Marcus, and Frankish [75] to introduce the term “P-centre”, and to emphatically contest that the vowel onset is the acoustic correlate of the beat of a syllable. This is also done by, e.g., Patel, Löfqvist, and Naito [80]. So, the syllable onset is characterized by a temporal cluster of onsets. Since, at the vowel onset, restrictions in the vocal cavity are released, the vowel onset will not only be the last but, in most cases, also the strongest onset. This is illustrated in Fig. 5.4, showing the wide-band spectrogram of the monosyllabic word “please”. Note the three onsets at the phonemes /p/, /l/, and /i/. In this chapter, it will be argued that the beat of a syllable emerges from a weighed integration of these onsets. As a result, the beat location of this syllable will be somewhere between the beginning of the /l/ and that of the /i/ in the example of Fig. 5.4. Models of how these onsets interact in fluent speech and how this interaction eventually leads to the percept of one beat per syllable in fluent speech is discussed in the upcoming Sect. 5.8. The importance of onsets in beat generation is also indicated by the results of two other studies. Scott [100] manipulated rise times at the onsets and decay times at the offsets of the syllables “ae”, “sa”, and “wa”, and found that only manipulations at the onset had a significant effect on beat location. Another study is presented by Kato, Tsuzaki, and Sagisaka [53] who asked themselves whether intervals between onsets or offsets in the speech signal determine speech-rate perception. Their stimuli consisted of five successions of a consonant (C) and a vowel (V), so /CVCVCVCVCV/.

240

5 Beat Detection

Fig. 5.4 Wide-band spectrogram and waveform of the mono-syllabic word ‘please”. The vertical lines in the lower panel indicate the onsets of the phonemes /p/, /l/, and /i/. (Matlab) (demo)

In the course of this series of five syllables they increased or decreased the proportion of the duration of the consonant and that of the vowel. Due to this processing, the inter-onset intervals of the vowels was different from their inter-offset intervals. They found that the intervals between vowel onsets affected the perceived speech rate, whereas the intervals between vowel offsets did not. This shows that indeed the interval between the onsets corresponds to the interval between the beats. The issue as to the role of the vowel onset in relation to other onsets has various components. In the first place, the vowel onset can be measured in different ways. One may just inspect the waveform or the spectrogram, and identify the first pitch period in which the resonances of the first and second formant become apparent, e.g., Allen [4]. Another way is to ask a trained phonetician to gate out small intervals of the speech signal and to find the moment at which the vowel becomes audible as a vowel [46]. A third way is to measure the envelope of the output of a band-pass filter that filters out the frequency band corresponding to the first-formant region of the speech signal, and to determine the instant at which the amplitude envelope rises most rapidly [80, 100]. In this last method, the vowel onset is identified with the instant of the syllable at which the intensity in the first-formant region rises most rapidly. Finally, from a phonological point of view, Janker [52] argues that the beat of a syllable is at the onset of the nucleus of the syllable. In this interpretation, the syllable consists of an onset and a rhyme separated by the beat. These discrepancies mark the inconsistency between some conclusions drawn in the literature. Some associate the beat of a syllable with the moment where there is a rapid and significant rise in intensity in the first-formant region [80, 100]. On the other hand, citing Scott [100], Villing et al. [120] conclude that: “Unfortunately, no single, well-defined, objectively measurable time point of an event has yet been found that reliably corresponds to the P-centre” (p. 1614).

5.4 The Role of Onsets

241

5.4.1 Neurophysiology It has been mentioned various times that peripheral adaptation plays an important role in the enhancement of onsets. But there may be a second peripheral mechanism that enhances onsets. In Sect. 2.6.3, it was mentioned that auditory-nerve fibres can have very different dynamic ranges. There are high-spontaneous-rate fibres which generally have low intensity thresholds and a low dynamic range. At higher intensities, the low-spontaneous-rate fibres are activated, which have higher thresholds and higher dynamic ranges. This shows that, at rapid increases in intensity, another population of auditory-nerve fibres gets activated. No information could be found, however, about how much this contributes to the dynamic processing of onsets. In more central parts of the central nervous system, the role of onsets, in particular that of vowel onsets or beats, in processing the temporal structure of speech is evident. For instance, Oganian and Chang [78] recorded electrocorticograms (ECoGs), which are electrical potentials recorded intracranially at the surface of the cortex. They recorded such ECoGs at the surface of the superior temporal gyrus (STG) during stimulation with speech utterances. From these ECoGs, they derived “peakRate events” corresponding to increments of the amplitude envelope of these ECoGs. They found that: “the STG does not encode the instantaneous, momentby-moment amplitude envelope of speech. Rather, a zone of the middle STG detects discrete acoustic onset edges, defined by local maxima in the rate-of-change of the envelope. Acoustic analysis demonstrated that acoustic onset edges reliably cue the information-rich transition between the consonant-onset and vowel-nucleus of syllables. Furthermore, the steepness of the acoustic edge cued whether a syllable was stressed. Synthesized amplitude-modulated tone stimuli showed that steeper edges elicited monotonically greater cortical responses, confirming the encoding of relative but not absolute amplitude. Overall, encoding of the timing and magnitude of acoustic onset edges in STG underlies our perception of the syllabic rhythm of speech” (p. 1). Moreover, Oganian and Chang [78] report that 90% of the vowel onsets deviated less than 40 ms from the peakRate events. Similarly, Yi, Leonard, and Chang [131] propose that these events function as “temporal landmarks” for the neural processing of speech sounds: “the amplitude envelope may be encoded as a discrete landmark feature. Neural populations that are tuned to detect this feature provide a temporal frame for organizing the rapid stream of alternating consonants and vowels in natural speech, which are analyzed in local STG populations that are tuned to specific spectral acoustic-phonetic features” (p. 1102). Though neither Oganian and Chang [78] nor Yi, Leonard, and Chang [131] mention the relation between these temporal landmarks and beats or P-centres, the correspondence is evident. Coath et al. [16], Hertrich et al. [47, pp. 322–323] and Aubanel, Davis, and Kim [8] make this correspondence explicit.

242

5 Beat Detection

5.5 The Role of F0 Modulation Various authors have mentioned that beats are not only induced by rises in intensity but also by changes in fundamental frequency. An example is Köhlmann [57], who is one of the very few who includes this in a model of “rhythmic segmentation”. The role of frequency modulation (FM) in the induction of beats has been illustrated in the demos of Fig. 1.49, that of changes in fundamental frequency (F 0 ) in Fig. 4.10 and Fig. 4.11. Another trivial example is presented in the next demo of Fig. 5.5, playing a diatonic scale on 110 Hz, sung by a synthetic voice singing /A/. The duration of the tones is 250 ms. Note the unmistakable 4 Hz rhythm induced by the beats at the turning points of the pitch contour. It must be remarked that, though the temporal envelope of the sound is constant in amplitude, the temporal envelopes of the outputs of the auditory filters with best frequencies in the range of the increasing F 0 s, will show rises in amplitude at those increases. These rises in intensity, which will be enhanced by adaptation, may explain the induction of beats. So, whether it is necessary to include a separate module for the detection of F 0 changes in a model of beat detection, as done by Köhlmann [57], remains to be investigated. It is a common observation that F 0 modulations in vibrato tones, especially the longer ones, are perceived as inducing beats. Remarkably, very few studies have been reported as to the role of FM or changes in F 0 in rhythm perception. McAnally [71] used tapping experiments to study at what moments of the periods of sinusoidally frequency-modulated tones listeners perceived the beats. He found that, after correcting for negative asynchrony, the participants synchronized with the moments when the modulation frequency rose most rapidly. Negative asynchrony decreased with decreasing modulation index. Interestingly, Demany and McAnally [23] previously found that the global pitch of FM tones frequency modulated with 5 Hz corresponded to the peaks of the instantaneous frequency. Moreover, they remark that “generally, the local minima were not heard as auditory ‘events”’ (p. 706). This resembles an aspect of the perception of intonation contours in speech. Indeed, it appears that pitch information in the syllable nucleus, i.e., information following the syllable beat, carries most information contributing to the perception of intonation contours [49] (for a review see Hermes [45]).

Fig. 5.5 Diatonic scale in a major sung by a synthetic voice. The amplitude is constant, but a clear 4-Hz rhythm is heard. (Matlab) (demo)

5.5 The Role of F 0 Modulation

243

Another study was carried out by Tanaka et al. [107]. They presented listeners with long-lasting pure tones of constant intensity but varying in frequency. The frequency contour of these tones was continuous and consisted of alternating rises and falls connected by steady parts. The durations of the glides were varied in such a way that either their onsets were isochronous or their offsets. They found that listeners were more sensitive to deviations from isochrony of the onsets of the glides than to deviations from isochrony of their offsets. They argued that the onset of a frequency glide was a more effective marker for indicating the beginning of a new auditory event than its offset. Compare this with the finding by Kato, Tsuzaki, and Sagisaka [53], discussed above, that the perceived speech rate is determined by the vowel onsets and not by the vowel offsets of a syllable. Modulations in F 0 also play a role in speech rhythm, though perhaps not as important as modulations in intensity. Cummins [18] asked speakers to read a text and then to speak it out in synchrony with a recording of that text, or with another speaker reading the same text. He found that speakers are very well able to do so with a delay between the two voices of less than 40 ms. In another set of experiments, Cummins [19] investigated what information in a speech signal is used by speakers in synchronizing their speech. By varying or removing amplitude information, F 0 information, or spectral information, he found that, though amplitude information played the dominant role, the role of F 0 and spectral information were also significant. Another effect has been found for the lexical tones of a tone language. In a tone language, the meaning of a syllable depends not only on its phonetic content but also on the pitch contour, the tone, realized on the syllable. The tone language Thai has five such lexical tones. Janker and Pompino-Marschall [51] found that the absolute F 0 level of the tones had no influence on the beat location, or P-centre location as they called it, of the syllable, but the presence of pitch movements had. Remarkably, although the role of F 0 modulation in rhythm perception has been little explored, FM sinusoids have been used in studying the neurophysiology of rhythmic entrainment [10]. Apparently, also at the level of the central nervous system, rhythmic entrainment by modulations in frequency expresses themselves in the same way as rhythmic entrainment by modulations in amplitude. It is concluded that, in fluent speech, the cluster of onsets at the start of a syllable largely determines the beat location of the syllables. Changes in F 0 can also play a significant role. The duration of the vowel plays a minor role. It will be clear that the role played by FM and F 0 modulations in the generation of beats in speech and in music needs more study. This will be discussed further in Sect. 10.12.4.4. Finally, an example will be discussed about which little is known, the rhythm induced by pulse series modulated in frequency. This is schematically sketched in the demos of Fig. 5.6 and 5.7. In these demos, the impacts are simplified to pulses. In Figs. 5.6, the average rate of a series of pulses is 200 Hz. In the upper panel of Fig. 5.6 the pulse rate is modulated between 100 and 300 Hz; in the lower panel, a counterphase modulation in amplitude is added, so that the lower density of the pulse trains is accompanied by a higher amplitude. Note the reduction in beat strength. The idea

244

5 Beat Detection

Fig. 5.6 Pulses series with pulse rates sinusoidally varying between 100 and 300 Hz. The modulation frequency is 3 Hz. In the upper panel the pulses are constant in amplitude. Note the clear beats inducing a 3-Hz rhythm. In the lower, an amplitude modulation varies in synchrony with the pulse rate in such a way that the amplitude is highest when the pulse rate is lowest and lowest when the pulse rate is highest. Note the change in the induced rhythm. (Matlab) (demo)

Fig. 5.7 Same stimulus as in Fig. 5.6, except that the pulses are jittered with a variance of 5 ms, the average interval between the pulses, so that no pitch percept is induced. Note the similar rhythm to that in Fig. 5.6. (Matlab) (demo)

is that the decrease in presentation rate is compensated more or less by the increase in amplitude. The listener is invited to judge for themselves. The next demo illustrates that beats can not only be induced by changes in F 0 but also by changes in density of random pulses. The stimuli of Figs. 5.6 and 5.7 are the same except that, in the latter, jitter is applied to the pulses with a variance of 5 ms, equal to the average interval between successive pulses. Again, note the reduction in beat clarity in the second part of the demo. No auditory research has been found on the rhythm of these types of sounds, although some instruments such as maracas also produce rhythmic sounds consisting of clustered pulse sequences. These results show that beats can be induced by clusters of onsets, and that the density of the onsets and their amplitudes contribute to the strength and the clarity of the beats, which brings us to the next section.

5.6 Strength and Clarity of Beats

245

5.6 Strength and Clarity of Beats Up to now, beats have been discussed as if they are an all-or-nothing phenomenon. Every auditory unit has a beat, and our main concern was to find out where in the auditory unit the beat was located. One beat, however, is not equal to any other beat. Both for speech and music, it is well known that beats vary in strength, which will be indicated with beat strength. In music, the succession of stronger and weaker beats defines the metrical structure or metre of the music [61, 65, 79]. In speech, successive syllables vary in the strength of their beats, and most listeners are well able to decide which of two successive syllables is stronger than the other is [63]. In short, both in speech and in music, the succession of beats forms a temporally linear structure of which the successive elements vary in strength. This structure defines the rhythm of the speech or the music. In this reasoning, it is implicitly assumed that rhythm is an attribute of one auditory stream. Hence, the successive elements that define the rhythmic pattern are sequentially integrated and perceived as produced by the same sound source. In short, the beat is an attribute of an auditory unit, while rhythm is an attribute of an auditory stream. Rhythm will, therefore, be discussed later in the section of that name in Chap. 10. Another aspect in which beats can differ is in their definition in time. For speech, Allen [4] mentions that the variance of the distribution of tap locations in a tapping task could vary and called this rhythmicalness. Villing et al. [120] also found that the beats of some syllables are better defined in time than others are. This they called P-centre clarity. For the sake of consistency, this will be called beat clarity in this book. In this way, the beat location of an auditory unit is expressed as a distribution in time with a mean and a variance. One may wonder whether this characterization is complete. Since many sounds, e.g., syllables with a cluster of consonants as onset, do not have just one onset, but start with a cluster of onsets, the distribution of a beat location may be more complex. Indeed, Gordon [37] not only presents the average of a number of measurements of beat locations as an estimate for the “perceptual attack time” of a musical tone but presents and discusses whole distributions that can show platforms and even be bimodal. This issue is elaborately studied and discussed by Wright [127] and Danielsen et al. [20]. Not only syllables but also tones can vary in the clarity of their beats. It has already been mentioned that the rise time of tones affects beat location. It also affects the temporal definition of the beat, i.e., its clarity. Indeed, Bregman, Ahad, and Kim [13] synthesized four 1 s pure tones of different frequencies with triangular envelopes. They played these tones with inter-onset times of 60, 80, and 100 ms, and asked listeners to judge the order in which the tones were played. The rise time was varied in a range of 10–640 ms. They found that the order of the tones was much easier to judge for the tones with the shorter rise times. For longer rise times, the beats of the separate tones were less well defined in time making it more difficult to hear the tones as separate auditory units and judge the order in which they were played. This indicates that the beat clarity of the tones is higher for the tones with the shorter rise times. A demo of this effect is presented on track 21 of the CD by Bregman and Ahad [12].

246

5 Beat Detection

As a final remark on the perception of the strength and the clarity of beats, Goswami et al. [39] showed that there is a relation between onset perception and developing dyslexia. More specifically, Huss et al. [50] studied the relation with rise-time perception, dyslexia, and musical-rhythm perception. These studies indicate that people with developing dyslexia may perceive beats with less strength and clarity than others. For reviews on this issue, see Goswami [38] or Ladanyi et al. [59].

5.7 Interaction with Vision Another interesting issue is that the beat location can be affected by visual information. This is exemplified by what is called the temporal ventriloquist effect [104]. The readers will be more familiar with the spatial ventriloquist effect, i.e., the effect that, when we see and hear something happening in our environment, the perceived location will generally better match where we see it happening, also when the sound comes from a different location. This is generally attributed to the finer spatial resolution of the visual system than that of the auditory system. One says that, as to location, the visual event “captures” the auditory event [73]. The temporal ventriloquist effect will be more extensively discussed in Sect. 9.3.8 on the contribution of visual information to perceived location. It is often stated that, while the visual system may have a finer spatial resolution, the auditory system has a finer temporal resolution. One can then ask the question whether, when we see and hear something happening, the perceived moment of occurrence corresponds better to the moment of occurrence of the auditory event, its beat, or with that of the visual event. In other words, does, as to timing, the auditory event capture the visual event? Also this question has a positive answer. More specifically, after an audiovisual event, the produced sound arrives later at our ears than the emitted light, a simple consequence of the much higher propagation speed of light. This delay usually goes unnoticed. Even stronger, when observers are asked to synchronize auditory events with visual events, they, on average, adjust the auditory event about 40 ms later than the visual event. Moreover, Engel and Dougherty [28] found that, when judging the synchrony of auditory and visual events, listeners compensate for this delay up to distances of at least 20 m, a distance travelled by sound in about 60 ms. Most experiments in this respect have been carried out with sequences of tones or flashes with temporally well-defined onsets. It is concluded that, in this condition, the auditory information is more accurate than the visual information and that the auditory event then captures the visual event. One can now ask what happens when the onset of the auditory event is temporally less well specified than the visual event. This was studied by Vidal [118]. By adding noise to the sequence of tones, he decreased the accuracy of the perceived timing of the tones up to a level at which it was as accurate as the perceived timing of the flashes. It appeared that, in that condition, the perceived moment of occurrence of the audiovisual event

5.8 Models of Beat Detection

247

was not captured by the auditory event but was determined by combining the auditory and the visual timing information in a more or less optimal way. Later in this book, more examples will be presented of this more general principle that the perceptual system combines information from different sources in a more or less optimal way [1]. In Sect. 10.12.3.5, also visual stimuli will be described that can induce rhythmically well-defined visual events, at least when these visual stimuli move rhythmically. Another example is the sound-induced flash illusion, presented by Shams, Kamitani, and Shimojo [102] to be discussed in Sect. 9.1.

5.8 Models of Beat Detection A detailed description and discussion of various ways to model beat location, or P-centre as it is generally called in the literature, is provided by Villing [119]. Here, a summary of various methods will be given. The first formula to calculate beat location in syllables is presented by Marcus [69]. He assumes the acoustic onset and the vowel onset to be known and presents the following equation, d P = αdC + βd V,

(5.1)

in which d P is the location of the beat after the acoustic onset, dC the duration of the syllable onset, and d V the duration of the syllable rhyme. The parameter values chosen are α = 0.65, β = 0.25. As said, this equation gives the beat location after the acoustic start of the syllable. This equation can equivalently be presented relative to the vowel onset, d PV = − (1 − α) dC + βd V , in which d PV is the beat location in respect of the vowel onset. When d PV is negative, the estimated beat precedes the vowel onset; when it is positive, the estimated beat is later than the vowel onset. With the parameter values presented above, this gives d PV = −0.35dC + 0.25d V . The model described by this equation “may be seen to incorporate two forces working relative to vowel onset, their resultant determining P-centre location. One, proportional to initial consonant duration, tends to pull the P-centre toward the onset of the stimulus; the other moves the P-centre toward stimulus offset and is proportional to vowel and final consonant duration” [69, p. 253]. As to the parameter values of 0.35 and 0.25, especially the latter has been found to be too large. As discussed above in Sect. 5.3.2, later studies found a much smaller effect or even no effect of rhyme duration on beat location. The equation presented by Marcus [69] is not much more than a descriptive model to estimate beat location of spoken syllables with known syllable onsets and vowel onsets. In normal circumstances, this information is not directly available. Scott [100] tested a model consisting of a gammatone-filterbank and investigated whether fast rises in the intensity of the envelope of the outputs of the filters could be associated with beats. Comparing the output of seven filters spaced 4 ERB apart, she found that increases in the frequency band centred around 578 Hz could predict

248

5 Beat Detection

about 86% of the variation. This was tested on a number of isolated syllables with manipulated onsets and rhymes. The frequency band centred at 578 Hz corresponds well to the first-formant region of vowels. So, the conclusion is that the data obtained by Scott [100] can largely be explained by assuming that the beat location is associated with a fast rise in intensity in the first-formant region, hence with the vowel onset. A more elaborate model based on the psychoacoustics of loudness perception was developed by Pompino-Marschall [86, 87]. In this model, the outputs of criticalband filters are scanned for onsets. These onsets are defined by increments in time in the specific loudness of an auditory filter. The size of an onset is calculated as the difference between a maximum in the specific-loudness distribution and its preceding minimum. These increments are weighted according to their size and their spacing in time, and integrated, resulting in an estimate of the beat location of the sound. In a somewhat more sophisticated model, Pompino-Marschall [86, 87] also took offsets into account, but it is not clear how much improvement this yielded. The model was then evaluated and it appeared to quite well predict a number of shifts in beat location as a function of phonetic variation. A comparable model was developed by Harsin [44]. For the set of CV and VC syllable he tested, he found a correlation coefficient of 0.99 between predicted and actual beat locations. How well this system performs for syllables starting with a consonant cluster is not shown. The above models were developed in order to predict the shifts in beat locations of single, well-pronounced syllables as a function of phonetic variations. They were not developed and evaluated for the detection of beat locations in complete spoken sentences. In general, there are two important differences in articulation between carefully pronounced single syllables and syllables in fluently spoken utterances. First, the successive phonemes in fluent speech are not realized as separate independent units. The realization of a phoneme is seriously affected by the preceding and following phonemes. This process, called coarticulation, plays an essential role in speech recognition (for a review see Plomp [84, pp. 70–73]). Coarticulation also plays a role between the last consonant of a syllable and the first consonant of the following syllable so that, e.g., boundaries between syllables are unclear. Second, in fluent speech, some syllables are much more reduced in articulation than others. The first direct computational model for the detection of beats in complete utterances is presented by Köhlmann [57]. He uses a model for loudness perception [81] and a model for pitch perception [110] to measure increases in loudness and changes in pitch to estimate the timing of the syllable beats. The author claims to be able to detect 90% of all “rhythmic events” in speech and music correctly. The merit of this model is that it can be used for analysing running speech and music. Köhlmann [57] does not give precise data, however, regarding the temporal accuracy of the estimates and the presence of false positives. It is, therefore, not possible to say how well his model predicts the shifts in beat location as a function of the phonetic composition of the syllable onset and rhyme. Hermes [46] developed a model for what he called vowel-onset detection. The underlying base of this model is the same as that of the models by Köhlmann [57] and Pompino-Marschall [86]. Indeed, for some fluently spoken utterances, Hermes

5.8 Models of Beat Detection

249

[46] asked a trained phonetician to indicate the vowel onsets by gating out short segments of the speech signal and listen where the vowel became audible as a vowel. The algorithm was based on the assumption that vowel onsets are characterized by simultaneous onsets in a number of frequency bands. This resulted in what was called vowel strength. In the calculation of vowel strength, the presence of resonances in the spectrum of the sound and harmonicity were also taken into account. This resulted in a time function representing the course of the vowel strength. By applying a simplified adaptation model, fast rises in vowel strength were detected and selected as candidates for the vowel onsets. After applying some criteria regarding the size of the onsets and their spacing, the definite estimates for the locations of the vowel onsets were determined. It was found that 91% of these vowel onsets were correctly detected with an accuracy of at least 50 ms. There were 8% missed detections and 3% false positives. The missed detections mostly occurred in reduced syllables, and the false positives at other strong onsets in the speech signal, e.g., before sonorant consonants such as the /l/ /r/, /w/, /n/ or /m/. Some of these sonorant consonants can function as vocalic consonants, e.g., the /l/ in “apple” /æpl/, or the /m/ in “rhythm” /rITm/. In summary, in this algorithm, the vowel-onset locations are estimated after applying an “adaptation” filter to the measured vowel strength of a sentence. In this process, onsets within the scope of this filter are integrated. In this respect, this vowel-onset detection algorithm has much in common with the algorithms presented by PompinoMarschall [86] and Harsin [44]. This may indicate that the estimated vowel-onset locations correspond well to the beat locations of the syllable as argued by Kortekaas, Hermes, and Meyer [58, pp. 1196–1197]. This remains to be investigated more closely. Kortekaas, Hermes, and Meyer [58] developed two other vowel-onset-detection algorithms based on completely different principles. One was developed by training an artificial neural network, and the other was a physiologically inspired model based on simulations of onset neurons, the chop-T cells, in the cochlear nucleus. These two models performed somewhat worse than the method based on vowel strength. In another aspect, the three algorithms performed similarly. When used to detect the vowel onsets in well-articulated, isolated words, the percentage of false positives increased considerably, which was accompanied by a considerable decrease in missed detections. For the algorithm based on detection of increments in vowel strength [46], the percentage of false positives rose from 3% to 15.8%. On the other hand, the percentage of missed detections decreased from 8% to 0.7%. A rate of false positives of 3% and a rate of missed detections of 8% corresponds to a d  of 3.3, whereas a rate of false positives of 15.8% and a rate of missed detections of 0.78% corresponds to a d  of 3.5. This insignificant shift in d  indicates that the sensitivity of the procedure hardly changed but that the change in false-alarm rate and rate of missed detections was due to a shift in criteria. In the context of music and speech processing, many papers have been published on onset “detection” or “beat tracking” in music. An overview of applications with the relevant literature is presented by Wright [127, pp. 17–18]. Among quite a few other applications, he mentions: Tempo and metre tracking, analysis of expressive timing in recordings, computer accompaniment systems, sound segmentation, time

250

5 Beat Detection

scaling, audio compression, music information retrieval, and automatic transcription. Reviews of beat-tracking algorithms are presented, e.g., by Bello et al. [11], Gouyon [40], Gouyon and Dixon [41], Gouyon et al. [42] Hainsworth [43], and McKinney et al. [72]. A review of beat-detection-evaluation methods for music is presented by Davies, Degara, and Plumbley [21]. Only some of these studies relate the performance of the beat-tracking procedures to the studies discussed above about beats in syllables, mostly indicated with P-centres, and musical tones, mostly indicated with perceptual onsets or perceptual attack times. One of these studies is presented by Wright [127]. Another such study using a neurophysiologically based model is presented by Coath et al. [16]. They developed an algorithm used for detecting the “perceptual onsets”, or beats, in sung music. They apply models of cortical spectrotemporal receptive fields as measured neurophysiologically in the mammalian cortex to the output of a computational model of the auditory periphery. Onset detection in every output channel of the peripheral model is based on a measure for the “skewness”, i.e., a measure for the asymmetry, of the temporal distribution of the “energy” as represented by the output of the peripheral auditory filters. In other words, in every frequency channel onsets are enhanced. By summation over the various frequency channels the candidates for the “perceptual onsets” of the sung syllables are selected. Note the similarity with the detection model for P-centres developed by Pompino-Marschall [86, 87] and Harsin [44], and the vowel-onset detection algorithm developed by Hermes [46]. They all try to find moments in the sound signal characterized by rapidly rising intensities. Over the past decade, newly developed beat-tracking algorithms have been largely based on deep neural networks, e.g., Pinto [83]. The use of neural networks in the study of auditory scene analysis will be discussed later in this book in Sect. 10.15.3. One may wonder whether the literature on automatic speech recognition lends itself for detecting transitions from consonants to vowels, i.e., for vowel onsets. The literature about automatic speech recognition is immense, and it makes no sense to give an overview here. Moreover, the vast majority is not concerned with the detection of onsets in speech sounds. There are exceptions, however. An exception for Mandarin Chinese is Wang et al. [123] who start with segmentation of the syllable into a consonant and a vowel part. In Chinese, the number of syllable onsets and that of syllable rhymes is limited, so that the search space is significantly reduced when one knows the vowel onset in the syllable. As reported above, Mandarin Chinese has only 23 initials and 34 finals. Wang et al. [123] do not use any psychoacoustic knowledge in their algorithm and the recognition part is based on trained neural networks. A more recent exception explicitly based on the detection of “vowel onset points” is presented by Prasanna, Reddy, and Krishnamoorthy [90] and Rao and Vuppala [91]. These methods are not only based on detecting increments in intensity by also on detecting other characteristics of the speech signal associated with vowel onsets, such as those that can be found in the voice source. This approach is later supplemented by methods to detect also the vowel offsets [130]. These techniques are reviewed in the framework of speech recognition by Sarma and Prasanna [96]. They can be quite efficient because, as mentioned, the set of syllable onsets, initials, and the set of syllable rhymes, finals, is limited.

5.9 Fluctuation Strength

251

As to music, it was concluded that, as a first approximation, the beat of a musical tone is situated at the point in time where its intensity rises most rapidly. As to speech, from what has been argued, it was concluded that every syllable onset is characterized by one onset or a cluster of onsets in intensity. Among these onsets, the vowel onset is mostly the last and the most significant. In the process of beat detection, these onsets are enhanced and weighted in a temporal integration process. Since the vowel onset is the last and most significant onset, the syllable beat is close to the vowel onset when the syllable onset is very short. As more consonants precede the vowel onset, the more the beat will shift away from the vowel onset into the syllable onset. The role of the consonants in the coda of the preceding syllable remains to be investigated. The syllable rhyme also plays a role but that is smaller and also this remains to be further investigated. Finally, there is no good reason to believe that beats in syllables of speech are processed by different perceptual processes than beats in musical tones. General models for beat detection, therefore, include peripheral filtering and enhancement of onsets based on peripheral adaptation and on other onset-enhancement mechanisms. In this process, the clusters of onsets at and preceding the vowel onset integrate perceptually and induce one unitary beat. Apparently, in fluent speech, the asynchrony between the separate onsets in the syllable onset is too small to induce segregation. Indeed, Simon and Winkler [103] showed that spectral integration of very short, spectrally different sounds becomes more likely when their inter-stimulus interval is less than 50–100 ms. Turgeon, Bregman, and Ahad [114] and Turgeon, Bregman, and Roberts [115] showed that onsets separated by less than about 20 ms integrate spectrally. At the neurophysiological level of the cortex, Yabe et al. [129] found that auditory information within an integration window with a length of 160–170 ms is processed into “auditory event percepts”. Poeppel [85] provides arguments, at least for speech processing, that this window is located in the right hemisphere of the cortex.

5.9 Fluctuation Strength Another perceptual measure often used to express the perceptual salience of beats is indicated with fluctuation strength [108, 111]. Terhardt [111, p. 223] introduces the perceptual unit of vib for a 40 dB 1 kHz tone sinusoidally modulated with 5 Hz and a modulation depth (MD) of 1. Fastl and Zwicker [30, p. 247] introduce the unit vacil for a 60 dB 1 kHz tone modulated at 4 Hz with an MD of 1. In these models, the estimation of fluctuation strength is based on at least two successive maxima in the temporal envelope of the auditory-filter outputs and, as such, it does not do justice to the individual nature of beats. Nor does it predict their perceived moment of occurrence. The interested reader is referred to the main review on the estimation of fluctuation strength by Fastl and Zwicker [30].

252

5 Beat Detection

5.10 Ambiguity of Beats Up to now, the beat is described as an attribute of an auditory unit that emerges from the process in which this unit is formed. In the description of the factors determining this process, mainly properties of the signal have been discussed, in particular the presence of onsets. When listeners are asked to make judgments as to the presence or absence of a beat, higher-order processes can come into play, however. This can, e.g., happen in the perception of beats in speech sounds. Listeners speak different languages, and languages differ in the structure of their syllables. These differences and the corresponding conceptual problems are discussed by e.g., Port [88] and Turk and Shattuck-Hufnagel [116]. The difference between Western languages such as English, French, and Spanish on the one hand, and a moraic language such as Japanese on the other, has already been mentioned. Moreover, various “consonants” such as the /l/, /m/, /n/, or /r/ can operate as vowels in other languages. For instance, the /r/ functions as a vowel in the Czech word “prst”, meaning finger. Even within one language, listeners can differ about their opinion as to what exactly is one syllable. For instance, in the above given examples of “apple” and “rhythm” one may question whether these words consist of one or two syllables. As a consequence, listeners may differ as to whether they perceive one or two beats. In music, notes such as trills and grace notes play a role in the rhythm of the music different from that of the notes that determine the metre of the music. As to grace notes, since they succeed each other very rapidly, they can be considered in the same way as consonant clusters in speech: The rapid succession of onsets fuse into one beat. How trills are perceived depends on the rate at which they are played. As long as the separate tones can be counted, singly, in doublets, triplets, or quadruplets, they can be considered as rapid sequences of tones, each with its own beat. When the succession of notes becomes more rapid, however, a transition to roughness perception occurs. Then a trill can best be considered as one rough auditory unit, with a beat at the start. These factors imply that there will always be transitions of perceptually welldefined sequences of beats, each belonging to a single auditory unit, to less distinct clusters of perceived onsets for which the number of actually heard auditory units is ambiguous. In everyday situations, the problem may not be so bad, however. At various instances, a distinction has been made between modulations in the rhythm range and modulations in the roughness range and, as to speech, it was argued that syllables are produced within the rhythm range. Substantiated by neurophysiological evidence, Xiang, Poeppel, and Simon [128] and Chait et al. [14] argue that syllables and phonemes are processed by different neurophysiological systems. Moreover, the latter authors report average syllable durations for three different classes of syllables: For unstressed syllables consisting of just one vowel, they give an average duration of 73 ms; for stressed syllables consisting of just one vowel, they give 164 ms, and for stressed syllables consisting of three consonants and one vowel, 400 ms. This shows that the vast majority of syllables in spontaneous speech is produced

References

253

with average speech rates well within the rhythm range. A minority of unstressed syllables consisting of only one vowel may be missed. Similar conclusions can be drawn for music. Ding et al. [24] showed that temporal modulations in speech peak at 5 Hz, whereas they peak at 2 Hz in music, showing that the rate at which tones are played in music will, on average, be lower than the syllable rate in speech. In trills and other very rapid passages, the boundary of the roughness range may be crossed. Indeed, estimates of the maximum number of tones per second a professional pianist can realize vary from 16 [36] to 20 [67]. The data presented above were derived from recordings of actually played music and spontaneous speech. In experimental conditions, transitions are easily created. And, as always in perception studies, nothing is absolute. Beats are the constituents of rhythm, and rhythm is mostly described as a complex, hierarchical system of which the elements are characterized by their strengths: One element is weaker or stronger than another. Moreover, it has already been discussed that beats may not only vary in strength but also in their definition in time, a concept called beat clarity discussed in Sect. 5.6. This indicates that weak beats with a low clarity are susceptible to neglect. A musician can smuggle away some rapid notes or a speaker can omit some reduced syllables. This will often lead to doubt about whether some notes or some syllables have actually been realized. Acoustically this may lead to the vanishing of information. Whether such information also vanishes perceptually is not always clear, however. This may seem contradictory because it seems logical to maintain that one cannot perceive what is not there. Later on, it will be shown, however, that the auditory system is well equipped to fill in missing information, and that it can do this so perfectly that the listener cannot judge which information is actually present and which information is missing but restored by the auditory system. This happens, e.g., when information is masked by background noise. This phenomenon of “restoration” will be discussed in Sect. 10.11.

References 1. Alais D, Burr D (2019) Cue combination within a Bayesian framework. In: Lee AK et al (eds) Multisensory processes: the auditory perspective, Chap. 2. Springer Nature Switzerland AG, Cham, Switzerland, pp 9–31. https://doi.org/10.1007/978-3-030-10461-0_2 2. Allen GD (1972) The location of rhythmic stress beats in English: an experimental study I. Lang. Speech 15(1):72–100. https://doi.org/10.1177/002383097201500110 3. Allen GD (1972) The location of rhythmic stress beats in English: an experimental study II. Lang. Speech 15(2):179–195. https://doi.org/10.1177/002383097201500208 4. Allen GD (1967) Two behavioral experiments on the location of the syllable beat in conversational American English. The Center for Research on Language and Language Behavior. Ann Arbor, MI, pp 1–171, 190–195. https://eric.ed.gov/?id=ED017911 5. ANSI (1995) ANSI S3.20-1995. American National Standard bioacoustical terminology. New York, NY 6. ASA (1973) American National psychoacoustical terminology 7. Aschersleben G (2002) Temporal control of movements in sensorimotor synchronization. Brain Cognit 48(1):66–79. https://doi.org/10.1006/brcg.2001.1304

254

5 Beat Detection

8. Aubanel V, Davis C, Kim J (2016) Exploring the role of brain oscillations in speech perception in noise: intelligibility of isochronously retimed speech. Front Hum Neurosci 10. Article 430, 11 p 9. Barbosa PA et al (2005) Abstractness in speech-metronome synchronisation: P-centres as cyclic attractors. In: Proceedings of the 6th Interspeech and 9th European conference on speech communication and technology (EUROSPEECH) (Lisboa, Portugal), vol 3, pp 1440– 1443 10. Bauer A-KR et al (2018) Dynamic phase alignment of ongoing auditory cortex oscillations. NeuroImage 167:396–407. https://doi.org/10.1016/j.neuroimage.2017.11.037 11. Bello JP et al (2005) A tutorial on onset detection in music signals. IEEE Trans Speech Audio Process 13(5):1035–1047. https://doi.org/10.1109/TSA.2005.851998 12. Bregman AS, Ahad PA (1996) Demonstrations of scene analysis: the perceptual organization of sound. Montreal, Canada. http://webpages.mcgill.ca/staff/Group2/abregm1/web/ downloadsdl.htm 13. Bregman AS, Ahad PA, Kim JJ (1994) Resetting the pitch-analysis system. 2. Role of sudden onsets and offsets in the perception of individual components in a cluster of overlapping tones. J Acoust Soc Am 96(5):2694–2703. https://doi.org/10.1121/1.411277 14. Chait M et al (2015) Multi-time resolution analysis of speech: evidence from psychophysics. Front Neurosci 9. Article 214, 10 p. https://doi.org/10.3389/fnins.2015.00214 15. Chow I et al (2015) Syllable synchronization and the P-center in Cantonese. J Phonet 55:55– 66. https://doi.org/10.1016/j.wocn.2014.10.006 16. Coath M et al (2009) Model cortical responses for the detection of perceptual onsets and beat tracking in singing. Connect Sci 21(2–3):193–205. https://doi.org/10.1080/ 09540090902733905 17. Cooper M, Whalen DH, Fowler CA (1988) The syllable’s rhyme affects its P-center as a unit. J Phonet 16(2):231–241. https://doi.org/10.1016/S0095-4470(19)30489-9 18. Cummins F (2003) Practice and performance in speech produced synchronously. J Phonet 31(2):139–148. https://doi.org/10.1016/S0095-4470(02)00082-7 19. Cummins F (2009) Rhythm as entrainment: the case of synchronous speech. J Phonet 37(1):16–28. https://doi.org/10.1016/j.wocn.2008.08.003 20. Danielsen A et al (2019) Where is the beat in that note? Effects of attack, duration, and frequency on the perceived timing of musical and quasi-musical sounds. J Exp Psychol: Hum Percept Perform 45(3):402–418. https://doi.org/10.1037/xhp0000611 21. Davies ME, Degara N, Plumbley MD (2009) Evaluation methods for musical audio beat tracking algorithms. Centre for Digital Music, London, UK, pp i–ii, 1–15. https://www. researchgate.net/profile/ 22. De Jong KJ (1994) The correlation of P-center adjustments within articulatory and acoustic events. Percept Psychophys 56(4):447–460. Centre for Digital Music. https://doi.org/10.3758/ BF03206736 23. Demany L, McAnally KI (1994) The perception of frequency peaks and troughs in wide frequency modulation. J Acoust Soc Am 96(2):706–715. Centre for Digital Music. https:// doi.org/10.1121/1.410309 24. Ding N et al (2017) Temporal modulations in speech and music. Neurosci Biobehav Rev 81:181–187. Centre for Digital Music. https://doi.org/10.1016/j.neubiorev.2017.02.011 25. Dunlap K (1910) Reaction to rhythmic stimuli with attempt to synchronize. Psychol Rev 17(6):399–416. Centre for Digital Music. https://doi.org/10.1037/h0074736 26. Eggermont J (1969) Location of the syllable beat in routine scansion recitations of a Dutch poem. IPO Ann Prog Rep 4:60–64. Centre for Digital Music 27. Eling PA, Marshall JC, Van Galen GP (1980) Perceptual centres for Dutch digits. Acta Psychol 46(2):95–102. https://doi.org/10.1016/0001-6918(80)90002-5 28. Engel GR, Dougherty WG (1971) Visual-auditory distance constancy. Nature 234(5327):3018. https://doi.org/10.1038/234308a0 29. Eriksson A (1991) Aspects of Swedish speech rhythm. University of Göthenburg, Allmän språkvetenskap, pp i–xii, 1–234. http://hdl.handle.net/2077/10854

References

255

30. Fastl H, Zwicker E (2007) Fluctuation strength. Psychoacoustics: facts and models, 3rd edn, Chap 10. Springer GmbH, Berlin, pp 247–256 31. Fowler CA (1979) ‘Perceptual centers’ in speech production and perception. Percept Psychophys 25(5):375–388. https://doi.org/10.3758/BF03199846 32. Fowler CA (1983) Converging sources of evidence on spoken and perceived rhythms of speech: cyclic production of vowels in monosyllabic stress feet. J Exp Psychol: General 112(3):386– 412. https://doi.org/10.1037/0096-3445.112.3.386 33. Fox B, Routh DK (1975) Analyzing spoken language into words, syllables, and phonomes: a developmental study. J Psycholinguist Res 4(4):331–342. https://doi.org/10. 1007/BF01067062 34. Fraisse P (1982) Rhythm and tempo. In: Deutsch D (ed) The psychology of music, Chap 6. Academic, London, UK, pp 149–180 35. Fraisse P (1946) Contribution a l’étude du rythme en tant que forme temporelle. J de Psychologie Normale et Pathologique 39:283–304 36. Goebl W, Palmer C (2013) Temporal control and hand movement efficiency in skilled music performance. PLoS ONE 8(1):e50901. 10 p. https://doi.org/10.1371/journal.pone.0050901 37. Gordon JW (1987) The perceptual attack time of musical tones. J Acoust Soc Am 82(1):88– 105. https://doi.org/10.1121/1.395441 38. Goswami U (2015) Sensory theories of developmental dyslexia: three challenges for research. Nat Rev Neurosci 16(1):43–54. https://doi.org/10.1038/nrn3836 39. Goswami U et al (2002) Amplitude envelope onsets and developmental dyslexia: a new hypothesis. Proc Natl Acad Sci 99(16):10911–10916. https://doi.org/10.1073/pnas. 122368599 40. Gouyon F (2005) A computational approach to rhythm description: audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. Barcelona, pp 1–xiv, 1–188. http://www.tdx.cat/bitstream/handle/10803/ 7484/tfg1de1.pdf?sequence=1 41. Gouyon F, Dixon S (2005) A review of automatic rhythm description systems. Comput Music 29(1):34–54. https://doi.org/10.1162/comj.2005.29.1.34 42. Gouyon F et al (2006) An experimental comparison of audio tempo induction algorithms. IEEE Trans Audio, Speech Lang Process 14(5):1832–1844. https://doi.org/10.1109/TSA. 2005.858509 43. Hainsworth S (2006) Beat tracking and musical metre analysis. In: Klapuri A, Davy M (eds) Signal processing methods for music transcription. Springer Science+Business Media, Inc, New York, NY, pp 101–129. https://doi.org/10.1007/0-387-32845-9_4 44. Harsin CA (1997) Perceptual-center modeling is affected by including acoustic rate-of-change modulations. Percept Psychophys 59(2):243–251. https://doi.org/10.3758/BF03211892 45. Hermes DJ (2006) Stylization of pitch contours. In: Sudhoff S et al (eds) Methods in empirical prosody research. Walter De Gruyter, Berlin, pp 29–62. https://doi.org/10.1515/ 9783110914641.29 46. Hermes DJ (1990) Vowel-onset detection. J Acoust Soc Am 87(2):866–873. https://doi.org/ 10.1121/1.398896 47. Hertrich I et al (2012) Magnetic brain activity phase-locked to the envelope, the syllable onsets, and the fundamental frequency of a perceived speech signal. Psychophysiology 49(3):322– 334. https://doi.org/10.1111/j.1469-8986.2011.01314.x 48. Hoequist CE (1983) The perceptual center and rhythm categories. Lang Speech 26(4):367– 376. https://doi.org/10.1177/002383098302600404 49. House D (1990) Tonal Perception in Speech. Lund, Sweden 50. Huss M et al (2011) Music, rhythm, rise time perception and developmental dyslexia: Perception of musical meter predicts reading and phonology. Cortex 47(6):674–689. https://doi. org/10.1016/j.cortex.2010.07.010 51. Janker PM, Pompino-Marschall B (1991) Is the P-center position influenced by ‘tone’? In: Proceedings of the international congress on phonetic sciences (ICPS’91) (19-24 August 1991, Aix-en-Provence), vol 3, pp 290–293

256

5 Beat Detection

52. Janker PM (1996) Evidence for the p-center syllable-nucleus-onset correspondence hypothesis. ZAS Pap Linguist 7:94–124 53. Kato H, Tsuzaki M, Sagisaka Y (2003) Functional differences between vowel onsets and offsets in temporal perception of speech: Local-change detection and speaking-rate discrimination. J Acoust Soc Am 113(6):3379–3389. https://doi.org/10.1121/1.1568760 54. Knafle JD (1973) Auditory perception of rhyming in kindergarten children. J Speech, Lang Hear Res 16(3):482–487. https://doi.org/10.1044/jshr.1603.482 55. Knafle JD (1974) Children’s discrimination of rhyme. J Speech Lang Hear Res 17(3):367–372. https://doi.org/10.1044/jshr.1703.367 56. Köhlmann M (1984) Bestimmung der Silbenstruktur von fließender Sprache mit Hilfe der Rhythmuswahrnehmung. Acustica 56(2):120–125 57. Köhlmann, M (1984) Rhythmische Segmentierung von Sprach-und Musiksignalen und ihre Nachbildung mit einem Funktionsschema. Acustica 56(3):193–204 58. Kortekaas RWL, Hermes DJ, Meyer GF (1996) Vowel-onset detection by vowel-strength measurement, cochlear-nucleus simulation, and multilayer perceptrons. J Acoust Soc Am 99(2):1185–1199. https://doi.org/10.1121/1.414671 59. Ladányi E et al (2020) Is atypical rhythm a risk factor for developmental speech and language disorders? Wiley Interdiscip Rev: Cognit Sci e1528, 32 p. https://doi.org/10.1002/wcs.1528 60. Lenel JC, Cantor JH (1981) Rhyme recognition and phonemic perception in young children. J Psycholinguist Res 10(1):57–67. https://doi.org/10.1007/BF01067361 61. Lerdahl F, Jackendoff R (1981) On the theory of grouping and meter. Musical Quart 67(4):479– 506. http://www.jstor.org/stable/742075 62. Liberman IY et al (1974) Explicit syllable and phoneme segmentation in the young child. In: J Exp Child Psychol 18(2):201–212. https://doi.org/10.1016/0022-0965(74)90101-5 63. Liberman M, Prince, A (1977) On stress and linguistic rhythm. Linguist Inquiry 8(2):249–336. https://doi.org/10.1121/1.392492, http://www.jstor.org/stable/4177987 64. Lindblom B, Sundberg J (2007) The human voice in speech and singing. In: Rossing TD (ed) Springer handbook of acoustics, Chap 6. Springer Science+Business Media, New York, NY, pp 669–712. https://doi.org/10.1007/978-1-4939-0755-7_16 65. London J (2012) Hearing in time: psychological aspects of musical meter, 2nd edn. Oxford University Press, Oxford, UK 66. London J et al (2019) A comparison of methods for investigating the perceptual center of musical sounds. Atten Percept Psychophys 81(6):2088–2101. https://doi.org/10.3758/s13414019-01747-y 67. Lunney H (1974) Time as heard in speech and music. Nature 249(5457):592. https://doi.org/ 10.1038/249592a0 68. Maclean M, Bryant P, Bradley L (1987) Rhymes, nursery rhymes, and reading in early childhood. Merrill-Palmer Quart 33(3):255–281. http://www.jstor.org/stable/23086536 69. Marcus SM (1981) Acoustic determinants of perceptual center (P-center) location. Percept Psychophys 30(3):240–256. https://doi.org/10.3758/BF03214280 70. Matthews S, Yip V (1994) Cantonese: a comprehensive grammar. Routledge, New York, NY 71. McAnally K (2002) Timing of finger tapping to frequency modulated acoustic stimuli. Acta Psychol 109(3):331–338. https://doi.org/10.1016/S0001-6918(01)00065-8 72. McKinney MF et al (2007) Evaluation of audio beat tracking and music tempo extraction algorithms. J New Music Res 36(1):1–16. https://doi.org/10.1080/09298210701653252 73. Mershon DH et al (1980) Visual capture in auditory distance perception: Proximity image effect reconsidered. J Audit Res 20(2):129–136 74. Miyake I (1902) Researches on rhythmic activity. Stud From the Yale Psychol Lab 10:1–48 75. Morton J, Marcus SM, Frankish C (1976) Perceptual centers (P-centers). Psychol Rev 83:(51976):405–408. https://doi.org/10.1037/0033-295X.83.5.405 76. Müller K et al (1999) Action timing in an isochronous tapping task: evidence from behavioral studies and neuroimaging. In: Aschersleben G, Bachmann T, Müsseler J (eds) Cognitive contributions to the perception of spatial and temporal events, Chap 10. Elsevier Science B. V., Amsterdam, pp 233–250. https://doi.org/10.1016/S0166-4115(99)80023-5

References

257

77. Näätänen R, Winkler I (1999) The concept of auditory stimulus representation in cognitive neuroscience. Psychol Bull 126(6):826–859. https://doi.org/10.1037/0033-2909.125.6.826 78. Oganian Y, Chang EF (2019) A speech envelope landmark for syllable encoding in human superior temporal gyrus. Sci Adv 5(11):eaay6279, 13 p. https://doi.org/10.1126/sciadv. aay6279 79. Parncutt R (1994) A perceptual model of pulse salience and metrical accent in musical rhythms. Music Percept: Interdiscip J 11(4):409–464. https://doi.org/10.2307/40285633 80. Patel AD, Löfqvist A, Naito W (1999) The acoustics and kinematics of regularly timed speech: A database and method for the study of the p-center problem. In: Proceedings of the 14th international congress of phonetic sciences (ICPhS99) (San Francisco, CA), vol 1, pp 405– 408. www.internationalphoneticassociation.org/icphs-proceedings/ICPhS1999/papers/p14_ 0405.dpdf 81. Paulus E, Zwicker E (1972) Programme zur automatischen Bestimmung der Lautheit aus Terzpegeln oder Frequenzgruppenpegeln. Acustica 27(5):253–266 82. Pérez-González D, Malmierca MS (2014) Adaptation in the auditory system: an overview. Front Integrat Neurosci 8, Article 19, 10 p. https://doi.org/10.3389/fnint.2014.00019 83. Pinto AS et al (2021) User-driven fine-tuning for beat tracking. Electronics 10(13):1518, 23 p. https://doi.org/10.3390/electronics10131518 84. Plomp R (2002) The intelligent ear: on the nature of sound perception. Lawrence Erlbaum Associates, Publishers, Mahwah, NJ 85. Poeppel D (2003) The analysis of speech in different temporal integration windows: cerebral lateralization as ‘asymmetric sampling in time’. Speech Commun 41(1):245–255. https://doi. org/10.1016/S0167-6393(02)00107-3 86. Pompino-Marschall B (1989) On the psychoacoustic nature of the P-center phenomenon. J Phonet 17(3):175–192. https://doi.org/10.1016/S0095-4470(19)30428-0 87. Pompino-Marschall B (1991) The syllable as a prosodic unit and the so-called P-centre effect. Forschungsberichte des Instituts für Phonetik und Sprachliche Kommunication der Universität München (FIPKM) 29:65–123 88. Port RF (2007) The problem of speech patterns in time. In: Gaskell GM (ed) The Oxford handbook of psycholinguistics, Chap 30. Oxford University Press, Oxford, UK, pp 503–514 89. Povel D-J (1981) The internal representation of simple temporal patterns. J Exp Psychol: Hum Percept Perform 7(1):3–18. https://doi.org/10.1037/0096-1523.7.1.3 90. Prasanna SRM, Reddy BVS, Krishnamoorthy P (2009) Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans Audio, Speech, Lang Process 17(4):556–565. https://doi.org/10.1109/TASL.2008.2010884 91. Rao KS, Vuppala AK (2014) Speech processing in mobile environments. Springer International Publishing, Cham, Switzerland. https://doi.org/10.1007/978-3-319-03116-3 92. Rapp-Holmgren K (1971) A study of syllable timing. Quart Prog Status Rep 12(1):14–19. http://www.speech.kth.se/prod/publications/files/qpsr/1971/1971_12_1_014-019.pdf 93. Rathcke T et al (2021) Tapping into linguistic rhythm. Lab Phonol: J Assoc Lab Phonol 12(1):11, 32 p. https://doi.org/10.5334/labphon.248 94. Repp BH (2007) Perceiving the numerosity of rapidly occurring auditory events in metrical and nonmetrical contexts. Percept Psychophys 69(4):529–543. https://doi.org/10.3758/ BF03193910 95. Ross C, Ma J-HS (2017) Modern mandarin Chinese grammar: a practical guide. Routledge, Taylor & Francis Group, London, UK 96. Sarma BD, Prasanna SRM (2018) Acoustic-phonetic analysis for speech recognition: a review. IETE Tech Rev 35(3):305–327. https://doi.org/10.1080/02564602.2017.1293570 97. Schütte H (1978) Ein Funktionsschema für die Wahrnehmung eines gleichmässigen Rhythmus in Schallimpulsfolgen. Biol Cybernet 29(1):49–55. https://doi.org/10.1007/BF00365235 98. Schütte H (1978) Subjektiv gleichmaßiger Rhythmus: Ein Beitrag zur zeitlichen Wahrnehmung von Schallereignissen. Acustica 41(3):197–206 99. Scott S, McGettigan C (2012) Amplitude onsets and spectral energy in perceptual experience. Front Psychol 3(80) 2 p. https://doi.org/10.3389/fpsyg.2012.00080

258

5 Beat Detection

100. Scott SK (1993) P-centres in speech: an acoustic analysis. University College London, London, UK 101. Scott SK (1998) The point of P-centres. Psychol Res 61(1):4–11. https://doi.org/10.1007/ PL00008162 102. Shams L, Kamitani Y, Shimojo S (2000) What you see is what you hear. Nature 408(6814):788. https://doi.org/10.1038/35048669 103. Simon J, Winkler I (2018) The role of temporal integration in auditory stream segregation. J Exp Psychol: Hum Percept Perform 44(11):1683–1693. https://doi.org/10.1037/xhp0000564 104. Slutsky DA, Recanzone GH (2001) Temporal and spatial dependency of the ventriloquism effect. NeuroReport 12(1):7–10 105. Šturm P, Volín J (2016) P-centres in natural disyllabic Czech words in a large-scale speechmetronome synchronization experiment. J Phonet 55:38–52. https://doi.org/10.1016/j.wocn. 2015.11.003 106. Sundberg J, Bauer-Huppmann J (2007) When does a sung tone start? J Voice 21(3):285–293. https://doi.org/10.1016/j.jvoice.2006.01.003 107. Tanaka S et al (2008) Auditory sensitivity to temporal deviations from perceptual isochrony: Comparison of the starting point and ending point of acoustic change. Jpn Psychol Res 50(4):223– 231. https://doi.org/10.1111/j.1468-5884.2008.00378.x 108. Terhardt E (1968) Über die durch amplitudenmodulierte Sinustöne hervorgerufene Hörempfindung. Acustica 20:210–214 109. Terhardt E, Schütte H (1976) Akustische rhythmus-wahrnehmung: Subjektive gleichmässigkeit. Acustica 35(2):122–126 110. Terhardt E, Stoll G, Seewann M (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. J Acoust Soc Am 71(3):679–688. https://doi.org/10.1121/1. 387544 111. Terhardt E (1968) Über akustische rauhigkeit und schwankungsstärke. Acustica 20:215–224 112. Treiman R (1985) Onsets and rimes as units of spoken syllables: evidence from children. J Exp Child Psychol 39(1):161–181. https://doi.org/10.1016/0022-0965(85)90034-7 113. Treiman R (1983) The structure of spoken syllables: evidence from novel word games. Cognition 15(1):49–74. https://doi.org/10.1016/0010-0277(83)90033-1 114. Turgeon M, Bregman AS, Ahad PA (2002) Rhythmic masking release: contribution of cues for perceptual organization to the cross-spectral fusion of concurrent narrow-band noises. J Acoust Soc Am 111(4):1819–1831. https://doi.org/10.1121/1.1453450 115. Turgeon M, Bregman AS, Roberts B (2005) Rhythmic masking release: effects of asynchrony, temporal overlap, harmonic relations, and source separation on cross-spectral grouping. J Exp Psychol: Hum Percept Perform 31(5):939–953. https://doi.org/10.1037/0096-1523.31.5.939 116. Turk A, Shattuck-Hufnagel S (2013) What is speech rhythm? A commentary on Arvaniti and Rodriquez, Krivokapic, and Goswami and Leong. Lab Phonol 4(1):93–118. https://doi.org/ 10.1515/lp2013-0005 117. Van Katwijk A, Van der Burg B (1968) Perceptual and motoric synchronisation with syllable beats. IPO Ann Prog Rep 3:35–39 118. Vidal M (2017) Hearing flashes and seeing beeps: timing audiovisual events. PLoS ONE 12(2):e0172028, 19 p. https://doi.org/10.1371/journal.pone.0172028 119. Villing RC (2010) Hearing the moment: measures and models of the perceptual centre. National University of Ireland Maynooth, Maynooth, Ireland, pp i–xv1, 1–296. http://mural. maynoothuniversity.ie/2284/1/Villing_2010_-_PhD_Thesis.pdf 120. Villing RC et al (2011) Measuring perceptual centers using the phase correction response. Atten Percept Psychophys 73(5):1614–1629. https://doi.org/10.3758/s13414-011-0110-1 121. Vos J, Rasch R (1981) The perceptual onset of musical tones. Percept Psychophys 29(4):323– 335. https://doi.org/10.3758/BF03207341 122. Vos PG, Mates J, Van Kruysbergen NW (1995) The perceptual centre of a stimulus as the cue for synchronization to a metronome: evidence from asynchronies. Quart J Exp Psychol 48(4):1024–1040. https://doi.org/10.1080/14640749508401427

References

259

123. Wang J-F et al (1991) A hierarchical neural network model based on a C/V segmentation algorithm for isolated Mandarin speech recognition. IEEE Trans Signal Process 39(9):2141– 2146. https://doi.org/10.1109/78.134458 124. Wessel DL (1979) Timbre space as a musical control structure. Comput Music J 3(2):45–52. https://doi.org/10.2307/3680283 125. Whalen DH, Cooper AM, Fowler CA (1989) P-center judgments are generally insensitive to the instructions given. Phonetica 46(4):197–203. https://doi.org/10.1159/000261843 126. Woodrow H (1932) The effect of rate of sequence upon the accuracy of synchronization. J Exp Psychol 15(4):357–379. https://doi.org/10.1037/h0071256 127. Wright MJ (2008) The shape of an instant: measuring and modeling perceptual attack time with probability density functions. Stanford, CA, pp i–xiv, 1–188 128. Xiang J, Poeppel D, Simon JZ (2013) Physiological evidence for auditory modulation filterbanks: cortical responses to concurrent modulations. J Acoust Soc Am 133(1):EL7–EL12. https://doi.org/10.1121/1.4769400 129. Yabe H et al (1998) Temporal window of integration of auditory information in the human brain. Psychophysiology 35(5):615–619. https://doi.org/10.1017/S0048577298000183 130. Yadav J, Rao KS (2013) Detection of vowel offset point from speech signal. IEEE Signal Process Lett 20(4):299–302. https://doi.org/10.1109/LSP.2013.2245647 131. Yi HG, Leonard MK, Chang EF (2019) The encoding of speech sounds in the superior temporal gyrus. Neuron 102(6):1096–1110. https://doi.org/10.1016/j.neuron.2019.04.023

Chapter 6

Timbre Perception

At this stage in this book, the auditory information entering the central nervous system is processed into auditory units. Unless these sounds begin very slowly, these auditory units are perceptually located in time by their beats at their perceptual moments of occurrence. Besides its beat, an auditory unit has several other auditory attributes. These attributes emerge from the processing of information used in the formation of that auditory unit. This chapter will be dedicated to the most elusive of auditory attributes, the attribute of timbre. It appears that among pitch, loudness, and timbre, the auditory system processes timbre faster than pitch and loudness. Indeed, pitch and loudness have integration times of more than ten milliseconds, whereas timbre only needs ten milliseconds or less. This was shown in an early study by Gray [89]. He gated out segments of various durations from different vowels. He found that segments of not more than 3 ms were enough to identify the vowel. Similar results were obtained by Robinson and Patterson [205] for the identification of synthetic sung vowels, also if two vowels were presented simultaneously [152], and for the identification of synthetic musical-instrument sounds [204]. Moreover, in a study using natural sung-vowel sounds and natural musical-instrument sounds, Suied et al. [243] showed that an interval of less than five milliseconds was sufficient for participants to make an almost perfect distinction between the two sound categories, also when the interval was shorter than one pitch period. These results were obtained for stimuli listeners had never heard before, so without any training. Apparently, the distinction between these sound categories is based on information from within very short sound segments, so short, that the emergence of loudness or pitch has yet to come about [165].

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_6

261

262

6 Timbre Perception

6.1 Definition Timbre was just mentioned as the most elusive auditory attribute among various other auditory attributes of a sound. The beat of an auditory unit can be attributed to a moment in time, its pitch can be scaled from low to high, its loudness from soft to loud, its perceived duration from short to long, and its perceived location can be indicated in space. Timbre is different. The sounds just mentioned in the introduction of this chapter are vowel sounds and the sounds of musical instruments. Both groups of sound can be divided into well-defined categories and the beginning of these sounds apparently contains enough information for a rapid recognition. For other sounds, such as many environmental sounds, categories will often not be so well-defined and the information needed for recognition will be spread over a longer time segment. All this makes it difficult to give a conclusive definition of timbre. Hence, timbre is generally defined based on what it is not: Not pitch, not loudness, and not duration. Indeed, the standard definition by the American National Standard Institute is: “Timbre is that attribute of auditory sensation in terms of which a listener can judge two sounds similarly presented and having the same loudness and pitch as dissimilar” [8]. This signifies that all properties attributed to a sound that are not pitch, loudness, and duration, are part of the timbre of the sound. For both English and Danish, [176] presents more than 500 words used to describe the properties of a sound. These sound descriptors span “the semantic space of sound”, the title of Pederson’s report. The number of descriptors in this list cannot be independent, and one of the main questions asked in timbre research is: How many descriptors are necessary to describe a sound? To a later version of the ANSI definition of timbre, a note is added [6]: “Timbre. That attribute of auditory sensation that enables a listener to judge that two nonidentical sounds, similarly presented and having the same loudness and pitch, are dissimilar. Note: Timbre depends primarily upon the frequency spectrum, although it also depends upon the sound pressure and the temporal characteristics of the sound.” This added note emphasizes the role played by the spectral content of the sound, but acknowledges that sound pressure level and temporal characteristics of the sound also play a role. The important role played by the temporal characteristics of a sound, already demonstrated in Fig. 1.70, will become clear in the course of this chapter. As to the role played by sound pressure level, it will be argued that it is relatively small. Within limits, loudness and pitch are auditory attributes that can largely be described independently of timbre. For systematic discussions on the problems associated with these official definitions of timbre, see Bregman [31, pp. 92–94], Sethares [215, pp. 27–32], Sankiewicz and Budzyski [213] and Siedenburg and McAdams [224]. Despite these complexities, it will be shown that different attributes of timbre can be described more or less independently of other auditory attributes. The following attributes will be discussed successively: roughness, breathiness, and brightness or sharpness. Later in the chapter, three composite timbre attributes will be discussed: sensory pleasantness, voice quality, and perceived effort. Roughness will be discussed first.

6.2 Roughness

263

6.2 Roughness In Sect. 1.5, the perception of the sum of two pure tones has been described for varying differences in frequency. It was found that the percept depended strongly on the frequency difference between the two tones. While varying the frequency difference, four perceptual ranges were distinguished: the range of hearing of slow modulations, the range of rhythm perception, the range of roughness perception, and the range of perception of a steady tone. Similar phenomena were described for sinusoidally amplitude-modulated (SAM) tones in Sect. 1.6 and for sinusoidally frequency-modulated (SFM) tones in Sect. 1.7. In the current section, the perception of roughness will be discussed more generally. Roughness is defined as the presence of audible temporal fluctuations in sound, the rate of which is so fast that it is not possible for the listener to keep up with every single fluctuation. In other words, the roughness fluctuations are too fast to induce beats and, since a beat is associated with the appearance of a new auditory unit, the temporal fluctuations underlying roughness do not split the sound into separate auditory units. Roughness is, therefore, a perceptual attribute of an auditory unit or, as McAdams [149, p. 56] formulates it: “Roughness, or any other auditory attribute of a single sound event, is computed after auditory organization processes have grouped the bits of acoustic information together.” In this section, the discussion will focus primarily on the roughness of two-tone complexes, SAM tones, SFM tones, and AM noise, as most research on roughness has been conducted with these types of sounds. It should be mentioned, however, that also pure tones with frequencies below about 100 Hz have a certain amount of roughness [155, 156]. In addition, also pulse trains can sound rough when their fundamental frequency is not too high [251]. Another example of a rough sound has been presented in the randomly jittered pulse trains demonstrated in Fig. 5.7 of the previous chapter. In the introduction, it was shown that the roughness of two-tone complexes is limited to a range of difference frequencies, and, similarly, that the roughness of SAM tones and SFM tones is limited to a range of modulation frequencies. It appears that this range of frequencies where roughness exists depends on the modulation frequency (MF) of the tones. One of the first to describe this existence region of roughness systematically was Terhardt [252]. He used both two-tone complexes and SAM tones. For both set of sounds, comparable results were obtained. First, the conclusions by Terhardt [252] obtained with two-tone complexes will be summarized. The perception of two-tone complexes has been discussed in Sect. 1.5. There, it was shown that a two-tone, complex could not only be described as the sum of two tones but also as an amplitude-modulated pure tone with a frequency that is the average of the frequencies of the two tones. Recapitulating,     f1 + f2 f1 − f2 t sin 2π t a sin 2π f 1 t + a sin 2π f 2 t = 2a cos 2π 2 2

(6.1)

264

6 Timbre Perception

In this equation, ( f 1 + f 2 ) /2 represents the carrier frequency (CF) f c and ( f 1 − f 2 ) /2 the modulation frequency (MF) f m , or: 2 a cos (2π f m t) sin (2π f c t) = a sin 2π ( f c + f m ) t − a sin 2π ( f c − f m ) t (6.2) As f m has a positive and a negative phase, the amplitude of this modulator fluctuates in absolute value with a frequency twice as high as the MF, i.e., | f 1 − f 2 |, the difference frequency of the two-tone complex (see Fig. 1.29). The region where a two-tone complex sounds rough is presented in Fig. 6.1 by the dotted area. The abscissa represents the average frequency of the two tones, CF, and the ordinate their difference in frequency, so twice MF. The two curved lines in this figure represent the two tones of the demo with their frequencies on the abscissa. Their average frequency CF, 1000 Hz, is indicated by the thin, dotted line. The sound demo starts at the upper part of the figure, where the two tones have frequencies of 200 and 1800 Hz, a difference of 1600 Hz. In the course of the demo, the frequency difference is gradually decreased and ends at 4 Hz at the bottom of the figure. As long as the frequency difference is large, two pure tones modulated in frequency are perceived. The frequency of the lower tone rises in frequency; that of the higher tone falls in frequency. As the frequency difference gets smaller, there is a moment when listeners start hearing roughness. In the demo, this is at a frequency difference of about 150 Hz, when the two lines enter the roughness area. As the frequency difference is further lowered, roughness first increases and the percept of hearing two tones gradually decreases. This transition is not abrupt. In the upper part of the roughness range, two separate tones can still be heard. Roughness is maximum at about 70 Hz. Below 70 Hz, roughness decreases again and the percept of a two-tone complex vanishes. When the frequency difference is as low as about 23 Hz, the two lines leave the roughness area again, and roughness disappears altogether. It is gradually replaced by a percept of beats: A rapidly fluctuating 1000 Hz pure tone starts to be heard, inducing the percept of rhythm. As the frequency difference is further decreased, the tempo of this rhythm gets slower and slower. The demo ends when the frequency difference is 4 Hz. As said earlier, the transitions between the various ranges are not abrupt as may be suggested by the figure, but are gradual. As is shown in Fig. 6.1, the roughness range of a two-tone complex is somewhat lower for lower CFs than for higher CFs. As to the lower boundary, at a CF of 250 Hz, roughness is heard when the CF gets larger than about 21 Hz, at 1 kHz larger than about 23 Hz, and at 4 kHz larger than about 30 Hz. The higher boundary of roughness perception is also lower for two-tone complexes with lower CFs than for two-tone complexes with higher CFs. Terhardt [252] finds that, when CF is lower than 300–500 Hz, roughness disappears at difference frequencies higher than about 100 Hz. Then this boundary gradually increases up to about 250 Hz for CFs higher than 4–5 kHz. In the example presented in Fig. 6.1, a two-tone complex was used to demonstrate the transition from the perception of roughness to that of beats on the lower side of the roughness range and the transition from the perception of roughness to that of two steady tones on the upper side. Similar results are obtained for SAM tones [252]

6.2 Roughness

265

Fig. 6.1 Existence region of roughness of a two-tone complex. The two curved lines represent the frequencies of the two tones; the average frequency of the two tones CF is 1000 Hz, indicated by the thin, vertical dotted line. At the start of the demo, the frequency difference between the tones is 1600 Hz and then gradually decreases to 4 Hz. Based on Terhardt [252, p. 213, Fig. 4] (Matlab) (demo)

and for SFM tones [116]. This has been demonstrated in the Sects. 1.6 and 1.7. The reader is encouraged to listen to the demos presented there once again. The lower boundary of the roughness range shown in Fig. 6.1 is defined by the transition of perceiving roughness to perceiving beating. As mentioned, in principle, every beat introduces a new auditory unit, whereas roughness is a property of one auditory unit. This distinction is not absolute and the transition is generally perceived as more gradual. The same applies to the upper boundary of the roughness range, but there the underlying perceptual mechanism is much different. Confirming an earlier assumption by Von Helmholtz [269] and Plomp and Levelt [183], Terhardt [252] argues that, for CFs lower than about 1000 Hz, the upper boundary is defined by the critical bandwidth. Actually, the Bark scale as presented in Table 3.1 gives a critical bandwidth of about 100 Hz up to 500 Hz, which then slowly increases. This corresponds well to the upper boundary of the roughness range. For CFs higher than about 1000 Hz, the critical bandwidth is much larger than the upper boundary of the roughness range. Hence, the lack of roughness at MFs outside the roughness range but within the critical bandwidth cannot be explained by assuming that there is no longer an interaction between the excitation induced by the two tones or, in the case of SAM or SFM tones, by the partials of these tones. Terhardt [252] argues that, for those CFs, the lack of roughness is caused by the inertia of the auditory system in the processing of amplitude fluctuations, in other words, by its limited temporal resolution. Not only the upper boundary of the roughness range, also the difference frequency of a two-tone complex inducing maximum roughness has been associated with the critical bandwidth. Indeed, Plomp and Levelt [183] suggest that roughness is maximum when the difference frequency is one quarter of the critical bandwidth. These relations are contested, however, by Mikiewicz et al. [157]. Instead of the Bark scale, they used the more up-to-date Cam scale. They concluded: “In contrast with

266

6 Timbre Perception

the conclusions of previous studies, the frequency interval between two tones that yields maximum roughness at various centre frequencies is not a constant fraction of the bandwidth of the auditory filter, nor is the frequency interval at which roughness disappears equal to the auditory filter bandwidth” (p. 331). The demo played in Fig. 6.1, as well as the various demos to come in this section, show that roughness can be present or absent, and that one sound can be rougher or less rough than another. The question is how to express the amount of roughness in a sound. First, there must be a standard. Terhardt [254] introduced the asper, the Latin word for rough, as the perceptual unit for roughness. He defined 1 asper as the roughness of a 40 dB, 1000-Hz pure tone, modulated in amplitude with a modulation frequency (MF) of 70 Hz and a modulation depth (MD) of 1. The roughness of other sounds can then be found by dedicated scaling experiments. In this way, Terhardt [254] found, e.g., that roughness was proportional to the square of the MD of a SAM tone. A result of such an experiment on roughness scaling is presented in Fig. 6.2 for a 1000 Hz SAM tone with an intensity of 70 dB, and an MD of 1 (data from Daniel and Weber [45, p. 118]). The smooth thin lines running from the very left to the very right of the figure are the frequencies of the three partials that constitute the SAM tone. The ordinate with their frequency is on the right side. The thick horizontal line represents the carrier; the curved thinner lines the sidebands. The roughness is represented by the bell-shaped curve with the roughness ordinate on the left, a graph that unmistakably has the nature of a bandpass filter. This is shown for a number of other CFs in Fig. 6.3 showing the roughness as a function of MF for 60 dB SAM tones for seven different CFs, 125, 250, 500, 1000, 2000, 4000, and 8000 Hz (data redrawn from Fig. 3 of Daniel and Weber [45]). As in Fig. 6.1, one can see that the roughness range is somewhat lower in frequency for lower CFs than for higher CFs. Furthermore, the highest roughness is attained for the 1000 Hz tone at 70 Hz MF for these SAM tones of 60 dB with an MD of 1.

Fig. 6.2 Roughness of a sinusoidally amplitude-modulated (SAM) 1000 Hz, 60 dB tone as a function of the modulation frequency (MF). The modulation depth (MD) is 1. The MF varies from 2 to 800 Hz. The frequencies of the three partials are presented as smooth lines running from the left to the right with the ordinate on the right side. Roughness is shown as the bell-shaped line. The percept of roughness is restricted to a range starting at about 20 Hz at about 7 s, is maximum at 70 Hz at about 12 s, and ends at about 180 Hz at about 15 s. Based on Daniel and Weber [45, p. 118, Fig. 3]. (Matlab) (demo)

6.2 Roughness

267

Fig. 6.3 Roughness of SAM tones as a function of MF. The parameters represent the carrier frequency of the tones. Note the band-pass character of the curves. (Based on data from Daniel and Weber [45, p. 118, Fig. 3]). (Matlab)

The perception of the roughness of SFM tones is quite comparable with the roughness of SAM tones. Terhardt [251] remarks that the power spectrum of SFM tones with modulation indices smaller from 1 is not much different from the power spectrum of SAM tones, since the second-order sidebands are relatively small (see Sect. 1.7), which can explain this result. The same conclusion is drawn by Kemp [116]: “The results indicate that the roughness of frequency modulated tones behaves similarly to that of amplitude modulated tones. Such a similarity suggests that the magnitude of the roughness sensation can be derived from temporal fluctuations in excitation, irrespective of the type of fluctuating sound used” (p. 126). It is clear that the graphs presented in Fig. 6.3 resemble the frequency characteristic of relatively wide band-pass filters. Indeed, models of roughness perception assume that roughness is based on temporal fluctuations in the envelopes of the bandpass-filtered outputs of the auditory filters. The first to develop such a model were Aures [16] and Daniel and Weber [45]. With this model, the roughness of two-tone complexes, SAM tones, SFM tones, and SAM noise can be estimated relatively well. An example is presented in Fig. 6.4 showing the same curves for 250, 1000, and 4000 Hz as in Fig. 6.3, now combined with the roughness estimates calculated with Daniel and Weber’s model. One can see that the simulation is not too bad, though there still are discrepancies. The reader who is interested in the details of this roughness model is referred to the original literature and the annotated software accompanying Fig. 6.4. Here only a summary description will be presented. Indeed, in order to estimate the roughness of a sound at a certain moment, the model by Daniel and Weber [45] starts from a sound segment gated out by a 200 ms analysis window. While retaining the phase information, the excitation pattern of this segment is calculated. This is done on a Bark scale and not on a Cam scale in order to enable a comparison with the earlier model developed by Aures [16] which was based on the Bark scale. From this excitation pattern, the specific loudness is calculated for 47 equidistant locations

268

6 Timbre Perception

Fig. 6.4 Estimated roughness of SAM tones. The solid lines connect the perceptually measured data points shown as circles and are the same as in Fig. 6.3. The dotted lines represent the estimated roughness calculated with the model by Daniel and Weber [45]. (Matlab)

separated by 0.5 Bark on the Bark scale. Remember that the Bark scale runs from 0 to 24 Barks, so that these 47 locations cover the complete Bark scale. Using the original phases of the signal at these locations, an inverse Fourier transform is then used to estimate the temporal envelopes of the outputs of the auditory filters. The bandpass character of roughness is then taken into account by band-pass filtering these envelopes so that only fluctuations as far as they contribute to roughness perception are taken into account. From these filtered envelopes, the MDs of the auditory-filter outputs are estimated. In the last step of this model, the correlation coefficients of the filtered temporal envelopes 1 Bark apart are calculated. This last step is necessary in order to take into account that the roughness of some sounds, e.g., noise, can be quite low although the temporal fluctuations in the auditory-filter outputs can be considerable [16, 195]. This is because, for sounds such as noise, the correlation between the outputs of the auditory filters is low. Apparently, the fluctuations of the envelopes of the auditory-filter outputs contribute more to roughness when they are in phase. By including the correlation between auditory filters 1 Bark apart in the calculations, the model succeeds in describing the low roughness for such noisy sounds [16]. Finally, in Fig. 6.3, it can be seen that, for the same MD, roughness is lower for low carrier frequencies, increases up to 1 kHz where it is maximum, and then diminishes again for higher carrier frequencies. This is taken into account by choosing appropriate weighting factors. For each of the 47 frequency channels the contribution of the total roughness is then calculated by multiplying this weighting factor, the square of the correlation coefficient and the square of the MD resulting in what is called specific roughness. Roughness is then found by summation of the specific roughness over all 47 auditory channels. The result gives the roughness estimate of the sound. The model presented by Aures [16] includes auditory band-pass filtering applied to the envelopes of the outputs of the auditory filters. Hence, this model suggests the existence of an additional, more central processing stage applied to the envelopes of

6.2 Roughness

269

the auditory-filter outputs. Such a filter is called a modulation transfer function (MTF). The presence of such a bank of MTFs, or modulation-filterbank (MFB), has further been assumed on various grounds by, e.g., Dau, Kollmeier, and Kohlrausch [46, 47], Houtgast [106] Jepsen, Ewert, and Dau [114], and Lemaska, Sek, and Skrodzka [132]. The description of these MTFs covers not only the roughness range but also the rhythm range; so it is likely that they are not specifically involved in roughness perception but play a more general role in the processing of the envelopes of the auditory-filter outputs. At the level of the central nervous system, such filters are discussed by Joris, Schreiner, and Rees [115] and Xiang, Poeppel, and Simon [275]. Recapitulating, roughness perception is based on integration of information over all auditory filters that contribute information to the perceived auditory unit. This information is derived from fluctuations in the envelopes of the auditory-filter outputs. These fluctuations must be correlated, and only fluctuations within the roughness range play a role. These aspects are all indispensable parts of models of roughness perception. Various adaptations have been proposed to the computational roughness model by Daniel and Weber [45], e.g. Sontacchi [231]. Moreover, Leman [131] developed a model based on a gammatone-filterbank and a synchronization index for the envelopes of the auditory-filter outputs. Hoeldrich and Pflueger [102], Sontacchi et al. [232], Wang et al. [270] further developed a model based on Daniel and Weber’s model for an application to measure the quality of car noise. These roughness models are quite successful for temporally symmetric, not too complex sounds. They all have one important limitation, however, which is that they are insensitive to temporal asymmetries in the envelopes of a sound. After all, the model developed by Daniel and Weber [45] is based on estimations of the MD of the outputs of the auditory filters. This MD is the difference between the maxima and the minima of the envelope and, hence, ignores temporal asymmetries in the auditory-filter outputs. The same holds for the synchronization indices in the model developed by Leman [131]. Consequently, asymmetries in the temporal envelopes of sounds are ignored, and the model predicts the same roughness for a given sound as for its temporal reversal. This appears not to be correct. Pressnitzer and McAdams [196] showed that, in general, amplitude-modulated sounds with temporally asymmetrical envelopes are rougher when they are damped, i.e., the rises in the envelopes are more rapid than the falls, than when they are ramped, i.e., the rises are slower than the falls. This is demonstrated in Fig. 6.5 for two 1000 Hz tones triangularly modulated in amplitude with 40 Hz. The triangular envelope in the upper panel rises in 22.5 ms and falls in 2.5 ms, whereas in the lower panel this is reversed in time. Comparing the two sounds shows that the first played sound, depicted in the upper panel, sounds less rough than the sound played second, depicted in the lower panel. This was confirmed by Yasui and Miura [277]. Timbre changes for sinusoids and noise with asymmetrically modulated envelopes have also been described by Patterson [171, 172] and Akeroyd and Patterson [2], not only for modulations in the roughness range, but also for higher modulation frequencies. Patterson [171, 172] shows that sinusoids with ramped envelopes sound

270

6 Timbre Perception

Fig. 6.5 Temporal asymmetry of roughness in triangularly amplitude-modulated sounds. The carrier frequency is 1000 Hz, the MF is 40 Hz. In the upper panel, the amplitude has a slow rise of 22.5 ms and a rapid fall of 2.5 ms, which is reversed for the sound shown in the lower panel. The dashed lines represent the triangular envelopes of the sound. The sound played first, shown in the upper panel, sounds less rough than the sound played next, shown in the lower panel. (Matlab) (demo)

more like a pure tone than sinusoids with damped envelopes. Similarly, Patterson [2] show that noise with ramped envelopes sounds more like a hiss than noise with damped envelopes. Temporal asymmetries in the envelopes of sounds appear to be important aspects of sounds such as rubbing sounds, scratching sound, sounds of alarm clocks, or sounds of objects rolling over irregular surfaces. These sounds are in general much less rough when played backwards. The model by Daniel and Weber [45], being temporally symmetric, does not consider this. This model and some of its extensions, therefore, have their limitations. Various adaptations have been proposed to take the temporal asymmetries of the auditory system into account in a model of roughness perception. Kohlrausch, Hermes, and Duisters [119] included an auditory filterbank of gammachirps and a model of peripheral adaptation, which could partly explain the temporal asymmetry of roughness perception. Vencovský [264] developed a model based on a simulation of the modulations of the basilar membrane, the “hydrodynamics of the cochlea”. Besides dealing relatively well with the temporal asymmetries, this model appears to well predict the differences in roughness of intervals between complex tones, either tuned in equal temperament or in just temperament [265]. (For a short description of the differences between equal and just temperament see Sect. 1.4.) Temporal asymmetries appear to play a role not only in roughness perception, but also in loudness perception. Indeed, sounds with slowly rising and rapidly falling envelopes are perceived as louder than their temporal inverses [233]. Temporal asymmetries in the auditory system are, however, difficult to model [110]. This is most

6.2 Roughness

271

likely due to the fact that asymmetrical temporal processing of sound does not only play a role in the peripheral auditory system, as modelled by Kohlrausch, Hermes, and Duisters [119] and Vencovsk’y [264], but also more centrally. Apparently, every level of processing has its own time constants. This leads to complex perceptual results, which are hard to model. This issue is discussed in depth in Chap. 5 “Temporal Processing in the Auditory System” of Moore [159]. Interferences between partials of different tones can not only result in audible fluctuations in the roughness range, but also for fluctuations with frequencies lower than about 20 Hz. When interferences result in audible beats, two disturbing phenomena occur, one concerning the harmonic structure of the partials, and the other the rhythm of the melody. Indeed, when the partials of two different tones have a frequency difference in the rhythm range, the pitch frequency of the combined partials is intermediary between the frequencies of the separate partials. Hence, this pitch does not fit into the harmonic pattern of either tone. Furthermore, the beats and, hence, their associated auditory units, will in general have no rhythmic relation with the intended rhythm of the music. Both phenomena are in general not appreciated by musicians and their listeners. The combination of these two phenomena induces the percept of dissonance, the opposite of consonance, to be discussed in Sect. 10.12.5. In conclusion, generally speaking, roughness is not considered a positive property of sound. It was first described by Helmholtz [99] in 1895 in the context of the tuning of musical instruments as a property that should be reduced to a minimum in order to realize consonance and harmony. Helmholtz [99] correctly described roughness as a result of interference at the perceptual level between the partials of the sounds produced by musical instruments. More specific, in music, the concept of roughness has been associated with the concept of dissonance [99, 266]. Indeed, when musicians tune their instruments, they in general try to avoid audible interferences between harmonics as much as possible, especially between lower harmonics. Roughness, however, can be part of the timbre of percussive sounds. It can also be used for expressive means when it is present at a short segment of a tone, e.g., at its beginning or its end. A discussion on the use of roughness for expressive means in music is presented by Vassilakis [263]. Moreover, roughness plays an important role in the concepts of consonance and dissonance. Also in speech, roughness is not considered a good property of the voice, but in speech, it is not associated with the interference of multiple sound sources as in music, but with the quality of a single voice. For instance, Eddins, Kopf, and Shrivastav [53] synthesized a number of vowels modelled after real disordered voices. Thus, they could apply amplitude modulation to these voices and vary its MF and MD in a controlled way. They found that roughness judgments of these synthesized voices varied in the same way with MF and MD as reported for SAM tones by Fastl and Zwicker [65]. In spite of this, in medical research on rough voices, roughness models are only based on “acoustic markers” [19], and timbre studies concerned with roughness are ignored. Finally, Arnal et al. [9] showed that roughness is a perceptually important attribute of screams that stimulate brain areas associated with danger [23]. Similarly, it is an important attribute of “terrifying film music” [257, p. EL540]. Furthermore, rough-

272

6 Timbre Perception

ness also plays an important role in the design of product sound [168]. This does not mean, however, that roughness is always avoided in the design of product sounds. Actually, roughness is an important attribute of the sound of alarm clocks and some other warning sounds.

6.3 Breathiness In the preceding section, roughness was shown to be an important auditory attribute not only of a modulated sound, but also of a disordered voice. Other important attributes of a voice are its hoarseness and its breathiness [279]. Hoarseness is generally considered as a combination of roughness and breathiness [117]. Now breathiness will be discussed. Intuitively, breathiness is the amount of noise in a sound relative to the presence of tonal components, in other words, the more noise, the breathier. Very often, the tonal components will consist of the harmonics of the sounds, which is why breathiness is often associated with the harmonics-to-noise ratio (HNR), the proportion of the energy of the harmonics and the remaining energy in the signal. Various different methods have been developed to estimate the HNR. For instance, based on the fact that the summation of the harmonics of a sound results in a periodic signal, Yumoto, Gould, and Baer [279] and Ferrer et al. [69] estimated the periodic part of the signal by averaging a number of successive pitch periods in a sustained vowel. The energy of this average is then an estimate of the harmonic energy E H . The remaining energy is the energy of the  noise part E N . EH The HNR can then be found in dB by calculating 10 · log10 E N . This procedure requires a stretch of relatively constant speech lest the estimate of the noisy part includes systematic changes in amplitude and frequency of the harmonics unrelated to the HNR. Consequently, this method can only be used for sustained phonemes, mostly vowels. Another method was developed by De Krom [49]. This method was based on the cepstrum, which is the inverse Fourier transform of the power spectrum of the signal, often used for pitch estimation [163]. For this method, only a few pitch periods are needed and it can be applied to speech segments within different locations of the syllable. De Krom [50] compared the HNR within various frequency bands of the speech signal with expert judgments as to the breathiness of the speech and concluded that the HNR within these bands could indeed explain 75–80% of the variance in breathiness judgments. A third method was developed by Yegnanarayana, d’Alessandro, and Darsinos [278]. They used linear predictive coding (LPC), briefly discussed below in Sect. 6.5.1, to separate the speech source from the filter and used the amount of noise in the source signal as an estimate for the HNR of the speech signal. The details are very technical and the interested reader is referred to Yegnanarayana, d’Alessandro, and Darsinos [278]. It is concluded that the relative amount of noise is an important determinant of the breathiness of a voice. Other factors, however, may also play a role in breathiness perception. Shrivastav and Sapienza [218] investigated the role of various such

6.3 Breathiness

273

factors, such as amplitude relations between harmonics and formants, the pitch, or some properties derived from the cepstrum. Moreover, they did not use the HNR as a measure for the amount of noise in the speech signal, but the proportion between the estimated loudness of the aspiration noise in the signal NL and the estimated partial loudness of the harmonic part of the signal HL. How NL and HL can be measured will be discussed in detail in the Chap. 7. The advantage of using NL and HL is that, in calculating their values, the distribution of information over the tonotopic array and the nonlinearities of the peripheral hearing system are taken into account. This makes it more likely that the information used in the calculation of the estimated breathiness gets the weight that matches the weight used in the auditory process of breathiness perception. In order to test their model, Shrivastav and Sapienza [218] chose a set of female breathy voices from a database of disordered voices, selected part of a stationary vowel /a/, and equated the intensity of the segments. They concluded that the best predictor for the breathiness of the vowels was NL, so the estimated loudness of the aspiration noise. In a follow-up study, Shrivastav and Camacho [217] synthesized the vowel /a/ for five different male and five different female voices modelled from the same database of disordered voices. They varied the amount of aspiration noise in the synthesized vowels within a “natural range” and asked listeners in a magnitude-estimation task to rate the breathiness of the more or less breathy vowels. Again, the authors selected a number of acoustic and auditory measures and studied their relation with the breathiness ratings. From the set of measures, they found the best results for a power-law relation between breathiness and the ratio of NL and HL, NL/HL, with an exponent that depended on the pitch of the voice. Shrivastav et al. [219] tested this model on a new, independent set of synthetic and natural voices and found a high correlation between estimated breathiness and breathiness judgments, explaining 60% of the variance. The concept of breathiness refers to the perceptual proportion of noise in the timbre of a sound. The opposite of breathiness is the proportion of tonal components, which should have an inverse relation with breathiness. This attribute of timbre is indicated with tonalness [14, 15] or tonality [66]. A review of these concepts is presented by Hansen, Verhey, andWeber [96]. Aures [14] presents a computational model to estimate the tonalness of a sound based on selecting tonal peaks in the power spectrum derived from a pitch-estimation algorithm developed by Terhardt, Stoll, and Seewann [253]. In this algorithm, tonal components are selected as far as they contribute to pitch and Aures [14] proposes this as a measure for their share to the tonalness of the sound. Eddins et al. [54] found an inverse relation of breathiness, not with tonalness, but with a related concept they called “pitch strength”, to be discussed in the section of that name in Chap. 8. A different approach was followed by Barsties v. Latoszek et al. [18]. They defined 28 “acoustic markers”, measures derived from the spectrum, the cepstrum, temporal envelopes, etc., of the speech signal. These measures were calculated for 970 disordered and 88 normal voice samples consisting of the sustained vowel /a/. They asked four experts to rate the breathiness of these samples. By means of a stepwise multiple regression, they defined a measure, the Acoustic Breathiness Index (ABI), based

274

6 Timbre Perception

on nine out of the 28 acoustic parameters. With this index, 72% of the variance of breathiness ratings could be explained. Based on such results, one may argue that the ABI is a better measure for breathiness than, e.g., the power-law equation by Shrivastav and Camacho [217]. One should realize, however, that the latter model has only two parameters, whereas the ABI has nine. Moreover, since the power-law equation between breathiness and NL/HL is based on knowledge about information processing in the human hearing system, it can in turn contribute to understanding of the process that may underlie breathiness perception. In order to understand the role of breathiness in communication, one not only needs to understand the perception of breathiness but also its production [120, 123]. Breathiness of a voice does not only arise from aspiration noise, but also from jitter and shimmer of the voice source. Jitter and shimmer do not only contribute to breathiness but also to the roughness and related attributes of the voice such as hoarseness, creakiness, and vocal fry. In order to separate the effects of these different noise sources on the various perceptual attributes of a voice, a large number of acoustic measures have been developed. For a review, the reader is referred to Maryn et al. [148]. Very few of these studies take perceptual processes into account. There is one important problem in breathiness perception that is not considered in any of the studies on breathiness discussed so far. In all these studies, it is assumed that the periodic part and the noisy part of the speech signal are perceptually integrated into one auditory unit. That this happens in breathy voices is not self-evident. When noise is added to recorded speech or music, one generally hears two auditory streams: One is the noise, the other the speech or the music. The speech or the music may be masked somewhat, but it does not sound breathier than when no noise is added. Apparently, the noise in music and speech that makes them sound breathy has to fulfil specific requirements in order to integrate perceptually with that speech or music. Hermes [100] showed that, for LPC-synthesized vowels with fundamental frequencies lower than about 300 Hz, high-frequency noise did integrate perceptually with the vowel when the source signal consisted of a combination of low-pass filtered pitch pulses combined with high-pass filtered noise bursts; the lower the cut-off frequency, the breathier the resulting vowels. Moreover, the amount of integration depended on the relative phase of the pitch pulses and the noise bursts, and was largest when they ran in synchrony. The perceptual integration consisted of both a reduction of the loudness of the separate noise stream and a timbre change in the breathy vowel. This is important, because one may think that the decrease in loudness is due to partial masking of the noise by the pitch pulses. This, however, can have played only a minor role, because the reduction in loudness of the noise stream was coupled with a change in timbre of the synthesized vowel. This timbre change showed that the noise bursts integrated perceptually with the vowel, making it sound breathier. Since the auditory information from the noise in divided over the perceived noise and the perceived vowel, this is an example of information trading, discussed above in Sect. 4.8.3. Above, it has been suggested that breathiness is a negative property of a pathological voice. This can be the case, of course, but aspiration noise is also a natural component of normal speech, in particular when one speaks at a low level and,

6.4 Brightness or Sharpness

275

certainly, when one whispers. For LPC-synthesized speech, it has been shown that adding noise bursts to a source signal in synchrony with the pitch pulses leads to more natural, less busy sounding speech [242]. Furthermore, the breathiness of a voice can also contribute to more complex perceptual attributes of a speaker. For instance, the presence of noise in speech does not only increase the breathiness or the perceived naturalness of speech, but also makes it sound more feminine [259]. The model used in the discussion of roughness perception, is based on the temporal course of the output of the auditory filters in response to the sound signal. Thus, knowledge on information processing in the peripheral hearing system is included in that model of roughness perception. In their model of breathiness perception, Shrivastav and Camacho [218], Shrivastav and Sapienza [217], and Shrivastav et al. [219] also included knowledge about auditory information processing. Above, it was argued that incorporating such knowledge improves the explanatory quality of the models and increases the understanding of the perceptual processes involved. Hence, this will also be done in the discussion about the next timbre attribute of a sound called brightness or sharpness.

6.4 Brightness or Sharpness The timbre attribute of roughness discussed above is essentially a temporal attribute. The next auditory attribute, brightness, is spectral in nature. Based on an analysis of verbal attributes for complex tones sounds, Lichte [135, p. 457] writes in 1941: “Preliminary observations indicated that a seemingly important type of qualitative change, tentatively labelled dullness-brightness, could be obtained by means of a series of tones in which the midpoint of the energy distribution was gradually shifted upward along the frequency continuum.” He found that the change in timbre associated with the transition from sounds with a low spectral centroid to sounds with a high spectral centroid was well described by the transition from the verbal attribute dull to the verbal attribute bright. Hence, the timbre attribute associated with spectral centroid is often referred to as brightness. An illustration of a transition from a duller to a brighter sound as studied by Lichte [135] is presented in Fig. 6.6. It shows the amplitude in dB of seven complex tones consisting of six harmonics of 220 Hz. The harmonics of one tone are connected by straight dotted lines. In the first stimulus, the amplitudes of the six harmonics decrease in equal steps from +15 to −15 dB. That tone sounds dullest. Its spectral centroid is at 0.42 kHz, indicated at the bottom of Fig. 6.6. In the last tone, the amplitude rises from −15 to +15 dB. That tone sounds brightest. Its spectral centroid is at 1.12 kHz. Lichte [135, p. 472] suggested: “brightness is a function of the location on the frequency continuum of the midpoint of the energy distribution.” In the study by Lichte [135], listeners scaled the sounds on a scale from dull to bright. Von Bismarck [268] used a larger set of pairs of verbal attributes with opposite meanings, referred to as semantic differentials [167], among which the scale from dark to bright and the scale from dull to sharp. It appeared that these two continua were closely related, which is

276

6 Timbre Perception

Fig. 6.6 From dull to bright. A series of seven complex tones is presented varying in spectral slope. The dotted lines connect the points that represent the amplitudes of the harmonics of one tone. The fundamental frequency of the tones is 220 Hz. In the first tone, the spectral slope is negative and decreases from +15 to −15 dB. This tone sounds relatively dull. The spectral slope is then increased until it is positive and in the last tone increases from −15 to +15 dB and the tone sounds much brighter. The spectral centroids of the successive tones are presented in kHz at the bottom of the figure. (Matlab) (demo)

why brightness and sharpness are often used interchangeably. In a follow-up study, Von Bismarck [267] studied the effect of the lowest frequency value, the highest frequency value, and the slope of the spectrum on the sharpness ratings by listeners. Von Bismarck [268] synthesized 35 tonal or noisy sounds with spectral envelopes based on those of music and speech sounds. Just as Lichte [135], Von Bismarck [267, p. 169] concluded: “sharpness appears in a first approximation to be describable in terms of a measure related to the position of the spectral energy concentration.” Based on this, he followed a more perception-oriented approach than his predecessor did: He did not calculate the centroid of the spectral energy distribution as an estimate for sharpness, but instead, he calculated a weighted centroid of the specific-loudness distribution N  (z), the distribution that was discussed in Sect. 3.8.5. Based on the results from perception experiments, in this estimation of sharpness, frequencies higher than 3.25 kHz were given more weight than lower frequencies. In Sect. 3.8.5, the specific-loudness distribution was calculated on a Cam scale, but that was not yet invented at that time. Von Bismarck [267] based his model on the earlier Bark scale. Indeed, he proposed as an estimate for sharpness the following equation:  24 Bark  N (z)g(z) dz σ = c · 0  24 Bark (6.3) σ0 N  (z)dz 0

In this equation, the denominator on the right is the integral over the specific loudness and, hence, represents the loudness of the sound. Furthermore, in this equation, σ0 is the sharpness of a reference sound, c is a proportionality constant, and g (z) is the weighting function mentioned, proportional to the sharpness of a narrow band of noise of constant loudness. In Von Bismarck [267], this weighting function is presented graphically. It is an almost linear function of z up to about 16 Bark, or 3.25 kHz, after which it rises more rapidly than linearly. This implies that components

6.4 Brightness or Sharpness

277

with frequencies higher than 3.25 kHz contribute more to the sharpness estimate than the lower-frequency components. At about the same time, a similar, and simpler, model was presented by Gordon and Grey [87]. They had performed a multidimensional scaling (MDS) analysis on similarity ratings of a number of musical sounds and found a timbre space of three dimensions, the first of which was correlated with the centroid of the specificloudness distribution of the sound. In order to illustrate their model, Gordon and Grey [87] present the specific-loudness distribution of a clarinet sound, calculated according to the model of loudness summation as presented by Zwicker and Scharf [281]. The authors indicated the centroid of this specific-loudness distribution as correlate of the main dimension that came out of the MDS analysis. So, without mentioning the work of Von Bismarck [268], Gordon and Grey [87] come to a similar conclusion, i.e., that the centroid of the specific-loudness distribution can be associated with the perceptually most significant attribute of the sounds under consideration. The only difference with Von Bismarck [267] was that Gordon and Grey [87] did not include the weighting function g (z) in Eq. 6.3. This approach was further formalized by Aures [14]. As a standard unit for sharpness, he introduced the acum, the sharpness of a narrow band of noise, narrower than 1 Bark, centred at 1 kHz. For g (z), Aures [14] chose an exponential function of z, g (z) = 0.0165 e0.173z . This was further elaborated by Fastl and Zwicker [66, pp. 239–246]. They introduced some minor modifications that led to the following equation for sharpness S:  24 Bark  N (z)g(z)z dz acum (6.4) S = 0.11 · 0  24 Bark N  (z)dz 0 In this equation, g (z) from the previous equation is now decomposed into two factors g (z) and z. The weighting function g (z) just enhances frequencies higher than 16 Bark, or 3.25 kHz. It is 1 up to 16 Bark and then increases to 4 at 24 Bark. As g (z) = 1 for frequencies lower than 3.25 kHz, this actually means that, for sounds that do not have such high frequency components, sharpness is directly proportional to the centroid of the specific-loudness distribution. This is illustrated in Fig. 6.7 for the sounds played in Fig. 6.6. Besides the specific-loudness distributions of the seven sounds, Fig. 6.7 gives their centroids in Cams, and the corresponding frequency in kHz. These correspond to the abscissae of the small circles presented under the values of the centroids. As one can see, the tones with the lower spectral slopes have lower sharpness than those with higher spectral slopes. Another example is given in Fig. 6.8 for seven complex tones of six successive harmonics of equal amplitude. The rank of the lowest harmonic is varied from 1 to 7, so that the brightness of the tones increases, but the slope of the spectrum is not varied. In the data for the estimated brightness presented in Figs. 6.7 and 6.8, only the centroid of the specific-loudness distribution is presented; the weighting factor g (z) has been neglected. According to the Aures [14] model, this only plays a role of

278

6 Timbre Perception

Fig. 6.7 Specific-loudness distributions of the seven tones played in Fig. 6.6. The position of the centroid of the distributions is indicated in the upper part of the figure by a small circle, its frequency in kHz and Cam. (Matlab) (demo)

Fig. 6.8 Specific-loudness functions of seven complex tones consisting of six successive harmonics. The fundamental frequency is 220 Hz. The rank of the lowest harmonic is varied from 1 to 7. The abscissa of the small circles, with their values presented in kHz and Cam, indicate the centroids of these distributions, corresponding to the brightness of the sounds. (Matlab) (demo)

significance for frequencies higher than about 15 Bark, 2700 Hz, or 23.7 z. How the Bark-based model by Aures [14] and Fastl and Zwicker [66] can be converted to a Cam based model is discussed by Swift and Gee [249, 250]. Lichte [135] associated the term brightness with the centroid of the spectrum of the sound, whereas Von Bismarck [267] used the term sharpness for probably the same perceptual attribute. Accordingly, the terms sharpness and brightness may have been used for the same perceptual attribute, but intuitively their associations may not be the same. The relation between the perceptual attributes of sharpness and brightness was further investigated by Ilkowska and Mikiewicz [108]. They used both musical and noisy sounds, and amplified these sounds within a certain frequency band by 6 or by 12 dB. The centre frequency of the amplified bands was varied between 50 and 16000 Hz, and the bandwidth of the amplified bands was varied between one

6.4 Brightness or Sharpness

279

third, three third, five third, and seven third of an octave. They had listeners rate the brightness and the sharpness of these sounds on an absolute magnitude scale and compared the judgments for brightness and sharpness. No major differences were found between the judgments of brightness and those of sharpness. Only increments in bands with frequencies higher than 6.3 kHz induced a larger increase in brightness judgments than in sharpness judgments, but only for musical sounds, and not for the noise. Furthermore, they compared the judgments by the listeners with the estimations of sharpness obtained by the equation above as presented by Von Bismarck [267]. In this, Ilkowska and Mikiewicz [108] used the following equation for g (z), g (z) = 0.0165 e0.173z derived from Aures [14]. They found that the judgments by the listeners increased more with frequency than predicted by these estimations. Larger discrepancies were found by Almeida et al. [5]. They used stimuli consisting of the first six harmonics of 500 Hz, of which they varied the spectral slope. They had four standard tones with spectral centroids at 500, 720, 940, and 1160 Hz and asked listeners to adjust the spectral centroid of a second tone so that it sounded twice as bright as that of one of the standard tones. They found discrepancies between their results with brightness adjustments and the predictions based on Aures [14]. This may have to do with the different stimuli used by the two groups. Ilkowska and Mikiewicz [108] used complex sounds varying is spectral peaks, the formants, whereas Schubert and Wolfe [214] and Almeida et al. [5] used stimuli varying in spectral slope. All this may raise various questions. Von Bismarck’s Equation 6.3 may not fully cover the associations listeners have for the verbal attributes of brightness or sharpness. Second, it indicates that listeners do not have unequivocal ideas about the difference in meaning of brightness and sharpness as auditory attributes. Third, it may be that brightness and sharpness represent different perceptual attributes as concluded by Almeida et al. [5]. But there may be more to it. The studies discussed until now only mentioned the centroid of the power spectrum or of the specific-loudness distribution as the correlate of the auditory attribute of sharpness or brightness. Statistically, the centroid of a distribution represents the mean of the distribution. To understand the perception of sharpness better, Marui and Marten [147] did not only look at the mean of this spectral power distribution, but also at its variance, skewness, and kurtosis. For some impulsive noise sounds with different power spectra, they showed that not only the mean of the specific-loudness distribution played a role in the sharpness ratings by the listeners, but also its variance; the skewness and the kurtosis did not contribute significantly. As the mean of a distribution represents its centroid, the variance of a distribution represents its spread. So, these studies lead to the conclusion that, besides for loudness, the distribution of acoustic energy over the tonotopic array is also a relevant determinant of perceptual attributes that listeners associate with verbal attributes such as bright and sharp, and not only with bright and sharp, but also with high or brilliant. Both the centroid of the specific-loudness distribution and its spread over the tonotopic array play a role. The exact role of these two factors and their interaction remains to be investigated. A possibility is that a larger spread over the tonotopic array may be associated with the perceived effort with which a sound is produced. This will be discussed below in Sect. 6.7.3. Whether

280

6 Timbre Perception

skewness and kurtosis, or even higher-order moments of the specific-loudness distribution play a role, also needs further study.

6.5 Dimensional Analysis In the previous sections, the auditory attributes of roughness, breathiness, and brightness have been discussed. In doing so, it is implicitly assumed that these three timbre attributes can be considered independently of the other perceptual attributes of an auditory unit. Below, one will see that in general this may not be so self-evident. Since timbre is defined as the auditory attribute of sound that is not loudness, not pitch, and not duration, it is clear that no simple acoustic correlate of timbre can be defined. Moreover, if one adds the hundreds of words used to describe sound mentioned by Pedersen [176], it becomes evident that timbre space is not one-dimensional or even few-dimensional. One may even ask whether it is multidimensional, since a space with dimensions assumes a metric, in timbre spaces defined in terms of similarity. In view of the non-linearities, the hysteresis, and the context effects that will be discussed, it is unlikely that the general timbre space encompassing all possible timbres has a metric at all. Fundamental aspects of the mathematical structure of perceptual spaces are discussed by Zaidi et al. [280]. On the other hand, subspaces of certain classes of sounds may be approximated as spaces with a limited number of dimensions. In order to calculate such spaces from perceptual results, computationally intensive procedures are required. One of these is the spectrogram introduced in 1945 [190]. Other techniques became available at the end of the 1960s and 1970s. Besides the spectrogram, the main techniques with which timbre spaces have been investigated are multidimensional scaling (MDS) and the analysis of verbal attributes of sounds [52, 94]. This has been done for a number of different kind of sounds, e.g., for the class of vowel sounds, and for the class of the sounds produced by musical instruments. These two timbre subspaces will now be discussed. Overviews of earlier studies on timbre perception are presented in Plomp [187] and Reuter and Siddiq [200].

6.5.1 Timbre Space of Vowel Sounds The spectrotemporal analysis of spoken vowels became possible with the advent of the spectrogram [190–192]. Spectrograms clearly show the changing formants, i.e., the resonances of the vocal tract, as “dark bands weaving through this pattern” [191, p. 528]. These patterns can be seen in the spectrograms shown in Fig. 1.13 to Fig. 1.16 of the introduction of this book, where the wide-band and the narrow-band spectrogram of speech have been introduced for the utterance “All lines from London are engaged”.

6.5 Dimensional Analysis

281

In general, formants are numbered from low to high as F1, F2, F3, . . ., etc. They can be described by modelling the human vocal tract as a half-open pipe. If the crosssection through the half-open pipe is equal everywhere, the resonance frequencies of a half-open pipe can be represented as (2k − 1) c/ (4L), k = 1, 2, 3, . . ., in which c is the speed of sound and L the length of the pipe. Since the length of the vocal tract of men is about 17 cm and for women about 13 cm, one ends up with odd multiples of 500 Hz for men and of 660 Hz for women. When the cross-sections have different areas, the resonance frequencies change accordingly. In human speech, this happens with the changing positions of the articulators such as the jaw, the tongue, and the lips, resulting in formant frequencies fluctuating around their average. An early systematic study of English vowels from 1952 based on these spectrograms was carried out by Peterson and Barney [178]. They measured the centre frequencies of the first three formants of ten English vowels spoken in the context of the well-articulated words /h[V]d/, in which [V] represents one of the ten vowels. These frequencies are presented for men and women in Table 6.1. It will appear that, for carefully articulated vowels, especially the first two formant frequencies are important for the identification of the vowel. As the importance of formant frequencies became apparent, Stevens and House [234] in 1961 developed a linear acoustic model of vowel production, which has been extremely influential in the study of speech by analysis-by-synthesis [24]. This has led to the development of linear predictive coding (LPC) of speech [12], a technique that has been and still is a powerful tool in speech research. In this model, speech production is divided into a filter part and a source part. The source consists of the periodic or noisy signal generated at the vocal chords or at other constrictions in the oral cavity. The filter is the oral cavity. An assumption of LPC is that this source signal has a white spectrum. This source signal is than filtered by the resonating oral cavity, resulting in a coloured spectrum with various resonances, the formants, characterized by their frequencies and bandwidths. The analysis part of LPC consists of estimating the frequencies and bandwidths of the formants from the power spectrum, or rather from its normalized Fourier transform, the autocorrelation function. In this way, the filter properties of the oral cavity can be reconstructed. The synthesis part of LPC consists of filtering a white

Table 6.1 Formant frequencies of English vowels. Reproduced from Peterson and Barney [178, p. 183, Table II], with the permission of the Acoustical Society of America Vowel /i/ /I/ /ε/ /æ / /α/ /O/ // /u/ /∧/ 3 F1 of men F1 of women F2 of men F2 of women F3 of men F3 of women

270 310 2290 2790 3010 3310

390 430 1990 2480 2550 3070

530 610 1840 2330 2480 2990

660 860 1720 2050 2410 2850

730 850 1090 1220 2440 2810

570 590 840 920 2410 2710

440 470 1020 1160 2240 2680

300 370 870 950 2240 2670

640 760 1190 1400 2390 2780

490 500 1350 1640 1690 1960

282

6 Timbre Perception

Fig. 6.9 Spectra of ten synthetic male vowels. The vertical lines represent the harmonics of the vowel; their spacing is 150 Hz, the F 0 of the vowels. The curved, dotted line is the spectral envelope. The first three formant frequencies are from Table 6.1. The fourth and fifth formants have fixed frequencies of 3.5 and 4.5 kHz, respectively. (Matlab) (demo)

source signal, a pulse series for the voiced parts of the speech or a white noise signal for its unvoiced parts, with this reconstructed filter. For a detailed treatment of this technique, the reader is referred to, e.g., the third chapter of Rabiner and Schafer [197]. For a summary of the relevance and the history of LPC, and its more recent extensions, the reader is referred to Atal [11]. The spectra of the ten male vowels studied by Peterson and Barney [178] resynthesized by LPC with the first three formant frequencies as presented in Table 6.1 are presented in Fig. 6.9. These vowels are synthesized with an F 0 of 150 Hz. The fourth and the fifth formants have fixed frequencies of 3.5 and 4.5 kHz, respectively. These vowels are well identifiable but sound buzzy, as if produced by a robot (cf. the demo of Fig. 4.12). The natural question to ask now is to what extent these formant frequencies specify the vowels. Vowel space as a dimensional space was first studied by Plomp, Pols, and Van de Geer [184]. They presented 15 Dutch vowels in the context /h[V]t/, in which [V] is one of these vowels, and measured the intensity at the output of 18 thirdoctave filters spoken by ten male speakers, and carried out a principal component analysis on the resulting 150 18-point vowel spectra. This resulted in four independent dimensions, all linear combinations of the original 18 dimensions, explaining 37.2%, 31.2%, 9.0%, and 6.7% of the variance, respectively. Hence, the 18-dimensional space could be reduced to a four-dimensional space explaining 84% of the variance in total. When this procedure was carried out on the spectra averaged across the ten speakers, even 96% of the variance could be explained by the first four principal components. Moreover, they observed that a plot of the projections of the positions of the vowels on the plane spanned by the first two principal components quite well reproduced plots of the second formant frequency against the first. Hence, they concluded that the first two principal components could be associated with the first two formant frequencies.

6.5 Dimensional Analysis

283

In summary, the frequencies of the first two formants largely specify the identity of a vowel in the context of the word /h[V]t/. This basically describes the vowel space as a two-dimensional space spanned by the frequencies of the two formants. This two-dimensional presentation of vowel space is not new but already resulted from nineteenth-century studies concerned with vowel production. An overview of these studies is presented by Plomp [187]. Significantly, the description of vowel space as a two-dimensional space does not only result from measurements of formant frequencies in production studies; these results match the results of perception studies. Indeed, Pols, Van der Kamp, and Plomp [189] obtained similarity ratings in a triadic-comparison experiment of eleven Dutch vowels and subjected them to an MDS analysis. The dimensions resulting from this MDS could be well matched with the dimensions that came out of the production study. Moreover, the results lent themselves well for the automatic classification of vowels [118]. Applying these methods to 12 Dutch vowels, again spoken by male speakers in the context /h[V]t/, Pols, Tromp, and Plomp [188] obtained a correct detection score of 71% over all speakers, and of 78% when speaker dependent corrections were applied. Similar results were obtained for vowels spoken by female speakers [261] and for sung vowels [28]. Pols, Tromp, and Plomp [188, p. 1093] conclude: “Statistical analysis of these formant variables confirmed that F 1 and F 2 are the most appropriate two distinctive parameters for describing the spectral differences among the vowel sounds.” Since these studies have demonstrated the importance of the first two formant frequencies in the classification of vowels, it is usual to present the position of the vowels of a language, the vowel space, in plots of F2 against F1 on linear, logarithmic, or mel-scale frequency axes. An example is presented in Fig. 6.10 for the values of the formant frequencies of the ten vowels described by Peterson and Barney [178] presented in Table 6.1. Besides these ten English vowels, the Dutch vowel /a/ is included in Fig. 6.10, because its position in the F1–F2 space is extreme and its

Fig. 6.10 The vowel triangle. The frequency of the second formant is plotted against the frequency of the first formant for the American English vowels as described by Peterson and Barney [178], and for the Dutch vowel /a/. All data are from adult male speakers. Most data points are within the triangle defined by the positions of the /a/, the /i/, and the /u/. (Matlab)

284

6 Timbre Perception

Fig. 6.11 The spectral undersampling effect. Spectra of the same vowels as shown in Fig. 6.9, now synthesized with an F 0 of 600 Hz. They are presented in random order. For this high F 0 , the spectra are seriously undersampled by the harmonics. Can you still identify the vowels? (Matlab) (demo)

inclusion better represents the triangular shape of what is called the vowel triangle. By the way, the vowel triangle was already described in the 18th century by Hellwag [98]. An interesting problem arises when vowels are produced with very high pitch as happens, e.g., when female opera singers produce very high notes. In that case, the spacing between harmonics becomes so large that the spectrum gets undersampled. This spectral undersampling effect is demonstrated in Fig. 6.11. It shows, in random order, the spectra of the same synthetic vowels as presented in Fig. 6.9 now synthesized with an F 0 of 600 Hz. Listening to the synthesized vowels makes clear that it is now much more difficult to identify the vowels than the ones played in Fig. 6.9 with an F 0 of 150 Hz. The actual order in which the vowels are played can be found by comparing the spectral envelopes in the two figures. This spectral undersampling effect was investigated by, e.g., De Cheveigné and Kawahara [48] and Deme [51]. The latter author found that, “at 698 Hz, sung vowels completely lost their intended perceptual quality and received a new one, which appeared to be highly uncertain or in an in-between perceptual category. [...] Furthermore, at the musical note B5 (988 Hz), all the intended vowel qualities appeared to receive a new perceptual quality again, which was most similar to the maximally open vowel /a:/ and /6/ irrespective of the vowels’ intended quality” (p. e8). In the previous paragraphs, it turned out that, for lower fundamental frequencies, the frequencies of the first and second formant are very important in vowel perception. One has to realize, however, that this conclusion is drawn based on research carried out with vowels produced in the context of well-articulated syllables. For English, these syllables were of the form /h[V]d/ and, for Dutch, of the form /h[V]t/. It appears that, for automatic identification of these vowels, not only the first two dimensions representing F1 and F2 play a significant role in vowel space. The third and the fourth dimensions are also significant [188]. Moreover, in fluent speech formant frequencies are much more variable and affected by preceding and following speech

6.5 Dimensional Analysis

285

sounds. Consequently, plots of F2 against F1 measured in fluent speech are much less distinctive than the plots obtained with well-articulated vowels. The result is that automatic identification of vowels based only on F1 and F2 can drop from about 90% in well-articulated speech to about 33% in fluent speech. This shows that, in fluent speech, vowels are incompletely specified by F1 and F2. Apparently, other factors such as coarticulation [113, 166] and context are essential for vowel perception (for a review see Chap. 5 of Plomp [186]). Context effects will be discussed more generally in Sect. 6.8. Up to now, only studies have been discussed concerned with those attributes of vowel timbre that determine vowel identity. These attributes apparently span a twoor at most four-dimensional space. It is important to recognize, however, that these dimensions span only a limited part of the general timbre space of spoken vowels. In the demonstrations played in this section only synthetic vowels were played, which, though perhaps well identifiable, sounded synthetic and robotic. When spoken by people, the timbre of vowels can vary in many different aspects. They can, e.g., be produced in isolation or well articulated in the context of a syllable such as above in the context of /h[V]d/ or /h[V]t/. Even in this context, vowels can be produced by men, women, or children, inducing well audible differences. In other contexts, e.g., in the context of fluent speech, they can be spoken with different emotions, with emphasis, without emphasis, with more or less effort, whispered or voiced. Moreover, when spoken by a familiar speaker, one can often recognize the speaker, showing that also speaker characteristics in one way or another express themselves in the timbre of spoken vowels. One sees that, besides possible changes in pitch, loudness, or duration, all these factors have audible effects on the way vowels are realized and, therefore, have an effect on the timbre of the vowel. The problem is even more complex than this because the contexts in which vowels are spoken have a strong effect on how they are perceived. Three of these context effects will briefly be discussed: The spectral enhancement effect, and the spectral and temporal contrast effect. Spectral enhancement effects occur when, at a certain moment, spectral components of relatively low intensity are increased up to the level of the already present spectral components; spectral dips are filled in, so to say. The result is that these increased spectral components contribute more to the timbre of the sound than they would without the preceding dips. Consequently, the timbre resembles that of a sound with peaks at the frequencies where first there were dips. In this way, Summerfield et al. [244] managed to produce vowel-like sounds which did not have peaks at the normal formant frequencies of these vowels, but were preceded by sounds that had dips at those frequencies. In fact, the demos of Figs. 4.3 and 4.5 are extreme cases of the spectral enhancement effect. In these demos, the inserted spectral component pops out completely and forms a separate auditory unit. Another context effect is the spectral contrast effect demonstrated in a classical experiment by Ladefoged and Broadbent [125]. These authors presented listeners with a synthetic introductory phrase: “Please say what this word is.” This was followed by a target syllable of the form /b[V]t/, in which [V] represents a vowel. This target syllable could be understood as one of the words bit /bIt/, bet /bεt/, bat

286

6 Timbre Perception

/bæt/, or but /b∧t/. The formant frequencies F1 and F2 of the introductory phrase were systematically varied between relatively low and relatively high values. It appeared that the vowel listeners reported to hear in the target word depended on the average values of F1 and F2 in the introductory phrase. When a formant frequency was higher, the vowel perceived in the target word was perceived as if it had a lower formant frequency, and vice versa. This effect was considerable. For instance, one specific syllable was perceived as bit /bIt/ in 87% of all presentations when F1 in the introductory phrase was relatively high, but as bet /bεt/ in 90% of all presentations when F1 in the introductory phrase was relatively low. As can be seen in Table 6.1, the canonical F1 for /I/ is 390 Hz, whereas it is 530 Hz for /ε/. It is concluded that what is a high formant frequency for one speaker may be a low formant frequency for another speaker, depending on the natural range of the speaker. These context effects ensure that the perceived identity of a vowel is adapted to the range of formant frequencies that characterize the speaker of the vowel. For instance, women have, on average, formant frequencies that are 10% higher than those of men are. One of the consequences is that, when a vowel is spoken by a woman, its formant frequencies are interpreted relative to the range of formant frequencies that are characteristic for that woman and, when a vowel is spoken by a man, its formant frequencies are interpreted relative to the characteristic range of formant frequencies of that man. Indeed, Slawson [229] showed for synthetic vowels that, if the F 0 of a vowel was doubled and the frequencies of the first two formants were increased by 10%, the vowel quality remained the same. In this way, the formant frequencies are normalized to the characteristics of the speaker. A more extreme consequence of the spectral contrast effect is heard when one listens to helium speech. When a speaker inhales helium, the formant frequencies rise by 15% to 55% because the speed of sound in helium is considerably higher than that in air [22]. The effect is that helium speech sounds like “Donald Duck” speech [142]. Up to moderate concentrations of helium, the speech remains well intelligible as cartoon fans can attest. Apparently, what constitutes high and low formant frequencies is determined not so much by their absolute frequency values but by their relative position within the range of the speaker. They are thus normalized to the range that characterizes the speaker. The constants of this normalization process, i.e., the mean and the variance of the distribution of formant frequencies, specify the speaker. If the formant frequencies are relatively low, they specify a man; when they are relatively high, a woman; and when they are unnaturally high, they specify Donald Duck. The result of this process is that, in spite of the different frequency range of its formants, the interpretation of a vowel remains constant for different speakers. This is an example of perceptual constancy. Spectral contrast effects as reported by Ladefoged and Broadbent [125] have been studied for many other kinds of contexts and contrasts. Besides spectral contrast effects in speech perception, there are also temporal contrast effects. Temporal contrast effects occur when the duration of a vowel is one of its distinctive properties, which means that a word with a short vowel has a different meaning than the same word with a long vowel. What is long or short depends on the speaking rate of the listener. When a speaker speaks slowly, the average duration of a short vowel

6.5 Dimensional Analysis

287

can be longer than the average duration of the long vowel when the speaker speaks faster. More examples of contrast effects outside the field of vowel perception will be discussed in the forthcoming Sect. 6.8.

6.5.2 Timbre Space of Musical Sounds Another well-investigated timbre space is the space of the sounds of musical instruments. One of the first to do so was, again, Plomp [187], who also gives an overview of the little research on timbre perception done earlier. Plomp [187] recorded the sounds of nine different musical instruments played on the same pitch of 319 Hz. One single pitch period of these sounds was selected, and the stimuli were produced by continuously playing this pitch period. These stimuli were presented in triads of different sounds and listeners were asked to indicate which two of the three sounds were most similar and which two most dissimilar. In this way, dissimilarity scores could be obtained for all pairs of the nine musical instrument sounds. Next, for all these sounds, Plomp [187] measured the sound pressure levels (SPLs) in dB of 15 successive third-octave bands equidistant on a logarithmic scale between 315 and 8000 Hz. Various distance measures were then calculated, based on the differences these spectra. These distance measures Di j had the form of  between

p 15 p Di j = k=1 L ik − L jk , in which L ik is the SPL of the kth third-octave band in the spectrum of stimulus i. These distance measures were then correlated with the dissimilarity scores previously obtained. Correlations higher than 0.8 were found; the highest correlations were found for values of p of about 1 or 2, though the effect of p appeared not to be so relevant for the stimulus set used. Similar results were obtained when the same experiment was carried out with the sounds of ten different registers of organs pipes of 263 Hz. Plomp [187] concluded that the dissimilarity scores could largely be explained based on the differences in the spectra of the sounds. As was done for vowel sounds by Plomp, Pols, and Van de Geer [184], a principle-component analysis on the power spectra made it possible to reduce the 15-dimensional space of power spectra to a space spanned by three principal components explaining 90% of the variance of the spectra [185, p. 96]. Moreover, it appeared possible to match this space of spectra closely with the timbre space resulting from the dissimilarity ratings by the listeners. Plomp argues that the main of these three dimensions corresponds to the centre of gravity of the third-octave power spectrum he measured. He also remarks that, for the stimuli he used, this third-octave power spectrum is not much different from the specific-loudness distribution used in the calculation of the loudness of a sound, which was discussed as an estimate of the brightness or sharpness of a sound. So, Plomp [187] argued that the most significant perceptual dimension of the timbre spaces of the musical sounds corresponded to the brightness or sharpness of the sound: “The main attribute of sharpness appears to be related primarily to the centre of gravity of loudness on a frequency scale in which critical bandwidths have equal lengths” (p. 110). Besides the dimension corresponding to

288

6 Timbre Perception

sharpness or brightness, two other dimensions are needed to characterize the timbre of the musical sounds used in these analyses, but these dimensions were not further specified in terms of verbal attributes. Sounds more complex than the single pitch period used by Plomp [185] were included in the experiments by Grey [90] and Gordon and Grey [87]. They synthesized sixteen different musical sounds based on the spectrograms of recorded real musical instruments. They did so in order to be able to get stimuli that were equal in pitch, loudness, and duration. These stimuli were “for all practical purposes, perceptually identical to the original recorded tones” [87, p. 24]. In a first experiment, Grey [90] asked listeners to judge the similarity of all possible pairs of stimuli selected from these sixteen musical sounds, and carried out a multidimensional scaling (MDS) analysis on the results. Grey [90] found an optimum solution consisting of a perceptual space of three dimensions. The first dimension was related to the distribution of energy over the spectrum: “At one extreme are tones with narrow bandwidths and energy concentrated in the lower harmonics. At the other end, the instrument tones have wide bandwidths with significant energy in the higher harmonics, suggestive of formant regions” [87, p. 24]. In a more detailed analysis, Gordon and Grey [87] correlated the coordinates of the different stimuli along this dimension with the centroid of the specific-loudness distribution of the stimuli and found correlations of 0.92. Gordon and Grey [87] used neither of the terms brightness or sharpness, but it is remarkable to see that this is the same correlate of brightness or sharpness as proposed by Plomp [185]. The other two dimensions could be associated with the dynamic pattern of change in the course of the stimuli. One of these dimensions was related to the presence of dynamic changes in the spectrum, spectral fluctuation or spectral flux, the other to the rise time of the tone. It will appear that these three sources of information, spectral centroid, spectral flux, and rise time, will in one way or another return in all following discussions on the timbre of musical instruments. The stimuli used in these studies were quite different. Plomp [185] made his stimuli by repeatedly playing the same pitch period sliced out of a recorded musical note, thus removing all dynamic changes from the stimuli, whereas Gordon and Grey [87] included the temporal changes within the recorded notes in resynthesizing their stimuli from the recorded tones. Plomp [185] needed three dimensions to describe the stationary spectrum of the space of musical instruments, of which the main dimension was related to the spectral centroid. Gordon and Grey [87] also found three dimensions, one of which largely corresponded to the spectral centroid. The other two were related to dynamic sources of information, the spectral flux, and the rise time of the tones. Gordon and Grey [87] ascribe this difference to the fact that Plomp [185] had stationary stimuli of a duration longer than one second, whereas their stimuli were much shorter, only 350 ms, and did contain the onset and the offset of the tones. Apparently, in making dissimilarity judgments, listeners attribute much weight to the timbre differences corresponding to the dynamic changes in the spectrum, spectral fluctuations, and the temporal envelope of the stimuli, thus obscuring the information in the stationary part of the spectrum other than its centre of gravity.

6.5 Dimensional Analysis

289

It is concluded that one of the important dimensions of the timbre space of musicalinstrument sounds is related to the spectral centroid of the sounds. Some authors proposed the term brightness for this perceptual attribute, e.g. Lichte [135], others sharpness, e.g., Von Bismarck [267]. In addition, there are many other verbal attributes of sounds. The question then is: What verbal attribute corresponds best to a certain dimension in a perceptual space? This was studied by Von Bismarck [268]. He selected 30 pairs of semantic differentials [167], such as dark-bright, smooth-rough, rounded- angular, or weak-strong, etc. Next, he synthesized 35 tonal or noisy sounds with spectral envelopes based on those of music and speech sounds. He then asked listeners to scale these sounds on the 30 continua defined by the semantic differentials. The results of these scaling experiments were subjected to a factor analysis, which yielded four orthogonal factors, explaining 90% of the variance. The main factor, explaining 44% of the variance, was mainly represented by the semantic differential dull-sharp, which is why Von Bismarck [268] called this factor “sharpness”. The ratings of the dull–sharp continuum were strongly correlated with those of the dark-bright continuum, and those of the low-high continuum. This shows that the verbal attributes of sharp, bright, and high are closely related. Other verbal attributes correlating with sharp were hard, loud, angular, tense, obtrusive, and unpleasant. The other three factors that resulted from the factor analysis were less easy to describe: The second factor, explaining 26% of the variance, was associated with the verbal attributes compact, boring, narrow, closed, and dead; the third factor, explaining 9% of the variance, with full, and the fourth factor, explaining 2% of the variance, with colorless. At the same time, but independently of Von Bismarck [267, 268], Pratt and Doak [193] also conducted experiments to examine the relationship between verbal attributes and timbre. They synthesized the sounds of four categories of musical instruments, and asked listeners to select from a list of nineteen verbal attributes the six verbal attributes they judged most useful for describing the sounds. On some subjective grounds, Pratt and Doak [193] selected three pairs of verbal attributes with opposite meaning and used these as semantic differentials in another experiment. These pairs were dull-brilliant, cold-warm, and pure-rich. Next, they asked subjects to scale six synthesized sounds along the three continua defined by these semantic differentials. Among the three pairs, the dull-brilliant pair appeared to yield the most reliable scores. Pratt and Doak [193] indicated that this result is consistent with the results by Von Bismarck [268]. It can thus be concluded that the continuum dull-brilliant represents the same perceptual timbre dimension as the continua dull-sharp and dark-bright found by Von Bismarck [268]. It was just mentioned that Von Bismarck [268] found that the continuum dullsharp was strongly correlated not only with dark-bright but also with low-high. As to the low-high continuum, one may ask whether listeners may not confound this timbre low-high continuum with the low-high continuum associated with the percept of pitch. This was checked by Schubert and Wolfe [214]. They independently varied the fundamental frequency and the spectral centroid of synthetic musical tones and asked listeners to judge the brightness of these sounds. They showed that brightness judgments correlated strongly with the centroid of the frequency spectrum and not with the pitch frequency. This shows that listeners deal with the

290

6 Timbre Perception

high-low continuum associated with brightness or sharpness mostly independently of the high-low continuum associated with pitch. This independence of pitch and brightness will be discussed in more detail in Sect. 8.16. It is concluded that the most relevant dimension in the timbre space of musical sounds is best described by the continuum between the semantic differentials dark-bright or dull-sharp. Returning to the perceptual space of musical instruments by means of multidimensional-scaling, advanced MDS techniques were applied by McAdams et al. [150]. Their stimulus set consisted of eighteen musical sounds synthesized by Wessel, Bristow, and Settel [274], twelve of which simulated the sounds of real musical instruments: French horn, trumpet, trombone, harp, vibraphone, harpsichord, English horn, bassoon, clarinet, guitar, a bowed string, and a piano; six other sounds were synthesized with parameters in between those of two other instruments: trumpet and guitar, oboe and celesta, bowed string and piano, vibraphone and trombone, oboe and harpsichord, and guitar and clarinet. The MDS model used by McAdams et al. [150] was more complex than that of previous studies. The authors included the presence of latent classes, i.e., groups of listeners attributing different weights to different dimensions of the stimuli, and specificities, i.e., attributes of sounds that do not fit into the Euclidian structure of the perceptual space assumed to underlie the judgments of the listeners. In addition to a six-dimensional solution without specificities that was difficult to interpret, McAdams et al. [150] found a three-dimensional solution with specificities. The analysis of the latent classes resulted in five different groups of listeners. For all these groups, the three dimensions of the MDS solutions could be associated with the logarithm of the rise time of the sound, with the centre of gravity of its spectrum, and with the degree of spectral variation. Remarkably, the finding that these different groups attributed different weights to the different dimensions could not be related to differences in musical training. These results confirm the conclusion by Grey [90] and Gordon and Grey [87] that three sources of information are involved in the perception of the timbre of musical tones: the spectral centroid of the tones, their rise time, and their spectral flux. Similar results were obtained by Krimphoff, McAdams, and Winsberg [124], except that the third dimension could not so much be associated with spectral flux but rather with what they called spectral fine structure a measure indicating how much the amplitude of individual harmonics deviates from the envelope of the spectrum. This means that this measure is high for sounds with successive harmonics that differ greatly in amplitude. An example of such a sound is the tone of a clarinet in which even harmonics have low amplitudes resulting in a high odd/even ratio. The stimulus set used by McAdams et al. [150] did not contain any percussive instrument sounds. Such sounds were included in a study by Lakatos [126]. He used 35 recorded musical sounds. Seventeen of these were the usual harmonic instrument sounds such as those of a clarinet, a trumpet, a piano, or a harp. The other eighteen were percussive sounds, seven of which had clear pitches, such as the celesta and the marimba, whereas eleven others were “weakly-pitched”, e.g., castanets, bongo  drum, or tam-tam. All harmonic sounds were played on E4 , or 311.1 Hz. Listeners were then asked to rate the similarity of pairs of these stimuli on a continuous scale ranging from very similar to very different, and the results were subjected to an MDS

6.5 Dimensional Analysis

291

analysis. This analysis also yielded a perceptual space spanned by three dimensions, two of which corresponded to dimensions previously found: spectral centroid and the logarithm of rise time. The third dimension, however, did not correlate well with spectral flux nor with spectral fine structure, as found in the earlier studies. It could be associated with “richness” for the set of percussive sounds. Moreover, as Lakatos [126] and McAdams et al. [150] did not find any significant effect of musical training on the judgments by the listeners. In the discussion section, Lakatos [126] discusses the adequacy of MDS as an analysis tool for perceptual spaces. He remarks that MDS almost always results in solutions of two or three dimensions: “From an intuitive perspective, it may seem unsatisfying to accept that two orthogonal dimensions capture most of the variance inherent in our rich acoustic environment, much as it would seem reductionistic to characterize the wide range of visual objects in our environment exclusively by length, width, and height.” He suggests: “a purely dimensional interpretation of timbre perception may mask other noncontinuous or categorical factors” (p. 1437). In the studies discussed until now, the stimuli consisted of recorded musical sounds or sounds that were synthesized with parameters directly derived from recorded sounds. Consequently, the perceptually relevant acoustic parameters of these sounds may not have been evenly distributed over the corresponding dimensions. In addition, they may have co-varied in systematic ways with each other. In order to test the effect of this, Caclin et al. [36] synthesized sounds consisting of 20 harmonics in which they independently varied three parameters. In a first experiment, they systematically varied the rise time, the spectral centroid, and the spectral flux of the stimuli. They confirmed the perceptual relevance of rise time and spectral centroid, but found only weak effects of spectral flux. In a second experiment, they synthesized stimuli in a similar way but did not vary spectral flux as in the first experiment, but the spectral fine structure by varying the relative amplitude of the odd harmonics. Now they did find three dimensions corresponding to the three parameters varied in the synthesis of the stimuli: Rise time, spectral centroid, and spectral fine structure. This indicates that the apparent significance of spectral flux found earlier may have been a consequence of its correlation with spectral fine structure in the stimulus sets used. In studies of the timbre space of musical instruments, two sources of information were up-to-now found to be consistently perceptually significant: Spectral centroid and rise time. Other sources of information appear to be more elusive. It has just been mentioned that this could be because they are mutually correlated in the stimulus sets used in the studies. There can also be other reasons, however. First, MDS analysis assumes a Euclidean structure that may not match well with the actual structure of perceptual space [35]. For instance, perceptual space might not have a well-defined metric. Another reason may be that the complexity of the acoustic information used in timbre perception is considerable and does not vary systematically in the stimulus sets used. For instance, sources of information used in timbre perception may essentially consist of a combination of the spectrotemporal features of sounds found in the natural environment of an organism as proposed by Singh and Theunissen [226]. They proposed the modulation power spectrum (MPS) of a sound as a spectrotemporal representation of perceptually relevant acoustic features. The MPS

292

6 Timbre Perception

is a “temporospectral” representation of the stimulus derived from the amplitude envelopes of the outputs of a bank of narrow-band Gaussian filters. It is assumed to separate perceptually significant features that in the spectrogram would overlap. For details, the reader is referred to Singh and Theunissen [226]. The perceptual relevance of features represented by the MPS of the sounds of musical instruments was investigated by Elliott, Hamilton, and Theunissen [60]. They asked listeners to rate the similarity of 42 recordings of non-percussive real musical instruments  played on E4 , or 311.1 Hz, equalized in loudness and perceived duration. From some instruments both recordings of tones played with and without vibrato were used, and for some other instruments recordings of muted and unmuted tones. The similarity ratings were subjected to an MDS analysis resulting in a space with as many as five dimensions. In addition to using dissimilarity ratings, Elliott, Hamilton, and Theunissen [60] also used verbal attributes to study the timbre space of musical sounds. Listeners were asked to rate stimuli on continua defined by semantic differentials. The results of the ratings were subjected to a principal component analysis from which a perceptual space was reconstructed. This method, in which Elliott, Hamilton, and Theunissen [60] used 16 semantic differentials, also resulted in a perceptual space of five-dimensions. By appropriate rotations, the dimensions of this space could be brought in line with those found in the MDS analysis. By correlating these dimensions with the principal components of the MPS, the authors show that only one of these dimensions, the third, is purely spectral and can be associated with the spectral centroid of the sound. The other factors are essentially spectrotemporal in nature. As to rise time, the authors note that this may be associated with the second dimension, not directly, but because the second dimension represents the presence of ongoing temporal fluctuations and the sounds in the stimulus set characterized by those fluctuations tend to have longer rise times. As mentioned, the stimulus set used by Elliott, Hamilton, and Theunissen [60] did not contain any sounds with impulsive excitations such as those of a guitar, a harp, or a piano. The results obtained by Elliott, Hamilton, and Theunissen [60] corroborated those found in a neurophysiological study carried out by Patil et al. [170]. For a computational model of timbre perception, the authors used a neurocomputational framework based on the spectrotemporal receptive field (spectrotemporal receptive field (STRF)) of neurons measured in the cortex of ferrets. These STRFs represent the spectrotemporal properties of the sounds to which the neurons are responsive. Patil et al. [170] used both actually measured and simulated STRFs to classify the sounds of eleven musical instruments. When using the simulated STRFs, they obtained a correct-identification score as high as 98.7%! Moreover, they showed that this high score cannot be obtained when the STRFs are decomposed into separate spectral and temporal dimensions: “The study demonstrates that joint spectro-temporal features, such as those observed in the mammalian primary auditory cortex, are critical to provide the rich-enough representation necessary to account for perceptual judgments of timbre by human listeners, as well as recognition of musical instruments” (p. 1). A review of these issues is presented by Elhilali [58]. The findings by Patil et al. [170] were largely confirmed and extended by Allen et al. [3], who used a larger number of 42 musical sounds, but which did not include percussive instrument sounds.

6.5 Dimensional Analysis

293

It is concluded that the acoustic sources of information (SOIs) contributing to the perception of the timbre of the sound of a musical instrument are multiple. In all dimensional analyses described above, the spectral centroid of the sounds turned out as a perceptually relevant SOI. The rise time of a sound and its spectral regularity or spectral flux also appeared often as significant SOIs, but their exact spectrotemporal nature is not quite clear. The weight attributed to these and other SOIs appears to depend on the size of the set of musical-instrument sounds used in the study, the number of different classes of musical instruments involved, and whether the sounds are recorded or synthesized. There are, however, more complications. For instance, the outcome of similarity ratings appears to depend on whether the set of stimuli contains transformed stimuli that are less familiar to the listeners than recorded orchestral musical sounds [223]. Apparently, in the absence of unfamiliar sounds, listeners use the familiar categories in order to judge whether the two sounds of a pair belong to the same category or not, which then affects their similarity ratings. To show the impact of such categorical judgments on the ratings by the listeners, a hypothetical extreme case will be presented of a metric timbre space that is only based on categories. In such a space, the sounds are completely characterized by their category and nothing else. This implies that, on a scale from zero to one, the similarity of two sounds that belong to different categories is always rated as zero, while the similarity of two sounds belonging to the same category is always rated as one. Consequently, the metric of the perceptual space derived from such data would be the discrete metric of the different categories, which means that the distance between two sounds from the same category is always 0, while it is always 1 between two sounds from different categories. This gives a space with as many dimensions as there are categories. One can imagine that the interference of these kinds of categorical judgments will obscure the low-dimensional structure of a low-dimensional timbre space. This awareness has led to the introduction of specificities mentioned above. The role of these factors in the dimensional analysis of the timbre of complex sounds is discussed by Donnadieu [52]. McAdams [149, pp. 50–55] discusses these issues in the MDS space of musical sounds. Another issue is that, in the studies discussed so far, the stimuli were equalized as to pitch and loudness, (or F 0 and intensity). This implies that some instruments were played high in their registers, e.g., a cello, whereas other instruments, e.g., a flute, were played low in their registers [200]. Although the timbre of a musical instrument is said to be more or less independent of the pitch on which it is played, the difference in pitch should not be too large, not much more than one octave [145, 146]. If the fundamental frequency is not kept constant over the set of stimuli used in an MDS analysis, the fundamental frequency can have a significant effect and, hence, can obscure the effect of other variables. Indeed, in one experiment, Miller and Carterette [154] systematically varied the fundamental frequency, the temporal envelope, and the relative amplitudes of the harmonics of a set of musical sounds. Listeners rated the similarity of pairs of such stimuli after which the results were subjected to an MDS analysis. This resulted in a one-dimensional space, spanned by the fundamental frequency. In a second experiment, the authors kept the fundamental frequency constant and varied three other stimulus properties. Now the

294

6 Timbre Perception

perceptual space resulting from the MDS analysis of the similarity ratings was threedimensional. Apparently, in the first experiment in which the pitch of the stimuli was varied, the differences in pitch dominated the ratings by the listeners, thus obscuring the effect of the other two variables. Comparable results were obtained by Handel and Erickson [95], who studied this problem in more detail. They noted that, in previous MDS experiments based on similarity judgments, generally no dimension was found that distinguished between instrument groups such as wood and brass instruments, or wind and string instruments. They suggest that this could be because the stimulus sets of these earlier studies only contained one tone per instrument played on one specified pitch. This may have induced the listeners to pay attention to “low-level acoustic differences”, thus preventing them from paying attention to the category differences between instrument groups. To further test this, Handel and Erickson [95] carried out an MDS experiment in which tones played by different instruments on different pitches were used. This also resulted in a “timbre” space of which the first dimension strongly correlated with pitch, while the second dimension correlated with spectral centroid and distinguished between woodwind and brass instruments. In summary, in discussing the results of the dimensional analyses, most authors emphasize the roles of dimensions such as brightness, attack time, of spectral flux. Remarkably, the clustering of sounds produced by instruments of the same family in perceptual spaces is only incidentally discussed. In various other studies, a division in percussive and sustained instruments is found, e.g., by Iverson and Krumhansl [111] and Lakatos [126], a distinction that naturally correlates with the different attack times of the sounds of the different instrument families [150]. So, the relation of sounds produced by the same instrument family and their position in perceptual space as found by MDS is not always clearly outlined. In order to study this more closely, Giordano and McAdams [84] and McAdams [85] reanalysed 23 data sets of 17 published studies, and concentrated on the role played not only by instrument family but also by the way in which the sound is mechanically produced, the “excitation type”. They analysed both identification studies and studies of dissimilarity ratings. As to the identification studies, they conclude that: “with the majority of the datasets, instruments from the same family were confused with each other significantly more frequently than were instruments from different families. Notably, with the majority of the studies within-family confusions were as frequent as would be expected if participants were guessing. As such, similarities in the mechanics of the sound source were associated with increased identification confusions” [84, pp. 164-165]. As to the reanalysis of the dissimilarity ratings they conclude that “Overall, tones generated by the same type of excitation or tones generated by instruments from the same family were more similar to each other and occupied independent regions of the MDS spaces. As such, significant mechanical-perceptual associations emerged even when the task did not explicitly require participants to focus on acoustical differences in the sound source” (p. 165). A final problem with the use of MDS and verbal attributes as analysis methods appears when different sets of stimuli are mixed. Reuter and Siddiq [200] mixed three sets of musical sounds. The first set was the set of synthesized sounds used

6.5 Dimensional Analysis

295

by Grey [90], the second was the set of synthesized sounds from the set used by McAdams et al. [150], and the third set consisted of recorded sounds from the Vienna Symphonic Library. The MDS analysis of the similarity ratings between pairs of sounds from these three different sound libraries turned out to be four-dimensional. More significantly, sounds from the stimulus set used by Grey [90] were very well separated from the stimulus set used by McAdams et al. [150], much better than the sounds played on the same instrument. The recorded sounds from the Vienna Symphonic Library were spread over space. The authors attributed the results to the different ways of compiling the stimulus sets. They concluded that these differences, and not the timbre differences supposed to be specific for the musical instruments, dominated the dissimilarity judgments by the listeners. The conclusion is that the timbre space of musical instruments is difficult to reduce to a small number of unequivocal dimensions, even for a relatively small set of musical sounds of equal loudness played on the same pitch. Reuter and Siddiq [200] emphasize that the fundamental frequency of musical sounds and also the effort with which they are played have a significant effect on the timbre of the musical sounds. Including them as independent variables in the stimulus set will only increase the complexity of the problem. In the approaches discussed so far, it is assumed that the timbre space of musical instruments is spanned by a limited number of dimensions. Another approach is followed by Peeters et al. [177]. They wanted not so much to describe the timbre space of musical sounds, but rather to automatically identify the instruments in recorded music. In order to realize this, Peeters et al. [177] composed the Timbre Toolbox, a large number of audio descriptors, sources of acoustic information that are used or may be used for the automatic classification of the instruments playing in recorded music. Some of those audio descriptors are purely acoustic, e.g., zero crossings or autocorrelation coefficients, but others are inspired by what is known about auditory processing, such as, e.g., the output of a gammatone-filterbank, or the centroid of the specific-loudness distribution. Siedenburg, Fujinaga, and McAdams [222, p. 30] mention that as many as 164 audio descriptors are included in the study by Peeters et al. [177]. These audio descriptors were applied to a very large database of more than 6000 recorded musical instrument sounds [222, p. 30]. Naturally, many of these audio descriptors are correlated, and it appears that the correlations can be used to cluster the audio descriptors into ten more or less independent classes. This result suggests that the number of audio descriptors necessary to characterize and automatically recognize the timbre of musical instruments can be reduced to ten. A relevant aspect of this approach is that the research goal is to identify the musical instrument correctly independently of whether the audio descriptors used for the identification have any perceptual relevance. It is very well possible that an audio descriptor can well discriminate musical instruments, but that the acoustic information described by this descriptor is not used by the human auditory processor. On the other hand, the results of these kind of experiments may very well be used to select those audio descriptors that differentiate between musical instruments and then to investigate to what extent that information may contribute to human timbre perception. This aspect

296

6 Timbre Perception

is discussed in detail by Aucouturier and Bigand [13] and Siedenburg, Fujinaga, and McAdams [222], and the interested reader is further referred to these authors. As a final remark, it is noted that, in the discussions of the timbre space of vowel sounds and the that of musical sounds, the perception of breathiness did not play any role, while the perception of roughness is only sparsely mentioned. In their lists of verbal continua both Von Bismarck [268] and Pratt and Doak [193] included the smooth-rough continuum, but none found this continuum to be significant. Von Bismarck [268, p. 17] concluded that roughness was unsuitable as a perceptual attribute because “most sounds were rated as being either rough or smooth.” McAdams et al.[150] mention roughness as a property of one of their sounds with specificity greater than 2, i.c., trumpar, a combination of a trumpet and a trombone. The only other exception is that Burgoyne and McAdams [35] suggest roughness as the second dimension in their analysis of the sounds used by Grey [90]. Besides these instances, breathiness and roughness usually do not turn out as significant factors in the description of the timbre spaces of musical sounds. The main reason for this is probably that they are generally not considered desirable characteristics of musical sounds. Hence, rough or breathy sounds may simply not have survived the selection process of the experimental stimuli. As to roughness, another possibility is that, as far as it was part of the timbre of the experimental stimuli, it may have expressed itself in the dimension corresponding to spectral flux.

6.6 The Role of Onsets and Transients The timbre of a sound is often described as the auditory correlate of the spectrum. For instance, Pratt and Doak [193, pp. 319–320] conclude, “the timbre of a note depends largely on the harmonic structure.” They add, however, “both the amplitude and frequency of the harmonics vary with time, and it is the precise nature of these fluctuations which is of critical importance for determining the timbre of a note.” Similarly, in a note added to the definition of timbre, the American National Standards Institute [7] states: “Timbre depends primarily upon the frequency spectrum”, but this is directly followed by “although it also depends upon the sound pressure and the temporal characteristics of the sound.” Moreover, above, the perceptual attribute of brightness was discussed, which corresponded to one of main dimensions of the perceptual space that resulted from the MDS of the timbre space of musicalinstrument sounds. This gives the impression that the spectrum dominates timbre perception and that temporal fluctuations are secondary. This, however, appears to be too simple. As early as the 1950s and 1960s, sound-processing equipment such as tape recorders made it easy to present listeners reproducibly with well-defined segments of recorded musical tones. One of the first questions asked was where in the sound signal the information is positioned that is perceptually important for its identification, in its onset or in its steady part. It was often concluded that the onset or attack plays an important role in the identification of a musical note. For instance, Richardson [201, p. 962] emphasized the importance of transients at the onsets of

6.6 The Role of Onsets and Transients

297

the tones: “Few wind instruments have transients which are exact copies of their steady states; either - underblown tones’ or - reed partials’ occur in the first tenth of a second until overlaid by the principal tones of the column in its steady state. In spite of their evanescent nature, the view is now held that it is these transients which enable the listener to distinguish the sounds of different musical instruments or between two of the same class. [...] The transient is indeed part of the - formant -of an instrument, and ought to be exhibited as a characteristic alongside the steady-state.” Similar conclusions were drawn by Clark Jr et al. [43], Elliott [59], and Saldanha and Corso [212, p. 2021] who found for ten different orchestral instruments that “the best identification is made for stimuli consisting of initial transients and a short steady state.” Berger [25] showed that removing the rise and decay portions of a number of brass-instrument sounds reduced the number of correct identifications from 59% to 35%. Finally, in resynthesizing tones of musical instruments, Grey and Moorer [91, p. 454] present data that “pointed out the importance of certain small details existing in the attack segments of tones.” All these examples show that an onset of a tone in general contains much information used by listeners in identifying the instrument on which the tone is played. This suggests that acoustic information at the onset of a sound plays a role in sound perception that can hardly be overestimated. This conclusion corresponds well to the outcome of the dimensional analyses discussed above, that yielded the rise time or attack time of a sound as one of the three main dimensions of the timbre space of musical-instrument sounds. The perceptual relevance of the rise time of tones becomes also apparent when the tone of a musical instrument is played backwards. George [79, p. 225] already found that “If the - steady state- (as observed on an oscilloscope) of any sound be reversed, its sound is unchanged but, in general, the original source of the sound cannot be identified with certainty even though its ios [= instantaneous overtone structure] is that given in the literature as characteristic of that sound.” Something similar was reported by Hajda [94] who found that the identification of sustained continuous tones was not affected by playing those tones backwards. In contrast, playing impulsive sounds backwards always did affect their identification. This is, e.g., demonstrated by playing piano tones backwards. In general, such tones are not recognized as played on a piano. This is demonstrated in Fig. 6.12. In the first half of the demo, a usual rising diatonic scale is played on C. In the second half of the demo, the same diatonic scale is recorded from high to low and then played backwards in time, so that the same rising scale is heard but with the separate piano notes played backwards. Note the loss of identity of the piano notes in this second half of the demo. The tones do no longer sound like those of a piano. For a similar demo comprising a Bach chorale one can listen to track 54–56 of the CD by Houtsma, Rossing, and Wagenaars [107]. This shows that the order of the acoustic details at the onset of a tone is an important determinant of the identity of impulsive tones. These acoustic details do not necessarily have to be very complex. As demonstrated in Fig. 1.70 playing various ADSR tones with varying rise times, the tones with the shorter rise time have an impulsive character that vanishes as the attack time gets longer. In fact, there appears to be a categorical distinction between tones with shorter rise times and tones with longer rise times, which introduces the subject of

298

6 Timbre Perception

Fig. 6.12 Diatonic scale on a piano, first played from low to high as usual, then the scale is played from high to low, but inverted in time, so that the same upward scale is heard as in the first half of the demo. The timbre of the tones in the second part of the demo has completely changed and no longer sounds like that of a piano. (Matlab) (demo)

categorical perception. Perception is categorical when “identification can be predicted from discrimination and vice versa”. This means that, when two stimuli are from different categories, they will be more easily discriminated than when they are from the same category [56, 199]. In other words, when two stimuli are from different categories, they are more easily discriminated than two physically equally different stimuli from the same category are. This means that just-noticeable differences are smallest at category boundaries and largest at the centre of the category. Categorical perception of timbre is reviewed by Donnadieu [52, pp. 297–312]. In all what has been discussed above, the rise time of a tone has only been investigated as to its relevance for the timbre of the sound or for the identification of the instrument that plays the tone. It is clear that the rise time of a tone also plays an important role in defining the perceptual attribute of beat clarity, an attribute discussed in Sect. 5.6 of the previous chapter. The clarity of the beat of an auditory unit defines how precisely that beat is perceptually defined in time, and the information that comes at the onset of a sound, and in particular its rise time, is paramount in this. Another dimension that often emerges from such studies is the dimension corresponding to what was called spectral flux, indicating the presence of spectrotemporal fluctuations also in the sustained part of the musical tones. These fluctuations in the components of a sound are often indicated with transients. Transients are important parts of the beginning of a sound, but in almost all natural musical tones, they extend well into the sustain, where they will also appear to play an important role. For instance, Iverson and Krumhansl [111] showed for a number of musical sounds that both onset and “remainder” of a tone play a role in similarity ratings, which was confirmed by Hajda [94] for continuous tones played on non-impulsive instruments. In general, however, it appears difficult to compare the relative contribution of the information at the onset of a tone with that at the sustain. That it plays a role is clear, but it will be difficult to specify this more precisely. There are various reasons for this. One of the reasons is that it is not self-evident how to indicate what exactly the onset of a tone is: Where does the onset start? And where does it end? In the studies mentioned above, various authors have used different ways to measure the rise time of a tone (for an overview see Hajda [94]). Another reason is that the relative contributions of onset and steady state can depend on the way a tone is played. Indeed,

6.6 The Role of Onsets and Transients

299

Wedin and Goude [272] found that three out of nine instruments remained well recognizable when the attack and the release were removed. These, however, were the three instruments out of nine that were played with vibrato whereas the other six were not played with vibrato. Apparently, vibrato adds dynamic information to the steady state of a sound that a listener can use for the identification of the instrument that plays the sound. Similar results were obtained for trumpet sounds by Risset [202] and for clarinet sounds by Pratt and Doak [193]. The latter found that, in the absence of micro modulation, the synthesized clarinet sounds did not sound like a clarinet but like an organ pipe. This can be compared with the effect of coherent frequency modulations on the timbre of a synthetic vowel as demonstrated in Fig. 4.12. In that demo, one hears that a vowel synthesized with micro modulations has a much richer voice-like timbre than the vowel synthesized with a constant fundamental frequency. This shows that the coherent dynamics of the harmonics introduced by vibrato or micro modulation play an important role, not only in auditory-unit formation as argued in Chap. 4, but also in enriching the timbre of the sounds. A third reason why it is difficult to indicate the relative contributions of onset and sustain of a tone is that removal of the onset actually introduces a new envelope with a rise time that can affect the timbre of the edited tone but is no property of the original tone. Indeed, Hajda [94] found that removal of the onset of some short non-impulsive tones changed the tones into tones that sounded as if produced by an impulsive instrument such as a guitar, a marimba, a piano, or a pizzicato played violin. Moreover, Hajda [94] found that removal of the onset of some sustained tones of longer duration actually improved the identification of the musical instruments that played the tones. A fourth problem is that it is not self-evident how to separate transient information from quasi-stationary, i.e., more slowly changing, information in a musical tone. Siedenburg [220, p. 1078] defines transients as “short-lived and chaotic bursts of acoustical energy”, and mentions that transient and quasi-stationary components of sounds overlap in time and in frequency. This implies that one cannot simply be separated from the other by gating out segments in time. To handle this problem, Siedenburg and Doclo [221] developed an iterative procedure to separate the transient from the quasi-stationary parts of a sound based on the assumption that quasi-stationary parts are “sparse in frequency and persistent in time, and vice versa for transients” (p. 228). Employing this procedure, they could vary the proportion of transient and quasi-stationary parts of musical sounds. Siedenburg [220] applied this to ten different musical instrument sounds: Piano, guitar, harp, vibraphone, marimba, trumpet, clarinet, flute, violin, and cello, played at twelve pitch levels. After a training procedure, he presented listeners with the first 64 ms of such a tone or with the segment between 128 and 192 ms after its onset and asked listeners to identify the instrument playing the sound. For the set of ten musical instruments, the results of these experiments confirm the role of the onset in musical-instrument identification, but qualify the role of transient information at this onset. Siedenburg [220] concludes: “Taken together, these findings confirm the prominent status of onsets in musical instrument identification suggested by the literature, but specify that rapidly varying transients (which often but not exclusively occur at sound onsets) have a relatively

300

6 Timbre Perception

limited diagnostic value for the identification of harmonic musical instruments. In conclusion, fairly slowly varying signal components during onsets, likely the characteristic buildup of sinusoidal components in particular, provide the most valuable bundle of acoustic features for perceptual instrument identification” (p. 1086). It is concluded that the onset of an auditory unit plays a major role in timbre perception. It is not clear yet, however, whether this is because there is simply more auditory information available at the onset, or because there is a perceptual process that attributes more weight to information at the onset of an auditory unit [225].

6.7 Composite Timbre Attributes So far, three timbre attributes have been discussed: Roughness, breathiness, and brightness. For these auditory attributes, tentative computational models have been developed, which have been selected because they are based on knowledge about information processing in the peripheral and low-level central nervous system. The parameters of these models are based on the results of physiological studies or perceptual experiments. This makes it possible to quantitatively check the role of these parameters in timbre perception. In addition, the role of the various processing stages on the emergence of the timbre percept can be verified. In the corresponding chapters, this will be done for the auditory attributes of loudness, pitch, and perceived location. Before doing this, three other timbre attributes will be discussed that are not described in terms of just one auditory attribute, but they will be described as composite timbre attributes, so attributes composed of timbre attributes already described above.

6.7.1 Sensory Pleasantness and Annoyance One of the most commonly discussed topics about sound is its annoyance, and one of the most mentioned perceptual attributes related to annoyance is it loudness. It is common experience, however, that loud noises such as the sounds of a moving train or the pounding of a boat’s engine do not prevent people from sleeping peacefully, while the soft sounds of a dripping water tap can drive people crazy. Hence, there is more to say about the annoyance of a sound than its loudness. Indeed, Aures [14, 15] did not study the annoyance of sound but its opposite, which he called euphony. In later publications, the term euphony was replaced by sensory pleasantness, or briefly pleasantness, the term that will also be use here. Aures [15] proposed that pleasantness could be decomposed into four components: Sharpness, roughness, tonalness, and loudness. In the previous sections it has already briefly been discussed how sharpness S, roughness R, and tonalness T , the opposite of breathiness, can be quantitatively estimated. In Chap. 3, it was described how the loudness N of a sound can be estimated; the perceptual unit for loudness was the sone. So, the four auditory

6.7 Composite Timbre Attributes

301

attributes that contribute to pleasantness are known, and the pleasantness P of a sound can be estimated by the following equation derived by Fastl and Zwicker [66]:

2 P = e−0.7R e−1.08S 1.24 − e−2.43T e−(0.023N )

(6.5)

in which R is in asper, S in acum, T is relative to the tonalness of a pure tone, and N is in sone. This equation shows that pleasantness is negatively related to roughness, sharpness, and loudness, but positively related to tonalness. For comparisons of these model-based estimates with perceived pleasantness of various sounds, the reader is referred to Fastl and Zwicker [66]. In view of the small coefficient before the loudness N in Eq. 6.5, the contribution of loudness to sensory pleasantness is relatively small, although both roughness and sharpness also increase with sound intensity. Indeed, Fastl and Zwicker [66] mention that, in the experiments they report, the contribution of loudness only starts to affect pleasantness when it is higher than 20 sone, about the loudness of stationary white noise of 64 SPL. They summarize that loudness “influences sensory pleasantness only for values that are larger than the normal loudness of communication between two people in quiet” (p. 245). Not much research has been published as to the usefulness of the concept of sensory pleasantness. This is different for its counterpart, annoyance, but most research on annoyance has focussed on intensity levels, which, as has just been shown, only play a role if they are well above normal communication levels. Besides on annoyance, research on the design of warning sounds has concentrated on subjects such as audibility and perceived urgency [10, 55].

6.7.2 Voice Quality In the previous sections, roughness and breathiness were shown to play an important part in the perception of voice quality. In general, the relation is negative; rough or breathy voices are associated with bad voice quality. The same applies to related attributes such as harshness, hoarseness, nasality, or creakiness. Intuitively, one may conclude that it might not be too difficult to scale the quality of a voice on the scales of a number of these verbal attributes. Practice is more complicated, however. In general, it appears that ratings by human listeners of voice quality are not very reliable. Even among experts, the agreement about the quality of disordered voices can be quite low. Indeed, Kreiman and Gerratt [121, p. 1598] conclude: “Results do not support the continued assumption that traditional rating procedures produce useful indices of listeners’ perceptions. Listeners agreed very poorly in the midrange of scales for breathiness and roughness, and mean ratings in the midrange of such scales did not represent the extent to which a voice possesses a quality, but served only to indicate that listeners disagreed.” Another way to determine voice quality would be to use acoustic measures derived from the recorded speech signal. Reviews of how to measure voice quality are pre-

302

6 Timbre Perception

sented by Buder [34] and Hillenbrand [101]. The latter concludes: “A great deal of effort has gone into the development of acoustic measures that can be used to quantify various aspects of vocal function. These efforts have been more successful in some areas than others. In particular, efforts to infer underlying pathological conditions based on the acoustic signal have been limited in part by our incomplete understanding of some fundamental relationships between physiology and acoustics, and in part by the inherently complex, one-to-many relationships that exist between acoustic features and underlying pathological conditions” (p. 14). In a systematic review of articles dedicated to clinical assessment of voice disorders, Roy et al. [210] comes to the conclusion: “Overall, the results of this systematic review [...] provide evidence that selected acoustic, laryngeal imaging- based, auditoryperceptual, functional, and aerodynamic measures have the potential to be used as effective components in a clinical voice evaluation. The review, however, did not produce sufficient evidence on which to recommend a comprehensive set of methods for a standard clinical voice evaluation. There is clearly a pressing need for highquality research that is specifically designed to expand the evidence base for clinical voice assessment” (pp. 220–221). Apparently, voice quality is difficult to measure quantitatively, whether physiologically, perceptually, or acoustically. It may have seemed so easy to define a few verbal attributes such as roughness, creakiness, and breathiness, and to have some experts rate a voice along the scales defined by those attributes. This appears not to result in clear-cut results, however. The interested reader is referred to Kreiman and Sidtis [122], who present an overview of the problems playing a role in voice-quality research.

6.7.3 Perceived Effort In many everyday circumstances, one does not only perceive auditory attributes of a sound such as pitch, loudness, and timbre, but one also infers the action causing the sound [130]. One can hear something rolling, scraping, colliding, or breaking. Another important aspect of a sound generating event is the effort with which it is produced. Almost independently of the intensity with which speech or music arrives at our ears, one has an impression about how “loud” someone is speaking or about the dynamics with which a piece of music is played, e.g., piano (p) or forte (f ). Relatively little research is done regarding the acoustic sources of information involved in making such judgments, and most observations are anecdotal and lead to the conclusion that the relative proportion of low and high frequency components correlates with the perceived effort. First, the situation for speech will be discussed, and we will indicate the power that people put into speaking with vocal effort. Then the situation for music will be discussed, and the power put into playing a note will be indicated with dynamic effort. As to speech, in an early study of 1969, Brandt, Ruder, and Shipp Jr [30] studied the relation between judgments of the loudness of speech utterances and of the vocal effort with which they are produced. They mention that the “bandwidth” of

6.7 Composite Timbre Attributes

303

the stimulus is the most likely source of information involved in judging the vocal effort with which speech is produced. Remarkably, this is not explicitly mentioned in later studies on vocal effort. Almost all authors mention that, with higher effort, the balance between high and low frequency components shifts towards the higher frequencies, e.g., Eriksson and Traunmüller [61], Granström and Nord [88], Licklider, Hawley, and Walkling [136], Sundberg [246], and Sundberg and Nordenberg [247]. This indicates that an increase in vocal effort is associated with an increase in brightness. Other factors, however, also play a role. Granström and Nord [88] report that, when reading aloud, also F 0 increases with more effort, and Liénard and Di Benedetto [137] and Eriksson and Traunmüller [61] mention that, when speakers adapt their vocal effort to a longer distance from a listener, both F 0 and the frequency of the first formant F1 of spoken vowels also increase. Traunmüller and Eriksson [256] elicited speech spoken with varying effort by asking speakers to direct a spoken utterance to a listener positioned in the open field at varying distances from the speaker. These distances varied over five logarithmically equidistant locations between 0.3 and 187.5 m. The recordings were then presented to other listeners who were asked to estimate the distance between speaker and listener for each recorded utterance. The presentation levels were presented with random variations around an equalized level, so that the listeners could not base their judgments only on level. The correlation coefficient between estimated and actual distance was high, 0.9, but the distances were systematically underestimated except for the shortest. Moreover, Traunmüller and Eriksson [256] found that creaky voice, often occurring at the end of the utterance, decreased with vocal effort, while the duration of pauses increased. McKenna and Stepp [151] mention that an increase in vocal effort is also coupled with a reduction in the harmonics-to-noise ratio. They attribute this to an increase in high-frequency noise, but argue that the relation with breathiness is not clear. All in all, this indicates that, in addition to more brightness, perceived vocal effort is associated with a higher pitch, an increased vowel height, less creak, and longer pausing. Also regarding music production, an increase in brightness is generally recognized as associated with higher dynamic effort. For instance, Risset [202, p. 912] mentions that the harmonic content “becomes richer in high-frequency harmonic when the loudness increases.” The literature about the perception of dynamic effort is not very rich, although it is well known that musicians are well able to convey the dynamics with which they play a part of music, e.g., piano, forte, crescendo or decrescendo, to their listeners [63, 161]. In spite of this, the literature about the effect of these dynamics on the timbre of musical tones is not very extensive. McAdams [149] remarks: “Similarly to pitch, changes in dynamics also produce changes in timbre for a given instrument, particularly, but not exclusively, as concerns spectral properties. Sounds produced with greater playing effort (e.g., fortissimo vs. pianissimo) not only have greater energy at the frequencies present in the softer sound, but the spectrum spreads toward higher frequencies, creating a higher spectral centroid, a greater spectral spread, and a lower spectral slope. No studies to date of which we are aware have examined the effect of change in dynamic level on timbre perception, but some work has looked at the role of timbre in the perception of dynamic level independently

304

6 Timbre Perception

of the physical level of the signal” (p. 46). The observation that greater spectral spread is associated with an increase in dynamic effort in music corroborates the aforementioned claim for speech by Brandt, Ruder, and Shipp Jr [30] that bandwidth is the most likely source of information available when assessing the vocal effort of speech. Though it did not include perceptual experiments, the most comprehensive study of acoustic features that may affect the perception of the dynamic effort put into the playing of musical notes is reported by Weinzierl et al. [273]. They asked professional musicians to play all possible notes from low to high with semitone intervals as soft as they could, pianissimo (pp), and as loud, fortissimo (ff ) as they could. These musically intended levels were indicated with “dynamic strength”. This was done for 40 orchestral instruments. This resulted in 1764 recordings of low dynamic strength, the pp-tones, and 1718 recordings of high dynamic strength, the ff -tones. For all these recordings, the authors calculated a large number of audio descriptors associated with timbre features defined by the Timbre Toolbox of Peeters et al. [177] briefly described above in Sect. 6.5.2. This was done in order to find out which audio descriptors could best be associated with dynamic strength. By means of a linear discriminant analysis, they found that “sound power”, “spectral skewness”, and “decrease slope” were the best predictors for dynamic strength resulting in a 92% correct classification of the recordings. The meaning of “sound power” is evident. “Spectral skewness” represents the asymmetry of the energy distribution over the spectrum and was calculated on the Cam-scale. Here it indicates that this distribution is skewed to higher frequencies for higher dynamic strength. “Decrease slope” represents the slope of the energy distribution over the spectrum and indicates that the spectral balance shifts to higher frequencies with higher dynamic strength. Interestingly, the same analysis was carried out with the exclusion of “sound power” as predictor. This yielded “spectral skewness”, “spectral flatness”, and “attack slope” as the best predictors for intended dynamic strength resulting in 85% correct classification. Spectral flatness is a measure of the degree to which the tonal components protrude above the noisy components of the sound, and a high dynamic strength is associated with a low flatness, indicating the high level of the tonal components protruding above the noise level. It may be associated with tonalness discussed above in Sect. 6.3. Attack slope is a temporal audio descriptor comparable with attack time. It will be clear that attack slope is steeper for sounds of higher dynamic strength, so that the clarity of the beat increases. As mentioned, these results were not coupled with a perceptual study to find out how much each of these audio descriptors contributes to the perceived dynamic effort of a musical tone. For music played on a single clarinet, Barthet et al. [21, p. 316] write: “loud tones are brighter than soft tones.” Moreover, by copying the variations in spectral centroid from an actual expressive performance to a synthetic midi-based version of a melody, Barthet, Kronland-Martinet, and Ystad [20] showed that the music regained its perceived expressiveness and naturalness by varying the spectral centre of gravity according to the original performance. Though Barthet, Kronland-Martinet, and Ystad [20] and Barthet et al. [21] discuss their manipulations in the context of musical expressiveness and do not mention perceived dynamics, all these observations

6.7 Composite Timbre Attributes

305

indicate that brightness contributes significantly to the perceived effort with which sound is produced. As explained, brightness is mostly estimated by calculating the centroid of the specific-loudness distribution. As suggested by Brandt, Ruder, and Shipp Jr [30] and McAdams [149, p. 46], spectral spread or bandwidth may also play a role in the perception of effort, but this remains to be confirmed. In spite of this lack of systematic perceptual studies, all these observations imply that an increase in the effort with which speech or music is produced is associated with spectral changes that increase the brightness and decrease the breathiness, and with temporal changes that increase the clarity of the beat. The role of spectral slope in the perception of vocal effort also becomes apparent from results of Nordstrom, Tzanetakis, and Driessen [164]. They wanted to transform a recording of a voice singing with great effort in such a way that it sounded like a soft, breathy voice. To do so, they used an adapted version of linear predictive coding (LPC). In standard LPC, the glottal source is modelled by what is called a pre-emphasis filter. The spectrum of this filter has the character of a slowly decaying envelope. In usual LPC this pre-emphasis filter is kept constant. Nordstrom, Tzanetakis, and Driessen [164] argue that, owing to this assumption, deviations from the correct formant frequencies and bandwidths become audible. To avoid this, they start with continuously estimating this pre-emphasis filter by low-order LPC. Based on this, they can more accurately estimate the frequencies and bandwidths of the formants. The authors called this adaptive pre-emphasis linear prediction (APLP). This made it possible to continuously vary the spectral slope of the resynthesized speech in a controlled way. Besides controlling the pre-emphasis, they also controlled the amount of noise added to the voice source as specified by Hermes [100], in order to model that less vocal effort is associated with more breathiness. In this way, Nordstrom, Tzanetakis, and Driessen [164] found that “APLP’s spectral emphasis filter can be used to transform high-effort voices into breathy voices. This resulted in breathy voices that sounded more relaxed and exhibited fewer artifacts than the corresponding transformation using constant pre-emphasis LP” (p. 1087). Results like these demonstrate the relevance of factors such as spectral slope and noise content for the perception of vocal effort. What has been said may give the impression that it would be easy to manipulate the perceived effort of a musical tone or a spoken word. This would actually mean that the percept of perceived effort is completely understood. There are some problems, however. Changes in spectral slope do not only affect the perceived effort, but also the perceived naturalness of a sound. Indeed, Moore and Tan [160] estimated the perceived naturalness as a function of various distortions of music and speech, among which the effect of applying positive and negative slopes to the spectra of the sounds. They showed that applying a spectral slope as small as 0.1 dB/Cam, already significantly degraded the naturalness of speech and music. Moreover, Pietrowicz, Hasegawa-Johnson, and Karahalios [181] present a list of possible acoustic correlates of effort level for acted speech. They find that spectral slope alone is a bad indicator for distinguishing whispered, breathy, modal, and resonant speech spoken by trained actors. So, the situation is not straightforward. All this indicates that perceived effort is determined by multiple sources of acoustic information. These sources of infor-

306

6 Timbre Perception

mation are involved in the emergence of brightness, breathiness, and beat clarity, but it remains to be sorted out how they interact and not only determine the perceived effort but also the identity and the naturalness of a speech sound or a musical note. Two kinds of speech distinguish themselves in some respects from other kinds of speech in which people raise their voice. The first is shouted speech, the second is Lombard speech. Shouted speech will be discussed first. The distinguishing feature of shouted speech is that it is produced in the upper end of the speaker’s dynamic range [180]. Rostolland [208] reports a 28-dB increase in intensity. This increase in intensity is greater for higher frequencies, resulting in a less steep spectral slope and, hence, a higher brightness. Other characteristics of shouted speech described by Rostolland [208] are an increase in F 0 , and a significant increase in the duration of the vowels of 67% and a small reduction in the duration of the consonants of 20%. Shouted speech can perhaps be considered as an extreme case of speech spoken with much effort, but there are some distinctions. The increase in F 0 is so high that it approaches the upper limit of the range of the speaker, thus inducing a reduction of the F 0 range of shouted speech. Also the range of variation of the formant frequencies is reduced [198]. Remarkably, whereas speaking with more effort generally does not reduce the intelligibility of the speech [180], the changes induced by shouting do lead to a reduced intelligibility [180, 209]. Moreover, it appears to be more difficult to distinguish between different speakers when they shout [208]. Xue et al. [276] studied how these factors can be used to synthesize shouted vowels. The second kind of speech is Lombard speech [138], called after the one who first described it in 1911. Not only do people speak louder in the presence of ambient noise, but their speech also changes in a number of other ways. These changes together are called the Lombard effect. Speakers are usually not aware of these changes. In addition to increases in intensity, Lombard speech is characterized by a higher F 0 , a reduction in speech rate, resulting in speech segments of longer duration, and a less steep spectral slope [245]. The extent to which this happens depends on the spectral and temporal properties of the noise present [44, 139]. Moreover, Cooke and Lu [44] show, e.g., that, when the noise consists of speech, speakers actively plan their speech to avoid temporal overlap with that speech. The Lombard effect is not unique to human speakers, but also occurs in many species of mammals and birds. There is still some debate about whether frogs exhibit the Lombard effect. This is reviewed by Brumm and Slabbekoorn [32] who argue that the features of the Lombard effect enhance the detection and discrimination of communication sounds in a noisy environment. The Lombard effect does not only consist of acoustic components; speakers also use visual ways of communication such as gestures and visible articulation [76, 258]. A review of research on the Lombard effect in humans and in animals up to 2011, 100 years after its first description by Étienne Lombard, is presented by Brumm and Zollinger [33]. Its neural underpinnings are presented by Luo, Hage, and Moss [140].

6.8 Context Effects

307

6.8 Context Effects In Sect. 6.5.1, context was shown to play an important role in vowel perception. Context effects, however, express themselves not only in vowel perception but also in, e.g., consonant perception, intonation, or the sounds of musical instruments. Indeed, Holt [103] shows that the perceived consonant in the syllables /da/ or /ga/ not only depends on the frequency content of preceding speech, but can also be influenced by the average frequency of a preceding tone sequence. A context effect in intonation is described by Francis et al. [71] for the tone language of Cantonese Chinese. In a tone language, the meaning of a syllable depends on the pitch contour realized on the syllable. Mandarin Chinese has four of such lexical tones; Cantonese has six. Francis et al. [71] show that the perceived level of lexical tones in Cantonese depends on the average level of preceding tones. Moreover, Sjerps, Zhang, and Peng [227] showed that, in contrast with vowel perception, tone perception in a Cantonese syllable not only depends on what comes before but also on what comes after the syllable. All these examples show that listeners adapt their range as to what is low and what is high to the preceding context and, in the case of lexical tones, also to the following context. Moreover, Frazier, Assgari, and Stilp [72] and Piazza et al. [179] show that contrast effects also affect musical sounds. In the section on the timbre space of vowel sounds, the spectral enhancement effect has been discussed. This effect occurs when spectral dips are, at a certain moment, filled in such a way that the spectrum becomes flatter. In this way, spectral components of relatively low intensity are increased up to the level of the already present adjacent spectral components. Consequently, the perceptual weight of the increased spectral components is increased. When these dips are at the position of the formant frequencies of some vowel, this can result in perceiving that vowel while there are no spectral peaks at the formant frequencies of that vowel. In order to present an explanatory description for spectral enhancement, Summerfield et al. [244] indicated the role of peripheral adaptation. The idea is that initially the frequency channels in the dips are not adapted, while the other frequency channels are. At the moment the dips are filled in, the information in the un-adapted channels contributes more to the timbre of the sound than the information in the adapted channels, thus inducing the timbre of a sound that does have peaks at those parts of the spectrum. The role of adaptation in general will be clear, but the extent to which, besides peripheral adaption, more central adaptation processes come into play is less certain. This role of peripheral adaptation in enhancement and contrast effects is further discussed by Chambers et al. [40], Feng and Oxenham [67, 68], Stilp [236, 237], Stilp and Anderson [238], and Stilp and Assgari [239]. They conclude that peripheral adaptation certainly plays a role, but more central mechanisms must also be involved. In conclusion, not only the formant frequencies of a vowel determine which vowel is perceived. The context in which a vowel is produced can have a considerable impact. Similar conclusions can be drawn for perception of consonants and of lexical tones. Apparently, context determines what is high and low in frequency, or long and short in duration. Spoken by a man speaking high in his pitch register, the pitch

308

6 Timbre Perception

of a vowel can be perceived as higher than the pitch of a vowel spoken by a woman speaking low in her register. When speaking slowly, a canonically short vowel can be perceived as longer in duration than when a canonically long vowel is spoken in a fast context. In that sense, the contrast effects result in a normalization to the characteristics and speaking style of the speaker [228]. This explanation describes a normalization process that is controlled by contextual information. Another possibility is that other sources of information are available for normalization. Irino and Patterson [109] showed that it is well possible to estimate the size of a vocal tract, not only based on the resonance frequencies of the vocal tract, the formants, but also based on the duration of the impulse response of the vocal track. Using a specially developed vocoder with which, based on this model, speech could be synthesized as produced by speakers of different size, Ives, Smith, and Patterson [112] and Smith et al. [230] showed that this information plays a role in estimating the size of the speaker also beyond their normal range. Moreover, these authors present evidence that this information is obtained at an early stage of neural processing, though it will be clear that it will not only depend on peripheral adaptation. This implies that listeners can first estimate the size of the speaker and then use this information to normalize the formant frequencies. By the way, similar studies show that listeners can hear the size of musical instruments [173, 174, 182, 260]. A review of size perception of both speaker and musical instrument is presented by Smith et al. [175]. The question of how people recognize a speaker is a complex, not-well understood problem. A review of this problem and of speaker recognition by machines is presented by Hansen and Hasan [97]. In summary, auditory information used in speech perception is normalized in various aspects. This process of normalization guarantees, on the one hand, that vowels, consonants, or lexical tones, spoken by different speakers in different contexts retain their identity. On the other hand, this normalization process brings about the information that characterizes the speaker and the context in which the speech is produced. In this way, the perceptual constancy of both the speaker and the context in which the speech is produced is guaranteed. An extensive review of context effects is presented by Stilp [235]. He distinguishes contrast effects based on whether they are forward or backward and whether they are proximal or distant. The latter distinction relates to whether the contrast occurs between adjacent sound components, such as in the spectral enhancement effect, or whether it extends over intermediate sound segments. Stilp [235] reviews these kinds of contrast effects in a systematic way both for spectral and for temporal contrast effects. Moreover, he also discusses spectrotemporal context effects. Spectrotemporal contrast effects occur not only for speech but also for other sounds produced in rooms with different amounts of reverberation. Due to different reverberant and reflexive properties of the various frequency components of a sound, the sound is spectrally coloured and spread out over time as it arrives at the ears of the listener. In this case, too, the perceived sounds are normalized as to the reverberation of the listening room, so that, on the one hand, the auditory attributes of the sound remain

6.9 Environmental Sounds

309

constant and, on the other hand, the normalization constants specify the room. The interested reader is further referred to this review by Stilp [235] and the literature cited therein.

6.9 Environmental Sounds A considerable part of what has been said about timbre perception was related to music or speech sounds. Many of the sounds we hear, however, are neither music nor speech. Natural sounds that are neither speech nor music are called environmental sounds [93, p. 839]. It is not easy to present a general framework within which one can systematically describe them as a coherent whole. The first studies on the perception of environmental sounds, e.g., Li, Logan, and Pastore [134], Vanderveer [262], and Warren and Verbrugge [271], started from the “ecological approach” as established for the visual system by Gibson [81, 82]. This ecological psychology advances the theory of direct perception, which states that perception is “unaided by inference, memories, or representations” [153, p. 2]. In order to make this more concrete, one may compare this with a properly trained neural network, which also is a medium that recognizes and categorizes objects and events without inference or memory. An important aspect of this is that perception is based on sensory information that is highly complex, but structured, and rich in information. Another important aspect of ecological psychology is that the richness of the incoming information is enormously expanded by moving around in the environment. The information flow over the senses is considered to provide an essential source of information for the organism. The organism actively scans the environment, thus making perception an active process. Hence, the observer and the environment are an inseparable pair; and perception and action cannot be separated from each other. A concrete example may make this more concrete. In a soccer match, the players have to scan the situation almost instantaneously and make decisions in fractions of a second, decisions that are mostly made based on visual information but that can also be made based on sound [37]. Ecological psychology places a strong emphasis on the wealth of information used by a perceptual system. In this book, it will be shown that in general many sources of information are involved in the emergence of auditory attributes, and that they enter the perceptual system over a large part of the sensory array, in this case the tonotopic array. A characteristic example in quite a different context is presented by Giordano, Rocchesso, and McAdams [85] who studied the sources of information involved in judging the hardness of two objects producing an impact sound. The two objects were a hammer and a suspended square plate, both of the same material. The area of the suspended plate was varied between 225, 450 or 900 cm2 . The hardness was varied by choosing materials varying from pine wood to hard steel. The judgments by the listeners were related to a number of mechanical descriptors of the hammer and the plate such as their elasticity, density, size, etc. Based on the results, the authors conclude: “Consistently with previous studies on sound source perception, results from all experiments confirm that perceptual judgments integrate information from

310

6 Timbre Perception

multiple acoustical features. [...] Overall, the results of this study point toward a joint influence of information accuracy and exploitability on the structure of the perceptual criteria. Thus, accurate information is generally more relevant perceptually, although accurate but not easily exploited information is perceptually secondary at best. When generalized to the perception of environmental sounds at large, the results of this study imply that the perceptual weight of the acoustical features can be fully predicted from two sets of measurements: firstly, task-dependent measures of the accuracy of the acoustical information within the environment in which source-perception criteria are acquired; secondly, task-independent measures of the ability of a listener to exploit the information carried by the acoustical features. A theory of source perception will benefit from further empirical tests of these predictions” (p. 475). In this book, the term source of information (SOI) is used. In literature, usually the term cue is used instead. This term is avoided here because is it ambiguous [62, p. 163]. It is often used both for something observers use in perception and for a physical aspect of a stimulus. But observers cannot use physical information directly. They only have conscious access to perceptual attributes such as pitch and loudness. So, it is not correct to speak about listeners using the frequency cue or the intensity cue, but one may speak about listeners using the pitch cue or the loudness cue. In the rest of this book, the term “cue” will be avoided and the term “source of information” will be used instead. The reader interested in the ecological psychology of hearing is further referred to an overview of the ecological account by, e.g., Macpherson [143, pp. 24–39] or Neuhoff [162]. Here, only some comment will be presented on the first systematic and elaborate study based on the ecological psychology of hearing as presented by Gaver [77, 78]. In the first of these, Gaver [78] asks the question: “What in the world do we hear?” He starts with distinguishing between musical listening and everyday listening. In musical listening, “the perceptual dimensions and attributes of concern have to do with the sound itself, and are those used in the creation of music” (p. 1). The attributes explicitly mentioned by the author are pitch, loudness, roughness, and brightness. Musical listening is contrasted with everyday listening in which listeners listen to the events and objects that produce the sounds, where these events take place, and in what kind of environment they take place. Hence, musical listening is concerned with the perception of auditory attributes, whereas everyday listening is concerned with hearing what happens where in what environment. In the second part of this first study, Gaver [78] presents a taxonomy of environmental sounds. At the top level of this hierarchy, he distinguishes sounds produced by liquids, sounds produced by vibrating solids, and sounds produced by gases. In accordance with the principles of categorization [207], basic level events are distinguished within each of these categories. For instance, within the category of liquid sounds, there are four basic level events: dripping, pouring, splashing, and rippling. These basic level events are then further subdivided, etc. For each division, hybrid sounds can be defined consisting of events in which sounds from more than one of the top-level categories are mixed as, e.g., the sound of rain, a liquid, on a roof, a solid. Using different methods, various authors have further refined and extended this categorization [29, 93, 105, 144]. The methodology of categorization of envi-

6.9 Environmental Sounds

311

ronmental sounds is discussed by Susini, Lemaitre, and McAdams [248]; a recent review is presented by Guastavino [92]. In a follow-up study, Gaver [77] asks the question: “How do we hear in the world?” An important method discussed is that of analysis-by-synthesis. The method proposes a cycle consisting of a repeated alternation of an analysis phase with a synthesis phase. It starts with an analysis of the sound, mostly based on its spectrogram, and, based on the result of that analysis, a sound is synthesized that is expected to sound similar to the original. The result of this synthesis is then compared with the original sound. The differences are analysed and, based on the results of that analysis, the synthesis procedure is adapted, after which the result of this synthesis is again compared with the original sound, etc., etc. The method of analysis-by-synthesis was described for speech by, e.g., Bell et al. [24] and for the timbre of musical sounds by, e.g., Risset and Wessel [203]. In this cycle of analysis and synthesis, one hopes that, with each cycle, the perceptually relevant aspects of the original sound are better captured by the synthesized sound. There are various criteria to stop the cycle. The strictest criterion is that the cycle stops when the synthesized sound is identical to the original. This criterion will, however, seldom be met. A less strict criterion is that one stops when the spectrogram of the original and the synthesized sound are the same. This, in fact, suggests that phase differences between the original and the synthesized sound are tolerated. Most often, however, one will already be satisfied with the result when one can hear no difference between the original and the synthesized sound. In that case, the original and the synthesized sound are called perceptually identical. One can also say that the synthesis is transparent. For many practical purposes, however, one will be satisfied when an even less strict criterion is met, and that is that of perceptual equivalence. Perceptual equivalence has been defined for speech sounds by Best, Morrongiello, and Robson [26], for whom it means that the same phonemes are perceived in the original and synthesized sound. In the case of environmental sound, perceptual equivalence means that differences between original and synthesis may be audible but that this does not affect the kind of sound event or object that is heard. The concept of perceptual equivalence is linked to the hierarchical level at which it is defined. For instance, two speech sounds can be equivalent at the phoneme level, the syllable level, or the word level. As an example of analysis-by-synthesis, Gaver [77] uses frequency modulation (FM) for the synthesis of machine sounds. The technique of FM synthesis has been extensively used for the synthesis of musical tones [41] and implemented in music synthesizers [42]. In Sect. 1.7, it was discussed how FM synthesis could be used to synthesize sound with rather complex spectra. One can add another level of complexity by not only sinusoidally modulating the frequency of a sinusoid but also the modulation frequency. By setting modulation frequencies to the roughness range, one can thus synthesize quite harsh and rattling sounds. Gaver [77] uses this method to synthesize machine-like sounds. It will be clear, however, that the equations with which these sounds are generated do not bear much resemblance, if any, with the acoustic laws that govern the sound-production process in real machines. The

312

6 Timbre Perception

Fig. 6.13 The cycle of measuring, modelling, and perceiving to study the process of perception. The phase ‘model’ represents the laws of physics, the phase ‘measure’ represents reality as we measure it with physical equipment, and the phase ‘perceive’ represents reality as we perceive it. Only by closing this cycle, we can really understand the process of the perception of objects and events. Modified after Li, Logan, and Pastore [134] and Gaver [77]. (Matlab)

disadvantage of this is that one can only turn the knobs representing the parameters of the FM synthesis until the result sounds more or less as desired. In order to comply with this, Gaver [77] extended this method of analysis-bysynthesis, by including a phase representing the mechanics of the sound producing system. In this respect, Gaver [77] followed Li, Logan, and Pastore [134] who had proposed this method in a study of the sound of walking. A schematic representation of this research cycle is presented in Fig. 6.13. Farnell [64, p. 7] calls the three parts of this research paradigm the “three pillars of sound design”: the physical, the mathematical, and the psychological. Gaver [77] used this paradigm to synthesize simple impact sounds, scraping sounds, and dripping sounds. In addition, he discusses the synthesis of temporally more complex sounds such as breaking, bouncing, and spilling. This approach will be illustrated by two simple examples: The synthesis of a single impact sounds, and of the sound of bouncing. First, the synthesis of simple impact sounds will be discussed [77]. It starts with a physical model of a solid object that is hit by another hard object. The impacted object has various modal frequencies that are excited by the impacting object and then decay exponentially. This results in a model in which an impact sound consists of the sum of decaying exponentials: N

ak e

− τt

k

sin (2π f k t − ϕk )

(6.6)

k=1

In this equation, N is the number of modal frequencies, ak are the amplitudes, τk are damping constants, and ϕk are phase shifts. Ideally, the values of these parameters are derived from a physical model in which the shape and the material of the impacted objects are specified. This equation can then be implemented. An example of a sequence of seven impact sounds synthesized as proposed by Gaver [77] is presented in the demo of Fig. 6.14, in which N = 3, ak = 1/3, τk = 80 ms, and ϕk = 0 for all seven impact sounds. What is varied from impact to impact is the lowest partial that starts at 110 Hz in the first impact and then increases by one

6.9 Environmental Sounds

313

Fig. 6.14 Impact sounds consisting of three partials with varying average frequency. When k is the order of the notes, the lowest partial is 110 · 2k−1 Hz, k = 1, 2, ...7. The ratio between two adjacent partials is 1.2. Note the shift in the perceived size of the impacted object and the change in its perceived material. (Matlab) (demo)

octave with each subsequent impact. The ratio between adjacent partials is 1.2 for each impact sound. Listening to the demo, shows that every sound is immediately recognized as an impact sound, “direct perception”. Moreover, in the course of the seven impact sounds, one perceives a change in the material of the impacted object and a decrease in its size, whereas only the frequencies of the partials change. This gives a handle to investigate the role of the parameters of the model in defining the timbre of the sound. Gaver [77, p. 296] summarizes that the amplitude parameters ak represent the force exerted by the impact and the hardness of the impacting object, the damping parameters τk represent the material, and the frequency parameters the material and the size of the impacted object. An essential aspect of this paradigm is that the values of the parameters of the model are derived from a physical analysis of a model of the sound producing objects and their values can be compared with measurements of real impact sounds. If there is a systematic discrepancy between the values that result from the model and those that result from the measurements, the model must be adapted. When the match is satisfactory, one can systematically vary the parameters of the model, also outside the range they can have in physical reality. This makes it possible to measure the perceptual effect of every single parameter of the model on the perception of the corresponding objects and events, independently of other parameters. Such perception experiments must then reveal to what extent perceived reality agrees with physical reality and, if it does, the accuracy with which the physical reality can be perceived. The next demo of Fig. 6.15 presents a simulation of bouncing. The impact sound consists of eight partials, the frequency of the lowest partial is 440 Hz, the ratio of the frequencies of adjacent partials is 1.2, and the decay parameter is 2 ms. The simulation presumes a perfectly spherical object bouncing on a perfectly flat surface. Moreover,

314

6 Timbre Perception

Fig. 6.15 Example of a bouncing sound. Every impact consists of eight partials. The lowest partial has a frequency of 440 Hz, and the ratio of the frequencies of two adjacent partials is 1.2. The damping constant is 2 ms, the interval between the first and the second bounce is 0.2 s, and the restitution coefficient is 0.9. (Matlab) (demo)

with every bounce, the bouncing object loses a fixed part of its energy. Let p then be the proportion of energy the bouncing object has after a bounce relative to before a bounce. It can then be shown that√the interval between successive bounces decreases with every bounce with a factor p, which is called the restitution coefficient rc of the bouncing system. Some straightforward calculations then show that, when I0 is the time interval between the first and the second bounce, the next intervals between the successive bounces In can be represented as In = rcn I0 . Summation over In then yields the moments Tn at which the nth impact occurs: Tn =

n k=1

Ik =

n k=1

rck I0 =

1 − rcn−1 I0 1 − rc

(6.7)

In this equation, there are only two parameters, I0 and rc . The parameter I0 is determined by the height from which the falling object is dropped, the restitution coefficient rc by the elasticity of the material of the bouncing object and the plate on which it bounces. The sound played in the demo of Fig. 6.15 will immediately be recognized as the sound of a bouncing object, “direct perception”. Actually, when I first synthesized such a sound, my roommate who sat beside me looked to the floor and asked what I had dropped, whereas the sound came from small speakers on my desk and there was carpet on the floor. So, one sees that perception and action cannot be separated. Another instance of “ecological” hearing, occurred when a student of mine and I were synthesizing dripping sounds. My colleagues complained that they had to go to the bathroom so often. If this is not ecological hearing, what is? The study of the perception of environmental sounds has considerably progressed. Rocchesso and Fontana [206] present an extensive overview on studies on the design of environmental sounds, including an “annotated bibliography” of earlier research [83]. More recent reviews are presented by Farnell [64], Lemaitre, Grimault, and Suied [127], and Lutfi [141]. A toolkit for the design of environmental sounds is

6.9 Environmental Sounds

315

developed by Baldan, Delle Monache, and Rocchesso [17], which provides modules for the synthesis of “basic solid interactions”, “compound solid interactions”, “liquid sounds”, “continuous turbulences”, “explosions”, and “machines”. One of the problems with these approaches is that, for quite a few interesting sounds such as those of rolling, scraping, and rubbing, the physics are very complex, and analytic solutions of the, mostly non-linear, higher-order partial differential equations are not available. The models are based, therefore, mostly on numerical solutions or approximations. In spite of that, the result of these simulations can be quite realistic. In our institute, a number of studies on the perception of the sounds of rolling and bouncing balls have been carried out, e.g., Houben [104] and Stoelinga [240]. Studies like these showed that listeners could discriminate between slower and faster rolling balls, between smaller and larger balls, and whether they roll over thinner or thicker plates. Indeed, there was a monotonic relation between the actual magnitudes of these variables and the estimations by the listeners. In absolute judgments tasks, however, these variables could be systematically over- or underestimated. By systematic manipulation of spectral and temporal features of the rolling sound, it could be established that listeners can use both spectral and temporal information in these tasks. Studies like these sometimes reveal remarkable phenomena. For instance, a study by Stoelinga et al. [241] describes a previously unknown interference pattern consisting of ripples over the spectrum in the sound of a ball rolling over a table. The spacing between the ripples depends on the distance of the ball from the edge of the table: The closer to the table edge, the smaller is the spacing. Consequently, one can hear a change in the timbre of the sound as the ball approaches or moves away from the table edge. It appears that the interference pattern results from interference between the direct rolling sound generated at the contact point between ball and table, and that sound after is it has been reflected at the edge of the table. Discrimination experiments revealed that, on the one hand, naive listeners could clearly hear the difference between the sound of a ball rolling towards or away from the table edge. On the other hand, however, identification experiments showed that they could not identify the direction of the rolling, so whether it was towards or away from the table edge. The conclusion of these experiments is that the perception of the environmental sounds has its limitations. In general, judgments involving temporal properties of sounds are more accurate than judgements involving spectral differences. For instance, substantiated by neurophysiological arguments, Lemaitre et al. [130] showed that listeners are better at identifying the kind of action producing a sound, mostly expressed in the temporal structure of the sound, than the materials of the interacting objects, mostly expressed in the spectral structure. They thus confirmed an old result by Warren and Verbrugge [271] who investigated the sources of information (SOIs) involved in distinguishing sounds induced by breaking from bouncing objects. Warren and Verbrugge [271] also found that temporal SOIs played a larger role than spectral SOIs. Sometimes, results are contradictory. For instance, a brighter timbre is generally associated with a smaller object. Regarding the impact sounds of plates, this is correct as to the length and the width of the plates. The spectra of the impact sounds of longer

316

6 Timbre Perception

or wider plates will generally have a lower spectral centroid, so that the timbre will be less bright. As to thickness, however, the situation is different. Because a thicker plate has a higher stiffness, the spectrum has a higher spectral centroid resulting in a brighter impact sound, which listeners associate with a smaller plate. Another contradictory phenomenon was found by Eitan et al. [57]. Using pure tones of constant frequency, they confirmed that higher frequencies of the tone are generally associated with smaller objects and lower frequencies with larger objects. For pure tones rising or falling in frequency, however, they found that “ascending pitch is congruent with growing size and descending pitch with shrinking size” (p. 273).

6.10 Concluding Remarks In the previous section, it was shown that auditory judgments as to sound events have their limitations. On the other hand, people’s capacities to identify sounds based on timbre are sometimes almost incredible. Indeed, I have heard claims that some people can distinguish the brand of a car based on the sound of its closing doors. As to music recordings, it is well-known that some people can above chance level recognize the composer of the music, its performer, and sometimes even the concert hall in which it is recorded. Another anecdote I heard on the radio was about a recording engineer who was actively involved in producing compact discs with classical piano music. It appeared that he was also able to identify the one who had tuned the piano used for the recording. Indeed, when a record with piano music unknown to him was played, he correctly identified the tuner. These examples suggest that the capacities of the human hearing system to identify sounds are quite extraordinary. Anecdotal evidence of this kind, however, has to be interpreted with great care, as shown by two, quite dissimilar examples. The first is about the quality of old Italian violins and the other about wine tasting. It is “well-known” that certain kind of old violins, e.g., those made by Stradivari or Guarneri, have exceptional tonal qualities that today’s violins are far from being able to match. In a blind study with renowned soloists, including sessions in a small salon [74], and a concert hall [75], however, the following results were obtained: “These results [...] present a striking challenge to near-canonical beliefs about Old Italian violins. The current study, the second of its kind, again shows that first-rate soloists tend to prefer new instruments and are unable to distinguish old from new at better than chance levels” [75, p. 7224]. Readers interested in these issued are referred to Levitin [133]. He argues that the main determinant factor in the quality judgments of the listeners to the violins is the price they expect the instrument to have. Similar results were obtained for judgments regarding the quality of the back wood of acoustic guitars [38]. Moreover, Levitin [133] discusses similar results for wine tasting. Indeed, the judgments by the wine tasters are largely determined by the expectations they derive from the price of the wine. For instance, Goldstein et al. [86] found that, for non-expert wine drinkers, there was a small but significantly negative (!) correlation between the judgments of the wine tasters and the price of the wine. This shows that they actually preferred the lower priced wines, probably

6.10 Concluding Remarks

317

because those were the wines they were used to drink. For expert wine drinkers this correlation, though positive, could only explain a small part of the variance of the results. As to Stradivari violins, the just sketched situation appears somewhat more complicated, however. A recent study [211] showed that at least one Stradivari violin does have exceptional qualities. In a double blind experiment, the authors asked 70 violin makers of the Cremona area to judge the quality of six violins, four modern violins and two Stradivari. It appeared that these expert listeners preferred one of the Stradivari violins above the other five. Rozzi et al. [211] argue that the exceptional quality of a Stradivari like this can have boosted the judgments of Stradivari violins in general. They also discuss the role of the fact that listeners tend to prefer “louder violins” and that the intensity of music played on older violins is generally less than that on newer ones. Nevertheless, all these results show that expectations listeners have about a stimulus can dominate their judgments. Research concerned with the quality of sound must therefore take strict care to control for this expectation bias. A review of how to do so for musical instruments is presented by Fritz and Dubois [73]. Similar requirements should be imposed on studies regarding the quality of recording systems, loudspeakers, concert halls, etc. Regarding the characterization of product sounds, Öcan, Van Egmond, and Jacobs [169] distinguish nine different concepts underlying their mental representation: Sound source, action, location, sound type, onomatopoeias, psychoacoustics, temporal descriptions, emotions, and abstract meaning. The timbre attributes discussed so far, roughness, breathiness, and brightness, are in the category of psychoacoustics, just as are loudness and pitch, the auditory attributes that will be discussed in the next two chapters. Other aspects that will be discussed are concerned with the perceived location of a sound. This will be discussed in Chap. 9. Besides the ones discussed, there are, of course, many more attributes of timbre. Just to mention a few: nasality, impulsiveness, complexity, richness, or harmonicity. As mentioned, Pedersen [176] gives a list of hundreds of English and Danish words used in describing sounds. Such a vocabulary helps in communicating about sound. Nevertheless, in spite of the size of this vocabulary, it will in many cases be impossible to describe a sound by such terms in such a way that the listener understands unambiguously what kind of sound is meant. In practice, many sounds are described not by the timbre attributes discussed so far, but by indicating the objects or events that produce the sounds [77, 78]. Another way to communicate about sounds is by vocal imitations or onomatopoeias [129], an effective method in this respect, especially for sounds that are difficult to identify [128]. A more general approach is taken by Carron et al. [39]. In order to communicate about sounds in the design process, they propose 35 adjectives, of which seven are presented separately, and the remaining 28 constitute 14 pairs of semantic differentials. These are divided into the following categories: Basic features, temporal features, and timbre features. The basic features are: high/low, loud/weak, noisy/tonal, short/long, dynamic, natural/artificial, and near/far; the temporal features are: continuous/discontinuous, constant/fluctuating, ascending/descending, crescendo/

318

6 Timbre Perception

decrescendo, and slow attack/fast attack; and the timbre features are: mate/resonant, rough/smooth, bright/dull, nasal, rich, round, warm, metallic, and strident. At the start of this chapter, it was mentioned that officially a sound is characterized by four perceptual attributes: pitch, loudness, duration, and timbre. Perceived location is often added as a fifth auditory attribute of a sound. In the previous chapter, it was shown that, among these four attributes, duration is in many cases perceptually ill defined. In the next two chapters, pitch and loudness will appear to be well-defined auditory attributes. Although there remains a lot to be studied in more detail as to pitch and loudness perception, for both, a general framework will be presented in which many phenomena related to their perception can be understood. Due to the presence of well-developed computational models for loudness and pitch perception, many phenomena can be described quantitatively, while other perceptual phenomena can be understood in qualitative terms. As to timbre, the situation is much more complex. Even for relatively small sets of sounds such as the sounds of musical instruments or the vowels of speech, the experimental results appear to be limited, and the acoustic correlates of the dimensions found do not lend themselves for the reliable recognition of the different sounds. In the discussions of the perception of brightness, roughness, and breathiness, models were presented that are exclusively based on relatively low levels of auditory processing. For some classes of sounds, these models could quite reliably be applied in predicting the corresponding perceived timbre attribute. Other aspects of timbre such as sensory pleasantness, voice quality, or perceived effort were presented as composite attributes. These are defined by at least three or four different, supposedly more basic timbre attributes: Sensory pleasantness is defined by brightness, roughness, breathiness, and loudness; voice quality is defined by at least roughness, breathiness, and nasality, and perceived effort by brightness, breathiness, and beat clarity. As to sensory pleasantness, the authors could show that the predictions correlated well with human judgments for the class of sounds for which the model was tested. Up to now, neither for voice quality nor for perceived effort, it appeared possible to arrive at an unequivocal measure in which to express it. In Sect. 6.5.2, the essentially joint spectrotemporal features were mentioned that characterize the sounds of musical instruments [3, 170]. An interesting extension to a larger set of sounds is presented by Mlynarski and McDermott [158]. These authors do not directly take the spectrotemporal receptive fields (STRFs) as measured in the auditory cortex as points of departure. Instead, they derive a set of spectrotemporal feature patterns by applying a sparse coding technique to a large database of speech and environmental sounds. The authors argue that these spectrotemporal feature patterns have properties similar to the STRFs of cortical neurons. This actually implies that the spectrotemporal sensitivities of such cortical auditory neurons are established by an unsupervised learning process of the spectrotemporal properties that characterize environmental sounds (see also Gervain and Geffen [80]). These ideas have been extended to speech sounds and environmental sounds by Sheikh et al. [216]. All these authors emphasize the essentially joint spectrotemporal character of timbre features and their nonlinear processing. Reviews of neural processing of timbre are presented by Town and Bizley [255] and Alluri and Kadiri [4].

References

319

It is concluded that the timbre of an auditory unit is not just specified by a limited number of basic features or dimensions, but should be described at various distinct levels. This is evident for sounds such as speech sounds, where, on the one hand, the timbre of one utterance can be different from that of another in, e.g., breathiness or perceived effort. On the other hand, timbre can change at a level lower than that of a phoneme, e.g., due to coarticulation. Moreover, also the task the listener carries out can have a strong effect. For instance, in a categorization task in which listeners had to respond as quick as possible to sung vowels, percussion, and strings, but not to other musical sounds such as wind instruments, Agus et al. [1] showed that reaction times to sung vowels were significantly shorter than those to the sounds of percussive instruments and strings (see Bigand et al. [27] and Suied et al. [243] for a discussion on methodological issues). This indicates that sung vowels are processed differently by the central nervous system than the sounds of percussive instruments or strings. Furthermore, Formisano et al. [70] showed for speech that different brain structures are activated for speaker identification and for speech recognition. This, in turn, indicates that the auditory attribute the listener pays attention to depends on the task, speaker identification or speech recognition in this case. Based on phenomena like these, Pressnitzer, Agus, and Suied [194] conclude: “A fundamental reason that makes timbre so elusive may therefore be that timbre recognition is a profoundly adaptive mechanism, able to create and use opportunistic strategies that depend on the sounds and task at hand” (p. 132).

References 1. Agus TR et al (2012) Fast recognition of musical sounds based on timbre. J Acoust Soc Am 131:4124–4133. https://doi.org/10.1121/1.3701865 2. Akeroyd MA, Patterson RD (1995) Discrimination of wideband noises modulated by a temporally asymmetric function. J Acoust Soc Am 98:2466–2474. https://doi.org/10.1121/1. 414462 3. Allen EJ et al (2018) Encoding of natural timbre dimensions in human auditory cortex. Neuroimage 166:60–70. https://doi.org/10.1016/j.neuroimage.2017.10.050 4. Alluri V, Kadiri SR (2019) Neural correlates of timbre processing. In: Siedenburg K (ed) Timbre: acoustics, perception, and cognition, Chap 6. Springer International Publishing, Cham, Switzerland, pp 151–172. https://doi.org/10.1007/978-3-030-14832-4_6 5. Almeida A et al (2017) Brightness scaling of periodic tones. Atten Percept Psychophys 79:1892–1896. https://doi.org/10.3758/s13414-017-1394-6 6. ANSI (1994) ANSI S1.1-1994. American National Standard Acoustical Terminology. New York, NY 7. ANSI (1995) ANSI S3.20-1995. American National Standard bioacoustical terminology. New York, NY 8. ANSI (1960) USA Standard, Acoustical terminology (including mechanical shock and vibration). New York, NY 9. Arnal LH et al (2015) Human screams occupy a privileged niche in the communication soundscape. Curr Biol 25:2051–2056. https://doi.org/10.1016/j.cub.2015.06.043 10. Arrabito GR, Mondor TA, Kent KJ (2004) Judging the urgency of non-verbal auditory alarms: a case study. Ergonomics 47:821–840. https://doi.org/10.1080/0014013042000193282

320

6 Timbre Perception

11. Atal BS (2006) The history of linear prediction. IEEE Signal Process Mag 23:154–161. https:// doi.org/10.1109/MSP.2006.1598091 12. Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50:637–655. https://doi.org/10.1121/1.1912679 13. Aucouturier JJ, Bigand E (2013) Seven problems that keep MIR from attracting the interest of cognition and neuroscience. J Intell Inf Syst 41:483–497. https://doi.org/10.1007/s10844013-0251-x 14. Aures W (1985) Berechnungsverfahren für den sensorischen Wohlklang beliebiger Schallsignale. Acustica 59:130–141 15. Aures W (1985) Der sensorische Wohlklang als Funktion psychoakustischer Empfindungsgrössen. Acustica 58:282–290 16. Aures W (1985) Ein berechnungsverfahren der Rauhigkeit. Acustica 58:268–281 17. Baldan S, Delle Monache S, Rocchesso D (2017) The sound design toolkit. SoftwareX 6:255– 260. https://doi.org/10.1016/j.softx.2017.06.003 18. Barsties V, Latoszek B et al (2017) The acoustic breathiness index (ABI): a multivariate acoustic model for breathiness. J Voice 31:511.e1-511.e27. https://doi.org/10.1016/j.jvoice. 2016.11.017 19. Barsties V, Latoszek B et al (2017) The exploration of an objective model for roughness with several acoustic markers. J Voice 32:140–161. https://doi.org/10.1016/j.jvoice.2017.04.017 20. Barthet M, Kronland-Martinet R, Ystad S (2008) Improving musical expressiveness by timevarying brightness shaping. In: Kronland-Martinet R, Ystad S, Jensen K (eds) Computer music modeling and retrieval: sense of sounds. Springer, Berlin, pp 313–336. https://doi.org/ 10.1007/978-3-540-85035-9_22 21. Barthet M et al (2011) Analysis-by-synthesis of timbre, timing, and dynamics in expressive clarinet performance. Music Percept: Interdiscip J 28:265–278. https://doi.org/10.1525/mp. 2011.28.3.265 22. Beil RG (1962) Frequency analysis of vowels produced in a helium-rich atmosphere. J Acoust Soc Am 34:347–349. https://doi.org/10.1121/1.1928124 23. Belin P, Zatorre RJ (2015) Neurobiology: sounding the alarm. Curr Biol 25:R805–R806. https://doi.org/10.1016/j.cub.2015.07.027 24. Bell CG et al (1961) Reduction of speech spectra by analysis-by-synthesis techniques. J Acoust Soc Am 33:1725–1736 25. Berger KW (1964) Some factors in the recognition of timbre. J Acoust Soc Am 36:1888–1891. https://doi.org/10.1121/1.1919287 26. Best CT, Morrongiello B, Robson R (1981) Perceptual equivalence of acoustic cues in speech and nonspeech perception. Percept Psychophys 29:191–211. https://doi.org/10.3758/ BF03207286 27. Bigand E et al (2011) Categorization of extremely brief auditory stimuli: domain-specific or domain-general processes? PLoS ONE 6:e27024. https://doi.org/10.1371/journal.pone. 0027024. 6 p 28. Bloothooft G, Plomp R (1988) The timbre of sung vowels. J Acoust Soc Am 84:847–860. https://doi.org/10.1121/1.396654 29. Bones O, Cox TJ, Davies WJ (2018) Sound categories: category formation and evidencebased taxonomies. Front Psychol 9. https://doi.org/10.3389/fpsyg.2018.01277. Article 1277, 17 p 30. Brandt JF, Ruder KF, Shipp T Jr (1969) Vocal loudness and effort in continuous speech. J Acoust Soc Am 46:1543–1548. https://doi.org/10.1121/1.1911899 31. Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Cambridge, MA 32. Brumm H, Slabbekoorn H (2005) Acoustic communication in noise. Adv Study Behav 35:151–209. https://doi.org/10.1016/S0065-3454(05)35004-2 33. Brumm H, Zollinger SA (2011) The evolution of the Lombard effect: 100 years of psychoacoustic research. Behaviour 148:1173–1198. https://doi.org/10.1163/000579511X605759

References

321

34. Buder EH (2000) Acoustic analysis of voice quality: a tabulation of algorithms 1902–1990. In: Kent RD, Ball MJ (eds) Voice quality measurement, Chap 9. Singular Publishing, San Diego, CA, pp 119–244 35. Burgoyne JA, McAdams S (2008) A meta-analysis of timbre perception using nonlinear extensions to CLASCAL. In: Kronland-Martinet R, Ystad S, Jensen K (eds) Computer music modeling and retrieval: sense of sounds. Springer, Berlin, pp 181–202. https://doi.org/10. 1007/978-3-540-85035-9_12 36. Caclin A et al (2005) Acoustic correlates of timbre space dimensions: a confirmatory study using synthetic tones. J Acoust Soc Am 118:471–482. https://doi.org/10.1121/1.1929229 37. Camponogara I et al (2017) Expert players accurately detect an opponent’s movement intentions through sound alone. J Exp Psychol Hum Percept Perform 43:348–359. https://doi.org/ 10.1037/xhp0000316 38. Carcagno S et al (2018) Effect of back wood choice on the perceived quality of steel-string acoustic guitars. J Acoust Soc Am 144:3533–3547. https://doi.org/10.1121/1.5084735 39. Carron M et al (2017) Speaking about sounds: a tool for communication on sound features. J Des Res 15:85–109. https://doi.org/10.1504/JDR.2017.086749 40. Chambers C et al (2017) Prior context in audition informs binding and shapes simple features. Nat Commun 8:15027. https://doi.org/10.1038/ncomms15027. 11 p 41. Chowning JM (1973) The synthesis of complex audio spectra by means of frequency modulation. J Audio Eng Soc 21:526–534 http://www.aes.org/e-lib/browse.cfm?elib=1954 42. Chowning JM, Bristow D (1986) FM theory & applications: by musicians for musicians. Yamaha Music Foundation, Tokyo, Japan. http://www.dxsysex.com/images/FM-SynthesisTheory-Applicationsextract.pdf 43. Clark Jr M et al (1963) Preliminary experiments on the aural significance of parts of tones of orchestral instruments and on choral tones. J Audio Eng Soc 11:45–54. http://www.aes.org/ e-lib/browse.cfm?elib=821 44. Cooke M, Lu Y (2010) Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. J Acoust Soc Am 128:2059–2069. https://doi.org/10. 1121/1.3478775 45. Daniel P, Weber R (1997) Psychoacoustical roughness: Implementation of an optimized model. Acustica 83:113–123 46. Dau T, Kollmeier B, Kohlrausch A (1997) Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J Acoust Soc Am 102(5):2892– 2905. https://doi.org/10.1121/1.420344 47. Dau T, Kollmeier B, Kohlrausch A (1997) Modeling auditory processing of amplitude modulation. II. Spectral and temporal integration. J Acoust Soc Am 102(5):2906– 2919. https:// doi.org/10.1121/1.420345 48. De Cheveigné A, Kawahara H (1999) Missing-data model of vowel identification. J Acoust Soc Am 105(6):3497–3508. https://doi.org/10.1121/1.424675 49. De Krom G (1993) A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals. J Speech, Lang Hear Res 36(2):254–266. https://doi.org/10.1044/jshr.3602. 254 50. De Krom G (1995) Some spectral correlates of pathological breathy and rough voice quality for different types of vowel fragments. J Speech Lang Hear Res 38(4):794–811. https://doi. org/10.1044/jshr.3804.794 51. Deme A (2017) The identification of high-pitched sung vowels in sense and nonsense words by professional singers and untrained listeners. J Voice 31(2):252.e1–252.e14. https://doi.org/ 10.1016/j.jvoice.2016.07.008 52. Donnadieu S (2007) Mental representation of the timbre of complex sounds. In: Beauchamps JW (ed) Analysis, synthesis, and perception of musical sounds: the sound of music, Chap 8. Springer Science+Business Media Inc., New York, NY, pp 272–319. https://doi.org/10.1007/ 978-0-387-32576-7_8 53. Eddins DA, Kopf LM, Shrivastav R (2015) The psychophysics of roughness applied to dysphonic voice. J Acoust Soc Am 138(5):3820–3825. https://doi.org/10.1121/1.4937753

322

6 Timbre Perception

54. Eddins DA et al (2016) Modeling of breathy voice quality using pitch-strength estimates. J Voice 30(6):774.e1–774.e7. https://doi.org/10.1016/j.jvoice.2015.11.016 55. Edworthy J, Loxley SL, Dennis ID (1991) Improving auditory warning design: relationship between warning sound parameters and perceived urgency. Hum Factors 33(2):205–231. https://doi.org/10.1177/001872089103300206 56. Eimas PD (1963) The relation between identification and discrimination along speech and nonspeech continua. Lang Speech 6(4):206–217. https://doi.org/10.1177/002383096300600403 57. Eitan Z et al (2014) Lower pitch is larger, yet falling pitches shrink: interaction of pitch change and size change in speeded discrimination. Exp Psychol 61(4):273–284. https://doi.org/10. 1027/1618-3169/a000246 58. Elhilali M (2019) Modulation representations for speech and music. In: Siedenburg K et al (ed) Timbre: acoustics, perception, and cognition, Chap 12. Springer International Publishing, Cham, Switzerland, pp 335–359. https://doi.org/10.1007/978-3-030-14832-4_12 59. Elliott CA (1975) Attacks and releases as factors in instrument identification. J Res Music Educ 23(1):35–40 (1975). https://doi.org/10.2307/3345201 60. Elliott TM, Hamilton LS, Theunissen FE (2013) Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. J Acoust Soc Am 133(1):389–404. https:// doi.org/10.1121/1.4770244 61. Eriksson A, Traunmüller H (2002) Perception of vocal effort and distance from the speaker on the basis of vowel utterances. Percept Psychophys 64(1):131–139. https://doi.org/10.3758/ BF03194562 62. Ernst MO, Bülthoff HH (2004) Merging the senses into a robust percepty. Trends Cognit Sci 8(4):162–169. https://doi.org/10.1016/j.tics.2004.02.002 63. Fabiani M, Friberg A (2011) Influence of pitch, loudness, and timbre on the perception of instrument dynamics. J Acoust Soc Am 130(4):EL193–EL199. https://doi.org/10.1121/1. 3633687 64. Farnell A (2010) Designing sound. The MIT Press, Cambridge, MA 65. Fastl H, Zwicker E (2007) Roughness. Psychoacoustics: facts and models, 3rd edn. Springer GmbH, Berlin, Heidelberg, pp 257–264 66. Fastl H, Zwicker E (2007) Sharpness and sensory pleasantness. Psychoacoustics: facts and models, 3rd edn. Springer GmbH, Berlin, Heidelberg, pp 239–246 67. Feng L, Oxenham AJ (2015) New perspectives on the measurement and time course of auditory enhancement. J Exp Psychol: Hum Percept Perform 41(6):1696– 1708. https://doi.org/10. 1037/xhp0000115 68. Feng L, Oxenham AJ (2018) Spectral contrast effects produced by competing speech contexts. J Exp Psychol: Hum Percept Perform 44(9):1447–1457. https://doi.org/10.1037/xhp0000546 69. Ferrer CA et al (2005) Correcting the use of ensemble averages in the calculation of harmonics to noise ratios in voice signals. J Acoust Soc Am 118(2):605–607. https://doi.org/10.1121/1. 1940450 70. Formisano E et al (2008) Who’ is saying ‘what’? Brain-based decoding of human voice and speech. Science 322(5903):970–973. https://doi.org/10.1126/science.1164318 71. Francis AL et al (2006) Extrinsic context affects perceptual normalization of lexical tone. J Acoust Soc Am 119(3):1712–1726. https://doi.org/10.1121/1.2149768 72. Frazier JM, Assgari AA, Stilp CE (2019) Musical instrument categorization is highly sensitive to spectral properties of earlier sounds. Attent Percept Psychophys 81(4):1119–1126. https:// doi.org/10.3758/s13414-019-01675-x 73. Fritz C, Dubois D (2015) Perceptual evaluation of musical instruments: state of the art and methodology. Acta Acustica united with Acustica 101(2):369–38. https://doi.org/10.3813/ AAA.918833 74. Fritz C et al (2012) Player preferences among new and old violins. Proc Natl Acad Sci 109(3):760–763. https://doi.org/10.1073/pnas.1114999109 75. Fritz C et al (2015) Soloist evaluations of six old Italian and six new violins. Proc Natl Acad Sci 1111(20):7224–7229. https://doi.org/10.1073/pnas.1323367111

References

323

76. Garnier M, Ménard L, Alexandre B (2018) Hyper-articulation in Lombard speech: an active communicative strategy to enhance visible speech cues? J Acoust Soc Am 144(2):1059–1074. https://doi.org/10.1121/1.5051321 77. Gaver WW (1993) How do we hear in the world? Explorations in ecological acoustics. Ecol Psychol 5(4):285–313. https://doi.org/10.1207/s15326969eco0504_2 78. Gaver WW (1993) What in the world do we hear? An ecological approach to auditory source perception. Ecol Psychol 5(1):1–29. https://doi.org/10.1207/s15326969eco0501_1 79. George WH (1954) A sound reversal technique applied to the study of tone quality. Acustica 4(1):224–225 80. Gervain J, Geffen MN (2019) Efficient neural coding in auditory and speech perception. Trends Neurosci 42(1):56–65. https://doi.org/10.1016/j.tins.2018.09.004 81. Gibson JJ (1979) The ecological approach to visual perception. Houghton Mifflin, Boston, MA 82. Gibson JJ (1966) The senses considered as perceptual systems. Houghton Mifflin, MA 83. Giordano BL (2003) Everyday listening: an annotated bibliography. The sounding object, Chap 1. Editioni di Mondo Estremo, pp 1–16. http://www.soundobject.org 84. Giordano BL, McAdams S (2010) Sound source mechanics and musical timbre perception: evidence from previous studies. Music Percept: Interdiscip J 28(2):155–168. https://doi.org/ 10.1525/mp.2010.28.2.155 85. Giordano BL, Rocchesso D, McAdams S (2010) Integration of acoustical information in the perception of impacted sound sources: the role of information accuracy and exploitability. J Exp Psychol Hum Percept Perform 36(2):462–476. https://doi.org/10.1037/a0018388 86. Goldstein R et al (2008) Do more expensive wines taste better? Evidence from a large sample of blind tastings. J Wine Econ 3(1):1–9. https://doi.org/10.1017/S1931436100000523 87. Gordon JW, Grey JM (1978) Perception of spectral modifications on orchestral instrument tones. Comput Music J 2(1):24–31. https://doi.org/10.2307/3680135 88. Granström B, Nord L (1992) Neglected dimensions in speech synthesis. Speech Commun 11(4):459–462. https://doi.org/10.1016/0167-6393(92)90051-8 89. Gray GW (1942) Phonemic microtomy, The minimum duration of perceptible speech sounds. Commun Monogr 9(1):75–90. https://doi.org/10.1080/03637754209390064 90. Grey JM (1977) Multidimensional perceptual scaling of musical timbres. J Acoust Soc Am 61(5):1270–1277. https://doi.org/10.1121/1.381428 91. Grey JM, Moorer JA (1977) Perceptual evaluations of synthesized musical instrument tones. J Acoust Soc Am 62(2):454–462. https://doi.org/10.1121/1.381508 92. Guastavino C (2018) Everyday sound categorization. In: Virtanen T, Plumbley MD, Ellis D (ed) Computational analysis of sound scenes and events, Chap 7. Springer International Publishing, Cham, Switzerland, pp 183–213. https://doi.org/10.1007/978-3-319-63450-0_7 93. Gygi B, Kidd GR, Watson CS (2007) Similarity and categorization of environmental sounds. Percept Psychophys 69(6):839–855. https://doi.org/10.3758/BF03193921 94. Hajda JM (2007) The effect of dynamic acoustical features on musical timbre. In: Beauchamps J (ed) Analysis, Synthesis, and perception of musical sounds: the sound of music, Chap 7. Springer Science+Business Media Inc., New York, NY, pp 250–271. https://doi.org/10.1007/ 978-0-387-32576-7_7 95. Handel S, Erickson ML (2004) Sound source identification: the possible role of timbre transformations. Music Percept: Interdiscip J 21(4):587–610. https://doi.org/10.1525/mp.2004.21. 4.587 96. Hansen H, Verhey JL, Weber R (2011) The magnitude of tonal content: A review. Acta Acust Acust 97(3):355–363. https://doi.org/10.3813/AAA.918416 97. Hansen JH, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99. https://doi.org/10.1109/MSP.2015.2462851 98. Hellwag CF (1967) Dissertatio Inauguralis Physiologico-medica de Formatione Loquelae [Inaugural PhysiologicalMedical Dissertation of Speech Formation]. Translation into Dutch by G. L. Meinsma, and Hendrik Mol, edition by Instituut voor Fonetische Wetenschappen [van de] Universiteit van Amsterdam. Tübingen, 1781, pp 1–38

324

6 Timbre Perception

99. Helmholtz HLF (1895) On the sensations of tone as a physiological basis for the theory of music. Trans. by Ellis AJ 2nd edn. Longmans, Green, and Co., London, UK, pp i–xix, 1–576. https://archive.org/stream/onsensationsofto00helmrich/onsensationsofto00helmrich 100. Hermes DJ (1991) Synthesis of breathy vowels: some research methods. Speech Commun 109(5–6):497–502. https://doi.org/10.1016/0167-6393(91)90053-V 101. Hillenbrand JM (2011) Acoustic analysis of voice: a tutorial. SIG 5 Perspect Speech Sci Orofacial Disorders 21(2):31–43. https://doi.org/10.1044/ssod21.2.31 102. Hoeldrich R, Pflueger M (1999) A generalized psychoacoustical model of modulation parameters (roughness) for objective vehicle noise quality evaluation. In: Proceedings of the 1999 SAE noise & vibration conference & exposition 17-20 May 1998, Traverse City, MI. Society of Automotive Engineers Inc, Warrendale, PA, 4 p 103. Holt LL (2005) Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychol Sci 16(4):305–312. https://doi.org/10.1111/j.0956-7976.2005.01532.x 104. Houben MMJ (2002) The sound of rolling objects: perception of size and speed. Department of Industrial Engineering & Innovation Sciences. https://research.tue.nl/nl/publications/thesound-of-rollingobjects-perception-of-size-and-speed 105. Houix O et al (2012) A lexical analysis of environmental sound categories. J Exp Psychol Appl 18(1):52–80. https://doi.org/10.1037/a0026240 106. Houtgast T (1989) Frequency selectivity in amplitude-modulation detection. The J Acoust Soc Am 85(4):1676–1680. https://doi.org/10.1121/1.397956 107. Houtsma AJ, Rossing TD, Wagenaars WM (1987) Auditory Demonstrations. Eindhoven, The Netherlands: Institute for Perception Research (IPO), Northern Illinois University, Acoustical Society of America. https://research.tue.nl/nl/publications/auditory-demonstrations 108. Ilkowska M, Miskiewicz A (2006) Sharpness versus brightness: a comparison of magnitude estimates. Acta Acustica united with Acustica 92(5):812–819 109. Irino T, Patterson RD (2002) Segregating information about the size and shape of the vocal tract using a time-domain auditory model: the stabilised Wavelet-Mellin transform. Speech Commun 36(3–4):181–203. https://doi.org/10.1016/S0167-6393(00)00085-6 110. Irino T, Patterson RD (1996) Temporal asymmetry in the auditory system. J Acoust Soc Am 99(4):2316–2331. https://doi.org/10.1121/1.415419 111. Iverson P, Krumhansl CL (1993) Isolating the dynamic attributes of musical timbre. J Acoust Soc Am 94(5):2595–2603. https://doi.org/10.1121/1.407371 112. Ives DT, Smith DRR, Patterson RD (2005) Discrimination of speaker size from syllable phrases. J Acoust Soc Am 118(6):3816–3822. https://doi.org/10.1121/1.2118427 113. Jenkins JJ, Strange W, Edman TR (1983) Identification of vowels in ‘vowelless’ syllables. Percept Psychophys 34(5):441–450. https://doi.org/10.3758/BF03203059 114. Jepsen ML, Ewert SD, Dau T (2008) A computational model of human auditory signal processing and perception. J Acoust Soc Am124(1):422–438. https://doi.org/10.1121/1.2924135 115. Joris PX, Schreiner CE, Rees A (2004) Neural processing of amplitude-modulated sounds. Physiol Rev 84(2):541–577. https://doi.org/10.1152/physrev.00029.2003 116. Kemp S (1982) Roughness of frequency-modulated tones. Acta Acustica united with Acustica 50(2):126–133 117. Kempster GB et al (2009) Consensus auditory-perceptual evaluation of voice: development of a standardized clinical protocol. Am J Speech Lang Pathol 18(2):124–132. https://doi.org/ 10.1044/1058-0360(2008/08-0017) 118. Klein W, Plomp R, Pols L (1970) Vowel spectra, vowel spaces, and vowel identification. J Acoust Soc Am 48(4B):999–1009. https://doi.org/10.1121/1.1912239 119. Kohlrausch A, Hermes DJ, Duisters R (2005) Modeling roughness perception for sounds with ramped and damped temporal envelopes. In: Forum Acusticum, the 4th European Congress on Acoustics 29 August - 2 September 2005, Budapest, Hongary, pp 1719–1724. http://www. conforg.fr/acoustics2008/cdrom/data/fa2005-budapest/paper/574-0.pdf 120. Kreiman J, Gerratt BR (2012) Perceptual interaction of the harmonic source and noise in voice. J Acoust Soc Am 131(1):492–500. https://doi.org/10.1121/1.3665997

References

325

121. Kreiman J, Gerratt BR (1998) Validity of rating scale measures of voice quality. J Acoust Soc Am 104(3):1598–1616. https://doi.org/10.1121/1.424372 122. Kreiman J, Sidtis D (2011) Voices and listeners: Toward a model of voice perception. Acoust Today 7(4):17–15 (2011). https://acousticstoday.org/wp-content/uploads/2017/09/Article_ 1of4_from_ATCODK_7_4.pdf 123. Kreiman J et al (2014) Toward a unified theory of voice production and perception. Loquens 1(1):e009, 10 p. https://doi.org/10.3989/loquens.2014.009 124. Krimphoff J, McAdams S, Winsberg S (1994) Caractérisation du timbre des sons complexes. II. Analyses acoustiques et quantification psychophysique. Le Journal de Physique IV (C5 1994), pp 625–628. https://doi.org/10.1051/jp4:19945134. https://hal.archives-ouvertes.fr/ jpa-00252811 125. Ladefoged P, Broadbent DE (1957) Information conveyed by vowels. J Acoust Soc Am 29(1):98–104. https://doi.org/10.1121/1.1908694 126. Lakatos S (2000) A common perceptual space for harmonic and percussive timbres. Percept Psychophys 62(7):1426–1439. https://doi.org/10.3758/BF03212144 127. Lemaitre G, Grimault N, Suied C (2018) Acoustics and psychoacoustics of sound scenes and events. In: Virtanen T, Plumbley MD, Ellis D (eds) Computational analysis of sound scenes and events. Springer International Publishing AG, Cham, Switzerland, pp 41–67. https://doi. org/10.1007/978-3-319-63450-0_3 128. Lemaitre G, Rocchesso D (2014) On the effectiveness of vocal imitations and verbal descriptions of sounds. J Acoust Soc Am 135(2):862–873. https://doi.org/10.1121/1.4861245 129. Lemaitre G et al (2011) Vocal imitations and the identification of sound events. Ecol Psychol 23(4):267–307. https://doi.org/10.1080/10407413.2011.617225 130. Lemaitre G et al (2018) Who’s that knocking at my door? Neural bases of sound source identification. Cereb Cortex 28(3):805–818. https://doi.org/10.1093/cercor/bhw397 131. Leman M (2000) Visualization and calculation of the roughness of acoustical musical signals using the synchronization index model (SIM). In: Proceedings of the COST G-6 conference on digital audio effects (DAFX-00) (Verona, Italy), 6 p 132. Lemanska J, Sek AP, Skrodzka EB (2002) Discrimination of the amplitude modulation rate. Arch Acoust 27(1):3–21 133. Levitin DJ (2014) Expert violinists can’t tell old from new. Proc Natl Acad Sci 111(20):7168– 7169. https://doi.org/10.1073/pnas.1405851111 134. Li X, Logan RJ, Pastore RE (1988) Perception of acoustic source characteristics: walking sounds. J Acoust Soc Am 90:3036–3049. https://doi.org/10.1121/1.401778 135. Lichte WH (1941) Attributes of complex tones. J Exp Psychol 28(6):455–480. https://doi. org/10.1037/h0053526 136. Licklider J, Hawley ME, Walkling RA (1955) Influences of variations in speech intensity and other factors upon the speech spectrum. J Acoust Soc Am 27(1):207. https://doi.org/10.1121/ 1.1917901 137. Liénard J-S, Di Benedetto M-G (1999) Effect of vocal effort on spectral properties of vowels. J Acoust Soc Am 106(1):411–422. https://doi.org/10.1121/1.428140 138. Lombard E (1911) Le signe de l’élévation de la voix. Annales des Maladies de l’Oreille et du Larynx 37:101–119 139. Lu Y, Cooke M (2009) Speech production modifications produced in the presence of low-pass and highpass filtered noise. J Acoust Soc Am 126(3):1495–1499. https://doi.org/10.1121/1. 3179668 140. Luo J, Hage SR (2018) The Lombard effect: from acoustics to neural mechanisms. Trends Neurosci 41(12):938–949. https://doi.org/10.1016/j.tins.2018.07.011 141. Lutfi RA (2007) Human sound source identification. In: Yost WA, Popper AN (eds) Auditory Perception of sound sources, Chap 2. Springer Science+Business Media, New York, NY, pp 13–42. https://doi.org/10.1007/978-0-387-71305-2_2 142. MacLean DJ (1966) Analysis of speech in a helium-oxygen mixture under pressure. J Acoust Soc Am 40(3):625–627. https://doi.org/10.1121/1.1910128

326

6 Timbre Perception

143. Macpherson EA (1995) A review of auditory perceptual theories and the prospects for an ecological account. Madison, WI, pp i–ii, 1–49. http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.199.909&rep=rep1&type=pdf 144. Marcell MM et al (2000) Confrontation naming of environmental sounds. J Clin Exp Neuropsychol 22(6):830–864. https://doi.org/10.1076/jcen.22.6.830.949 145. Marozeau J, De Cheveigné A (2007) The effect of fundamental frequency on the brightness dimension of timbre. J Acoust Soc Am 121(1):383–387. https://doi.org/10.1121/1.2384910 146. Marozeau J et al (2003) The dependency of timbre on fundamental frequency. J Acoust Soc Am 144(5):2946–2957. https://doi.org/10.1121/1.1618239 147. Marui A, Martens WL (2006) Predicting perceived sharpness of broadband noise from multiple moments of the specific loudness distribution. J Acoust Soc Am 119(2):EL7–EL13. https://doi.org/10.1121/1.2152294 148. Maryn Y et al (2009) Acoustic measurement of overall voice quality: a meta-analysis. J Acoust Soc Am 126(5):2619–2634. https://doi.org/10.1121/1.3224706 149. McAdams S (2013) Musical timbre perception. In: Deutsch D (ed) The psychology of music, Chap 2. Elsevier, Amsterdam, pp 35–67. https://doi.org/10.1016/B978-0-12-3814609.00002-X 150. McAdams S et al (1995) Perceptual scaling of synthesized musical timbres: common dimensions, specificities, and latent subject classes. Psychol Res 58(3):177–192. https://doi.org/10. 1007/BF00419633 151. McKenna VS, Stepp CE (2018) The relationship between acoustical and perceptual measures of vocal effort. J Acoust Soc Am 144(3):1643–1658. https://doi.org/10.1121/1.5055234 152. McKeown JD, Patterson RD (1995) The time course of auditory segregation: concurrent vowels that vary in duration. J Acoust Soc Am 98(4):1866–1877. https://doi.org/10.1121/1. 413373 153. Michaels CF, Carello C (1981) Direct perception. Prentice-Hall. Inc., Englewood Cliffs, NJ. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.138.1523&rep=rep1&type=pdf 154. Miller JR, Carterette EC (1975) Perceptual space for musical structures. J Acoust Soc Am 58(3):711–720. https://doi.org/10.1121/1.380719 155. Miskiewicz A (2004) Roughness of low-frequency pure tones. In: Proceedings of the PolishGerman OSA/DAGA meeting (Gdansk), 3 p 156. Miskiewicz A, Majer J (2014) Roughness of low-frequency pure tones and harmonic complex tones. In: 7th Forum Acusticum (Krakow), pp 1–4 157. Miskiewicz A, Rakowsky A, Rosciszewska T (2006) Perceived roughness of two simultaneous pure tones. Acta Acustica united with Acustica 92(2):331–336 158. Mlynarski W, McDermott JH (2018) Learning midlevel auditory codes from natural sound statistics. Neural Comput 30(3):631–669. https://doi.org/10.1162/neco_a_01048 159. Moore BC (2012) An introduction to the psychology of hearing, 6th edn. Emerald Group Publishing Limited, Bingley, UK 160. Moore BC, Tan C-T (2003) Perceived naturalness of spectrally distorted speech and music. J Acoust Soc Am 114(1):408–419. https://doi.org/10.1121/1.1577552 161. Nakamura T (1987) The communication of dynamics between musicians and listeners through musical performance. Percept Psychophys 41(6):525–533. https://doi.org/10.3758/ BF03210487 162. Neuhoff JG (2004) Ecological psychoacoustics. Elsevier Academic Press, San Diego, CA 163. Noll AM (1967) Cepstrum pitch determination. J Acoust Soc Am 41(2):293–309. https://doi. org/10.1121/1.1910339 164. Nordstrom KI, Tzanetakis G, Driessen PF (2008) Transforming perceived vocal effort and breathiness using adaptive pre-emphasis linear prediction. IEEE Trans Audio Speech Lang Process 16(6):1087–1096. https://doi.org/10.1109/TASL.2008.2001105 165. Ogg M, Slevc LR, Idsardi WJ (2017) The time course of sound category identification: insights from acoustic features. J Acoust Soc Am 142(6):3459–3473. https://doi.org/10.1121/ 1.5014057

References

327

166. Öhman SEG (1966) Coarticulation in VCV utterances Spectrographic measurements. J Acoust Soc Am 39(1):151–168. https://doi.org/10.1121/1.1909864 167. Osgood CE (1952) The nature and measurement of meaning. Psychol Bull 49(3):197–237. https://doi.org/10.1037/h0055737 168. Özcan E, Van Egmond R (2012) Basic semantics of product sounds. Int J Des 6(2):41–54. https://search.proquest.com/docview/1270361442?accountid=27128 169. Özcan E, Van Egmond R, Jacobs J (2014) Product sounds: basic concepts and categories. Int J Des 8(3):97–111. https://search.proquest.com/docview/1646398348?accountid=27128 170. Patil K et al (2012) Music in our ears: the biological bases of musical timbre perception. PLoS Comput Biol 8(11):e1002759, 16 p. https://doi.org/10.1371/journal.pcbi.1002759 171. Patterson RD (1994) The sound of a sinusoid: Spectral models. J Acoust Soc Am 96(3):1409– 1418. https://doi.org/10.1121/1.410285 172. Patterson RD (1994) The sound of a sinusoid: time-interval models. J Acoust Soc Am 96(3):1419–1428. https://doi.org/10.1121/1.410286 173. Patterson RD, Gaudrain E, Walters TC (2010) The perception of family and register in musical tones. In: Jones MR, Fay R, Popper AN (eds) Music perception, Chap 2. Springer Science+Business Media, New York, NY, pp 13–50. https://doi.org/10.1007/978-1-4419-61143_2 174. Patterson RD, Irino T (2014) Size matters in hearing: How the auditory system normalizes the sounds of speech and music for source size. In: Popper AN, Fay RR (eds) Perspectives on auditory research, Chap 23. Springer Science+Business Media, New York, NY, pp 417–440. https://doi.org/10.1007/978-1-4614-9102-6_23 175. Patterson RD et al (2008) Size information in the production and perception of communication sounds. In: Yost WA, Popper AN, Fay RR (eds) Auditory perception of sound sources, Chap 3. Springer Science+Business Media, New York, NY, pp 43–75. https://doi.org/10.1007/9780-387-71305-2_3 176. Pedersen TH (2008) The semantic space of sound: lexicon of sound-describing words – Version 1. 99 p. https://www.researchgate.net/profile/Torben-Holm-Pedersen/publication/ 263964081_The_Semantic_Space_of_Sounds/links/53db8ab40cf2cfac9928ee98/TheSemantic-Space-of-Sounds.pdf 177. Peeters G et al (2011) The timbre toolbox: extracting audio descriptors form musical signals. J Acoust Soc Am 130(5):2902–2916. https://doi.org/10.1121/1.3642604 178. Peterson GE, Barney HL (1952) Control methods used in a study of the vowels. J Acoust Soc Am 24(2):175–184. https://doi.org/10.1121/1.1906875 179. Piazza EA et al (2018) Rapid adaptation to the timbre of natural sounds. Sci Rep 8:13826, 9p. https://doi.org/10.1038/s41598-018-32018-9 180. Pickett JM (1956) Effects of vocal force on the intelligibility of speech sounds. J Acoust Soc Am 28(5):902–905. https://doi.org/10.1121/1.1908510 181. Pietrowicz M, Hasegawa-Johnson M, Karahalios KG (2017) Acoustic correlates for perceived effort levels in male and female acted voices. J Acoust Soc Am 142(2):792– 811. https://doi. org/10.1121/1.4997189 182. Plazak J, McAdams S (2017) Perceiving changes of sound-source size within musical tone pairs. Psychomusicol: Music, Mind, Brain 27(1):1–13. https://doi.org/10.1037/pmu0000172 183. Plomp R, Levelt W (1965) Tonal consonance and critical bandwidth. J Acoust Soc Am 38(4):548–560. https://doi.org/10.1121/1.1909741 184. Plomp R, Pols L, Van de Geer JP (1967) Dimensional analysis of vowel spectra. J Acoust Soc Am 41(3):707–712. https://doi.org/10.1121/1.1910398 185. Plomp R (1976) Aspects of tone sensation: a psychophysical study. Academic, London, UK 186. Plomp R (2002) The intelligent ear: on the nature of sound perception. Lawrence Erlbaum Associates, Publishers, Mahwah, NJ 187. Plomp R (1970) Timbre as a multidimensional attribute of complex tones. In: Plomp R, Smoorenburg G (eds) Frequency analysis and periodicity detection in hearing. Seithoff, Leiden, pp 397–414

328

6 Timbre Perception

188. Pols L, Tromp H, Plomp R (1973) Frequency analysis of Dutch vowels from 50 male speakers. J Acoust Soc Am 53(4):1093–1101. https://doi.org/10.1121/1.1913429 189. Pols L, Van der Kamp LJ, Plomp R (1969) Perceptual and physical space of vowel sounds. J Acoust Soc Am 46(2B):458–467. https://doi.org/10.1121/1.1911711 190. Potter RK (1945) Visible patterns of sound. Science 102(2654):463–470 191. Potter RK, Peterson GE (1948) The representation of vowels and their movements. J Acoust Soc Am 20(4):528–535. https://doi.org/10.1121/1.1906406 192. Potter RK, Kopp GA, Kopp HG (1948) Visible speech. D. Van Nostrand Co., New York, NY 193. Pratt RL, Doak PE (1976) A subjective rating scale for timbre. J Sound Vib 45(3):317–328. https://doi.org/10.1016/0022-460X(76)90391-6 194. Pressnitzer D, Agus TR, Suied C (2015) Acoustic timbre recognition. In: Jaeger D, Jung R (eds) Encyclopedia of computational neuroscience. Springer Science+Business Media Inc, New York, NY, pp 128–133 195. Pressnitzer D, McAdams S (1999) An effect of the coherence between envelopes across frequency regions on the perception of roughness. In: Dau T, Hohmann V, Kollmeier B (eds) Psychophysics, physiology and models of hearing. World Scientific, Singapore, pp 105–108 196. Pressnitzer D, McAdams S (1999) Two phase effects in roughness perception. J Acoust Soc Am 105(5):2773–2782. https://doi.org/10.1121/1.426894 197. Rabiner LR, Schafer RW (1978) Digital processing of speech signals. Prentice Hall Inc, Englewood Cliffs, NJ 198. Raitio T et al (2013) Analysis and synthesis of shouted speech. In: Proceedings of interspeech 2013 25-29 August 2013, Lyon, France, pp 1544–1548. https://www.isca-speech.org/archive_ v0/archive_papers/interspeech_2013/i13_1544.pdf 199. Repp BH (1984) Categorical perception: Issues, methods, findings. In: Lass NJ (ed) Speech and language: advances in basic research and practice. Academic Inc, Orlando, FL, pp 243– 335. https://doi.org/10.1016/B978-0-12-608610-2.50012-1 200. Reuter C, Siddiq S (2017) The colourful life of timbre spaces: timbre concepts from early ideas to metatimbre space and beyond. In: Wöllner C (ed) Body, Sound and space in music and beyond: multimodal explorations, Chap 9. Routledge, Oxfordshire, UK, pp 150–167 201. Richardson EG (1954) The transient tones of wind instruments. J Acoust Soc Am 26(6):960– 962. https://doi.org/10.1121/1.1907460 202. Risset J-C (1965) Computer study of trumpet tones. J Acoust Soc Am 38(5):912–912. https:// doi.org/10.1121/1.1939648 203. Risset J-C, Wessel DL (1999) Exploration of timbre by analysis and synthesis. In: Deutsch D (ed) The psychology of music, Chap 5, 2nd edn. Academic, New York, NY, pp 113–169. https://doi.org/10.1016/B978-012213564-4/50006-8 204. Robinson K, Patterson RD (1995) The duration required to identify the instrument, the octave, or the pitch chroma of a musical note. Music Percept: Interdiscip J 15(1):1–15. https://doi. org/10.2307/40285682 205. Robinson K, Patterson RD (1995) The stimulus duration required to identify vowels, their octave, and their pitch chroma. J Acoust Soc Am 98(4):1858–1865. https://doi.org/10.1121/ 1.414405 206. Rocchesso D, Fontana F (eds) (2003) The Sounding Object. Editioni di Mondo Estremo. http://www.soundobject.org 207. Rosch E (1978) Principles of categorization. In: Rosch E, Lloyd BB (eds) Cognition and categorization, Chap 2. Lawrence Erlbaum Associates, Mahwah, NJ, pp 27–48. https://doi. org/10.1016/B978-1-4832-1446-7.50028-5 208. Rostolland D (1982) Acoustic features of shouted voice. Acustica 50(2):118–125 209. Rostolland D (1985) Intelligibility of shouted voice. Acustica 57(3):103–121 210. Roy N et al (2013) Evidence-based clinical voice assessment: a systematic review. Am J Speech Lang Pathol 22(2):212–226. https://doi.org/10.1044/1058-0360(2012/12-0014) 211. Rozzi CA et al (2022) A listening experiment comparing the timbre of two Stradivari with other violins. J Acoust Soc Am 151(1):443–450. https://doi.org/10.1121/10.0009320

References

329

212. Saldanha EL, Corso JF (1964) Timbre cues and the identification of musical instruments. J Acoust Soc Am 36(11):2021–2026. https://doi.org/10.1121/1.1919317 213. Sankiewicz M, Budzynski G (2007) Reflections on sound timbre definitions. Arch Acoust 32(3):591–602 214. Schubert E, Wolfe J (2006) Does timbral brightness scale with frequency and spectral centroid. Acta Acustica united with Acustica 92(5):820–825 215. Sethares WA (2005) Tuning, timbre, spectrum, scale, 2nd edn. Springer, London, UK, pp i–xviii, 1–426. https://doi.org/10.1007/b138848 216. Sheikh A-S et al (2019) STRFs in primary auditory cortex emerge from masking-based statistics of natural sounds. PLoS Comput Biol 15(1):e1006595 23 p. https://doi.org/10.1371/ journal.pcbi.1006595 217. Shrivastav R, Camacho A (2010) A computational model to predict changes in breathiness resulting from variations in aspiration noise level. J Voice 24(4):395–405. https://doi.org/10. 1016/j.jvoice.2008.12.001 218. Shrivastav R, Sapienza CM (2003) Objective measures of breathy voice quality obtained using an auditory model. J Acoust Soc Am 114(4):2217–2224. https://doi.org/10.1121/1.1605414 219. Shrivastav R et al (2011) A model for the prediction of breathiness in vowels. J Acoust Soc Am 129(3):1605–1615. https://doi.org/10.1121/1.3543993 220. Siedenburg K Specifying the perceptual relevance of onset transients for musical instrument identification. J Acoust Soc Am 145(2):1078–1087. https://doi.org/10.1121/1.5091778 221. Siedenburg K, Doclo S (2017) Iterative structured shrinkage algorithms for stationary/transient audio separation. In: Proceedings of the 20th international conference on digital audio effects (DAFx-17) 5–9 September 2017, Edinburgh, UK, pp 283–290. http://dafx17. eca.ed.ac.uk/papers/DAFx17_paper_61.pdf 222. Siedenburg K, Fujinaga I, McAdams S (2016) A comparison of approaches to timbre descriptors in music information retrieval and music psychology. J New Music Res 45(1):27–41. https://doi.org/10.1080/09298215.2015.1132737 223. Siedenburg K, Jones-Mollerup K, McAdams S (2016) Acoustic and categorical dissimilarity of musical timbre: evidence from asymmetries between acoustic and chimeric sounds. Front Psychol 6, Article 1977, 17 p. https://doi.org/10.3389/fpsyg.2015.01977 224. Siedenburg K, McAdams S (2017) Four distinctions for the auditory ‘wastebasket’ of timbre. Front Psychol 8, Article 1747, 4 p. https://doi.org/10.3389/fpsyg.2017.01747 225. Siedenburg K, Schädler MR, Hülsmeier D (2019) Modeling the onset advantage in musical instrument recognition. J Acoust Soc Am 146(6):EL523-EL529. https://doi.org/10.1121/1. 5141369 226. Singh NC, Theunissen FE (2003) Modulation spectra of natural sounds and ethological theories of auditory processing. J Acoust Soc Am 114(6):3394–3411. https://doi.org/10.1121/1. 1624067 227. Sjerps MJ, Zhang C, Peng G (2018) Lexical tone is perceived relative to locally surrounding context, vowel quality to preceding context. J Exp Psychol Hum Percept Perform 44(6):914– 924. https://doi.org/10.1037/xhp0000504 228. Sjerps MJ et al (2019) Speaker-normalized sound representations in the human auditory cortex. Nat Commun 10(1):2465, 9 p. https://doi.org/10.1038/s41467-019-10365-z 229. Slawson AW (1968) Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency. J Acoust Soc Am 43(1):87–101. https://doi.org/10.1121/1.1910769 230. Smith DR et al (2005) The processing and perception of size information in speech sounds. J Acoust Soc Am 117(1):305–318. https://doi.org/10.1121/1.1828637 231. Sontacchi A (1998) Entwicklung eines Modulkonzeptes für die psychoakustische Gerüuschanalyse unter Matlab. Graz 232. Sontacchi A et al (2012) Predicted roughness perception for simulated vehicle interior noise. SAE Int J Eng 5(3):1524–1532. https://doi.org/10.4271/2012-01-1561 233. Stecker GC, Hafter ER (2000) An effect of temporal asymmetry on loudness. J Acoust Soc Am 107(6):358–3368. https://doi.org/10.1121/1.429407

330

6 Timbre Perception

234. Stevens KN, House AS (1961) An acoustical theory of vowel production and some of its implications. J Speech Hear Res 4(4):303–320. https://doi.org/10.1044/jshr.0404.303 235. Stilp CE (2019) Acoustic context effects in speech perception. Wiley Interdiscip Rev: Cognit Sci 11(1):e1517, 18 p. https://doi.org/10.1002/wcs.1517 236. Stilp CE (2019) Auditory enhancement and spectral contrast effects in speech perception. J Acoust Soc Am 146(2):1503–1517. https://doi.org/10.1121/1.5120181 237. Stilp CE (2020) Evaluating peripheral versus central contributions to spectral context effects in speech perception. Hear Res 392:107983, 12 p. https://doi.org/10.1016/j.heares.2020.107983 238. Stilp CE, Anderson PW (2014) Modest, reliable spectral peaks in preceding sounds influence vowel perception. J Acoust Soc Am 136(5):EL383–EL389. https://doi.org/10.1121/1. 4898741 239. Stilp CE, Assgari AA (2018) Perceptual sensitivity to spectral properties of earlier sounds during speech categorization. Atten Percept Psychophys 80(5):1300–1310. https://doi.org/ 10.3758/s13414-018-1488-9 240. Stoelinga CN (2007) A psychomechanical study of rolling sounds. Eindhoven University of Technology, Industrial Engineering & Innovation Sciences, Eindhoven. https://research.tue. nl/en/publications/apsychomechanical-study-of-rolling-sounds 241. Stoelinga CN et al (2003) Temporal aspects of rolling sounds: a smooth ball approaching the edge of a plate. Acta Acustica united with Acustica 89(5):809–817. https:// www.ingentaconnect.com/contentone/dav/aaua/2003/00000089/00000005/art00008? crawler=true$pdf 242. Stylianou Y (2001) Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans Speech Audio Process 9(1):21–29. https://doi.org/10.1109/89.890068 243. Suied C et al (2014) Auditory gist: recognition of very short sounds from timbre cues. J Acoust Soc Am 135(3):1380–1391. https://doi.org/10.1121/1.4863659 244. Summerfield Q et al (1984) Perceiving vowels from uniform spectra: phonetic exploration of an auditory aftereffect. Percept Psychophys 35(3):203–213. https://doi.org/10.3758/ BF03205933 245. Summers WV et al (1988) Effects of noise on speech production: acoustic and perceptual analyses. J Acoust Soc Am 84(3):917–928. https://doi.org/10.1121/1.396660 246. Sundberg J (1994) Perceptual aspects of singing. J Voice 8(2):106–122. https://doi.org/10. 1016/S0892-1997(05)80303-0 247. Sundberg J, Nordenberg M (2006) Effects of vocal loudness variation on spectrum balance as reflected by the alpha measure of long-term-average spectra of speech. J Acoust Soc Am 120(1):453–457. https://doi.org/10.1121/1.2208451 248. Susini P, Lemaitre G, McAdams S (2012) Psychological measurement for sound description and evaluation. In: Berglund B et al (ed) Measurements with persons: theory, methods, and implementation areas. Psychology Press, Taylor & Francis Group, New York, NY, pp 227– 254. https://www.researchgate.net/profile/P_Susini/publication/281985123_Psychological_ Measurement_for_Sound_Description_and_Evaluation/links/56b5de0808aebbde1a79b53a. pdf 249. Swift SH, Gee KL (2017) Extending sharpness calculation for an alternative loudness metric input. J Acoust Soc Am 142(6):EL549–EL554. https://doi.org/10.1121/1.5016193 250. Swift SH, Gee KL (2017) Implementing sharpness using specific loudness calculated from the ‘Procedure for the Computation of Loudness of Steady Sounds’. In: Proceedings of meetings on acoustics (Boston), vol 30, 14 p. https://doi.org/10.1121/2.0000542 251. Terhardt E (1974) On the perception of periodic sound fluctuations (roughness). Acustica 30(4):201–213 252. Terhardt E (1968) Über die durch amplitudenmodulierte Sinustöne hervorgerufene Hörempfindung. Acustica 20:210–214 253. Terhardt E, Stoll G, Seewann M (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. J Acoust Soc Am 71(3):679–688. https://doi.org/10.1121/1. 387544 254. Terhardt E (1968) Über akustische rauhigkeit und schwankungsstärke. Acustica 20:215–224

References

331

255. Town SM, Bizley JK (2013) Neural and behavioral investigations into timbre perception. Front Syst Neurosci 7, Article 88, 14 p. https://doi.org/10.3389/fnsys.2013.00088 256. Traunmüller H, Eriksson A (2000) Acoustic effects of variation in vocal effort by men, women, and children. J Acoust Soc Am 107(6):3438–3451. https://doi.org/10.1121/1.429414 257. Trevor C, Arnal LH, Frühholz S (2020) Terrifying film music mimics alarming acoustic feature of human screams. J Acoust Soc Am 147(6):EL540–EL545. https://doi.org/10.1121/ 10.0001459 258. Trujillo J et al (2021) Speakers exhibit a multimodal Lombard effect in noise. Sci Rep 11:16721, 12 p. https://doi.org/10.1038/s41598-021-95791-0 259. Van Borsel J, Janssens J, De Bodt M (2009) Breathiness as a feminine voice characteristic: a perceptual approach. J Voice 23(3):291–294. https://doi.org/10.1016/j.jvoice.2007.08.002 260. Van Dinther R, Patterson RD (2006) Perception of acoustic scale and size in musical instrument sounds. J Acoust Soc Am 120(4):2158–2176. https://doi.org/10.1121/1.2338295 261. Van Nierop DJ, Pols L, Plomp R (1973) Frequency analysis of Dutch vowels from 25 female speakers. Acustica 29(2):110–118 262. Vanderveer NJ (1979) Ecological acoustics: human perception and environmental sounds. University of Cornell, Ithaca 263. Vassilakis PN (2005) Auditory roughness as a means of musical expression. Selected reports in ethnomusicology: perspectives in systematic musicology, vol 12. University of California, Department of Ethnomusicology, Los Angeles, pp 119–144 264. Vencovský V (2016) Roughness prediction based on a model of cochlear hydrodynamics. Arch Acoust 41(2):189–201. https://doi.org/10.1515/aoa-2016-0019 265. Vencovský V, Rund F (2017) Roughness of two simultaneous harmonic complex tones on just-tempered and equal-tempered scales. Music Percept: Interdiscip J 35(2):127–143. https:// doi.org/10.1525/mp.2017.35.2.127 266. Von Békésy G (1935) Über akustische Rauhigkeit. Z Tech Phys 16(9):276–282 267. Von Bismarck G (1974) Sharpness as an attribute of the timbre of steady sounds. Acustica 30:159–172 268. Von Bismarck G (1974) Timbre of steady sounds: a factorial investigation of its verbal attributes. Acustica 30:146–159 269. Von Helmholtz H (1870) Die Lehre von den Tonempfindungen als Physiologische Grundlage für die Theorie der Musik. 3rd edition. Braunschweig: Druck and Verlag von Friedrich Vieweg und Sohn, pp i–xx, 1–644 270. Wang YS et al (2013) Roughness modelling based on human auditory perception for sound quality evaluation of vehicle interior noise. J Sound Vib 332(16):3893–3904. https://doi.org/ 10.1016/j.jsv.2013.02.030 271. Warren H, Verbrugge RR (1984) Auditory perception of breaking and bouncing events: A case study in ecological acoustics J Exp Psychol: Hum Percept Perform 10(5):704–712. https:// doi.org/10.1037/0096-1523.10.5.704 272. Wedin L and Goude G Dimension analysis of the perception of instrumental timbre. Scandandinavian Journal of Psychology 13(1):228–240 (1972). https://doi.org/10.1111/j.1467-9450. 1972.tb00071.x 273. Weinzierl S et al (2018) Sound power and timbre as cues for the dynamic strength of orchestral instruments. The Journal of the Acoustical Society of America 144(3):1347–1355. https:// doi.org/10.1121/1.5053113 274. Wessel DL, Bristow D, Settel Z Control of phrasing and articulation in synthesis. Proceedings of the 1987 International Computer Music Conference. 1987, pp. 108–116. url: http://hdl. handle.net/2027/spo.bbp2372.1987.016 275. Xiang J, Poeppel D, Simon JZ Physiological evidence for auditory modulation filterbanks: Cortical responses to concurrent modulations. The Journal of the Acoustical Society of America 133 (1 2013), EL7–EL12. https://doi.org/10.1121/1.4769400 276. Xue Y et al (2021) Acoustic and articulatory analysis and synthesis of shouted vowels. Comput Speech Lang 66:101156, 13 p. https://doi.org/10.1016/j.csl.2020.101156

332

6 Timbre Perception

277. Yasui N, Miura M (2011) Perception of roughness on sounds amplitude-modulated with triangular wave. In: Proceedings of forum acusticum 2011 (Aalborg, Denmark), pp 1229– 1234 278. Yegnanarayana B, d’Alessandro C, Darsinos V (1998) An iterative algorithm for decomposition of speech signals into periodic and aperiodic components. IEEE Trans Speech Audio Process 6(1):1–11. https://doi.org/10.1109/89.650304 279. Yumoto E, Gould WJ, Baer T (1982) Harmonics-to-noise ratio as an index of the degree of hoarseness. J Acoust Soc Am 71(6):1544–1550. https://doi.org/10.1121/1.387808 280. Zaidi Q et al (2013) Perceptual spaces: mathematical structures to neural mechanisms. J Neurosci 33(45):17597–17602. https://doi.org/10.1523/JNEUROSCI.3343-13.2013 281. Zwicker E, Scharf B (1965) A model of loudness summation. Psycholog Rev 72(1):3–26. https://doi.org/10.1037/h0021703

Chapter 7

Loudness Perception

Loudness is defined as “that attribute of auditory sensation in terms of which sounds can be ordered on a scale extending from quiet to loud” [3], or “Loudness. That attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from soft to loud” [2]. It is often said that loudness is the perceptual correlate of sound intensity or of sound level and, indeed, if the intensity of one or more components that are perceptually integrated into one auditory unit increases, the loudness of that auditory unit will also increase. If, however, components that are not integrated with the auditory unit attended to increase in intensity, those components may partly or completely mask the auditory unit attended to, so that its loudness is reduced or even vanishes. When, in a duet of a violin and a piano, e.g., the frequency components of the piano are increased in intensity, only the loudness of this piano will increase. If the increase is significant, and the frequency components of the piano overlap with those of the violin, the loudness of the violin will decrease, possibly to an extent that it is hardly audible. In this situation, there is no frequency channel in which the intensity of the incoming sound decreases. So, in spite of the absence of any decrease in intensity in any of the inputs of the auditory filters, the loudness of one of the perceived auditory units, in this case the violin, decreases. A similar situation often arises when one listens to music in a car or a train. Imagine that one gets into a train and wants to listen to a music player. The music player will be switched on and its volume will be adjusted up to a level that is not too soft and not too loud. This is often indicated with a comfortable loudness level, hopefully not much higher than 80 dB SPL. If the train starts to run, the noise produced by the train will increase in intensity. Consequently, the intensity of the inputs to all filters of the auditory system will increase; in no auditory filter, the intensity of the input will decrease. In spite of this, the loudness of the music may be reduced, even up to a level at which it becomes difficult to listen to; one may miss certain passages especially in the “softer” parts of the music. This, of course, is an example of masking. Due to this masking of the music by the train noise, the loudness of the music will be reduced and the listener will increase the volume of the music player up to a level that the music gets well audible again. Even more remarkable is what happens when © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_7

333

334

7 Loudness Perception

the train stops. In that case, the intensity of the train noise will decrease and, hence, the intensity of the input to all auditory filters will decrease. There is, therefore, no auditory filter the input of which increases in intensity. In spite of this, the loudness of the music will increase, sometimes even up to a level at which it is no longer comfortable to listen to; it has become too loud. Consequently, in many cases, the listener will reduce the volume of the music player again down to a level about equal to the volume it had before the train started moving. The conclusion is that, certainly if the listener perceives more than one sound, there is no simple relation between the intensity of the incoming frequency components of a sound and the loudness of the various auditory units as perceived by the listeners. In the course of this chapter, a model will be presented that can explain a number of aspects of these phenomena quite reliably, to some extent even up to a quantitative level. First, some more elementary situations will be discussed, situations in which only one auditory unit is perceived and increases in intensity will in general be associated with an increase in the loudness of that auditory unit.

7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL) The level of a sound is generally expressed in the logarithmic intensity scale of decibels (dB). The major reason for this is that, in simple situations and as a first approximation, equal increments in dB are perceived as equal increments in loudness. So, the fact that SPL is most often expressed on the logarithmic dB scale is a first example of a “concession” of the “engineers”, who put physics in a central position, made towards the “psychologists”, who put perception in a central position. Several other cases will be described where the sound level measurement is adjusted so that the result better reflects the magnitude of the perceptual attribute loudness. Later in this chapter, it will also be shown, however, that the estimation of the loudness of a sound even as simple as that of a two-tone combination requires knowledge about the auditory-filter bandwidth and the tonotopic frequency array. First, the dB scale will be discussed, as it is used to express the sound pressure level (SPL) of a sound. Sound pressure level L is defined with respect to an officially defined reference sound pressure level pr e f of 20 µ Pa. When pr ms is the root mean square of the sound pressure within a certain time window at that location, the sound pressure level (SPL) L in dB is defined as:   pr ms dB (7.1) L p = 20 log10 pr e f   p This means that sounds of 20 µPa have an SPL of 20 log10 prr ee ff = 20 log10 1 = 0 dB. This SPL of 0 dB is about the lowest sound level human listeners can detect. In order to hear sounds of less than 0 or of only a few dB, one needs a very silent

7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL)

335

environment, and complete silence is very rare. Due to the wind, human activity, or traffic noise, there is almost always a baseline of background noise considerably higher than just a few dB. Even usual “quiet” environments such as libraries or churches have background-noise levels of 20 to 30 dB. So, 0 dB is about the lowest sound pressure level audible environmental sounds can have. Very loud sounds have sound pressure levels up to about 120 dB or even higher, which is really very much and can cause serious and permanent hearing loss. In situations in which SPLs of sound sources are compared, there is no need for a reference level. For, if the two sounds have pressures p1 and p2 , the difference in dB SPL is: 

 p1 − 20 log10 = 20 log10 = 20 log10 . 20 log10 p2 (7.2) So, the value of pr e f vanishes from the expression for the difference in dB between two sounds. Sound pressure level and sound intensity level (SIL) are very often used interchangeably. Acoustically, however, sound pressure and sound intensity are very different quantities. Sound pressure is a physical quantity and sound intensity is a vector indicating the power of the sound field exerted on a unit surface. So, in contrast with sound pressure, sound intensity depends on the direction from which a sound field hits a surface. The unit of sound pressure is newton per square meter or pascal, and that of sound intensity is watt per square meter (W/m2 ). Also sound intensity is defined relative to a reference level. This is chosen in such a way that, in practical situations, SPL and SIL are identical. This leads to a reference level for sound intensity I0 of 0.964 · 10−12 , most often simplified to 10−12 (W/m2 ). When I is sound intensity, sound intensity level L I is defined as p1 pr e f





p2 pr e f





p1 pr e f pr e f p2





L I = 10 log10 (I /I0 ) dB.

(7.3)

Sound intensity is proportional to the square of sound pressure and, hence,  L I = 10 log10

I I0



 = 10 log10

p2 pr2e f



 = 10 log10

p pr e f

2

 = 20 log10

p pr e f

 = L p.

(7.4) The conclusion is that “in dB, SPL and SIL are the same”. In most practical cases, and also in this book, I hope, this confusion does not lead to any problems, but one has to realize that sound pressure and sound intensity are different physical quantities. As said, the main reason for expressing sound pressure in dB is that, as a first approximation, equal differences in dB SPL are perceived as equal differences in loudness. This certainly holds for levels from about 40 to 90 dB. In order to give an impression about the order of magnitude of one or two dB SPL, the demo of Fig. 7.1 presents four sequences of 25 noise bursts. In the first sequence,

336

7 Loudness Perception

Fig. 7.1 Spectrograms of sequences of 25 noise bursts with a fixed level difference, indicated in the panels, between successive bursts. In all four presentations, the level is increased twelve times and decreased twelve times in random order. (Matlab) (demo)

the level difference between successive bursts is 0.5 dB, in the second it is 1 dB, in the third 3 dB, and in the fourth 6 dB. The bandwidth of the noise bursts is 2 octaves between 880 and 3520 Hz. They last 200 ms and their inter-onset interval is 250 ms. In every sequence, the level increases and decreases are presented in random order. There are twelve increases and twelve decreases, which guarantees that the levels of the first and the last burst are the same. In the first sequence of noise bursts, the level difference between successive bursts is 0.5 dB corresponding to a factor 100.5/20 = 1.0593 in pressure. This is about the smallest difference human listeners can usually detect in successive bursts of wide-band noise [48]. The listener will probably hear some level differences due to the random successions of more than one increment or decrement. In the second sequence of noise bursts in Fig. 7.1, the level difference between successive bursts is 1 dB SPL. An SPL of 1 dB corresponds to a factor 101/20 = 1.122, or about 12% in pressure. It appears that two sounds that differ 1 dB in intensity are still difficult to distinguish. But, as for the 0.5-dB increments and decrements, the listener will probably hear some clear changes in level. An SPL of –3 dB corresponds to a factor 10−3/20 = 0.708, which means that a level difference of –3 dB corresponds to a factor of 0.708 or 70.8% in pressure expressed in pascal. Since intensity is proportional to p2 , an SPL of –3 dB corresponds to a factor 0.7082 = 0.501 or about 50% in intensity. This factor is very often used in electrical engineering, e.g., in indicating the cut-off frequency of filters. The bandwidth of a band-pass filter then is the distance in Hz between the lower and the higher cut-off frequency, and these

7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL)

337

are the frequencies where the frequency response curve of the filter has decreased by 3 dB from its maximum. Finally, an SPL of –6 dB corresponds to a factor 10−6/20 = 0.501, about 50% in sound pressure, or 25% in intensity. A decrease of 6 dB is the decrease in sound pressure when, in free field, the distance from an spherical sound source is doubled. This only holds in free field, since otherwise reflections and reverberation of the sound add to the total pressure.

7.1.1 Measurement of Sound Pressure Level The measurement of sound pressure level (SPL) is in general carried out in three stages. In the first stage, the sound signal is recorded with a high-quality microphone, in the second stage it is filtered with a frequency characteristic that roughly represents human sensitivity for sound, and in the third stage, the root mean square of the signal is calculated within a certain temporal window. First, the filter will be described that is used during the recording of the sound signal. Four such filters are officially defined, the dBA, dBB, dBC, and the dBD filter, which are used in different conditions. First, the most common filter, the dBA filter, will be described. The characteristic of this filter is derived from human frequency sensitivity. Its shape is more or less parabolic with a maximum at 3000 Hz, and at 20 Hz the filter characteristic is 50 dB lower than at 1000 Hz, where the gain is 0. This filter is presented as the A weighted function in Fig. 7.2. It can be viewed as a first approximation of a combination of three mechanisms. The first two are filters that have already been discussed: The characteristic of the outer ear is presented in Fig. 2.2, that of the middle ear in Fig. 2.4. Combining these two filters, however, cannot explain the decrease in sensitivity

Fig. 7.2 Transfer functions of filters commonly used for measuring sound pressure level. The abscissa indicates frequency. The overall gain of the filters is such that they are 0 dB at 1000 Hz. Retrieved from https:// commons.wikimedia.org/w/ index.php?curid=207706 under (CC BY SA 3.0)

338

7 Loudness Perception

at frequencies lower than 500 Hz. Hence, there must be a third mechanism, and that mechanism is associated with the less efficient amplification of these lowerfrequency components by the cochlear amplifier, as has been explained in Sect. 3.8. The cochlear amplifier dominates the response of the basilar membrane at low sound levels. At higher levels, the passive response of the basilar membrane becomes more important and the relative contribution of the amplification by the cochlear amplifier gets less weight, which results in a transfer characteristic that more resembles that of the combined transfer by the outer and the middle ear. This results in the B weighting, also shown in Fig. 7.2, advised for moderate sound levels, say around 70 dB, i.e., for normal speaking levels or levels of music in usual living-room conditions. At very high intensities, the stapedius reflex and the inhibitive function of the efferent olivocochlear bundle will start to play a role, flattening the frequency response even more. This is represented in the C weighting curve of Fig. 7.2. The C weighting is defined for very high sound intensities, intensities at around 100 dB. So, there are different weightings for low, moderate, and high sound levels. In spite of this, very often the A weighting is used, also for sound levels higher than some 40 dB. One should realize that, when the A weighting is used for levels much higher than about 40 or 50 dB, the contribution of lower frequency components to the measurement is less than these lower frequencies contribute to the percept of loudness. So, one sees that, first, sound pressure level is expressed in dBs, because, as a first approximation, the auditory system processes sound intensities on a logarithmic scale. Next, in estimating the contribution of a sound to human hearing, one has to take the transfer by the outer ear, the middle ear, and the operation of the cochlear amplifier into account, resulting in the A weighting. At higher intensities, the contribution of the cochlear amplifier diminishes, which results in the B weighting. For very high intensities, the attenuation by the activity of the stapedius reflex and the olivocochlear bundle is considered, resulting in the C weighting. Moreover, Fig. 7.2 also shows a pink line, indicated with D. This line shows that a higher weight is given to frequencies higher than 10 kHz. This is done because sound components with such frequencies contribute more to the experienced annoyance than lower frequencies. This applies in particular to sound produced by air planes, traffic, etc. The D weighting, therefore, gives a boost to frequency components in this frequency range, so that the measurement better represents the annoyance by the sound. Note that this approach to perceived annoyance differs completely from the perceptually motivated measures for sensory annoyance described in the Sect. 6.7.1. Finally, the temporal window will be discussed within which sound pressure level is measured. In the standard of measuring sound level, three of such time windows have been defined, Slow, Fast, and Impulse. They consist of leaky integrators, i.e., filters with a decaying exponential as impulse response. The time constant of this decaying exponential determines whether the time window is slow, fast, or impulse. This time constant is 1000 ms for slow, 125 ms for fast, and 35 ms for impulse. The three impulse responses are plotted as inserts in Fig. 7.3.

7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL)

339

Fig. 7.3 Sound pressure level of a speech signal measured with the three standard time constants τ : slow, fast, and impulse. The bottom panel shows the sound signal, the second panel from below, the squared signal, and the three highest panels show the course of the measured sound pressure level (SPL) for these three different time constants. The impulse responses of the three time windows, with their time constants, are shown as an insert. (Matlab) (demo)

In order to measure the sound pressure level at a certain moment, the root mean square of the dB(A), dB(B), dB(C), or dB(D) filtered sound within one of these three time windows is then calculated. Examples are shown for the three standard windows in the upper three panels of Fig. 7.3. The longer the time constant of the leaky integrator, the more the measured intensity is spread out over time. The course of the measured SPL values, therefore, looks smoother for longer time constants. The other side of this coin is that the measured extrema of the SPL are wider apart than for the shorter time constants, and that the maxima are highest for the smallest time constant, that of impulse. In fact, this very short time constant was included in order to get a more reliable estimate of the sound pressure level of sounds with very abrupt onsets, such as shot noise and percussion sounds. If, at the onset of a sound, the sound intensity rises very rapidly to intensities higher than 80–120 dB, the stapedius reflex and the efferent suppressive contribution of the efferent olivocochlear bundle have not had the time yet to become effective, which is why these very intense sounds are so harmful. When one measures sound pressure with

340

7 Loudness Perception

this impulse time window the higher levels at the onsets of sounds are indeed better represented. It appears, however, that this is not enough to get a reliable measure of the damage done by percussive sounds such as shot noise. Even if one uses the result of measurements with this short time window, the damage administered is often seriously underestimated. This is why the time constant of impulse is not used very much anymore. In Fig. 7.3 one sees that, for the setting of impulse, the syllabic structure of the speech utterance becomes clear. For the setting of fast, this is already less. For the setting of slow, the sound pressure level is integrated over a longer time span, so that the measurement better represents the overall sound level of the sentence. Recommendations for measuring the sound pressure level of speech are presented by Švec and Granqvist [85]. This problem will be discussed thoroughly in Sect. 7.7.2.

7.1.2 “Loudness” Normalization It has been shown that expressing sound level in decibels and applying A-, B-, or Cweighting to the measurement results in a better correspondence of the measurement to the subjective experience of loudness by the listener. The use of A-, B-, or Cweighting adapts the measurement to the frequency sensitivity of human hearing for different sound levels; using decibels aims at realizing that equal differences in the measured levels correspond to equal differences in loudness. These ways of measuring sound pressure level are widely used in order to obtain objective measures that can be used to define what levels are or are not acceptable. This has resulted in a large amount of rules, regulations, and guidelines for the highest noise level permitted in all kinds of different industrial and environmental settings. There is another application domain where the aim is not so much to define acceptable sound levels. In this application domain, one wants the “loudness” of the sound not to vary too much in time. These situations arise when one, e.g., listens to the radio, watches television, listens to a music player, or watches movie pictures in the cinema. For these situations, a number of recordings, often recorded under very different circumstances, are played one after the other. Likewise, music players play large numbers of someone’s favourite songs recorded under many diverse conditions. In such cases, it would be annoying if the listener had to adjust the player’s level every time a new track starts. Watching a documentary or a movie, the average experienced sound level should not fluctuate so much from episode to episode that it prompts the viewer to turn the level button with each new episode. For these reasons, sound engineers have developed procedures aimed at equalizing the average experienced loudness level from recording to recording. The process in which the sound level is adjusted from episode to episode or from song to song is called loudness normalization. So, loudness normalization is not carried out for one tone or one syllable, but for relatively long stretches of sound, so for one complete song, or one movie scene. In older equipment, loudness normalization was carried out by adjusting the peak level of the sound signals, where the peak

7.1 Sound Pressure Level (SPL) and Sound Intensity Level (SIL)

341

level was based on the sound pressure levels within about 10-ms intervals. This was based on the idea that a recording is most annoying or disturbing when it reaches its peak levels. These peak levels, therefore, should be limited. The problem is that some recordings have a small dynamic range, i.e., their level is more or less stable, resulting in small differences between the levels at the peaks and the dips. Consequently, the difference between the peak level and the average level of the recording will also be small. Other recordings have a large dynamic range, so that the level at the peaks is much higher than the average level and the softest level of the recording. The result is that, on average, the recordings with relatively low peak levels sound louder than recordings with relatively high peak levels. This was intentionally used by sound engineers to make certain sound tracks, e.g., commercials, sound louder than the preceding and the following tracks. This dynamic-range compression attenuates the more intense parts of the recording and amplifies the less intense parts, thus reducing the dynamic range of the recording. Pop songs, e.g., heavy metal songs, were also notorious in this respect. This has resulted in what is called the loudness war [7, 87]. Haghbayan, Coomes, and Curran [34] showed that the average level of pop songs has increased by more than 7 dB between 1950 and 2010. Hove, Vuust, and Stupacher [38] showed that this increase was largely due to an increase in intensity of the lower frequencies, making the music not only louder and more compressed, but also more “bass-heavy” (p. 2249). In order to overcome this problem, new standards have been defined, not based on peak levels of a recording but on what is supposed to be a better measure for the experienced sound level of the recording as a whole, its “loudness”. This has resulted in a recommendation “Algorithms to measure audio program loudness and truepeak audio level” from the International Telecommunication Union (ITU) [39]. This procedure to measure “loudness” is developed for a five-channel sound-reproduction system: A left channel, a centre channel, a right channel, a left-surround field, and a right-surround field. (For simpler systems, one can simply reduce the number of input channels.) The five inputs are first filtered by what is called a K-filter. The transfer characteristic of a K-filter is described in Fig. 7.4. It consists of two stages: The first stage, shown in the top panel of Fig. 7.4, “accounts for the acoustic effects of the head, where the head is modelled as a rigid sphere” [39, p. 3]; the second stage, shown is the middle panel of Fig. 7.4, is a simple high-pass filter attenuating frequencies lower than 100 Hz. The bottom panel shows the transfer characteristic of the combination of the two filters. The power of the outputs of these filters is then measured within overlapping windows of 400 ms, and the results are added with weighting factors of 1 for the left, the centre, and the right channel, and of 1.4 for the two surround channels. The result is expressed in logarithmic units indicated with LKFS, where the L is for loudness, K for K-filtering, and FS for full scale. The “loudness” of the song or episode is then calculated by averaging the outputs of the gates only including “relatively loud” measurements. Relatively “soft” measurements are not included to avoid that long silent or very soft parts of a recording dominate the result. Whether a measurement is “relatively loud” is determined in two stages. First, it must be higher than –70 dB in respect of the maximum possible output of one channel. The measurements fulfilling this criterion are averaged, resulting in the first estimation

342

7 Loudness Perception

Fig. 7.4 The two K-filters and their combination used in K-filtering. (Matlab)

of the “loudness” of the episode. The final “loudness” is determined by averaging all measurements that are higher than -10 dB of this first estimation. The system is calibrated in such a way that a 1000-Hz sinusoid played with maximum amplitude is –3.010 LKFS and a difference of 1 dB in the input channels results in a difference of 1 LKFS. For further details the reader is referred to the Recommendation BS.1770-3 [39]. The European Broadcasting Union (EBU) has adopted this recommendation and elaborated on it [17, 19]. The EBU does not use LKFS units but Loudness Units (LU), where 1 LU corresponds to 1 dB. For broadcasting purposes the Loudness Unit Full Scale (LUFS) is defined, while a “loudness” of -23.0 LUFS is recommended, which means that the “loudness” is 23 LU lower than a signal comprising the full scale of the digital equipment. The EBU also defines a loudness range (LRA), i.e., the difference in LUs between the estimates of the 10th and the 95th percentiles of the loudness distribution over the whole episode [18, 19]. The 10th percentile is chosen to prevent very soft intervals such as after fadings to dominate the estimate of the loudness range; the 95th percentile is chosen to prevent short but very loud sounds such as gunshots to affect the “loudness” setting. A specific loudness range is not recommended by the EBU but is defined in order to allow the sound engineer to gauge whether it may be worthwhile to adjust the dynamic range of the recording. Depending on the nature of the recording, its LRA may be increased so that it sounds “richer” in dynamics. On the other hand, the LRA of recordings with short but disturbing, intense parts may be decreased so that the dynamic range gets less without affecting the average listening level. In this way, one wants to realize that listening, e.g., to the sound track

7.2 The dB Scale and Stevens’ Power Law

343

of a movie picture consisting of a number of successive recordings is not negatively affected by too loud or too soft episodes. In summary, a number of procedures were discussed to adapt the measurement of sound intensity in such a way that it corresponds better to the loudness experience of the listener. The application of the dB scale adapted the logarithmic nature of loudness perception, the use of dBA, dBB, etc., or K-weighting adapted it in various ways to our sensitivity to the frequency and level of the sound components. Loudness normalization was used to realize that the human listener experiences successive recordings as about equally “loud”. In spite of all these adaptations aiming at adjusting the measurement of the sound level to the properties of the human auditory system, it will be shown that loudness perception exhibits a number of phenomena that cannot be explained by these procedures alone. This is why loudness was written between quotation marks in this section. Later in this chapter, a computational model of loudness estimation will be presented that makes it possible to, e.g., accurately predict “Why are commercials so loud?” [51]. First, some basics.

7.2 The dB Scale and Stevens’ Power Law The use of the dB scale is based on the assumption that equal differences in dB SPL represent equal differences in loudness. So, the difference in loudness between two sounds, e.g., one with a SPL of 40 and the other of 50 dB, is perceived as equal to the difference in loudness between two tones with SPLs of, e.g., 70 and 80 dB. As has already been indicated, this is indeed the case at least as a good first approximation, since this is the very reason why the dB scale is used. Later it will be shown that, relatively speaking, for levels lower than 30 or 40 dB and higher than 100 or 120 dB, equal level differences result in larger differences in loudness than at intermediate levels. But first focus will be on the range within which equal differences in level are indeed perceived as equal differences in loudness. This brings us to Stevens’ power law [82], or simply Stevens’ law. Stevens’ power law is a very general psychophysical law which states that, when a perceptual attribute can be assigned a magnitude N  , this perceived magnitude is proportional to the power of the corresponding physical magnitude. So, when the physical magnitude is intensity I , one gets: N  = cI α sone.

(7.5)

Stevens’ power law was verified for quite a number of perceptual attributes [83] including loudness, for which an exponent α of 0.3 was proposed. An easy calculation shows that an increase in Iof 10  dB corresponds to an increase10in loudness of a factor 2. Indeed, since 10 log10 II21 = 10 means that I2 = I1 · 10 10 = I1 · 101 = I1 · 10, an increase of 10 dB SIL corresponds to a factor 10 in intensity. When Stevens’ law is applied to N2 , one gets

344

7 Loudness Perception

N2 = cI2α = c · (10 · I1 )α = 10α · c · I1α = 10α · N1

(7.6)

With α = 0.3, this gives N2 = 100.3 · N1 ≈ 2 · N1 .

(7.7)

Besides the exponent α, Stevens’ law contains the constant c, which is about 250. This is chosen in such a way that the loudness of a 1-kHz pure tone of 40 dB is 1 sone, the perceptual unit of loudness [81]. These constants only apply to a 1-kHz pure tone, and can vary from listener to listener. Loudness of pure tones of different frequencies and for more complex sounds will be discussed later in this chapter.  Loudness  N can also be expressed as a function of sound pressure level L = 10 log10 II0 . This results in another version of Stevens’ law for sound: N = 2

(L−40) 10

sone.

(7.8)

One can now easily check that a 40-dB 1 kHz tone has a loudness of 1 sone. (L−40) (40−40) Indeed, if L = 40, one gets N  = 2 10 = 2 10 = 20 = 1 sone. This equation also agrees with the rule of thumb that every increase in 10 dB should correspond to a doubling in loudness. For, when L 2 = L 1 + 10 dB, this gives N2 = 2 ( L 1 +10−40)

10 10

( L 1 −40)

10 10

( L 1 −40)

N1 .



( L 2 −40) 10

(L−40) 10

=

=2 = 2 ·2 =2· So, the equation N = 2 is a 2 version of Stevens’ law for which the exponent is 0.3, and for which 40 dB is assigned a standard value of 1 sone. An easy calculation shows that, according to this law, sound pressure levels of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, and 100 dB correspond to 1/16, 1/8, 1/4, 1/2, 1, 2, 4, 8, 16, 32, and 64 sone, respectively. As mentioned, for 30 or 40 dB up to 100 or 120 dB this will appear to be quite accurate. Following Stevens’ original proposal, the exponent α is generally set to 0.3. Besides large individual difference, it appears, however, that there are various complicating factors that can affect the value of α in experimental conditions. Some have argued that 0.3 is too low a value and should be set to 0.5 [90], which will be discussed in Sect. 9.7.4. In a simple experiment described in the next paragraph, α turns out to be lower than 0.3. For now, α will be set to the standard value of 0.3. One of the consequences of Stevens’ law is that equal increments in intensity lead to much larger increments in loudness at low sound levels than at high levels. Indeed, 10 dB SPL corresponds to an intensity of about 10−11 W/m2 . Increasing the level with 9·10−11 to 10−10 W/m2 results in 20 dB SPL, or 100 W/m2 , a tenfold increase in intensity. According to Stevens’ law, this corresponds to a doubling of the loudness. An identical increase in intensity at 70 dB SPL or 10−5 W/m2 results in an increase in intensity from 10−5 W/m2 to 10−5 + 9·10−11 W/m2 , which is 1.000009 times the intensity corresponding to 70 dB SPL. Obviously, this is perceptually negligible. This phenomenon, that an increment in intensity causes a much larger increase in loudness at lower sound intensities than at higher intensities, is sometimes called Stevens’-law-like behaviour. 10

10

10

7.3 Stevens’ Law of a Pure Tone and a Noise Burst

345

7.3 Stevens’ Law of a Pure Tone and a Noise Burst As mentioned, Stevens’ law appears to be quite accurate for intensities between about 40 to about 90 dB SPL. For lower intensities, there are important deviations. This is shown in Fig. 7.5 based on data from Figure 20 of Fletcher [22]. This figure shows the loudness of a 1-kHz tone, as perceived by the listeners, as a function of the intensity level of the 1-kHz tone. Fletcher [22] used results obtained with a variety of methods to measure how large, for various average levels, differences in intensity were perceived as difference in loudness. Fletcher’s data are represented in Fig. 7.5 by the thick curve. The graph is about linear above 40 dB, which shows that there equal proportions in intensity are indeed perceived as equal proportions in loudness. Below 40 dB, however, the deviations are quite considerable. The smaller the average intensity level, the steeper is the curve. Hence, at levels lower than 40 dB SPL, the same differences in level are perceived as larger differences in loudness than at levels higher than 40 dB SPL. At 1 kHz, the hearing threshold is about 0 dB SPL. Hence, Fig. 7.5 basically shows that, per dB intensity level, the loudness first grows quite rapidly but stabilizes for intensities above 40 dB. In order to illustrate all this, an informal experiment will be described aimed at verifying the assumptions that a listener judges equal rises in dB SPL as equal rises in loudness, and a rise in intensity of 10 dB as a doubling in loudness. The first assumption will be confirmed; as to the second assumption, it will be concluded that, at least for moderate intensities, a doubling in loudness needs somewhat more than 10 dB. Finally, the range of intensities for which Stevens’ law applies, will be discussed in more detail.

Fig. 7.5 Loudness of a 1-kHz pure tone as a function of its intensity level. The data are based on data from [22, Fig. 20]. The dashed line represents Stevens’ law with an exponent of 0.3. Note the considerable deviation from Stevens’ law at intensities lower than 40 dB. (Matlab)

346

7 Loudness Perception

The experiment that will be described uses loudness rating, also indicated with loudness scaling. In such an experiment, listeners are presented with pairs of sounds and are asked to indicate how much softer or louder the second sound of the pair is relative to the first sound. In the experiment described below, listeners are presented with pairs of noise bursts of which the first, the standard, has a fixed SPL, while the second has a variable SPL, which can be lower than the standard, higher than the standard, or the same as the standard. (The reader of the E-book can listen to such pairs by clicking on “demo” at the end of the caption of Fig. 7.6.) In the experiment, listeners are asked to indicate on a percentage scale how much louder or less loud than the standard they hear the second sound. So, when they hear the second sound twice as loud as the first, standard sound, they should indicate 200%; when the second is perceived as only one third as loud, they should indicate 33%. In Fig. 7.6, informal results are presented of an experiment using track 20 of the CD by [37]. In that track, the SPL of the second sound was –20, –15, –10, –5, 0, 5, 10, 15, or 20 dB higher than that of the first presented standard. The order of presentation was random. The results presented in Fig. 7.6 represent the responses of 674 listeners, students who participated in my lectures from 2000 to 2017. They were obtained in a wide variety of lecture rooms of different sizes and with different amounts of reverberation. The results are presented in about the same way as those presented by Hartmann [35, p. 6, Fig. 4.3], who also presented the results of such an experiment with the same stimuli. Figure 7.6 shows three panels. In each of these panels, the same results are plotted, but the scales of the ordinate and the abscissa are different: In the upper panel both ordinate and abscissa are linear; in the middle panel, the ordinate is linear, while the abscissa is logarithmic; in the lower panel, both abscissa and ordinate are logarithmic. The ordinate presents the loudness rating in percentage, the abscissa presents the intensity of the second noise burst relative to that of the first, the standard. The responses to the pair of bursts in which the standard and the comparison burst had equal intensities are circled. For these data points the spread is very small, obviously because listeners found it easy to hear when both sounds of a pair had equal intensities. Moreover, the larger the intensity difference between the two bursts of a pair, the larger the spread in the responses by the listeners. Apparently, listeners find the task more difficult for larger differences or differ in the weight they attribute to a difference in intensity. It can be seen that neither in the top panel nor in the middle panel are the data on a straight line. Furthermore, on the linear ordinate the spread is considerably smaller for the negative differences in dB SPL than for the positive differences. This indicates that equal differences in intensity expressed in W/m2 are perceived as smaller at lower intensities than at higher intensities. Most relevant are the data as presented in the bottom panel of Fig. 7.6. The dashed line is the regression line drawn through the medians of the data point for intensity differences ranging from -5 to 20 dB. It can be concluded that the fit is quite good. Actually, the correlation coefficient is 0.9980. But the question is, of course: Why are the data points for the lower intensity differences not included? It can be seen that, for these points, the spread is larger than for the other points. The explanation for this can be found in the conditions in which the tests were carried out. The overall range of intensities that were presented to the students could be very different. This was

7.3 Stevens’ Law of a Pure Tone and a Noise Burst

347

Fig. 7.6 Informal results of a loudness rating experiment according to tracks 19–20 of the CD by [37], presented in much the same way as those presented by [35, Fig. 4.3]. All graphs present the same data, but the abscissa and the ordinate can be linear or logarithmic. The data points present the 0.1, 0.5, and 0.9 quantiles of the responses. The 0.5 quantiles, i.e., the medians, are connected. The circled data points are the points for the pair of bursts of equal intensity. The regression line in the bottom panel is drawn through the medians of the data points for intensity differences of –5 to 20 dB. Explanation in the text. (Matlab) (demo)

due to the various room sizes, by which some listeners were much more separated from the speakers than others, and to the presence of noise that raised the hearing threshold for the stimuli. It will be explained later that, at low intensities and just above threshold, intensity differences in dB are perceived as larger differences in loudness than at intensities far above threshold. This is an informal explanation for the data as presented in the bottom panel of Fig. 7.6. Hartmann [35] presents the results of the same experiment for 64 listeners and, when presented on a logarithmic ordinate and a logarithmic abscissa, these results are linear over the whole range of frequency differences. All this illustrates that, for intensities of about 40 dB to about 80 dB above threshold, listeners experience equal differences in dB as equal proportions in loudness. The exponent presented in Fig. 7.6 is 0.20 which is lower than the value of 0.3 it officially has in Stevens’ power law for sound. This is found

348

7 Loudness Perception

more generally in this kind of experiment. For instance, Hartmann [35] finds an exponent of 0.22.

7.4 Loudness of Pure Tones Before discussing the loudness of more complex sounds, the loudness of the most elementary of sounds, pure tones will be discussed. In the previous section, the loudness of a 1-kHz pure tone was discussed. But what about pure tones of other frequencies? The first phenomenon that must be taken into account is that tones of the same SPL, but with different frequencies, are generally not perceived as equally loud. As mentioned in the discussion on how to measure SPL, our sensitivity for sound diminishes quite considerably below 500 Hz, and, depending on age, also above 10–18 kHz. In order to take this into account, first the same standard tone is chosen as in the previous section, a 1-kHz 40-dB pure tone. Then, for a large number of test tones of different intensities and frequencies, one can ask listeners to adjust the intensity of the standard 1-kHz reference tone in such a way that it sounds just as loud as the test tone. When the participants have done this, this level, expressed in dB SPL, is called the loudness level of the test tone. Actually, one can not only do this for pure tones. One can present any sound to a listener and ask to adjust the level of the 1-kHz standard tone in such a way that it sounds equally loud as the test sound. This adjusted level in dB SPL is called, per definition, the loudness level of the test sound, and is expressed in phone.

7.4.1 Equal-Loudness Contours When the loudness level of a pure tone, i.e., the level a 1-kHz tone must have in order to sound just as loud, one can also reverse this procedure. One can take a 1kHz tone of a fixed intensity, say 10, 20, 30 dB SPL, etc., and ask listeners to adjust the intensities a large number of pure tones of different frequencies up to a level that they sound just as loud as the fixed 1-kHz tone. In this way, one can determine for tones of a large range of frequencies what their intensity must be in order to sound just as loud as a 1000-Hz tone of, e.g., 10, 20, 30, etc., up to 100 dB SPL. The result of such a procedure is shown in Fig. 7.7, showing the intensities of the tones as ordinate and their frequencies on the abscissa. The lowest dashed curve gives the hearing threshold as a function of frequency. For a number of SPLs, starting from 10 dB and then spaced 10 dB apart, lines are drawn representing the SPLs of tones that sound just as loud as the 1000-Hz tone on that line [24]. Hence, all the tones on such a curve sound just as loud. These curves are therefore called equal-loudness contours. Other terms are isophone curves or, after the ones who first published them [23], Fletcher-Munson curves. The level of the 1-kHz tone on such a curve

7.4 Loudness of Pure Tones

349

Fig. 7.7 Equal-loudness contours, also called isophone curves or, after the ones who first published them, Fletcher-Munson curves. The level of the 1-kHz tone on such a curve represents the loudness level of all pure tones on that curve. Data derived from ISO226. (Matlab)

represents the loudness level of all pure tones on that curve. In this way one can establish the loudness level for all pure tones of different frequencies and intensities. One important aspect of the loudness perception of pure tones in not represented in the equal-loudness contours of Fig. 7.7. They do not show what is shown in Fig. 7.5, viz., that the loudness increases more rapidly at intensities lower than 40 dB. So, the loudness difference between the 10-dB and the 20-dB equal-loudness contours is larger than that between the 30-dB and the 40 dB equal-loudness contour, which, in turn, is larger than that between the 40-dB and the 50-dB equal-loudness contour. A similar phenomenon occurs at loudness levels higher than 90 dB. In other words, using Eq. 7.8, the loudness difference between tones on adjacent equal-loudness contours of Fig. 7.7 is about 2 sone. This is only a good approximation for loudness levels between 40–90 phone. For loudness levels lower than 40 or higher than 90 phone the loudness difference is higher than 2. In fact, an equal-loudness contour represents the level that a frequency-modulated (FM) pure tone must have in order not to change in loudness. The extent to which this is correct, can be heard in the demo of Fig. 7.8. In this demo, two FM tones are played. In the first tone, the amplitude is modulated in such a way that the loudness level of the tone is constant. In other words, listening to this tone, its loudness should remain the same. In the calculations, described below, the loudness level of this tone was set to 60 phone, corresponding to a loudness of 4 sone. This loudness is presented in the upper panel of Fig. 7.8 as the continuous, horizontal line. In the second tone, the

350

7 Loudness Perception

Fig. 7.8 Two glides are presented with frequencies starting at 60 HZ and ending at 15 kHz. The upper panel shows the estimated loudness of the two tones, the lower panel their sound pressure levels. In the first presentation, shown as the continuous line, the amplitude is varied according to the 60-dB equal-loudness contour so that its loudness should be more or less constant. In the second presentation, shown as dotted lines, the amplitude is fixed at the average amplitude of the presentation, so that the loudness varies with the frequency sensitivity of our hearing system. The lower panel shows the sound pressure levels of the two glides. (Matlab) (demo)

SPL is fixed at 60 dB SPL. In this case, the loudness changes with our sensitivity for the frequency of the tone as shown by the dotted line in the upper panel of Fig. 7.8. The readers can check all this by listening to the demo. The lower panel of Fig. 7.8 presents the sound pressure levels of the two tones. The continuous line in this panel represents the equal-loudness contour for the loudness level of 60 dB. Up to now, only the loudness of pure tones has been discussed, but in principle one can, for all different kinds of sounds, ask people to adjust the level of a 1-kHz tone so that both sounds have the same loudness. This then yields the loudness level of these sounds, and by using Stevens’ law, one can then calculate the loudness of these sounds. In principle, this, indeed, defines the loudness level, i.e., the number of phones, and loudness, i.e., the number of sones, of these sounds. Remarkably, the 1-kHz pure tone used as a reference for the phone and the sone scale may actually not be the best possible reference sound in loudness matching tasks. It appears that, when

7.5 Loudness of Steady Complex Sounds

351

listeners are asked to adjust the loudness of one sound to that of a reference sound, the variability of the adjustments is high when the reference sound is a pure tone, higher than for other reference sounds. Indeed, Skovenborg, Quesnel, and Nielsen [79] asked subjects to match the loudness of a target sound to that of another sound by adjusting the intensity of the target sound in such a way that both sounds were perceived equally loud. They studied the variability of the matching for all kind of different pairs, so pairs including music and speech sounds, music and noise, noise and pure tones, etc. It appeared that the variability of the matching was largest when one of the pairs was the 1000-Hz pure tones. This shows that the listeners experience difficulty in comparing the loudness of a pure tone with the loudness of another sound, certainly when they are very different in timbre. This brings us to the loudness of sounds more complex than pure tones.

7.5 Loudness of Steady Complex Sounds In Chap. 3, a model was discussed to calculate the excitation pattern and the specificloudness distribution of a sound based on its spectrum. This model consisted of four stages: Filtering by the outer and middle ear, calculation of the excitation pattern induced by the sound, transformation of this excitation pattern to the specificloudness distribution, and integration of the specific-loudness distribution over the complete tonotopic array. In Sect. 3.8.6, this was illustrated in Fig. 3.20 for pure tones of varying intensities, and in Fig. 3.21 for a complex tone of two harmonics and a complex tone of seven harmonics. Moreover, in the demo of Fig. 3.5 of Sect. 3.2, an experiment was described in which the bandwidth of a noise burst was varied while keeping its intensity constant. It appeared that the loudness of the noise was constant as long as the bandwidth of the noise was smaller than the critical bandwidth, but increased when the bandwidth was spread out over a larger range of the tonotopic array. The larger the number of auditory filters excited by the noise, the louder the sound. The reader is advised to listen to that demo again. First, the situation will be discussed in which the noise only covers a very narrow frequency band. In that case, the frequency components of the noise are so close together that they are processed by the auditory system as a single auditory unit consisting of one frequency component that fluctuates in time. And that is also how a very narrow band of noise is perceived. Examples of noise bursts of 1/6th of an octave wide are played in the demo of Fig. 1.68. It has the character of a kind of whistle, a somewhat fluctuating pure tonal sound. The perceived fluctuations may concern the pitch and the loudness of the sound, but it is hard to distinguish between the two. The frequency of the pitch is close to the centre frequency of the noise. As the width on the noise band gets wider, the sound becomes less tonal and noisier. As long as the bandwidth of the noise is less than one critical bandwidth, the excitation pattern on the tonotopic array is that of one single frequency component with an intensity of that of the noise.

352

7 Loudness Perception

Fig. 7.9 Specific-loudness distributions of four complexes of two bursts of narrow-band noise with a varying distance between their centre frequencies. These centre frequencies are shown in the upper left corner of the panels. The intensity of the separate bursts is 50 dB, their combination 53 dB. Their bandwidth is 1/6 of an octave. The estimated loudness is presented in the upper right corner of the panels. The intensities of the stimuli do not change but their loudnesses do. (Matlab) (demo)

Now, the loudness of two concurrent narrow bands of noise will be discussed in more detail. A demo is presented in Fig. 7.9 for four different distances between the centre frequencies of the noise bands. Their intensity is set to 50 dB SPL and their bandwidth is 1/6th of an octave. In the first pair, the centre frequencies of the two bursts are equal, 1000 Hz, but the noise of the two bursts is independent. The centre frequencies are 990 and 1010 Hz for the second pair, 800 and 1200 Hz for the third, and 500 and 1500 Hz for the fourth pair of noise bursts. The figure shows the specific-loudness distributions and gives the estimated loudness in the upper right corner of the panels. First the upper panel of Fig. 7.9 will be discussed. This shows the specific-loudness distribution of the sum of two narrow noise bands with the same centre frequency. For clarity, the two noise bursts in the demo are generated independently, so that the intensity of the sum of the two is twice the intensity of one noise burst or, equivalently, 3 dB higher in SPL. If identical noise bursts are added, their values simply double resulting in an intensity that is four times the intensity of the single noise burst, giving a SPL that is 6 dB higher. But, as said, two independent noise bursts are added in the demo of Fig. 7.9. Since the intensity of a single noise burst is 50 dB SPL, this gives an intensity for the sum of the two noise bursts of 53 SPL. The upper panel then gives a loudness estimate of 2.54 sone, the result of integrating the specific-

7.5 Loudness of Steady Complex Sounds

353

loudness distribution over the tonotopic array. As the centre frequency of the noise is 1 kHz, this can be checked with the simple version of Stevens’ power law for sound, (L−40) N  = 2 10 . As the intensity of one noise burst is 50 dB SPL, this gives a loudness (53−40) of 2 10 ≈ 2.46 sone, close to the more elaborate result of 2.54 sone obtained by calculating the specific-loudness distribution and integrating it over the tonotopic array. In the second panel, the centre frequencies of the two noise bursts are 990 and 1010 Hz. At 1 kHz, the ERB is 133 Hz, so that these two noise burst are well within each other critical bandwidths. This results in a specific-loudness distribution that differs very little from that shown in the first panel. As a consequence, the estimated loudness of 2.55 is very close to that of the first pair of noise bursts. In the third panel, the centre frequencies of the noise bands are 800 and 1200 Hz, a difference of 400 Hz, which well exceeds the ERB at 1 kHz of 133 Hz. In spite of that, the specific-loudness distribution shows that there is still some overlap of the specific-loudness distributions of the separate noise bands. The estimated loudness is 3.86 sone. Finally, in the bottom panel, the centre frequencies are 500 and 1500 Hz. The specific-loudness distributions of the separate noise bands no longer overlap, and the loudness of the combination of the two noise bands can simply be found be adding the loudness of the two separate noise bands. Integration over the specificloudness distribution gives 4.13 sone, which is close to 4, the sum of the loudness of two separate noise burst of 2 sone. This leads to the following rule of thumb for loudness perception: When the frequencies of a complex sound are within one critical bandwidth, the intensities add up, but when they are separated by more than one critical bandwidth their loudness adds up. Figure 7.9 presents the specific-loudness distributions of complexes consisting of two bursts of noise with a narrow bandwidth. A similar demo is presented in Fig. 7.10 for a complex of seven of such narrow noise bands. The centre frequencies of the seven noise bands are equally spaced on a linear frequency scale. The centre frequency of the lowest noise band is always 500 Hz. In the course of the demo, the spacing between the centre frequencies is varied over 0, 50, 100, 250, 500, and 1000 Hz. In the first sound of the demo, the spacing between the centre frequencies is 0 Hz. Hence, in fact, seven independent noise bursts are added up, which means that the intensity increases by 10 · log10 7 = 8.45 dB. At 500 Hz, 50 dB is on the equal-loudness contour of 47 dB. Stevens’ simple law for sound, Eq. 7.8, then gives a loudness of 2(47+8.45−40)/10 = 2.92 sone, which differs only 0.12 sone from the 3.04 sone obtained by integrating the specific-loudness distribution. In the second noise burst of the demo, the spacing between the centre frequencies is 50 Hz. At 500 Hz, the ERB is 79 Hz, which shows that the bandwidths of the noise bursts strongly overlap so that, in fact, they form one burst of noise with a bandwidth of more than 300 Hz, which is considerably more than the ERB of 79 Hz. Consequently, the estimated loudness increases to 4.94 sone. In the following noise bursts, the specificloudness distribution spreads out more and more over the tonotopic array. This result in a loudness that increases up to 13.62 sone in the last noise bursts. Their centre

354

7 Loudness Perception

Fig. 7.10 Specific-loudness distributions of five noisy stimuli consisting of seven narrow noise bands. The centre frequencies of the noise bands are shown in the upper left corner of the panel. The intensity of the separate bursts is 50 dB. The estimated loudness of the complexes is shown in the upper right corner of the panels. The loudness increases with increasing distance between the centre frequency. (Matlab) (demo)

frequencies differ by 1000 Hz. In all six noise bursts played in the demo of Fig. 7.10, the intensity stays at the level 58.45 dB, but the loudness increases from 4.94 to 13.62 sone. This illustrates that loudness increases when the spectral power is spread out over a larger part of the tonotopic array. But not only the loudness increases but also the brightness. Furthermore, especially in the last bursts of the demo, it seems as if the sound is produced with more power corresponding to what has been said about this in Sect. 6.7.3 of the previous chapter on timbre. In principle, the loudness of any steady sound can now be estimated when its power spectrum is known. Before discussing the limitations of the model, two checks as to the adequateness of the model will be presented. In the first, the estimated loudness of a 1000-Hz pure tone is presented as a function of its intensity level. The

7.5 Loudness of Steady Complex Sounds

355

Fig. 7.11 Estimated loudness, thick line, of a 1000-Hz pure tone as a function of the intensity level of the tone. The dashed line shows Stevens’ law for sound intensity with an exponent of 0.3. Adapted from [55]. The thin lines show the standard loudness of 1 sone for a 40-dB 1000-Hz tone. (Matlab)

result is presented in Fig. 7.11. The continuous thick line presents the estimations; the dashed line gives the predictions based on Stevens’ law with an exponent of 0.3. Between 30 and 100 dB, Stevens’ law is quite well met. For lower and higher intensities, however, the loudness grows more rapidly with intensity. Note, that this figure closely matches the loudness measurements of a 1-kHz pure tone as presented by Fletcher [22] and shown in Fig. 7.5. This clearly shows that, just above threshold, loudness increases quite rapidly but that the rate at which this happens slows down and stabilizes above about 40 dB up to about 100 dB. Above 100 dB, the growth of loudness increases, again, as shown in Fig. 7.11, but those levels are only 10–20 dB below the pain threshold, so that then other factors such as annoyance, irritation, or pain may increase the loudness estimates of the listeners. The reader is strongly advised to stay away from situations with such high sound levels, certainly, when pure tones are involved. It is now known that pure tones are not perceived as loud as sounds of the same SPL but whose power is distributed over more frequency bands. This means that, for pure tones, all power is concentrated on one small region of the tonotopic array, inducing maximum damage. This illustrates that the human hearing system is poorly protected against high sound levels, especially for pure tones. A second check of the adequacy of this loudness model consists of comparing real equal-loudness contours with equal-loudness contours as estimated by this model. Such estimated loudness contours are presented in Fig. 7.12. One sees that the result quite well resembles the equal-loudness contours presented in Fig. 7.7. This should not come as a surprise, however, as the parameter setting of the model is based on these kind of data. And it is mentioned, again, the difference in loudness between tones on the equal-loudness contour of 10 dB and tones on the equal-loudness contour of 20 dB is larger than the difference in loudness between tones on the equal-loudness

356

7 Loudness Perception

Fig. 7.12 Estimated equal-loudness contours. The loudness levels corresponding to the equalloudness contours are indicated in dB. Adapted from [55]. (Matlab)

contour of 60 dB and tones on the equal-loudness contour of 70 dB, as can be seen for a 1000-Hz tone in Fig. 7.11. So, this loudness model appears to be well able to estimate the loudness of quite a number of sound signals, certainly sounds that last longer than a few hundred milliseconds and do not change too much in time. Moreover, the model correctly predicts that, when sounds have equal intensities but different bandwidths, the sounds with the wider bandwidths are louder than the sounds with the narrower bandwidths. Here, bandwidth is defined in terms of distance along the tonotopic array.

7.5.1 Limitations of the Loudness Model The loudness model discussed above appears to be quite accurate when the sound does not change too rapidly and lasts longer than one or two hundred milliseconds. The spectrum of the sound is the only input and, except for an onset and an offset, the sound signal is assumed to be stationary. Temporal fluctuations are not taken into account. Actually, the model appears no longer to be accurate as soon as temporal fluctuations can be heard, a situation which, in everyday situations, occurs more often than not, e.g., when listening to speech or to music. Temporal aspects are also important in other situations, e.g., when studying the effect of duration of a sound on its loudness. When a sound lasts shorter than, say, 100 ms its loudness is considerably

7.5 Loudness of Steady Complex Sounds

357

less than when it lasts longer [58]. These problems will be discussed in more detail in Sect. 7.7. Another issue is that the model is just based on the power spectrum of the sound at the entrance of the ear canal. Hence, phase relations are not taken into account, although [30] found that phase relations between harmonics affect loudness. These authors asked listeners to adjust tones with harmonics added in cosine phase in loudness to tones with the same harmonics of equal amplitude but added in random phase. It appeared that the tones with harmonics added in cosine phase were adjusted a few dB less in intensity than the tones with harmonics added in random phase. Correlations between the sounds entering the left and the right ear can also influence loudness [13]. Models that aim at taking these effects into account cannot only be based on the power spectrum of the sound at the entrance of the ear canal, but should include a realistic auditory filterbank as front end [54]. Furthermore, as mentioned in Sect. 3.8.5, things at threshold are somewhat more complex than described above. For instance, while each component of a complex signal may be below threshold, the complex signal as a whole may be above threshold. In addition, different methods for measuring thresholds may lead to different estimates of the loudness [55]. Next, individual differences are not taken into account. The model will certainly not apply to people with hearing loss. Also, when people get older, their sensitivity decreases especially for high frequencies. The newborn starts with hearing frequencies up to 20 kHz. With every ten years, 1 kHz is lost at the high-frequency side of hearing, so that, at the age of about 100 years, components with frequencies higher than 10 kHz no longer play a role in hearing. Evidently, there is much individual variation in this respect. Finally, the model does not take into account that more than one sound may be heard. In many situations, multiple sounds are heard, each with its own perceptual attributes, among which loudness is only one. This means that up to now it has been assumed that all frequency components are perceptually integrated into one auditory unit, and that all frequency components contribute to the loudness of that unit, which is very often not the case. Loudness is an attribute of one auditory unit, and the contribution of a sound component to loudness perception is probably based on the extent to which it contributes to the formation of that auditory unit or, as formulated by Allen [1, p. 1832]: “It may be that loudness additivity only holds for a single auditory stream.” This will be discussed in more detail in the next section, in which an extended model will be described that does take the presence of other sounds into account. These other sounds are considered as noise, and it is assumed that the components of this noise are not integrated perceptually into the sound of which one wants to know the loudness.

358

7 Loudness Perception

7.6 Partial Loudness of Complex Sounds It is well known that the loudness of a sound can diminish in the presence of another sound; one sound can partially or completely mask another sound. This reduced loudness of a sound in the presence of another sound is referred to as its partial loudness. In order to present equations with which partial loudness can be estimated, the following is a continuation of the description of the computational model of loudness perception presented in Sect. 3.8.5. Indeed, if one sound is the signal with specific loudness Ns , and the other sound is the noise with specific loudness Nn , the partial specific loudness Ns of the signal can, as a first-order approximation, be  is the specific loudness of the combination modelled by assuming that, when Ntot  of the two sounds, the partial specific loudness equals the difference betweenNtot       and Nn , or Ns = Ntot − Nn . Using Eq. 3.19, Ntot can then be represented as Ntot = C {(G (E s + E n ) + A)α − Aα }, in which E s is the excitation by the signal, E n is the excitation by the noise, the constant A represents the excitation produced by internal noise, and G deals with the lower efficiency of the cochlear amplifier at frequencies lower than 500 Hz. Moreover, as Nn = C{G (E n + A)α − Aα }, one gets: Ns = C{(G (E s + E n ) + A)α − Aα } − C{G (E n + A)α − Aα }

(7.9)

In practice, however, the distribution of the specific loudness over the signal and the background noise appears to depend considerably on the relative excitation by the signal and the background noise. Based on four conditions, Moore, Glasberg, and Baer [55] have developed a more accurate model for calculating partial loudness. These conditions are: 1. When the intensity of the signal is well below its masked threshold, the partial specific loudness should approach zero; 2. When the intensity of the noise is well below threshold, the partial specific loudness should approach the loudness of the signal in quiet; 3. When the signal is at its masked threshold, the partial specific loudness is equal to the specific loudness of the signal at absolute threshold; 4. When the intensity of the signal is well above its threshold in a narrow band of noise, the partial specific loudness should approach its unmasked value. These four conditions lead to quite complex equations for which the reader is referred to the original literature [55]. Some examples will be presented to illustrate the most important properties of the model.

7.6.1 Some Examples of Partial-Loudness Estimation Masking is often demonstrated by switching on a constant narrow band of noise and playing a series of tone bursts of which the frequency starts low, passes through

7.6 Partial Loudness of Complex Sounds

359

the frequency band of the masker, and ends high. When the tone is much lower in frequency than the frequency of the noise band, one can hear the tone with the same loudness as when played in isolation. When the frequency approaches the frequency of the masker, however, the loudness of the tone is affected by the masker. First, its loudness diminishes, then the tone may become inaudible and, when the frequency of the tone is higher than the centre frequency of the noise, its loudness increases again until it is as loud as it would be in isolation. This is illustrated in Fig. 7.13, in which the partial loudness of a tone in the presence of another narrow-band sound is estimated by applying the partial-loudness model mentioned in the previous section [55]. In this example, both tone and masker have an intensity of 60 dB. The masker is fixed at 1000 Hz, while the tone starts at 50 Hz, a frequency which is then gradually increased up to 20 kHz. You may compare this figure with the dotted line in the upper panel of Fig. 7.8 showing the loudness of a 60-dB pure tone in silence. In Fig. 7.13, it is reproduced as the dotted line. Note that it very much resembles an inverted version of the threshold curve as, e.g., presented in Fig. 7.12. This is because the loudness of the pure tone is monotonically related to its level above threshold. As to the partial loudness of the tone in the presence of a narrow-band 1000 Hz masker, one sees that there is only a difference in the neighborhood of the masker. This shows that masking is only effective in a narrow frequency band around the frequency of the masker. At frequencies further away from the frequency of the masker, the loudness of the tone is the same as in silence. In summary, for low frequencies, the loudness of the tone is low due to our low sensitivity for low frequency sounds. As the frequency of the tone rises, its loudness increases until it gets affected by the narrow-band masking sound. There the loudness is suppressed and becomes partial, is minimum at 1000 Hz, and then rises again, until it has the loudness it would have in silence. At about 3000–4000 Hz, where our sensitivity for sound is maximum, the loudness of the tone

Fig. 7.13 Estimated partial loudness of a pure tone as a function of its frequency in the presence of a narrow-band noise centred around 1 kHz. The intensities of both sounds are 60 dB. The loudness of the tone in the absence of the masker is presented as a dotted line. Note the decrease in the partial loudness of the target tone around 1000 Hz, the centre frequency of the masker. (Matlab) (demo)

360

7 Loudness Perception

Fig. 7.14 Calculation of the partial loudness of a 1000-Hz 60 dB pure tone in the presence of masking noise. The noise consists of 257 sinusoids with random phase and frequencies between 500 and 2000 Hz, equidistant on a logarithmic frequency scale. The level of the frequency components of the noise was varied in steps of 15 dB from 0–60 dB. The panels on the left give the specific loudness of the signal plus the noise, the panels in the middle give the specific loudness of the noise, and the right panels give the partial specific loudness of the tone. Note the decrease in partial loudness of the tone as the noise increases in intensity. (Matlab) (demo)

is maximum, and finally decreases again in the same way as it would in silence as shown as the dotted line in the upper panel of Fig. 7.8. The next example, shown in Fig. 7.14, illustrates the situation in which the intensity of a noise sound is increased in the presence of a 1-kHz pure tone with a constant intensity of 60 dB SPL. The noise consists of a pink-noise band with a lower cut-off frequency of 250 and a higher cut-off frequency of 4000 Hz. Pink noise is chosen instead of white noise, because the excitation induced by pink noise is more or less equal for frequencies up to 3–5 kHz. In the demo of Fig. 7.14, the intensity of the noise masker is increased in steps of 15 dB from 0–60 dB. The specific-loudness distributions of the pure tone and the noise taken together are presented in the left panels. Just to their right, the specific-loudness distributions of the separate noise bands are presented. The estimated loudness of these signals is presented in the upper left corner of the panels. The right panel shows the partial specific loudness of the tone, which is, approximately, the difference between the partial specific loudness of the sound combination and that of the masking noise Eq. 7.9. The partial loudness of the tone, obtained by integrating the partial specific loudness of the tone, is shown in the upper left corner of the panels. Note the decreasing partial loudness of the tone in the presence of the increasing noise levels.

7.6 Partial Loudness of Complex Sounds

361

Fig. 7.15 Estimated partial loudness of a 1000-Hz a pure tone as a function of the intensity of the tone for various levels of the masker. The masker consisted of 65 frequency components between 500 and 2000 Hz, equidistant on a logarithmic frequency scale. The level of the masker is indicated in dB. (Matlab)

The fact that the loudness of the tone diminished in the presence of the increasing noise level may seem a natural phenomenon, as one daily experiences it. But one should realize that, in no filter of the tonotopic array, the intensity diminishes. In fact, the level of the input of all auditory filters increases or at best remains the same. So, as can be derived from the second column of Fig. 7.14 representing the specific loudness of the combination of the signal and the noise, at no location on the tonotopic array the excitation diminishes in the course of the rising noise level, it only increases. So, also in the auditory filter centred at the frequency of the tone, 1000 Hz, the excitation increases. In spite of that, the loudness of the 1000-Hz tone diminished until masking is complete and it is no longer audible. In the previous example it was shown that a tone can be more and more masked by noise with frequency components close to the frequency of the tone. The next figure, Fig. 7.15, presents the partial loudness of a 1000-Hz pure tone as a function of its intensity in the presence of masking noise. The noise level is varied in steps of 10 dB from 0 to 70 dB. For a noise level of 0 dB the partial loudness of the tone is very much the same as its loudness in silence as presented in Fig. 7.11. The more noise that is added, the steeper the curves. Actually, it appears that just above threshold the partial loudness increases vary rapidly with intensity. At intensities higher than the threshold, the loudness grows less and less rapidly with intensity. The consequence of this is that, as soon as the sound is 15–25 dB above masked threshold, its loudness is almost completely restored up to the value it has in silence. Another way of saying this is that, 15–25 dB above threshold, the masker no longer masks the signal. The curves shown in Fig. 7.15 illustrate that close to threshold the loudness system is very unstable. At threshold and just above it, the loudness of a pure tone grows very rapidly with intensity. This all shows that, in situations in which the contribution of

362

7 Loudness Perception

the noise significantly affects the audibility of sounds such as music and speech, the loudness of this music or this speech can change quite rapidly. In fact, this may be a familiar phenomenon. For many people, listening to music in a noisy environment is experienced as much less pleasant than listening to music in quiet. This situation occurs, e.g., when one listens to music in a car or in a train. Usually, people will switch on the radio in the car or the music player in the train when the vehicle is not moving yet. The level of the music player will then be adjusted to a comfortable loudness level. When the vehicle starts moving, the noise produced by the vehicle will increase with the speed. At a certain level, the noise will start masking part of the frequency components of the music, and the contribution of these components to the loudness of the music will rapidly decrease. As a consequence, the music will be partially masked and, at a certain level, only the most intense frequency components of the music may contribute to hearing the music. Due to the steepness of the intensity loudness functions just above threshold, small changes around threshold can rapidly change the contribution of these components to the loudness and the timbre of the music. This can make listening to music quite annoying. To compensate for this, the listeners will in many cases increase the level of the music so that at least the larger part of the frequency components of the music are 10 or more dB above their thresholds. The result is then meant to be another, hopefully more comfortable loudness level. But in that situation, many of the sound components will be partially masked. Then look at what happens when the car or the train comes to a halt. In this situation the intensity of all frequency components that arrive at the ear goes down. As a consequence, in all auditory filters along the tonotopic array the excitation diminishes. Nonetheless, the contribution of the completely or partially masked frequency components returns to a level they had before the volume was turned up. If the volume was turned up by 10 or 15 dB when the train was moving, this means that, after the train stops, the volume of the music will be 10–15 dB above the comfortable loudness level, which can be uncomfortably loud. So, in spite of the fact that in all auditory frequency channels the excitation has gone down, the loudness of the music is now too high and the listener will decrease the level back to around the comfortable loudness level it originally had in quiet. This shows that the presence of background noise can make listening to music quite annoying. This is, in fact, what many people with hearing loss experience is everyday listening situations. Especially when their cochlear amplifier does not operate properly due to damage of the outer hair cells, their threshold can be raised considerably. Above threshold loudness grows very rapidly with intensity whereas, at high intensities, loudness growth is normal. This is called loudness recruitment. The effects of loudness recruitment can partly be undone by dynamic-range compression [49, pp. 236–257]. Dynamic range compression has briefly been described in the Sect. 7.1.2. Simply speaking, dynamic range compression means that sound components with a low intensity level are amplified, while components with a high intensity level are attenuated. So, it is concluded that this model for the perception of partial loudness is quite well capable of explaining some of the features of loudness perception in everyday situations. But there are still a number of limitations, which will be discussed now.

7.6 Partial Loudness of Complex Sounds

363

7.6.2 Limitations of the Partial-Loudness Model The limitations indicated for the loudness model discussed above in the Sect. 7.5.1 also apply to the partial-loudness model. At threshold, certainly at masked threshold, things are somewhat more complex than described above. For instance, the outcome of the model depends on the way thresholds are determined. Another limitation is that temporal fluctuations are not taken into account, which will be discussed in Sect. 7.7. In the partial-loudness model just described, it is taken into account that one may not hear one sound but two sounds, the masker and the signal. The problem is that it must be known in advance what component is perceptually part of the signal and what component is part of the masker. And even when this is known, the mutual temporal relations between the frequency components can be important. This is illustrated in the next demo of Fig. 7.16. It is an elaboration of the demo of Fig. 4.4. The demo of Fig. 7.16 is derived from phenomena described by Warren [91], who called this homophonic induction. The loudness calculations follow the ideas presented in McAdams, Botte, and Drake [46]. A 1000-Hz pure tone is synthesized with a complex temporal envelope the intensity of which is shown in dB in the top panel of Fig. 7.16. It consists of a combination of two envelopes: the envelope of a sequence of tones and that of a tone slowly increasing in intensity, the “ramp”. The maximum of these two envelopes is the envelope of the stimulus and is displayed in the top panel of Fig. 7.16. The intensity of the slow ramp starts at 50 dB and rises slowly up to 60 dB at the end. The sequence of tones consists of 200-ms pure tones, also of 1000 Hz, with 200-ms inter-onset intervals. These tones are faded in and out in 20 ms up to the maximum intensity of these tones, 60 dB, the intensity of the ramp at its end. Under the assumption that such a sound is perceived as one auditory unit, the estimated loudness of this signal, as calculated with the model described above in Sect. 7.6, is presented in the second panel of Fig. 7.16. First, the tone has an intensity of 50 dB which corresponds to an estimated loudness of (50−40) about 2 10 = 2 sone; the calculations of the model give 2.10 sone. As the intensity (60−40) is increased to 60 dB, the estimated loudness increases to about 2 10 = 4 sone; the calculations of the model give 4.17 sone. But this does not correspond to what one hears. One does not hear a single pure tone. Actually, two concurrent sounds are heard, one corresponding to the slow ramp, and the other corresponding to an intermittent sequence of tones with each tone starting at the moment where the level rises from the ramp to the 60 dB maximum and stops when this level falls back to the level of the ramp. Hence, the intensity contour is partitioned into two contours, the ramp and the tones. These contours are shown in dB in the third and the fifth panel of Fig. 7.16, so panel c and panel e. Panel d presents both the estimated loudness and the partial loudness of the tonal part of the sound signal. The estimated loudness is the loudness as estimated when a tonal signal with the envelope as presented in panel c is played; the estimation of the partial loudness assumes the presence of the ramp as masker. It appears that the partial loudness, shown as the continuous line, best approximates what one hears, a sequence of tones first slowly and at the end

364

7 Loudness Perception

Fig. 7.16 Example of homophonic induction. Panel a shows in dB the envelope of a pure 1000-Hz tone. But we do not hear a single pure tone of 1000 Hz fluctuating in loudness in correspondence to this envelope. In fact, the percept is partitioned into a sequence of tones decreasing in loudness and, second, a continuous tone gradually increasing in loudness. The presumed envelopes of these two sounds are presented in dB in panel c and e, respectively. The estimated loudness of the tones and their partial loudness in the presence of the ramp are presented in panel d. The estimated loudness of the ramp and its partial loudness in the presence of the tones are presented in panel f. The thick continuous lines in panel d and f indicate the loudness contours of the tones and the ramp, respectively, best corresponding to what is heard. (Matlab) (demo)

quite suddenly decreasing in loudness. Looking at the estimated loudness and partial loudness of the ramp part of the sound, presented in the bottom panel f, one sees that not the partial loudness but the loudness shown as the continuous line corresponds to what is heard. This loudness of the ramp is not interrupted by the tones but, instead, a constant tone is heard gradually increasing in loudness. The loudness of this ramp at the start, about 2 sone, corresponds to the intensity of the ramp at the start, 50 dB, and the loudness at the end, about 4 sone, corresponds to the intensity at the end, 60 dB. This phenomenon will be discussed further in the context of the continuity illusion in Sect. 10.9.7.

7.7 Loudness of Time-Varying Sounds

365

It can be concluded that the tones, though higher in intensity, do not mask the ramp, but the ramp partially masks the tones. Apparently, the ramp, as it starts 200 ms earlier than the tone, captures auditory information generated by the tone. As a consequence, the ramp is perceived as continuous. The part of the auditory information allocated to the ramp does not contribute to the loudness of the intermittent tones, which decrease in loudness as a result. This model is confirmed perceptually by the experiments carried out by McAdams, Botte, and Drake [46]. They conclude that the formation of the two auditory units precedes the neural loudness processing. In this description, the temporal aspects of loudness perception are omitted. These temporal aspects were investigated by Drake and McAdams [15] and appeared to be significant. One of the main findings was that the ramp was only perceived as continuous when the relative duration of the intermittent tones was short. When their duration was more than twice as long as the intervals between them, the perception of continuity of the low-level ramp disappeared and the intermittent tones regained the loudness they would have in isolation. The results obtained by Drake and McAdams [15] were very robust, as they were not affected by the instructions given to the listeners or by their level of experience. These experiments illustrate that the loudness of an auditory unit is computed based on auditory information allocated to that auditory unit [33]. What must be emphasized for the moment is that this example of Fig. 7.16 is concerned with a narrow-band signal, a pure tone modulated in amplitude. The tones were faded in and out in 20 ms, enough to prevent audible clicks at the onsets or offsets. Hence, the information within not more than one critical band is involved in processing the auditory information. It is concluded that auditory information from within one critical band can be redistributed over more than one auditory unit. Finally, the above example is presented to show one of the limitations of the partial loudness model. It is an attempt to quantitatively describe the complex relations between masking, sequential integration or fusion, and what Warren et al. [91] called homophonic induction. It is applied to the situation in which both ramp and tone sequence have identical frequencies and phase relations. In the original paper on this subject, Warren et al. [91] describe more complex situations in which the continuously perceived sound may not only lead to the described reduction in loudness of the discontinuous sound sequence but also to a change of its timbre. This will be further discussed in Sect. 10.11. The relation between masking, fusion, and induction is also discussed by Bregman [5, pp. 314–320].

7.7 Loudness of Time-Varying Sounds The model of loudness perception discussed above applies quite well to sound signals with not too abrupt onsets, a stationary part of at least a few hundred milliseconds, and not too abrupt offsets. For synthetic sounds, the signal is in general exactly known, and this knowledge can be used to calculate the exact spectrum of the sound. For noisy sounds, and certainly for well-defined noises such as white noise, pink

366

7 Loudness Perception

noise, or brown noise, the long-term spectrum can be estimated and, based on this, a discrete spectrum with sufficiently close frequency components can be estimated. So, for both stationary, synthetic sounds and well-defined noises, it is well possible to obtain the long-term spectrum, which can then be used to calculate the longterm excitation pattern, the long-term specific loudness and, finally, the loudness. For many other stationary sounds, the spectral components can quite reliably be estimated by averaging the spectra over a sufficiently long analysis interval within which the signal can be assumed to be stationary. This spectrum can then be used to reliably estimate the loudness of the signal. Most sounds heard, however, are very variable, and quite a few last less than a few hundred milliseconds. The loudness of these very short sounds is now discussed first.

7.7.1 Loudness of Very Short Sounds Already in the first part of the last century, it turned out that the loudness of a short tone, say less than 100 ms, is considerably less than when it lasts longer. The results from such an early experiment in 1947 by Munson [58] are reproduced in Fig. 7.17. They show that, when the duration of a 70-dB tone is reduced from 1 s to 5 ms, its loudness level is reduced by about 30–40 dB. This corresponds to a loudness reduction from 8–1 sone. Inspecting Fig. 7.17, it seems that after the onset of the tone the loudness gradually builds up until it reaches a constant value after about 200–500 ms. This suggests an integration mechanism with a time constant of about 100 ms. An integration mechanism of loudness with only one time constant appears not to be enough, however. Based on early measurements of series of action potentials from auditory-nerve fibres in response to tone pips [25], Munson [58] already in 1947 proposed an integration mechanism with two time constants, a shorter and a longer one. This was confirmed by, e.g., Poulsen [71] who proposed a time constant of 5–10 ms for the shorter integration mechanism. For the longer integration mechanism, a time constant of 200 ms was proposed for sounds near threshold decreasing to 100 ms at higher levels. In order to account for the results of experiments concerned with the loudness of impact sound, Kumagai, Suzuki, and Sone [42] also propose two integration mechanisms, a shorter one with a time constant of 5 ms and a longer one with a time constant of 125 ms. They applied these integrators in parallel with a relative attenuation of 16 dB for the shorter integration. Viemeister and Wakefield [88] discuss the possibility that the hearing system does not integrate information over time, but combines loudness information from successive instants of the sound, indicated with multiple looks. In their stimulus set, they included pairs of very short tone pulses, and determined the threshold of these pulse pairs. Viemeister and Wakefield [88] concluded that the loudness of these short pulse pairs could not be explained based on an integration mechanism, but suggested a relatively dense sampling of the output of the auditory filters, a few hundred looks per second. The information obtained from these looks is then combined in a near optimum fashion.

7.7 Loudness of Time-Varying Sounds

367

Fig. 7.17 The loudness level, upper panel, and the loudness, lower panel, of a short 1-kHz pure tone as a function of its duration. The data for a 1-kHz pure tone are based on [58, p. 585, Table I] and the thick lines reproduce Figs. 3 and 4 of [58, p. 586, Figs. 3 and 4]. The upper thin line presents the data for 125 Hz, the dotted line for 5650 Hz. The crosses in the upper right of the panels are the loudness level and the loudness of a 70-dB 1000-Hz tone. The sound demo first plays a 1000-Hz tone of 1 s, and then the tones with the durations indicated in the figure, from right to left, 200, 100, 40, 10, and 5 ms. (Matlab) (demo)

The situation is complicated by experimental factors [76]. For instance, in experiments in which two sounds are compared in loudness, order effects appear to play a role. Buus, Florentine, and Poulsen [6] remark that these order effects complicate the interpretation of earlier experiment, e.g., those given by Munson [58]. Buus, Florentine, and Poulsen [6] report adjustment experiments with two tones in which listeners were asked to adjust the level of one tone to that of the other. Different results were obtained depending on whether the first of the two sounds was fixed in level and the task was to adjust the level of the second sound to make it equal in loudness to the first sound, or whether the second sound was fixed in level and the first was varied by the participant. Consistent results were only obtained when the results of these two conditions were combined. Another complicating factor is that the loudness of a sound depends on its temporal envelope. Sounds that slowly rise in intensity and suddenly fall, called ramped sounds, are perceived louder than their temporal inversions which have sudden onsets and slow offsets, called damped sounds [80]. Moreover, as shown in Fig. 5.1, the beat of ramped sounds is later in respect of the acoustical beginning of a sound then that of damped sounds. Furthermore, ramped sounds have perceived durations that are longer than those of their time-inverted, damped counterparts [31, 32, 77]. This will be discussed in more detail in a wider perspective in Sect. 9.9.3.

368

7 Loudness Perception

7.7.2 Loudness of Longer Time-Varying Sounds Most sounds heard by human listeners, such as speech and music, contain essentially time-varying elements, not only at onsets and offsets, but also at transitions between the elements that constitute the sound. These non-stationary parts, in particular the onsets, play perceptually a very important and diverse role, a role discussed in the Chaps. 5 and 6. For such sounds, it is no longer adequate to choose a shorter or longer analysis interval and, based on the spectrum of this sound segment, to do the loudness calculations. Already in the 1930s, Fletcher and Munson [23] indicated the necessity to adapt the loudness estimations for variable signals such as music and speech: “In addition, it is necessary to determine experimentally to what extent the proposed computation method may be applicable to the types of fluctuating complex tones encountered in speech, music and noise” (p. 65). For sounds such as music and speech, listeners on the one hand clearly observe temporal fluctuations in loudness, but on the other hand are often quite consistent in comparing the relative loudness of longer stretches of speech and music, stretches comprising several words in speech or several tones in music. One instrument can play louder than another or the dynamic effort with which it is played can be increased. Similarly, one speaker can speak louder than another or raise the level of his or her voice. This indicates that listeners also have a concept of loudness for intervals of speech of more than a few syllables and for music of more than a few tones. As to speech, Fastl [20] calls this “the loudness of running speech.” Others have used various attributes to distinguish the percept of loudness of longer stretches of sound. Some use the term overall loudness [43, 78], others use the term long-term loudness [26]. The term global loudness is also used, but this term is used for one single sound, e.g., a pure tone of a relatively long duration, say longer than 1 s, with a changing level [84]. Global loudness is then defined as the “overall loudness of a sound over its entire duration” [70, p. 1083]. Here, the oldest term will be used, overall loudness. In the 1970s, various studies on the loudness of speech and speech-like sounds were carried out. The question asked was whether the overall loudness of the speech sounds was determined by the time average of the loudness contour, its peak values, or whatever. Fastl [20] asked listeners to compare the overall loudness of speech with that of noise filtered in such a way that its spectrum resembled the long-term spectrum of speech. He found that listeners perceived this speech-simulating noise as equally loud as the speech when its level attained the level of the maximum sound level of the speech. In other words, the speech is perceived as loud as its maxima, which will generally be the loudness speech attains during the syllabic nuclei, so generally the vowels. The conclusion that the loudest parts of a stretch of sound determine the loudness of the overall sound, was also drawn for a variety of environmental sounds such as that of trains, shopping arcades, helicopters, or blowing hammers [78]. Zwicker [92] published a “loudness meter” not only for calculating the loudness of running speech, but also of time-varying sounds in general, such as the very short tones discussed in the previous paragraph, amplitude-modulated tones,

7.7 Loudness of Time-Varying Sounds

369

frequency-modulated tones, and noise of varying bandwidths. Based on the output of 24 adjacent critical-band filters covering the whole Bark scale, the device starts calculating instantaneous specific-loudness patterns. The resulting time series are then subjected to a simulation of forward and backward masking. When applied to speech, the outcome reflects the syllable structure of the speech signal, with loudness maxima at the vowels, and minima in between. When used for estimating the loudness of speech in real-time situation, Zwicker incorporated the finding by Fastl [20] that running speech is perceived as loud as its maxima. So, the loudness of speech was estimated based on the successive maxima measured by the loudness meter. This aspect was further studied in more detail by Fastl [20]. He asks listeners to adjust the loudness of noise with the same long-term spectrum as speech to the loudness perceived in running speech. He finds that the loudness that is exceeded 7% of the time is the loudness of the running speech. This confirms that the loudest parts of the speech signal, generally the vowels, determine the overall loudness of speech [20, 92]. The authors, Zwicker and Fastl, do not mention loudness as an auditory attribute of a syllable, but these findings can be interpreted as indicating that the syllable nucleus, mostly its vowel, determines its loudness, and that the loudness of the syllables determines the loudness of the speech. The pioneering work by Zwicker and Fastl was carried out in Munich, Germany, where the tonotopic frequency scale was expressed in Bark. The “loudness meter” by Zwicker [92] for time-varying sounds was expanded considerably, e.g., by including a simulation of forward and backward masking, by Chalupper and Fastl [9] who additionally adapted it in such a way that is could be applied to loudness perception by people with cochlear hearing loss. This, in turn, was expanded and refined by Rennies et al. [74]. In Cambridge, England, the tonotopic array was redefined based on the equivalent rectangular bandwidth (ERB) of the auditory filter as measured with notched noise. Distances along this array were expressed in number of ERBs, later called Cams. Based on this Cam scale, a revision of Zwicker’s model of loudness perception was published by Moore and Glasberg [53]. Moreover, “Cambridge” also presented a model for the estimation of partial loudness [55]. A model extending this model of loudness and partial-loudness perception to time-varying sounds was presented by Glasberg and Moore [26]. In order to account for the time-varying properties of the sound signals, the loudness computation is no longer based on the Fourier spectrum of one fixed stimulus interval. Glasberg and Moore [26] start from intervals of six different durations, 2, 4, 8, 16, 32, and 64 ms, sampled with 32 kHz. These six intervals are Hanning windowed and zero padded to 2048 sample points. The 2-ms interval is used to calculate the spectrum between 4050 and 15000 Hz, the 8-ms interval for frequencies between 2530 and 4050 Hz, the 16-ms interval for frequencies between 1250 and 2530 Hz, the 16-ms interval for frequencies between 500 and 1250 Hz, the 32-ms interval for frequencies between 80 and 500 Hz, and the 64-ms interval for frequencies between 20 and 80 Hz. The thus obtained frequency components are used to calculate the excitation patterns at four points per ERB, with which the specific loudness and the total loudness are calculated. This is updated every 1 ms, yielding, what is called, the instantaneous loudness, an abstract

370

7 Loudness Perception

time series of loudness estimations that do not reflect a really perceived attribute of the sound. In order to account for the dynamic effects of loudness perception, the Cambridge tradition did not include forward and backward masking, as the Munich tradition. Instead, [57] included a gain control with relatively short time constants to calculate short-term loudness and a gain control with relatively long time constants to calculate long-term loudness. They specify: “Short-term loudness is meant to represent the loudness of a short segment of sound, such as a word or a single musical note, whereas long-term loudness is meant to represent the overall loudness of a longer segment of sound, such as a sentence or a musical phrase” (p. 1504). Both in the Munich and the Cambridge tradition, the models of loudness perception have been refined, adapted, and extended in various ways. It was just mentioned that, in the Munich tradition, Chalupper and Fastl [9] presented a model of loudness perception by people with cochlear hearing loss. Similarly, a model for estimating the loudness of time-varying sounds is presented for listeners with cochlear hearing loss by Moore and Glasberg [52]. The application of these models for estimating loudness perception and simulating cochlear hearing loss is comprehensively reviewed in Moore [49]. Glasberg and Moore [28] extended their 2002 model of loudness perception of time-varying sounds sounds [26] with a model of loudness perception in the presence of background sounds, in other words with a model of partial-loudness perception. Another important extension of the Cambridge model is presented by Chen et al. [12]. These authors replace the single roex filter described by Glasberg and Moore [27] by a double-roex filter, one representing the passive response of the basilar membrane, the other the cochlear amplifier. The strength of this model is that the transformation from excitation pattern to specific-loudness distribution is no longer necessary. Loudness can be calculated directly by integrating the excitation pattern across the tonotopic array. This model has been extended with a model for the perception of partial loudness [11] and for application to time-varying sounds [10, 11]. A final aspect that has been neglected up to now is that, in auditory-unit formation, information from the two ears is combined. In the model for loudness perception of steady sounds [55] and that for time-varying sounds [26], it is assumed that the information from both ears is combined so that the loudnesses as calculated for each ear individually can simply be added. This is called binaural loudness summation. This appears not to be correct. Even more, it can be assumed that strong information from one ear can inhibit weaker information from the other ear. As a result of this contralateral binaural inhibition, the loudness of binaurally presented sounds is generally less than twice the loudness of monaurally presented sounds. In order to account for this, the Cambridge model for loudness perception has been modified [29, 54]. This model, in turn, has been extended for application to time-varying sounds [56]. Some refinements are presented by Moore et al. [57]. The performance of the Munich and the Cambridge models for loudness perception has been tested and evaluated for a number of different signals, e.g., by Rennies, Verhey, and Fastl [73] and Rennies, Holube, and Verhey [72]. The model by Glasberg and Moore [26] was tested by Swift and Gee [86] for very intense sounds from jet aircrafts. Due to non-linear propagation properties, such sounds are characterized

7.7 Loudness of Time-Varying Sounds

371

by very abrupt increases in sound pressure. The rate of these sudden increases is in the range of roughness perception. Swift and Gee [86] conclude that the short-term loudness estimation by Glasberg and Moore [26] does not yield good estimates of the loudness for these kind of noise sounds. They suggest to include instantaneous loudness in the estimations, but indicate the need for additional research and to include other perceptual attributes such as roughness and sharpness in the considerations. Rennies et al. [75] tested both Munich and Cambridge models on a number of technical sounds such as that of a hammer, a snare drum, a diesel engine, or a helicopter. They concluded that the accuracy of the models depended strongly on the type of sound. The loudness of steady sounds was generally well estimated, but no model could correctly estimate the loudness of all tested sounds, especially those with spectrally and temporally strongly varying characteristics. A comprehensive review is presented by Ward [89]. Up to now, all models presented were based on measurements of the spectrum of the sound. This spectrum was used to calculate the excitation pattern and, except in the models by Chen et al. [11, 12] and Chen and Hu [10] that can omit this step, the specific-loudness distribution. In a review of the Cambridge model for loudness perception, Moore [50, p. 23] writes that, due to this: “A time-domain auditory filter bank is needed to produce versions of the models applicable to time-varying sounds.” The first to do so are Pieper et al. [66, 67]. They calculated the excitation pattern along the basilar membrane, not with gammatone or gammachirp filters, nor with single or double-roex filters, but based on a transmission-line simulation of the mechanics of the basilar membrane (for a monograph see Duifhuis [16]). In this approach, the transformation from excitation to specific loudness is also unnecessary. Moreover, they use the ERB N -number or Cam scale for distances along the tonotopic array. For time-varying sounds, the estimated loudness estimates are integrated with a time constant of 25 ms and, in agreement with the finding that the loudness of a longer time-varying sound is determined by its loudest parts, the maximum estimated loudness in an interval is chosen as the loudness estimate of the whole interval. This model has been extended for application to the loudness perception of listeners with cochlear hearing loss by Pieper et al. [67]. For this, the authors introduce a central component in the model, doped “post gain”, which “amplifies the signal parts above the internal threshold and can better account for individual variations in the overall steepness of loudness functions and for variations in the uncomfortable level which are independent of the hearing loss. The post gain can be interpreted as a central gain occurring at higher stages as a result of peripheral deafferentation” (p. 917). It is concluded that modelling loudness perception has resulted in quite a few successful models, both for people with and without cochlear hearing loss. Cochlear hearing loss is mostly due to malfunctioning of the inner and outer hair cells, and, apparently, our understanding of their operation is at a level that it appears possible to model their effect on loudness perception also when they do not function properly. This applies in particular to the models in which the operation of the inner and the outer hair cells is modelled by separate filters, a passive filter for the inner hair cells and an active filter for the outer hair cells, such as the models developed by Chen et al. [12] and Pieper et al. [66]. Only for very impulsive sounds and sounds

372

7 Loudness Perception

with complex changing spectrotemporal characteristics, considerable discrepancies between loudness judgments by listeners and model estimations remain.

7.8 Concluding Remarks Based on the models for loudness perception discussed so far, it is well possible to describe various aspects of loudness perception quite accurately. These estimations are expressed in sones, the units of a perceptual scale on which the loudness of sounds can be arranged from soft to loud. The model is based on the excitation pattern induced by the frequency components of the sound. Quite accurate loudness estimations can be obtained for stationary sounds that are played in isolation and are perceived as a single auditory unit. In the mid-range of intensity levels, this loudness scale quite closely follows Stevens’ law. Moreover, this model specifies the simple rule of thumb for loudness perception: “Within a critical band, intensities add up; outside a critical band loudness adds up”, or, when two sounds are played with the same intensity level but the excitation pattern of one of these sounds is distributed over a larger part of the tonotopic array than the excitation pattern of the other, the sound with the more widely distributed excitation pattern sounds louder. Furthermore, at low and very high intensity levels, in particular just above threshold, loudness increases more rapidly with intensity than at mid-range levels. Another important aspect of the presented model for the perception of loudness and partial loudness is that it can quite accurately estimate masking phenomena. If the frequency components of two sounds are close together, the loudness of one sound is reduced in the presence of the other, close together in the sense that the excitation patterns of the two sounds overlap. If the excitation pattern of one sound completely covers that of the other, masking is complete and the masked sound is below hearing threshold. When the intensity of the masked sound is subsequently increased up to above the threshold, the loudness increases rapidly immediately after crossing the threshold, much faster than in isolation. At 15 to 20 dB above threshold, the loudness of the sound is almost completely restored to the level it would have in silence. Except one, all models of loudness perception discussed are based on information from the amplitude spectrum of the incoming sounds. Only the model by Pieper et al. [66] uses the temporal envelope of the output of the auditory filters as input for their model. None of the models discussed uses temporal fine structure for estimating the loudness of a sound. In the study of loudness perception of time-varying sounds, listeners are often asked to judge the overall loudness but hear something that is changing in loudness. For instance, in their presentation of the loudness model of time-varying sounds, Glasberg and Moore [26] mention that the long-term loudness of amplitudemodulated sounds is, for modulation frequencies lower than 10 Hz, not equal to the maxima of the short-term loudness, but turns out to be somewhat lower. Moreover, they remark that, for those frequencies, “The listeners complain that it is difficult to

7.8 Concluding Remarks

373

make a judgment of overall loudness because the loudness is continually changing” (p. 338). According to Glasberg and Moore’s model, these long-term loudness fluctuations amount to only 0.5 phone at 10 Hz, but apparently are perceptually significant. Fluctuation rates of 10 Hz or lower are in the rhythm range. This means that listeners are actually asked to judge the overall loudness of a sequence of auditory units in a stream. Similar problems occur when listeners are asked to judge the loudness of sounds slowly rising or decreasing in intensity. Above, it was mentioned that Stecker and Hafter [80] found that ramped sounds are perceived as louder than their timereversed counterparts, damped sounds, are. The sounds used by these authors were short 330-Hz pure-tone pips or noise bursts with an effective duration of not more than one or two hundred milliseconds, so are perceived as short sounds with a welldefined loudness. This changes when longer sounds are used. Loudness perception for such sounds was studied with sounds with amplitudes that systematically varied in time. By asking listeners to indicate which of two sounds is louder, one can try to find out how much weight listeners attribute to the various parts of the sounds. It turned out that, in most of these experiments, listeners attribute more weight to the start of the stimulus than to its end. This is called the primacy effect [59, 65]. This primacy effect was more systematically studied by Oberfeld, Hots, and Verhey [62]. Using wide-band noise with varying intensities, they showed that “The primacy effect can be described by an exponential decay function with a time constant of about 200 ms” (p. 952). Besides the primacy effect, there is also the recency effect. This means that listeners base their judgments on what they heard last [69]. In general, the recency effect is weaker and more variable than the primacy effect [62] and only expresses itself more systematically for stimulus durations of more than a few seconds [63]. These effects were confirmed for loudness judgment of sounds varying in intensity in the presence of noise by Fischenich et al. [21]. They show that listeners can differ in the strategies they use in making their judgments, resulting in large individual differences [68]. Another phenomenon of sound of varying intensity expresses itself in experiments in which listeners are not asked to judge the loudness of a sound globally but are asked to continuously judge the loudness of sounds increasing or decreasing in intensity. For instance, Canévet and Scharf [8] asked listeners to continuously judge the loudness of a sound slowly rising or falling in intensity. The range varied between 40–80 dB. It appeared that listeners judged the loudness at the end of the falling sound much lower than at the start of the rising sound, though their intensities were the same. This phenomenon is called decruitment, not to be confused with loudness recruitment, briefly mentioned in Sect.7.6. To complicate matters further, a perhaps contradictory phenomenon is described by Neuhoff [61]. He also presented listeners with sounds, synthetic vowels in this case, increasing or decreasing in intensity by the same amount. He, however, did not ask listeners to judge the loudness of these sounds continuously but, instead, he asked them to judge the loudness change on a scale from “no change” to “large change”. If the results of the experiments on decruitment can be extrapolated to these experiments, one would expect that the loudness change would be perceived as larger

374

7 Loudness Perception

for the sounds falling in intensity than for the sounds rising in intensity. The opposite, however, appears to be the case; changes in loudness are perceived as larger for sounds rising in intensity than for sounds falling in intensity by the same amount. Neuhoff [60] explained this effect by assuming that listeners associate sounds with increasing intensities with looming sounds, hence sounds that approach the listener and, hence, are more urgent and may require immediate action. Apparently, the perception of the loudness of a sound source may not be independent of the perception of its location. The impatient reader is referred to [64] who presents a review on these issues. For the moment, it is concluded that increases in intensity can be associated with increases in loudness, in perceived effort, in perceived urgency, or decreases in the perceived distance between listener and sound source. They will be discussed in more detail in Sect. 9.9.3. It has been shown that many phenomena in loudness perception can be explained based on the assumptions underlying the various models of loudness perception that are discussed. Some discrepancies remain, which can have different backgrounds. For instance, it appears that, for wide-band sounds, the intensity of the edge bands of the spectrum contribute more to loudness than predicted by the models, an effect that is especially strong for the upper edge band, probably owing to upward spread of masking [41]. Another issue is that tonal sounds are often judged louder than noisy sounds with similar spectra. The issue is complex, e.g., because tonal sounds are often judged more annoying than noisy sounds [44]. The reader is referred to Jesteadt, Wróblewski, and High [40] for a recent discussion. In all models of loudness perception presented in this chapter, it is assumed that the sound components that contribute to the loudness of an auditory unit are known in advance. This situation arises, e.g., when a masker is played continuously while a target sound is switched on, played for one or two hundred milliseconds, and then switched off, again. In this case, one knows precisely which frequency components belong to the masker and which to the target sound. Loudness estimations in mixtures of sounds in everyday circumstances are much more complicated, however. This applies, e.g., to situations in which various speakers speak together or musical instruments play simultaneously. In the models for loudness and partial loudness, it is assumed to be known which components belong to the masker and which belong to the signal. Except in laboratory conditions, this will, however, generally not be the case. Moreover, the demo of Fig. 7.16 illustrated that information from a single sound component that is processed within the bandwidth of one auditory filter can be distributed over more than one auditory unit or stream. In this example, the proportion of the signal energy contributing to one auditory unit and the signal energy contributing to the other auditory unit could be estimated based on the way the stimulus was synthesized. When this is not possible, models of loudness perception have to include a frontend that simulates the “auditory organization processes that group the bits of acoustic information together” [45, p. 56] into auditory units. Such processes have been discussed in Chap. 4 and will further be discussed in Chap. 10. Before models of loudness perception can operate satisfactorily in more general situations, models of such processes have to be included. As far as could be ascertained, only

References

375

Bradter and Hobohm [4] emphasized the need to include the formation of auditory units into models of loudness perception. It is concluded that there are various problems that become manifest in the discussions of experiments in which the sound stimuli persist in time. It has already been shown that a distinction is made between short-term loudness and long-term loudness. The only specification for this is by Moore et al. [57, p. 1504]: “Short-term loudness is meant to represent the loudness of a short segment of sound, such as a word or a single musical note, whereas long-term loudness is meant to represent the overall loudness of a longer segment of sound, such as a sentence or a musical phrase.” In the context of auditory scene analysis, it is natural to associate shortterm loudness with an auditory unit, such as a syllable in speech or a single tone in a musical melody, and to associate long-term loudness with an auditory stream, such as a speech utterance or a musical phrase. For speech, one can then associate the loudness of a syllable with the loudness of the syllable nucleus, as suggested above. The loudness of an utterance can then be considered as a loudness contour connecting the short-term loudnesses of the separate syllables. This is analogous to the method of the “Prosogram” used in speech for the analysis of intonation contours [14, 47]. In speech, not only the loudness but also the pitch is often perceptually not welldefined outside the syllable nucleus. Moreover, when the pitch of a speech segments outside the syllable nucleus is well-defined, it appears hardly, if at all, to contribute to the pitch contour of the utterance as it is perceived. The pitch of a syllable is perceived as constant when the vowel is short; when the vowel lasts long, a change is pitch can be perceived as a rise, a fall, or a combination of the two, which then has important communicative values. A review is presented by [36]. Similarly, one can consider the loudness contour of a speech utterance as a contour connecting the reliable short-time loudnesses of the separate syllabic nuclei. If the nucleus is short, the loudness is perceived as constant, but when the nucleus lasts longer, rises and falls in loudness, and even successions of them, will start to have communicative value. An analogous procedure can be followed for the analysis of the loudness contours of musical melodies. Note, however, that up to now the importance of rapid increments in intensity has been ignored, rapid increments to which Chap. 5 was dedicated. These beats induce the rhythm in auditory streams, a subject which still has to be discussed. This will be done in Sects. 10.5 and 10.12.3.3.

References 1. Allen JB (1996) Harvey fletcher’s role in the creation of communication acoustics. J Acoust Soc Am 99(4):1825–1839. https://doi.org/10.1121/1.415364. 2. ANSI (1994) ANSI S1.1-1994. American national standard acoustical terminology. New York 3. ASA (1973) American national psychoacoustical terminology 4. Bradter C, Hobohm K (2008) Loudness calculation for individual acoustical objects within complex temporally variable sounds. Audio Eng Soc Conv 124(17-20 May 2008, Amsterdam, Netherlands). pp 1720–1731. http://www.aes.org/e-lib/browse.cfm?elib=14624

376

7 Loudness Perception

5. Bregman AS (1990) Auditory scene analysis: the perceptual organization of sound. MIT Press, Cambridge, MA 6. Buus S, Florentine M, Poulsen T (1997) Temporal integration of loudness, loudness discrimination, and the form of the loudness function. J Acoust Soc Am 101(2):669–680. https://doi. org/10.1121/1.417959. 7. Camerer F (2010) On the way to loudness nirvana – audio levelling with EBU R 128. EBU technical review 2010 (June 2010), p 7. 8. Canévet G, Scharf B (1990) The loudness of sounds that increase and decrease continuously in level. J Acoust Soc Am 88(5):2136–2142. https://doi.org/10.1121/1.400110. 9. Chalupper J, Fastl H (2002) Dynamic loudness model (DLM) for normal and hearing-impaired listeners. Acta Acust United Acust 88:378–386 10. Chen Z, Hu G (2012) A revised method of calculating auditory exciation patterns and loudness for timevarying sounds. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012). Kyoto, Japan, pp 157–160. https:// doi.org/10.1109/ICASSP.2012.6287841. 11. Chen Z et al (2011) A new method of calculating auditory excitation patterns and loudness for steady sounds. Hear Res 282:204–215. https://doi.org/10.1016/j.heares.2011.08.001 12. Chen Z et al (2011) A new model for calculating auditory excitation patterns and loudness for cases of cochlear hearing loss. Hear Res 282:69–80. https://doi.org/10.1016/j.heares.2011.09. 007 13. Culling JF, Edmonds BA (2007) Interaural correlation and loudness. In: Kollmeier B et al (eds) Hearing: from sensory processing to perception. Springer, Berlin, Heidelberg, Chap. 39, pp 359–368. https://doi.org/10.1007/978-3-540-73009-5_39. 14. d’Alessandro C, Mertens P (1995) Automatic pitch contour stylization using a model of tonal perception. Comput Speech Lang 9(3):257–288. https://doi.org/10.1006/csla.1995.0013. 15. Drake C, McAdams S (1999) The auditory continuity phenomenon: role of temporal sequence structure. J Acoust Soc Am 106(6):3529–3538. https://doi.org/10.1121/1.428206. 16. Duifhuis H (2012) Cochlear mechanics: introduction to a time domain analysis of the nonlinear cochlea. Springer Science & Business Media, New York. https://doi.org/10.1007/978-1-44196117-4. 17. EBU.UER (2010) EBU – recommendation R 128: loudness normalisation and permitted maximum level of audio signal. Geneva, Switzerland, p 5. 18. EBU.UER (2011) EBU – TECH 3342: loudness range: a measure to supplement loudness normalisation in accordance with EBU R 128. Geneva, Switzerland, p 9. 19. EBU.UER (2011) EBU – TECH 3343: practical guidelines for production and implementation in accordance with EBU R 128. Geneva, Switzerland, p 44. 20. Fastl H (1977) Zur Lautheit flieSSender Sprache. Zeitschrift für Gehörgeräte-Akustik 16:2–13 21. Fischenich A et al (2019) Temporal weights in loudness: investigation of the effects of background noise and sound level. PLoS ONE 14(11), e0223075, p 19. https://doi.org/10.1371/ journal.pone.0223075. 22. Fletcher H (1940) Auditory patterns. Rev Mod Phys 12(1):47–55. https://doi.org/10.1103/ RevModPhys.12.47. 23. Fletcher H, Munson WA ( 1933) Loudness of a complex tone, its definition, measurement and calculation [Abstract]. J Acoust Soc Am 5(1):65. https://doi.org/10.1121/1.1915633. 24. Fletcher H, Munson WA (1933) Loudness, its definition, measurement and calculation. Bell Labs Tech J 12(4):377–430. https://doi.org/10.1002/j.1538-7305.1933.tb00403.x. 25. Galambos R, Davis H (1943) The response of single auditory-nerve fibers to acoustic stimulation. J Neurophysiol 6(1):39–57. https://doi.org/10.1152/jn.1943.6.1.39. 26. Glasberg BR, Moore BC (2002) A model of loudness applicable to time-varying sounds. J Audio Eng Soc 50(5):331–342. http://www.aes.org/e-lib/browse.cfm?elib=11081 27. Glasberg BR, Moore BC (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47(1-2):103–138. https://doi.org/10.1016/0378-5955(90)90170-T. 28. Glasberg BR, Moore BC (2005) Development and evaluation of a model for predicting the audibility of time-varying sounds in the presence of background sounds. J Audio Eng Soc 53(10):906–918. http://www.aes.org/e-lib/browse.cfm?elib=13391

References

377

29. Glasberg BR, Moore BC (2010) The loudness of sounds whose spectra differ at the two ears. J Acoust Soc Am 127(4):2433–2440. https://doi.org/10.1121/1.3336775. 30. Gockel HE, Moore BC, Patterson RD (2002) Influence of component phase on the loudness of complex tones. Acta Acust United Acust 88(3):369–377 31. Grassi M, Darwin, CJ (2006) The subjective duration of ramped and damped sounds. Percept & Psychophys 68(8):1382–1392. https://doi.org/10.3758/BF03193737. 32. Grassi M, Mioni G (2020) Why are damped sounds perceived as shorter than ramped sounds?. Atten Percept & Psychophys 82(6):2775–2784. https://doi.org/10.3758/s13414-020-02059-2. 33. Grimault N, McAdams S, Allen JB (2007) Auditory scene analysis: a prerequisite for loudness perception. In: Kollmeier B et al (eds) Hearing: from sensory processing to perception. Springe, Berlin, Heidelberg, Chap. 32, pp 295–302. https://doi.org/10.1007/978-3-540-73009-5_32. 34. Haghbayan H, Coomes EA, Curran D (2020) Temporal trends in the loudness of popular music over six decades. J Gen Intern Med 35(1):394–395. https://doi.org/10.1007/s11606019-05210-4. 35. Hartmann WM (1993) Auditory demonstrations on compact disk for large N. J Acoust Soc Am 93(1):1–16. https://doi.org/10.1121/1.405645. 36. Hermes DJ (2006) Stylization of pitch contours. In: Sudhoff S et al (eds) Methods in empirical prosody research. Walter De Gruyter, Berlin, pp 29–62. https://doi.org/10.1515/ 9783110914641.29. 37. Houtsma AJ, Rossing TD, Wagenaars WM (1987) Auditory demonstrations. Institute for Perception Research (IPO), Northern Illinois University, Acoustical Society of America, Eindhoven, Netherlands. https://research.tue.nl/nl/publications/auditory-demonstrations 38. Hove MJ, Vuust P, Stupacher J (2019) Increased levels of bass in popular music recordings 1955–2016 and their relation to loudness. J Acoust Soc Am 145(4):2247–2253. https://doi. org/10.1121/1.5097587. 39. ITU-R (2012) Recommendation ITU-R BS.1770-3: algorithms to measure audio programme loudness and truepeak audio level. Geneva, Switzerland, pp i–ii, 1–22 40. Jesteadt W, Wróblewski M, High R (2019) Contribution of frequency bands to the loudness of broadband sounds: tonal and noise stimul. J Acoust Soc Am 145(6):3586–3594. https://doi. org/10.1121/1.5111751. 41. Jesteadt W et al (2017) Relative contributions of specific frequency bands to the loudness of broadband sounds. J Acoust Soc Am 142(3):1597–1610. https://doi.org/10.1121/1.5003778. 42. Kumagai M, Suzuki Y, Sone T (1984) A study on the time constant for an impulse sound level meter (A study on the loudness of impact sound. V). J Acoust Soc Jpn (E) 5:31–36.https://doi. org/10.1250/ast.5.31. 43. Kuwano S, Namba S (1985) Continuous judgment of level-fluctuating sounds and the relationship between overall loudness and instantaneous loudness. Psychol Res 47(1):27–37. https:// doi.org/10.1007/BF00309216 44. Lee J, Wang LM (2018) Development of a model to predict the likelihood of complaints due to assorted tone-in-noise combinations. J Acoust Soc Am 143(5):2697–2707. https://doi.org/ 10.1121/1.5036731. 45. McAdams S (2013) Musical timbre perception. In: Deutsch D (ed) The psychology of music. Elsevier, Amsterdam, Chap. 2, pp 35–67. https://doi.org/10.1016/B978-0-12-3814609.00002-X. 46. McAdams S, Botte M-C, Drake C (1998) Auditory continuity and loudness computation. J Acoust Soc Am 103(3):1580–1591. https://doi.org/10.1121/1.421293. 47. Mertens P (2004) The prosogram: semi-automatic transcription of prosody based on. In: Proceedings of the International Conference on Speech Prosody (23–26 March 2004, Nara, Japan). p 4. https://www.isca-speech.org/archive_open/sp2004/sp04_549.pdf 48. Miller GA (1947) Sensitivity to changes in the intensity of white noise and its relation to masking and loudness. J Acoust Soc Am 19(4):609–619. https://doi.org/10.1121/1.1916528. 49. Moore BC (2007) Cochlear hearing loss: physiological, psychological and technical issues, 2nd edn. John Wiley and Sons Ltd, Cambridge, UK

378

7 Loudness Perception

50. Moore BC (2014) Development and current status of the ‘Cambridge’ loudness models. Trends hear 18, 2331216514550620, p 29. https://doi.org/10.1177/2331216514550620. 51. Moore BC (2005) Why are commercials so loud?. Noise & Vib Worldw 36(8):11–15. https:// doi.org/10.1260/095745605774851421. 52. Moore BC, Glasberg BR (2004) A revised model of loudness perception applied to cochlear hearing loss. Hear Res 188:70–88. https://doi.org/10.1016/S0378-5955(03)00347-2 53. Moore BC, Glasberg BR (1996) A revision of zwicker’s loudness model. Acust 82(2):335–345 54. Moore BC, Glasberg BR (2007) Modeling binaural loudness. J Acoust Soc Am 121(3):1604– 1612. https://doi.org/10.1121/1.2431331. 55. Moore BC, Glasberg BR, Baer T (1997) A model for the prediction of thresholds, loudness, and partial loudness. J Audio Eng Soc 45(4):224–240. http://www.aes.org/elib/browse.cfm? elib=10272 56. Moore BC et al (2016) A loudness model for time-varying sounds incorporating binaural inhibition. Trends hear 20, 2331216516682698, p 16. https://doi.org/10.1177/2331216516682698. 57. Moore BC et al (2018) Testing and refining a loudness model for time-varying sounds incorporating binaural inhibition. J Acoust Soc Am 143(3):1504–1513. https://doi.org/10.1121/1. 5027246 58. Munson WA (1947) The growth of auditory sensation. J Acoust Soc Am 19(4):584–591. https:// doi.org/10.1121/1.1916525 59. Namba S, Kuwano S, Kato T (1976) The loudness of sound with intensity increment. Jpn Psychol Res 18(2):63–72. https://doi.org/10.4992/psycholres1954.18.63. 60. Neuhoff JG (2001) An adaptive bias in the perception of looming auditory motion. Ecol Psychol 13(2):97–110. https://doi.org/10.1207/S15326969ECO1302_2. 61. Neuhoff JG (1998) Perceptual bias for rising tones. Nat 395(6698):123–124. https://doi.org/ 10.1038/25862. 62. Oberfeld D, Hots J, Verhey JL (2018) Temporal weights in the perception of sound intensity: effects of sound duration and number of temporal segments. J Acoust Soc Am 143(2):943–953. https://doi.org/10.1121/1.5023686. 63. Oberfeld D et al (2018) Evaluation of a model of temporal weights in loudness judgments. J Acoust Soc Am 144(2):EL119–EL124. https://doi.org/10.1121/1.5049895. 64. Olsen KN (2014) Intensity dynamics and loudness change: a review of methods and perceptual processes. Acoust Aust 42(3):159–165 65. Pedersen B, Ellermeier W (2008) Temporal weights in the level discrimination of time-varying sounds. J Acoust Soc Am 123(2):963–972. https://doi.org/10.1121/1.2822883. 66. Pieper I et al (2016) Physiological motivated transmission-lines as front end for loudness models. J Acoust Soc Am 139(5):2896–2910. https://doi.org/10.1121/1.4949540. 67. Pieper I et al (2018) Physiologically motivated individual loudness model for normal hearing and hearing impaired listeners. J Acoust Soc Am 144(2):917–930. https://doi.org/10.1121/1. 5050518 68. Ponsot E, Susini P, Meunie S (2017) Global loudness of rising-and falling-intensity tones: how temporal profile characteristics shape overall judgments. J Acoust Soc Am 142(1):256–267. https://doi.org/10.1121/1.4991901. 69. Ponsot E, Susini P, Oberfeld D (2016) Temporal weighting of loudness: comparison between two different psychophysical tasks. J Acoust Soc Am 139(1):406–417. https://doi.org/10.1121/ 1.4939959. 70. Ponsot E et al (2015) Are rising sounds always louder? influences of spectral structure and intensity-region on loudness sensitivity to intensity-change direction. Acta Acust United Acust 101(6):1083–1093.https://doi.org/10.3813/AAA.918902. 71. Poulsen T (1981) Loudness of tone pulses in a free field. J Acoust Soc Am 69(6):1786–1790. https://doi.org/10.1121/1.385915. 72. Rennies J, Holube I, Verhey JL (2013) Loudness of speech and speech-like signals. Acta Acust United Acust 99(2):268–282. https://doi.org/10.3813/AAA.918609. 73. Rennies J, Verhey JL, Fastl H (2010) Comparison of loudness models for time-varying sounds. Acta Acust United Acust 96(2):1112–1122. https://doi.org/10.3813/AAA.918287.

References

379

74. Rennies J et al (2009) Modeling temporal effects of spectral loudness summation. Acta Acust United Acust 95(6):1112–1122. https://doi.org/10.3813/AAA.918243. 75. Rennies J et al (2015) Spectro-temporal characteristics affecting the loudness of technical sounds: data and model predictions. Acta Acust United Acust 101(6):1145–1156. https://doi. org/10.3813/AAA.918907. 76. Scharf B (1978) “Loudness”. In: Carterette EC, Friedman MP (eds) Handbook of perception, vol 4, Hearing. Academic Press, Inc., New York, pp 187–242 77. Schlauch RS, Ries DT, DiGiovanni JJ (2001) Duration discrimination and subjective duration for ramped and damped sounds. J Acoust Soc Am 109(6):2880–2887. https://doi.org/10.1121/ 1.1372913. 78. Schlittenlacher J et al (2017) Overall judgment of loudness of time-varying sounds. J Acoust Soc Am 142(4):1841–1847. https://doi.org/10.1121/1.5003797. 79. Skovenborg E, Quesnel R, Nielsen SH (2004) Loudness assessment of music and speech. In: Proceedings of the 116th Convention of the Audio Engineering Society (AES) (8–11 May 2004, Berlin, Germany). pp 1–25. http://www.aes.org/e-lib/browse.cfm?elib=12770 80. Stecker GC, Hafter ER (2000) An effect of temporal asymmetry on loudness. J Acoust Soc Am 107(6):3358–3368. https://doi.org/10.1121/1.429407. 81. Stevens SS (1936) A scale for the measurement of a psychological magnitude: loudness. Psychol Rev 43(5):405–416. https://doi.org/10.1037/h0058773. 82. Stevens SS (1957) On the psychophysical law. Psychol Rev 64(3):153–181. https://doi.org/10. 1037/h0046162 83. Stevens SS, Galanter EH (1957) Ratio scales and category scales for a dozen perceptual continua. J Exp Psychol 54(6):377–411. https://doi.org/10.1037/h0043680. 84. Susini P, McAdams S, Smith BK (2002) Global and continuous loudness estimation of timevarying levels. Acta Acust United Acust 93(4):623–631 85. Švec JG, Granqvist S (2018) Tutorial and guidelines on measurement of sound pressure level in voice and speech. J Speech Lang Hear Res 61(3):441–461. https://doi.org/10.1044/ 2017_JSLHR-S-17-0095. 86. Swift SH, Gee KL (2011) Examining the use of a time-varying loudness algorithm for quantifying characteristics of nonlinearly propagated noise. J Acoust Soc Am 129(5):2753–2756. https://doi.org/10.1121/1.3569710. 87. Vickers E (2010) The loudness war: background, speculation, and recommendations. In: Proceedings of the Audio Engineering Society Convention 129 (San Francisco, CA). p 27. http:// www.aes.org/e-lib/browse.cfm?elib=15598 88. Viemeister NF, Wakefield GH (1991) Temporal integration and multiple looks. J Acoust Soc Am 90(2):858–865. https://doi.org/10.1121/1.401953. 89. Ward D (2017) Application of loudness models in audio engineering. Faculty of Computing, Engineering, the Built Environment, The School of Computing, and Digital Technology, Birmingham, UK. https://core.ac.uk/download/pdf/189365180.pdf 90. Warren RM (1977) Subjective loudness and its physical correlate. Acta Acust United Acust 37(5):334–346 91. Warren RM et al (1994) Auditory induction: reciprocal changes in alternating sounds. Percept & Psychophys 55(3):313–322. https://doi.org/10.3758/BF03207602. 92. Zwicker E (1977) Procedure for calculating loudness of temporally variable sounds. J Acoust Soc Am 62(3):675–682. https://doi.org/10.1121/1.381580.

Chapter 8

Pitch Perception

Auditory units are characterized by a number of perceptual attributes of which timbre and loudness have been discussed in the preceding chapters. Now it is the turn of pitch. The emergence of a perceptual attribute such as pitch is a fine example of a property of a complex sound that cannot be explained based on a property of one of its components alone. It is the harmonic relation between the partials that defines the pitch of the complex, not the frequency of any of its components separately. Pitch plays an important role in music and in speech: In speech, the pitch contour of an utterance defines the intonation of this utterance; in music, it defines the melodies played. Indeed, if one can hear a melody in a succession of tones, one can be sure that pitch plays a role. Actually, pitch has been defined as that attribute of sounds that creates melody [112, p. 381]. Later in this chapter, in Sect. 8.1, this and other definitions of pitch will be discussed further. In Sect. 4.8.2, it was argued that regularity and harmonicity play different roles in sound perception. Spectral regularity means that the spectral components of a sound have frequencies that are equidistant on a linear frequency scale. It plays an important role in the formation of auditory units. Harmonicity, on the other hand, means that spectral components have frequencies that are integer multiples of a fundamental frequency and plays a role in the emergence of pitch. Brunstrom and Roberts [20] showed that the integration of spectral components into an auditory unit and the processing of pitch are, indeed, distinct processes. Harmonicity rather than regularity is the primary source of information used in pitch perception [35, 140]. In other words, pitch is roughly speaking a perceptual attribute of harmonic auditory units and, since many sounds are not harmonic, this indicates that many sounds do not have pitch, which is correct. For instance, most noisy signals do not have pitch. Tonal sounds with partials that are not harmonic often have perceptually ill-defined pitches or sound “inharmonic”, “dissonant”, or “out of tune”. Partials that deviate significantly from a regular pattern are in general perceived as separate auditory units as illustrated in the demos of Figs. 4.13 and 4.14. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_8

381

382

8 Pitch Perception

Table 8.1 Calculation of the fundamental frequency for a complex tone of 1400, 1600, and 1800 Hz. The first column shows the rank of the successive subharmonics; the other columns the rounded frequencies of the subharmonics of 1400, 1600, and 1800 Hz 1 1400 1600 1800 2 3 4 5 6 7 8 9 10 11 12

700 467 350 280 233 200 175 156 140 127 117

800 533 400 320 267 229 200 178 160 145 133

900 600 450 360 300 256 225 200 180 164 150

Since pitch is a perceptual attribute, it is not easy to give a conclusive definition of it. In the same way as loudness is often associated with the intensity of a sound, pitch is often associated with the frequency of a sound. Indeed, in general one can ask a listener to adjust the frequency of a pure tone in such a way that its pitch is equal to that of a test sound. This frequency can be defined as the pitch of the test sound, and will be indicated with pitch frequency. If there is any signal attribute that corresponds to the emergence of pitch, however, it is not frequency but the “repetitiveness” of the signal, and the best example of a signal that repeats itself is a periodic signal. As a usually good first approximation, the frequency of the pitch then corresponds to the inverse of the signal period. Fourier theory shows that periodic signals can be described as sums of harmonics, the first of which is called the fundamental and its frequency the fundamental frequency. Hence, the pitch frequency of periodic sounds generally equals the fundamental frequency. Finding the fundamental frequency amounts to calculating the greatest common divisor of the frequencies of the harmonics. One, not so efficient, way to do so is to list the subharmonics of each frequency component of the complex and to look for the largest subharmonic these components have in common. This is illustrated in Table 8.1, where the greatest common divisor is calculated of 1400, 1600, and 1800 Hz. Each column shows the first twelve subharmonics of each of these frequencies, respectively. The first common subharmonic, in this case 200 Hz, is the greatest common divisor and, hence, the fundamental frequency of this three-tone complex. This yields the rule of thumb for pitch perception: The pitch frequency of a harmonic tone can be found by calculating the greatest common divisor of the frequencies of the harmonics. In the example of Table 8.1, the fundamental frequency of 200 Hz is not part of the complex of 1400, 1600, and 1800 Hz, and one may ask if the pitch of this complex

8.1 Definitions of Pitch

383

is indeed 200 Hz. This appears to be the case; the fundamental need not be part of the harmonic complex [148, 150]. When this is the case, the pitch is called virtual pitch or residue pitch. Virtual pitch will be discussed in detail in Sect. 8.4.4. This rule of thumb for pitch perception is simple, indeed, but it is only adequate for harmonic tones with precisely defined frequencies. In most everyday situations, tonal sounds are not exactly periodic, but vary in waveform and duration from period to period, and may contain some amount of noise. Associating pitch with the periodicity of a sound, one should realize that exactly periodic sounds are man-made or very rare. Synthesized buzzes or beeps from electronic devices may closely approximate signals consisting of series of identical periods, but even those will contain a minimum amount of noise. Periods of natural sounds will also show some variation. Besides random fluctuations, there are also systematic deviations from strict harmonicity. A well-known example is that of octave stretching. It appears that, when people are asked to adjust the frequency of one tone one octave higher than that of another tone, they will on average adjust it somewhat higher than one octave. A recent discussion of the implications for the tuning of orchestral instruments is presented by Jaatinen, Pätynen, and Alho [71]. Octave stretching is part of a more general phenomenon that there are both random and systematic deviations from strict harmonicity: Harmonics are not strictly harmonic and intervals do not strictly conform simple mathematical equations. Because of this, it would be better to speak about pseudo-periodic or pseudo-harmonic sounds. For the sake of simplicity, as long as the deviations are small, however, the terms periodic and harmonic will be used. For elaborate discussions on these issues the reader is referred to Sethares [153] and Parncutt and Hair [121].

8.1 Definitions of Pitch The pitch of pure sinusoids and synthesized, simple harmonic complexes is usually quite well-defined, at least when their fundamental frequency is in the pitch range. The situation becomes more complex for sounds such as speech, music, and many environmental sounds. First, some official definitions of pitch will be discussed. The American Standard Association (ASA) defines pitch as “that attribute of auditory sensation in terms of which sounds may be ordered on a musical scale” [5]. This definition associates pitch with music. In speech, however, and in many other sounds such as sounds used in animal communication, harmonic and melodic structures are not defined in terms of a musical key, such as A flat or C sharp, or a musical scale, major or minor, within which a limited set of pitches is defined. Therefore, the definition by the ASA from 1960 may cover the role played by pitch in music, but it is not complete. It is copied by Moore [112], who adds: “In other words, variations in pitch create a sense of melody. A sound that evokes a pitch is often called a “tone”, especially when the pitch has a clear musical quality” (p. 381). This definition links pitch to melody, a concept not only used in music but sometimes also in speech, where the melody of an utterance is often used to indicate the intonation contour of

384

8 Pitch Perception

that utterance. When it comes to speech, the term intonation contour will preferably be used in this book. Another definition of pitch [4] defines pitch as “that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from low to high. Pitch depends primarily on the frequency content of the sound stimulus, but it also depends on the sound pressure and the waveform of the stimulus.” Indeed, asking listeners to compare the pitches of two sounds will, at least when the sounds have well-defined pitches, usually result in a clear answer as to the relative height of the pitches: The pitch of one sound can be higher or lower than that of the other, or both can be equal. The problem with this definition is that pitch is not the only perceptual attribute of sound that can be ordered from low to high [124, pp. 37–47]. In Sect. 6.4, the mean of the specific-loudness distribution over the tonotopic array has been presented as a computational estimate of brightness. Hence, brightness is described on the tonotopic array, and can as such also be ordered from low to high. So, there must be other essential differences between pitch and brightness. It will indeed be shown that pitch is a perceptual attribute associated not so much with the frequency content of the sound, but with its periodicity or better still with its repetition rate. In fact, a computational model of pitch perception will be presented that attempts to detect common repetitions in the sound signal over the tonotopic array. If there is evidence for such a repetitiveness, albeit imperfect, this repetition may be associated with a pitch. Naturally, the estimate of the pitch frequency then corresponds to its repetition rate. In the demos of Figs. 8.23 and 8.24, some non-periodic, but repetitive, sounds will be presented with well-defined pitches. For the moment, it is more convenient to use a practical definition for pitch. A note to that effect is added to the definition of ANSI [4]: “Note: The pitch of a sound may be described by the frequency or frequency level of that simple tone having a specified sound pressure level that is judged by listeners to produce the same pitch.” As also mentioned earlier, this note indicates that the frequency of the pitch can be found by an adjustment procedure in which the frequency of a pure tone is matched to that of the test sound. A similar practical definition to that effect is given by Hartmann [51, p. 283]: “A sound can be said to have a certain pitch if it can be reliably matched by adjusting the frequency of a pure tone of arbitrary amplitude.” This frequency is then the pitch frequency. In many practical situations, this definition can easily be applied, in particular for relatively short and stable tones. For rapidly varying sounds such as speech sounds, this definition may not be directly applicable, for the pitch of these sounds can change very rapidly, also within one syllable. In such cases, one may gate out a short interval of the sound with only a few pitch periods and apply the adjustment procedure to this short interval played in isolation. In this way, one can determine the pitch of the speech sound within that interval. This can be repeated for all intervals covering the utterance, in this way establishing the pitch contour of the whole utterance. It appears that, for a wide class of well-recorded single sounds, one can experimentally determine in this way what frequency listeners associate with the pitch of these sounds, in other words, what the pitch frequency of these sounds is. If one has done this a few times for well-recorded clean speech or single musical melodies and has compared the results with the waveforms of the signals, one will

8.1 Definitions of Pitch

385

rapidly conclude that the pitch of such signals corresponds in almost all cases, to a very good approximation, to the (pseudo-)periodicity of these sounds. One such period will be indicated with pitch period. The duration of one such period is, of course, the inverse of the pitch frequency. However, it is important to realize that pitch frequency and pitch period are perceptual attributes of a sound, since they are defined as the frequency and the period of a sinusoid with identical pitch, and these are not necessarily identical. This is well illustrated by diplacusis, the phenomenon that the pitch frequency of a pure tone can be somewhat different for the left and the right ear. Diplacusis can be a notorious problem in people with hearing impairment but, also in normal hearing, the difference in pitch frequency can be as high as 2.5%, about ten times the just-noticeable difference for frequency [168]. Even the pitch frequency of a pure tone of fixed frequency is not strictly fixed, since the pitch of a pure tone varies somewhat with its intensity, not very much, but it does. In general, pure tones with frequencies lower than 1–2 kHz go down in pitch as the intensity increases, while pure tones with frequencies higher than 1–2 kHz go up in pitch as the intensity increases [167, 183]. This effect is not large, however, and Terhardt [170] showed that it vanishes almost completely for complex sounds, but it indicates that the frequency of a sound and the frequency of its pitch are not the same thing, not even for pure tones. For all kinds of other sounds, the duration of the pitch period is not exactly equal to the duration of one signal period, either. This applies to tones of very short duration [41], short tones with rapidly decaying envelopes [52], and sinusoids partially masked by other sounds [171]. Many pseudo-periodic sounds, e.g., many sounds used in human or animal communication, show gradual changes from period to period. Moreover, these periods can get shorter and longer inducing corresponding rises and falls in pitch, which often are of great communicative significance. They cannot go too fast, however. Abrupt changes in periodicity in general lead to a discontinuity of the perceived sound source, i.e., when such a break in pitch occurs, the listener stops hearing the same sound source and starts hearing another sound source coming in [19, 37]. This does not mean that the pitch of a sound should be continuous. Periodic sound segments with pitch can be interrupted by non-periodic sounds, e.g., noise sounds. This happens frequently in speech at the transition of a voiced to an unvoiced segment. During unvoiced speech segments such as the /s/ and the /f/, the periodicity vanishes all together. In spite of this, no discontinuity in the sound source is perceived. In this situation, however, the durations of the pitch periods before and after the unvoiced segment should not deviate too much from each other, either. If it does, a discontinuity in the perceived sound source will again be heard. This will be further discussed in Chap. 10.

386

8 Pitch Perception

8.2 Pitch Height and Pitch Chroma In music, the frequency of tones can be specified in Hz, but almost no musician will know which note to play when asked to play a note of, e.g., 349.2 Hz. If in the possession of a calculator, it is possible to find out that an F3 is meant, but many musicians will not know or have forgotten how to do the calculations. As described in Sect. 1.4, the frequency of a musical note is indicated by a symbol representing the position of that note within an octave. The reference tone is the 440-Hz note in the middle of the piano keyboard, indicated with A, as are the other notes that differ by an integer number of octaves. The other white keys of the keyboard are indicated with B, C, D, E, F, or G. Furthermore, the black keys are also indicated with A, B, C, D, E, F, or G, but now, depending on the harmony of the music, provided with a sharp sign  or a flat sign . So, symbols in music such as B, F , C, or E do not indicate the frequency of the notes but rather their position within a circular scale of octaves. Apparently, in music, not only the absolute frequency of a tone is of importance, but also its position within the octave. Based on this, Bachem [7, 8] proposed to distinguish two pitch classes. For one, he proposed the term “height”. Pitch height can be identified with the position of the pitch frequency on a scale from low to high, as has been proposed by ANSI [4] and discussed in Sect. 8.1. For the second pitch class, Bachem [7] coined the new term “chroma”. Pitch chroma indicates the position of the note within a musical scale [8]. Thus, two notes that differ in pitch frequency by one octave have the same chroma. One of the reasons underlying the distinction between pitch height and pitch chroma stems from the harmonic fact that two tones that differ by one octave are more similar to each other than two tones that differ by any other interval. Indeed, when a male voice and a female voice sing the same melody together, the pitch of the female voice is mostly one octave higher than that of the male voice. Also, when two people play a tune, one by singing and the other by whistling, the pitches are usually one or two octaves apart. Since this often goes unnoticed, this has been called “octave deafness” [137], but more often the term octave equivalence is used, e.g., by Hoeschele, Weisman, and Sturdy [60]. This has led to suggestions for a perceptual pitch space consisting of two dimensions [144], a linear dimension representing pitch height and a circular dimension representing chroma. The best-known example, presented by Shepard [157], is shown in Fig. 8.1. Pitch space is situated on a helix of which the vertical dimension represents pitch height, and the circular dimension pitch chroma. One rotation of 360◦ along this circular dimension corresponds to one octave. Each semitone then corresponds to 360◦ /12 = 30◦ . Consequently, notes with the same musical notations, e.g., all Cs, all F s, or all B s, are positioned right above each other. This represents their close harmonic relationship. The perceptual space suggested by Fig. 8.1 is not a usual euclidian space in which a meaning can be attributed to the distance between two points in a 3-dimensional apace. The only distance between two points to which a perceptual meaning can be attributed is their distance along the helix. The distance between two adjacent points represents one semitone. The distance between two arbitrary musical points on the

8.2 Pitch Height and Pitch Chroma

387

Fig. 8.1 Pitch height and pitch chroma. The vertical dimension represents pitch height, the property of pitch that can be ordered from low to high. The circular dimension represents pitch chroma, the circularity of pitch. After Shepard [157, Fig. 1, p. 308] (Matlab)

helix is then always an integer multiple of one semitone. The other essential property of this representation is that two notes that differ in pitch frequency by one or more octaves have the same circular coordinate. This indicates that they have the same chroma, which corresponds to the special role of the octave in music. The interval between two notes is an octave when the ratio between their frequencies is 2. All other harmonic relations between two successive musical tones are also described by the “intervals” between them, i.e., the ratio of the pitch frequencies of the two tones. As a ratio scale is equivalent with a logarithmic frequency scale, pitch chroma can best be represented on a logarithmic frequency axis. The circularity of pitch as depicted in Fig. 8.1 will be demonstrated auditorily in the demos of Figs. 8.15 and 8.16. In summary, the helix shown in Fig. 8.1 expresses two different aspects of pitch perception, pitch height and pitch chroma. The question is now on what frequency scale the pitch of pitch chroma and that of pitch height can best be represented. Is this the musical, logarithmic scale, or the tonotopic scale. This will be worked out in Sect. 10.12.4. There, the conclusion by Van Noorden [181] will be endorsed: “Tone height is related to the tonotopical organization along the basilar membrane, and the chroma dimension to the temporal organization of spikes in the auditory nerves” (p. 266). Hence, pitch chroma can best be represented on a musical, i.e., logarithmic scale, whereas pitch height can best be represented on a tonotopic scale.

388

8 Pitch Perception

8.3 The Range of Pitch Perception One of the first question that arises is concerned with the existence region of pitch [136], i.e., what is called the pitch range in this book. In other words, what are the lowest and the highest pitch frequencies we can hear? Two different pitch ranges will be distinguished, a range in which listeners can distinguish between higher and lower pitches and a range in which listeners can recognize musical melodies. The first range is associated with pitch height and covers the whole tonotopic array. The other, somewhat narrower, range is associated with pitch chroma and requires a sense of harmony. First, the frequency range of pitch chroma will be discussed. About the upper limit of pitch chroma, the general opinion is that the percept of musical pitch vanishes for frequencies higher than about 5 kHz. Indeed, when a listener is asked to adjust the pitch of one of two pure tones so that it differs by one octave from the other, this appears difficult when one of the two tones has a frequency higher than 5000 Hz [6, 186]. This was confirmed by Semal and Demany [152] based on the results of an experiment in which listeners were presented with a sequence of two pure tones and asked to transpose them up to a level at which the higher tone of the sequence was “just above the upper limit of musical pitch” (p. 167). This upper limit of 5000 Hz is usually associated with the vanishing of phase lock in the auditory-nerve fibres above this frequency [113] (see, however, Verschooten et al. [182]). More difficult to interpret are the results of experiments carried out to determine the lower limit of the pitch range. Ritsma [136] presented listeners with harmonic amplitude-modulated (AM) tones of varying centre frequency (CF) and modulation frequencies (MF). He asked subjects to adjust the modulation depth of these tones so that they could just hear the virtual pitch generated by these harmonic sounds. He ended up with an existence region starting at about 40 Hz and ending at about 5000 Hz. The problem with Ritsma’s results is, however, that listeners may not have listened to the pitch induced by the AM tones, but to combination tones [139]. Some other studies, too, are hard to interpret at first instance. In a first experiment, Guttman and Pruzansky [47] presented listeners with a pulse train with a variable rate. In order to hear how such pulse trains sound, listeners can listen to the demo of Fig. 1.54. It will be clear that, when varying the rate in the rhythm range, listeners will hear the pulse train going slower or faster. At higher rates, besides entering the roughness range, one also enters the pitch range, and varying the rate of a pulse train then results in it sounding lower and higher in pitch. In conclusion, when varying the pulse rate below the pitch range, listeners hear the pulse train go slower and faster; within the pitch range, it gets higher or lower. Guttman and Pruzansky [47] asked listeners to adjust the frequency of a pulse train precisely at the transition, i.e., at that rate where the high-low distinction is absent for lower pulse rates, but present for higher pulse rates. The authors found that, on average, listeners adjusted the pulse rate at 19.0 Hz. This shows that, above 19 Hz, listeners start hearing the pitch go up when the pulse rate is increased, and go down when it is decreased. The authors called this the “lower pitch limit”. This limit corresponds almost perfectly to the lower limit of human auditory sensitivity. It was recently confirmed by Jurado, Larrea, and Moore

8.3 The Range of Pitch Perception

389

[73], not only for pure tones, but also for sinusoidally amplitude-modulated 125-Hz tones and very short 125-Hz tone pips that do not contain any energy at 19 Hz. This indicates that this low pitch is derived from periodicity information within the spike trains of the auditory-nerve fibres excited by the 125-Hz tones. The second experiment by Guttman and Pruzansky [47] was carried out in a musical context. They presented listeners with two pulse trains. One was played when the listener set a key in position A; the other was played when the key was set in position B. In all trials, the pulse rate of one of the two pulse trains, the standard, was fixed at a rate between 20.6 and 131 Hz, while the rate of the other was under control of the listener. In a first set of control trials, the authors asked listeners simply to adjust the rate of the variable pulse train to that of the standard pulse train. For this control set, performance was good for all rates, showing that listeners can well distinguish higher from lower pulse rates. In a second set of trials, listeners were asked to adjust the rate of the variable pulse train in such a way that its pitch was one octave lower than the pitch of the fixed pulse train; and in a third set of trials, so that the pitch was one octave higher than that of the standard pulse train. Guttman and Pruzansky [47] found that the performance of the listeners was good when the pulse rates of both pulse trains were higher than about 60 Hz. When, however, the pulse rate of one of the pulse trains was below 60 Hz, the performance of the listeners deteriorated. Generally, listeners adjusted the rate of the variable pulse train in such a way that the difference in rate was larger than one octave. The authors called this 60-Hz limit the “lower musical pitch limit”. So, one sees that listeners can quite well do a task only involving distinguishing a higher from a lower pitch above rates of about 20 Hz, whereas a task involving the matching of an octave needs frequencies higher than about 60 Hz. In order to find the “lower limit of pitch”, Krumbholz, Patterson, and Pressnitzer [78] measured the rate-discrimination threshold for bandpass-filtered pulse trains. The lower cut-off frequency was varied between 0.2 and 6.4 kHz; the equivalent rectangular bandwidth of the filter was fixed at 1.2 kHz. Based on the results, the authors conclude that the lower limit of pitch perception is 30 Hz. Using a different paradigm, the same conclusion was drawn by Pressnitzer, Patterson, and Krumbholz [132]. These authors presented listeners with two short four-tone melodies consisting of tones with fundamental frequencies randomly selected from five different equidistant values spaced by one semitone. One of the four tones of the second melody was lowered or heightened by one semitone, and the task of the listener was to determine which of the four tones had increased or decreased in pitch. For this experiment, the authors again conclude that the “lower limit of melodic pitch” is 30 Hz. Note, however, that the only task of the listener was to hear which of the four tones had changed. So, the listeners did not have to make judgments as to the size of the change in terms of musical intervals, such as the octave, the major third, or the minor third. Finally, Biasutti [13] wanted to know for what frequencies listeners could hear differences between major and minor chords. These chords consisted of triads, i.e., three simultaneous tones in a chord progression. He found that listeners could only do so when the mean of the frequencies of the three tones was between 120 and 3000 Hz. Biasutti [13] does not mention the frequency of the lowest tone of the

390

8 Pitch Perception

triads but, since all three tones of the triads have to be in the range for which the listeners can do the task, it can be assumed that all three tones of the triads must have frequencies higher than one major third below 120 Hz, so higher than about 90–100 Hz. This is still considerably higher that the “lower musical pitch limit” of 60 Hz determined by Guttman and Pruzansky [47], who used octaves. This discrepancy may be because recognition of octaves is easier than recognition of minor and major thirds. Actually, only one third of the population is able to distinguish between major and minor chords, a property that appears to be innate [1]. This may also explain why Biasutti [13] found a rather low upper limit of musical pitch of 3000 Hz, instead of 5000 Hz found with octaves [6, 186]. It is concluded that the range of pitch height corresponds to the full tonotopic array. The range of pitch chroma or musical pitch is more complex. Identification of an octave requires tones with pitch frequencies between about 60 to 5000 Hz. Identification of other musical intervals is an ability of a minority of people, and requires musical training. For tasks requiring distinguishing between such intervals, the pitch of the lower tone must be higher than about 90–100 Hz and the pitch of the higher tone must be lower than about 3000 Hz. The difference between the two “pitch classes”, pitch height and pitch chroma will be further discussed in Sect. 10.12.4.

8.4 The Pitch of Some Synthesized Sounds 8.4.1 The Pitch of Pure Tones The pitch of a pure tone has been well documented. One will be inclined to say that the pitch frequency of a pure tone is the frequency of the sinusoid of which the tone consists. Roughly, this is correct, but not exactly. One of the complications has already been mentioned: Pure tones with frequencies much lower than 1–2 kHz go down in pitch a little bit as the intensity increases, while pure tones with frequencies much higher than 1–2 kHz go up in pitch as the intensity increases [167, 183, 193]. Another complication arises when one listens to pure tones with very low or very high frequencies. The problem for frequencies lower than about 100 Hz have been discussed in the previous Sect. 8.3. In addition, the thresholds of hearing are very high for these low frequencies and the perception of roughness may come into play [108, 109]. As to the higher limit of melodic pitch, above some 3–5 kHz, it becomes more and more difficult to create music melodies based on pure tones. Based on their brightness, it remains possible to hear whether one pure tone is higher in frequency than another. For pure tones, these two sources of information coincide and one has to carry out dedicated experiments to find out whether listeners base their judgments on the pitch or on the brightness of the pure tones. Later in this chapter, e.g., in the demo of Fig. 8.4, it will be shown that, for complex tones, pitch and brightness can be varied independently. A sequence of tones will be constructed the pitch of which

8.4 The Pitch of Some Synthesized Sounds

391

goes up while the timbre of the tones goes down. In order to do so, the phenomenon of “virtual pitch” must be discussed. This will be done in Sect. 8.4.4.

8.4.2 The Duration of a Sound and its Pitch At various instances, the pitch of a sound has been linked to its periodicity. One may ask how many periods are necessary to obtain a clear percept of pitch. This question, however, is not easy to answer. Metters and Williams [106] reported that the certainty with which a listener can estimate the pitch of a tone rapidly diminishes when fewer than four complete periods are played; they used pure tones and tones consisting of the fifth, sixth, and seventh harmonic of 280 Hz. Patterson, Peters, and Milroy [125] who used a musical task on a diatonic scale between 84 and 125 Hz consisting of pure tones or tones with five harmonics, concluded that about 40 ms were necessary for a reliable pitch percept to arise. Pollack [131] found that pulse pairs, a rapid succession of two pulses, could generate a pitch with a frequency that is the inverse of the time interval between the two pulses. This is illustrated in Fig. 8.2. A sequence of pulse pairs is played in which the time intervals between the two pulses of a pair are varied such that the frequencies corresponding to their inverse form a diatonic scale on the A of 440 Hz. The inserts in the figure show 8-ms segments around every

Fig. 8.2 Repetition pitch of pulse pairs or double clicks. Eight 8-ms intervals are depicted, each containing a pulse pair. Every other 0.2 s another pair is played. The interval between the two pulses of a pair decreases in such a way that the repetition pitch induced by the sequence of pulse pairs forms an increasing diatonic scale on A4 . For comparison, another diatonic scale consisting of short tones is played after the sequence of pulse pairs (Matlab) (demo)

392

8 Pitch Perception

pulse pair stretched by a factor of 20 relative to the abscissa. When listening to this, a weak diatonic scale on the A of 440 Hz can be heard, which shows that one indeed perceives pitch. The minimum duration a tonal sound must have in order to induce a percept of pitch depends on the fundamental frequency and on the number of harmonics. For most periodic sounds, three to four complete periods will be sufficient. In the extreme case of a pulse pair only two “periods” are enough. A pulse pair can be considered as one pulse followed by a delayed version of itself. The pitch generated in this way is indicated with repetition pitch. One may wonder what happens when one takes another signal, delays it, and adds it to the original signal. Will it also generate a pitch with a frequency that is the inverse of the delay time? Indeed, if one takes a wide-band sound, a noise burst, e.g., delays it, and adds it to its not delayed original, the result will in general produce a repetition pitch. This will be demonstrated in Fig. 8.23 of Sect. 8.9.1. This is why it was stated in the introduction that, when one wants to mention one signal property that can be considered as the acoustic correlate of the pitch of a sound, repetitiveness is the best candidate. These demos illustrate this. On the other hand, the pitches of the pulse pairs are quite weak. Naturally, when not just two pulses but three, four, or more pulses with equal time intervals are played, the salience of the pitch will increase. This will be demonstrated for the repetition pitch of white noise in Figs. 8.23 and 8.24 of Sect. 8.9.1.

8.4.3 Periodic Sounds and their Pitch In the beginning of this chapter, it was demonstrated that the pitch frequency of a harmonic complex is in very many cases almost perfectly identical with the fundamental frequency of the complex. This applies quite well to harmonic complexes with fundamental frequencies in the pitch range and with a sufficient number of harmonics. What is sufficient in this case depends on the harmonic rank of the harmonics of the complex. When there is only one harmonic, there is only a pure tone with a well-defined pitch. For complex tones with harmonics of low harmonic rank, say below 6–8, just two harmonics are enough to induce a pitch at the frequency of the fundamental, also when this fundamental is absent and the pitch is virtual [62]. Remarkably, the salience of this virtual pitch is different from listener to listener [64]. For harmonics of higher rank, a larger number of harmonics is required to induce a well-defined pitch at the fundamental frequency. This will be illustrated below in Sect. 8.8. Now, one of the most simple of all harmonic sounds, will be discussed, a series of pulses. As illustrated in Fig. 1.53 and according to Eq. 1.15, a pulse train can be modelled as the sum of cosine waves of equal amplitude with frequencies that are all multiples of the fundamental frequency f 0 :

8.4 The Pitch of Some Synthesized Sounds

p (t) = lim

N →∞

393 N 1  cos (2πk f 0 t) . N k=1

(8.1)

Listening to such a sound is not always very pleasant; it is a penetrating buzz. Moreover, below an f 0 of 150 Hz, pulse trains sound rough, which makes them an ideal sound for alarm clocks. Anyway, the pitch of a pulse train is, at least in the pitch range from about 50 to about 5000 Hz, one of the best-defined pitches a sound can have, which is very likely due to the large number of harmonics, covering a large part of the tonotopic array. Tonal sounds with harmonic partials are generally perceived as one auditory unit. This applies certainly to sounds which are not only harmonic but also regular. Pulse trains are about the best examples of harmonic and spectrally regular tones. Hearing their pitch as a whole generally comes naturally. In spite of this, trained persons may be able to hear out the lower harmonics of pulse trains. More naive listeners can hear out the separate harmonics by switching them on and off as demonstrated in the demo of Fig. 4.3. In that case, the harmonic pops out as a separate auditory unit and is no longer integrated into the buzz formed by the other harmonics. In this way, also harmonics of higher rank can be heard as separate auditory units. A demo of this is presented on track 1 of the CD by Houtsma, Rossing, and Wagenaars [66]. The role of harmonicity in pitch perception has also been illustrated in the demo of Fig. 4.13. The frequencies of the partials in that demo deviated more and more from harmonic positions in a random way. The result was that the sounds sounded more and more inharmonic and that individual harmonics started to pop out as separate tones. It is concluded that pitch is a property of harmonic auditory units. As a first, very good approximation, the frequency of this pitch corresponds to the fundamental frequency of the harmonics.

8.4.4 Virtual Pitch It has already been mentioned various times that harmonic tones can have pitches at the frequency of the fundamental, also in the absence of any acoustic energy at that frequency. As long as there is a sufficient number of harmonics, the pitch frequency remains well-defined at the fundamental frequency [148, 150]. This is referred to as the phenomenon of the missing fundamental; the pitch produced in this condition is called virtual pitch, e.g., Terhardt [173], or residue pitch, e.g., Moore [110, 148] and Schouten [114]. Virtual pitch is not a phenomenon acquired in the course of growing up. Lau and Werner [85, 86] and Lau et al. [87] showed that infants as young as three to four months perceived virtual pitches. Moreover, also mammals perceive virtual pitches as has been shown, e.g., for cats [54] and for rhesus monkeys [178].

394

8 Pitch Perception

Fig. 8.3 Virtual pitch. The lower harmonics of 330 Hz are successively removed from the complex, first the first harmonic, then the second, the third, etc., up to the eighth harmonic. The timbre of the tones changes, but the pitch remains the same. This virtual pitch of 330 Hz is indicated by the thin dashed lines (Matlab) (demo)

In Fig. 8.3, eight tones are played synthesized by adding harmonics of 330 Hz. In the first tone, all twelve lowest harmonics are added. Every time another tone is played, the lowest harmonic is removed, so that the second tone consists of the sum of the second up to the twelfth harmonic, the third tone of the sum of the third up to the twelfth, etc., etc. Since there are eight tones, the last tone consists of the eighth up to the twelfth harmonic. Listening to this tone sequence, the pitch remains the same but the brightness increases. One may question whether, for the sound of Fig. 8.3, it is, indeed, the timbre that changes from tone to tone and not the pitch. The answer is that, what goes up, does not define a melody, since one does not hear a rising melody. This will be checked more elaborately in the next demos by creating melodies of tones with virtual pitches that are varied independently of their timbre. The pitch frequency can go up, while the timbre goes down, and vice versa. It will be shown that the melody will go up and down with the pitch, and not with the timbre. This, per definition, shows that the perceived melody is determined by the pitches and not by the timbre. The melody for which this will be shown is the familiar diatonic scale on 330 Hz going up and down: Do, Re, Mi, Fa, Sol, La, Si, Do, Si, La, Sol, Fa, Mi, Re, Do. This is illustrated in Fig. 8.4. All tones are synthesized by three successive harmonics. For the first and the last Do, the rank of the lowest harmonic is 8, so these tones consist of three harmonics, with frequencies of 8, 9, and 10 times 330, which is 2640, 2970, and 3300 Hz, respectively. For the following tones of the scale, the rank of the harmonics is lowered by 1. So, the Re’s of the scale, the second and the penultimate tone, consist of three harmonics of rank 7, 8, and 9. Hence, they have frequencies of 7, 1 8, and 9 times the frequency of the Re, which is 330 · 2 6 = 370.4 Hz, so that the frequencies of these harmonics of Re are 7 · 370.4 = 2593, 8 · 370.4 = 2963, and 9 · 370.4 = 3334 Hz, respectively. This procedure is repeated up to the eighth note of the scale, the Do again, but one octave higher, at 660 Hz, than at the first and the last note. At this eighth note, the lowest harmonic is the fundamental. After that, the rank of the harmonics is increased with every note again, while the melody returns to

8.4 The Pitch of Some Synthesized Sounds

395

Fig. 8.4 An ascending and descending diatonic scale on 330 Hz consisting, except for the highest, middle tone, of virtual pitches. While the pitch goes up and down, the brightness of the tones goes down and up. The F 0 of the tones is represented by the horizontal dashed lines. Adapted from track 7 of the CD by Plomp [130] (Matlab) (demo)

Fig. 8.5 Same as Fig. 8.4 but with low-pass noise added with a cut-off frequency of 2000 Hz. The virtual pitches remain well audible as long as the harmonics are higher in frequency than the upper cut-off frequency of the noise (Matlab) (demo)

the same Do as at the beginning. Figure 8.4 clearly shows that, up to the high middle Do of 660 Hz, the average frequency of the harmonics of the tones goes down while the fundamental frequency goes up; after this high Do, the average frequency of the harmonics rises again, while the fundamental frequency goes down again to the low Do of 330 Hz. The fact that one clearly hears a diatonic scale which goes up and down shows that it is indeed the pitch of the tones which go up and down, and not their timbre. Nevertheless, one also hears something go down and then go up, but this does not induce the percept of a melody. What first goes down and then up must be another perceptual attribute of the tone, its brightness. An important property of virtual pitch is that it cannot be masked by sounds with components that are equal or close in frequency to the frequency of the virtual pitch [89, 175]. This is demonstrated in Fig. 8.5 which is the same as Fig. 8.4, except that low-pass noise is added. The cut-off frequency of the noise is 2000 Hz. As can be heard, this noise is so loud that it masks the harmonics of the tones with frequencies lower than this 2000 Hz, reducing the loudness of these tones. As long as the frequencies of the harmonics are higher than 2000 Hz, however, the loudness

396

8 Pitch Perception

Fig. 8.6 Virtual “pitch” of a tone with an inharmonic but regular spectrum. The thin lines between about 200 and 500 Hz are the “pitches” estimated with the autocorrelation method described in Chap. 3. These estimated “virtual pitches” can deviate from what is actually heard (Matlab) (demo)

of the tones with their virtual pitches is as loud as it is without the noise, and does not affect the virtual pitch. It has been shown that not so much harmonicity but spectral regularity is one of the factors playing a role in auditory-unit formation. The next demo of Fig. 8.6 is the same as that of Fig. 8.5 except that the tones are made inharmonic while maintaining their spectral regularity. This is done by keeping the distance between the partials the same as √in Fig. 8.5 but shifting their frequencies upwards with F 0 times the golden ratio, 5 − 1 /2 ≈ 0.618. This guarantees that the tones are maximally inharmonic [184]. These tones with shifted harmonics are spectrally regular but not harmonic. Their inharmonic virtual “pitches” are indicated by the horizontal lines. They are estimated based on the summary autocovariance function (SACVF) obtained with the autocorrelation method described in Chap. 3 of this book. An example of such an SACVF will be presented in Fig. 8.20 for the third tone of the demo in Fig. 8.6. It shows multiple peaks, which may explain the ambiguity of the “pitches” of these tones and, hence, their quite erratic course. Just as for the virtual pitch of harmonic tones, the virtual “pitches” of these regular but not harmonic tones cannot be masked by noise as long as the partials remain audible, though the “melody” may be unclear due to the ambiguity of the “pitches”. So, one of the best tests to check whether an auditory attribute of a sound is pitch consists of showing that this attribute can produce a musical melody. And that virtual pitch is indeed pitch, is further demonstrated in the next demo for the simple melody, also presented in Fig. 1.23. The pitch frequencies of the tones comprise the A4 of 440 Hz, the C 5 of 554 Hz, and the E5 of 659 Hz. The tones in Fig. 1.23 are produced by adding the lowest three harmonics of the tones. There it was shown that, indeed, the frequency of the first harmonic, the fundamental frequency, corresponded to the pitch frequency of the tones, and not with any other possible set of harmonics. In the next demo, illustrated in Fig. 8.7, the bottom panel is the same as in Fig. 1.23. In the sounds depicted in the higher panels, the pitches are virtual, since they are induced by three harmonics far higher in frequency than the fundamental frequency. The virtual pitches are indicated by the dashed lines. In the second panel from below, the pitches of the tones are induced by the lowest three harmonics that have frequencies

8.4 The Pitch of Some Synthesized Sounds

397

Fig. 8.7 Same melody as presented in Fig. 1.23 synthesized with three harmonics. The tones consist of the three lowest harmonics just above a specific frequency level indicated by the dotted lines, from down to up, 400, 1000, 1500 and 2700 Hz. In the bottom panel, the tones consist of the lowest three harmonics, so that the pitch is not virtual. In the higher panels, however, the pitches are virtual and the harmonics may not follow the course of the melody (Matlab) (demo)

higher than 1000 Hz. In the third panel from below, the pitches are induced by the lowest three harmonics higher than 1500 Hz, while in the top panel, the pitches are induced by the lowest three harmonics with frequencies higher than 2700 Hz. The course of the harmonics no longer follows the melody, but can go up when the pitch goes down and vice versa. Listening to the sounds of Fig. 8.7, one hears the same melody played four times, showing that the pitches of the tones remain the same. What changes from melody to melody is the brightness of the tones, which does not follow the melody. The next demos will show that also harmonics of very high rank can induce a pitch. In the demo of Fig. 8.8, a virtual diatonic scale is played by six high harmonics. In the lower panel, the harmonics are chosen in such a way that the lowest is just higher than 3 kHz, indicated by the thin, dotted horizontal line just below the tones. In the

398

8 Pitch Perception

Fig. 8.8 Ascending and descending diatonic scale of virtual pitches induced by six harmonics higher than 3 kHz, lower panel, and 4.5 kHz, upper panel (Matlab) (demo)

upper panel all six harmonics are just higher than 4.5 kHz. In both cases, a diatonic scale can be heard, though it will be less salient in the upper panel. It is concluded that, if at least enough harmonics are present, a virtual pitch can arise, even when all harmonics have very high ranks. The salience or strength of this virtual pitch will, however, diminish as the rank of the harmonics gets higher [67].

8.4.5 Analytic Versus Synthetic Listening In the previous demos, sequence of tones were played with at least three harmonics of low rank inducing clear pitches resulting in easily identifiable melodies. After this, the rank of the harmonics has been increased. This decreased the salience of the pitches and, thus, the identifiability of the melodies. When the number of harmonics is decreased and the rank of the harmonics is increased, there is a moment that a virtual pitch is no longer perceived. This is illustrated in the next demo of Fig. 8.9. A sequence of two-tone complexes is played of which the F 0 s follow a rising and descending diatonic scale on an A of 440 Hz. In the first part of the demo, shown in the lower panel of Fig. 8.9, the frequency of the lowest harmonic is just above 3 kHz, in the second part of the demo, shown in the upper panel of Fig. 8.9, just above 4.5 kHz. Many listeners will hear a rising and falling diatonic scale in the first part of the demo, but that becomes doubtful in the second part. The moment at which this loss of virtual pitch happens appears to differ considerably from listener to listener [64, 133, 163]. Terhardt [173] distinguishes two ways of listening: listening in analytic mode and listening in synthetic mode. Listening in the analytic mode corresponds to focusing on the frequencies of the partials; listening in the synthetic mode corresponds to listening to the virtual pitch. The two

8.4 The Pitch of Some Synthesized Sounds

399

Fig. 8.9 Ascending and descending diatonic scale on 440 Hz of virtual pitches induced by two harmonics higher than 3 kHz, lower panel, and 4.5 kHz, upper panel (Matlab) (demo)

modes are illustrated in Fig. 8.10, derived from Smoorenburg [163]. Two two-tone complexes are successively presented. The first tone consists of the 9th and 10th harmonic of 200 Hz; the second of the 7th and the 8th harmonic of 250 Hz. Thus, the two lower harmonics of the complexes with frequencies of 1800 and 1750 Hz, respectively, fall in frequency, while the fundamental frequencies of the complexes rise from 200 to 250 Hz. It appears that listeners generally either hear two tones of which the second is lower in pitch than the first, or two tones of which the second is higher in pitch than the first. Moreover, the listeners do not switch from one way of listening to the other [163]. Based on this, listeners can be divided into two groups, analytic listeners on the one hand, and synthetic listeners on the other. Analytic listening, also called spectral listening [80, 147], occurs when listeners hear out the partials. Hence, in the demo of Fig. 8.10, they will perceive the fall in pitch of the lower partials of the two complexes. Synthetic listeners will hear the rise of a major third in the virtual pitches of the two complexes from 200 to 250 Hz. Hence, they will perceive an upward jump in pitch. Since these listeners combine harmonics, this is called synthetic listening. Other terms used are holistic listening [146], or F0 listening [80, 147]. The proportion of analytic and synthetic listeners depends on the rank of the harmonics [64, 133]. Duration appears to be of little effect for tones longer than about 100 ms,[80], but Beerends [12] found that very short tones, shorter than 50–100 ms, tend to be perceived more analytically than longer tones. Ladd et al. [80] studied the difference between analytic and synthetic listeners in more detail. Besides analytic and synthetic listeners, they distinguish an intermediate group of about 25% of all listeners who show no preference for analytic or synthetic listening. Remarkably, the proportion of analytic listeners is higher in left-handed people than in right-handed people [81]. Moreover, it appears that the differences between

400

8 Pitch Perception

Fig. 8.10 Schematic representation of two harmonic two-tone complexes. The first consists of the 9th and 10th harmonic of 200 Hz, with frequencies 1800 and 2000 Hz, respectively. The second consists of the 7th and 8th harmonic of 250 Hz, with frequencies 1750 and 2000 Hz, respectively. So, the first tone has a fundamental frequency of 200 Hz and the other of 250 Hz. If you hear an upward jump in pitch of a major third, you are a synthetic listener. This represents the melody of the low virtual pitches of the two-tone complexes. If you hear a fall at high frequencies, you are an analytic listener (Example from Smoorenburg [163]) (Matlab) (demo)

analytic and synthetic listeners can be traced back to differences in the auditory cortex of listeners. Heschl’s Gyrus on the left-hand side is involved in synthetic listening, whereas Heschl’s Gyrus on the right-hand side is involved in analytic listening. In analytic listeners, Heschl’s Gyrus is more developed on the left-hand side, in synthetic listeners on the right-hand side [146]. Moreover, the difference between analytic and synthetic listeners is not directly related with musical training, but people with musical training have a more developed Heschl’s Gyrus in general [147].

8.4.6 Some Conclusions It is concluded that the pitch frequency of a sound does not necessarily correspond to the frequency of any of its components. In fact, this is a very familiar phenomenon. In telephone communication, all frequencies below 300 Hz (and above 3000 Hz) are filtered out. In spite of this, the pitch of a low voice, e.g., a male voice, remains unaffected. Likewise, in many sound-producing toys, the loudspeakers are so small

8.5 Pitch of Complex Sounds

401

that they cannot reproduce frequencies lower than a few hundred hertz. In spite of that, the melodies or utterances played by these toys have well-defined pitches. For harmonics of low rank, only two harmonics are enough to induce a virtual pitch. The higher the rank of two adjacent harmonics, the lower will be the salience of the virtual pitch. This salience is increased by taking more than two harmonics. In general, most communicative sounds such as music and speech have many more than two or three harmonics. The pitch of such sounds is generally very salient.

8.5 Pitch of Complex Sounds The pitch and the changes in pitch of an utterance define the intonation contour of this utterance and, hence, are of great communicative significance. The periods of voiced speech are coupled with the cyclical opening and closing of the vocal chords. The duration of each cycle can become shorter and longer, inducing rises and falls in pitch. In addition, articulatory gestures cause different kinds of changes in the oral cavity. These articulatory changes have the following aspects. First, articulation is coupled with changes in the resonance frequencies of the oral cavity. Second, articulatory gestures can induce the generation of noise of various locations in the oral cavity such as the lips, the teeth, and the glottis. Third, articulation can induce temporary closure of the oral cavity at these locations. Due to such changes, the waveform of the speech signal changes continuously from glottal cycle to glottal cycle. In fact, changes in waveform due to articulation are often faster than changes in the durations of the glottal cycles. This is illustrated in Fig. 8.11 showing the narrow-band and the wide-band spectrogram of the utterance “Please, turn on your recorder”. A detail of these spectrograms from the first word “Please” is presented in Fig. 8.12. How such spectrograms are calculated, has been shown in Figs. 1.13 to 1.16. In the narrow-band spectrograms, shown in the top panels of Figs. 8.11 and 8.12, the pseudo-periodicity of the sound signal expresses itself as a pattern of parallel, approximately horizontal lines. The distance between the lines is the fundamental frequency. In the wide-band spectrogram, shown in the middle panels of Figs. 8.11 and 8.12, the periodicity of the sound signal expresses itself as a pattern of vertical lines, each line corresponding to one pitch period. This pattern of vertical lines in the wide-band spectrogram and of horizontal lines in the narrow-band spectrogram represents the pseudo-harmonic structure of these sounds and, as such, defines the course of the pitch of the utterance. It comprises various phonemes, syllables, and words and is only interrupted during the unvoiced segments, indicated in the bottom panel of Fig. 8.11. A detail of Fig. 8.11 from the vowel /i/ in the word “Please” is presented in Fig. 8.12. In the top panel, the narrow-band spectrogram is presented. Due to the fine spectral resolution, the harmonic pattern of the narrow-band spectrogram can be seen as the closely spaced, almost horizontal lines, each line indicating one harmonic of this speech segment. In this segment, the spacing between the lines is about 107 Hz, the fundamental frequency. The articulatory changes are visible as the dark bands representing the

402

8 Pitch Perception

Fig. 8.11 Waveform, narrow-band and wide-band spectrogram of the utterance “Please, turn on your recorder”. The periodic structure of the speech signal can be seen in the waveform shown in the lower panel. The horizontal lines segments above “uv” indicate the unvoiced segments. The periodicity of the voiced segments expresses itself very well in the wide-band spectrogram shown in the middle panel. The associated harmonic structure of the voiced segments of the speech signal is clearly visible in the top panel (Matlab) (demo)

formants of the vowel. In the vowel /i/, the first and second formants are quite far apart. In Fig. 8.12, the first formant covers the first few harmonics; the second starts somewhat below 2 kHz and ends a bit above 2 kHz. The third formant is close to the second. The fourth and the fifth formant are also close together and form the bands above 3 kHz in the narrow-band spectrogram. The middle panel of Fig. 8.12 shows the wide-band spectrogram. Due to the fine temporal resolution, changes within one pitch period can be distinguished. Especially for the higher formants, most acoustic energy is produced at specific moments within one pitch period. For voiced speech signals, this is the moment that the vocal chords collapse, thus producing an acoustic impulse. The fine spectral resolution of the narrow-band spectrogram is now lost, however.

8.5 Pitch of Complex Sounds

403

Fig. 8.12 Detail of Fig. 8.11 from the vowel /i/ in the word “Please” (Matlab) (demo)

The spectrograms show many details of the signal with information about the production of the sound, speech in this case. It is important to realize that these details are not separately audible as details; neither can we hear any single closure of the vocal chords, nor can we hear any of the harmonics separately. Listening to the segment shown in Fig. 8.12, an /i/ with a pitch of about 107 Hz is heard, spoken by a male speaker. The discussion of the spectrograms presented in Figs. 8.11 and 8.12 may give the impression that it is easy to calculate the pitch of an utterance from its recording. Indeed, many algorithms for the extraction of pitch have been proposed. An overview for speech signals up to 1983 has been presented by Hess [57] and, since then, many other pitch-determination algorithms have been proposed. As mentioned before, within one utterance, the pitch after an unvoiced part cannot differ too much from the pitch before the unvoiced part. This indicates that this harmonic pattern is much more stable than the articulatory changes caused by the articulatory movements of the speaker. In fact, it seems that this harmonic structure forms a stable framework or grid within which the rapidly changing speech segments

404

8 Pitch Perception

Fig. 8.13 Waveform, wide-band and narrow-band spectrogram of two simultaneous utterances “Please, turn on your recorder” and “Please, insert your banking card”. Though it is not easy to find out which pitch period, which partial, or which formant trace belongs to which speaker, the listener can hear two speakers and understand what they say (Matlab) (demo

are kept together. One can imagine that the harmonic patterns of two simultaneous, different utterances will be different and will not fit together. So, when produced simultaneously, these harmonic patterns contain information as to what speech segments belong to one utterance and what to another. An interfering sound fragment from one speaker will have a different harmonic pattern that does not fit into the harmonic pattern of another utterance and, therefore, cannot come from the same speaker. This simple description may give rise to the expectation that it will not be too difficult to separate the components of one speech signal from those of another. Unfortunately, there are many complications, especially when there are two utterances spoken simultaneously. This is illustrated in the next two figures, Figs. 8.13 and 8.14, showing the waveform, the narrow-band and the wide-band spectrogram of two simultaneous utterances. One is the same utterances as shown in Figs. 8.11 and 8.12, “Please, turn on your recorder”; the other is “Please, insert your banking card”. When one listens to these simultaneous utterances, one can clearly hear two speakers

8.5 Pitch of Complex Sounds

405

Fig. 8.14 Detail of Fig. 8.13 from the vowel /i/ in the word “Please”. The listener can hear two speakers, one with a higher pitch than the other. Both say the vowel /i/ (Matlab) (demo)

and, though one may have to listen more than once, both utterances can be understood. Both utterances start with the same word “Please”, a detail of which is shown in Fig. 8.14. Listening to this short segment, one can clearly hear two speakers saying an /i/. Looking at the waveform or the spectrogram of this combined speech sound it appears to be very difficult, however, if not impossible, to find out which detail belongs to which speaker. The periodic structure of the waveform, so abundantly clear in the clean single speech signal, is no longer obvious. In the spectrograms, the harmonic structure of vertical or horizontal lines can still be distinguished, but the problem now is to separate the lines of one harmonic pattern from those of the other. In fact, this problem is not easy to solve based on these spectrograms. Nevertheless, listening to this speech fragment, two sounds are heard, each with its own pitch. This shows that the human pitch processor manages to solve this problem almost continuously in situations with more than one simultaneous speaker. In other situations, environmental sounds can interfere with the speech of someone one tries to understand, thus disrupting the periodic structure of the speaker’s utterances. One

406

8 Pitch Perception

of the most remarkable achievements of the human sound processor is that it all too often succeeds in giving meaning to such mixtures of sound sources. The problem of measuring simultaneous pitches will be discussed further in Sect. 8.12.

8.6 Shepard and Risset Tones In the next demo, it will be shown that octave ambiguities can lead to the strange percept of a seemingly infinitely falling or rising pitch. This was first demonstrated by Shepard [156], which is why they are called Shepard tones. These sounds consist of partials that are all exactly one octave apart. Hence, the spectrum of a Shepard tone consists of peaks at 2k f s Hz, k = . . . − 2, −1, 0, −1, 2, . . .. A consequence of this is that no unambiguous fundamental frequency can be distinguished, since for every partial with frequency f s , there is a partial with frequency f s /2, so that there is no lowest harmonic. Their signal s (t) can be written as: s (t) =

∞ 

  an sin 2π2k f 0 t

(8.2)

k=−∞

In the demo of Fig. 8.15, an increasing series of Shepard tones can be heard. Only partials with frequencies in the range from 20 to 18000 Hz are included. Their amplitudes are equal, so that, at moderate intensities, their contribution will approximate a dBB curve. The first tone of the series consists of partials with an octave relation with 440 Hz, so consists of the partials with frequencies 27.5, 55, 110, 220, 440, 800 Hz, etc. With every new tone, these frequencies are increased in frequency by half a semitone. A continuously rising sequence of tones is heard without any downwards jump in pitch. Nevertheless, since the inter-onset-interval between the tones is 0.2 s, the series exactly repeats itself every 4.8 s.

Fig. 8.15 Schematic spectrogram of Shepard tones. Note the logarithmic frequency axis, and that the distance between the harmonics is exactly one octave. In the course of the stimulus the harmonic patterns rises over two octaves, but the last tone is exactly the same as the first tone (Matlab) (demo)

8.7 The Autocorrelation Model

407

Fig. 8.16 Schematic spectrogram of Risset tones. The distance between the harmonics is exactly one octave. In this demo the harmonic pattern drops over ten octaves, but the signal at the beginning is the same as at the end (Matlab) (demo)

In the next demo, a continuous version of the Shepard tones is played. After the author who first described them, they are called Risset tones [33, 135]. A schematic time-frequency representation is shown in Fig. 8.16. The thickness of the components indicates their amplitudes. Note that the amplitude of the partials first increases and then decreases. In this demo, one tone is heard continuously falling in pitch. Analytic listeners will hear partials decreasing in frequency, but the remarkable thing is that, though partials may be heard coming in and going out, no jumps in pitch are perceived, in spite of the fact that the sound is identical to itself after 5 s. These examples of Shepard and Risset tones show the circular character of pitch perception as indicated in Fig. 8.1. The idea of circularity is much older than Shepard and Risset, however, and various composers have experimented with the circular structure of pitch. An historical overview of circularity in music is presented by Braus [18]. The Shepard tones played in the demo of Fig. 8.15 and the Risset tones played in the demo of Fig. 8.16 are tones with successive harmonics that differ exactly one octave. Hence, their frequencies are all positioned right above each other in the pitch space presented in Fig. 8.1. Such tones, in fact, slowly ascend or descend the helix on which this pitch space is situated. This exemplifies the phenomenon of octave equivalence. Besides the octave, also other musical intervals such as the fifth and the third are harmonically related. Shepard also developed geometric figures of pitch spaces modelling these more complex harmonic relations. These figures include double helices and toruses. The interested reader in referred to Shepard [157].

8.7 The Autocorrelation Model In Sect. 3.9, the information was discussed that is present in the temporal structure of the series of action potentials that run along the auditory-nerve fibres to the central nervous system. There it was argued that the interval distribution of these spike trains presents information regarding the frequency of the sound components that excite the

408

8 Pitch Perception

basilar membrane. The peaks in the autocorrelograms, or rather the autocovariance functions (ACVFs), of the outputs of the auditory filters were presented as a rough and first approximation of the peaks of the interspike-interval distributions of the auditory-nerve fibres. Not the autocorrelation functions were chosen, as they consist of correlation coefficients, which are normalized to a value of 1 at a delay of τ = 0. This normalization is omitted for ACVFs, so that the ACVF at τ = 0 represents a rough approximation of the excitation induced by the stimulus at that frequency. In this way, the contribution of the auditory channels is weighted by this excitation. Pitch estimation according to this model then consists of finding the most common periodicity in these interval distributions, which is simply done by adding the calculated ACVFs over the complete tonotopic array and finding the peak in this summary autocovariance function (SACVF). All this was illustration based on a 200-Hz pulse train in Fig. 3.26. If necessary, the reader is recommended to reread those sections. In the next sections, this will be illustrated for quite a number of other sounds. It will appear that the autocorrelation model of pitch perception unifies various kinds of pitch distinguished in the past, such as “residue pitch”, “low pitch”, “periodicity pitch”, “repetition pitch”, or “virtual pitch” [100]. These terms will be used here, not to indicate essential differences in the process underlying the perceptual generation of these “different” kinds of pitches, but only to indicate certain properties of the sound signals that generate the pitch. For instance, the term virtual pitch will be used for the pitch of complex sounds that have no acoustic energy at the frequency at which the pitch is perceived. But it will be shown that the autocorrelation model of pitch perception does not assume that virtual pitch is different from, e.g., repetition pitch. It will be shown that both can be estimated by the same model.

8.8 The Missing Fundamental One of the first questions one may ask is whether the autocorrelation model of pitch perception can correctly estimate the virtual pitch of harmonic tone complexes that miss the first, and possibly more, lower harmonics. This will first be shown for tones consisting of three successive harmonics. Let f 0 be the fundamental frequency of the three harmonics and n − 1, n, and n + 1 their rank. The sum of these three tones s(t) in its spectral representation is: s (t) = sin (2π (n − 1) f 0 t) + sin (2πn f 0 t) + sin (2π (n + 1) f 0 t) ,

(8.3)

which can be rewritten into its temporal representation by taking the first and the last term together. This gives: s (t) = sin (2πn f 0 t) + 2 cos (2π f 0 t) sin (2πn f 0 t) = [1 + 2 cos (2π f 0 t)] sin (2πn f 0 t) .

(8.4)

8.8 The Missing Fundamental

409

Hence, a complex tone of three successive harmonics of f 0 with rank n − 1, n, and n + 1 can be rewritten in its temporal representation as a sinusoid of frequency n f 0 , the average frequency of the three tones, modulated by an amplitude of 1 + 2 cos (2π f 0 t). The factor between square brackets on the right-hand side of this equation represents the amplitude of the carrier. This amplitude and its inverse are the dashed lines in the top panels of Figs. 8.17, 8.18 and 8.19. First, the pitch of three successive harmonics of a low rank will be estimated, so that all harmonics are resolved. Then, the rank will be increased up to a level that a clear pitch is no longer perceived. Next, the number of successive harmonics will be increased from 3 to 7.

Fig. 8.17 Waveform, ACVFs and SACVF of a tone complex of 800, 1000, and 1200 Hz. Adapted from Meddis and Hewitt [100, p. 2871, Fig. 6] (Matlab) (demo)

410

8 Pitch Perception

Fig. 8.18 Waveform, ACVFs and SACVF of a tone complex of 1800, 2000, and 2200 Hz. Adapted from Meddis and Hewitt [100, p. 2871, Fig. 7] (Matlab) (demo)

8.8.1 Three Adjacent Resolved Harmonics The first three-tone complex considered consists of three resolved harmonics, namely the fourth, fifth and sixth harmonic of 200 Hz added in sine phase. The signal itself, the ACVFs of the inner-hair-cell responses, and the SACVF are shown in Fig. 8.17 in the same way as for the pulse series in Fig. 3.26. The amplitude of the stimulus as presented in Eq. 8.4 and its inverse is shown by the dashed line. The periodicity, 5 ms, corresponds to 200 Hz. Hence, one should expect an estimated pitch frequency of 200 Hz. The surface plot of ACVFs shows the three bands corresponding to the three harmonics. The lowest band has peaks at 1.25 ms and its multiples, the middle band has peaks at 1 ms and its multiples, and the highest band at 0.833 ms and its multiples. All ACVFs have a common maximum at 5 ms, resulting in a clear maximum in the SACVF at 5 ms, which is, as expected, the period of the, in

8.8 The Missing Fundamental

411

Fig. 8.19 Waveform, ACVFs and SACVF of a tone complex of 2800, 3000, and 3200 Hz. Adapted from Meddis and Hewitt [100, p. 2871, Fig. 7] (Matlab) (demo)

this case missing, fundamental. Apparently, this procedure operates quite well for these simple tone complexes with missing fundamentals. Besides the peak at 5 ms, the SACVF shows various secondary maxima, not only at subharmonic positions, but also at higher harmonic positions. First, the maximum just to the right of the origin will be discussed. This peak is positioned at a delay of about 0.952 ms corresponding to a frequency of 1050 Hz. This approximates 1000 Hz, the average frequency of the three harmonics and the carrier frequency of the sound when considered in its temporal representation as an amplitude-modulated sinusoid. Moreover, this frequency corresponds to the spectral centroid of the signal, which, as known, is a first approximation of the brightness of the sound.

412

8 Pitch Perception

8.8.2 Three Adjacent Unresolved Harmonics The same procedure as in the previous example will now be demonstrated for a tone complex of three successive harmonics of rank 9, 10, and 11, hence unresolved harmonics. The waveform, ACVFs, and SACVF are shown in Fig. 8.18. The highest peak in the SACVF at 5 ms correctly corresponds to the virtual pitch of 200 Hz of the three-tone complex, but one can now see that the secondary peaks become higher and higher relative to the main peak. To explain this, realize that the human pitch processor has no a priori knowledge as to the harmonic rank of the three partials of the tone complex. It will be shown that the side peaks of the main peak at 5 ms can be explained by assuming that the frequencies of the three partials also fit more or less into the harmonic pattern with a fundamental frequency corresponding to these side peaks. First, consider the peak just to the right of the main peak at about 5.5 ms corresponding to about 182 Hz. The frequencies of the three partials, 1800, 2000, and 2200 Hz, are close to the frequencies of the 10th, the 11th, and the 12th harmonic of 182 Hz, which are 1820, 2002, and 2184 Hz, respectively. Similar reasoning can be applied to the peak at 4.5 ms corresponding to about 222 Hz. Indeed, the frequencies of the partials of 1800, 2000, and 2200 Hz are close to the 8th, the 9th, and the 10th harmonic of 222 Hz, which are 1776, 1998, and 2220 Hz, respectively. Therefore, the peaks to the left and right of the main peak can be considered pitch estimates, based on different assumptions about the fundamental frequency of the harmonic pattern to which the partials fit. Next, consider the first peak to the right of the origin. Measuring it gives a delay of 0.4988 ms, corresponding to about 2005 Hz, very close to the average frequency of the three partials. In other words, this peak corresponds closely with the centre frequency of a 2000-Hz tone modulated in amplitude according to the temporal representation of this sound presented in Eq. 8.4. This frequency corresponds closely to the brightness of the sound.

8.8.3 Three Adjacent Unresolved Harmonics of High Rank In the next example, the rank of the three successive unresolved harmonics is even higher than in the previous example, namely 14, 15, and 16, so these harmonics have frequencies of 2800, 3000, and 3200 Hz. The waveform, ACVFs, and SACVF are shown in Fig. 8.19. The peak in the SACVF at 5 ms is no longer the highest. The highest peak is now the peak close to the origin at 0.3374 ms. As in the previous demos, this corresponds closely to the average frequency of the three harmonics. If the listeners hear this as pitch, it can be concluded that their pitch processor considers this stimulus as a 3000-Hz tone, modulated by the amplitude given by Eq. 9.2. Note, again, that this pitch also corresponds to the brightness of the sound. It has been shown that, according to this autocorrelation model of pitch perception, the virtual pitch of a tone consisting of three successive harmonics gets less salient

8.8 The Missing Fundamental

413

as the rank of the harmonics gets higher. For, the peak in the SACVF corresponding to f 0 gets smaller and smaller. Peaks close to the virtual-pitch period correspond to pitch estimates based on harmonic patterns with different fundamental frequencies. The peak close to the origin corresponds to the percept of a sinusoid with the average frequency of the tone complex, modulated in amplitude as given by Eq. 8.4. In the latter case, one may wonder whether people indeed hear a sound with the timbre of a tone and a pitch of 3 kHz. The way to test this is to check whether musical intervals or melodies can be heard in this range for sounds synthesized in this way and, if so, whether they correspond to the frequencies of the virtual pitches or the average frequencies of the harmonics.

8.8.4 Three Adjacent Shifted Unresolved Harmonics of High Rank Another issue is the virtual “pitch” of regular but not harmonic tones as demonstrated in Fig. 8.6. In the demo of that figure, it was shown that these shifted harmonics elicited an ambiguous pitch that could not be masked as long as the regular partials were not masked. The next figure, Fig. 8.20, shows the waveform, ACVFs, and SACVF of the sixth tone of that demo consisting of partials with frequencies 898, 1453, 2008 Hz, so that the distance between adjacent partials is 555 Hz. The highest peak not at the origin, usually the estimated pitch period, is at 3.51 ms, corresponding to a pitch frequency of 285 Hz. But besides this main peak there are various other peaks without an apparent harmonic relation with the main peak, which may explain the ambiguity of the pitch.

8.8.5 Seven Adjacent Unresolved Harmonics of High Rank In the three previous examples, it was shown that the virtual pitch of a tone complex of three successive harmonics becomes less well defined, less salient, when the harmonic rank of the complex gets higher. This corresponds quite well, at least qualitatively, to the appearance of higher secondary peaks in the SACVF. In the demos of Figs. 8.8 and 8.9, one could hear that pitch also gets more salient with more harmonics. It will now be shown how the autocorrelation method described here performs in this respect. So, in the next demos the same stimuli are synthesized as in Fig. 8.19, except that there are not three successive harmonics but seven. Let f 0 be the fundamental frequency of the seven harmonics of rank n − 3 to n + 3 and f c = n f 0 the frequency of the central component, the carrier. The sum of these seven tones s(t) in its spectral representation is:

414

8 Pitch Perception

Fig. 8.20 Waveform, ACVFs, and SACVF of the third tone played in the demo of Fig. 8.6, suggesting a “pitch” of 285 Hz. But there are various other peaks not at harmonic positions indicating that the pitch will be ambiguous (Matlab) (demo)

s (t) =

3 

sin (2π (n + k) f 0 t) .

(8.5)

k=−3

Taking the components of rank n − k together with those of rank n + k for k = 1, 2, and 3 yields the temporal representation: s (t) = sin (2πn f c t) + 2

3  k=1

 cos (2πk f 0 t) sin (2πn f c t) = 1 + 2

3 

 cos (2πk f 0 t) sin (2πn f c t) .

k=1

(8.6)

8.8 The Missing Fundamental

415

Fig. 8.21 Waveform, ACVFs and SACVF of a tone complex of 2400, 2600, 2800, 3000, 3200, 3400 and 3600 Hz (Matlab) (demo)

The factor between square brackets on the right-hand side of this equation represents the amplitude of the carrier. This amplitude and its negative inverse are the dashed lines in the top panels of Figs. 8.21 and 8.22. The waveform, ACVFs, and SACVF of the seven harmonics of 200 Hz centred around 3000 Hz are presented in Fig. 8.21. The main peak at a delay of 5 ms is now expected to be more significant, which, indeed, appears to be the case. The peak in the SACVF at about 0.33 ms, corresponding to the centre frequency of the complex, is still significant due to phase lock, but smaller than the peak at 5 ms. So, also in this respect, the results of this autocorrelation method correspond at least qualitatively with what is found perceptually. The first peak to the right of the origin is significant, again, corresponding to the carrier frequency of the seven-tone complex considered as an amplitude-modulated sinusoid.

416

8 Pitch Perception

Fig. 8.22 Waveform, ACVFs and SACVF of a tone complex of 7400, 7600, 7800, 8000, 8200, 8400, and 8600 Hz (Matlab) (demo)

8.8.6 Seven Adjacent Unresolved Harmonics in the Absence of Phase Lock In the final example of this series, the autocorrelation procedure is tested for seven successive harmonics in the region where there is no longer phase lock. The result is shown in Fig. 8.22 for seven harmonics of 200 Hz centred around 8 kHz. Due to the lack of phase lock the peaks intermediate between the main peaks are now gone, and the SACVF looks quite smooth. The ACVFs and the SACVF shows clear peaks at about 5 ms and its multiples corresponding to the F 0 of 200 Hz. The question here, of course, is whether listeners will indeed hear a pitch at 200 Hz. The brightness of the sound is so high that confusions between brightness and pitch are likely to occur. This was studied by Oxenham et al. [118]. They synthesized fourtone melodies with F 0 s selected from a diatonic scale of 1–2 kHz and harmonics

8.8 The Missing Fundamental

417

with frequencies higher than 6 kHz, so well above the phase-lock limit. Listeners were presented with two such melodies that could either be the same or different. It the latter case, the F 0 of the second or the third tone of the melody was increased or decreased by one scale step. The task of the listeners was to indicate whether the melodies were the same or not. It appears that listeners were well able to do this task. Hence, it appears possible to make melodies with these kinds of sounds, which demonstrates the presence of pitch. As no identification of musical intervals is involved, it remains unclear, however, whether these results are based on pitch height or on pitch chroma. The same applied to results by Macherey and Carlyon [91]. Their experiments use quite complex stimuli and the interested reader is referred to this literature.

8.8.7 Conclusions The autocorrelation method, in general, yields plausible estimates of virtual pitch, certainly when the number of harmonics of low rank is significant. As the number of harmonics gets smaller and the rank of the harmonics gets higher, other peaks in the SACVF get higher. Two kinds of peaks have been described: First, peaks close to the main peak that correspond to alternative judgments as to harmonic rank of the components of the complex; second, the first peak close to the origin of the SACVF that corresponds to the carrier frequency of the tone complex when considered in its temporal representation, i.e., an amplitude-modulated sinusoid. This centre frequency is also a good estimate of the brightness of the tones. So, qualitatively, these two kinds of peaks correspond to well-known auditory attributes to which a frequency can be attributed, pitch and brightness. This autocorrelation model is completely based on rough approximations of the interval distributions of the spike trains in the auditory-nerve fibres. These demos show, therefore, that, at a qualitative level, the pitch and brightness of the presented sounds as they are perceived can be derived from these interval distributions. Quantitatively, however, the presented model falls short in various aspects. For instance, besides the main peak in the SACVF that was discussed, there are many other peaks, to which no significance has been attributed. The most notable of these are peaks at multiple delays of the main peak. In Figs. 3.26, 8.17, 8.18 and 8.19, showing the signal, the ACVFs, and the SACVF of a 200-Hz pulse train and of harmonic three-tone complexes with a (missing) 200-Hz fundamental, there are, besides the main peak at 5 ms that corresponds to the estimated pitch period, significant peaks at multiples of this pitch period, 10, 15, 20, 25 ms, etc. These peaks correspond to subharmonic frequencies of 200 Hz, and may be identified with second-, third-, or higher order intervals in the spike trains of the auditory-nerve fibres. They may represent harmonic ambiguities, but generally, they do not represent the pitch or brightness of a tone. At other delays, too, there are peaks that represent harmonic relations between the partials of the sound, but also in this case, no auditory attribute can be associated with the frequencies corresponding to these peaks. Moreover, pitch can be more or

418

8 Pitch Perception

less salient, and the question is to what extent the peaks in the SACVF provide a consistent picture in this respect. This will further be discussed in the upcoming Sect. 8.14 of this chapter. Finally, it is unclear what the significance is of the peaks in the SACVF of regular but not harmonic tones. These tones are not periodic which brings us to the next topic.

8.9 Pitch of Non-periodic Sounds A few times it was mentioned that also non-periodic sounds can produce a pitch. This has already been shown for narrow noise bands in the demo of Fig. 1.68, of which the centre frequencies defined a clearly audible melody. Two other examples will be discussed in more details: the pitch of repetition noise and the pitch of pulse pairs. It will be explained how the autocorrelation model of pitch perception also provides plausible estimates for these kinds of pitches. There are various other kinds of noisy sounds that induce the percept of pitch, but they will not be discussed here. The interested reader is referred to the literature. Examples of such non-periodic sounds with pitches are amplitude-modulated noise [21, 22], the pitch at the cut-off frequencies of sharply filtered noise, called edge pitch [36, 53, 76, 185], and the pitch of Zwicker tones [44], a faint pitch that can be heard shortly after a bursts of band-stop noise at the band-stop frequency. Here, only the pitch of repetition noise, and the pitch of pulse pairs will be demonstrated and discussed.

8.9.1 Repetition Noise Besides narrow-band noise, also another class of noise can elicit a well-defined pitch, and that pitch arises when noise is added to a delayed version of itself. As mentioned in the discussion on the pitch of pulse pairs, the pitch induced by adding a delayed version of a sound to its undelayed original is called repetition pitch. Indeed, when T0 is the delay time, one can hear a pitch of 1/T0 . This is illustrated for white noise in the demo of Fig. 8.23, showing the spectrogram of eight realizations of repetition noise. The delay of the first burst of the series is chosen so that 1/T0 = 220 Hz. The delay of the following noise bursts is so that the resulting pitches form a diatonic scale on the A of 220 Hz. One of the first aspects of Fig. 8.23 that stands out consists of the pattern of horizontal bands laid over the spectrograms. The physical basis of this phenomenon becomes clear when the spectrum of repetition noise is calculated. Indeed, when a signal x (t) has a spectrum F (ω), the spectrum of a version delayed by τ can be noise written as F (ω) e−iωτ . The spectrum of the repetition  s (t) = x (t) + x (t − τ )  is then equal to F (ω) + F (ω) e−iωτ = F (ω) 1 + e−iωτ . This gives a power spectrum that equals twice the square of the original spectrum multiplied by a raised cosine:

8.9 Pitch of Non-periodic Sounds

419

Fig. 8.23 Repetition pitch. Spectrograms are shown of white noise delayed and added to itself. This is done for a sequence of eight noise bursts in such a way that the successive delays equal the pitch periods corresponding to a rising diatonic scale on 220 Hz. Note the “ripples” over the spectrograms corresponding to the harmonics of the pitch frequencies of this diatonic scale (Matlab) (demo)

|F (ω) |2 |1 + e−iωτ |2 = 2|F (ω) |2 [1 + cos (ωτ )]

(8.7)

This shows that the power spectrum of the sum of two sounds, one of which is a delayed version of the other, equals the spectrum of the original signal multiplied by a raised cosine with a periodicity in the spectral domain of 1/τ . In spectral measurements, this periodicity is observed as a ripple. Such ripples are clearly visible in Fig. 8.23 as the structure of horizontal bands. The distance between these horizontal bands increases in the course of the diatonic scale in accordance with the frequency of the perceived pitch. Because of this ripple, this noise is also indicated with ripple noise [190, 191]. Here, the term repetition noise will be used. As mentioned, the pitches of these kinds of sounds will be indicated with repetition pitch. The process of delaying and adding sequences of noise can be iterated, resulting in what is called iterated rippled noise. An example is presented in Fig. 8.24 with ten iterations. Again, a diatonic scale is played. Note the sharper ripples in the spectra of these sounds and the more salient pitches of the diatonic scale. Patterson et al. [126, 127] showed that the timbre of these sounds becomes more tonal with every iteration. In other words, the breathiness of such sounds decreases with the number of iterations. The authors showed that the reduction in the perceptual tone/noise ratio, or equivalently, the increase in breathiness, could be explained based on the height of the main peak of a somewhat adapted autocorrelation model. The question now is what pitch estimate is obtained when the autocorrelation procedure described in this chapter is applied to this kind of repetition noise. The result is shown Fig. 8.25 for repetition noise with a delay of 5 ms, corresponding to a repetition pitch of 200 Hz. Figure 8.25 shows the waveform, the ACVFs, and the SACVF of this noise. No periodicity of 5 ms can be detected in the waveform of the signal, but in the SACVF a clear single peak can be seen close to the expected

420

8 Pitch Perception

Fig. 8.24 Iterated rippled noise. Same as Fig. 8.23, but now ten successively delayed versions of white noise are added to the original noise. Note the sharper ripples and the more salient pitches than in Fig. 8.23 (Matlab) (demo)

delay of 5 ms. Listening to this sound may, however, not yield a very salient pitch. It appears that playing repetition noise in the context of a melody, in the demos of Figs. 8.23 and 8.24 a diatonic scale, makes the melody, and hence the pitches of its notes, pop out clearly.

8.9.2 Pulse Pairs Similar results are obtained for pulse pairs. In Fig. 8.2 is has already been shown that a pulse pair has a pitch, albeit weak, with a pitch period equal to the time interval between the two pulses. The waveform, ACVFs, and SACVF of a 5-ms pulse pair are presented in Fig. 8.26. Here, too, a clear peak can be seen in the SACVF at a delay of 5 ms or very close to that. In summary, the previous demos show that also non-periodic sounds can have a pitch. This applies not only to narrow bands of noise, as demonstrated in Fig. 1.68, but also to sound consisting of the sum of delayed repetitions of itself. Only one repetition is enough to induce pitch, both for wide-band noise and for single pulses. It is concluded that the SACVF procedure lends itself well to the estimation of this repetition pitch.

8.10 Pitch of Time-Varying Sounds

421

Fig. 8.25 Waveform, ACVFs, and SACVF of repetition noise with a 5-ms delay. A clear peak can be distinguished corresponding to the perceived pitch (Matlab) (demo)

8.10 Pitch of Time-Varying Sounds Except perhaps when synthesized, a sound is rarely strictly stationary. In the demos described above, the estimations of the pitch at a certain moment was based on segments of about 60 ms. When applied to, e.g., speech utterances, this leads to estimated pitch contours with much more details than are perceptually relevant. Apparently, the pitch contours as we perceive them are based on the integration of information distributed over a longer time interval. Based on results with FM tones, Carlyon et al. [29] conclude that this integration window has a duration of about 110 ms. Moreover, sounds such as speech utterances and musical melodies can change very rapidly in intensity and spectral content. In order to deal with the variability of these time-varying sounds, Gockel, Moore, and Carlyon [45] formulated the weighted

422

8 Pitch Perception

Fig. 8.26 Pitch estimation of a pair of pulses with a 5-ms interval between them (Matlab) (demo)

average period (WAP) model in which relatively stable parts of the signal contribute more to the perceived pitch contour than rapidly changing parts. Another issue is vibrato. In music performance, tones are often provided with pitch fluctuations around the musically prescribed pitch. As demonstrated for a synthetic vowel in the demo of Fig. 4.12, musical tones without such vibrato are often perceived as machine-like. The excursion size of the instantaneous pitch frequency during vibrato mostly far exceeds the tolerance range of what is in tune for a tone with a stationary pitch frequency. Nevertheless, listeners appear to be well able to attribute a well-defined pitch, called its principal pitch, to a vibrato tone [42]. Shonle and Horan [159] found that the frequency of this principal pitch corresponds better to the geometric mean of the course of the instantaneous pitch frequency of the tones, than with its arithmetic mean. They did not consider the Cam scale. A review of six different methods to deal with these issues is presented by Etchemendy, Eguia, and Mesz [42]. They concluded that none of these six methods could fully predict all

8.11 Pitch Estimation of Speech Sounds

423

perceptual results obtained in the study of this issue. The most robust predictions were, however, given by the mentioned WAP model by Gockel, Moore, and Carlyon [45] and a model presented by Mesz and Eguia [105]. These two models distinguish themselves from the others by incorporating a “stability sensitive” mechanism that gives more weight to sound segments in which the pitch frequency is relatively stable. It appears that it is difficult to determine “the” integration window involved in human pitch perception. In fact, Wiegrebe [187] showed that pitch extraction cannot be based on a single integration window, but that the integration window depends on the frequency of the pitch itself. Lower pitches require a longer integration window than higher pitches. This indicates a recursive processing of pitch information. This, and similar results, led Balaguer-Ballester et al. [10] to develop a hierarchical model of pitch processing including feed-back stages in which ascending information is compared with descending predicted information. If the deviations are small, the integration windows are relatively long, so that the estimation of pitch is based on a relatively long, stable sound segment, thus optimizing the accuracy of the estimate. If, on the other hand, the ascending information deviates much from what is expected, the integration window is shortened so that the processing is based on a short segment containing the most recent information that can then optimally be used to update the prediction process.

8.11 Pitch Estimation of Speech Sounds Up to now mainly mathematically well-defined, relatively simple signals have been considered, which were either tonal or noisy. The human pitch processor on the other hand operates predominantly with much more complicated sounds of which speech and music are the most complex examples. There are various reasons why the pitch of a speech signal can be difficult to determine. First, speech consists of rapid successions of voiced segments, unvoiced segments, and of combinations of voiced, but very noisy segments, such as the /v/ and the /z/. Second, speech can change very rapidly, especially at the transitions between phonemes. Third, a speaker can vary the amount of voicing. On one extreme, speech can be completely unvoiced, e.g., when the speaker is whispering. This speech is generally quite soft and cannot carry very far. At the other extreme, there is speech produced to carry as far as possible. To achieve this, speakers increase the effort with which they speak or even shout, as has been discussed in Sect. 6.7.3. In between these extremes there is a wide variety of more or less noisy speech in which the amount of noise can vary considerably. When the share of noise in the speech signal is high, this speech will in general be perceived as breathy. A fourth factor that can make it difficult to define the pitch of a speech sound, is that voicing is not necessarily periodic or pseudo-periodic. In what is called creaky voice, the vocal cords can vibrate at subharmonic periods or chaotically [176, 177]. If one then tries to measure the pitch contour of a speech utterance based on independent pitch estimations, all these factors lead to irregular contours containing many perceptually irrelevant details. In fact, not every segment of the speech signal

424

8 Pitch Perception

contributes equally to the perceived pitch contour. How to measure the pitch contour of a speech utterance that only contains perceptually relevant details is discussed by Hermes [56]. Here, only the estimation of pitch at a certain moment in the utterance will be discussed. The demos are from the same utterances as shown in the figures of Sect. 3.9. Figure 8.27 presents the waveform, ACVFs, and the SACVF of a segment of the phoneme /i/ in the word “please” from the utterance “Please turn on your recorder”. The waveform of the signal is pseudo-periodic, especially for the lower frequency components, and the estimated pitch period is expected to closely correspond to that. The SACVF looks somewhat more irregular than seen in previous SACVFs of previous sections. This is due to the more random-like fluctuations of the signal especially at frequencies around 2 kHz. Taking this into account, the result does not deviate very much from what is expected. The estimated pitch period of 9.06 ms

Fig. 8.27 Pitch estimation for a segment for a segment from the vowel /i/ in the word “please” for one speaker from the utterance “Please turn on your recorder”. The pitch is estimated at 1000/9.06 = 110 Hz (Matlab) (demo)

8.11 Pitch Estimation of Speech Sounds

425

Fig. 8.28 Pitch estimation for a segment for a segment of the vowel /i/ in the word “please” from the utterance “Please insert your banking card” spoken by a speaker different from the speaker of Fig. 8.27. The pitch is estimated at 1000/6.73 = 148.5 Hz (Matlab) (demo)

corresponds well with the periodicity of the signal. The corresponding pitch has a frequency of 110.4 Hz. The same can be said about the estimated pitch shown in Fig. 8.27 also based on a segment of the phoneme /i/ in the word “please” but now from the utterance “Please insert your banking card” spoken by a different speaker. The waveform of the signal is pseudo-periodic, again, and the estimated pitch period is expected to closely correspond to this periodicity. Indeed, this is correct, again. The estimated pitch period of 6.73 ms corresponds well to the periodicity of the signal. The corresponding pitch has a frequency of 148.5 Hz. Comparing the male voices of Figs. 8.27 and 8.28 shows, indeed, a lower pitched voice in the first and a higher pitched voice in the latter figure. So, this corresponds what can be expected. The situation will be more complicated in the next demo of Fig. 8.29 in which the two speech signals are combined, which brings us to the next issue, pitch estimation of concurrent sounds.

426

8 Pitch Perception

Fig. 8.29 Pitch estimation for a segment from the vowel /i/ in the word “please” from the two utterances of Figs. 8.27 and 8.28, now played simultaneously. The algorithm gives a pitch estimate of 1000/6.73 = 149 Hz, the pitch frequency of the second speaker. There is also a peak at 9.06 ms indicated by the shorter dashed line corresponding to the pitch period of the first speaker (Matlab) (demo)

8.12 Estimation of Multiple Pitches Figures 8.13 and 8.14 showed the spectrogram and the waveform of speech of two simultaneous speakers. It was shown that it was no longer easy to distinguish two periodic speech sounds. In Figs. 8.27 and 8.28, the autocorrelation model was applied to a short segment of the vowel /i/ in the first word “please” of each of these utterances. In Fig. 8.29, the autocorrelation method is applied to the sum of these two speech signals. The waveform of this combined speech signal is presented in the top panel of Fig. 8.29. It shows that the signal is no longer pseudo-periodic, and it seems unlikely that, without more information, one can see that this signal is the sum of two pseudoperiodic speech signals. When listening to the sound, however, two speakers are

8.13 Central Processing of Pitch

427

heard each producing an /i/-like vowel sound with its own pitch. Applying the pitchestimation procedure to this signal may give some indications. Indeed, the SACVF shows clear peaks close to the peaks in the SACVF of the separate signals. The largest peak at 6.73 ms corresponds to the higher pitch in Fig. 8.28. But a peak can also be seen at the pitch period of the first signal of Fig. 8.27. In the model presented here, the ACVFs are calculated based on the sum of the two speech signals. The SACVF is then scanned for possible peaks. In this example this results in two peaks at delays that closely correspond to the periods of the two pitches. The question now is, whether the human central pitch processor carries out this task in a comparable way. After all, not only the two peaks corresponding to the two pitch periods of the two speech signals can be distinguished. Other peaks of comparable heights can also be seen. Furthermore, in this demo two voiced vowel sounds of about equal intensity are added. Hence, the peaks at the pitch periods in the SACVF have about equal heights. If one of the two speech signals is much weaker, it is unlikely that the pitch period of the weaker signal will clearly be represented in the SACVF, let alone that its main peak can unambiguously be detected. The situation will even be more complex when more than two speakers speak simultaneously, when a number of musicians play together, or when there is noise in the background. So, the question is now how the auditory system solves this problem. In this book, Chap. 4 on auditory-unit formation precedes the chapters on the auditory attributes of a sound. In that chapter, it was argued that, also in auditory processing, auditory-unit formation precedes the emergence of perceptual attributes such as pitch, loudness, and timbre. This implies that, in the auditory system, the estimation of a perceptual attribute of an auditory unit is only based on information that has contributed to the formation of that auditory unit. A big advantage of this system is that the problem of estimating simultaneous pitches is avoided. Nevertheless, much effort is put in developing algorithms to estimate simultaneous pitches. This has often been carried out for practical purposes such as automatic speech separation or automatic speech-recognition in noisy environments. Reviews of these methods are presented in, e.g., Christensen and Jakobsson [34], De Cheveigné [38] and Klapuri [74]. Yeh, Roebel, and Rodet [188] presented an overview of multiple-pitch extraction algorithms for music sounds.

8.13 Central Processing of Pitch In early studies on pitch perception, it was assumed that pitch may arise from distortion products produced within the cochlea on the basilar membrane, e.g., Schouten [148]. Houtsma and Goldstein [65], however, showed that the perception of pitch was a central process. They synthesized pure-tone dyads, i.e., two-tone complexes consisting of two pure tones, which had virtual pitches. They presented musically trained listeners with pairs of these pure-tone dyads as, e.g., in Fig. 8.10, and asked musically trained listeners to identify the musical interval between the virtual pitches. It appeared that these listeners could not only do this task when both harmonics of

428

8 Pitch Perception

the dyads were presented to both ears, but also when one of the harmonics of the dyad was presented to one ear and the other to the other ear. In this way, they prevented the two harmonics from interacting on the basilar membrane and, hence, from generating distortion products. This shows that the percept of pitch is generated in the central auditory system. The autocorrelation model as presented above is based on the assumption that the information on which pitch perception is based is derived from the interval distributions of the trains of action potentials running over the auditory-nerve fibres to the central nervous system. This involves not only the temporal fine structure (TFS)—see Sect. 3.9—but also the phase lock to the envelope of the output of the auditory filters. One may wonder whether this assumption is correct. If so, the information present in these interval distributions must be enough to explain the phenomena related with pitch perception. Indeed, Cariani and Delgutte [26, 27] showed that a whole series of pitch-perception phenomena can be explained by assuming that pitch is associated with the dominant interval in the spike trains of auditory-nerve fibres, among which there are pitch salience (see also Trainor et al. [179]), pitch ambiguity, phase invariance, and pitch circularity. A comprehensive review of the role of TFS in pitch perception is also presented in chapter 3 of Moore [111]. Various authors have argued that resolved harmonics and unresolved harmonics are processed by different processes [28, 67]. They show that the pitch discrimination of complex tones with several harmonics is much better for complex tones only consisting of resolved harmonics than for complex tones only consisting of unresolved harmonics. Houtsma and Smurzynski [67] already mention that their results may also be explained based on interspike interval distribution in the auditory-nerve fibres. This was confirmed by Meddis and O’Mard [102], who argued that the lock to the envelope of unresolved harmonics is less precise than the lock to the phase of the resolved harmonics. This can be checked by comparing the SACVFs of Figs. 8.21 and 8.22. Another argument in favour of the assumption that pitch estimation is based on the interval distributions of the spike trains in the auditory-nerve fibres is provided by the results of modelling studies concerned with pitch perception. In fact, the first sketch of an autocorrelation model of pitch perception was presented by Licklider [88]. This was implemented by Slaney and Lyon [161]. A detailed implementation including a model of the inner-hair-cell function was presented by Meddis and Hewitt [100], who showed that a variety of pitch-perception phenomena were well described by their model, e.g., virtual pitch, ambiguous pitch, and, repetition pitch. Moreover, they could also explain some phase phenomena that are difficult to reconcile with pitch-perception theories based only on excitation patterns [101, 102]. This process of integration of peripheral models as described above with models of central processing of pitch will continue to develop. Indeed, Balaguer-Ballester, Denham, and Meddis [9, p. 2195] conclude: “Both peripheral models and methods for analyzing their output are continuing to evolve and we must expect increasingly sophisticated accounts of pitch perception to emerge as a consequence. [...] Nevertheless, we conclude that, for the present, Licklider’s (1951) view that pitch perception can be

8.13 Central Processing of Pitch

429

understood in terms of a periodicity analysis of the activity of the AN [auditory nerve] remains intact.” The models described up to now just did an autocorrelation on the outputs of the auditory filters without considering detailed knowledge about auditory processing in the more central parts of the auditory nervous system. Later these models have been extended with models of auditory processing within the cochlear nucleus [50, 59] and the inferior colliculus [40, 58, 75, 103]. Regarding their model including the inferior colliculus, Meddis and O’Mard [103, p. 3861] concluded: “This physiological model is similar in many respects to autocorrelation models of pitch and the success of the evaluations suggests that autocorrelation models may, after all, be physiologically plausible.” McLachlan [95, 96], however, mentions some limitations of the autocorrelation models, not so much in respect of the estimation of the pitch frequency but in the estimation of pitch salience. Including the auditory cortex in his model, McLachlan [95, 96] proposes a model of human pitch estimation based on a general two-phase model of auditory perception presented by Nääänen and Winkler [116] and discussed in Sect. 4.9.2. In the first, prerepresentational phase, feature traces are formed of various sources of information used in the perception of auditory attributes. In the second, representational phase, “unitary auditory events”, in other words auditory units, are formed that are accessible for conscious perception and attention. As to pitch perception, McLachlan [95] suggests that, in the prerepresentational phase, a pitch trace of periodicity information is formed with a rather roughly defined pitch frequency base. In this context, spectral regularity and common onsets are likely candidates to play a role in this, since they are involved in auditory-unit formation. Based on this, neurons in the auditory cortex carry out a much finer analysis based on the fine timing information represented in periodicity traces in the inferior colliculus. These neurons are identified with neurons in the cortex that are very finely tuned, much finer than the auditory-filter bandwidth [15]. These cortical neurons inhibit rival neurons representing other pitches. This can explain why discrimination of pitch frequency is much more accurate than suggested by the auditory-filter bandwidth. Another finding corroborating this model is that the just-noticeable difference for pitch frequency is based on the perceived pitch and not on that of individual harmonics. This follows from the finding that the just-noticeable difference is smaller for harmonic complexes than for inharmonic complexes [107], showing that indeed the combined information from the various harmonics determines the perceived pitch. Another aspect of this model is that listeners can only accurately attend to one sound source at a time [97, 98]. The authors argue that recognition of musical chords is not based on the simultaneous estimation of the pitches of the individual notes of the chords, but on the learned familiarity with the chords. One of the consequences of this is that listeners cannot count simultaneous sounding tones even if the actual number is as low as two or three [174]. This sharply contrasts with counting of visual objects which is almost perfect for up to four simultaneous objects. Of course, counting the auditory units of a stream is as good as counting of visual objects, at least when the auditory units are separated in time by more than 125 ms [134].

430

8 Pitch Perception

This is another argument for the idea that auditory units are perceptually primarily represented in time [116]. This will be further discussed in the Sects. 10.3 and 10.4.

8.14 Pitch Salience or Pitch Strength Listening to the demos on pitch perception in this chapter, the listener will have noticed that the pitch of one sound can be much better defined perceptually than that of another. On one extreme, there is the “weak” repetition pitch of pulse pairs as demonstrated in Fig. 8.2 or of repetition noise as demonstrated in Fig. 8.23; on the other extreme, there is the “strong” pitch of complex tones with a considerable number of low-rank harmonics. Fastl and Stoll [43] studied whether listeners could consistently scale the pitch of a sound on a continuum from weak to strong. They synthesized a number of different kinds of sound, e.g., pure tones, complex tones with lower harmonics, bandpass-filtered complex tones without lower harmonics, amplitude-modulated tones, repetition noise, and bandpass-filtered noise. These sounds were equalized in loudness and duration; their pitch frequencies were varied over 125, 250, and 500 Hz. Listeners were asked to scale the pitch strength of these sounds relative to that of a standard 500-Hz pure tone, the strength of which was set to 100%. Fastl and Stoll [43] found that listeners could, indeed, do this task consistently. For instance, the pitch of complex tones with lower harmonics was stronger than the pitch of complex tones with virtual pitches, and complex tones with virtual pitches had stronger pitches than the noisy sounds. Results like these have been confirmed by, e.g., Houtsma and Smurzynski [67], Moore and Rosen [115], and Ritsma [138]. They show that, in general, listeners can well order the pitch of different sounds from weak to strong. This is called pitch strength or pitch salience [26, 63, 172]. Some other terms are also in use. For instance, Patterson et al. [126, 127] use the term perceptual tone/noise ratio. Another term is tonalness, discussed as the opposite of breathiness in Sect. 6.3. There, it was mentioned that an estimate of tonalness is computed by first estimating the pitch of the sound and then calculating the contribution of harmonic peaks to its spectrum. Pitch strength has been studied for various kinds of sounds, e.g., for bandpass noise. That bandpass noise can elicit pitch, if at least the band is narrow enough, has been shown in the demos of Figs. 1.67 and 1.68. In the first demo, the bandwidth of the noise is one octave, and only an increase in brightness is perceived; in the second the bandwidth of the noise is one sixth of an octave, and a clear diatonic scale can be heard. It will be clear that the pitch strength increases as the bandwidth gets smaller. The pitch strength of bandpass noise was studied by Horbach, Verhey, and Hots [61]. When the slopes of the bandpass filters were not too steep, they found, as expected, that the pitch strength decreased as bandwidth of the noise increased, and the pitch vanished when the bandwidth exceeded the critical bandwidth. When the slopes of the bandpass filters were very steep, however, the results were complicated by edge pitches. Edge pitch arises at the cut-off frequencies of very sharply filtered broadband sounds [185, p. 596], [53, 76]. Horbach, Verhey, and Hots [61] found that

8.14 Pitch Salience or Pitch Strength

431

the edge pitches of narrow noise bands fuse with the pitch at the centre frequency, thus increasing the pitch strength of these sharply filtered noise bursts. Another example of pitch elicited by noise sounds is repetition pitch, examples of which have been presented in the demos of Figs. 8.23 and 8.24. In Fig. 8.23, noise is once delayed and added to itself; this is done for eight noise bursts with delays that decreased in time in such a way that the repetition pitches of the successive noise bursts form a rising diatonic scale. The same is done in Fig. 8.24, except that not one but ten iterations of the delayed noise are added. It will be clear that the pitch in the scale with ten iterations is stronger than that in the scale with just one. The pitch strength of iterated rippled noise has been investigated by Patterson et al. [126, 127] and Yost [189]. They wondered whether the height of the main peak in the summary autocorrelation function was a good measure for the strength of the repetition pitch. They concluded that this, indeed, was the case. Shofner and Selas [158] tested this for a larger variety of sounds. Besides wide-band noise which has no pitch, the authors included harmonic tones with harmonics added in cosine phase and in random phase, and iterated rippled noise. The pitches of these sounds were varied over 62.5, 125, 250, and 500 Hz. The authors found, again, a monotonic relation between the height of the first main peak in the summary autocorrelogram and the pitch strength scaled by the listeners. This relation corresponded to a somewhat modified version of Stevens’ power law with an exponent somewhat larger than 1. In summary, pitch strength may be associated with the height of the main peak of the summary autocorrelogram. In the autocorrelation model, the main peak is thought to represents the sum of all intervals in the spike trains of the auditory-nerve fibres that correspond to the pitch period. This, however, appeared not to work out well for some other kinds of sounds, such as pulse trains with alternating inter-click intervals, alternating, e.g., between 4 and 6 ms [30, 31, 192]. Based on this, BalaguerBallester, Denham, and Meddis [9, p. 2193] proposed to include not only the main peak of the summary autocorrelation function but also the peaks at multiples of the delay of the main peak, corresponding to subharmonic positions of the main peak. Within the autocorrelation model of pitch perception, this amounts to saying that not only the first-order intervals in the spike trains of the auditory-nerve fibres are included in the estimate of pitch strength, but also the intervals of higher order. These intervals correspond to subharmonics of the frequency exciting the nerve fibre. For the summary autocovariance function (SACVF) of a 200-Hz pulse train shown in Fig. 3.26, this main peak is at 5 ms. The peaks representing the subharmonic intervals are at 10, 15, 20 ms, etc. So, Balaguer-Ballester, Denham, and Meddis [9] propose to include these subharmonic peaks in the estimate of pitch strength. The same applies to the SACVFs of the harmonic 200-Hz tones with virtual pitches shown in Figs. 8.17, 8.18, 8.19, 8.20, 8.21 and 8.22. Only for the demos of repetition pitch presented in Fig. 8.25 for noise and in Fig. 8.26 for pulse pairs only one single peak can be distinguished. A similar approach was followed by McLachlan [95], who defines pitch strength as the “certainty of pitch classification” (p. 27). He did not take an autocorrelation model as starting point, but a model of the central processing of pitch [99] that produces a “tonotopic activation pattern.” This tonotopic activation pattern is scanned

432

8 Pitch Perception

with a large number of harmonic patterns with varying distances between the harmonics. The fundamental frequency of the pattern with the best match gives the estimate of the pitch frequency. In estimating pitch strength, McLachlan [95] then not only includes activation at the fundamental frequency but also at subharmonic positions, just as proposed for the summary autocorrelogram by Balaguer-Ballester, Denham, and Meddis [9]. A somewhat different model of pitch strength, based on the auditory image model [16], is presented by Ives and Patterson [70]. These authors also show that, within their model, the pitch strength of a harmonic tone without lower harmonics can be derived from interval distributions of the spike trains in the auditory-nerve fibres. They do not, however, explicitly include subharmonics in their model.

8.15 Pitch Ambiguity In the previous section, models were presented with which, certainly in a qualitative way, some phenomena relating to pitch strength can more or less successfully be explained. Various aspects, however, have been ignored. For instance, the SACVFs have many more peaks than their main peak and its subharmonics. The significance of these secondary peaks has been discussed, e.g., in the description of the SACVFs of tones with virtual pitches in Figs. 8.17, 8.18, 8.19, 8.20, 8.21 and 8.22. They show that the pitch of a sound is not always well-defined but can be ambiguous. Such ambiguities have already been reported by Schouten, Ritsma, and Lopes Cardozo [149]. They presented listeners with a complex of three tones with frequencies of 1791, 1990, and 2189 Hz, which are the 9th, the 10th, and the 11th harmonic of 199 Hz. The listeners were asked to adjust a number of possible pitches to these tones. This resulted in adjustments clustered around 153, 166, 181, 199, and 221 Hz. These figures closely correspond to 1990 divided by 13, 12, 11, 10, and 9, respectively. This indicates that the frequencies of the three tones not only match the harmonic pattern of 199 Hz, but to some extend, also that of 153, 166, 181, and 221 Hz. Indeed, starting from 181 Hz, the 10th, 11th, and 12th harmonic are 1810, 1991, and 2172 Hz, not very different from actual frequencies, 1791, 1990, and 2189 Hz. Similar calculations can be done for 133, 166, and 221 Hz. It can thus be concluded that the adjustments by the listeners corresponded to different estimates as to the rank of the harmonics of the three-tone complex, as has been illustrated in the side peaks of the SACVFs shown in Figs. 8.18 and 8.19. Another issue is the presence of different kinds of listeners. Besides that people show large differences in musicality and musical training, people can be divided into analytic and synthetic listeners as discussed in Sect. 8.4.5. Individual differences like these were studied in more detail by Renken et al. [133]. They synthesized tones of four successive harmonics with fundamental frequencies between 100 and 250 Hz, and varied the rank of the lowest harmonic. In addition, they added the four harmonics either in cosine phase or in Schroeder phase as described in Sect. 1.8.3. In general, pitch strength is independent of phase for resolved harmonics. For unresolved

8.16 Independence of Timbre and Pitch

433

harmonics, however, the envelope of the output of the auditory filters depends on the phase relations between the harmonics. As can be derived from Fig. 1.55, the crest factor of the tone with harmonics added in cosine phase is much higher than that of the tones with harmonics added in Schroeder phase. Consequently, when unresolved harmonics are added in cosine phase the virtual pitch will be stronger than when added in Schroeder phase. In this way, Renken et al. [133] could vary the relative strength of the virtual pitch by varying the rank of the lowest harmonic and by adding the harmonics in cosine phase or Schroeder phase. Listeners were then presented with pairs of such tones in which the fundamental frequency of one was lower than that of the other, while the frequencies of the harmonics were higher than in the other. The listeners had to decide whether they heard the “pitch” go up or go down. In this, the authors did not make a distinction between pitch and brightness. Actually, what is called brightness in this book was called “spectral pitch” by them. Renken et al. [133] indeed found that, when all four harmonics were resolved, listeners generally responded according to whether the virtual pitch went up or down. The higher the harmonic rank of the tones, the more listeners responded according to brightness. In addition, when the harmonics were added in cosine phase, more listeners responded according to the virtual pitch than when the harmonics were added in Schroeder phase. Seither-Preisler et al. [151] found, however, that there were considerable individual differences, and that differences like these can at least partly be ascribed to differences in musical training. These authors also used pairs of tones of which the virtual pitch and the brightness changed in different directions. They showed that musicians were much more inclined to respond in accordance with the virtual pitch, whereas non-musicians responded more according to brightness.

8.16 Independence of Timbre and Pitch In general, pitch and timbre are considered independent auditory attributes of a sound [79, 93, 129]. There are, however, some phenomena, which have already been alluded to a few times, that may suggest some interdependencies. This concerns in particular the relation between pitch and the timbre attribute of brightness. Brightness, like pitch, can also be ordered from low to high. Indeed, Singh and Hirsh [160] synthesized tones consisting of four successive harmonics of which F 0 and the rank of the lowest harmonic were varied independently. They presented listeners with pairs of such tones and asked them to indicate whether they heard the same pitch, an increase in pitch, or a decrease in pitch, and to indicate whether they heard “something else” change or not. That “something else” referred to the brightness of the two tones. It appeared that listeners could well distinguish changes in F 0 from changes in spectral content, as long as the change in F 0 was more than 2%. When the change in F 0 was less, listeners often reported to hear a change in “something else”, although spectral content remained constant. Allen and Oxenham [2] studied the effect of differences in brightness on the accuracy with which listeners could discriminate different pitches, and the other

434

8 Pitch Perception

way round. They synthesized harmonic tones with all harmonics below 10 kHz. The spectrum of the tones was bell-shaped, i.e., the amplitudes of the harmonics first increased with their rank up to a peak, after which the amplitudes decreased until it was 0, again, at 10 kHz. The brightness was varied by changing the position of the peak in the spectrum; the pitch was varied by changing F 0 . Listeners were presented with pairs of such sounds and asked to indicate whether the pitch went up or down, and whether the brightness went up or down. There were two classes of tone pairs: One class consisted of congruent pairs in which both pitch and brightness increase or decrease, and the other of incongruent pairs in which one changes in the opposite direction to the other. Listeners were provided with immediate feedback on the correctness of their response. The authors carried out a similar experiment in which listeners were asked explicitly whether they heard changes in pitch or in brightness. Listeners were given feedback, again, on the correctness of their response. Allen and Oxenham [2] find that there is indeed a symmetric interference between pitch and brightness. They argue that this is probably not due to the fact that the two attributes are not separable, but that “changes in F0 and spectral centroid elicit changes in pitch and timbre, respectively, but that subjects sometimes confuse the two, and therefore respond to the inappropriate dimension” (p. 1378). This conclusion is the more justified by the results of a similar experiment carried out by Caruso and Balaban [32]. They also found an interference between pitch and a timbre attribute, but the latter was not brightness but noisiness. They conclude: “In summary, a consistent interference between pitch and timbre can be identified when timbre and pitch are appropriately quantified and parametrically varied. The interference does not abolish the distinction between pitch and timbre, but variations along the un-attended dimension make listeners less certain about whether they heard variation in the attended dimension” (p. 6). Congruent and incongruent differences in pitch and brightness can also influence the perceived size of musical intervals. Russo and Thompson [145] presented musically untrained listeners with pairs of tones that differed by an interval of six or seven semitones. Their brightness was varied by either increasing or decreasing the amplitudes of the harmonics with their ranks. The tones of which the amplitudes of the harmonics increase with harmonic rank are, therefore, brighter than the tones of which the amplitudes of the harmonics decrease with harmonic rank, as can be heard in the demo of Fig. 6.6. The authors asked the listeners to rate the size of the interval between the two tones of a pair. It appeared that the intervals of congruent pairs, i.e., for pairs in which the lower tone was duller and the higher tone was brighter, were rated larger than the intervals of incongruent pairs. Apparently, these untrained listeners combined the differences in pitch with the differences in timbre in their judgments. Russo and Thompson [145] repeated this experiment with musically trained listeners. It appeared that these listeners attributed much less weight to the differences in brightness, but remarkably, only for ascending intervals. For descending intervals, the ratings were similar to those of the untrained listeners. Apparently, it is more difficult to focus attention on pitch and ignore brightness for descending intervals than for ascending intervals [90]. The problem of comparing the pitch of

8.16 Independence of Timbre and Pitch

435

successive sounds with very different timbres is also discussed by Borchert, Micheyl, and Oxenham [17]. This may give the impression that pitch and timbre are not completely independent. It appears, however, that pitch does not only show congruency effects with brightness, but also with loudness [104]. In fact, there is more to it than this. Pitch height can be associated with many other perceptual attributes, also with attributes from other sensory modalities. For instance, Spence [165, p. 979] mentions crossmodal correspondences of pitch to the visual attributes of brightness, lightness, shape/angularity, size, spatial frequency, and direction of movement. This shows that the presence of congruency effects may be due to systematic monotonic mappings between the continuum of one perceptual attribute to the continuum of another attribute. Parise, Knorre, and Ernst [120] argue that this correspondence may be based on naturally occurring mappings in the environment. For reviews of such cross-modal correspondences the reader is referred to Spence [165] and Parise [119]. A somewhat different example of an effect of pitch differences on “timbre” perception is described by Handel and Erickson [49]. For six different wind instruments, clarinet, English horn, French horn, oboe, trombone, and trumpet, they recorded all notes played from G3 , 196.0 Hz, to C6 , 1047 Hz. They presented untrained listeners with short melodies consisting of two such tones, a low tone A and a high tone B. These were played in an ABABA sequence. Listeners were asked to judge whether the A and the B tones were played on the same or on different instruments. It appeared that listener could only do so correctly when the interval between the two tones was less than about one octave. These results were replicated also with untrained listeners by Steele and Williams [166] for two wind instruments, the French horn and the bassoon. These authors, however, repeated this experiment also with musically trained listeners. It appeared that these listeners were well able to do the task correctly for more than 80% of the trials up to musical intervals of 2.5 octaves. This result makes it clear that the timbre of a tone played on a musical instrument changes with the frequency of the note played. This makes it impossible for musically untrained listeners to identify a musical instrument for all different notes that can be played on it. Musicians, who are familiar with these musical instruments, can do so much better. Finally, two other examples of the interaction between pitch and timbre will be discussed. First, when the difference in pitch between two tonal sounds becomes larger than about one octave, the brightness of a sound can change. It appears that, up to a certain level, this change in brightness can be undone. Indeed, Slawson [162] showed that the quality of a synthetic vowel changes somewhat when its pitch is doubled. This could be undone by coupling a doubling of F 0 with an increase of the frequencies of the first two formants by 10%. A similar correction was proposed for musical sounds. A somewhat more complex correction, carried out on the Cam scale, has been proposed by Marozeau and De Cheveigné [92]. Second, another kind of confusion of timbre and pitch can occur when two tones with the same F 0 s but with very different spectral centroids are compared. It appears that especially untrained listeners tend to judge the pitch of the brighter tone one octave higher than that of the duller tone [141]. Such harmonic confusions may become apparent in the demos

436

8 Pitch Perception

of Figs. 8.17, 8.18, 8.19, 8.20 and 8.21. As explained there, they have to do with harmonic ambiguities in pitch perception. From a processing point of view, too, there is evidence that pitch and timbre are processed independently. In the introduction to Chap. 6, it was argued that the auditory system processes timbre faster than pitch, and also than loudness. Indeed, pitch and loudness have integration times more than ten millisecond, whereas timbre only needs ten milliseconds or less [94, 142, 143, 169]. In general, the time course of the induction process is quite different for the percept of pitch and for that of timbre. In fact, spectral flux, i.e., the amount of temporal change in the spectrum of a sound, is an important attribute of timbre, showing that the spectrum of a sound can change rapidly without affecting the perceptual unity of the sound. Pitch perception, on the other hand, is a much slower process. When the pitch of an utterance changes too rapidly, it seems as if another speaker takes turn [19, 37]. Pitch appears to be a rather inert perceptual attribute that cannot change too rapidly without disturbing the perceived continuity of the sound. Moreover, Hukin and Darwin [69] showed that the effect on onset asynchronies has a very different time course for timbre than for pitch. Onset asynchronies of about 40 ms are enough to remove the contribution of a partial to the timbre of a complex tone, in this case the vowel quality, while onset asynchronies of about 300 ms are necessary to remove the contribution of a partial to the pitch of a harmonic complex. These differences between the processing of pitch and timbre are also found in the neurophysiology of these auditory attributes. Langner et al. [82] and Langner, Dinse, and Godde [83] found orthogonal representations of the “tonotopic”, i.e., brightness, array and the “periodotopic”, i.e., pitch, array in the auditory cortex of cats. In humans, however, the situation is more complex [3, 164]. A perhaps even more convincing argument that pitch and timbre are independent auditory attributes is supplied by a recent result by Lau, Oxenham, and Werner [84]. They trained threeand seven-months old infants to respond to small changes in F 0 in the presence of variations in spectral content and vice versa. The surprising result was that they were able to do so as good as musically trained listeners and even better than musically untrained listeners! The authors conclude the presence of: “high fidelity of F0 and spectral-envelope coding in infants, implying that fully mature cortical processing is not necessary for accurate discrimination of these features” (p. 693).

8.17 Pitch Constancy Perceptual constancy is an important property of perceptual attributes, not only of auditory attributes but also of attributes in the visual and other sensory modalities. The presumed function of perceptual constancy is to ensure that perceived objects are not perceived as changing under changing environmental condition. In this book, this has been discussed for the beat of an auditory unit and for its timbre. Later in this book, in Sect. 9.7.4, more will be said about loudness constancy. As to pitch constancy, one may also wonder whether, indeed, the percept of pitch is robust to

8.17 Pitch Constancy

437

environmental changes. It has been argued that pitch perception can be associated with repetitions or periodicities of the sound signal. In other words, the question is whether, indeed, repetitions and periodicities are robust to environmental changes. Well, it is indeed hard to find a property of a sound that is as robust to environmental changes as the periodicity of the signal. Moreover, in speech, music, and in animal communication many sounds are produced by periodically vibrating structures such as vocal chords, reeds, wings, or syrinxes. These sound sources are then filtered by resonating structures that modify the spectrum of the sound, but do not significantly affect the periodic structure of the sound. The environment in which the sounds are produced does not affect the periodic structure either. It may be that the environment does not affect the repetitiveness or the periodicity of a sound signal as produced by the sound source, but, on the other hand, the environment can introduce repetitiveness and periodicities into the sound signal, e.g., by adding reflections against a hard surface such as a floor, a wall, or a ceiling. The sound demo of Fig. 8.2 on repetition pitch demonstrates that this can lead to the perception of pitch if there is only one reflection. A pitch percept can also be induced, when the reflections are periodic due to the presence of some equidistant hard surfaces in the environment such as, e.g., a staircase, which was already known in the 17th century to Huygens (1693), cited after Bilsen and Ritsma [14] (See Fig. 8.30). In most cases, however, the reflections from the floor, the walls, the ceiling, and other objects in a room will not form a periodic pattern. In Chap. 9, it will be shown that these reflections play a crucial role in auditory distance perception. In addition, the presence of reflections and reverberation plays an important role in, e.g., the perception of the size of the room and in what kind of room the sound in produced [23, 24, 48, 77]. In summary, the robustness of the periodicity of a signal to environmental changes makes periodicity one of the main signal properties that are appropriate for the audi-

Fig. 8.30 Repetition pitch illustrated by Huygens (1693). Retrieved in the public domain from Bilsen and Ritsma [14, p. 64, Fig. 1]

438

8 Pitch Perception

tory system to use in the processing of sound. Pitch is an example of an attribute of a perceptual object that is robust to environmental variations. Moreover, in the spectrograms of speech signals shown in Figs. 8.11, 8.12, 8.13 and 8.14, it has been shown that the periodicity information is distributed over a large spectrotemporal array. If one region in the spectrotemporal domain is masked by noise, pitch information can be derived from other regions in the spectrotemporal domain. This makes the information robust to the presence of noise. Moreover, it will appear that the auditory system avails of a remarkable system for recovering the information masked by the noise, so that the listener cannot hear what information is restored and what is not. This will be discussed extensively in Sect. 10.11.

8.18 Concluding Remarks The autocorrelation model of pitch perception described here is based on the assumption that pitch is derived from the interval distributions of the spike trains in the auditory-nerve fibres. In the presence of phase lock, these intervals on average represent the frequency of the sound that stimulates the auditory filter and its subharmonics; when there is no phase lock, the intervals represent the periodicity of the envelope of the sound that stimulates the auditory filter and its subharmonics [111]. It is mostly assumed that the spike trains in the auditory-nerve fibres can code frequencies up to about 5000 Hz. There is, however, no universal agreement about this. It is not possible to directly measure spike trains from auditory-nerve fibres in hearing human participants. Estimations as to this limit of phase locking in humans, therefore, have to be made based on indirect measurements. An overview of the various points of view on this issue is presented in Verschooten et al. [182]. The consequences for theories and models of pitch perception are reviewed by Oxenham [117, pp. 31–39] and Moore [114]. This issue is also studied by Carcagno, Lakhani, and Plack [25] and Gockel, Moore, and Carlyon [46]. They showed that also tones with fundamental frequencies below about 3 kHz but with harmonics higher than 6 or 7 kHz had virtual pitches with which musical intervals could be distinguished. The reader interested in this complex issue is referred to the literature. Whatever the exact limit of the temporal code for frequency in the peripheral auditory system, it is clear that, due to random fluctuations in neural processing, this temporal precision as present in the spike trains of the auditory-nerve fibres cannot be maintained in higher centres of auditory processing in the central nervous system. It is, therefore, generally assumed that, somewhere in the central nervous system, the temporal code is replaced by a rate-place code. This issue is reviewed by Plack, Barker, and Hall [128] who conclude that, up to the superior olivary complex, the auditory code is indeed temporal but is converted into a rate-based code in the brainstem: “Taken together, the results suggest that the initial temporal pitch code in the auditory periphery is converted to a code based on neural firing rate in the brainstem. In the upper brainstem or auditory cortex, the information from the individual

References

439

harmonics of complex tones is combined to form a general representation of pitch” (p. 53). In the autocorrelation model of pitch perception presented in this book, common periodicities in the spike trains of the auditory-nerve fibres were found by summation of autocovariance functions. Various alternatives have been proposed. For instance, Huang and Rinzel [68] propose a method based on “coincidence detectors”, Joris [72] proposes “entrained phase-locking”, Hehrmann [55] proposes Bayesian inference. The autocorrelation model of human pitch perception is just one model in a long range of pitch-perception models published in the last century. It stands out from others, first, by the simplicity of its main assumption, which is that pitch perception arises from the most common periodicity in the trains of action potentials that enter the auditory central nervous system and its subharmonics. Second, even in its simple implementation as presented here, it appears to be able to correctly predict, albeit qualitatively, a wide range of pitch-perception phenomena. This does not mean that other approaches are not conceivable. On the contrary, alternative approaches have been presented by, e.g., Barzelay, Furst, and Barak [11], Patterson [122, 123], Shamma and Dutta [155], and Shamma and Klein [154]. Actually, pitch theories have been formulated since the time of Pythagoras. A review of pitch-perception theories up to this century is presented by De Cheveigné [39]. Finally, as also mentioned for loudness perception, the presented models generally assume that the sound components that contribute to the pitch of an auditory unit are known in advance. Except in laboratory conditions, however, this will generally not be the case. In Sect. 8.12, the problems playing a role in determining simultaneous pitches have briefly been discussed. In an early study, Van Noorden [180] already concludes: “Although the origin of fusion proper is not clear, evidence has been found that it should be localised between the basilar membrane and the final pitch extractors” (p. 12). Nevertheless, the need for a model of auditory-unit formation that precedes the attribution of auditory attributes is only rarely explicitly mentioned. But there are exceptions, e.g., Borchert, Micheyl, and Oxenham [17]: “With rare exceptions, existing models of pitch perception do not include perceptual organization processes. They compute the pitch of incoming sounds without regard for whether or not these sounds are perceived as a single auditory object or source. Such models may require substantial extension or revision to account for the present findings” (p. 11).

References 1. Adler SA et al (2020) Sensitivity to major versus minor musical modes is bimodally distributed in young infants. J Acoust Soc Amer 147(6):3758–3764. https://doi.org/10.1121/10.0001349 2. Allen EJ, Oxenham AJ (2014) Symmetric interactions and interference between pitch and timbre. J Acoust Soc Amer 135(3):1371–1379. https://doi.org/10.1121/1.4863269 3. Allen EJ et al (2017) Representations of pitch and timbre variation in human auditory cortex. J Neurosci 37(5):1284–1293. https://doi.org/10.1523/JNEUROSCI.2336-16.2016

440

8 Pitch Perception

4. ANSI. ANSI S1.1-1994 (1994) American National Standard Acoustical Terminology. New York, NY 5. ASA (1960) Acoustical terminology. SI, 1-1960. New York, NY 6. Attneave F, Olson RK (1971) Pitch as a medium: a new approach to psychophysical scaling. Amer J Psychol 84(2):147–166. https://doi.org/10.2307/1421351 7. Bachem A (1950) Tone height and tone chroma as two different pitch qualities. Acta Physiol (Oxf) 7:80–88. https://doi.org/10.1016/0001-6918(50)90004-7 8. Bachem A (1937) Various types of absolute pitch. J Acoust Soc Amer 9(2):146–151. https:// doi.org/10.1121/1.1915919 9. Balaguer-Ballester E, Denham SL, Meddis R (2008) A cascade autocorrelation model of pitch perception. J Acoust Soc Amer 124(4):2186–2195. https://doi.org/10.1121/1.2967829 10. Balaguer-Ballester E et al (2009) Understanding pitch perception as a hierarchical process with top-down modulation. PLoS Comput Biol 5(3):e1000301, 15 pages (2009). https://doi. org/10.1371/journal.pcbi.1000301 11. Barzelay O, Furst M, Barak O (2017) A new approach to model pitch perception using sparse coding. PLoS Comput Biol 13(1):e1005338, 36 pages (2017). https://doi.org/10.1371/journal. pcbi.1005338 12. Beerends JG (1989) The influence of duration on the perception of pitch in single and simultaneous complex tones. J Acoust Soc Amer 86(5), 1835–1844 (1989). https://doi.org/10.1121/ 1.398562 13. Biasutti M (1997) Sharp low-and high-frequency limits on musical chord recognition. Hear Res 105(1):77–84. https://doi.org/10.1016/S0378-5955(96)00205-5 14. Bilsen FA, Ritsma RJ (1969) Repetition pitch and its implication for hearing theory. Acustica 22(2):63–73 15. Bitterman Y et al (2008) Ultra-fine frequency tuning revealed in single neurons of human auditory cortex. Nature 451(7175):197–202. https://doi.org/10.1038/nature06476 16. Bleeck S, Ives DT, Patterson RD (2004) Aim-mat: the auditory image model in MATLAB. Acta Acust Unit Acust 90(4):781–787 17. Borchert EMO, Micheyl C, Oxenham AJ (2010) Perceptual grouping affects pitch judgments across time and frequency. J Exp Psychol Hum Percept Perform 37(1):257–269. https://doi. org/10.1037/a0020670 18. Braus I (1995) Retracing one’s steps: an overview of pitch circularity and Shepard tones in European music, 1550–1990. Music Percept: Interdiscip J 12(3):323–351. https://doi.org/10. 2307/40286187 19. Brokx JPL (1979) Waargenomen continuiteit in spraak: het belang van toonhoogte. Eindhoven, pp 1–124. https://doi.org/10.6100/IR171313 20. Brunstrom JM, Roberts B (2000) Separate mechanisms govern the selection of spectral components for perceptual fusion and for the computation of global pitch. J Acoust Soc Amer 107(3):1566–1577. https://doi.org/10.1121/1.428441 21. Burns EM, Viemeister NF (1976) Nonspectral pitch. J Acoust Soc Amer 60(4):863–869. https://doi.org/10.1121/1.381166 22. Burns EM, Viemeister NF (1981) Played-again SAM: further observations on the pitch of amplitude modulated noise. J Acoust Soc Amer 70(6):1655–1660. https://doi.org/10.1121/1. 387220 23. Cabrera D et al (2005) Auditory room size perception for modeled and measured rooms. In: Proceedings of internoise, the 2005 congress and exposition on noise control engineering (Rio de Janeiro, Brazil), 10 pages. https://www.researchgate.net/profile/Densil_Cabrera/ publication/228372572_Auditory_room0fcfd51354253e898d000000.pdf 24. Calcagno ER et al (2012) The role of vision in auditory distance perception. Perception 41(2):175–192. https://doi.org/10.1068/p7153 25. Carcagno S, Lakhani S, Plack CJ (2019) Consonance perception beyond the traditional existence region of pitch. J Acoust Soc Amer 146(4):2279–2290. https://doi.org/10.1121/ 1.5127845

References

441

26. Cariani PA, Delgutte B (1996) Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. J Neurophysiol 76(3):1698–1716. https://doi.org/10.1152/jn.1996.76.3.1698 27. Cariani PA, Delgutte B (1996) Neural correlates of the pitch of complex tones. II. Pitch shift, pitch ambiguity, phase invariance, pitch circularity, rate pitch, and the dominance region for pitch. J Neurophysiol 76(3):1717–1734. https://doi.org/10.1152/jn.1996.76.3.1717 28. Carlyon RP, Shackleton TM (1994) Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms? J Acoust Soc Amer 95(6):3541– 3554. https://doi.org/10.1121/1.409971 29. Carlyon RP et al (2004) Auditory processing of real and illusory changes in frequency modulation (FM) phase. J Acoust Soc Amer 116(6):3629–3639. https://doi.org/10.1121/1.1811474 30. Carlyon RP et al (2008) Behavioral and physiological correlates of temporal pitch perception in electric and acoustic hearing. J Acoust Soc Amer 123(2):973–985. https://doi.org/10.1121/ 1.2821986 31. Carlyon RP et al (2002) Temporal pitch mechanisms in acoustic and electric hearing. J Acoust Soc Amer 112(2):621–633. https://doi.org/10.1121/1.1488660 32. Caruso VC, Balaban E (2014) Pitch and timbre interfere when both are parametrically varied. PLoS ONE 9(1):e87065, 7 pages. https://doi.org/10.1371/journal.pone.0087065 33. Charbonneau G, Risset J-C (1973) Circularité de jugements de hauteur sonore. Comptes Rendus de l’Academie des Sciences Paris, Serie B 73:623 34. Christensen MG, Jakobsson A (2009) Multi-pitch estimation. Synth Lect Speech Audio Proc 5(1):1–160 (2009). https://doi.org/10.2200/S00178ED1V01Y200903SAP005 35. Ciocca V (1999) Evidence against an effect of grouping by spectral regularity on the perception of virtual pitch. J Acoust Soc Amer 106(5):2746–2751. https://doi.org/10.1121/1.428102 36. Dahlbom DA, Braasch J (2020) How to pick a peak: pitch and peak shifting in temporal models of pitch perception. J Acoust Soc Amer 147(4):2713–2727. https://doi.org/10.1121/ 10.0001134 37. Darwin CJ, Bethell-Fox CE (1977) Pitch continuity and speech source attribution. J Exp Psychol Hum Percept Perform 3(4):665–672. https://doi.org/10.1037/0096-1523.3.4.665 38. De Cheveigné A (2006) Multiple F0 estimation (Chap 2). In: Wang D, Brown GJ (eds) Computational auditory scene analysis, algorithms and applications. Wiley - IEEE Press, pp 45–79 39. De Cheveigné A (2005) Pitch perception models (Chap 6). In: Plack CJ et al (eds) Pitch, neural coding and perception. Springer Science+Business Media, Inc., New York, pp 169– 233. https://doi.org/10.1007/0-387-28958-5_6 40. Dicke U et al (2007) A neural circuit transforming temporal periodicity information into a ratebased representation in the mammalian auditory system. J Acoust Soc Amer 121(1):310–326. https://doi.org/10.1121/1.2400670 41. Doughty JM, Garner WR (1948) Pitch characteristics of short tones. II. Pitch as a function of tonal duration. J Exp Psychol 38(4):478–494. https://doi.org/10.1037/h0057850 42. Etchemendy PE, Eguia MC, Mesz B (2014) Principal pitch of frequency-modulated tones with asymmetrical modulation waveform: a comparison of models. J Acoust Soc Amer 135(3):1344–1355. https://doi.org/10.1121/1.4863649 43. Fastl H, Stoll G (1979) Scaling of pitch strength. Hear Res 1(4):293–301. https://doi.org/10. 1016/0378-5955(79)90002-9 44. Gockel HE, Carlyon RP (2016) On Zwicker tones and musical pitch in the likely absence of phase locking corresponding to the pitch. J Acoust Soc Amer 140(4):2257–2273. https://doi. org/10.1121/1.4963865 45. Gockel HE, Moore BC, Carlyon RP (2001) Influence of rate of change of frequency on the overall pitch of frequency-modulated tones. J Acoust Soc Amer 109(2):701–712. https://doi. org/10.1121/1.1342073 46. Gockel HE, Moore BC, Carlyon RP (2020) Pitch perception at very high frequencies: on psychometric functions and integration of frequency information. J Acoust Soc Amer 148(5):3322–3333. https://doi.org/10.1121/10.0002668

442

8 Pitch Perception

47. Guttman N, Pruzansky S (1962) Lower limits of pitch and musical pitch. J Speech Hear Res 5(3):207–214. https://doi.org/10.1044/jshr.0503.207 48. Hameed S et al (2004) Psychoacoustic cues in room size perception. In: Proceedings of the convention of the audio engineering society 116 (Berlin), p 7. http://legacy.spa.aalto.fi/ research/cat/psychoac/papers/hameedaes116.pdf 49. Handel S, Erickson ML (2001) A rule of thumb: the bandwidth for timbre invariance is one octave. Music Percept: Interdiscip J 19(1):121–126. https://doi.org/10.1525/mp.2001.19.1. 121 50. Harczos T, Klefenz FM (2018) Modeling pitch perception with an active auditory model extended by octopus cells. Front Neurosci 12:12, Article 660. https://doi.org/10.3389/fnins. 2018.00660 51. Hartmann WM (1998) Signals, Sound, and Sensation. Springer Science+Business Media Inc, New York 52. Hartmann WM (1978) The effect of amplitude envelope on the pitch of sine wave tones. J Acoust Soc Amer 63(4):1105–1113. https://doi.org/10.1121/1.381818 53. Hartmann WM, Cariani PA, Colburn HS (2019) Noise edge pitch and models of pitch perception. J Acoust Soc Amer 145(4):1993–2008. https://doi.org/10.1121/1.5093546 54. Heffner H, Whitfield IC (1976) Perception of the missing fundamental by cats. J Acoust Soc Amer 59(4):915–919. https://doi.org/10.1121/1.380951 55. Hehrmann P (2018) Pitch perception as probabilistic inference. University College London, London, UK, pp 1–207. https://discovery.ucl.ac.uk/id/eprint/10062222 56. Hermes DJ (2006) Stylization of pitch contours. In: Sudhoff S et al (eds) Methods in empirical prosody research. Walter De Gruyter, Berlin, pp 29–62. https://doi.org/10.1515/ 9783110914641.29 57. Hess W (1983) Pitch determination of speech signals: algorithms and devices. Springer, Berlin 58. Hewitt MJ, Meddis R (1994) A computer model of amplitude-modulation sensitivity of single units in the inferior colliculus. J Acoust Soc Amer 95(4):2145–2159. https://doi.org/10.1121/ 1.408676 59. Hewitt MJ, Meddis R, Shackleton TM (1992) A computer model of a cochlear nucleus stellate cell: responses to amplitude modulated and pure tone stimuli. J Acoust Soc Amer 91(4):2096– 2109. https://doi.org/10.1121/1.403696 60. Hoeschele M, Weisman RG, Sturdy CB (2012) Pitch chroma discrimination, generalization, and transfer tests of octave equivalence in humans. Att. Percept. & Psychophys. 74(8):1742– 1760. https://doi.org/10.3758/s13414-012-0364-2 61. Horbach M, Verhey JL, Hots J (2018) On the pitch strength of bandpass noise in normalhearing and hearing-impaired listeners. Trends Hear 22:2331216518787067, 14 pages. https:// doi.org/10.1177/2331216518787067 62. Houtsma AJ (1979) Musical pitch of two-tone complexes and predictions by modern pitch theories. J Acoust Soc Amer 66(1):87–99. https://doi.org/10.1121/1.382943 63. Houtsma AJ (1984) Pitch salience of various complex sounds. Music Percept.: Interdiscipl. J. 1(3):296–307. https://doi.org/10.2307/40285262 64. Houtsma AJ, Fleuren J (1991) Analytic and synthetic pitch of two-tone complexes. J Acoust Soc Amer 90(3):1674–1676. https://doi.org/10.1121/1.401911 65. Houtsma AJ, Goldstein JL (1972) The central origin of the pitch of complex tones: evidence from musical interval recognition. J Acoust Soc Amer 51(2):520–529. https://doi.org/10. 1121/1.1912873 66. Houtsma AJ, Rossing TD, Wagenaars WM (1987) Auditory demonstrations. Eindhoven, The Netherlands: Institute for Perception Research (IPO), Northern Illinois University, Acoustical Society of America. https://research.tue.nl/nl/publications/auditory-demonstrations 67. Houtsma AJ, Smurzynski J (1990) Pitch identification and discrimination for complex tones with many harmonics. J Acoust Soc Amer 87(1):304–310. https://doi.org/10.1121/1.399297 68. Huang C, Rinzel J (2016) A neuronal network model for pitch selectivity and representation. Front Comput Neurosci 10:17, Article 57. https://doi.org/10.3389/fncom.2016.00057

References

443

69. Hukin RW, Darwin CJ (1995) Comparison of the effect of onset asynchrony on auditory grouping in pitch matching and vowel identification. Percept & Psychophys 57(2):191–196. https://doi.org/10.3758/BF03206505 70. Ives DT, Patterson RD (2008) Pitch strength decreases as F0 and harmonic resolution increase in complex tones composed exclusively of high harmonics. J Acoust Soc Amer 123(5):2670– 2679. https://doi.org/10.1121/1.2890737 71. Jaatinen J, Pätynen J, Alho K (2019) Octave stretching phenomenon with complex tones of orchestral instruments. J Acoust Soc Amer 146(5):3203–3214. https://doi.org/10.1121/1. 5131244 72. Joris PX (2016) Entracking as a brain stem code for pitch: The butte hypothesis. Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing. Ed. by Dijk, P. van et al. Cham, Switzerland: Springer International Publishing AG, pp 347–354. https://doi.org/10. 1007/978-3-319-25474-6_36 73. Jurado C, Larrea M, Moore BC (2021) The lower limit of pitch perception for pure tones. Arch. Acoust. 46(3):459–469 (2021). https://doi.org/10.24425/aoa.2021.138138 74. Klapuri A (2008) Multipitch analysis of polyphonic music and speech signals using an auditory model. IEEE Trans Audio Speech Lang Process 16(2):255–266. https://doi.org/10.1109/ TASL.2007.908129 75. Klefenz F, Harczos T (2020) Periodicity pitch perception. Front Neurosci 14:15, Article 486 (2020). https://doi.org/10.3389/fnins.2020.00486 76. Kohlrausch A, Houtsma AJ (1992) Pitch related to spectral edges of broadband signals. Philosoph Trans R Soc Lond. Ser B: Biol Sci 336(1278):375–382 (1992). https://doi.org/10. 1098/rstb.1992.0071 77. Kolarik AJ et al (2021) Factors affecting auditory estimates of virtual room size: effects of stimulus, level, and reverberation. Perception 50(7):646–663. https://doi.org/10.1177/ 03010066211020598 78. Krumbholz K, Patterson RD, Pressnitzer D (2000) The lower limit of pitch as determined by rate discrimination. J Acoust Soc Amer 108(3):1170–1180. https://doi.org/10.1121/1. 1287843 79. Krumhansl CL, Iverson P (1992) Perceptual interaction between musical pitch and timbre. J Exp Psychol Hum Percept Perform 18(3):739–751. https://doi.org/10.1037/0096-1523.18.3. 739 80. Ladd DR et al (2013) Patterns of individual differences in the perception of missingfundamental tones. J Exp Psychol Hum Percept Perform 39(5):1386–1397. https://doi.org/ 10.1037/a0031261 81. Laguitton V et al (1998) Pitch perception: a difference between right-and left-handed listeners. Neuropsychologia 36(3):201–207. https://doi.org/10.1016/S0028-3932(97)00122-X 82. Langner G et al (1997) Frequency and periodicity are represented in orthogonal maps in the human auditory cortex: evidence from magnetoencephalography. J Comp Physiol A Neuroethol Sens Neural Behav Physiol 181(6):665–676. https://doi.org/10.1007/s003590050148 83. Langner G, Dinse HR, Godde B (2009) A map of periodicity orthogonal to frequency representation in the cat auditory cortex. Front Integr Neurosci 3(27):334–341. https://doi.org/10. 3389/neuro.07.027.2009 84. Lau BK, Oxenham AJ, Werner LA (2021) Infant pitch and timbre discrimination in the presence of variation in the other dimension. J Assoc Res Otolaryngol 22(6):693–702. https:// doi.org/10.1007/s10162-021-00807-1 85. Lau BK, Werner LA (2012) Perception of missing fundamental pitch by 3- and 4-month-old human infants. J Acoust Soc Amer 132(6):3874–3882. https://doi.org/10.1121/1.4763991 86. Lau BK, Werner LA (2014) Perception of the pitch of unresolved harmonics by 3- and 7month-old human infants. J Acoust Soc Amer 136(2):760–767. https://doi.org/10.1121/1. 4887464 87. Lau BK et al (2017) Infant pitch perception: missing fundamental melody discrimination. J Acoust Soc Amer 141(1):65–72. https://doi.org/10.1121/1.4973412

444

8 Pitch Perception

88. Licklider J (1951) A duplex theory of pitch perception. Experientia 7(4):128–134. https://doi. org/10.1007/BF02156143 89. Licklider J (1956) Auditory frequency analysis. In: Cherry C (ed) Information theory. Butterworth, London, UK, pp 253–268 90. Luo X, Masterson ME, Wu C-C (2014) Melodic interval perception by normal-hearing listeners and cochlear implant users. J Acoust Soc Amer 136(4):1831–1844. https://doi.org/10. 1121/1.4894738 91. Macherey O, Carlyon RP (2014) Re-examining the upper limit of temporal pitch. J Acoust Soc Amer 136(6):3186–3199. https://doi.org/10.1121/1.4900917 92. Marozeau J, De Cheveigné A (2007) The effect of fundamental frequency on the brightness dimension of timbre. J Acoust Soc Amer 121(1):383–387. https://doi.org/10.1121/1.2384910 93. Marozeau J et al (2003) The dependency of timbre on fundamental frequency. J Acoust Soc Amer 144(5):2946–2957. https://doi.org/10.1121/1.1618239 94. McKeown JD, Patterson RD (1995) The time course of auditory segregation: concurrent vowels that vary in duration. J Acoust Soc Amer 98(4):1866–1877. https://doi.org/10.1121/ 1.413373 95. McLachlan NM (2009) A computational model of human pitch strength and height judgments. Hear Res 249:23–35. https://doi.org/10.1016/j.heares.2009.01.003 96. McLachlan NM (2011) A neurocognitive model of recognition and pitch segregation. J Acoust Soc Amer 130(5):2845–2854. https://doi.org/10.1121/1.3643082 97. McLachlan NM, Marco DJT, Wilson SJ (2013) Pitch and plasticity: insights from the pitch matching of chords by musicians with absolute and relative pitch. Brain Sci 3(4):1615–1634. https://doi.org/10.3390/brainsci3041615 98. McLachlan NM, Marco DJT, Wilson SJ (2012) Pitch enumeration: failure to subitize in audition. PLoS ONE 7(4):e33661, 5 pages (2012). https://doi.org/10.1371/journal.pone.0033661 99. McLachlan NM, Wilson S (2010) The central role of recognition in auditory perception: a neurobiological model. Psychol Rev 117(1):175–196. https://doi.org/10.1037/a0018063 100. Meddis R, Hewitt MJ (1991) Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification. J Acoust Soc Amer 89(6):2866–2882. https://doi. org/10.1121/1.400725 101. Meddis R, Hewitt MJ (1991) Virtual pitch and phase sensitivity of a computer model of the auditory periphery. II: Phase sensitivity. J Acoust Soc Amer 89(6):2883–2894. https://doi. org/10.1121/1.400726 102. Meddis R, O’Mard L (1997) A unitary model of pitch perception. J Acoust Soc Amer 102(3):1811–1820. https://doi.org/10.1121/1.420088 103. Meddis R, O’Mard LP (2006) Virtual pitch in a computational physiological model. J Acoust Soc Amer 120(6):3861–3869. https://doi.org/10.1121/1.2372595 104. Melara RD, Marks LE (1990) Interaction among auditory dimensions: timbre, pitch, and loudness. Att Percept & Psychophys 48(2):169–178. https://doi.org/10.3758/BF03207084 105. Mesz BA, Eguia MC (2009) The pitch of vibrato tones. Ann N Y Acad Sci 1169(1):126–130. https://doi.org/10.1111/j.1749-6632.2009.04767.x 106. Metters PJ, Williams RP (1973) Experiments on tonal residues of short duration. J Sound Vib 26(3):432–436. https://doi.org/10.1016/S0022-460X(73)80198-1 107. Micheyl C, Ryan CM, Oxenham AJ (2012) Further evidence that fundamental-frequency difference limens measure pitch discrimination. J Acoust Soc Amer 131(5):3989–4001. https:// doi.org/10.1121/1.3699253 108. Miskiewicz A (2004) Roughness of low-frequency pure tones. In: Proceedings of the PolishGerman OSA/DAGA Meeting (Gdansk), 3 pages 109. Miskiewicz A, Majer J (2014) Roughness of low-frequency pure tones and harmonic complex tones. In: 7th Forum Acusticum (Krakow), pp 1–4 110. Moore BC (2012) An introduction to the psychology of hearing, 6th edn. Emerald Group Publishing Limited, Bingley, UK 111. Moore BC (2014) Auditory processing of temporal fine structure: effects of age and hearing loss. World Scientific, Singapore

References

445

112. Moore BC (2005) Basic auditory processes (Chap 12). In: Gold-stein EB (ed) Blackwell handbook of sensation and perception. Blackwell Publishing Ltd, Oxford, UK. pp 379–407 113. Moore BC (1973) Some experiments relating to the perception of complex tones. Q J Exp Psychol 25(4):451–475. https://doi.org/10.1080/14640747308400369 114. Moore BC (2019) The roles of temporal envelope and fine structure information in auditory perception. Acoust Sci Technol 40(2):61–83. https://doi.org/10.1250/ast.40.61 115. Moore BC, Rosen SM (1979) Tune recognition with reduced pitch and interval information. Q J Exp Psychol 31(2):229–240. https://doi.org/10.1080/14640747908400722 116. Näätänen R, Winkler I (1999) The concept of auditory stimulus representation in cognitive neuroscience. Psychol Bull 126(6):826–859. https://doi.org/10.1037/0033-2909.125.6.826 117. Oxenham AJ (2018) How we hear: The perception and neural coding of sound. Ann Rev Psychol 69:27–50. https://doi.org/10.1146/annurev-psych-122216-011635 118. Oxenham AJ et al (2011) Pitch perception beyond the traditional existence region of pitch. Proc Natl Acad Sci 108(18):7629–7634. https://doi.org/10.1073/pnas.1015291108 119. Parise CV (2016) Crossmodal correspondences: standing issues and experimental guidelines. Multisen Res 29(1–3):7–28. https://doi.org/10.1163/22134808-00002502 120. Parise CV, Knorre K, Ernst MO (2014) Natural auditory scene statistics shapes human spatial hearing. Proc Natl Acad Sci 111(16):6104–6108. https://doi.org/10.1073/pnas.1322705111 121. Parncutt R, Hair G (2018) A psychocultural theory of musical interval: bye bye Pythagoras. Music Percept: Interdiscipl J 35(4):475–501. https://doi.org/10.1525/mp.2018.35.4.475 122. Patterson RD (1987) A pulse-ribbon model of monaural phase perception. J Acoust Soc Amer 82(5):1560–1586. https://doi.org/10.1121/1.395146 123. Patterson RD (1986) Spiral detection of periodicity and the spiral form of musical scales. Psychol Music 14(1):44–61. https://doi.org/10.1177/0305735686141004 124. Patterson RD, Gaudrain E, Walters TC (2010) The perception of family and register in musical tones (Chap 2). In: Jones MR, Fay R, Popper AN (eds) Music perception. Springer Science+Business Media, New York, NY, pp 13–50. https://doi.org/10.1007/978-1-4419-61143_2 125. Patterson RD, Peters RW, Milroy R (1983) Threshold duration for melodic pitch. In: Klinke R, Hartmann R (eds) Hearing – physiological bases and psychophysics, Proceedings of the 6th international symposium on hearing (5–9 April 1983, Bad Nauheim, Germany). Springer, Berlin, pp 321–326. https://doi.org/10.1007/978-3-642-69257-4_47 126. Patterson RD et al (2000) The perceptual tone/noise ratio of merged iterated rippled noises. J Acoust Soc Amer 107(3):1578–1588. https://doi.org/10.1121/1.428442 127. Patterson RD et al (1996) The relative strength of the tone and noise components in iterated rippled noise. J Acoust Soc Amer 100(5):3286–3294. https://doi.org/10.1121/1.417212 128. Plack CJ, Barker D, Hall DA (2014) Pitch coding and pitch processing in the human brain. Hear Res 307:53–64. https://doi.org/10.1016/j.heares.2013.07.020 129. Plomp R, Steeneken H (1971) Pitch versus timbre. In: Proceedings of the seventh international congress on acoustics (1971, Budapest), pp 387–390 130. Plomp R (1998) Hoe wij Horen: Over de Toon die de Muziek Maakt 131. Pollack I (1967) Number of pulses required for minimal pitch. J Acoust Soc Amer 42(4): 895. https://doi.org/10.1121/1.1910663 132. Pressnitzer D, Patterson RD, Krumbholz K (2001) The lower limit of melodic pitch. J Acoust Soc Amer 109(5):2074–2085. https://doi.org/10.1121/1.1359797 133. Renken R et al (2004) Dominance of missing fundamental versus spectrally cued pitch: individual differences for complex tones with unresolved harmonics. J Acoust Soc Amer 115(5):2257–2263. https://doi.org/10.1121/1.1690076 134. Repp BH (2007) Perceiving the numerosity of rapidly occurring auditory events in metrical and nonmetrical contexts. Percept & Psychophys 69(4):529–543 (2007). https://doi.org/10. 3758/BF03193910 135. Risset JC (1971) Paradoxes de hauteur: Le concept de hauteur sonore n’est pas le même pour tout le monde. In: Proceedings of the seventh international congress on acoustics (ICA 7) (Budapest), vol 3, pp 613–616. https://www.icacommission.org/Proceedings/ ICA1971Budapest/ICA07

446

8 Pitch Perception

136. Ritsma RJ (1962) Existence region of the tonal residue. I. J Acoust Soc Amer 34(9A):1224– 1229. https://doi.org/10.1121/1.1918307 137. Ritsma RJ (1963) The ‘octave deafness’ of the human ear. IPO Ann Progress Report 1:15–17 138. Ritsma RJ (1967) Frequencies dominant in the perception of the pitch of complex sounds. J Acoust Soc Amer 42(1):191–198. https://doi.org/10.1121/1.1910550 139. Ritsma RJ (1970) Periodicity detection. In: Plomp R, Smoorenburg GF (eds) The proceedings of the international symposium on frequency analysis and periodicity detection in hearing (Driebergen). Sijthof, pp 10–18 140. Roberts B, Bailey PJ (1996) Spectral regularity as a factor distinct from harmonic relations in auditory grouping. J Exp Psychol Hum Percept Perform 22(3):604–614. https://doi.org/10. 1037/0096-1523.22.3.604 141. Robinson K (1993) Brightness and octave position: are changes in spectral envelope and in tone height perceptually equivalent? Contemp Music Rev 9(1–2):83–95. https://doi.org/10. 1080/07494469300640361 142. Robinson K, Patterson RD (1995) The duration required to identify the instrument, the octave, or the pitch chroma of a musical note. Music Percept: Interdiscipl J 15(1):1–15. https://doi. org/10.2307/40285682 143. Robinson K, Patterson RD (1995) The stimulus duration required to identify vowels, their octave, and their pitch chroma. J Acoust Soc Amer 98(4):1858–1865. https://doi.org/10.1121/ 1.414405 144. Ruckmick CA (1929) A new classification of tonal qualities. Psychol Rev 36(2):172–180. https://doi.org/10.1037/h0073050 145. Russo FA, Thompson WF (2005) An interval size illusion: the influence of timbre on the perceived size of melodic intervals. Percept & Psychophys 67(4):559–568. https://doi.org/ 10.3758/BF03193514 146. Schneider P, Wengenroth M (2009) The neural basis of individual holistic and spectral sound perception. Contemp Music Rev 28(3):315–328. https://doi.org/10.1080/ 07494460903404402 147. Schneider P et al (2005) Structural and functional asymmetry of lateral Heschl’s gyrus reflects pitch perception preference. Nat Neurosci 8(9):1241–1247. https://doi.org/10.1038/nn1530 148. Schouten JF (1938) The perception of subjective tones. Proc K Ned Akad Wet 41:1086–1092 149. Schouten JF, Ritsma RJ, Lopes Cardozo B (1962) Pitch of the residue. J Acoust Soc Amer 34(9B):1418–1424. https://doi.org/10.1121/1.1918360 150. Seebeck A (1841) Beobachtungen über einige Bedingungen der Entstehung von Tönen. Annalen der Physik und Chemie 53(7):417–436. https://doi.org/10.1002/andp.18411290702 151. Seither-Preisler A et al (2007) Tone sequences with conflicting fundamental pitch and timbre changes are heard differently by musicians and nonmusicians. J Exp Psychol Hum Percept Perform 33(3):743–751. https://doi.org/10.1037/0096-1523.33.3.743 152. Semal C, Demany L (1990) The upper limit of ‘musical’ pitch. Music Percept: Interdiscip J 8(2):165–175. https://doi.org/10.2307/40285494 153. Sethares WA (2005) Tuning, timbre, spectrum, scale, 2nd edn. Springer, London, pp i–xviii, 1–426. https://doi.org/10.1007/b138848 154. Shamma SA, Dutta K (2019) Spectro-temporal templates unify the pitch percepts of resolved and unresolved harmonics. J Acoust Soc Amer 145(2):615–629. https://doi.org/10.1121/1. 5088504 155. Shamma SA, Klein D (2000) The case of the missing pitch templates: how harmonic templates emerge in the early auditory system. J Acoust Soc Amer 107(5):2631–2644. https://doi.org/ 10.1121/1.428649 156. Shepard RN (1964) Circularity in judgments of relative pitch. J Acoust Soc Amer 36(12):2346–2353. https://doi.org/10.1121/1.1919362 157. Shepard RN (1982) Geometrical approximations to the structure of musical pitch. Psychol Rev 89(4):305–333. https://doi.org/10.1037/0033-295X.89.4.305 158. Shofner WP, Selas G (2002) Pitch strength and Stevens’ power law. Percept & Psychophys 64(3):437–450. https://doi.org/10.3758/BF03194716

References

447

159. Shonle JI, Horan KE (1980) The pitch of vibrato tones. J Acoust Soc Amer 67(1):246–252. https://doi.org/10.1121/1.383733 160. Singh PG, Hirsh IJ (1992) Influence of spectral locus and F0 changes on the pitch and timbre of complex tones. J Acoust Soc Amer 92(5):2650–2661. https://doi.org/10.1121/1.404381 161. Slaney M, Lyon RF (1990) A perceptual pitch detector. In: Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP-90) (Albuquerque), pp 357– 360. https://doi.org/10.1109/ICASSP.1990.115684 162. Slawson AW (1968) Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency. J Acoust Soc Amer 43(1):87–101. https://doi.org/10.1121/1.1910769 163. Smoorenburg GF (1970) Pitch perception of two-frequency stimuli. J Acoust Soc Amer 48(4B):924–942. https://doi.org/10.1121/1.1912232 164. Sohoglu E et al (2020) Multivoxel codes for representing and integrating acoustic features in human cortex. NeuroImage 217:116661, 13 pages. https://doi.org/10.1016/j.neuroimage. 2020.116661 165. Spence C (2011) Crossmodal correspondences: a tutorial review. Att Percept & Psychophys 73(4):971–995. https://doi.org/10.3758/s13414-010-0073-7 166. Steele KM, Williams AK (2006) Is the bandwidth for timbre invariance only one octave? Music Percept: Interdiscip J 23(3):215–220. https://doi.org/10.1525/mp.2006.23.3.215. www.jstor. org/stable/10.1525/mp.2006.23.3.215 167. Stevens SS (1935) The relation of pitch to intensity. J Acoust Soc Amer 6(3):150–154. https:// doi.org/10.1121/1.1915715 168. Stevens SS, Egan JP (1941) Diplacusis in ‘normal’ ears. Psychol Bull 38(7):548 169. Suied C et al (2014) Auditory gist: recognition of very short sounds from timbre cues. J Acoust Soc Amer 135(3):1380–1391. https://doi.org/10.1121/1.4863659 170. Terhardt E (1975) Influence of intensity on the pitch of complex tones. Acustica 33(5):344– 348 171. Terhardt E, Fastl H (1971) Zum Einfluß von Störtönen und Störgeräuschen auf die Tonhöhe von Sinustönen. Acustica 25(1):53–61 172. Terhardt E, Stoll G, Seewann M (1982) Algorithm for extraction of pitch and pitch salience from complex tonal signals. J Acoust Soc Amer 71(3):679–688. https://doi.org/10.1121/1. 387544 173. Terhardt E (1974) Pitch, consonance, and harmony. J Acoust Soc Amer 55(5):1061–1069. https://doi.org/10.1121/1.1914648 174. Thurlow WR, Rawlings IL (1959) Discrimination of number of simultaneously sounding tones. J Acoust Soc Amer 31(10):1332–1336. https://doi.org/10.1121/1.1907630 175. Thurlow WR, Small AM Jr (1955) Pitch perception for certain periodic auditory stimuli. J Acoust Soc Amer 27(1):132–137. https://doi.org/10.1121/1.1907473 176. Titze IR (2008) Nonlinear source-filter coupling in phonation: theory. J Acoust Soc Amer 123(5):2733–2749. https://doi.org/10.1121/1.2832337 177. Titze IR, Riede T, Popolo P (2008) Nonlinear source-filter coupling in phonation: vocal exercises. J Acoust Soc Amer 123(4):1902–1915. https://doi.org/10.1121/1.2832339 178. Tomlinson RWW, Schwarz DWF (1988) Perception of the missing fundamental in nonhuman primates. J Acoust Soc Amer 84(2):560–665. https://doi.org/10.1121/1.396833 179. Trainor LJ et al (2014) Explaining the high voice superiority effect in polyphonic music: evidence from cortical evoked potentials and peripheral auditory models. Hear Res 308:60– 70. https://doi.org/10.1016/j.heares.2013.07.014 180. Van Noorden LPAS (1971) Rhythmic fission as a function of tone rate. Inst Percept Res, 9–12 181. Van Noorden LPAS (1982) Two channel pitch perception (Chap 13). In: Clynes M (ed) Music mind brain: neuropsychol music. Plenum Press, London, UK, pp 251–269. https://doi.org/ 10.1007/978-1-4684-8917-0_13 182. Verschooten E et al (2019) The upper frequency limit for the use of phase locking to code temporal fine structure in humans: a compilation of viewpoints. Hear Res 377:109–121. https://doi.org/10.1016/j.heares.2019.03.011

448

8 Pitch Perception

183. Verschuure J, Van Meeteren AA (1975) The effect of intensity on pitch. Acta Acust Unit Acust 32(1):33–44 184. Verwulgen S et al (2020) On the perception of disharmony. In: Ahram T et al (ed) Integrating people and intelligent systems: proceedings of the 3rd international conference on intelligent human systems integration (IHSI 2020). Springer Nature Switzerland AG, Cham, Switzer land, pp 195–200. https://doi.org/10.1007/978-3-030-39512-4_31 185. Von Békésy G (1963) Hearing theories and complex sounds. J Acoust Soc Amer 35(4):588– 601. https://doi.org/10.1121/1.1918543 186. Ward WD (1954) Subjective musical pitch. J Acoust Soc Amer 26(3):369–380. https://doi. org/10.1121/1.1907344 187. Wiegrebe L (2001) Searching for the time constant of neural pitch extraction. J Acoust Soc Amer 109(3):1082–1092. https://doi.org/10.1121/1.1348005 188. Yeh C, Roebel A, Rodet X (2010) Multiple fundamental frequency estimation and polyphony inference of polyphonic music signals. IEEE Trans Audio Speech Lang Process 18(6):1116– 1126. https://doi.org/10.1109/TASL.2009.2030006 189. Yost WA (1996) Pitch strength of iterated rippled noise. J Acoust Soc Amer 100(5):3329– 3335. https://doi.org/10.1121/1.416973 190. Yost WA, Hill R (1979) Models of the pitch and pitch strength of ripple noise. J Acoust Soc Amer 66(2):400–410. https://doi.org/10.1121/1.382942 191. Yost WA, Hill R (1978) Strength of the pitches associated with ripple noise. J Acoust Soc Amer 64(2):485–492. https://doi.org/10.1121/1.382021 192. Yost WA et al (2005) Pitch strength of regular-interval click trains with different length ‘runs’ of regular intervals. J Acoust Soc Amer 117(5):3054–3068. https://doi.org/10.1121/1. 1863712 193. Zheng Y, Brette R (2017) On the relation between pitch and level. Hear Res 348:63–69. https://doi.org/10.1016/j.heares.2017.02.014

Chapter 9

Perceived Location

Sound consists of longitudinal waves which, at 20 ◦ C at sea level, propagate through the air with a speed of about 343 m/s or 1230 km. These sound waves originate at mechanical events representing interactions between solid objects, fluids, gases, or combinations of them. If there are no obstacles between the sound source and the listener, a small proportion of the sound will arrive directly at the listener’s ears. This is called the direct sound. The longer the distance between the sound source and the listener, the smaller will be the proportion of the direct sound. According to the inverse-square law, which applies to a spherical sound source, the decrease in level will be 6 dB for every doubling of the distance. In general, by far the largest proportion of the sound produced by a source will not arrive at the listener but at other objects, where it can be absorbed, reflected, diffracted, or scattered. The proportion of absorbed, reflected, diffracted, and scattered sound depends on the acoustic properties of these objects and the frequency of the sound. These objects comprise walls, floors, ceilings, furniture, and other objects, including the perceivers themselves. A small proportion of these reflected, diffracted, and scattered waves will also reach the listener, but the largest part will again be absorbed, reflected, diffracted, or scattered by other objects. These waves will in turn again be absorbed, reflected, diffracted, or scattered, etc., etc. In summary, after a sound is produced, the first part of the acoustic waves that reaches the listener is the direct sound that travels directly from the sound source to the listener. This is followed by a series of reflections which, if the reflecting surfaces are relatively large planes, reach the listener at relatively well-defined times. If, however, the surfaces are curved or irregular, the arrival time of these reflections will lose definition. Moreover, these indirect waves arrive later at the listener than the direct waves and, depending on the trajectories these waves follow on their way to the listener, they arrive at the listener from different directions and with different phases, delays, and intensities. This loss of definition will increase every time that sound is reflected or scattered, resulting in a sound field in which the direction from which it reaches the listeners’ ears gets ill-defined. This is called the diffuse field. This diffuse field, together with the early reflections and the scattered sound, is called the reverberant sound or briefly the reverberation in the room. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. J. Hermes, The Perceptual Structure of Sound, Current Research in Systematic Musicology 11, https://doi.org/10.1007/978-3-031-25566-3_9

449

450

9 Perceived Location

Since part of the acoustic energy is absorbed with every reflection or scattering, the energy in the reverberant sound will decrease after the sound production has stopped. This rate of decrease depends of the absorbing characteristics of the material of the objects in the room, and determines what is called the reverberation time of the room. When the decrease is slow, this is called a reverberant space; when it is faster, it is called a sound-attenuated space and, when the reverberation does not play any perceptually significant role, it is called an anechoic space or a dead space. In anechoic spaces, virtually all sound is absorbed by the walls, the ceiling, and the floor. The amount of reverberation in a space is quantified in what is called the reverberation time of the space, indicated with RT60 . It is the time it takes for an impulsive sound to decrease in intensity to 60 dB below its original intensity. It is generally measured for sound in the frequency range of 20 to 20,000 Hz, but one should realize that it is frequency dependent, since the absorption of sound depends on frequency. Roughly speaking, in most rooms, components with a higher frequency are more absorbed than components with a lower frequency. Hence, a more accurate specification of the reverberation properties of a room requires the specification of the reverberation time as a function of frequency. Another quantity in which the reverberation characteristics of a room are expressed is the reverberation radius rh of a room, also called critical distance or reverberation distance. It is the distance from the sound source where the energy in the direct sound field is equal to that in the indirect sound field. It can be calculated from the reverberation time RT60 according to  rh = 0.1 GV / (π RT60 ),

(9.1)

in which V is the volume of the room and G is the directivity factor of the sound source [178, p. 137]. For a spherical sound source, G is 1. The main property of direct sound is that it arrives earlier at the listener’s ears than reverberant sound. But that is not what is perceived. In general, the reverberant sound is not perceived as a separate sound source, and no separate perceptual attributes can be attributed to it [125]. The reverberant sound fuses perceptually with the direct sound and contributes to the perceptual attributes of the perceived sound such as its loudness and its timbre. However, the reverberant sound does not contribute to the perceived location of the sound source; only the direct sound does. Cremer [72, p. 126] called this the “Gesetz der ersten Wellenfront” in German or, in English, the law of the first wavefront. One manifestation of this law is the precedence effect described in 1949 by Wallach, Newman, and Rosenzweig [361]. These authors played series of clicks over two loudspeakers in front of the listener. They varied the distance between the listener and one of the loudspeakers, and ascertained that the click series was perceived as coming from the nearest loudspeaker. Interestingly, listeners also perceived a single series of clicks when the distance between the listener and the two loudspeakers was the same, but the clicks were played a few ms earlier over one loudspeaker than over the other. In that case, the clicks are heard as originating from the “leading” loudspeaker, i.e., the loudspeaker from which the clicks first

9 Perceived Location

451

Fig. 9.1 The Franssen effect. In the left channel, a short 20-ms 880-Hz pulse-like sound is played. In the right channel, the sound lasts much longer but starts 10 ms after the short sound. This is done for a 880-Hz pure tone and for white noise. The pure tone seems to come from the left channel. The Franssen effect does not occur with the noise. (Matlab) (demo)

arrive at the listener. Moreover, the location of the other loudspeaker, the “lagging” loudspeaker, had a negligible influence on the perceived location, even when the intensity of the lagging sound was a few dBs higher than that of the leading sound. This effect has also been described for speech by Haas [125], and has been called the Haas effect. Another manifestation of the law of the first wavefront is the Franssen effect [95, 96]. There are two loudspeakers. The first plays a tone with an abrupt onset after which its intensity decreases slowly. At the same time, a second loudspeaker plays another tone that compensates the decaying intensity of the first tone; hence, it starts at zero intensity, and then gradually increases in intensity up to the level with which the first tone started. In other words, the pure tone is faded from the first loudspeaker to the second. In this case, listeners hear the sound at the location of the first loudspeaker for the whole duration of the tone although at the end of the tone all power is produced by the second loudspeaker. An interpretation of these effects is that the sound arriving later at the listener is perceptually interpreted as reverberant sound, to which no separate location is attributed. Hence, the location of the loudspeaker playing the sound that arrives earliest at the listener determines the perceived location of the sound source. The situation is more complex, however, as the effect does not occur for noisy sounds [135]. This is shown in the demo of Fig. 9.1. Thorough reviews of the precedence effect are presented by Blauert [28], Brown, Stecker [44] and Litovsky et al. [198]. The precedence effect is another instance of the dominant role played by onsets in auditory perception. Their role in sound localization is also discussed by, e.g., Diedesch and Stecker [75], Hafter and Dye [127], Hafter et al. [128] and Houtgast and Plomp [148]. It has been said that reverberant sound often fuses perceptually with the direct sound. But this effect can break down, e.g., when the delay between direct sound and indirect sound is too long, longer than about 30 to 40 ms. In that case, one hears the later arriving sound as a separate sound source with a separate location; in other words, one hears an echo. An echo can occur when there is a large reflecting

452

9 Perceived Location

surface, such as a wall, separated from the sound source by more than about 10 m. In contrast with early reflections, an echo is perceived as a separate sound source with a perceived location at the other side of the reflecting surface. Its auditory attributes of pitch and timbre are similar to those of the real sound source, its loudness is less, and its beat comes later than that of the direct sound. For the rest, no attention will be paid to echoes. So, the acoustic waves generated by a sound source will in general arrive at the listener from multiple directions and in multiple phases. Waves that come from the medial plane arrive at both ears at about the same time with approximately equal intensities. But when the waves come more from the left or from the right side of the head, they arrive at different times with different intensities at both ears, which will appear to play an important role in human sound localization, especially in perceiving the angle between the direction of the sound source and the central axis between the two ears. After the direct sound has arrived, the reflected sound, the diffracted sound, and the reverberant sound arrive at the pinnae, mostly from very different directions. These pinnae are more or less saucer-shaped antennae, the main function of which is to collect the acoustic energy of the sound waves and to guide it into the ear canal. The shape of the pinnae is remarkably irregular and, since the pinnae also reflect and diffract sound waves, the lengths of the trajectories followed by the various sound waves coming from different directions varies slightly, so that they arrive at the entry of the ear canal with different phases. As a result, they interfere with each other and, depending on their frequency, sometimes amplify and sometimes attenuate each other. Which frequencies are attenuated and which amplified depends on the direction the sound waves come from. Hence, these resonances and anti-resonances specify the direction where the sound comes from. They will appear to play a central role in perceiving the direction of a sound source. In this way, pressure waves arise in the ear canal and set the eardrums in motion, which is the beginning of a process in which the acoustic energy is transformed into neural information to be used by the central nervous system to interpret what happens around the listener. In this description, the two eardrums are in fact two point sensors that transform the acoustic information coming from sound sources in the three-dimensional space around the listener into just two time signals. The function of the hearing system is to reconstruct from these two time signals where the acoustic events have taken place and what these events have been. The complexity of the task performed by the human localization system can perhaps be envisioned by imagining a lake with two sensors at the border of the lake that sense the vertical movements of the water surface (see Fig. 9.2). The two signals coming from these sensors must be used to reconstruct the identity and location of everything on the lake that generates waves: boats, swimmers, the wind, objects falling on the water, etc. What this so-called Bregman Lake does in two dimensions, is in fact what the two ears do in three dimensions (Adapted from a somewhat different metaphor by Bregman [32, pp. 5–6]). This shows that, in general, the auditory system attributes a perceived location to an auditory stream, a process referred to as auditory sound localization. One of the first questions one may ask is how well the auditory system performs this task. To

9 Perceived Location

453

Fig. 9.2 The Bregman Lake. For explanation, see text. Retrieved from https://ccrma.stanford. edu/~cc/deck.js/surroundSnd2019/#slide%2027 under CC BY 4.0 (https://creativecommons.org/ licenses/by/4.0)

this question, however, there is no unambiguous answer, because some sounds can be localized much more accurately than others can. The accuracy with which a sound source can auditorily be localized depends on its physical properties. For instance, Von Békésy [354] writes: “The existence of a number of special physical attributes determining the perceived distance of a sound becomes obvious on listening to a sharp click and a continuous tone of 3000 cps in a moderately damped room. The sound image of the click seems small and sharply localized, and when the head is moved the source seems to remain stationary. On the other hand, the image of the tone is diffuse, and when listened to with one ear it seems to move exactly as the head moves. For a continuous tone of 3000 cps there is no distance localization that is related to the actual distance” (p. 21) [Translation by Wever in [351, pp. 301–302]]. In order to specify the position of the sound source, three coordinates are necessary. These coordinates are specified by three planes, the horizontal plane, the median plane, and the frontal plane. The horizontal plane is the plane through the interaural axis parallel to the ground. The frontal plane is the plane through the interaural axis perpendicular to the ground, hence, the plane dividing the body into a front and a back. The median plane is the plane dividing the body into a left and a right half. Hence, it is the plane through the middle of the axis between the two ears and perpendicular to it. The origin then is the point where these three planes cross, i.e., in the middle of the interaural axis inside the head. In hearing research, a spherical coordinate system is generally used. The first coordinate is the distance between the origin of this coordinate system and the sound source. This is indicated

454

9 Perceived Location

by egocentric distance. The second coordinate is the azimuth, which is the angle between the line running from the origin to the projection of the sound source on the horizontal plane and the line running forward in the median plane. It is 0◦ for points in the median plane in front of the listener. The azimuth is negative for points on the left side of the listener, so, runs from 0◦ in front of the listener to −90◦ , exactly on the left, to −180◦ at the back. Similarly, the azimuth is positive on the right side of the listener. The third coordinate, the elevation, is the angle between the direction of the sound source and the horizontal plane. These three coordinates, egocentric distance, azimuth, and elevation, specify the position of the sound source relative to the listener. The perceived location of a sound source can be different from its actual location. The coordinates of this perceived location will be indicated with perceived distance, perceived azimuth, and perceived elevation, respectively. As mentioned, some sounds have very diffuse perceived locations, while those of others appear to be quite precise. The fact that the perceived location of a sound is well defined does not necessarily mean that it corresponds to the actual location of the sound source. Imagine, e.g., two situations in which one listens to the sound coming from two loudspeakers in front. First, two different voices are played; one voice is played over one loudspeaker and the other voice over the other. In this case two different voices are naturally heard, each localized at the loudspeaker from which it is actually played. This is different for the situation in which the same voice is played over two loudspeakers. Then one does not hear two separate voices, one coming from one loudspeaker and the other coming from the other, as acoustically is the case, but one hears one and the same voice coming from somewhere between the loudspeakers, where acoustically there is no sound source. Hence, the two sounds coming out of the two different loudspeakers are perceptually integrated. The occurrence of such a phantom source is another example of a situation in which the auditory system merges different acoustic sources into one perceived auditory stream. The occurrence of phantom sources is, in fact, a very familiar phenomenon as it occurs almost always when listening to sound over headphones. In that case, sound is most often perceived inside the head at a location somewhere between the ears. This phenomenon, already observed in early studies, e.g., in 1902 by Rostosky [296, p. 570] and in 1917 by Stewart [326], is called internalization. Internalization also occurs in most stereophonic recordings when listened to over headphones. One then hears some instruments closer to the left ear and some other closer to the right ear. The phenomenon is called lateralization. Listening over headphones does not necessarily have to imply that one hears the sound inside our heads. Actually, it is now possible to reproduce sound over headphones in such a way that listeners hear the sound at locations outside their heads, which is called externalization. There may arise some confusion as to the meaning of the terms externalization and lateralization. Especially in literature before the 1990s, what is defined here as internalization is indicated with lateralization and what was defined here as externalization with localization, e.g. Jeffress and Taylor [158] Plenge [274], and Toole [343]. In this book, the concepts will be kept separated and used as defined above.

9.1 Information Used in Auditory Localization

455

Internalization does not only occur when sound is played over headphones, but can also occur for externally produced sound or for sounds played over loudspeakers. One of the simplest ways to experience this is to fold the pinnae over the entrance of the ear canals against the head ([285], as cited by Plenge [275]). Environmental sounds are then often heard inside the head. Another situation in which an external sound is perceived internally arises when narrow-band sounds with frequencies below about 1500 Hz are played in counter phase over two loudspeakers at symmetrical positions with respect to the median plane, e.g., one at azimuth −40◦ and the other at 40 ◦ . In that situation, the listeners are unable to assign a direction to the perceived location of the sound source and hear the sound “in and around the head” [133, 301]. Besides over headphones, there is another way to create a situation in which people can hear a number of phantom sources in such a way that it seems as if they are really there. This can be done by reconstructing the complete sound field by a large number of loudspeakers. This technique, called wave-field synthesis, has been developed by Berkhout [21] and Berkhout, De Vries, and Vogel [22]. How well listeners can localize sounds in wave-field generated acoustic environments is discussed by Wierstorf, Raake, and Spors [369]. Since it is a purely technical issue, it is not discussed here. The reader is referred to the relevant literature.

9.1 Information Used in Auditory Localization It will appear that the auditory system uses many different kinds of information in order to attribute a location to a sound. First, only the direct sound comes from the direction of the sound source, while reverberant sound does not. Hence, the reverberant sound does not directly contain reliable information as to the direction of the sound source with respect to the listener. As shown in the demo of Fig. 9.1, the auditory system appears to have incorporated this in its sound localization system and, in the perception of direction, attributes most weight to the onset of a sound, a phenomenon which has been introduced as the law of the first wavefront just discussed in the introduction of this chapter. This does not mean that reverberation does not play a role in human sound localization. On the contrary, it will be seen that its relationship with the direct sound plays an important role in perceiving egocentric distance. Before discussing each source of information (SOI) involved in auditory localization, it is emphasized that, in everyday situations, these SOIs interact in a complex yet coherent way. This is very different from the experimental conditions in which one wants to investigate the effect of each of these variables separately and their interactions. Then one has to vary each variable independently of each other. It will be shown that the number of SOIs involved is quite considerable, higher than ten. If one wants to conduct an experiment in which ten variables are independently varied over only two values, one ends up with a number of 1024 conditions, which goes beyond what is practically feasible. Hence, one has to select two, three, or perhaps

456

9 Perceived Location

four variables which can be varied independently while the other variables then have to be kept constant. Varying only a limited set of variables leads to situations in which the change of one of the experimental variables is not coupled with the naturally occurring, coherent changes of other variables, leading to unnatural stimulus conditions. This will appear to have perceptually remarkable consequences of which already one has been mentioned, internalization, the phenomenon that the perceived sound location can be inside our head. Here follows a more systematic list of the SOIs used in human sound localization. A technical account of many of these SOIs is presented by Van Opstal [349].

9.1.1 Interaural Time Differences Only when the source of a sound is located in the median plane, the produced sound will reach each ear at the same time. When a sound source in not in the median plane, the sound arrives earlier at the ipsilateral ear than at the contralateral ear. This difference in time of arrival is called the interaural time difference (ITD). The relation between ITD and actual location of the sound source is quite complex, however. It is schematically illustrated in Fig. 9.3 for a spherical head with diametrically opposed ears. The distance between the two ears is 2r . The left panel shows that the difference in path length from a remote sound source at azimuth θ is 2r sin(θ). It also shows the hyperbola the points of which have the same difference in distance to the two ears. Actually, this figure shows the situation in only two dimensions. In three dimensions, the set of points that have the same difference in distance to the two ears is a hyperbolic cone around the interaural axis. This is called the cone of confusion depicted in Fig. 9.5 to be discussed later in this chapter. This shows that ITDs provide little information about the distance to the sound source. Actually, the real situation is even more complex as sound cannot travel through the head but must go around it. This is illustrated in the right panel of Fig. 9.3. One can see that the difference in path length d to the two ears is equal to the sum of d1 and d2 . With an azimuth of θ in radians, this gives d = d1 + d2 = r θ + r sin θ [325, 379]. Although Stevens and Newman used this equation in their 1936 paper, this equation is referred to as Woodworth’s formula and the model as Woodworth’s model. For a head with a diameter of 20 cm, an azimuth of 45◦ , and a speed of sound of 343 m/s, Woodworth’s formula gives an ITD of 299 µs. The maximum ITD is attained for an azimuth of 90◦ , giving an ITD of 514 µs in this model, considerably less than one ms. One may wonder whether the auditory system is able to process such small time differences, but this indeed appears to be the case: The human soundlocalization system is sensitive to ITDs as short as 10 to 20 µs corresponding to a difference in azimuth of 1◦ to 2◦ [126, 168, 233, 399]. This order of magnitude can be demonstrated in a simple experiment with a 1-m long garden hose. The hose must be placed behind a listener while one end is pressed against the left ear and the other against the right ear. If one softly taps against the hose a few centimetres left from the middle of the hose, the listener will in general hear the sound as coming from the

9.1 Information Used in Auditory Localization

457

Fig. 9.3 Interaural time differences. When the azimuth θ is larger than zero, sound arrives earlier at the ipsilateral ear than at the contralateral ear. The left panel shows the simple situation of two small point receivers without a head in between. The difference in path length is then 2r sin θ. For a spherical head, the longer distance travelled by sound from a remote source with azimuth θ is r θ + r sin θ, which is Woodworth’s formula [379]. (Matlab)

left; when one repeats this another time at an equal distance right from the middle of the hose, the listener will in general hear the tapping sound as coming from the right. Only one, or at most two, centimetres are enough to hear this difference. Assuming that the sound is propagated through air, it takes sound only about 0.02/343 s ≈ 60 µs to travel this distance of two centimetres. This looks like a very small number, but training and dedicated experiments can even reduce this quantity to 10 to 20 µs. One may object that the listener’s judgement can be based on the sound travelled through the hose itself. In solid material, however, the speed of sound is generally much higher than it is in air so that, in that case, the interaural time different would be even shorter. It is thus concluded that listeners are indeed sensitive to ITDs of only one or two tens of µs. A simple physiological mechanism that makes the discrimination of these short intervals possible was already proposed in 1948 by Jeffress [157]. His model assumes that neurons in the medial superior olive, a nucleus in the medulla oblongata, the lowest part of the brain, are sensitive to small timing differences between action potentials arriving from the left and the right ear. The validity of this model has been reviewed in 1998 by Joris, Smith, and Yin [160]. So, it can be assumed that our auditory system indeed uses ITDs in the process of sound localization. Woodworth’s model was tested for real human heads by Feddersen et al. [85]. They measured the interaural time differences of very short clicks and compared it with the predictions by Woodworth’s model. The results are presented in Fig. 9.4. One can see that the approximation is quite accurate for the short clicks with which the measurements were done and for the relatively long distance between the sound source and the listener. A more complex model, including shorter egocentric dis-

458

9 Perceived Location

tances, and for ear positions that are not exactly diametrically opposed, is presented by Aaronson and Hartmann [1]. Woodworth’s model was developed for clicks or high-frequency tones with wavelengths much shorter than the diameter of the head. For pure tones of lower frequencies, the situation appears to be more complex. Sound passes the head in three dimensions, along the chin, the nose, the top or the back of the head, etc., all with somewhat different path lengths. The presence of the head diffracts the passing sound and thus influences the phase and the intensity of the passing sound waves. The behavior of waves around an obstacle such as the human head is different for wavelengths longer than the diameter of the head than for shorter wavelengths, because of which the ITDs are frequency dependent [2]. Model simulations for spherical heads show that the ITDs are 50% greater for frequencies 500 Hz than for frequencies above 2000 Hz [176]. The ITDs for frequencies 500 and 2000 Hz do not only depend on the shape of the head but also on the presence and shape of the torso, and even on whether the torso is clothed or bare [176] and on the presence of hair [346]. Some authors find that this frequency dependence of ITDs is perceptually not very relevant [134], though others find that it might be [19]. The interested reader is referred to those papers. Here it will be ignored. For tones of constant amplitude, interaural time differences can only be used for frequencies with wavelengths of an order of magnitude longer than the diameter of the head. Licklider and Webster [196] mention an upper limit of 1400 Hz. One can

Fig. 9.4 Measured and calculated interaural time difference as a function of azimuth. Reproduced from Feddersen et al. [85, p. 989, Fig. 1], with the permission of the Acoustical Society of America

9.1 Information Used in Auditory Localization

459

imagine that, for these low frequencies, the ITD is relatively short compared with the period of the sinusoid, so that the first maximum in the pressure wave at the contralateral ear after a maximum in the pressure wave at the ipsilateral ear indeed corresponds to the ITD. For shorter wavelengths, the maxima in the pressure wave follow each other so rapidly that the time interval between two successive maxima at one ear is shorter than the ITD. As a consequence, the auditory system cannot match the maxima in the pressure wave at the contralateral ear with the corresponding maxima in the ipsilateral ear and, hence, has no information as to the ITDs at these higher frequencies. This reasoning applies to frequency components of more or less constant amplitude. Indeed, Woodworth’s model can also be applied to the envelopes of higherfrequency components, if these envelopes have frequencies corresponding to wavelengths longer than the diameter of the head [79, 83, 141, 219, 230]. The results show, indeed, that the auditory system can also use the ITDs of the envelopes of higher frequency components, though the frequency of the envelopes, the modulation frequency, should not be much higher than 600 Hz [79, 141]. This is considerably lower than the 1400 Hz mentioned above for the frequency of unmodulated pure tones. As indicated in Fig. 9.3, an ITD on its own does not specify unambiguously the location where a sound comes from. The set of points from which the sound arrives at both ears with equal delays is approximately a hyperbolic cone centred on the interaural axis [359, 360]. This cone is called the cone of confusion [230, 232], and is schematically depicted in Fig. 9.5. As one can see, an intersection with this cone of confusion by a sagittal plane, i.e., a plane parallel to the median plane, consists of a circle, at least when the distance of this plane to the median plane is larger than that of the ear. One of the consequences of this is that, based on ITD alone, it is not possible to decide whether a sound comes from the front or the back, or from above or below. In other words, listeners make front-back confusions [283, 325], also called frontback reversals, or up-down confusions, also called up-down reversals. For sounds with a limited number of frequency components, such confusions, especially frontback confusions, are indeed quite common. Up-down confusions are less common. These confusions will be discussed at various instances later in this chapter. In addition to explaining the regular occurrence of front-back and up-down confusions, there is another remarkable aspect of the cone of confusion. Sakamoto, Gotoh, and Kimura [299] argued that, if the auditory system does not avail of sufficient information about the distance of the sound source from the listener, the perceived location of a sound is zero or small and, hence, placed within the head. As illustrated in Fig. 9.3, this perceived location corresponds to the location where the cone of confusion crosses the interaural axis. The larger the azimuth, the more laterally the sound is heard. This phenomenon, discussed in the introduction of this chapter, is called lateralization. So, the cone of confusion can play a role in explaining where inside the head a sound is perceived when it is not externalized.

460

9 Perceived Location

Fig. 9.5 The cone of confusion. Sounds arriving from the points on the cone arrive at the ear with about the same ITDs and ILDs. Reproduced from Moore [238, p. 262, Fig. 7.8], with the permission of Brill; permission conveyed through Copyright Clearance Center, Inc.

9.1.2 Interaural Level Differences In contrast with frequency components with wavelengths longer than the diameter of the human head, higher frequency components are more or less obstructed by the head. The shorter the wavelength, the greater is the proportion of acoustic waves reflected by the head. Consequently, the intensity level with which these longitudinal waves arrive at the contralateral ear is lower than at the ipsilateral ear. This difference is called the interaural level difference (ILD). The size of the ILDs is shown as a function of azimuth for a number of frequencies [85] in Fig. 9.6. Data were obtained for 1-sec pure tones with rise and decay times of 150 ms. As Fig. 9.6 shows, for tones with a frequency 200 Hz, the ILDs are about zero for all azimuths. For higher frequencies, the ILD is naturally zero when the azimuth is 0◦ or 180◦ . Between 0◦ and 180◦ , the ILDs first grow with larger azimuth, then fluctuate around a more or less constant value, and return to zero, again, at 180◦ . F500 Hz, when the wavelength is about 70 cm, the maximum ILD is a few

9.1 Information Used in Auditory Localization

461

dB. At 1800 Hz, or a wavelength of 19 cm, it is already 10 dB and, for frequencies higher than 4000 Hz, or a wavelength of less than 10 cm, it becomes about 20 dB. The measurements are not monotonically rising from 0 dB at 0◦ up to a maximum at 90◦ and then monotonically falling to 0 again at 180◦ , but show some fluctuations. These fluctuations can be attributed to the asymmetrical and irregular shape of the head and the pinna. Just as for the ITDs, the set of points from which the sound arrives with the same ILDs at both ears is a cone centred on the interaural axis. Its shape may be more irregular due to the asymmetrical and irregular shape of the head and the pinnae, but the discrepancy between the cone of confusion for ITDs and that for ILDs will in general be small. Furthermore, both provide an explanation for the phenomenon that listeners often make back-front confusions, up-down confusions or, in the absence of distance information, internalize the sound [158].

Fig. 9.6 Interaural level differences for a number of frequencies as a function of azimuth. Reproduced from Feddersen et al. [85, p. 989, Fig. 2], with the permission of the Acoustical Society of America

462

9 Perceived Location

The reasoning above is based on the situation in which the sound source is relatively far away from the listener and low-frequency sounds are not presented over occluding headphones. When low-frequency sounds are presented over occluding headphones, they can be presented with arbitrary intensity differences. In that case, they appear to induce lateralization in the same way as high frequency sounds do [81, 129, 387]. Another situation in which ILDs of low-frequency sounds do play a role is when the sound source is close to the listener, less than, say, one or two meters. Then the acoustic wave as it reaches the listener’s head can no longer be considered a plane wave, but must be considered a spherical wave. The sound intensity will, therefore, diminish with the distance travelled, and thus be significantly higher at the ipsilateral ear than at the contralateral ear. In those situations, too, intensity differences of low-frequency sounds appear to play a role in human sound localization [47]. The situation is further complicated by the complex shape of the head, the neck, the shoulders, and the torso. These reflect the incoming sound and these reflections interfere with the direct sound. Since high frequencies have short wavelengths, the ILDs vary rapidly with azimuth, elevation, and distance. The acoustics of this problem are discussed for a spherical model of the head by Duda and Martens [80], and for a more general model of the head, the torso, and the shoulders by Xie [382]. The perceptual consequences for human sound localization are studied by Macaulay, Hartmann, and Rakerd [205] and Pöntynen and Salminen [279]. Moreover, due to the variability of the ILD with azimuth, elevation, and distance, the location of the sound source with respect to the listener is specified by the interaural coherence of the ILD fluctuations during listening to time-varying sounds such as speech. This appears to play a significant role in the externalization of sound [61]. It was mentioned that the time and intensity differences with which sound reaches both ears are sources of information (SOIs) used by the auditory system to localize a sound source. Indeed, as early as 1907, Rayleigh [283] showed that these two measures contribute to human sound localization. This theory, saying that auditory localization in the horizontal plane is based on interaural time and intensity differences, became known as the duplex theory of sound localization. It will be shown, however, that the auditory system uses many more SOIs than just ITDs and ILDs in sound localization.

9.1.3 Filtering by the Outer Ears The idea that the pinnae may play an important role in human sound localization was already formulated in the nineteenth century by Mach [206] and Thompson [340]. Thompson [340] argued “the perception of direction of sound arose from the operation of the pinnae of the ears as resonators for the higher tones to be found in the compound sounds to which the ear is usually accustomed; their action as resonators should be more or less effective according to the position of the pinnae with respect to the direction of the sound-waves; and by this reinforcing in different position with

9.1 Information Used in Auditory Localization

463

Fig. 9.7 Reflection of sound by the pinna. Reproduced from Wright et al. [380, p. 958, Fig. 1], with the permission of the Acoustical Society of America

unequal intensity some one or more of the higher tones of a compound sound should affect the quality or timbre of the perceived sound, producing a difference between the sounds heard in the two ears in all positions, save when the source of sound lay in the median plane of the head” (p. 410) [the word “quality” emphasized in original]. Thompson [340] first notes that, due to the direction-dependent resonances of the pinnae, the high-frequency part of sounds with a wide-band spectrum, “the higher tones of a compound sound” in the quote, varies as a function of the direction of the sound source. Then, in the last part of the quote, he suggests that outside the median plane the differences between the high-frequency parts of the spectra at the left ear and the right ear specify the direction of the sound source. This, however, appears not to be correct [134]. If the auditory system would use the difference between the spectra of the right and the left ear, we would need two ears for accurate sound localization. We can, however, also localize sounds with one ear, some listeners even almost as well as with both ears [8, 90, 137, 317]. This shows that the auditory system in general uses the information in the spectra at each ear separately. For a comprehensive discussion on this subject, see Carlile et al. [59]. The role of the pinnae in sound localization can easily be demonstrated in the following informal experiment. First, a participant is asked to close the eyes and point in the direction of a rich sound source, e.g., a rattling key ring positioned at various locations with different azimuths and elevations. In general, the participant will be well able to do this task. Then, the pinnae of the listener are folded forwards by taping them to the cheek. It will appear that the participant will then make many errors in localizing the sound source [50]. More specific, Gardner and Gardner [106] showed that filling the cavities in the pinnae with soft rubber severely affected the ability of the participants to perceive the elevation of a high-frequency wide-band noise. Musicant and Butler [243] found that also the number of front-back confusions increased considerably when the pinnae were occluded, which demonstrates the role of the pinnae in horizontal sound localization. How the pinnae operate as sensors for sound-source direction is schematically illustrated in Fig. 9.7. It appears that sound, when it has arrived at the ear, is reflected by the different parts of the pinna. The travelled distance up to the entry of the ear canal then varies with the place on the pinna where the sound is reflected. These differences in the distance travelled are in the order of magnitude of a few centimeters resulting in resonances and antiresonances with frequencies of a few thousand hertz and, importantly, their frequencies depend on the direction where the sound comes from. A mathematical model of this process has been developed by Batteau [15].

464

9 Perceived Location

So, it can be concluded that the outer ear operates as a direction dependent filter. The distribution of the peaks and the notches of the frequency characteristic of this filter specify the direction where the sound comes from. This implies that, if wideband noise is played from a loudspeaker with a certain azimuth and elevation, it arrives at the entry of the ear canal with spectral peaks and notches corresponding to that azimuth and elevation. The transmission from the entrance of the ear canal to the eardrums, though different from listener to listener, is not frequency dependent [131, 311, 368]. As a consequence, the filter properties of the outer ear can be measured at the entrance of the ear canal as well as at the end of the ear canal close to the eardrum. One may now wonder what would happen if the path between sound source and eardrum were circumvented by filtering a wide-band sound in such a way that it has the same peaks and notches as when filtered by the pinnae and then played close to the eardrum. This was actually tested by Blauert [27] for one loudspeaker in front of the listener and the other behind the listener. He played a wide-band sound from one of the loudspeakers and measured the spectrum of the sound as recorded at the entrance of the ear canal. These two spectra had peaks and notches at different locations. He then filtered the original wide-band sound in such a way that, when played over two loudspeakers, one in front of the listener and the other behind the listener, they showed the same peaks and notches at the entrance of the ear canal. He found that, though the sound was played over two loudspeakers, the listener perceived the sound as coming from in front or from behind according to the position of the peaks and notches in the spectrum that corresponded to these two directions. For these two directions, this shows that the distribution of the peaks and notches in the spectrum as can be measured at the entrance of the ear canal specify whether the sound is perceived as coming from in front of the listener or from behind, not its actual location. The finding by Blauert [27] just described for only two directions can be generalized to all directions. In principle, the frequencies of the resonances and antiresonances of the pinna filter can be determined for any specific direction in the following way. A listener is put in an anechoic room with a probe microphone in the ear canal. Then a sound impulse in generated at a point located in that specific direction, and one measures the pressure wave generated by that impulse. This pressure wave is called a head-related impulse response (HRIR). Examples of HRIRs of the left and the right ear of a participant are shown in Fig. 9.8 for two directions. The upper panels show the HRIRs for an azimuth of 30◦ ; the lower panels those for an azimuth of 90◦ . These impulse responses also show a difference in intensity and in time of arrival at the ears. Therefore, HRIRs also contain information about ITDs and ILDs. As for any linear filter, the transfer function can be found by calculating the power spectrum of this impulse response. This spectrum is called the head-related transfer function (HRTF) of the ear for that direction. The procedure of measuring HRTFs is described by Møller et al. [236]. The HRTFs corresponding to the HRIRs of Fig. 9.8 are shown in Fig. 9.9. They show indeed that the HRTFs are different for two different azimuths. By presenting the magnitude spectrum, the information present in the ITDs is lost. In addition, Fig. 9.9 shows once more that the ILDs are

9.1 Information Used in Auditory Localization

465

Fig. 9.8 Head-related impulse response for two positions in the horizontal plane measured at the left and the right ear of a listener. The parameter θ indicates the azimuth. Reproduced from Zhong and Xie [398, p. 107, Fig. 6] under CC BY 3.0 (https://creativecommons.org/licenses/by/3.0)

quite small at low frequencies but get larger with higher frequencies as shown in a different way in Fig. 9.6. Head-related transfer functions do not only show the peaks and notches that correspond to the direction dependent resonances and antiresonances of the pinnae, but also more gradual fluctuations due to the presence of the head and the torso. In order to compensate for this, the HRTFs are often normalized by dividing them by the average of the HRTFs over all directions [230]. This gives the directional transfer function (DTF). This normalization should only show the direction specific part of the resonant properties of the pinna. Figure 9.9 only shows HRTFs for two different azimuths. The dependence on elevation is shown in Fig. 9.10, which presents the DTFs for varying elevations for two different listeners. The azimuth is held constant at 90◦ in this figure; the elevation is varied from −50◦ to 50◦ . Figure 9.10 does not only demonstrate that the DTFs depend on the elevation of the sound source; it also shows that they are different for the two listeners. This is more explicitly shown in Fig. 9.11 for one azimuth, 90◦ , and one elevation, 0◦ . For this direction, it shows the DTFs of the right ear for ten different listeners. The DTFs appear to vary considerably among listeners, especially for frequencies higher than about 6 kHz. It shows, e.g., that the frequency of the main notch in the DTF can vary considerably between 7 and 14 kHz from the ear of one listener to that of another. This dependence of the shape of the ear appears to be a general feature of HRTFs. And since no ear is identical to another, an HRTF specifies the direction of a sound source only for the ear for which it is measured. One of the main practical consequences of this is that HRTFs as measured from one listener are not representative for the HRTFs of someone else. The problem becomes acute when one wants to present sound over headphones in such a way that the listener not only externalizes the perceived location of the sound source but also attributes a specific direction

466

9 Perceived Location

Fig. 9.9 Four HRTFs corresponding to the impulse responses of Fig. 9.8. Reproduced from Zhong and Xie [398, p. 108, Fig. 7] under CC BY 3.0 (https:// creativecommons.org/ licenses/by/3.0)

to that location. Butler and Belendiuk [51] placed microphones inside the ears of listeners who listened to noise bursts played from loudspeakers located in front of them at various elevations, and recorded the noise bursts within the ear of the listeners. Then they played the recordings over headphones back to these listeners and found that the perceived location of the noise bursts agreed almost perfectly with the original location of the loudspeaker from which the noise bursts were played during the recording, if at least the listener during the recording and the play back were the same. If the recordings were played over headphones to another listener, significant localization errors occurred. However, all listeners externalized the sounds. Apparently, the resonance frequencies as recorded with the ears of one listener specify different directions for other listeners. The results by Butler and Belendiuk [51] were obtained for sounds coming from the median plane in front of the listener at different elevations, but the results can be generalized to all directions. Indeed, for eight participants, Wightman and Kistler [371, 372] measured the left and the right HRTFs from 144 different directions. These HRTFs were used to filter wide-band sound that was played over headphones to the same listener as from which the HRTFs were measured. They found that, in general, the thus filtered sound was perceived as coming from the directions corresponding to the HRTFs with which the sound was filtered. There were still some more front-back confusions and some discrepancies than in free-field stimulation, but later experiments showed that these could be attributed to minor inaccuracies resulting from the insertion of the probe microphone into the ear canal during the measurement procedures of the HRTFs [39]. With two sets of a large number of HRTFs, one for the left and one for the right ear, it is natural to ask whether the data necessary to represent them can be reduced.

9.1 Information Used in Auditory Localization

467

Fig. 9.10 Directional transfer functions (DTFs) of two listeners for an azimuth of 90◦ as a function of elevation. Reproduced from Wightman and Kistler [370, p. 5, Fig. 2] with the permission of Taylor & Francis Group; permission conveyed through Copyright Clearance Center, Inc.

468

9 Perceived Location

Fig. 9.11 Left-ear DTFs of ten different listeners for a sound source coming exactly from the left. Reproduced from Zhong and Xie [398, p. 108, Fig. 8] under CC BY 3.0 (https:// creativecommons.org/ licenses/by/3.0)

Kistler and Wightman [165] studied with how many principal components an HRTF can be described without affecting the accuracy with which a sound can be given a virtual location. They recorded the HRTFs of both ears of ten subjects at 265 positions and subjected these data to a principal component analysis. They showed that five principal components were sufficient to explain 90% of the variance of the 2 · 10 · 265 HRTFs. Breebaart, Nater, and Kohlrausch [31] studied the frequency resolution necessary to represent an HRTF. They concluded that one data point per critical band is enough for each HRTF. Another question is: For how many directions must the HRTFs be determined? One way to do so is to measure the HRTFs for many directions. One can then reduce the number of HRTFs and check how far one can go before virtual localization accuracy is affected. In this way, Breebaart, Nater, and Kohlrausch [31] found that a resolution of 5◦ is enough. Increasing the sampling frequency in azimuth resulted only in a marginal increase of localization accuracy. Finally, by decomposing the HRTFs of 20 listeners into 35 basis functions, Xie [384] found that 73 directions per ear were enough, and that the HRTFs for other directions could be obtained by interpolation of these 73 HRTFs. Defining distortion as the root-mean-square deviation of the interpolated HRTFs from the original ones, this leads to a mean signal-to-distortion ratio of 19 dB. Also when the just mentioned reductions in number of measurements can be made, the measurement of individualized HRTFs remains quite laborious, however. Each listener has to go to an anechoic room where a large number of precise measurements have to be made. Various different ways have been proposed to reduce the workload of this process. The simplest way is to use non-individualized HRTFs and take the inaccuracies for granted [366]. Another approach is to measure the shape of the ear and to develop an acoustic model from which the HRTFs can be derived. It appears that part of the individual differences are due to differences in maximum interaural delays, sizes of the external ears, and widths of their heads [227]. By scaling up the dimensions of an arbitrary HRTF to those of the listener according to these

9.1 Information Used in Auditory Localization

469

Fig. 9.12 The ten anthropometric parameters used by Iida, Ishii, and Nishioka [151] to derive the HRTFs for locations in the median plane. Reproduced from Iida, Ishii, and Nishioka [151, p. 320, Fig. 3], with the permission of the Acoustical Society of America

size differences, Middlebrooks [229] could increase localization accuracy of virtual sound sources by a factor of two. Iida, Ishii, and Nishioka [151] measured the listeners’ ears in more detail. They measured ten anthropometric parameters of an ear indicated in Fig. 9.12 and used these to derived its HRTFs for different locations in the median plane. They claim that localization was as good as with individualized HRTFs for positions in front and at the back of the listener, but decreased with elevation. They argue that perceived azimuth does not require accurate HRTFs but can better be adjusted by binaural time and intensity differences. Finally, Finally, Fels and Vorländer [88] checked six anthropometric measures of the pinnae and six such measures of the torso and the head for their effect on the HRTFs of children and adults. Their experiments were not based on perceptual experiments but on CAD methods. For the measures of torso and head they found that the distance between the ear and the shoulder, the width of the head, and the back vertex have considerable effect on the shape of the HRTFs. For those of the pinnae, the depth of the concha and its size appeared to be most significant.

9.1.4 Reverberation So far, ITDs, ILDs, and the direction-dependent filtering properties of the pinna have been discussed. These three sources of information (SOIs) specify, for wideband sounds, the direction where the sound comes from, but they do not specify egocentric distance, so, the distance between the sound source and the listener. Consequently, it occurs very often that, in the absence of any other information, a sound is internalized when presented over headphones. Hence, additional information is necessary that specifies egocentric distance. One of the most relevant SOIs in this respect is supplied by the reverberation in the listening environment. If the listener is close to the sound source, the intensity of the direct sound will be high compared with that of the reverberant sound, and will decrease as the distance between sound source and listener gets longer. For a spherical sound source, this decrease is 6 dB

470

9 Perceived Location

per doubling of the distance. The intensity of the reverberant sound, on the other hand, depends much less on the distance between sound source and listener, as it is the sum of reflected, diffracted, and scattered sound in the room; it will depend much more on the size and the absorption characteristics of the room [394]. This implies that the ratio of the energy of direct and reverberant sound that arrives at the listener will decrease as a function of the distance between the sound source and the listener [214, 324, 351, 354]. This ratio is called the direct-to-reverberant ratio (DRR). For every room, location of listener, and location of sound source, this DRR specifies the distance between this listener and this sound source. This was confirmed by Mershon and King [224] and Mershon and Bowers [223], who showed that the DRR could indeed perceptually serve as a SOI for absolute distance. If the DRR provides distance information, it is natural to infer that, in the absence of such information, the perceived distance is zero or very small, hence, inside the head [299]. This implies that the DRR may play an important role in externalization. Indeed, Plenge [274, 275] used a dummy head, i.e., an artificial head with realistically shaped outer ears with microphones at the location of the eardrums. The author showed that, when sound was recorded in a normal reverberant room with such a dummy head and presented to listeners over headphones, the listeners did hear the sound as coming from outside. This indicates that the filtering properties of the outer ears play an important role in externalization. That was not the only relevant result. Plenge [275] also found that, when the listeners were placed in a darkened room, so that they did not know the dimensions or the reverberant properties of the room, externalization did not occur immediately, but only a few seconds after the sound was turned on. Hence, the listener must first have an impression of the acoustic characteristics of the room where the sound is recorded before externalization occurs. Plenge [275] concluded that the mere presence of reverberant sound is not enough for externalization, it must also correspond to the acoustic characteristics of the room the listener has in mind. The role played in externalization by reverberation was further investigated by Sakamoto et al. [299]. They recorded sound with two dummy heads in two separate rooms, an anechoic room and a reverberant room. By mixing the two recordings in different proportions, the DRR could be varied independently. In this way, the authors could establish that the DRR indeed played a very important role in the externalization of sound, as also shown by Hartmann and Wittenberg [134]. Externalization will be discussed more extensively in Sect. 9.5. The reverberant properties of a room are generally characterized by its impulse response, the room impulse response (RIR). It can be measured by producing a short impulsive sound at a certain location in the room, and recording the response at the position of the listener. A handclap is frequently used as an impulsive sound [262]. This impulse response contains the direct sound, all early reflections, late reflections, and the diffuse field. In principle, the impulse response at a specific location in a room depends on both the location of the sound source and the recording position. As a consequence, for every room, a large number of RIRs have to be determined. One way to solve this problem is to use an acoustic simulation of the room with which the impulse response can be calculated for every position of the listener and that of the sound source, a technique that has been developed more and more since the early

9.2 The Generation of Virtual Sound Sources

471

1960s [306]. The pros and cons of these methods are described by Christensen [64]. A recent overview of how an RIR can be measured is presented by James [156]. A final remark. It is well known that bats use the early reflections of self-generated sounds for navigation and localizing prey. Human listeners, too, especially blind listeners, appear to be able to use echolocation during navigation. To do this, they use sounds produced by themselves such as tongue clicks or finger snaps. A short review of human echolocation is presented by Neuhoff [247, pp. 103–106]. Milne et al. [234] showed that blind listeners who use echolocation can even distinguish triangular, square, and rectangular shapes positioned at 80 cm from their ears, but only when they could make head movements. A comprehensive review is presented by Kolarik et al. [172].

9.2 The Generation of Virtual Sound Sources It has been shown that recorded sound is almost always internalized when played over headphones. It appears possible nowadays, however, to play externalized sounds over headphones. When a sound is presented over headphones and its source is perceived outside the head, this is called a virtual source. It has cost a lot of research effort to master the technique of externalizing sound when played over headphones, but now this can be done. The generation of virtual sound sources appears to be a powerful tool in the study of human sound localization. This field of research is often indicated by the study of spatial hearing. The reason it took some time to get a grip on the problem why sound played over headphones is mostly internalized is due to the reductionist approach which has appeared so fruitful in many other areas of hearing research. Indeed, what is more obvious than starting from the simplest sounds, pure tones and, since one wants to have full control over the stimuli and the stimulus conditions, to eliminate all environmental factors by presenting the sounds over headphones? The problem is, first, that pure tones are difficult to localize in general, especially their elevation and distance. Second, when sound is played over headphones, the variables that play a role in sound localization usually have values that do not correspond to the values they have when these sounds are produced at their usual external positions. Hence, they do not have coherent values. In this situation, as discussed in the previous Sect. 9.1.4, sound is almost never externalized; in fact, without environmental information the perceptual system that estimates distance collapses and the sound is internalized. The fact the sound is generally internalized when played over headphones may seem natural, because the actual sound is not produced in the environment of the listener but over headphones. Also when sound is produced in the environment, however, the percept of distance must be based on acoustic information entering the hearing system through the ear canals. And, when sound played over headphones is the same as sound at the entrance of the ear canal when coming from an external sound source, the perceived location of the sound source should be the same. The first to check this in the 1970s was Laws [185], who carried out his experiments in a dark, anechoic

472

9 Perceived Location

room, and used wide-band noise as stimulus. He played the stimulus from one of an array of loudspeakers positioned in front of the listener, and recorded the sounds within the ear canal of the listener and measured its spectrum. Then he processed the sound in such a way that its spectrum was equal to that recorded spectrum. Next, he played this processed sound over headphones and asked the listener to indicate which loudspeaker they thought had produced the sound. He compared their responses with those when the sounds were actually played over one of the loudspeakers. It appeared that, indeed, though not necessarily correct, the estimations of the distance of the sound source were the same as when actually played over loudspeakers. In these experiments the loudspeakers were placed in a linear array in front of the listeners whose heads were fixed so that they could not make any head movements. Listeners were seated in a dark, anechoic room, and 12 out of the 20 subjects were not familiar with the room in which the experiments were carried out. In that condition listeners often made front-back confusions or reported not to externalize the sound. In spite of these limitations, Laws [185] showed that the spectral transformations of the sound by the body and the pinnae during its course from source to eardrum are essential for the veridical reproduction of the perceived distance of a sound source, and that, by applying these transformation to the sound signal and presenting the result over headphones, the same perception of distance could be realized as when the unprocessed sound was played over loudspeakers. The veridical reproduction of sound in three dimensions over headphones was shown to be possible in the late 1990s s by Wightman and Kistler [371, 372]. This requires, first, that the sound is realistically processed in simulating the filtering by the pinnae. Second, it appears that the information used by the auditory system in auditory distance perception comprises the complex spectrotemporal acoustic patterns arising from the combination of the direct sound and the reverberant sound at the location of the ear canal [365]. This is necessary because, as shown in the last section, the reverberant properties of the room perceptually specify the distance of a sound source. For that reason, a faithful spatial reproduction of the acoustic environment over headphones can only be realized by accurately reproducing these complex spectrotemporal patterns at the entrance of the ear canal. As has been shown above in Sect. 9.1.3, this can be realized by combining the head-related impulse responses (HRIRs) of the listener with the room impulse responses (RIRs) of the room. In many applications, the RIRs and the HRIRs are not measured separately, but they are combined by directly measuring the impulse responses of a room at the entrance of the ear canal or at the eardrum [394]. This impulse response is called the binaural room impulse response (BRIR). The absolute value of its Fourier transform is the binaural room transfer function (BRTF). In principle, a BRTF contains both the filtering properties of the listener’s outer ears, and the filtering by the acoustics of the room and, hence, its reverberation; the BRIR includes the phase characteristic of the filter. By convolution of a sound signal as produced at its source with this BRIR and presenting the sound over headphones, one can reproduce the sound signal at the eardrum of the listener as if produced by the sound source. When done accurately, this thus reproduced virtual sound source cannot be distinguished from the real sound source. Indeed, it appears now to be possible to create quite

9.3 More Information Used in Sound Localization

473

realistic and high-quality auditory virtual environments by these techniques. For instance, a recent publication by Schoeffler et al. [305], in which virtual auditory environments created with BRIRs were compared with real environments, showed that the difference in quality between the different spaces is small. In spite of this positive evaluation, they conclude that “future experiments must investigate other dependent perceptional variables such as sound quality, reverberation, loudness, and distance” (p. 198). An important disadvantage of these techniques is that, for every different room and every different listener, different sets of HRIRs and RIRs have to be measured. If this procedure could be shortened, it would save a lot of time. In the previous Sect. 9.1.3, it was discussed how this may be done for the pinnae. As to the BRIRs, Lindau et al. [197] studied this for some music and speech sounds in a studio with an RT60 of 0.7 s and a reverberation radius of 1.4 m and for a lecture room with an RT60 of 2.1 s and a reverberation radius of 4 m. They found that a resolution better than 3◦ in both azimuth and elevation direction was enough to avoid resolution artifacts. For music in a reverberant environment, they recommend a resolution of 5◦ or better. They concluded that these results could be well explained by assuming that the auditory analysis is based on the output of a third-octave filterbank. In practice, BRIRs are often measured with a resolution of 15◦ , e.g., Shinn-Cunningham et al. [313]. Many publications are now available on auditory sound localization in which detailed information about the generation of virtual sound sources and the creation of virtual auditory displays can be found: For instance, Begault and Trejo [17], Blauert [28, 29], Gilkey and Anderson [114], Sunder et al. [332], Suzuki et al. [334], Vorlönder and Shinn-Cunningham [356], Xie [383], and Zhong and Xie [398]. The technique is used in quite a few of the experiments described in this chapter.

9.3 More Information Used in Sound Localization Above it has been shown that ITDs, ILDs, the spectral filtering by the outer ear, and reverberation in principle contain all information listeners may use to correctly estimate the three-dimensional location of the sound source. In a way this is correct for temporally and spectrally rich sounds. For narrow-band signals, or in situations with little reverberation such as in the free field, however, locating a sound source can be, and often is, tricky. A variety of other SOIs can then additionally be used, which will now be discussed.

9.3.1 Movements of the Listener In situations in which sound is not produced over headphones, movements by the listener induce coherent patterns of change in the intensity level, the ITDs and ILDs, and the peaks and notches of the sound spectra at the eardrum. This situation is

474

9 Perceived Location

completely different when sound is presented over headphones. In that situation the ITDs, ILDs, and the spectra at the eardrum do not change with head movements. Consequently, the perceived location of a sound is stationary with respect to the listener’s head. This may be a factor that prevents the externalization of the sound source [365]. That head movements play an important role in human sound localization can already be concluded from the results of an early experiment in 1950 carried out by Koenig [170]. He used two microphones attached to an artificial head. The signals of the microphones were fed to a listener. Then he had a speaker walk around in front of the artificial head. The artificial head was first kept stationary. In that situation, the speech from the speaker sounded as coming from a place somewhere on a semicircle behind the listener. However, this artificial head could also move in synchrony with the head of the listener. In that situation, the speech sounded as coming from in front of the listener, when the speaker was walking in front of the artificial head. The role of head rotations was further investigated by Brimijoin et al. [36]. They carried out experiments either with real sources or with virtual sources that were stationary relative to the outside world or stationary relative to of the listeners’ heads. Their results confirmed that, in addition to the spectral filtering of the pinnae and the presence of realistic reverberation, head rotations play an important role in the externalization of sound. Another issue is the occurrence of confusions. At various instances, it has been mentioned that front-back confusions are common in laboratory situations. The ITDs and ILDs with which a sound arrives at both ears do not completely specify the direction from which the sound arrives at the ears, but constrain the possible locations to the cone of confusion shown in Fig. 9.5 and introduced above in the Sects. 9.1.1 and 9.1.2. The concept of the cone of confusion explains why front-back confusions are quite common in studies on sound localization. Up-down confusions also occur, but are less common. In the first half of the last century, Rayleigh [283], Young [393], and Wallach [359, 360] already showed that head rotations can indeed play an important role in disambiguating front-back confusions. Rotating one’s head changes the time and intensity differences with which a sound source reaches our ears. First, the situation will be examined in which the sound source is exactly within the median plane, so exactly in front of or behind the listener. When the source is in front, turning one’s head to the left makes the time of arrival earlier and the intensity higher at the right ear than at the left ear. When the sound source is behind the listener, turning one’s head to the left makes the time of arrival earlier and the intensity higher at the left ear than at the right ear; mutatis mutandis, when we turn our head to the right. Hence, not the time and intensity differences themselves provide the relevant information to disambiguate front-back confusions, but the direction in which they change does. Now the situation will be discussed where the sound source is not exactly in front of or exactly behind the listener. In that case, one can turn one’s head in the direction of the sound source or away from it. If one turns the head in the direction of the sound source, the time and intensity differences decrease when the source is in front of the listener, while they increase when the source is behind the listener. In other words, it is, again, not the time and intensity differences themselves but the direction

9.3 More Information Used in Sound Localization

475

of their change that can disambiguate whether the sound source is located in front of the listener or behind the listener. Of course, head rotations, and other head movements such as nodding and tilting, do not only induce changes in ITDs and ILDs. For sounds with wide-band highfrequency components, the peaks and notches in the spectra of the sounds at the entrance of the ear change concurrently due to the direction-dependent filter properties of the pinnae. Here, too, the patterns of change in peaks and notches will be specific for the course of the azimuth and the elevation of the sound source. Moreover, not only front-back confusions can thus be disambiguated by head movements. Confusions in other dimensions, such as up-down confusions, may also be disambiguated in this way [208, 341, 342, 350, 351, 358]. In conclusion, head movements play an important role in the externalization of sound and in the disambiguation of front-back confusions. The question as to what kind of information about the location of a sound source can be obtained from the patterns of change induced by head rotations and translations was investigated for a very simple system developed by Kneip and Baumann [169]. They used a system even simpler than the Woodworth model of Fig. 9.3: Just two diametrically opposed, omnidirectional microphones mounted together about the usual interaural distance. This system could rotate about three perpendicular axes and translate in three perpendicular directions. The authors studied this system for an application as a sound-localization system for a robot. There was no acoustic barrier between the two microphones, so that “interaural” intensity differences could not play a significant role. Kneip and Baumann [169] only studied the situation in which the sound source was relatively far away. They calculated how the location of the sound source was specified by the dynamically changing pattern of ITDs when the two microphones were rotated and translated, and derived equations for the azimuth and the elevation of a wide-band sound source based on the ITDs, the change in ITDs, and the rotational speed of the head. They showed that these three variables do not only specify the azimuth, but also the elevation of the sound source. Apparently, the patterns of change in ITDs are different for different elevations. McAnally and Martin [215] showed that rotations of less than 4◦ were enough to disambiguate front-back confusions, but rotations of at least 16◦ were necessary to increase the accuracy of elevation judgments. The natural question is now: Are these patterns of change involved in sound localization? Thurlow et al. [341] and Morikawa et al. [240] showed that listeners do indeed rotate their heads if allowed to do so. Moreover, the head movements occurred less for easy to localize white noise than for more difficult to localize sounds such as 500-Hz low-pass filtered noise or 12-kHz high-pass filtered noise. This indicates that the head movements were indeed made in order to get the perceived location of the sound source more accurate. Moreover, Fisher and Freedman [91] showed that, if the listeners can move their heads, localization in the horizontal plane could be very accurate even when the filter function of the pinnae was made ineffective by conducting the stimuli into the ears through 10-cm metal tubes and covering the rest of the pinnae with sound-attenuating earmuffs. This was confirmed by Perrett and Noble [268] who showed that front-back confusions virtually disappeared when the

476

9 Perceived Location

listeners were allowed to rotate their heads. By using low-pass noise with a cut-off frequency of 2 kHz, they showed that changes in ITDs do play a role in performing these tasks. Remarkably, head rotations did not improve localization in the horizontal plane when high-pass noise was used. Apparently, dynamically changing ILDs do not play a role in disambiguating front-back confusions. Another finding was that, for the same low-pass noise, horizontal head rotations do not only resolve front-back confusions, but can also improve the auditory estimation of elevation, especially for positive elevations [269]. This corresponds to the results by Kneip and Baumann [169] mentioned above, that the patterns of change coupled with head rotations are different for different elevations and, hence, can specify elevation. All this shows that head rotations play an important role in auditory localization. Further evidence for this is provided by the following experiment [132]. One loudspeaker was placed 140 cm in front of a listener and another 60 cm behind the listener, so that the sound coming from them sounded equally loud. The stimulus was pink noise with a bandwidth between 800 and 1200 Hz so that the pinna could not play a role in sound localization. First the sound was played over the loudspeaker at the back of the listener, who was asked to make head movements to ensure that the sound was unequivocally perceived as coming from behind. Then the listener was asked not to move his or her head again. The sound was switched from the speaker at the back to the speaker in front. It appeared that the listener did not notice this; the sound continued to be perceived as coming from behind. Even stronger, when the front speaker, from which the sound was still played, was moved to the left and to the right, the listener perceived this as a sound moving at the back from left to right. Next, the position of the front speaker was fixed again, and the listener was asked to indicate the location of the sound source. Still the loudspeaker at the back was indicated as the sound source. Finally, the listener was permitted to move his or her head again. Only then, the perceived location jumped to the front where it actually came from. One sees here is that rotations of the sound source around the listener do not resolve the front-back confusion, but head rotations performed under control of the listener do. In their experiments with the two coupled microphones mentioned above, Kneip and Baumann [169] not only showed that the patterns of change induced by head rotations specify the azimuth and the elevation of a sound source. They also showed that they do not specify egocentric distance. This corroborates results of experiments by Throop and Simpson [66] and Simpson and Stanton [316] who showed that head rotations, indeed, do not contribute to auditory distance perception. Later in this chapter, in Sect. 9.7.5, this will be qualified for distances close to the listener. When a listener moves around so that the egocentric distance changes, Speigle and Loomis [320] indicated two dynamic sources of information (SOIs) that specify egocentric distance but that are not present when the listener does not move. The first one is motion parallax. This is the distance-dependent rate of change of the azimuth as the listener moves around. The motion parallax is zero when the listener walks straight into the direction of the sound source but, for other directions, the rate of change of the azimuth depends monotonically on the egocentric distance,

9.3 More Information Used in Sound Localization

477

and, hence, specifies it. The other SOI is a dynamic variable indicated with tau or τ . Tau specifies time-to-contact, i.e., the time it takes for an observer to come into contact with an object. The quantity τ was introduced into vision research as a perceptual invariant in 1976 by Lee [187]. In the visual domain, optical tau, τo , is given by τo = 2 A (t) /A (t), in which A (t) represents the area of the projection of an approaching object on the retina and A (t) its time derivative, hence the speed with which this area expands over the retina. (This spatial measure A can also be the distance between the projections of two points of the object on the retina, or any other spatial metric.) In the auditory domain, acoustic tau, τa , can be given by 2I (t) /I  (t), where I (t) is the intensity of the sound at the position of the listener and I  (t) its time derivative [312]. This shows that, by moving around, listeners may use motion parallax and acoustic tau in estimating the distance of a sound source. Speigle and Loomis [320] found, indeed, that this was the case, though the effect on the accuracy with which listeners could estimate the distance of a sound source was not large compared to when only static SOIs were present. Their results corroborate those of various other perceptual studies, in which the listeners could walk toward the sound source. For instance, Ashmead, Davis, and Northington [10] carried out experiments in the open field with little reverberation. They asked blindfolded listeners to walk to the location where they heard the sounds, 1.5-sec white-noise bursts in this study. In one condition, they asked listeners to walk to the sound source after the sound had stopped; in another condition, the sound was played while the listeners were walking towards the sound source. They found that localization was more accurate when listeners heard the sounds while walking. This improvement in performance in the presence of auditory information was only found for distances shorter than some 20 m. Ashmead, Davis, and Northington [10] attributed this improvement to the motion-related changed in intensity induced by the movements of the listener. They concluded that, for distances longer than about 20 m, motion-related changes in intensity were too weak to be picked up. A systematic discussion of the advantages of head movements and moving around is presented by Cooke et al. [70]. They focus on the strategies listeners may use to optimize the SOIs for locating the sound source and increasing the signal to noise ratio. They mention two strategies that are particularly effective. First, by turning their heads towards the sound source, listeners increase the accuracy of the perceived azimuth, since azimuth perception is best in front and worse at the side of the listener. Second, once the sound source is localized, it is beneficial to turn one ear to the sound source to use the head as a barrier against environmental noise, thus improving the signal-to-noise ratio by about 4 dB [43, 119]. This will be discussed further in Sect. 9.10.

478

9 Perceived Location

9.3.2 Rotations of the Sound Source Around the Listener The generation of sounds such as rubbing, rolling, and scraping is coupled with moving objects. Since it takes time for an object to be transferred from one position to another, it is evident that the possible locations where a sound is heard are severely constrained by where the same sound is heard previously. In the preceding Sect. 9.3.1, it was argued that a significant role in sound localization is played by the dynamic patterns of change in ITDs induced by movements by the listener. Reasoning in an analogue way, one may argue that movements of the object may also play such a role, e.g., in disambiguating front-back confusions. There is, however, an important difference between the situation in which the listener moves and the situation is which the sound source moves. Since head movements are sensed by the vestibular system, the auditory system is informed about the direction in which the changes in azimuth, elevation, and distance take place, when listeners move their heads. If the head is stationary and the sound source moves, the direction in which ITD and ILD change do not disambiguate front-back confusions [209]. For instance, based on ITD and ILD alone, a clockwise movement in the horizontal plane from azimuth 20◦ to 30◦ cannot be distinguished from a counter-clockwise movement in the horizontal plane from 160◦ to 150◦ . This is further exemplified in the system consisting of two coupled omnidirectional microphones from Kneip and Baumann [169] discussed above. In that system, the rotational speed of the microphones was an essential variable in calculating the azimuth and elevation of the sound source, in addition to the ITDs and changes in ITDs. This information is absent when the listener is stationary and the sound source moves. By the way, listeners can derive the direction of the rotations of their heads not only from vestibular sources but also from proprioceptive, and visual information. Macpherson [209] and Macpherson and Kim [210] showed, however, that, in human sound localization, only vestibular information plays a role in determining the direction of the head rotation. But there is more to it. It appears that the patterns of change associated with the movements of a sound source can indeed be used when the listener is under control of the movements of the sound-producing object. Studying the role of head movements in disambiguating front-back confusions, Wightman and Kistler [373] did not only consider the situation in which the listeners could move their heads, thus changing the relative position of the listeners and the sound source, they also studied the situation in which the sound source was moved. This movement was either under control of the listeners or under control of the experimenters. When the experimenter controlled the movements of the virtual sound source, the listeners could not know in what direction the change would take place. If the listener controlled the movements, which they did by moving the cursor keys on a keyboard, the listener did know in which direction the change in position would take place. Wightman and Kistler [373] found that the number of front-back confusions decreased only if the movement of the sound source was under control of the listeners, and this decrease was also found when the listeners did not move their heads. When the virtual location of the sound source was changed by the experimenters, and the listeners could not know in

9.3 More Information Used in Sound Localization

479

which direction the change would take place, this decrease in front-back confusions did not occur. This confirms the finding by Han [132] discussed in the previous section that, once a sound is perceived at a specific location, the patterns of change induced by movements of the objects are usually not interpreted in such a way that the perceived location jumps abruptly from a location in front of the listener to a location behind the listener or vice versa. Wightman and Kistler [373] conducted their experiments with wide-band noise in anechoic conditions with virtual sound sources created with individualized HRTFs. In principle, these factors may be used to disambiguate front-back confusions. This information, however, appears not to have enough perceptual weight to elicit the percept of such a jump from front to back. Information arising from head rotations, on the other hand, does have this perceptual weight. It is concluded that only information arising from patterns of change that match expectations based on planned head rotations or other motor actions allows for such corrections of front-back confusions.

9.3.3 Movements of the Sound Source Towards and from the Listener In the previous section, the role of head rotations in human sound localization was examined. An ecologically different situation arises when the sound source moves in the direction of the listener. In that case, the sound source and the listener may get into contact possibly requiring a response by the listener. It appears that the increasing or decreasing sound intensity associated with the approaching or receding sound source plays a significant role in these situations, and that is the subject of this section. Except perhaps in conditions of very much reverberation, the level of the sound at the position of the listener will in general increase when the distance between listener and sound source gets shorter, and decrease when this distance gets longer, at least when the power of the sound source remains the same. This makes sound level a significant factor for perceived distance. Indeed, as early as in 1909, Gamble [101] carried out an experiment in which distance and power of a sound source were varied independently. She concluded that listeners based their estimates of distance mainly on the level of the sound at the position of the listener. Her experiments took place in a normal room and she simply asked subjects “not to look” (p. 419). In the 1970s, her results were confirmed by Laws [185] in more controlled conditions, viz., a dark, anechoic room with a loudspeaker array in front of the listeners with fixed heads. He found that wide-band noise produced with less power was perceived as coming from farther away than noise with more power and vice versa, independently of the actual distance of the sound source. So, sounds of higher intensity are generally perceived as produced closer to the listener than lower-intensity sounds. The size of this effect strongly depends on the kind of stimulus and the reverberation in the room, however. For sounds for which other sources of distance information are available such as clicks, the effect is much

480

9 Perceived Location

less strong than for narrow-band sounds such as tones. Indeed, Von Békésy [354] writes about such an experiment carried out in a “moderately damped room”: “A change in the loudness of the click has little effect upon its apparent distance. For the tone, however, despite the fact that an actual spatial localization is absent, an increase in loudness produces a clear reduction in the distance of the diffuse image. Thus it seems that loudness has an effect upon the perceived distance only in the absence of other more determinate physical cues” (p. 21) [translation by Wever in Von Békésy [351, p. 300]]. Apparently, when a sound arrives at our ears, there are more sources of information available as to its origin when it is a click than when it is a pure tone. The effect of level on auditory localization is more complex for sound changing in intensity. The importance of increases and decreases in intensity in the perception of approaching and receding sounds is demonstrated by an interesting quadristable illusion described by Bainbridge et al. [13]. Quadristable means that, although the stimulus does not change, it is perceived in four different ways. Bainbridge et al. [13] presented 392-Hz tones with triangular waveforms over two loudspeakers at equal distance from the listeners, one in front of them, the other behind them. The authors simulated the situation in which a sound source is moved from in front of the listeners to behind them and the other way round. A movement from a sound starting in front of the listener was simulated by linearly decreasing the intensity level of the sound of the front speaker from a high to a low value, while simultaneously linearly increasing the intensity level of the sound from the back speaker over the same range from the low value to the high value. As a consequence, the sound intensity level coming from the front speaker decreased linearly whereas simultaneously the sound intensity level coming from the back speaker increased. One would expect that in this case the sound is perceived as moving from the front speaker to the back speaker. This is indeed what listeners frequently heard but, remarkably, due to front-back confusions the sounds were also perceived as moving in the reverse direction. Even more remarkable is that the sounds source could also be perceived as starting in front of the listener, moving towards the listener midway returning to the front speaker. Likewise, the sound sources could be perceived as starting behind the listener, moving towards the listener midway reversing direction back to the speaker behind the listener. Moreover, the same results were obtained for the situation in which the sum of the intensities from the two speakers was adjusted in such a way that it simulated a movement of the sound source from the back to the front of the listeners. In conclusion, it did not matter whether the intensity of the stimulus started high at the front speaker or high at the back speaker. In all conditions, four different movements were perceived in a quadristable way: a movement from the front speaker to the back speaker; a movement from the back speaker to the front speaker; a movement from the front speaker towards the listeners and then bouncing back; or a movement from the back speaker approaching the listeners, and then bouncing back. These four percepts have in common that the sounds first approaches the listeners and then departs again but, due to front-back confusions both the approaching and the departing stimulus can be perceived either in front of the listeners or behind them. But this is not where the story ends. Even more remarkable, the same four kind of movements were perceived in a

9.3 More Information Used in Sound Localization

481

condition in which the percepts of approaching and departing sounds were simulated by presenting the sound with equal intensities at both speakers. Indeed, Bainbridge et al. [13] started the stimuli at a low intensity level at both speakers, linearly increased it and then decreased it, again, to its original low value. In this situation the listener had the same quadristable percept and, in fact, could not hear the difference between the various conditions. This shows the important role played by decreases and increases in intensity in auditory distance perception. Above, in Sect. 9.3.1, the experiments by Ashmead et al. [10] were discussed who asked listeners to walk to the location of a 1.5-s noise burst in one of two conditions. In the first condition, the blindfolded listeners heard the sound and were asked to walk to the location of the sound source after the burst had stopped; in the second condition, the listeners were already walking while the noise was played. In the second condition performance was better than in the first condition. Since the experiments took place in the open air with little or no reverberation, Ashmead et al. [10] argued that the improvement in performance was based on the listeners’ ability to pick up information based on motion-related changes in intensity. Acoustic tau is then a likely candidate for this information since it specifies time-to-contact. The concept of time-to-contact or tau was introduced above, also in Sect. 9.3.1. There, it was explained that time to contact can be derived both from optical information, in this case from optical tau τo = 2 A (t) /A (t), in which A (t) is the area of the projection on the retina of an approaching object, and from acoustic information, in this case from acoustic tau τa = 2I (t) /I  (t), in which I (t) is the intensity of the approaching sound. The relation between the two was studied by Schiff and Oldak [303]. They used recorded sounds and films of passing vehicles recorded along roadsides and of speaking persons approaching the film camera. These stimuli were presented to participants seated before a large screen on which the film was projected. The sounds were played through a loudspeaker just under the screen. The stimuli stopped just before the sound source would actually pass the participant, the “occlusion time”. The listeners were asked to indicate at what moment they expected the distance between them and the stimulus to be minimum. The authors showed that the observers indeed estimate time-to-contact also based on sound recordings though, except for blind observers, less accurate than based on the films. They found that observers generally underestimated the time-to-contact. The function of this may be to keep listeners on the safe side of coming into contact. Moreover, the underestimations were larger for women than for men. This indicates that women, on average, take fewer risks than men, which will be confirmed in the forthcoming Sect. 9.9.3. A remarkable aspect of these experiments by Schiff and Oldak [303] was that the participants based their judgment on films projected on a screen and sounds played from a loudspeaker underneath the screen. So, physically the stimuli were stationary, and motion was purely illusory. Apparently, observers are able to imagine themselves at the locations where the recordings are made and judge time-to-contact as if they are along the road, in spite of the contradictory distance information. From these results, it can be concluded that observers do use sound in estimating timeto-contact. But the specific role of acoustic tau τa in this respect is questioned by Guski [124], who argues that the auditory system functions much more as a warning

482

9 Perceived Location

system that draws attention to possibly important environmental events. The decision to initiate motor behaviour to avoid or approach such events can then rather be based on the much more accurate information obtained from optical sources such as optical tau τo .

9.3.4 Doppler Effect The Doppler effect is the shift in frequency that occurs when a sound source moves towards a listener, in which case the shift is positive, or away from the listener, in which case the shift in negative. The Doppler effect is illustrated in Fig. 9.13 for a sound source producing a sound with a pitch frequency 440 Hz, moving with a speed of 100 m/s, and passing the listener at a distance of 2 m. This situation is sketched in the upper panel of Fig. 9.13. The five circles on the left are the wave fronts of the sound at isochronous intervals of one pitch period. As one can see, the periods are shorter in front of the moving sound source than behind it, which demonstrates the Doppler shift. In the middle panel, the course of the intensity at the position of the listener is presented; in the lower panel, the thick line is the frequency corresponding to the periodicity of the stimulus at the position of the listener. This is a monotonically decreasing function with a positive Doppler shift before it passes the listener, and a negative shift after it has passed the listener. The equation for the Doppler shift is: f obs = f sr c

vsnd vsnd − vsr c cos (θ)

(9.2)

in which f obs is the frequency of the sound at the position of the observer, θ the angle between the path of the sound source and the line connecting this path and the observer, f sr c is the frequency of the sound produced by the sound source, vsnd is the speed of sound, and vsr c the speed of the sound source. When listeners are asked to describe the course of the pitch they hear, they describe a pitch that first remains more or less constant or slowly increases in pitch, then, just before it passes, rapidly increases in pitch, and after passage decreases in pitch, first very rapidly then more slowly. The perceived increase in pitch as the sound source approaches the listener is an illusion coupled with the increase in intensity at the position of the listener [216, 251]. This is depicted in Fig. 9.13 by the dotted line. This is called the Doppler illusion. The question is now whether Doppler shifts play a role in human sound localization. That this is well possible was shown by Jenison [159]. He derived equations that described how, in the horizontal plane, the motions of a moving sound object can be calculated from the patterns of change not only in ITD and intensity, but also from Doppler shifts. For each ITD, these equations have two solutions corresponding to the two locations in the horizontal plane on the cone of confusion for that ITD. This corroborates the regular occurrence of front-back confusions, which, as shown in Sect. 9.3.1, can easily be resolved by head rotation. That Doppler shifts—and the

9.3 More Information Used in Sound Localization

483

Fig. 9.13 The Doppler illusion. A sound passes a listener with a speed of 100 m/s or 360 km/h. The smallest distance between sound source and listener is 2 m. This situation is sketched is the top panel, in which the small circle at t = 0 shows the position of the listener, the thick horizontal line shows the path of the sound source. The middle panel gives the level of the stimulus at the position of a listener. In the bottom panel, the thick line gives the fundamental frequency of a 440-Hz sound at the position of the listener. The dotted line represents the course of the perceived pitch of the sound as described by the listener. The phenomenon that listeners report to hear the pitch of the sound going up just before it passes is called the Doppler illusion. The sound played in this example consists of the sum of the first ten harmonics of a 440-Hz saw tooth. Adapted from Neuhoff and McBeath [251]. (Matlab) (demo)

484

9 Perceived Location

other two sources of information (SOIs), ITD and intensity—do, indeed, play a role in human sound localization was shown by Rosenblum et al. [292]. In an experiment with virtual sound sources simulating the passing of an ambulance at 13.4 m/s, they asked listeners to indicate the moment of closest approach. The sounds were presented over headphones in a sound-attenuated booth. In the sound of the ambulance siren, the three SOIs, ITD, intensity, and Doppler effect, were varied independently of each other. The results showed that each of the three experimental variables plays a role: The perceptual weight of the changing intensities is higher than that of the ITDs, which, in turn, is higher than that of the Doppler effect. These results were confirmed and complimented by Lutfi and Wang [204] who, additionally, studied the effect of the velocity of the passing sound source. They studied discrimination of displacement, velocity and acceleration and found that ITDs influence the discrimination of displacement most for moderate velocities of 10 m/s. Discrimination of velocity and acceleration, however, was mostly based on the Doppler effect. For high velocities of 50 m/s, all three discrimination tasks were mostly based on the Doppler effect. In these experiments, only the ILDs, the ITDs, and the Doppler shifts were used in synthesizing the virtual sounds. The filter responses of the pinnae were ignored. The experiments by Rosenblum et al. [292] and Lutfi and Wang [204] also show that Doppler shifts play a role in the localization of sound sources moving relative to the listener. Rosenblum et al. [292] simulated sources moving with a speed of 13.4 m/s or 48.28 km/h; the lowest speed studied by Lutfi and Wang [204] was 10 m/s or 36 km/h. These velocities are quite fast when compared with normal walking speed, about 5 km/h. The Doppler shifts are very small at such velocities; it is maximally 1.5% at a speed of 5 m/s. At 10 m/s, it is 3%, which is enough to play a role as shown by Lutfi and Wang [204], who also showed that the role of the Doppler shift in motion detection is the major factor at high speeds of 50 m/s or 180 km/h. The Doppler illusion suggests a close interaction between pitch and loudness, especially when the sound sources are rapidly moving. Scharine and McBeath [302] investigated the relation between intensity and fundamental frequency in actual speech and music performances. For both speech utterances and performed musical melodies, they found significant positive correlations between fundamental frequency and intensity. Moreover, thresholds were found to be lower for sounds in which F 0 and intensity both increased or both decreased, than for sounds in which F 0 and intensity changed in opposite directions. This shows that listeners are more sensitive to combined changes in F 0 and intensity when F 0 and intensity change congruently, i.e., in accordance with this positive correlation. Scharine and McBeath [302] conclude: “the auditory perceptual system has internalized a bias to favor hearing changes in f 0 and intensity that co-occur in a manner that matches the natural statistical pattern” (p. 224). This interaction between the course of pitch and that of loudness will be further discussed in Sect. 10.12.4.4.

9.3 More Information Used in Sound Localization

485

9.3.5 Ratio of Low-Frequency and High-Frequency Energy in the Sound Signal Air operates as a low-pass filter [152, 355]. Consequently, the longer a sound travels, the less its high-frequency content, and the lower the ratio of low- and high-frequency energy. The decrease in high-frequency content is quite slow, however. For 50% relative humidity and at 20◦ , Coleman [68] gives a decrease of 0.2 dB per 100 feet at 1000 Hz, of 0.3 dB at 2000 Hz, of 1.0 dB at 4000 Hz, of 3.3 dB at 8000 Hz, and of 4.7 dB 10000 Hz. Hence, this measure can only serve as a distance measure for sound that has travelled over a considerable distance. In general, the ratio of low- and high-frequency energy is assumed to start playing a role in distance perception after 15 m [28]. Indeed, Coleman [68] showed that click-like sounds low-pass filtered at 7680 Hz were perceived as coming from farther away than such sounds low-pass filtered 10500 Hz. This was confirmed for a larger set of stimuli by Little, Mershon, and Cox [199]. These authors additionally found that the ratio of low- and highfrequency energy can be used as a relative measure for distance. Hence, when two similar sounds differ somewhat in the ratio of low- and high-frequency energy, the sound with the lower ratio of high-frequency energy will be perceived as coming from farther away. It is concluded that the ratio of low- and high-frequency energy will especially be a source of distance information in the free field, where sound can travel uninterruptedly over long distances, and reflections and reverberation are in general insignificant.

9.3.6 Information About the Room In Sect. 9.1.4, experiments carried out by Plenge [274, 275] were discussed in which he shows that it is possible to play recorded sounds through headphones in such a way that they are heard as if they are being produced at the time by real acoustic events in the listener’s environment. He did so by recording the sounds with a dummy head, i.e., an artificial head with naturally shaped outer ears, ear canals, and microphones at the position where human listeners have eardrums. By playing these recordings over headphones, Plenge [275] could realize that, when the sound arrives at the entrance ear, it is much the same as when it is produced on site, including the reverberation of the recording room. The result was that the listeners externalized the sounds. These experiments were done not only with recorded speech but also with recordings of orchestral music. This was recorded in a specially built anechoic room, played in a concert hall over loudspeakers and recorded with a dummy head. The recordings were then played to the listeners over headphones in a darkened room. The room was darkened in order to ensure that the listeners had no idea about the acoustic properties of the environment. The results showed that the listeners, indeed, externalized the sound and heard them as if they were at the recording site, so in the recording room for the speech and in the concert hall for the music. This externalization did not happen

486

9 Perceived Location

immediately, however, but took a few seconds: “When listening to a medial sound source via a dummy-head transmission, first the sound image is located intracranially and, after a short time of exposure, the sound image is located extracranially” [274, p. 960]. Plenge [274] concluded that it takes some time for listeners to build up a correct image of the room they are in and that, only after this is done, the information in the recorded sound can be used in localizing the sound source correctly. If listeners do not have a correct image of the room they are in, the sound remains internalized. Moreover, Calcagno et al. [55] showed that, when listeners could inspect the test room before the start of an experiments, their auditory distance estimates were much more accurate, also when the experiment was performed in the dark.

9.3.7 Information About the Location of Possible Sound Sources It appears that not only information about the room in which the listener is located has an effect on sound localization; information about the possible locations of sound sources can also be a significant factor. One phenomenon that shows this is the proximity image effect [105]. This effect occurs in a situation where a sound is being played from one of a linear array of speakers in front of the listener of which the listener can only see the closest. Hence, the first loudspeaker of the array visually covers the other loudspeakers. In this situation, the listener always hears the sound as coming from the first, so the closest loudspeaker, also when the sound is actually played from a loudspeaker behind this closest loudspeaker. The experiments by Gardner [105] were carried out in an anechoic room, but they were confirmed for a more reverberant room by Mershon et al. [226] and Anderson and Zahorik [7]. Mershon et al. [226] also tested the situation in which the actual sound source, hidden by cloth, was positioned in front of the visible dummy loudspeaker. In that case, too, listeners located the sound source at the location of the visible loudspeaker. Hence, it is not so much the proximity of a sound source that determines the perceived distance, but its visibility. Mershon et al. [226] thus showed that the “proximity” image effect is not so much an example of proximity but rather of visual capture, as in the ventriloquist effect discussed in the next section. Furthermore, they showed that the effect has its limits; the discrepancy between visual stimulus and auditory stimulus should not be too large. The visibility of possible sound sources not only affects the estimation of distance but also of elevation. Perrett and Noble [267] placed seven loudspeakers with 15◦ intervals between 0◦ and 90◦ elevation all with an azimuth of −90◦ or 90◦ . They showed that listeners could correctly identify the elevation of low-frequency noise when the response choices where limited to these seven loudspeakers. In case some additional loudspeakers were placed in the horizontal plane at different azimuths, listeners got confused, resulting in responses biased into the direction of the hori-

9.3 More Information Used in Sound Localization

487

zontal plane. Apparently, the interaural time differences in this condition only play a role if the response choices are constrained to positions where they are distinctive.

9.3.8 Visual Information It has already been shown that information about the acoustic room properties and about the location of sound sources can affect human sound localization. In usual circumstances, this information is obtained visually by perceivers when they enter a room and look around to inspect what is in it. But visual information can also influence sound localization in other ways. A famous example is the ventriloquist effect. When one sees something moving and the movements are in synchrony with variations of a sound, the sound is very often perceived as produced by the moving object, also when the sound source does not have the same location as that of the observed object. The reader will be familiar with this effect when ventriloquists create the illusion that their puppets are speaking and not they themselves [155]. In that case, the public localizes a sound source at the mouth of the puppet that moves in synchrony with the speech sounds actually almost invisibly produced by the ventriloquist. That is why it is called the ventriloquist effect. One of the earliest experiments on the ventriloquist effect was carried out in 1952 by Witkin et al. [377]. They had listeners look through a mirror at someone reading Aesop’s fables aloud. The mirror and the speaker were positioned in such a way that the image of the speaker was exactly in front of the participants. The listeners could not hear the speaker directly, but the speech was presented to both ears over tubes that were variable in length. In this way the experimenters could vary the difference in time of arrival, the ITD, of the speech at both ears. An experiment was started with speech arriving simultaneously at both ears, so that the speech was perceived in the middle. Then the ITD was slowly increased and the participants were asked to indicate when the perceived location of the speech was no longer in the middle at the position of the speaker. Then the experiment was repeated while the listeners were asked to close their eyes so that they could not see the speaker. Again, the ITD was slowly increased and listeners were asked to indicate when they heard the sound as no longer coming from the middle. It appeared that in the latter situation the ITD at which the listeners indicated that the speech no longer came from the middle was much smaller than in the first situation. In other words, while seeing the speaker just in front of them, the ITD had to be larger before the listeners perceived the speech and the speaker as positioned at different locations than when they did not see the speaker. This showed that, up to a certain extent, the visual image of the speaker captured the sound of the speech, another example of visual capture. It has been shown that the ventriloquist effect can occur up to differences in azimuth of 30◦ [154]. The ventriloquist effect has two components. The one described above is the spatial ventriloquist effect, and has to do with sound localization. It is “the phenomenon that a visual stimulus such as a flash can attract the perceived location of

488

9 Perceived Location

a spatially discordant but temporally synchronous sound” [62, p. 790]. This is an example of visual dominance. Besides the spatial ventriloquist effect, there is the temporal ventriloquist effect, “where temporal aspects of a visual event, such as its onset, frequency, or duration, can be biased by a slightly asynchronous sound” [62, p. 790], which is an example of auditory dominance. Another example of auditory dominance is presented by Burr, Banks, and Morrone [49]. They presented listeners with audiovisual stimuli consisting of a combination of a short tone pip and a short light flash in the form of a small circular disk. These stimuli were perceived as one single event, although the authors inserted a very short time interval between the sound and the light. The authors presented a sequence of two of these stimuli with an interval of 800 ms. Moreover, in between these two bimodal stimuli, the authors presented another short test stimulus, that could be auditory, visual, or audiovisual. The timing of this test stimulus was varied. They asked listeners to judge whether this test stimulus was closer in time to the first or to the second stimulus. It appeared that participants perceived the test stimulus as equally close to the first and the second stimulus, when it was equally close to the auditory parts of the audiovisual stimuli. This shows that the timing of the auditory part of the audiovisual stimuli determines their perceptual moment of occurrence. Such results were confirmed in another set of experiments by Ortega et al. [260], who conclude: “auditory processing dominates over visual processing in the perception of simple durations up to about a second even when the visual component is clearly more salient, the visual component is selectively attended, and visual temporal discriminability is no worse than auditory temporal discriminability” (p. 1500). Another situation in which visual information has been shown to play a role in sound-source localization is the situation in which front-back confusions make it difficult to identify the direction of a sound moving in circular motion around the listener in the horizontal field. Indeed, in the horizontal plane, Lakatos [181] constructed a circular array of loudspeakers around the listener and half such an array of light sources in the frontal field of the observer. With these arrays, Lakatos [181] could generate auditory, visual, or combined circular trajectories around the listener. These motions were perceived as continuous. Observers were asked to indicate whether the stimulus moved clockwise or counterclockwise or, alternatively, whether it moved to the right or to the left. When only sounds were played, front-back confusions occurred frequently, especially for trajectories played in the frontal field. The remarkable thing is that this confusion can be resolved not only by presenting a visual stimulus on the same trajectory as that of the auditory stimulus, but also by presenting only one stationary light on that trajectory.

9.3.9 Background Noise Very little research has been carried out about the influence of background noise on auditory distance perception, in spite of the fact that, outside the laboratory, noise is almost always present. The only significant study in this area up to now has been

9.3 More Information Used in Sound Localization

489

carried out by Mershon et al. [225]. They had listeners estimate the distance of a sound source at two noise levels differing by 20 dB. They found that, in the presence of the higher noise level, distances were estimated shorter than in the presence of the lower noise level. They could attribute this to the higher masking effect of the noise on the reverberation in the room. After all, since reverberation is generally spread out over a much longer time interval than the direct sound, it is lower in intensity, so that it is more masked than the direct sound. Consequently, the direct-to-reverberant ratio (DRR) increases. Apparently, the auditory system does not compensate for this, resulting in the perception of a shorter distance. A similar effect was found under less well controlled outdoor conditions by Fluitt et al. [93]. In their experiments, the background noise mostly consisted of calls by insects such as crickets, cicadas, and grasshoppers. The influence of the presence of other sounds on the accuracy of the perception of azimuth and elevation has got more attention [117, 183, 201]. The results are quite diverse, however, since many variables can be varied, such as the locations of the maskers and the targets, their levels, their frequency contents, and their temporal properties. A general conclusion that may be drawn is that localization based on ITDs and ILDs is less affected by the presence of other sounds than localization based on spectral information resulting from the filtering by the pinnae [117]. The result of an experiment by Lorenzi et al. [201] was consistent with this conclusion. They found that maskers presented at the side of the listeners have a larger negative effect on localization accuracy than maskers presented in front of the listeners [201]. They conclude that, when both low- and high-frequency information is available, listeners base their decision on the information that best specifies the direction of the sound source. Hence, estimation of azimuth is based on low-frequency information (ITDs), while estimation of elevation is based on high-frequency information (HRTFs). Another situation in which localization plays an important role is the situation in which a listener wants to pay attention to just one sound source in the presence of many other sound sources and reverberation. A typical example is that of a cocktail party at which a music band is playing. It is well-known that, in such situations, listeners are quite well able to understand just one speaker. The perceptual question of how this is done is called the cocktail-party problem [63]. The phenomenon itself is called the cocktail-party phenomenon or the cocktail-party effect. In the cocktailparty effect, sound localization and auditory scene analysis can no longer be discussed as separate issues. Much research has been carried out on this phenomenon, but many problems remain. Reviews are presented by Bronkhorst [40, 41], Haykin and Chen [136], and McDermott [218]. It will be discussed in more detail in Sect. 10.14.

490

9 Perceived Location

9.3.10 Atmospheric Conditions: Temperature, Humidity, and the Wind In Sect. 9.3.5, it was shown that air operates as a low-pass filter on sound [152], a property that becomes perceptually relevant for distances longer than about 15 to 20 m [28]. Some other atmospheric conditions may be relevant at such distances: Temperature, vertical temperature gradients, humidity, wind, and wind gustiness. Ingård [152] states that temperature, humidity, and wind are relatively insignificant factors compared to the major factors of temperature gradient and gustiness of the wind. It is well known that sound does not carry far on hot days when the temperature at the ground is very high. Due to the temperature gradient, horizontally propagating sound is then bent upwards. On the other hand, at night, when the ground is relatively cold, the situation is reversed, and sound can carry very far. Some exploratory perceptual experiments have been reported by Fluitt et al. [94]. They found that the higher the humidity and the lower the temperature the shorter the estimated distances. Individual differences were large, however, and the conditions could not be controlled independently. Whether the effect of these atmospheric variables is due to their influence on the low-pass characteristics of the air or to some other effects remains to be investigated.

9.3.11 Familiarity with the Sound Source For many sounds, listeners have an idea of where they can and where they cannot be produced. For instance, a large part of sound-producing events is produced by animate or inanimate objects positioned on the ground. Certainly when the ground consists of floors or streets, this means that the sounds produced by such objects will usually originate in a more or less horizontal plane one or two meters below the listeners’ ears. Examples are the sounds of footsteps, bicycles, cars, etc. Other sounds, such as singing birds, airplanes, or the wind in the trees, will more often come from above the listener. The knowledge derived from these experiences constrains the plausible sound-source locations, and can bias the perceived locations as indicated by the listeners. Other characteristics of a sound also give experience-based information as to the location of a sound source. For instance, some sounds are inherently very intense. Philbeck and Mershon [272] give the example of the siren of a fire engine, which will be perceived as coming from far away when the intensity at the position of the listener is low. In many other cases, e.g., for live speech or music, listeners can hear quite accurately with how much effort they are produced. Based on this experience and the intensity with which the sound reaches their ears, they can quite accurately judge the distance of the sound source, also in anechoic space [48, 66, 103]. This, however, only applies to situations in which the speech or the music is produced live. When speech or music is played over loudspeakers, the production effort no longer deter-

9.4 Multiple Sources of Information

491

mines the level of the sound at the position of the listener. In that situation, it appears that the perceived distance of the sound is no longer accurate [103, 104]. This was systematically investigated with virtual speech sounds by Brungart and Scott [48]. They independently varied presentation level and production level of speech, and showed that both influence perceived distance, but the effect was quite complex. For instance, they were different for whispered speech, voiced speech produced at a low level, i.e., a level lower than 66 dB at 1 m from the speaker, and voiced speech produced at a high level, i.e., a level higher than 66 dB at 1 m from the speaker. The interested reader is referred to Brungart and Scott [48].

9.4 Multiple Sources of Information In the previous section, more than a dozen acoustic sources of information (SOIs) have been discussed that are involved in the auditory localization of a sound source. Some SOIs are relevant for the estimation of the azimuth of a sound source, some for the estimation of its elevation, some for the estimation of its distance, and some for the estimation of its velocity. In spite of these many SOIs, some sounds, such as stationary pure tones, are generally difficult to localize. Apparently, none of the available SOIs is sufficient for the correct localization of such sounds. This happens, e.g., in a reverberant room in which a continuous pure tone is played. Listeners then report that the sound comes from all sides, or that the perceived location of the sound source is diffuse [273]. Before discussing the accuracy of the auditory system, it is noted that some sounds, e.g., steady high-frequency pure tones, are much more difficult to localize than some other sounds, e.g., clicks or short wide-band noise burst [325] or detonations [354]. So, the accuracy with which the location of a sound is perceived can be very different from sound to sound. Moreover, the way in which listeners experience a perceived location of a sound source can be very different for different listeners. An extreme example is described by Féron et al. [89]. They had an experimental set-up in which a sound source rotated around a listener. They wanted to know up to what rotational frequency this was perceived as a sound source spinning around the head. They found that, above a frequency of about 2 3 Hz, the listeners no longer heard a rotating sound source. Instead, they reported: “[...] the common impression was that of an alternation between the left and the right sides (or along the diagonal for one participant). ‘Shaking’ and ‘surrounding’ sensations were also often reported in this case. Moreover, it was noticed that the alternation did not change even when the listener moved his/her head (despite being instructed not to). In case of very high velocities, the sound appeared to most participants as a superposition of pulses at different spatial points. For some participants the sound was fractioned giving rise to rhythmic structures and roughness when speed increases. For others, the sound apparently came from everywhere at the same time, resulting in a ubiquitous sensation” (p. 3711). In other situations, it can happen that listeners do systematically attribute a location to a sound, but that the perceived location has no relation with the actual location of

492

9 Perceived Location

the sound source. The presence of phantom sources and virtual sources has already been mentioned. Here are two other non-trivial examples. The first example has to do with distance perception. Mershon and King [224, p. 414] found that, in the absence of reliable distance information such as reverberation, bursts of noise were often perceived as coming from a distance of about 2 m. This will be referred to as the default perceived distance. A second example is provided by Pratt’s effect, a phenomenon called after the author who first described it. Listeners do often attribute an elevation to a narrow-band signal such as a pure tone, but their judgments do not depend on the actual elevation of the tones, but on their frequencies [281]. Pratt [281] found that narrow-band sound with a higher centre frequency was by default perceived as coming from sources with a higher elevation than narrow-band sounds with a lower centre frequency, independently of their actual elevation. Apparently, in the absence of reliable information, a default elevation is attributed to narrow-band sounds depending on the centre frequencies of these sounds. This result was later confirmed by, e.g., Trimble [347] and Roffler and Butler [289]. This elevation that is perceived independently of the actual position of the sound source will be referred to as the default perceived elevation. Pratt’s effect will be discussed in more detail in Sect. 9.8.2. Something similar has been found in the horizontal plane by Elendiuk and Butler [18] and Butler and Flannery [52]. To this azimuth will be referred to as the default perceived azimuth.

9.5 Externalization or Internalization It has already been mentioned a few times that listeners often hear a sound source inside their heads, a phenomenon that is called internalization. This problem expresses itself in particular when listeners listen to sound over headphones. Most sounds recorded with two free microphones are perceived as coming from within the head when played over headphones. The auditory system places the origin of these sounds inside the head somewhere between the ears. Apparently, the information the auditory system uses in estimating the distance of a sound source is not available. In the absence of this information, the perceived distance is set to zero or almost zero as a kind of default perceived distance. Certainly when the sound is a mono-recording, internalization is often experienced as uncomfortable. When listening to stereo-recordings over headphones, the various sound sources are generally not perceived at the same location but on positions between the two ears spread over the interaural axis. This prevents some of the annoyance coupled with continuously hearing a recording at the same spot exactly between the ears but, generally, listeners prefer to hear sound outside their heads. The converse of internalization is called externalization, which means that a sound is perceived as coming from a location outside the listener’s head. Until the early 1970s, it was a persistent problem how sound could be played over headphones in such a way that it was externalized. Failures to realize this have been attributed to a variety of causes, e.g., a difference between the acoustic impedance for head-

9.5 Externalization or Internalization

493

phone presentation and for free-field presentation or to failures in the transmission from recording to the reproduction of sound [304]. Sone et al. [318] suggested that internalization during earphone listening was a consequence of the improper balance between air conduction and bone conduction due to the contact between earphone and head. In hindsight, all these failures to externalize sound during headphone listening can be attributed to the lack of technical know-how about how to reproduce sound over headphones in such a way that it is identical to the sound at the entrance to the ear canal when played by a real external sound source. As explained in Sect. 9.2, these problems have now been overcome. One of the most direct ways to realize that a recorded sound is externalized when played over headphones is to make a recording with a dummy head with realistically shaped ears and microphones at the position of the eardrums. In that way, two of the factors that play a role in the externalization of sound mentioned in the list of SOIs are taken into account: filtering by the pinnae, and the presence of reverberation. Indeed, sound recorded in this way is often externalized. Externalization can also be realized by recording sound with a pair of in-ear microphones. The effect can be amazing. If one listens to such a recording made with the microphones in one’s own ears, it seems as if one is back in the room where the recording is made and the sound is actually produced while one is listening. In this way, it is simple to reproduce the atmosphere of complex environments such as noisy restaurants, coffeehouses, or busy market places. For sounds produced beside the listener, above the listener, or at the back of the listener, the recordings can also be astonishingly realistic, certainly if recordings are made of snipping scissors, wasps, or whispering voices. However, certainly if the recording is made by someone else, one will often hear that the reproduction is not perfect, especially for sounds supposed to come from directions in front of the listener. Apparently, the auditory system is quite sensitive to small discrepancies in spectral content for sounds produced just in front. They will often be perceived as coming from a higher elevation than with which they were recorded or, again, within the head. Another remarkable observation is that, when one walks or turns around during an in-ear recording, one will not have the impression that the sounds being recorded are moving or turning around. Instead, one will experience the sound sources as stationary. When one listens to that recording afterwards, however, one will not perceive the sound sources as stationary, such as during the recording, but as moving or rotating [390]. This does not alter the fact that, once one has listened to such in-ear recordings, one realizes how imperfect the usual “hifi” stereo recordings are when played over two loudspeakers. The reader is strongly advised to play with this. One only needs a sound recorder and a pair of in-ear microphones. One sees that recordings made with a dummy head with realistically shaped outer ears, or with a real head with in-ear microphones, include the filtering by outer ears and the reverberation of the recording environment. Both factors contribute positively to the externalization of sound when listened to over headphones. Certainly for less rich sounds, however, it appears quite often that sound recorded in this way still is not externalized. One likely reason is that, during headphone listening, the information in the recording is stationary in respect of the head; consequently, the perceived

494

9 Perceived Location

location is not stationary with respect to the outside world, but moves along with the movements of the head [365]. Significant in this respect is an observation by Plenge [275]. He describes circumstances where headphone listeners who turned their heads heard a sound coming from above, which is the only direction, besides the direction directly below them, where the sound source remains in the same place relative to the turned head. The role of head rotations was further studied by Brimijoin et al. [36] by means of virtual sources. They presented sounds over headphones and measured the head movements of the listeners. They created virtual sources in two ways: Either the locations of the virtual sources remained fixed with respect to the outside world, also when listeners moved their heads, or the virtual sound source remained fixed in respect of the heads of the listeners. Brimijoin et al. [36] found a significant increase in the externalization of the sound sources, when the virtual sources remained fixed with respect to the outside world. When the virtual sound source moved along with the head rotations, no such increase was found. Hendrickx et al. [140] showed that, when virtual sound sources were presented with non-individualized HRTFs, externalization could be improved by including the effect of head rotations in the simulations. All this confirms the suggestion by Wenzel [365] that head rotations play a significant role in the externalization of the perceived sound location. Above, in Sect. 9.1.4 it was argued that reverberation of a room plays an important role in externalization of a sound [134, 275, 299]. Moreover, Plenge [275] found that the reverberant sound must correspond to the acoustic characteristics of the room the listener has in mind. By having listeners listen to recordings of music with a dummy head in a darkened room, he showed that it could take a few seconds before the listeners externalized the music to which they listened. A similar interesting situation occurs when listeners listen over loudspeakers to sound recorded in a room with acoustics that are completely different from the room in which they actually are. In that situation, too, it can happen that the sound is not externalized. This phenomenon is the room-divergence effect [113, 367]. The room-divergence effect can be reduced by training the listener to the new listening situation [167]. A review of this phenomenon aimed at highlighting listener experience in audio reproduction is presented by Brandenburg et al. [30]. In Sect. 9.1.2, it was briefly mentioned that, due to the short wavelengths of high frequency components and the complex shape of the human head, torso, and shoulders, ILDs vary quite considerably with azimuth, elevation, and distance. Due to this, the interaural coherence of the ILD during presentation of time-varying sounds such as speech specifies the location of the sound source relative to the listeners. Catic et al. [61] showed that these ILD fluctuations play an important role in the externalization of the sound, in addition to the presence of reverberation that corresponds to the listening room [60]. In summary, at least the following sources of information are involved in the externalization of sound: The filtering by the outer ear, the presence of reverberation, familiarity with the acoustic properties of the room, the changing patterns of binaural information coupled with head rotations, and ILD fluctuations and their interaural coherence with time-varying sounds. If this information is absent or ineffective, the

9.6 Measuring Human-Sound-Localization Accuracy

495

perceived location is diffuse, not well defined, or at its default location, i.e., within the head. Moreover, the role of several of these factors depends on the position of the sound source in relation to the listener. Indeed, Li et al. [194] showed that reverberation at the contralateral ear has more effect on externalization than at the ipsilateral ear. In addition, the same authors showed that the spectral detail at the entrance of the ipsilateral ear also plays an important role [195]. In combination, this explains why sounds close to the median plain are less well externalized than more lateral sounds [36, 186]. A recent comprehensive review of all these factors involved in externalization is presented by Best et al. [25] who include a succinct discussion on the role played by externalization in virtual-reality systems and the problems of externalization, or rather the lack of it, for people with hearing aids.

9.6 Measuring Human-Sound-Localization Accuracy In this section, the situation will be considered in which every single sound stimulus is produced by one real single source. So, the sound source in assumed to be neither a phantom sound source nor a virtual sound source, since, if it is, it is evident that the perceived location and the actual location of the sound source will be different. The accuracy of the human localization system can be expressed by two essentially different kinds of measures, absolute measures and relative measures. An absolute measure for sound-localization accuracy gives the difference between the actual location of the sound source that produces the stimulus and its perceived location. It can be divided into two components, a systematic component and a random component. The systematic component is expressed by the systematic error. This is the difference between the actual location and the average of the perceived locations. The random component is expressed by the random error. This is the average spread of the perceived locations around their average. This may seem evident, but there will appear to be various complications, e.g., because azimuth and elevation are circular measures, and because perceived locations can have bimodal distributions, as in the case of front-back confusions. First some systematic errors will be discussed. Indeed, various experiments have been carried out in which listeners were asked to point into the direction of a sound source or to indicate its egocentric distance. After such experiments, one ends up with a number of perceived locations, each corresponding more or less accurately to the actual locations of the sound source. Systematic deviations can have various origins. For instance, sound sources close to the listener are often perceived as located farther away than they really are, while sound produced by sources farther away from the listener are perceived as coming from closer by [394]. A second example of a systematic deviation occurs when a sound source with a certain elevation is systematically perceived at a location closer to the horizontal plane [256]. A third kind of systematic deviations that occur regularly are the aforementioned front-back confusions, i.e., sounds that actually come from in front of

496

9 Perceived Location

the listeners are perceived as coming from behind and vice versa [283, 325]. Next to these systematic deviations, the perceived localizations will show random deviations. The spread of these random fluctuations is mostly different for perceived azimuth, perceived elevation, and perceived distance. In the presence of these different kinds of deviations, it is not always easy to come up with one valid single measure of localization accuracy. As to azimuth and elevation, spherical statistics [92] must be used. Another issue is how to deal with front-back confusions resulting in a bimodal distribution of the perceived locations. Neither the mean nor the standard deviation of this distribution is informative as to the problem at hand. This problem of how to express localization accuracy quantitatively is discussed by (Wightman and Kistler [372, Appendix]), Leong and Carlile [188] and Letowski and Letowski [189]. In addition to absolute measures, there are relative measures. A relative measure gives the difference between two locations of a sound source the listener can just discriminate. For instance, one can often hear whether one sound comes from farther away than another without being very well able to estimate the absolute distance of the two sound sources. For relative measures, the concept of just-noticeable difference (JND), or difference limen will be used. The JND can be measured in a situation in which a participant is presented with two stimuli that may or may not differ in a physical quantity. The task of the participant is to indicate whether the two stimuli are different or not; in the case of sound localization, whether the two sounds come from the same location or not. If the difference between the two values of the physical quantity of the two stimuli is much smaller than the JND, the difference will be virtually impossible to detect, and the proportion of correct responses will not significantly exceed chance level; if, on the other hand, the difference between the values of that physical variable of the two stimuli is much larger than the JND, the difference will be easy to detect and the proportion of correct discriminations will be close to 1. To be more precise, the JND can be defined as the difference between two values of a physical quantity that is detectable in 50% of all stimulus presentations. This means that, when the chance level of a correct response is 50%, the percentage of correct responses will be 75% at the JND. A relative measure that is often used to express localization accuracy for azimuth and elevation is the minimum audible angle (MAA) [233]. The MAA is measured by placing two sound sources, e.g., two loudspeakers, at equal distances from the listener while varying the distance between the two loudspeakers. A sound is then played twice in succession, either over the same loudspeaker or over the different loudspeakers. One can imagine that, if the distance between the loudspeakers is large enough, the listener can easily hear whether the sound is played over the same loudspeaker or not. If the loudspeakers are moved closer to each other, the two situations will be more and more difficult to discriminate. In the horizontal plane, the MAA then is the JND in azimuth of two loudspeakers, so the azimuth for which it is just possible to judge whether the sounds are played from the same loudspeaker or not. In the vertical direction, it is the JND in elevation of two loudspeakers for which this is just possible. So, if the difference in azimuth or elevation is much smaller than the MAA, listeners have great difficulty in judging whether the two sounds are

9.7 Auditory Distance Perception

497

played over the same loudspeaker or not, whereas they can easily do so when this difference is much larger. The concept of MAA is used for expressing the accuracy with which azimuth and elevation can be discriminated. To express the accuracy of auditory distance estimation a similar concept is used, the minimum audible distance (MAD) [352] or minimum audible depth [23]. The MAD is defined as the JND of the distance between two sound sources of identical azimuth and elevation. Both MAA and MAD are relative measures. Hence, they only express the difference in position that can be perceived, but do not indicate whether a sound source is perceived at its actual location. As a consequence, these measures ignore systematic errors. This is very relevant since, e.g., the MAA is in general at least five times smaller than the spread measured in absolute location experiments [58]. This indicates that differences in azimuth or elevation of a sound source induce distinguishable acoustic differences but are perceptually not interpreted as changes in sound-source position. This also applies to the MAD. When the distance of a sound source from the listener is varied, and the listener is asked to indicate its position, the judgments may be based on the variations in intensity that induce variations in loudness, and not on variations in perceived distance. Effects such as these make the MAA and the MAD measures of accuracy that must be interpreted with care. One may wonder how well auditory localization compares to visual localization. One can generally say that auditory localization is much less accurate than visual localization. The minimum visible angle is about 1 minute of arc, while the lowest MAA for sounds is not much less than 1◦ . One must realize, on the other hand, that visual localization is only good at the centre of the visual field. Away from the centre of the visual field, localization accuracy decreases rapidly. Outside the visual field, or when acoustic events are visually obscured by other objects, one has to rely on hearing to find out where something happens, and not only that, but also what happens.

9.7 Auditory Distance Perception The first systematic overview of sources of information (SOIs) involved in auditory perception of distance is given by Coleman [67]. In the discussions of those SOIs, he limits himself to those that play a role in the free field: level, frequency spectrum at near distance, frequency spectrum at far distances, and binaural information. For conditions that are more reverberant, he passingly mentions: reverberation, binaural differences in frequency spectrum, and pinna effects as significant. In their review of auditory distance perception, Kolarik et al. [173] added: Dynamic information, stimulus familiarity, and visual information.

498

9 Perceived Location

9.7.1 Accuracy of Distance Perception The minimum audible distance or minimum audible depth (MAD) has been introduced as a relative measure to express the accuracy of distance perception. The outcome of experiments to determine the MAD has been quite variable, not only between experiments but also between listeners. The main reason for this variability is that the SOIs used by the perceptual system to estimate the distance of a sound source are quite diverse; they include not only acoustical sources such as the level of the sound and the amount of reverberation in the room, but also visual information about the room in which the listener is located and knowledge about the positions of the possible sound sources, e.g., loudspeakers, in the room. All these variables interact in a complex way. A second aspect of the problem is that distance estimation is virtually impossible for narrow-band sounds such as pure tones. In non-reverberant conditions the listener can only move around and try to find where the tone gets softer or louder; in reverberant conditions the tones will form a complex system of nodes and anti-nodes obscuring all information as to the direction of the direct sound field. Wider-band sounds can generally be more accurately localized, which has already been studied at the end of the 19th century. In rooms with a moderate reverberation, there appears to be systematic relation between the actual distance of a sound source and its perceived distance. In 1898, Shutt [315, p. 1] described experiments “in a large room” with blindfolded participants who judged the varying distances of a rich sound, a “telegraph snapper”, and a less rich sound, a “pitch pipe.” Distances close to the listeners were overestimated and distances farther away were underestimated. Also, Shutt [315] found that the distance of rich wide-band sounds was judged more accurately than that of narrow-band sounds. Moreover, judgments were more accurate beside the listeners than in front or behind. Similar results were obtained “in an open field” for “a loudspeaker source emitting bursts of thermal noise” [145, p. 1584]. This contrasts with direction estimation which, as one will see, is in general less accurate beside the listeners than in front. Now, imagine a situation in which participants listen to two identical sounds played in free field over two loudspeakers located one behind the other at different distances from the listener. In that case, the main SOI listeners have in estimating which sound comes from farther away will be the difference in sound level between the two presentations. If this is correct, the MAD should be equal to the distance between the loudspeakers that gives a level difference equal to the just-noticeable difference for intensity [101]. This hypothesis is called the loudness-discrimination hypothesis, though intensity-discrimination hypothesis would be a more appropriate name. This hypothesis was tested by Strybel and Perrott [329]. In order to exclude the role of reverberation, they carried out the experiment in an open field. In free field a doubling of distance between sound source and listener corresponds to a decrease of 6 dB at the position of the listener. The just-noticeable difference for wide-band noise is about 0.3 to 0.5 dB. A decrease of 0.3 and 0.5 dB corresponds to an increase 0.3 0.5 of distance of 10 20 = 1.035 or 3.5% and 10 20 = 1.059 or 5.9%, respectively. For

9.7 Auditory Distance Perception

499

longer distances, longer than about 4 to 5 m, Strybel and Perrott [329] indeed found an MAD of a few percentages of the actual distance which corresponds quite well with the loudness-discrimination hypothesis. For shorter distances, however, the MAD got larger, increasing up to about 20% at a distance of about 50 cm. So, sound level can play an important role in auditory distance perception. The experiments conducted by Strybel and Perrott [329] were discrimination experiments, in which subjects were only asked to indicate which of two successive sounds came from the nearest or the farthest loudspeaker, and not how far the listeners heard the sounds as coming from. In order to reliably estimate the distance of a sound source only based on level, listeners must have a good idea about the power with which these sounds are usually produced. This will in general be the case for rich and familiar sounds such as speech [48] and music [84, 364] because the listener can hear with how much effort these sounds are produced [124]. The loudness with which they perceive these sounds can then be a reliable measure of the distance to these sound sources. This will, however, only be the case for naturally produced speech and music. When played over loudspeakers, the power with which they are reproduced has no direct relation with the effort the speaker or the musician has put in the production of the sounds. Indeed, Gardner [103] found that, in an anechoic room, distance estimation of speech played over loudspeakers in front of the listener only depended on level, while distance estimation of live speech was much more accurate. (See also the discussion in Sect. 6.7.3” and the forthcoming Sect. 9.9.3). For many sounds, certainly simple sounds, it is not possible to hear with how much power they are produced. In that case, the task of distance estimation based on level is precarious and the results will appear to be more variable. This was indeed concluded by Coleman [69]. He played 1-sec noise bursts from an array of loudspeakers in front of the listeners and asked listeners to indicate from which loudspeaker they perceived the sound as coming from. He divided the participants into four groups for which the difference to the first loudspeaker of the array was different. In order to minimize the effect of reflections, “the experiments were conducted outdoors on a frozen lake with a snow covering” (p. 346). He found that, at the beginning of an experiment, listeners based their distance estimations mainly of intensity level. As the experiment proceeded, however, the distance estimations improved. A similar improvement in distance estimations in the course of an experiments was found by Zahorik [396] in a normal reverberant room. Zahorik [396] recorded speech spoken by a female speaker in an anechoic room and played it over one of an array of loudspeakers in front of the listener as done by Coleman [69]. Zahorik [396] found a reduction in the variance of the distance estimates during the first 10 trials for blindfolded listeners. Such an improvement did not occur with sighted listeners. So, familiarity with the sound stimulus, e.g., its range in level, and with the experimental conditions can improve absolute distance judgments. It is concluded that sound level can only be used as a SOI in distance perception if the listener can judge with how much power the sound is produced; in other situations, the sound level at the position of the listener can only be a relative measure for auditory localization. Another, very important SOI for absolute sound localization appears to be the reverberant field in the room in which the listener is located, mostly expressed as the

500

9 Perceived Location

direct-to-reverberant ratio (DRR). There is some discussion as to the quantitative contribution of this effect. Zahorik [395] showed that perceptually the DRR specifies distance only grossly; he found that thresholds for DRR differences could be as large as 6 dB, corresponding to a factor 2 in distance. This makes the DRR only a rough absolute measure for distance. Larsen et al. [184], however, found thresholds of 2 to 3 dB for DRRs of 0 and 10 dB. For lower DRRs of -10 dB and higher DRRs of 20 dB they found thresholds of 6 to 8 dB, so similar to those of Zahorik [395]. Moreover, Kolarik, Cirstea, and Pardhan [171] measured the contribution of sound level and DRR to the accuracy of distance perception in simulated rooms with no, moderate, or high levels of reverberation. The sounds consisted of a male and a female speech utterance. Naturally, only level played a role in the simulated room with no reverberation. More significant, the contribution of DRR grew with longer reverberation times. In very reverberant rooms, and when the distance between listener and simulated source was large, information based on DRR contributed about as much to the accuracy of distance perception as the information based on sound level. It is concluded that the DRR can serve as a SOI in the auditory estimation of absolute distance [223, 224, 253]. In rooms with a moderate reverberation and for wide-band sounds, there appears to be a systematic relation between the actual distance of a sound source and its perceived distance. For sound sources within the peripersonal space of the listener, i.e., closer than about 1.5 m, distance is systematically overestimated; for sources in the extrapersonal space, so farther away than about 1.5 m, distances are systematically underestimated. This means that there is a cross-over point where the perceived distance of a sound source is equal to the actual distance. This relation was first described in 1898 by Shutt [315] and in 1973 by Laws [185], and it has been confirmed by many other studies since (for a review see Kolarik et al. [173]). This relation between the perceived distance of sound and its actual distance can be captured by a power law with an exponent that is smaller than 1 [7, 394] up to what is called the auditory horizon [42, 184]. Distance estimation beyond this auditory horizon runs asymptotically to a constant level. This will be specified later on in this section. One now knows that reverberation plays an important role in the externalization of sound. Even when a sound is externalized, however, it does not necessarily mean that there is enough information to localize a sound at its true location. It appears that listeners often systematically localize a sound at a well specified distance, but this distance can significantly deviate from its actual distance. Relevant in this context is the concept of specific distance tendency (SDT). Gogel [116] introduced this concept for the visual system; Mershon and King [224] introduced it for the auditory system and defined it as “the tendency for objects presented under conditions of reduced information for egocentric distance to appear at a relatively near distance of approximately 2 m” (p. 414). Anderson and Zahorik [7] define it as “the perceived distance of an object reported by participants under conditions with minimal distance cues” (p. 9). Here, it will be referred to it as the default perceived distance, the distance from which a listener hears a sound by default as coming from in the

9.7 Auditory Distance Perception

501

absence of more specific SOIs. Mershon and King [224] suggested that this default perceived distance might correspond to the cross-over point mentioned in the previous paragraph where perceived distance of a sound source is equal to the actual distance. This is discussed in detail by Anderson and Zahorik [7]. As to reverberation, they argue that “Reverberation level [...] may provide the context necessary for sound sources to appear displaced from the SDT ” (p. 9). A graphical impression of the relation between actual distance and perceived distance is presented in Fig. 9.14. The dashed line shows the relation between perceived distance and actual distance when actual and perceived distance would be equal. The thick line follows a more realistic pattern. Up to the auditory horizon of 15 m, it follows a power function as just suggested above. The constant of the power function is set to 1.32, the exponent to 0.39, values proposed for an average listener by Zahorik [394]. The cross-over point is at about 1.6 m; for shorter distances, distance is overestimated, and for longer distances, it is underestimated. In order to give an impression of the accuracy of auditory distance perception, the narrow shaded area of Fig. 9.14 shows the variation in the distance estimates by a listener when based on the MAD, a relative measure of localization accuracy. For distances shorter than 5 m, it is derived from the data for the MAD presented by Strybel and Perrott [329]; for longer distances, it is set to 6%, a value corresponding to the loudness-discrimination hypothesis. This gives a rather favourable impression of the spread of the distance estimates as may be found under ideal listening conditions, so for wide-band sounds, in a moderately reverberant room, and in the presence of other auditory or visual reference points. In many other situations, distance estimation will not be so accurate. As mentioned, Zahorik [395] measured that just-noticeable difference (JND) for DRR could be as high as 5 to 6 dB, corresponding to a difference in distance of a factor 2. Larsen et al. [184] found even larger JNDs for conditions with a relatively

Fig. 9.14 Model of perceived distance as a function of actual distance. The narrow shaded area represents the accuracy of the perceived distance based on discrimination experiments by Strybel and Perrott [329]. Many distance-estimation experiments will show a much larger variability as illustrated by the wide dotted area based on threshold measurements for the DRR by Zahorik [395]. The dashed line shows the line on which perceived distance is equal to actual distance. (Matlab)

502

9 Perceived Location

low DRR of −10 dB and a relatively high DRR of 20 dB. The broad dotted area Fig. 9.14 represents this variation in distance estimates, in case the JND for DRR is 6 dB. This indicates that distance estimation based on DRR, an absolute measure, can be rather coarse. In practice, this source of information will be supplemented with other sources of information. Finally, listeners can use motion parallax in estimating relative distance, as was briefly discussed above in Sect. 9.3.1. It is the rate of change of the azimuth as the listener moves around. It arises when an observer moves laterally in respect of an object. In that situation the azimuth of an object changes more slowly when it is farther away from the observer [386]. Recently, Genzel et al. [108] showed that, indeed, motion parallax plays a role in distance perception in peripersonal space. In an anechoic room, they positioned two loudspeakers at different distances in front of a listener. The largest distance between listener and loudspeaker was 98 cm. Each speaker produced a sequence of harmonic tones, one with an F 0 210 Hz, the other with an F 0 440 Hz. The task of the listener was to judge whether the lower-pitched tones or the higher-pitched tones came from farther away. It appeared that listeners could quite well do this task when the distance between the loudspeakers was large enough, larger than about 20 cm, and when the listeners were allowed to move their heads actively by about 20 to 25 cm laterally. Remarkably, performance deteriorated when the heads of the listeners were moved passively, and even more so when not the listeners but the loudspeakers were moved. In order to remove confounding factors associated with the experimental set-up with moving listeners or sound sources, the authors repeated these experiments with virtual sound sources. They guaranteed that the stimuli as they reached the ears were the same when the head of the listener was moved actively, when it was moved passively, or when not the listener but the virtual sound sources moved. For these virtual sound sources, too, the listeners performed best when they could actively move their heads. Apparently, planning active motion allows listener to compare the incoming binaural information with information predicted from planning the movement. This then results in a more accurate estimate of the relative distance of the two sources. One sees that a number of quite different SOIs contribute to auditory distance perception. Some have an acoustic origin, e.g., sound level and reverberation; other do not, e.g., familiarity with the sound and the visually perceived locations of possible sound sources. Cognitive factors can also play a role. The number of studies into the role played by these factors and their interaction has grown rapidly in the last few decades. The results have shown to be quite variable, however, due to large differences not only between subjects but also within subjects [7]. Apparently, the weight attributed to the various SOIs is different from listener to listener, and the results depend on the context in which the experiments are carried out. For instance, the results of auditory distance-perception experiments can be quite different for experiments in which the listeners can see or cannot see, or for experiments in which the listeners know the locations of the possible sound sources or not. Moreover, the outcome of experiments carried out in an anechoic room can be quite different from those carried out in reverberant rooms, or in the open field.

9.7 Auditory Distance Perception

503

9.7.2 Direct-to-Reverberant Ratio Above it has been argued that it is the first wave front that contains the information used by the auditory system to estimate the direction where a sound comes from. This does not mean that the reverberation is of no importance. On the contrary, reverberation appears to play an important role is auditory distance estimation. In the introduction of this chapter, the complex paths have been sketched that a sound can follow before it reaches the ears of the listeners. Besides the direct sound, there is the reverberation, consisting of the reflected, diffracted, or scattered sound waves arriving later. In general, the reverberation arrives a few milliseconds later than the first wave front. It was shown above that the ratio between the energy in the direct sound and the reverberant sound, the direct-to-reverberant ratio (DRR), decreases monotonically with increasing egocentric distance, and can thus be an absolute measure for perceived distance. This DRR can also be influenced by other objects in the listening room, which appears to affect the perceived distance of a sound source. Indeed, as early as 1923, Von Hornbostel [355, p. 111] describes how the rattling sound of a box of matches in front of a listener sounds as coming from farther away than it actually is when a wooden shelf is placed between the listener and the sound source. When the shelf is placed behind the sound source, it sounds as coming from less far. Bronkhorst and Houtgast [42] presented a computational model for the way in which the DRR can contribute to the perceived distance of a sound source. This model not only comprises the DRR but also the reverberant properties of the room in the shape of the reverberation radius of the room rh , the distance from the sound source at which the energy of the reverberant sound equals that of the direct sound. According to this model, the perceived distance ds is directly proportional to rh , and has an inverse power law relation with the DRR. The DRR is the ratio between the energy of the reverberant sound Er and that of the direct sound E d , or the DRR is E d /Er . For the estimated perceived distance this gives:  j ds = Arh Er /E d

(9.3)

The reverberation radius rh depends on the volume V and the reverberation time T of the room, and on the directivity factor G of the sound source. It can be calculated using Eq. 9.1. The question that remains is, how to obtain E d and Er . In the model by Bronkhorst and Houtgast [42] the direct sound is the sound that arrives at our ears within about 6 ms after the first wave front. What comes later is the indirect sound. To be more precise, the energy of the direct sound is the sound arriving at the ears within the integration window W (t), specified by the parameter tw , 6.1 ms, and the slope s, 400 s-1.

504

9 Perceived Location

Fig. 9.15 Integration window for calculating the energy of the direct sound, continuous line, and that of the indirect sound, dashed line. (Matlab)

⎧ ⎪ if 0 ≤ t ≤ tw − π/ (2s) , ⎨1, W (t) = 0.5 − 0.5 sin [s (t − tw )] , if tw − π/ (2s) < t ≤ tw + π/ (2s) , (9.4) ⎪ ⎩ 0, if t > tw − π/ (2s) . The shape of this integration window is shown in Fig. 9.15 as the thick, continuous line. The parameters of this model were found by an iterative procedure: A = 1.9 ± 0.1, j = 0.49 ± 0.03, tw = 6.1 ± 0.3 ms, and s = 400 ± 40 s-1 . The dashed line in Fig. 9.15 shows the integration window for the indirect sound, which is 1 − W (t). Bronkhorst and Houtgast [42] applied this model to simulate auditory distance perception within a less reverberant room with a reverberation time of 0.1 s, and a more usual reverberant room with a reverberation time of 0.5 s. Bursts of pink noise were used as stimulus. Virtual sound sources were modelled by filtering, in addition to the direct sound, 800 simulated reflections with appropriate HRTFs. The results for the two different rooms are presented in Fig. 9.16. Data from three other studies are included in the insert. The individual points are the judgments by the listeners and the continuous lines the outcomes of the model. The correspondence between the real data and the simulations is remarkable. This model not only includes the DRR but, since it includes the reverberation radius, also the reverberation properties of the room. The model of auditory distance estimation is based on Eq. 9.3, which presumes that the auditory system is able to separate direct sound from reverberant sound. Traer and McDermott [344] showed that this is indeed the case, but only for reverberation with natural statistical properties. The authors recorded the impulse responses of 271 different locations in the Boston metropolitan area, and selected four properties these impulse responses must have in order to be perceived as natural. These properties are the presence of the early reflections in relation to the reverberant tail, the exponential decay of the reverberant tail, the frequency dependence of this decay, and the degree of variation with frequency of this decay. In perception experiments, the authors manipulated these properties, and found that, if these properties deviate from their natural statistics, the addition of such unnatural reverberation affects not only the perception of the environment but also that of the sound source. Traer and McDermott

9.7 Auditory Distance Perception

505

Fig. 9.16 Perceived distances compared with actual distances. Reproduced from Bronkhorst and Houtgast [42, p. 519, Fig. 3], with the permission of Springer Nature BV; permission conveyed through Copyright Clearance Center, Inc.

[344] conclude: “that human listeners can estimate the contributions of the source and the environment from reverberant sound, but that they depend critically on whether environmental acoustics conform to the observed statistical regularities. The results suggest a separation process constrained by knowledge of environmental acoustics that is internalized over development or evolution” (p. E7856). The model presented above does not exactly reproduce the power-law relation between perceived and actual distance as suggested by various studies discussed in Kolarik et al. [173] and sketched in Fig. 9.14. The calculations by Bronkhorst and Houtgast [42] show that this auditory horizon is a consequence of the relatively small increase of the DRR at longer distances, i.e., distances much longer than the reverberation radius. This was further confirmed by Larsen et al. [184]. In a model in which the DRR was estimated by an equalization-cancellation method, Lu and Cooke

506

9 Perceived Location

[202] found a linear relationship between egocentric distance and log(1/DRR) up to about 2.5 m, after which the decrease of the DRR with distance became less. All these models show that auditory distance estimation based on the DRR becomes less steep at longer distances because the DRR changes relatively slowly for those distances. As a final remark, Larsen et al. [184] and Kopˇco and Shinn-Cunningham [175] investigated whether it is really the DRR that is used as a source of information in auditory distance estimation, and not another acoustic variable that correlates with the DRR. Various such variables have been proposed such as the interaural crosscorrelation, frequency-by-frequency variation in the power spectrum, and spectral envelope [184]. The authors confirm the significant role played by the DRR in distance perception and, by showing that these acoustic variables do not change significantly beyond a certain distance, can also explain the existence of an auditory horizon, just as does the DRR. The other variables may, however, play a role for very low or very high DRRs. A review of the relations between such different variables that co-vary with the DRR is provided by Georganti et al. [109]. These authors also present an alternative model of auditory distance estimation based on the statistical properties of binaural signals [110].

9.7.3 Dynamic Information Above, in Sect. 9.3.1, it appeared that head rotations do in general not contribute to auditory distance estimation [169, 316]. On the other hand, Kneip and Baumann [169] showed that head translations can in principle provide information about the distance to the sound source. This was systematically investigated by Lu and Cooke [203] by means of simulated listeners with virtual sources in an anechoic and a reverberant room. They estimated the DRR based on the idea that the direct sound arrives at our ears with an ITD that is equal for all frequencies. The reverberant sound, on the other hand, arrives from different azimuths and, hence, arrives with different ITDs. By means of an elaborated interaural crosscorrelation technique, they could thus estimate the DRR. Furthermore, they defined eight motion strategies and asked themselves which motion strategy resulted in the most accurate location. Lu and Cooke [203] found that all motion strategies performed better than standstill. Allowing head rotations but no translations, however, resulted in only minor increments in localization accuracy. More significant, one of their conclusions was: “Of the purposeful motion strategies, those which resulted in a reduction of distance to the estimated source location led to the best overall performance, probably due to an improvement in the ratio of direct to reverberant energy” (p. 639). In Sect. 9.7.1, it was shown that, up to the auditory horizon, there is a powerlaw relation between intensity and perceived distance. This applies to stationary sound sources. If this is generalized to moving sound sources, this implies that equal intensity changes in dB at the position of the listener are perceived as equal proportions between the distances at the start and the end of the movement. This, however, appears to be too simple. Indeed, Neuhoff [249] showed that sound sources

9.7 Auditory Distance Perception

507

increasing in intensity level by a certain amount are perceived as increasing more in loudness than sound sources decreasing in intensity level by the same amount are perceived as decreasing in loudness. This means that, for the same amount of change in intensity level, increases induce a larger change in loudness than decreases do. In addition, Neuhoff [246] showed that, in the free field, sound sources approaching the listener with a certain speed are perceived closer to the listener than sound sources moving away from the listener at the same speed. Both phenomena illustrate the higher weight attributed perceptually to rising intensities than to falling intensities. This will be discussed in more detail in Sect. 9.9.3. Another issue is decruitment [56], a phenomenon briefly described in Sect. 7.7. Listeners are not asked to first listen to the stimulus and then to globally judge the loudness change of a sound increasing or decreasing in intensity level, but are asked to continuously monitor the loudness of sounds slowly changing in intensity level. The authors found that, for sounds with the same difference in dB between the intensity of the beginning and the end of the stimulus, the decrease in loudness for the decreasing sounds was faster than the increase in loudness was for the increasing sounds. The judged loudness was equal for the two sounds where their intensity level was highest, but at the end of the decreasing sound, the judged loudness was lower than at the start of the rising tone, although the intensities were equal there. Whether this plays a role in distance perception remains unclear. A review of all these complicating issues is presented by Olsen [257].

9.7.4 Perceived Distance, Loudness, and Perceived Effort In Sect. 9.3.11, it was argued that the perceived effort of familiar signals such as speech and music can be used in estimating the distance to such a sound source. In Sect. 6.7.3, it was shown that, in addition to intensity level, there are several other acoustic sources of information involved in estimating vocal effort. In conditions in which speech is spoken live, these sources of information can be used in estimating the distance of the speaker, which indeed appears to be the case for normal reverberant spaces [66, 104]. After all, if a sound produced with little effort is loud, it must be near; when sound produced with great effort is soft, it must be far away. When played over loudspeakers, however, the intensity with which a sound reaches the listener is no longer dependent on the production level of the sound but also on its amplification or attenuation by the audio-system. Consequently, studies on auditory distance perception in which sounds are played over loudspeakers show much more variable results. Allen [5] showed that judgments as to the loudness and the vocal effort of monosyllables played over loudspeakers at varying distances in a “normal, i.e., not specially sound-treated, classroom” (p. 1834) are highly correlated. This was studied by Brungart and Scott [48], who used virtual sound sources presented over headphones. As already mentioned a few times and illustrated above in Fig. 9.14, a general result is that distances of sound sources in the peripersonal space of the

508

9 Perceived Location

listener are overestimated, while those in the extrapersonal space are underestimated (for a review, see Kolarik et al. [173]), Another approach was followed by Warren et al. [362]. They formulated the physical-correlate theory, which holds that “estimates of sensory intensity are based upon experience with the manner in which the levels of sensory excitation are correlated with some physical attribute of the stimulus” (p. 700). To test this hypothesis, they asked listeners in one experiment to judge whether a test sound was more or less than twice as far away as a comparison sound, and in another experiment whether a test sound was twice as soft as a comparison sound. In this example, egocentric distance is the physical attribute, while loudness is the auditory attribute derived from sensory excitation. The experiments took place in a moderately sized conference room with normal reverberation. The participants were blindfolded. In the first experiment, Warren et al. [362] found that a decrease of, on average, 6 dB corresponded to a halving of the loudness. In the second experiment, they found that this decrease of about 6 dB corresponded to doubling of the distance. This result, which corroborates the physical-correlate theory, was obtained both for a speech sound and for a 1000-Hz pure tone. The decrease of 6 dB with halving the loudness corresponds to an exponent of 0.5 in Stevens’ power law for loudness perception (see Sect. 7.2). Usually Stevens’ law for loudness perception is presented with an exponent of 0.3 which implies that halving the loudness should corresponds to a decrease of 10 dB. Warren et al. [362] attribute this to the fact that they obtained their results in reverberant conditions, while Stevens’ law was based on data obtained in the absence of reverberation. On the other hand, it has been shown that the relation between actual distance and perceived distance can in general be approximated by a power law, but the individual differences can be large. For instance, for a virtual speech signal at 0◦ elevation and 90◦ azimuth, Zahorik [394] found exponents varying between 0.16 and 0.70 for different listeners with an average of 0.39. In summary, this exponent is, on average, lower than that predicted by the physical-correlate theory, where it is 0.5. Another relevant phenomenon in this context is that of loudness constancy. Loudness constancy is the phenomenon that the loudness of a sound in a room is perceived as relatively constant whatever the position of the source and the listener. Mohrmann [235] asked listeners to judge the loudness of different kinds of sounds, in this case speech, music, metronome clicks, and pure tones, which were varied in distance and effort and played over loudspeakers at varying distances from the listener. The experiments took place in a moderately sized recording studio with little reverberation. Moreover, they were carried out in the dark, and with blindfolded or fully seeing listeners. He found that the loudness judgments by the listeners varied much less with distance than would follow from the intensity of the sound at the ear of the listeners. Similar results were found by Zahorik and Wightman [397], who carried out their experiments with virtually presented noise bursts processed with binaural room impulse responses measured in a moderately sized auditorium with a reverberation time T60 of 0.7 s. The listeners were seated in a sound booth and were, remarkably, “carefully instructed to make their judgments based on the sound source power — so called ‘objective’ instructions” (p. 83).

9.7 Auditory Distance Perception

509

Loudness constancy as a function of the amount of reverberation was studied by Altmann et al. [6]. They carried out experiments in a large room of which the reverberation properties could be varied. In that room, they made dummy-head recordings for sounds played with different intensities from loudspeakers at different distances from the dummy head. The reverberation time T60 was set to 0.14 s for the weak-reverberation condition and to 1.03 s for the strong-reverberation condition. Altmann et al. [6] postulated that loudness constancy would be much stronger in the reverberant condition than in the reverberation poor condition. They instructed the listeners to “report the perceived loudness of a sound source in terms of sound power invariant to source distance following a magnitude estimation procedure” (p. 3212). They concluded: “In sum, we observed loudness constancy for strong but not weak reverberating conditions. This was dissociated from distance perception, strengthening the claim that loudness constancy directly depends on reverberating energy rather than a computational process that compensates for object distance” (p. 3219). It is concluded that, in the experiments on loudness constancy described above, listeners were not judging loudness based on the intensity of the sound at the listener’s position. Rather, their instructions were to ignore the loudness with which they heard the sound, but to estimate the power of the sound source. Apparently, listeners are in general well able to estimate the effort with which a familiar complex sound such as speech and music is produced. This is confirmed by Honda, Yasukouchi, and Sugita [146], who placed listeners and a model music player in a large room at distances varied over 2, 8, and 32 m. Listeners were explicitly asked to play a musical instrument “as loudly as a model player” (p. 438). These authors also reported almost perfect loudness constancy. One may wonder what these experiments would have yielded if listeners were instructed to base their judgments only on loudness and to ignore the perceived production level of the sound source. Most importantly, one may wonder whether listeners are really capable of ignoring the perceived production level of the sound source, as Allen [5] found high correlations between judgments of loudness and vocal effort. Given these results, these judgments are unlikely to be independent. In summary, there is a complex interaction between a number of acoustic sources of information (SOIs) involved in estimating a number of perceptual attributes. On the one hand, there are the production level of the sound source, its play back level, and the distance between listener and sound source. On the other hand, listeners can be asked to estimate the loudness of the sound source, the production level of the sound source, and the distance to the sound source. The relation between one of those physical SOIs and the judgments of the listeners can be approximated by a power law, at least for distances shorter than the auditory horizon. However, all studies found significant individual differences. This suggests that different listeners interpret the experimental instructions differently or that the acoustic SOIs get different weights.

510

9 Perceived Location

9.7.5 Distance Perception in Peripersonal Space Most of what has been said until now applies to studies of distance perception in extrapersonal space. It appears that auditory distance perception in peripersonal space, hence close to the listener, is complicated by a number of additional factors that up to now have mostly been ignored. One of the important factors is that the wave front of a sound source close to the listener can no longer be approximated by a plane wave. When a sound source is far away from the listener, it can be assumed that the difference in intensity at the contralateral and the ipsilateral ear is mainly due to the shadowing function of the head, and not to the difference in travelling distance of the contralateral ear and the ipsilateral ear. Since the decrease in intensity is inversely proportional to the square of the distance, the inverse-square law, this can no longer be ignored for sound sources close to the listener, neither for low- nor for high-frequency sounds. Indeed, for lateral sound sources close to the listener, ILDs become significant also for lower frequencies. A simple calculation shows that, when an external sound source positioned on the interaural axis at an equal distance from the ipsilateral ear as the distance between the two ears, the intensity at the ipsilateral ear would be at least 6 dB higher than at the contralateral ear. This 6 dB is based on the assumption that there is no head between the sound source and the contralateral ear. The shadowing effect of the head adds to this. Indeed, Brungart and Rabinowitz [47] showed that, for such lateral sound sources, ILDs could be as large as 20 dB for distances of 0.12 m. Moreover, for low-frequency components, these ILDs appear not only to affect but even to dominate distance perception for lateral sound sources [45]. As to head-related transfer functions (HRTFs), the effect of the head and the shoulder changes at close distance, so that also the HRTFs change with distance, an effect that appears to be mainly significant for different azimuths. There appears to be no interaction effect of elevation and HRTF on distance perception [45]. These effects are especially significant for sound sources at the side of the listener, and in this way, they contribute to a higher accuracy of distance estimation for positions at the side of the listener than for positions closer to the median plane [46]. One of the consequences of the wavefront of nearby sound sources not being spherical is that the set of locations with similar ILDs can no longer be approximated by a cone, the cone of confusion. Instead, the set of points with equal ILDs becomes a sphere with a centre on the extended interaural axis [314]. The cross-section between this sphere and the cone of confusion as defined by the ITDs forms a circle parallel to the median plane and with a centre on the extension of the interaural axis, a torus of confusion. The distance between this torus of confusion and the median plane provides an absolute measure of distance. Moreover, the pinnae have hardly any effect on intensity at low frequencies, but the presence of the torso does. At low frequencies, the torus of confusion is distorted in the vertical direction by the presence of the torso, while the symmetry in respect of the frontal plane is largely unaffected. This asymmetry provides information as to the vertical position of the sound, and may explain why up-down confusions are relatively rare compared with front-back confusions.

9.8 Auditory Perception of Direction

511

Another difference between sources close to the listener and sources further away is the acoustic parallax, i.e., the angle between the directions of the left and right ear to the sound source. This angle increases as the egocentric distance decreases. This acoustic parallax is especially relevant for sound sources in front of the listener, where ITDs and ILDs are close to zero. Due to this acoustic parallax, the directions specified by the HRTFs of the left and right ear are different. Kim et al. [163] simulated this for virtual sources and found that, in the absence of level and DRR information, so based on acoustic parallax, listeners can reliably estimate distance up to 1 to 1.5 m. A final source of information that is used for distance perception of sounds in peripersonal space is motion parallax, not to be confused with acoustic parallax just discussed in the previous paragraph. Motion parallax arises when an observer moves laterally in respect of an object. In that situation, the azimuth of an object changes more slowly when it is farther away from the observer [386]. As discussed in Sect. 9.7.1, Genzel et al. [108] showed that, indeed, motion parallax plays a role in distance perception in peripersonal space, but only when the participants actively move their heads. It is concluded that distance perception of sound sources in peripersonal space not only depends on SOIs that are involved in distance perception in extrapersonal space, such as reverberation, dynamic information, familiarity with the sound source, level, and produced effort. Also SOIs such as ITDs, ILDs, and HRTFs come into play. This list will not be exhaustive and comprises only SOIs for which the role they play in auditory distance perception has been demonstrated in perception experiments. It is very likely that other SOIs cannot be ignored. One may think, e.g., of the role of head rotations and translations, visual information, or background noise. Some of these SOIs, such as intensity level, ITDs or ILDs, are acoustically simple in nature, since they can be easily measured. Others may require computationally demanding, and complex measurements, such as HRTFs and reverberation, but the way they operate is straightforward. More difficult are SOIs such as familiarity with the sound source, familiarity with the listening room, and knowledge as to the location of possible sound sources. It still is a long way to go to find the relative weights that all these factors have in auditory distance estimation under different circumstances, and for different listeners.

9.8 Auditory Perception of Direction Up to now, sources of acoustic information have been discussed that are involved in the externalization of sound and in the auditory perception of distance. What remains is the perception of the direction where the sound comes from. Actually, historically this is not the right order. Most early research was concerned with the role ITDs and ILDs played in auditory localization. This resulted in the duplex theory of auditory localization first clearly formulated in 1907 by Rayleigh [283]. Rayleigh concluded

512

9 Perceived Location

that high tones are localized based on intensity differences and low tones on phase differences. But he well realized that this could not be the whole story, since he warns: “A judgement that the signal is to the right or left may usually be trusted, but a judgement that it comes from in front or behind is emphatically to be distrusted” (p. 232). But he also notes that slight head rotations can disambiguate front-back confusions, and he suggests the role of the pinnae at higher frequencies: “If, as seems the only possible explanation, the discrimination of front and back depends upon an alteration of quality due to the external ears, it was to be expected that it would be concerned with the higher elements of the sound” (p. 231). The experiments carried out by Rayleigh [283] only involved sounds played in the horizontal plane and hence, at an elevation of 0◦ . Also here in this section, up to Sect. 9.8.2, it will be assumed that the elevation is 0◦ .

9.8.1 Accuracy of Azimuth Perception The first data on the accuracy of azimuth perception in the horizontal plane were obtained in the early fifties by Mills [233]. He measured the minimum audible angles (MAAs) of pure tones of varying frequencies for different azimuths. The tones were 1-s pure tones with rise and fall times of 70 ms. His results are summarized in Fig. 9.17, showing the MAAs as a function of frequency for azimuths of 0◦ , to 30◦ , 45◦ , 60◦ , to 75◦ . For all azimuths, the MAAs for pure tones vary strongly as a function of frequency. It is clear that the smallest MAAs are found for the smallest azimuth, 0◦ . In that case, so for two loudspeakers just in front of the listener, the MAA is less than 2◦ for frequencies between 250 and 1000 Hz. A800 Hz, it rises, is maximum at about 1800 Hz, then falls below 2◦ again, and starts to rise again at 6000 Hz. A frequency of 1800 Hz corresponds to a wavelength of about 20 cm, which is the same order of magnitude as the diameter of the head. Indeed, it was shown that the difference in time with which a pure tone reaches both ears plays an important role when its wavelength is larger than the diameter of the head. For pure tones with wavelengths smaller than the size of the head, the difference in intensity with which it reaches both ears will appear to play an important role. For pure tones with intermediate wavelengths—so, wavelengths with the same order of magnitude as the head—neither the time nor the intensity difference can be used effectively. This is even more true for tones that come more from the side, since the MAA is larger then. As one can see in Fig. 9.17, at an azimuth of 75◦ , the MAA gets larger than 30◦ between 1000 and 3000 Hz. This applies also to tones coming from aside with frequencies higher than 5000 Hz. These results give an impression of the accuracy with which the auditory system can localize pure tones and other narrow-band sounds in the horizontal plane. The results can well be explained by assuming that the judgments by the listener are based on binaural information, for frequency components higher than 1500 to 2500 Hz on ILDs, and for components with lower frequencies on ITDs.

9.8 Auditory Perception of Direction

513

Fig. 9.17 Minimum audible angle in the horizontal plane for pure tones as a function of their frequency. The parameter θ represents the azimuth of the reference tone. Reproduced from Mills [233, p. 240, Fig. 5], with the permission of the Acoustical Society of America

The results presented in Fig. 9.17 are obtained with tonal stimuli. One may wonder whether wide-band sound may be more accurately localized. This has been systematically studied by Yost and Zhong [389] not for a relative measure, as the MAA is, but for an absolute measure. They placed six loudspeakers in the right frontal field of the listener at equidistant azimuths varied between 0◦ and 75◦ , and asked listeners to indicate from which of these six loudspeakers they heard a sound coming. The root-mean-square error of the perceived locations was presented as a measure of accuracy. They found that the accuracy with which listeners localized noise bursts was better than for pure tones, even for noise with a bandwidth as small as 1/20 of an octave. Furthermore, when the bandwidth was wider than one octave, accuracy was independent of the centre frequency of the noise, and noise bursts of a wider bandwidth were more accurately localized than noise bursts of narrower bandwidths. For noise bursts with bandwidths narrower than one octave, the accuracy did depend on centre frequency and varied in the same way as for pure tones. The accuracy was best 250 Hz, worst around 2000 Hz, and then improved to intermediate values for

514

9 Perceived Location

higher frequencies as found for pure tones. As found by Mills [233], localization accuracy was best in front of the listener and decreased with azimuth. Moreover, using an absolute measure, Yost and Zhong [389] showed that azimuths were generally overestimated, a bias that increased with azimuth. This was also reported by Odegaard et al. [254] and Garcia et al. [102]. In the experiments described in this section, no sounds were presented from behind the listeners. Consequently, front-back confusions did not come into play. Front-back confusions will be discussed in more detail below.

9.8.2 Accuracy of Elevation Perception It may seem logical now to present the MAAs for pure tones in the vertical plane, so for different elevations. This appears not to make any sense, however, since the elevation of a sound source producing a pure tone does not depend on its actual elevation but on its frequency [281]. This phenomenon has been described as Pratt’s effect in Sect. 9.4. The conditions under which the correct elevation of a sound can be perceived were studied by Roffler and Butler [288]. They formulated three conditions: First, the listener must have pinnae, second the sound must be wideband, and third, the bandwidth should cover the high-frequency range, higher than 7 kHz. This last estimate appears to be too high, however, because the authors only took elevations between −13◦ and 20◦ into account [138]. Hebrank and Wright [138] included positions with higher elevations. They took nine positions in the median plane starting at an elevation of −30◦ in front of the listener and continuing in steps of 30◦ along the upper half of the median plane up to 210◦ , a position at the back of the listener. They conclude that frequencies in the range between 3.8 and 17 kHz play a role in elevation perception. Roffler and Butler [288] found, for sounds in front of the listener satisfying the three conditions they had specified, a random error in the vertical direction of about 4◦ . Hence, the random error in the vertical direction is more than twice as large as in the horizontal direction where it can be less than 2◦ . This lower accuracy in the vertical direction is confirmed by Saberi and Perrott [298], who also measured the MAA in oblique directions. Changing the direction from horizontal to vertical, they found that the accuracy first decreased quite slowly. It was only when the direction was nearly vertical that the accuracy decreased rapidly. Grantham, Hornsby, and Erpenbeck [120] argued that this could well be explained by the fact that binaural information, specifying azimuth, leads to much more accurate estimations of the MAA than the pinna-based spectral information, specifying elevation. Only in very steep directions, when the horizontal component becomes much smaller than the vertical component, the specification of azimuth by the binaural information becomes insignificant and the specification of elevation by pinna-based information dominates. Pratt’s effect expresses itself also in complex tones. Cabrera and Morimoto [54] showed for tones consisting of the five lowest harmonics, that tones with higher fundamental frequencies were perceived at higher locations than tones with lower

9.8 Auditory Perception of Direction

515

fundamental frequencies. The effect was insignificant for square waves. The authors attribute this to the presence of high-frequency components, for which the filtering by the pinnae yields information as to the elevation of the sound source. Systematic studies of human auditory localization accuracy of wide-band sounds as a function of azimuth and elevation are presented by Oldfield and Parker [255] and Makous and Middlebrooks [212]. In general, localization accuracy of such sounds diminishes with azimuth and elevation up to a random error of about 20◦ . Makous and Middlebrooks [212] confirm that these results can be well explained by assuming that azimuth perception is based on binaural information, while elevation perception is based on pinna-based spectral information. It has been shown that the elevation of a narrow-band sound cannot be perceived accurately. This, however, does not mean that listeners do not attribute an elevation to such a sound. In fact, they often do so consistently, and this elevation has been called the default perceived elevation. In discussing Pratt’s effect, it was shown that the default perceived elevation of narrow-band sounds has no relation with the actual elevation of the sound source. Grossly speaking, high-frequency sounds are perceived at high locations, and low-frequency sounds are perceived at low locations, although one will see below that the matter is more complicated. The association of high-frequency sounds with high positions and low-frequency sounds with low positions has been noticed since long. It is evident in music notation and in all cultures [330, pp. 189–199]. Walker et al. [357] showed that it occurs in people of all ages over three to four months, and is founded in language. This association of highfrequency sounds with high positions and low-frequency sounds to low positions is often described as a cross-modal correspondence, in this case between pitch height and spatial height. There are correspondences between a multitude of pairs of perceptual attributes. In addition to the correspondence between pitch height and perceived elevation, there is also the correspondence between pitch height and perceived weight, a low pitch corresponding to heavy objects and a high pitch corresponding to light objects. A cross-modal correspondence between two non-auditory attributes is, e.g., the correspondence to perceived colour and perceived temperature, red being warm and blue being cold. For reviews on cross-modal correspondences, the reader is referred to Spence [321] and Parise [264]. This all indicates that there is a cognitive association between pitch height and elevation suggesting that perceptual processes play an insignificant role. It appears not to be as simple as this. For a narrow-band sound, presented in the median plane, Blauert [27] gives a more complex description of the relation between the centre frequency of the noise and its default perceived location: Narrow-band sounds are generally heard in front of the listener when the centre frequency is between 125 500 Hz, or between 2000 and 6000 Hz. Between 500 and 2000, and above 10 kHz, the perceived location is mostly behind the listener, and between 6000 and 1200 Hz, the sound is perceived above the listener. Blauert [27] and Blauert [28, pp. 105– 116] present explanations of these phenomena based on the resonance frequencies of the pinnae. He argues that a narrow-band sound is perceived at an elevation that corresponds to a peak in the head-related transfer function (HRTF) at the centre frequency of that narrow-band sound. In this way, the centre frequencies of narrow-band

516

9 Perceived Location

sounds are associated with corresponding elevations. Blauert [27] dubbed these frequencies directional bands. In this approach, the mapping between frequency and elevation is a consequence of the elevation dependent frequency transmission by the outer ears [228]. Individual differences in directional bands are explained by individual differences in HRTFs [153]. To make this more concrete, one may inspect Fig. 9.10, showing DTFs of two listeners as a function of elevation. The resonances of significance in these DTFs form ridges on the surface plot the frequency of which changes with elevation. Moreover, these ridges are different for the two listeners, thus explaining why these listeners have different default perceived elevations. More detailed results were obtained by Langendijk and Bronkhorst [182]. They used a “virtual acoustic pointer”, i.e., a 100-ms 210-Hz buzz that could virtually be positioned in space using individualized HRTFs. They asked listeners to position this pointer at perceived locations of 200-ms noise bursts bandpass filtered 200 Hz and 16 kHz. Based on the results, the authors conclude that up-down positions are mainly specified in the 6–12 kHz band, while front-back positions are specified by the 8–16 kHz band. They also found considerable individual differences, e.g., there were two “poor localizers”, and argued that different listeners may use different sources of information. As this experiment is very technical and complex, the interested reader is referred to the original article for details. In this respect, Parise, Knorre, and Ernst [265] approached this phenomenon from another perspective. They wondered whether Pratt’s effect may be a reflection of the correspondence between the spectrum of environmental sounds and the average of the elevation of their sources, as also suggested by Cabrera et al. [53] as cited by Cabrera and Morimoto [54]. In other words, high-frequency sounds may on average come from higher elevations than low-frequency sounds. Hence, based on this experience, higher-frequency sounds may by default be perceived as coming from higher elevations than lower-frequency sounds, because that corresponds best with the direction where they usually come from. To check this, Parise, Knorre, and Ernst [265] measured the spectra of 50,000 1-s sounds in everyday conditions and related that to the elevations of their sources. And sure enough, they could establish a mapping between the frequency content of the sounds and the elevation with which they on average arrived at the listener’s ears. Higher-frequency sounds come on average from higher elevations than lower-frequency sounds. This is especially clear in the range from 1 to 6 kHz, a range similar to the range of 2 to 8 kHz in which Blauert’s directional bands correspond to Pratt’s effect. Parise, Knorre, and Ernst [265] call the mapping between elevation and the average frequency of environmental sounds a distal frequency-elevation mapping, since it is derived from regularities in the environment of the listener, or world-centred. The mapping corresponding to Blauert’s directional bands is called a proximal mapping, since it is derived from regularities in the listener’s body, hence it is body-centred. Parise, Knorre, and Ernst [265] now suggest that the correspondence between the distal and the proximal frequency-elevation mappings are the result of a process of adaptation of the outer ears to this frequency distribution of sounds in natural environments. In other words, the ear is shaped in such a way that higher-frequency sounds are most amplified when they come from higher elevations, because they usually

9.8 Auditory Perception of Direction

517

come from higher elevations; lower-frequency sounds are most amplified when they come from lower elevations, because they usually come from lower elevations. In conclusion, both proximal and distal frequency-elevation mappings introduce a bias in elevation judgments, usually in the same direction, but not always. Indeed, the biases introduced by the proximal and the distal mapping are only in the same direction when the listener’s head is in upright position. When the head is tilted sideward, they no longer agree; when listeners lie on their sides, they are even independent. This was used by Parise, Knorre, and Ernst [265] to find out to what extent each mapping contributes to the elevation judgments of narrow-band sounds. They carried out localization experiments with listeners not only in upright positions, but also with listeners in half-upright and lying positions, so that their bodies including their heads were tilted sideward by 45◦ and 90◦ , respectively. As expected, the perceived locations of the narrow-band noises were independent, again, of the actual elevations of the sound sources and completely determined by the centre frequency of the noise band. Both distal and proximal mapping appeared to play a role; neither mapping dominated the localization judgments. Parise, Knorre, and Ernst [265] showed that the systematic biases could be well explained based on Bayesian decision theory in which the two mappings acts as priors, the distal map as a world-centred prior, and the proximal map as a body-centred prior. Parise [264] suggests that cross-modal correspondences may in general be the result of environmental regularities to which perceptual systems adapt.

9.8.3 Computational Model Above, it has been shown that auditory perception of azimuth is predominantly based of binaural information, while spectral filtering of the pinna is dominant in auditory perception of elevation. There are indeed various arguments indicating that the perception of azimuth and elevation are based on different processes. Frens and Van Opstal [99] investigated saccades invoked by sounds. Saccades are rapid eye movements from one fixation point to another. They can be observed when someone is reading or looking out of the window of a moving vehicle. More will be said about saccades in Sect. 9.10.1. Frens and Van Opstal [99] showed that saccades consist of two overlapping eye movements, one controlled by binaural information and the other by spectral information derived from the filter properties of the pinnae. Moreover, Hofman and Van Opstal [142] measured the accuracy of azimuth and elevation perception based on saccadic eye movement towards wide-band noise stimuli varying in azimuth and elevation. By stabilizing the head, head movements were not possible. Hence, azimuth and elevation were only varied over the range for which eye movements towards the stimulus location could be realized, i.e., about 35◦ in all directions. By varying the spectrotemporal properties of the noise they could show that perception of azimuth is based on binaural information that can be derived from a stimulus interval of not more than 3 ms. On the other hand, though intervals as short as 5 ms appeared to contribute significantly to elevation estimation, a stable and

518

9 Perceived Location

accurate estimate of elevation required an interval of wide-band noise of at least 80 ms. This shows that the processing of spectral information is computationally much more intensive than the processing of ITDs and ILDs. Computational models of the two processes will now be sketched. First, a concise sketch will be presented of a model often used to estimate the perceived azimuth of a sound. Indeed, for each ear, a gammatone-filterbank is used to process the HRTFfiltered sound inputs of the two ears. The outputs of the frequency channels of these filterbanks are then crosscorrelated, and the crosscorrelations are used to estimate the ITD and, with that, the azimuth from where a sound originates (e.g., Roman et al. [291]). Second, the processing of elevation will be sketched, which is more complex. Hofman and Van Opstal [142] calculated the correlation between HRTFs corresponding to different elevations. They found that, with the exception of very high elevations close to the zenith of the listener, correlations were high for HRTFs that were close in elevation and low for HRTFs that were significantly different in elevation. This shows that an HRTF indeed specifies the elevation of a sound source. This specification is correct for sound stimuli with a flat spectrum, hence wide-band, and of sufficient duration. If the noise is coloured in such a way that the peaks and notches correspond to an HRTF of a different elevation, the perceived location has this different elevation. This corroborates the results on directional bands found by Blauert [27], and the results by Middlebrooks [228] described above in Sect. 9.8.2. Based on this, Hofman and Van Opstal [142] present a computational model for the perception of elevation. This operated well for noise that was coloured according to the real HRTFs of the participants. The authors, however, also used stimuli with spectra that did not correspond well to any real HRTF. Hofman and Van Opstal [142] then find that “in the absence of sufficient spectral processing the auditory localization system stays close to its default initial estimate of elevation, typically near the horizontal plane” (p. 2647), a result similar to results found by Oldfield and Parker [256] and Wenzel et al. [366]. This may seem quite natural but it challenges the cone of confusion model that predicts that, in the absence of the spectral pinna information, localization will be based on interaural time and intensity differences. If, e.g., a sound source with an azimuth of 90◦ and an elevation of 60◦ would be presented to a listener with mould-filled pinnae, its perceived location would be found on the cone of confusion corresponding to an azimuth of 90◦ and an elevation of 60◦ . The two points in the horizontal plane on this cone of confusion have an azimuth of 30◦ or of 150◦ , respectively. This, however, is not the direction from where the sound source is perceived. Its perceived location has about the original azimuth and an elevation that is closer to or at the horizontal plane [256, 267, 366]. For a recent review of modelling auditory sound localization in the context of auditory scene analysis, the reader is referred to Sutojo et al. [333]. In these models, no use is made of information from motions by the listener or by the sound source, which will be discussed in the next section.

9.9 Auditory Perception of Motion

519

9.9 Auditory Perception of Motion Not only the location of a sound source, also the motion around a listener is mostly described in spherical coordinates with the listener’s midpoint between the ears as centre. Motion can then be divided into a rotational part involving changes in the two angular coordinates and a radial part involving changes in egocentric distance. The accuracy of auditory motion perception in situations in which both rotational and radial velocity are varied has not yet been systematically studied. Only when an object, such as a mosquito or the scissors of a hairdresser, spirals around the listeners, the two parts of the motion are combined. In experimental studies of auditory motion perception, either sensitivity to changes in azimuth or elevation have been studied, or sensitivity for changes in distance. Hence, rotational motion and radial motion will be discussed separately. First, human sensitivity to rotational motion will be discussed.

9.9.1 Accuracy of Rotational-Motion Perception Rotational motion can be divided into a horizontal and a vertical component. The horizontal component represents the changes in azimuth, the vertical component the changes in elevation. As to the sensitivity for changes in azimuth or elevation, an important concept is that of the minimum audible movement angle (MAMA) [271], “the minimum arc swept by a moving source which could be detected as moving” [emphasis by authors] (p. 289). Strybel, Manligas, and Perrott [328] give a somewhat different definition of the MAMA: “the minimum angle of travel required for detection of the direction of sound movement” (p. 267), in which only movement in the horizontal plane was considered. The first hypothesis one may formulate is that the MAMA is based on the perceived difference in azimuth and elevation between two successively perceived locations. If this is true, the MAMA cannot be smaller than the MAA. This appears to be correct. Perrott and Marlborough [270] find MAMAs of less than 1◦ for noise moving with 20◦ /s in the horizontal plane in front of the listener. Strybel et al. [328] measured MAMAs of noise sounds, also for the speed of 20◦ /s, for various azimuths and elevations. They find that the MAMA varied between about 1◦ to 2◦ for azimuths between −40◦ and 40◦ and elevations lower than 80◦ . Outside this range, the MAMAs increase up to 10◦ . Saberi and Perrott [298] studied the dependence of the MAMA on velocity and various directions relative to the horizontal plane. Listeners were seated at a distance of 716 cm from a loudspeaker array in a large sound-attenuated room. The sounds consisted of very short, 50 Hz, and beta, 15–30 Hz, oscillations, syllables and words with theta, 4–8 Hz, oscillations, and prosodic phrases with delta oscillations,