211 15 2MB
English Pages 248 [258] Year 1996
NEUROSCIENCE INTELLIGENCE UNIT
VIRTUAL AUDITORY SPACE: GENERATION AND APPLICATIONS Simon Carlile
NEUROSCIENCE INTELLIGENCE UNIT
VIRTUAL AUDITORY SPACE: GENERATION AND APPLICATIONS Simon Carlile, Ph.D. University of Sydney Sydney, Australia
LANDES BIOSCIENCE AUSTIN
NEUROSCIENCE INTELLIGENCE UNIT VIRTUAL AUDITORY SPACE: GENERATION AND APPLICATIONS
LANDES BIOSCIENCE Austin, Texas, U.S.A. Copyright ©1996 All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printed in the U.S.A. Please address all inquiries to the Publishers: Landes Bioscience, 810 S. Church Street, Georgetown, Texas, U.S.A. 78626 Phone: 512/ 863 7762; FAX: 512/ 863 0081
While the authors, editors and publisher believe that drug selection and dosage and the specifications and usage of equipment and devices, as set forth in this book, are in accord with current recommendations and practice at the time of publication, they make no warranty, expressed or implied, with respect to material described in this book. In view of the ongoing research, equipment development, changes in governmental regulations and the rapid accumulation of information relating to the biomedical sciences, the reader is urged to carefully review and evaluate the information provided herein.
Library of Congress Cataloging-in-Publication Data Carlile, Simon, 1957Virtual auditory space: generation and applications / Simon Carlile [editor]. p. cm. — (Neuroscience intelligence unit) Includes bibliographical references and index. ISBN 1-57059-341-8 (alk. paper). — ISBN 0-412-10481-4 (alk. paper). — ISBN 3-40-60887-7 (alk. paper) 1. Directional hearing—Computer simulation. 2. Auditory perception. 3. Virtual reality. 4. Signal processing—Digital techniques. I. Carlile, Simon, 1957- . II. Title. III. Series. QP469.C37 1996 612.8'5'0113—dc20 96-14117 CIP
DEDICATION In memory of John J. Mackin who planted this seed.
contents Foreword ................................................................................... ix Nathanial Durlach
1. Auditory Space ............................................................................1 Simon Carlile 1. Perceiving Real and Virtual Sound Fields .......................................... 1 2. Sound Localization by Human Listeners ......................................... 11
2. The Physical and Psychophysical Basis of Sound Localization ...............................................................27 Simon Carlile 1. Physical Cues to a Sound’s Location .............................................. 27 2. Psychophysical Sensitivity to Acoustic Cues to a Sound’s Location ..................................................................... 54
3. Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction ............................................................79 Philip Leong, Tim Tucker and Simon Carlile 1. Introduction .................................................................................. 79 2. The Nature of Sound ..................................................................... 80 3. Discrete Time Systems ................................................................... 82 4. Frequency Domain Digital Signal Processing ................................. 88 5. Time Domain Analysis................................................................... 94 6. Filter Design .................................................................................. 97
4. Generation and Validation of Virtual Auditory Space .............109 Danièle Pralong and Simon Carlile 1. Introduction ................................................................................ 109 2. Recording the Head Related Transfer Functions (HRTFs)........... 116 3. Filtering Signals for Presentation .................................................. 128 4. Delivery Systems: Recording Headphones Transfer Functions (HpTFs)................... 130 5. Performance Measures of Fidelity................................................. 134 6. Individualized versus Nonindividualized HRTFs and HpTFs ...... 142
5. An Implementation of Virtual Acoustic Space for Neurophysiological Studies of Directional Hearing ................153 Richard A. Reale, Jiashu Chen, Joseph E. Hind and John F. Brugge 1. Introduction ................................................................................ 153 2. Free-field to Eardrum Transfer Function (FETF) in the Cat ........ 154 3. Earphone Delivery of VAS Stimuli ............................................... 160 4. Mathematical Model of VAS ........................................................ 160 5. Responses of Single Cortical Neurons to Directional Sounds ....... 170
6. Recent Developments in Virtual Auditory Space ....................185 Barbara Shinn-Cunningham and Abhijit Kulkarni 1. Introduction ................................................................................ 185 2. Effects of HRTF Encoding on VAS Applications ......................... 186 3. Applications ................................................................................. 215 4. Discussion .................................................................................... 234
Index ..............................................................................................245
EDITORS Simon Carlile, Ph.D. Department of Physiology University of Sydney Sydney, Australia Chapters 1, 2, 3, 4
CONTRIBUTORS John F. Brugge, Ph.D. Department of Neurophysiology and Waisman Center on Mental Retardation and Human Development University of Wisconsin-Madison Madison, Wisconsin, U.S.A. Chapter 5 Jiashu Chen, Ph.D. Nicolet Instrument Company Madison, Wisconsin, U.S.A. Chapter 5 Joseph E. Hind, Ph.D. Department of Neurophysiology University of Wisconsin-Madison Madison, Wisconsin, U.S.A. Chapter 5 Abhijit Kulkarni, M.S. Department of Biomedical Engineering Boston University Boston, Massachusetts, U.S.A. Chapter 6 Philip Leong, Ph.D. Department of Electrical Engineering University of Sydney Sydney, Australia Chapter 3
Danièle Pralong, Ph.D. Department Physiology University of Sydney Sydney, Australia Chapter 4 Richard A. Reale, Ph.D. Department of Neurophysiology and Waisman Center on Mental Retardation and Human Development University of Wisconsin-Madison Madison, Wisconsin, U.S.A. Chapter 5 Barbara Shinn-Cunningham, Ph.D. Research Laboratory of Electronics Massachusetts Institute of Technology Cambridge, Massachusetts, U.S.A. Chapter 6 Tim Tucker, B.E.E. Tucker-Davis Technology Gainsville, Florida, U.S.A. Chapter 3
Foreword
T
he material presented in this book constitutes a significant contribution to the emerging area of virtual auditory space. In the following paragraphs, we put this material in a larger context concerned with human-machine interfaces and multimodal synthetic environments. Consistent with the terminology used in the recent report by the National Academy of Sciences in the U.S.,1 the term “synthetic environments” is intended to include not only virtual environments generated by computers, but also the environments that result when a human operator interacts with the real world by means of sensors and actuators mounted on a telerobot. The term also includes the many types of environments that are now being created by combining information generated by computers with information derived from the real world either directly or by means of telerobots, often referred to under the heading of “augmented reality.” Within the general field of synthetic environments, the material in this book can be characterized by its focus on: (1) the auditory channel; (2) the construction and use of virtual auditory environments; (3) the spatial properties of these virtual auditory environments; and (4) the manner in which and extent to which the spatial properties of these virtual auditory environments are realistic. Comments on each of these constraints are contained in the following paragraphs. 1. THE AUDITORY CHANNEL Inclusion of auditory input in virtual environments (VEs) is as important as it is in real environments. Although the attention of most computer scientists has been focused primarily on the visual channel and graphics, it seems clear that the use of the auditory channel for intraspecies communication by speech signals and for alerting the human to actions in the surrounding environment makes hearing a truly vital function. The relative importance of hearing in the real world can be demonstrated empirically by comparing the effects of real deafness to those of real blindness. Apparently, deafness constitutes a more serious obstacle than blindness to participation in the main culture (or cultures) of the society. Of particular interest in this connection are the political battles now taking place in some countries among different groups of deaf individuals (and the hearing parents of deaf children) over the virtues and vices of converting deaf individuals to partially hearing individuals by means of cochlear implants. Quite apart from issues related to the cost-effectiveness of such implants, questions are being raised by certain members of the deaf sign-language community about the basic desirability of restoring the ability to hear. In fact, some of the more extreme statements concerned with this question have referred to hearing restoration by means of
implants as a form of genocide. The existence of this movement, like that of a number of other social/political movements around the world, again demonstrates the depth of one’s ties to one’s native language. We seriously doubt that an equivalent movement would arise in connection with implants that restore vision. It should also be noted when comparing the auditory and visual channels in virtual environments that the basic interface devices available for displaying sounds in the virtual world are much more cost-effective than those available for displaying sights. In particular, assuming that both types of devices are head mounted (an issue discussed further below), the cost effectiveness of earphones far surpasses that of head-mounted visual displays. Unlike the visual channel, for which the limitations on fidelity of the image generally result from both inadequate display technology and inadequate computer synthesis of the visual images to be displayed, in the auditory channel the limitations on fidelity arise primarily from inadequate methods for computer synthesis of the sounds to be displayed. 2. VIRTUAL AUDITORY ENVIRONMENTS The assumption made above that the interface device for the auditory channel in virtual environments consists of a set of earphones rather than a set of loudspeakers relates to an important underlying distinction between virtual environment systems and traditional simulators. In traditional simulators (e.g., for training aircraft pilots), only the far field, that is, the field beyond reach of the user, is generated by computer; the near field is generally created by means of a physical mock-up. Thus, most simulators tend to be both expensive and space consuming. In contrast, in virtual environments, the near field as well as the far field is virtualized by means of software. An important reflection of this tendency is the extent to which virtual environment equipment constitutes a form of electronic clothing (head-mounted displays, gloves, body suits, etc.). It is only a matter of time before research on auditory virtual environments begins to focus on reproduction (using earphones) of sounds generated very close to the head. Also worth noting are the similarities and differences in the research issues that arise between virtual environments and teleoperation. These issues are essentially identical in the two cases when attention is focused solely on the interface device used for presenting the acoustic signals. However, when one considers the way in which the signals are generated, important differences arise. In the VE case, the main challenges center around problems of spatialization and sound synthesis. In the teleoperator case on the other hand, they center around problems related to the design of microphone arrays to achieve adequate spatial resolution at the site of the telerobot, and, if the telerobot is not anthropomorphic (e.g., the interaural spacing or the number of ears is different from that of
the operator), to the signal processing required to “match” the output of the microphone array to the human listener. 3. SPATIALIZATION IN VIRTUAL AUDITORY ENVIRONMENTS In virtual auditory environments, the two main problems are the generation of the sounds to be displayed and the spatialization of these sounds. Although sound synthesis constitutes a major R & D topic in the domains of speech and music, relatively little has been accomplished in the domain of environmental sounds. Although some basic physical modeling of sound generation has been performed, and there have been extensive efforts in the entertainment industry to record a wide range of sounds (and to process these recordings in various ways and to varying degrees) to achieve a wide range of “sound effects,” much work remains to be done to achieve a sound synthesis system that is truly useful for virtual environments. The other substantial problem area in the creation of virtual auditory environments, spatialization, is, of course, the main focus of this book. 4. REALISM Within the domain of auditory virtual-environment spatialization, the main emphasis in this book is on the achievement of realism. This concern with the creation of realistic spatial properties obviously has led, and is continuing to lead, to increased knowledge about auditory perception and to improved instrumentation. It is important to note, however, that there are numerous applications in the VE area in which realism should probably not be the goal. For example, and as discussed briefly in this book, it may be important to emphasize or magnify certain components of spatialization in order to enhance performance in certain specified tasks. Also, such distortions may prove useful in probing the nature of auditory processing or sensorimotor adaptation to perceptual alterations. In general, it is not only important to take account of human perceptual abilities in designing VE systems, but once such systems are available, they can provide a powerful tool for studying these abilities. In addition, it is important to note that auditory spatialization in VEs can serve as a means for increasing the effectiveness of VEs in the area of scientific visualization or, more generally, information understanding, whether the information is concerned with fluid flow, the internal states of a parallel-processing computer, the physiological condition of a hospital patient, or stock market behavior, the use of auditory spatialization in presenting this information may prove very useful. Finally, it should be noted that once the constraint of realism is relaxed and one begins to consider the use of new perceptual cue systems to specify auditory spatial properties and one assumes that one can adapt to these new systems, one
is faced with the need for defining what one means by a spatialization system. One possible answer is the following: a perceptual cue field is a spatialization system to the extent that (1) the perceptions change in an orderly way when the listener moves his or her ears; and (2) the perceived characteristics of the resulting sound streams that are invariant to movements of the ears can be associated in a stable and meaningful manner with spatial properties. The chapters included in this book provide the reader with a comprehensive, up-to-date discussion of virtual auditory space and should be of use to readers with a wide range of interests and backgrounds. Also, the seriousness with which the topics are considered should help provide a much needed counter-weight to the vast amount of superficial “hype” that has been associated during the past few years with the general field of “virtual reality.” It is only through the type of work evidenced in this book that truly significant advances are likely to be made. Nathanial Durlach
REFERENCES 1. Durlach NI, Mavor A. Virtual Reality: Scientific and technical challenges. Washington D.C.: National Academy of Sciences, 1994.
Preface
A
virtual environment, put very simply, is just an interface between a human and a complex data set. The data set may relate to the real world as we normally perceive it or to a totally synthetic world. However, the power of the interface is determined solely by how well the data can be mapped onto the human senses. When we consider the amount of data that needs to be processed to enable a person to safely cross a busy street the astounding information processing capability of the human nervous system becomes apparent. A virtual environment provides the opportunity to map data onto this massively parallel system in a way which would allow humans to interact with these data in ways not previously possible. Mapping effectively from the data domain to the perceptual domain is itself a huge interdisciplinary task involving engineers of almost every flavor, computer scientists, neuroscientists, psychophysicists, human factors researchers and communications specialists. One of the major challenges for this large and heterogeneous group of researchers is to develop a common vocabulary and an intellectual appreciation of the research implications of different findings across the whole field. This book is an attempt to make a contribution to this interdisciplinary communication. Accordingly, it has been aimed at a wider audience than the more traditional collection of scholarly articles. The discussions have been necessarily limited to just one data channel in the nervous system; namely the auditory channel. The first half of this book is an information rich introduction to the psychophysics of our perceptions of auditory space, together with a review of the acoustic cues utilized by the auditory system in generating this perception. Care has been taken to make this material as accessible as possible to the range of disciplines involved in this research area. The second part of the book examines particular aspects of the implementation of high fidelity virtual auditory space and looks in more detail at a number of current and developing applications of this technology. The first chapter introduces what is meant by the term auditory space and reviews much of the literature examining our perception of auditory space. The engineering challenge in generating virtual auditory space is to produce sounds over headphones that contain the same sets of physical cues that the nervous system uses in generating our percept of auditory space. The second chapter reviews what is known about these cues with a view to guiding the necessary engineering compromises along lines which are informed by the perceptual relevance of different physical characteristics of the sounds at each ear. The third chapter is a tutorial chapter on the digital signal processing techniques commonly
employed in generating virtual auditory space. This chapter is directed to those readers without a computational or engineering background and contains many illustrations of the way in which these techniques can be employed in auditory research. The fourth chapter considers, in some detail, the acoustic techniques involved in recording the so-called head related transfer function (HRTF). This chapter also considers the issues of how the fidelity of the resulting virtual auditory space can be measured. The fifth chapter looks at how these techniques have been employed in neurophysiological research to examine how the auditory system processes the physical cues to a sound's location and, to some extent, how the nervous system represents auditory space. The final chapter examines issues of efficiency in encoding HRTFs and implementing virtual auditory space as well as reviewing some of the most recent applications of this technology in research and development. Each chapter represents an up to date and accessible review of many of the principal issues in this rapidly expanding field and is written by researchers who are at the forefront of this research and development. The Auditory Neuroscience group at the University of Sydney is a multi-disciplinary group examining bioacoustic, perceptual and neurophysiological aspects of how the nervous system encodes auditory space. The first four chapters are contributions from this group with the third chapter written in conjunction with Timothy Tucker, of Tucker-Davis Technology, Inc., one of the foremost designers and suppliers of auditory spatial displays. The last two chapters represent contributions from the members of the Neurophysiology group at the University of Wisconsin at Madison and the Laboratory of Electronics at the Massachusetts Institute of Technology. Both groups have been pioneers in the applications of this technology to their respective research areas. It is hoped that this book will be useful across the range of disciplines involved in the development of virtual auditory space by providing an accessible bridge between these disciplines. While the book could be read cover to cover as each chapter contains new and interesting research reviews and results, different groups are likely to gain most from different sections of the book. For instance the physiologist and psychophysicist with some experience in auditory research would gain most from chapters 2, 3 and 6, where as the engineer would gain most from chapters 1, 2, 4 and 6. For those outside the field who wish to gain insight into the principal research questions chapters 1, 2, 5 and 6 are most likely to be of interest. For those auditory neuroscientists wishing to move into this rapidly growing field chapters 3 and 4 cover many of the detailed implementation issues.
1
CHAPTER 1
AUDITORY SPACE Simon Carlile
1. PERCEIVING REAL AND VIRTUAL SOUND FIELDS 1.1. PERCEIVING
THE
WORLD
O
ne of the greatest and most enduring of intellectual quests is that of self understanding. What we understand and the intellectual models that we manipulate in the process of applying that understanding are intimately related to what we perceive of the world. Our perceptions are in turn related to the structure of our sense organs and to the brain itself. The neurosciences represent a rapidly growing body of knowledge and ideas about the marvelous machinery of the brain1 and are making an increasingly important contribution to this process. There is now a considerable understanding of the basic operation of the five senses of extero-reception: vision, hearing, touch, taste and smell. Our perception of our environment necessarily involves these five senses together with the senses of balance and body position (proprioception). The richness of our perception is clearly heightened by the complex combinations of these senses. For example, the successful restaurant generates a sensual experience that goes well beyond the simple satiation of hunger. The lighting and furnishings generate a mood that is relaxed and comfortable, the smells relate to the food and the conversation of other diners is muted and combines with the background music to generate a sense of communion and yet privacy. In this book we are interested principally in the mechanisms by which the perception of an illusory or phantom space can be generated; in particular, the generation of virtual auditory space. In most cases this is achieved by presenting over headphones sounds that appear to come from locations in space that are distant from the listener. On the face of it, this might not appear too daunting a task.
Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 Landes Bioscience.
2
Virtual Auditory Space: Generation and Applications
An engineer might argue, quite reasonably, that by simply ensuring that the pattern of sound waves delivered over headphones to the ear drum was the same as when the individual was listening to a sound in free space, then the auditory experience should be identical. Indeed this is the very basis of the generation of virtual auditory space (Fig. 1.1). However, as we shall see, this is beset by a number of nontrivial problems that result in compromises in design and implementation. As a consequence, this becomes an issue where engineering solutions need to be guided by our understanding of the processes of hearing that lead to our perception of sounds. Due to a number of biological and evolutionary constraints, many of the operations of the auditory nervous system are quite nonlinear. Therefore, the challenge is to build efficient devices which result in this illusion of auditory space by matching up the necessary engineering compromises and biological constraints.a This kind of challenge can only be effectively met when there is a close association between auditory neuroscientists, psychophysicists and engineers. It is hoped that this book may make something of a contribution to this association.
1.2. DIMENSIONS
OF THE
PERCEPTION
OF
AUDITORY SPACE
Under normal listening conditions, the perception generated in a listener by a sound emitted from a single source is generally that of a particular auditory object.2 It has been argued that the auditory system has evolved for the detection of objects which generally correspond to sources of sounds.3 Consequently, an auditory object is mapped onto the physical attributes of its source. A talker is a person (or possibly an electronic sound source), a bark is a dog, a snap can be a twig breaking etc. However, when we as observers and commentators begin to classify the perceptual qualities of an experience we begin to indulge in a theory-dependent exercise. That is, we necessarily make assumptions about the nature of the stimulus and begin to map perceptual quality onto presumed physical attributes. The interpretation that we place on our perception then is inextricably linked to our expectations about the world. For instance we can easily attribute the source with a spatial position with respect to the listener. In localizing a sound source we assign a two dimensional direction to the sound source and we estimate how far away the source is. Things get a little more complicated when we consider factors such as the extent of the source. The idea that a sound has a spatial extent could possibly be mapped onto some notion of the size of the object emitting the sound. However, there is a body of psychophysical work a
This is a point sometimes missed in the design of high fidelity audio systems where the drive for system linearity can result in over-engineering when compared to resolution of the final receiver, the human ear.
Auditory Space
3
which indicates that extent tells us something about the environment within which we are listening to the source.4 The term ‘spaciousness’ has been coined, particularly by architectural acousticians, to describe this perceptual quality (see Blauert5 for discussion). The great concert halls of the world are designed with the physical attributes necessary
Fig. 1.1. (a) When we listen to sounds over headphones the source of the sound is generally perceived to be inside the head. If we vary the signals at each headphone so that, as in the case illustrated in the figure, the signal is of greater amplitude in the left ear and arrives earlier in the left ear, the apparent source of the sound will appear closer to the left ear. (b) If a sound source is located off the midline in free space, in this case close to the left ear of the listener, the sound will be of greater amplitude in the left ear and arrive earlier in the left ear. In contrast to the figure above, under normal listening conditions, the sound is also filtered by the outer ear before it is encoded by the auditory nervous system. These effects are illustrated by the time/pressure graphs on each side of the head and represent the pressure waves generated by a particular sound source. In this case we perceive the sound to be located in free space away from the head. (c) To generate the illusion of a sound in free space, the pattern of sound waves that would have been produced by a sound in free space is generated over headphones. This is achieved by taking into account the normal filtering effects of the outer ear. In this case, the illusion is generated of a sound source at a particular location outside the head. Reprinted with permission from Carlile S and King AJ, Cur Biol 1993; 3:446-448.
4
Virtual Auditory Space: Generation and Applications
to generate this quality. In a large concert hall the sense of spaciousness results primarily from an interaction between the primary incident wavefront, generated by the performer, and the lateral reflections combined with the reverberation. When a sound is ‘spacious’ the listener feels surrounded by or immersed in the sound and, at least for music, this tends to increase the emotional impact of the sound. The amount of reverberance in an environment determines to some extent the ability to localize a single source.b Therefore, in some sense, the spaciousness of a sound is at one end of a perceptual dimension where accurate localization of a discrete source is at the other.c The foregoing discussion serves to underline the important notion that sound alone is not necessarily sufficient for the generation of our perception of our auditory world. Certainly, sounds generate in us certain sensations but the perception that results from these sensations can be dependent on other factors. These can include the expectations we have about the nature of the sound sources and the environment within which we are listening to these sources. These are sometimes referred to as ‘cognitive’ factors or ‘top down’ elements of perception. However, as we shall see later, auditory perception is also dependent on other factors which are not necessarily ‘cognitive’ in origin. With these cautions in mind we can start to draw some preliminary conclusions about the dimensionality of our perception of auditory space. If we initially restrict our considerations to a single sound source in an anechoic field, then the two obvious perceptual dimensions are direction and distance of the source relative to the head. The direction can be indicated using a familiar coordinate system such as azimuth angle with respect to the frontal midline and elevation angle with respect to the audio-visual horizon.d The perception of distance is relative to our egocentric center.’
b
A highly reverberant environment results in an increase in the incoherence of the sounds waves at each ear, thereby degrading the acoustic cues used by the auditory system in determining spatial position (see chapter 2, section 1). c The emotional impact of music in such situations may be related to our restricted ability to localize the source. In an evolutionary context, accurate localization of a particular source might have very important survial consequences. In fact the principal evolutionary pressure on hearing may well be the ability to localize a presumed predator (or prey). In the musical context the inability to localize the source and indeed the notion of being immersed or consumed by the source may add some emotional frisson to the experience. d This co-ordinate system relies on a single pole system like that used to describe global location on the planet. Other coordinate systems are sometimes employed to describe sound location and these are described in greater detail in chapter 2, section 1.4.2.
Auditory Space
1.3. THE NATURE
5
OF THE
AUDITORY STIMULUS
Much of the previous research effort in auditory neuroscience has followed a formal reductionist line and concentrated on how the auditory system encodes simple sounds. The stimuli used in such experiments are typically short bursts of white noise or pure tones, sometimes modulated in frequency or amplitude. These are generally presented using headphones or closed field sound systems sealed into the external canal. The underlying methodological philosophy assumes that if the system acts linearly and superposition applies, then the encoding of more complex sounds could be understood in terms of the encoding of ‘elementary’ sounds. Over the last 100 years or so this approach has provided a considerable amount of important information about auditory processing of simple sounds but there is not yet a complete picture. However, what is beginning to become clearer is that this simple ‘bottom up’ approach may not be able to provide the necessary intellectual and methodological tools to examine the processing of ecologically appropriate stimuli. Recent work has reinforced the notion that the principal of superposition does not necessarily apply to the analysis of many combinations of sounds. The auditory system tends to analyze sounds differently depending on various parameters of the sound; for instance, when we are dealing with very short duration sounds the binaural auditory system tends to analyze sounds synthetically rather than analytically.6 In the latter type of processing, sound is broken up into various components (frequency, level, time of arrival) and then parsed into different potential auditory objects. In contrast, synthetic processing tends to result in a single auditory object whose characteristics are dependent on some type of vector sum of the components. These different modes of processing may well have an ecological rationale. If we accept that very short or transient sounds are unlikely to have resulted from a combination of sources—the inadvertent sounds made by a predator may well fit into this category, for instance—the efficiency of processing the location of this sound may be of paramount importance. Synthetic processing of a sound may be the result of such a strategy. On the other hand, analytic processing is likely to be more computationally expensive and therefore, time consuming. Such a strategy may be reserved for longer duration sounds such as communication sounds which require discrimination along other dimensions. A further limitation of a simple reductionist approach is that the elementary stimuli are unlike most sounds that are likely to be encountered in the ‘real world.’ One point of view is that the auditory system never evolved to detect and analyze such sounds.3 Following from this, it would seem questionable as to whether probing the system with such sounds will lead to clear picture of its normal processing. There is no doubt that the system can encode such sounds but the question is whether the responses one elicits with such stimuli
6
Virtual Auditory Space: Generation and Applications
bear much relationship to the kinds of processing of more ecologically valid stimuli. Certainly, the perceptual experience generated by such stimuli is clearly impoverished. Sounds presented over headphones are generally perceived as coming from a phantom source within the head rather than outside; that is, they have zero egocentric distance. By varying a number of characteristics of the sounds at each ear, the apparent source of the sound can be made to move closer to one ear or the other but still lacks any 3D directional quality. Such sounds are said to be lateralized within the head rather than localized in external space. There are very few natural listening experiences that result in such an auditory illusion. So what are the advantages of using a headphone stimulus system? The short answer is one of stimulus control. By delivering sounds over headphones it is possible to carefully control the characteristics of the sound delivered to each ear. This makes possible a highly reproducible stimulus and greater rigor in experimental design. This kind of stimulus control also makes possible a whole class of experiments which would be impossible using a sound presented from a loudspeaker in the free field. As we shall consider in some detail below, the mammalian auditory system has two ears, each sampling the sound field under slightly different conditions. The differences in the inputs to each ear are used by the auditory system in a variety of tasks; for instance, determining the location of a sound source or separating out a sound of interest from background noise. Using headphones, the differences in the sounds at each ear can be manipulated in ways which would be very difficult using sound sources placed away from ears in the free field. So, despite its obvious perceptual limitations, the closed field or headphone presentation of stimuli still provides a powerful experimental tool.
1.4. A VIRTUAL SOUND FIELD 1.4.1. Generation and utility If we use headphones to generate a sound field at a listener’s eardrums that is identical to the sound field that is normally generated by a sound source in the free field, then the listener should perceive the sound source as existing in the free field; that is, in virtual auditory space (VAS; Fig. 1.1). In contrast to the stimulus methodology described in the previous section, a complex sound presented in virtual auditory space is a highly ecological stimulus. Under properly controlled conditions, the percept generated in the listener is of a sound emanating from a source located away from the head at a particular location in space. Clearly, this is also an illusion, but in this case the illusion is one which better approximates the normal listening experience. From a research point of view, stimuli presented in virtual auditory space promise to provide a most powerful tool for investigating many important and outstanding questions. Such a presentation method
Auditory Space
7
combines the stimulus control offered by headphones together with the ecological validity of a real free field sound source. Additionally, as these signals are usually generated using digital signal processing techniques and fast digital-to-analog converters, it is a relatively simple task to perform complex manipulations of the signals before presentation (chapter 3). The usefulness of this technique for auditory research is almost entirely dependent on how well the illusory sound field corresponds to the real free field. Clearly, any experiment which relies on manipulation of the virtual sound field to expose auditory processing strategies will be confounded if the original virtual field is a poor approximation to a real free field. Chapter 4 canvases some of the difficult acoustical issues involved in generating high fidelity VAS which result in acoustical compromises in its implementation. Therefore, the question of fidelity of a virtual sound field is principally a perceptual issue rather than an acoustical issue. As such, it becomes operationally defined and based on some behavioral or psychophysical test. In the remainder of this section we will consider what kind of psychophysical tests might be used to determine the fidelity of a virtual sound field. 1.4.2. Tests of fidelity One of the most clearly understood aspects of auditory behavior relating to a sound field is the capacity of a subject to localize the source of a sound within that field. Thus, the fidelity of VAS could be determined by comparing the ability of a subject to localize an auditory target within VAS with that in the free field. However, there are a number of important factors that need to be considered if such an approach is to be useful. For instance, it is well known that there are differences between individuals in their accuracy of sound localization7-10 (chapter 1, section 2), therefore this factor should also be taken into account when assessing VAS fidelity. The type of localization task used in such a test is also important. Clearly, the power of any test is related to the specificity of the question that is asked and, in the context of auditory localization, the mechanisms that are tested are intimately related to the kind of stimulus that is employed. The simplest form of localization relies on a homing strategy. In this case the sound detector need only be able to code stimulus level and its output integrated with movement of the detector throughout the sound field. The only requirement for the target stimulus is that it be continuous or at least very repetitive. Scanning the sound field is a second and slightly more sophisticated localization strategy. In this case the sound receiver has to be directional but it need only be rotated in the sound field. Thus, scanning is not dependent on translocation of the receiver with respect to the source. Again, sound level is encoded and integrated with rotation of the receiver to provide the directional information.
8
Virtual Auditory Space: Generation and Applications
If the duration of the stimulus is very short and nonrepetitive these two localization strategies will fail. The capability to localize the source of a transient stimulus represents a much more sophisticated capability than that of homing or scanning. Biologically, this is achieved by using two receivers which sample the sound field under slightly different conditions; in the case of the mammal, the two ears are generally found on each side of the head. The inputs to the two ears are compared by the auditory system (binaural processing: see chapter 2) to extract a variety of cues to the location of the sound source. A whole range of auditory stimuli can be localized by such a mechanism but its particular utility is in the localization of sounds which are so brief that homing and scanning strategies are not possible. However, for some stimuli, such as pure tones with no amplitude or frequency modulations, even this localization mechanism is not perfect and can lead to large errors. Its not surprising that narrow frequency band sounds are often exploited as warning signals by groups of animals.11 There is a clear evolutionary advantage for a group of animals in being made aware of the presence of danger such as a predator. However, there is clearly no individual advantage if such a warning signal can be easily localized and the hapless sentry exposes himself to attack! So the choice of warning signals which are particularly difficult to localize represents the evolutionary compromise. Transient stimuli also represent a special class of stimuli which are likely to have a high ecological significance. The inadvertent sounds of approach, particularly in an environment with plenty of vegetation, are most likely to be generated by snapping twigs or rustling of leaves. Both are short duration sounds containing a wide range of frequencies. Indeed, the shorter the duration of the transient, the closer it approximates a delta function and the broader the range of frequencies that it contains (see chapter 3). Such sounds might result from inefficient stalking by a predator and are thus highly significant in terms of survival. In survival terms the most important attribute of such a sound is its location. The forgoing discussion suggests that the clearest test of the fidelity of a particular virtual sound field would be the capacity of a subject to localize a transient stimulus. Such a stimulus places the greatest processing demands on the auditory system and is dependent upon the widest range of acoustic cues to source location. If a particular virtual sound field fails to provide these cues, presumably because of the compromises made in its implementation, then there should be greater localization error in the virtual field compared to localization by the same subject in the free field. In the following chapters the methods by which localization ability can be assessed will be reviewed and questions of the spatial resolution of these methods will also be considered. Obviously, the methodology employed in such a test of fidelity must be sufficiently sensitive to be capable of detecting
Auditory Space
9
perceptually relevant differences in the virtual and free sound fields. Clearly, if VAS is to be used in advanced auditory research or in mission critical applications, it is insufficient for a designer or engineer to simply listen to the virtual sound field and decide that it satisfies the necessary criteria because the sounds appear to come from outside the head and from roughly the correct locations.
1.5. THE REPRESENTATION OF AUDITORY IN THE CENTRAL NERVOUS SYSTEM
SPACE
In the central nervous system the way in which auditory space is coded is very different from the other sensory representations of external space, particularly those of visual space or of the body surface. This has important implications for the way in which we might expect the auditory system to process information and for the specific characteristics of a sound that are important in generating our perception of the auditory world. These fundamental differences in processing also flag a caution about using analogies imported from different sensory systems in our attempts to understand processing by the auditory system. The fundamental differences between these systems has its origin in how the sensory information itself is encoded. In the visual system, light from in front of the eye enters through the pupil and strikes the light sensitive receptors in the retina at the back of the eye. Thus, the resulting pattern of neural activity in the retina corresponds to the spatial pattern of light entering the eye. Broadly speaking the visual system is working like a camera and takes a picture of the outside world. That is, the visual field is mapped directly onto the retina which then makes connections with the brain in an ordered and topographic manner. Thus, visual representations are said to be topographic in that there is a direct correspondence between the location of activity in the neural array and the spatial location of the visual stimulus. In other words, the spatial patterns of neural activity that occur in the visual cortex correspond directly to the patterns of activity in the retina which in turn correspond to the pattern of light entering the eye.e The primary sensory coding by the auditory system is very different from the visual system. Sound is converted from mechanical energy to neural signals in the inner ear. The inner ear, however, breaks down the sounds as running spectra and encodes the amplitude and phase of each frequency component. Due to a number of biological e
This topographic pattern of activity is preserved across a large number of cortical fields but as processing becomes more advanced from neural field to neural field, the topographical pattern tends to become increasingly blurred as this topographic map is sacrificed for the extraction of other important visual features such as motion, form or colour (see refs. 74 and 75).
10
Virtual Auditory Space: Generation and Applications
limitations, the ability to encode phase decreases as a function of increasing frequency. What is most different with this encoding scheme compared to the visual system is that the spatial pattern of neural activity across the auditory receptors (and subsequently the auditory nuclei in the central nervous system), reflects the frequency content of the sound and not the spatial location of the source. Therefore, the processes that give rise to neural representations of auditory space and indeed our perception of auditory space must be based on other information that is extracted from the auditory inputs to one or both ears. That is to say, space perception is based upon a highly computational neuronal process. In the visual system, the sensory ‘primitive’ is the location of the origin of a ray of light and the emergent perceptual components are, say form or motion. By contrast, for the auditory system the sensory primitive is sound frequency and space is one emergent component. Thus, while the auditory nervous system clearly generates some kind of a representation of auditory space, the mechanisms by which this arises are very different to how space is encoded in the other senses that deal with the place of objects in the external world.
1.6. AN OVERVIEW
OF THE
FOLLOWING REVIEW
The foregoing introduction has been necessarily eclectic and discursive in an attempt to illustrate the range of issues that should be considered when the proper implementation and applications of virtual auditory space are considered. In the following sections of this chapter and in the following chapter we shall consider, in a more systematic and comprehensive manner, many of the issues that have been touched on above. Auditory localization of single sound sources under anechoic conditions is probably the best understood process involving a free field sound field. As is suggested above, performance testing based on this process probably provides the best test of the fidelity of a virtual sound field. For this reason, in the second section of this chapter we will review the current state of knowledge of human sound localization abilities. Humans seem to localize a sound source quite accurately although some other nocturnal predators, notably the owl, do somewhat better. It may be that such performance differences result from differences in processing strategies by the auditory nervous systems of these animals. However, while there are known to be structural differences in the auditory nervous systems of the owl compared to that of the human, it is not clear whether these differences simply reflect a different evolutionary heritage or real differences in the processing strategies. Another very important difference underlying the variations in localization performance between species is likely to be in the quality of the acoustic cues to spatial location that are generated at the outer ears. For instance, there are major differences in the structures of the ears of owls and humans and the acoustics of these structures are known
Auditory Space
11
in some detail. In chapter 2 we will consider in some detail the physical cues to sound location that are generated at the auditory periphery of the human. The major theories of how we perceive sounds in auditory space have been built upon an understanding of the physical cues to sound location and capacity of the auditory system to encode those cues. There are a number of important biological limitations to the processes by which these cues are encoded by the central nervous system. Therefore, the fact that a possible cue is present at the auditory periphery by no means indicates that this cue is utilized by the auditory nervous system. In the second section of chapter 2 some of the experiments that have examined the sensitivity of the auditory system to the physical cues to a sound’s location are examined.
2. SOUND LOCALIZATION BY HUMAN LISTENERS 2.1. ACCURACY
AND
RESOLUTION
IN
AUDITORY LOCALIZATION
2.1.1. Introductory observations There are two main approaches to assessing the capacity of the auditory system to localize a sound source; (i) assessing absolute localization accuracy;9,12 or (ii) determining the minimum audible change in the location of a stimulus, the so-called minimum audible angle (MAA).13,14 An important distinction between these two approaches is that the first examines localization ability per se, while the second examines the capacity of the auditory system to detect changes in any or all of the cues to a sound’s location. That is, in a MAA experiment, two stimuli may be distinguished as being different (by virtue of the small differences in spatial location) but the subject may still be incapable of accurately localizing either of the stimuli or even assessing the magnitude or direction of the vector of difference. While the assessment of MAA can provide important information about the quantum of information in auditory processing, it does not necessarily relate to the processes that lead to our perception of auditory space. On the other hand, the detection of small differences associated with slight variations in the locations of sources may provide insights into other auditory processes that rely on differences in the signals arriving at each ear. For the remainder of this section we will concentrate on experiments that have examined absolute localization accuracy rather than MAA. There have been a number of studies examining sound localization accuracy and several excellent recent reviews.15-18 Rather than going systematically through this large literature I will concentrate on some of the general issues that have importance for the generation, validation and applications of virtual auditory space.
12
Virtual Auditory Space: Generation and Applications
One general observation is that, to date, most localization experiments have been conducted under quite controlled acoustic conditions. Clearly the motivation for such an experimental approach is the desire to move from the simple to the complex in experimental design. In these experiments the testing environment is generally anechoic, the stimuli are generally broadband and of short duration and presented from a fixed number of source locations to a subject whose position is also generally fixed in space. As a consequence, after the first few stimulus presentations, the subject will have considerable knowledge about the stimulus spectrum and the acoustic environment. This is of course a very unnatural listening situation in that most sounds of interest are likely to have time variant spectra and the listening conditions are also likely to be constantly changing with head movements, variations in the number and locations of other sound sources and the variation in the geometry of reflecting surfaces as a one moves about the environment. Thus while current work may provide insights into the limits of our sensory coding of auditory space, we should remain cautious about what the current state of knowledge can tell us about sound localization in a real world situation. 2.1.2. Methodological issues There are two main methodological issues that need to be considered: (i) how the position of a sound source is varied; and (ii) how the subject indicates where the sound source is perceived to be. These issues are discussed in some detail in chapter 4 (section 5.1) and are only briefly considered here. Varying the location of a test stimulus has been achieved by using either a static array of possible sources or by using a single moveable sound source placed at a number of locations about the subject. In the first case it is often possible for the subject to simply indicate a number identifying which speaker a stimulus was perceived to have come from. In the second case, as there is only a single target, the localization experiments are usually carried out in complete darkness and the subject is required to indicate the location of the source by pointing or noting the location coordinates in some way. A number of studies have shown that localization performance can be influenced by foreknowledge of the potential target locations as would be the case when a subject is faced with an array of speakers from which it is known that the target will come. Under normal conditions localization is a continuous spatial process so that constraining or quantizing the subject’s responses places artificial bounds on the subject’s behavior and may also bias our analyses of this behavior (chapter 4, section 5.1). For these and other reasons discussed later we will be principally concerned here with those studies that have used continuous variations in the location of the target.9,10,12,19-23 These studies have used a variety of techniques to indicate the sound location including point-
Auditory Space
13
ing with a hand held gun, pointing the face towards the target and tracking the position of the head or simply having the subject call out the coordinates of the apparent location of the source. We have found that, with appropriate training, pointing the head towards the target and tracking the head location is a highly efficient and reliable method for indicating perceived target location10 (Fig. 1.2). 2.1.3. Two types of errors in absolute localization Using brief bursts of broadband noise, two different types of localization errors can be demonstrated: (i) large localization errors associated with a front-to-back or back-to-front reversal of the apparent target location; that is, the location of the target is indicated correctly with respect to the median plane but the front-back hemisphere is confused.
Fig. 1.2. The figure shows a subject inside the anechoic chamber at the Auditory Neuroscience Laboratory (University of Sydney). The subject stands on a raised platform in the center of the chamber. The robot arm, which carries the sound source, is suspended from the ceiling such that rotation of the vertical frame varies the azimuth location of the source. The inner hoop, which actually carries the speaker, is driven by small stepper motors on either side of the main frame; one such motor and it’s gearing can be seen to the left of the picture. The task of the subject in a localization experiment is to turn and point her nose at the speaker at each test location (the experiments are carried out in complete darkness). The small cap on the head of the subject carries a 6 degrees of freedom tracking receiver which indicates the location of the head and the direction towards which the subject is pointing. The subject indicates completion of each task by pressing the button seen in this subject’s left hand.
14
Virtual Auditory Space: Generation and Applications
(ii) variations in the perceived location relatively close to the actual target. The difference in the character of these errors implies the failure of different localization processes. When broadband stimuli are used, front-back localization error is found in between 6%f,9 and 3%23 (see ref. 12, 4%—also Parker, personal communication; ref. 7, 5.6%) of localization judgments.g This indicates that, under these listening conditions, there is some ambiguity in the perceived location of the sound source. It is important to note that these kind of errors are likely to be strongly affected by the foreknowledge that subjects may have about the configuration of potential stimulus locations, particularly where these are limited in number or limited to a particular spatial plane.24 Such foreknowledge may help the subject in resolving the perceptual ambiguity of some stimuli so that their performance no longer represents the simple perceptual aspects of the task. Regardless of the experimental approach, a general observation for broadband stimuli is that the accuracy of localization varies as a function of the location of the target (Fig. 1.3). In general, human subjects demonstrate the smallest localization errors and the smallest minimum audible angles for targets located about the frontal midline at around the level of the audio-visual horizon (the plane containing the eyes and the interaural axis). In studies using continuous variation of sound locations, the absolute accuracy of localization varies across studies, presumably reflecting methodological differences such as the spectral content of the stimulus and the method of indicating the location of the stimulus. We have found,10 using the head pointing technique described above, that for sound locations on the anterior midline and ±20° about the audio-visual (AV) horizon the variation of the horizontal component of localization is between 2° and 3° with the variation in the vertical estimates between 4° and 9°. However, for locations around the interaural axis the horizontal variation increases to between 8.5° and 13° and variation in the estimates of the vertical locations is between 6° and 9°. For posterior locations close to the AV horizon the variation in f
This may represent an underestimate of the number of front-back confusions in this study as Makous and Middlebrooks did not test locations directly behind the subject. g There is no generally accepted definition of what constitutes a front-back confusion for locations close to the interaural axis. For instance if a sound is located 10° behind the interaural axis but is perceived to be located in front of the interaural axis, does this represent a front-back confusion? The small differences in the front-back confusion rate may well reflect variations in the criteria of identification between studies. The main point from these data however, is that the front-back confusion rate is relatively low across all of the studies.
Auditory Space
15
the estimates for horizontal components ranges between 8° and 12° and for the vertical components between 7° and 10.5°. In general the errors in localization increase towards the extremes of elevation. Makous and Middlebrooks9 report similar variations in localization accuracy to those we have found, although in that study the errors reported for the posterior locations were generally larger. Although there are a number of differences in the stimuli between these and previous studies, the
Fig. 1.3. The mean localization accuracy from 9 subjects is shown together with an estimate of the variance of the location estimates. Each sphere represents the hemisphere of space surrounding the subject as indicated on each plot. The filled circle indicates the location directly in front of the subject (Azimuth 0° Elevation 0°). The actual location of the target is indicated by the small cross at the origin of each ray. The center of each ellipse indicates the mean location (azimuth and elevation) of six localization trials for each subject. The variance of the azimuth and elevation components estimate is indicated by the extent of the ellipse. The distributions of the localization estimates for each target position are described by a Kent distribution.72 Data from Carlile et al.10
16
Virtual Auditory Space: Generation and Applications
smaller localization errors found in these two studies compared to previous studies probably reflect differences in the methods by which the subjects indicated the perceived location of the target (see chapter 4, section 5.1). Furthermore, it is also not entirely clear to what extent the spatial variations in the localization acuity can be attributed to sensory limitations or to the methods employed by the subjects to indicate the perceived location (however see refs. 9, 10). However, the fact that a consistent general pattern of the spatial variation of localization accuracy is seen across all studies using very different methodologies supports the notion that these differences are, in a large part, attributable to sensory effects. 2.1.4. Localization accuracy is dependent on the stimulus characteristics Another general finding that emerges from numerous different studies is that the ambiguity of a sound’s location increases when the bandwidth of the stimulus is restricted. This is manifest as an increase in the number of front-back confusions.4,22-24,26-28 Decreasing stimulus bandwidth also results in a general decrease in localization accuracy. Butler28 found that, following correction of the front-back confusions, there was a progressive increase in localization error as the bandwidth of a noise centered on 8 kHz was decreased from 8 kHz to 2 kHz. These observations indicate that, for accurate localization, spectral information across a wide range of frequencies is required. The complex folded structure of the outer ear has also been shown to play a very important role in this process (Fig. 1.4). Increases in the number of front-back confusion errors have also been reported when the concavities of the outer ear were filled with plasticine but the auditory canal was left patent.19,29,30 This further demonstrates that the interactions between a broadband stimulus and the structures of the outer ear also provide important localization cues. As is discussed in detail in chapter 2, the outer ear filters the sound across a wide range of frequencies. The exact characteristics of this filtering vary as a function of the location of the sound source, so providing the socalled ‘spectral cues’ to sound location. The link between the spectral characteristics of a sound and its location has been examined extensively in the context of sound locations on the median vertical plane. On the basis that the head and ears are symmetrically arranged, it has been generally argued that interaural differences are uniformly zero for median plane locations; thus the elevation of a sound’s location on this plane must be indicated by variations in the spectral content of the signal produced by pinna filtering. However, from an analysis of localization data using a decision theory approach31 and careful acoustical recording from each ear32-34 it is clear that, at least for the subjects examined, there are often marked acoustical asymmetries that lead to significant interaural level differences for sounds on the median plane. Notwithstanding this
Auditory Space
17
Fig. 1.4. A simple line drawing showing the main features of the complexly convoluted structure of the outer ear. The main functional components are (a) the pinna flange comprising helix, anti helix and lobule, (b) the concha including the cymba and the cavum and (c) the ear canal connecting to the floor of the concha (not shown). Adapted with permission from Shaw EAG. In: Keidel WD et al, Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490.
problem, a number of studies have indicated that the apparent location of a sound source on the median plane can be varied by manipulating the stimulus spectrum rather than the actual location of the source.4,35,36 The perception of the vertical location of sounds presented over headphones is associated with the spectral ripple produced by comb filtering using a delay and add procedure.37 Such an approach was suggested by the work of Batteau38 who argued that sound locations could be coded by multiple delays provided by the complex sound paths of the outer ear. Although he suggested a time domain analysis of the input signal it seems more likely that the auditory system analyzes the resulting comb filtered inputs in the frequency domain39 (see chapter 2 section 1.8.2 and chapter 6 section 2.3.2). Consistent with the role of the outer ear in providing these spectral cues, manipulation of the outer ear by filling the concavities of the pinna has also been found to reduce localization accuracy for sounds on the median plane.19,35,40,41 However, some care must be taken in interpreting many of these data as most studies have employed a small number of visible sound sources and thus constrained the subject’s response choices (see section 2.1.2).
18
Virtual Auditory Space: Generation and Applications
2.2. LOCALIZING SOUND SOURCES
WITH
ONE EAR
So far we have considered sound localization using two ears, but it has been known for some time that individuals with only one functional ear can also localize sounds with reasonable accuracy.42 Many studies using normal hearing subjects but with one ear plugged have also demonstrated reasonable localization accuracy for targets in both horizontal30,43,44 and vertical planes.20,45 In subjects who were artificially deafened in one ear, sound locations along the horizontal plane tended to be displaced towards the functional ear so that localization accuracy was good for locations about the interaural axis on the unblocked side20,43,46,47 and increasingly less accurate for locations displaced from these locations. Where subject responses were unconstrained, vertical localization does not seem to be as affected as the perception of horizontal location.28,48 Monaural localization has also been found to be dependent on the spectral content of the stimulus and is very inaccurate for stimuli lowpassed at 5 kHz.44 Monaural localization is also disrupted by manipulation of the pinna.49 For the monaural subject, the apparent locations of narrow band stimuli seemed to be determined by their center frequency rather than their actual location.30 However, both practice effects46,50,51 and context effects52 have also been shown to influence the subject’s responses. There is evidence that some subjects with a long history of unilateral deafness perform better than subjects with one ear blocked, particularly with respect to displacement of apparent sound locations towards the hearing ear.48 Mixed localization accuracy was also reported for a group of 44 unilaterally impaired children when compared to 40 normally hearing subjects49 with the hearing impaired children showing a greater range of localization errors for a noise highpassed at 3 kHz. Whatever the basis of these differences in accuracy, the principal finding is that monaural subjects can localize reasonably well and must do so on the basis of the filtering effects of the outer ear.
2.3. DYNAMIC CUES
TO THE
SOURCE
OF A
SOUND
2.3.1. Head motion as a cue to a sound’s location There have been a number of studies examining the contribution that head movements or movement of the sound source play in localization accuracy. The basic idea is that multiple sequential sampling of the sound field with ears in different locations with respect to the source would provide systematic variation in the cues to a sound’s location. For instance, the pattern of the variations in the binaural cues could be used to help resolve the front-back confusion.53,54 The availability of such cues is, of course, dependent on a relatively sustained or repetitive stimulus to allow multiple samples. When naive subjects are attempting to localize long duration stimuli they do tend
Auditory Space
19
to make spontaneous head movements of the order of ±10°, particularly when the sound is narrow band.55 However, some what surprisingly, there is little evidence that, for a binaurally hearing individual, head movements contribute significantly to localization accuracy under normal listening conditions. Some improvements are seen where the cues to a sound’s location are impoverished in some way. Head movements have been shown to substantially increase monaural localization accuracy of a 3.7 kHz pure tone.56 Induced head movement (in contrast to self induced movements) showed some increase in localization accuracy where the noise or click stimuli were high- or low-pass filtered.57 Fisher and Freedman29 showed that self induced head movement produced no improvement in the localization of a small number of fixed sound sources. The elevation estimates of low-pass noise were reported to be very poor24,58 and, despite expectations to the contrary, allowing subjects to move their heads during the presentation of a long duration stimulus did not result in any improvement.58 Pollack and Rose59 confirmed the finding that small head movements have no effect on localization accuracy but found that when a subject turned to face the source of the sound there was an increase in localization accuracy. This last result may have more to do with the location dependent variations in localization accuracy discussed above rather than a contribution of a head motion cue to source location per se. Thus, despite strong theoretical expectations to the contrary, there is almost no evidence that head movements are useful in localizing a free field sound source unless the bandwidth of the sound is narrow or the spectral cues to location are degraded in some other way. This suggests that, at least under the experimental conditions examined so far, the auditory system does not re-sample the sound field as a cue to location. This may be related to the fact that most subjects already have two simultaneous samples of the sound field (one from each ear). Furthermore, the system is only likely to gain more information by re-sampling if the characteristics of the stimulus are stationary. In contrast to the kinds of sounds used in these experiments, natural sounds are highly nonstationary in both their temporal and spectral characteristics. Under such conditions, variations in the subsequent samples of the sounds that result from rotation of the head could be confounded by the variation in the characteristics of the source. Thus, in the ecological context in which this system has evolved, re-sampling probably represents a computationally expensive and yet largely redundant strategy. 2.3.2. Perception of the motion of a sound source In contrast to what is known about localization of static sources, considerably less effort has been expended examining issues of auditory motion. Furthermore, many previous studies of auditory motion are limited by a number of technical and theoretical problems outlined
20
Virtual Auditory Space: Generation and Applications
below. There is also considerable disagreement as to the mechanisms of motion analysis, some of which may be traced to differences in methodology. Many studies have employed simulated auditory motion using headphones because of the technical difficulties associated with silently moving a physical sound source in the free field.60,61 Variations in the binaural stimulus parameters result in variations in the lateralized sound image within the head and therefore this experimental paradigm suffers from the limitation that sound image has no externalized 3D spatial location. As with studies of auditory localization, the generalizability of such experiments to free field listening conditions is questionable. Other methods of simulating motion using static sources rely on stereo-balancing between two widely spaced free field speakers62,63 or rapid switching between relatively closely spaced speakers.64,65 These methods generate a more compelling percept of auditory motion in that the sound image occupies extra-personal space. However, as discussed above, the generation of the percept does not necessarily demonstrate that a particular simulation method generates all of the relevant cues to auditory motion. Stereo-balancing involves a reciprocal variation in the level of the stimulus at each speaker. This results in a continuous variation in the loudness of the sounds in each ear. However, this method will not produce appropriate variation of the location dependent filtering effects of the outer ears that results during an actual variation in the location of a source. Accurate localization processing of a static stimulus requires the conjunction of appropriate binaural and monaural cues22,66 so that the cue mismatches produced by stereo balancing might also disrupt some aspects of motion processing. Cue mismatch may not be a problem for movement simulations relying on rapid switching between closely spaced speakers. This technique assumes that the distance between speaker is within discriminable limits and that the rate of switching is within the ‘sampling period’ of the auditory system. The first assumption can be confirmed by studies of localization of static signals discussed above but the second has yet to be experimentally demonstrated. A problem that can arise with rapid switching is the “ringing” that this produces in each speaker at signal onset and offset. This will produce spectral splatter resulting in significant side lobes in the spectra of narrow band sounds or smearing of the spectra of complex sounds. This can be avoided to a large extent by appropriate time domain windowing of the signal. However, both windowing or the off-set ringing in individual speakers will result in a sound ‘source’ that has a much larger spatial extent than a real moving point source. There have been a number of studies employing a moving loudspeaker. The speaker was attached to the end of a boom anchored above the head which could be rotated around the subject. These studies have necessarily been limited in the trajectories of movement and the
Auditory Space
21
range of velocities that could be examined (e.g., ref. 67) or were restricted to pure tone stimuli (e.g., refs. 68, 69). Pure tones are not a particularly ecological stimulus as most natural sounds are rich in spectral features. More importantly, their use may have negated an important potential cue for auditory motion, namely the location-dependent variations in the peaks and notches in the filter functions of the outer ear. These spectral features are only available with complex, broadband stimuli. Only recently have broadband sounds been employed with actual moving sources70 or simulated movements using multiple sources.64,65 It is noteworthy that, where comparison between studies is possible, the minimum audible movement angle (MAMA) is considerably less with broadband stimuli than with pure tones. In this chapter we have seen that our perception of auditory space is dependent on a range of auditory and non auditory factors. The generation of virtual auditory space promises to provide a very powerful research tool for the study of this important perceptual ability. A key determinant of the utility of VAS is its fidelity. While the generation of VAS is simple in conception, its implementation involves a number of acoustic compromises. It has been proposed here that a behavioral measurement of the localization ability of subjects listening to short duration noise stimuli presented in VAS represents an appropriate measure of that fidelity. A review of the literature examining such localization behavior reveals that there are individual differences in ability and that nonauditory factors can play an important role in localization performance. Therefore, adequate tests of VAS fidelity using auditory localization tasks need to take these factors into account. In the next chapter we will consider the physical cues to a sound’s location that are available to the auditory system. However, demonstrating the presence of a particular physical cue does not necessarily imply that the auditory system utilizes this cue. Some physiological models of the encoding of these cues will be described and psychophysical tests examining the sensitivity of subjects to particular physical cues will also be considered. Such studies also provide insights into the limits of sensory coding of these cues and provide important benchmarks for the acoustic precision with which VAS needs to be generated.
ACKNOWLEDGMENTS I would like to acknowledge Drs. Martin, Morey, Parker, Pralong and Professor Irvine for comments on a previous version of this chapter. The recent auditory localization work from the Auditory Neuroscience Laboratory reported in this chapter was supported by the National Health and Medical Research Council (Australia), the Australian Research Council and the University of Sydney. The Auditory Neuroscience Laboratory maintains a Web page outlining the laboratory facilities and current research work at http://www.physiol.usyd.edu.au/simonc.
22
Virtual Auditory Space: Generation and Applications
REFERENCES 1. Gazzaniga MS. The congitive neurosciences. Cambridge, Mass.: MIT Press, 1994. 2. Yost WA. Auditory image perception and analysis: the basis for hearing. Hear Res 1991; 56:8-18. 3. Masterton RB. Role of the central auditory system in hearing: the new direction. TINS 1992; 15:280-285. 4. Blauert J. Spatial Hearing: The psychophysics of human sound localization. Cambridge, Mass.: MIT Press, 1983. 5. Blauert J, Lindermann W. Auditory spaciousness: Some further psychoacoustic analyses. J Acoust Soc Am 1986; 80:533-542. 6. Dye RH, Yost WA, Stellmack MA et al. Stimulus classification procedure for assessing the extent to which binaural processing is spectrally analytic or synthetic. J Acoust Soc Am 1994; 96:2720-2730. 7. Wightman FL, Kistler DJ. Headphone simulation of free field listening. II: Psychophysical validation. J Acoust Soc Am 1989; 85:868-878. 8. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 9. Makous J, Middlebrooks JC. Two-dimensional sound localization by human listeners. J Acoust Soc Am 1990; 87:2188-2200. 10. Carlile S, Leong P, Hyams S et al. Distribution of errors in auditory localization. Proceedings of the Australian Neuroscience Society 1996; 7:225. 11. Erulkar SD. Comparitive aspects of spatial localization of sound. Physiol Rev 1972; 52:238-360. 12. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space I. Normal hearing conditions. Percept 1984; 13:581-600. 13. Mills AW. On the minimum audible angle. J Acoust Soc Am 1958; 30:237-246. 14. Hartman WM, Rakerd B. On the minimum audible angle–A decision theory approach. J Acoust Soc Am 1989; 85:2031-2041. 15. Middlebrooks JC, Green DM. Sound localization by human listeners. Annu Rev Psychol 1991; 42:135-159. 16. Wightman FL, Kistler DJ. Sound localization. In: Yost WA, Popper AN, Fay RR, ed. Human psychophysics. New York: Springer-Verlag, 1993:155-192. 17. Yost WA, Gourevitch G. Directional hearing. New York: Springer-Verlag, 1987. 18. Blauert J. Binaural localization. Scand Audiol 1982; Suppl.15:7-26. 19. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space II: Pinna cues absent. Percep 1984; 13:601-617. 20. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space. III Monaural hearing conditions. Percep 1986; 15:67-81. 21. Wightman FL, Kistler DJ, Perkins ME. A new approach to the study of human sound localization. In: Yost WA, Gourevitch G, ed. Directional Hearing. New York: Academic, 1987:26-48.
Auditory Space
23
22. Middlebrooks JC. Narrow-band sound localization related to external ear acoustics. J Acoust Soc Am 1992; 92:2607-2624. 23. Carlile S, Pralong D. Validation of high-fidelity virtual auditory space. Br J Audiology 1996; (abstract in press). 24. Perrett S, Noble W. Available response choices affect localization of sound. Percept and Psychophys 1995; 57:150-158. 25. Makous JC, O’Neill WE. Directional sensitivity of the auditory midbrain in the mustached bat to free-field tones. Hear Res 1986; 24:73-88. 26. Burger JF. Front-back discrimination of the hearing system. Acustica 1958; 8:301-302. 27. Stevens SS, Newman EB. The localization of actual sources of sound. Amer J Psychol 1936; 48:297-306. 28. Butler RA. The bandwidth effect on monaural and binaural localization. Hear Res 1986; 21:67-73. 29. Fisher HG, Freedman SJ. The role of the pinna in auditory localization. J Auditory Res 1968; 8:15-26. 30. Musicant AD, Butler RA. The psychophysical basis of monaural localization. Hear Res 1984; 14:185-190. 31. Searle CL, Braida LD, Davis M F et al. Model for auditory localization. J Acoust Soc Am 1976; 60:1164-1175. 32. Searle CL, Braida LD, Cuddy D R et al. Binaural pinna disparity: another auditory localization cue. J Acoust Soc Am 1975; 57:448-455. 33. Middlebrooks JC, Makous JC, Green DM. Directional sensitivity of soundpressure levels in the human ear canal. J Acoust Soc Am 1989; 86:89-108. 34. Pralong D, Carlile S. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 35. Roffler SK, Butler RA. Factors that influence the localization of sound in the vertical plane. J Acoust Soc Am 1968; 43:1255-1259. 36. Blauert J. Sound localization in the median plane. Acustica 1969-70; 22:205-213. 37. Watkins AJ. Psychoacoustic aspects of synthesized vertical locale cues. J Acoust Soc Am 1978; 63:1152-1165. 38. Batteau DW. The role of the pinna in human localization. Proc Royal Soc B 1967; 158:158-180. 39. Hebrank J, Wright D. Spectral cues used in the localization of sound sources on the medican plane. J Acoust Soc Am 1974; 56:1829-1834. 40. Gardner MB, Gardner RS. Problems of localization in the median plane: effect of pinnae cavity occlusion. J Acoust Soc Am 1973; 53:400-408. 41. Gardner MB. Some monaural and binaural facets of median plane localization. J Acoust Soc Am 1973; 54:1489-1495. 42. Angell JR, Fite W. The monaural localization of sound. Psychol Rev 1901; 8:225-243. 43. Butler RA, Naunton RF. The effect of stimulus sensation level on the directional hearing of unilaterally deafened persons. J Aud Res 1967; 7:15-23.
24
Virtual Auditory Space: Generation and Applications
44. Belendiuk K, Butler RD. Monaural location of low-pass noise bands in the horizontal plane. J Acoust Soc Am 1975; 58:701-705. 45. Humanski RA, Butler RA. The contribution of the near and far ear toward localization of sound in the sagittal plane. J Acoust Soc Am 1988; 83:2300-2310. 46. Butler RA. An analysis of the monaural displacement of sound in space. Percept and Psychophys 1987; 41:1-7. 47. Butler RA, Humanski RA, Musicant AD. Binaural and monaural localization of sound in two-dimensional space. Percept 1990; 19:241-256. 48. Slattery WH, Middlebrooks JC. Monaural sound localization: acute versus chronic unilateral impairment. Hear Res 1994; 75:38-46. 49. Newton VE. Sound localisation in children with a severe unilateral hearing loss. Audiol 1983; 22:189-198. 50. Musicant AD, Butler RA. Monaural localization: An analysis of practice effects. Percept and Psychophys 1980; 28:236-240. 51. Musicant AD, Butler RA. Monaural localization following exposure to different segments of acoustic space. Percept and Psychophys 1982; 31:353-357. 52. Butler RL, Humanski RA. Localization of sound in the vertical plane with and without high-frequency spectral cues. Percept and Psychophys 1992; 51:182-186. 53. Wallach H. The role of head movements and vestibular and visual cues in sound localization. J Exp Psych 1940; 27:339-368. 54. Lambert RM. Dynamic theory of sound-source localization. J Acoust Soc Am 1974; 56:165-171. 55. Thurlow WR, Mangels JW, Runge PS. Head movements during sound localization. J Acoust Soc Am 1967; 42:489-493. 56. Perrott DR, Ambarsoom H, Tucker J. Changes in head position as a measure of auditory localization performance: Auditory psychomotor coordination under monaural and binaural listening conditions. J Acoust Soc Am 1987; 82:1637-1645. 57. Thurlow WR, Runge PS. Effect of induced head movements on localization of direction of sounds. J Acoust Soc Am 1967; 42:480-488. 58. Thurlow WR, Mergener JR. Effect of stimulus duration on localization of direction of noise stimuli. J Speech and Hear Res 1970; 13:826-838. 59. Pollack I, Rose M. Effect of head movement on the localization of sounds in the equatorial plane. Percept and Psychophys 1967; 2:591-596. 60. Altman JA, Viskov OV. Discrimination of perceived movement velocity for fused auditory image in dichotic stimulation. J Acoust Soc Am 1977; 61:816-819. 61. Grantham DW, Wightman FL. Auditory motion aftereffects. Percept and Psychophys 1979; 26:403-408. 62. Grantham DW. Detection and discrimination of simulated motion of auditory targets in the horizotal plane. J Acoust Soc Am 1986; 79:1939-1949. 63. Grantham DW. Motion aftereffects with horizontally moving sound sources in the free field. Percept and Psychophys 1989; 45:129-136.
25
Auditory Space
64. Saberi K, Perrott DR. Minimum audible movement angles as a function of sound source trajectory. J Acoust Soc Am 1990; 88:2639-2644. 65. Perrott DR, Costantino B, Ball J. Discrimination of moving events which accelerate or decelerate over the listening interval. J Acoust Soc Am 1993; 93:1053-1057. 66. Wightman FL, Kistler DJ. The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 1992; 91: 1648-1661. 67. Harris JD, Sergeant RL. Monaural/binaural minimum audible angles for a moving sound source. J Speech and Hear Res 1971; 14:618-629. 68. Perrott DR, Musicant AD. Minimum audible movement angle: Binaural localization of moving sound sources. J Acoust Soc Am 1977; 62: 1463-1466. 69. Perrott DR, Tucker J. Minimum audible movement angle as a function of signal frequency and the velocity of the source. J Acoust Soc Am 1988; 83:1522-1527. 70. Perrott DR, Marlborough K. Minimum audible movement angle: Marking the end points of the path traveled by a moving sound source. J Acoust Soc Am 1989; 85:1773-1775. 71. Carlile S, King AJ. From outer ear to virtual space. Cur Biol 1993; 3:446-448. 72. Fisher NI, Lewis T, Embleton BJJ. Statistical analysis of spherical data. Cambridge: Cambridge University Press, 1987. 73. Shaw EAG. The external ear. In: Keidel WD, Neff WD, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490. 74. Barlow HB. Why have multiple cortical areas? Vision Research 1986; 26:81-90. 75. Blakemore C. Understanding images in the brain. In: Barlow H, Blakemore C, Weston-Smith M, eds. Images and Understanding. Cambridge: Cambridge University Press, 1990:257-283.
27
CHAPTER 2
THE PHYSICAL AND PSYCHOPHYSICAL BASIS OF SOUND LOCALIZATION Simon Carlile
1. PHYSICAL CUES TO A SOUND’S LOCATION 1.1. THE DUPLEX THEORY
OF
AUDITORY LOCALIZATION
T
raditionally, the principal cues to a sound’s location are identified as the differences between the sound field at each ear. The obvious fact that we have two ears sampling the sound field under slightly different conditions makes these binaural cues self-evident. A slightly more subtle concept underlying traditional thinking is that the differences between the ears are analyzed on a frequency by frequency basis. This idea has as its basis the notion that the inner ear encodes the sounds in terms of its spectral characteristics as opposed to its time domain characteristics. As a result, complex spectra are thought to be encoded within the nervous system as varying levels of activity across a wide range of auditory channels; each channel corresponding to a different segment of the frequency range. While there is much merit and an enormous amount of data supporting these ideas, they have tended to dominate research efforts to the exclusion of a number of other important features of processing. In contrast to these traditional views, there is a growing body of evidence that: (i) illustrates the important role of information available at each ear alone (monaural cues to sound location);
Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 Landes Bioscience.
28
Virtual Auditory Space: Generation and Applications
(ii) suggests that processing across frequency is an important feature of those mechanisms analyzing cues to sound location (monaural and binaural spectral cues); (iii) suggests that the time (rather than frequency) domain characteristics of the sound may also play an important role in sound localization processing. The principal theoretical statement of the basis of sound localization has become know as the “duplex theory” of sound localization and has its roots in the work of Lord Rayleigh at the turn of the century. It is based on the fact that “the main difference between the two ears is that they are not in the same place.”1 Early formulations were based on a number of fairly rudimentary physical and psychophysical observations. Models of the behavior of sound waves around the head were made with simplifying approximations of the head as a sphere and the ears as two symmetrically placed point receivers (Fig. 2.1).2 Despite these simplifications the resulting models had great explanatory and predictive power and have tended to dominate the research program for most of this century. The fact that we have two ears separated by a relatively large head means that, for sounds off the mid-line, there are differences in the path lengths from the sound source to each ear. This results in a difference in the time of arrival of the sound at each ear; this is referred to as the interaural time difference (ITD). This ITD manifests as a difference in the onset of sound at each ear and, for more continuous sounds, results in an interaural difference in the phase of the sounds at each ear (interaural phase difference: IPD). There are important frequency limitations to the encoding of phase information. The auditory nervous system is known to encode the phase of a pure tone stimulus at the level of the auditory receptors only for relatively low frequencies.3 Psychophysically, we also seem to be insensitive to differences in interaural phase for frequencies above about 1.5 kHz.4,5 For these reasons, the duplex theory holds that the encoding of interaural time differences (in the form of interaural phase differences) is restricted to low frequency sounds. As the head is a relatively dense medium it will tend to reflect and refract sound waves. This only becomes a significant effect when the wavelengths of the sound are of the same order or smaller than the head. For a sound located off the midline, the head casts an acoustic shadow for the far ear and generates an interaural difference in the sound level at each ear (interaural level difference: ILD). At low frequencies of hearing this effect is negligible because of the relatively long wavelengths involved, but for frequencies above about 3 kHz the magnitude of the effect rises sharply. The amount of shadowing of the far ear will depend on the location of the source (section 1.3) so that this effect provides powerful cues to a sound’s location. There are also changes in the level of the sound at the ear nearer to the sound
The Physical and Psychophysical Basis of Sound Localization
29
Fig. 2.1. The coordinate system used for calculating the interaural time differences in a simple path length model and the interaural level difference model. In these models the head is approximated as a hard sphere with two point receivers (the ears). Reprinted with permission from Shaw EAG. In: Keidel W D, Neff W D, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490.
source that are dependent on the location of the source. The latter variations result from two distinct effects: Firstly, the so-called obstacle or baffle effect (section 1.3) and secondly, the filtering effects of the outer ear (section. 1.5 and chapter 6, section 2.2). The head shadow and near ear effects can result in interaural level differences of 40 dB or more at higher frequencies. The magnitudes of these effects and the frequencies at which they occur are dependent on the precise morphology of the head and ears and thus can show marked differences between individuals. The duplex theory is, however, incomplete in that there are a number of observations that cannot be explained by reference to the theory and a number of observations that contradict the basic premises of the theory. For instance, there is a growing body of evidence that the human auditory system is sensitive to the interaural time differences in the envelopes of high frequency carriers (see review by Trahiotis6). There are a number of experiments that suggest that this information is not dependent on the low frequency channels of the auditory system.7,8 In the absence of a spectral explanation of the phenomena, this
30
Virtual Auditory Space: Generation and Applications
suggests a role for some form of time domain code operating at higher frequencies. Furthermore, recent work suggests that coding the interaural differences in both amplitude and frequency modulated signals is dependent on rapid amplitude fluctuations in individual frequency channels which are then compared binaurally.9 The incompleteness of the duplex theory is also illustrated by the fact that listeners deafened in one ear can localize a sound with a fair degree of accuracy (chapter 1, section 2.2). This behavior must be based upon cues other than those specified by the duplex theory which is principally focused on binaural processing of differences between the ears. A second problem with the theory is that because of the geometrical arrangement of the ears a single interaural difference in time or level is not associated with a single spatial location. That is, a
Fig. 2.2. The interaural time and level binaural cues to a sound’s location are ambiguous if considered within frequencies because a single interaural interval specifies more than one location in space. Because of the symmetry of the two receivers on each side of the head, a single binaural interval specifies the locations in space which can be described by the surface of a cone directed out from the ear, the so-called “cone of confusion.” For interaural time differences, the cone is centered on the interaural axis. The case is slightly more complicated for interaural level differences as, for some frequencies, the axis of the cone is a function of the frequency. Reprinted with permission from Moore BCJ. An Introduction to the Psychology of Hearing. London: Academic Press, 1989.
The Physical and Psychophysical Basis of Sound Localization
31
particular interaural difference will specify the surface of an imaginary cone centered on the interaural axis (Fig. 2.2). The solid angle of the cone will be associated with the magnitude of the interval; for example the cone becomes the median plane for zero interaural time difference and becomes the interaural axis for a maximum interaural time difference. Therefore, interaural time differences less that the maximum possible ITD will be ambiguous for sound location. These have been referred to as the “cones of confusion.”1 Similar arguments exist for interaural level differences although, as we shall see, the cones of confusion for these cues are slightly more complex. The kind of front-back confusions seen in a percentage of localization trials is consistent with the descriptions of the binaural cues and indicative of the utilization of these cues (chapter 1, section 2.1.3). However, the fact that front-back confusions only occur in a small fraction of localization judgments suggests that some other cues are available to resolve the ambiguity in the binaural cues. These ambiguities in the binaural cues were recognized in the earliest statements of the duplex theory and it was suggested that the filtering properties of the outer ear might play a role in resolving these ambiguities. However, in contrast to the highly quantitative statements of the binaural characteristics and the predictive models of processing in these early formulations, the invocation of the outer ear was more of an ad hoc adjustment of the theory to accommodate a “minor” difficulty. It was not until the latter half of this century that more quantitative models of pinna function began to appear10 and indeed it has been only recently that quantitative and predictive formulations of auditory localization processing have begun to integrate the role of the outer ear11 (but see Searle et al12). In the following sections we will look in detail at what is known about the acoustics of the binaural cues and also the so-called monaural cues to a sound’s location. We will then look at the role of different structures of the auditory periphery in generating these location cues and some of the more quantitative models of the functional contribution of different components of the auditory periphery such as the pinna, head, shoulder and torso.
1.2. CUES THAT ARISE DIFFERENCE
AS A
RESULT
OF THE
PATH LENGTH
The path length differences depend on the distance and the angular location of the source with respect to the head (Fig. 2.1).1,13 Variations in the ITD with distance is really only effective for source locations a to 3a, where a is the radius of a sphere approximating the head. At distances greater than 3a the wave front is effectively planar. The ITDs produced by the path length differences for a plane sound wave can be calculated from D = r (θ + sin(θ))
(1)
32
Virtual Auditory Space: Generation and Applications
where D = distance in meters, r = radius of head in meters, θ = angle of sound source from median plane in radians, (Fig. 2.1).1 The timing difference produced by this path length difference is14 t = D/c
(2)
where t = time in seconds, c = speed of sound in air (340 m s–1). The interaural phase difference (IPD) produced for a relatively continuous periodic signal is then given by Kuhn15 IPD = t ω
(3)
where ω = radian frequency. For a continuous sound, the differences in the phase of the sound waves at each ear will provide two phase angles; a° and (360° – a°). If these are continuous signals there is no a priori indication of which ear is leading. This information must come from the frequency of the sound wave and the distance between the two ears. Assuming the maximum phase difference occurs on the interaural axis, the only unambiguous phase differences will occur for frequencies whose wave lengths (λ) are greater than twice the interaural distance. At these frequencies the IPD will always be less than 180° and hence the cue is unambiguous. Physical measurements of the interaural time differences produced using click stimuli are in good agreement with predictions from the simple “path length” model described above.14-16 This model breaks down however, when relatively continuous tonal stimuli are used (Fig. 2.3).14,15,17,18 In general, the measured ITDs for continuous tones are larger than those predicted. Furthermore, the ITDs become smaller and more variable as a function of frequency and azimuth location when the frequency exceeds a limit that is related to head size. The failure of the simple models to predict the observed variations in ITDs results from the assumption that the velocity of the sound wave is independent of frequency. Three different velocities can be ascribed to a signal; namely the phase, group and signal velocities.15,18,19 The rate of propagation of elements of the amplitude envelope is represented by the group ITD, while the phase velocity of the carrier is a
The fact that a signal can have a number of different velocities is not intuitively obvious to many. Brillouin19 likens the phase and group velocities to the ripples caused by a stone cast into a pond. He points out for instance that if the group velocity of the ripple is greater than the phase velocity one sees wavelets appearing at the advancing edge of the ripple, slipping backwards through the packet of wavelets that make up the ripple and disappearing at the trailing edge.
33
The Physical and Psychophysical Basis of Sound Localization
Fig. 2.3. Measurements of the interaural time differences using a dummy head reveal that this is a function of both frequency and the type of sound. The points plot data obtained from the measurement of on-going phase of a tone at a number of angles of incidence (15°, 30°, 45°, 60°, 75° and 90° referenced to the median plane). The solid lines to the left show the predictions based on the phase velocity of the wave (eq. 5) and can be seen to be a good match for the data only for the lowest frequencies. The boxed points show the solutions for integer ka for the complete model from which equation (5) was derived (i.e., without the simplifying assumption that ka < 1; see text). On the right y-axis, the dashed lines show the predictions of the simple path length model (eq. 2) and the arrows show measurements from the leading edge of a tone burst. Reprinted with permission from Kuhn GF, J Acoust Soc Am 1977; 62:157-167.
best ascribed to what was previously thought of as the steady state ITD.a Over the frequency range of auditory sensitivity, the group and signal velocities are probably identical.18 When phase velocity is constant, phase and group velocities will be equal, regardless of wavelength. However, because the phase velocity of sound waves is dependent on wavelength (particularly at high frequencies), then relatively large differences can occur between the phase and group velocities.19 In addition, as a wave encounters a solid object, it is diffracted such that the wavefront at the surface of the object is a combination of the incident and reflected waves. Under these circumstances the phase velocity at the surface of the object becomes frequency-dependent in a manner characteristic of the object.18 The interaural phase differences based on phase velocity, for frequencies in the range 0.25 kHz to 8.0 kHz, have been calculated using a sphere approximating the human head (Fig. 2.3). IPD ≈ 3ka sin(ainc)
(4)
34
Virtual Auditory Space: Generation and Applications
where k = acoustic wave numberb (2π/λ), a = radius of the sphere, ainc = angle of incidence of the plane sound wave (see Kuhn15 for derivation). The interaural time difference is calculated using equation (3) ITD ≈ 3(a/c)sin(ainc)
(5)
where c = speed of sound in air. According to equation 5, ITD is constant as a function of frequency, however this relation15 holds only where (ka)2 13 kHz) and/or a low frequency (< 1.5 kHz) peak in the median plane transformation.
1.6. CONTRIBUTION OF DIFFERENT COMPONENTS OF THE AUDITORY PERIPHERY TO THE HRTF In considering the spectral transfer functions recorded at either end of the ear canal, it is important to keep in mind that structures other than the pinna will contribute to these functions.10,56 Figure 2.8 shows the relative contribution of various components of the auditory periphery calculated for a sound located at 45° azimuth. These measures are very much a first approximation calculated by Shaw,10 but serve to illustrate the point that the characteristics of the HRTF are dependent on a number of different physical structures. The gain due to the head, calculated from the Rayleigh-Stewart description of the sound pressure distribution around a sphere,10,21,22 increases with increasing frequency to an asymptote of 6 dB. The rate of this increase, as a function of frequency, is determined by the radius of the sphere. In humans this corresponds to a radius of 8.75 cm and the midpoint to asymptote occurs at 630 Hz (see Fig. 2.4). The contribution of the torso and neck is small and restricted primarily to low frequencies. These pressure changes probably result from the interactions of the scattered sound waves at the ear and are effective primarily for low frequencies. The contribution of the pinna flap is small at 45° azimuth but probably exerts a greater influence on the resulting total for sounds presented behind the interaural axis48 (see also section 1.7). The largest contributions are attributable to the concha and the ear canal/eardrum complex. An important feature of these contributions is the complementarity of the conchal and ear canal components which act together to produce a substantial gain over a broad range of frequencies. However, an important distinction between the two is that the contribution of the ear canal is insensitive to the location of the stimulus, while the gain due to the concha and the pinna flange is clearly dependent on stimulus direction.10,24,48,57 That is to say, the HRTF is clearly composed of both location-dependent and location independent components.
46
Virtual Auditory Space: Generation and Applications
Fig. 2.8. Relative contributions of the different components of the human auditory periphery calculated by Shaw (1974). The source is located at 45° from the median plane. At this location the transformation is clearly dominated by the gains due to the concha and the ear canal. An important distinction between these components is that the gain due to the concha is highly dependent on the location of the source in space, while the gain of the canal remains unaffected by location. Reprinted with permission from Shaw EAG, in: Keidel WD, Neff WD, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490.
1.7. MODELS
OF
PINNA FUNCTION
There are three main functional models of pinna function. The pinna is a structure convoluted in three dimensions (see Fig. 1.4) and all theories of pinna function refer, in some way, to the interactions of sound waves, either within restricted cavities of the pinna, or as a result of the reflections or distortions of the sound field by the pinna or the pinna flap. These models and other numerical models of the filtering effects of the outer ear are also considered in chapter 6. 1.7.1. A resonator model of pinna function The frequency transformations of the pinna have been attributed to the filtering of the sound by a directionally-dependent multi-modal resonator.10,24,48,57-59 This model has been realized using two similar analytical techniques. Firstly, precise probe tube measurements have been made of the sound pressures generated in different portions of life-like models of the human pinna57 and real human pinnae. Between five and seven basic modes have been described as contributing to the frequency transfer function of the outer ear. The first mode (M1) at around 2.9 kHz is attributed to the resonance of the ear canal.
The Physical and Psychophysical Basis of Sound Localization
47
The canal can be modeled as a simple tube which is closed at one end. An “end correction” of 50% of the actual length of the canal is necessary in matching the predicted resonance with the measured frequency response. This correction is attributed to the tight folding of the tragus and the crus helicas around the opening of the canal entrance57 (see Fig. 1.4). The second resonant mode (M2) centered around 4.3 kHz is attributed to the quarter wavelength depth resonance of the concha. Again, the match between the predicted and the measured values requires an “end correction” of 50%. The large opening of the concha and the uniform pressure across the opening suggests that the concha is acting as a “reservoir of acoustic energy” which acts to maintain a high eardrum pressure across a wide bandwidth.57 For frequencies above 7 kHz, precise probe tube measurements using the human ear model suggest that transverse wave motion within the concha begins to dominate.24 The higher frequency modes (7.1 kHz, 9.6 kHz, 12.1 kHz, 14.4 kHz and 16.7 kHz) result from complex multipole distributions of pressure within the concha resulting from transverse wave motion within the concha. An important result from these measurements is that the gain of the higher frequency modes was found to be dependent on the incident angle of the plane sound wave. The second approach was to construct simple acoustic models of the ear and make precise measurements of the pressure distributions within these models.24,48,49,58 By progressively adding components to the models, the functions of analogous morphological components of the human external ear could be inferred (Fig. 2.9). The pinna flange was found to play an important role in producing location-dependent changes in the gain of the lower conchal modes 3 kHz to 9 kHz.48 The model flange represents the helix, anti-helix and lobule of the human pinna (Fig. 2.9). There is an improved coupling of the first conchal mode to the sound field for sound in front of the ear, but this gain is greatly reduced when the sound source is toward the rear. This has been attributed to the interference between the direct and scattered waves from the edge of the pinna.49 By varying the shape of the conchal component of the models, the match between the frequency transforms for higher frequencies measured from the simple models and that measured from life-like models of human ears can be improved. The fossa of the helix and the crus helicas both seem to be very important in producing the directional changes for the higher modes.24,49,58 In summary, while the primary features of the transformations seem to be adequately accounted for by the acoustic models, the agreement between the theoretical predictions of the modal frequencies and the measured modes is not as good. The size of the ad hoc “end corrections” for the predictions based on simple tube resonance suggest that while simple tube resonance models may provide a reasonable first
48
Virtual Auditory Space: Generation and Applications
Fig. 2.9. Pressure transformation relative to the SPL at the reflecting plane for three simple acoustic models of the outer ear. The variations in the gain of the model for variation in the elevation of the progressive wave source are indicated in each panel (0° indicates a location in front of the ear and 90° above the ear). The dimensions of the models are in mm and the small filled circle in each model illustrates the position of the recording microphone. Panels A and B indicate the blocked meatus response of the models with increasingly complex models of the concha, while panel C shows the response of the system with a tubular canal and an approximation to the normal terminating impedance of the ear drum. Note the large gain at around 2.6 kHz that appears with the introduction of the canal component of the model. Reprinted with permission from Shaw EAG. In: Studebaker GA, Hochberg I. Acoustical factors affecting hearing aid performance. Baltimore: University Park Press, 1980:109-125.
The Physical and Psychophysical Basis of Sound Localization
49
approximation, a more accurate description may require more sophisticated modeling techniques. 1.7.2. Sound reflections as a basis of pinna function In 1967, Dwight Batteau60 suggested that the pinna may transform the incoming signal by adding a time-delayed copy of the signal to the direct signal. Sound from different locations in space could be reflected off different features of the pinna so that the magnitude of the time delay between the direct and reflected waves would be characteristic of the location of the sound source. The identity of the function providing the inverse transformation of the composite signal would provide the information as to the source location. Using a model of the human pinna, a monotonic change in the reflected delay of between 0 and 100 µs was produced by varying the location of a sound on the ipsilateral azimuth. Sounds located close to the midline, where the greatest spatial acuity has been demonstrated, produced the largest delays, and those located behind the interaural axis produced the smallest. Changing the elevation of the sound source produced a systematic variation in the delay of between 100 µs and 300 µs.60 Hiranaka and Yamasaki61 have examined these properties using real human ears. Using miniature microphones they recorded the response of the ear to a very short transient produced by an electric spark. These data, collected using a very large number of stimulus locations, clearly demonstrates systematic changes in the number of reflected components and in the time delays between the reflected components as a function of the stimulus location. Stimuli located in front of the subjects produced two or more reflected components, while “rear-ward” locations produced only one, and “above” locations produced no reflected components. The systematic changes in the delays all occurred in the first 350 µs from the arrival of the direct signal. Furthermore, the amplitude of the reflected components was similar to that of the direct components. An obvious objection to a localization mechanism based on reflected delays is that the time differences involved in the delayed signals are very small relative to the neural events which would presumably be involved in determining the identity of the inverse transformation. However, Wright et al62 demonstrated that when the amplitude ratio between a direct and a time-added component exceeds 0.67, human subjects can easily discriminate 20 µs delays using white noise stimuli presented over headphones. These authors point out that the final spectrum of a spectrally dense signal will depend on the interval between the initial and time-added components. The interaction of different frequency components will depend on the phase relationship between the direct and reflected waves which is, of course, dependent on the time interval between the direct and reflected components. Such a mechanism would be expected to produce a “comb filtering” of the
50
Virtual Auditory Space: Generation and Applications
input signal and produce sharp peaks and nulls in the frequency transformation at frequencies where reinforcing and canceling phase interactions have occurred.62-65 Thus reflected echoes are likely to produce changes in the input signal that can be analyzed in the frequency domain rather than the time domain as suggested by Batteau.60 The perceived location of monaurally presented noise stimuli changes as a function of the magnitude of the delay between the direct and time-added components. Variations in the apparent elevation of the stimulus occur for delays of between 160 µs to 260 µs,63 while systematic variations in the apparent azimuth are produced by 0 to 100 µs changes in the delay.64 Watkins63 derives a quantitative model for vertical localization of sound based on spectral pattern recognition as described by an auto-correlation procedure. The quantitative predictions of median plane localization of low-, high- and band-pass noise are in reasonable agreement with the psychophysical results of Hebrank and Wright. 36 There are two main criticisms of the “time delay” models of pinna function. As was pointed out above, analysis in the time domain seems unlikely because of the very small time intervals produced. The acoustic basis of the frequency transformations produced by time-added delays has also been criticized. Invoking an observation by Mach, Shaw58 has argued that the wavelengths of the frequencies of interest are too large to be differentially reflected by the folds of the pinna. Therefore, the variations in the frequency transfer functions, as a function of source location, may not be due to phase interactive effects of the direct and reflected waves. Furthermore, the psychophysical discriminations of timeadded delay stimuli may simply be based on the spectral changes necessarily produced by the analog addition of the two electrical signals; the implication of an underlying acoustic analogy in the ear is simply a theoretical expectation. 1.7.3. A diffractive model of pinna function The third model of pinna function results arises from work examining the frequency dependence of the directional amplification of the pinna of nonhuman mammals (cat,66 wallaby,67 guinea pig,68 bat,69 ferret44). There is an increase in the directionality of the ear as a function of stimulus frequency. This has been attributed to the diffractive effects of the aperture of the pinna flap and has been modeled using an optical analogy. For the cat, bat and wallaby there is a good fit between the predicted directionality of a circular aperture, which approximates the pinna aperture, and the directional data obtained from pressures measured within the canal or estimated from cochlear microphonic recordings. For the ferret44 and the guinea pig68 the model provides a good fit to the data only at higher frequencies. These differences undoubtedly reflect the different morphologies of the outer ears of these two groups of animals: The mobile pinnae of the cat, bat
The Physical and Psychophysical Basis of Sound Localization
51
and wallaby resemble a horn collector whereas the pinnae of the guinea pig and ferret more closely resemble the outer ear of the human where the pinna flange is a less dominant structure. As an exclusive model of pinna function, the optical analogy of the diffractive effect does not lend itself easily to an explanation of the directionally-dependent changes in the frequency transfer functions at the primary conchal resonance for both the ferret and the human pinna. The diffractive effect, which is due to the outer dimensions of the pinna flap, probably acts in conjunction with other acoustic effects produced by the convolutions around the ear canal (see Carlile44 for further discussion). In summary, the pinna is probably best described as a directionally sensitive multimodal resonator for narrow-band and relatively continuous signals. For transient sounds, the pinna may act as a reflector, where the time delay between the direct and reflected sound is a unique function of the source location. In this case the signal analysis is unlikely to be in the time domain as was first suggested by Batteau60 but more likely represents a frequency domain analysis of the resulting comb-filtered input. In animals such as the cat where the pinna flap is a dominant feature, the monopole directionality of the ear, particularly for high frequencies, is probably best described by the diffractive effects of the pinna flap.
1.8. JUDGMENTS
OF THE
DISTANCE
OF A
SOUND SOURCE
1.8.1. Psychophysical Performance Human subjects are relatively poor at judging the distance of the sound source under controlled conditions70 (also see Blauert40 for an excellent discussion). As was discussed above, the externalization of a sound heard over headphones could be considered a component of the egocentric distance of the source. However, it is likely that with headphone listening, the rather unnatural perception of a sound within the head probably results from the failure to provide the auditory system with an adequate set of cues for the generation of an externalized sound source. More conventionally, the perception of the distance of a sound source has been found to be related to the familiarity of the subject with the source of the sound. Gardner71 found that distance estimates were excellent over a 1 m to 9 m range for normal speech but that whispering and shouting lead to under estimates and over estimates of distance respectively. Haustein (reported in Blauert40) found that good estimates of distance over much the same range of distances could be obtained for clicks if the subject was pre-exposed to the stimuli over the test range. With unfamiliar sounds, distance estimates are initially poor although subjects also seem to be able to decrease the error of their estimate with increased exposure to the sound, even in the absence of feedback.70
52
Virtual Auditory Space: Generation and Applications
1.8.2. Acoustic cues for the judgment of distance There are a number of potential physical cues to source distance that the auditory system could utilize (see Coleman72 for review). The simplest cue is one of stimulus level; if the source is located in an anechoic environment, then as the distance of a source is increased the pressure level decreases by 6 dB for each doubling of the distance. If the source level is known or the distance between the receiver and a level invariant source is changed, then pressure level may act as a distance cue. Indeed, it has been shown that the estimated distance of the source of speech is related to the presentation level for distances up to 9 m71,73 although the situation seems to be more complex for impulsive sounds40 and sounds presented in virtual auditory space.74 Certainly, human listeners are sensitive to relatively small variations in pressure. However, as discussed below, there are also distance effects on the spectrum of the sound, which is in turn an important factor in the perception of the loudness. It is unclear how these factors might interact in the perception of sound source distance. Hirsch has argued that if the azimuth direction of a sound source is know from, say, ITD cues, then ILDs cues might be useful in determining distance.75 Molino76 tested Hirsch’s original predictions (slightly modified to account for a number of theoretical problems) but found that, for pure tones at least, subjects seem to be unable to utilize this potential cue. There is some evidence that the rank order of distance estimates of noise sound sources placed along the interaural axis were better than for other relative head locations77 although absolute distance judgments did not seem to benefit when there were very few stimulus presentations in the test environment.78 This suggests that, under some rather restricted conditions, binaural information may provide some additional cue to source distance. When the source is close to the head, say less than 2 m to 3 m, the wavefront from a point source will be curved (the radius of curvature being related directly to the distance from the source). As we have seen above, the transfer functions of the outer ears vary as a function of the angle of incidence of the source, so that we would expect that variation in the distance of the source over this range would result in variation in the transfer functions. As well as varying the monaural spectral cues as a function of distance, this could also result in variations in the binaural cues; if the sound is located off the midline then the “apparent angle” of the source will also vary for each ear as a function of distance. Certainly the distance-related variations in the path lengths from the source to each ear produce variation in the interaural time differences that fall into the perceptually detectable range. In this light, it may be important that normal human verbal communication is generally carried out within the range of distances that are relevant to this discussion. There would also be considerable evolutionary pressure for accurate distance estimations within this range, particularly for predators.
The Physical and Psychophysical Basis of Sound Localization
53
At distances greater than 15 m the attenuation of high frequencies by transmission through the air becomes greater than that for low frequencies. This is likely to be increasingly important for frequencies greater than 2 kHz and such attenuation is also be affected by meteorological conditions.79 In a leafy biosphere this attenuation is likely to be even more marked. This will lead to lowpass filtering of a sound with a roll-off that will vary with the distance from the source. Butler et al80 demonstrated that the apparent distance of sound sources recorded under both echoic and anechoic conditions varies as a function of the relative differences in the high and low frequency components of the spectra played back over headphones. More recent findings have provided evidence that the spectral content can provide a relative distance cue but this requires repeated exposure to the sound.81 These findings are consistent with the everyday perceptual experiences of the low rumbling sounds of distant thunder or a distant train or aircraft. The discussion above largely ignores the situation where the listening environment is reverberant, as is certainly the case for most human listening experiences. The pattern of reflected sounds reaching a listener in a reverberant environment will vary as a function of the relative geometry of the environment. In general, the relative sound energy levels of the direct and reverberant sound will vary as a function of distance. At locations close to the listener, the input signal will be dominated by the direct signal but at increasing distances the relative level of the reverberant contribution will increase. This effect of reverberation on the perception of the distance of the source was first demonstrated by von Bekesy82 and for many years the recording industry has used the effects of reverberation to increase the perception of both the distance and the spaciousness of a sound. Mershon and King83 make an important distinction between requiring the subjects to make an estimate of the distance and to report the apparent distance of the source; the former case implies that more than apparent distance should be taken into account and may well confound experiments concerned with the metrics of perceptual variation. Likewise, they argue that nonauditory cues may play an important role in distance estimation, particularly where the observer is required to select from a number of potential targets. In a very simple experiment, these authors assessed the relative contributions of sound level and reverberation to the perception of the apparent distance of a white noise source. Subjects estimated the distance of two sources (2.7 m and 5.5 m) presented at one of two sound levels under either anechoic or reverberant conditions. In summary, they found that while sound level might provide powerful cues to variation in the distance of a source, there was no evidence that it was used as an absolute cue to source distance. Under reverberant conditions, the apparent distance of the source was farther than for sources presented at the same distance in anechoic conditions, thus confirming von Bekesy’s82 earlier observations. Probably the most important finding of this study was
54
Virtual Auditory Space: Generation and Applications
that reverberation provided a cue to the absolute distance of the source as well as providing information about the change in relative distance. However, this study used only a very limited number of conditions to determine the relative contribution of these two distance cues. More recently Mershon et al84 have reported that in conditions of low reflectance distance is systematically underestimated. In a reverberant environment target distance was also underestimated in the presence of background noise, presumably because of the disruption of reverberant cues to distance. There is a considerable amount of work still to be done on the accuracy of distance perception, the effects of different reverberant environments and the interaction of the other potential cues to distance perception. In this context, the use of VAS stimuli combined with simple room acoustical models will be able to make an important and rapid contribution to our understanding of the influence of different factors on distance perception.
2. PSYCHOPHYSICAL SENSITIVITY TO ACOUSTIC CUES TO A SOUND’S LOCATION 2.1. INTRODUCTION This section examines some of the data on the sensitivity of the human auditory system to the physical cues identified in section 1. It is important to recognize, however, that demonstrating sensitivity to a particular cue does not necessarily demonstrate its involvement in auditory localization. While there is value in such correlative arguments, they are not necessarily demonstrative of a causative connection between the physical parameter and the perception. There are a number of physical parameters which co-vary with the location of a sound source that may be involved in other auditory processes, but do not directly relate to the perception of location itself; for instance, the separation of foreground sounds of interest from background masking sounds (e.g., ref. 82, 85). Likewise, the judgment of just noticeable difference between two stimuli may also be important but determination of the absolute location of a sound source is more likely to also require coding of the magnitude and vector of the differences between two stimuli (chapter 1, section 2.1). Studies examining the sensitivity to a single cue to a sound’s location have been carried out using closed field (headphone) stimulation. This is often referred to as dichotic stimulus presentation and involves the independent stimulation of each ear by headphones or by sound systems sealed into the ears. This system facilitates the precise control of the timing and intensities of the stimuli delivered to each ear. The principal advantage here of course is that a single parameter associated with a sound’s location can be varied in a way which would be impossible by changing the location of the sound in the free field where
The Physical and Psychophysical Basis of Sound Localization
55
all of the parameters co-vary. The stimuli generally employed in these kinds of experiments are not intended to result in the perception of an externally localized image of the sound source (although this is possible86,87), but rather, have been designed to observe the psychophysical or physiological effects of varying or co-varying the binaural cues of time and intensity. We should also bear in mind, however, the recent work suggesting that there may be a number of important interdependencies between different aspects of a stimulus88 (chapter 1, section 1.3). Under such conditions, superposition in system behavior may not be a valid assumption. Despite these limitations, a considerable understanding about the range and limits of stimulus coding by the auditory system can be gained from these studies.
2.2. SENSITIVITY
TO
INTERAURAL TIME DIFFERENCES
2.2.1. Psychophysical measures of sensitivity to interaural time differences The auditory nervous system is exquisitely sensitive to changes in the interaural phase or timing of stimuli presented dichotically. The smallest detectable interaural time differences are for noise containing low frequencies, where just noticeable differences can be a low as 6 µs.89 For a 1 kHz tone, the just noticeable variation in interaural phase is between 3° and 4° phase angle. Above this frequency the threshold rises very rapidly, so that phase difference becomes undetectable for frequencies above 1.5 kHz.4,5 These findings are consistent with predictions of the upper frequency limit for unambiguous phase information; i.e., for 1 kHz the wavelength (λ) equals 34 cm and, as the average human interaural distance equals 17.5 cm, the upper limit for precise interaural phase discrimination is almost exactly λ/2. The upper frequency limit for interaural phase sensitivity is also consistent with the physiological limits imposed by the fidelity of phase encoding by the auditory nervous system3 (section 2.2.1). The correspondence between the predicted upper frequency limit for unambiguous phase information and the upper limit of phase discrimination demonstrated by the dichotic experiments has lent considerable support to the duplex theory of sound localization. However, more recent psychophysical experiments have provided evidence that the simple division of time and intensity localization cues on the basis of frequency may be naive6 (see Hafter90 for discussion). Amplitude modulated stimuli with high frequency carriers can be lateralized on the basis of time differences in their envelopes7,91,92 (see also Boerger, quoted in Blauert93). Lateralization of these stimuli is not affected by the presence of a low-pass masker7,8 but declines with a reduction in the correlation between the carriers.8 Furthermore, lateralization does not occur for an AM signal matched against a pure tone at the carrier frequency.8 The interaural signal disparity, expressed as the amount of
56
Virtual Auditory Space: Generation and Applications
interaural spectral overlap, was found to affect the sensitivity to interaural time differences for both short tone bursts and clicks.94 These data suggest that the information necessary for the lateralization of amplitude modulated high frequency stimuli is carried in the high frequency channels and that the auditory system is sensitive to the envelope of the time domain signal. It has been known for some time that the lateralization of dichotically presented spectrally dense stimuli is affected by the duration of the stimulus.95,96 Mean ITD thresholds for spectrally dense stimuli decrease with increasing stimulus duration and asymptote around 6 ms for stimulus durations greater than 700 µs.96 Furthermore, the efficacy of onset and ongoing disparities in producing a lateralized image was found to vary as a function of the duration of the stimulus. Onset disparity was completely ineffective in producing lateralization of dichotic white noise stimuli with no ongoing disparities when the stimulus duration was greater than 150 ms.96 Furthermore, onset disparity was dominant only for stimulus durations of less than 2 ms to 4 ms.95 Yost97 measured the ITD thresholds for pulsed sinusoids with onset, ongoing and offset disparities. With the addition of these amplitude envelopes, the ITD thresholds were significantly lower for high frequencies when compared to the data of Zwislocki and Feldman5 and Klump and Eady4 who varied only the interaural phase. As discussed earlier (chapter 1, 1.3) the recent work of Dye et al88 has lead to suggestions that the auditory processing strategy might be determined by the duration of the signal to be analyzed. The perception generated by a short duration sound is dependent on the ‘mean’ of the different cues from which it is composed (so-called synthetic listening) while longer duration stimuli tend to be parsed or streamed into different auditory objects. In this context, the trigger seems to be the duration of the signal. 2.2.2. Multiple signals and the “precedence effect” The discussion above suggests that the auditory system may be dealing with the onset and ongoing time disparity cues in different ways. A related consideration is the effects of multiple real or apparent sources. In this case, the auditory system must make an important distinction between inputs relating to distinct auditory objects and inputs relating to reflections from nearby surfaces. The interactions between the incident and reflected sounds may therefore affect the way in which the auditory system weights differently the onset and ongoing components of a sound. The experiments discussed above have all relied on dichotically presented time-varying stimuli; however they are analogous to the so-called precedence (or Haas) effect which relies on free field stimuli40,98,99 (see also introductory review in Moore100). When two similar stimuli are presented from different locations in the free field, the perceived location of the auditory event is dependent on the precise times of arrival of the two events. Stimuli arriving within a
The Physical and Psychophysical Basis of Sound Localization
57
millisecond of each other will produce a fused image that is located at some point between the two sound sources;d this has been referred to as summing localization.40 For greater temporal disparities, the later sound is masked until the arrival time disparity is of the order of 5 ms to 40 ms.99 The actual period over which this masking is effective is very dependent on the spectral and temporal characteristics of the sound. Although the later sound is masked (in that it is not perceived as relating to a separate auditory object), it may still have some effect on the perception of the final image. This is particularly the case if there is a large level disparity between the first and subsequent sound. For instance, if the second sound is much louder it can completely override the precedence effect. Wallach et al98 also noted that the transient components in the signal were important in inducing the effect. If the sound levels of two narrow band sources are varied inversely over a period of seconds, the initial perception of the location of the source remains unchanged despite the fact that the level cues are reversed (the Franssen effect).101 However, this effect fails in an anechoic environment or if a broadband sound is used. Our understanding of the mechanisms of the precedence effect provides important insights into the psychoacoustic effects of reverberant room environments102 particularly in the case where multiple sources are concerned.99 The precedence effect is clearly a binaural phenomenon: in a reverberant environment our perception of the reflected sound rarely interferes with our perception of the primary sound. However, the simple expedient of blocking one ear can result in a large increase in our perception of the reverberation to the extent that localization and discrimination of the source can made very difficult. The emphasis on the initial components of the signals arriving at each ear has obvious advantages for sound localization in reverberant environments. In a relatively close environment, the first reflected echoes would probably arrive over the 40 ms period following the arrival of the incident sound, so that the ongoing ITDs would be significantly affected over this time course. Consistent with this, Roth et al18 found that ongoing ITDs estimated from cats in an acoustically “live” environment over the first 10 ms to 100 ms exhibited large variations as a function of frequency and location. These authors argued that the large d
This is often thought to be the basis of hi-fi stereo; however unless specific channel delays have been inserted into the program at mix down, the stereophonic effect of most pop music is dependent on the level of the sound in each channel. In cases where two microphones have been used to sample a complex source such as an orchestra, the recordings of the individual instruments in each channel will obviously contain delays related to their relative proximity to each microphone, so that the stereophony will depend on both levels and delays.
58
Virtual Auditory Space: Generation and Applications
variations in ITD would render this cue virtually useless for localization as the interval/position relationships would be at best very complex, or at worst, arbitrary (section 1.2). Hartmann103 compared the localization of broadband noise, low frequency tone impulse and relatively continuous low frequency tones. Consistent with other results, the broadband noise was most accurately localized of all of the stimuli; however accuracy was also degraded in highly reverberant conditions. The localization of the low frequency pulse with sharp onset and offset transients was unaffected by room echoes unless these echoes occurred within about 2.5 ms. The same stimulus was virtually unlocalizable when presented without onset transients (6 to 10 seconds rise time). These results indicate the relative importance of the initial components of a signal for accurate localization. 2.2.3. Neurophysiological measures of ITD sensitivity By recording the electrophysiological activity from the auditory systems of nonhuman mammals we have gained considerable insight into how sound is encoded within the nervous system and, most importantly, the fidelity of these neural codes. As discussed earlier, the ITDs will be manifest as differences in the arrival time of the sound to each ear (onset time differences) and as differences in the phases of the on-going components of the sounds in each ear. An important prerequisite for interaural sensitivity to the phase of a sound is the ability of each ear to encode the phase monaurally. This will have associated with it certain encoding limits. Following the initial encoding, a second process will be required for the binaural comparison of the input phase at each ear. This is likely to involve different neural mechanisms which have, in turn, their specific encoding limitations. With respect to the monaural encoding of the phase characteristics of a sound, it is clear that phase is only encoded for the middle to low frequency components and that this limit is most likely determined by the auditory receptors in the inner ear.3 When the responses of individual fibers in the auditory nerve are recorded, the ability of the auditory system to lock to the phase of an input sine wave decreases with an increase in frequency. There is still considerable debate about actual cut-off frequency but most researchers would expect little phase information to be encoded in the mammalian auditory system for frequencies above 4 kHz. In fact, the capacity of these nerve fibers to “lock” to the phase of the stimulus falls rapidly for frequencies greater than 1.5 kHz.104 Recording the responses of auditory neurons at other locations within the auditory nervous system of mammals provides evidence that is broadly consistent these findings (see Yin and Chan105 and Irvine106 for review). At these lower frequencies the auditory system has been shown to be exquisitely sensitive to interaural phase. Psychophysical experiments
The Physical and Psychophysical Basis of Sound Localization
59
using humans 4,5 and other primates (see Houben, 1977 quoted in Brown)107 demonstrate threshold discrimination differences of 3° and 11° in interaural phase respectively. Sensitivity to the interaural disparities in the time of arrival and the ongoing phase of dichotic stimuli has been demonstrated in auditory nuclei of a variety of animals (for recent reviews see Irvine106,108). The convergence of phase sensitive inputs allows the comparison of interaural phase differences. Single neurons recorded in the medial superior olivary nucleus, the first relay nucleus where the inputs from the two ears converge, demonstrate phaselocked responses to stimulation of either ear. Studies of the inputs from each ear suggest that the convergent information is both excitatory and inhibitory and the timing relationship of the inputs from each ear are critical in determining the output activity of the neurons in the medial superior olive.109,110 For low frequency stimuli, the maximum response of a neuron occurs for a specific interaural delay, regardless of the stimulation frequency. While the total period of the response variation is equal to the period of the stimulus frequency, the maximum response occurs only at an interaural delay that is characteristic of the unit under study. This “characteristic delay” may reflect a neural delay introduced into the signal by neural path length differences from each ear. Thus, individual cells could act as neural coincidence detectors sensitive to particular interaural delays. Such a mechanism for the detection of small interaural time differences was first proposed by Jeffress111 (Fig. 2.10, see also refs. 112-114). The coding of interaural time disparities in the envelopes of amplitude modulated high frequency stimuli115 and for modulated noise bands in the presence of low-pass maskers116 has been demonstrated in the inferior colliculi of the cat and guinea pig respectively. This has also been confirmed in recent recordings in the lateral superior olive in the cat.117 These findings may represent a neurophysiological correlate for the psychophysical observations that amplitude modulated high frequency stimuli can be lateralized on the basis of interaural time differences.7,91
2.3. SENSITIVITY
TO
INTERAURAL LEVEL DIFFERENCES
A number of investigators have examined ILD cues by making sound pressure measurements at, or in, the ears of experimental subjects or life-like models (sections 1.3, 1.5 and 1.6.3). As we have seen, the pinna and other structures of the auditory periphery act to produce a directionally selective receiver which is frequency dependent. 2.3.1. Psychophysical measures of within frequency sensitivity to ILD The threshold interaural intensity differences for sounds presented dichotically vary as a function of frequency.118 Thresholds for frequencies below 1 kHz are around 1 dB sound pressure level (SPL re 20 µP)
60
Virtual Auditory Space: Generation and Applications
Fig. 2.10. The Jeffress model of ITD coding. When a sound is located off the midline there is a difference in the pathlengths from the source to each ear. This results in a difference in the arrival time to each ear. Jeffress proposed that an anatomical difference in path-lengths could be used by the auditory nervous system to encode the interaural time differences. Essentially, information from each ear would converge in the nervous system along neuronal paths that differed in length. For instance, in the nucleus depicted in the figure there is an array of neurones receiving information from both the ipsilateral and contralateral ears. The neurone labeled 1 has the shortest path-length from the contralateral ear and the longest path-length from the ipsilateral ear. If this neurone only responded when the signals from both ears arrived coincidentally (i.e., acts as a coincidence detector) then neurone 1 will be selective for sounds with a large interaural delay favoring the ipsilateral ear. That is, if the sound arrives first at the ipsilateral ear it will have a longer pathway to travel to neurone 1 than the later arriving sound at the contralateral ear. In this way, very small differences in interaural time differences could be converted into a neural place code where each of the neurones in this array (1 to 7) each code for particular interaural time difference. The resolution of such a system is dependent on the conduction velocities of the fibers conveying the information to each coincidence detector and synaptic ‘jitter’ of the detector; e.g., the excitatory rise time and the reliability of the action potential initiation. Reprinted with permission from Goldberg JM, Brown PB, J Neurophysiol 1969; 32:613-636.
The Physical and Psychophysical Basis of Sound Localization
61
and decrease to about 0.5 dB SPL for 2 kHz to 10 kHz. These changes are in close agreement with those predicted from the measurements of minimum audible angles,119 particularly over the 2 kHz to 5 kHz frequency range. The sensitivity of the auditory system to ILDs at low frequencies has also been investigated indirectly using the so-called ‘time-intensity trading ratio’. In these experiments, the ILDs of a dichotic stimulus with a fixed ITD were varied to determine what ILD offset was required to compensate for a particular ITD; the ratio expressed as ms/dB (see discussions by Moore100 and Hafter90). Over moderate interaural differences, the measurements of this relation varies from 1.7 µs/dB for pure tone to 100 µs/dB for trains of clicks. A number of studies have also suggested that the processing of these cues may not be that complementary; the variance of measurements trading time against a fixed level are different from that obtained by trading level against a fixed time. Furthermore, some studies found that subjects can report two images (one based on the ITD and the other based on the ILD) and not one image as is assumed in the previous studies. This may well reflect an analytic as opposed to synthetic processing strategy. However, regardless of the basis of the time-intensity trading, the important point in this context is that the auditory system is sensitive to ILDs of the order of 1 dB to 2 dB over the low frequency ranges. 2.3.2. Neurophysiological measures of ILD sensitivity There are two main strategies that could be utilized by the auditory system in the analysis of interaural level differences. The traditional thinking on this matter reflects the way in which the inner ear is thought to encode complex sounds: i.e., principally as a spectral code which is manifest within the central nervous system as a topographical code of frequency. The analysis of ILD information is thought to occur when these two topographic codes of frequency intersect in the superior olivary nuclei where the analysis is carried out on a frequency by frequency basis. There is a considerable literature documenting the physiological responses of auditory neurones to variation in interaural level differences in a number of different species and auditory nuclei (for recent reviews see Irvine,106,108 Fig. 2.11). There is considerable heterogeneity in the response patterns of individual neurones with a spectrum of response types coding a range of interaural levels. One important neurophysiological finding is that in some nuclei the sensitivity of individual neurones to variations in ILDs using pure tone stimuli can vary considerably with both the frequency of the tones presented and the overall (or average) binaural level at which these stimuli are presented.120-124 This wide variability in response presents some difficulties in interpreting how these neurones might be selective for specific ILDs. However, in at least one nucleus (the superior colliculus), when
62
Virtual Auditory Space: Generation and Applications
broadband stimuli are utilized a consistent pattern of responses to ILDs can be found across a wide range of average binaural levels. That is, the inclusion of a broad range of frequencies in the stimulus results in a response which is selective for ILD per se rather than one confounded by variations in the overall sound level. This leads to considerations of the second strategy by which the auditory system might analyze ILD cues namely binaural analysis across frequency rather than within frequency. The monaural filter functions
Fig. 2.11. Using a closed field stimulus system the sound level can be varied in each ear about a mean binaural stimulus level or alternatively the level in one ear can be varied while keeping the level in the other ear constant.136 Solid line shows the variation in the firing rate of a binaurally sensitive neurone in the inferior colliculus by varying the contralateral stimulus level. Note that the activity of this neurone is completely inhibited when the interaural stimulus level slightly favors the ipsilateral ear. The dashed line shows the response to stimulation of the contralateral ear alone. There is evidence that the interaural level differences at which neurones are inhibited varies considerably across the population of neurones. This variation is consistent with the idea that ILD may also be encoded by neural place. Reprinted with permission from Irvine DRF et al, Hear Res 1995; 85:127-141.
The Physical and Psychophysical Basis of Sound Localization
63
of each ear combine to produce a complex spectral pattern of binaural differences (section 1.5.1, see also Fig. 2.14). In the traditional view of the ascending auditory system, frequency is thought to be preserved by the place code, so that different channels carry different frequencies. If spectral patterns in either the individual monaural cues or in their binaural comparisons are to be analyzed across frequency, then frequency information must converge rather than remain segregated. There are a number of subcortical nuclei in the auditory system where this kind of convergence appears to occur (for instance the dorsal cochlear nucleus, the dorsal nuclei of the lateral lemniscus, the external nucleus of the inferior colliculus and the deep layers of the superior colliculus). Single neurones in these structures can be shown to have wide and complex patterns of frequency tuning, indicating that frequency information has converged, presumably as a result of the neural selection for some other emergent property of the sound. There have been virtually no physiological or psychophysical studies of the coding of binaural spectral profiles; however, some psychophysical work has been carried out on the sensitivity to monaural (or diotically presentede) spectral profiles. The lack of any systematic study of interaural spectral differences (ISDs) is surprising on a number of counts. Firstly, there should be considerable advantage to the auditory system in analyzing ISDs as, unlike monaural spectral cues, the ISD is independent of the spectrum of the sound. The auditory system must make certain assumptions about the spectrum of a signal if monaural spectral cues are to provide location information. By contrast, as long as the overall bandwidth of a sound is sufficiently broad, the ISD is dependent only on the differences between the filtering of the sound by each ear. As both of the filter functions are very likely to be known by the auditory system, this cue is then unambiguous. However, when the ISDs are calculated for a sound on the audio-visual horizon (e.g., see Fig. 2.14) there is very little asymmetry in the shape of the binaural spectral profile for locations about the interaural axis. In this case, the utility of ISD cues for resolving the front-back ambiguity in the within-frequency binaural cues of ITD and ILD is considerably compromised. However, that these cues play some role in the localization was indicated by a decision theory model of auditory localization behaviour.125 Searle et al125 argued that on the basis of an analysis of a large number of previous localization studies it was likely that the pinna disparity cues provided about the same amount of localization information as the monaural pinna cue. e
In this context a dichotic stimulus refers to a stimulus which is presented binaurally and contains some binaural differences. On the other hand a diotic stimulus is presented binaurally but is identical at each ear.
64
2.4. SENSITIVITY OF SOUNDS
Virtual Auditory Space: Generation and Applications
TO
CHANGES
IN THE
SPECTRAL PROFILE
The human head related transfer functions (HRTFs) indicate that the spectral profile of a sound varies considerably as a function of its location in space. These variations occur principally over the middle to high frequency range of human hearing where the wavelengths of the sounds are smaller than the principal structures of auditory periphery. Although there is considerable evidence that the “spectral” cues furnished by the outer ear are important for accurate localization, considerably less is known about how the auditory system encodes these variations in the spectral profile. There is a considerable body of experimental work examining the effects of spectral profile for the low to mid range of frequencies. Green and his colleagues have coined the phrase spectral “profile analysis” to describe the ability of subjects to discriminate variations in the spectral shape of a complex stimulus126 (see also Moore et al127). In these studies, spectrally complex stimuli have generally been constructed by summing a number of sinusoids covering a range of frequencies and determining the sensitivity of subjects to changes in the amplitude of an individual component of this complex (see Green128,129 for review). Randomizing the phase of the components, which varies the time domain waveforms of the stimuli, seems to have no significant effect on this task. This has been interpreted as indicating that the detection of variations is most likely based on variations in the amplitude spectrum rather than waveform. The number of components in the complex, the frequency interval between components and the frequency range of components all affect the detectability of variations in the amplitude of a single component. In particular, detection is best when the components cover a wide range of frequencies and individual components are spaced about 2 critical bands apart. The frequency of the component which is varied within the complex also has a moderate effect on the sensitivity of detection. However, these variations do not easily discriminate among the various models of auditory function. For our purposes it is sufficient to summarize this work as indicating that the auditory system is indeed sensitive to quite small changes in the spectral profile of a stimulus and, with some limitations, this sensitivity is generally related to the bandwidth of the sound. It seems likely that any analysis of the monaural spectral cues and possibly binaural spectral differences, will involve some kind of process that analyzes information across frequency in a manner similar to that illustrated by the kind of profile analysis experiments carried out by Green and colleagues.
The Physical and Psychophysical Basis of Sound Localization
2.5. MODELS OF THE ENCODING OF BY THE AUDITORY SYSTEM
65
SPECTRAL PATTERNS
2.5.1. Models of auditory encoding In identifying which parameters of the HRTF might be important in the localization of a sound source, it is important to consider ways in which this information is encoded by the auditory nervous system. The HRTF illustrates the way in which the sound arriving at the eardrum is modified by the auditory periphery. There are at least two other important transmission steps before the sound is represented as a sensory signal within the auditory nervous system. These are (1) the band pass transfer function of the middle ear and (2) the transfer function of the inner ear. There are essentially two ways in which these processes might be taken into account.f The first is to computationally model what is known about the physical characteristics of these different stages and then combine the HRTFs with such a model. While there have been huge advances in our understanding of the mechanisms of the inner ear over the last decade, it is probably fair to say that such a functional model of the middle and inner ear is still likely to be incomplete. The second approach is to use models of auditory processing that have been derived from psychophysical data. Such models treat the auditory system as a black box and simply try to relate input to perceptual output. An underlying assumption of the spectral process models is that the auditory system can be modeled as a bank of band pass filters, e.g., Glasberg and Moore130 (but see also Patterson131). Such models attempt to take into account (i) the transfer function of the middle ear; (ii) the variation in the frequency filter bandwidth as a function of frequency; and (iii) the variation in the filter shape as a function of level.132-134 The outputs of such models are generally expressed in terms of “neural excitation patterns.” That is, the outputs make predictions about how a sound might appear to be coded across an array of frequency sensitive fibers within the central nervous system (chapter 1, section 1.5). We have combined one model130,135 with the measured HRTFs to estimate the excitation patterns generated by a spectrally flat noise at a nominal spectrum level of 30 dB (see Carlile and Pralong31 for f
The most obvious way of taking these processes into account would be to measure the neurophysiological responses after the encoding of the sound by the inner ear. However, this is also highly problematic. Meaningful measurement of the neural responses in humans is not technically possible (animal models can be used in this kind of experiment, which requires surgery to gain access to the neural structures to be recorded from). Additionally, it is not clear that there is as yet a sufficiently good understanding of the nature of the neural code to unequivocally interpret the results of any such measurements.
66
Virtual Auditory Space: Generation and Applications
computational details). The components of the HRTFs which are likely to be most perceptually salient can be determined from such an analysis. 2.5.2. Perceptual saliency of the horizon transfer function Figure 2.12 shows the calculated excitation patterns generated by a flat noise source at different azimuth locations along the ipsilateral audio-visual horizon. This is referred to as the horizon excitation function and shows the dramatic effects of passing the HRTF filtered noise through these auditory filter models. There are significant differences between the excitation patterns and the HRTF obtained using acoustical measures (cf. Figs. 2.6 and 2.12). One of the most marked effects is the smoothing of much of the fine detail in the HRTF by the auditory filters, particularly at high frequencies. Additionally there are overall changes in the shape of the horizon transfer function; the withinfrequency variations are very similar between the horizon transfer function and the horizon excitation function, as the auditory filter is not sensitive to the location of the stimulus. However, the transfer function of the middle ear dramatically changes the shape of the functions acrossfrequency. For frequencies between 2.5 kHz and 8 kHz there is a large increase in the gain of the excitation patterns for anterior locations. By contrast, there is a marked reduction in the excitation pattern for frequencies between 3 kHz and 6 kHz for locations behind the interaural axis (-90° azimuth). Furthermore, there is an increase in the frequency of peak excitation as the source is moved from the anterior to posterior space; this ranges from 0.5 to 1.0 octaves for the eight subjects for whom this was calculated.31 2.5.3. Perceptual saliency of the meridian transfer function The effects of combining the meridian transfer function with the auditory filter model is shown in Figure 2.13 for sounds located on the anterior median plane. There is an appreciable reduction in the amount of spectral detail at the higher frequencies. The frequency of the peak of the principal gain feature in the middle frequency range varies as a function of elevation. There is also a trough in the excitation pattern that varies from 5 kHz to 10 kHz as the elevation of the source is increased from 45° below the audio-visual horizon to the directly above the head. There are a considerable number of studies that provide evidence for the involvement of spectral cues in median plane localization (chapter 1, section 2.1.4); therefore it is likely that the changes evident in the meridian excitation pattern (Fig. 2.13) are both perceptually salient and perceptually relevant. 2.5.4. Variation in the interaural level differences The HRTFs can also be used to estimate the pattern of interaural level differences. It is important to note, however, that these differences
The Physical and Psychophysical Basis of Sound Localization
67
Fig. 2.12. The horizon excitation function for the ipsilateral audio-visual horizon has been calculated from the data shown in Fig. 2.6 (see text and Carlile and Pralong31 for method). The gain indicated by the color of the contour is in dB excitation. This calculation takes into account the frequency dependent sensitivity and the spectral filtering properties of the human auditory system and provides a measure of the likely perceptual saliency of different components of the HRTF. Reprinted with permission from Carlile S and Pralong D, J Acoust Soc Am 1994; 95:3445-3459.
will only be calculated by the auditory nervous system following the encoding of the sound by each ear. That is, the interaural level difference is a binaural cue obtained by comparing the neurally encoded outputs of each ear. For this reason we have estimated the neurally encoded ILDs from the excitation patterns that are calculated for each ear by subtracting the excitation pattern obtained at the contralateral ear from that at the ipsilateral ear. The horizon ILDs calculated for the left and right ears are shown in Figure 2.14. The pattern of changes in ILD within any one frequency band demonstrate a non-monotonic change with location, thereby illustrating the “cone of confusion” type ambiguity. Not surprisingly, because of the complex filtering properties of the outer ear, the patterns of within-frequency changes in ILD are often more complex than predicted by a simple spherical head model. In general, however, ILDs peak around the interaural axis, particularly for the mid to high frequencies where the changes are sufficiently large to be acoustically and perceptually relevant.
68 Virtual Auditory Space: Generation and Applications
Fig. 2.13. The meridian excitation functions have been calculated from the data shown in Fig. 2.7. All other details as for Fig. 2.12. Reprinted with permission from Carlile S, Pralong D, J Acoust Soc Am 1994; 95:3445-3459.
The Physical and Psychophysical Basis of Sound Localization 69
Fig. 2.14. The horizon transfer function provides a measure of how one outer ear modifies the sound as a function of its location. These data can also be used to estimate how the binaural spectral cues might vary as a function of location. In this figure the binaural spectral cues generated for locations on the audio-visual horizon of the (a) left ear and (b) right ear are shown. The level in dB of the binaural difference is indicated by the color of the contour. Note that the differences between these plots result from differences in the filtering properties of the left and right ears respectively, indicating a significant departure from biological symmetry in these structures. Reprinted with permission from Carlile S, Pralong D, J Acoust Soc Am 1994; 95:3445-3459.
70
Virtual Auditory Space: Generation and Applications
There are considerable differences evident in the horizon ILDs calculated for each ear for this subject. Similar kinds of inter-ear differences were seen in seven other subjects; however those shown in Figure 2.14 represent the largest differences observed. These data demonstrate that there is sufficient asymmetry in the ears to produce what are very likely to be perceptually salient differences in the patterns of ILDs at each ear (for sounds ipsilateral to that ear). Clearly, estimates of ILD that are based on recordings from one ear and assume that the ears are symmetrical should be treated with some caution. These interaural variations were also sufficient to produce ILDs of greater than 9 dB for a sound located on the anterior median plane.31
2.6. PINNA FILTERING AND THE CONSTANCY OF THE PERCEPTION OF AUDITORY OBJECTS One of the most striking effects related to the spectral profile of the sound can be obtained by simply playing noise filtered by different HRTFs through a loud speaker in the free field. The changes in the spectral profiles from HRTF to HRTF are discernible as significant changes in the timbre of the sound. Clearly in this demonstration the noise is being filtered a second time by the listener’s ears but the point here is that the differences between sounds filtered by different HRTFs are very obvious. However, the most remarkable feature of this demonstration is that when these noise stimuli are heard over headphones appropriately filtered by the left and right HRTFs, the large timbral differences are not apparent but rather these differences have been mapped onto spatial location. That is, there seems to be some kind of perceptual constancy regarding the auditory object so that the changes in the input spectra are perceived as variation in the spatial location of the object rather than in the spectral characteristics of the sound. In the visual system there is an analogous example to do with color constancy. The color we perceive an object to be is related to the light reflected from that object. The problem is that under different types of illumination, the wavelengths of the light that is actually reflected can vary considerably. Despite this we still perceive a leaf to be green regardless of whether we are looking at the leaf at dawn, midday or dusk, or even under a fluorescent light. The solution to this problem is that the visual system is sensitive to the ratio of light reflected at different wavelengths. Whether there is some kind of similar auditory mechanism relying on, for instance, the comparison between the frequency bands or between the sounds at each ear is, of course, unknown at this time. Such a process might also indicate a close interaction between the analytical system responsible for localization processing and pattern recognition.
3. CONCLUDING
REMARKS
In this chapter we have canvassed a wide range of cues to the location of a sound in space. We have also considered the sensitivity
The Physical and Psychophysical Basis of Sound Localization
71
of the auditory system to these cues as indicated by psychophysical experiments on humans and by directly measuring the neural responses in animals. Both the acoustic and perceptual studies give us some insight into the kinds of signals that a VAS display must transduce. Although the sensitivity studies have generally examined each localization cue in isolation, they provide data as to the quantum of each cue that is detectable by the auditory system. Although there is some evidence of complex non-linearities in the system and of complex interdependencies between the cues, these studies provide, at least to a first approximation, an idea of the resolution which we would expect a high fidelity VAS display to achieve.
ACKNOWLEDGMENTS Drs. Martin, Morey, Parker, Pralong and Professor Irvine are warmly acknowledged for comments on a previous version of this chapter. Some of the recent bioacoustic and excitation modeling work reported in this chapter was supported by the National Health and Medical Research Council (Australia), the Australian Research Council and the University of Sydney. The Auditory Neuroscience Laboratory maintains a Web page outlining the laboratory facilities and current research work at http://www.physiol.usyd.edu.au/simonc.
REFERENCES 1. Woodworth RS, Schlosberg H. Experimental Psychology. New York: Holt, Rinehart and Winston, 1962. 2. Woodworth RS. Experimental Psychology. New York: Holt, 1938. 3. Palmer AR, Russsell IJ. Phase-locking in the cochlear nerve of the guineapig and its relation to the receptor potential of inner hair-cells. Hear Res 1986; 24:1-15. 4. Klump RG, Eady HR. Some measurements of interaural time difference thresholds. J Acoust Soc Am 1956; 28:859-860. 5. Zwislocki J, Feldman RS. Just noticeable differences in dichotic phase. J Acoust Soc Am 1956; 28:860-864. 6. Trahiotis C, Robinson DE. Auditory psychophysics. Ann Rev Psychol 1979; 30:31-61. 7. Henning GB. Detectibility of interaural delay in high-frequency complex waveforms. J Acoust Soc Am 1974; 55:84-90. 8. Nuetzel JM, Hafter ER. Lateralization of complex waveforms: effects of fine structure, amplitude, and duration. J Acoust Soc Am 1976; 60:1339-1346. 9. Saberi K, Hafter ER. A common neural code for frequency- and amplitude-modulated sounds. Nature 1995; 374:537-539. 10. Shaw EAG. The external ear. In: Keidel WD, Neff WD, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490. 11. Middlebrooks JC, Makous JC, Green DM. Directional sensitivity of soundpressure levels in the human ear canal. J Acoust Soc Am 1989; 86:89-108.
72
Virtual Auditory Space: Generation and Applications
12. Searle CL, Braida LD, Cuddy DR et al. Binaural pinna disparity: another auditory localization cue. J Acoust Soc Am 1975; 57:448-455. 13. Hartley RVL, Fry TC. The binaural location of pure tones. Physics Rev 1921; 18:431-42. 14. Nordlund B. Physical factors in angular localization. Acta Otolaryngol 1962; 54:75-93. 15. Kuhn GF. Model for the interaural time differences in the horizontal plane. J Acoust Soc Am 1977; 62:157-167. 16. Feddersen WE, Sandel TT, Teas DC et al. Localization of high-frequency tones. J Acoust Soc Am 1957; 29:988-991. 17. Abbagnaro LA, Bauer BB, Torick EL. Measurements of diffraction and interaural delay of a progressive sound wave caused by the human head. II. J Acoust Soc Am 1975; 58:693-700. 18. Roth GL, Kochhar RK, Hind JE. Interaural time differences: Implications regarding the neurophysiology of sound localization. J Acoust Soc Am 1980; 68:1643-1651. 19. Brillouin L. Wave propagation and group velocity. New York: Academic, 1960. 20. Gaunaurd GC, Kuhn GF. Phase- and group-velocities of acoustic waves around a sphere simulating the human head. J Acoust Soc Am 1980; Suppl. 1:57. 21. Ballantine S. Effect of diffraction around the microphone in sound measurements. Phys Rev 1928; 32:988-992. 22. Kinsler LE, Frey AR. Fundamentals of acoustics. New York: John Wiley and Sons, 1962. 23. Shaw EAG. Transformation of sound pressure level from the free field to the eardrum in the horizontal plane. J Acoust Soc Am 1974; 56: 1848-1861. 24. Shaw EAG. The acoustics of the external ear. In: Studebaker GA, Hochberg I, ed. Acoustical factors affecting hearing aid performance. Baltimore: University Park Press, 1980:109-125. 25. Djupesland G, Zwislocki JJ. Sound pressure distribution in the outer ear. Acta Otolaryng 1973; 75:350-352. 26. Kuhn GF. Some effects of microphone location, signal bandwidth, and incident wave field on the hearing aid input signal. In: Studebaker GA, Hochberg I, ed. Acoustical factors affecting hearing aid performance. Baltimore: University Park Press, 1980:55-80. 27. Pralong D, Carlile S. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 28. Wightman FL, Kistler DJ, Perkins ME. A new approach to the study of human sound localization. In: Yost WA, Gourevitch G, ed. Directional Hearing. New York: Academic, 1987:26-48. 29. Wightman FL, Kistler DJ. Headphone simulation of free field listening. I: Stimulus synthesis. J Acoust Soc Am 1989; 85:858-867.
The Physical and Psychophysical Basis of Sound Localization
73
30. Hellstrom P, Axelsson A. Miniture microphone probe tube measurements in the external auditory canal. J Acoust Soc Am 1993; 93:907-919. 31. Carlile S, Pralong D. The location-dependent nature of perceptually salient features of the human head-related transfer function. J Acoust Soc Am 1994; 95:3445-3459. 32. Belendiuk K, Butler RD. Monaural location of low-pass noise bands in the horizontal plane. J Acoust Soc Am 1975; 58:701-705. 33. Butler RA, Belendiuk K. Spectral cues utilized in the localization of sound in the median sagittal plane. J Acoust Soc Am 1977; 61:1264-1269. 34. Flannery R, Butler RA. Spectral cues provided by the pinna for monaural localization in the horizontal plane. Percept and Psychophys 1981; 29:438-444. 35. Movchan EV. Participation of the auditory centers of Rhinolophus ferrumequinum in echolocational tracking of a moving target. [Russian]. Neirofiziologiya 1984; 16:737-745. 36. Hebrank J, Wright D. Spectral cues used in the localization of sound sources on the median plane. J Acoust Soc Am 1974; 56:1829-1834. 37. Hammershoi D, Moller H, Sorensen MF. Head-related transfer functions: measurements on 24 human subjects. Presented at Audio Engineering Society. Amsterdam: 1992:1-30. 38. Moller H, Sorensen MF, Hammershoi D et al. Head-related transfer functions of human subjects. J Audio Eng Soc 1995; 43:300-321. 39. Bloom PJ. Determination of monaural sensitivity changes due to the pinna by use of minimum-audible-field measurements in the lateral vertical plane. J Acoust Soc Am 1977; 61:820-828. 40. Blauert J. Spatial Hearing: The psychophysics of human sound localization. Cambridge, Mass.: MIT Press, 1983. 41. Shaw EAG. Earcanal pressure generated by a free sound field. J Acoust Soc Am 1966; 39:465-470. 42. Mehrgardt S, Mellert V. Transformation characteristics of the external human ear. J Acoust Soc Am 1977; 61:1567-1576. 43. Rabbitt RD, Friedrich MT. Ear canal cross-sectional pressure distributions: mathematical analysis and computation. J Acoust Soc Am 1991; 89:2379-2390. 44. Carlile S. The auditory periphery of the ferret. I: Directional response properties and the pattern of interaural level differences. J Acoust Soc Am 1990; 88:2180-2195. 45. Khanna SM, Stinson MR. Specification of the acoustical input to the ear at high frequencies. J Acoust Soc Am 1985; 77:577-589. 46. Stinson MR, Khanna SM. Spatial distribution of sound pressure and energy flow in the ear canals of cats. J Acoust Soc Am 1994; 96:170-181. 47. Chan JCK, Geisler CD. Estimation of eardrum acoustic pressure and of ear canal length from remote points in the canal. J Acoust Soc Am 1990; 87:1237-1247. 48. Teranishi R, Shaw EAG. External-ear acoustic models with simple geometry. J Acoust Soc Am 1968; 44:257-263.
74
Virtual Auditory Space: Generation and Applications
49. Shaw EAG. The external ear: new knowledge. In: Dalsgaad SC, ed. Earmolds and associated problems-Proceedings of the seventh Danavox Symposium. 1975:24-50. 50. Knudsen EI, Konishi M, Pettigrew JD. Receptive fields of auditory neurons in the owl. Science (Washington, DC) 1977; 198:1278-1280. 51. Middlebrooks JC, Green DM. Sound localization by human listeners. Annu Rev Psychol 1991; 42:135-159. 52. Middlebrooks JC. Narrow-band sound localization related to external ear acoustics. J Acoust Soc Am 1992; 92:2607-2624. 53. Price GR. Transformation function of the external ear in response to impulsive stimulation. J Acoust Soc Am 1974; 56:190-194. 54. Carlile S. The auditory periphery of the ferret. II: The spectral transformations of the external ear and their implications for sound localization. J Acoust Soc Am 1990; 88:2196-2204. 55. Asano F, Suzuki Y, Sone T. Role of spectral cues in median plane localization. J Acoust Soc Am 1990; 88:159-168. 56. Kuhn GF, Guernsey RM. Sound pressure distribution about the human head and torso. J Acoust Soc Am 1983; 73:95-105. 57. Shaw EAG, Teranishi R. Sound pressure generated in an external-ear replica and real human ears by a nearby point source. J Acoust Soc Am 1968; 44:240-249. 58. Shaw EAG. 1979 Rayleigh Medal lecture: the elusive connection. In: Gatehouse RW, ed. Localisation of sound: theory and application. Connecticut: Amphora Press, 1982:13-27. 59. Shaw EAG. External ear response and sound localization. In: Gatehouse RW, ed. Localisation of sound: theory and application. Connecticut: Amphora Press, 1982:30-41. 60. Batteau DW. The role of the pinna in human localization. Proc Royal Soc B 1967; 158:158-180. 61. Hiranaka Y, Yamasaki H. Envelope representations of pinna impulse responses relating to three-dimensional localization of sound sources. J Acoust Soc Am 1983; 73:291-296. 62. Wright D, Hebrank JH, Wilson B. Pinna reflections as cues for localization. J Acoust Soc Am 1974; 56:957-962. 63. Watkins AJ. Psychoacoustic aspects of synthesized vertical locale cues. J Acoust Soc Am 1978; 63:1152-1165. 64. Watkins AJ. The monaural perception of azimuth: a synthesis approach. In: Gatehouse RW, ed. Localisation of sound: theory and application. Connecticut: Amphora Press, 1982:194-206. 65. Rogers CAP. Pinna transformations and sound reproduction. J Audio Eng Soc 1981; 29:226-234. 66. Calford MB, Pettigrew JD. Frequency dependence of directional amplification at the cat’s pinna. Hearing Res 1984; 14:13-19. 67. Coles RB, Guppy A. Biophysical aspects of directional hearing in the Tammar wallaby, Macropus eugenii. J Exp Biol 1986; 121:371-394. 68. Carlile S, Pettigrew AG. Directional properties of the auditory periphery
The Physical and Psychophysical Basis of Sound Localization
75
in the guinea pig. Hear Res 1987; 31:111-122. 69. Guppy A, Coles RB. Acoustical and neural aspects of hearing in the Australian gleaning bats, Macroderma gigas and Nyctophilus gouldi. J Comp Physiol A 1988; 162:653-668. 70. Coleman PD. Failure to localize the source distance of an unfamiliar sound. J Acoust Soc Am 1962; 34:345-346. 71. Gardner MB. Distance estimation of 0° or apparent 0°-orientated speech signals in anechoic space. J Acoust Soc Am 1969; 45:47-53. 72. Coleman PD. An analysis of cue to auditory depth perception in free space. Psychol Bul 1963; 60:302-315. 73. Ashmead DH, LeRoy D, Odom RD. Perception of the relative distances of nearby sound sources. Percept and Psychophys 1990; 47:326-331. 74. Begault D. Preferred sound intensity increase for sensations of half distance. Peceptual and Motor Skills 1991; 72:1019-1029. 75. Hirsch HR. Perception of the range of a sound source of unknown strength. J Acoust Soc Am 1968; 43:373-374. 76. Molino J. Perceiving the range of a sound source when the direction is known. J Acoust Soc Am 1973; 53:1301-1304. 77. Holt RE, Thurlow WR. Subject orientation and judgement of distance of a sound source. J Acoust Soc Am 1969; 46:1584-1585. 78. Mershon DH, Bowers JN. Absolute and relative cues for the auditory perception of egocentric distance. Perception 1979; 8:311-322. 79. Ingard U. A review of the influence of meteorological conditions on sound propagation. J Acoust Soc Am 1953; 25:405-411. 80. Butler RA, Levy ET, Neff WD. Apparent distance of sound recorded in echoic and anechoic chambers. J Exp Psychol: Hum Percept Perform 1980; 6:745-50. 81. Little AD, Mershon DH, Cox PH. Spectral content as a cue to perceived auditory distance. Perception 1992; 21:405-416. 82. Bekesy GV. Experiments in hearing. USA: McGraw-Hill Book Company, 1960. 83. Mershon DH, King LE. Intensity and reverberation as factors in the auditory perception of egocentric distance. Percept Psychophys 1975; 18:409-415. 84. Mershon DH, Ballenger WL, Little AD et al. Effects of room reflectance and background noise on perceived auditory distance. Perception 1989; 18:403-416. 85. Saberi K, Perrott DR. Minimum audible movement angles as a function of sound source trajectory. J Acoust Soc Am 1990; 88:2639-2644. 86. Plenge G. On the differences between localization and lateralization. J Acoust Soc Am 1974; 56:944-951. 87. Sayers BM, Cherry EC. Mechanism of binaural fusion in the hearing of speech. J Acoust Soc Am 1957; 29:973-987. 88. Dye RH, Yost WA, Stellmack MA et al. Stimulus classification procedure for assessing the extent to which binaural processing is spectrally analytic or synthetic. J Acoust Soc Am 1994; 96:2720-2730.
76
Virtual Auditory Space: Generation and Applications
89. Zerlin S. Interaural time and intensity difference and the MLD. J Acoust Soc Am 1966; 39:134-137. 90. Hafter ER. Spatial hearing and the duplex theory: how viable is the model? New York: John Wiley and Sons, 1984. 91. McFadden D, Pasanen EG. Lateralization at high frequencies based on interaural time differences. J Acoust Soc Am 1976; 59:634-639. 92. McFadden D, Moffitt CM. Acoustic integration for lateralization at high frequencies. J Acoust Soc Am 1977; 61:1604-1608. 93. Blauert J. Binaural localization. Scand Audiol 1982; Suppl.15:7-26. 94. Poon PWF, Hwang JC, Yu WY et al. Detection of interaural time difference for clicks and tone pips: effects of interaural signal disparity. Hear Res 1984; 15:179-185. 95. Tobias JV, Schubert ED. Effective onset duration of auditory stimuli. J Acoust Soc Am 1959; 31:1595-1605. 96. Tobias JV, Zerlin S. Lateralization threshold as a function of stimulus duration. J Acoust Soc Am 1959; 31:1591-1594. 97. Yost WA. Lateralization of pulsed sinusoids based on interaural onset, ongoing, and offset temporal differences. J Acoust Soc Am 1977; 61:190-194. 98. Wallach H, Newman EB, Rosenzweig MR. The precedence effect in sound localization. Am J Psych 1949; 62:315-337. 99. Zurek PM. The precedence effect. In: Yost WA, Gourevitch G, ed. Directional hearing. New York: Springer-Verlag, 1987:85-105. 100. Moore BCJ. An introduction to the psychology of hearing. 3rd Edition. London: Academic Press, 1989. 101. Hartmann WM, Rakerd B. On the minimum audible angle–A decision theory approach. J Acoust Soc Am 1989; 85:2031-2041. 102. Berkley DA. Hearing in rooms. In: Yost WA, Gourevitch G, ed. Directional hearing. New York: Springer-Verlag, 1987:249-260. 103. Hartmann WM. Localization of sound in rooms. J Acoust Soc Am 1983; 74:1380-1391. 104. Johnson DH. The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. J Acoust Soc Am 1980; 68:1115-1122. 105. Yin TCT, Chan JCK. Neural mechanisms underlying interaural time sensitivity to tones and noise. In: Edelman GM, Gall WE, Cowan WM, ed. Auditory Function. Neurobiological basis of hearing. New York: John Wiley and Sons, 1988:385-430. 106. Irvine DRF. Physiology of the auditory brainstem. In: AN Popper, RR Fay, ed. The mammalian Auditory pathway: Neurophysiology. New York: Springer-Verlag, 1992:153-231. 107. Brown CH, Beecher MD, Moody DB et al. Localization of pure tones by old world monkeys. J Acoust Soc Am 1978; 63:1484-1492. 108. Irvine DRF. The auditory brainstem. Berlin: Springer-Verlag, 1986. 109. Sanes D. An in vitro analysis of sound localization mechanisms in the gerbil lateral superior olive. J Neurosci 1990; 10:3494-3506.
The Physical and Psychophysical Basis of Sound Localization
77
110. Wu SH, Kelly JB. Binaural interaction in the lateral superior olive: Time difference sensitivity studied in mouse brain slice. J Neurophysiol 1992; 68:1151-1159. 111. Jeffress LA. A place theory of sound localization. J Comp Physiol Psychol 1948; 41:35-39. 112. Goldberg JM, Brown PB. Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: some physiological mechanisms of sound localization. J Neurophysiol 1969; 32:613-636. 113. Yin TCT, Chan JCK, Carney LH. Effects of interaural time delays of noise stimuli on low-frequency cells in the cat’s inferior colliculus. III Evidence for cross-correlation. J Neurophysiol 1987; 58:562-583. 114. Smith PH, Joris PX, Yin TC. Projections of physiologically characterized spherical bushy cell axons from the cochlear nucleus of the cat: evidence for delay lines to the medial superior olive. J Comp Neurol 1993; 331:245-260. 115. Yin HS, Mackler SA, Selzer ME. Directional specificity in the regeneration of Lamprey spinal axons. Science 1984; 224:894-895. 116. Crow G, Langford TL, Mousegian G. Coding of interaural time differences by some high-frequency neurons of the inferior colliculus: Responses to noise bands and two-tone complexes. Hear Res 1980; 3:147-153. 117. Jorris P, Yin TCT. Envelope coding in the lateral superior olive. I. Sensitivity to interaural time differences. J Neurophysiol 1995; 73:1043-1062. 118. Mills AW. Lateralization of high-frequency tones. J Acoust Soc Am 1960; 32:132-134. 119. Mills AW. On the minimum audible angle. J Acoust Soc Am 1958; 30:237-246. 120. Hirsch JA, Chan JCK, Yin TCT. Responses of neurons in the cat’s superior colliculus to acoustic stimuli. I. Monaural and binaural response properties. J Neurophysiol 1985; 53:726-745. 121. Wise LZ, Irvine DRF. Interaural intensity difference sensitivity based on facilitatory binaural interaction in cat superior colliculus. Hear Res 1984; 16:181-188. 122. Semple MN, Kitzes LM. Binaural processing of sound pressure level in cat primary auditory cortex: Evidence for a representation based on absolute levels rather than the interaural level difference. J Neurophysiol 1993; 69. 123. Irvine DRF, Rajan R, Aitkin LM. Sensitivity to interaural intensity differences of neurones in primary auditory cortex of the cat: I types of sensitivity and effects of variations in sound pressure level. J Neurophysiol 1996; (in press). 124. Irvine DRF, Gago G. Binaural interaction in high frequency neurones in inferior colliculus in the cat: Effects on the variation in sound pressure level on sensitivity to interaural intensity differnces. J Neurophysiol 1990; 63. 125. Searle CL, Braida LD, Davis MF et al. Model for auditory localization. J Acoust Soc Am 1976; 60:1164-1175.
78
Virtual Auditory Space: Generation and Applications
126. Spiegel MF, Green DM. Signal and masker uncertainty with noise maskers of varying duration, bandwidth, and center frequency. J Acoust Soc Am 1982; 71:1204-1211. 127. Moore BC, Oldfield SR, Dooley GJ. Detection and discrimination of spectral peaks and notches at 1 and 8 kHz. J Acoust Soc Am 1989; 85:820-836. 128. Green DM. Auditory profile analysis: some experiments on spectral shape discrimination. In: Edelman GM, Gall WE, Cowan WM, ed. Auditory function: Neurobiological basis of hearing. New York: John Wiley and Sons, 1988:609-622. 129. Green DM. Profile analysis: Auditory intensity discrimination. New York: Oxford University Press, 1988. 130. Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data. Hear Res 1990; 47:103-138. 131. Patterson RD. The sound of a sinusoid: time-interval models. J Acoust Soc Am 1994; 96:1419-1428. 132. Patterson RD. Auditory filter shapes derived with noise stimuli. J Acoust Soc Am 1976; 59:640-654. 133. Rosen S, Baker RJ. Characterising auditory filter nonlinearity. Hear Res 1994; 73:231-243. 134. Patterson RD, Moore BCJ. Auditory filters and excitation patterns as representations of frequency resolution. In: Moore BCJ, ed. Frequency selectivity in hearing. London: Academic Press, 1986:123-177. 135. Moore BCJ, Glasberg BR. Formulae describing frequency selectivity as a function of frequency and level, and their use in calculating exitation patterns. Hear Res 1987; 28:209-225. 136. Irvine DRF, Park VN, Mattingly JB. Responses of neurones in the inferior colliculus of the rat to interaural time and intensity differences in transient stimuli: implications for the latency hypothesis. Hear Res 1995; 85:127-141.
79
CHAPTER 3
DIGITAL SIGNAL PROCESSING FOR THE AUDITORY SCIENTIST: A TUTORIAL INTRODUCTION Philip Leong, Tim Tucker and Simon Carlile
1. INTRODUCTION 1.1. OBJECTIVES
OF THIS
TUTORIAL
T
his chapter is an introduction to digital signal processing (DSP) concepts used in the generation and analysis of auditory signals. It is not a mathematical exposition on DSP, but rather a qualitative description of the DSP techniques commonly used in the processing of audio signals. Readers should refer to the annotated bibliography for a more detailed coverage of specific DSP techniques.
1.2. SIGNAL PROCESSING
IN
AUDITORY RESEARCH
The microprocessor has revolutionized signal processing by shifting the emphasis away from analog techniques. DSP enables the auditory scientist to generate, record and analyze audio signals with greater accuracy, convenience and resolution. Entirely new signal processing techniques such as generating complex stimuli, linear phase filters and Fourier transforms have become invaluable tools in most fields of science and engineering. An advantage of the digital approach is that system control is performed primarily through software, and thus changes to the system involve changes in software rather than hardware. Other advantages include improved tolerances to noise, arbitrarily high signal to noise Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 Landes Bioscience.
80
Virtual Auditory Space: Generation and Applications
ratios, no changes in performance due to component variation, ambient temperature and aging and ease of data storage. Although general purpose microprocessors such as the Intel Pentium can be used to perform digital signal processing, it is more common to use special purpose digital signal processors for real-time applications. Digital signal processors such as the AT&T DSP32C and the Texas Instruments TMS320C40 are microprocessors which are optimized to perform common digital signal processing operations such as Fourier transforms and filtering at maximum speed. Optimized software libraries for implementing most of the common techniques described in this chapter are available, allowing users to perform DSP without concern for the intricate details of DSP programming.
2. THE NATURE OF SOUND 2.1. SOUND
AND
MEASUREMENT
Sound is created by a source generating a mechanical wave in the surrounding medium. Assuming the usual case of the medium being air, air molecules conduct the vibrations, propagating the sound in all directions away from the source. A listener can detect the sound because air pressure changes cause mechanical vibrations in the listener’s eardrums which then pass through the middle ear to be transduced into neural signals in the inner ear. Sound sources create changes in pressure by causing a mechanical particle displacement. For example, striking a drum causes it to vibrate, and the air around it is alternately compressed and rarefied as the skin moves backwards and forwards. Note that the sound waves are longitudinal, and the backwards and forwards particle displacement is in the same direction as the propagation of the sound wave. Other sound sources include vibrating strings (e.g., violin, piano), vibrating air (e.g., organ, flute, human vocal tract) and vibrating solids (drums, loudspeaker, door knocker). Although sounds are longitudinal waves, they can be drawn as transverse waves (displacement being at right angles to the direction of propagation) like the sinusoidal waveform shown in Figure 3.1. Sinusoids are of particular importance since we will later see that all practical waveforms can be constructed from a sum of sinusoids. A sinusoid can be completely characterized by three parameters, frequency (or its reciprocal-period), amplitude and phase. Frequency is related to the pitch of the sound and amplitude is related to its loudness. The phase is only important when we are dealing with more than one sinusoid, and it determines the relative starting times of the sinusoids. The human ear can detect sounds over a wide range of frequency from 20 Hz to about 20 kHz,a and can detect pressure ratios of approximately 1 x 107. a
In the case of a healthy, young adult.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
81
Fig. 3.1. Sine wave showing frequency, amplitude, phase and period.
Sound intensity is proportional to the square of the pressure, and is defined as the sound energy transmitted per second through unit area. The magnitude of a sound can be measured as the ratio between two levels with one Bel corresponding to a tenfold difference in the ratio between two intensities. A more convenient unit is the decibel (dB) which is one tenth of this amount. Thus we can express the difference in sound intensity between two sounds with intensity I0 and I1 (or between pressures of P0 and P1) as decibels = 10 log 10 (I1 / I 0) = 20 log 10 (P1 / P0). Since the decibel scale is logarithmic, a positive value means that the ratio difference is greater than one, and this corresponds to amplification (I1 is greater than I0). When it is negative, I1 is said to be attenuated from I 0. A +3 dB (gain) causes a doubling in the intensity of the sound, whereas -3 dB is a halving in intensity. A decibel measure is not an absolute measure of amplitude, but always corresponds to a difference in sound pressure between two signals, i.e., it is based on a ratio. Often, one would like to refer to a sound as an absolute number instead of a difference, and so the sound must be compared with a reference value and levels expressed in this fashion are called sound pressure levels (SPL). The usual reference is the softest 1000 Hz tone the average human can detect, which corresponds to an intensity of about 10–12 W m–2 (Watt/meter–2). Thus
82
Virtual Auditory Space: Generation and Applications
20 dB SPL refers to an intensity 100 times more than the reference value, namely 10 –10 W m–2. According to Vernon et al,1 the application of the dB scale can be traced back to early developments in telephone engineering. The engineers were interested in measuring how much power drop there was down a telephone line. Because the losses were significant they found a logarithmic ratio system worked well to explain the measurements. Coincidentally this logarithmic system also works well when describing hearing sensitivity. Psychophysically the auditory system has an approximately logarithmic sensitivity quantized to about one dB. In a DSP system, the sound pressure is usually converted into an electrical voltage by a microphone. In a dynamic microphone, the change in air pressure created by the sound causes a coil to move in a magnetic field, generating a voltage. The voltage is then amplified and converted to a digital signal using an analog to digital converter (see section 3).
2.2. NOISE Unwanted noise is present in all practical signals and must also be considered in DSP systems. External sources of noise include electrical interference from nearby electronic equipment and background audio noise. There can also be internal noise sources such as quantization noise (see section 3.3). Noise is often used as a sound stimulus in a DSP system since it contains energy across the whole frequency range of interest. White noise, for instance, contains uniform amounts of energy at all frequencies. The Signal to noise ratio (SNR) is the measurement used to describe the ratio between a signal and the noise present in the system. The SNR is defined as
SNR = 10 log10
2 s 2 n
v v
where vs is the signal voltage and v n is the noise voltage. b For example the minimum SNR for good speech recognition is approximately 6 dB and an audio CD player has over 90 dB SNR.
3. DISCRETE TIME SYSTEMS 3.1. INTRODUCTION In the real world, all signals are analog in value and continuous in time. A DSP system operates in discrete time where signals are repreb
Actually, these values are all root mean squared values since we are dealing with AC signals. Also, since noise is a random process, we usually measure it over several readings rather than a single reading.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
83
sented by a series of digital numbers. These digital values correspond to the signal’s amplitude at uniformly spaced intervals of time. Although the theories corresponding to discrete and continuous time signals are not identical, they have many parallels. Fortunately, with care, these differences do not seriously limit our ability to process signals in the digital domain. A block diagram of a generic DSP system is shown in Figure 3.2. A continuous time signal is first low pass filtered (the necessity of filtering is explained below) by the anti-aliasing filter and converted to a digital signal via an analog to digital converter (ADC). An ADC is an electrical circuit that measures the analog signalc at its input and converts it to a digital number at its output. This is done at discrete, instantaneous points in time. The analog signal has now been converted to a digital representation, so it can be processed in the digital domain. After processing, the digital samples are converted back to an analog signal via a digital to analog converter (DAC) and low pass filter (called the reconstruction filter). Some systems such as a speech synthesizer, start with a digital representation and generate an analog signal so analog to digital conversion is not required. Similarly, some systems designed only to analyze signals do not require digital to analog conversion. All DSP systems have a structure similar to Figure 3.2. Since all of the processing of the system is performed in the digital domain, the functionality of the DSP system is usually derived solely through software. Thus, the same DSP hardware can be used for many different applications. Because software controlled systems are easier to develop and debug than hardware systems, DSP systems can be made to perform ever more complex signal processing tasks.
3.2. SAMPLING
OF
CONTINUOUS TIME SIGNALS
In Figure 3.2, it can be seen that the first operation in a DSP system is to low pass filter and convert the analog signal into a digital one. This conversion is performed by the ADC using a process called discrete time sampling. Discrete time sampling converts a continuous
Fig. 3.2. Block diagram of a generic DSP system. c
These can be voltages or currents, but in this chapter we will assume that the input of the ADC is a voltage.
84
Virtual Auditory Space: Generation and Applications
time signal into a discrete time signal. This process is illustrated in Figure 3.3. Voltage measurements are taken at regular intervals determined by the sampling rate of the ADC. In the Figure, our sampling period is T, corresponding to a sampling frequency of fs = 1/T. Note that the digitized version of the waveform is not an exact copy of the signal since it only changes value every sampling period. Two types of artifacts occur in the sampled record as a result of this process and are refereed to as quantization error and aliasing. A number of precautions need to be taken in DSP to ensure that the effects do not distort the final signal. These precautions are discussed in this and the following section. All practical signals contain energy only up to a maximum frequency and thus are said to be bandlimited. The Nyquist sampling theorem states that under certain conditions, a bandlimited continuous time signal can be exactly reconstructed from the discrete samples. This remarkable theorem is the basis of all DSP systems and is the reason why we can process signals using a discrete time representation of a continuous time system. If the maximum frequency contained within a signal is given by fmax the Nyquist sampling theorem states that the original signal can be reconstructed from the sampled version if the sampling rate fs is greater than 2f max. This is called the Nyquist rate, i.e.,
f 2
s
>
f
max
If a signal is sampled below its Nyquist rate, the signal will become distorted and this effect is known as aliasing. There are a num-
Fig. 3.3. Sampling of a continuous time analog signal to make it into a discrete time signal. The sampling frequency is fs = 1/T Hz.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
85
ber of nonintuitive consequences of the digitization process. For instance, if we examine a sampled signal using a frequency representation (see section 4) identical copies of the signal spectra occur at intervals of fs as illustrated in Figure 3.4. If the signal only contains frequency components which are less than fs/2, the copies of the signal at the different frequency bands will not overlap. However, if the maximum frequency is larger than fs/2, the higher frequencies will “fold over” onto the lower band and corrupt the lower frequency components (Fig. 3.4b). The sampling theorem can be rigorously demonstrated mathematically, but for our purposes is best understood intuitively by considering what happens if we change the frame rate of a movie camera as
Fig. 3.4. Frequency domain representation of a signal sampled with a sampling period T. The original signal contains frequencies extending up to fmax. Note that copies of the spectra occur at intervals of 2/T and if the sampling rate is less than the Nyquist rate, overlapping will occur causing aliasing [as shown in (b)]. Although only the positive x-axis is shown, copies extend into the negative frequency axis as well [as discussed in Oppenheim and Schafer (1975), see Annotated Bibliography, or any other DSP text].
86
Virtual Auditory Space: Generation and Applications
we record a person at work. The action is normally sampled quickly so that when it is played back, we can make a faithful reproduction. As we slow the frame rate of the movie camera, a point is reached when we are unable to follow what the person is doing because the movie camera undersamples the signal. The frequency at which this occurs depends on the maximum frequency at which the person moves; after a certain point, we cannot reconstruct the original signal and in fact fast repetitive movements may appear as smooth slow movements (strobing). In a DSP system, an anti-aliasing filter like the one shown in Figure 3.2 is used. The anti-aliasing filter removes high frequencies from the signal, ensuring that it is bandlimited to at least half the sampling frequency. The anti-aliasing filter should not distort the amplitude or phase of the signals of interest, but should remove high frequency noise or unwanted parts of the signal to avoid aliasing. Anti-aliasing filters and reconstruction filters must be analog continuous time filters constructed using analog techniques. Unfortunately, we cannot build a perfect anti-aliasing filter and anti-aliasing filters are subject to tradeoffs between cost, steepness, sensitivity to component variations and phase characteristics. Most real-world DSP systems sample much faster than the Nyquist rate to reduce the requirements of the anti-aliasing filter. This technique, known as oversampling, allows the slope of the low pass filter to be more gradual, making the filter easier to implement. Refer to section 6 for more details on filter design. In practice, the speed at which we can perform our processing (including the time it takes to perform ADC and DAC) will limit the maximum sampling rate of the DSP system. In addition, if the sampled data must be stored, larger amounts of data will be generated for higher sampling rates and extra storage is required.
3.3. QUANTIZATION When a continuous signal is digitized, it must be changed from a value represented by a real number to a value which is represented by a finite number of levels (Fig. 3.3). This process is known as quantization. Quantization is analogous to using a ruler to measure something—we only have a resolution of 1/2 of the finest graduation on the ruler, and the total length of the ruler determines the maximum length which we can easily measure. Internally, in a DSP system, an N bit number (which can represent 2N different values) is used to represent the signal, and N describes the resolution of the converter. Quantizing errors are introduced by the finite resolution of the ADC, which is limited to the maximum value which the ADC can measure divided by 2N+1. The resolution of an ADC is usually in the range of 8 to 18 bits. Generally speaking, as the resolution increases, the cost of the ADC
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
87
increases and the speed decreases. To obtain the high conversion rates that are required, video signals are usually processed with 8 bit resolution. For speech signals, 12 bits is normally sufficient; whereas, for high quality audio reproduction, 16 bits or more are used. Converters are usually either bipolar (they can take both positive and negative inputs), or unipolar (they take strictly positive inputs). For example, a unipolar 8 bit converter might take inputs from 0 to 5 V, converting them to values from 0-255; whereas, a bipolar 8 bit converter with the same range would take inputs from -2.5 to +2.5 V and convert them to values of -128 to 127. When implementing analog to digital conversion, care must be taken to match the expected range of the analog input signal to the specified input range for the ADC (signal range and converter range respectively). Ideally, these two ranges will be the same and the conversion will use the full scale of the ADC. If the signal range is smaller than the converter range, we will not be fully utilizing the resolution of the converter. For example, if we use a 12 bit converter with input range 0 to 5 V to convert an analog input signal with range 0 to 1.25 V, only 1/4 of the conversion range is used, with 2 bits of the resolution lost, resulting in 10 bits of effective resolution (see Fig. 3.5). On the other hand, if the signal range is larger than the converter, clipping can occur. This causes large amounts of distortion and is therefore most undesirable. To circumvent this problem, such a signal should be attenuated and perhaps have its DC level shifted so that its range is within that of the converter.
Fig. 3.5. Illustration of how resolution is lost if the signal is reduced in range. If a 5V signal is measured, it uses the entire 12 bits of the converter. If the signal is limited to 1.25 V in amplitude, only 10 bits are used.
88
Virtual Auditory Space: Generation and Applications
3.4. SIGNAL GENERATION In order to generate an analog signal, discrete time samples are first computed (either mathematically or by sampling a desired analog signal) and stored in the memory of the DSP device. By sending successive samples to a digital-to-analog converter and low pass filtering, a continuous time analog signal can be generated (see Figs. 3.2 and 3.3). The process of signal generation is the exact reverse of sampling analog signals (described in section 3.2). A reconstruction filter is required at the output to remove high frequency copies of the signal caused by sampling. As with sampling of signals, to avoid aliasing, the signal that is being generated must not have frequencies higher than the Nyquist rate. Data generated in this manner can be easily controlled in software and be arbitrarily complicated, thus offering advantages over analog signal generators which are restricted to generating much more simple signals.
4. FREQUENCY DOMAIN DIGITAL SIGNAL PROCESSING 4.1. REVIEW
OF
COMPLEX ARITHMETIC
Since complex arithmetic is required when a frequency domain representation of a signal is used, we need to first provide a brief review of complex arithmetic. For a more complete description of complex numbers the reader is referred to any 1st year undergraduate mathematics text. The problem with real numbers is that polynomial equations such as x2+1 = 0 cannot be solved. If we define the imaginary number j = −1 , we can solve any polynomial equation and it greatly simplifies a lot of algebra. Note that most scientists and mathematicians use i to represent the imaginary number but electrical engineers like to reserve this symbol for electrical current (we will always use j). In general, a complex number is broken into two parts and expressed in the form a+jb where a is the real part and b is the imagiFig. 3.6. A complex number a+jb can be plotted on a realimaginary graph. It can be represented either in cartesian or polar form (r,θ ). In polar form, it is expressed as a magnitude and a phase value and in DSP systems, this is the preferred representation.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
89
nary part. Since this is two numbers, we can locate a complex number on a graph of the real and imaginary parts as shown in Figure 3.6. As is also shown in Figure 3.6, a complex number a+jb can be described using polar coordinates (r,θ) where r corresponds to the length of the ray extending from the origin of the graph and θ corresponds to the angle the ray makes with the real axis of the graph. These are related to the complex numbers so that the magnitude is given by 1_ r = |a+jb| or r = (a2+b2) 2 and the phase θ = arg(a+jb) or θ = tan–1(b/a). Note that the second expression in both cases can be simply derived using simple trigonometry from the graph of the real and imaginary parts of the complex number (Fig. 3.6). The complex conjugate of a complex number turns out to have very useful properties for signal processing and is computed by simply changing the sign of the imaginary part of the number, for example, the complex conjugate of 1- 8j is 1+8j. As we shall see below, complex exponentials are also very important in signal processing and these are related to sinusoids in the following fashion:
e 4.2. PERIODIC SIGNALS
jθ
= cos θ + j sin θ
AND THE
FOURIER SERIES
A signal is said to be periodic if it repeats after some time. Sinusoids are very important in digital signal processing since any practical periodic waveform can be expressed as a Fourier series—an infinite sumd of sinusoids of frequencies which are multiples of a “fundamental” frequency fo. The Fourier series x(t) is given by:
x(t) =
∞
∑X
k
e jk ω t 0
k =−∞
where j = −1 , ω 0 = 2 πf 0 (frequency in radians per second) and the X k are known as the Fourier coefficients and represent the sinusoidal component of the signal at the frequency corresponding to index k. These coefficients are complex so that they can encode both amplitude and phase relationships of the sinusoids being summed. If only real coefficients were summed, we would not be able to represent arbitrary signals. Just as light can be split into its component wavelengths by using a prism or diffraction grating, electrical signals can also be separated into component frequencies. As an example, Figure 3.7 shows a square wave being constructed from contributions made from its component sinusoids. The overshoot at the corners of the square wave which is d
In fact, all practical signals can be represented with a finite sum of sinusoids.
90
Virtual Auditory Space: Generation and Applications
apparent in the plot with the 11 summed sinusoids shows an example of the Gibbs effect, which will be explained in section 6.1.1.
4.3. FOURIER ANALYSIS TRANSFORM
AND THE
DISCRETE FOURIER
In the previous section, we saw that periodic signals could be represented as a sum of complex exponentials (e raised to the power of a complex number). However, this formulation is not of practical use in a DSP system since the Fourier series represents a continuous time signal as an infinite sum of sinusoids of different frequencies whereas our DSP systems can only deal with discrete time systems and finite sums. Fortunately, for a discrete time signal x[n] of finite duration N, the discrete Fourier transform (DFT) can be used to calculate the Fourier series coefficients X[k] (we normally use capital letters to represent frequency domain descriptions). The DFT, also known as the analysis equation, determines the Fourier coefficients from the sample values and can be expressed mathematically as:
1 N −1 X[k] = ∑ x[n]e − jk2 πn/ N , k = 0,1⋅⋅⋅ N − 1 N n=0 Using this formulation, the X[k] values that are computed represent the contributione of each of the component sinusoids required to approximate the input signal x[n]. The number of coefficients that are computed is given by N. If N were infinite, we would be able to exactly reconstruct our signal from the Fourier coefficients. Unfortunately, as we increase N, our computational and storage requirements increase, so a tradeoff must be made between an accurate representation of the input, and our computation time. Thus N determines the resolution (in frequency) of the time domain to frequency domain transform. As can be seen in the formula, the n value loops through the input, summing the contribution of each different frequency (represented by k) to the Fourier coefficient. The inverse DFT (IDFT), or synthesis equation performs the reverse computation, computing the time domain response from the Fourier coefficients. This is immensely useful in signal generation as it means that we can specify any signal in terms of its amplitude and phase characteristics and generate the analog waveform using the IDFT. The IDFT is applied using the following equation:
1 N −1 X[n] = ∑ x[k]e jk2 πn/ N , n = 0,1⋅⋅⋅ N − 1 N k =0 e
Since it is a complex number, both the magnitude and phase is represented.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
91
Fig. 3.7. Plot showing terms of the Fourier series for a square wave being summed. Although the waveform approaches a square wave, the amplitude of the overshoot does not decrease with increasing N (Gibbs effect).
92
Virtual Auditory Space: Generation and Applications
Thus we have a way to take our N point frequency domain representation, and reconstruct our original input x[n] from it or to synthesize new analog signals. In fact, the IDFT generates a waveform which is exactly equal to x[n] for 0 ≤ n < N from the Fourier coefficients X[k]. The Fourier coefficients computed by the DFT describe the frequency spectrum of the signal. We can then think of the DFT as transforming our signal x[n] from a representation in the time domain, to a complex valued frequency representation X[k]. In a DSP context, a polar representation is normally used to separately describe the magnitude and phase of each frequency component that makes up the signal. A very common computation in DSP is to analyze the frequency components of a signal. The procedure is shown in Figure 3.8, where the DFT of the signal is computed and then the amplitude and phase of each of the Fourier coefficients are plotted as a function of frequency. Figure 3.9 shows an example of a time domain input signal composed of two sine waves (1 kHz and 3 kHz) plus a small amount of noise which has all been digitized at 10 kHz. The corresponding amplitude and phase components responses have been computed using the DFT and are also plotted on the figures. Note that as described in section 3.2 there are identical copies of the frequency components of the signal reflected around the frequency corresponding to half the sampling frequency (see also Fig. 3.4). Since the DFT quantizes the frequency components of the signal into N discrete values, there is an issue of tradeoff between resolution and range. After we apply the DFT, if the sampling period is T, each Fourier coefficient a[k] represents the magnitude and phase of the signal at a single frequency:
fk =
k , n = 0,1,⋅⋅⋅N / 2 NT
Thus the DFT can only provide information about the signal at frequencies which are multiples of the fundamental frequency 1/NT. The separation in frequencies 1/NT, given by the DFT, is called the binwidth. The range of frequencies covered can be improved by reducing T (increasing the sampling rate) but this increases the binwidth and therefore reduces the frequency resolution. On the other hand resolution can be improved by increasing N (this does not affect the frequency range) which serves to sample more periods of the signal. For instance, if a 1024 point DFT is applied to a signal digitized at 40 kHz, the binwidth is 39 Hz and we can analyze signals up to 20 kHz. If we wanted better resolution, we could do a 2048 point DFT which would make our Fourier coefficients 19.5 Hz apart. Furthermore, if
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
93
Fig. 3.8. Frequency domain signal analysis using the DFT (or FFT).
Fig. 3.9. The top plot shows a signal composed from two sine waves plus noise. The bottom plots show the magnitude and phase of the 100 point (i.e., N = 100) DFT of this signal. The two peaks in the DFT occur at the frequencies of the sine waves. Note that the magnitude response is symmetric about N/2 and the phase is antisymmetric about N/2. This is one of the effects of sampling and the values of frequency and phase up to N/2 are sufficient to fully describe the system (see also Fig. 3.4).
94
Virtual Auditory Space: Generation and Applications
we decided we only need to look at frequencies up to 10 kHz, the sampling rate could be reduced to 20 kHz.f
4.4. FAST FOURIER TRANSFORM For reasons of computational efficiency, the Fourier coefficients are not often calculated directly from the DFT equations given in section 4.3. The fast Fourier transform (FFT), as its name suggests, is an extremely efficient algorithm for calculating the DFT with the restriction that N be a power of 2. Whereas the DFT has a computational complexity of the order of N2 multiplications, the FFT only requires order Nlog2(N) multiplications. In addition the architecture of modern DSP chips are optimized to perform FFTs at maximum speed.
5. TIME DOMAIN ANALYSIS 5.1. IMPULSE RESPONSE An impulse is a short duration signal pulse. It contains equal energy over all frequencies and has zero starting phase, which makes it a valuable test signal. If we wish to analyze how a system transforms a signal (such as the filtering processes of the outer), we can apply an impulse as the input and record the output (using a microphone inside the ear canal for example). The resulting impulse response that we measure enables us to determine the transfer function of the system. This type of measurement is critical in generating high fidelity virtual auditory space. The ear acts as a filter which varies as a function of the location in space from which the input signal arrives. By applying an impulse input from a particular point in space, the head related transfer function (HRTF) for that location in space can be measured (see chapter 2, section 1.4), and the response of the ear to any other input signal (such as music) can be computed from the measured impulse response information. An impulse function δ[n] is defined as:
{
n =1 δ [n] = 0 1 otherwise A practical example is striking a bell with a hammer. The hammer blow is an impulse input, which causes the bell to ring at certain frequencies determined by the geometry and materials of the bell. The bell can be considered as our system which filters the input signal to produce an output. This also illustrates a very important property of the impulse response; namely that it contains a very wide range of
f
Note that in this case we would need to adjust our anti-aliasing filters to ensure that aliasing would not occur if we lowered the sampling rate.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
95
frequencies. This can be demonstrated by summing together a large number of harmonically related sine waves all with the same starting phase. At the end of this operation one is left with an impulse whose duration is related to the highest frequency added into the complex. When the bell is struck it is essentially being presented with this very wide range of frequencies but only transmits those frequencies to which it is resonantly tuned. Impulse responses are very important in DSP since they completely characterize a linear time-invariant (LTI) system.g From the impulse response we can determine the system’s response to arbitrary inputs. The frequency response of a system is simply the discrete Fourier transform of the impulse response.
5.2. IMPULSE RESPONSE MEASUREMENT
AND
GOLAY CODES
Impulse responses can be measured by simply applying an impulse to the input of a system and measuring the output. Although the impulse response contains a very wide range of frequencies, the overall energy level is quite low. Noise in the measurement process can cause poor signal to noise ratios and therefore is not an entirely satisfactory method of measuring the impulse response. In recent years, Golay codes have become popular for accurately measuring the impulse response of various systems. In particular this technique has been used for measuring HRTFs (e.g., refs. 2-4) and provides a significant improvement in the signal to noise ratio. This process relies on the presentation of a pair of specially constructed codes, and postprocessing the resulting output to obtain the impulse response. The initial pair of Golay codes is a1 = {+1,+1} and b1 = {1,-1}. The next Golay code is obtained by appending b1 to a1 so that a2 = {+1,+1,-1} and b2 is obtained by appending -b1 to a1 to get b2 = {+1,+1,-1,+1}. This is repeated recursively to obtain the Nth code pairs which are of length L = 2N.
g
A system is said to be linear if a signal obeys the property of superposition. That is, if the response of a system to signal x1(t) is y1(t) and the response of the system to signal x2(t) is y2(t), then for some constants a and b, the response of the system to ax1(t)+bx1(t)= ay1(t)+by1(t). Even nonlinear systems can be approximated as linear ones if the excursions of the inputs and outputs are kept small. A system is time-invarient if a time shift in the input causes a time shift in the output. Mathematically, if the response of the system to x(t) is y(t), then the response to x(t-t0) (i.e., x(t) time-shifted by t0 is y(t-t0)). Linear time invariant (LTI) systems are very important since these conditions represent a large class of systems, and are necessary for much of the theory behind digital signal processing.
96
Virtual Auditory Space: Generation and Applications
In order to measure the impulse response of a system h(t), we apply the Golay codes aN and bN to the system and measure the frequency response HA[f] and HB[f] (obtained using the DFT of the response to each code aN and bN). These responses can then be processed by multiplying each response by the complex conjugate of the DFT of each code aN and bN. If we then take the IDFT, the impulse response is obtained. Mathematically, this can be expressed:
1 h(t ) = IDFT {H A [ f ]DFT ∗ (a n ) + H B [ f ]DFT ∗ ( b N )} 2L where DFT* denotes the complex conjugate of the discrete Fourier transform (see section 4.1 and 4.3). This method of computing the impulse response has a signal to noise ratio which is 10log10(2L) dB higher than if it were computed directly using an impulse signal (since there is much more energy present in the input signal). In practice, typical lengths used are L = 512 (ref. 2) and L = 1024 (ref. 3) which correspond to improvements in signal to noise ratio of 30.1 dB and 33.1 dB respectively. For a further discussion of this important processing method and some implementation examples, see Zhou et al.2
5.3. CONVOLUTION In DSP two equivalent methods, in either the frequency or time domain, can be used to filter an input signal with an impulse response. In the frequency domain, direct multiplication of the DFT of the time domain input signal with the DFT of the transfer function of the system followed by an IDFT will produce the output (see Fig. 3.10). The time domain equivalent to multiplication in the frequency domain is to convolve the input signal x[n] with the impulse response h[n] of the system in order to produce the output y[n] of that system (see Fig. 3.10). Convolution is the infinite sum of time shifted values, and the operation is commutative.h We use the ‘*’ operator to denote convolution so:
y[n] =
∞
∑ x[k]h[n − k]
k =−∞
=x*h =h*n A most useful property of the DFT is that convolutions in the time domain correspond to multiplication in the frequency domain. h
If ‘*’ is a commutative operator, x * h = h * x. For example, addition is commutative but division is not.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
97
Fig. 3.10. Multiplication in the frequency domain is equivalent to convolution in the time domain.
This property can be exploited to make efficient implementations of FIR filters as discussed in section 5.4.1.
6. FILTER DESIGN As its name suggests, filtering involves modification of certain signal components. This could involve either amplifying or attenuating portions of the signal. A decibel scale is normally used to describe these factors with positive gain corresponding to amplification (the signal is larger than the original), and negative gain attenuating the signal. Digital filters have significant advantages over their analog counterparts since they can be made more accurately and have steeper edges. Since they are often implemented in software, the same hardware can be used to apply many different filtering functions. The desired characteristics of a filter are usually described in the frequency domain, selecting and/or rejecting different frequencies (standard filters are lowpass, highpass, bandpass and bandstop) as illustrated in Figure 3.11. For example, a low pass filter might be used to remove high frequency noise from a signal in an antialiasing application; a high pass filter can be used to remove the DC level from a signal and shift it into range for an ADC; a bandpass filter could extract a signal which is known to be in a certain frequency range;
98
Virtual Auditory Space: Generation and Applications
Fig. 3.11. Ideal filter magnitude response characteristics.
and a bandstop filter could be used to remove 50 or 60 Hz power line noise from a recording. Filters are normally designed only with regard to the amplitude of the frequency response |H[k]| and then the resulting phase response arg(H[k]) is analyzed to see if it is satisfactory (see also section 4.1). A signal is said to be undistorted if the filter only amplifies and/or time delays the signal. Amplification corresponds to a frequency independent scaling of the magnitude. The time delay (or group delay) is given by:
τ [k] = −
d arg(H[k]) dk
and for this to be constant, the phase response must be linear with frequency (see also chapter 2, section 1.2). Mathematically, if c is a constant:
arg(H[k]) = −ck If the phase response is nonlinear, the time delay of the filter will be a function of the frequency and hence the output will be distorted. This is usually not a problem as long as the phase response is reason-
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
99
ably smooth. However, linear phase is desirable when high fidelity is required. The specification of a filter is the first step in filter design. An ideal lowpass filter has unit gain in the passband, no transition region, and infinite attenuation in the stopband (this terminology is defined in Fig. 3.12). In practice, arbitrarily close approximations can be made to this ideal as long as we have enough computing power to implement our filter. For a given amount of computation, changing any parameter will affect the others. In fact, most of the time spent in filter design involves dealing with tradeoffs between magnitude response, phase response and computational requirements. This is why there is such a wide choice of filter approximations. Linear, time invariant systems can be divided into those with finite impulse response (FIR), and those with infinite impulse response (IIR). FIR filters can have precisely linear phase and guaranteed stability; whereas, IIR filters are implemented recursively, have steeper characteristics than a FIR filter (given the same amount of computing time), but are more sensitive to quantization effects which can make the filters unstable, and hence oscillate. Quantization effects in filters are beyond the scope of this introduction but interested readers can refer to Oppenheim and Schafer5 for a good description of these effects.
Fig. 3.12. Low pass filter specifications. A perfect low pass filter has no passband ripple and infinite attenuation in the stopband. Although practical filters such as the one shown cannot achieve this, they can make arbitrarily close approximations to this ideal.
100
Virtual Auditory Space: Generation and Applications
6.1. FIR filters FIR filters are usually implemented by convolving (see section 5.3) the filter’s impulse response with the input signal x[k]. Since the impulse response is finite, the sum for the convolution becomes finite. If the impulse response is of length N, the computation is given by: N
y[n] = ∑ h[k]x[n − 1] k =0
As can be seen from the equation, the computation requires the N previous values of the input x, and then these values can be considered as taps on a delay line (see Fig. 3.13). The inputs x[n] are multiplied and accumulated with the coefficients of the filter represented by h[k] to produce the output signal y[n]. DSP chips are optimized so that they can compute the multiply and accumulate function (known as MAC) at maximum speed. Larger values of N allow steeper filters to be implemented, but they are computationally more expensive. Very complicated transfer functions can be easily implemented using FIR filters. For example, head related transfer functions of the outer ear can be simulated in DSP for virtual reality applications by first measuring the impulse response h of the filter (see section 5.2), and then convolving the input with the impulse response to obtain a filtered output (see Fig. 1.1 and chapter 4). For large values of N, it is possible to reduce the amount of computation required by using the fact that convolution in the time domain is equivalent to multiplication in the frequency domain (see sec-
Fig. 3.13. Implementation of a FIR filter. The FIR filter can be modeled as the weighted sum of the outputs of a delay line.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
101
tion 5.3). In this case, the FFT of the input signal is computed, multiplied by the FFT of the impulse response, and then the inverse FFT is computed to generate the output. 6.1.1. Windowing functions In section 4.2, we saw that any signal can be reconstructed from its Fourier series which is an infinite sum. We cannot, in general, produce arbitrary signals using a finite sum, but it is possible to make an arbitrarily close approximation by summing many terms of the Fourier series. In this section, we describe windowing functions which are used to reduce the length of a long (or even infinite) impulse response to produce a finite impulse response of length N. The most obvious technique would be to simply truncate the impulse response. This rectangular windowing function can be described mathematically as:
{
wR (n) = 1 0 ≤ n ≤ N 0 otherwise The impulse function h[n] is then multiplied by the windowing function to give hnew[n]:
hnew [n] = wR (n)h[n]
n = 0…N-1
The problem with this technique is that if it is applied to a discontinuous function (such as desired in filters), it leads to overshoot or “ringing” in the time domain response. This problem arises because we are trying to approximate the discontinuity using a finite number of Fourier series terms (when we really need an infinite number of terms), with the resulting effect known as the Gibbs phenomenon. For any finite value of N, a truncated Fourier series will, in general, exhibit overshoot if the rectangular window is applied. An example of this effect can be seen Figure 3.7. The amplitude of the overshoot does not decrease with increasing N. Although a large value of N will make the energy in the ripples negligible, by applying other windowing functions we can overcome this problem. For this reason, the rectangular window is not used in practice. Windowing functions such as the Hanning, Hamming and Kaiser windows can be used to progressively reduce the Fourier coefficients instead of the discontinuous weighting applied by the rectangular window. The Hamming window is a raised cosine function given by:
2πn wn (n) = 0.54 + 0.46 cos N − 1 As can be seen in Figure 3.14, the Hamming window has a gradual transition to the null at N-1 avoiding the discontinuity which would be caused by a rectangular windowing function.
102
Virtual Auditory Space: Generation and Applications
6.1.2. Frequency sampling design A FIR filter can be designed simply in the frequency domain by applying the IDFT to N uniformly spaced samples of the desired frequency response, thus obtaining the impulse response. After having made a filter design using frequency sampling, the DFT of the impulse response should be calculated and plotted to check the design.
Fig. 3.14. The Hamming window function. The tapered response of the function causes a gradual truncation of the impulse response.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
103
The larger the value of N, the closer the filter will be to the desired response. Care must be taken when specifying the amplitude and phase values before performing the IDFT. When using a frequency sampling design the amplitude and phase values must be arranged symmetrically around f2/2 as is illustrated in Figure 3.15. In order to achieve this symmetry the IDFT coefficients must occur in complex conjugate pairsi and obey the following equations: For N odd,
H(N - k) = H * (k)
k = 1,2, … (N-1)/2
Fig. 3.15. Example of a frequency sampled design of a complex filter. Plots (a) and (b) show the desired magnitude and phase response of the filter. Our frequency sampled design with N = 50 is computed by taking an IDFT to get the impulse response shown in (c). The resulting frequency response is shown in (d) which passes through all of the sampled points of (a). i
We use the * superscript to denote the complex conjugate of a number.
104
Virtual Auditory Space: Generation and Applications
and for an even N,
H ( N − k ) = H ∗ (k ) H ( N / 2) = 0
k = 1, 2,L ( N / 2) − 1
Figure 3.15 shows an example of a frequency sampling design. The frequency response that we wish to achieve is that of the HRTF shown in the top two plots of the figure and is represented by the 150 point magnitude and phase response. The symmetry of the magnitude and phase response ensure that the above equations are satisfied. For an impulse response of length 50, the figure shows the impulse response, and resulting frequency response, of a filter generated using the frequency sampling approach. 6.1.3. Optimal FIR filters As an example of the trade-offs in filter design, a filter that is allowed to have ripples in the passband and stopband can be made to have a steeper transition region than one which is constrained to be monotonic. In practice, ripples in the passband and stopband (see Fig. 3.16) are acceptable, with the allowable ripple specified in the filter design. The most widely used FIR design algorithm was developed by Parks and McClellan.6 This design is optimal in the sense that for a given length (number of taps), it has the smallest maximum error in the stopband and passband. Designing such filters is a computationally expensive task which involves optimization to determine the coefficients. Such filters are designed using computer aided design (CAD) programs which takes the specifications as inputs, and produces the impulse response of the filter as the output. More sophisticated programs can generate DSP specific subroutines which can then be called from an applications program. Figure 3.16 shows an example of a filter design using the MATLAB Signal Processing Toolbox from Mathworks Inc.
6.2. IIR
FILTERS
An infinite impulse response (IIR) filter has the following formula N
M
k =0
j =0
y[n] = ∑ a[k]x[n − k] − ∑ b[ j]y[n − j] where x[n] are the inputs, y[n] the outputs and a[k], b[j] are the filter coefficients. Compared with the FIR filter which is a function of the previous N inputs the IIR filter is also a function of the previous M outputs, and it is this recursive formulation that makes the impulse response infinite in length. For nearly all applications, the filter coefficients a[k] and b[j] are one of the four standard filter types (Butterworth, Chebyshev, inverse
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
105
Fig. 3.16. Example showing the design of a 500 Hz lowpass filter using the MATLAB DSP toolkit. The transition region is set between 500 Hz (Fpass) and 600 Hz (Fstop), the passband ripple is 3 dB (Rpass) and stopband attenuation is 50 dB (Rstop). The filter design software uses the Parks-McClellan algorithm to determine the filter order (27) and compute the filter coefficients. Only the magnitude response is shown, but this filter has linear phase.
Chebyshev or elliptic). Each type is optimal in some fashion, and the different types represent tradeoffs in steepness, ripple and phase response. In order to illustrate the differences in these four types of filters, Figure 3.17 shows an 8th order bandpass filter design using the Butterworth, Chebyshev and inverse Chebyshev methods. Butterworth filters are maximally flat and their transfer function changes monotonically with frequency. Chebyshev filters allow ripples in the passband, enabling them to obtain a steeper filter (at the expense of more nonlinearity in the phase response). The inverse Chebyshev filter is maximally flat in the passband and has ripples in the stopband. The elliptic filter(not shown) has the steepest transfer characteristic, with ripples in both passband and stopband. Unfortunately, the elliptic filter also has the worse phase response.
106
Virtual Auditory Space: Generation and Applications
Fig. 3.17. Comparison of Butterworth, Chebyshev and inverse Chebyshev lowpass filter types (magnitude and phase). All 3 filters are of order 8 and hence have the same computational requirements.
The simple IIR filters described above can be made as analog hardware continuous time filters, and in a DSP context this is standard for anti-aliasing and reconstruction filters. IIR filters are designed by first designing the continuous time filter, and then translating the design to a discrete time IIR filter using the bilinear transform (see annotated bibliography for example). As with FIR filters, these are usually designed with the aid of CAD software. Quantization effects and decreased stability for IIR filters mean that special attention must be paid to the implementation of IIR filters. For this reason, there are different structures for implementing IIR filters. Among these structures are direct form, cascade, parallel and lattice, which provide varying degrees of tradeoff between computational complexity and sensitivity to quantization (for examples see refs. 5 or 7). The lattice structure is the least sensitive to quantization effects, but requires more multiplications to produce the same result as the direct or cascade forms.
Digital Signal Processing for the Auditory Scientist: A Tutorial Introduction
107
ACKNOWLEDGMENTS The authors would like to thank Dr. Mark Hedley and Dr. Russell Martin for kindly reviewing this chapter.
ANNOTATED BIBLIOGRAPHY R. Chassaing, “Digital Signal Processing with C and the TMS320C30,” Wiley, 1992. This book deals with many practical issues in implementing DSP systems. Source code for implementing commonly used DSP functions such as FFT and filtering is included as well as examples on interfacing to ADCs and DACs. The Texas Instruments TMS320C30 DSP chip is used for examples in this book. J.R. Johnson, “Introduction to Digital Signal Processing,” PrenticeHall 1989. A good, easy to understand, introduction to DSP for undergraduates. A.V. Oppenheim and R.W. Schafer, “Digital Signal Processing,” Prentice-Hall 1975. One of the standard DSP textbooks for undergraduate Electrical Engineering students. This book covers all of the basic ideas of DSP theory. T.W. Parks and J.H. McClellan, “Chebyshev Approximation for Nonrecursive Digital Filters with Linear Phase,” IEEE Trans. Circuit Theory, Vol. CT-19, March 1972, pp. 189-194. Paper on the most commonly used FIR filter design method. D. Pralong and S. Carlile, “Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system,” J. Acoust. Soc. Am., 1994, pp. 3435-3444. Paper which describes procedures used to measure the human head-related transfer function.
REFERENCES 1. Vernon JA, Katz B, Meikle MB. Sound measurements and calibration of instruments. In: Smith CA, Vernon JA, ed. Handbook of auditory and vestibular research methods. Springfield, Illinois: Thomas Books, 1976:307-356. 2. Zhou B, Green DM, Middlebrooks JC. Characterization of external ear impulse responses using Golay codes. J Acoust Soc Am 1992; 92: 1169-1171. 3. Carlile S, Pralong D. The location-dependent nature of perceptually salient features of the human head-related transfer function. J Acoust Soc Am 1994; 95:3445-3459. 4. Pralong D, Carlile S. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 5. Oppenheim AV, Schafer RW. Digital signal processing. New York: Prentice-Hall, 1975. 6. Parks TW, McClellan JH. Chebyshev Approximation for Nonrecursive
108
Virtual Auditory Space: Generation and Applications
Digital Filters with Linear Phase. IEEE Trans Circuit Theory, 1972; CT-19:189-194. 7. Chassaing R. Digital Signal Processing with C and the TMS320C30. New York: Wiley, 1992.
109
CHAPTER 4
GENERATION AND VALIDATION OF VIRTUAL AUDITORY SPACE Danièle Pralong and Simon Carlile
1. INTRODUCTION
T
he aim in generating virtual auditory space (VAS) is to create the illusion of a natural free-field sound using a closed-field sound system (see Fig. 1.1, chapter 1) or in another free-field environment, using a loudspeaker system. This technique relies on the general assumption that identical stimuli at a listener’s eardrum will be perceived identically independent of their physical mode of delivery. However, it is important to note that the context of an auditory stimulus also plays a role in the percept generated in the listeners (chapter 1, section 2.1.2) and that auditory and nonauditory factors contribute to this. Stereophony is generated by recording sounds on two separate channels and reproducing these sounds using two loudspeakers placed some distance apart in front of a listener. If there are differences in the program on each channel, the listening experience is enhanced by the generation of a spatial quality to the sound. This could be considered as a relatively simple method for simulating auditory space. The easiest way to simulate a particular auditory space is to extend this technique to the use of multiple loudspeakers placed in a room at the locations from which the simulated sound sources are to be located. The usefulness of this technique is limited by the room’s size and acoustics, by the number of speakers to be used, and by the restricted area of space where the simulation is valid for the listener (for review see Gierlich1). It is now generally accepted that the simulation of acoustical space is best achieved using closed-field systems, since headphones allow and Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 Landes Bioscience.
110
Virtual Auditory Space: Generation and Applications
complete control over the signal delivered to the listener’s ears independently of a given room size or acoustical properties. The disadvantage of this technique is that it requires compensation of the transfer function of the sound delivery system itself, that is, the headphones. As we will see in sections 4 and 6 of this chapter, this problem is not trivial, for both acoustical and practical reasons, and is a potential source of variation in the fidelity of VAS.
1.1. GENERATION
OF
VAS USING
A
CLOSED-FIELD SYSTEM
One technique for the simulation of auditory space using closedfield systems involves binaural recording of free-field sound using microphones placed in the ears of a mannequin2 or the ear canals of a human subject.3 These recordings when played back over headphones’ result in a compelling recreation of the three-dimensional aspects of the original sound field, when there is proper compensation for the headphones transfer characteristics.2-4 Such a stimulus should contain all the binaural and monaural localization cues available to the original listener, with the exception of a dynamic link to head motion. However, this technique is rather inflexible in that it does not permit the simulation of arbitrary signals at any spatial location. Still, the binaural technique remains extremely popular in the architectural acoustics and music recording environments, using artificial heads such as KEMAR by Knowles Electronics, Inc.,5 or the one developed by Gierlich and Genuit6 (for recent reviews see Blauert4 and Møller7).a The principles underlying the generation of VAS are closely related to the binaural recording technique. The fact that binaural recordings generate a compelling illusion of auditory space when played back over headphones validates the basic approach of simulating auditory space using a closed-field sound system. The more general approach to generating VAS is based on simple linear filtering principles (see chapter 3, section 5.1; also Wightman and Kistler9). Rather than record program material filtered by the outer ear, the head-related transfer functions (HRTFs) for a particular set of ears are measured for numerous positions in space and used as filters through which any stimulus to be spatialized is passed. The following set of equations have been adapted from Wightman and Kistler9 and describe, in the frequency domain, how the appropriate filter can be generated. The series of equations described apply only to one ear, and a pair of filters for each ear must be constructed for each position to be simulated.
a
Hammershøi and Sandvad8 recently generalized the term “binaural technique” by introducing the expression “binaural auralization” which is equivalent to the VAS technique described below.
Generation and Validation of Virtual Auditory Space
111
Let X1 be a signal delivered by a free-field system, the spatial properties of which have to be reproduced under closed-field conditions. The signal picked up at the subject’s eardrum in the free-field, or Y1, can be described by the following combination of filters: Y1 = X1 L F M,
(1)
where L represents the loudspeaker transfer function, F represents the free-field to eardrum transfer function, i.e., the HRTF for a given spatial location, and M the recording system transfer function. Similarly, if a signal X2 is driving a closed-field system, the signal picked up at the eardrum, or Y2, can be described as: Y2 = X2 H M
(2)
where H represents the headphone-to-eardrum transfer function, which includes both the headphone’s transducing properties and the transfer function from the headphone’s output to the eardrum. As the aim is to replicate Y1 at the eardrum with the closed-field system, one has: Y2 = Y1
(3)
X2 H M = X1 L F M.
(4)
X2 = X1 L F/H.
(5)
Then: Solving for X2: Therefore, if the signal X1 is passed through the filter LF/H and played by the headphones, the transfer function of the headphones is canceled and the same signal will be produced at the eardrum as in free-field conditions. It can be seen from Equation 5 that with this approach the loudspeaker transfer function L is not eliminated from the filter used in the closed-field delivery. L can be obtained by first calibrating the loudspeaker against a microphone with flat transducing properties. Alternatively, L can be more easily eliminated if the closed-field delivery system employed gives a flat signal at the eardrum, like the in-ear tube phones made by Etymotic Research. In this case the filter used in closed-field is F only, which can be obtained by correcting for the loudspeaker and the microphone transfer functions (L M); this is illustrated in Figure 4.1. VAS localization results obtained using this latter method are reported in section 5.2 of this chapter.
1.2. ACOUSTICAL
AND
PSYCHOPHYSICAL VALIDATIONS
OF
VAS
As the perceptual significance of the fine structure of the HRTF is still poorly understood, the transfer functions recorded for VAS simulations should specify, as far as possible, the original, nondistorted signal
112
Virtual Auditory Space: Generation and Applications
Fig. 4.1. Dotted line: transducing system transfer function (loudspeaker, probe tube and microphone, L M; see equation (1), section 1.1). Broken line: transfer function recorded in a subject’s ear canal with the above system. Solid line: subject’s free-field-to-eardrum transfer function after the transducing system’s transfer function is removed from the overall transfer function (F). The measurements were made at an azimuth and an elevation of 0°.
input to the eardrum (however, see chapter 6, section 2.2). This is particularly important if VAS is to be applied in fundamental physiological and psychophysical investigations. This section is concerned with introducing and clearly differentiating the experimental levels at which the procedures involved in the generation of VAS can be verified. The identification of the correct transfer functions represents the first challenge in the generation of high fidelity VAS. Firstly, the measured HRTFs will depend on the point in the outer ear where the recordings are taken (see section 2.1). Secondly, recordings are complicated by the fact that the measuring system itself might interfere with the measured transfer function. Section 2.1.2 in the present chapter deals with the best choice of the point for measurement of the HRTF. The question of the perturbation produced by the recording system itself is reminiscent of the Heisenberg’s uncertainty principle in particle physics. We have examined this previously using a mannequin head fitted with a model pinna and a simplified ear canal. 11 As schematized in Figure 4.2A, the termination of the model canal was
Generation and Validation of Virtual Auditory Space
113
fitted with a probe microphone facing out into the ear canal (internal recording system). Transfer functions could also be recorded close to the termination of the canal by an external probe microphone held in place by a small metal insert seated at the entrance of the ear (external recording system); this system is described in more detail in sections 2.1.1 and 2.1.3. Figure 4.2B shows that the spatial location transfer functions computed from the two different microphone outputs were virtually identical. Thus recordings of transfer functions by the internal microphone in the absence and then in the presence of the external recording system could provide a measure of the effects introduced by the latter on the recorded transfer function. Figure 4.3A shows that the main changes in the HRTFs produced by the recording system were direction-independent small attenuations at 3.5 kHz (-1.5 dB) and at 12.5 kHz (-2 dB). The low standard error of the mean of the amplitude data pooled from three experiments (Fig. 4.3B) indicates Fig. 4.2. (A) Schematic drawing of the external recording system used to measure the HRTFs placed in a model ear showing the internal microphone probe 2 mm from the eardrum. (B) Transfer functions recorded in the model ear for a stimulus located at azimuth -90° and elevation 0°. Recordings from the internal microphone (solid line) and the external recording system (broken line) (dB re free-field stimulus level in the absence of the dummy head). Adapted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.
114
Virtual Auditory Space: Generation and Applications
Fig. 4.3. Mean of the differences in pressure level between measurements obtained in the presence of the external recording system and measurements obtained in the absence of the external recording system across 343 speaker positions as a function of frequency. Shown are the average of the mean for three experiments (A), and the standard error of the mean (B), incorporating a total of 1029 recordings. Reprinted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.
Generation and Validation of Virtual Auditory Space
115
that these effects were highly reproducible. Effects on phase were between -0.02 and 0.05 radians for frequencies below 2.5 kHz. The likely perceptual significance of the perturbations introduced by our recording system was also estimated by comparing the HRTFs recorded with and without the external recording system after they had been passed through a cochlear filter model12 (see chapter 2, section 2.5.1). In this case the mean attenuation produced by the recording system reached a maximum of 1.4 dB over the frequency range of interest.11 It was concluded that the perturbation produced by the external recording system was unlikely to be perceptually significant. Hellstrom and Axelsson13 also described the influence of the presence of a probe tube in a simple model of the ear canal, consisting of a 30 mm long tube fitted at the end with a 1/4-inch B&K microphone. The effect was greatest (-1 dB attenuation) at 3.15 kHz, the resonance maximum. However, this measure takes no account of the remainder of the external ear or of the possible direction-dependent effects of their recording system, and the system described was not compatible with the use of headphones. The second challenge in the generation of VAS is to ensure that the digital procedures involved in the computation of the filters do not cause artifacts. Wightman and Kistler9 have presented an acoustical verification procedure for the reconstruction of HRTFs using headphones. This involves a comparison of the transfer functions recorded for free-field sources and for those reconstructed and delivered by headphones. The closed-field duplicated stimuli were within a few dB of amplitude and a few degrees of phase of the originals, and the errors were not dependent on the location of the sound source. This approach validates the computational procedures involved in the simulation, as well as the overall stability of the system, but provides no information about the possible perturbing effects of the recording device. Therefore, acoustical verification of both the recording procedure (as described above) and the computational procedures are necessary and complementary. Finally, a psychophysical verification, showing that the synthesized stimuli are perceptually equivalent to the percept generated by freefield stimuli constitutes the ultimate and necessary demonstration of the fidelity of VAS. In this type of experiment, the ability of a group of listeners to localize sound sources simulated in VAS is rigorously compared to their ability to localize the same sources in the free-field. To be able to demonstrate any subtle differences in the percepts generated the localization test must be one which truly challenges sound localization abilities (see chapter 1, section 1.4.2). Wightman and Kistler14 have presented a careful psychoacoustical validation of their HRTFs recording and simulation technique. These results are reviewed in section 5.2 together with results obtained recently in our laboratory.
116
Virtual Auditory Space: Generation and Applications
2. RECORDING THE HRTFS 2.1. ACOUSTICAL ISSUES The description of an acoustical stimulus at the eardrum, and hence the recording of HRTFs, is complicated by the physics of sound transmission through the ear canal. In order to capture the transfer function of the outer ear, the recording point should be as close as possible to eardrum. However, this is complicated by the fact that in the vicinity of the eardrum the sound field is complex. Over the central portion of the ear canal wave motion is basically planer over the frequency range of interest (< 18 kHz;15 see also Rabbitt and Friedrich16). Because of the finite impedance of the middle ear, the eardrum also acts as a reflective surface, particularly at high frequencies. The sound reflected at the eardrum interacts with the incoming sound and results in standing waves within the canal.17,18 Closer to its termination at the eardrum, the morphology of the canal can vary in a way which makes the prediction of pressure distribution in the canal for frequencies above 8 kHz more complex.19 Furthermore, close to the eardrum the distribution of pressure is further complicated by the fact that eardrum motion is also coupled to the sound field.20 The locations of the pressure peaks and nulls resulting from the longitudinal standing waves along the canal are dependent on frequency. These are manifest principally as a notch in the transfer function from the free-field to a point along the canal that moves up in frequency as the recording location approaches the eardrum.17,21 This is illustrated in Figure 4.4. To a first approximation, the frequency of the notch can be related to the distance of the recording point from the eardrum using a 1/4 wave length approximation.21 From these considerations it is clear that placement of the microphone within the auditory canal records not only the signal of interest, namely the longitudinal mode of the incoming wave, but also a number of other components which for the purposes of recording the HRTF are epiphenomenal. A final point worth mentioning is that the input impedance of the measuring system should also be sufficiently high to avoid perturbing the acoustical environment sampled.16,22 The measurement of HRTFs for VAS is further constrained by the requirement that the recording system employed must also be capable of measuring the headphone-to-eardrum transfer function (see section 4). HRTFs can be obtained either from anthropometric model heads fitted with internal microphones or from human subjects. An alternative approach is to use pinnae cast from real human subjects and placed on an acoustical mannequin. Acoustical and psychophysical work is currently being undertaken in our laboratory to test whether this type of recording is equivalent to the HRTFs directly measured at the subject’s
Generation and Validation of Virtual Auditory Space
117
Fig. 4.4. The frequency response spectrum of the replica of a human ear canal with the recording probe at depths of 4,11,16 and 17 mm into the canal. Reprinted with permission from Chan JCK et al, J Acoust Soc Am 1990; 87:1237-1247.
eardrum. There are a number of implementation issues in the recording of HRTFs from real human ears and these are discussed in the following sections. 2.1.1. Microphone type A general problem with most microphones is that due to their size they will perturb the sound field over the wavelengths of interest. The choice of microphones suited for HRTF measurement is therefore limited by having to satisfy criteria both of reduced dimensions and of sufficient sensitivity and frequency response to provide a good signal to noise ratio over the frequency range of interest. Chapter 2 (section 1.4.1) gives a list of studies where directiondependent changes in spectrum at the eardrum have been measured. Recordings were obtained using either probe microphones or miniature microphones small enough to be placed in the ear canal. Both types have a relatively low sensitivity (10-50 mV/Pa) and probe microphones also suffer the further disadvantage of a nonflat frequency response due to the resonance of the probe tube. Furthermore, because of their relative size, a miniature microphone introduced in the
118
Virtual Auditory Space: Generation and Applications
ear canal will also perturb the sound field.23 Therefore HRTFs recorded in this way are not readily suitable for the generation of VAS.b Although the microphone’s own transfer function can be compensated for at various stages of the HRTF recording or VAS generation procedures (cf. sections 1.1 and 2.3.2), a frequency response as flat as possible should be preferred. The HRTFs are characterized by a relatively wide dynamic range, and if the frequency response varies considerably, then the spectral notches in the recorded HRTFs could disappear into the noise floor if they coincide with a frequency range where the system’s sensitivity is poor. The microphone should also be sufficiently sensitive to record the HRTFs using test stimuli at sound pressure levels which will not trigger the stapedial reflex. The stapedial reflex results in a stiffening of the ossicular chain which will cause a frequency-dependent variation in the input impedance to the middle ear (see section 2.3.3). Ideally, the microphone should also have a bandpass covering the human hearing frequency range, or at the least the range within which the cues to a sound’s location are found (see chapter 2). Two main types of microphones have been employed for recording HRTFs in humans: condenser probe microphones (e.g., Shaw and Teranishi;24 Shaw;25 Butler and Belendiuk; 3 Blauert;4 Møller et al26), and small electret microphones (Knowles EA-1934;23 Etymotic;9 Knowles EA-1954; 27 Sennheiser KE 4-211-228). Metal probe tubes are inadequate due to the curved geometry of the ear canal and are therefore fitted with a plastic extension (e.g., Møller et al28). Alternatively, a plastic probe can be directly fitted to an electret microphone.9,11 Probes have been chosen with a plastic soft enough so that it bends along the ear canal, but hard enough not to buckle when introduced into the canal.21 Furthermore, a compromise has to be found between the overall diameter of the probe and its potential interaction with the sound field in the ear canal. In our previous study 11 the probe was thick enough so that transmission across the probe tube wall was negligible compared to the level of the signal which is picked up by the probe’s tip (50 dB down on the signal transmitted through the tip from 0.2 to 16 kHz) and with an internal diameter still large enough not to significantly impair the sensitivity of the recording system.
b
Møller7 has described how the disturbance should affect the final transfer function; it is argued that the error introduced in the HRTF will cancel if headphones of an “open” type are used for the generation of VAS, i.e., headphones which do not disturb the sound waves coming out of the ear canal, so that the ear canal acoustical conditions will be identical to those in the freefield.
Generation and Validation of Virtual Auditory Space
119
2.1.2. Point of measurement: occluded canal recordings versus deep canal recordings The location within the outer ear to which the HRTFs are to be measured is an important acoustical factor. Deciding on a specific recording location is also a prerequisite for a reproducible placement of probes at similar acoustical positions for different subjects and for the left and right ears of a given subject. Several studies have demonstrated that in the frequency range where only one mode of wave propagation is present (the longitudinal mode), sound transmission along the ear canal is independent of the direction of the incoming sound source.23,24,29,30 Consequently, the HRTFs measured in the ear canal can be accounted for by a directionally-dependent component and a directionally independent component. The directional components are due to the distal structures of the auditory periphery such as the pinna, concha and head. The directionally independent components are due principally to the proximal component of the outer ear, the ear canal. From this point of view, measurements made at any point along the ear canal will contain all the directional information provided by the outer ear and should represent an accurate description of the directional components of the HRTFs. Thus, in theory, HRTFs measured at any point along the ear canal could be used for the generation of VAS but only if the appropriate correction is made for the direction independent component so that the absolute sound pressure level at the eardrum is replicated.7 HRTFs described in the literature have been measured at three main positions along the ear canal: deep in the canal (i.e., close to the eardrum), in the middle of the canal, and at the entrance of the canal. Transfer functions have also been measured at the entrance of an occluded ear canal. For the latter technique, the microphone is embedded in an earplug introduced into the canal (see for example refs. 7, 28, 31, 32). The plug’s soft material expands and completely fills the outer end of the ear canal, leaving the microphone sitting firmly and flush with the entrance of the ear canal. This method offers the advantage of eliminating the contribution of the outgoing sound reflected at the eardrum. Occluded ear canal measurements represent an obviously attractive option as the method is relatively noninvasive and is potentially less uncomfortable for the subjects than deep canal measurements (see below). However, there are a number of potentially complicating issues that need to be considered. Firstly, the main question with this technique is the definition of the point in the ear canal at which sound transmission ceases to be direction-dependent. Hammershøi et al (1991; quoted in ref. 7) have shown that one-dimensional sound transmission starts at a point located a few mm outside the ear canal. This is in contradiction with previous results by Shaw and Teranishi24 and Mehrgardt and Mellert30 demonstrating direction-dependent transmission
120
Virtual Auditory Space: Generation and Applications
for locations external to a point 2 mm inside the ear canal (see however Middlebrooks and Green33). Secondly, the description of the canal contribution to the HRTFs as being nondirectional does not necessarily mean that blocking the canal will only affect the nondirectional component of the response. Shaw32 has argued that canal block increases the excitation of various conchal modes. This results simply from the increase in acoustical energy reflected back into the concha as a result of the high terminating impedance of the ear canal entrance. Finally, the HRTFs measured using the occluded canal technique will only be a faithful representation of the HRTFs to the eardrum if the transmission characteristics of the canal itself are accurately accounted for. Occluded canal recordings have been carefully modeled by Møller and coauthors.7,28 However what is still lacking is an acoustical demonstration that these types of measurements result in a replication of the HRTF at the eardrum of human subjects. Preliminary reports of the psychophysical validation of VAS generated using this method do not seem as robust as those obtained using deep canal recordings8 (see also section 5.2). It may be, however, that refinement of this technique, particularly with respect to precise recording location, may result in a convenient and a relatively noninvasive method for obtaining high fidelity HRTFs. An alternative position for recording the HRTFs is deep within the canal in the vicinity of the eardrum. Although deep ear canal measurements represent a more delicate approach to the recording of HRTFs, they have to date provided the most robust and accurate VAS14 (also see section 5.2). In this case, the measuring probe should be placed deep enough in the ear canal so that the recorded HRTFs are not affected by the longitudinal standing waves over the frequencies of interest. This will ensure that measurements in the frequency range of interest do not suffer a poor signal to noise ratio as a result of the pressure nulls within the canal. On the other hand, the probe should be distant enough from the termination of the canal so as to avoid the immediate vicinity of the eardrum where the pressure distribution becomes complex at high frequencies. Wightman and Kistler9 have reported that their recording probes were located 1-2 mm from the eardrum. We have chosen to place our probe microphones at a distance of 6 mm from the effective reflecting surface of the eardrum so that the 1/4 wavelength notch due to the canal standing waves occurs above 14 kHz27 (see below). Obviously, this position represents a compromise between the upper limit of the frequency range of interest and the subjects’ comfort and safety. Such transfer functions will provide an excellent estimate of the eardrum pressure for frequencies up to 6 kHz.21 However, as the frequency is increased from 6 kHz to 14 kHz, the pressures measured by the probe will represent a progressively greater underestimate of the effective eardrum pressure. As the auditory canal
Generation and Validation of Virtual Auditory Space
121
is not directionally selective, this spectral tilt is unaffected by the location of the sound in space. Additionally, placing the probe at the same acoustical position when measuring the headphones transfer function will ensure that this effect will be canceled when the HRTFs are appropriately convolved with the inverse of the headphones transfer function (this will apply if headphones of an “open” type are employed; see section 2.1.1). 2.1.3. Microphone probe holders Microphone probes have to be held in place in the ear canal by a system satisfying the following three criteria: to allow a safe and precise control of the probe’s position in the ear canal, to be of minimum dimensions in order to avoid occluding the ear canal or causing any disturbance of the sound field in the frequency range of interest, and to be suitable for measuring the headphone transfer function. Wightman and Kistler9 introduced the use of custom-made, thin boredout shells seated at the entrance of the ear canal and fitted with a guide tube. We have used an approach which is largely based on this method.27 While keeping the idea of a customized holder the original, positive shape of the distal portion of the ear canal and proximal part of the concha is recovered by first molding the shape of the external ear and then plating the surface of the mold. The thickness of the holder can be reduced to a minimum (less than 0.25 mm) using metal electroplating. The metal shell is slightly oversized compared to the original ear canal and concha and forms an interference fit when pressed into the canal. Figure 4.5 shows a photograph of the metal insert, probe tube and microphone assembly in place in a subject’s ear. The acoustical effects of this recording system (including probe and microphone) on the measured HRTFs for a large number of spatial locations of a sound source were determined using a model head fitted with a probe microphone11 (see section 1.2). The results demonstrate that this system does not significantly perturb the sound field over the frequency range of interest (0.2 to 14 kHz). 2.1.4. Methods for locating the probe within the ear canal For deep canal recordings considerable precautions have to be taken to avoid the possibility of damaging the subject’s eardrum. Also, the tissue surrounding the ear canal becomes increasingly sensitive with canal depth and should be avoided as well. It is thus crucial that the probe should be introduced to its final location under careful monitoring and by well-trained personnel. The subject’s ears should also be free of accumulated cerumen which could occlude the probe or be pushed further in and touch the eardrum. The use of probe tubes is also complicated by the fact that, due to the slightly curved geometry of the ear canal, in some subjects the probe ceases to be visible a few mm after it has entered the meatus. Two different methods have been
122
Virtual Auditory Space: Generation and Applications
Fig. 4.5. Photograph of a metal insert, probe tube and microphone assembly used for head-related transfer functions recordings in place in a subject’s ear. Reprinted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.
employed for controlling the placement and the final position of inear probes: “touch” and acoustical methods. An example of the former is best described in Lawton and Stinton,34 where a recording probe was placed at a distance of 2 mm from the eardrum in order to measure acoustical energy reflectance. The probe tube was fitted with a tuft of soft nylon fibers extending 2 mm beyond the probe tip. The fibers touching the eardrum would result in a bumping or scraping noise detected by the subject. The fibers deflection was also monitored under an operating microscope (see also Khanna and Stinton17). Wightman and Kistler9 employed a slightly modified version of this method, where a human hair was introduced into the ear canal until the subject reported that it had touched the eardrum. The hair was marked and then measured to determine the precise length for the probe. The probe’s final placement was also monitored under an operating microscope (Wightman and Kistler, personal communication).
Generation and Validation of Virtual Auditory Space
123
Acoustical methods represent an alternative to the approaches presented above. Chan and Geisler21 have described a technique for measuring the length of human ear canals which we have applied to the placement of a probe for the measurement of HRTFs.27 Chan and Geisler showed that the position of a probe tube relative to the eardrum can be determined using the frequency of the standing wave notches and a 1/4 wavelength approximation. In our recording procedure the probe tube is first introduced so that it protrudes into the auditory canal about 10 mm from the distal end. A transfer function is then measured for a sound source located along the ipsilateral interaural axis, where the HRTF is fairly flat and the position of the standing wave notch best detected. The frequency of the spectral notch is then used to calculate the distance from the probe to the reflecting surface. The frequency of the standing notch is monitored until the probe reaches the position of 6 mm from the eardrum, which leaves the standing wave notch just above 14 kHz. Both placement methods allow control of the position of the probe with a precision of about 1 mm. 2.1.5. Preservation of phase information Precise control over the phase of the recorded signals is crucial for the generation of high fidelity VAS. In the case where a recording of absolute HRTFs is desired (i.e., a true free-field to eardrum transfer function devoid of the loudspeaker characteristics), the recording system (probe and microphone) has to be calibrated against the loudspeaker in the free-field (see above). It is essential for the two microphones or microphone probes to be placed at precisely the same location with respect to the calibration speaker for this measurement, which also matches the center of the subject’s head during recordings. Indeed, any mismatch will superimpose an interaural phase difference on the signals convolved for VAS. This will create a conflict between time, level and spectral cues and therefore reduce the fidelity of VAS. For instance, a difference of as little as 10 mm in position would result in a time difference of 29 µs, corresponding to a shift of approximately 2° towards the leading ear.
2.2. DIGITIZATION ISSUES There are a number of important issues to be considered when selecting the digitization parameters for measuring the HRTFs, recording the program material and generating virtual auditory space. The background to these issues has been considered in some detail in chapter 3 and is only briefly considered here. The frequency bandwidth of the signal and its temporal resolution is determined by the digitization rate. However, there is an interesting anomaly in auditory system processing that makes the selection of digitization a less than straight forward matter. Although the precise upper limit of human hearing is
124
Virtual Auditory Space: Generation and Applications
dependent on age, for most practical purposes it can be considered as around 16 kHz. In this case it could be argued that, assuming very steep anti-aliasing filters, all the information necessary for the reconstruction of the signal should be captured with a digitization rate of 32 kHz. However, we also know that the auditory system is capable of detecting interaural time differences of as little as 6 µs (chapter 2, section 2.2.1). To preserve this level of temporal resolution, digitization rates would need to be as high as 167 kHz, some 5-fold higher than that required to preserve the frequency bandwidth. The digitization rate chosen has a serious impact on both the computational overheads and mass storage requirements. The time domain description of the HRTF (its impulse response) is of the order of a few milliseconds. The length of the filter used to filter the program material will be greater with higher digitization rates. The longer the filter, the greater the computational overheads. At the higher digitization rates most of the current convolution engines and AD/DA converters commonly used are capable of this kind of throughput and so can be used for generating VAS in real time. However, in practice this is only really possible where the length of the filter is less than 150 taps or so (see chapter 3, sections 5.3 and 6). In many implementations of VAS, the program material is continuously filtered in real time. In the real world, when we move our head, the relative directions between the ears and the various sound sources also vary. To achieve the same effect in VAS the location of the head is monitored (chapter 6) and the characteristics of the filters are varied rapidly. With current technology, if VAS is to be generated in real time and be responsive to changes in head orientation this is likely to require a processor for each channel and, assuming no preprocessing of the digital filter and the higher digitization rates, this would be limited to generating only one target in virtual space. The obvious alternative to this rather limited and clumsy implementation of VAS is to reduce digitization rate, thereby reducing the computational overhead. However, there are as yet no systematic studies examining digitization rate and the fidelity of VAS. This is somewhat surprising, particularly given the apparent disparity between frequency and time domain sensitivities in the auditory system. Selecting a digitization rate which simply satisfies the frequency bandwidth criteria will limit the temporal resolution to steps of 31 µs. Whether this reduction in temporal resolution leads to any degradation in the fidelity of VAS is unknown. On the other hand, selecting the digitization rate to satisfy the time domain requirements would result in a frequency bandwidth of around 80 kHz. Naturally, the corner frequencies of the anti-aliasing and reconstruction filters would also need to be set accordingly so that the time domain fidelity of the output signal is not compromised (chapter 3, section 3.2).
Generation and Validation of Virtual Auditory Space
125
The choice of the dynamic range of the DA/AD converters to use in such a system is determined by both the dynamic range of the HRTFs and the dynamic range of the program material to be presented in VAS. While HRTFs vary from individual to individual, the maximum dynamic range across the frequency range of interest is unlikely to exceed 50 dB. This is determined by the highest gain component in the low to mid frequency range for locations in the ipsilateral anterior space and the deepest notches in the transfer functions in the mid to high frequency range (chapter 1, section 1.5). The gain of the program material can be controlled by either varying the gain after the DA conversion using a variable gain amplifier or variable attenuator, or alternatively by varying the gain of the digital record prior to conversion. Every electronic device (including DA and AD converters) has a noise floor. To obtain a high fidelity signal, where the signal is relatively uncontaminated by the inherent noise in the device (chapter 3, section 2.2), it is necessary to have as large a difference as possible between the gain of the signal to be output and the noise floor of the device. Where the gain of signal is controlled post conversion and assuming that the output signal fills the full range of DA converter (chapter 3, section 3.3), the signal to noise ratio of the final signal will be optimal. This will not be the case if the gain of the signal is varied digitally prior to conversion as the signal will be somewhat closer to the noise floor of the device. Often it is not practical to control the rapidly changing level of a signal program using post conversion controllers so a small sacrifice in fidelity needs to be tolerated. However, in practice, the overall average level of a signal can still be controlled by a post conversion controller. In this case a 16 bit device providing around 93 dB dynamic range (chapter 3, section 3.3) is quite adequate to the needs of everything but the most demanding of applications.
2.3. TEST SIGNAL
AND
RECORDING ENVIRONMENT
2.3.1. Recording environment Head-related transfer functions are best recorded in a nonreverberant, anechoic room. Although the HRTFs could be recorded in any environment, HRTFs measured in a reverberant space will include both the free-field to eardrum transfer function and potentially directiondependent effects of the room. The impulse-response function of the auditory periphery is relatively short (on the order of a few milliseconds). Therefore, if the room is of sufficiently large dimensions, reverberant effects due to the walls and ceiling could be removed by windowing the impulse response function, although floor reflections may still present a problem. Even though reverberation will improve the externalization of synthesized sound sources (Durlach et al73), the maximum flexibility for simulation will be achieved by separating room
126
Virtual Auditory Space: Generation and Applications
acoustical effects from the effects of the subject’s head and body. This is particularly important in the case where HRTFs are to be used in the study of fundamental auditory processes. Sound sources can then be simulated for different acoustical environments using a combination of HRTFs and acoustical room models (e.g., Begault35). 2.3.2. Test signals for measuring the HRTFs The first measurements of the frequency response of the outer ear were obtained in a painstaking and time-consuming way using pure tones and a phase meter.24,29,36 More recently, the impulse-response technique (see chapter 3, section 5.1) has been routinely applied to measure HRTFs. The total measurement time has to be kept as short as possible to minimize subject discomfort and ensure that the recording system remains stable throughout the measurement session. The main problem with the impulse response technique is the high peak factor in the test signal. That is, there is a high ratio of peak-toRMS pressure resulting from the ordered phase relations of each spectral component in the signal. Where the dynamic range of the recording system is limited this can result in a poor signal to noise ratio in the recordings. Several digital techniques involving the precomputation of the phase spectrum have been developed to try and minimize the peak factor (see for instance Schroeder37). The algorithm developed by Schroeder has been applied by Wightman and Kistler9 to the generation of a broadband signal used for the recording of HRTFs. Signal to noise ratio can also be increased by using so-called maximum length sequence signals, a technique involving binary two-level pseudo random sequences.26 Recently, a similar technique according to Golay has been applied successfully to the measuring of HRTFs11,27,38 (see chapter 3, section 5.2). Several other steps can be used to increase the signal to noise ratio in the recording of HRTFs. Energy should be concentrated in the frequency band of interest. Also, averaging across responses will provide a significant improvement, even when Golay codes are employed. Finally, the signal delivered can be shaped using the inverse filter functions of elements of the transducing system so that energy is equalized across the spectrum (see for instance Middlebrooks et al23). We have generated Golay codes shaped using the inverse of the frequency response of the whole transducing system to compensate for the loudspeaker and microphone high-frequency roll-off and the probe tube resonance (Fig. 4.6, broken line).11 The resulting response of the microphone was flat (±1 dB above 1 kHz) and at least 50 dB above noise over its entire bandwidth (0.2-16 kHz) (Fig. 4.6, solid line). Signal to noise ratio can also be improved at the level of sampling by adjusting the level to optimize analog to digital conversion (see section 2.2).
Generation and Validation of Virtual Auditory Space
127
Fig. 4.6. Broken line: transducing system (loudspeaker, probe tube and microphone) transfer function determined using Golay codes (dB re A/D 1 bit). Solid line: amplitude spectrum of the Golay codes stimulus used for recording head-related transfer functions as redigitized from the microphone’s output, as a result of shaping with the inverse of the system’s frequency response. Adapted with permission from Pralong D et al, J Acoust Soc Am 1994; 95:3435-3444.
2.3.3. Sound pressure level Absolute sound pressure level is an important parameter in recording both the HRTFs and the headphone transfer functions. On the one hand, the level should be high so that signal to noise ratio is optimized. On the other hand, sound pressure level should be sufficiently moderate so that the middle ear stapedial reflex is not triggered. Wightman and Kistler9 presented a control experiment where HRTFs recorded at a free-field SPL of 70 dB were used to filter headphones stimuli delivered at 90 dB SPL. The HRTFs re-recorded in this condition differed significantly from the original HRTFs, in particular in the 1-10 kHz frequency range, reflecting the change in acoustical impedance at the eardrum. This clearly indicates that SPL of the measuring stimulus should be matched to the level that VAS is to be generated, or at least over a range where the measurement is linear. We have also measured HRTFs and headphone transfer functions at 70 dB SPL.11
2.4. STABILIZING
THE
HEAD
FOR
RECORDINGS
The subject’s head can be maintained stabilized during the recordings of the HRTFs by various devices such as a clamp, a bite bar or chin rest. A head clamp is not advisable because of the possibility of perturbing the sound field. Also, the contraction of the masseter muscles
128
Virtual Auditory Space: Generation and Applications
produced by a bite bar results in a noticeable change in the morphology of the distal portion of the auditory canal which may affect the positioning of the probe tube. Therefore a chin rest of a small size seems to represent a good compromise between stability and interfering with the recordings. We estimated the variation introduced into the data by any head movement during the recording session using multiple (9 to 13) measurements of the HRTF made directly in front of 8 subjects (azimuth 0° and elevation 0°).27 The variation in the HRTFs was pooled across all subjects, and the pooled standard deviation was obtained by pooling the squared deviations of each individual’s HRTF from their respective means. To gain insight into the perceptual significance of these acoustical variations, the HRTFs were filtered using an auditory filter model to account for the spectral smoothing effects of the auditory filters12 (see section 1.2, and section 2.5.1 in chapter 2).This resulted in pooled standard deviations around 1 dB except for frequencies around 8 kHz or higher, where the deviation reached a maximum of about 2 dB. Middlebrooks et al23 introduced the use of a head-tracking device (see section 5.1) worn by the subject during the HRTFs measurement, providing a measurement of the head’s position throughout the session. We have recently refined this technique by providing subjects with feed-back from the head-tracker, allowing them to keep their head within less than 1.5 degrees of azimuth, elevation and roll compared to the original position.
3. FILTERING SIGNALS FOR PRESENTATION There is a huge literature examining the efficient filtering of signals using digital methods and it would not be possible to give adequate coverage here. Instead, we will focus on a number of key issues and look at ways in which they are currently being implemented in VAS display. Digital filtering is a very computationally intensive process and there are a number of efficient, purpose designed devices available. As discussed above (section 2.2), these devices are, on the whole, being pushed close to their operating limits with the implementation of even fairly simple VAS displays involving a small number of auditory objects and very rudimentary acoustical rendering. Where there are a large number of virtual sources and/or a complex acoustical environment involving reverberance and therefore long impulse response functions, the computational overheads become sufficiently high that real time generation of VAS is not possible. There are a number of research groups currently examining ways in which the efficiency of the implementation of VAS can be improved (see chapter 6). The two main approaches to filtering a signal are either in the time domain or the frequency domain. In the first case, digital filters,
Generation and Validation of Virtual Auditory Space
129
generally in the form of the impulse response of the HRTF and the inverse of the headphone transfer function, are convolved with the program material (chapter 3, section 5.3). The number of multiply and accumulate operations required for each digitized point in the program material is directly related to the length of the FIR filter. Each filter can represent a single HRTF or the combination of a number of filter functions such as the HRTF plus the inverse of the headphone transfer function plus some filter describing the acoustics of the environment in which the auditory object is being presented (e.g., the reverberant characteristics of the environment). In the latter case the filter is likely to be many taps long and require considerable processing. In addition to the computational time taken in executing a filter there may also be a delay associated with each filter which adds to the total delay added to the program material. In the situation where the user of a VAS display is interacting with the virtual environment (see chapter 6), the selection of the components of the filters will be dependent on the relative location and directions of the head and the different acoustical objects in the VAS being generated (this will include sources as well as reflecting surfaces). There may also be a further small delay added as the appropriate libraries of HRTFs are accessed and preprocessed. In a dynamic system, the total delays are often perceptually relevant and appear as a lag between some action by the operator (e.g., turning the head) and the appropriate response of the VAS display. In such cases the VAS is not perceived to be stable in a way that the real world is stable when we move around. The perceptual anomalies produced by such a poor rendering of VAS can lead to simulator sickness, a condition not unlike sea sickness. Thus, computational efficiency is an issue of considerable importance not just in terms of the fidelity of a real time display but in the general utility of the display. In a time domain approach, care must be taken when switching filters with a continuous program to avoid discontinuities due to ringing in the filters and large jumps in the characteristics of the filters. If the head is moved slowly in the free-field, a continuous free-field sound source will undergo a continuous variation in filtering. However, by definition sampling of the HRTF must be done at discrete intervals. If VAS is based on a relatively coarse spatial sampling of the HRTFs then, if there is no interpolation between the measured HRTFs, there may be significant jumps in the characteristics of the filters. A number of interpolation methods have been employed to better capture the continuous nature of space and, therefore, the variations in the HRTFs (see chapter 5 and 6). As yet there is no systematic study of the effects on the fidelity of the generated VAS using relatively large steps in the HRTF samples. In any case, the seamless switching of filters requires convolution methods that account for the initial conditions of the system. Similarly, if the filters start with a step change
130
Virtual Auditory Space: Generation and Applications
from zero then they are also likely to ring and produce some spectral splattering in the final signal. Fortunately, this problem is simply dealt with by time windowing to ensure that the onset and offset of the filters make smooth transitions from zero. The second approach to filtering is in the frequency domain. This approach becomes increasingly more efficient when the impulse response of the combined filters is relatively long. In this approach the program material is broken up into overlapping chunks and shifted into the frequency domain by a FFT. This is then multiplied by the frequency domain description of the filter and the resultant is shifted back into the time domain by an inverse FFT. Each chunk of the now filtered program material is time windowed and reconstructed by overlapping and adding together the sequential chunks. The efficiency of the current implementations of the FFT make this the most practical approach when the filters are relatively long, as occurs when complex or reverberant acoustical environments are being rendered. There are also new very efficient methods and hardware for computing very fast FFTs.
4. DELIVERY SYSTEMS: RECORDING HEADPHONES TRANSFER FUNCTIONS (HPTFS) Many points discussed in the previous sections of this chapter for the recording of HRTFs also apply to the recording of HpTFs. The calibration of headphones presents some specific problems, some of which could contribute significantly to the spatial quality of signals resynthesized under VAS. Choosing headphones with good transducing qualities is an obvious factor. As shown further in this section, the headphones capsule type has to be taken into consideration as well. In particular, the type of headphones employed should leave the recording system undisturbed so that the probe’s position in the ear canal is not altered when the HpTFs are being measured. There has been very little published on headphone transfer functions measured in real ears. Calibration is generally performed on artificial ears systems because of the difficulties associated with using real ears. While such systems may be useful for controlling the quality of headphones from the manufacturer point of view, they have limited relevance to the headphones’ performance on real human ears. Moreover, results are generally expressed as differences between the sound pressure level produced by the headphones at a given point in the ear canal and the sound pressure level produced at the same point by a free-field or diffuse field sound source.39,40 These measurements give information about the ability of headphones to replicate free-field or diffuse field sound reproduction. In the case of the generation of VAS the acoustical objective is to reconstitute, using headphones, the sound field at the eardrum that would normally occur for a free-field stimulus. The HRTFs for free-field sources are measured to a specific reference
Generation and Validation of Virtual Auditory Space
131
in the ear canal and the HpTFs must be measured to the same reference point if they are to be properly accounted for (see section 1.1). Figure 4.7 shows a number of measurements of HpTFs recorded for two different types of headphones from the internal probe microphone at the eardrum of our model head (see section 3).c These transfer functions were obtained as follows. The transfer function of the probe-microphone system was first obtained by comparing its free-field response to a loudspeaker with that of a flat transducer (B&K 4133 microphone) to the same speaker. The HpTF was then obtained by deconvolving the probe-microphone response from the response recorded in the ear canal with the stimulus delivered through headphones. This transfer function describes both the headphones’ transducing properties and the headphone-to-eardrum transfer function. Figure 4.7A shows a number of measurements of HpTFs recorded for circum-aural headphones (Sennheiser Linear 250). The HpTF is characterized by gains of up to 18 dB for frequencies between 2 kHz and 6 kHz and between 8 kHz and 12 kHz, separated by a spectral notch around 7.5 kHz. These spectral features are similar to the main features of the freefield-to-eardrum transfer functions obtained for positions close to the interaural axis (see chapter 2, section 1.5). With circum-aural headphones the ear is entirely surrounded by the external part of the phone’s cushion. Therefore, it is not surprising that this type of headphone captures many of the filtering effects of the outer ear. The standard deviation for six different placements of the headphones was worst at 8 kHz (around 2 dB) and below 1 dB for the other regions of the spectrum below 12 kHz (Fig. 4.7B). This demonstrates that the placement of this type of headphones on the model head is highly reproducible. In contrast, the reproducibility of recordings obtained for repeated placements of headphones of the supra-aural type (Realistic Nova 17) was less satisfying (Fig. 4.7C). Although these transfer functions were characterized by a generally flatter profile, there was considerable variation between replicate fittings and the standard deviation for six different placements reached 8 to 9 dB around 8 and 12 kHz (Fig. 4.7D). Indeed, this type of headphone flattens the outer ear convolutions in a way which is difficult to control from placement to placement. Although favored for their flat response when tested on artificial ear systems, this type of headphone is clearly not suitable for VAS experiments which rely on a controlled broadband signal input to the auditory system. Transfer functions for the Sennheiser 250 linear circum-aural headphones recorded from a human subject are shown in Figure 4.7E. In this case measurements were made with the in-ear recording system as c
This measures only the reproducibility of positioning, not the disturbance of the external recording system used on real human ears.
132
Virtual Auditory Space: Generation and Applications
Generation and Validation of Virtual Auditory Space
133
described above (section 1.2; see Fig. 4.2). These transfer functions show moderate gains for frequencies between 2 and 4 kHz and between 4 and 6 kHz (up to 10 and 5 dB respectively) and a second 10 dB gain at 12 kHz, preceded by a -25 dB notch at 9 kHz. Although these features are qualitatively similar to those obtained from the model head with the same headphones (Fig. 4.7A), these are significant differences in the amplitude and frequency of spectral transformations. They can probably be attributed to differences between the human and model pinna shape and size, ear canal length, acoustical reflectance at the eardrum as well as the probe’s position. These transfer functions compare well with data reported for similar types of headphones by Wightman and Kistler9 and Møller et al.28 The standard deviation for six different placements of the headphones was worst around 14 kHz (about 4 dB), about 2 dB for frequencies between 8 and 12 kHz and below 1 dB for frequencies below 7 kHz (Fig. 4.7F). Part of the variability observed for high frequencies could be caused by slight movements of the probe due to the pressure applied by the headphones on to the external recording system. Thus, the standard deviation reported here probably represents an overestimate of the variability between different placements of circum-aural headphones on a free human ear. Circum-aural headphones therefore seem to be best suited for the generation of VAS, and have been used in all the VAS studies published so far (e.g., Sennheiser HD-340;9 Sennheiser HD-43041). Some models of circum-aural headphones also have capsules of an ovoid rather than a round shape, which fit better around the pinna and leave enough space for microphone cartridges to be placed below the ear without being disturbed. Møller et al28 have presented an extensive study describing transfer functions from 14 different types of headphones to the occluded ear canals of 40 different subjects, including many circumaural headphones. Control over the precise position of headphones around the outer ear remains a critical issue in the generation of high fidelity VAS. In the absence of any acoustical measurement device, the correspondence between subsequent placements of the headphones and that used in calibrating the HpTFs can not be determined. Mispositioning circumaural headphones slightly above or below the main axis of the pinna can result in significant changes in the measured HpTF, with notches in the transfer function shifting in frequency in ways similar to changes Fig. 4.7 (opposite). (A) Headphone-to-eardrum transfer functions for Sennheiser 250 Linear circum-aural headphones measured using the model head, (C) Realistic Nova 17 supra-aural headphones measured using the model head, and (E) Sennheiser 250 Linear circum-aural headphones measured from the left ear of a human subject. Six recordings were made in each condition with the headphones taken off and repositioned between each measurement. (B), (D), (F): standard deviations for (A), (C), (E), respectively.
134
Virtual Auditory Space: Generation and Applications
observed in HRTFs when a free-field sound source moves around the ear (Pralong and Carlile, unpublished observations). This problem can be eliminated, at least in the research environment, by the use of inear tube phones such as the Etymotic Research ER-2. These tube phones fit tightly in the ear canal through an expanding foam plug and are calibrated to give a flat response at the eardrum. Psychophysical validation of VAS using Etymotic ER-2 tube phones are presented in section 5.2. Another possible advantage of in-ear tube phones over headphones is that headphones might cause a cognitive impairment in the externalization of a simulated sound source due to their noticeable presence on the head. On the other hand, the auditory isolation provided by in-ear tubes phones could be a source of problems in itself. Finally, HpTFs vary from individual to individual, thus requiring individualized calibration for the generation of high fidelity VAS. This issue is addressed in section 6 of this chapter together with the general problem of the individualization of HRTFs.
5. PERFORMANCE MEASURES OF FIDELITY Section 4.2 in chapter 1 gives a detailed account of the different methods available for the measurement of sound localization performance in humans, and it is concluded that a most demanding test for the fidelity of VAS is to compare the absolute localization of transient stimuli in VAS with that in the free-field. The following sections briefly review the methods by which localization of transient stimuli can be assessed as well as studies where localization in VAS has been systematically compared to localization in the free-field. We also illustrate this discussion with some recent results from our own laboratory.
5.1. METHODOLOGICAL ISSUES There are three main methodological issues that need to be considered in the assessment of sound localization by humans; (i) how the position of a sound source is varied; (ii) how the subject indicates where the sound source is perceived to be; and (iii) the statistical methods by which localization accuracy is assessed. In most experimental paradigms the first two problems are linked and the solutions that have been employed offer different advantages and disadvantages. One way in which the sound source location can be varied is by using a fixed number of matched speakers arranged at a number of locations in space around the subject. Following the presentation of a stimulus the subject is required to identify which speaker in the array generated the sound. This procedure has been referred to as “categorization,”42 as subjects allocate their judgment of an apparent sound location to one response alternative in a set of available positions (see for instance Butler;43 Blauert;44 Hebrank and Wright;45,46 Belendiuk and Butler;47 Hartman;48 Musicant and Butler49). The set of possible responses, or categories, is generally small (below 20). The obvious
Generation and Validation of Virtual Auditory Space
135
technical advantages is that changing the location of the source is simply a matter of routing the signal to a different speaker. One technical difficulty with such a system is that the speakers in the array have to be well matched in terms of transduction performance. This has generally been achieved by testing a large number of speakers and simply selecting those which are best matched. With the advent of digital signal processing (see chapter 3) it has been possible to calibrate each speaker and then modify the spectrum of the signal to compensate for any variations between the sources (see Makous and Middlebrooks50). In this kind of experimental setup there are a discrete number of locations and the subject’s foreknowledge of the potential locations constrains their response to a relatively small number of discrete locations. This is particularly a problem when a sound location is ambiguous or appears to come from a location not associated with any of the potential sources. Indeed, differences in the arrangement of the loudspeakers also seems to guide the subject’s responses.51-53 For instance, the accuracy of the localization of a lowpass sound above the head was found to be dependent on whether other sources used in the localization experiments were restricted to either the frontal or lateral vertical plane.53 Thus, as these authors point out, the context in which the stimulus was presented had influenced the localization judgments. A similar pattern of results was reported by Perrett and Noble51 where sound localization under conditions where the subject choices were constrained were compared to conditions where they were unconstrained. This clearly demonstrated the powerful influence of stimulus context and foreknowledge of the potential source locations. Butler et al52 also showed that prior knowledge of the azimuth location of the vertical plane in which a stimulus was presented had a significant effect on the accuracy of localization in that vertical plane. However, a similar interaction was not shown for foreknowledge of the vertical position of the horizontal plane from which a stimulus could be presented (see Oldfield and Parker54 and Wightman and Kistler42 for further discussion of these important points). A variant of this procedure is to use a limited array of sources to generate the stimuli but to conceal the number and location of these sources from the subject. An acoustically transparent cloth can be placed between the subject and the speaker array and a large number of potential speaker locations can be indicated on the cloth (e.g., Perrett and Noble51). Although there may be only a few of these potential locations actually associated with a sound source, the subject’s response is not constrained to the small number of sources. Additionally, the cognitive components associated with the foreknowledge of the disposition of the potential source is also eliminated. The second approach to varying the target location is to vary the source location using a movable loudspeaker so that the choice of locations is a continuous function of space. If this procedure is carried
136
Virtual Auditory Space: Generation and Applications
out in the dark, or with the subject blindfold, the subject has no foreknowledge as to potential target locations. The method by which subjects report their judgments in these studies has to be carefully considered, as some of the variance in the results could be accounted for by the subject’s ability to perform the task of indicating the perceived location and not be related to his/her perceptual abilities. In the series of studies by Oldfield and Parker54-56 subjects reported the position of sound by pointing to the target using a hand-held “gun,” the orientation of which could be photographed. A similar procedure was used by Makous and Middlebrooks50 where subjects pointed their nose towards the perceived sound location and the orientation of their head was registered with an electromagnetic tracking device (see also Fig. 1.2 and section 2.1.2 of chapter 1).d Head-pointing is currently the most popular method for indicating the location of the stimulus. Precautions have to be taken so that the variability associated with the subjects ability to orient their head is less than the error in their localization of the sound source (see below). Also, the resolution of the tracker should be better than human sound localization accuracy. One of the most commonly used headtracking systems has an angular accuracy of about 0.5 degrees (e.g., Polhemus IsoTrack), which is well below the human best possible performance. The use of head-tracking devices is further discussed in chapter 6. Wightman and Kistler14,59 (see also Kistler and Wightman60) used a third type of judgment method where subjects indicate the perceived position of a sound source by calling out numerical estimates of apparent azimuth and elevation, using spherical coordinates. In this type of experiment the results would also depend on the ability of the subject to map perceived locations to the coordinate system rather than motor ability. In all cases, subjects would first have to learn the task till they produce steady and reliable judgments in the free-field before any comparison can be made with localization performance in VAS. Also, multiple measurements should be made for each position in order to describe all the biological variance and to obtain a good statistical estimate of localization accuracy. All three procedures described above produced similar results with respect to localization errors and their regional dependency, with the difference that the coordinate estimation procedure seemed to produce more variance in the data.42 We have recently measured a number of the error components associated with pointing the head towards an unconstrained target in the dark.61 Turning to face an auditory target is clearly a highly ecological d
Another method of estimation has been described recently, where subjects report the presumed location of a target by pointing onto a spherical model of the coordinate system.57,58 Also, Hammershøi and Sandvad8 have used a similar approach where the subjects indicated a perceived position from a limited number of locations using a digital notepad.
Generation and Validation of Virtual Auditory Space
137
behavior and tracking the position of the head using a head mounted device provides an objective measure of the perceived location of the sound source. One problem associated with “head pointing” is that orienting the head is only one component of the behavior that results in visual capture of the source. The other major component is the movement of the eyes. For sound locations where head movement is mechanically unrestricted (around the frontal field and relatively close to the audio-visual horizon) it is very likely that the eyes will be centered in the head and the position of the head will provide a good indicator of the perceived position of the sound. However, for sound locations which would require more mechanically extreme movements of the head it is more likely that the eyes will deviate from their centered position so that there will be an error between the perceived location of the source and the position indicated by the head. Recent results from our laboratory verify this prediction and we have developed training procedures based on head position feedback which act to minimize these errors.61
5.2. EXPERIMENTAL RESULTS The study by Wightman and Kistler14 provided the first rigorous assessment of the perceptual validity of VAS. The stimulus employed in their localization tests was a train of eighth 250 ms bursts of noise. In order to prevent listeners from becoming familiar with a specific stimulus or transducer characteristics, the spectrum of each stimulus was divided into critical bands and a random intensity was assigned to the noise within each critical band (spectral scrambling). In both freefield and VAS, the same eight subjects were asked to indicate the apparent spatial position of 72 different target locations by calling out numerical estimates of azimuth and elevation (see above for details). VAS was synthesized using individualized HRTFs, i.e., recorded from their own ear canal.9 Subjects were blindfolded in both conditions and no feedback regarding the accuracy of their judgment was given. The data analysis for this type of experiment is complicated by the fact that target and responses are distributed on the surface of a sphere. Therefore, spherical statistical techniques have been used in analyzing the data.14,62,63 Two useful parameters provided by these statistics are the centroid judgment, which is reported as azimuth and elevation and describes the average direction of a set of responses vectors for a given location, and the two-dimensional correlation coefficient, which describes the overall goodness of fit between centroids and targets. Another problem in this type of data comes from the cone of confusion type of errors, i.e., front-to-back or back-to-front reversals (see chapter 1, section 2.1.3). Cone of confusion errors need to be treated as a particular case, as they represent large angular deviations from their targets and their inclusion in the overall spherical statistics
138
Virtual Auditory Space: Generation and Applications
would lead to a large overestimate of the average angular error. Two types of analytical strategies have been proposed to deal with this type of error. Makous and Middlebrooks50 remove confusions from the pool of data and treat them separately. Wightman and Kistler14 have chosen to count and resolve cone of confusion errors; that is, the response is reversed towards the hemisphere where the target was located, and the rate of reversal is computed as a separate statistic. Therefore, responses where a confusion occurred still contribute to the computation of the overall localization accuracy while the true perception of the subject is reflected in the statistics. Table 4.1 provides a general summary of the results obtained by Wightman and Kistler.14 It can be seen that the individual overall goodness of fit ranges from 0.93 and 0.99 in the free-field, and 0.89 and 0.97 in VAS. Although no significance was provided for these coefficients, at least it can be said that the free-field and VAS performances are in good agreement. A separate analysis of the azimuth and elevation component of the centroids of the pooled localizations for each location shows that, while the azimuthal component of the stimulus location is rendered almost perfectly, the simulation of elevation is less satisfying, as reflected by a slightly lower correlation between the responses and target elevations. Also, the cone of confusion error rate, which is between 3 and 12% in the free-field, increases to 6 and 20% in VAS. The increase in the front-back confusion rate and the decrease in elevation accuracy suggests that in these experiments there may have been some mismatch between the sound field generated at the eardrum in VAS and the sound field at the eardrum in free-field listening (see chapter 2; also see section 1.2 of this chapter).e This issue was recently examined in our laboratory using highfidelity individualized HRTFs acoustically validated as described in section 1.2.64 Free-field localization accuracy was determined in a dark anechoic chamber using a movable sound source (see chapter 1). Broadband noise bursts (0.2-14 kHz, 150 ms) were presented from 76 different spatial locations and 9 subjects indicated the location by turning to face the source. Head position was tracked using an electromagnetic device (Polhemus IsoTrack). Cone of confusions errors were removed e
Hammershøi and Sandvad8 recently presented a preliminary report of a psychophysical validation of VAS generated using occluded ear canal HRTF recordings. The localization test consisted in the identification of 17 possible sound sources reported using a digital notepad, and two different types of stimuli were tested: a 3 seconds speech signal and 3 seconds of white noise. They show a doubling of the total number of errors in VAS compared to free-field and a particularly large increase in the number of cone of confusions errors. It is not clear yet whether these poor results are related to any procedural problem in the generation of VAS or to the use of occluded ear recordings.
Generation and Validation of Virtual Auditory Space
139
Table 4.1 Spherical statistics and cone of confusions errors for measures of free-field sound localization (in bold face type) and measures of VAS sound localization (between brackets) shown individually for 8 subjects ID
Goodness of fit
Azimuth correlation
Elevation correlation
% reversal
SDE
0.93 (0.89)
0.983 (0.973)
0.68 (0.43)
12 (20)
SDH
0.95 (0.95)
0.965 (0.950)
0.92 (0.83)
5 (13)
SDL
0.97 (0.95)
0.982 (0.976)
0.89 (0.85)
7 (14)
SDM
0.98 (0.98)
0.985 (0.985)
0.94 (0.93)
5 (9)
SDO
0.96 (0.96)
0.987 (0.986)
0.94 (0.92)
4 (11)
SDP
0.99 (0.98)
0.994 (0.990)
0.96 (0.88)
3 (6)
SED
0.96 (0.95)
0.972 (0.986)
0.93 (0.82)
4 (6)
SER
0.96 (0.97)
0.986 (0.990)
0.96 (0.94)
5 (8)
Adapted with permission from Wightman FL et al, J Acoust Soc Am 1989; 85: 868-878.
from the data set. Figure 4.8A shows that the mean localization accuracy across all subjects varies with source location, being most accurate for frontal locations. The centroids of the localization judgments for each location were highly correlated with the actual location of the source (spherical correlation: 0.987) and the cone of confusion error rate was 3.1%. Our VAS stimuli were generated by filtering noise with the appropriate HRTFs and delivered using in-ear tube phones (Etymotic Research ER-2; see section 4). Sound localization was measured as for free-field so that each subject was tested with at least 5 consecutive blocks of each of 76 locations. Subjects acclimatized to the VAS task after a short training period without feedback, and localization accuracy was compared with accuracy in the free-field using the mean angular error for each test location. Although the spatial variation of the mean angular errors in VAS is very similar to that seen in the freefield, they were on average larger by about 1.5º (paired t: t = 3.44 df75 p = 0.0009). The cone of confusions errors rate is 6.4%, a fraction of which can be accounted for by the slight increase in the average angular error in VAS around the interaural axis compared to the free-field. The correlation between localization in VAS and localization in the free-field is very high (spherical correlation: 0.983), indicating that the performance between the two conditions is very well matched (Fig. 4.8B). The difficulty of discriminating front and back in simulated auditory space has been known since the binaural technique was first developed.7 Frontal localization in particular seems to be a problem. Sound sources in the frontal hemisphere are often perceived as appearing
140 Virtual Auditory Space: Generation and Applications
Fig. 4.8. (A) Free-field localization accuracy shown for 9 subjects (see text). The centroids of the localization judgments for each location (+) are very highly correlated with the actual location of the source (spherical correlation: 0.987). The horizontal and vertical axis of the ellipses represent ± 1 s.d. about the mean response for each location. (B) VAS localization accuracy shown for 5 subjects (see text). The centroid of the localization judgments (+) are very highly correlated with the HRTF recording location (spherical correlation: 0.974). *: indicates position with coordinates 0°, 0°.
Generation and Validation of Virtual Auditory Space
141
from behind and are poorly externalized, although the problem in binaural recordings could be related to the use of material recorded from nonindividualized, artificial ear canals (see section 6). The resolution of frontal and posterior hemispheres seems to be the most fragile aspect of sound localization when tested using an auditory environment where cues are reduced to a minimum. This can be exemplified by the fact that front-back resolution is the first aspect of performance to be degraded when sound localization is investigated in a noisy background.57 These results have been discussed by Good and Gilkey57 in relation to the saliency of cues to a sound’s location. It appears quite likely that this aspect of localization could also deteriorate as a result of higher level factors, auditory or cognitive. Indeed, identical stimuli might be perceived identically only when they are presented in identical environments. Some experimental factors, like prior training with feed-back in the same environment, still differed between the freefield and VAS localization experiments described above. Another of these factors is the absence of dynamic link to head motion in VAS. In the light of the discussion presented in chapter 1 (section 2.3.1), it appears rather unlikely that head movements could contribute to the localization of transients in the experiments described in Fig. 4.8. Wightman and colleagues recently reported that dynamically coupling a stimulus presented in VAS to small head movements decreased the number of confusions for those subjects which performed particularly badly in the virtual environment.65 If the stimulus consisted in a train of 8 bursts of 250 ms as in their previous work9 it is then possible that the subjects used a scanning strategy to resolve ambiguous spectral cues in the signal, a process different to that involving the localization of transients. In conclusion, experimental data obtained so far by Wightman and Kistler and collaborators as well as in our laboratory indicate that the simulation of free-field listening in VAS is largely satisfactory, as indicated by the very high correlations between localization accuracy in both conditions. Efforts remain to be made towards improving elevation accuracy and decreasing the number of cone of confusion errors. It appears that the evaluation of the fidelity of VAS using the localization of transient, static stimuli presented in anechoic conditions and uncoupled to head movements is delicate, and the unusual context and sparsity of cues of such stimuli could render the listener’s performance more susceptible to the influence of higher level factors. It should also be borne in mind here that the psychophysical validation of VAS relies on the primary assumption that the correct HRTFs have been measured, which remains difficult to establish empirically (see section 1.2 of this chapter). Progress in the understanding of the outer ear acoustics and in the control over environmental factors should enable sound localization in VAS with an accuracy matching that in the free-field. It is also expected that in more multi-modal and dy-
142
Virtual Auditory Space: Generation and Applications
namic virtual environments, the contribution of higher level factors that can impair localization performance could become less important with more continuous stimuli when disambiguating information is obtained from correlated head movements and visual cues.
6. INDIVIDUALIZED VERSUS NONINDIVIDUALIZED HRTFS AND HPTFS The question of whether individualized HRTFs have to be used to generate high-fidelity VAS is of considerable practical and theoretical interest. So far, the discussion in this chapter has assumed that individual HRTFs are recorded for each subject for whom VAS is generated. Indeed, due to differences in the sizes and shapes of the outer ears there are large differences in the measured HRTFs, particularly at high frequencies, which would seem to justify the use of personalized recordings.10,23,27 Furthermore, when the measured HRTFs are transformed using an auditory filter model, which accounts for the frequency dependence of auditory sensitivity and the frequency and level dependent characteristics of cochlear filters, the individual differences in the HRTFs are preserved, suggesting that the perceptually salient features in the HRTFs are likely to differ from subject to subject27 (see chapter 2, section 2.5). Preliminary psychoacoustical studies confirmed the importance of individualized HRTFs and showed that the best simulation of auditory space is achieved when the listener’s own HRTFs are used.66,67 It is clear from the previous sections of this chapter that measuring high-fidelity HRTFs is a delicate and time consuming process. Measurements have to be carried out in a sophisticated laboratory environment, which is probably not achievable for all potential users of VAS displays. Wenzel et al63,66 have suggested that any listener might be able to make use of nonindividualized HRTFs if they have been recorded from a subject whose perceptual abilities in both free-field and close-field simulated sound localization are accurate. In a recent study, Wenzel et al41 asked inexperienced listeners to report the spatial location of headphone stimuli synthesized using HRTFs and HpTFs obtained from a subject characterized by Wightman and Kistler14 who was found to be an accurate localizer. These results show that using nonindividualized HRTFs, listeners were substantially impaired in elevation judgement and demonstrated a high number of cone of confusion errors. Begault and Wenzel68 reported comparable results in similar experiments where speech sounds rather than broadband noise were used as a stimuli. Unfortunately, the acoustical basis of the variability observed in the results by Wenzel et al69 is not known, as the waveform at the subject’s eardrum had not been recorded in this study. That is, as stated by the authors themselves, “to the extent that each subject’s headphone-to-eardrum transfer function differs from SDO’s [the accurate localizer], a less faithful reproduction would result.” We have illustrated in section 4 that the HpTFs for circum-aural
Generation and Validation of Virtual Auditory Space
143
headphones do capture some of the outer ear filtering effects (see Fig. 4.7). Data from our laboratory show that consequently, like the free-fieldto-eardrum transfer functions, the headphone-to-eardrum transfer functions for circum-aural headphones can indeed differ significantly from one subject to another.70 It can be seen in Figure 4.9 that the variability in the transfer functions is considerable for frequencies above 6 kHz, with an inter-subject standard deviation peaking up to 17 dB for frequencies around 9 kHz for right ears. The frequency and depth of the first spectral notch varied between 7.5 and 11 kHz and -15 and -40 dB respectively. There are also considerable individual differences in the amplitude and the center frequency of the high frequency gain features. The intersubject differences in the HpTFs shown here are similar to those shown previously for circum-aural headphones.28 As previously described for HRTFs in the frontal midline9,26,27,71 interaural asymmetries in the HpTFs are also evident, especially for frequencies above 8 kHz. These data demonstrate that when generating VAS using circumaural headphones, the headphone transfer function will differ from subject to subject. It is therefore likely that the subjects in the experiments described above by Wenzel et al69 listened to different signals. This is illustrated by the data presented in Figure 4.10. We chose to examine the effects of using nonindividualized HpTFs by considering in detail the transfer functions of two subjects A and B which are amongst the ones described in Figure 4.9. Figure 4.10A and B shows that there is a 2 kHz mismatch in the mid-frequency notch of the HpTFs for the two subjects for both left and right ears. Differences can also be observed at higher frequencies (above 10 kHz), as well as in the 2 kHz to 7 kHz region where the second subject is characterized by lower amplitude levels and a shift of the main gain towards low frequencies. The spectral profile which would have been obtained if subject A’s HpTFs had been used to deconvolve one of subject A’s HRTFs when reconstituted in subject B’s ear canal were calculated (Fig. 4.10C and D). In this particular example the simulated location was at the level of the interaural axis and facing the left ear. In this condition subjects A’s inverse headphone transfer functions were removed from the resynthesized stimulus whereas subject B’s headphone transfer functions were actually imposed on the delivered stimulus. It can be seen that due to the higher amplitude level of subject A’s HpTFs in mid-frequency region, the resynthesized HRTFs lacks up to 10 dB in gain between 3 kHz and 7 kHz. Furthermore, due to the mismatch of midfrequency notches, the notch in the resynthesized HRTF is shifted down to 8 kHz and a sharp peak is created at 9 kHz where the notch should have appeared. Minor differences also appear above 10 kHz. In order to examine how much of the changes introduced by the nonindividualized headphone deconvolution might be encoded by the auditory system, the resynthesized HRTFs were passed through the auditory filter model previously described (see chapter 2, section 1.6.1,
144
Virtual Auditory Space: Generation and Applications
Fig. 4.9. Top: headphone-to eardrum transfer functions measured for a pair of Sennheiser 250 Linear headphones, for the left and right ears of 10 different human subjects. The transfer functions are displaced by 40 dB on the Y axis to facilitate comparison. Bottom: the mean (solid line) and standard deviation (thick dotted line) of the headphone-toeardrum transfer function for each ear.
Generation and Validation of Virtual Auditory Space
145
Fig. 4.10. Effects of using nonindividualized headphones transfer functions on the resynthesis of HRTFs, for left and right ears. (A) and (B): transfer functions for Sennheiser 250 Linear circum-aural headphones, for subject A (solid line) and B (dotted line). (C) and (D): head-related transfer functions for a spatial location on the left interaural axis (azimuth -90°, elevation 0°), as originally recorded for subject A (solid line), and as deconvolved with subject A’s HpTFs but imposed on subjects B’s HpTFs (dotted line). Convolution and deconvolution were modeled in the frequency domain using Matlab (The MatWorks Inc.). (E) and (F): data displayed in (C) and (D) after being passed through an auditory filter model (see text). To simulate the effect of overall stimulus level, an offset of 30 dB was added to the input of each auditory filter.
146
Virtual Auditory Space: Generation and Applications
and also section 1.2 of this chapter). The output of the cochlear filter model indicates that all of the differences described above are likely to be encoded in the auditory nerve and may well be perceptually relevant (Fig. 4.10E and F). The distortion of spectral cues provided by nonindividualized HpTFs may well be the basis for the disruption of sound localization accuracy previously observed with inexperienced listeners.69 Firstly, the effects on frequencies between 3 kHz and 7 kHz would disrupt a potential cue for the resolution of binaural front-back ambiguities.27 Secondly, the peak in transmission around 9 kHz illustrated in Figure 4.10 results in a filter function which is uncharacteristic of any set of HRTFs previously reported.9,27 Distortion effects similar to those described above have been reported by Morimoto and Ando.10 These authors used a free-field, symmetrical two-channel loudspeaker system to simulate HRTFs in an anechoic chamber and compare the localization accuracy of three subjects with different sizes of pinna. They described, in the frequency domain, the sound pressure level at the subject’s ear when a different subject’s HRTFs were reconstituted using their system. They observed that when listening to nonindividualized HRTFs the subject’s own free-field to eardrum transfer functions were being imposed on the signal. The different HRTFs associated with each ear will also lead to interaural spectral differences which vary continuously as a function of sound location. Computing the interaural spectral profile for subject B using subject A’s HRTFs with both nonindividualized and individualized HpTFs shows that, in contrast to the effects seen in the monaural spectral profile, there are only relatively small differences between binaural spectral differences generated using individualized and nonindividualized HpTFs (data not shown). This might be expected from a close inspection of the HpTFs, which shows that for any particular subject the intersubject differences are generally similar for both ears. Thus the use of nonindividualized HpTFs is likely to result in a greater disruption of monaural spectral cues compared to binaural spectral differences. Interaural level differences are known to contribute to the judgment of the horizontal angle of a sound source (for review, see Middlebrooks and Green72). It is not surprising then that when frontback confusions were resolved, most subjects from the Wenzel et al69 study retained the ability to correctly identify the azimuthal angle of sound sources. In view of the large variations in the individual HpTFs reported here, it is clear that these transfer functions must be accounted for in any controlled “cross-listening” experiment. Such experiments are important because they provide insights into how new listeners might perform with nonindividualized HRTFs and adapt to these over time. This is an important acoustical step in examining the link between the HRTFs and sound localization performance and the adaptation of
Generation and Validation of Virtual Auditory Space
147
listeners to new sets of cues present in nonindividualized HRTFs. It remains at this point unclear whether performance using nonindividualized HRTFs will be improved if appropriate training is used and subjects are given the opportunity to adapt to new HRTFs. There are a number of alternative approaches to matching the user of a VAS display to the HRTFs used in generating this display. One could be to provide a large library of HRTFs and simply allow subjects to select from this library the HRTFs which result in the best sound localization performance. The success of such a system will depend on the number of dimensions over which HRTFs vary in the population and the development of practical means by which the selection of the appropriate library can be achieved. Another method might involve the identification of the dimension of variation of HRTFs within the population and the development of mechanisms for modifying a standard HRTF library to match individual listeners. Both of these approaches are currently under investigation in our laboratory.
ACKNOWLEDGMENTS The new psychoacoustical work from the Auditory Neuroscience Laboratory presented in this chapter was supported by the National Health and Medical Research Council (Australia), the Australian Research Council and the University of Sydney. The assistance of Stephanie Hyams in collecting some of the data is gratefully acknowledged. We also wish to thank the subjects who took part in these experiments. The Auditory Neuroscience Laboratory maintains a Web page outlining the laboratory facilities and current research work at http:// www.physiol.usyd.edu.au/simonc.
REFERENCES 1. Gierlich HW. The application of binaural technology. Appl Acoust 1992; 36:219-143. 2. Plenge G. On the differences between localization and lateralization. J Acoust Soc Am 1974; 56:944-951. 3. Butler RA, Belendiuk K. Spectral cues utilized in the localization of sound in the median sagittal plane. J Acoust Soc Am 1977; 61:1264-1269. 4. Blauert J. Spatial Hearing: The psychophysics of human sound localization. Cambridge, Mass.: MIT Press, 1983. 5. Burkhardt MD, Sachs RM. Anthropometric manikin for acoustic research. J Acoust Soc Am 1975; 58:214-222. 6. Gierlich HW, Genuit K. Processing artificial-head recordings. J Audio Eng Soc 1989; 37:34-39. 7. Møller H. Fundamentals of binaural technology. Appl Acoust 1992; 36:172-218. 8. Hammershøi D, Sandvad J. Binaural auralization. Simulating free field conditions by headphones. Presented at Audio Engineering Society. Amsterdam: 1994. 1-19.
148
Virtual Auditory Space: Generation and Applications
9. Wightman FL, Kistler DJ. Headphone simulation of free field listening. I: Stimulus synthesis. J Acoust Soc Am 1989; 85:858-867. 10. Morimoto M, Ando Y. On the simulation of sound localization. J Acoust Soc Jpn 1980; 1:167-174. 11. Pralong D, Carlile S. Measuring the human head-related transfer functions: A novel method for the construction and calibration of a miniature “in-ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 12. Glasberg BR, Moore BC. Derivation of auditory filter shapes from notched-noise data. Hearing Res 1990; 47:103-138. 13. Hellstrom P, Axelsson A. Miniature microphone probe tube measurements in the external auditory canal. J Acoust Soc Am 1993; 93:907-919. 14. Wightman FL, Kistler DJ. Headphone simulation of free field listening. II: Psychophysical validation. J Acoust Soc Am 1989; 85:868-878. 15. Shaw EAG. The acoustics of the external ear. In: Studebaker GA, Hochberg I, ed. Acoustical factors affecting hearing aid performance. Balitmore: University Park Press, 1980:109-125. 16. Rabbitt RD, Friedrich MT. Ear canal cross-sectional pressure distributions: mathematical analysis and computation. J Acoust Soc Am 1991; 89:2379-2390. 17. Khanna SM, Stinson MR. Specification of the acoustical input to the ear at high frequencies. J Acoust Soc Am 1985; 77:577-589. 18. Stinson MR, Khanna SM. Spatial distribution of sound pressure and energy flow in the ear canals of cats. J Acoust Soc Am 1994; 96:170-181. 19. Stinson MR, Lawton BW. Specification of the geometry of the human ear canal for the prediction of sound-pressure level distribution. J Acoust Soc Am 1989; 85:2492-2503. 20. Rabbitt RD, Holmes MH. Three dimensional acoustic waves in the ear canal and their interaction with the tympanic membrane. J Acoust Soc Am 1988; 83:1064-1080. 21. Chan JCK, Geisler CD. Estimation of eardrum acoustic pressure and of ear canal length from remote points in the canal. J Acoust Soc Am 1990; 87:1237-1247. 22. Gilman S, Dirks DD. Acoustics of ear canal measurements of eardrum SPL in simulators. J Acoust Soc Am 1986; 80:783-793. 23. Middlebrooks JC, Makous JC, Green DM. Directional sensitivity of soundpressure levels in the human ear canal. J Acoust Soc Am 1989; 86:89-108. 24. Shaw EAG, Teranishi R. Sound pressure generated in an external-ear replica and real human ears by a nearby point source. J Acoust Soc Am 1968; 44:240-249. 25. Shaw EAG. Transformation of sound pressure level from the free field to the eardrum in the horizontal plane. J Acoust Soc Am 1974; 56: 1848-1861. 26. Møller H, Sørensen MF, Hammershøi D et al. Head-related transfer functions of human subjects. J Audio Eng Soc 1995; 43:300-321. 27. Carlile S, Pralong D. The location-dependent nature of perceptually salient features of the human head-related transfer function. J Acoust Soc
Generation and Validation of Virtual Auditory Space
149
Am 1994; 95:3445-3459. 28. Møller H, Hammershøi D, Jensen CB et al. Transfer characteristics of headphones measured on human ears. J Audio Eng Soc 1995; 43:203-217. 29. Wiener FM, Ross DA. The pressure distribution in the auditory canal in a progressive sound field. J Acoust Soc Am 1946; 18:401-408. 30. Mehrgardt S, Mellert V. Transformation characteristics of the human external ear. J Acoust Soc Am 1977; 61:1567-1576. 31. Shaw EAG. The external ear: new knowledge. In: Dalsgaad SC, ed. Earmolds and associated problems. Proceedings of the Seventh Danavox Symposium. 1975:24-50. 32. Shaw EAG. 1979 Rayleigh Medal lecture: the elusive connection. In: Gatehouse RW, ed. Localisation of sound: theory and application. Connecticut: Amphora Press, 1982:13-27. 33. Middlebrooks JC, Green DM. Directional dependence of interaural envelope delays. J Acoust Soc Am 1990; 87:2149-2162. 34. Lawton BW, Stinson MR. Standing wave patterns in the human ear canal used for estimation of acoustic energy reflectance at the eardrum. J Acoust Soc Am 1986; 79:1003-1009. 35. Begault DR. Perceptual effects of synthetic reverberation on three-dimensional audio systems. J Audio Eng Soc 1992; 40:895-904. 36. Shaw EAG. The external ear. In: Keidel WD, Neff WD, ed. Handbook of Sensory physiology. Berlin: Springer-Verlag, 1974:455-490. 37. Schroeder MR. Synthesis of low-peak-factors signals and binary sequences with low autocorrelation. IEE Trans Inform Theory 1970; IT-16:85-89. 38. Zhou B, Green DM, Middlebrooks JC. Characterization of external ear impulse responses using Golay codes. J Acoust Soc Am 1992; 92: 1169-1171. 39. Theile G. On the standardization of the frequency response of high-quality studio headphones. J Audio Eng Soc 1986; 34:956-969. 40. Villchur E. Free-field calibration of earphones. J Acoust Soc Am 1969; 46:1526-1534. 41. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 42. Wightman FL, Kistler DJ. Sound localization. In: Yost WA, Popper AN, Fay RR, ed. Human psychophysics. New York: Springer-Verlag, 1993:155-192. 43. Butler RA. Monaural and binaural localization of noise burst vertically in the median sagittal plane. J Audit Res 1969; 3:230-235. 44. Blauert J. Sound localization in the median plane. Acustica 1969-70; 22:205-213. 45. Hebrank J, Wright D. Are the two ears necessary for localization of sound sources on the median plane? J Acoust Soc Am 1974; 56:935-938. 46. Hebrank J, Wright D. Spectral cues used in the localization of sound sources on the median plane. J Acoust Soc Am 1974; 56:1829-1834.
150
Virtual Auditory Space: Generation and Applications
47. Belendiuk K, Butler RA. Monaural location of low-pass noise bands in the horizontal plane. J Acoust Soc Am 1975; 58:701-705. 48. Hartman WM. Localization of sound in rooms. J Acoust Soc Am 1983; 74:1380-1391. 49. Musicant AD, Butler RA. Influence of monaural spectral cues on binaural localization. J Acoust Soc Am 1985; 77:202-208. 50. Makous JC, Middlebrooks JC. Two-dimensional sound localization by human listeners. J Acoust Soc Am 1990; 87:2188-2200. 51. Perrett S, Noble W. Available response choices affect localization of sound. Perception and Psychophysics 1995; 57:150-158. 52. Butler RA, Humanski RA, Musicant AD. Binaural and monaural localization of sound in two-dimensional space. Percep 1990; 19:241-256. 53. Butler RA, Humanski RA. Localization of sound in the vertical plane with and without high-frequency spectral cues. Perception and Psychophysics 1992; 51:182-186. 54. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space I. Normal hearing conditions. Perception 1984; 13:581-600. 55. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space II: Pinna cues absent. Percep 1984; 13:601-617. 56. Oldfield SR, Parker SPA. Acuity of sound localization: a topography of auditory space. III Monaural hearing conditions. Percep 1986; 15:67-81. 57. Good MD, Gilkey RH. Auditory localization in noise. I. The effects of signal to noise ratio. J Acoust Soc Am 1994; 95:2896. 58. Good MD, Gilkey RH. Auditory localization in noise. II. The effects of masker location. J Acoust Soc Am 1994; 95:2896. 59. Wightman FL, Kistler DJ. The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 1992; 91: 1648-1661. 60. Kistler DJ, Wightman FL. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J Acoust Soc Am 1992; 91:1637-1647. 61. Carlile S, Leong P, Hyams S et al. Distribution of errors in auditory localisation. Proceedings of the Australian Neuroscience Society, 1996; (in press). 62. Fisher NI, Lewis T, Embleton BJJ. Statistical analysis of spherical data. Cambridge: Cambridge University Press, 1987. 63. Wenzel EM. Localization in virtual acoustic displays. Presence 1992; 1:80-105. 64. Carlile S, Pralong D. Validation of high-fidelity virtual auditory space. Br J Audiol, (in press). 65. Wightman F, Kistler D, Andersen K. Reassessment of the role of head movements in human sound localization. J Acoust Soc Am 1994; 95:3003-3004. 66. Wenzel E, Wightman F, Kistler D. Acoustic origins of individual differences in sound localization behavior. J Acoust Soc Am Suppl 1 1988; 84: S79.
Generation and Validation of Virtual Auditory Space
151
67. Wenzel EM. Issues in the development of virtual acoustic environments. J Acoust Soc Am 1991; 92:2332. 68. Begault DR, Wenzel EM. Headphone localization of speech. Human Factors 1993; 35:361-376. 69. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 70. Pralong D, Carlile S. The role of individualized headphone calibration for the generation of high fidelity virtual auditory space. Proc Australian Neurosci Soc 1995; 6:209. 71. Searle CL, Braida LD, Cuddy DR et al. Binaural pinna disparity: another auditory localization cue. J Acoust Soc Am 1975; 57:448-455. 72. Middlebrooks JC, Green DM. Sound localization by human listeners. Annu Rev Psychol 1991; 42:135-159. 73. Durlach N, Rigopulos A, Pang XD, Woods WS, Kulkarni A, Colburn HS, Wenzel EM. On the externalization of auditory images. Presence 1992; 1:251-257.
153
CHAPTER 5
AN IMPLEMENTATION OF VIRTUAL ACOUSTIC SPACE FOR NEUROPHYSIOLOGICAL STUDIES OF DIRECTIONAL HEARING Richard A. Reale, Jiashu Chen, Joseph E. Hind and John F. Brugge
1. INTRODUCTION
S
ound produced by a free-field source and recorded near the cat’s eardrum has been transformed by a direction-dependent ‘Free-fieldto-Eardrum Transfer Function’ (FETF) or, in the parlance of human psychophysics, a ‘Head-Related-Transfer-Function’ (HRTF). We prefer to use here the term FETF since, for the cat at least, the function includes significant filtering by structures in addition to the head. This function preserves direction-dependent spectral features of the incident sound that, together with interaural time and interaural level differences, are believed to provide the important cues used by a listener in localizing the source of a sound in space. The set of FETFs representing acoustic space for one subject is referred to as a ‘Virtual Acoustic Space’ (VAS). This term applies because these functions can be used to synthesize accurate replications of the signals near the eardrums for any sound-source direction contained in the set.1-6 The combination of VAS and earphone delivery of synthesized signals is proving to be a Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 R.G. Landes Company.
154
Virtual Auditory Space: Generation and Applications
powerful tool to study parametrically the mechanisms of directional hearing. This approach enables the experimenter to control dichotically with earphones each of the important acoustic cues resulting from a free-field sound source while obviating the physical problems associated with a moveable loudspeaker or an array of speakers. The number of FETFs required to represent most, if not all, of auditory space at a high spatial resolution is prohibitively large when a measurement of the acoustic waveform is made for each sound direction. This limitation has been overcome by the development of a mathematical model that calculates FETFs from a linear combination of separate functions of frequency and direction.7,8 The model is a low-dimensional representation of a subject’s VAS that provides for interpolation while maintaining a high degree of fidelity with empirically measured FETFs. The use of this realistic and quantitative model for VAS, coupled with an interactive high-speed computer graphical interface and a spectrally-compensated earphone delivery system, provides for simulation of an unlimited repertoire of sound sources and their positions and movements in virtual acoustic space, including many that could not be presented easily in the usual free-field laboratory. In this paper we summarize the techniques we have devised for creating a VAS, and illustrate how this approach can be used to study physiological mechanisms of directional hearing at the level of auditory cortex in experimental animals. Further details can be found in original papers on this subject.
2. FETF IN THE CAT 2.1. FETF ESTIMATION
FROM
FREE-FIELD RECORDINGS
Calculation of the FETF requires two recordings from a free-field source, one near the animal’s eardrum for each source location, and the other in the free-field without the animal being present. The techniques for making these measurements in the cat involved varying the direction of a loudspeaker in a spherical coordinate system covering 360° in azimuth and 126° in elevation. Typically, a left- and right-ear recording was made for each of 1800 or more positions at which the loudspeaker was located.9 Figure 5.1 shows schematically the elements of a recording system used to derive an FETF. Because the cat hears sounds with frequencies as least as high as 40 kHz, a rectangular digital pulse (10 µs in width) is used as a broadband input test signal, d(n) . The outputs of the system recorded near the eardrum, and in the absence of the animal, are designated y(n) and u(n) , respectively. An empirical estimation of an FETF is determined simply by dividing the discrete Fourier transform (DFT) of y(n) by the DFT of u(n) for each sound-source direction.10 The validity of this empirical estimation depends upon having recordings with an adequate signal-to-noise ratio (SNR) in the relevant frequency band and a denominator without zeros in its spec-
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
155
trum. In our free-field data, both recordings had low SNR at high (> 30 kHz) and low (< 1.5 kHz) frequencies because the output of the loudspeaker diminished in these frequency ranges. Previously we alleviated this problem by subjectively post-processing recordings in the frequency domain to restrict signal bandwidth and circumvent the attendant problem of complex division.5 In practice, only a minority of the tens of hundreds of free-field measurements for a given subject will suffer from low SNR. Data from these problematic sample directions could always be appropriately filtered through individualized post hoc processing. We now employ, however, a more objective technique that accurately estimates FETFs without introducing artifacts into frequency regions where SNR is low. This technique employs finite-impulse-response (FIR) filters.11 Under ideal conditions, the impulse response of the FETF, h(n) , becomes the following deconvolution problem,
y(n) = d(n) ∗ h(n) .
(Eq. 5.1)
In our current technique, h(n) is modeled as an FIR filter with coefficients determined objectively using a least-squares error criterion. The FIR filter is computed entirely in the time domain based on the principle of linear prediction. Thus, it redresses problems encountered previously with our subjective use of empirical estimation. Figure 5.2 compares directly the results obtained with the empirical and FIR filter estimation techniques. To simulate conditions of varying SNR for the purpose of this comparison, we added random noise of different amplitudes to a fixed pair of free-field recordings. For each SNR tested, the difference between the known FETF and that obtained by empirical DFT estimation or the FIR technique was expressed as percent relative error. At high SNR (59 dB) both techniques yield excellent estimates of the FETF. At low SNR (24 dB), however, the FIR method is clearly the superior of the two. Having derived a reliable set of FETFs, the question arises as to the relative merits of different modes of data display in facilitating visualization of the spatial distribution of these functions. Perhaps the simplest scheme is to show successive FETFs on the same plot. A typical sequence of four successive FETFs for 9° steps in elevation with azimuth fixed at 0° is shown in Figure 5.3A. The corresponding plot for four successive 9° steps in azimuth with elevation maintained at 0° is presented in part (D) of that figure. Such basic displays do provide information on the directional properties of the transformation but only a few discrete directions are represented, and it is difficult to visualize how these transformations fit into the total spatial domain. Figure 5.3B is a three-dimensional surface plot of FETFs for 15 values of elevation from -36° (bottom) to +90° (top) in steps of 9° and with azimuth fixed at 0°. This surface contains the four FETFs in (A) which are identified by arrows at the left edge of the 3-D surface.
156 Virtual Auditory Space: Generation and Applications
Fig. 5.1. Schematic diagram illustrating the factors that act upon an input signal d(n) to a loudspeaker and result in the recording of free-field signals u(n) and y(n) . In the absence of the animal, the signal u(n) is recorded by a Probe Tube microphone with impulse response m(n) . The acoustic delay term, f(n) , represents the travel time from loudspeaker to probe tube microphone. With the animal present, the signal y(n) is recorded near the eardrum with the same microphone. In practice the acoustic delay with the cat present is made equal to f(n) .
Implementation of Virtual Acoustic Space for Studies of Directional Hearing 157
Fig. 5.2. FETF magnitude spectra derived from free-field recordings using the empirical estimation method (A, B) or the least-squares FIR filter method (C, D). The derivations are obtained at two (59 dB and 24 dB) signal-to-noise ratios (SNR). Comparison of Percent Relative Error (right panel) shows the FIR estimation is superior at low SNR.
158
Virtual Auditory Space: Generation and Applications
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
159
In similar fashion, Figure 5.3E shows a 3-D surface plot consisting of FETFs for 29 values of azimuth, ranging from 0° (bottom) to -126° (top) in steps of 9° and with elevation fixed at 0°. This surface includes the four FETFs in (B) which are marked by arrows at the left of the surface. Surface plots facilitate appreciation of the interplay between frequency, azimuth and elevation in generating the fine structure of the FETFs and call attention to systematic shifts of some of the features with frequency and direction. However, none of the foregoing plots directly displays the angles of azimuth and elevation. One graphical method that does provide angular displays is a technique long used by engineers and physicists in studying the directional properties of devices such as radio antennas, loudspeakers and microphones, namely polar plots of “gain” vs angular direction. In such plots the magnitude of the gain, commonly in dB, is represented by the length of a radius vector whose angle, measured with respect to a reference direction, illustrates the direction in space for that particular measurement. Since the gain of a system is typically represented as a complex quantity, a second polar plot showing phase or other measure of timing may also prove informative. In the present application, the length of the radius vector is made proportional to the log magnitude of the FETF (in dB) for a specified frequency while the angle of the vector denotes either azimuth (at a specified elevation) or elevation (at a specified azimuth). If desired, a second set of plots can illustrate the variation with spatial direction of FETF phase or onset time difference. In Figure 5.3C the variation of FETF magnitude with elevation is plotted in polar format for four values of azimuth and a fixed frequency of 10.9 kHz. Similarly, Figure 5.3F shows the variation of FETF magnitude with azimuth for four values of elevation and the same frequency of 10.9 kHz. While rectangular plots can depict the same in-
Fig. 5.3 (opposite). Three modes of graphical representation to illustrate variation in magnitude of the FETF with frequency (FREQ), azimuth (AZ) and elevation (EL). All data are for the left ear. AZ specifies direction in the horizontal plane with AZ = 0° directly in front and AZ = 180° directly behind the cat. Elevation specifies direction in the vertical plane with EL = 0° coincident with the interaural axis and EL = 90° directly above the cat. (A) FETF magnitude vs FREQ for four values of EL with AZ = 0°. (B) Three-dimensional surface plot of FETF for 15 values of EL from +90° (top) to -36° (bottom) to in steps of 9° and AZ = 0°. This surface includes the four FETF functions in (A) that are identified by arrows at left edge of surface. (C) Polar plot of FETF vs EL with fixed FREQ of 10.9 kHz and for four values of AZ: 0°, -9°, -18° and -27°. Magnitude of FETF in dB is plotted along radius vector. Note that EL varies from +90° (directly overhead) to -36°. (D) FETF magnitude vs FREQ for four values of AZ with EL = 0°. (E) Same as (B) except surface represents 15 values of AZ from -126° (top) to 0° (bottom) in steps of 9° and EL = 0°. This surface includes the four FETFs in (D) which are identified by arrows at left edge of surface. (F) Polar plot of FETF vs AZ with fixed FREQ of 10.9 kHz and for four values of EL: -18°, -9°, 0° and +9°.
160
Virtual Auditory Space: Generation and Applications
formation, the polar plot seems, intuitively, to assist in visualizing the basic directional attributes of a sound receiver, namely the angles that describe its azimuth and elevation.
3. EARPHONE DELIVERY OF VAS STIMULI The success of the VAS technique also depends upon the design of an earphone sound delivery system that can maintain the fidelity of VAS-synthesized waveforms. Typically, a sealed earphone delivery and measurement system introduces undesirable spectral transformations into the stimulus presented to the ear. To help overcome this problem, we use a specially designed insert earphone for the cat that incorporates a condenser microphone with attached probe tube for sound measurements near the eardrum.4 The frequency response of the insert earphone system measured in vivo is characterized by a relatively flat response (typically less than ±15 dB from 312 to 36000 Hz), with no sharp excursions in that frequency range. Ideally, the frequency response of both sound systems would be characterized by a flat magnitude and a linear phase spectrum. In practice, neither the earphone nor the measuring probe microphone used in our studies has such ideal characteristics. Thus, we currently employ least-squares FIR filters to compensate for these nonideal transducers. These filters are designed in the same manner as those employed for FETF estimation because they are solutions to essentially the same problem; namely, minimizing the error between a transforming system’s output and a reference signal.11 In order to judge the suitability of least-squares FIR filters for compensating our insert earphone system, comparisons were made between VAS-synthesized signals and the waveforms actually produced by the earphone sound-delivery system. Figure 5.4 illustrates data from one cat that are representative of signal fidelity under conditions of earphone stimulus delivery. The upper panel presents a sample VAS signal for a particular sound-source direction that would be recorded near the cat’s eardrum in the free field. The middle panel shows the same signal as delivered by the earphone, compensated by the leastsquares FIR filter technique. To the eye they appear nearly identical; the correlation coefficient is greater than 0.99. The same comparison was made for 3632 directions in this cat’s VAS and the distribution of correlation coefficients displayed as a histogram. The agreement is seen to be very good, indicting accurate compensation of these wideband stimuli throughout the cat’s VAS. As a rule, our FIR filters perform admirably, with comparisons typically resulting in correlation coefficients exceeding 0.99.
4. MATHEMATICAL MODEL OF VAS 4.1. SPATIAL FEATURE EXTRACTION AND REGULARIZATION (SFER) MODEL FOR FETFS IN THE FREQUENCY DOMAIN
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
161
Empirically measured FETFs represent discrete directions in the acoustic space for any individual.9 There is, however, no a priori method for interpolation of FETFs at intermediate directions. Therefore, lacking a parametric model, the number of discrete measurements increases in direct proportion to the desired resolution of the sampled space. For example, to define the space surrounding the cat at a resolution of 4.5° requires approximately 3200 unique directions for each ear.
Fig. 5.4. Top panel: Time waveform of an ear canal signal recorded by probe tube microphone. Middle panel: Time waveform of corresponding signal recorded in an ear canal simulator when delivered through insert earphone system that was compensated by a least-squares FIR filter. Lower panel: Distribution of correlation coefficients in one cat for 3632 directions obtained between the time waveform of free-field signal and the corresponding waveform delivered through compensated insert earphone system.
162
Virtual Auditory Space: Generation and Applications
This situation is undesirable both in terms of the time required for complete data collection and because physical constraints preclude measurements at so many required directions. To address these limitations, a functional model of the FETF was devised and validated based on empirical measurements from a cat and a KEMAR model.7,8 Implementation of this parametric model means that FETFs can be synthesized for any direction and at any resolution in the sample space. Such a need arises, for example, in the simulation of moving sound sources or reverberate environments. Additionally, the use of a functional model should aid in the analysis of a multidimensional acoustical FETF database, especially when cross-subject or cross-species differences are studied. In this model, a Karhunen-Loeve expansion is used to represent each FETF as a weighted sum of eigen functions. These eigen functions are obtained by applying eigen decomposition to the data covariance matrix that is formed from a limited number (several hundreds) of empirically measured FETFs. Only a small number (a few dozen) of the eigen functions are theoretically necessary to account for at least 99.9% of the covariance in the empirically measured FETFs. Therefore, the expansion has resulted in a low-dimensional representation of FETF space; the eigen functions are termed eigen transfer functions (ETF) because they are functions only of frequency. An FETF for any given spatial direction is synthesized as a weighted sum of the ETFs. Importantly, the weights are functions of spatial direction (azimuth and elevation) only and, thus, are termed spatial characteristic functions (SCF). Sample SCFs, with data points restricted to coordinates at the measured azimuth and elevations, are obtained by back-projecting the empirically measured FETFs onto the ETFs. Spatially continuous SCFs are then obtained by fitting these SCF samples with two-dimensional splines. The fitting process is often termed ‘regularization’. The complete paradigm is therefore termed a ‘spatial feature extraction and regularization’ (SFER) model. Detailed acoustical validation of the SFER model was performed on one cat in which 593 measured FETFs were used to construct the covariance matrix. Twenty of the ETFs were found to be significant.8 The performance of the model is veracious, since comparison of more than 1800 synthesized vs empirically measured FETFs indicates nearly identical functions. Errors between modeled and measured FETFs are generally less than one percent for any direction in which the measured data have adequate signal-to-noise characteristics. Importantly, the model’s performance is similarly high when locations are interpolated between the 593 input directions; these modeled FETFs are compared with measured data that were not used to form the covariance matrix.
4.2. TIME DOMAIN
AND
BINAURAL EXTENSION
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
TO THE
163
SFER MODEL
The SFER model for complex-valued FETFs was originally designed as a monaural representation, and was implemented entirely in the frequency domain. In order to study the neural mechanisms of binaural directional hearing using an on-line interactive VAS paradigm, the SFER model was implemented in the time domain and extended to provide binaural information. Figure 5.5 shows the components of the complete model in block-diagram form. There are several advantages to these extensions. First, an interaural time difference is a natural consequence of the time-domain approach; this parameter was lacking in the previous frequency-domain model. Second, the time domain does not involve complex-number calculations. This becomes especially advantageous when synthesizing moving sound sources or reverberant environments on-line in response to changing experimental conditions. Lastly, most digital signal processors are optimized for doing filtering in the time domain, viz. convolution. Eventually, if a real time implementation of VAS is achieved using these processors, a time-domain approach is clearly preferable. 4.2.1. Data Covariance Matrix The SFER model produces a lower dimensional representation of the FETF space (see section 4.1) by applying an eigen decomposition to the covariance matrix of the input set of FETFs. In the time do-
Fig. 5.5. Block diagram illustrating how an input set of measured FETF impulse responses from one cat are processed by the binaural SFER model in the time domain to yield a left- and right-ear impulse response for any direction on the partially sampled sphere. See text for description of model components.
164
Virtual Auditory Space: Generation and Applications
main approach, the input data for each modeled cat consist of the impulse responses from a subset (number chosen = P) of measured FETFs obtained at 9° increments on a spherical coordinate system spanning 360° in azimuth and 126° in elevation, and centered on the interaural axis. The impulse response ( h j ) of a measured FETF at each ( j = 1,2,..., P ) direction is estimated by the least-squares FIR method (see section 2.1) using the free-field recordings y(n) , and u(n) , shown in Figure 5.1. The FETF impulse response covariance matrix ( R(h) ) is then defined as:
R(h) =
P
∑[Λ
j
] [
h j − e0 × Λ j h j − e0
j =1
Where, the operation denoted by
eo =
[ ]T
]
T
(Eq. 5.2)
is the matrix transpose, and
1 P ∑Λj ⋅hj P j =1
and
Λ j = 1 − sin(EL j ) Here, e o is the average value of the P -measured FETF impulse responses, and Λ j is a simple function of elevation at that direction used to compensate for the different separations (measured in arc length) between measured FETFs obtained at different elevations on the sphere. 4.2.2. Eigen impulse response (EIR) In actual use, a measured FETF impulse response is a 256-point vector and the covariance matrix is, therefore, real and symmetric with dimensions 256-by-256. Therefore, the output of the eigen decomposition defined by Eq. 5.2 is an eigen matrix of dimension 256-by-256 whose columns are 256-point eigen vectors. These eigen vectors constitute the new basis functions that represent the FETF impulse-response space. Accordingly, an eigen vector is termed an eigen impulse response (EIR) and is analogous to the eigen transfer function (ETF) notation employed in the frequency domain. Figure 5.6 shows the first- through fourth-order EIRs and their corresponding ETFs for one representative cat. In the frequency domain, the features of ETFs resemble the major characteristics that define measured FETFs (see section 2.1). The relative importance of an EIR is related to its eigen value because the eigen value represents the energy of all measured FETF impulse responses projected onto that particular EIR basis vector. The sum of all 256 eigen values will indicate that 100% of the energy in the measured FETF impulse response space is accounted for by the 256 EIRs. The number ( M ) of EIRs (i.e., the dimensionality) is dra-
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
165
Fig. 5.6. The first- through fourth-order EIR and their corresponding ETF for one representative cat using the SFER model. Note the general similarity between ETFs and measured FETFs.
matically lower if only somewhat less than 100% of the energy is to be represented by the expansion. The mean-squared-error ( MSE ) associated with use of M EIRs to represent the P -measured FETF impulse responses is given by:
MSE =
256
∑λ
i
(Eq. 5.3)
i = M +1
Where the eigen values have been arranged in descending order,
λ 1 ≥ λ 2 ≥ ... λ 256 A value of M = 20 was obtained for each of three modeled cats when
166
Virtual Auditory Space: Generation and Applications
the MSE was set to 0.1%. Thus, the SFER model with a dimension of only 20 EIRs will represent a subspace accounting for about 99.9% of the energy in all the measured FETF impulse responses for each representative cat. In terms of the model, any of the P -measured FETF impulse responses ( h ) is synthesized from the EIRs by the following relationship: 20
h ( j ) = ∑ wˆ i ⋅ EIR i + e o
(Eq. 5.4)
i =1
Here j is the index that chooses any one of the P spatial directions ˆ i , a weighting function that is in the input measurement set, and w th the projection of h(j) onto the i EIR . Formally:
w i = h ( j ) ⋅ [EIR i ]
T
Equation 5.4 is often termed a Karhunen-Loeve expansion. It is important to realize that this expansion effects a separation of time coordinates from spatial coordinates. That is, the elements of the EIR i column vectors correspond only to the time dimension, while the w i are functions only of spatial coordinates (i.e., azimuth and elevation). 4.2.3. Spatial feature extraction and regularization The weighting functions w i are termed spatial characteristic functions (SCF), since by virtue of Equation 5.4 they are functions only of azimuth and elevation. These SCFs are defined only for the discrete coordinate-pairs in the input measurement set. A continuous function can be determined by applying regression to these discrete samples, in a framework of regularization.12 In our application, a software package, RK-pack, is used to regularize the discrete spatial samples ˆ i , of spatial coordinates.13 into SCFs that are continuous functions, w The first- through fourth-order SCFs from one representative modeled cat are shown as mesh plots in Figure 5.7. Previous work indicated that features of SCFs of different order are closely related to the head and external ear geometry of cats and humans.7,8 For example, the major peak in the first-order SCF is related to the direction of the pinna opening in this modeled cat. More detailed comparisons and speculations concerning the use of SCFs to describe the physical characteristics of the external ear are presented elsewhere.7 In our current implementation, only 20 EIRs and their corresponding SCFs are necessary to simulate free-signals with a high degree of fidelity for the cat. Any direction, d , on the partially sampled sphere, including those intermediate to the original P -measurement locations, can be represented by the impulse response,
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
167
20
h (d ) = ∑ wˆ i (d ) ⋅ EIR i + e o i =1
(Eq. 5.5) Thus, the SFER model reduces to a particularly simple implementation that is a linear combination of EIRs weighted by SCFs. 4.2.4. Regularization of ITD samples and binaural output In order to extend the SFER model for binaural studies, ITD information needs to be extracted from the free-field data at the P -measurement sites used as input for each modeled cat. These data consist of the free-field recordings y(n) and u(n), shown in Figure 5.1, from which a position dependent time delay can be calculated by comparing the signal onset of y(n) with the signal onset of u(n) . The onset of a recording is defined as the 10% height of the first rising edge of the signal with respect to its global absolute maximum. In practice, this monaural delay-versuslocation function is estimated for only one ear. The delay function for the opposite ear is then derived by assuming a symmetrical head. The ITD function for these discrete locations is simply the algebraic difference, using the Fig. 5.7. Mesh plots of the first- through fourth-order regularized SCF for the same cat whose EIR are shown in Figure 5.6. All plots are magnitude only.
168
Virtual Auditory Space: Generation and Applications
direction (AZ = 0°, EL = 0°) as the zero reference. An ITD function, continuous in azimuth and elevation, can be derived for the partial sphere by applying the same regularization paradigm as used to regularize SCFs (see section 4.2.3). Figure 5.5 depicts the ITD extraction and regularization modules and their incorporation to produce a VAS signal appropriate for the left and right ear. 4.2.5. Acoustic validation of the implementation The time domain binaural SFER model for VAS was validated by comparing the model’s output, as either an FETF or its impulse response, to the empirically-measured data for each modeled cat. In general, two comparisons are made: (1) as a fidelity check, at the P -measurement locations that were used to determine both EIRs and SCFs; and (2) as a predictability or interpolation check, at a large number of other locations that were not used to determine the model parameters. The degree to which the model reproduced the measurement was similarly exact regardless of the comparison employed. Figure 5.8 depicts a representative comparison between data for a modeled and measured direction, both in the time- and frequency-domains. The error is generally so small that it cannot be detected by visual inspection. Therefore, a quantitative comparison is made by calculating the percentmean-squared error between the modeled and the measured data for all directions in the comparison. Figure 5.8 illustrates this error distribution for one modeled cat. The shape and mean (0.94%) of the distribution are comparable to the results obtained from the frequency domain SFER model.
4.3. QUASI-REAL-TIME OPERATION The implementation of a mathematical representation (SFER model) for VAS is computationally demanding since even with a low-dimensional model dozens of EIRs and their weighting functions (SCFs) must be combined to compute the FETF for every sound-source direction. To make matters worse, we have chosen to minimize the number of off-line calculations that provide static initialization data to the model, and instead require that the software calculate direction-dependent stimuli from unprocessed input variables. This computation provides greater flexibility of stimulus control during an actual experiment by permitting a large number of stimulus parameters to be changed interactively to best suit the current experimental conditions. The resolution of the VAS is one major parameter that is not fixed by an initialization procedure, but is computed ‘on-demand’, and it can be set as small as 0.1 degree. The transfer characteristics of the earphone sound delivery system may have to be measured at several times during the course of a long experiment to insure stability. Therefore, earphone compensation is not statically linked to VAS stimulus synthesis, but rather is accomplished dynamically as part of the ‘on-demand’ computation.
Implementation of Virtual Acoustic Space for Studies of Directional Hearing 169
Fig. 5.8. Comparison between modeled and measured impulse response of FETF. Left column: SFER model of impulse response and corresponding FETF for a representative sound-source direction. Middle column: impulse response and FETF measured at the corresponding direction using the techniques discussed in section 2. Right column: Distribution of Percent Relative Error between modeled and measured impulse responses for 3632 directions in one cat.
170
Virtual Auditory Space: Generation and Applications
Fig. 5.9. Block diagram illustrating the components and interactions employed to implement a VAS suitable to the on-line demands of neurophysiological experiments in the cat.
Fortunately, in recent years the processing power of desktop workstations and personal computers has increased dramatically while the cost of these instruments has actually dropped. We employ a combination of laboratory and desktop workstations as independent components that are connected over ethernet hardware (Fig. 5.9). We refer to this combination as providing a ‘quasi-real-time’ implementation because the computation and delivery of a VAS-synthesized stimulus is not instantaneous but rather occurs as distinctly separate events that require a fixed duration for completion. The interaction with the enduser has been handled by a window-based user interface on a MicroVax-II/GPX computer. This VAS interface program controls stimulus setup and delivery, data acquisition, and graphical display of results. Upon receipt of setup variables, VAS stimuli are synthesized on a VAX-3000 Alpha workstation and the resulting stimulus waveforms transferred back to program-control memory for subsequent earphone delivery. All parameters of stimulus delivery, together with stimulus waveforms for the Left- and Right-Ear channels, are passed to a Digital Stimulus System that produces independent signals through 16-bit D/A converters and provides synchronization pulses for data acquisition procedures initiated by the VAS control program.14,15
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
171
5. RESPONSES OF SINGLE CORTICAL NEURONS TO DIRECTIONAL SOUNDS 5.1. VIRTUAL SPACE RECEPTIVE FIELD (VSRF) Virtual acoustic space is a propitious technique to study in a systematic way a neuron’s sensitivity to the direction of simulated free-field sounds. 5,16 The synthesized right- and left-ear stimulus waveforms for each direction surrounding an individual will mimic all the significant acoustic features contained in sound from a free-field source. The deterministic nature of virtual space means that any arbitrary signal can be transformed for any selected combination and sequence of soundsource directions. Furthermore, when virtual space stimuli are delivered through compensated earphones, the traditional advantages of dichotic delivery become accessible. In order to achieve our principal aim of understanding cortical mechanisms of directional hearing, we have been exploring in detail the sensitivity of auditory cortical neurons to the important and independently-controllable parameters involved in detecting sound-source direction: intensity, timing, and spectrum. Here we present examples of data from these studies to illustrate how the advantages of the VAS approach may be brought to bear on studies of mechanisms of directional hearing. For our studies, the direction of a VAS stimulus is referenced to an imaginary sphere centered on the cat’s head (Fig. 5.10). The sphere is bisected into front and rear hemispheres in order to show the relationship to a spatial receptive field (VSRF) that is plotted on a representation of the interior surface of the sphere. The VSRF is composed of those directions in the sampled space from which a sound evokes a response from the cell. In order to evaluate the extent of a VSRF, a stimulus set consisting of approximately 1650 regularly spaced spherical directions covering 360° in azimuth (horizontal direction) and 126° in elevation (vertical direction) is generally employed. The maximum level of the stimulus set is fixed for each individual cat according to the earphone acoustic calibration. Intensity is then expressed as decibels of attenuation (dB ATTN) from this maximum and can be varied over a range of 127 dB. Usually each virtual space direction in the set is tested once, using a random order of selection. About 12 minutes is required to present the entire virtual space stimulus set at one intensity level, and to collect and display single-neuron data on-line. Typically, cortical neurons respond to a sound from any effective direction with a burst of 1-4 action potentials time-locked to stimulus onset. The representative VSRF shown in Figure 5.10C depicts the directions and surface area (interpolating to the nearest sampled locus) at which a burst was elicited from a single cell located in the left primary (AI) auditory cortex. Thus, with respect to the midline of the cat, the cell responded almost exclusively to directions in the contralateral (right side of midline) hemifield. Using this sampling paradigm,
172
Virtual Auditory Space: Generation and Applications
Fig. 5.10. Coordinates of virtual acoustic space and their relationship to position of the cat. (A) The direction of a sound source is referenced to a spherical coordinate system with the animal’s interaural axis centered within an imaginary sphere. (B) The sphere has been bisected into a FRONT and REAR hemisphere and the REAR hemisphere rotated 180° as if hinged to the FRONT. (C) Experimental data representing the spatial receptive field of a single neuron are plotted on orthographic projections of the interior surface of the imaginary sphere.
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
173
the majority of cells (~60%) recorded in the cat’s primary (AI) auditory field showed this contralateral preference.5 Smaller populations of neurons showed a preference for the ipsilateral hemifield (~10%), or the frontal hemisphere (~7%). Some receptive fields showed little or no evidence for direction selectivity, and are termed omnidirectional (~16%). The remainder exhibited ‘complex’ VSRFs.
5.2. RELATIONSHIPS
OF
INTENSITY
TO THE
VSRF
Receptive field properties at any intensity level are influenced by both monaural and binaural cues. Which cue dominates in determining the size, shape and location of the VSRF depends, in part, on the overall stimulus intensity. Generally, at threshold intensity the VSRF is made up of a relatively few effective sound source directions, and these typically fall on or near the acoustic axis, above the interaural plane and either to the left or to the right of the midline. Raising intensity by 10 to 20 dB above threshold invariably results in new effective directions being recruited into the receptive field. Further increases in intensity results in further expansion of the VSRF in some cells but not in others. This is illustrated by the data of Figures 5.11 and 5.12, collected from two cells that had similar characteristic frequency and similar thresholds to the same virtual space stimulus set. In analyzing these data, we took advantage of the fact that with the VAS paradigm it is possible to deliver stimuli to each ear independently. In both cases, presenting VAS signals to the contralateral ear alone at an intensity (70 dB ATTN) within 10-20 dB of threshold resulted in VSRFs confined largely to the upper part of the contralateral frontal quadrant of VAS, on or near the acoustic axis. Response to stimulation of the ipsilateral ear alone was either very weak or nonexistent (not shown) in both cases. When stimulus intensity was raised by 20 dB (to 50 dB ATTN), the responses of both neurons spread to include essentially all of VAS. These spatial response patterns could only have been formed, obviously, using the direction-dependent spectral features associated with the contralateral ear. Under binaural listening conditions the size, shape and locations of VSRFs obtained at 70 dB ATTN were essentially the same as those observed under monaural conditions at that intensity. This is interpreted to mean that at low intensity the receptive field was dominated by the input from sounds in contralateral space. At 50 dB ATTN, however, the situation was quite different for the two neurons. For the neuron illustrated in Figure 5.12 responses spread, but remained concentrated in contralateral acoustic space. We interpret this to mean that sound coming from ipsilateral (left) frontal directions suppressed the discharge of the cell, since at 50 dB ATTN the stimuli from these directions were quite capable of exciting the neuron if the ipsilateral ear was not engaged. For the cell illustrated Figure 5.11 there was no such restriction in
174
Virtual Auditory Space: Generation and Applications
the VSRF at 50 dB ATTN; the sound in the ipsilateral hemifield had little or no influence on the cell’s discharge at any direction. Taking advantage of our earphone-delivery system, it was possible to probe possible mechanism(s) that underlie the formation of these VSRFs by studying quantitatively responses to changing parameters of monaural and binaural stimuli. Spike count-vs-intensity functions obtained under tone-burst conditions (lower left panels) showed that both neurons responded to high-frequency sounds; characteristic frequen-
Fig. 5.11. Virtual Space Receptive Field (VSRF) of an AI neuron at two intensities showing the results of stimulation of the contralateral ear alone (top row) and of the two ears together (middle row). Spike countvs-intensity functions obtained with tone-burst stimuli delivered to the contralateral ear alone (bottom row, left) or to two ears together (bottom row, right).
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
175
cies were 15 and 15.5 kHz, respectively. Because interaural intensity difference (IID) is known to operate as a sound-localization cue at high frequency, we proceeded to study this parameter in isolation as well (lower right panels). For the neuron in Figure 5.12, the spike count-vs-IID function shows a steep slope for IIDs between -20 and +30 dB, which is the range of IID achieved by a normal cat at this frequency. These data provide good evidence that the strong inhibitory input from the ipsilateral ear revealed by these functions accounts
Fig. 5.12. Virtual Space Receptive Field (VSRF) of an AI neuron at two intensities showing the results of stimulation of the contralateral ear alone (top row) and of the two ears together (middle row). Spike countvs-intensity functions obtained with tone-burst stimuli delivered to the contralateral alone (bottom row, left) or to two ears together (bottom row, right).
176
Virtual Auditory Space: Generation and Applications
for the restriction of this cell’s VSRF to the contralateral hemifield. The spike count-vs-IID function illustrated in Figure 5.11 suggests why the VSRF of this neuron was not restricted to the contralateral hemifield: ipsilateral inhibition was engaged only at IIDs greater than 30 dB, which is beyond the range of IIDs available to the cat at these high frequencies. Thus, the VSRF for this cell was dominated by excitation evoked by sounds arriving from all virtual space directions.
5.3. COMPARISONS OF VSRFS OBTAINED USING FROM DIFFERENT CATS
THE
VAS
The general pattern of location-dependent spectral features is very similar among the individual cats that were studied in the free field.9 For the same sound-source direction, however, there can be significant differences among cats in the absolute values of the spectral transformation in both cats and humans.9,17,18 Our models of virtual acoustic space mimic these general patterns as well as individual differences, and thereby provide an opportunity to study the sensitivities of AI neurons to these individualized realizations of a VAS. The VAS for each of three different cats was selected to obtain three sequential estimates of a single neuron’s VSRF. The comparisons are shown in Figure 5.13 for several intensity levels. Differences in the VSRFs among cats are most noticeable at low intensity levels where the VSRF is smallest and attributable mainly to monaural input. Under this condition, the intensity for many directions in a cell’s receptive field is near threshold level, and differences among the individualized VASs in absolute intensity at a fixed direction are accentuated by the all-or-none depiction of the VSRF. These results are typical of neurons that possess a large receptive field that grows with intensity to span most of an acoustic hemifield. At higher intensity levels, most directions are well above their threshold level where binaural interactions restrict the receptive field to the contralateral hemifield. Thus, while the VAS differs from one cat to the next, the neuronal mechanisms that must operate upon monaural and interaural intensity are sufficiently general to produce VSRFs that resemble one another in both extent and laterality.
5.4. TEMPORAL RELATIONSHIPS
OF THE
VSRF
The characterization of a cortical neuron’s spatial receptive field has, so far, been confined to responses to a single sound source of varying direction delivered in the absence of any other intentional stimulation (i.e., in anechoic space). Of course, this is a highly artificial situation since the natural acoustic environment of the cat contains multiple sound sources emanating from different directions and with different temporal separations. The use of VAS allows for the simulation of multiple sound sources, parametrically varied in their temporal separation and incident directions. Results of experiments using one
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
177
Fig. 5.13. VSRFs of an AI neuron obtained with three different realizations of VAS. Each VAS realization was obtained from the SFER model using an input set of measured FETF impulse responses from three different cats (see section 4). Rows illustrate VSRFs obtained at different intensities for each VAS.
such paradigm give us some of the first insights into the directional selectivity of AI neurons under conditions of competing sounds. There is good physiological evidence that the output of an auditory cortical neuron in response to a given sound can be critically influenced by its response to a preceding sound, measured on a time scale of tens or hundreds of milliseconds. We have studied this phenomenon under VAS conditions. The responses of the cell in Figure 5.14
178
Virtual Auditory Space: Generation and Applications
Fig. 5.14. Response of an AI neuron to a one- or two-sound stimulus located with the cell’s VSRF. Presentation of one sound from an effective direction (e.g., AZ = 27°, EL = 18°; white-filled circle) in the VSRF results in the time-locked discharge of the neuron (topmost dot raster). Remaining dot rasters show the response to two sounds that arrive at different times, but from the same direction. The response to the lagging sound is systematically suppressed as the delay between sounds is reduced from 300 to 50 milliseconds.
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
179
are representative of the interactions observed. The VSRF occupied most of acoustic space at the intensity used in this example. The topmost dot raster displays the time-locked discharge of the neuron to repetitive presentation of a sound from one particularly effective direction (AZ = 27°, EL = 18°) in the VSRF. The remaining dot rasters show the responses to a pair of sounds that arrive at different times, but from the same direction. The delay between sounds was varied from 50 to 300 milliseconds, The response to the lagging sound was robust and essentially identical to the leading sound for delays of 125 msec and greater. At delays shorter than 125 msec, the effect of the leading sound was to suppress the response to the lagging one, which was complete when the temporal separation was reduced to 50 msec. In some cells, the joint response to the two sounds could be facilitative at critical temporal separations below about 20 msec (not shown). The directional selectivity of most cortical neurons recorded appeared to be influenced by pairs of sounds with temporal separations similar to those illustrated in Figure 5.14. There is also evidence from previous dichotic earphone experiments that a cortical cell’s output to competing sounds has a spectral basis as well as a temporal one.19-21 We have studied the interaction of spectral and temporal dependencies by presenting one sound from a fixed direction within the cell’s VSRF and a second sound that was varied systematically in direction but at a fixed delay with respect to the first. This means that the spectrum of the delayed sound changed from direction to direction while that of the leading sound remained constant. A VSRF was then obtained in response to the lagging sound at several different temporal separations. Figure 5.15 shows the progressive reduction in size of the VSRF to the lagging sound that was associated with shortening of the delay between the lead and lag sounds from 150 to 50 milliseconds. Note that this cell is the same one whose delay data were illustrated in Figure 5.14. In such cells, the directional selectivity under competing sound conditions appeared to be altered dynamically and systematically by factors related to both temporal separation and sound source direction.
5.5. INTERNAL STRUCTURE
OF THE
VSRF
The mapping of VSRFs shown in previous illustrations is only meant to depict the areal extent over which the cell’s output can be influenced. This may be considered the neuron’s spatial tuning curve. However, both the discharge rate and response latency of a cortical neuron can vary within its spatial receptive field.22 In some cells, a gradient was observed in the receptive field consisting of a small region in the frontal hemifield where sounds evoked a high discharge rate and short response latency surrounded by a larger region where sounds evoked relatively lower rates and longer latencies. One paradigm used to reveal this variation in response metrics is to stimulate
180
Fig. 5.15. VSRF of an AI neuron derived from the response to the lagging sound of a twosound stimulus. The two sounds arrived at different times and from different directions. The leading sound was fixed at an effective direction (e.g., AZ = 27°, EL = 18°) in the VSRF. Progressive shortening of the delay to the lagging sound from 150 to 50 milliseconds resulted in a concurrent reduction in the VSRF size.
Virtual Auditory Space: Generation and Applications
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
181
repetitively at closely spaced directions spanning the azimuthal dimension.23,24 In this approach, the VSRF is used to guide the selection of relevant azimuthal locations. This analysis was performed on the cell illustrated in Figure 5.16. The histograms show the mean first-spike latency and summed spike-count to 40 stimulus repetitions as a function of azimuth in the frontal hemifield. The azimuthal functions were determined at three different elevations. Regardless of elevation, response latency was always shortest and spike output always greatest in the contralateral frontal quadrant (0° to +90°) of the VSRF. The degree of modulation of the output was, however, dependent on elevation. The strongest effects of varying azimuthal position were near the interaural line (E = 0°) and the weakest effects were at the highest elevation. For other cortical cells, both the VSRF and its internal structure were found to contain more complicated patterns.
Fig. 5.16. Spatial variation in response latency and strength across a VSRF of an AI neuron. Azimuth functions (response-vs-azimuth) were obtained at three fixed elevations in the frontal hemifield. Histograms show the mean first-spike latency and summed spike-count to 40 stimulus repetitions at each direction along the azimuth.
182
Virtual Auditory Space: Generation and Applications
5.6. SUMMARY Free-field to eardrum transfer functions (FETFs), derived from measurements made of free-field sounds at the tympanic membrane, can be used to obtain a virtual acoustic space (VAS). Both the calculation of the FETF and transducer compensation are aided by the use of FIR filters. The VAS derived from empirical measurements can be modeled accurately in the frequency or time domains by a Spatial Feature Extraction and Regularization (SFER) model. This model is used to synthesize, in quasi-real time, stimuli that when delivered via sealed and calibrated insert earphones mimic in their time waveforms and spectrum sounds originating in the free field. Advantages of the model include interpolation between directions in VAS and ease of computation of stimulus waveforms. This paradigm is being applied successfully to studies of directional sensitivity of single neurons of auditory cortex.
ACKNOWLEDGMENTS The authors wish to acknowledge the participation of Joseph C.K. Chan, Paul W.F. Poon, Alan D. Musicant, and Mark Zrull in many of the experiments related to virtual space receptive fields that are presented here. Jiashu Chen and Zhenyang Wu played major roles in the development of mathematical models and Ravi Kochhar was responsible for developing the software that implemented virtual acoustic space. Richard Olson, Dan Yee and Bruce Anderson were responsible for the instrumentation. The work was supported by NIH grants DC00116, DC00398 and HD03352.
REFERENCES 1. Wightman FL, Kistler DJ. Headphone simulation of free-field listening. I: Stimulus synthesis. J Acoust Soc Am 1989; 85:858-867. 2. Wightman FL, Kistler DJ. Headphone simulation of free-field listening. II: Psychophysical validation. J Acoust Soc Am 1989; 85:868-878. 3. Carlile S. The auditory periphery of the ferret. II: The spectral transformations of the external ear and their implications for sound localization. J Acoust Soc Am 1990; 88:2196-2204. 4. Chan JCK, Musicant AD, Hind JE. An insert earphone system for delivery of spectrally shaped signals for physiological studies. J Acoust Soc Am 1993; 93:1496-1501. 5. Brugge JF, Reale RA, Hind J E et al. Simulation of free-field sound sources and its application to studies of cortical mechanisms of sound localization in the cat. Hear Res 1994; 73:67-84. 6. Pralong D, Carlile S. Measuring the human head-related transfer functions: construction and calibration of a miniature “in ear” recording system. J Acoust Soc Am 1994; 95:3435-3444. 7. Chen J. Auditory space modeling and virtual auditory environment simulation. Ph.D thesis, Madison: University of Wisconsin, 1992.
Implementation of Virtual Acoustic Space for Studies of Directional Hearing
183
8. Chen J, Van Veen BD, Hecox KE. A spatial feature extraction and regularization model for the head-related transfer function. J Acoust Soc Am 1995; 1:439-452. 9. Musicant AD, Chan JCK, Hind JE. Direction-dependent spectral properties of cat external ear: New data and cross-species comparisons. J Acoust Soc Am 1990; 87:757-781. 10. Ljung L. System Identification: Theory for the User. Englewood Cliffs: Prentice-Hall Inc., 1987. 11. Chen J, Wu Z, Reale RA. Applications of least-squares FIR filters to virtual acoustic space. Hear Res 1994; 80:153-166. 12. Wahba G. Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics, 1990. 13. Gu C. Rkpack and its applications: Fitting smoothing spline models. Tech. Rep. 857. University of Wisconsin-Madison: Dept. of Statistics, 1989. 14. Rhode WS. A digital system for auditory neurophysiology research. In: Brown P, ed. Current computer technology in neurobiology. 256th ed. Washington, D.C.: Hemisphere, 1976:543-567. 15. Olson RE, Yee D, Rhode WS. Digital Stimulus System—Version 2. University of Wisconsin, Madison: Medical Electronics Laboratory and Dept. Neurophysiology, 1985. 16. Brugge JF, Reale RA, Hind JE. Auditory Cortex and Spatial Hearing. In: Gilke R, Anderson T, eds. Binaural and Spatial Hearing in Real and Virtual Environments. New York: Erblaum, 1995. 17. Rice JJ, May BJ, Spirou GA et al. Pinna-based spectral cues for sound localization in cat. Hear Res 1992; 58:132-152. 18. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 19. Calford MB, Rajan R, Irvine DR. Rapid changes in the frequency tuning of neurons in cat auditory cortex resulting from pure-tone-induced temporary threshold shift. Neuroscience 1993; 55:953-964. 20. Phillips DP, Hall SE, Hollett JL. Repetition rate and signal level effects on neuronal responses to brief tone pulses in cat auditory cortex. J Acoust Soc Am 1989; 85:2537-2549. 21. Eggermont JJ. Differential effects of age on click-rate and amplitude modulation-frequency coding in primary auditory cortex of the cat. Hear Res 1993; 65:175-192. 22. Brugge JF, Reale RA, Hind JE. The structure of spatial receptive fields of neurons in primary auditory cortex of the cat. J Neurosci 1996; (submitted). 23. Imig TJ, Irons WA, Samson FR. Single-unit selectivity to azimuthal direction and sound pressure level of noise bursts in cat high-frequency primary auditory cortex. J Neurophysiol 1990; 63:1448-1466. 24. Rajan R, Aitkin LM, Irvine DRF et al. Azimuthal sensitivity of neurons in primary auditory cortex of cats. I. Types of sensitivity and the effects of variations in stimulus parameters. J Neurophysiol 1990; 64:872-887.
185
CHAPTER 6
RECENT DEVELOPMENTS IN VIRTUAL AUDITORY SPACE Barbara Shinn-Cunningham and Abhijit Kulkarni
1. INTRODUCTION
V
irtual Auditory Space technology is being applied to a broad range of applications in a wide variety of fields. The flexibility and fine degree of stimulus control enabled by VAS techniques makes them perfectly suited for studying sound localization and spatial auditory perception. Researchers can easily separate out the relative importance of different spatial cues, manipulate stimulus “sources” with ease and speed, and generate stimuli which are difficult or impossible to generate using free-field techniques. This same level of flexibility and control makes VAS displays useful in providing spatial auditory information to a human operator in many real-world applications. In the past, visual displays have been employed almost exclusively to present spatial information to air traffic controllers, fighter pilots, and other human operators. However, there is a growing need to increase the amount of information received by such operators. VAS techniques provide the opportunity to exploit the auditory channel without directly compromising the information already available via other modalities. VAS displays are also integral parts of multimodal “virtual environment” (VE) systems (proposed for use in training and entertainment) and “teleoperator” systems (designed to display information from remote or dangerous environments). In developing VAS displays for these varied applications, many practical issues arise. For instance, in any complex display system, the architecture of the computer software and hardware often determines the speed and accuracy of the information displayed. In systems which Virtual Auditory Space: Generation and Applications, edited by Simon Carlile. © 1996 Landes Bioscience.
186
Virtual Auditory Space: Generation and Applications
display information to more than one sensory modality, synchronization of the information between these modalities is also affected (for a discussion of system architecture issues, see the work by Lehnert, reviewed in Shinn-Cunningham et al1). How a particular VAS display is implemented affects nearly all aspects of system performance, including whether the display can compute stimuli in real time, whether an interactive display can be realized, and whether auditory information can be closely coordinated with information displayed to other modalities. For each application, different performance issues may be critical. For one application, the update rate of the display may be critical; while for another the most important aspect of the display might be the ability to present auditory and visual stimuli that are synchronized in time. In theory, the ability to produce acoustic stimuli at the two ears equivalent to the stimuli that would be received in free-field is limited only by the accuracy with which HRTFs can be measured. However, in practice there are many (sometimes subtle) problems which arise when a VAS display system is realized. Between the difficulties in measuring HRTFs accurately and limits in memory and computational power, compromises must be made in today’s systems. One choice which has received considerable attention is how best to represent and encode HRTFs. Choices in the representation of the HRTF affect important system attributes including the spatial and temporal resolution achievable in the system and the ease with which one can simulate room reverberation. As with architectural issues, the choice of how to represent HRTFs ultimately depends upon the application for which the VAS will be used. Despite the many differences between ideal and achieved performance in available VAS systems, use of the systems is growing for a wide variety of fields. This chapter begins by reviewing some of the research issues in the encoding and implementation of HRTFs and on the ways in which the choice of HRTF encoding impacts the abilities of the resulting VAS system. The chapter then reviews some of the different ways in which VAS displays are currently being used for scientific study and for displaying spatial information to a human operator.
2. EFFECTS OF HRTF ENCODING ON VAS APPLICATIONS The specific issues that might arise in the design of auditory interfaces differ depending on whether one presents an isomorphic representation of the acoustic environment with natural sounds (and using natural HRTFs) or a nonisomorphic mapping from fundamentally different inputs. Even in the completely isomorphic case, the problem of how to implement a VAS system is conceptually straightforward but very difficult to realize. Although one simply has to recreate ap-
Recent Developments in Virtual Auditory Space
187
propriate acoustical stimuli at the ears of the human listener, there are substantial engineering challenges in reproducing or recreating the appropriate sounds. In addition to the obvious requirement for excellent fidelity in the acoustical system, other factors are also important in creating realistic VAS. For instance, in order to create a realistic environment, reverberation effects must be synthesized and the motion of the listener must be coupled to the outputs of the interface. With recent advances in the field of digital electronics, VAS implementations can make use of accurate HRTF measurements to realize directional acoustics. However, limited understanding of sound localization makes the implementation of a realistic and efficient VAS system challenging. Given that it is often too difficult or too costly to create a perfect VAS simulation, the tradeoffs that are made in the implementation of a VAS system need to be informed by the application.
2.1. GENERAL ISSUES The psychoacoustics of sound localization have been extensively studied for over a century.2,3 Despite improved experimental methods in the last few decades, we still have a limited understanding of how sound localization cues are combined, the frequency regions in which they are important, and the regions of acoustic space over which they compete and/or dominate. For example, the cues responsible for sound externalization and distance perception are only partially understood (e.g., see Plenge,4 Rigapulos,5 or Mershon6,7). Often, when subjects listen over headphones, stimuli are perceived as inside the head even though they have interaural temporal and intensity differences appropriate for some external source. Research indicates that this lack of realism stems in part from inadequate representation of pinna spectral cues.8 However, the exact attributes of the HRTF which contribute to the perception of an externalized sound image are unclear. As a result of this lack of understanding, all possible spatial cues must be recreated in a VAS to ensure a realistic simulation. In order to recreate all possible acoustic cues, VAS implementations rely on filtering a single, monaural signal through carefully measured HRTF filters; however, this approach is “brute-force” in that not all cues may be necessary for a good simulation. With this approach, cues that are perceptually insignificant are created with the same fidelity as cues that are extremely important for a veridical perception. In addition, practical limitations in implementation keep even this approach from being entirely successful. Much of the current research in VAS is examining different aspects of the available spatial acoustic cues in order to identify their perceptual importance. These efforts aim to discover how each cue affects perception in order to inform the design of future systems. Once the ramifications of removing or omitting specific cues is understood, it will be possible to create more efficient systems.
188
Virtual Auditory Space: Generation and Applications
For many applications, creating a realistic VAS is of primary importance; however, for many other applications, realism may be less important than conveying spatial information to a listener. Because the application determines what aspects of the displayed information are important, design of an efficient system depends on knowledge of many different things: it is crucial to understand the specific needs of the application, the perceptual relevance of various acoustic cues, and how incorporating different cues affects the technical requirements for a VAS system. For a small number of sources in an anechoic environment, using a brute-force approach (and including all possible acoustic cues) may offer a workable solution. However, in the following sections, we will enumerate several problems that arise when such a direct approach is realized. 2.1.1. Measurement of individualized HRTFs Significant interindividual variability (IIV) in head shape, pinna shape and perhaps several other physical variables exists. As a result, the frequency response of an individual’s HRTF may be as unique as his fingerprint. To date, there is limited understanding of the role this IIV has on the formation and perception of the final auditory event. Therefore, in order to ensure realistic VAS, the synthesis algorithms must be individually tailored to the acoustical characteristics of each listener’s head and ears. This can be achieved by making individualized HRTF measurements for each listener. The resources required to make HRTF measurements and the expertise to make accurate measurements are found only in a few research groups. Traditional approaches entail tedious and time-consuming measurements made in an anechoic chamber while the listener remains motionless. More recently, efforts have focused on developing rapid and robust measurement systems that can be used in nearly any acoustic environment with reasonable success. A commercial, portable HRTF measurement system called SNAPSHOT (developed at Crystal River Engineering) measures the HRTFs at the ear canal entrance, as opposed to traditional measurements deep in the ear canal with probe tubes. A recent study9 has reported that, although discrepancies between HRTFs measured using SNAPSHOT and traditional techniques using probe tubes were found for a few subjects, SNAPSHOT, in general, provides a fairly high-fidelity HRTF characterization. As such, SNAPSHOT (or some similar system) may provide a reasonable alternative to measuring high-quality, nonindividualized HRTFs. Even so, it is not yet feasible or affordable to include facilities like the SNAPSHOT with every VAS simulation system. In fact, for many VAS applications, the time cost associated with making HRTF measurements for each operator (even with a measurement system like SNAPSHOT) may not be justified.
Recent Developments in Virtual Auditory Space
189
A different approach to solving the problem of IIV in HRTFs is being pursued by Tucker-Davis Technologies of Gainesville, Florida. Tucker-Davis Technologies (TDT, who make the Power-Dac high performance convolver for virtual audio applications) plan on providing a library of HRTFs from a large number of listeners with their equipment. Future plans include developing a technique by which listeners can search through this library of HRTFs to obtain HRTFs that approximate their own. An alternative to either of these approaches might entail making a small number of measurements on each subject (perhaps of specific physical characteristics of the subject, like the head circumference and the dimensions of the pinnae) that capture the important features of that particular listener’s HRTFs. With a small number of such parameter values, HRTFs could be interpolated among a set of empirically measured HRTFs or extrapolated from a known set of HRTFs. Such an approach is obviously a long way from being realizable. To be practical, the important physical features of the listener’s head must first be identified (and turn out to be easily measurable). Also, an HRTF encoding scheme must be realized in which the physical parameters measured on each subject correspond to parameters in the model so that synthesis of a set of HRTFs for each individual is possible. Regardless of how individualized HRTFs are obtained, individualized HRTFs are important for some applications. Individual differences in HRTFs have been identified as contributing to the perception of elevation cues and to the resolution of front-back confusions.10 Significant subject-to-subject differences in the transfer function from headphone to ear canal (a transformation that must be compensated for in VAS simulation) have also been demonstrated.11 Many listeners find listening through individualized HRTFs to be much more compelling, providing better externalization and a more realistic simulation. Although individualized HRTFs are important for some applications, for others it may be unimportant. In particular, for applications in which the listener is trying to extract azimuthal information (and for which realism is not crucial), nonindividualized HRTFs will probably suffice. For applications in which the listener will receive significant training, it may be possible to overcome the problem of using nonindividualized HRTFs, even when realism is important and source elevation is crucial. No long-term studies have been performed to see whether subjects can adapt to HRTFs different from their own.12,13 In any event, until HRTFs are tailored to individual users, VAS systems will continue to be only partially successful in creating realistic, externalized sound sources for a naive listener. Thus, creation of extremely realistic VAS will not be possible until efficient ways for measuring, approximating, or synthesizing individual HRTFs are perfected. There is a clear need to investigate IIV in HRTFs in order
190
Virtual Auditory Space: Generation and Applications
to better define its role in the veridical perception of VAS, and to better understand what information is lost when using nonindividualized HRTFs. Several research groups (including groups at the University of Wisconsin, NASA Ames Research Center, i.e. MIT, Boston University and the University of Sydney) are actively pursuing this type of research. 2.1.2. HRTF sampling Even when individualized HRTFs are made available, the characterization of acoustic space is imperfect. A major limitation of HRTFs is that only a discrete representation of acoustic space can be realized. Consider, for example, a simple, anechoic environment. For a fixed source distance, the acoustical space around the listener may be modeled as a sphere with the listener at its center. Theoretically, there are an infinite number of possible source positions on the sphere’s surface. A concomitant, infinite number of HRTFs are needed to represent all possible source directions (even if one ignores any possible dependence of the HRTFs on source distance). When a listener can move around and interactively explore a virtual environment, transfer functions must change smoothly and nearly continuously, again pointing to the need to characterize HRTF space fully. In practice, only a finite number of HRTFs are needed: if HRTFs for adjacent positions produce stimuli that are perceptually indistinguishable, the sampling is adequate. However, measuring HRTFs at a fine enough resolution to provide the necessary spatial resolution for both fixed and moving sources is currently unreasonable. First of all, measuring HRTFs at fine enough resolution is extremely time intensive and difficult, and measurement accuracy limits the ability to obtain a dense sampling of HRTF space. Secondly, memory limitations on any practical device restrict the storage of a large number of HRTFs. Because of the difficulties in obtaining and of storing finely sampled HRTFs, most current VAS implementations rely on interpolation methods to create HRTFs for positions intermediate to measured positions. In the absence of a suitable model, current strategies use first principle approximation methods like linear interpolation, which may lead to erroneous results.14 Simulation errors of this sort may be acceptable for applications that use auditory spatial cues in conjunction with cues from other modalities (as in a virtual environment) or when auditory spatial resolution is not critical (as in entertainment applications). However, errors of any sort are intolerable for other applications, such as psychoacoustical or physiological research. Current hardware trends suggest that storage problems associated with using a large number of HRTFs will be overcome in the future. However, the practical concerns of memory and computational power become even more formidable when one considers the simulation of reverberant acoustical environments or of creating dynamic VAS simulations, both of which are discussed below.
Recent Developments in Virtual Auditory Space
191
2.1.3. Issues with reverberant VAS The need to incorporate reflections in a VAS simulation has been reported in previously published studies (e.g., see Durlach15). It has been established in several demonstrations (e.g., Rigapulos,5 Kulkarni et al16) that the inclusion of reflections provides a more realistic simulation of auditory space. There are two ways in which reflections can be incorporated in a VAS implementation. One method is to measure free-field to eardrum transformations in the reflective environment. The resulting impulse response contains both the room characteristics and the directional characteristics of the head. Assuming that a long enough impulse response is measured (so that the impulse response fully captures all the echoes) the acoustical features associated with the environment are embodied in this empirical measurement. Note that each such measurement depends on both the absolute orientation and absolute position of the listener relative to the reverberant room as well as the position of the source in the room. Thus, unlike anechoic space where the relative angle between the listener and the source specifies a unique HRTF, in a reflective environment we need to characterize the acoustical space for each absolute orientation and position of the listener in the environment, for every position of the source. The total number of such room transfer functions that would be required to be measured and stored is unreasonable, especially when one considers that the transfer functions are much longer than anechoic transfer functions. Moreover, the computational requirement to enable processing sounds through the long impulse responses characterizing both source position and room properties is quite formidable. The alternative method to implementing reverberant VAS is by room modeling. In this method, the echoic source is modeled as a combination of a single primary source and a set of discrete, secondary sound sources.17-19 The primary source is specified by the anechoic HRTF for the relative position of source to listener. Each of the secondary sources is first filtered to simulate the effects of the reflective medium and then filtered by the appropriate HRTF, given the direction of the reflection relative to the listener. A VAS display using this strategy needs to be capable of processing a large number of sources, one effective source for the direct path and one for each reflection. It should also have appropriate synchronization mechanisms to facilitate the parallel processing of the primary and secondary paths for each sound source and for rendering the composite signal over headphones. Although reverberation adds to the realism of VAS, there are times when reverberation may be detrimental. In particular, although reverberation usually has a relatively small effect on the ability of subjects to localize the primary sound source, localization accuracy is decreased.20,21 If a VAS is being used in order to convey information about source location (rather than to try to recreate a realistic environment or to
192
Virtual Auditory Space: Generation and Applications
augment the feeling of presence in a multimodal virtual environment system), the effects of reverberation must be carefully considered to ensure that spatial information is not lost in order to improve unnecessary, subjective features of the VAS. 2.1.4. Issues with dynamic VAS In the simulation of dynamic VAS where movement of either the sound source or human listener is desired, each primary sound source and its reflections need to be processed through a stream of continuously changing HRTFs. Unfortunately, several basic issues regarding the perception of a moving source are still poorly understood. Currently, there are two competing theories of how auditory source movement is perceived. It is not yet clear whether the auditory system perceives motion by performing a “snap-shot” analysis22 of the source trajectory (where the auditory system only notes the position of the source at distinct points in time, similar to how source position is perceived by the visual system during saccadic eye movements), or whether the moving source is tracked as a continuous stream23 (where the changing spatial cues are perceived continuously, similar to how sources are tracked by the visual system during smooth pursuit). This issue greatly affects the requirements of a VAS display because it determines the limit on latencies that the auditory system can tolerate for a moving source. Consider, for example, employing long impulse responses that characterize room reverberation. The inherent delay associated with such filters (which is proportional to their length) limits the update rate of directional information in the device. Recent work on the perception of auditory motion by Perrott and his colleagues22 suggests that update rates in devices that use anechoic HRTFs may be acceptable for moderate source velocities (e.g., see Shinn-Cunningham1); however, the update rates are probably not sufficient for long transfer functions like those that incorporate reflection information. In any case, it is clear that further investigations must be undertaken to address the question of auditory motion perception more directly. For instance, there have been no reports of studies investigating the delays that the auditory system might tolerate when subjects actively move their head. In addition to questions about what latencies can be tolerated in a dynamic VAS, the question of what spatial resolution is required for moving sources has yet to be directly addressed. HRTFs provide a discrete representation of acoustic space. The resolution of the stored HRTFs is bounded both by storage constraints (which could possibly be overcome) and by the sheer inability to make precision HRTF measurements at finer resolutions. Since little is known about resolution of moving sources, one is forced to rely on resolution measures of static source localization in order to estimate the necessary spatial resolution of HRTFs in a dynamic VAS. It is known that the minimum audible angle (also known as the MAA, the smallest angular displacement of an acoustic source that humans can detect about a
Recent Developments in Virtual Auditory Space
193
fixed reference point) is a couple of degrees for positions directly in front of listeners.24 A spatialization system developed by McKinley and his associates in the Bioacoustics and Biocommunications Branch of the Armstrong Laboratory at Wright Patterson Air Force Base utilizes HRTFs measured from the mannequin KEMAR at a resolution of every one degree in azimuth.82 This sampling is roughly the resolution required given the MAA measure of spatial sensitivity. If the saccadic theory of auditory motion perception is correct, static MAA measurements can be used to estimate the necessary HRTF resolution for a dynamic VAS. In this case, the spatial separation of sampled HRTFs must be smaller than the static MAA to ensure a veridical representation of the location of the source at the end of a saccade. On the other hand, if the smooth pursuit model is accurate, it is difficult to predict exactly what HRTF resolution may be required for moving sources. In either case, more research is necessary to determine exactly how moving sources are perceived and what resolution in HRTF sampling is needed for moving sources. In theory, creation of a realistic VAS is relatively easy and straightforward; however, practical problems currently limit the realism of many VAS implementations. Computational and storage resources are limited, constraining the size and number of HRTFs that can be used as well as the complexity of the acoustic environment that can be simulated. Additionally, measuring HRTFs on individual subjects is timeconsuming and difficult, making it necessary to use nonindividualized HRTFs for most applications. Only with further psychophysical study will it be possible to make informed decisions about the design tradeoffs that are necessary in creating VAS.
2.2. HRTF MODELING The development of efficient HRTF modeling schemes depends upon finding low-order parametric HRTF representations that capture all perceptually salient, spatial information in the HRTFs. The ideal HRTF encoding scheme would address all of the problems already reviewed. It would reduce the requirements for HRTF storage, allow a computationally efficient method for rendering spatialized sources, make it relatively easy to interpolate HRTF positions, and allow individualized HRTFs to be synthesized from already-measured HRTF sets. Today, these disparate goals cannot all be attained with any single HRTF encoding scheme. In practice, memory and computational limits are the biggest hurdles in current VAS implementations. Several research groups (e.g., University of Wisconsin, NASA Ames, Boston University, University of Michigan, etc.) are developing HRTF representations that improve the efficiency of HRTF displays. Bearing in mind the general problems in implementing VAS systems discussed above, the following sections describe general modeling issues and review the broad range of HRTF modeling efforts underway.
194
Virtual Auditory Space: Generation and Applications
2.2.1. General issues Two major ideologies can be identified in the area of HRTF modeling. One body of work has been directed towards determining the physical phenomena that give rise to spatial information in HRTFs, while the other effort has been motivated by the need to improve the efficiency of VAS displays. Other recent work has tried to integrate the two philosophies in an attempt to guide research in one direction via improved understanding of the other. Models that explicitly try to characterize the physical phenomena that underlie HRTFs have only been partially successful. This is because the physical acoustics of sound propagation from source position to the eardrum is a highly complex phenomenon. The effects resulting from reflection and diffraction by the torso, head and pinna are complicated functions which vary both with frequency, as well as the azimuth, elevation and range, of the sound source.2 Models motivated by a desire to increase the efficiency of VAS implementations have been hampered for two reasons: (1) many perceptual issues affecting VAS systems are not well understood, and (2) the quality of a VAS is difficult to evaluate. Both of these factors result from a lack of a deterministic psychophysical criterion that can be used to evaluate HRTFs. Only when a valid method for rating VAS systems has been developed will it be possible to meaningfully evaluate different HRTF models. Most current methods for evaluating HRTF models use a metric that weights errors in all frequency regions equally. Such approaches fail to take into account many factors. For instance, different frequency regions are more important than others, perceptually. Also, because of the physical structure of the hearing apparatus (namely the physical structure of eardrum, middle ear, and cochlea), some frequencies are transmitted more efficiently than others, further affecting the relative importance of the different spectral regions. By taking into account what spectral regions are more critical than others, it may be possible to substantially reduce the complexity of the HRTF and allow the model to fit the features most important to sound localization. In the absence of such a criterion, models will invariably fail to produce the most veridical simulation possible within given technical constraints. Ideally, a VAS system would produce a completely veridical simulation. However, creating the perfect simulation is not yet possible. Even if it were possible, the cost associated with such a seamless implementation may not be justifiable for some applications. For example, when used for certain entertainment applications, providing fairly primitive directional information may be adequate. On the other hand, when the virtual display is being used to perform some controlled psychophysical experiments, it may be imperative that the perceptual experience of the observers in VAS match their experience with natural auditory stimuli. Because different applications have different requirements,
Recent Developments in Virtual Auditory Space
195
it is important that the HRTF model is the appropriate choice for a given application. Each model may be better suited for certain manipulations than others. Thus, an ideal VAS system would allow the implementation of more than one HRTF model in order to meet the different needs of different applications. 2.2.2. Perceptual significance of HRTF features The HRTF is a complex-valued function that can be broken down into a magnitude spectrum and a phase spectrum. There are complicated features in both the HRTF magnitude and phase spectra that contain spatial information. As yet, the perceptual relevance of many of these features is not known. Much of the effort expended in modeling HRTFs is spent trying to capture all of the details in both HRTF phase and magnitude. Knowledge of human sensitivity to HRTF attributes can ensure that approximations of the HRTF are as effective as is possible. The goal of such an approach is to determine from a complex-valued HRTF vector the features that have perceptual importance; however, this approach can only be validated by psychophysical testing. In this section we discuss some recent studies of human sensitivity to HRTF phase spectra and the simplifications in modeling HRTFs that are suggested by these results. Studies of human sensitivity to magnitude spectra, currently being pursued in several labs, are also discussed. Wightman and Kistler25 recently demonstrated that it is the interaural time differences (ITDs) of low-frequency components that dominate perception of source direction. Interaural timing information in highfrequency components is much less salient for directional information, and is basically ignored by the auditory system in favor of low-frequency ITDs when such low-frequency information is present in the stimulus. Kulkarni et al26 showed that listeners are not sensitive to the detailed structure of the HRTF phase spectra. This study examined minimum-phase HRTFs in which the measured magnitude spectra determine the details of the phase spectra. A simple, overall interaural delay (obtained from the low-frequency components of the empirical HRTF measurements) was then added to the signals presented to the listener. Such minimum-phase HRTFs do not replicate all details in the HRTF phase spectrum; however, the study showed that these details were not perceptually significant. The results of this study are consistent with the earlier work by Wightman and Kistler,25 in that the detailed phase information lost in a minimum-phase HRTF representation is mainly in the high-frequency part of the phase spectrum. On the basis of these earlier results, one might predict that the loss of phase detail is not of great perceptual importance, especially since low-frequency phase is accurately represented in the minimum-phase HRTF.
196
Virtual Auditory Space: Generation and Applications
With a minimum-phase HRTF (described by Mehrgardt and Mellert27 and first used in a model by Kistler and Wightman28), the magnitude and phase components of the HRTF are uniquely related by the Hilbert transform.29 This result has important implications for HRTF modeling and VAS implementations: • Adequate models need only fit the magnitude spectra of the HRTFs and the overall, frequency-independent ITD. This reduces the modeling problem to fitting a real-valued function rather than fitting a complex-valued HRTF. Moreover, a minimum-phase assumption ensures that the transfer functions will always be stable. (A stable transfer function requires all its poles and zeros to be contained within the unit circle in the complex z-plane. A minimum-phase system is guaranteed to have this property.)29 • Some of the poor results associated with linear interpolation techniques are mitigated when minimum-phase HRTFs are used. Analysis has shown that interpolating HRTFs using a weighted average results in an output signal that is effectively comb filtered.14 This occurs because each HRTF used in the interpolation gives rise to a separate filtered output which occurs at a different absolute time. Adding multiple such outputs results in a quasi-periodic signal, thus modulating the magnitude spectrum of the output sound. Minimum-phase HRTFs remove the major delay components in the filters prior to any interpolation. As a result, when interpolating using minimum-phase HRTFs, the multiple, filtered outputs occur at roughly the same time and any comb-filtering effect is negligible. The appropriate overall ITD for the interpolated, minimum-phase filter can be computed separately by interpolating the overall ITDs of the neighboring HRTFs. There are a few studies that shed light on the perceptual significance of features in the magnitude spectra of HRTFs. Complicated peaks and notches in the magnitude spectra vary with source elevation and source azimuth, especially for high frequencies. Both the spectral peaks2 and notches30 may carry salient information about the location of the source. These high-frequency features of the HRTF magnitude spectra have repeatedly been shown to be important in determining source azimuth and for resolving front/back confusions (where sound locations in the front hemifield are confused with sound locations in the rear hemifield, or vice-versa).25,31 Recent work has begun to answer the question of exactly how the magnitude spectra affects the perceived location of a stimulus. Studies have shown that the apparent source location for narrow band sources depends upon the center frequency of the stimulus.2,32-34 In particular, the apparent location of a narrow band stimulus is di-
Recent Developments in Virtual Auditory Space
197
rectly related to spectral peaks in the subject’s HRTFs: the HRTF corresponding to the perceived location is the HRTF that has the largest peak (maximizes the output energy) at the frequencies present in the stimulus.2,32,33 The apparent elevation and front/back position of the source is strongly influenced by the center frequency of the narrow band stimuli; however, azimuthal position is generally determined by interaural cues.2,34 Data also suggest that azimuth is well preserved in simulations using nonindividualized HRTFs, although subjects are poorer at making elevation judgments and there is an increase in the number of front-back reversals.10 These studies show that details in the highfrequency magnitude spectra determine the perceived elevation and the front/back position of a source, and that individual differences are important in these perceptions. Although these studies show that features in the HRTF magnitude spectrum are perceptually important, they do not address the question of what spectral resolution is necessary for veridical simulation. In particular, there are several sources of peripheral smoothing that would reduce the need for spectral detail in the HRTF (e.g., the mechanical properties of the middle ear and cochlea). In particular, there is poorer spectral resolution at high frequencies: the transduction from acoustic stimulus to electrical signal in the auditory nerve effectively smooths out much of the high-frequency spectral detail that many people argue conveys spatial information. Some analysis of the perceptual salience of cues after such smoothing has been carried out by Carlile and Pralong.35 Better understanding of the perceptual salience of the HRTF magnitude spectra is likely to show that there is little need for an exact spectral representation at high frequencies, thereby reducing the dimensionality of the HRTF representation. Some studies investigating human sensitivity to details in the magnitude spectra are currently underway (e.g., at Boston University and the University of Wisconsin). One of the aims of these studies (e.g., see Kulkarni et al26) is to suggest a psychophysical metric that can be used to evaluate the fit of HRTF magnitude spectra.
2.3. EXAMPLES
OF
MODELS
AND
IMPLEMENTATION SCHEMES
In this section we shall review some HRTF models reported in the literature. Models from each of the two basic approaches mentioned above (those that are based on physical descriptions of the listener and those that are use various abstract mathematical approaches to encode HRTFs) are reviewed. The task of modeling HRTFs based on the knowledge of underlying physical phenomenon is daunting for several reasons: (1) The dimensions of the human pinna are comparable to acoustic wavelengths of interest. Therefore, the ear cannot be modeled as a lumped parameter system. (2) Since the geometry of the external ear is complex, an appropriate distributed system description of the ear may be impossible.
198
Virtual Auditory Space: Generation and Applications
(3) Individual differences in the physical structure of the pinnae require customization of a physical model for specific ears. In practice, physical parameters characterizing the subtle variations in these geometrical features may be impossible to specify. To the extent that physical models are successful, they provide a natural framework in which to study the source of IIV in HRTF measurements, and may enable individualized HRTFs to be readily synthesized (instead of being measured for each subject). In contrast to physically-based models, the structure of abstract mathematical models of HRTFs does not directly relate to any physical properties of the listener. Models based on generalized system identification techniques depend upon the availability of reliable empirical HRTF measurements and the development of computationally efficient algorithms for representing HRTFs and for simulating sources with the HRTFs. These models use a problem-oriented approach to modeling the HRTF, using system identification techniques to obtain “best” fits to measured HRTF spectra. The goal of these models is to reduce the empirical data into a more efficient representation, thus reducing the dimensionality of the HRTF characterization. To the extent that the models are successful, the directional dependence of the external ear transformation is completely characterized by the model parameters; the model structure itself is not constrained by considerations of actual physical phenomenon. In general, these mathematical models start by assuming that HRTFs vary with the direction of the sound source with respect to the head. The set of HRTFs from all possible directions is treated as a random process in which the sample waveforms (i.e., the individual HRTFs) are functions of frequency and the underlying sample space is the set of directions from source to the listener’s head. The random process may differ from individual to individual (because of differences in head, pinna geometry, etc.) and also depends upon the acoustic environment. Note that, because HRTF measurements are usually carried out in anechoic environments, current models are restricted to anechoic conditions; however, the techniques employed by models in this category can be applied to any environment as long as the sample space is well defined (e.g., one could use the same techniques while considering a listener in a fixed position in an echoic space with fixed boundaries). Statistical models differ in their description of the physical system underlying the random process. One class of models relies on deriving basis functions which are uncorrelated over the sample space of directions and expressing the HRTF as a weighted combination of members of this orthogonal set. A different class of models utilizes techniques of rational-function approximation to describe the random process. Still other characterizations may chose a class of basis functions and then chose the optimal members of that class to efficiently encode the HRTFs.
Recent Developments in Virtual Auditory Space
199
2.3.1. Wave-equation models Theoretically, it is possible to specify the pressure at the eardrum for a source from any location simply by solving the wave equation with appropriate boundary conditions for the torso, head and pinna. Needless to say, this is analytically and computationally an intractable problem. A much simpler problem assumes that the human head is perfectly spherical. Under these assumptions, the pressure waveform for an arbitrary angle of incidence can be derived relatively easily. Lord Rayleigh36 found an accurate low-frequency solution for the sphericalhead model which has since been tested in several studies. For example, the model predictions are consistent with the frequency-dependent ITD and IID measurements taken from human heads.37 Specifically, the spherical-head model can predict the head-shadow effect and the sinusoidal dependence of ITD with source azimuth. However, the exact solution for a plane wave diffracting off a rigid sphere is obtainable only at low frequencies. In addition, there are many important features in real HRTFs that are not predicted by a simple spherical model; for instance, the model cannot account for any elevation dependence in HRTFs. As has already been discussed, details in the high-frequency spectrum are important for resolution of spatial ambiguity and for creating a realistic VAS, another feature that the low-frequency, spherical-head solution cannot predict. Thus, the simple spherical model of the human head is not very promising for most VAS applications. Genuit38 has proposed a detailed description of the directional filtering of sound waves by the body. The model describes the head and pinna geometry as a distributed system and is comprised of subsystems which are dependent on the direction of sound incidence (e.g., the head, torso, and pinna) and subsystems which are independent of source direction (e.g., the ear canal). Each of sixteen subsystems is implemented as a channel with some transfer characteristic. The model makes use of Kirchoff’s diffraction integral to approximate the disturbed sound field reaching the external ear boundary. The solution of the integral depends upon boundary conditions specified by parameters dependent upon an individual’s outer ear geometry. These integrals are solved for each model subsystem to find the individualized transfer characteristics for that portion of the model. An individualized HRTF is constructed from the combination of 16 filter channels and delay elements. Note that the filter coefficients and delays are explicitly dependent upon both the external ear geometry and the position of the sound source. Genuit also made repeated empirical measurements of HRTFs on one of the subjects whose HRTFs he had modeled. For a single source direction for that subject, the amplitude spectrum predicted by the model fell within the maximum and minimum values spanned by the repeated empirical HRTF measurements. This model is the first true parametric description of external ear geometry.
200
Virtual Auditory Space: Generation and Applications
While models like Genuit’s (which are based on physical structure and acoustic principles) are intuitive and appealing, their effectiveness in binaural synthesis has yet to be demonstrated. Additionally, as pointed out already, solving the wave equation for complex structures like the human ear is very difficult; even finding approximate solutions is a computationally intensive process. In practice, finding efficient methods to realize models of this sort will require additional research and evaluation. 2.3.2. Models of pinna structure The models discussed in this section do not rely on solving the acoustic wave equation, but rather try to model the effects of the human pinnae on the HRTF. We shall discuss models described by Blauert2 as time-domain39 and frequency-domain40,41 models of pinnae effects, as well as a more recent model by Chen et al.42 Batteau39 was one of the first people to demonstrate the role of the pinnae in determining source elevation. He proposed a time-domain model to describe how each pinna may give rise to the spectral cues for source elevation. Batteau first built an enlarged pinna replica. From studying this pinna, Batteau conjectured that the two ridges of the outer ear produce reflections of any direct source reaching the ear, and that the amplitudes and timing of these reflections are dependent on the sound source position (a block diagram of this model is shown in Fig. 6.1). One echo (having a latency of 0-80 µs) varies with the azimuthal position of the sound source, while the other (having a latency of 100-300 µs) varies with source elevation. When the direct sound and the two reflections are combined, spectral notches arise whose depth and frequency depend upon the relative amplitudes and timing of the two reflections. This is most easily seen by considering the following. If the time delays characterizing the two reflective paths are denoted by τ1 and τ2 and the associated reflection coefficients by λ1 and λ2 respectively, the input-output relationship of the model is described by:
y(t) = x(t) + λ1x(t - τ1) + λ2x(t - τ2) . The model HRTF (normalized so that the maximum magnitude of the transfer function equals 1.0) can then be obtained by computing the frequency response of the above equation:
H( jω ) =
1 = λ 1e –jωτ1 + λ 2e –jωτ2 1 + λ1 + λ 2
Systematic variation of τi and λi causes a corresponding change in the magnitude spectrum of the model HRTF. In Batteau’s model, this
Recent Developments in Virtual Auditory Space
201
Fig. 6.1. Schematic version of the time-domain pinna model proposed by Batteau.39 The position dependent delays associated with the acoustic paths give rise to spectral notches characteristic of pinna filtering.
systematic variation comes about from varying the direction of the incoming waveform. Studies by Watkins43 and Wright et al44 have shown that the above equation provides a good approximation to frequency response measurements from human listeners in the lateral vertical plane. Wright and Hebrank 44 have supported this description of pinna filtering. Moreover, it has been shown that HRTFs measured from human listeners in the median plane show a systematic movement of a spectral notch with changing source elevation, consistent with the model.45 The advantage of Batteau’s model is its simplicity. It contains only four parameters and explicitly codes for source location. However, even though this simple description is important for helping to understand pinna function, it is an oversimplification of an otherwise complex process. Because the reflecting surfaces of the ears are small in comparison with the wavelengths of most audible frequencies, it is likely that both dispersion and reflection occur at the pinnae. It is likely that there are more than the two echoes modeled by Batteau, and that the reflections are not simply scaled, delayed versions of the impinging waveform. These second-order effects would have to be incorporated into the model in order to allow the model to predict all of the
202
Virtual Auditory Space: Generation and Applications
fine details of the HRTFs that are likely to be of some perceptual significance. The frequency-domain model by Shaw40,41 arose from empirical studies using replicas of human pinnae. Shaw determined the frequency response of pinna replicas and identified eight resonance frequencies, ranging from 2.6 to 15.7 kHz. These resonances were determined to vary with the incident angle of the sound source. Although this study provided great insight into the physics of pinna filtering, no analytical expression was formulated to replicate the measured effects. Hence, the model cannot be used for VAS synthesis. A recent study by Chen et al42 used a beam forming approach to generate a functional model of pinnae effects. Although this study does not explicitly model the physics of the external ear (unlike, for example, the model by Genuit38), it does rely on a general description of the ear’s physical structure. This model assumes that the acoustical wavefront reaching the ear is reflected off multiple ridges of the pinna. Each individual reflection is further assumed to undergo different frequency-dependent acoustical effects (such as absorption) that are independent of the direction of incidence of the sound wave, as well as a delay which varies with source direction. These individual reflections (which are delayed, filtered versions of the incident waveform) are then superimposed to create the total waveform reaching the ear drum. A block diagram of this model is shown in Figure 6.2. This model treats the ear as an acoustic antenna composed of an array of sensors (corresponding to the individual delayed and filtered reflections). The output of each sensor is assumed to be a weighted sum of N outputs off a tapped delay line (for simplicity, successive taps are assumed to sample the input signal one time unit later) combined with an overall propagation delay for that sensor. With this formulation, each sensor has a response (in the form of a finite-impulseresponse filter) that is independent of the source direction and an overall delay that varies with source position. If the overall propagation delay from the first to the ith sensor is denoted by τi, Hi(ω, θ), the FIR filter describing the transfer function corresponding to the ith sensor, may be expressed as:
H i (ω , θ ) =
N -1
∑ w [n ]e
jω ( n+ τ i )
i
n =0
where N denotes the number of samples and wi[n] corresponds to the impulse response of the FIR filter. Note that the intersensor delay τi depends upon the sensor geometry chosen for the model and varies with the location of the source. The output of the beamformer model is the summed output of each sensor and for an M sensor array may be expressed as:
203
Recent Developments in Virtual Auditory Space
Fig. 6.2. Schematic representation of the beamforming model of the pinna proposed by Chen et al.42 Note that as an extension of the Batteau39 model, each acoustic path is modeled by a linear filter in addition to a simple delay.
M
M N -1
i=1
i=1 n = 0
H(ω, θ) = ∑ H i (ω, θ) = ∑ ∑ ω i [ n ]e
jω ( n+ τ i )
In this model, variations in the HRTF with source azimuth arise from changes in the time delays τi for the individual sensors. Comparing Figures 6.1 and 6.2, you can see that the model by Batteau39 shown in Figure 6.1 can be viewed as a simple three-channel beam former model in which N = 0, since the output is composed of simple delayed versions of the input (where the delays vary with source position). Conceptually, individual reflections in the beam former model can be related to physical features of the pinnae; however, in practice, there is not a simple way to derive the model parameters from a physical description of the ear. In fitting the model parameters, Chen and colleagues chose various geometries for the sensor array, thus constraining N and M and determining the values {τi}. The model weights wi[n] were subsequently found by minimizing the error between model HRTFs
204
Virtual Auditory Space: Generation and Applications
and a discrete set of measured HRTFs across frequency and source position (the HRTFs fit in this manner corresponded to cat HRTFs; however the same approach can easily be applied to HRTFs for other species, including humans). Comparison between measured cat HRTFs and model results show that the beam former model yields very good results for sources between 0 and +90 degrees azimuth using 400 weights (8 sensors of 50 taps each). Although the beamformer model is conceptually straightforward, a great deal of work remains before such a model description can be derived directly from physical parameters. The beam former model provides an explicit mathematical characterization of the directional dependence of the pressure wave at the ear drum. The characterization is a continuous function of spatial directions even though discrete directions are used to determine model parameters. It may be generalized to echoic environments and can be customized to individual pinnae. Although no psychophysical studies have been conducted with this model as yet, such a formulation may have great potential for efficient binaural synthesis. 2.3.3. Eigenfunction models A few researchers have used eigen value decompositions to encode HRTFs efficiently. In these approaches, an optimal set of orthogonal eigen functions is found to describe the HRTFs. Each HRTF can be represented as a weighted average of the eigen functions, and is fully described by a series of coefficients (or weights) wi corresponding to the chosen basis functions. In these models, the eigen functions are used to capture the variability in the HRTFs (functions of frequency) with changes in source position. Thus, the HRTF for any given direction h(ω, θ, φ) is given by the equation:
h(ω, θ, φ) =
M
∑
qi(ω) wi + q0(ω)
i=1
where qi(ω) denotes the Eigenfunctions (or the Principal Components, PCs), wi denotes the associated weights, qo(ω) represents a component common to all the HRTFs in the set (and may be thought of as the average, direction-independent transfer function, encoding directionindependent effects like canal-resonance, etc.), and M represents the number of Eigenfunctions used in the model. The value of M can be reduced in this approach to reduce both computational and storage requirements; however, reducing M does reduce the veridicality of the HRTF approximation. Generally, the basis functions are ordered such that qi(ω) captures more of the variability in the output filters than does qj(ω) when i < j. These Eigenfunction models are derived from pure, abstract mathematical approaches designed to reduce the dimensionality of the HRTFs; as such, a physical interpretation of the model parameters is difficult
Recent Developments in Virtual Auditory Space
205
at best. In these models, the HRTFs (which are functions of frequency) are transformed into a new coordinate system in which the basis functions form the new orthogonal axes in the transformed space. The weights wi correspond to coordinates in the transformed space, describing how the HRTF functions change with source position in the transformed frequency space. Any individual differences in HRTFs are captured by differences in the set of weights {wi}. In many ways, this approach is analogous to Fourier analysis. However, in Fourier analysis, sinusoidal basis functions are used to decompose the signals, and no particular encoding efficiency is gained by the decomposition. In Principal Components (PC) analysis (also known as Karhunen-Loeve or KL decomposition), the chosen basis functions are optimal for the given data so that most of the variability in the constituent HRTFs is captured using fewer weighting coefficients and fewer basis functions. Two studies have used the above description to compute PC basis functions and PC weights for HRTF magnitude functions. One study by Martens46 computed the basis functions for HRTFs filtered through criticalbands for 36 source positions in the horizontal plane. The study noted a systematic variation of the weights as source azimuth varied from left to right and front to back. A more recent study by Kistler and Wightman28 used PC analysis to compute a set of basis functions for HRTF magnitude spectra measured from 10 listeners at 265 source positions in both the horizontal plane and the vertical plane. Eigen functions were computed for the log-magnitude directional-transferfunctions (DTFs), obtained by subtracting the mean log-magnitude function from each HRTF. The study reported a systematic variation of the first principal component [q1(ω)] as source position was moved from one side of the head to the other. The higher order PCs were, however, less amenable to interpretation. In both of the studies described above, only HRTF magnitude functions were modeled so that h(ω, θ, φ), qi(ω), and wi in the above equation were real-valued. Kistler and Wightman also tested the perceptual validity of the HRTFs constructed from their PCA model. Subjects were presented with binaural stimuli synthesized through the model. Because the model fit only HRTF magnitude functions, the phase functions for the models were obtained by assuming that the HRTFs were minimum-phase functions. A constant, frequency-independent, position-dependent ITD was measured for each source position. This delay was then introduced in the model HRTF for the lagging ear to make the overall model ITD consistent with empirical measures. With this approach, listener judgments of the apparent directions for stimuli synthesized in the model were similar to judgments for free-field stimuli, even when only 5 basis functions were used in the synthesis.
206
Virtual Auditory Space: Generation and Applications
Chen47,48 applied the same analysis to the complex-valued HRTF (hence providing fits to both empirical magnitude and phase spectra). In addition, Chen fit a two-dimensional, thin-plate spline to the Eigenfunction weights wi (which they termed spatial characteristic functions or SCFs). This approach makes implicit use of the assumption that the weights change smoothly with changes in source position to allow prediction of HRTFs for positions other than those in the original data set. If wi (θ, φ) represents the SCFs expressed as a continuous two-dimensional function of source position, the HRTF may be modeled as:
h(ω, θ, φ) =
M
∑
qi(ω) wi(θ, φ) + q0(ω)
i=1
This formulation provides a description of the HRTF at any arbitrary elevation and azimuth (θ, φ) and is therefore capable of HRTF interpolation. The fidelity of model reconstructions using this method (with 12 Eigenfunctions) is reported to be very good.47 A real-time implementation of the model49 demonstrates the utility of this approach for VAS displays. As noted above, the weights capture the dependence of HRTFs on source position and thus indirectly reflect differences in how an acoustic waveform is transformed by the head and pinnae of a listener for sources from different directions. Additional studies like those of Chen47 and Kistler and Wightman,28 which show how the model weights (and thus the relative importance of the different basis functions) depend on source position and on subject, are important for building intuition into how the model parameters relate to the physical acoustics from which the HRTFs derive. Ultimately, such models may make it easy to spatially interpolate HRTFs by predicting how the weights change with position, and to approximate individualized HRTFs by choosing the weights based upon physical characteristics of the listener. However, much more work will be necessary before the relationship between physical acoustics and model parameters is understood. 2.3.4. Neural-network model The PC/KL approach is to find an optimal set of orthogonal basis functions to encode a multivariate function (the HRTF). Another standard approach is to choose a family of basis functions and then to optimally choose a fixed number of the members of this family to encode the multivariate function.50 With this approach, the HRTF corresponding to a specific source position in space is represented by a set of weighting parameters, similar to the PC/KL approach. The HRTF for each position is generated by multiplying each basis function by the appropriate weight and summing the results. The family of basis functions is generally chosen to try to reduce the dimensionality
Recent Developments in Virtual Auditory Space
207
of the HRTFs while keeping the computation of the basis functions relatively simple and straightforward. Jenison and Fissell51 at the of the University of Wisconsin used this approach with two different families of radially-symmetric basis functions: Gaussian and von Mises-Fisher functions (for a review of these techniques, see Haykin52). A neural network was used to learn the input-output mapping from source position to weighting parameters. Results with this technique were promising. Similar to PC/KL studies, encoding HRTFs with radially-symmetric basis functions proved to be both computationally and storage efficient. In addition, the neural network was capable of estimating basis function weights for sources at positions in between the measured positions with reasonable success. As with the Eigenfunction approach, the advantages of this modeling effort are that: (1) the HRTFs can be represented by a small number of weights for each source location; (2) spatial interpolation can be performed by interpolating the weights; (3) ultimately, individualized HRTFs could be approximated by appropriate choices of the weighting functions. However, as with the Eigenfunction approach, the model is derived from purely mathematical constraints. As such, a great deal of work remains before it will be possible to predict how physical parameters relate to the model parameters. 2.3.5. Rational-function models Rational-function models constitute a class of solutions yielding parametric models of linear random processes. In their most general formulation, an input driving function u[n] and an output sequence x[n] are related by the linear difference equation: p
x[n] = -
∑
k=1
q
a[k]x[n-k] +
∑
k=0
b[k]u[n-k] .
This description [called the Auto-regressive Moving Average or ARMA (p,q) model in statistical literature and equivalently described as a pole-zero model in signal processing parlance] is the most general form for a linear model. The driving noise of the model u[n] is an innate part of the model, giving rise to the random nature of the process x[n], and should not be confused with additive observation noise commonly encountered in signal processing applications (any observation noise needs to modeled into the ARMA process by modification of its parameters). The studies described in this section use the variation of the HRTF with spatial position as the source of “randomness” in the model, although in fact these variations are deterministic. The transfer function H(z), describing the input-output function of the model is given by the rational function:
208
Virtual Auditory Space: Generation and Applications
H(z) =
B(z) A(z)
where A(z) and B(z) are equal to the z-transforms of the coefficients a[n] and b[n], respectively, i.e., q
p
A(z) =
∑
k=0
a[k]z-k
and
B(z) =
∑
b[k]z-k
k=0
In these formulations, the transfer function H(z) is a ratio of two polynomial functions of the complex variable z. Evaluating H(z) along the unit circle z = ejω yields the phase and magnitude of the HRTF for the frequency ω. The roots of the denominator A(z) and the numerator B(z) are respectively called the poles and zeros of the system. When the denominator coefficients a[n] are set to zero (except for a[0] = 1), the resulting process is called a strictly Moving-Average (MA) process of order q and represents an all-zero model. When numerator coefficients b[n] are set to zero (except b[0] = 1), the resulting process is called a strictly auto-regressive (AR) process of order p and represents an allpole model. The poles and zeroes of rational-function models can be directly related to spectral features in the HRTFs. When the value of z approaches a pole, the transfer function H(z) approaches infinity; when the value of z approaches a zero, H(z) approaches zero. Thus, a peak occurs at frequency ω whenever the value of ejω is close to a pole of H(z); a notch occurs in the HRTF whenever the value ejω is near a zero of H(z). It is also possible to relate the poles and zeroes to acoustical phenomena: poles correspond to acoustical resonances and zeroes correspond to acoustical nulls, both of which are caused by specific physical features of the listener’s head and pinnae. For instance, as discussed in section 2.3.2, spectral peaks and notches (poles and zeroes) can arise from comb-filtering effects due to reflections of the impinging waveform off the pinnae of the listener. Despite these intuitive interpretations of the poles and zeroes in rational-function models, these models are derived from purely mathematical techniques; as yet, it is impossible to relate all of the individual poles and zeroes to specific physical features of a listener. The choice of which specific rational-function model (MA, AR, or ARMA) to use to describe the data is not always obvious. The MA (all-zero) model is an efficient choice for representing HRTFs with deep valleys, but not those with sharp peaks. Conversely, an AR (allpole) model is appropriate for spectra with sharp peaks, but not for spectra with deep valleys. A more general ARMA model can represent both these extremes. Empirical characterizations of HRTFs are usually in the form of MA (or all-zero) models, as they are generally derived
Recent Developments in Virtual Auditory Space
209
from time-domain, impulse responses. Many researchers have investigated encoding these all-zero processes using either ARMA (e.g., Asano et al,53 Sandvad and Hammershøi,54 Bloomer and Wakefield,55 Bloomer et al56 and Kulkarni and Colburn57) or reduced-order MA models.58 Some of these approaches are reviewed here. 2.3.5.1. Pole-zero (ARMA) models Pole-zero models of HRTFs require estimates of the a[k] and b[k] coefficients, described in the equation above, to obtain best fits for the HRTFs. The canonical least-squares error approach to obtain these coefficients involves the minimization of the error ε, given by: -
π
ε=
H(jω) π
B(jω) 2 dω A(jω)
where H(jω) represents the measured HRTF, and A(jω) and B(jω) are the Fourier transforms of the coefficients a[n] and b[n] over which the minimization is to take place. This is a difficult problem to solve for a variety of reasons, including: 1. The solution is not linear. 2. The solution filter may be unstable, even though H(jω) is stable. 3. In order to minimize the total error across frequency, the solution filter preferentially fits peaks in the HRTF (regions of high spectral energy) more accurately than valleys (regions with low spectral energy). Since information in both the peaks and valleys may be important for sound localization, the obtained fit may not be the optimal solution, perceptually, even though it is optimal in the leastsquare-error sense. The study by Asano et al53 used a linear modification of the leastsquares problem proposed by Kalman59 to obtain the ARMA process coefficients. They used ARMA(40,40) filters to demonstrate an adequate fit to the HRTFs. Recent studies by Bloomer et al56 and Kulkarni and Colburn57 have reported new techniques to obtain filter coefficients. These studies found reasonable fits with fewer than half the coefficients reported by Asano et al.53 A major difference in these studies is that a logarithmic error measure is used. The model filter coefficients are obtained by the minimization of: -
π
σ=
log H(jω) - log π
B(jω) 2 dω A(jω)
Both Bloomer et al56 and Kulkarni and Colburn57 make use of the minimum-phase assumption for HRTF impulse responses. As a result,
210
Virtual Auditory Space: Generation and Applications
the best-fit filter is restricted to be stable. Whereas Bloomer et al introduced an efficient gradient search algorithm to minimize the error, Kulkarni and Colburn proposed a modified weighted-least-squares solution to obtain comparable results. An ARMA(6,6) model of the directional transfer function (DTF) was found with this approach. Psychophysical testing of this low-order ARMA model of the DTF (used in conjunction with the empirically-determined omnidirectional transfer function) showed that ARMA models can be extremely efficient in storage and computation without compromising perceptual results. Bloomer et al56 and Kulkarni and Colburn57 also used the ARMA descriptor of HRTFs to provide insight into the physical processes underlying HRTF generation. The trajectories of the poles and zeros of the model HRTF [which correspond to the roots of the numerator and denominator polynomial B(z) and A(z) respectively] can be studied as source location is varied. The resulting pole-zero loci may be used to interpret the physical filtering performed by the pinna. Bloomer et al are pursuing methods for using the trajectories of these poles and zeroes in order to perform HRTF interpolation. Along similar lines, Kulkarni and Colburn have suggested the decomposition of the ARMA equation into parallel second-order sections. This provides a description of the pinna as a distributed system of 2nd order directional resonators, the outputs of which combine constructively or destructively to cause the characteristic spectral peaks and notches of HRTFs. This approach allows the possibility of associating specific anatomical features with the resonators. The resonances are mathematically described by simple 2nd order system equations. Moreover, from an implementation standpoint, each resonator has a very short intrinsic latency (being composed of only two delay elements), making this parallel architecture attractive for use in dynamic VAS simulations where fast filter update rates are required. 2.3.5.2. Reduced-order, all-zero (MA) models As noted previously, the empirical HRTF is an all-zero filter corresponding to a strictly MA process. Relying on the fact that not all the information in the HRTF magnitude spectrum of the HRTF is perceptually important, Kulkarni and Colburn have studied two modelorder reduction schemes for MA HRTF models.58 Both schemes use the minimum-phase HRTF. The first method utilizes the minimumenergy delay property of minimum-phase sequences. It can be shown that the partial energy E in the first p taps of an n -tap filter, given by: p−1
= ∑ h[k]2 , k =0
is greatest when h[k] is the minimum-phase sequence, compared to all other sequences with the same magnitude spectrum. Thus, truncating the minimum-phase sequence to the first p taps is equivalent to finding
Recent Developments in Virtual Auditory Space
211
the optimal (i.e., minimizing the error in overall energy) order p model of the measured HRTF. The second scheme proposed by Kulkarni and Colburn58 begins by finding the Fourier series representation of the HRTF magnitude spectrum. This series can then be low-pass filtered, smoothing the magnitude spectrum. The resulting smoothed fit of the original HRTF can be shown to be optimal in the least-squares sense.29 The smoothing causes a reduction in the order of the output FIR filter that is inversely proportional to the extent of the low-pass filter’s window. Both of these techniques have been used to approximate 512-tap HRTFs using 64-tap impulse responses. Psychophysical tests of these (significantly reduced) HRTF representations suggests that they provide adequate simulations for many applications. 2.3.6. Interaural-spectrum model A modeling approach by Duda60 examines variations in the interaural magnitude spectrum (the ratio of the left and right ear magnitude spectra) with variations in source position. The interaural log-magnitude spectrum has a periodic variation with source azimuth and elevation (due to simple geometric constraints) which is fit by a Fourier series. Using only two of the terms from the Fourier series, a good fit to the macroscopic details of the interaural magnitude spectrum was obtained. This model implies that interaural differences alone can be used to resolve the cone of confusion. The idea that interaural difference cues across frequency can provide an unambiguous estimate of source location has also been proposed by Peissig et al61 and Martin. 62 This work suggests a more sophisticated role for the binaural system in source localization; the model, however, does not provide a method for binaural synthesis.
2.4. PSYCHOPHYSICAL EVALUATION
OF
HRTF IMPLEMENTATIONS
The process of model development consists of two main stages. The first stage consists of deriving the mathematical structure of the model and performing any required signal analysis. The second stage involves the psychophysical evaluation of the model. Unfortunately, many models are never evaluated in this way. As we have noted before, abstract, signal-analysis type measures of a model’s “goodness-offit” are inappropriate; a low dimensional model could provide an excellent simulation despite failing to meet some arbitrary fitting criteria. Quantitative measures of psychophysical performance, as opposed to abstract error measures and subjective reports of perceived sound quality, provide a robust metric that can be used to compare different models and their shortcomings. Although the need for a good, psychophysically-based metric is obvious, quantifying the various aspects of perceptual experience is challenging. Psychophysical dimensions may not be orthogonal and
212
Virtual Auditory Space: Generation and Applications
application-dependent tradeoffs may be necessary. In this section, we discuss experiments that begin to address the question of how psychophysical performance of subjects depends upon the HRTF model employed. One of the most fundamental questions that can be asked about an HRTF encoding scheme is how closely the apparent location of a synthesized source matches its intended location. Two paradigms for examining this question have been reported in the literature. The first compares how subjects localize synthesized and natural, free-field sources, usually in an absolute identification paradigm. The second approach measures the discriminability of model HRTFs versus empirically measured HRTFs by requiring subjects to compare the perceived quality of virtual stimuli processed through both. Asano et al53 studied median-plane sound localization using ARMA models of HRTFs. In their experiments, details in the transfer-function were smoothed by different amounts by parametrically varying the order of the ARMA model. In this way, the investigators examined the importance of microscopic and macroscopic patterns in the HRTF for median-plane localization. The study reported that frontback judgment information was derived from the microscopic details in the low-frequency regions (below 2 kHz) and macroscopic details of the high-frequency regions of the HRTF. Macroscopic details in the high frequency region (above 5 kHz) of the HRTF appeared to encode elevation information. Kistler and Wightman28 evaluated their PC model of HRTFs by using an absolute identification paradigm. Subjects reported the perceived location of sounds simulated through model HRTFs that were constructed from five Eigenfunctions. Performance was comparable for the VAS stimuli and free-field sources. In an evaluation of sensitivity to HRTF phase spectra, Kulkarni et al26 required subjects to discriminate between sounds synthesized using empirical HRTFs and sounds synthesized with HRTFs that had identical magnitude spectra but simplified phase spectra. Discrimination performance was at chance for the HRTFs, suggesting that details of the phase spectra are not perceptually important. The same discrimination paradigm was also used by Kulkarni and Colburn in their ARMA and MA modeling studies.57,58 These studies roved the overall level of test stimuli to preclude the use of absolute intensity cues that may be present in the stimuli due to disparities between the model HRTFs and empirical HRTFs (roving the overall level does not prevent the use of any directional information in the encoded HRTFs). Subjects were unable to discriminate between the reduced-order HRTFs and the empirical HRTFs. The model orders tested included a 6-pole, 6-zero ARMA model of the directional transfer function (in series with the nondirectional transfer-function)57 and two 64-tap MA models of the HRTF.58
Recent Developments in Virtual Auditory Space
213
An early study by Watkins43 was designed to test Batteau’s “twodelay and add” physically-based model of the pinnae. Systematic measurements of perceived location were taken from a number of observers. The study showed that white noise stimuli passed through the two-delay-and-add system (and presented over headphones) was perceived at appropriate elevations when one of the delays was varied. These results were consistent with the idea that subjects are sensitive to spectral notches that arise from the spectral effects of the two reflections in the model. This experiment demonstrates that a relatively simple model can convey some spatial information, even if it fails to create completely natural sounding stimuli. A similar experimental approach is now being undertaken both at the University of Wisconsin and at Boston University to test the effectiveness of VAS simulations. These experiments involve performing a task in which subjects are asked to report the direction of a stimulus played from either a pair of open-air headphones worn by the subject or a speaker located in the free-field. If the headphone stimulus is filtered with the HRTFs appropriate for the location of the free-field speaker, the perceived locations should be identical to the perceived locations of the free-field stimuli. Systematic manipulation of the headphone stimulus can then help determine the attributes of the signal that contribute to the appropriate localization of the sound. This type of experimental approach may prove to be an extremely effective way of validating VAS displays. A study by Wenzel et al10 has explored the use of nonindividualized HRTFs and the resulting perceptual deficiencies. In that study, inexperienced listeners judged the apparent direction (in both azimuth and elevation) of noise bursts presented either in the free-field or over headphones. Comparison of the two conditions suggests that the horizontal location of sound was preserved robustly in the headphone stimuli, while vertical location was not perceived correctly by all subjects. The study also reported an increase in the number of front-back confusions in the headphone presented stimuli. These experiments demonstrate that there are many ways in which HRTFs can be encoded and simplified while still retaining much of their spatial information. Significantly reducing the dimensionality of the HRTF representation in different ways may result in only negligible decrements in performance. However, it should be noted that care must be taken when trying to select an HRTF encoding scheme. Even though performance on one psychophysical test (such as determining the direction of a source) may be more than adequate for a given application, performance in some other dimension may be adversely affected (for instance, externalization may be disrupted). All of the different aspects of auditory space must be considered in evaluation of an HRTF model. Since different applications require that different details of the VAS be veridical, it is necessary to first identify what
214
Virtual Auditory Space: Generation and Applications
performance characteristics are most important for a given application. These performance characteristics must then be tested directly in order to verify that the chosen model is appropriate for the task at hand.
2.5. FUTURE WORK A number of interesting models have been proposed to improve VAS implementation efficiency, many of which we discussed in the sections above. Nearly all of the proposed models reduce the storage requirements of HRTFs and most can be implemented using computationally efficient algorithms. However, little effort has been expended in trying to develop models that can not only be stored efficiently, but which can be interpolated to synthesize HRTFs at intermediate locations. Of all the models reviewed, that of Chen et al,47 which uses the method of thin-plate spline interpolation, appears to be most promising in this area. It is important to pursue efforts to develop adequate interpolation schemes in order to ensure accurate, memory-efficient VAS displays that can display sources moving smoothly through space. Another area which needs to be addressed more systematically and thoroughly is the incorporation of reverberation in VAS simulations. The use of room transfer functions (that incorporate reverberant effects in the measured impulse responses) is an expensive solution in terms of storage, computational power, and in system update rate. Roomacoustics modeling, as discussed by Kendall and Martens17 and Lehnert and Blauert,63 is a promising alternative approach to the development of reverberant VAS. In addition, there is some evidence that many aspects of echoes are poorly perceived (cf. the precedence effect.64-69) We must obtain more knowledge about how reflections are perceived in natural environments in order to simplify room models. Historically, models of this sort have been developed for the field of architectural acoustics. As such, these models have not been constrained to meet real-time computational requirements. As computational power continues to grow, it will be possible to implement existing, complex models of rooms in real time. However, given the current state of the art, efforts to find computational short cuts for realizing these complex (computationally expensive) models are crucial to the development of realistic, real-time models of reverberant environments. Additional effort is needed to incorporate other natural acoustical effects like scattering and dispersion to create more realistic models in the future. The current methods of VAS synthesis use time-domain convolution to render virtual acoustical stimuli. Given the speed of available processors, this technique is robust and fast but may become quite expensive as the complexity of the simulation is increased. An alternative approach would be to implement filtering in the frequency domain, using FFT algorithms to convert the incoming and outgoing time-
Recent Developments in Virtual Auditory Space
215
domain waveforms into a discrete Fourier transform representation. While this approach can significantly reduce the number of mathematical operations needed to filter the input source, the technique requires a fairly large number of time-domain samples to be processed simultaneously. This requirement translates into a significant delay in the system. Current processing speeds are only now becoming fast enough to allow frequency-domain processing with acceptably small latencies. Other issues must also be resolved in trying to implement a frequency-domain technique. For example, mechanisms for simulating smoothly moving sources must be developed and issues of how to interpolate between frequency-domain filters must be addressed. Much work remains in trying to develop a frequency-domain processing algorithm for VAS; however, there are many reasons that such an effort should result in a more efficient and cost-effective system. It is obvious that understanding human perception is central in trying to address many of the problems stated thus far. In particular, our ignorance of the basic auditory abilities of the perceiver places limits on our ability to engineer perceptually-adequate sound synthesizers. There are probably numerous other approximations and simplifications that can be made which will reduce the computational burden on acoustic displays without any perceptible loss in performance. Experiments designed specifically to examine what cues are salient for different aspects of sound localization are underway in many laboratories around the world. These experiments include examinations of distance perception, sound externalization, perception of nonindividualized HRTFs, adaptation to unnatural HRTFs, and a whole range of other topics. Studies investigating dynamic aspects of VAS are relatively rare (e.g., where either simulated source or listener move). As more applications are developed that require interactive VAS displays, it will be important to set out the basic psychophysical guide lines to assist in the design of such systems. There is now a substantial body of literature on the perception of moving sources.70-72 However, further studies must be performed, since much of this work was not designed to answer the basic questions about how to implement a dynamic VAS. Finally, as VAS displays become more popular, it is becoming increasingly important to develop some method for tailoring HRTFs to the individual. This may mean developing a universal library of HRTFs from which listeners can choose HRTFs that are close to their own or perfecting algorithms for synthesizing individual HRTFs given a small number of measurements.
3. APPLICATIONS The most unique aspect of VAS displays is their inherent flexibility. This flexibility guarantees that VAS displays will be useful in a
216
Virtual Auditory Space: Generation and Applications
wide variety of applications, both for scientific study and for prototyping of next-generation display systems for many complex tasks. Unfortunately, technical limitations restrict the utility of the displays for some applications: for instance, whenever realism is crucial or whenever update rates must be extremely fast. On a more positive note, most such problems are likely to be solved, or at least ameliorated, in the near future. The following survey of application areas is not intended to be an allinclusive list, but rather gives a general overview of the variety of ways in which VAS displays are currently being used. This review does not examine issues of how the source carrier signal is selected or generated (e.g., how to select between different source sounds such as speech, music, tones, noise, or how to generate different acoustic waveforms), but rather on how the spatial information carried in the signal can be utilized for a variety of tasks. Applications discussed below were chosen to illustrate the ways in which VAS techniques provide a unique solution for a given task. The unique attributes of VAS displays which are addressed are their ability to manipulate auditory spatial information in ways previously impossible and their ability to present arbitrary information to a human observer by encoding it as acoustic spatial information. The first issue (of flexibility in the control of auditory spatial cues) makes VAS powerful and useful for studies of spatial auditory perception, and in the study of how auditory spatial information is integrated with information from other sensory modalities. Both issues (of control flexibility and of the ability to encode any information in spatial auditory cues) make VAS systems useful for a wide variety of real-world spatial tasks by allowing the presentation of nonacoustic information, of information from a simulated environment, or of remote information via spatial auditory cues.
3.1. PSYCHOPHYSICAL
AND
PHYSIOLOGICAL STUDY
A number of investigators have used VAS to study aspects of auditory spatial perception. Many such studies examine phenomena not easily studied using more traditional methods (either using free-field stimuli or stimuli with simple interaural time and/or intensity differences). Some of this research has actually been driven by a desire to design more effective VAS displays. These studies help to improve our understanding of the importance of spatial auditory information and the relative importance of different spatial auditory cues. 3.1.1. Auditory spatial perception The use of VAS displays in psychophysical and physiological study is becoming more and more common. The advantages of these systems for performing physiological study have already been discussed extensively in the previous chapters. Many examples of the use of VAS systems for psychophysical study were reviewed in section 2 in this chapter, since these studies were designed to validate the very VAS
Recent Developments in Virtual Auditory Space
217
displays they employed. For instance, there is much work examining how HRTFs encode spatial information and what aspects of HRTFs contain the most important and salient spatial information. Examples of this type of research include the study of the importance of individualized HRTFs,10 work on the effects of spatial interpolation of HRTFs,14,73 various studies comparing free-field perception with perception of stimuli simulated by VAS systems,74-77 and work examining psychophysical sensitivity to details in HRTF phase information.26 Other studies have used VAS systems to manipulate acoustic spatial cues in new ways simply to discover more about normal spatial auditory perception. For instance, work by Wightman et al25 has examined the relative importance of interaural delay information in lowand high-frequency carrier signals. In this work, monaural and binaural spectral cues were chosen to be consistent with one source location while the interaural delays were set consistent with some different location. Subjects tended to base location judgments on the localization information conveyed by the interaural delay information, provided low-frequency energy was present in the signal. For signals containing only high-frequency energy, localization judgments were based primarily on spectral cues. This study demonstrated that low-frequency interaural delay information is more salient than are interaural and monaural spectral cues. VAS techniques made it possible to separately control spectral cues and interaural timing cues in this study. Generation of stimuli that contained spectral cues consistent with one location and interaural timing information consistent with a different location would be extremely difficult with more traditional psychophysical methods. Wightman and his colleagues performed a number of other investigations using VAS displays to control auditory spatial cues. In one study,78 they demonstrated that monaural localization is extremely poor when performed using a VAS display that does not incorporate changes in cues with listener head movement. Subjects with one ear occluded in a free-field control condition performed much better on the same task even when they were asked to hold their heads still. Two possible factors were postulated as contributing to this difference for the freefield, monaural condition: (1) it is possible that subjects made small head movements, even though they were instructed to remain still, and (2) the blocked ear may still have received some salient acoustic information from acoustic leakage. In a later study, Wightman and colleagues demonstrated that incorporating head movement cues in a VAS display reduced localization errors, particularly for front-back confusions (whereby a source in front of the listener is mistaken as coming from a position behind the listener, or vice versa).79 Beth Wenzel and her colleagues at NASA Ames Research Center in California have performed numerous studies of human spatial perception using VAS displays.10,73,74,80 A recent example of their efforts investigated the relative importance of ITDs and ILDs in conjunction
218
Virtual Auditory Space: Generation and Applications
with cues from head motion.81 In this study, subjects were asked to localize sound sources under six different cue conditions. In a given trial, a source was presented with: (1) normal ITDs and ILDs; (2) normal ITDs, but ILDs for a source at (azimuth, elevation) = (0, 0); or (3) normal ILDs and ITDs for position (0, 0). Each of these conditions was presented (1) without any head-motion cues or (2) with headmotion cues controlling the normal cues (e.g., either ITD and ILD, ITD only, or ILD only, changed with head motion). Results from this study imply that correlation between head motion and either ILD and ITD can help to resolve source position. As with the study of Wightman et al,25 the ability to separate the ITD and ILD cues (and to separate the effects of head motion from static localization cues) is only possible through the use of VAS techniques. Other examples of perceptual studies using VAS displays are found in the work at WrightPatterson Air Force Base in Ohio,82-84 and Boston University.14,16,26 A number of investigators85-90 have used VAS techniques to study how spatial cues affect the perception of a signal in a noisy background. Normal listeners show a great benefit in perceiving a signal in the presence of noise if the signal and noise arise from different locations in space (or, using traditional headphone techniques, if the signal and noise have different interaural differences). This binaural gain can be measured both in detection experiments (where it is referred to as the binaural masking level difference or BMLD) and in speech discrimination tasks (where it is known as the binaural intelligibility level difference or BILD).2 For normal listeners, this binaural advantage is extremely useful in everyday situations where there are multiple sound sources present (e.g., in a restaurant, in a car or train etc.). For many hearing-impaired listeners, the most troubling aspect of their hearing loss is a decrease in the BILD and the corresponding difficulties in understanding speech in noisy environments. Because of the practical importance of the BMLD and BILD, it is crucial to determine what spatial cues give rise to the binaural advantage. As with other psychophysical studies, VAS techniques allow researchers studying the BMLD and BILD to control the cues that are presented to subjects, systematically measuring the relative importance of every possible cue both in isolation and when combined with other cues. Bronkhorst and Plomp performed a number of studies to determine exactly what cues give rise to the BILD. In these studies, interaural level and timing differences were estimated from binaural recordings taken with a KEMAR mannequin. These differences were then used to synthesize binaural stimuli containing only ITDs, only ILDs, or both cues, both for normal86 and for impaired listeners. 85 This technique allowed the researchers to examine the relative importance of ITDs and ILDs for the reception of speech in a noisy background, and to determine how the ITD and ILD cues combine under normal listening conditions.
Recent Developments in Virtual Auditory Space
219
Similar work by Carlile and Wardman87 examined how increasing the realism of the spatial cues presented to a subject affected the BMLD, both for relatively low- and high-frequency sources. The use of VAS techniques enabled these researchers to separate from other effects the detection gain due to differences in the signal-to-noise ratio that arise due to head shadow. In addition, the same techniques made it possible to demonstrate that the BMLD arises from information combined across multiple critical bands for high-frequency signals, but that it depends only on information within a critical band for low-frequency signals. Finally, the intelligibility advantage of spatializing sound sources has been investigated for real-world applications by Begault and his colleagues at NASA Ames Research Center.88-90 These studies have demonstrated the advantage of using VAS displays to help speech intelligibility for low-pass speech, such as is encountered in ordinary telecommunications,90 as well as for speech tokens used in communications at the Kennedy Space Center.88-90 VAS displays enable localization studies which are nearly impossible using traditional psychophysical methods. For instance, separating the effects of listener movement from the effects of other localization cues could only be performed with cumbersome physical setups in the past (e.g., the study by Wallach91 which employed an array of speakers that were selectively activated by electrical switches). In addition, some earlier methods for trying to separate the effects of various localization cues are imperfect. As postulated by Wightman et al,78 it is possible that subjects in previous studies of monaural localization received attenuated information from their physically occluded ear as well as from the unblocked ear. Similarly, it is possible that even small head movements made by subjects who were instructed to hold their heads still affected results in free-field studies that purported to remove dynamic cues.78 VAS techniques make it possible to control the types of cues available to the listener with a much finer degree of accuracy than has been possible with other techniques, thereby enabling studies that have been impractical before and/or providing a check on results from earlier studies. 3.1.2. Adaptation to distorted spatial cues A related area of psychological study is the study of adaptation to spatial cues which are inconsistent across modalities. Studies of sensorimotor adaptation have examined the question of what occurs when visual, proprioceptive and auditory spatial cues give different information about where an object is located (for example, see the study by Canon92). Typically, a physical device distorts cues from one modality so that the spatial information from that modality gives erroneous information, while sensory information from other modalities remains undistorted.
220
Virtual Auditory Space: Generation and Applications
Studies of adaptation to all types of intermodal discrepancies are reviewed in Welch,93 while a review concentrating on adaptation to intermodal discrepancies between audition and other modalities is found in Shinn-Cunningham et al.1 Such studies find the relative perceptual weight that each modality carries for an observer, as well as examine how quickly and completely subjects can overcome errors in localization caused by the erroneous spatial information. These studies add to our basic understanding of spatial perception and perceptual plasticity and inform how to design displays for spatial tasks. Earlier studies of sensorimotor adaptation relied on cumbersome physical devices to distort spatial cues. For instance, visual cues were distorted by the use of prism goggles (e.g., the study by McLaughlin and Rifkin94) while auditory cues were distorted through the use of a “pseudophone” (a device consisting of stereo microphones which are displaced relative to the ears; e.g., see the study by Held95). These same studies can now be undertaken by using VAS displays to distort auditory cues or using other VE displays (such as head-mounted visual displays) to distort visual or proprioceptive cues. These VE technologies enable researchers to arbitrarily distort the spatial cues received by subjects and provide a powerful tool for future work on sensorimotor adaptation. Virtual environment technology is being applied to the study of sensorimotor adaptation at the Research Laboratory of Electronics at the Massachusetts Institute of Technology. In one study, hand-eye discrepancies are introduced by computer-generated visual and proprioceptive cues. More relevant to the focus of the current book, adaptation to distorted auditory spatial cues is also being examined.13,96 Both studies are motivated in part by the realization that VE technologies are imperfect and will introduce intermodal discrepancies (because of temporal inconsistencies between different display devices, imperfections in the displays’ resolutions, and other unavoidable technical problems). Because of these inevitable discrepancies, it is important to learn more about the effects of the discrepancies if one hopes to use such systems for training, displaying important spatial information, or for other tasks. In this way, these studies are inspired by the desire to design effective virtual displays. While one of the goals of the auditory adaptation study at MIT is to understand how well subjects can overcome intermodal discrepancies in general, another goal is to see if subjects can achieve betterthan-normal localization when acoustic spatial cues are emphasized. This aspect of the study is motivated by the observation that in a VAS display, the mapping between physical cue and spatial location can be set however you desire. For instance, HRTFs from a person with a head and pinnae twice the normal size could be used in the display just as easily as can “normal” HRTFs. With such “supernormal” HRTFs, two spatial locations that normally give rise to physical
Recent Developments in Virtual Auditory Space
221
cues which can barely be discriminated should be much easier to discriminate. The idea of generating supernormal cues is not new; previous attempts have used pseudophones where the intermicrophone distance is larger than the inter-ear distance (for early examples of this, see Wenzel,74 also see the work of Wien97). The ability of VAS technology to easily create supernormal cues is one aspect of the displays that makes them unique compared to more traditional psychophysical techniques. Supernormal cues could conceivably allow subjects to perform more accurately on localization tasks than is possible with normal cues. For many of the proposed applications of VAS displays, increasing the resolution achievable on spatial tasks may be extremely useful. However, changing the mapping between physical spatial cues and the corresponding source position not only affects discriminability of different source positions, but the perception of absolute position of the source. For example, a source which is slightly right of center will result in interaural timing differences that are larger than the normal ITDs for that position, and subjects are likely to mislocalize the source as farther to the right than it is. Therefore, the supernormal localization study at MIT is examining how emphasizing acoustic spatial cues affects resolution on auditory localization tasks (that is, whether supernormal performance is possible) and whether absolute errors in localization can be overcome as subjects adapt to the supernormal cues (similar to the goals in traditional sensorimotor adaptation studies). The MIT study has focused on adaptation to emphasized azimuth cues only (for details about how supernormal cues were generated see Durlach et al13). Subjects adapt to the supernormal auditory localization cues, but adaptation is incomplete. As hoped, resolution on auditory localization tasks is better with the supernormal cues. However, as subjects adapt to overcome the errors in their absolute judgments of auditory source position, resolution decreases. These results were explained by a preliminary psychophysical model of adaptation.96 In the model, the decrease of resolution with time occurs as a result of the adaptation process. As subjects adapt to the supernormal cues, they must attend to a larger range of physical cues. As in most psychophysical tasks, resolution decreases when the range of stimuli increases. For instance, subjects can easily discriminate a source at zero degrees azimuth from one at five degrees azimuth in a task where only those two positions are presented. However, the same subjects will often confuse these two positions if sources can come from one of many positions around the listener (say, from one of thirteen positions ranging from -30 degrees to +30 degrees in azimuth). This dependence on range is usually explained as arising from high-level factors like memory limitations.98,99 In the supernormal localization model, the range monitored by subjects increases as they adapt, so resolution is predicted to decrease with adaptation, consistent with the experimental results.
222
Virtual Auditory Space: Generation and Applications
The supernormal adaptation study illustrates a number of interesting points about applications involving VAS displays. Designers of VAS systems have great freedom in how they encode information to be presented in the display and can try to make information extremely easy to extract. However, there are many, many factors that impact whether information that is theoretically available in the physical cues presented to a listener are actually perceivable by that listener. In the supernormal localization study, physical cues may be larger than normal; however, high-level cognitive factors ultimately limit localization performance, not the size of the physical cues. 3.1.3. Providing better cues for distance As has been discussed already, normal cues for distance are not very salient. Many of the physical distance cues are ambiguous, and the “unambiguous” cues often are not easily perceived by the subjects.7,100,101 In fact, most VAS systems do not encode distance except by altering the overall level of a source. A few attempts have been made to model atmospheric absorption which differentially attenuates high-frequencies more than low-frequencies as distance increases. 7 However, all current systems assume that sources are relatively far from the head (in the acoustic far field). With this assumption, HRTFs depend only upon the azimuth and elevation from listener to source, except for an overall level dependence and some spectral effect that is equivalent at the two ears. Both of these possible cues (overall level and overall spectral content) can be affected by the source signal’s level and spectrum as well as its distance, making these cues ambiguous even when they are properly represented by a display. A few systems are capable of simulating sources in echoic spaces and providing some distance information through the ratio of the direct to reflected energy of the source. However, this cue is not perceptually reliable under many circumstances.101,102 As a result of these factors, distance is ambiguously represented in a VAS system that does not include reverberation, and poorly represented in a system that simulates echoic environments. As was pointed out in the discussion of supernormal localization, you can arbitrarily manipulate cues in a VAS system in order to make them easier to perceive. Instead of relying on complex geometric room modeling or on manipulations of overall level and spectrum, one can create new, reliable distance cues relatively simply with a VAS display. Preliminary work at MIT has begun to address how to encode distance cues in VAS. Brungart103 investigated the ability of listeners to perceive information encoded as the strength and delay of a single echo of the source. This distance encoding was chosen both because it is simple to incorporate into a simulation and because it is likely to affect the perceived distance of a source. Although the work focused on how much information listeners can extract when distance-like cues are presented rather than on the perception of distance per se, it is a
Recent Developments in Virtual Auditory Space
223
first step toward developing simple but reliable distance cues in a VAS system. Brungart showed that information transfer for echo strength and delay varies from subject to subject. In addition, the amount of information subjects were able to extract from one stimulus parameter (either echo strength or delay) decreased for experiments in which the other parameter was varied, relative to the information transfer achieved when the other parameter is held constant (e.g., the stimulus parameters are not perceptually separable104). These results again emphasize the need to examine how performance can be limited by the way the human processes available information. Intersubject variability can be very large, so that some subjects can perform very well while others perform poorly. In addition, different information dimensions that are theoretically separate in the stimulus may interact perceptually, further limiting achievable performance. Great care and insight is needed in designing VAS cues in order to achieve the best possible performance.
3.2. PRESENTING NONACOUSTIC INFORMATION SPATIAL CUES
VIA
ACOUSTIC
VAS displays allow almost any information to be presented to a listener as acoustic spatial information. In addition, there is a growing need to find new ways to get complex information into a human operator for many real-world tasks. Since VAS displays were designed to present spatial acoustic cues, the most promising applications for their use are those that involve inherently spatial tasks. Using acoustic cues to represent spatial information allows the user to explore the information using all his normal spatial hearing abilities: the listener can form mental images of the spatial information presented just as he would with normal acoustic events in the real world. Because they use natural localization cues, VAS displays should require less training than displays which present spatial information to users in some other manner. For a taste of how VAS displays can be used to present nonacoustic information, we focus here on one of the most promising application areas: VAS displays to augment information displays for pilots. VAS displays are uniquely suited for this application area. First of all, pilots often suffer from visual overload, making the use of addition visual cues of small benefit. Secondly, pilots must perform complex spatial tasks; use of auditory spatial cues is a natural way to present additional spatial information. Finally, pilots already wear earphones, so that providing spatialized cues to the listener over headphones adds no further physical constraint on the pilots. 3.2.1. Orientation cueing for pilots Pilots of high-performance jet aircraft must perform very complex maneuvers while maintaining a clear sense of their location relative to the external world. In order to monitor the various instruments inside
224
Virtual Auditory Space: Generation and Applications
their craft, pilots must often ignore the visual field outside their craft for periods of time. Because of the large accelerations they experience, these pilots get distorted vestibular cues about their orientation relative to the world. Since they cannot visually monitor the outside world and because their inaccurate vestibular cues affect their sense of orientation, it is common for pilots to become confused about their attitude and position relative to Earth. Small errors in the perceived attitude of plane and pilot may build over time to cause large misregistration with Earth. Even small misperceptions of orientation often lead to disastrous results, with a resulting loss of life and equipment. Visual displays of orientation have been employed to try to alleviate some of these problems, but with only limited success. Researchers have postulated that visual displays are only partially effective because the visual channel is already burdened by a large number of dials, gauges, and other displays.105,106 Researchers at Brandeis University are now undertaking studies to determine how acceleration cues affect auditory localization. The goal of this work is to determine whether it will be feasible to present orientation cues to pilots via a spatial auditory display, perhaps in conjunction with somatosensory cues (other researchers have also proposed investigating the use of VAS displays to address pilot disorientation, e.g., see the work of Perrott106 and McKinley105). Depending on how vestibular, somatosensory, and auditory localization cues interact, it may be possible to create auditory beacons which help maintain a pilot’s sense of orientation relative to the external world. Investigators will examine how angular and linear accelerations affect the apparent location of auditory sources. Once these interactions are better understood, it may be possible to simulate auditory sources that are perceptually stable relative to the external world, even under situations where distorted vestibular cues affect localization judgments. With this approach, the presented auditory localization cues would take into account misperceptions caused by the accelerations experienced by the pilots. Auditory cues may provide salient orientation information without adding to the already heavy visual load of the pilot. Because the spatial cues presented by a VAS system can be programmed to take into account the attitude of the plane as well as the perceptual effects of the acceleration of the plane, they can present a source that is perceived at a stationary position relative to Earth. Although there is great promise in using VAS displays for cueing orientation, the utility of this approach ultimately depends upon whether auditory cues (or auditory and somatosensory cues in combination) will be a strong enough percept to override the conflicting vestibular and visual cues already experienced by pilots. This application demonstrates both the inherent power and flexibility of VAS displays as well as the need to take into account the human receiver when manipulating cues presented to the listener.
Recent Developments in Virtual Auditory Space
225
3.2.2. Other benefits for pilots Work at NASA Ames Research Center107-109 and Wright Patterson Air Force Base82,105 has demonstrated that VAS displays can provide useful spatial information to pilots for other tasks as well. Work at NASA Ames Research Center has investigated the use of auditory spatial displays to help pilots avoid ground collisions107,108 and to aid in ground navigation.109 In these studies, it was shown that crew members using a VAS to augment a standard traffic alert and collision avoidance systems (TCAS) acquired possible collision targets faster than did crew members without the spatial auditory display;107-109 however, there was no significant decrease in the time needed to complete taxi routes when a VAS display was used to present ground navigation information to a flight crew.109 At Wright Patterson Air Force Base,82,105 McKinley and his colleagues showed that spatialized speech was perceived more clearly in the presence of noise than nonspatialized speech. Since pilots must respond to verbal instructions under conditions that are extremely noisy, this work points to an important application of VAS displays for pilots. By spatializing speech received from air traffic controllers and from other airmen, speech reception may be improved. McKinley also explored the use of VAS displays to present spatial information to aid in target acquisition. For this work, a special VAS display was constructed for real in-flight tests of target acquisition. Because of the difficulty of setting up controlled, objective tests for pilots in flight, results of these flight tests consisted of subjective reports. In general, pilots found the acoustic spatial cueing to be of use, particularly for providing azimuthal information (note that in these tests, nonindividualized HRTFs were employed, decreasing the reliability of the available elevation cues). The improvement in reception of spatialized speech seen in the laboratory was also reported by pilots in the in-flight tests. Finally, some pilots reported that target acquisition was more rapid with the addition of auditory spatial cues, but that workload did not increase with the additional cues. These results indicate that VAS displays may be of great benefit to pilots as a means of improving speech reception, increasing situational awareness, and improving target acquisition without increasing workload. Although we have focused here on the use of auditory spatial cues for pilots, this is but one example of how nonacoustic spatial information can be useful for one specific set of users. The same principles that make acoustic spatial cues promising for presenting information to pilots makes them promising for presenting information to a variety of other human operators. Some of the other applications for which auditory spatial cues have been proposed include everything from displaying real-time information for airtraffic control,110,111 to aids for the blind,112-115 to the presentation of medical116-119 and financial data120 (for a review of auditory displays in general, see Kramer121).
226
Virtual Auditory Space: Generation and Applications
3.3. VIRTUAL ENVIRONMENTS Virtual display technologies allow people to explore arbitrary spatial information through immersive, interactive displays. When such information is contained in a computer model (rather than derived from real-world sensors), the resulting world is usually described as a virtual environment (VE). Many designers of virtual environment systems come from the fields of computer graphics or computer vision. As a result, historically, development of virtual environment displays has been focused on the creation of immersive, stereoscopic visual displays. However, most virtual environments applications depend upon creating realistic environments that give users a feeling of being present in the created location and thus include auditory displays, and perhaps even haptic displays, as well. In general, most of the applications below employ VE displays in multiple modalities. Because of this, the auditory channel is relatively less important for these applications than for many of the applications already discussed. However, the multimodal nature of VEs is one of their distinguishing features; even though the VAS display is often only one part of the total system, it is often extremely important for creating a realistic and compelling display. Virtual environments are being used in new applications every day. The list of major application areas given below is designed to show some of the very disparate fields in which VEs are being used, rather than to list all possible applications. A more comprehensive overview of the many different uses of VEs can be found in Durlach and Mavor.122 The pervasiveness of the use of VEs is due in part to their ability to simulate varied situations effectively; however, another reason for the growth of VEs is their emotional appeal. The ability to simulate realistic environments and to create fantastic or unusual situations is a compelling feature of VEs. This whimsical factor appeals to the creative instincts of many people and may be as strong a motivation in the development of VEs as is their flexibility and cost-effectiveness. 3.3.1. Entertainment Commercial production of virtual environment displays has been driven almost exclusively by the entertainment industry. At the low end of the entertainment market are home-computer game systems like those produced by Nintendo and Sega. To date, such home systems have avoided the use of head-mounted visual or auditory displays in order to keep costs affordable and to avoid encumbering the users of such systems with head-mounted devices. Mid-range entertainment systems that include head-mounted displays with stereo visual and auditory stimuli and joy-sticks or other haptic input devices can now be found in most commercial video arcades. At the highest end, large entertainment conglomerates like the Disney Company are developing virtual display technologies for use in theme parks and theaters. In all
Recent Developments in Virtual Auditory Space
227
cases, the advantages to using virtual environment displays (including VAS displays) for entertainment are clear: with such displays, it is possible to create effects that are impossible with other approaches for reasons of safety, cost, or the laws of physics. Nearly all entertainment VEs include some auditory component. However, for many systems, the included sound is not spatialized into a true VAS. Instead, one, or possibly two, speakers are used to generate the auditory stimuli in the system. As VAS technology becomes less expensive, more and more systems are beginning to include spatialized sound, at least in some rudimentary form. Although it is easy to belittle the contributions made to the VE field by the entertainment industry, their importance in driving the development of affordable technology should not be ignored. Although the goals of the entertainment industry are to develop systems that provide reasonable simulations for the minimum cost rather than to develop maximally controlled simulations for any reasonable cost, the economic power of the entertainment industry helped to drive forward the state of the art for VAS. In addition, this driving force will remain a powerful one in the foreseeable future. 3.3.2. Task training and education VEs are being explored for use in training for a variety of circumstances. The use of VEs for training is usually driven by the desire for a training system that is flexible and reprogrammable, the wish to train users in a cost-effective manner, and/or the obvious need to train users who operate in dangerous or remote locations. Virtual environments can also be useful for training tasks in which small operator errors can be extremely costly. Because the same physical system can be programmed to simulate many different situations, the same system can be used to train many different tasks. In contrast, more traditional simulators are generally built to simulate only a single task, making them less cost-effective in the long term. In addition, trainees in a virtual environment can be exposed to a wide variety of physical situations without ever leaving the VE. As a result, they do not have to be exposed to a physically threatening environment. In addition, the results of their inexpert operation are felt only in the computer model, not in the real world. These factors make training in a VE both convenient and cost effective. Examples of VEs used for training are quite varied.123 A few of the many military uses include training aircraft piloting, submarine navigation, hand-to-hand combat, and battle planning.124,125 Training for tasks which can be dangerous for the operator include training fire fighters, astronauts, and undersea vehicle operators.122 A prime example of a training application in which operator error can be extremely costly is the area of medical training. For instance, surgical simulators are being developed for different surgical tasks with the hope of reducing
228
Virtual Auditory Space: Generation and Applications
patient risk while availing would-be surgeons of invaluable experience.126-129 Finally, training with VEs is not limited to training specific job skills: VEs are also being used for basic educational purposes. Allowing students to interact with environments that they are studying can be stimulating and exciting, simulating a “hands-on” learning experience that may be prohibitively costly or physically impossible using traditional methods (e.g., see Moshell and Hughes,130 Osbert131 and Bricken and Byrne132). In many of the training tasks mentioned, the goal of the VE is to accustom the user to experiences in the simulated environment so that they can react quickly and confidently in the real world when confronted with similar experiences. It is clear that when auditory spatial cues are central to performing the specific task at hand, the inclusion of auditory spatial cues in training will be helpful and useful. However, the benefits of including auditory spatial cues can be less obvious as well. For instance, in some applications (such as in military navigation tasks), auditory cues may be of secondary importance (say, compared to visual cues), but can reduce reaction times for the task when they are present.133 If subjects are trained in a VE that excludes such cues, subjects may not benefit from the auditory cues present in the real world. The way in which multi-sensory cues are combined perceptually is understood only in a rudimentary way at this time and certainly depends upon the specific task being performed. However, until these intersensory effects are better understood, VEs used for training should be designed with care. Otherwise, VE training systems may create unrealistic expectations for trainees that actually hurt their performance in the real-world task. Many organizations are already pursuing the use of virtual environments for training because of the benefits already listed. The economic advantages and convenience of use make the lure of virtual environments difficult to resist. However, it should be noted that the usefulness of virtual environments for many training applications has yet to be demonstrated. In particular, few studies have proven that training in current state-of-the-art VEs is a compelling way to learn real-world tasks.134 As has already been mentioned, negative training effects may occur for a badly designed training system. However, despite these caveats, it is likely that VEs will prove useful for training a wide variety of tasks in the future. 3.3.3. Therapy Virtual environments are also being employed for helping phobic patients overcome their irrational fears. Phobias are traditionally treated by desensitizing the patients through repeated exposure to the situations causing them anxiety, or through visualization of such situations.135 Use of virtual environments for treating phobias is promising for many of the same reasons that VEs are being developed for training specific
Recent Developments in Virtual Auditory Space
229
tasks. Under many circumstances, immersion in a virtual environment is more cost effective and more convenient than is taking patients to the real locations that cause them anxiety. Also, because the realism of the exposure period can be completely controlled in a VE (e.g., by including only some modalities in the simulation, altering the resolution of the cues depicted, etc.), treatment can be tailored to each patient individually. Virtual environments have been shown to be an effective tool for treating acrophobia in a study conducted at the Georgia Institute of Technology.136,137 In this study, subjects exhibited the same physiological signs of anxiety while in the virtual environment simulation as they did in real-world situations of which they were fearful. Whereas treatment of phobias usually entails desensitization, psychological problems related to social interactions are often treated by engaging in role-playing. These types of disabilities may also be addressed with VEs. Virtual environments enable realistic social interactions in role-playing, allowing patients to interact with and confront people and social situations that are troubling. The ability to both monitor and control these interactions makes VEs a promising tool for use in treating volatile emotional issues as well as in the treatment of phobias. One major benefit to using virtual environments for therapy is that the same display technology can be used to treat patients with a wide variety of problems, simply by reprogramming the computer models used to drive the displays. Although initial costs for purchasing a virtual environment display may be large, the same display can be used to treat fear of flying and fear of intimacy. Also, although the initial investment in a VE system may be substantial, over time the use of such a system should prove cost effective, especially when compared to the costs associated with conducting therapy outside the therapist’s office. The inclusion of auditory cues in VEs used for therapy provides benefits similar to the benefits of providing spatial auditory cues when using VEs for training. Such cues can create a more realistic, immersive experience, increasing the therapeutic strength of a virtual experience. As with training tasks, including auditory spatial cues can have a subtle effect on listeners that is not readily apparent. For instance, in the case of desensitization, including realistic auditory cueing may be crucial in recreating all aspects of an experience that may cause anxiety in the real world. 3.3.3. Architectural design and architectural acoustics The ability of VEs to allow people to explore places that do not exist makes them a perfect tool for architects. After all, the main goal of the architectural profession is to design spaces that are pleasant and functional when used on a daily basis. While one of the architect’s
230
Virtual Auditory Space: Generation and Applications
skills is his ability to visualize how different buildings can be used even before they are built, this visualization skill is uncommon. In addition, even though an experienced architect may be capable of visualizing the effects of his design decisions, he must also convey these ideas to his client in some way. Virtual environments are perfectly suited to this visualization problem. In a virtual environment, clients and architects alike can explore and interact with a proposed building before it exists anywhere except as a computer model. Decisions about where to put doorways, walls, and furniture can be made with more assurance and a clearer understanding of the consequences. Because of these obvious benefits, virtual environments are being developed to enable architectural walk-throughs of everything from houses to factory designs.122,138-140 While VEs are being used to help visualize the effectiveness of architectural designs in the most general sense, VAS displays are being applied specifically to help design the acoustical environments of buildings.141,142 Historically, VAS techniques for room simulation involved developing detailed mathematical models of room acoustics. Because of the complexity of this problem, most room simulators are not realtime; only recently are realistic room simulations becoming interactive. Since mathematical models of room acoustics can be extremely complicated, an alternative approach is sometimes taken. Small, scale models of architectural designs are made in which a scaled dummy head is placed inside the model. Binaural recordings made at the scaled dummy head can then be frequency shifted to approximate how the acoustics of the environment affect the subjective experience of listeners in the proposed space. This empirical approach is more cumbersome than is developing computer models; however, for complex architectural designs, empirical measurements can be more robust than are traditional modeling approaches. The advantages to applying VAS to the problem of architectural acoustics are similar to the benefits for more general architectural design problems. Auralization of the synthesized spaces enables the architect to explore the consequences of different design decisions in an inexpensive, straightforward manner. This ability to test the final acoustical design prior to investing money in building the actual space makes the use of VAS very appealing. 3.3.4. Product prototyping As noted above, VEs can be used to test architectural spaces before they are built. A closely related application for VEs is to test consumer products prior to investing time and energy in building expensive prototypes. The use of virtual environments for product prototyping is growing in a number of fields. Some industries currently investigating the use of VEs for product prototyping include the aerospace industry (in the design of new aircraft) and the automo-
Recent Developments in Virtual Auditory Space
231
tive industry (in the design of new automobiles).122 As with the design of buildings, the appeal in using VEs for product prototyping is the fact that for a relatively small investment, the VE allows a product designer to explore aspects of the product which may be hard to visualize in any other way. Although much of the prototyping done in a VE examines visual aspects of a design, there are important acoustic effects that can be tested in VEs as well. For example, a large amount of money is spent by the automobile industry in trying to control the level and quality of noise in the passenger compartments of their vehicles. VE models of prototype vehicles can help to isolate mechanical resonances in the designs that can cause extremely annoying acoustic noise. Simple human factors tests in a virtual vehicle can save time and money compared to similar tests in a physical prototype.
3.4. TELEOPERATION VEs present computer-modeled information to a human operator. In contrast, teleoperators present information from a remotely-sensed environment to a human operator. Whereas VEs typically developed out of the field of computer graphics, teleoperator systems developed out of the field of robotics. From a technological display viewpoint, the differences between VEs and teleoperator systems are minor. In both cases, virtual environment displays are employed to present information to the human operator. However, in the case of teleoperation, the information to be displayed is not contained solely in a computer model, but rather is derived from external sensors at some real location. Another distinction between the two types of systems is that human operators in teleoperator systems can often affect the remote environment that is being sensed. Although many teleoperator systems employ remote actuators like telerobots, not all teleoperator systems involve such mechanical manipulators. While these distinctions between VEs and teleoperator systems may not impact the display technology being employed, they can have some practical implications for the way a teleoperator system is implemented. For instance, in a virtual environment, the information to be presented to a user is derived from mathematical models. As such, limitations on the information available to the user arise from computational constraints and limits on the display technology. In a teleoperator system, the information to be displayed will often be limited by the remote sensors and the ability to extract salient information from the information retrieved by the sensors. As a result, even though the VE display technology allows great flexibility in how available information is encoded, the amount of information that can be encoded in a teleoperator system may be limited by the type and quality of the sensors at the remote location.
232
Virtual Auditory Space: Generation and Applications
As with VEs, the uses of teleoperator systems are quite varied. The list of applications of teleoperator systems presented here is not intended to list all current uses of teleoperation, but to demonstrate some of the kinds of applications of such systems. The wide variety of uses of teleoperator systems are reviewed more completely in Durlach and Mavor122 and Sheridan.143 3.4.1. Teleconferencing One of the major applications for an acoustic teleoperator system is in the area of teleconferencing.141,142 Teleconferencing applications differ from many other teleoperator applications in that the remote environment that is sensed is usually not acted on, except by delivering auditory/visual stimuli to listeners in that remote environment. As such, teleconferencing is not entirely representative of many teleoperation applications; however, it is an extremely important application when it comes to using VAS in teleoperation. Although there are many similarities, the technical challenges in the development of good teleconferencing systems are distinct from the challenges of designing an efficient VAS display. First of all, the remote sensors in a teleconferencing application may pick up acoustic sources from multiple locations. Thus, the first technical challenge in trying to render spatialized sound for the listener is to separate out the different acoustic signals from the different locations around the sensors. If this can be accomplished, then the problem of rendering spatialized sources for the listener becomes equivalent to the problem in a typical VAS. The problem of having to segregate acoustic sources received at the remote sensors is common to most teleoperator applications. In general, this problem can be circumvented by using remote sensors that are isomorphic to the listener’s own hearing apparatus; that is, by using two remote microphones separated by approximately a human head width. Of course, in order to realize the most realistic acoustic cues possible, the remote microphones would be located on a remote dummy head. In this way, interaural level and timing information is at least roughly what the listener is accustomed to hearing. However, unless the remote dummy head incorporates user-specific models of the listener’s pinnae, elevation cues are likely to be poorly perceived by the listener. Spatialization of acoustic sources is important in teleconferencing applications for a number of reasons. First of all, if sources are spatially distinct, interference from simultaneous, competing sources is lessened, improving speech reception. Since verbal communication is the primary goal of teleconferencing applications, this is an extremely important factor when realizing a teleconferencing system. Secondly, although voice timbre is likely the primary cue for determining who is speaking at different times during a teleconference, spatial location can help listeners keep the various talkers perceptually separated during a multi-person teleconference (e.g., see Bregman144).
Recent Developments in Virtual Auditory Space
233
The use of VAS for teleconferencing applications is becoming more widespread. This is due in large part to economic pressures on business, which are simultaneously pushing businesses to becoming larger geographical entities at the same time that the expenses of travel are becoming more prohibitive. As a result of these pressures, it is likely that the development of reasonable, inexpensive teleconferencing systems is one of the major commercial growth areas for VAS systems in the near term. 3.4.2. Remote exploration Examples of teleoperation that enable users to physically explore and/or act on remote sites include any number of applications. Some of the more interesting applications include remote surgery, hazardous waste removal, and space and undersea exploration (again, these and other applications are discussed more fully in Durlach and Mavor122). These applications use VE display technology to allow a user to immerse himself in information from a remote location. In general, such systems employ displays of information in multiple modalities including vision, audition, and haptics. In remote surgical applications, the information from the “remote” location may actually be physically close to the user but unavailable through normal sensory systems. For instance, many new laparoscopic surgery techniques exemplify how relatively simple teleoperation systems are already employed in the medical community. Such surgical techniques entail the use of sensors (tiny cameras) and actuators (tiny scalpels or other such devices) that go into locations that the surgeons senses cannot reach: inside the patient’s body. Future medical teleoperator systems will probably entail more complex sensors and actuators that provide the surgeon with more complete information from sensors that are isomorphic to his normal sensory organs and enable him finer control at the remote site within the patient’s body. The continuing push to utilize new technology for medical applications may some day make it possible to perform remote diagnosis and procedures in addition to making possible more complex surgical techniques. Other common applications of teleoperator systems include their use in hazardous environments. Remotely controlled vehicles have been used to explore everything from the Chernobyl nuclear powerplant to deep sea sites. NASA is interested in teleoperator systems both for exploring planets as well as for working on satellites and other assets in Earth’s orbit. Although remotely-sensed acoustic information is not very useful for most space applications, acoustic information can be extremely important in other exploration tasks. In particular, because sound from all directions is sensed by microphones (regardless of the direction of gaze of remote cameras) spatial sound can provide omnidirectional monitoring of a remote location just as the sense of sound does in our normal, everyday lives. In addition, when exploring some environments
234
Virtual Auditory Space: Generation and Applications
(for instance, in a smoke- and fire-filled warehouse, or in a cloudy undersea location), visibility may be poor, making acoustic information relatively more reliable than visual cues.
4. DISCUSSION The usefulness of VAS displays arises from a number of factors. These displays offer psychophysicists and physiologists control over nearly every aspect of spatial sound, allowing researchers to discover perceptual constraints on how spatial auditory information is processed and perceived by a listener. Because it is possible to control exactly how spatial information is encoded and what that spatial information represents, these same displays can be used to present information that is not ordinarily available to a human listener. VAS displays can also be used to emphasize information that might not be easily perceived by a normal listener, creating “supernormal” localization displays. Other uses of VAS displays allow users to explore virtual and remote environments, useful for everything from designing products to monitoring dangerous environments. Although VAS displays offer great flexibility in how information is encoded for the listener, under many circumstances human perceptual factors limit the utility of the display. Information from multiple spatial sources can interfere with the perception of each source’s location. High-level perceptual effects such as memory constraints can limit performance. Just because information is present at the periphery of the human perceiver does not mean that the information will be perceivable. For this reason, design of an effective VAS display must take into account not only whether spatial auditory information is faithfully recreated at the periphery, but how that information is processed by the listener once it reaches his ears.145 For each application, different aspects of spatial auditory information may be important. For example, in presenting information to a pilot or to a surgeon, one of the most important features of the display is that it have an extremely fast update rate and short latency. For psychophysical or physiological experiments, display resolution may be of primary importance. For use in entertainment applications, the subjective realism of the display may be more important than the accuracy or resolution of the display. For use in architectural acoustics, the display may not need to be a real-time system, but it must be able to recreate all aspects of an acoustic environment with great fidelity. For applications like teleconferencing, spatial cues may be of secondary importance; the main goal of the display is to maximize speech reception for the listener. Given the disparate requirements of the different applications reviewed in this chapter, it is not surprising that the design of a VAS depends upon the application for which the display is intended. A specific algorithm or technique for generating spatial cues may be more
Recent Developments in Virtual Auditory Space
235
appropriate for one application than another. For instance, use of individual HRTFs is obviously crucial for tasks involving elevation cues, while the use of simplified HRTFs (that can be implemented in an extremely efficient computational algorithm) may be appropriate when update rate is crucial, as in a teleoperator system. As the field of VAS matures and as technology advances, many of the implementation issues raised here will become insignificant. Computational power will increase, making more complex models of virtual acoustic space more feasible. Memory constraints on VAS systems will be less significant, allowing the storage of longer-length and more finely sampled HRTFs. Finally, our knowledge of how spatial auditory information is processed will continue to grow, allowing more efficient and cost-effective approaches to implementing VAS.
REFERENCES 1. Shinn-Cunningham BG, Lehnert H, Kramer G et al. Auditory Displays. In: Gilkey R, Anderson T, Eds. Spatial and Binaural Hearing. New York: Erlbaum, 1996: in press. 2. Blauert J. Spatial Hearing. Cambridge, MA: MIT Press, 1983. 3. Middlebrooks JC, Green DM. Sound localization by human listeners. Annual Review of Psychology 1991; 42:135-159. 4. Plenge G. On the differences between localization and lateralization. J Acoust Soc Am 1974; 56:944-951. 5. Rigapulos A. The role of reverberation in the localization of real and simulated auditory targets. Massachusetts Institute of Technology, 1990. 6. Mershon DH, Desaulniers DH, Amerson J, Thomas L. Visual capture in auditory distance perception: Proximity image effect reconsidered. J Aud Res 1980; 20:129-136. 7. Little AD, Mershon DH, Cox PH. Spectral content as a cue to perceived auditory distance. Perception 1992; 21:405-416. 8. Wenzel EM, Wightman FL, Foster SH. A virtual display system for conveying 3-dimensional acoustic information. Proceedings of 32nd Annual Meeting of the Human Factors Society, 1988:86-90. 9. Wightman FL, Kistler DJ, Foster SH et al. A comparison of head-related transfer functions measured deep in the ear canal and at the ear canal entrance. Proceedings of 18th ARO Midwinter Meeting. St. Petersburg, Florida, 1995:61. 10. Wenzel EM, Arruda M, Kistler DJ et al. Localization using nonindividualized head-related transfer functions. J Acoust Soc Am 1993; 94:111-123. 11. Pralong D, Carlile S. The role of individualized headphone calibration for the generation of high fidelity virtual auditory space. J Acoust Soc Am 1996; (submitted). 12. Durlach NI, Held RM, Shinn-Cunningham BG. Super Auditory Localization Displays. Society for Information Displays International Symposium: Digest of Technical Papers 1992; XXIII: 98-101.
236
Virtual Auditory Space: Generation and Applications
13. Durlach NI, Shinn-Cunningham BG, Held RM. Super normal auditory localization. I. General background. Presence 1993; 2(2):89-103. 14. Kulkarni A. Auditory Imaging in a Virtual Acoustic Environment. M.S. Thesis in the Department of Biomedical Engineering: Boston University, 1993. 15. Durlach NI, Rigopulos A, Pang XD et al. On the externalization of auditory images. Presence 1992; 1:251-257. 16. Kulkarni A, Woods WS, Colburn HS. Binaural recordings from KEMAR mannequin in several acoustical environments. J Acoust Soc Am 1992; 92:2376. 17. Kendall GS, Martens WL. Simulating the cues of spatial hearing in natural environments. Proceedings of 1984 International Computer Music Conference. Paris, France, 1984. 18. Foster SH, Wenzel EM, Taylor RM. Real-time synthesis of complex acoustic environments. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Paltz, New York, 1991. 19. Lehnert H, Blauert J. Principles of binaural room simulation. App Acoust 1992; 36:335-333. 20. Rakerd B, Hartmann WM. Localization of sound in rooms. II. The effects of a single reflecting surface. J Acoust Soc Am 1985; 78:524-533. 21. Rakerd B, Hartmann WM. Localization of sound in rooms. III. Onset and duration effects. J Acoust Soc Am 1986; 80:1695-1706. 22. Perrott DR. Auditory Motion. In: Gilkey R, Anderson T, Eds. Spatial and Binaural Hearing. New York: Erlbaum, 1996: in press. 23. Grantham W. Auditory Motion. In: Gilkey R, Anderson T, Eds. Spatial and Binaural Hearing. New York: Erlbaum, 1996: in press. 24. Mills AW. On the minimum audible angle. J Acoust Soc Am 1958; 30:237-246. 25. Wightman FL, Kistler DJ. The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 1992; 91: 1648-1661. 26. Kulkarni A, Isabelle SK, Colburn HS. Human sensitivity to HRTF phase spectra. Proceedings of 18th ARO Midwinter Meeting. St. Petersburg, Florida, 1995:62. 27. Mehrgardt S, Mellert V. Transformation characteristics of the external human ear. J Acoust Soc Am 1977; 61:1567-1576. 28. Kistler DJ, Wightman FL. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J Acoust Soc Am 1991; 91:1637-1647. 29. Oppenheim AV, Schafer RW. Digital Signal Processing. Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1975. 30. Musicant AD, Butler RA. Influence of monaural spectral cues on binaural localization. J Acoust Soc Am 1985; 77:202-208. 31. Butler RA, Humanski RA. Localization of sound in the vertical plane with and without high-frequency spectral cues. Perc Psychophys 1992; 51:182-86.
Recent Developments in Virtual Auditory Space
237
32. Butler RA. Spatial referents of stimulus frequencies: Their role in sound localization. In: Gilkey R, Anderson TR, Eds. Binaural and Spatial Hearing. Hillsdale, NJ: Erlbaum, 1996: in press. 33. Musicant AD. The relationship between tone frequency and perceived elevation under headphone listening conditions. J Acoust Soc Am 1995; 97(5):3279. 34. Middlebrooks JC. Narrow-band sound localization related to external ear acoustics. J Acoust Soc Am 1992; 92(5):2607-2624. 35. Carlile S, Pralong D. The location-dependent nature of perceptually salient features of the human head-related transfer functions. J Acoust Soc Am 1994; 95(6):3445-3459. 36. Rayleigh JWS. The Theory of Sound. London: Macmillan (second edition published by Dover Publications, New York, 1945), 1877. 37. Kuhn GF. Model for the interaural time differences in the azimuthal plane. J Acoust Soc Am 1977; 62:157-167. 38. Genuit K. A description of the human outer ear transfer function by elements of communication theory (Paper B6-8). Proceedings of 12th International Congress on Acoustics. Toronto, Canada, 1986. 39. Batteau DW. The role of the pinna in human localization. Proceedings of the Royal Society of London 1967; 168(B): 158-180. 40. Shaw EAG. Transformation of sound pressure level from the free field to the eardrum in the horizontal plane. J Acoust Soc Am 1974; 56: 1848-1861. 41. Shaw EAG. The elusive connection: 1979 Rayleigh medal lecture. Proceedings of Annual Meeting of the Institute of Acoustics. United Kingdom, 1979. 42. Chen J, Van Veen BD, Hecox KE. External ear transfer function modeling: A beamforming approach. J Acoust Soc Am 1992; 92:1933-1945. 43. Watkins AJ. Psychoacoustical aspects of synthesized vertical locale cues. J Acoust Soc Am 1978; 63:1152-1165. 44. Wright D, Hebrank JH, Wilson B. Pinna reflections as cues for localization. J Acoust Soc Am 1974; 56:957-962. 45. Butler RA, Belendiuk K. Spectral cues utilized in the localization of sound in the median saggital plane. J Acoust Soc Am 1977; 61:1264-1269. 46. Martens WL. Principal components analysis and resynthesis of spectral cues to perceived location. In: Tiepei S, Beauchamps J, Eds. Proceedings of 1987 International Computer Music Conference, 1987. 47. Chen J, Van Veen BD, Hecox KE. A spatial feature extraction and regularization model for the head-related transfer-function. J Acoust Soc Am 1995; 97:439-952. 48. Chen J, Van Veen BD, Hecox KE. Auditory space modeling and simulation via orthogonal expansion and generalized spline model. J Acoust Soc Am 1992; 92(4):2333. 49. Chen J, Wu Z, Reale RA. A quasi-real-time implementation of virtual acoustic space (VAS) based on a spatial feature extraction and regularization model (SFER). Proceedings of 18th ARO Midwinter Meeting. St. Petersburg, Florida, 1995:57.
238
Virtual Auditory Space: Generation and Applications
50. Cunningham RK, Cunningham RN. Neural Network Overview, personal communication, 1995. 51. Jenison RL, Fissell K. Radial basis function neural network for modeling auditory space. J Acoust Soc Am 1994; 95(5):2898. 52. Haykin S. Neural Networks: A Comprehensive Foundation. New York: Macmillan College Publishing Company, 1994. 53. Asano F, Suzuki Y, Sone T. Role of spectral cues in median plane localization. J Acoust Soc Am 1990; 88(1):159-168. 54. Sandvad J, Hammershøi D. Binaural auralization: Comparison of FIR and IIR filter representation of HIRs. Proceedings of 96th AES Convention. Amsterdam, Netherlands, 1994. 55. Bloomer MA, Wakefield GH. On the design of pole-zero approximations using a logarithmic error measure. IEEE Transactions on Signal Processing 1994; 42:3245-3248. 56. Bloomer MA, Runkle PR, Wakefield GH. Pole-zero models of head-related and directional transfer functions. Proceedings of 18th ARO Midwinter Meeting. St. Petersburg, Florida, 1995:62. 57. Kulkarni A, Colburn HS. Infinite-impulse-response models of the headrelated transfer function. J Acoust Soc Am 1995; 97:3278. 58. Kulkarni A, Colburn HS. Efficient finite-impulse-response models of the head-related transfer function. J Acoust Soc Am 1995; 97:3278. 59. Kalman RE. Design of a self-optimizing control system. Transactions of the ASME 1958; 80:468-478. 60. Duda R. Modeling Interaural Differences. In: Gilkey R, Anderson T, Eds. Spatial and Binaural Hearing. New York: Erlbaum, 1996: in press. 61. Peissig J, Albani S, Kollmeier B. A real-time model of binaural sound source localization resolving spatial ambiguities. J Acoust Soc Am 1994; 95:3004. 62. Martin KD. Estimating azimuth and elevation from interaural differences. Proceedings of IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, New York, 1995. 63. Lehnert H, Blauert J. A concept for binaural room simulation. Proceedings of IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY, 1989:207-221. 64. Zurek PM. The Precedence Effect. In: Yost WA, Gourevitch G, Eds. Directional Hearing. New York: Springer-Verlag, 1987:85-105. 65. Zurek PM. Measurements of binaural echo suppression. J Acoust Soc Am 1979; 66:1750-1757. 66. Shinn-Cunningham BG, Zurek PM, Clifton RK et al. Cross-frequency interactions in the precedence effect. J Acoust Soc Am 1995; 98(1): 164-171. 67. Shinn-Cunningham BG, Zurek PM, Durlach NI. Adjustment and discrimination measurements of the precedence effect. J Acoust Soc Am 1993; 93:2923-2932. 68. Clifton RK, Morrongiello BA, Dowd JM. A developmental look at an auditory illusion: The precedence effect. Developmental Psychobiology
Recent Developments in Virtual Auditory Space
239
1984; 17:519-536. 69. Bech S. Audibility of individual reflections in a complete sound field II. J Acoust Soc Am 1995; 97:3320. 70. Perrott DR. Studies in the perception of auditory motion. In: Gatehouse RW, Ed. Localization of Sound. Groton, CT: Amphora Press, 1982: 169-193. 71. Perrott DR. Concurrent minimum audible angle: A re-examination of the concept of auditory spatial acuity. J Acoust Soc Am 1984; 75:1201-1206. 72. Grantham DW. Adaptation to auditory motion in the horizontal plane: Effect of prior exposure to motion on motion detectability. Perc Psychophys 1992; 52:144-150. 73. Wenzel EM, Foster SH. Perceptual consequences of interpolating headrelated transfer functions during spatial synthesis. Proceedings of IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, New York, 1993. 74. Wenzel EM. Localization in virtual acoustic displays. Presence 1992; 1(1):80-107. 75. Wenzel EM, Wightman FL, Kistler DJ et al. The convolvotron: Real time synthesis of out-of-head localization. Proceedings of Joint Meeting of the Acoustical Society of American and the Acoustical Society of Japan, 1988. 76. Wightman FL, Kistler DJ. Headphone simulation of free-field listening. II. Psychophysical validation. J Acoust Soc Am 1989; 85:868-878. 77. Kulkarni A. Sound localization in natural and virtual acoustical environments. Ph.D. Thesis in the Department of Biomedical Engineering: Boston University, 1996. 78. Wightman F, Kistler D, Arruda M. Monaural localization, revisited. J Acoust Soc Am 1991; 89(4):1995. 79. Wightman F, Kisler D, Andersen K. Reassessment of the role of head movements in human sound localization. J Acoust Soc Am 1994; 95(2):3003-3004. 80. Begault DR. Perceptual similarity of measured and synthetic HRTF filtered speech stimuli. J Acoust Soc Am 1992; 92(4):2334. 81. Wenzel EM. The relative contribution of interaural time and magnitude cues to dynamic sound localization. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 15-18, 1995. New Paltz, New York, 1995. 82. McKinley RL, Ericson MA. Minimum audible angles for synthesized localization cues presented over headphones. J Acoust Soc Am 1992; 92(4):2297. 83. Ericson MA, McKinley RL. Experiments involving auditory localization over headphones using synthesized cues. J Acoust Soc Am 1992; 92(4):2296. 84. Ericson MA. A comparison of maskers on spatially separated competing messages. J Acoust Soc Am 1993; 93(4):2317. 85. Bronkhorst AW, Plomp R. Binaural speech intelligibility in noise for hearing-impaired listeners. J Acoust Soc Am 1989; 86:1374-1383.
240
Virtual Auditory Space: Generation and Applications
86. Bronkhorst AW, Plomp R. The effect of head-induced interaural time and level differences on speech intelligibility in noise. J Acoust Soc Am 1988; 83:1508-1516. 87. Carlile S, Wardman D. Masking produced by broadband noise presented in virtual auditory space. J Acoust Soc Am 1996; (submitted). 88. Begault D. Call sign intelligibility improvement using a spatial auditory display (104014). NASA Ames Research Center, 1993. 89. Begault DR, Erbe T. Multi-channel spatial auditory display for speech communication. Proceedings of 95th Convention of the Audio Engineering Society, October 7-10, 1993. New York, 1993. 90. Begault DR. Virtual acoustic displays for teleconferencing: Intelligibility advantage for “telephone grade” audio. Proceedings of 95th Convention of the Audio Engineering Society, February 25-28, 1995. Paris, France, 1995. 91. Wallach H. The role of head movements and vestibular and visual cues in sound localization. J Exp Psych 1940; 27:339-368. 92. Canon LK. Intermodality inconsistency of input and directed attention as determinants of the nature of adaptation. J Exp Psych 1970; 84:141-147. 93. Welch R. Adaptation of Space Perception. In: Boff KR, Kaufman L, Thomas JP, Eds. Handbook of Perception and Human Performance,Vol. I. New York: John Wiley and Sons, Inc., 1986:24.1-24.45. 94. McLaughlin SC, Rifkin KI. Change in straight ahead during adaptation to prism. Psychonomic Sci 1965; 2:107-108. 95. Held RM. Shifts in binaural localization after prolonged exposure to atypical combinations of stimuli. Am J Psych 1955; 68:526-548. 96. Shinn-Cunningham BG. Adaptation to Supernormal Auditory Localization Cues in an Auditory Virtual Environment. Ph.D. Thesis in the Department of Electrical Engineering and Computer Science: Massachusetts Institute of Technology, 1994. 97. Wien GE. A preliminary investigation of the effect of head width on binaural hearing. M.S. Thesis in the Department of Electrical Engineering and Computer Science: Massachusetts Institute of Technology, 1964. 98. Durlach NI, Braida LD. Intensity perception. I. Preliminary theory of intensity resolution. J Acoust Soc Am 1969; 46(2):372-383. 99. Braida LD, Durlach NI. Intensity perception. II. Resolution in one-interval paradigms. J Acoust Soc Am 1972; 51(2):483-502. 100. Mershon DH, Bowers JN. Absolute and relative cues for the auditory perception of egocentric distance. Perception 1979; 8:311-322. 101. Mershon DH, King LE. Intensity and reverberation as factors in auditory perception of egocentric distance. Perc Psychophys 1975; 18:409-415. 102. Mershon DH, Ballenger WL, Little AD et al. Effects of room reflectance and background noise on perceived auditory distance. Perception 1989; 18:403-416. 103. Brungart DS. Distance information transmission using first order reflections. M.S. Thesis in the Department of Electrical Engineering and Computer Science: Massachusetts Institute of Technology, 1994.
Recent Developments in Virtual Auditory Space
241
104. Durlach NI, Tan HZ, Macmillan NA et al. Resolution in one dimension with random variations in background dimensions. Perc Psychophys 1989; 46:293-296. 105. McKinley RL, Ericson MA, D’Angelo WR. 3-Dimensional auditory displays: Development, applications, and performance. Aviation, Space, and Environmental Medicine 1994; May: A31-A38. 106. Perrott D, McKinley RL, Chelette TL. Investigations in interactions of auditory, visual, and vestibular perception in real and synthetic environments, 1995. 107. Begault D. Head-up auditory displays for traffic collision avoidance system advisories: A preliminary investigation. Human Factors 1993; 35(4):707-717. 108. Begault D, Pittman MT. 3-D audio versus head down TCAS displays (177636). NASA Ames Research Center, 1994. 109. Begault DR, Wenzel EM, Miller J et al. Preliminary investigation of spatial audio cues for use during aircraft taxi under low visibility conditions. NASA Ames Research Center, 1995. 110. Begault DR, Wenzel EM. Techniques and applications for binaural sound manipulation in man-machine interfaces. International Journal of Aviation Psychology 1992; 2:1-22. 111. Wenzel EM. Spatial sound and sonification. In: Kramer G, Ed. Auditory Display: Sonification, Audification, and Auditory Interface, Vol. XVIII, SFI Studies in the Science of Complexity. Santa Fe, New Mexico: AddisonWesley, 1994. 112. Loomis JM, Hebert C, Cincinelli JG. Active localization of virtual sounds. J Acoust Soc Am 1990; 88:1757-1764. 113. Edwards ADN. Soundtrack: An auditory interface for blind users. Human-Computer Interaction 1989; 4:45-66. 114. Scadden LA. Annual report of progress. Rehabilitation Engineering Center of the Smith-Kettlewell Institute of Visual Sciences, San Franciso, California, 1978. 115. Lunney D, Morrison R. High technology laboratory aids for visually handicapped chemistry students. Journal of Chemical Education 1981; 58:228. 116. Witten M. Increasing our understanding of biological models through visual and sonic representation: A cortical case study. International Journal of Supercomputer Applications 1992; 6:257-280. 117. Smith S. An auditory display for exploratory visualization of multi dimensional data. In: Grinstein G, Encarnacao J, Eds. Workstations for Experiment. Berlin: Springer-Verlag, 1991. 118. Fitch T, Kramer G. Sonifying the body electric: Superiority of an auditory over a visual display in a complex, multi-variate system. In: Kramer G, Ed. Auditory Display: Sonification, Audification, and Auditory Interface, Vol. SFI Studies in the Science of Complexity, Proceedings XVIII. Santa Fe, New Mexico: Addison-Wesley, 1994. 119. Kramer G. Some organizing principals for auditory display. In: Kramer G, Ed. Auditory Display: Sonification, Audification, and Auditory Inter-
242
120.
121. 122. 123. 124. 125.
126.
127.
128. 129. 130.
131.
132.
133.
134. 135. 136.
Virtual Auditory Space: Generation and Applications
face, Vol. SFI Studies in the Science of Complexity, Proceedings XVIII. Santa Fe, New Mexico: Addison-Wesley, 1994. Mezrich JJ, Frysinger SP, Slivjanovski R. Dynamic representation of mulitvariate time-series data. Journal of the American Statistical Association 1984; 79:34-40. Kramer G. Auditory Display: Sonification, Audification, and Auditory Interface. Santa Fe, New Mexico: Addison-Wesley,1994. Durlach NI, Mavor A. Virtual Reality: Scientific and Technical Challenges. Washington, D.C.: National Academy of Sciences, 1994. Proceedings of NASA Conference on Intelligent Computer-Aided Training and Virtual Environment Technology. Houston, TX, 1993. Moshell M. Three views of virtual reality: Virtual environments in the US military. IEEE Computer 1993; 26(2):81-82. Pausch P, Crea T, Conway M. A literature survey for virtual environments: Military flight simulator visual systems and simulator sickness. Presence: Teleoperators and Virtual Environments 1992; 1(3):344-363. Aligned Management Associates. Proceedings of Medicine Meets Virtual Reality: Discovering Applications for 3-D Multi-Media Interactive Technology in the Health Sciences. San Diego, California, 1992. Aligned Management Associates. Proceedings of Medicine Meets Virtual Reality II: Interactive Technology and Healthcare: Visionary Applications for Simulation, Visualization, Robotics. San Diego, CA, 1994. Bailey RW, Imbmbo AL, Zucker KA. Establishment of a laparoscopic cholecystectomy training program. American Surgeon 1991; 57(4):231-236. Satava R. Virtual reality surgical simulator: The first steps. Surgical Endoscopy 1993; 7:203-205. Moshell JM, Hughes CE. The virtual academy: Networked simulation and the future of education. Proceedings of IMAGINA Conference. Monte Carlo, Monaco, 1994. Osbert KM. Virtual Reality and Education: A Look at Both Sides of the Sword (R-93-6). Human Interface Technology Laboratory of the Washington Technology Center, University of Washington, 1992. Bricken M, Byrne CM. Summer Students in Virtual Reality: A Pilot Study on Educational Applications of Virtual Reality Technology. Human Interface Technology Laboratory of the Washington Technology Center, University of Washington, 1992. Welch R, Warren DH. Intersensory interactions. In: Boff KR, Kaufman L, Thomas JP et al., Eds. Handbook of Perception and Human Performance, Vol. I: John Wiley and Sons, Inc., 1986:25.1-25.36. Kozak JJ, Hancock PA, Arthur E et al. Transfer of training from virtual reality. Ergonomics 1993; 36:777-784. Hodges M. Facing real fears in virtual worlds. Technology Review 1995; May/June: 16-17. Rothbaum BO, Hodges LF, Kooper R et al. Effectiveness of computergenerated (virtual-reality) graded exposure in the treatment of acrophobia. American Journal of Psychiatry 1995; 152(4):626-628.
Recent Developments in Virtual Auditory Space
243
137. Hodges LF, Rothbaum BO, Kooper R et al. Applying virtual reality to the treatment of psychological disorders. IEEE Computer 1995; May 1995. 138. Airey JM, Rohlf JH, Brooks Jr FP. Towards image realism with interactive update rates in complex virtual building environments. Computer Graphics 1990; 24(2):41. 139. Emhardt J, Semmler J, Strothotte T. Hyper-navigation in virtual buildings. Proceedings of IEEE 1993 Virtual Reality Annual International Symposium, VRAIS ’93. Piscataway, NJ: IEEE Service Center, 1993:342-348. 140. Henry D. Spatial Perception in Virtual Environments: Evaluating an Architectural Application. M.S. Thesis in the Department of Inter-Engineering: University of Washington, 1992. 141. Special Issue on Computer Modelling and Auralization of Sound Fields in Rooms. App Acoust 1993; 38(2-4). 142. Special Issue on Auditory Virtual Environments and Telepresence. App Acoust 1992; 36(3-4). 143. Sheridan TB. Telerobotics, Automation, and Human Supervisory Control. Cambridge, MA: MIT Press, 1992. 144. Bregman AS. Auditory Scene Analysis: The Perceptual Organization of Sound. Cambridge, MA: MIT Press, 1990. 145. Shinn-Cunningham BG, Durlach NI. Defining and redefining limits on human performance in auditory spatial displays. In: Kramer G, Smith S, Eds. Proceedings of Second International Conference on Auditory Display. Santa Fe, NM: Santa Fe Institute, 1995:67-78.
245
Index Numbers in italics indicate figures (f) and tables (t).
A Ando Y, 146 Asano F, 209, 212 Axelsson A, 115
B Bat, 50 Batteau DW, 17, 49-51, 200-203, 213 Begault DR, 126, 142, 219 Belendiuk K, 118 Binwidth. See Digital signaling processing (DSP)/ frequency domain processing/Fourier analysis of periodic signals/discrete Fourier transform. Blauert J, 3, 43, 45, 51, 55, 110, 118, 134, 200, 214 Bloomer MA, 209-210 Boerger, 55 Bregman AS, 232 Bricken M, 228 Brillouin L, 32 Bronkhorst AW, 218 Brown CH, 59 Brungart DS, 222-223 Butler RA, 16, 53, 118 Byrne CM, 228
C Canon LK, 219 Carlile S, 39, 41, 51, 65, 134, 197 219 Cat, 50-51, 59, 153-160 Central nervous system directional sound response, 170-182, 172f, 174-175f, 177-178f, 180f ILD sensitivity, 61-63, 62f ITD sensitivity, 58-59, 60f representation of auditory space, 9-10 Chan JCK, 58, 123 Chen J, 200, 202-203, 206, 214 Colburn HS, 209-211 Coleman PD, 52 Computer aided design (CAD) programs, 104, 106 Cone of confusion, 30-31, 30f, 67, 139t, 137-139
D Digital signaling processing (DSP), 79-106, 83f discrete time systems, 82-88 amplitude as digital numbers, 83 analog to digital converter (ADC), 83, 86-87, 87f
anti-aliasing filter, 83, 86, 106 digital to analog converter (DAC), 83, 88 discrete time sampling, 83-84, 84f, 88 Nyquist rate, 84-86, 85f quantization, 86-87 reconstruction filter, 83, 88, 106 filter design, 97-106, 98f, 128-134 bandpass, 105, 106f finite impulse response (FIR), 99-104, 100f, 129, 157f, 160, 161f, 202 frequency sampling, 102-104, 103f Parks-McClellan algorithm, 104, 105f windowing function, 101, 102f infinite impulse response (IIR), 99, 104-106 Butterworth filter, 104-105, 106f Chebyshev filter, 104-105, 106f elliptic filter,105 inverse Chebyshev filter, 104-105, 106f low pass, 99, 99f, 105f frequency domain processing, 88-94, 130 complex numbers, 88-89, 88f, 103-104 Fourier analysis of periodic signals, 89-94, 91f, 101, 211 discrete Fourier transform (DFT), 90, 92, 93f, 94-96, 154, 215 binwidth, 92, 94 fast Fourier transform (FFT), 93f, 94, 130 inverse discrete Fourier transform (IDTF), 90, 92, 96 microprocessors, 79-80 time domain analysis, 94-97, 129-130 binaural FETF, 162-168, 163f convolution, 96-97, 97f, 100, 145f, 214-215 impulse response, 94-95 Golay codes, 95-96, 126, 127f Duda R, 211 Duplex theory. See Perception of auditory space/ localization/binaural. Durlach NI, 125, 191, 221, 226, 232 Dye RH, 56
E Eady HR, 56 Ear canal/eardrum complex, 45-47, 46f, 48f, 65, 112-113, 115, 119-123 Eigen functions, 162-168, 204-206, 212 eigen impulse response (EIR), 164-166 eigen transfer functions (ETF), 162, 164
246
F Feldman RS, 56 Ferret, 50-51 Fisher HG, 19 Fissell K, 207 Fourier analysis. See Digital signaling processing (DSP)/frequency domain processing. Free field to eardrum transfer function (FETF), 153-166, 156-158f. See also Eigen function. binaural expansion, 162-168, 163f spatial characteristic function (SCF), 162, 166-168, 167f spatial feature extraction and regularization (SFER) model, 160-170, 163f Freedman SJ, 19 Frequency limits. See also Digital signaling processing (DSP)/filter design. horizon transfer function, 66 human detection, 80 interaural intensity differences, 59, 61-64 interaural phase difference, 55-56, 58, 64 meridian transfer function, 66 practical signal bandlimit, 84, 85f, 86 Friedrich MT, 116
G Gardner MB, 51 Gaunard GC, 34 Geisler CD, 123 Genuit K, 110, 199-200, 202 Gierlich HW, 110 Gilkey RH, 141 Glasberg BR, 65 Golay codes. See Digital signaling processing (DSP)/ time domain analysis/impulse response. Good MD, 141 Green DM, 64, 120, 146 Guinea pig, 50-51, 59
H Haas effect. See Perception of auditory space/ localization/cue sensitivity/precedence effect. Hafter ER, 55, 61 Hammershøi D, 110, 119, 136, 138, 209 Hartmann WM, 58, 134 Haustein, 51 Haykin S, 207 Head as acoustic obstacle, 35-37, 45 movement effects, 127-128, 136 Head related transfer function (HRTF), 38-45, 94, 110, 145f, 217, 220. See also Free field to eardrum transfer function (FETF).
Virtual Auditory Space: Generation and Applications
measurement for VAS, 111-134, 186-215 digitization, 123-125 environment, 125-127 individualized, 142-147 interindividual variability (IIV), 188-189, 198 recording, 116-123 SNAPSHOT, 188 modeling, 193-211 eigen function, 204-206 interaural-spectrum, 211 neural-network, 206-207 pinna structure, 200-204, 201f, 203f rational function, 207-211 pole-zero (ARMA), 208-210, 212 reduced-order, all-zero (MA), 208, 210-211 wave-equation, 199-200 variation with azimuth, 41-43, 42f variation with elevation, 43-45, 44f Headphones, 3f, 6, 109-110 cat earphones, 160 circum-aural, 131, 132f, 133, 143, 145f sound field generation, 6, 54-55, 213 supra-aural, 131, 132f transfer characteristics, 110 transfer functions (HpTFs), 130-134, 132f, 144-145f individualized, 142-147 Hebrank JH, 43, 45, 50, 134, 201 Held RM, 220 Hellstrom P, 115 Hiranaka Y, 49 Hirsch HR, 52 Horizon transfer (excitation) function, 66, 67f, 69f Houben, 59 Hughes CE, 228 Hyams, 39
I Inner ear, 65, 115, 146 Interaural intensity difference (IID). See Interaural level difference (ILD). Interaural level difference (ILD), 28-29, 30f, 35, 59, 67, 174-175, 174-175f, 199, 217-218 sensitivity to, 55-59 Interaural phase difference (IPD), 32-33 Interaural spectral difference (ISD), 63 Interaural time difference (ITD), 28, 29f, 30f, 31, 33f, 34, 167, 199, 217-218, 221 path length difference, 31-35 sensitivity to, 59-63, 60f Interindividual variability (IIV). See Head related transfer function (HRTF)/measurement for VAS/ individualized. Irvine DR, 58-59, 61
Index
247
J
N
Jeffress LA, 59, 60f Jenison RL, 207
Noble W, 135 Nyquist rate. See Digital signaling processing (DSP)/ discrete time sampling.
K Kalman RE, 209 Karhunen-Loeve expansion. See Eigen functions. Kendall GS, 214 Khanna SM, 122 King LE, 53 Kistler DJ, 43, 110, 115, 120-122, 126-127, 133, 135-138, 141-142, 195-196, 205-206, 212 Klump RG, 56 Knudsen EI, 40 Kramer G, 225 Kuhn GF, 34 Kulkarni A, 191, 195, 197, 209-212
L Lawton BW, 122 Lehnert H, 186, 214 Linear time-invariant (LTI) system, 95, 99 Localization of sound. See Perception of auditory space; Stimulus, auditory; Virtual auditory space.
M Mach B, 50 Makous JC, 14-15, 135-136, 138 Martens WL, 205, 214 Martin KD, 211 Mavor A, 226, 232 McClellan JH, 104 McKinley RL, 193, 224-225 McLaughlin SC, 220 Mehrgardt S, 43, 45, 119, 195 Mellert V, 43, 45, 119, 195 Meridian transfer (excitation) function, 66, 68f Mershon DH, 53-54, 187 Microphone, 82 probe, 38, 41, 94, 110, 113, 113f, 160, 161f recording HpTFs, 131 recording HRTFs, 117-118, 121-123, 122f Middlebrooks JC, 14-15, 120, 126, 128, 135-136, 138, 146 Minimum audible angle (MAA) of sound source displacement, 11 Minimum audible movement angle (MAMA), 21 Molino J, 52 Moore BCJ, 56, 61, 64-65 Morimoto M, 146 Moshell JM, 228 Møller H, 110, 118, 120, 133 Musicant AD, 134
O Oldfield SR, 135-136 Oppenheim AV, 99 Osbert KM, 228 Outer ear structure, 16, 17f, 29, 37-38, 45-51, 46f, 48f, 112, 115, 119-120, 131, 187, 198, 200204, 201f, 203f diffractive model, 50-51 resonator model, 46-49
P Parker SPA, 135-136 Parks TW, 104 Patterson RD, 65 Peissig J, 211 Perception of auditory space, 1-6, 3f, 109, 187 constancy of auditory object, 70 dimensions, 2-4 environment, 3-4, 12, 52-54 expectation, 2, 51, 135 extent of source, 2-4 ecological rationale, 5, 52-53, 136-137 localization, 7-8, 10-11 cue sensitivity, 54-71, 188, 219-224 precedence effect, 56-58 distance, 51-54 human listeners, 11-21, 13f, 134-137 binaural/duplex theory, 27-35 frequency restrictions, 28-29, 30f, 34-35, 45 dynamic cues, 18-21 errors. See also Cone of confusion. front/back reversal, 13-14, 141 near-miss localization, 13-16, 15f, 140f monaural, 18, 30, 63 methodology, 37-41, 134-137 coordinate systems, 39-41, 40f Perrett S, 135 Perrott DR, 192, 224 Plenge G, 187 Plomp R, 218 Pollack I, 19 Pralong D, 65, 134, 197 Primates, 59
R Rabbitt RD, 38, 116 Rayleigh JWS, 28, 36, 199 Rifkin KI, 220
248
Rigapulos A, 187, 191 Rose M, 19 Roth GL, 34, 57
S Sandvad J, 110, 136, 138, 209 Schafer RW, 99 Schroeder MR, 126 Searle CL, 31, 63 Shaw EAG, 37, 41, 45, 46f, 50, 118-120, 202 Sheridan TB, 232 Shinn-Cunningham BG, 186, 192, 220 Signal to noise ratio (SNR), 82, 95 SNAPSHOT. See Head related transfer function (HRTF)/measurement for VAS. Sound characteristics, 80-82 intensity, 81-82 noise, 82 wave, 81f Sound pressure level (SPL), 36, 36f, 48f, 81-82, 127 Stewart, 36 Stimulus, auditory, 5-6 amplitude modulated, 55-56 continuous tones, 32 dichotic, 61, 63 impulse, 94-95, 126 localization, 7-8, 12 comb filtering, 17, 49 methodology, 12-16, 13f, 134-135 narrowing band-width, 16 moving, 19-21, 192-193 path length, 31-35 spectrally dense, 56 VAS signals, 170 velocities, 32-35 group, 33-35 phase, 33-35 Stinton MR, 122
T Teranishi R, 118-119 Trahiotis C, 29
V Vernon JA, 82 Virtual auditory space (VAS), 1, 6-7, 109-147. See also Headphones; Digital signal processing. dynamic, 192-193 fidelity, 7-9, 134-142, 139t localization, 7-8, 21 generation, 109-111, 112f HRTF/FETF recording, 116-128, 154-159, 156-158f modeling, 54, 185-186 adaptation to distorted spatial cues, 219-222
Virtual Auditory Space: Generation and Applications
auditory spatial perception, 216-219 directional hearing studies, 153-182, 170f virtual space receptive field (VRSF), 170-181, 172f, 174-175f, 177-178f, 180f intensity relationships, 173-175 internal structure, 179-181 temporal relationships, 176-179 distance perception, 222-223 environments, 226-234 architectural design and acoustics, 229-230 entertainment, 226-227 product prototyping, 230-231 remote exploration, 233-234 teleconferencing, 232-233 teleoperation, 231-232 therapy, 228-229 training and education, 227-228 HRTF, 193-214 eigen function, 204-206 interaural-spectrum, 211 neural-network, 206-207 pinna structure, 200-204, 201f, 203f rational function, 207-211 wave-equation, 199-200 orientation (pilots), 223-224 verbal signal sorting (pilots), 225 reverberant, 191-192, 214 signal filtering, 128-134 validation, 111-116, 113-114f, 137, 139t, 168, 169f, 194-195, 211-214 von Bekesy G, 53
W Wakefield GH, 209 Wallaby, 50-51 Wallach H, 57, 219 Wardman D, 219 Watkins AJ, 50, 201, 213 Welch R, 220 Wenzel EM, 142-143, 146, 213, 217, 221 Wien GE, 221 Wightman FL, 43, 110, 115, 120-122, 126-127, 133, 135-138, 141-142, 195-196, 205-206, 212, 217-219 Wright D, 43, 45, 49, 50, 201
Y Yamasaki H, 49 Yin TCT, 58 Yost WA, 56
Z Zhou B, 96 Zwislocki J, 56
NEUROSCIENCE INTELLIGENCE UNIT
VIRTUAL AUDITORY SPACE: GENERATION AND APPLICATIONS Simon Carlile