Human Information Processing in Speech Quality Assessment (T-Labs Series in Telecommunication Services) 3030713881, 9783030713881

This book provides a new multi-method, process-oriented approach towards speech quality assessment, which allows readers

124 93 4MB

English Pages 185 [180] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Acronyms
1 Introduction
References
2 Speech Quality Fundamentals
2.1 Conceptual Approaches
2.1.1 Quality of Service (QoS)
2.1.2 Quality of Experience (QoE)
2.1.2.1 Quality Elements Versus Quality Features
2.1.2.2 Perceived Quality Versus Judged Quality
2.1.2.3 Speech Quality: Form Versus Content
2.2 Explanatory Approaches
2.2.1 Psychophysical Models
2.2.2 Functional Process Models
References
3 Speech Quality Assessment
3.1 Levels of Analysis
3.1.1 Subjective and Behavioral Levels
3.1.2 Primer: Event-Related Brain Potential (ERP) Technique
3.1.2.1 Definitions
3.1.2.2 ERP Processing and Analysis
3.1.2.3 ERP Components
3.1.2.4 Oddball Paradigm
3.1.3 Neurophysiological Level
3.2 Measuring Criteria
3.2.1 Objectivity, Reliability, and Validity
3.2.2 Sensitivity, Diagnosticity, and Intrusiveness
References
4 Functional Model of Quality Perception (Research Questions)
References
5 Discrimination of Speech Quality Change Along Perceptual Dimensions (Study I)
5.1 Motivation
5.2 Pre-test
5.2.1 Methods
5.2.1.1 Participants
5.2.1.2 Speech Stimuli
5.2.1.3 Experimental Procedure
5.2.1.4 Data Collection
5.2.1.5 Data Analysis
5.2.2 Results
5.2.3 Discussion
5.3 Main Experiment
5.3.1 Methods
5.3.1.1 Participants
5.3.1.2 Speech Stimuli
5.3.1.3 Stimulus Sets
5.3.1.4 Experimental Procedure
5.3.1.5 Data Collection
5.3.1.6 ERP Processing
5.3.1.7 Parameter Extraction
5.3.1.8 Data Analyses
5.3.2 Hypotheses
5.3.3 Results
5.3.3.1 Subjective Analysis
5.3.3.2 Behavioral Analysis
5.3.3.3 Electrophysiological Analysis
5.3.4 Discussion
5.3.4.1 Subjective Results
5.3.4.2 Behavioral Results
5.3.4.3 Electrophysiological Results
5.4 Conclusion
Appendix
References
6 Discrimination of Speech Quality Change Under Varying Semantic Content (Study II)
6.1 Motivation
6.2 Pre-test
6.2.1 Methods
6.2.1.1 Participants
6.2.1.2 Speech Stimuli
6.2.1.3 Experimental Procedure
6.2.1.4 Data Collection
6.2.1.5 Data Analysis
6.2.2 Results
6.2.3 Discussion
6.3 Main Experiment
6.3.1 Methods
6.3.1.1 Participants
6.3.1.2 Speech Stimuli
6.3.1.3 Stimulus Sets
6.3.1.4 Experimental Procedure
6.3.1.5 Data Collection
6.3.1.6 ERP Processing
6.3.1.7 Parameter Extraction
6.3.1.8 Data Analyses
6.3.2 Hypotheses
6.3.3 Results
6.3.3.1 Subjective Analysis
6.3.3.2 Behavioral Analysis
6.3.3.3 Electrophysiological Analysis
6.3.4 Discussion
6.3.4.1 Subjective Results
6.3.4.2 Behavioral Results
6.3.4.3 Electrophysiological Results
6.4 Conclusion
Appendix
References
7 Talker Identification Under Varying Speech Quality and Spatialization (Study III)
7.1 Motivation
7.2 Main Experiment
7.2.1 Methods
7.2.1.1 Participants
7.2.1.2 Speech Stimuli
7.2.1.3 Speech Reproduction
7.2.1.4 Experimental Procedure
7.2.1.5 Data Analyses
7.2.2 Hypotheses
7.2.3 Results
7.2.3.1 Subjective Analysis
7.2.3.2 Behavioral Analysis
7.2.4 Discussion
7.2.4.1 Subjective Results
7.2.4.2 Behavioral Results
7.3 Post-test
7.3.1 Methods
7.3.1.1 Participants
7.3.1.2 Speech Stimuli
7.3.1.3 Speech Reproduction
7.3.1.4 Experimental Procedure
7.3.1.5 Data Analyses
7.3.2 Hypotheses
7.3.3 Results
7.3.3.1 Subjective Analysis
7.3.3.2 Behavioral Analysis
7.3.4 Discussion
7.3.4.1 Subjective Results
7.3.4.2 Behavioral Results
7.4 Conclusion
References
8 General Discussion
8.1 Attentional Resource Allocation in Quality Perception
8.2 Internal Processes Underlying Talker Identification
References
9 General Conclusion and Outlook
References
Index
Recommend Papers

Human Information Processing in Speech Quality Assessment (T-Labs Series in Telecommunication Services)
 3030713881, 9783030713881

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

T-Labs Series in Telecommunication Services

Stefan Uhrig

Human Information Processing in Speech Quality Assessment

T-Labs Series in Telecommunication Services Series Editors Sebastian Möller, Quality and Usability Lab, Technische Universität Berlin, Berlin, Germany Axel Küpper, Telekom Innovation Laboratories, Technische Universität Berlin, Berlin, Germany Alexander Raake, Audiovisual Technology Group, Technische Universität Ilmenau, Ilmenau, Germany

It is the aim of the Springer Series in Telecommunication Services to foster an interdisciplinary exchange of knowledge addressing all topics which are essential for developing high-quality and highly usable telecommunication services. This includes basic concepts of underlying technologies, distribution networks, architectures and platforms for service design, deployment and adaptation, as well as the users’ perception of telecommunication services. By taking a vertical perspective over all these steps, we aim to provide the scientific bases for the development and continuous evaluation of innovative services which provide a better value for their users. In fact, the human-centric design of high-quality telecommunication services – the so called “quality engineering” – forms an essential topic of this series, as it will ultimately lead to better user experience and acceptance. The series is directed towards both scientists and practitioners from all related disciplines and industries. ** Indexing: books in this series are indexing in Scopus **

More information about this series at http://www.springer.com/series/10013

Stefan Uhrig

Human Information Processing in Speech Quality Assessment

Stefan Uhrig Quality and Usability Lab Technische Universit¨at Berlin Berlin, Germany

ISSN 2192-2810 ISSN 2192-2829 (electronic) T-Labs Series in Telecommunication Services ISBN 978-3-030-71388-1 ISBN 978-3-030-71389-8 (eBook) https://doi.org/10.1007/978-3-030-71389-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Dedicated to my parents and my brother

Preface

The present book titled Human Information Processing in Speech Quality Assessment covers the main outcomes of the author’s doctoral research project, accomplished in the time period from 2017 to 2020. This monograph is a revised and extended version of the original dissertation approved by the doctoral committee at Technische Universität (TU) Berlin in December 2020. Two-thirds of the entire research work were initially conducted at Quality and Usability Lab of TU Berlin, the final third accrued during a one year research stay of the author at the Department of Electronic Systems of the Norwegian University of Science and Technology (NTNU) in 2019 and early 2020. Over those different project phases, the topic of human perceptual and cognitive processing of transmitted speech had to be attacked from multiple angles, thereby crossing disciplinary borders between quality engineering, acoustics, and psychology. Previous publications in the T-Labs Series in Telecommunication Services have already been concerned with subjective quality assessment and prediction, as well as identifying neural correlates of speech and audio-visual quality perception. The present book contributes to this existing knowledge base in two ways: First, it closes a theoretical gap by specifying the functional structure of human information processing for listening-only test scenarios and tasks. Second, it propagates a new multi-method, “process-oriented” approach towards speech quality assessment, allowing to systematically analyze effects of varying transmission quality on specific internal processes. This approach is exemplified by three experimental studies demonstrating interactions between speech quality, stimulus context, and semantic speech content. Besides the purpose of documenting the three studies (Chaps. 5, 6, and 7), this book opens with an introduction of fundamental concepts and methodologies (Chaps. 2 and 3), further describing a functional model of quality perception (Chap. 4) before finally attempting to theoretically integrate the studies’ empirical results (Chap. 8).

vii

viii

Preface

The conceptual and methodological ideas together with the empirical findings elaborated in this book should specially interest researchers working in the fields of quality and audio engineering, psychoacoustics, audiology, and psychophysiology. Berlin, Germany January 2021

Stefan Uhrig

Acknowledgments

First of all, I am thankful to Prof. Dr.-Ing. Sebastian Möller, head of Quality and Usability (QU) Lab, Technische Universität (TU) Berlin, for his supervision, detailed feedback, and impetus that greatly shaped the research work presented in this book. Also, I am grateful to Prof. Dr. Andrew Perkis, Department of Electronic Systems, Norwegian University of Science and Technology (NTNU), for his guidance and supervision, which provided the foundation to follow my research goals during a one year stay at NTNU from March 2019 to February 2020. I want to give special thanks to Prof. Dr. Dawn Behne and Prof. Dr. Peter Svensson, Department of Psychology and Department of Electronic Systems, NTNU, for their trust and commitment during this one year research visit abroad. I would like to thank Prof. Dr. Dietrich Manzey, Department of Psychology and Ergonomics, TU Berlin, for reviewing the present research work. Many thanks go to Irene Hube-Achter, Yasmin Hillebrenner, and Tobias Jettkowski for their administrative and technical support during my years at QU Lab. I am reminded of numerous conversations and collaborations with my colleagues at QU Lab. Thank you for the memorable times we spent together! In particular, I want to mention Gabriel Mittag, Falk Schiffner, Steven Schmidt, and Saman Zadtootaghaj, who were always on hand with help and advice. Likewise, I want to thank my colleagues at NTNU—including members of three research groups: the Signal Processing and Acoustics groups in the Department of Electronic Systems, as well as the Speech, Cognition and Language Research (SCaLa) group in the Department of Psychology—for their valuable feedback in various colloquium talks. Especially with Asim Hameed and Marzieh Sorati, I shared many insightful discussions during my stay in Trondheim and beyond. Last but not least, I wish to express my gratitude to both TU Berlin—namely Senta Maltschew and Evelina Skurski, from the International Scientific Cooperation Team of the International Affairs Office—and NTNU, whose strategic partnership program funded the scholarship that enabled this research work in the first place.

ix

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 4

2

Speech Quality Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Conceptual Approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Quality of Service (QoS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Quality of Experience (QoE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Explanatory Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Psychophysical Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Functional Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 5 6 12 13 15 18

3

Speech Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Levels of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Subjective and Behavioral Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Primer: Event-Related Brain Potential (ERP) Technique . . . . . 3.1.3 Neurophysiological Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Measuring Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Objectivity, Reliability, and Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Sensitivity, Diagnosticity, and Intrusiveness. . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 21 22 26 34 37 38 40 41

4

Functional Model of Quality Perception (Research Questions). . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47 51

5

Discrimination of Speech Quality Change Along Perceptual Dimensions (Study I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Pre-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 55 57 58 60 60

xi

xii

6

7

8

Contents

5.3 Main Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 68 69 71 76 78 86

Discrimination of Speech Quality Change Under Varying Semantic Content (Study II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Pre-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Main Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89 89 91 92 93 93 93 93 98 100 102 107 109 118

Talker Identification Under Varying Speech Quality and Spatialization (Study III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Main Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Post-test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121 121 123 124 130 131 133 138 138 140 140 142 144 145

General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Attentional Resource Allocation in Quality Perception . . . . . . . . . . . . . . . 8.2 Internal Processes Underlying Talker Identification. . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149 149 153 157

Contents

9

xiii

General Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Acronyms

ACR AEP AMR-WB ANOVA ANS ASL BAEP BAQ CAEP CCR CMOS CNS Col cRT DCR Dis DMOS ECG EDA EEG EMG ERP FA FER FIR fMRI HQ HRTF ICA IFCN IP

Absolute category rating Auditory evoked potential Adaptive multi-rate wideband Analysis of variance Autonomic nervous system Active speech level Brainstem auditory evoked potential Basic audio quality Cortical auditory evoked potential Comparison category rating Comparison mean opinion score Central nervous system Coloration-impaired Correct response time Degradation category rating Discontinuity-impaired Degradation mean opinion score Electrocardiography Electrodermal activity Electroencephalography Electromyography Event-related brain potential Factor analysis Frame erasure rate Finite impulse response Functional magnetic resonance imaging High-quality Head-related transfer function Independent component analysis International Federation of Clinical Neurophysiology Internet protocol xv

xvi

ITU ITU-T LMEM LPC LPP LQ MCN MDS MEG MMN MNRU MOS Noi OLE PCA PS QoE QoS RFE RIR RT S-R SD SNR TI TTS WI WMA

Acronyms

International Telecommunication Union International Telecommunication Union—Telecommunication Standardization Sector Linear mixed-effects model Late positive component Late positive potential Low-quality Modified combinatorial nomenclature Multidimensional scaling Magnetoencephalography Mismatch negativity Modulated noise reference unit Mean opinion score Noisiness-impaired Overall listening experience Principal component analysis Pairwise similarity Quality of experience Quality of service Random frame erasure Room impulse response Reaction time Stimulus-response Semantic differential Signal-to-noise ratio Talker identification Text-to-speech Word intelligibility World Medical Association

Chapter 1

Introduction

An ever-growing portion of modern-day life is mediated by information and communication technologies. As a consequence, influencing factors that impair multimedia signal transmission increasingly determine human perception and behavioral interaction in diverse contexts of use. Practical assurance of “quality” for those technologies has traditionally been limited to the fulfillment of functional requirements put forward by different stakeholders. In recent years, this traditional approach has been revised to accentuate the subjective experience of users, who are either passively exposed to or actively engaged with multimedia applications, systems, and services [1]. Accordingly, transmitted multimedia signals (e.g., audio/speech, image, video) leave an immediate subjective impression of “quality” in human perceivers, based on which they may derive cognitive judgments—for instance, by rating their “preference” toward and deemed “usefulness” of systems and services—most explicitly in the context of empirical user tests. Testing and model-based prediction of users’ experienced quality, which constitute iteratively traversed steps within the usability engineering life cycle, are regarded to be crucial prerequisites for optimizing user satisfaction and acceptability [2]. Initial research in the application domain of telecommunications centered around quality perception and evaluation of transmitted speech, both in passive listeningonly and interactive conversation test scenarios [3–5]. This endeavor has led to the identification of various influencing factors on the technical system side, called “quality elements” (e.g., circuit noise, talker echo), including their physical foundation (e.g., physical properties of the network or terminal device) [4]; furthermore, methodologies for speech quality assessment and prediction models based on instrumentally measurable system parameters were developed and documented as standardized guidelines by the International Telecommunication Union (ITU) [6–8]. Later investigations specified relationships between quality elements and different aspects of subjective listening experience contributing to quality judgments, termed “quality features” [4]. In practice, knowledge about the relationships between quality elements and quality features supports diagnostic evaluation of speech © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Uhrig, Human Information Processing in Speech Quality Assessment, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-71389-8_1

1

2

1 Introduction

communication systems [2], since degradations in specific quality features could now be traced back to associated quality elements that were most probably responsible for the occurrent speech quality impairments. Assessment of experienced quality commonly relies on subjective opinion metrics, which are grounded in conscious introspection, cognitive reflection and judgment, for instance, when using category rating scales or other psychophysical techniques (e.g., pair comparisons) [3, 4]. However, cognitive evaluation processes underlying such subjective methods are well-known to be susceptible to bias [9]. Besides, even the mere fact that participants reflect upon and judge the quality of a given test stimulus may alter what was to be measured in the first place, that is, their immediate sense of perceived quality (being understood as an evaluative aspect of the percept, see upcoming Sect. 2.1.2.1). One way to circumvent this fundamental “observer problem” is to conduct subjective measurements only after the quality experience has already ceased, which again introduces new biases due to reliance on memory. Motivated by the aim to counter the abovementioned limitations of subjective methods, more recently, measurement techniques and experimental paradigms borrowed from psychophysiology [10–12] have been adopted for multimedia quality assessment [13, 14]. In particular, noninvasive electrophysiological recording of brain activity (by means of electroencephalography, EEG) promises to deliver precise temporal correlates of quality experience, as biopotentials (voltage changes) are continuously sampled during exposure to multimedia signals serving as test stimuli, with a temporal resolution in the milliseconds range. However, even such neurophysiological measurements remain to be somewhat obtrusive due to necessary attachment of biosensors (electrodes) on the bodily surface of participants [14]; also, stimulus presentation must be tightly controlled and participants’ motor activity restrained in order to ensure interpretability of identified neurophysiological effects as indicating variation in quality experience. Given some task requires participants to execute goal-directed motor behaviors (actions), a third class of behavioral measures is applicable to quantify task performance [15, 16], complementary to subjective and physiological measures. In test scenarios designed to be more realistic, hence precluding the deployment of physiological methods, behavioral responses can still be implicitly captured if they form an integral or natural part of the task (e.g., speeded behavioral responding in time-/safety-critical contexts of use). Past research has interrelated objective physiological measures and subjective self-report measures—serving as indicators of perceived and judged quality, respectively—to varying types and magnitudes of quality impairment for transmitted speech, visual, and audio-visual signals [14, 17, 18]. Other than subjective methods, behavioral and physiological methods enable theoretical inferences about the structure and dynamics of human information processing. Timing of internal processes would be most directly reflected by neurophysiological measures of electrical brain activity and, on the behavioral side, by the speed and accuracy of task-appropriate actions. Accordingly, the following types of research questions could be formulated: Which internal processes and representations, at which stages

1 Introduction

3

within the information processing hierarchy (sensory, perceptual, cognitive, actionrelated) are involved in a particular task? Whether and how are those processes influenced by properties of the technical system, as well as contextual and contentrelated factors? The present book pursues to establish a theoretical basis for systematic examinations of the information processing chain inside human perceivers, who passively listen to transmitted speech varying in perceived quality. A functional model of (speech) quality perception is put forward which postulates a hierarchy of internal processes and representations along different (auditory) processing stages. This presented model can be more specifically classified as “psychophysiological,” as it describes functional relations between psychological constructs and neurophysiological indicators of those constructs. It offers a theoretical integration of various earlier, purely “psychological” models on internal quality formation (i.e., which contain psychological constructs, but do not specify any physiological indicators) with models from basic psychophysiology that explain the functional significance of the underlying neurophysiological phenomena utilized as indicators. The proposed model is assumed to have practical utility for quality assessment of speech-based information and communication technologies. On the one hand, insights into structural and dynamic aspects of human information processing should improve the validity of theoretical assumptions in traditional subjective assessment and prediction of speech transmission quality. On the other hand, it should enable a systematic examination of internal processes required for effective and efficient completion of cognitive and behavioral tasks. Such a process-oriented approach is considered relevant since quality impairments may exert very selective effects at different information processing stages, depending on contextual and content-related influencing factors. The model further promotes a multi-method approach toward quality assessment, allowing to test for convergence of observed result patterns across multiple levels of analysis (subjective, behavioral, neurophysiological); also, based on the model, behavioral and neurophysiological measures can be chosen in advance to indicate specific internal processes. This book presents three experimental studies about the topic of speech transmission quality, tackling perceptual and cognitive processes along the auditory processing hierarchy, which should provide a first empirical validation of the proposed model. In Chap. 2, conceptual and explanatory approaches surrounding the subjective phenomenon of “quality” will be elaborated. Chapter 3 introduces concepts and methods for practical assessment of speech transmission quality, including a short primer on the event-related brain potential (ERP) technique. The functional model of quality perception and the research questions will be described in Chap. 4. Chapters 5, 6 and 7 present results of Studies I–III, which test more detailed hypotheses to be derived from these research questions. More general theoretical interpretations of reported empirical result patterns are laid out in Chap. 8. Lastly, Chap. 9 concludes with a summary of the main results, answers the research questions, and provides an outlook on future research directions.

4

1 Introduction

References 1. P. Le Callet, S. Möller, A. Perkis (eds.), Qualinet white paper on definitions of quality of experience, in European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Version 1.2 (2013) 2. S. Möller, Quality Engineering: Qualität kommunikationstechnischer Systeme (Springer, Heidelberg, 2010) 3. S. Möller, Assessment and Prediction of Speech Quality in Telecommunications (Springer, Boston, 2000) 4. U. Jekosch, Voice and speech quality perception: assessment and evaluation, ser. in Signals and Communication Technology (Springer, Berlin, 2005) 5. A. Raake, Speech Quality of VoIP: Assessment and Prediction (Wiley, Chichester, 2006) 6. ITU-T Recommendation P.800, Methods for Subjective Determination of Transmission Quality (International Telecommunication Union (ITU), Geneva, 1996) 7. ITU-T Recommendation P.805, Subjective Evaluation of Conversational Quality (International Telecommunication Union (ITU), Geneva, 2007) 8. ITU-T Recommendation G.107, The E-Model: A Computational Model for Use in Transmission Planning (International Telecommunication Union (ITU), Geneva, 2015) 9. S. Zieli´nski, F. Rumsey, S. Bech, On some biases encountered in modern audio quality listening tests–a review. J. Audio Eng. Soc 56(6), 427–451 (2008) 10. K. Hugdahl, Psychophysiology: the mind-body perspective, ser. Perspectives, in Cognitive Neuroscience (Harvard University Press, Cambridge, 1995) 11. J.L. Andreassi, Psychophysiology: Human Behavior and Physiological Response, 5th edn. (Psychology Press, New York, 2007) 12. J.T. Cacioppo, L.G. Tassinary, G.G. Berntson (eds.), Handbook of Psychophysiology, 4th edn. (Cambridge University Press, Cambridge, 2016) 13. S. Arndt, K. Brunnström, E. Cheng, U. Engelke, S. Möller, J.-N. Antons, Review on using physiology in quality of experience. Electron. Imaging 2016(16), 1–9 (2016) 14. U. Engelke, D.P. Darcy, G.H. Mulliken, S. Bosse, M.G. Martini, S. Arndt, J.-N. Antons, K.Y. Chan, N. Ramzan, K. Brunnström, Psychophysiology-based QoE assessment: a survey. IEEE J. Sel. Top. Signal Proces. 11(1), 6–21 (2017) 15. A.F. Sanders, Elements of Human Performance: Reaction Processes and Attention in Human Skill (Lawrence Erlbaum Associates, Mahwah, 1998) 16. R.W. Proctor, T. Van Zandt, Human Factors in Simple and Complex Systems, 3rd edn. (CRC Press, Boca Raton, 2018) 17. J.-N. Antons, Neural correlates of quality perception for complex speech signals, ser. T-Labs Series in Telecommunication Services (Springer, Cham, 2015) 18. S. Arndt, Neural correlates of quality during perception of audiovisual stimuli, ser. T-Labs Series in Telecommunication Services (Springer, Singapore, 2016)

Chapter 2

Speech Quality Fundamentals

2.1 Conceptual Approaches In traditional quality management, the eponymous concept of quality has been defined as the “degree to which a set of inherent characteristics fulfills requirements” (p. 18) [1], with requirements referring to the “totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs” (p. 14) [2]. Accordingly, quality presents a key criterion for the evaluation of technical products (i.e., applications, systems) and services in terms of their functional features, which are derived from requirement specifications by different stakeholders (including system designers, service providers, costumers, end users). However, this classic notion of quality has further evolved with the rise of new multimedia information and communication technologies and the increasingly interdisciplinary character of quality engineering, which started to adopt concepts and methods from telecommunications, linguistics, psychology, and psychophysics, among others.

2.1.1 Quality of Service (QoS) The approach of quality of service (QoS) transfers the classic notion of quality to the telecommunications domain, rephrasing it as “the totality of characteristics of a telecommunications service that bear on its ability to satisfy stated and implied needs of the user of the service” (p. 26) [3]. QoS exclusively addresses objective aspects of telecommunication networks and, more generally, of multimedia products and services [4]: on the one hand, it may denote the instrumental assessment of the performance of a particular system (e.g., with respect to data throughput, jitter,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Uhrig, Human Information Processing in Speech Quality Assessment, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-71389-8_2

5

6

2 Speech Quality Fundamentals

delay, or packet loss in networks based on the Internet protocol [IP] [5]). Resulting system performance measures (QoS metrics) depend directly on physical properties and the interplay of relevant system components. On the other hand, QoS may also refer to the implementation of mechanisms (QoS architectures) that ensure a high level of measured system performance (e.g., differentiated services, DiffServ; integrated services, IntServ).

2.1.2 Quality of Experience (QoE) Recently, an alternative approach to QoS has emerged under the name of quality of experience (QoE), taking the individual human user’s point of view, who perceives and interacts with technical products. In line with this perspective change, a much broader construct of overall quality (dubbed “QoE”) was formulated as the “degree of delight or annoyance of a person whose experiencing involves an application, service, or system, [which results] from the person’s evaluation of the fulfillment of his or her expectations and needs with respect to the utility and/or enjoyment in the light of the person’s context, personality and current state” (p. 22) [6]. In quality evaluation of audio devices and systems [7], overall quality corresponds to overall listening experience (OLE), another compound construct including different evaluative and affective aspects of listeners’ experience, such as “pleasantness,” “enjoyment,” and “likability” during extended listening episodes [8]. This new conceptualization acknowledges interactions between a wide variety of influencing factors during the formation of experienced quality. A common taxonomy differentiates between three classes of influencing factors [9, 10]: human (e.g., gender, age, expertise and abilities, personality traits), system (e.g., network parameters/signals, impairment factors during transmission), and reciprocally connected layers of context (including spatiotemporal, environmental/stimulus, interactional and task-related, sociocultural). Sometimes, a fourth class of influencing factors related to content is distinguished (e.g., bottom-up saliency or perceptual salience [11–13], defining “a salient auditory event as one that deviates from the feature regularities in the sounds preceding it”, p. 2, [14]; semantic meaning [5, 15]). It becomes apparent from the QoE viewpoint that experienced quality is fundamentally subjective and relative in nature, due to immanent and transient variability in human, contextual, and content-related influencing factors. As a consequence, valid assessment and evaluation of experienced quality will always necessitate direct testing of human participants [16], who perceive, judge, and describe their subjective impressions of (multisensory) test stimuli or behavioral interactions with a technical product, while being situated in a certain context of use [17]. QoE metrics have been classified into subjective opinion metrics (e.g., for perceptual quality dimensions, overall quality, acceptance) and objective metrics derived from behavioral and physiological measurements quantifying the human-system interaction [18].

2.1 Conceptual Approaches

2.1.2.1

7

Quality Elements Versus Quality Features

According to an early definition by Jekosch, experienced quality (or a “quality event”) presents the outcome of a cognitive “judgment of the perceived composition of an entity with respect to its [expected or ]desired composition” (p. 15) [19]. The abstract term entity refers to a material or immaterial object or event under observation, its composition to the realized “totality of the features [. . . ] of an entity” (p. 11). With features denoting “recognizable and nameable characteristic[s] of an entity” (p. 16), orthogonal or independent features are termed dimensions. All features or dimensions are realized by means of concrete feature or dimension values. In the objective realm of the physical world, an entity is a physical event (e.g., a sound event) that can be fully described by a composition of physical features (sound pressure level, fundamental frequency, etc.). Multiple physical dimensions span a physical space (sound space), which includes all physical events. On the contrary, the subjective realm of human perception involves conscious experience in form of perceptual events or percepts (e.g., an auditory event); each perceptual event is characterized by a perceived composition of numerous perceptual features (pitch, timbre, etc.) with singular feature values. The realized “totality of features of individual expectations and/or relevant demands and/or social requirements” (ibid.), termed expected (or desired) composition, when pertaining to perceptual features enables one-to-one matches with feature values of a complementary perceived composition. Jekosch also stressed the important distinction between quality elements and quality features: a quality element refers to a “contribution to the quality of a material or immaterial product as the result of an action/activity or a process in one of the planning, execution, or usage phases” (p. 16) [19]. In short, quality elements are objective, instrumentally measurable (as QoS metrics) parameters of a system or properties of a communication channel that act as influencing factors on experienced quality (e.g., signal-to-noise ratio, bit rate). A quality feature denotes “a recognized and designated characteristic of an entity that is relevant to the entity’s quality” (p. 17) [19]. Thus, being understood as constituent parts of subjective experience, quality features are subsets of perceptual, conceptual, and affective features that contribute to experienced quality. Orthogonal quality features are referred to as quality dimensions. With regard to human-human interactions over telecommunication networks, Möller has put forward a QoS taxonomy that categorizes and interrelates relevant quality elements and quality features [16, 20, 21]. On two opposing sides of technical system and human user, quality elements and quality features are grouped together to quality factors and quality aspects, which in turn integrate to QoS and QoE (overall quality), respectively. One category of quality factors, speech communication factors, comprises “all factors which can be directly associated with the communication between the human partners over the speech-transmission system” (p. 170) [21], again being divided into three subcategories (one-way voice [or speech] transmission quality, ease of communication, conversation effectiveness), all of which determine communication efficiency. In particular, those quality

8

2 Speech Quality Fundamentals

elements put under the first subcategory of speech transmission quality (e.g., packet/frame loss rate, signal-to-noise ratio, transmission bandwidth) exhibit most direct relations to perceptual quality features (e.g., perceived “discontinuity,” “noisiness,” “coloration”; as identified by Wältermann et al. [22, 23]) that are combined to perceptual quality aspects on the user (listener) side. Here, speech transmission quality exclusively concerns one-way transmission to passive listeners (i.e., who fulfill the communicative role of receivers), whereas the other two subcategories of quality factors necessitate two-way communication between alternately listening and talking interlocutors during interactive conversations (i.e., in switching roles as senders and receivers).

2.1.2.2

Perceived Quality Versus Judged Quality

Both Jekosch’s initial definition and later QoE-centered reformulations identify experienced quality with “the outcome of an individual [person]’s comparison and judgment process” that comprises “perception, reflection about the perception, and the description of the outcome” (p. 4) [9]. Accordingly, the quality of a stimulus is only experienced after it has been cognitively evaluated. The present book further expands this traditional account by proposing a conceptual distinction between perceived quality and judged quality. In their daily environments, both natural and technologically mediated, humans are confronted with physical stimuli of varying “degree of quality.” Examples are walking outdoors in the misty dark or nearby a street with loud traffic noise, which greatly reduces sensory detail of visual or auditory objects, likewise in a context of multimedia technology use, when watching a video or audio stream with low resolution and bit rate. A global sense of “degradation” also arises under conditions of sensory impairment, for instance, as experienced by ametropic persons or persons with mild to moderate hearing loss after removing their vision or hearing aid. If such degradations are of very high intensity, they may immediately invoke negative affect: from an evolutionary viewpoint, lack of clear vision and audition in natural environments has proven to be potentially dangerous, facilitating emotional response dispositions [24]. Lowered video and audio quality during business telemeetings, as caused by unstable network connections, often leads to loss of task-relevant information and social cues, which again increases stress and frustration in participants. Beyond the abovementioned examples involving significant negative affective experience, even more subtle degradation intensities may already possess a somewhat negative tone.1

1A

similar notion of “micro-valence” or perceived valence is discussed by Lebrecht and colleagues [25]: “Everyday objects automatically evoke some perception of valence[, which] can be considered a higher-level object property that connects vision [and audition] to behavior [. . . ]. Thus, valence is not a label or judgment applied to the object postrecognition, but rather an integral component of mental object representations” (p. 1). In contrast, affective valence would only be associated with fully fledged, high-arousal emotional and motivational states or, e.g.,

2.1 Conceptual Approaches

9

It is claimed here that a perceptual event’s perceived composition already contains a perceptual feature of perceived quality (or perceived degradation), which is evaluative (also: “emotive” [26]) rather than descriptive in character. Hence, it already carries with it an evaluative tone instead of merely signifying the neutral presence of a certain perceptual feature value. As a higher-order, evaluative perceptual feature, perceived quality integrates lower-order, descriptive quality features (such as “noisiness,” “loudness”) yet is immediately experienced, without necessitating subsequent cognitive reflection and judgment [27]. This idea resonates with the notion of experiencing, defined as “the individual stream of perceptions (of feelings, sensory percepts and concepts) that occurs in a particular situation of reference,” although the authors again regard “quality [. . . ] to be the result of additional cognitive processes on top of experiencing” (p. 13) [6]. In view of more basic sound quality, Blauert and Jekosch proposed another related concept of “sound-quality-as-such” (later termed auditive quality in [28]) to denote “the quality of the auditory event as such, i.e. in its pure, non-interpreted form” (p. 3) [29]. Taken together, perceived quality, due to constituting a part of the percept, is immediately experienced and does not necessarily require higher-cognitive processing but rather presents a more abstract, evaluative perceptual feature resulting from the integration of more concrete, descriptive perceptual features. Nonetheless, the perceived quality of a stimulus can be intentionally reflected and judged upon in order to arrive at a quality judgment. The experience associated with this evaluation outcome is then termed the judged quality of the stimulus. As cognitive evaluation processes are usually activating a wide variety of complex associations (including “denotations, connotations and interpretations”, p. 200, [30]), judged quality presents more of an amalgam of perceptual, conceptual, and affective content. Figure 2.1 illustrates the theorized organization of the percept, being composed of many perceptual features that form a descriptive-evaluative continuum. Examples for evaluative perceptual features include “valence” and “pleasantness,” both of which have been put in relation to “quality” before [23]. The QoE definition emphasizes the “delight or annoyance of the user” (p. 6) and his/her “enjoyment” with a system, service, or application as some of its central facets [9, 31]. OLE, the equivalent of QoE in the audio technology domain (see prior Sect. 2.1.2), is likewise clearly characterized as evaluative, possibly even affective: to operationalize OLE, “participants have been asked to rate the stimuli according to how much they like, enjoy, or feel pleased when listening to the stimuli. Thereby, participants are allowed to involve affective aspects” (p. 78) [8].

with startle responses triggered by more intense stimulatory change. Either valence dimension describes a continuum from “negative/bad” or “unpleasant” to “positive/good” or “pleasant” [24], characterizing the evaluative percept (perceived valence) or affective experience (affective valence).

10

2 Speech Quality Fundamentals

Fig. 2.1 Organization of the percept, embedded within the totality of experience. A hierarchical order of perceptual features is assumed, the character of which ranges from descriptive to evaluative. An extreme example for a descriptive (geometric) feature is the “perceived location” of an object within the perceptual scene [32]; an extreme example for an evaluative feature is the “perceived valence” (positive-negative) of an object [33]. Constituting a subset of perceptual features, perceptual quality features (depicted in blue) are more descriptive in character relative to the evaluative feature of “perceived quality” (depicted in purple). Experiential outcomes of cognitive evaluation (broad arrow), being based on the percept, may belong to conceptual, imaginary, and affective parts (outside the percept) of holistic experience

OLE extends the narrower construct of basic audio quality (BAQ) commonly used in the audio engineering field for evaluating monophonic, stereophonic, and multi-channel audio systems. Two definitions of BAQ have been brought forward: in the first definition, BAQ is understood to be a “global” perceptual feature integrating a number of concrete perceptual features (e.g., reverberance, echo, background noise) [7, 34], akin to the idea of perceived quality elaborated above. The second definition regards BAQ as being the “fidelity with which a signal is transmitted or rendered by a system” (p. 100) [35], that is, “any and all detected differences between the reference and the object” (p. 7) [36]; here, the terms “reference” and “object” describe the functional significance of subsequent auditory events in the task context of a subjective listening test, during which participants explicitly compare object stimuli with reference stimuli, to judge perceptual (quality) feature values of the former relative to the latter.

2.1.2.3

Speech Quality: Form Versus Content

Natural spoken language or speech has evolved as the primary means for communication, that is, exchange of information between humans. An acoustic speech signal consists of a temporal sequence of minimally distinguishable speech sounds or phones—termed phonemes if swapping them alters the conveyed semantic meaning (see below)—that can be concatenated to syllables, words, and sentences [5]. Speech communication systems enable technological mediation of speech signals through communication channels (e.g., via traditional wireline networks in plain

2.1 Conceptual Approaches

11

old telephone service [“POTS”] or via IP-based networks used in modern voiceover-IP services [5]). This comprises one-way transmission to single end users and two-way communication between multiple users (e.g., interlocutors during a conversation). Several quality impairment factors, like properties and parameters of the transmission path and terminal equipment [16, 20, 37], but also ambient noise in the listening environment may affect the perceived quality of transmitted speech. It is vital to distinguish between the acoustic (surface) form and the (semantic) content of speech signals. Application of semiotics, the study of signs and their meaning, to phenomena of sound quality in general [19, 30] and speech quality in particular [5, 20] has offered a precise terminology for describing speech quality perception: in semiotic terms, speech constitutes a sign system; the speech signal with its acoustic form represents a sign carrier; human central information processing of the sensorially transduced speech signal leads up to its perceived form, that is, a perceptual event with a perceived composition of perceptual (quality) features, including an evaluative feature of perceived quality; the perceived form functions as a sign if it triggers associated (semantic) objects, encompassing concrete imagined events (images) retrieved from perceptual memory, motivational and emotional states (affects), and more abstract ideas retrieved from semantic memory (concepts); the totality of imaginary, affective, and conceptual objects constitutes (semantic) meaning. Typically, the meaning of speech signs (e.g., words) is determined by linguistic convention, except for auditory icons, where meaning is based on perceptual similarity to acoustic sources [30]. This semiotic terminology, which will be used throughout the following sections and chapters, is illustrated in Fig. 2.2. The acoustic form of speech signals (and quality impairments thereof) always produces a perceived (degraded) form, irrespective of any existent or non-existent semantic content. However, it is known that content factors can influence quality judgments of the perceived form, for instance, when participants are directing their attention preferentially to task-relevant content features (dimensions) rather than to perceptual quality features (dimensions) [16, 20]. But also the reverse direction of

Fig. 2.2 Listening situation, in which a human listener receives a transmitted speech signal occurrent in the physical environment, possessing a particular acoustic form. Quality impairment factors may degrade the acoustic form. In the listener’s subjective experience, the perceived form of the speech signal (percept) triggers associated semantic content (including concepts, images, affects). Perceived quality refers to an evaluative perceptual feature characterizing the perceived form, that is closely associated with (negative) affect. Corresponding semiotic terms are presented in brackets

12

2 Speech Quality Fundamentals

quality factors influencing content may occur, namely, when quality impairments are imposed onto or modify portions of the speech signal surface form that carry meaningful information; corresponding perceived degradations would then mask or alter the experienced semantic meaning. As a general rule of thumb, change in speech transmission quality implies variation in perceived form, but not necessarily in semantic meaning, while change in speech content entails variation in both perceived form and meaning. Users of speech communication systems and services judge (overall) quality quite differently depending on whether they only passively receive transmitted speech signals (one-way transmission) or actively engage in conversation and have to exchange semantic information with interlocutors, thus fulfilling both sender and receiver roles (two-way communication) [16]. During listening-only test scenarios, the influence of content may be minor if participants are not explicitly attending to it; during interactive conversation test scenarios, content-related influencing factors are becoming more important, as users must extract task-relevant semantic information to be able to follow the course of the exchange and ultimately accomplish their task goals [16, 20]. Moreover, mental effort might be strained by listening-only or conversation task demands in quite specific ways depending on the nature of the underlying internal information processing (e.g., as indicated by subjectively judged listening effort, concentration effort, talker recognition effort, topic comprehension effort; see [38]). The present book focuses on quality perception of one-way transmitted speech in listening-only task contexts. Selected quality impairment factors are systematically manipulated to degrade the acoustic form of speech signals in order to affect perceived quality (features) on the human listener side.

2.2 Explanatory Approaches Taking a utilitarian stance (see upcoming Sect. 3.1.1), the QoE approach ultimately aims to predict overall quality under naturalistic test conditions, ideally in real-life contexts of technology use. For the purpose of either instrumental or perceptionbased quality estimation, analytical quality prediction models have been built in the past [21, 23, 39]: using multiple regression models, either integral quality, a perceptual (quality) dimension, or another subjective construct (e.g., speech intelligibility, listening effort), may serve as criterion variable; different quality elements (as quantified by signal-based measures or QoS performance metrics) or perceptual (quality) dimensions serve as predictor variables. The term integral quality refers to a single higher-order feature, which integrates multiple lowerorder dimensions [23] (see upcoming Sect. 2.2.1). Therefore, as was described in Sect. 2.1.2.2, integral quality would be equivalent to perceived quality when experienced after perception and equivalent to judged quality when experienced after additional cognitive evaluation. Practically useful quality prediction models promise a more efficient evaluation of systems and services in terms of integral

2.2 Explanatory Approaches

13

quality, since costly and time-consuming empirical user tests could be greatly reduced in scope or even completely avoided. Another goal of the QoE approach lies in achieving a better theoretical understanding of information processing inside human perceivers and evaluators of experienced quality. Functional “boxes-and-arrows” models [40] have been devised to specify relationships between internal processes and representations, sensory signals, and behavioral responses in terms of their functional roles—that is, processes producing inputs for other processes; sensory signals, representations, and behavioral responses presenting inputs to and outputs of processes—hereby detailing the flow of information through different phases or stages of processing. Eventually, at some point along this internal processing chain, a subjective sense of quality is formed. With respect to the first and foremost “psychological” nature of these internal processes, Berntson and Cacioppo believed that “useful theories will be based on a higher level of functional [i.e. psychological] description, constrained and calibrated by knowledge of the underlying [neuro-]physiology” (p. 370) [41]. Until now, two principle approaches exist to model perceived and judged quality, the first one being characterized as “psychophysical” and the second one pertaining to a functional description of human information processing. Representative models derived from either approach will be explicated in the following subsections.

2.2.1 Psychophysical Models Psychophysical models establish formal relationships between physical features of test stimuli, which are defined as physical events (see Sect. 2.1.2.1), and perceptual features of their associated perceptual events. Multidimensional analysis decomposes experienced quality into multiple perceptual quality dimensions [5, 23, 42, 43]. To this end, subjective tests need to be carried out involving human participants who describe aspects of their experience of test stimuli on sets of rating scales. Afterwards, the collected quantitative rating data are analyzed by means of dimensionality reduction techniques to identify a minimum number of perceptual dimensions underlying the behavioral descriptions. Several authors have extracted different kinds and numbers of perceptual quality dimensions for speech transmission quality, employing different stimulus materials and different combinations of psychoacoustic paradigms for data collection and statistical techniques for dimensionality reduction [23]. Two general approaches can be distinguished [16, 23]: in the first approach, a pairwise similarity (PS) paradigm is employed, wherein participants have the task to compare pairs of stimuli on a continuous, bipolar similarity rating scale (labeled with “very similar” and “not similar at all” at both scale ends). Following data collection, multidimensional scaling (MDS) is carried out on the gathered PS ratings. In MDS, meaningful labels for the extracted dimensions are only derived a posteriori, through subjective interpretation by the assessor. The second approach relies on the semantic differential (SD) paradigm, during which participants rate

14

2 Speech Quality Fundamentals

the test stimuli on a set of continuous, bipolar scales, each one being labeled with antonyms at the scale ends that describe extreme values of perceptual features (e.g., loudness: quiet-loud). Based on these SD ratings, a principal component analysis (PCA) or factor analysis (FA) is calculated. Thus extracted orthogonal principal components or factors are identifiable as the initially chosen perceptual features. In either approach, also an empirical estimate of integral quality is collected, typically in form of subjective quality metrics by employing category rating paradigms (see upcoming Sect. 3.1.1). Both PS and SD paradigms require large sets of well-defined test stimuli or stimulus classes with varying types and intensities of induced quality impairments. Only then does dimensionality reduction by means of MDS or PCA/FA enable a valid minimization of the number of dimensions that determine judged quality in a certain measurement (including stimulus, task) context. Moreover, having provided the basis for cognitive reflection and judgment, immediate perceived quality (being a part of the percept; see Sect. 2.1.2.2) could now be understood as an evaluative integration of more descriptive perceptual quality dimensions. Single dimensions may be monotonically positively or negatively related to integral quality (e.g., expressed in words as “the lower the degree of noisiness, the higher the integral quality”) or possess ideal points corresponding to highest integral quality (“a certain degree of noisiness is associated with optimal integral quality”) [5, 23, 43]. The particular kinds and numbers of identified perceptual quality dimensions critically depend on a multitude of human, system, contextual, and content-related influencing factors [10], including the sampled population of participants (e.g., hearing ability, quality rating expertise), test equipment (type of headphones or loudspeakers), test environment (room acoustics), stimulus material, combination of experimental paradigm and dimensionality reduction technique (PS and MDS vs. SD and PCA/FA). In the vector model, a set of a priori selected perceptual dimensions spans an Euclidean perceptual space.2 Test stimuli are represented within this space as points described by Cartesian coordinates (i.e., quantifying their combined values on all perceptual dimensions) [5, 23, 43]. The space also contains a quality vector for the process of external preference mapping: accordingly, integral quality is “monotonically related to the [orthogonal] projection of a point onto this vector,” the vector being directed “towards optimum quality” (p. 52) [23]. Figure 2.3 shows an example of a vector model for speech transmission quality, assuming a twodimensional perceptual space. Regarding speech transmission quality, a variety of analytical, multidimensional solutions exists (for a comprehensive overview, see [23]). By combining the two approaches mentioned above, Wältermann and colleagues have identified three perceptual quality dimensions, “discontinuity,” “noisiness,” and “coloration,” under the restriction that “loudness” is kept constant—the reason for this being that other authors like McDermott have previously considered “loudness” to be a perceptual

2 The

vector model actually presents a special case of an ideal point model, as elaborated in [23].

2.2 Explanatory Approaches

15

Fig. 2.3 Vector model for speech transmission quality. Test stimuli (HQ, LQ-Noi, LQ-Col) are represented as black points in a perceptual space spanned by two dimensions (dim.), with coordinates (d1 , d2 )T quantifying combined dimension values for each stimulus (blue). In external preference mapping, points are orthogonally projected onto the quality vector, with projections q quantifying feature values for integral quality (purple). This figure is a modified version of Fig. 2.5 in [5] (p. 38), copyright 2006, with permission from John Wiley & Sons

quality dimension on its own [44]. Those three dimensions are supposed to possess a mostly descriptive character (in contrast to the clearly evaluative character of perceived quality). Lastly, each of the identified dimensions is further decomposable into a number of subdimensions (e.g., two subdimensions of “coloration” called “directness” and “brightness” [45]; other subdimensions are listed in [46]).

2.2.2 Functional Process Models Over the past years, several authors have introduced and successively refined functionalistic models of human information processing in quality assessment contexts [5, 6, 9, 19, 47]. Such instances of functional (process) models can be further characterized as “psychological” or “psychophysiological,” depending on whether they incorporate solely psychological constructs or also physiological constructs. The conceptual model of the quality formation process by Raake and Egger offers the most recent integration of earlier model versions or variants [6]. The present section introduces this model as visualized in Fig. 2.4, describing a person who is exposed to physical signals and, beyond uni- or multimodal perception, also wants to judge and describe his/her individual sense of quality (e.g., in the task context of a category rating paradigm). It further assumes that the person remains a passive signal receiver, hence omitting additional internal information processing required for human-human (or human-machine) interactions; neither does the model include

16

2 Speech Quality Fundamentals

Fig. 2.4 Conceptual model of the quality formation process proposed by Raake and Egger [6]. For a detailed description, see Sect. 2.2.2. This figure is identical to Fig. 2.4 in [6] (p. 27), copyright 2014, with permission from Springer Nature

a system part, specifying relevant properties and parameters (i.e., quality elements) of the system or service at hand, and its exerted influences on the human part (i.e., relations between quality elements and internal representations associated with perceptual quality features). In general, the model subdivides internal quality formation into three phases: perception, cognitive quality evaluation, and encoding. Within each phase, internal

2.2 Explanatory Approaches

17

processes, and representations, their interrelations and experiential correlates are specified. Furthermore, the person’s (emotional and motivational) state and expectations or assumptions derived from prior knowledge stored in long-term memory (including conceptual quality references, attitudes, preferences, task-related information) might interact with processes at each phase. During perception, signals from sensory modalities (most dominantly vision and audition), together with contextual information, enter central sensory processing, which produces a multidimensional topographic representation of sensory features (time, space, frequency activity) [48]. Immediately afterward, steps of perceptual organization (pre-segmentation, grouping into perceptual objects [48–50]) lead to the formation of a perceptual event (percept). In an anticipation and matching process, the perceptual event is compared against perceptual references and top-down expectations, enabling object recognition. A resultant mismatch can lead to heightened quality awareness, meaning that the person consciously notices changes in perceived quality, possibly triggering exploratory motor responses (e.g., orienting movements of eyes and head); besides, quality awareness might also be raised as a consequence of the perceptual event itself (e.g., if it contains conspicuous degradations in perceptual quality features) or by conceptual assumptions (e.g., if the current task explicitly requests quality ratings). Cognitive quality evaluation beyond pure perception might be instigated by quality awareness and experiencing (see direct quotation in Sect. 2.1.2.2), the latter being the experiential outcome of perceptual event formation. The evaluation phase starts with a reflection of perceived and expected (or desired) quality features; in an additional attribution process, the degraded character of the perceptual event is subjectively ascribed to external (technical) or internal (perceptual) causes. After a comparison between the perceived and expected compositions, a judgment process operating on the internally represented expectation deviation produces a quality event as output, also taking into account different weightings of individual quality features.3 Usually happening only in the task contexts of quality tests, during a final phase of encoding, the quality event is explicitly assigned to an appropriate category within a chosen sign system (e.g., a numeric value or semantic label on a category rating scale). This encoded quality event may physically manifest through motor action, resulting in a behavioral description (e.g., voiced opinion about one’s experienced quality; quality rating on a category rating scale, selected via keypress). During interactive conversational situations involving multiple persons, additional perceptual and cognitive processes take place. Skowronek and Raake have proposed an adaptation of the above-described quality formation process model to the use context of multiparty teleconferencing services [51]. Therein, a cognitive process of quality-attention focus denotes the focusing of selective “attention to

3 Again,

the “quality event” (or “quality” in general; see Fig. 2.4) is clearly associated with an outcome of higher-cognitive processing, namely, the quality judgment. In the terminology developed in Sect. 2.1.2.2, this way of experiencing quality would be equivalent to “judged quality.”

18

2 Speech Quality Fundamentals

a limited number of quality features that a user is actually using for the quality judgment” (p. 2) [51]. Quality-attention focus is assumed to critically depend on details of the test protocol like the stimulus material, the exact quality manipulations (kinds and intensities of quality impairments), task instructions, and employed measuring instruments (questions and labels attached to category rating scales) [51]. It follows that for quality descriptions to be viewed as valid in a particular measurement context, factors influencing participants’ selective attention must always be considered. Even if human perceivers of multimedia signals have no (task-driven) intention to evaluate and describe their experienced quality, circumstances might arise such that attention is focused on perceptual quality features. For instance, this could likely happen after sudden, unexpected, and salient changes in perceived quality, which automatically capture attention and elicit orienting responses, yet are not necessarily entailing any higher-cognitive processing (including introspection, reflection, judgment) and associated changes in judged quality. While the above-elaborated model by Raake and Egger is entirely “psychological” in terms of its incorporated constructs, first attempts have been made to add (neuro-)physiological phenomena as indicators of assumed internal processes and representations [52, 53]. The present book puts forward a detailed “psychophysiological” model of quality perception as an explanatory basis; research questions will be derived from this model and later on tested in three experimental studies (see Chaps. 5–7); the model itself will be presented in Chap. 4. It is argued here that process-oriented speech quality assessment, aiming at experimentally isolating (effects on) internal processes at temporally separate processing stages, profits from comparative examination of three levels of analysis: subjective, behavioral, and neurophysiological. Only a combination of methods, each targeting a different level of analysis, should allow to triangulate effects of quality manipulations on human information processing, which will be the topic of the following up Chap. 3.

References 1. ISO 9000:2005, Quality Management Systems – Fundamentals and Vocabulary (International Organization for Standardization (ISO), Geneva, 2005) 2. ISO 8402:1994, Quality Management and Quality Assurance – Vocabulary (International Organization for Standardization (ISO), Geneva, 1994) 3. ITU-T Recommendation P.10/G.100, Vocabulary for Performance, Quality of Service and Quality of Experience (International Telecommunication Union (ITU), Geneva, 2017) 4. M. Varela, L. Skorin-Kapov, T. Ebrahimi, Quality of service versus quality of experience, in Quality of Experience, ed. by S. Möller, A. Raake (Springer, Cham, 2014), pp. 85–96 5. A. Raake, Speech Quality of VoIP: Assessment and Prediction (John Wiley & Sons, Ltd, Chichester, 2006) 6. A. Raake, S. Egger, Quality and quality of experience, in Quality of Experience, ed. by S. Möller, A. Raake (Springer, Cham, 2014), pp. 11–33

References

19

7. S. Bech, N. Zacharov, Perceptual Audio Evaluation – Theory, Method and Application (John Wiley & Sons, Ltd, Chichester, 2006) 8. M. Schoeffler, A. Silzle, J. Herre, Evaluation of spatial/3D audio: basic audio quality versus quality of experience. IEEE J. Sel. Top. Sign. Proces. 11(1), 75–88 (2017) 9. P. Le Callet, S. Möller, A. Perkis (eds.), Qualinet White Paper on Definitions of Quality of Experience, European Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), Version 1.2 (2013) 10. U. Reiter, K. Brunnström, K. De Moor, M.-C. Larabi, M. Pereira, A. Pinheiro, J. You, A. Zgank, Factors influencing quality of experience, in Quality of Experience, ed. by S. Möller, A. Raake (Springer, Cham, 2014), pp. 55–72 11. U. Engelke, H. Kaprykowsky, H.-J. Zepernick, P. Ndjiki-Nya, Visual attention in quality assessment. IEEE Signal Process. Mag. 28(6), 50–59 (2011) 12. U. Engelke, D.P. Darcy, G.H. Mulliken, S. Bosse, M.G. Martini, S. Arndt, J.-N. Antons, K.Y. Chan, N. Ramzan, K. Brunnström, Psychophysiology-based QoE assessment: a survey. IEEE J. Sel. Top. Sign. Proces. 11(1), 6–21 (2017) 13. S. Uhrig, G. Mittag, S. Möller, J.-N. Voigt-Antons, P300 indicates context-dependent change in speech quality beyond phonological change. J. Neural Eng. 16(6), 066008 (2019) 14. E.M. Kaya, M. Elhilali, Investigating bottom-up auditory attention. Front. Hum. Neurosci. 8 (2014) 15. A. Raake, Does the content of speech influence its perceived sound quality? Sign 1, 1170–1176 (2002) 16. S. Möller, Quality Engineering: Qualität kommunikationstechnischer Systeme (Springer, Heidelberg, 2010) 17. M. Maguire, Context of use within usability activities. Int. J. Hum. Comput. Stud. 55(4), 453– 483 (2001) 18. T. Hoßfeld, P.E. Heegaard, M. Varela, S. Möller, QoE beyond the MOS: an in-depth look at QoE via better metrics and their relation to MOS. Qual. User Exp. 1(1), 2 (2016) 19. U. Jekosch, Voice and Speech Quality Perception: Assessment and Evaluation. Signals and Communication Technology (Springer, Berlin, 2005) 20. S. Möller, Assessment and Prediction of Speech Quality in Telecommunications (Springer US, Boston, 2000) 21. S. Möller, Quality of transmitted speech for humans and machines, in Communication Acoustics, ed. by J. Blauert (Springer, Heidelberg, 2005), pp. 163–192 22. M. Wältermann, A. Raake, S. Möller, Quality dimensions of narrowband and wideband speech transmission. Acta Acust. United Acust. 96(6), 1090–1103 (2010) 23. M. Wältermann, Dimension-Based Quality Modeling of Transmitted Speech. T-Labs Series in Telecommunication Services (Springer, Heidelberg, 2013) 24. M.M. Bradley, P.J. Lang, Emotion and Motivation, ed by J.T. Cacioppo, L.G. Tassinary, G. Berntson (Cambridge University Press, Cambridge, 2007), pp. 581–607 25. S. Lebrecht, M. Bar, L.F. Barrett, M.J. Tarr, Micro-valences: perceiving affective valence in everyday objects. Front. Psychol. 3 (2012) 26. F. Rumsey, Spatial quality evaluation for reproduced sound: terminology, meaning, and a scene-based paradigm. J. Audio Eng. Soc. 50(9), 651–666 (2002) 27. S. Uhrig, G. Mittag, S. Möller, J.-N. Voigt-Antons, Neural correlates of speech quality dimensions analyzed using electroencephalography (EEG). J. Neural Eng. 16(3), 036009 (2019) 28. J. Blauert, U. Jekosch, A layer model of sound quality, in Proceedings of the 3rd International Workshop on Perceptual Quality of Systems (PQS 2010) (2010), pp. 18–23 29. J. Blauert, U. Jekosch, Auditory quality of performance spaces for music – the problem of the references, in Proceedings of the 19th International Congress on Acoustics (ICA 2007) (2007), pp. 1205–1210 30. U. Jekosch, Assigning meaning to sounds – semiotics in the context of product-sound design, in Communication Acoustics, ed. by J. Blauert (Springer, Heidelberg, 2005), pp. 193–221

20

2 Speech Quality Fundamentals

31. M. Barreda-Ángeles, R. Pépion, E. Bosc, P. Le Callet, A. Pereda-Baños, Exploring the effects of 3D visual discomfort on viewers’ emotions, in 2014 IEEE International Conference on Image Processing (ICIP) (IEEE, Paris, 2014), pp. 753–757 32. S. Uhrig, S. Möller, D.M. Behne, U.P. Svensson, A. Perkis, Testing a quality of experience (QoE) model of loudspeaker-based spatial speech reproduction, in 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) (IEEE, Athlone, 2020), pp. 1–6 33. J.T. Cacioppo, S.L. Crites, W.L. Gardner, G.G. Berntson, Bioelectrical echoes from evaluative categorizations: I. A late positive brain potential that varies as a function of trait negativity and extremity. J. Pers. Soc. Psychol. 67(1), 115–125 (1994) 34. ITU-R Recommendation BS.1284-2, General Methods for the Subjective Assessment of Sound Quality (International Telecommunication Union (ITU), Geneva, 2019) 35. R. Nicol, L. Gros, C. Colomes, M. Noisternig, O. Warusfel, H. Bahu, B.F.G. Katz, A roadmap for assessing the quality of experience of 3D audio binaural rendering, in EAA Joint Symposium on Auralization and Ambisonics, Berlin (2014), pp. 100–106 36. ITU-R Recommendation BS.1116-3, Methods for the Subjective Assessment of Small Impairments in Audio Systems (International Telecommunication Union (ITU), Geneva, 2015) 37. ITU-T Recommendation G.107, The E-model: A Computational Model for Use in Transmission Planning (International Telecommunication Union (ITU), Geneva, 2015) 38. J. Skowronek, A. Raake, Assessment of cognitive load, speech communication quality and quality of experience for spatial and non-spatial audio conferencing calls. Speech Commun. 66, 154–175 (2015) 39. S. Möller, W.-Y. Chan, N. Côté, T. Falk, A. Raake, M. Wältermann, Speech quality estimation: models and trends. IEEE Signal Process. Mag. 28(6), 18–28 (2011) 40. E. Datteri, F. Laudisa, Box-and-arrow explanations need not be more abstract than neuroscientific mechanism descriptions. Front. Psychol. 5 (2014) 41. G.G. Berntson, J.T. Cacioppo, Reductionism, in Paradigms in Theory Construction, ed. by L. L’Abate (Springer, New York, 2012), pp. 365–374 42. S. Möller, R. Heusdens, Objective estimation of speech quality for communication systems. Proc. IEEE 101(9), 1955–1967 (2013) 43. S. Möller, M. Wältermann, M.-N. Garcia, Features of quality of experience, in Quality of Experience, ed. by S. Möller, A. Raake (Springer, Cham, 2014), pp. 73–84 44. B.J. McDermott, Multidimensional analyses of circuit quality judgments. J. Acoust. Soc. Am. 45(3), 774–781 (1969) 45. M. Wältermann, A. Raake, S. Möller, The sound character space of spectrally distorted telephone speech and its impact on quality, in Audio Engineering Society Convention 124 (2008) 46. M. Wältermann, A. Raake, S. Möller, Modeling of integral quality based on perceptual dimensions – a framework for a new instrumental speech-quality measure, in ITG Conference on Voice Communication [8. ITG-Fachtagung] (2008), pp. 1–4 47. N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality, ser. T-Labs Series in Telecommunication Services (Springer, Heidelberg, 2011) 48. A. Raake, J. Blauert, Comprehensive modeling of the formation process of sound-quality, in 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX) (IEEE, Klagenfurt am Wörthersee, 2013), pp. 76–81 49. A.S. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, Cambridge, 1990) 50. B.C.J. Moore, An Introduction to the Psychology of Hearing, 6th edn. (Brill, Leiden, 2013) 51. J. Skowronek, A. Raake, Conceptual model of multiparty conferencing and telemeeting quality, in 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX) (2015), pp. 1–6 52. J.-N. Antons, Neural Correlates of Quality Perception for Complex Speech Signals. T-Labs Series in Telecommunication Services (Springer, Cham, 2015) 53. S. Arndt, Neural Correlates of Quality During Perception of Audiovisual Stimuli. T-Labs Series in Telecommunication Services (Springer, Singapore, 2016)

Chapter 3

Speech Quality Assessment

From a QoE perspective, the assessment, evaluation, and prediction of experienced quality of multimedia signals on the sole basis of instrumentally measurable system parameters and network properties (QoS metrics) must be considered insufficient. Due to its inherently subjective and relative nature (see Sect. 2.1.2), experienced quality can in the end only be estimated with sufficient validity by testing human participants and empirically deriving QoE metrics [1]. Different participant groups may have extensively (experts) or never (novices) interacted with a technical system before, deviate in demographic variables (e.g., gender, age, native language), sensory abilities (hearing, vision), and motives, attitudes, and personality traits. Besides controlling for such human influencing factors, test paradigms are needed that resemble common use scenarios of speech communication systems and services (listening-only vs. conversation) to examine the impact of quality impairment factors. Upcoming Sects. 3.1.1 and 3.1.3 introduce methods for speech quality assessment, probing participants on multiple levels of analysis: subjective, behavioral, or neurophysiological. The relative advantages and disadvantages of those methods will be elaborated in view of major measuring criteria in Sect. 3.2.

3.1 Levels of Analysis The majority of empirical research conducted in quality and usability engineering has relied on subjective methods. However, a process-oriented approach toward speech quality assessment would require deployment of measuring instruments and test paradigms that allow to analyze effects of quality manipulations on subjective, behavioral, and neurophysiological levels (for analogous multi-method assessment of the construct of “mental workload,” see [2–6]; for related “threesystem” emotion assessment, see [7]). In past research, methodologies borrowed © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 S. Uhrig, Human Information Processing in Speech Quality Assessment, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-71389-8_3

21

22

3 Speech Quality Assessment

from psychophysics and psychophysiology have been adopted for the purpose of speech quality assessment.

3.1.1 Subjective and Behavioral Levels Subjective methods like category rating scales and other psychophysical (e.g., psychoacoustic) detection, discrimination, and identification paradigms can be used to derive perception-based metrics, psychometric functions, and thresholds for varying classes and physical intensities of test stimuli [8–13]. Moreover, such methods allow to establish relationships between features of physical (acoustic) events and features of their associated perceptual (auditory) events (see Sect. 2.1.2.1). Physical events are directly observable and can be objectively quantified using physical measuring instruments. By contrast, perceptual events form a part of subjective experience of individual test participants and are only indirectly accessible through their behavioral descriptions; therefore, the participants themselves fulfill simultaneous roles of measurands and measuring organs [9]. Test protocols for subjective speech quality assessment are laid out in recommendations by the International Telecommunication Union (ITU) [14–16]. During listening-only tests, participants cognitively evaluate the “quality” of test stimuli, either absolutely or in comparison to previously presented reference stimuli. Participants encode their opinions within the sign systems of (category) rating scales and generate quality descriptions by behaviorally selecting appropriate scale labels or positions [10, 17]. Subjective rating tasks are mainly employed to investigate perceptible effects of alterations in the surface form of transmitted speech signals, irrespective of any conveyed semantic meaning (see Sect. 2.1.2.3). This is achieved either by holding stimulus content constant or by varying it systematically. Besides the influence of content, also stimulus duration and phonetic complexity must be carefully controlled when selecting test stimuli. Three category rating paradigms for subjective assessment of speech transmission quality have been documented in ITU-T Recommendation P.800 [14], demanding that test participants are native speakers and speech signals are received in (emulated) standard telephony contexts: • In the absolute category rating (ACR) paradigm, stimuli are serially presented as either high- or low-quality versions. After each stimulus presentation, participants judge their perceived quality on a unipolar, five-point category rating scale with numeric values 1, 2, 3, 4, and 5 and with respective category labels “bad,” “poor,” “fair,” “good,” and “excellent” (German: “schlecht,” “dürftig,” “ordentlich,” “gut,” “ausgezeichnet”). An absolute quality metric called mean opinion score (MOS) results from simply averaging ratings for each stimulus across all participants. Oftentimes, the standard deviation of opinion scores (SOS) is reported as a measure of dispersion, which can be analyzed to gain additional information about user rating diversity [18].

3.1 Levels of Analysis

23

• In the comparison category rating (CCR) paradigm, high- and low-quality test stimuli are presented as pairwise combinations (with repetition). Following each stimulus pair, participants judge the quality of the second test stimulus relative to the first reference stimulus on a bipolar, seven-point category rating scale with numeric values −3, −2, −1, 0, 1, 2, and 3 and corresponding category labels “much worse,” “worse,” “slightly worse,” “about the same,” “slightly better,” “better,” and “much better” (German: “viel schlechter,” “schlechter,” “etwas schlechter,” “etwa gleich,” “etwas besser,” “besser,” “viel besser”). A relative quality metric, the comparison mean opinion score (CMOS), is calculated for every test stimulus based on all comparisons containing this stimulus. • In the degradation category rating (DCR) paradigm, pairs of stimuli are serially presented: the first stimulus serves as a high-quality reference, while the second test stimulus is of low quality. After each stimulus pair, participants judge their annoyance by the quality impairment of the test stimulus in comparison to the reference stimulus on a unipolar, five-point category rating scale with numeric values 1, 2, 3, 4, and 5 and with respective category labels “very annoying,” “annoying,” “slightly annoying,” “perceptible but not annoying,” and “imperceptible” (German: “sehr störend,” “störend,” “leicht störend,” “wahrnehmbar, aber nicht störend,” “nicht wahrnehmbar”). Averaging the collected ratings per test stimulus yields another relative degradation metric called degradation mean opinion score (DMOS). All three category rating paradigms involve cognitive evaluation of perceived quality associated with presented test stimuli. Thus resulting MOS, CMOS, and DMOS quality metrics quantify variation in judged quality (see Sect. 2.1.2.2), as determined within the measurement contexts of the respective category rating tasks. Only CCR and DCR tasks acknowledge the relative nature of (quality) perception by explicitly offering participants a reference stimulus. Nevertheless, when participants are instructed to rate the absolute quality of a single stimulus in an ACR task, the internally represented stimulus context (i.e., range and order of stimuli [8]) as well as long-term perceptual and conceptual references still implicitly affect participants’ quality judgments. In general, two types of subjective tests can be differentiated by their intended goals and purposes [10, 12, 19]: the abovementioned ITU-T-standardized test protocols are representative for a utilitarian type of testing. Such subjective tests utilize single-valued QoE opinion metrics of overall quality like MOS, CMOS, and DMOS for practical evaluation of systems and services. On the opposite, the analytical type of testing aims at dissecting participants’ overall evaluative impression into multiple perceptual quality dimensions, each of which contributes with a certain weight to integral quality. Analytical testing helps to create QoS taxonomies that interrelate quality elements and perceptual quality features for a particular system or service (e.g., Möller’s taxonomy for human-human interaction over a telecommunication network [8]; see Sect. 2.1.2.1). QoS taxonomies offer system designers and service providers a diagnostic means to estimate integral quality from perceptual feature values associated with system (component) settings

24

3 Speech Quality Assessment

and infer probable technological causes of lowered perceived quality. Different combinations of experimental paradigms and dimensionality reduction techniques that are commonly employed to identify perceptual quality dimensions have been described in prior Sect. 2.2.1. Due to usually being collected only post hoc, that is, after stimulus exposure has already ended, valid linkage of perceived and judged quality, or any other subjective constructs (e.g., speech intelligibility, listening effort), to specific internal processes as their functional manifestations is not straightforward. Cognitive (quality) judgments may be severely biased by the design of the measuring instrument (category rating scale) and contextual and content-related factors (e.g. task instructions, range of presented stimuli, semantic meaning) [8, 20]. In addition, smaller alterations in internal information processing (e.g., slightly delayed sensory-perceptual processes and behavioral responding, subtle emotional state changes) may not be as easily detectable by participants, making subjective self-report methods less sensitive in these cases [21, 22] (see discussion on the measurement criterion of sensitivity in Sect. 3.2.2). Hence, objective methods are needed that can be continuously applied in passive listening-only or interactive conversation tasks to gauge effects of speech quality manipulations on internal information processing. In tasks demanding fast execution of predefined actions, behavioral responses are quantifiable by response time and, given criteria exist for correct or erroneous action, by error rates (e.g., response classifications as “hits,” “misses,” “false alarms,” “correct rejections,” in the terminology of signal detection theory) [23–25]. The reaction time (RT) paradigm used in experimental studies requests participants to respond to stimuli as fast as possible (speeded responses, e.g., via quick keypresses), which has traditionally been implemented in three variants [25, 26]: simple RT tasks involve a single response to any presented stimulus; choice RT tasks involve multiple stimuli and multiple response alternatives, each response alternative being associated with a different stimulus (class); Go/No-Go RT tasks involve multiple stimuli and a single response that is associated with one particular stimulus (class). An RT task is forced if participants must execute a response (alternative) in order to proceed to the next trial; if trials have a time-out, after which the next trial starts automatically, this fact might be exploited by participants as an additional no-response alternative.1 Psychophysical paradigms can be constructed as simple (detection), Go/No-Go, or choice (discrimination, identification) RT tasks [23, 25, 27]: Detection refers to an internal process that signifies the presence or absence of a stimulus (i.e., its sensory signal) or change in perceptual feature value, while detectability denotes the difficulty with which detection is achieved under varying physical stimulation conditions (e.g., depending on stimulus intensity, background noise level); a discrimination process discerns two stimuli based on their values on selected perceptual features,

1 If

category rating paradigms were implemented as forced-choice RT tasks, by giving participants the additional instruction to answer their opinion as quickly as possible, behavioral descriptions of judged quality could also be analyzed with regard to behavioral response time.

3.1 Levels of Analysis

25

with discriminability describing the difficulty to successfully discriminate certain perceptual features; a recognition or identification process assigns perceived objects to abstract event classes or categories already represented in long-term memory, with identifiability describing the ease of being able to do so. Effects of experimental manipulations on RT task performance are describable by speed and accuracy of participants’ behavioral responses, averaged across trials of individual test conditions. The methodological approach of measuring behavioral response times (as well as neural response latencies; see next Sect. 3.1.2) to infer dynamic changes in internal information processing is known as mental chronometry [25, 26, 28–30]. Its measuring logic builds upon the analysis of systematic differences in response times, either due to varying functional requirements between RT tasks or different influencing factors within the same RT task (e.g., stimulus detectability and discriminability, stimulus-response compatibility) [24, 25]: In the former case (“subtractive logic”), the presence or absence of a specific internal process (or processing stage) could be inferred when subtraction of total response time for a task not requiring the process from total response time for another task requiring the process leaves a significant difference. This subtraction method has originally been applied by the nineteenth-century Dutch physiologist Franciscus Cornelis Donders [31] to estimate the exact duration of internal processes. However, its validity rests on several simplifying assumptions, namely, that the kinds, number and duration of processes or processing stages, which may in principle operate independently from each other and are traversed in ideally strict series, remain more or less unchanged by the insertion or deletion of the targeted internal process. In the latter case (“additive-factors logic”), two fully combined experimental factors would be interpreted as influencing different internal processes (or processing stages) when each factor has an additive main effect on response time (i.e., entailing an equal response time increment or decrement across all levels of the other factor); likewise, the two factors would be thought of as influencing the same internal process when they demonstrate a significant interaction [26]. These systematic effect patterns follow directly from Sternberg’s additive-factors method [32], which revived the chronometric approach during the early rise of cognitive psychology [26]. Again, the method assumes strict seriality of internal processing stages, but also constant processing outputs at each stage as well as selectivity of factors that may influence only specific processes. Even though daily technology usage more and more often requires perceptual and cognitive processes to operate in parallel (e.g., during multi-tasking [33]), instances of serial human information processing remain most common and can be emulated by appropriately designed test scenarios [30]. Sometimes, additional psychophysical techniques are employed to analyze how gradually varying stimulus intensity (e.g., magnitude of quality impairment, like signal-to-noise ratio) maps onto subjective measures (MOS value [34]) or behavioral measures (hit rate of impairment detection [35, 36]). For example, a psychometric function may be fitted to participants’ responses that describe the relationship between the subjective or behavioral measure and the intensity of a

26

3 Speech Quality Assessment

physical stimulus feature, typically taking the form of a sigmoid (“S-shaped”) curve [25]. Based on this function, absolute thresholds can be determined that describe the lowest stimulus intensity that is still detectable above chance by a perceiver. Section 3.1.2 offers a concise primer on the event-related brain potential (ERP), a special class of neuro-electric phenomena and a measurement technique in electroencephalography (EEG). Afterward, in Sect. 3.1.3, the neurophysiological level of analysis will be addressed.

3.1.2 Primer: Event-Related Brain Potential (ERP) Technique Electroencephalography (EEG) describes the noninvasive recording of electrical voltage fluctuations on the scalp surface caused by electrical activity of the brain, resulting in a multi-channel time-series called electroencephalogram [37–39].2 In comparison to functional neuroimaging techniques based on hemodynamic responses (e.g., functional magnetic resonance imaging, fMRI [42]), EEG possesses a very high temporal resolution (ms range) but only a low spatial resolution (cm range; depending on the number of electrodes or electrode density). Neuroelectric activity is attenuated and spatially smeared as electrical current has to propagate through multiple head tissues (brain, skull, scalp), which vary in material properties (e.g., resistivity, conductivity) and shape [43]. Consequently, the distinct neuroanatomical (primarily cortical) regions that contribute to the EEG signal or neural generators can only be estimated by applying source localization techniques which require simplifying mathematical assumptions. A technical EEG recording setup consists of a multi-channel amplifier and surface electrodes, which are attached to an EEG cap or net. The exact electrode placement has been standardized in the (extended) international 10–20 system, relying on percentage distances between anatomical landmarks on individual scalps (distance between nasion “NZ” and inion “IZ,” distance between left and right preauricular points = “A1,” “A2”) [44, 45]. Its nomenclature designates electrode positions by their proximity to subjacent cortical regions (frontal, “F”; central, “C”; temporal, “T”; parietal, “P”; occipital, “O”), and the direction from the participant’s point of view (left hemisphere, odd number; midline, “z”/“Z”; right hemisphere, even number). Electrode placement following the 10–20 system is depicted in Fig. 3.1. In reference recording, EEG signals from different scalp electrodes are registered together with the signal from a common reference electrode placed on a position on the body surface that is electrically neutral or does not contain relevant

2 The

historical origins of the EEG method can be traced back to studies by Richard Caton (1842– 1926), who registered cortical activity via surface electrodes placed onto the scalp of test animals [40]. During the 1920s, the German neurologist Hans Berger (1873–1941) carried out first human EEG recordings [41].

3.1 Levels of Analysis

27

Fig. 3.1 International 10–20 system for standardized electroencephalographic (EEG) electrode placement, showing the modified combinatorial nomenclature (MCN). The figure file (author: Brylie Christopher Oxley, date: 11 July 2017) has been made available under the Creative Commons CC0 1.0 Universal Public Domain Dedication: https://commons.wikimedia.org/wiki/ File:International_10-20_system_for_EEG-MCN.svg

physiological activity (e.g., tip of the nose, earlobes, mastoid bones). The reference channel is then subtracted from each channel of the electroencephalogram. In addition to electrical brain activity, the electroencephalogram might contain several artifacts of physiological (e.g., ocular artifacts due to eye blinks and movements) and non-physiological (e.g., 50/60 Hz mains hum) origins. To eliminate such artifacts, raw EEG signals pass through various pre-processing steps (including bandpass filtering, artifact correction) prior to any condition-related processing and feature extraction.

3.1.2.1

Definitions

The portion of the EEG signal that reflects slow, long-term (tonic) changes in electrical brain activity, unrelated to any clearly defined events, is referred to as spontaneous EEG [37, 38]. Its neurophysiological basis lies in the rhythmic oscillatory activity of various cortical generators [39]. In the frequency domain of the EEG signal, a number of EEG rhythms or EEG frequency bands have been distinguished based on their typical frequency range, amplitude range, and

28

3 Speech Quality Assessment

topographical distribution over the scalp, correlating with specific internal processes and states: delta (1–4 Hz), theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (36–44 Hz). For practical application purposes, spontaneous EEG is often monitored over longer time periods (e.g., sleep and epilepsy monitoring in clinical practice) [37]. Fast, short-term (phasic) changes of electrical brain activity, time-locked to the occurrence or absence of defined external or internal events (i.e., inside the environment or organism), manifest in the EEG time domain as an event-related brain potential (ERP) [38, 46–48].3 ERPs consist of stereotypic sequences of positive and negative voltage changes, “peaks” and “troughs,” forming characteristic waveforms. ERP components denote voltage changes that can be traced back to the activity of distinct neural generators associated with specific psychological (sensory, perceptual, cognitive, response-related) functions. ERP waveforms result from a superposition of simultaneous activity by multiple neural generators, for which reason single voltage peaks and troughs normally cannot be definitely identified as individual components in the strict sense; regardless, these deflections are frequently called “components” albeit in a more loose, descriptive sense. Early ERP components resonating physical stimulation in different sensory modalities (auditory, visual, somatosensory, gustatory, olfactory) are also known as sensory evoked potentials (SEPs), like the auditory evoked potential (AEP). An idealized ERP waveform is shown in Fig. 3.2. In the raw EEG signal, ERPs are normally invisible due to their relatively small amplitudes ( LQ-Noi, LQ-Col) due to varying temporal degradation patterns in “discontinuously” versus “continuously” degraded stimuli. (ii) Effects of reference quality on P3 amplitude (HQ > LQ-Dis, LQ-Noi, LQCol) due to varying oddball-standard detectability (for both oddball types, i.e., targets and distractors) and target-standard discriminability. (iii) Effects of reference quality on P3 latency (HQ < LQ-Dis, LQ-Noi, LQ-Col) due to varying oddball-standard detectability and target-standard discriminability.

5.3.3 Results 5.3.3.1

Subjective Analysis

Subjective analysis showed a significant effect of quality on ACR rating (F [3, 81] = 2 = 0.413). MOS values for all four stimuli are 36.160, H F = 0.658, p < 0.001, ηG depicted in Fig. 5.5. Post hoc pairwise comparisons were significant between each low-quality stimulus and HQ (all p < 0.001), but not among the low-quality stimuli themselves (LQ-Dis vs. LQ-Noi, p = 0.91; LQ-Dis vs. LQ-Col, p = 0.70; LQ-Noi vs. LQ-Col, p = 0.70).

5.3.3.2

Behavioral Analysis

Behavioral data analysis demonstrated significant effects of target quality on hit 2 = 0.662) and hit rate response time (F [2, 54] = 52.811, p < 0.001, ηG 2 = 0.174). Figure 5.6, (a) and (b), show average (F [2, 54] = 5.706, p = 0.006, ηG

70

5 Discrimination of Speech Quality Change Along Perceptual Dimensions (Study I)

Fig. 5.5 Effects of quality (HQ, LQ-Dis, LQ-Noi, LQ-Col) on ACR rating. Error bars represent 95% confidence intervals. The original version of this figure is published in [2], copyright 2019, with permission from IOP Publishing

Fig. 5.6 Effects of target quality (LQ-Dis, LQ-Noi, LQ-Col) on hit response time (a) and hit rate (b). Error bars represent 95% confidence intervals

hit response time and hit rate as a function of target quality. Post hoc pairwise comparisons turned out significant between all target quality levels for hit response time (all p < 0.001) and between discontinuity- and coloration-impaired stimuli for hit rate (LQ-Dis vs. LQ-Col p = 0.012).

5.3.3.3

Electrophysiological Analysis

Statistically significant effects of oddball quality and reference quality on peak amplitude and peak latency are summarized in Tables 5.2 and 5.3. Figure 5.7 contains grand average oddball-standard difference waveforms for all factor level combinations of oddball type (target, distractor) and oddball quality (LQ-Dis, LQNoi, LQ-Col) at electrode position Cz, given HQ as constant reference. Figure 5.8 shows grand average oddball-standard difference waveforms for all combinations of oddball type and reference quality (HQ, LQ-Noi, LQ-Col) at Cz, in case of LQ-Dis occurring as oddball.

5.3 Main Experiment

71

Table 5.2 Effects of channel (Fz, Cz, Pz), oddball type (target, distractor), and oddball quality ({LQ-Dis, LQ-Noi, LQ-Col}; {LQ-Noi, LQ-Col}; {LQ-Dis, LQ-Col}; {LQ-Dis, LQ-Noi}) on ERP parameters for constant reference quality (HQ; LQ-Dis; LQ-Noi; LQ-Col). All listed effects are statistically significant (p < αSI D , αSI D = 0.0036). Effect indices point to Appendix Table A.3 with post hoc analysis results Oddball quality LQ-Noi vs. LQ-Col 2 LQ-Dis vs. LQ-Col 3 LQ-Dis vs. LQ-Col 4 LQ-Dis vs. LQ-Noi vs. LQ-Col 5 LQ-Dis vs. LQ-Noi vs. LQ-Col 6 LQ-Dis vs. LQ-Noi vs. LQ-Col 7 LQ-Noi vs. LQ-Col 8 LQ-Dis vs. LQ-Col 9 LQ-Dis vs. LQ-Col 10 LQ-Dis vs. LQ-Noi # 1

Reference ERP peak quality parameter Effect LQ-Dis Amplitude Oddball type

dfn dfd F p 1 8789.1 25.549