Deep Learning Based Speech Quality Prediction 303091478X, 9783030914783

This book presents how to apply recent machine learning (deep learning) methods for the task of speech quality predictio

194 42 7MB

English Pages 170 [171] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
Acronyms
1 Introduction
1.1 Motivation
1.2 Thesis Objectives and Research Questions
1.3 Outline
2 Quality Assessment of Transmitted Speech
2.1 Speech Communication Networks
2.2 Speech Quality and Speech Quality Dimensions
2.3 Subjective Assessment
2.4 Subjective Assessment via Crowdsourcing
2.5 Traditional Instrumental Methods
2.5.1 Parametric Models
2.5.2 Double-Ended Signal-Based Models
2.5.3 Single-Ended Signal-Based Models
2.6 Machine Learning Based Instrumental Methods
2.6.1 Non-Deep Learning Machine Learning Approaches
2.6.2 Deep Learning Architectures
2.6.3 Deep Learning Based Speech Quality Models
2.7 Summary
3 Neural Network Architectures for Speech Quality Prediction
3.1 Dataset
3.1.1 Source Files
3.1.2 Simulated Distortions
3.1.3 Live Distortions
3.1.4 Listening Experiment
3.2 Overview of Neural Network Model
3.3 Mel-Spec Segmentation
3.4 Framewise Model
3.4.1 CNN
3.4.2 Feedforward Network
3.5 Time-Dependency Modelling
3.5.1 LSTM
3.5.2 Transformer/Self-Attention
3.6 Time Pooling
3.6.1 Average-/Max-Pooling
3.6.2 Last-Step-Pooling
3.6.3 Attention-Pooling
3.7 Experiments and Results
3.7.1 Training and Evaluation Metric
3.7.2 Framewise Model
3.7.3 Time-Dependency Model
3.7.4 Pooling Model
3.8 Summary
4 Double-Ended Speech Quality Prediction Using Siamese Networks
4.1 Introduction
4.2 Method
4.2.1 Siamese Neural Network
4.2.2 Reference Alignment
4.2.3 Feature Fusion
4.3 Results
4.3.1 LSTM vs Self-Attention
4.3.2 Alignment
4.3.3 Feature Fusion
4.3.4 Double-Ended vs Single-Ended
4.4 Summary
5 Prediction of Speech Quality Dimensions with Multi-TaskLearning
5.1 Introduction
5.2 Multi-Task Models
5.2.1 Fully Connected (MTL-FC)
5.2.2 Fully Connected + Pooling (MTL-POOL)
5.2.3 Fully Connected + Pooling + Time-Dependency(MTL-TD)
5.2.4 Fully Connected + Pooling + Time-Dependency + CNN (MTL-CNN)
5.3 Results
5.3.1 Per-Task Evaluation
5.3.2 All-Tasks Evaluation
5.3.3 Comparing Dimension
5.3.4 Degradation Decomposition
5.4 Summary
6 Bias-Aware Loss for Training from Multiple Datasets
6.1 Method
6.1.1 Learning with Bias-Aware Loss
6.1.2 Anchoring Predictions
6.2 Experiments and Results
6.2.1 Synthetic Data
6.2.2 Minimum Accuracy rth
6.2.3 Training Examples with and Without Anchoring
6.2.4 Configuration Comparisons
6.2.5 Speech Quality Dataset
6.3 Summary
7 NISQA: A Single-Ended Speech Quality Model
7.1 Datasets
7.1.1 POLQA Pool
7.1.2 ITU-T P Suppl. 23
7.1.3 Other Datasets
7.1.4 Live-Talking Test Set
7.2 Model and Training
7.2.1 Model
7.2.2 Bias-Aware Loss
7.2.3 Handling Missing Dimension Ratings
7.2.4 Training
7.3 Results
7.3.1 Evaluation Metrics
7.3.2 Validation Set Results: Overall Quality
7.3.3 Validation Set Results: Quality Dimensions
7.3.4 Test Set Results
7.3.5 Impairment Level vs Quality Prediction
7.4 Summary
8 Conclusions
A Dataset Condition Tables
B Train and Validation Dataset Dimension Histograms
References
Index
Recommend Papers

Deep Learning Based Speech Quality Prediction
 303091478X, 9783030914783

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

T-Labs Series in Telecommunication Services

Gabriel Mittag

Deep Learning Based Speech Quality Prediction

T-Labs Series in Telecommunication Services Series Editors Sebastian M¨oller, Quality and Usability Lab, Technische Universität Berlin, Berlin, Germany Axel K¨upper, Telekom Innovation Laboratories, Technische Universität Berlin, Berlin, Germany Alexander Raake, Audiovisual Technology Group, Technische Universität Ilmenau, Ilmenau, Germany

It is the aim of the Springer Series in Telecommunication Services to foster an interdisciplinary exchange of knowledge addressing all topics which are essential for developing high-quality and highly usable telecommunication services. This includes basic concepts of underlying technologies, distribution networks, architectures and platforms for service design, deployment and adaptation, as well as the users’ perception of telecommunication services. By taking a vertical perspective over all these steps, we aim to provide the scientific bases for the development and continuous evaluation of innovative services which provide a better value for their users. In fact, the human-centric design of high-quality telecommunication services – the so called “quality engineering” – forms an essential topic of this series, as it will ultimately lead to better user experience and acceptance. The series is directed towards both scientists and practitioners from all related disciplines and industries. ** Indexing: books in this series are indexing in Scopus **

More information about this series at https://link.springer.com/bookseries/10013

Gabriel Mittag

Deep Learning Based Speech Quality Prediction

Gabriel Mittag Technische Universit¨at Berlin Berlin, Germany

ISSN 2192-2810 ISSN 2192-2829 (electronic) T-Labs Series in Telecommunication Services ISBN 978-3-030-91478-3 ISBN 978-3-030-91479-0 (eBook) https://doi.org/10.1007/978-3-030-91479-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Instrumental speech quality prediction is a long-studied field in which many models have been presented. However, in particular, the single-ended prediction without the use of a clean reference signal remains challenging. This book studies how recent developments in machine learning can be leveraged to improve the quality prediction of transmitted speech and additionally provide diagnostic information through the prediction of speech quality dimensions. In particular, different deep learning architectures were analyzed towards their suitability to predict speech quality. To this end, a large dataset with distorted speech files and crowdsourced subjective ratings was created. A number of deep learning architectures, such as CNNs, LSTM networks, and Transformer/self-attention networks, were combined and compared. It was found that a network with CNN, self-attention, and a proposed attention-pooling delivers the best single-ended speech quality predictions on the considered dataset. Furthermore, a double-ended speech quality prediction model based on a Siamese neural network is presented. However, it could be shown that, in contrast to traditional models, deep learning models only slightly benefit from including the clean reference signal. For the prediction of perceptual speech quality dimensions, a multi-task learning based model is presented that predicts the overall speech quality and the quality dimensions noisiness, coloration, discontinuity, and loudness in parallel, where most of the neural network layers are shared between the individual tasks. Finally, the single-ended speech quality prediction model NISQA is presented that was trained on a large variety of 59 different datasets. Because the training datasets come from a variety of sources and contain different quality ranges, they are exposed to subjective biases. Therefore, the same speech distortions can lead to very different quality ratings in two datasets. To prevent a negative influence of this effect, a bias-aware loss function is proposed that estimates and considers the biases during the training of the neural network weights. The final model was tested on a live-talking test set with real recorded phone calls, on which it achieved a Pearson’s correlation of 0.90 for the overall speech quality prediction. Berlin, Germany

Gabriel Mittag

v

Acknowledgments

I am very grateful to the many supporters who have made this work possible. During the last years, I had the pleasure to meet and get to know many interesting people at the Quality and Usability Lab, but also at several academic conferences, workshops, and ITU meetings. First, I would like to thank my thesis supervisor Prof. Dr. Sebastian Möller for his support, his scientific expertise, and his advice that greatly helped me to write and complete this book. My special thanks also go to Dr. Friedemann Köster, who introduced me to the exciting field of speech quality estimation and without whom I probably would not have started my doctoral studies. I would like to thank my student assistant Louis Liedtke for his support and also all the students I had the pleasure to supervise during their bachelor’s or master’s theses preparation, in particular Assmaa Chehadi for her work on one of the datasets used in this book and Huahua Maier on his work on the Android recording app. I also want to thank Prof. Dr. Gerhard Schmidt and Tobias Hübschen from the University of Kiel and Dr. Jens Berger for the great collaboration during the DFG project. I would like to thank Prof. Tiago H. Falk. and, a second time, Prof. Dr. Gerhard Schmidt for reviewing this book and for serving on my doctoral committee. Many thanks go to Irene Hube-Achter, Yasmin Hillebrenner, and Tobias Jettkowski for their excellent administrative and technical support. Thanks to all my former and current colleagues at the Quality and Usability Lab for the numerous discussions, exchange of research ideas, and for keeping me company during my coffee breaks and making sure that it never got too boring at the lab, including Steven Schmidt, Saman Zadtootaghaj, Sai Sirisha Rallabandi, Thilo Michael, Tanja Kojic, Dr. Babak Naderi, Dr. Laura Fernández Gallardo, Dr. Patrick Ehrenbrink, Dr. Dennis Guse, Dr. Maija Poikela, Dr. Falk Schiffner, and Dr. Stefan Uhrig, and many more. Thank you all for a great time!

vii

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Objectives and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 4

2

Quality Assessment of Transmitted Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Speech Communication Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Speech Quality and Speech Quality Dimensions . . . . . . . . . . . . . . . . . . . . . . 2.3 Subjective Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Subjective Assessment via Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Traditional Instrumental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Double-Ended Signal-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Single-Ended Signal-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Machine Learning Based Instrumental Methods . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Non-Deep Learning Machine Learning Approaches. . . . . . . . . . 2.6.2 Deep Learning Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Deep Learning Based Speech Quality Models . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 7 10 12 16 18 18 19 21 22 23 24 28 31

3

Neural Network Architectures for Speech Quality Prediction . . . . . . . . . . 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Source Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Simulated Distortions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Live Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Listening Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Overview of Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Mel-Spec Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Framewise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Feedforward Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 34 35 39 40 42 43 43 43 45

ix

x

Contents

3.5 Time-Dependency Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Transformer/Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Time Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Average-/Max-Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Last-Step-Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Attention-Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Training and Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Framewise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Time-Dependency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 Pooling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 47 48 51 51 51 52 53 53 54 56 57 58

4

Double-Ended Speech Quality Prediction Using Siamese Networks . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Siamese Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Reference Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 LSTM vs Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Double-Ended vs Single-Ended . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 59 60 62 62 64 65 65 66 67 68 70

5

Prediction of Speech Quality Dimensions with Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Multi-Task Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Fully Connected (MTL-FC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Fully Connected + Pooling (MTL-POOL) . . . . . . . . . . . . . . . . . . . . 5.2.3 Fully Connected + Pooling + Time-Dependency (MTL-TD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Fully Connected + Pooling + Time-Dependency + CNN (MTL-CNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Per-Task Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 All-Tasks Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Comparing Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Degradation Decomposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 73 75 76 77 78 79 79 80 83 84 85 87

Contents

xi

6

Bias-Aware Loss for Training from Multiple Datasets. . . . . . . . . . . . . . . . . . . 89 6.1 Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.1.1 Learning with Bias-Aware Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1.2 Anchoring Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2.2 Minimum Accuracy rth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.3 Training Examples with and Without Anchoring . . . . . . . . . . . . . 96 6.2.4 Configuration Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.5 Speech Quality Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7

NISQA: A Single-Ended Speech Quality Model. . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 POLQA Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 ITU-T P Suppl. 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Live-Talking Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Model and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Bias-Aware Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Handling Missing Dimension Ratings . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Validation Set Results: Overall Quality . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Validation Set Results: Quality Dimensions . . . . . . . . . . . . . . . . . . . 7.3.4 Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Impairment Level vs Quality Prediction . . . . . . . . . . . . . . . . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

103 103 104 104 106 109 110 110 111 111 112 114 115 116 118 122 125 138

A Dataset Condition Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B Train and Validation Dataset Dimension Histograms . . . . . . . . . . . . . . . . . . . 149 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Acronyms

ACR AP BiLSTM CCR CNN COL DCR DE DIS FB FC FER FFT GMM ITU-T LOUD LPC LSTM MARS Mel-Spec MFCC MNRU MOS MSE MTD MTL NB NISQA NLP NOI

Absolute Category Rating Attention Pooling Bidirectional Long Short-Term Memory network Comparison Category Rating Convolutional Neural Network Coloration Degradation Category Rating Double-Ended Discontinuity Fullband Fully Connected layer Frame Error Rate Fast Fourier Transform Gaussian Mixture Models International Telecommunication Union—Telecommunication Standardization Sector Loudness Linear Predictive Coding Long Short-Term Memory Network Multivariate Adaptive Regression Spline Mel-Spectrogram Mel-Frequency Cepstrum Coefficients Modulated Noise Reference Unit Mean Opinion Score Mean Squared Error Mean Temporal Distance Multi-Task Learning Narrowband Non-Intrusive Speech Quality Assessment Natural Language Processing Noisiness xiii

xiv

OTT P.AMD P.SAMD PCA PCC PESQ PLP POLQA POTS PSTN RELU RMSE RMSE* RNN SA SE SLSP SNR ST SWB TD VAD VoIP VoLTE VoWiFi WB

Acronyms

Over-the-Top Perceptual approaches for multi-dimensional analysis (ITU-T SG12 work item) Single-ended perceptual approaches for multi-dimensional analysis (ITU-T SG12 work item) Principal Component Analysis Pearson’s Correlation Coefficient Perceptual Evaluation of Speech Quality (ITU-T Rec. P.862) Perceptual Linear Prediction Perceptual Objective Listening Quality Analysis (ITU-T Rec. P.863) Plain Old Telephone Service Public Switched Telephone Network Rectified Linear Unit Root-Mean-Square Error Epsilon-Insensitive RMSE Recurrent Neural Network Self-Attention Single-Ended Sequential Least Squares Programming Signal-to-Noise Ratio Single-Task Super-Wideband Time-Dependency Voice Activity Detection Voice over IP Voice over LTE Voice over WiFi Wideband

Chapter 1

Introduction

1.1 Motivation Although commercial telecommunication services have been available for more than a century, the telecommunication sector is still an ever-growing area. Through the introduction of smartphones and the almost constant availability of mobile internet, new opportunities opened up in the field. These days we are used to making phone calls at any time from anywhere to any place in the world. Mobile plans that come with a large amount of data for an affordable price removed the barrier of costly long-distance calls that are billed by the minute. Because of the increasing globalisation in the world, family and friends are even more spread out than before, where phone and video calls are often the only possibilities to stay in touch. Aside from the consumer market, also for businesses, the new developments in the field become more and more important. Especially, starting with the COVID19 pandemic, teleconferencing providers experienced a dramatic surge in demand, where with many people working from home, it is crucial for businesses to hold online meetings on a daily basis. Consequently, for speech communication providers, it is even more essential to monitor their networks to ensure a satisfying experience for their customers. Several key performance indicators are measured when speech communication networks are evaluated. For example, in benchmark tests (Zafaco GmbH 2020) that compare different providers, besides factors such as call setup duration, speech delay, and call failure ration, the speech quality is one of the main indicators of the overall performance. In these benchmark tests, a prerecorded reference speech sample of high quality is sent through the network. At the receiving side, the speech signal is recorded. An algorithm then uses both signals to estimate the speech quality. While the intelligibility of transmitted speech is usually not an issue these days, the speech quality can still be significantly degraded, in particular, when a call is routed through multiple network providers, where the speech signal may be encoded and decoded multiple times. Furthermore, although most people are used © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_1

1

2

1 Introduction

to having mobile connectivity everywhere, there are still many remote spots with inferior reception. For example, Germany is notorious for having a poor coverage of mobile Internet, which still leads to lower speech quality when travelling in a train or outside of areas with a high population density. The current state-of-the-art and standard of the International Telecommunication Union (ITU-T) for objective speech quality prediction is POLQA (ITU-T Rec. P.863 2018). POLQA uses the degraded output signal of a transmission channel and compares it to the clean input signal in order to estimate the speech quality. The ground truth for speech quality prediction models, however, results from auditory listening experiments that are conducted according to ITU-T Rec. P.800 (1996). In these experiments, participants listen to speech samples and give their opinion on the quality. The average rating across all test participants then gives the so-called mean opinion score (MOS). POLQA and other traditional speech quality models estimate the speech quality by calculating distance features between the reference and degraded signals and map them to the MOS values. These double-ended models, which use the reference signal and the degraded speech signal, are suitable for active monitoring of networks and give reliable results for this use case. However, their test scope is limited to a restricted number of end points that need to send and record the test signals. Furthermore, these tests are somewhat artificial as no real phone calls of actual service users are conducted. Instead, perfectly balanced sentences are used as test signals. Another drawback of the POLQA model is that it only gives information about the overall speech quality without any indication about the cause of the quality degradation. To overcome this problem, Côté (2011) developed the DIAL model that additionally to the overall quality also predicts four quality dimensions: Noisiness, Coloration, Discontinuity, and Loudness. The advantage of using perceptual quality dimensions for degradation diagnosis is that the dimensions can be linked to technical causes in the transmission system but at the same time are independent of the technology used. Therefore, the prediction model remains useful, even when new technologies, such as codecs or concealment algorithms, are introduced to communication systems. Models that are based on network parameters, such as packet loss, codec, or bitrate, can be used to monitor the speech quality of phone calls passively. However, these models only take into account the digital transmission path and neglect end point devices. Moreover, when the speech signal is sent through multiple networks, the network providers usually only have access to the parameters of their own network. Apart from the network providers, also for so-called over-the-top (OTT) providers, it is important to monitor their services. These providers offer calls via the Internet and Voice over IP (VoIP) without providing the network themselves. For OTT providers, it can be more difficult to monitor the speech quality, as they do not have direct access to all network parameters. If the network parameters are unknown, the transmission path is basically a black box, and only the degraded output speech signal is available to estimate the speech quality.

1.1 Motivation

3

Models that estimate the speech quality based on the degraded speech signal without the need for a clean reference are also referred to as single-ended models. The current recommendation by the ITU-T for single-ended speech quality estimation is ITU-T Rec. P.563. However, it is known to give unreliable results for VoIP networks (Hines et al. 2015a) and for conversational speech. Furthermore, it can only predict speech for so-called narrowband communication networks. Since this model has been developed, the trend in speech networks is going towards “high definition” voice transmission with higher audio bandwidth, so-called wideband or super-wideband networks. For a long time, it has been very challenging to predict speech quality without the availability of a reference signal. Especially, interruptions caused by lost packets in the transmission path can be difficult to detect. However, with the rise of deep learning methods in recent years, new possibilities opened up for predicting speech quality with a machine learning approach. In 2012, the first deep learning model won the annually held image classification challenge ImageNet (Krizhevsky et al. 2017). Since then, deep learning models kept on outperforming traditional methods in a multitude of applications, from image classification to self-driving cars and language translation. In the field of speech technology, most work on deep learning methods was done for speech recognition and speech synthesis. However, only little work was conducted towards speech quality estimation. In this work, a single-ended speech quality model based on deep learning is presented. The model is trained end-to-end with subjective MOS scores to predict the overall quality and four speech quality dimensions in a multi-task approach. Thus, during training, the model not only learns to compute features that are suitable for predicting overall quality but also learns how to distinguish between certain degradation types. Within the multi-task approach, a different amount of network layers can be shared across the prediction tasks. One question that opens up in this context is whether the ability of the neural network to distinguish between quality dimensions improves the internal feature representation in the network and therefore also helps to improve the overall quality prediction or if the sharing instead hurts the prediction performance. To obtain a model that gives reliable predictions for unknown data, the model is trained on a large variety of different datasets that come from different sources. Many of these datasets were annotated in a narrowband context, while others contain speech samples with wideband, super-wideband, or fullband speech. However, in an experiment that only contains narrowband speech signals, a clean narrowband sample will be usually rated with a relatively high score, while the same sample in a super-wideband context will be rated lower. These different ratings for the same conditions can negatively influence the neural network prediction performance. To overcome this problem of database-specific biases, a bias-aware loss function is presented in this work.

4

1 Introduction

1.2 Thesis Objectives and Research Questions So far, no single-ended model that predicts the four perceptual speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness is available. Furthermore, no single-ended model that predicts the overall speech quality model in a super-wideband/fullband context reliably on unknown, conversational data is available. Based on these facts, the main aim of this work is to develop a single-ended diagnostic speech quality model. The development of this model with recently emerged deep learning methods comes along with a number of research questions. Therefore, the aims and research questions can be summarised as follows: Objective 1 Develop a single-ended speech quality model that predicts speech quality of real phone calls in a fullband context reliably on unknown data. Objective 2 Develop a single-ended model that, additionally to the overall quality, also reliably predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. Research Question 1 Which neural network architectures give the best performance for speech quality prediction and how can they be combined? Research Question 2 How can the clean reference signal be incorporated into neural networks for speech quality prediction to build a double-ended speech quality model? And does including the reference improve the results? Research Question 3 Does multi-task learning of speech quality and speech quality dimensions improve the results when compared to individually trained models? How much of the neural network should be shared across the different tasks? Research Question 4 How can machine learning based speech quality models be trained from multiple datasets that are exposed to dataset-specific biases?

1.3 Outline Chapter 2 of this thesis gives an overview of the current state of the art of quality assessment of transmitted speech. It introduces subjective speech quality methods and also describes how listening experiments can be conducted via crowdsourcing. Afterwards, a brief literature review of traditional and machine learning based speech quality models is presented. In Chap. 3, different neural network architectures are analysed towards their suitability for speech quality prediction. At first, a new dataset with various distortions and subjective speech quality ratings is presented. The presented deep learning based speech quality models are divided into three neural network stages for which different network architectures are presented. It is then shown that the CNN-SA-AP (Convolutional Neural Network–Self-Attention–Attention-Pooling) combination achieves the best results through an ablation study in which the individual stages are removed or replaced.

1.3 Outline

5

In Chap. 4, it is analysed whether applying the clean reference to deep learning based speech quality models improves the results. To this end, a new double-ended model with different configurations is presented, trained, and evaluated. In Chap. 5, the single-ended deep learning model from Chap. 2 is extended to predict perceptual speech quality dimensions. To this end, a multi-task approach is used and analysed. Four models that each share a different number of layers between the tasks are presented and compared. Chapter 6 presents a new method to train speech quality prediction models from multiple datasets that are exposed to subjective biases. The presented bias-aware loss estimates the biases during training and considers these biases when optimising the network weights. The method is firstly analysed with a synthetic and then with real datasets. Finally, in Chap. 7, a single-ended machine learning based diagnostic speech quality model is presented that was trained on a large set of different datasets. At first, the different datasets are presented and described. The model is then trained on 59 datasets and validated on a live-talking dataset that contains real telephone calls, conducted in various environments.

Chapter 2

Quality Assessment of Transmitted Speech

This chapter firstly gives an overview of different speech communication networks and common quality impairments that occur during speech transmission. Then the terms “speech quality” and “speech quality dimensions” are introduced, and subjective speech quality assessment methods are discussed. Afterwards, a review of instrumental speech quality prediction models is given. At first traditional speech quality models are presented. Then machine learning based models from literature, which are not based on deep learning, are described. Finally, a brief overview of deep learning architectures and deep learning based speech quality models is given.

2.1 Speech Communication Networks Speech or voice services can be divided into the following three classes: • Landline networks • Mobile networks • Over-the-top VoIP applications The landline network is the oldest of the three services and actively running since the late 1800s. It used to transmit speech via analogue signal transmission with underground copper wires. This type of analogue telephone service is also called plain old telephone service (POTS). However, these days, almost all of the analogue networks have been replaced with digital technology. One of the most commonly used codecs in landline networks is ITU-T Rec. G.711 (1988), which applies a nonuniform quantization and passes the speech signal in the range of 300–3400 Hz. This audio bandwidth is also referred to as narrowband (NB) and corresponds to the same bandwidth as analogue telephony that leads to the typically muffled sound known from telephone calls. Today, many providers offer wideband (WB) networks © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_2

7

8

2 Quality Assessment of Transmitted Speech

(sometimes marketed as “HD voice”) that allow for a higher audio bandwidth of 100–7000 Hz. One commonly used wideband codec in landline networks is ITU-T Rec. G.722 (2012). However, if a phone call is made from a WB to a NB network, the connection will cut down to a NB call. Even when a phone call between different WB providers is conducted, the connection may cut down to NB as well. The mobile or cellular network allows for phone calls to and from mobile phones to which the network is connected via cellular radio towers. More and more people use their mobile phone as the standard way to conduct a phone call. For example, the percentage of households in the U.S. that own a landline phone went from more than 90% in 2004 to less than 40% in 2019 (CDC 2020). However, while there are hardly transmission issues in landline networks, the advantage of the mobility in mobile networks comes with possible inferences on the radio frequencies that lead to transmission errors. Also, when the users change their location, the phone may switch from one antenna to another (so-called handover), which results in brief interruptions. There are different systems for radio transmission in the mobile network. Common ones in Europe are GSM (2G), UMTS (3G), LTE (4G), and the upcoming 5G standard. The most common codec in mobile networks is AMRNB (3GPP TS 26.071 1999), which is a hybrid codec that transmits both speech parameters and a waveform signal and offers speech at narrowband bandwidth. These days, an increasing amount of providers also support wideband speech with the AMR-WB (3GPP TS 26.171 2001; ITU-T Rec. G.722.2 2003) codec, in particular through UMTS and LTE. More recently, some providers also support super-wideband (SWB) speech (in Germany marketed as “HD Plus” or “Crystal Clear”) via VoLTE (Voice over LTE) and VoWiFi (Voice over Wi-Fi) and the more recent codec EVS (3GPP TS 26.441 2014). In super-wideband telephony, speech is transmitted with a bandwidth of 50–14,000 Hz. Another type of communication services is the so-called over-the-top (OTT) applications that provide a voice service via the Internet with VoIP (Voice over IP). These applications offer the service as a third party and are independent of the Internet network provider. There are a large number of applications available, such as dedicated voice or video call services (e.g. Viber, Skype, FaceTime), messenger or social media apps that include voice communication (e.g. WhatsApp, Line, Telegram, Facebook, Instagram) or videoconferencing tools (e.g. Zoom, Teams, Meet, Webex). One codec that is often used for VoIP applications is Opus (RFC 6716 2012). It is open source and royalty free and supports different audio bandwidths with up to fullband (FB) (20–20,000 Hz) depending on the bitrate that is used. Whereas landline and mobile phone networks are directly connected to the public switched telephone network (PSTN), this is mostly not the case for OTT services. A landline or mobile phone can be connected with either a circuit-switched or a packet-switched network. Circuit-switched networks are the classic type of network, where a dedicated connection between two devices is established, and the communication channel with the full available bandwidth is maintained during the call. In packet-switched networks, speech between two devices is transferred through the division of the signal data into small packets. Each data packet can

2.1 Speech Communication Networks

9

take a different route through the network to reach its final destination, where the individual packets are merged to recreate the original signal. At the receiving side of a packet-switched network, a jitter buffer handles packets that arrive in the wrong order, duplicates, or delays. For example, in case a packet arrives late, the jitter buffer playout mechanism may stretch the speech signal contained in the previously received packet to avoid interruptions in the speech signal. In case the packet arrives erroneous or too late, the jitter buffer drops the packet. This then leads to the so-called packet loss but should rather be termed “packet discard”. Most modern codecs contain packet-loss concealment (Lecomte et al. 2015) algorithms that try to conceal the interruption in the speech signal by synthesising a prediction of the lost speech signal on the basis of previously received packets. However, when the algorithm fails to synthesise a signal that is similar to the original signal of the lost packet, it often results in interruptions, robotic voices, or artificial sounds. While there are no data packets in circuit-switched networks, the codec still divides the speech signal into individual frames. Therefore, if too many bit errors occur in the bitstream, the decoder can flag a frame as lost and apply concealment on this frame as well. In case the codec does not have a concealment algorithm, the frames are simply replaced by zeros. Because nowadays most of the speech transmission is digital, frame/packet loss is the main impairment caused in the transmission channel. Additional impairments can be caused by the devices and their signal-processing algorithms, low-bitrate coding, and transcoding that occurs when multiple networks are involved. An overview of some prominent technical factors of speech communication networks that can influence the perceived quality is given in the following: Audio Bandwidth An overview of the different bandwidths is given in Table 2.1. Restricted audio bandwidth can, for example, lead to speech that sounds muffled or thin. Codec The codec compresses the speech signal before transmitting it through the communication channel, and on the receiving side, the signal is reconstructed from the encoded bitstream. However, the compression is usually not lossless and can lead to various speech degradations. Transcoding When a call is routed through multiple networks, the speech signal may be encoded and decoded several times, potentially with different codecs. Packet Loss Frame or packet loss can lead to interruptions in the signal. If packetloss concealment is applied, it can also cause a variety of different degradation, such as robotic voices and artificial sounds. Table 2.1 Audio bandwidths of speech networks according to ITU-T Rec. P.10 (2017)

Name Fullband Super-wideband Wideband Narrowband

Abbr. FB SWB WB NB

Frequency range [Hz] 20–20,000 50–14,000 100–7000 300–3400

10

2 Quality Assessment of Transmitted Speech

Jitter Management In packet-switched networks, the jitter buffer may stretch or shrink the speech signal to reduce the influences of delay. Delay Delays can occur at different places in the communication systems, for example, during the transmission or caused by the coding. Delays can cause difficulties in conversation but are not affecting listening speech quality directly. Echo Speakers can be confronted with an echo of their own voice, for example, when the receiver is using a loudspeaker. Echo impacts the speaking abilities of a speaker. Echo cancellation Echo cancellation algorithms try to estimate the echo signal and subtract it from the incoming signal. However, if the algorithm fails, it can also introduce new degradation to the signal. Voice Activity Detection Voice activity detection (VAD) is used by telecommunication systems to detect low volume levels and instead of sending the speech signal sending a flag that indicates silence. Thus, bandwidth can be saved in the digital transmission. If the algorithm fails, it may cut off speech, in particular at the start or end of an utterance. Ambient Noise Ambient noises are all sounds from the environment around a speaker and listener, such as car, street, or shopping centre sounds. If the ambient noise is too loud, it can be difficult to understand the voice of the speaker. Circuit Noise Circuit noise was mostly caused by analogue technology that is largely not present in modern speech networks. However, similar noises may be introduced by low-bitrate codecs or on purpose by the so-called comfort noise that fills artificial silences caused by voice activity detection. Noise Reduction Noise reduction algorithms try to remove background noises from the speech signal. However, they can sometimes introduce new degradation to the speech signal instead. Speech Level A non-optimal speech level can be caused at different points in the communication system or when a talker is too far from the input device. Amplitude Clipping If an input signal is too loud, the maximum amplitude capacity may be reached, and as a result the signal is clipped at the maximum possible level. Amplitude clipping leads to a distorted speech signal. Active Gain Control To avoid amplitude clipping and too quiet signals, most devices contain active gain control algorithms that level-equalise the speech signal.

2.2 Speech Quality and Speech Quality Dimensions When speech communication networks are evaluated, one of the main performance indicators is the perceived quality of the transmitted speech. Jekosch (2005) defined speech quality as follows:

2.2 Speech Quality and Speech Quality Dimensions

11

Speech quality: The result of assessing all the recognised and nameable features and feature values of a speech sampled under examination, in terms of suitability to fulfil the expectation of all the recognised and nameable features and feature values of individual expectations and/or social demands and/or demands.

Thus, the perceived speech quality is mostly depending on the user’s expectations. When a user assesses the quality of a speech sample, the assessment process is triggered by the physical speech signal. The signal is perceived and reflected by the listener, at which point also the perceived quality features are created within the listener. In parallel to the assessment of the speech sample under examination, a similar process is executed on the listener’s experiences that creates the desired quality features. A comparison and judgement between the desired and perceived quality features then leads to the perceived speech quality. According to Wältermann (2012), the perceived quality is a multidimensional value and can be thought of as a point in a multidimensional perceptual space. If the coordinate system of this perceptual space is Cartesian and each of the auditory features is orthogonal to each other, the features are referred to as perceptual dimensions. The orthogonality also implies that the values of different perceptual dimensions are not correlated. In this multidimensional space, the listener’s desired composition (internal reference) is defined by a vector p and the perceived composition (speech signal under study) as q such that p = (p1 , . . . , pNdim ),

q = (q1 , . . . , qNdim ).

(2.1)

The overall or integral quality Q can then be defined by the Euclidean distance between p and q,  Ndim  Q = d(p, q) =  αi (pi − qi )2 ,

(2.2)

i=1

where Ndim is the number of orthogonal dimensions and the weighting coefficients αi determine the influence of each dimension. Wältermann et al. (2010) derived these perceptual quality dimensions for modern speech communication networks. To this end, 14 different NB test conditions and 14 different mixed NB/WB test conditions were processed. Then a pairwise similarity and subsequent multidimensional scaling, as well as semantic differential and subsequent principal component analysis experiments, were conducted. The analysis revealed that the perceptual quality space can be spanned with three orthogonal perceptual dimensions: Coloration, Discontinuity, and Noisiness. However, in these studies, the listening level of all speech files was normalised to a preferred listening level. Later, Côté et al. (2007) found that the loudness is indeed an important speech

12

2 Quality Assessment of Transmitted Speech

quality feature and should therefore be included in the perceptual quality space as a Loudness dimension. It should be noted that the Loudness can be correlated with other dimensions, in particular the Coloration. To summarise, the perceptual speech quality space is spanned by the following four dimensions: • Noisiness Refers to degradation such as background, circuit, or coding noise. • Coloration Refers to degradation caused by frequency response distortions, e.g. introduced by bandwidth limitations, low-bitrate codecs, or packet-loss concealment. • Discontinuity Refers to isolated or non-stationary distortions, e.g. introduced by packet loss or clipping. • Loudness Refers to non-ideal loudness, either too loud or too quiet signals or loudness variation.

2.3 Subjective Assessment The quality of a speech sample is traditionally derived through auditory listening experiments according to ITU-T Rec. P.800 (1996), in which test participants rate the quality of a speech sample on a 5-point Absolute Category Rating (ACR) scale. The labels of this scale can be seen in Table 2.2. The score can either be displayed to the participants additional to the labels, or only the labels are presented to the participants, and the score is allocated later by the experimenter. The average rating across all test participants then gives for each speech sample the so-called mean opinion score (MOS). Usually, each distortion type is processed multiple times for different source speech files to avoid biases caused by the source speech sample. The resulting per-file MOS values are then averaged across each condition to give the per-condition MOS. Two other types of commonly used speech quality tests are Degradation Category Rating (DCR) and Comparison Category Rating (CCR), which are described in ITU-T Rec. P.800 (1996) as well. The assessment of speech quality for diagnosis purposes through multiple rating scales is described in ITU-T Rec. P.806 (2014). However, in this Recommendation, Table 2.2 Listening quality 5-point ACR scale recommended by ITU-T Rec. P.800 (1996)

Label Excellent Good Fair Poor Bad

Score 5 4 3 2 1

2.3 Subjective Assessment

13

Fig. 2.1 Scales for the subjective assessment of speech quality dimensions

the perceptual quality dimensions derived by Sen and Lu (2012) are used. Although so far, no ITU-T Recommendation for the four speech quality dimensions used in this work is available, they are contained in ITU-T Rec. P.804 (2017), which describes subjective test methods for conversational speech analysis. The scales in this Recommendation are labelled with antonym pairs describing the corresponding dimension and are shown in Fig. 2.1. The extended continuous scale is used because it showed to be more sensitive. A function to map the ratings from the continuous scale to ACR MOS values in the range of 1–5 is given by Köster et al. (2015) as follows: 2 3  ACR = 1 + 0.1907 · MOSEC + 0.2368 · MOSEC MOS − 0.0262 · MOSEC ,

(2.3)

where MOSEC describes the rating on the extended continuous scale shown in Fig. 2.1 and MOSACR the MOS on an ACR scale. It should be noted that the scale in this work is inverted and a signal that is rated “noisy” will be assigned with a low Noisiness MOS. Therefore, a high rating corresponds to a high quality for the overall MOS and the speech quality dimensions. Speech Material Traditionally, according to ITU-T Rec. P.800 (1996), the speech material used for listening-only tests should consist of simple, meaningful short sentences that fit into a time slot of 2–3 s. A group of 2–5 sentences is then merged together to constitute one speech sample. The silent pause between the sentences, during which circuit or background noises are heard and in which adaptive processes settle into the new states is also important and is usually set to 1–3 s. The talker should pronounce the sentences fluently but not dramatically. It is also important to use male and female talkers, as they are often affected differently by speech transmission. Furthermore, to reduce the dependency of the resulting MOS values on peculiarities of the chosen voices, it is essential to use more than one male/female speaker in a balanced design. The active speech level of the speech signals should be set to −26 dB relative to the peak overload level of the recording system (also referred to as dBov), unless the speech level is supposed to be degraded on purpose. The level is calculated according to ITU-T Rec. P.56 (2011), where at first a VAD is applied to the signal

14

2 Quality Assessment of Transmitted Speech

that divides the signal into silent segments and segments that contain speech. The speech level is then only calculated over the segments that contain speech. Anchor Conditions Every experiment should contain anchor conditions to make a comparison possible between experiments in different laboratories or carried out at different times in the same laboratory. These anchor conditions are sometimes also referred to as reference conditions; however, they are referred to as anchors in this work to avoid confusion with a clean speech signal. The following anchor conditions were used for datasets of the POLQA competition (ITU-T Rec. P.863 (2018), see also Sect. 2.5.2). Reference The clean signal without any degradation. The audio bandwidth of the signal should be adjusted to the scale of the experiment (super-wideband in the case of the POLQA datasets). MNRU One common anchor condition is the Modulated Noise Reference Unit (MNRU) according to ITU-T Rec. P.810 (1996). It is a simulated signalcorrelated distortion that is used to introduce controlled degradation to speech signals. As such, it has been used extensively in subjective performance evaluations of speech communication networks. The speech-plus-modulated-noise output generation can be expressed as y(i) = x(i)[1 + 10−Q/20 N(i)],

(2.4)

where N represents random Noise and Q the ratio of speech power to the modulated noise power in dB. In the POLQA datasets, it is included with 10 and 20 dB. Background Noise Additionally to MNRU noise also additive background noise is usually used. The POLQA anchor conditions contained Hoth and babble noise with an SNR of 12 dB and 20 dB, respectively. SNR is the signal-to-noise ratio expressed in decibels. Hoth noise is a random noise with a given power density spectrum between 100 and 8000 Hz published by Hoth (1941). Babble noise is produced by adding spoken sentences by multiple speakers on top of each other, yielding an incomprehensible “babble” sound. Another noise that is often used is additive white Gaussian noise that has uniform power across all frequency bands. Active Speech Level Deviation from the standard P.56 speech level with −10 and −20 dB from the nominal level (−26 dBov). Linear filtering Bandpass filtering to reduce the audio bandwidth of the speech signal. For POLQA datasets, a narrowband IRS (ITU-T Rec. P.48 1988) send and receive filter with 500 to 2500 Hz and 100 to 5000 Hz was used. Temporal Clipping 2% and 20% packet loss. Packet loss is simulated by dividing the speech signal into 20 ms frames. The given percentage of frames is then randomly set to zero. Listeners The subjects that participate in the listening test should be chosen at random from a normal telephone-using population. Furthermore, they should be naïve participants,

2.3 Subjective Assessment

15

meaning that they have not been involved in work on performance assessment of telephone circuits or speech coding. Ideally, they should also not have participated in any subjective test for at least six months and should not have heard the sentences in the speech samples before. Biases Between Experiments There are many factors that can influence how users perceive speech quality during a listening experiment. According to Jekosch (2005), they can be classified into three groups as follows: The scaling-effect Induced by the use of the measurement scale as an interface with the subject. The subject-effect Generated by the use of human listeners as an instrument of measurement. The context-effect Due to the relationship between the context and the use of speech as an object of measurement. The influencing factors and biases are also described in detail by Möller (2000) and Zielinski et al. (2008). As described in ITU-T Rec. P.1401 (2020), these bias-inducing factors make a comparison between the MOS values from different experiments difficult, even when the same scale is used. The main factors are: Rating noise The score assigned by a listener is not always the same, even if an experiment is repeated with the same samples and the same presentation order. Short-term context dependency/order effect Subjects are influenced by the short-term history of the samples they previously rated. For example, after one or two poor samples, participants tend to rate a mediocre sample higher. In contrast, if a mediocre sample follows high-quality samples, there is a tendency to score the mediocre sample lower. Because of this effect, the presentation order for each subject is usually randomised in speech quality experiments. Medium- and long-term context effects/corpus effect The largest influence is given by effects associated with the average quality, the distribution of quality, and the occurrence of individual distortions. Test participants tend to use the entire set of scores offered in an experiment. Because of this, in an experiment that contains mainly low-quality samples, the subjects will overall rate them higher, introducing a constant bias. Despite verbal category labels, subjects adapt the scale to the qualities presented in each experiment. Furthermore, individual distortions that are presented less often are rated lower, as compared to experiments in which samples are presented more often, and people become more familiar with them. For example, Möller et al. (2006) showed that a clean narrowband signal obtains a higher MOS in a corpus with only narrowband conditions than in a mixed-band corpus that includes wideband conditions as well. Long-term dependencies These effects reflect the general cultural behaviour of the subjects as to the exact interpretation of the category labels, the cultural attitude to quality, and language dependencies. Also, the daily experiences with telecommunication or media are important. Quality experience, and therefore

16

2 Quality Assessment of Transmitted Speech

expectation, may change over time as the subjects become more familiar with high-definition codecs and VoIP distortions.

2.4 Subjective Assessment via Crowdsourcing Recently the subjective assessment of speech quality through crowdsourcing on platforms such as Amazon Mechanical Turk has attracted increased interest. Although a controlled laboratory setup comes with more measurement reliability, they lack realism as the listening equipment, and the test environment does not reflect the typical usage situation. Furthermore, laboratory experiments are very time-consuming for the experimenters and need to be carefully planned in advance. Micro-task crowdsourcing, on the other hand, offers a fast, low-cost, and scalable approach to collect subjective ratings from a geographically distributed pool of demographically diverse participants that use a wide range of listening devices. In crowdsourcing subjective tests, the assessment task is provided to the crowdworkers over the Internet. The participating crowdworkers are usually remunerated for their work. Recently, an ITU-T Recommendation for conducting speech quality tests via crowdsourcing has been published as ITU-T Rec. P.808 (2018). The Recommendation describes the experiment design, test procedure, and data analyses for an ACR listening-only speech quality test. Furthermore, it contains methods to evaluate the participant’s test eligibility (i.e. hearing test), environment/listening system suitability, and quality control mechanisms. A recent study by Naderi et al. (2020) shows that valid and reliable results for speech quality assessment in crowdsourcing are achieved if the tests are conducted according to ITU-T Rec. P.808 (2018). The following points should be considered when tests in the crowd are conducted: Source material The preparation of the source material is the same as for a lab test according to ITU-T Rec. P.808. Test duration As typical crowdsourcing micro-tasks only take a few minutes to complete, it is recommended to split an experiment into a chain of these smaller tasks, where one task should contain 5–15 stimuli. However, a crowdworker may only perform one of these tasks and as a result does not rate the entire set of stimuli available. This lack of context can lead to an error variance in the ratings caused by the corpus effect. To overcome this problem, individual tasks with 5–15 samples can be prepared to contain a good variety of conditions. Also, the crowdworkers can be motivated to complete more tasks by offering an extra reward (i.e. bonus) when a certain minimum number of tasks are completed. Test procedure It is recommended to create overall three jobs for the crowdsourcing assessment: qualification job, training job, and rating job. Qualification job In the qualification job, the purpose of the study is explained to the crowdworkers, and it is checked whether they are eligible to participate. For

2.4 Subjective Assessment via Crowdsourcing

17

example, it could be asked if they suffer from hearing loss. Also, to guarantee a balanced participant distribution, the participants can be asked for their gender and age. Based on the response, a randomly sampled group of crowdworkers (who satisfied the prerequisites) can be invited to participate in the experiment. Instead of using a qualification job (or additionally), also, in-built platform features can be used, such as only inviting crowdworkers with a task approval rate of at least 98% and a minimum of 500 approved tasks. Training job In the training job, the test instructions, accompanied by a preliminary list of training stimuli, should be presented to the crowdworkers. In the selection of the samples, attention should be applied to approximately cover the range from worst to best quality to be expected in the test. However, no suggestion should be made to the crowdworkers that the samples include the best or worst condition in the range to be covered. After the crowdworker finished the training job, they should get access to the actual rating job. Additionally, if more than 24 hours are passed, the crowdworkers should lose the rating job access and repeat the training job. Rating job In the rating job, at first, the listening environment, the listening system, and the playback level should be checked. Then the crowdworkers should listen to the stimuli and give their opinion. Appropriate methods should be used to make sure that the crowdworkers listen to the stimuli before giving their rating and that they provide answers to all questions before submitting their response. Listening environment The crowdworkers should be instructed to perform their task in a quiet, non-distracting environment. Additionally, the crowdworkers should compare a pair of speech samples and give their opinion on which one has better quality. When the crowdworker correctly selects the stimuli with better quality in the majority of questions, it can be inferred that their surrounding environment and listening system are suitable for participating in the study. Listening system The crowdworkers should be instructed to wear two-eared headphones. This can be validated at the beginning of each rating job through a short maths exercise with digits panning between left and right in stereo. Listening level The crowdworker should set the volume of their listening device to a comfortable level. Afterwards, the crowdworkers must not change the listening level when rating the speech samples. Gold standard question A gold standard question is a question for that the answer is known to the experimenter. Naderi et al. (2015) showed that including these questions makes crowdsourcing assessment more reliable. The questions should be asked in a way that makes it easy to answer the question correctly if the crowdworker consciously follows the test instructions. For example, the clean reference signal can be used as a gold standard question, for which it is known that the quality should be rated with a high rating. Furthermore, it is recommended to add a second trapping question, which appears to the participants to be a normal speech sample. However, in between the speech sample, the playback is interrupted by a message that prompts the crowdworkers to select a specific item from the opinion score (e.g. poor or excellent). These

18

2 Quality Assessment of Transmitted Speech

gold standard questions should be randomly positioned between the quality assessment questions. Data screening The submitted responses of one rating job should be discarded if (1) one or more gold standard questions are answered incorrectly, (2) the listening system was not set up correctly, and (3) the listening environment was not suitable. Furthermore, the submitted responses should be evaluated in terms of unexpected patterns in ratings (e.g. no variance in ratings) and unexpected user behaviour in a session. Also, an outlier detection can be applied. The experimenter may discard responses that contain significant outliers or show unexpected patterns or behaviour. It should be noted that a downside of the data screening is that it results in different numbers of votes for the speech samples.

2.5 Traditional Instrumental Methods Because subjective methods are time-consuming and costly, so-called instrumental or objective methods have been developed that can predict speech quality on the basis of an algorithm. Furthermore, for network providers, it is important to monitor their networks in regard to speech quality to ensure that their customers are satisfied with the service. Another application of instrumental models is the optimisation of speech processing systems and network planning. However, instrumental models are always based on mean opinion scores derived from subjective auditory experiments. Typically, the development of instrumental speech quality model consists of four major steps (Wolf et al. 1991): (1) the design and conduction of a first auditory test, (2) the development of a candidate instrumental measure trained on the first auditory tests, (3) the design of a second validation auditory test, and (4) the validation of the instrumental method on the second auditory test. Instrumental models can be divided into three classes, which are described in the following: • Parametric models • Double-ended signal-based models • Single-ended signal-based models

2.5.1 Parametric Models Parametric models (or Opinion models or Network planning models) use parameters that characterise different quality elements of a transmission system to estimate the expected overall quality. When network providers plan a new network, they can tune these parameters to ensure that their service users will be satisfied with the quality. For example, a network provider could estimate which maximum delay the transmission system should allow to still achieve a high overall conversation quality.

2.5 Traditional Instrumental Methods

19

The first parametric models have already been developed in the 1970s as telecommunication companies started to develop algorithms that estimate the user’s opinion. These models include the Bellcore TR mode (Cavanaugh et al. 1976) and the OPINE model (Osaka and Kakehi 1986) that were both developed to predict the quality of POTS networks. E-model In 1997, results from different opinion models were merged into a new model called E-model (Johannesson 1997), which is also the recommended transmission planning tool by the ITU-T (ITU-T Rec. P.107 2015). It is applied to plan future voice networks by using network parameters that describe specific impairments (e.g. delay, low-bitrate codecs, or packet loss) and computes an overall rating of the expected conversational quality as the transmission rating factor R. The model is based on the impairment factor principle, which assumes that different types of degradations are additive in terms of the perceptual impairment they cause. The rating can be calculated with the following basic formula: R = R0 − Is − Id − Ie,eff + A.

(2.5)

Essentially, the model assumes a maximum rating R0 , from which impairment factors are subtracted to calculate an overall quality R. R0 describes the basic signal-to-noise ratio (e.g. caused by circuit noise). Is represents simultaneous impairments caused by non-optimum loudness or signal-correlated noise, Id stands for impairment caused by delay, and Ie,eff is the effective equipment impairment factor, which represents impairment caused by speech coding and packet loss. A can be used to compensate impairment through other quality advantage factors. The R-value is linked to the MOS by an S-shaped curve. Through this relationship, new codec impairment factors can be derived by conducting auditory test and transform the MOS obtained from the test participants to the R-scale. Initially, the E-model was only designed for NB telecommunication systems, operated with handset telephones, and was later extended by Möller et al. (2006) and Raake et al. (2010) to the Wideband E-Model (ITU-T Rec. P.107.1 2019) and by Möller et al. (2019b) and Mittag et al. (2018) to the Fullband E-model (ITU-T Rec. P.107.2 2019).

2.5.2 Double-Ended Signal-Based Models Double-ended (also called full-reference or intrusive) signal-based models estimate the speech quality through a comparison of the clean reference input signal with the degraded output signal. Because of the ongoing advancements in speech coding and the complexity of human quality perception, the development of such models is a continuous process. In the following, the five models PESQ, POLQA, VISQOL, DIAL, and the P.AMD candidate are presented.

20

2 Quality Assessment of Transmitted Speech

PESQ The ITU-T Rec. P.862 (2001) (Rix et al. 2001) was released in 2001 and is based on an integration of the PAMS (Rix and Hollier 2000) model and an enhanced version of the PSQM (Beerends and Stemerdink 1994) model. It is still one of the most popular speech quality models as the code is publicly available, and it has been proven to give reliable results in a narrowband context. The model firstly level-aligns both signals to a standard listening level. They are then filtered to model a standard telephone handset. The signals are timealigned and then processed through an auditory transform that involves equalising for linear filtering in the system and for gain variation. Two distortion parameters are extracted from the difference between those transforms. Finally, they are aggregated in frequency and time and mapped to subjective MOS values. The model was later extended to predict wideband speech as PESQ-WB (ITU-T Rec. P.862.2 2007). POLQA The current state-of-the-art speech quality model POLQA is the successor of PESQ and supports evaluation of super-wideband speech signals. It was standardised as ITU-T Rec. P.863 (2018) in 2011 and consists of two major blocks, the timealignment (Beerends et al. 2013a) and the perceptual model (Beerends et al. 2013b). In contrast to PESQ, the POLQA time-alignment algorithm is more complex and was updated to allow assessing VoIP networks, which can introduce sudden align jumps or slowly changing time scaling. The time-alignment consists of five computation blocks: filtering, pre-alignment, coarse alignment, fine alignment, and section combination. The perceptual block is similar to the approach used in PESQ; that is, the reference and degraded signals are mapped onto an internal representation using a model of human perception. The distance between the internal representations is then used to predict the speech quality of the degraded signal. New approaches in POLQA include the idealisation method that removes low level of noise in the reference signal and optimises the timbre. Also, the impact of the playback level on the perceived quality is modelled, and low- and high-level distortions are considered separately. The final MOS is estimated based on a disturbance and an added disturbance density and three different indicators: FREQ for frequency response distortions, NOISE for additive noise, and REVERB for room reverberations. The disturbance densities are based on the distance between the perceptual based representations and are aggregated in frequency and time as a final step. POLQA was recently updated to predict quality of fullband speech (Beerends et al. 2020). VISQOL In 2015, the double-ended speech quality model VISQOL (Hines et al. 2015b) was presented. Similarly to POLQA, it can predict the quality of super-wideband speech signals and is based on a spectro-temporal measure of similarity between the reference and degraded speech signals. It consists of five major processing stages: preprocessing, time-alignment, predicting warp, similarity comparison, and a postprocess mapping similarity to objective quality. It showed to outperform PESQ and

2.5 Traditional Instrumental Methods

21

give similar prediction results as POLQA. However, it was not evaluated on the same number of datasets as POLQA. DIAL The first work towards prediction models for the speech quality dimensions Noisiness, Coloration, and Discontinuity was presented by Scholz (2008) and was then further developed by Huo (2015). These studies were extended by Côté (2011) and led to the DIAL model, which also includes the dimension Loudness. Additionally, to the four speech quality dimensions, it also predicts the overall speech quality. The DIAL model consists of a Core model that is based on the speech quality model TOSQA (Berger 1998). Whereas the Core model assesses nonlinear degradation introduced by a speech transmission network such as low-bitrate codecs, the dimension estimators quantify the degradations on the four perceptual dimensions. Then, an aggregation of all degradations simulates the cognitive process employed by humans during the quality judgement process. In contrast to POLQA, the estimators of the DIAL model are based on individual indicators that model certain physical distortions in the speech signal rather than a mapping of time– frequency representations to MOS values. P.AMD Candidate The work on DIAL led to a new ITU-T work item P.AMD (ITU-T SG12 TD.137 2017) that is targeting to provide a standardised diagnostic speech quality model for the four perceptual dimensions Noisiness, Coloration, Discontinuity, and Loudness (set A), as well as for the six dimensions proposed by Sen (2004) (set B). The candidate model for set A is currently going into the validation phase. The prediction of the dimensions Coloration, Discontinuity, and Loudness is mostly based on a linear combination of POLQA internal indicators. The estimation of the Noisiness is based on two different indicators. Whereas the indicator for the estimation of background noise during silent parts of speech is based on DIAL, the estimation of noise during speech is based on a spectral density distance presented by Mittag and Möller (2019a).

2.5.3 Single-Ended Signal-Based Models Single-ended (also called no-reference or non-intrusive) models allow for prediction of speech quality without the need for a clean reference signal. Thus, they allow for passive monitoring during which the reference is usually not available. However, they generally tend to give less reliable results as double-ended models. So far, no single-ended diagnostic model exists that estimates the four speech quality dimensions derived by Wältermann (2012) and Côté et al. (2007). However, individual estimators for the dimensions Noisiness (Köster et al. 2016b), Coloration (Mittag et al. 2016), and Loudness (Köster et al. 2016a) have been presented. In the following, the two single-ended speech quality models P.563 and ANIQUE+ are described.

22

2 Quality Assessment of Transmitted Speech

P.563 Since 2004, ITU-T Rec. P.563 (2004) is the ITU-T Recommendation for singleended speech quality estimation, which was developed by Malfait et al. (2006). It was the winning model of a competition performed by the ITU-T with two submissions, where the other submission was the ANIQUE model presented by Kim (2005). The model consists of three stages: the preprocessing stage, the distortion estimation stage, and the perceptual mapping stage, where the distortion estimation stage combines three basic principles for evaluating distortions. The first principle focuses on the human voice production system, modelling the vocal tract as a series of tubes, where variations of the tubes’ sections are considered as a degradation. The second principle is to reconstruct a clean reference signal in order to apply a double-ended model and to assess distortions unmasked during the reconstruction. The third principle is to identify and to estimate specific distortions encountered in speech communication networks, such as temporal clipping, robotising, and noise. The speech quality is then estimated from the calculated parameters, applying a distortion-dependent weighting. However, it should be noted that P.563 is limited to narrowband speech signals and showed to perform unreliably on VoIP conditions (Hines et al. 2015a). ANIQUE+ ANIQUE+ was presented by Kim and Tarraf (2007) and is an improved version of the ANIQUE model. Similarly to P.563, the ANIQUE+ model is limited to narrowband speech signals. The model contains three major blocks. The frame distortion block decomposes the incoming speech signal into successive time frames and then predicts for each frame perceptual distortion in order to derive the overall frame distortion. The mute distortion model detects unnatural mutes in the signals and predicts their impact on perceived quality degradation. The non-speech distortion block detects the presence of outstanding non-speech distortions and models their impact. Finally, the combined distortion is mapped linearly onto the subjective MOS values.

2.6 Machine Learning Based Instrumental Methods While the mapping of the aforementioned traditional speech quality models from indicators or perceptual signal representations to subjective MOS values is a form of machine learning, the models are largely based on prior knowledge from psychoacoustics, human auditory perception, and human speech perception. They contain mostly handcrafted signal-processing based features that are dependent on empirically determined thresholds and parameters. In contrast to these models, in this section, speech quality models based on machine learning techniques are presented. At first, machine learning based models that do not include deep neural networks are discussed. Then an overview of common deep learning architectures

2.6 Machine Learning Based Instrumental Methods

23

and a review of deep learning based speech quality models are given. The main difference between shallow neural networks and deep neural networks lies in the number of layers that are used. It should be noted that only signal-based models are considered.

2.6.1 Non-Deep Learning Machine Learning Approaches Although the use of neural networks for speech quality prediction only recently became more popular, mostly due to advancements of neural network architectures in other fields and the availability of faster hardware, the idea of applying neural networks to predict speech quality is not new. Fu et al. (2000) used MFCC (Mel-Frequency Cepstrum Coefficients) squared errors between the reference and degraded signals as input to a multilayer perception neural network with three layers. They achieved high correlations with subjective MOS on a relatively small dataset. Chen and Parsa (2005) presented a single-ended speech quality model that is based on an adaptive neuro-fuzzy inference system, which is a type of artificial neural network. As input, they applied features extracted from a perceptual spectral density distribution of the degraded speech signal. Guo Chen and Parsa (2005) also proposed an algorithm that is based on perceptual spectral features, Bayesian inference and Gaussian mixture density hidden Markov models to predict speech quality. Falk and Chan (2006a) developed a model in which the degree of consistency between PLP (Perceptual Linear Prediction) features of the received speech signal and Gaussian mixture probability models (GMMs) of normative clean speech behaviour are used as indicators of speech quality. They also presented a machine learning based single-ended speech quality model that was extended to a more robust double-ended model (Falk and Chan 2006b). The model firstly extracts perceptual features from the degraded speech signal on a frame basis. A time segmentation module then labels the frames as either active-voiced, active-unvoiced, or inactive. Then five different indicators are calculated. Multiplicative noise is estimated with a linear regression of PLP mean cepstral deviations. The consistency computation stage is based on a combination of a GMM model which outputs are mapped to MOS with a MARS (multivariate adaptive regression spline) regression function. A decision tree is used to detect multiplicative noise and a support vector classifier to detect temporal discontinuities. Finally, the MOS is estimated based on the output of the five modules. The presented double-ended extension of their model is suitable for signals that were processed by speech enhancement algorithms as it does not require the reference signal to be undistorted. Grancharov et al. (2006) presented a low-complexity speech quality model that was also based on GMMs with common speech coding parameters, such as the spectral flatness, spectral dynamics, or spectral centroid as input. Narwaria et al. (2010) trained a support vector regression model with MFCCs as input to build a single-ended speech quality model. Later Narwaria et al. (2012) directly applied Mel-filtered

24

2 Quality Assessment of Transmitted Speech

bank energies instead of MFCCs. Dubey and Kumar (2013) also used a GMM to predict speech quality but applied combinations of several auditory perception and speech production features, such as principal components of Lyon’s auditory model, MFCCs, line spectral frequencies, and their first and second differences as input. Li et al. (2014) used spectro-temporal features that were extracted by a Gabor filter bank. The reduced features set after a PCA (Principal Component Analysis) were then used to train a support vector regression model. Sharma et al. (2016) proposed a classification and regression tree model that was trained on PESQ predicted MOS values with a number of different input features, for example, based on LPC (linear predictive coding) coefficients, the Hilbert envelope, and the long-term average speech magnitude spectrum. Avila et al. (2019a) presented a double-ended speech quality model based on i-vectors. The i-vectors of the reference signal are compared to the i-vectors of the degraded signal with a cosine similarity score. Furthermore, they showed that the i-vector cosine similarity outperforms PESQ and POLQA on the considered datasets.

2.6.2 Deep Learning Architectures Before presenting deep learning based speech quality models, this subsection gives a brief overview of deep learning architectures and introduces some related terminology. Deep Feedforward Networks Feedforward networks or multilayer perceptions are the quintessential deep learning models (Goodfellow et al. 2016). The aim of a feedforward network is to approximate some function f ∗ . The network results in a mapping y = f (x; θ ) with the learned weights θ that gives the best function approximation. They are called feedforward because information flows from x to y without any feedback connections in which the outputs of the model are fed back into itself. Because they typically represent a composition of many different functions, they are called networks. For example, the network may consist of three functions f (1) , f (2) , and f (3) that are connected in a chain to form f (x) = f (3) (f (2) (f (1) (x))) with f (1) being the first layer, f (2) the second layer, and so on. The overall length of the chain gives the depth of the model, hence the terminology “deep learning”. The final layer is also referred to as output layer. During training, the model weights θ are optimised in a way that the output layer gives an estimate of the target value y. Because the behaviour of the other layers is not directly specified, they are also called hidden layers. The output of a hidden layer, which can be seen as internally calculated features, is called hidden units and is used as input for the next layer. In feedforward networks, the basic mapping function corresponds to linear regression models with bias terms that can be written as f (1) (x; W , c) = (x  W + c). These layers are also referred to as fully connected layer (FC) as they directly connect all input units with the output units. However, in order for

2.6 Machine Learning Based Instrumental Methods

25

the neural network to model non-linear functions, a non-linear activation function has to be applied. Typically in modern neural networks, the so-called rectified linear unit or ReLU (Jarrett et al. 2009) that sets all negative values to zero is used. It can be written as g(z) = max{0, z}. A basic feedforward network could thus be f (x) = f (3) (g(f (2) (g(f (1) (x))))) (or [FC–ReLU–FC–ReLU–FC]), where activation functions are applied to the hidden layers and the output layer gives the estimated target value y. Convolutional Networks Convolutional neural networks (LeCun et al. 1989), ConvNets, or CNNs are a specialised kind of feedforward network that contain the so-called convolutional layers and were originally developed for image classification tasks. These convolutional layers are composed of filters, also called kernels, with a relatively small size, for example, 3 × 3 pixels. During the forward pass, each kernel slides across the width and height of the input image (e.g. 200 × 300 pixels) and computes a dot product between the kernel weights and the input at any position. The output is then a 2D map that gives the response of that filter at every spatial position. Typically, multiple kernels are contained within one layer (e.g. 16 kernels). The outputs of each kernel are then stacked together to give the output of the convolutional layer (e.g. 200×300×16). In order to preserve the size of the input image, typically the sliding step size (also called stride) is set to 1, and zero padding is applied to the borders of the image. The kernel weights are learnable parameters; therefore, intuitively the network will learn kernels that activate when they see visual features that are useful for the given classification task, for example, an edge of some orientation on the first layer, or eventually, for example, face-like patterns on higher layers of the network. Another important layer in CNNs is the so-called pooling layer that is inserted in between successive convolutional layers. Its function is to progressively downsample the spatial output size and to reduce the number of weights, thus, also avoiding overfitting. The most common pooling layer is max-pooling, which is, for example, applied with the dimensions 2 × 2. The pooling layer slides across the width and height of the convolutional output with a stride of 2 and gives for each 2 × 2 input window the maximum value as output. In this way, the output size is reduced by half, and only the hidden units with the highest activation within their surrounding are passed on to the next layer. A simple CNN could, for example, be designed as follows: [Conv–ReLU–Max Pool–FC]. Recurrent Networks Recurrent neural networks or RNNs are a type of neural network for processing sequential data. In contrast to feedforward networks, RNNs allow previous outputs to be used as inputs. In this way, they are able to remember past inputs from previous time steps and make decisions based on the past and the present. A basic RNN layer (omitting bias terms) with activation function g(z) can be written as hi = g(W x x i + W h h(i−1) ), where x i is the new input of the current time step at time i and h(i−1) is the hidden state of the previous time step. Both inputs are multiplied with two different learnable weight matrices W x and W h and then give the new output hi . In theory, RNNs can be applied to arbitrarily long sequences;

26

2 Quality Assessment of Transmitted Speech

however, in practice, it is difficult for RNNs to capture long-term dependencies because of the vanishing gradient problem. Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber 1997) are a class of RNNs that overcome this problem. One key advancement of LSTMs is the included cell state that stores information from previous and current time steps but is also able to forget information. The amount of old and new information that is stored in the cell state and information that is forgotten is controlled by gates with learnable weights. Through this technique, LSTMs are able to model long-term dependency problems and are often used for Natural Language Processing (NLP) tasks, such as neural machine translation. One variant of LSTMs is the bidirectional LSTM, which are in fact two LSTMs that run the input sequence in two different ways, one forward and the other one backward. The outputs of both directions are then concatenated to give the output of the BiLSTM. Attention Mechanism The attention mechanism was originally introduced for machine translation problems by Bahdanau et al. (2015), where an input sequence of words is translated to an output sequence of words in a different language. The input sequence is processed by an encoder RNN and produces a sequence of hidden states hj for each original word at time step j , while the decoder RNN computes an output sequence of hidden states si at each time step i, which are used to calculate the translated words yi . The decoder hidden states can be calculated with a recursive formula of the form s i = f (s i−1 , y i−1 , ci ),

(2.6)

where s i−1 is the previous hidden vector, y i−1 is the generated word at the previous time step, and ci is a context vector that captures the context from the original sentence. In contrast to traditional sequence to sequence translation models, this context vector is calculated with an attention mechanism that focuses on particular words in the original sentence, which are relevant for the currently translated word. The context vector is computed as a weighted average across the original words’ hidden states of the encoder as follows: ci =

N 

αi,j hj ,

(2.7)

j

where αi,j gives more weight to important words of the original sentence. Therefore, the network puts more attention to particular words that help the model to translate at the current time step. For example, the network may translate a word at the end of the sentence; however, some information from the start of the original sentence may be important for a proper translation. The weights α can be calculated with any function a, for instance, a fully connected network as follows: ei,j = a(s i−1 , hj ),

(2.8)

2.6 Machine Learning Based Instrumental Methods

27

where a softmax function is applied to normalise the attention scores ei,j . exp(ei,j ) αi,j = n . k=1 exp(ei,k )

(2.9)

The attention scores at one decoder time step are thus calculated from the hidden states of the previously translated word s i−1 and all original words (hj , . . . , hN ). Since the attention mechanism has been introduced, it was applied to a variety of different applications in many different ways. The main idea is always to let the network put attention to particular time steps or positions and to let the network learn the attention function automatically through learnable parameters. Transformer A more recent architecture for neural machine translation is the Transformer model presented by Vaswani et al. (2017), which is used by systems, such as BERT (Devlin et al. 2019) and GPT-3 (Brown et al. 2020). As the title of their paper “Attention Is All You Need” implies it mainly relies on attention and feedforward networks without the need for recurrent sequences. It is a deep model with a sequence of many self-attention based Transformer blocks. Self-attention is different from the previously described attention mechanism in that it is only applied to the input sequence instead of to the input and output sequences. In the case of machine translation, it gives the network the possibility to include information from other words into the one that is currently processed. In the Transformer block, this is implemented by firstly calculating three different vectors from the input sequence (x 1 , x 2 , . . .): the query, key, and value as follows: qi = W q xi ,

ki = W k x i ,

vi = W v x i ,

(2.10)

where W q , W k , and W v are three different learnable weight matrices that transform the input sequence at time step i to new vectors q i , k i , and v i . The query and key vectors are then compared to each other to yield an attention weight matrix α, as it was also done in the attention mechanism previously described (see Eq. (2.9)). The weights are then applied to the value vectors to give the output sequence z of the self-attention layer. The Transformer block contains multiple of these selfattention layers in parallel, where the same calculation is performed with different weight matrices that consequently give different outputs (z1 , z2 , . . .). This concept is referred to as multi-head self-attention. Each of these multi-head self-attention layers is then followed by a feedforward network with two layers and ReLU. The described Transformer block is the basic and most important building block of the Transformer model that makes it possible to directly model the relationships between all words in a sentence, regardless of their respective position. Training Deep Networks Neural networks are trained by optimising the model weights θ to minimise a loss function. In the case of a regression task, the loss function is usually the mean squared error (MSE) between the target values and the predicted outputs of the

28

2 Quality Assessment of Transmitted Speech

 2 network L = N1 N i=1 (xi − yi ) . The network is trained by updating the weights in the opposite direction of the gradient of the loss function with respect to the weights θ = θ − η∇θ L(θ ). This optimisation method is referred to as gradient descent. The learning rate η determines the size of the steps that are taken to reach a (local) minimum. A small learning rate can lead to very slow convergence while with a too-large learning rate it is possible to “overshoot” the optimisation and miss the minimum. Adam (Adaptive Moment Estimation) presented by Kingma and Ba (2015) is another optimisation method that computes adaptive learning rates for each weight and additionally stores an exponentially decaying average of past gradients. The method is often used as default optimisation algorithm as it usually converges fast and works on many different problems. The gradients are computed through the so-called backpropagation algorithm (Rumelhart et al. 1986). It works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time. This efficient calculation of the gradient is essential to make the training of deep neural networks possible. The optimisation is usually performed on a smaller subset of samples, taken from the overall training dataset. These small subsets are referred to as mini-batch. The optimisation is then performed on each mini-batch until the network has seen all available training samples. One of these runs through the entire training set is called epoch. Typically, at the start of each epoch, the mini-batches are randomly sampled to increase the variety of data the model sees at each individual optimisation step. A layer that is often included in CNNs is the Batch Normalisation layer introduced by Ioffe and Szegedy (2015). It normalises the inputs of the layers by recentring and re-scaling the values of the hidden units. The normalisation operation is performed on each mini-batch and makes training of neural networks faster and more stable. Another form of normalisation is the Layer Normalisation layer (Ba et al. 2016), for example, used in the Transformer model. Instead of normalising the hidden units across a mini-batch, the units are normalised across each layer. Dropout (Srivastava et al. 2014) is a layer that is often used to avoid overfitting to the training data. It works by randomly setting a certain percentage of hidden units of a layer to zero. Therefore, these units are ignored during that optimisation step. At the next step, another random set of outputs are set to zero. After training, the dropout is deactivated, to apply the model to test data.

2.6.3 Deep Learning Based Speech Quality Models In this subsection, a review of deep learning based speech quality models is given. Soni and Patil (2016) presented a single-ended speech quality model that uses a deep autoencoder to extract features. These features are mapped to subjective MOS values of the NOIZEUS database. The autoencoder is based on a feedforward network with a bottleneck structure. The network firstly reduces the dimension of

2.6 Machine Learning Based Instrumental Methods

29

the hidden units and then increases the dimensions again to estimate the original inputs. The low-dimension hidden units are then used as compressed features for the speech quality model. They showed to outperform P.563 in a cross-validation evaluation. Their approach with deep autoencoder features also outperformed the same model with Mel-filtered spectrograms as inputs. Ooster et al. (2018) presented a single-ended speech quality model that is based on the outputs of an automatic speech recognition model. The automatic speech recognition model itself combines a deep neural network with a hidden Markov model and produces phoneme probabilities, which are used to calculate a mean temporal distance (MTD) from the phoneme probabilities. The correlation between the MTD values and the subjective MOS values for different distortion classes of the TCD-VoIP dataset is then used for the evaluation of the approach. The method was later improved by including a voice activity detection (Ooster and Meyer 2019). Fu et al. (2018) presented Quality-Net, which is a single-ended speech quality model based on a BiLSTM network. They trained and evaluated the model on PESQ MOS predictions of noisy and enhanced speech files, where the source speech files are taken from the TIMIT dataset. The input to the model is spectrograms. These are processed by the BiLSTM that outputs a sequence of per-frame quality predictions for each time step. The per-frame quality predictions are then averaged to yield the overall MOS prediction. As loss function during training, the model uses a combination of per-frame MSE and overall MOS MSE that optimises the model to predict per-frame quality and overall quality. The per-frame target values for training are based on the overall MOS target values. They showed to outperform the previously described autoencoder model by Soni and Patil (2016) on the used dataset. Shan et al. (2019) presented a single-ended speech quality model that predicts a pseudo-reference signal from the degraded signal through a deep neural network. The difference between the MFCC features of the degraded and the pseudoreference are then used to predict the overall quality, based on a deep belief neural network. They also showed to outperform the autoencoder model by Soni and Patil (2016) on a Chinese dataset, annotated with subjective MOS ratings. The author of this work presented a single-ended CNN-LSTM speech quality model for super-wideband communication networks (Mittag and Möller 2019d). The model is based on a CNN with Mel-spectrograms as input that firstly estimates a per-frame quality. The CNN per-frame quality, together with MFCC features, is then used as input to a BiLSTM network, where the last time step of the LSTM is used to predict the final overall speech quality. The per-frame quality predictions are trained with POLQA per-frame similarities as target values. The overall MOS was trained and evaluated mostly on datasets from the POLQA pool. The model outperformed P.563 and ANIQUE+ but was outperformed by POLQA. It should be noted that the model was not trained end-to-end but rather in two stages. Also, it was not evaluated on conversational or spontaneous speech. The same approach was later also used to predict the speech quality dimensions Noisiness, Coloration, and Discontinuity (Mittag and Möller 2019b). Due to the lack of subjective ratings for the speech quality dimensions, the training dataset was mostly based on expert scores.

30

2 Quality Assessment of Transmitted Speech

Avila et al. (2019b) presented four different deep learning architectures to predict speech quality without a reference signal. The models were trained and evaluated on a crowdsourced dataset with 10,000 speech files. The first model is a CNN with a constant Q spectrogram as input. The second model is a feedforward network with i-vectors as input. The third model is a feedforward network that uses MFFCs combined with a pitch estimate, where only the active frames determined by a VAD are used. The fourth model is equivalent to the third model but additionally applies an extreme learning machine (ELM). They showed that their fourth model gives the best results and also outperforms P.563 and PESQ. Cauchi et al. (2019) presented an LSTM based speech quality model that uses modulation energies as inputs. The LSTM network consists of two layers, where the last time step of the second layer is used to predict the overall quality. They showed to outperform P.563 and POLQA on a dataset with noisy and enhanced speech files that were annotated with subjective quality ratings. Catellier and Voran (2020) presented a single-ended speech quality model with a 1D CNN model that is directly applied to the raw waveform of the speech signal. It is trained and evaluated on a dataset of 100,681 files with POLQA, PESQ, PEMO, and STOI predicted values. The speech files are impaired with wideband conditions, such as different noises and codecs. The model is referred to as WEnets in Chap. 7. The author of this work also presented a double-ended and single-ended CNNLSTM model that is trained end-to-end (Mittag and Möller 2020b). Because of the lack of subjective data, the model was pretrained with POLQA predicted MOS.1 After pretraining, the model is finetuned on datasets from the POLQA pool. The double-ended model is applied through a Siamese neural network and an attention based time-alignment. The model architecture is an earlier version of the presented model in this work; however, it does not contain self-attention layers and is not trained on conversational/spontaneous speech. Dong and Williamson (2020a) presented a CNN-LSTM speech quality model with an additional self-attention layer. They use the speech spectrogram as input to a CNN with four convolutional layers. The output of the CNN is then used by a BiLSTM layer with the following self-attention layer. They train the model to predict PESQ, ESTOI, HASQI, and SDR objective values in a multi-task learning approach, where only the self-attention blocks are task-specific, and the CNNLSTM block is shared across tasks. Additionally, to the MSE loss for the regression task, they introduce a second loss for a classification task for which the target values are divided into classes. On their own dataset with noisy conditions, they showed to outperform the models by Mittag and Möller (2019d), Avila et al. (2019b), and Fu et al. (2018), where Mittag and Möller (2019d) ranked second on the real-world dataset. Their network architecture is similar to the model presented in this work. However, they do not apply attention-pooling. Also, the model in this work omits the LSTM network, and instead of applying a basic self-attention

1 These earlier models do not correspond to the models presented in this work, which are completely trained on subjective data.

2.7 Summary

31

mechanism, a Transformer block is used as self-attention layer. Furthermore, in this work, the Mel-spectrogram is divided into segments before it is applied to the CNN network. In contrast, Dong and Williamson (2020a) applied the CNN to the complete spectrogram. Therefore, their model cannot directly be applied to signals with different duration. Also, the pooling layer within their CNN downsamples the outputs in the time domain before they reach the LSTM network. Therefore, some time-dependency modelling and pooling is already done by the CNN before the LSTM network. The model in this work retains the time dimension after the CNN block. Dong and Williamson (2020b) recently presented a further single-ended speech quality model that is based on pyramid BiLSTM networks. In this network, after each BiLSTM layer, the time resolution of the sequence is reduced before it is passed to the next layer. Overall, they use one BiLSTM and three pyramid BiLSTM with the following self-attention layer. They trained and evaluated the model on noisy speech datasets with crowdsourced quality ratings. Again they outperform the other models from Mittag and Möller (2019d), Avila et al. (2019b), and Fu et al. (2018), where Mittag and Möller (2019d) ranked second.

2.7 Summary In this chapter, at first, an overview of modern speech communication networks was given, where the transmission is mostly digital, and telephony with wideband bandwidth has now become widespread. Some service providers also offer even higher quality with super-wideband speech transmission. A number of typical impairments that occur in speech networks were then presented. Because of the digital transmission, the main factors that influence the speech quality these days are packet loss, transcoding, signal-processing algorithms of the terminal devices, and background noises. In Sect. 2.2, the term speech quality and the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness were presented and described. Section 2.3 then described how speech quality can be assessed through auditory experiments in the laboratory and which type of subjective biases can be introduced within an experiment. Recently, it has become popular to conduct speech quality experiments via crowdsourcing instead of in the laboratory. However, to obtain consistent results in the crowd, it is important to follow a set of rules that are described in Sect. 2.4. Section 2.5 presented traditional speech quality models that are not mainly developed with machine learning but rather through handcrafted features and quality perception models. Section 2.6.1 then gives an overview of machine learning based speech quality models that are not based on deep learning. After a short overview of deep learning architectures in Sect. 2.6.2, deep learning based speech quality models are presented in Sect. 2.6.3.

32

2 Quality Assessment of Transmitted Speech

The first machine learning based and purely data-driven speech quality models can be found in literature in the 2000s. These models were mostly trained with shallow neural networks, Gaussian mixture models, and support vector regression. From 2018 onwards, the trend shifted towards end-to-end trained deep neural networks that apply CNNs and/or LSTM networks. Due to the lack of common datasets in the community, it is difficult to compare machine learning based models in the literature as they are mostly trained and evaluated on different datasets. Furthermore, because of the tendency of machine learning based models to overfit to training data, it is relatively easy to outperform traditional speech quality models on a training and validation dataset that come from the same data distribution. For these reasons, the traditional model POLQA still remains the state-of-the-art speech quality assessment model, as it showed to be reliable on a wide variety of different datasets; however, it requires the availability of a clean reference signal.

Chapter 3

Neural Network Architectures for Speech Quality Prediction

In this chapter, different neural network architectures are investigated for their suitability to predict speech quality. The neural network architectures are divided into three different stages. At first, a frame-based neural network calculates features for each time step. The resulting feature sequence is then modelled by a timedependency neural network. Finally, a pooling stage aggregates the sequence of features over time to estimate the overall speech quality. It will be shown that the combination of a CNN for per-frame modelling, a self-attention network for timedependency modelling, and an attention-pooling network for pooling yields the best overall performance. To train and evaluate the different models, a large dataset with crowdsourced MOS values was created. To this end, reference speech files from four different source datasets were collected and then processed with a wide variety of different distortions. Furthermore, also a smaller dataset with live call conditions was created.

3.1 Dataset A new dataset was created for the training and evaluation of the proposed model in this work. On the one hand, this dataset aims to contain the most common distortions in speech communication networks. On the other hand, it should contain a wide variety of different speakers and sentences that reflect a realistic telephone conversation scenario. To achieve a high diversity in the recordings, speech signals from four different speech corpora were used as reference files for this dataset.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_3

33

34

3 Neural Network Architectures for Speech Quality Prediction

3.1.1 Source Files AusTalk: An Audio-Visual Corpus of Australian English AusTalk (Burnham et al. 2011) is a large speech dataset that contains 861 adult speakers, whose ages range between 18 and 83 years. The samples were recorded in 15 different locations in Australia to cover the regional and social diversity and linguistic variation of Australian English. Each speaker was recorded for one hour in a range of scripted and spontaneous speech situations. The dataset is open source and was produced by a group of eleven Australian universities. In this work, the spontaneous speech recordings from the interview task are used, as this is closest to a telephone conversation. This task aimed to capture spontaneous, engaged, narrative talk, which is achieved by empathetic interviewers who suggest different topics to the participants. Often they talk, for example, about their profession, hobbies, or a holiday trip. The samples are recorded with a sample rate of 48 kHz, which allows for processing fullband conditions. The quality of the samples is generally very high; however, occasionally low-level background noises are present. Also, in some cases, the interviewer can be heard asking questions to the participants. The AusTalk interview speech files are between 3 s and 34 minutes long. To cut the files into shorter segments, all files with a duration longer than 10 s were used. Each file was then segmented into clips with a random size between 6 and 12 s. However, the first 30 s were excluded as they often contain questions by the interviewer. TSP Speech Database The TSP dataset (Kabal 2002) contains over 1400 utterances spoken by 24 speakers. The text for the utterances is taken from the Havard sentences (IEEE 1969). The files were recorded in an anechoic room with a sample rate of 48 kHz. Therefore, these files are also suitable for processing of fullband conditions. Most of the speakers were native (Canadian) English speakers, while a small number of participants were non-native speakers. The speech samples are generally of very high quality with occasional pops within the recordings. Also, the files contain a low-level background noise caused by the microphone preamplifier. The dataset is open source and was created by the TSP lab of McGill University. Each file is around 3 s long and contains one sentence that was read out by the participants. To create typical ITU-T Rec. P.800 double sentences, two sentences were merged together to one clip. The pause before the first sentence, in the middle of both sentences, and after the last sentence was randomly chosen between 1 and 2 s. The resulting source speech clips of this dataset are therefore similar to the ones used for double-ended speech quality prediction models such as POLQA. UK-Ireland Dataset This is an open-source high-quality corpus of British and Irish English accents created by Google Research (Demirsahin et al. 2020). It contains over 31 hours of speech from 120 volunteers who identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish, and Irish varieties of English. The

3.1 Dataset

35

script lines that were read out by the participants are retrieved from a variety of sources, for example, public domain texts, such as Wikipedia, or virtual assistant dialogues. The total number of used text lines is 2050, and the overall number of files in this dataset is 17,877. Most of the speech files were recorded in a soundinsulated recording room, while others were recorded in a quiet, empty room. The high-quality audio was saved with a sample rate of 48 kHz. The files of this dataset are between 2 and 20 s long and contain one sentence. All files with a duration higher than 12 s were then cut at the end of the speech sample to a random duration of 6 to 12 s. Files with a duration shorter than 6 s were discarded. DNS-Challenge (Librivox) The DNS-Challenge dataset (Reddy et al. 2020) was created by Microsoft for the Deep Noise Suppression challenge, in which different speech enhancement methods are trained and evaluated on the same open-source dataset. The clean speech files are taken from the public audiobook dataset LibriVox (LibriVox 2020) that contains recordings of over 10,000 audiobooks. Because some of the audiobooks were recorded with poor quality, Reddy et al. (2020) performed a P.808 listening test to sort the book chapters by subjective quality. Then the top 25% of audio chapters, with MOS values from 4.3 to 5, were extracted for the DNS-Challenge dataset. The resulting clean files have 500 hours of speech from 2150 speakers. Unfortunately, the LibriVox audiobooks are recorded with a sample rate of 16 kHz, which only allows for processing of wideband conditions but not super-wideband or fullband. For this work, the DNS clean files, which have a duration of 30 s, were cut to clips with a randomly selected length of 6 to 12 s.

3.1.2 Simulated Distortions These source speech files were then processed with a large variety of different distortions to produce a training and a validation set. At first, reference speech files were randomly drawn from the large pool of created 6–12 s audio clips. In order to obtain a large variety of different speakers, the same number of clips was sampled for each speaker. Also, different speakers were used for the training and validation sets to avoid using the same speakers or sentences for the model evaluation as for model training. Table 3.1 shows the number of sentences and speakers that were used for the training and validation datasets. Most files were taken from the AusTalk and DNS datasets because they contain a large variety of speakers and because the contained speech is spoken more spontaneously than in the datasets UK-IRL and TSP. Overall, 10,000 sentences from 2322 speakers were selected for training, and 2500 files from 938 speakers were selected for validation. The TSP speech samples were mainly included so that the model also learns to predict typical P.800 double sentences with a silent pause in between. However, because the main aim of the presented model is

36

3 Neural Network Architectures for Speech Quality Prediction

Table 3.1 Simulated datasets: the number of speakers and sentences from different source datasets in simulated training and validation dataset. The number of sentences is equal to the number of contained files because only unique sentences have been used Source dataset AusTalk DNS UK-IRL TSP Total

Type Interviews Audiobooks Single sentence Double sentence

Speakers Train 600 1599 100 23 2322

Val 77 341 20 0 938

Sentences Train 4500 4000 1000 500 10,000

Val 1000 1000 500 0 2500

the prediction of spontaneous speech and because of the low numbers of speakers in this dataset, the TSP source files were not included in the validation dataset. These files were then processed with a large number of distortion conditions, such as background noise, packet loss, or audio bandwidth limitations. Each of these conditions was processed with different values of the corresponding distortion parameter (e.g. SNR for background noise) to cover the whole range of possible distortion levels. The application of a wide range of levels during training should avoid a non-monotonic prediction behaviour of the model, that is, the model should on average (over multiple files) rate a speech signal with a higher SNR, higher than a file with a lower SNR. The model could otherwise behave unexpectedly during test time if a certain range of SNR values is left out during training, for example, extremely loud or quiet noise. The number of files equals the number of conditions in the datasets because each file was processed with a different condition. To simulate real background noises, the noise clips from the DNS-Challenge datasets were used, which in turn are taken from the Audioset (Gemmeke et al. 2017), Freesound (2020), and DEMAND (Thiemann et al. 2013) corpora. Audioset is a collection of almost 2 million human-labelled 10-s noise clips drawn from YouTube videos that belong to 600 individual audio events. The DNS-Challenge dataset contains 60,000 noise clips from 150 noise classes. They were sampled in a way that there are at least 500 clips from each class. Additionally, 10,000 noise files from Freesound and the DEMAND dataset were added. The chosen noise types are particularly relevant to VoIP applications. The following conditions were included in the dataset: • White noise: Additive white Gaussian noise with SNR values from 0 to 50 dB. • MNRU noise: Signal-correlated noise according to the Modulated Noise Reference Unit of ITU-T Rec. P.810 (1996) with Q values from −30 dB to 30 dB applied to the clean fullband signal. • Recorded background noise: Randomly sampled noise clip taken from the DNS-Challenge dataset added with SNRs between −10 dB and 50 dB • Temporal clipping: Randomly selected frames with a duration of 20 ms are deleted and replaced with zeros, which leads to audible interruptions. The burstiness or the maximum number of consecutive deleted frames was randomly

3.1 Dataset











• •

37

chosen between 1 and 10. The number of overall deleted frames was set between 1% and 50%. Highpass filter: A minimum-order filter with a stopband attenuation of 60 dB. The cut-off frequency is varied between 70 Hz and 2.2 kHz. The selected cut-off frequencies were chosen in Mel, and thus the lower frequencies are chosen more often. Lowpass filter: A minimum-order filter with a stopband attenuation of 60 dB. The cut-off frequency is varied between 290 Hz and 16 kHz. Again the cut-off frequencies were selected in Mel. Bandpass filter: A minimum-order filter with a stopband attenuation of 60 dB. The low and high cut-off frequencies are varied between 70 Hz and 16 kHz, where the frequencies were set in Mel steps and the passband width was at least one Mel. Arbitrary filter: Single band FIR arbitrary response magnitude filter, where for each random frequency in a frequency vector of length between 1 and 99, a random magnitude is selected. The randomly selected magnitudes depend on the previous magnitude in order to avoid large magnitude jumps. Amplitude clipping: To simulate an overloaded signal, the signal amplitude is clipped to values between 0.01 and 0.5, where the maximum signal amplitude is normalised to 1. Active speech level: The active speech level, according to ITU-T Rec. P.56 (2011), was varied between −10 dBov and −70 dBov. Codecs: Overall, six codecs were available. Each bitrate mode of each codec was then applied. – G.711 (ITU-T Rec. G.711 1988): Widely used in narrowband VoIP communication. Only one bitrate is available for this codec. No packet-loss concealment is applied. The signal is resampled to 8 kHz sample rate prior to applying the codec. – AMR-NB (3GPP TS 26.071 1999): Used, for example, in narrowband mobile networks. There are nine bitrate modes available. It contains a concealment algorithm for lost frames. The signal is resampled to 8 kHz sample rate prior to applying the codec. – G.722 (ITU-T Rec. G.722 2012): Used in wideband VoIP communication. There are three different bitrates available. No packet-loss concealment is used. The signal is resampled to 16 kHz sample rate prior to applying the codec. – AMR-WB (3GPP TS 26.171 2001; ITU-T Rec. G.722.2 2003): Used in wideband mobile networks (e.g. in UMTS networks). There are nine different bitrate modes available. It contains a concealment algorithm for lost frames. The signal is resampled to 16 kHz sample rate prior to applying the codec. – EVS (3GPP TS 26.441 2014): Used, for example, for super-wideband mobile networks through VoLTE. There are 12 different bitrate modes available. It also supports a narrowband, wideband, and fullband mode. However, in this dataset, only the super-wideband mode is used, except for the lower bitrates,

38

3 Neural Network Architectures for Speech Quality Prediction

where automatically the wideband mode is selected. It contains state-of-theart packet-loss concealment algorithm. The signal is resampled to 32 kHz sample rate prior to applying the codec. – Opus (RFC 6716 2012): An open-source codec that is often used by overthe-top VoIP services (e.g. messenger apps). Because Opus does not have predefined bitrate modes, 12 fixed bitrates between 6 and 48 kbit/s were selected. It also contains a modern packet-loss concealment algorithm. • Codec Tandem: For all codecs and bitrate modes, the speech files were also processed as tandem and triple tandem, where the speech files are processed with the same codec twice or three times in a sequence. • Packet Loss: All codecs were processed with packet-loss conditions. The error patterns were generated with the “gen-pat” function of ITU-T STL Toolbox (ITU-T Rec. G.191 2019). The patterns are then applied directly to the encoded G.192 bitstream (ITU-T Rec. G.192 1996). After applying the error pattern with the “eid-xor” function, the bitstream with the missing frames was decoded. Most of the codecs apply a packet-loss concealment algorithm. These algorithms synthesise a speech signal by using information from previously successfully received packets (Lecomte et al. 2015). If the original speech signal is unaltered, for example, in the middle of a vowel with constant pitch, the concealment usually works very well, and the lost packet may not be perceived. However, when the algorithm fails to synthesise a signal that is similar to the original signal of the lost frame, it often results in interruptions, robotic voices, or artificial sounds. The official Opus implementation does not create a bitstream in the G.192 format, and however, it allows to simulate random packet loss. To apply bursty packet loss, the Opus implementation was modified to be able to read the generated error patterns of the “gen-pat” function. Each codec and bitrate mode was processed with random packet loss from 0.01 to 0.50 frame error rate and also with bursty packet loss from 0.01 to 0.3 frame error rate. The “gen-pat” function uses the Gilbert model with γ = 0 to generate random patterns and the Bellcore model (Varma 1993) for generating the bursty error patterns. • Combinations: – – – – – – – –

White noise/codec White noise/codec/packet loss Recorded background noise/codec Recorded background noise/codec/packet loss Amplitude clipping/codec/packet loss Arbitrary filter/codec/packet loss Active speech level/codec/packet loss Random combinations: Each distortion condition was added with a likelihood of p = 0.1, except for the likelihood of processing the file with a codec, which was set to p = 0.8. The likelihood of adding a second or third codec was also set to p = 0.1. For 10% of the files, the distortion was added after encoding and decoding the file with a codec, and in the other cases the distortions were added before applying the codec.

3.1 Dataset

39

Overall, 12,500 conditions were processed in this way. Because each file was processed with an individual condition, the overall number of files is also 12,500.

3.1.3 Live Distortions Besides simulated distortions, also a smaller dataset with live conditions was created. To this end, live telephone and Skype calls were conducted, where clean speech files from the DNS-Challenge dataset were played back via a loudspeaker. The speech signal was played back from a laptop on a “Fostex PM0.4n studio monitor”. Two types of calls were conducted: a fixed-line to mobile phone call and a Skype call (laptop to laptop). For the first type, a call from a fixed-line VoIP phone (Cisco IP Phone 9790) within the Q&U Lab to a state-of-the-art smartphone (Google Pixel 3) was conducted. The VoIP handset was placed in front of the monitor to capture the speech signal acoustically. The received signal was then stored directly on the rooted Google Pixel 3 via a specifically developed “answer machine app” (Maier 2019). The Skype call was conducted between two laptops, where the sending laptop was placed next to the monitor to capture the played back speech signal. The transmitted speech signal was then stored on the receiving laptop with the application “Audacity”. The following conditions were created during the recordings: • • • • • • •

Moving the monitor closer and further away from the handset/laptop Changing the angle of the monitor Changing the volume of the monitor Typing on the laptop’s keyboard Open window (street and construction noise) Poor mobile phone reception (leading to interruptions) Different kinds of noises: moving glasses, pouring water into a glass, crumple plastic bag, opening/closing door, coughing, etc.

The resulting speech files were then split into a training and a validation set. The same speakers that were used in the simulated training dataset were again used for the live training dataset and vice versa for the validation set. However, new sentences of these speakers that are not contained in the training dataset were used. Because many of the recorded speech samples were of the same quality, POLQA predicted MOS values were used to sample a subset of the recorded files. Table 3.2 gives an overview of the live dataset, where the training set contains 1020 files and the validation set 200. For technical reasons, in contrast to the simulated dataset, some of the files were used multiple times in the live dataset. Therefore the number of sentences is lower than the overall number of files in the table.

40

3 Neural Network Architectures for Speech Quality Prediction

Table 3.2 Live dataset: the number of speakers, sentences, and files of the live training and validation dataset Source dataset DNS

Type Audiobooks

Speakers Train Val 486 102

Sentences Train Val 495 102

Files Train 1020

Val 200

3.1.4 Listening Experiment The datasets were then annotated in a subjective listening experiment via crowdsourcing according to ITU-T Rec. P.808 (2018), using the Amazon Mechanical Turk implementation presented by Naderi and Cutler (2020). However, additionally to the overall quality, also the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness were rated by the crowdworkers. While the overall quality was rated on a five-point ACR scale, such as in P.800, the quality dimensions were rated on an extended continuous scale (see Sect. 2.3). The ratings on the continuous scale were then transformed to MOS values as described by Köster et al. (2015). Mittag et al. (2020) showed that already three ratings per file can be enough to train a machine learning based speech quality prediction model. For example, training with 5000 files with each 4 ratings led to better performance than training with 2500 files and 8 ratings. These results indicate that a larger variety of speakers, sentences, and distortion conditions is more helpful for training than a low confidence interval of the MOS values. Because of this, in this work, 5 ratings per files were used, in order to allow for a larger dataset of overall 13,720 files. While the language proficiency of the participants was not explicitly tested, all of the workers were based in the USA. Before rating the files, each participating crowdworker listened to the same 11 training files with an accompanying explanation of the quality dimensions. These training files were selected to contain the whole spectrum of different speech quality, to anchor the ratings of the participants. This training section had to be performed once a day by the crowdworkers. The worker then rated 12 speech files on the 5 different scales. After submitting their answers, the worker could rate another batch of 12 files if wanted. To filter the answers by the crowdworkers for spam answers, trapping questions were implemented, where an interruption message is played back in between a speech file. In this interruption message, the participants are asked to select the left, the middle, or the right side of the scales. Additionally, a gold standard question was included. The purpose of this question was to check whether the participants understood and followed the description of the quality dimensions. To this end, speech files that were only distorted in the Coloration dimension through a lowpass filter were created, which, as a consequence, sound extremely muffled. This assumed perception of the Coloration dimensions was confirmed by a smaller pretest in the crowd. The

3.1 Dataset

41

gold standard question was then used to filter the submissions. Only the answers of participants that rated the gold standard’s Coloration MOS at least one MOS lower than the Discontinuity MOS and the Coloration at most with a MOS of 3 were accepted. Files that remained with less than 5 ratings after the filtering were rated again in a second round to assure every file was rated by enough subjects. Figure 3.1 presents the distribution of resulting MOS values in the datasets. It shows that the quality of the simulated datasets (Fig. 3.1a, b) is distributed relatively evenly from poor to excellent quality. Furthermore, the distribution of the training and validation sets is similar. The distribution of the live datasets (Fig. 3.1c, d) shows that the highest quality in these datasets is around a MOS of 3.5. This is expected since the speech signals were transmitted through a telephone/VoIP network and therefore, even under perfect network conditions, are at least processed with a codec. In fact, a spectrogram analysis of the recorded speech files revealed that the “fixedline to mobile phone” recordings are limited to narrowband, and the Skype calls are limited to wideband audio. The distributions of the dimension ratings can be found in Figs. B.1, B.2, B.3, and B.4 in the Appendix.

(a)

(b)

(c)

(d)

Fig. 3.1 MOS histogram of the simulated and live training and validation datasets. (a) Simulated training set: NISQA_TRAIN_SIM. (b) Simulated validation set: NISQA_VAL_SIM. (c) Live training set: NISQA_TRAIN_LIVE. (d) Live validation set: NISQA_VAL_LIVE

42

3 Neural Network Architectures for Speech Quality Prediction

3.2 Overview of Neural Network Model In this section, different neural network architectures for speech quality prediction are outlined. The deep learning based speech quality models presented in this chapter can generally be divided into four stages: 1. 2. 3. 4.

Mel-Spec Segmentation Framewise Model Time-Dependency Model Pooling Model

An overview of this architecture can be seen in Fig. 3.2. In a preprocessing stage, Mel-spectrograms (Mel-specs) are calculated from the input signal and then divided into overlapping segments. In the second stage, a framewise neural network with the Mel-spec segments as inputs is used to learn features that are suitable for speech quality prediction automatically. These features are calculated on a frame basis and therefore result in a sequence of framewise features. In this work, feedforward neural networks and CNNs are used for framewise modelling. In a third stage, the time dependencies of the feature sequence are modelled. Two different approaches for the time-dependency modelling of speech quality features are analysed, LSTM networks and self-attention networks. Finally, the features are aggregated over time

Speech Signal Mel-Spectrogram

Mel-Spec Segments

Framewise Model Framewise features Time-Dependency Model Updated Features

Pooling MOS Fig. 3.2 Overview of the general deep learning based speech quality model architecture

3.4 Framewise Model

43

in a pooling layer. The aggregated features are then used to estimate a single MOS value. For the last stage, classic average- and max-pooling layers, as well as an attention-pooling layer, were analysed.

3.3 Mel-Spec Segmentation The input to the speech quality models is Mel-spec segments with 48 Mel-bands. The FFT window length is 20 ms with a hop size of 10 ms. The maximum frequency in the Mel-specs was chosen to be 20 kHz to be able to predict speech quality for up to fullband. These Mel-spectrograms are divided into segments with a width of 15 (i.e. 150 ms) and a height of 48 (corresponding to the 48 Mel-bands). The hop size between the segments is 4 (40 ms), which leads to a segment overlap of 73% and overall 250 segments for a 10-second speech signal. The idea is that the framewise network will automatically learn to focus on the centre of the Mel-spec segment. Instead of only giving one single Mel-spec bin as input, the framewise network is provided with a wider segment of 150 ms to give the network some contextual awareness. The short-term and long-term temporal modelling, however, follows in the third model stage. This means every 40 ms the framewise network receives a Mel-spec of 150 ms width.

3.4 Framewise Model This section describes the two neural network architectures that are used as a framewise model. The first model is a CNN with six convolutional layers. As a baseline model, a feedforward neural network is presented.

3.4.1 CNN CNNs are most commonly used in the field of image classification and have the ability to learn a suitable set of features—for a given regression or classification task—automatically. This feature set learning is done in the so-called convolutional layers that give the neural network its name (see Sect. 2.6.2). Recently, they have increasingly been used for recognition or detection tasks in the context of music and speech (Amiriparian et al. 2017; Pons and Serra 2017; Schlüter and Böck 2014; Tóth 2015; Zhang et al. 2015). Furthermore, Mittag and Möller (2018) showed that CNNs can successfully be used to detect lost packets from speech spectrograms. Within the proposed CNN, three max-pooling layers are used to downsample the feature map in time and frequency. The downsampling procedure helps to fasten computation time and also avoids overfitting of the model by reducing the number of parameters

44

3 Neural Network Architectures for Speech Quality Prediction

in the network. To further reduce overfitting, dropout layers are applied. To make the training more stable and faster, batch normalisation layers are integrated. ReLU layers are implemented to allow the model to learn non-linear relationships. The individual blocks and output sizes of the proposed CNN are shown in Table 3.3 and visualised in Fig. 3.3. While an RGB image has three channels— one for each colour—the Mel-spec input has only one channel, representing the spectrogram’s amplitude. The convolutional layers have a kernel size of 3 × 3. This kernel slides over the Mel-spec to calculate features for each 3 × 3 frame of the Mel-spec of size 48 × 15, where the borders are padded with zeros in order to return the same overall size. During training, the kernel weights are learned automatically by the neural network through backpropagation. The first convolutional layer contains 16 kernels; therefore, the height and width of the input remain the same, but the channel size increases to 16. After a batch Table 3.3 CNN design. The kernel size is 3 × 3

Layer Input Conv 1 Bach normalisation ReLU Adaptive max-pool Conv 2 Bach normalisation ReLU Adaptive max-pool Dropout (20%) Bach normalisation ReLU Conv 3 Bach normalisation ReLU Conv 4 Bach normalisation ReLU Adaptive max-pool Dropout (20%) Conv 5 Bach normalisation ReLU Dropout (20%) Conv 6 (no width padding) Bach normalisation ReLU Output (flatten)

Channel 1 16

Height 48 48

Width 15 15

16 16

24 24

7 7

16

12

5

32

12

5

32

12

5

32

6

3

64

6

3

64

6

1

384

3.4 Framewise Model

45

32

5

conv5

64

1

64 32

3

6

6

12

24

48

conv6

conv3 conv4 7

16

conv2

38 4

15

16

conv1

output

Fig. 3.3 CNN architecture for a single Mel-spec segment as input

normalisation and ReLU layer, the feature matrix is downsampled by approximately a factor of 2 from 48 × 15 to 24 × 7. In the next blocks, the size of the feature matrix is further reduced, while the channel size, and therefore the number of resulting features, is increased. Consequently, kernels at the start of the network concentrate on low-level features as the kernel only sees a small frame of the total Mel-spec. In contrast, in the later blocks with more complexity (i.e. more kernels), the CNN computes more high-level features as the kernels see more context. In the last convolutional layer, no zero padding is applied in order to downsample the feature size to a single value in the time domain with 6 bins in the frequency domain. The CNN output of one 150 ms Mel-spec segment is thus 64 features, which are calculated for 6 different frequency ranges. Finally, this 64 × 6 × 1 feature matrix is flattened to give a feature vector of size 384. The output of a 10-second speech signal would then result in a sequence of length 250 (corresponding to 10 s) with 384 features each (250 × 384 matrix).

3.4.2 Feedforward Network As a baseline model, a basic deep feedforward network with a depth of four layers is implemented. The Mel-spec matrices are at first flatten from a size of 48 × 15 to a vector of size 720. Then four fully connected neural networks follow with each 2048 hidden units. ReLU layers are used as activation functions, and dropout layers

46 Table 3.4 Deep feedforward neural network design

3 Neural Network Architectures for Speech Quality Prediction Layer Input Bach normalisation Flatten Fully connected 1 Bach normalisation ReLU Dropout (20%) Fully connected 2 Bach normalisation ReLU Dropout (20%) Fully connected 3 Bach normalisation ReLU Dropout (20%) Fully connected 4 Bach normalisation ReLU Output

Channel 1

Height 48

Width 15

720 2048

2048

2048

2048

2048

are added before the last three fully connected networks to avoid overfitting. The output of a 10-second speech signal results in a sequence of length 250 with 2048 features each (250 × 2048 matrix) (Table 3.4).

3.5 Time-Dependency Modelling In the next stage, the time dependencies between the sequence of framewise features are modelled. So far, the model has only seen a small 150 ms segment of the speech signal. In this time-dependency stage, the individual feature vectors of the sequence can interact with each other to improve the prediction performance. For example, while an overall quiet or slightly too loud signal may not be perceived as a strong quality degradation, variation of loudness can be perceived as annoying by a speech communication service user. During the time-dependency modelling, the neural network can compare the loudness of the individual segments and update the features accordingly. Consequently, the output of this stage is another sequence of features. Three different approaches are analysed: LSTM, Transformer, and self-attention neural networks.

3.5 Time-Dependency Modelling

47

3.5.1 LSTM Recurrent neural networks are a class of neural networks that allow previous outputs to be used as current inputs. They can use their internal state memory to remember historical sequence information and therefore make decisions considering previous inputs (see Sect. 2.6.2). One advantage of RNNs is that they allow time sequences with different lengths as input. LSTM networks are a type of RNN that are able to remember their inputs over a longer period of time and thus are able to model long-term dependency problems. These kinds of networks are often used to forecast time-series. A more specific variation is the BiLSTM layer that learns bidirectional dependencies between time steps in both directions. In this work, a single BiLSTM layer with 128 hidden units in each direction is used. As a consequence, for a 10-second signal (250 × 384 input matrix), the output of the LSTM layer is 250 × 256. However, the model in this work can be applied to variable-length signals, and thus, the outputted sequence length depends on the input signal duration. A top-level diagram of the LSTM network for a sequence length of four time steps is presented in Fig. 3.4. It shows the framewise features xi that are used as input for the LSTM+ network in forward direction and the LSTM− network in backward direction. At each time step, the LSTM network receives as input the framewise features of the current time step and the LSTM output of the previous time step. Consequently, at the end of the sequence, the network “saw” all of the

Framewise Features

x0 LSTM+

x1 LSTM+

LSTM-

y0

x2 LSTM+

LSTM-

y1

LSTM+

LSTM-

y2

Updated Features Fig. 3.4 LSTM time-dependency model

x3

LSTM-

y3

48

3 Neural Network Architectures for Speech Quality Prediction

individual time steps. The outputs of the forward and backward LSTM networks at each time step are concatenated to yield the updated output feature (y1 , y2 , . . .) sequence.

3.5.2 Transformer/Self-Attention Recently, Transformer networks have shown to outperform LSTM networks in many tasks, especially in Natural Language Processing. The Transformer model consists of an encoder network that processes an input sequence and a decoder network that generates an output sequence. Because the output of a speech quality model is a single value and not a sequence, only the Transformer encoder is applied in this work. The encoder mainly consists of multi-head self-attention layer with a subsequent feedforward layer. This Transformer block is repeated multiple times to create a deep network. The self-attention mechanism allows the input of a time step to attend to other time steps, rendering an interaction between different time steps possible (see also Sect. 2.6.2). The Transformer block as presented by Vaswani et al. (2017) is shown within the frame on the left-hand side of Fig. 3.5, the multi-head attention mechanism, which

Transformer Block

Multi-Head Attention

Dot-Product Attention

Updated Features y MatMul

Add & Norm

FC

Feedforward

Concat

Add & Norm

Dot-ProductAenon Aenon Dot-Product

Nx Mul-Head Aenon

FC FC

FC FC

SoMax Mask Scale

FC FC

MatMul x’ x’ FC

Q

K

V

Norm

Framewise Features x Fig. 3.5 Depiction of the self-attention mechanism. Left: Transformer block (framed). Middle: multi-head attention. Right: dot-product attention

3.5 Time-Dependency Modelling

49

is included in the Transformer block, is depicted in the middle, and on the righthand side, the dot-product attention is shown, which in turn is included within the multi-head attention mechanism. Transformer Block The input to the Transformer block is the sequence of framewise features (x1 , x2 , . . .), produced by the framewise model (e.g. the CNN).1 The input sequence is processed by the multi-head self-attention sub-layer with the following feedforward sub-layer. Around each sub-layer, a residual connection (He et al. 2016) is employed, followed by layer normalisation. Residual or skip connections allow the gradients to flow directly through the network by adding the input and the output of a layer together. Thus, they help to train the model by avoiding vanishing or exploding gradients. As a result, the output of each sublayer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To allow for the addition of inputs and outputs, all sub-layers in the model must produce outputs of the same dimension dtf . As a consequence, at first, a fully connected (FC) network is applied to the input sequence (x1 , x2 , . . .) (with dimensions dcnn = 384 in the case of a CNN framewise model) to yield a sequence (x1 , x2 , . . .) with vectors of length dtf at each time step. The feedforward sub-layer consists of two fully connected layers with a ReLU activation and dropout in between and contains dtf,ff hidden units. The Transformer block (framed on the left-hand side of Fig. 3.5) is stacked multiple times to yield a deep neural network. The output feature sequence is then (y1 , y2 , . . .), where each time step vector is updated with relevant information from other time steps. Multi-Head Self-Attention The multi-head self-attention mechanism is the core part of the Transformer block. Based on every time step of the input sequence (x1 , x2 , . . .), a query, a key, and a value are calculated through a matrix multiplication with a set of three learnable weight matrices, implemented as fully connected layers. The output after each FC layer yields a Q, K, and V matrix that contains the query, key, and value vector of length dtf at each time step. Instead of performing a single attention function with dtf -dimensional queries, keys, and values, the attention function is performed in parallel with multiple “heads”. The multi-head mechanism allows the model to jointly attend to information from different representation subspaces at different positions. The outputs of each head are concatenated and multiplied with a weight matrix (implemented as FC layer) to output a sequence of updated features with dimension dtf .

1 In

order for the model to make use of the order of the sequence,Vaswani et al. (2017) add a positional encoding to the input sequence that injects information about the relative or absolute position of the time steps in the sequence. While intuitively this would be beneficial for the task of speech quality estimation as well, it did not improve the results and is therefore left out in this work.

50

3 Neural Network Architectures for Speech Quality Prediction

Dot-Product Self-Attention The attention mechanism can be described as mapping a query and a set of key– value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. In the case of a self-attention mechanism, the sequence of queries, keys, and values are produced by three different FC layers that all use the same input sequence x  . The resulting matrices Q, K, and V are the input for the attention mechanism, where the queries and keys are compared to each other through the dot product. To obtain the dot product for all time steps at once, a matrix multiplication (MatMul) between Q and K is calculated. Thus, the query of each time step is compared to the keys of all time steps in the sequences, yielding an L × L attention score matrix, where L is the sequence length. Because for large dimensions dtf , the gradient can become extremely small, the score matrix is divided by dtf (i.e. scaling). A softmax function is then applied to normalise the attention scores and to obtain the weights on the values. The resulting weights are finally applied to the value matrix V through another matrix multiplication. Thus, the output is computed as 

QK T Attention(Q, K, V ) = softmax √ dk

 V.

(3.1)

As a result, each time step in the sequence is made up of a weighted average of all sequence steps. In this way, each time step can add information from other time steps in the sequence. Because in this work speech signals of variable length are used as inputs, the feature sequences need to be zero-padded to the length L of the longest sequence before passing them to the self-attention network. To avoid that the network attends to these zero-padded time steps, a mask is applied to the attention score L × Lmatrix. All values in the input of the softmax function that correspond to zeropadded time steps are masked out by setting them to −∞. Self-Attention Time-Dependency Model In this work, the Transformer is applied as a time-dependency model. Because the Transformer block only models temporal dependencies of framewise features that are already computed by another deep network, a relatively low complexity of the Transformer network is sufficient. Moreover, in practice, it was noted that the multihead mechanism did not improve the results. Therefore, the block is implemented with a single head, a depth of 2 blocks, a model dimensionality of dtf = 64, and a feedforward network with dtf,ff = 64 hidden units. This model based on the Transformer block is denoted as self-attention (SA) time-dependency model. As a baseline for comparison, the model with the original hyper-parameters by Vaswani et al. (2017) with 8 heads, a depth of 6 consecutive transformer blocks, a model dimensionality of dtf = 512, and a feedforward network with dtf,ff = 2048 hidden units is included as well. This model is denoted as Transformer (TF) (while it only contains the encoder part of the original Transformer model).

3.6 Time Pooling

51

3.6 Time Pooling The quality of a telephone call can generally not be predicted accurately by simply taking the average quality across time. As was shown by Berger et al. (2008), the “recency effect” and the out-weighting of poor-quality segments in a call have to be considered adequately. For example, short interruptions have been proven to sound more annoying than steady background noise (Wältermann 2012). Also, low quality during a silent segment of the original speech signal degrades the overall speech quality less than during active speech segments, where the impairment may even disturb the intelligibility of the transmitted speech. For this reason, four different approaches to aggregate the neural network features over time are analysed: average-pooling, max-pooling, last-step-pooling, and attention-pooling.

3.6.1 Average-/Max-Pooling The simplest form of pooling is taking the average or maximum value across all input values. Because the speech signals in this work are of variable size, the sequences need to be padded for certain operations. Therefore, for average-pooling, the output sequence is zero-padded. Then, for each feature, the sum across the padded sequence length L is calculated. After that, the sum is divided by the original sequence length M as follows: L 1  z(k) = yi (k), M

(3.2)

i=1

where k ∈ (1, 2, . . . , d) is the index of the feature vector yi (k) of length d, and (y1 , y2 , . . . , yL ) is the sequence of updated feature vectors after time-dependency modelling. Because the output values of the neural network can be negative, for maxpooling, the sequence is padded with negative infinity values. After that, the maximum of each feature is taken across the sequence length. Finally, a fully connected layer is applied to estimate the overall MOS value from the pooled feature vector z.

3.6.2 Last-Step-Pooling RNNs, such as the LSTM network, process the input sequence in a recurrent fashion and memorise information from previous time steps. Thus, the network has seen the entire feature sequence at the last time step, and instead of an average or maximum

52

3 Neural Network Architectures for Speech Quality Prediction

across all time steps, also the final time step output can be used as an estimate of the overall quality. In the case of an LSTM with multiple layers, only the outputs of the last layer are used. When a BiLSTM is applied, the first time step of the sequence of the backward LSTM network (i.e. the final output of the network in backward direction) needs to be concatenated with the last time step of the network in forward direction as follows: + z = concat(yM , y1− ),

(3.3)

+ ) are the outputs of the last layer of the LSTM network in where (y1+ , y2+ , . . . , yM − − − forward direction, (y1 , y2 , . . . , yM ) are the outputs of the last layer of the LSTM network in backward direction, and M denotes the original sequence length. Again, the output vector z is applied to a final fully connected layer with output size 1 to estimate the overall MOS.

3.6.3 Attention-Pooling In order to apply an attention mechanism for the purpose of time pooling, in this work, an attention-pooling (AP) mechanism is proposed. Typically, attention mechanisms are applied by comparing two sequences with each other, and then for each time step, a weight vector is computed. Consequently, the output will be another sequence of vectors. However, in the case of pooling, the aim is to obtain only a single feature vector. In theory, every time step could be compared to a single vector, such as the mean or maximum over all time steps. This would then lead to a vector of weights that could be applied to the individual time steps to produce a pooled feature vector. However, the comparison to an average or maximum did not prove to be meaningful. Therefore, instead of comparing the time steps to another vector, it is proposed to use a feedforward network to predict an attention score for each time step directly. An overview of the attention-pooling block is shown in Fig. 3.6, where y is the output dtd × L-matrix of the time-dependency model. The matrix contains a zero-padded sequence of length L with feature vectors of dimension dtd . The feedforward network with an output size of 1 is applied to each time step separately and identically. The feedforward network parameters that are used for computing the attention score are learned during the model training automatically. As a result, the network produces exactly one score for each time step. The feedforward network is shown in Table 3.5. It consists of two fully connected layers with ReLU activation and dropout between them. The number of hidden units dap is set to 128, where the input dimension dtd depends on the time-dependency model (e.g. dtd = dsa = 64). The attention scores computed by the feedforward network are then masked at the zero-padded time steps and applied to a softmax function to yield the normalised attention weights. These weights are applied to the input matrix y with a matrix

3.7 Experiments and Results

53

Fig. 3.6 Attention-pooling block

MOS FC z MatMul

SoMax Mask Feedforward

y Table 3.5 Feedforward network of attention-pooling mechanism

Layer Input Fully connected 1 ReLU Dropout (10%) Fully connected 2

Features dtd dap

Length L L

1

L

multiplication operation. The weighted average feature vector z is then finally passed through a fully connected layer to estimate the overall speech quality.

3.7 Experiments and Results In this section, it will be shown that a combination of a CNN as a framewise model, a self-attention network as time-dependency model, and an attention-pooling network as pooling model (CNN-SA-AP) gives the best results on the considered dataset. To this end, an ablation study is performed, in which one of the three neural network model stages is either removed or replaced with another network type.

3.7.1 Training and Evaluation Metric The presented models are trained end-to-end with Adam optimiser, a batch size of 60, and a learning rate of 0.001. The training was run with an early stopping of

54

3 Neural Network Architectures for Speech Quality Prediction

15 epochs, where the PCC (Pearson’s correlation coefficient) and the RMSE (rootmean-square error) were checked on the validation set. When the performance on the validation set in terms of PCC or RMSE did not improve for more than 15 epochs, the training was stopped. For training and evaluation, the datasets from Sect. 3.1 are applied. For training, the simulated and live training sets NISQA_TRAIN_SIM and NISQA_TRAIN_LIVE were used. The model is evaluated on the validation datasets NISQA_VAL_SIM and NISQA_VAL_LIVE. Since the training and the validation set come from the same distribution, and thus, in theory, there is no subjective bias present between the datasets, no polynomial mapping was applied for the evaluation. Furthermore, because the MOS values are distributed relatively evenly between the minimum and maximum values in the datasets, only the PCC is used as an evaluation metric in this section. The results on the live dataset are of particular interests since they are more “realistic” and thus give a better indication of how accurate the model would perform in the real world. However, the live validation set is much smaller than the simulated dataset. Because of this, the PCC is calculated separately for the simulated and the live dataset. Afterwards, the average over both datasets is calculated and used for the model evaluation as follows: r=

rval_sim + rval_live . 2

(3.4)

During one training run of the model, the performance on the validation set is calculated after each loop through the training set. When the validation performance is calculated, the model weights are stored, and the next epoch is started. After the model training is finished, the model of the epoch with the best performance on the validation set is saved for this particular run. Because the weights of the model are initialised randomly before each training run, the results for the same configuration can strongly vary. Furthermore, dropout layers, which are commonly implemented in neural networks as a regularisation method, zero out a fixed percentage of weights randomly during training. Also, the training samples that are presented to the model in each mini-batch are randomly sorted. Therefore, to be able to compare different network configurations, the training is run 12 times in this section and then presented as box-plots. It should further be noted that the hyper-parameters of each network architecture have been empirically tuned to give the best results for the overall model.

3.7.2 Framewise Model In this subsection, the model training is run for different framewise models that compute features for each individual Mel-spec segment. On the left-hand side of Fig. 3.7, the results of the speech quality model with a self-attention

3.7 Experiments and Results

(a)

55

(b)

Fig. 3.7 Results of different framewise models on validation dataset over 12 training runs. (a) Comparing framewise models. (b) Using larger TD models

time-dependency model and attention-pooling pooling model but three different framewise models is presented. The CNN-SA-AP on the very left side of the boxplot uses a CNN for framewise modelling, the FFN-SA-AP model uses a feedforward network for framewise modelling, and the Skip-SA-AP model applies the Melspec segments directly to the self-attention network without framewise modelling. It can be seen that the CNN model with an average PCC of around 0.87 clearly outperforms the other two models, where the model without framewise modelling only obtains a PCC of around 0.77. The feedforward network helps to improve the performance but cannot achieve similar results as the CNN. It could be argued that the relatively low-complexity SA model does not contain enough parameters to model the framewise features and the time dependencies at the same time. For this reason, in Fig. 3.7b, the CNN model is compared to larger time-dependency models. TF-AP represents the model with a Transformer network and attention-pooling, while the LSTM-AP model uses a larger BiLSTM network with a depth of 3 layers that contain 1024 hidden units in each direction. The Melspec segment width for these two models was reduced from 15 to 5 frequency bins (i.e. 50 ms) as no framewise model downsamples the segment in time before passing it to the TF/LSTM model. Interestingly, it can be noted that the Transformer model obtains approximately the same result as the self-attention model with preceding feedforward network (FFN-SA-AP). The deep LSTM model performs better than the low-complexity SA model but is still outperformed by the Transformer model. Overall, it can be concluded that CNNs outperform Transformer/self-attention and LSTM networks as framewise models by a relatively large margin.

56

3 Neural Network Architectures for Speech Quality Prediction

3.7.3 Time-Dependency Model In this subsection, different time-dependency models are compared. Figure 3.8a shows the results for the model with CNN framewise modelling and attentionpooling and three different time-dependency models. The model on the very lefthand side is again the CNN-SA-AP model with self-attention for time-dependency modelling, the CNN-LSTM-AP model uses a BiLSTM network with one layer and 128 hidden units, and the CNN-Skip-AP model skips the time-dependency stage and directly pools the CNN outputs. It can be seen that the difference between the different time-dependency models is not as large as for the framewise models. That being said, it can be noted that the inclusion of a time-dependency model improves the overall performance, where the self-attention network shows to outperform the LSTM network. The PCC of the self-attention network is on average around 0.15 higher than the PCC of the model without time dependency. The different time-dependency models can also be combined by applying an LSTM network with the following self-attention network or vice versa. The results of these combinations are shown in Fig. 3.8b, where it can be seen that both combinations perform on average worse than using only an LSTM or SA network. Overall, it can be concluded that it is beneficial for a speech quality model to include a time-dependency stage. The difference between using a self-attention or LSTM network for this purpose is only small, but the SA network showed to give the best results on average.

(a)

(b)

Fig. 3.8 Results of different time-dependency models on validation dataset over training 12 runs. (a) Comparing time-dependency models. (b) Using a mix of two different TD models

3.7 Experiments and Results

57

3.7.4 Pooling Model In this subsection different pooling models are compared, where for framewise modelling a CNN is applied. The different pooling mechanisms are average-pooling (Avg), max-pooling (Max), and attention-pooling. In the case of the recurrent LSTM network, also the output of the last time step can be used for pooling (Last). Figure 3.9b in the middle compares the proposed CNN-SA-AP model to CNNSA models with max- or average-pooling. Although the performance difference is relatively small, it can be seen that attention-pooling outperforms the other two methods, where average-pooling and max-pooling achieve approximately the same results. In the case of using an LSTM for time-dependency modelling in Fig. 3.9c, attention-pooling again outperforms the other mechanisms, where average-pooling achieves similar results. Last-step-pooling and max-pooling are outperformed by the other methods. In contrast to that, in Fig. 3.9a, where the output of the CNN is directly pooled without TD model, max-pooling outperforms average pooling. Overall, again attention-pooling achieves the best results. Interestingly, in the case that no TD model is applied, max-pooling gives better results than average-pooling, which is in line with the quality perception of speech communication users that tend to give more weight to segments in the speech file with poor quality. However, this out-weighting of poor-quality segments seems to be modelled by the timedependency stage, and therefore average-pooling achieves similar or better results than max-pooling if a TD stage is included. Overall, it can be concluded that the choice of the pooling stage only has a small influence on the performance if selfattention is applied and that attention-pooling achieves on average the best result for all tested cases.

(a)

(b)

(c)

Fig. 3.9 Results of different pooling models on validation dataset over 12 training runs. (a) Without TD model. (b) Using SA as TD model. (c) Using LSTM as TD model

58

3 Neural Network Architectures for Speech Quality Prediction

3.8 Summary In this chapter, different neural network architectures were presented and compared for the task of single-ended overall speech quality prediction. At first, a large dataset with 12,500 simulated conditions that each contain a different sentence from 3260 different speakers is presented. Additionally, a dataset with live conditions and overall 1220 files was created. The datasets were divided into a training and validation set and rated regarding their quality via crowdsourcing, where over 68,000 ratings were collected. Then a general overview of the deep learning speech quality models was given. They were divided into four stages: Mel-spec segmentation, framewise model, timedependency model, and pooling model. The individual neural networks for each stage were described in Sects. 3.4–3.6. In Sect. 3.7, the models were trained and evaluated on the presented datasets. It was shown that the combination of a CNNSA-AP model achieves the best results through an ablation study, in which one of the three neural network model stages was either removed or replaced with another network type. It was found that the framewise CNN model has the largest influence on the correlation between the predicted and the subjective MOS. The addition of a time-dependency model further improves the performance, where the self-attention network PCC is around 0.1 higher than the resulting PCC of an LSTM network. The influence of the pooling stage is rather small when a self-attention network is applied for time-dependency modelling. However, it could be shown that the attention-pooling mechanism outperforms average-, max-, and last-step-pooling.

Chapter 4

Double-Ended Speech Quality Prediction Using Siamese Networks

In this chapter, a double-ended speech quality prediction model with a deep learning approach is presented. The architecture is based on the single-ended model of Chap. 3 but calculates a feature representation of the reference and the degraded signal through a Siamese CNN-TD network that shares the weights between both signals. The resulting features are then used to align the signals with an attention mechanism and are combined to estimate the overall speech quality. The presented network architecture represents a solution for the time-alignment problem that occurs for speech signals transmitted through VoIP networks and shows how the clean reference signal can be incorporated into speech quality models that are based on end-to-end trained neural networks.

4.1 Introduction According to Jekosch (2005), when humans judge the quality of a speech sample, they compare the perceived characteristics of the speech signal with their own desired characteristics of a speech signal (i.e. individual expectations, relevant demands, or social requirements). In the context of instrumental speech quality prediction models, this internal reference is often equated with the clean reference speech signal. The clean reference represents the highest possible quality within a subjective listening experiment as it is not affected by any distortions. Because of this, most signal-based speech quality models are based on a comparison between the reference signal and the considered degraded speech signal. For example, the current state-of-the-art model for speech quality estimation POLQA (ITU-T Rec. P.863 2018) calculates distance features from a time–frequency representation of both signals and then maps the resulting features to the overall quality. Before the use of deep learning models for speech quality prediction, double-ended models were generally outperforming single-ended models by a large margin. The current © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_4

59

60

4 Double-Ended Speech Quality Prediction Using Siamese Networks

ITU-T Recommendation P.563 for single-ended speech quality prediction, for example, is known for being unreliable for packet loss or live-talking conditions (Hines et al. 2015a). Recent deep learning based models (see Sect. 2.6.3), however, only use the degraded signal as input. In this chapter, a method for incorporating the clean reference into deep learning based models is presented. The presented model firstly calculates a feature representation of the speech signals through a CNN-TD network. Both, the clean reference and the degraded signal, are sent through this network in parallel. The network shares the weights and thus optimises them for both inputs. This type of architecture is also called Siamese neural network and allows for calculating features that are comparable for different input vectors. They have, for example, proven to be useful for image quality prediction (Bosse et al. 2018) and for the alignment of speaker verification tasks (Zhang et al. 2019). One of the main challenges in double-ended speech quality prediction is the alignment of the signals. When a speech signal is sent through a communication network, it is exposed to a channel latency. In VoIP networks, the jitter buffer tries to compensate for delayed packets in the network by stretching and shrinking the speech signals in time. The jitter buffer can also drop packages when the delay reaches a certain threshold to reduce the delay. This introduces a variable delay across the speech signal that is difficult to determine. POLQA, for example, uses a complex time-alignment algorithm (Beerends et al. 2013a) that consists of five major blocks: filtering, pre-alignment, coarse alignment, fine alignment, and section combination. The presented model in this chapter uses an attention mechanism on the calculated features of the Siamese network to align the reference to the degraded signal. The features of both signals are then combined in a fusion layer and sent through another time-dependency network. Finally, the features are aggregated in a pooling layer to estimate the overall quality. In the next section, the proposed double-ended model and its individual stages are presented. In the following Sect. 4.3, the model is trained, and the results are discussed. Finally, the presented method and the results are summarised in the last section of this chapter.

4.2 Method The architecture of the proposed Siamese neural network model is shown in Fig. 4.1. The blue line represents the feature flow of the reference signal, while the orange line represents the degraded speech signal. Both signals are firstly transformed into the time–frequency domain and then processed by the same CNN and TD network. The feature vector output after each neural network block is depicted as 2D bars, of which each represents one time step of the four input Mel-spec segments. These outputs are then used as input for the next neural network block. After the first TD block, the reference signal is aligned to the degraded signal and subsequently fused. After the fusion layer, both signals’ features are stored within one feature vector,

4.2 Method

61 Degraded

Reference CNN

Time-Dependency #1

Alignment

Fusion

Time-Dependency #2

Pooling

Fig. 4.1 Double-ended speech quality prediction model with Siamese CNN-TD network. Blue line: feature flow of reference signal. Orange line: feature flow of degraded signal. The 2D blocks represent the output features of each neural network block

62

4 Double-Ended Speech Quality Prediction Using Siamese Networks

which is illustrated as a blue/orange dashed arrow. Also, the size of the 2D blocks that represent the output feature vectors is increased due to the fusion. Finally, a second TD and a pooling block are applied. In the next subsections, the individual blocks of the double-ended models are described in more detail.

4.2.1 Siamese Neural Network A Siamese neural network is a neural network architecture that contains two or more identical subnetworks. In the case of the presented double-ended model, these two identical models are the CNN and the first TD network, which network weights are therefore optimised on the basis of both inputs. Because the CNN-TD network calculates the same features for the reference and the degraded signal and is trained to estimate the overall quality, it should, in theory, result in features that are optimised to find perceptual differences between both signals. The input Mel-spectrograms and CNN network are the same as for the singleended model of Chap. 3. The time-dependency block is placed just before the timealignment block to allow the model to put more emphasis on certain time steps that could be more relevant for the subsequent alignment.

4.2.2 Reference Alignment The alignment is based on the notion that reference features of a certain time step will be similar to the corresponding time step features of the degraded signal since both have been processed by the same Siamese neural network. Because the aim is to predict the perceived quality of the degraded signal, the degraded features are used as a basis for the prediction. For each degraded time step, the corresponding time step of the reference signal is then found. In this way, the model can compare the degraded signal to the original undistorted reference for each individual time step. To determine the similarity between a sequence of feature vectors, many different score functions have been proposed. In this work, the four following commonly used alignment scores are analysed for their suitability to align degraded and reference feature vectors of a Siamese neural network. Dot The dot product is a basic alignment score function that is often used with Siamese networks or for alignment in NLP tasks (Luong et al. 2015). It is calculated between the feature vectors of the individual time steps as follows: s = u · v.

(4.1)

4.2 Method

63

The resulting alignment score can be seen as a correlation measure between features of the degraded and the reference signal. Cosine Another function that is often used is the cosine similarity (Graves et al. 2014), which can be seen as a variation of the dot product that neglects the magnitudes of the vectors. In case of normalised vectors, the cosine similarity delivers the same results as the dot product. It is calculated as follows: s=

u·v , max (||u|| · ||v||, )

(4.2)

where  is set to 1e-08 to avoid zero division. Luong This alignment score function is similar to the dot product, however, before calculating the dot product between the two feature vectors, the reference signal is multiplied with a trainable weight matrix according to Luong et al. (2015) as follows: s = u · Wv,

(4.3)

where W is a trainable weight matrix, u the degraded feature vector, and v the reference feature vector. Bahdanau This alignment score function according to Bahdanau et al. (2015) is sometimes also referred to as additive attention score and is calculated as follows: s = tanh(Wu u + Wv v),

(4.4)

where Wu and Wv are two different trainable weight matrices. After the addition of both vectors a tanh activation function is applied; thus, limiting the output value range to [−1,1]. The alignment scores s for each possible combination of time steps is then calculated, resulting in an N × M matrix, where N is the number of time steps of the degraded signal and M the number of time steps of the reference signal. Usually, attention scores are normalised across the sequence length and then used to calculate a weighted average across all feature vectors, which is also referred to as soft-attention. However, for the alignment of both signals, we want to find exactly one matching reference time step for each degraded time step. Because of this, a hard-attention approach is applied, where the network only focuses on the reference time step with the highest similarity. To this end, for each time step of the degraded signal, the index in the reference signal with the highest alignment score s is found, and the feature values at this index are saved as the new aligned reference feature vector.

64

4 Double-Ended Speech Quality Prediction Using Siamese Networks

4.2.3 Feature Fusion As shown in Fig. 4.1, the Siamese CNN-TD output features of the degraded signal are then combined with the aligned output features of the reference signal. Bosse et al. (2018) showed that—although the network should be able to find this automatically—it can be helpful for the regression task to explicitly include the difference between the features. Besides a simple subtraction based distance measure, inspired by Yu et al. (2020), also other commonly used fusion approaches are analysed in this work. The length of the feature vectors v and u is denoted as L. uv Simple element-wise multiplication of both feature vectors, which preserves the dimensions of the degraded input feature vector. It is calculated as follows: f = u ◦ v,

(4.5)

where f is the resulting fused output feature vector of length L. u/v/u − v Concatenation of the degraded and reference vectors with additional distance feature vector as a subtraction as follows: f = concat(u, v, u − v),

(4.6)

where f is the resulting fused output feature vector of length 3L. u + v/u − v Concatenation of the addition and subtraction of the reference and degraded feature vectors as follows: f = concat(u + v, u − v),

(4.7)

where f is the resulting fused output feature vector of length 2L. u + v/u − v/uv Concatenation of the addition, subtraction, and multiplication of the reference and degraded feature vectors as follows: f = concat(u + v, u − v, u ◦ v),

(4.8)

where f is the resulting fused output feature vector of length 3L. u2 + v 2 /(u − v)2 /uv Concatenation of a subtraction of the squared feature vectors, the squared subtraction, and multiplication of the reference and degraded feature vectors as follows: f = concat(u2 + v 2 , (u − v)2 , uv), where f is the resulting fused output feature vector of length 3L.

(4.9)

4.3 Results

65

4.3 Results The double-ended model was then trained and validated on the datasets presented in Sect. 3.1: NISQA_TRAIN_SIM, NISQA_TRAIN_LIVE, NISQA_VAL_SIM, and NISQA_VAL_LIVE. The training parameters were set to the same values as in Sect. 3.7 for the single-ended model and also the same CNN model was applied (see Sect. 3.4.1). As performance indicator again the average Pearson’s correlation coefficient between the live and simulated validation dataset was used. In the first experiment, two different time-dependency models were compared to each other for their suitability towards double-ended speech quality prediction. In the second experiment, different alignment layers are compared. In the third experiment, different fusion approaches are analysed, and lastly, the best performing model is compared to the single-ended model to analyse the performance increase that can be achieved by including the reference signal.

4.3.1 LSTM vs Self-Attention In this experiment, two different neural network architectures are used as timedependency model: (1) LSTM network and (2) a self-attention network. For both types of time-dependency modelling, a layer depth of two was chosen. The reference and degraded features were fused with the basic “u/v/u − v” fusion layer. The Results after ten training runs are presented in Fig. 4.2. The first and the third boxplots, from left to right, “LSTM-SE” and “SA-SE”, represent the single-ended models from Chap. 3 without time-alignment, fusion, and second time-dependency block. The boxplots of “LSTM-DE” and “SA-DE” represent the double-ended models without time-alignment blocks. The input signals of the used datasets are already aligned. In the case of the simulated datasets, the only conditions that introduce delays are the codecs and filters. These known fixed delays were removed after processing the signals to yield aligned signals. In the case of the live datasets, the individual speech files were merged into one single file with more than one hour of speech. After recording, the file was aligned through a cross-correlation between both signals and then subsequently cut back into smaller speech samples of the original size. Therefore, omitting the alignment block should lead to perfectly aligned feature vectors. Whereas this is given for the simulated datasets, delays within the transmission of live speech samples are possible. That being said, a spot check of several files from the live dataset did not reveal any alignment issues. In the figure, it can be seen that the self-attention time-dependency block generally performs better than the LSTM network. The gap between the singleended and double-ended model in both cases is around a PCC of 0.1. It can be concluded that LSTM and self-attention are both suitable for double-ended models; however, self-attention overall outperforms the LSTM network on the dataset under

66

4 Double-Ended Speech Quality Prediction Using Siamese Networks

Fig. 4.2 Comparison of single-ended (SE) and double-ended (DE) models with LSTM and with self-attention (SA) as time-dependency block. The double-ended models did not contain an alignment block (the input speech signals were already aligned, therefore, omitting the alignment block yields aligned time steps). Results after ten training runs

study. Therefore, for the following experiments, only self-attention is used as timedependency model.

4.3.2 Alignment In this experiment, the double-ended models were trained with different alignment score functions, whereas fusion layer the “u/v/u − v” fusion was used for all models. For both time-dependency blocks within the double-ended model, a selfattention network with each two layers was used. Additional to the four different alignment score functions that are compared, also a model without alignment block is included. As in the previous experiment of Sect. 4.3.1, this will lead to perfectly aligned features since the input signals of the datasets are already aligned. However, it should be noted that the time-alignment block cannot benefit from the already aligned signals. In the alignment network, each time step of the degraded signal is compared to each time step of the reference signal. Therefore, no knowledge of the absolute or relative position of the individual steps is included in the algorithm. The results after 10 training runs are presented in Fig. 4.3, where the “Aligned” boxplot represents the model without alignment block. It can be seen that the “Cosine” alignment achieves the best results, which are on average equal to the model with aligned inputs. The training of the model with aligned inputs leads overall to more variety in the results. The second best alignment score function

4.3 Results

67

Fig. 4.3 Comparison of different alignment score functions for the double-ended model with selfattention. Results after ten training runs

is the one proposed by Bahdanau, followed by “Dot” and “Luong”, which achieve similar results. The experiments show that by using the “Cosine” alignment, the same results as with perfectly aligned signals can be achieved, proving the suitability of this alignment score function for the alignment of the reference to the degraded signal. The downside of this attention based approach is that in some cases, the most similar reference feature vector will not be the original aligned time step. For example, in the case of frame loss with zero insertion, the degraded signal will be silent in the affected time step. In this case, the algorithm will likely find a silent reference time step (e.g. at the start or the end of the reference signal) and therefore overestimate the speech quality. Also, in case speech frames are dropped during the speech transmission because of delays, the model will have no knowledge of this since it only considers the degraded signal with its aligned reference time steps. As was shown in Fig. 4.1, the input to the model is always the degraded signal and its aligned reference time steps. Time steps of the reference signal that are not included in the degraded signal are thus not used as input and never seen by the model. However, in the considered dataset, no conditions with jitter-buffer management were contained.

4.3.3 Feature Fusion Based on the results of the previous section, in this experiment, the double-ended model with self-attention and cosine alignment was trained with five different fusion mechanisms. The results after ten training runs are presented in Fig. 4.4. It can be

68

4 Double-Ended Speech Quality Prediction Using Siamese Networks

Fig. 4.4 Comparison of different fusion mechanism for the double-ended model with selfattention and cosine alignment. Results after ten training runs

seen that the simplest fusion “uv” with element-wise matrix multiplication yields the lowest performance. On the other hand, all other fusion mechanisms obtain similar results, showing that the concrete implementation of the fusion layer is not crucial, as long as the information of both signals and some distance or similarity measure is contained in the resulting fused vector.

4.3.4 Double-Ended vs Single-Ended Finally, the best double-ended model is compared to the single-ended model with the same hyper-parameters but without alignment, fusion, and second timedependency block. As time-dependency model a self-attention block with two layers was applied; as alignment score function, the cosine similarity was used; and for the future fusion the “u/v/u − v” mechanism was applied. The results after ten training runs can be seen in Fig. 4.5. As previously seen, when the single-ended model was compared to the double-ended model without an alignment block, the performance increase of the double-ended model, compared to the single-ended model, is about a PCC of 0.012. This relatively small increase could be explained with the ability of the single-ended neural network to learn an internal reference.

4.3 Results

69

Fig. 4.5 Comparison of single-ended and double-ended model with self-attention, cosine alignment, and “u/v/u − v” fusion. Results after ten training runs

This general reference representation on a feature level is likely to be learned from the training data and stored within the network weights. Another explanation could be that the possible prediction accuracy that can be achieved is limited by the noise contained within the ground truth MOS values, rendering a further performance increase impossible. Figure 4.6 presents the correlation diagrams of the MOS predictions of the live and the simulated dataset from the single-ended and the double-ended model. The two figures (a) and (b) at the top represent the prediction of the single-ended model and the two figures at the bottom (c) and (d) represent the double-ended predictions. One assumption for the double-ended model could be that it predicts samples with higher quality with more accuracy than the single-ended model since the distance between the degraded and reference features is very small in this case. The single-ended model, on the other hand, could judge a speech sample of good quality also with a low prediction since it has no reference available. However, the figure shows that both models predict speech samples with a high subjective MOS equally well. In fact, it seems that the double-ended model predicts samples with a medium quality between 2 and 4 slightly better than the single-ended model, as fewer outliers can be observed in the correlation diagrams. Therefore, in contrast to the initial assumption, it could be argued that the double-ended model is not as likely to judge medium quality samples with very high quality around a MOS of 5 since it knows that the distance between the features would be very small in this case, whereas the single-ended model has no access to this information.

70

4 Double-Ended Speech Quality Prediction Using Siamese Networks

(a)

(b)

(c)

(d)

Fig. 4.6 Correlation diagrams of single-ended and double-ended predictions. (a) SE prediction of NISQA_VAL_SIM. (b) SE prediction of NISQA_VAL_LIVE. (c) DE prediction of NISQA_VAL_SIM. (d) DE prediction of NISQA_VAL_LIVE

4.4 Summary In this chapter, it was shown how the clean reference can be incorporated into deep learning based speech quality prediction models to improve the prediction performance. The reference and the degraded Mel-spec segments are at first sent through a Siamese CNN-SA network that calculates comparable features of both signals. After that, the reference features are aligned to the degraded features with a cosine similarity function as alignment score and a hard-attention mechanism. The features are subsequently fused by concatenating the reference feature vector, the degraded feature vector, and a distance feature vector. The fused features are then sent through a second self-attention layer that is followed by a final time-pooling layer.

4.4 Summary

71

It was shown that, by applying the cosine similarity function, the implemented time-alignment can achieve the same results as with perfectly aligned signals. The double-ended model outperforms the single-ended model by an average PCC increase of around 0.012. One of the advantages of incorporating the reference is that the model is less likely to falsely judge medium quality speech samples with a very high quality. However, considering the high effort that is involved with an intrusive speech communication monitoring setup, the performance increase seems to be too small to justify the use of a double-ended model over the single-ended model with similar results. Therefore, in the rest of this work, only single-ended models are considered.

Chapter 5

Prediction of Speech Quality Dimensions with Multi-Task Learning

In this chapter, machine learning based methods for single-ended prediction of the overall quality and additionally, the four following speech quality dimensions are presented: • • • •

Noisiness Coloration Discontinuity Loudness

The resulting dimension scores serve as degradation decomposition and help to understand the underlying reason for a low MOS score. The subjective ground truth values of these scores are the perceptual quality dimensions presented in Sect. 2.2. Because the model aims to predict the overall MOS and additionally dimension scores from the same speech signal, the prediction can be seen as a Multi-TaskLearning (MTL) problem. The overall quality is here denoted as MOS, while the Noisiness, Coloration, Discontinuity, and Loudness are denoted as NOI, COL, DIS, and LOUD, respectively. (however, the subjective ground truth values of the dimensions are also averaged opinion scores on their scales.)

5.1 Introduction While it would be possible to train individual models with the same architecture to predict the different tasks (i.e. quality dimensions), there are mainly two reasons why multi-task models are often preferred: (1) faster computation time and (2) better results through regularisation. Faster computation is achieved because most of the model’s components are shared across the different tasks; thus, the model’s outputs only need to be computed once rather than one time per task. A betterachieved regularisation can be understood if multi-task learning is viewed as a form © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_5

73

74

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

of inductive transfer. By introducing an inductive bias from one task to another, the model prefers to learn hypotheses that explain more than one task and thus reduces the risk of overfitting (Caruana 1997). Also, if the training data for one task is noisy, it can be difficult for the model to know which of the input features are relevant. By learning multiple tasks in parallel, other tasks can provide additional evidence for the relevance of features. Although in the case of this work, all dimensions and the overall MOS are rated by the same test participants, they may still have rated one dimension more consistently than others. While inconsistent or noisy ratings of one dimension will lead to overfitting to noise, during multi-task learning, the model learns a more general representation through averaging the noise patterns (Ruder 2017). In case additional tasks are only learned to improve the results of the main task without actually using the resulting tasks’ outputs, these additional tasks are referred to as auxiliary tasks. In the particular case of this work, a third reason to prefer a multi-task model is the underlying degradation composition approach. The overall predicted quality of a speech signal should be explainable by the individual predicted dimension scores. If the predictions of the individual models are completely independent, it is more likely that the scores contradict each other. For example, the overall MOS prediction may be low, but all of the dimension scores are rated high. In contrast, if the model’s individual predictions are based on the same feature calculation, the features that caused a low overall MOS will most likely cause the responsible dimensions to be scored low as well. In this way, the dimension scores can help to understand why the overall MOS was predicted low by the model. For example, if the overall MOS and Discontinuity are predicted with a low and the other three dimensions with a high score, it can be assumed that the reason for the overall low score is that the model found interruptions in the speech signal. While it is important that the predicted scores are consistent, the dimension scores should still be orthogonal to each other. In theory, the model will inherit this orthogonal characteristic of the individual tasks through end-to-end learning from the subjective ratings. Two commonly used methods for applying MTL to deep learning are either hard or soft parameter sharing of hidden layers (Ruder 2017). In the case of hard parameter sharing, most hidden layers are shared between all tasks, except for the task-specific output layers. In soft parameter sharing, on the other hand, each task has its own hidden layers, and the distances between the layer’s parameters are then regularised to encourage the parameters to be similar. In other words, individual models are trained for each task, but during parameter optimisation, it is ensured that the model weights are not too different. In this work, MTL models with hard parameter sharing are analysed for the task of speech quality dimension prediction since the computation time is much faster when compared to soft parameter sharing. Also, soft parameter sharing introduces yet another hyper-parameter (the regularisation parameter) that needs to be tuned. While hard parameter sharing is useful in many MTL problems, it can lead to lower overall performance if the tasks are not closely related or require reasoning on different levels. This kind of cross-task interference that leads to lower performance of an MTL model compared to individually trained models is also referred to as

5.2 Multi-Task Models

75

negative transfer. Another reason for negative transfer can be different learning rates of the tasks. In this case, the model may already be overfitting for one of the tasks, while it is still learning a general representation for another one. Because of these challenges, one of the main questions when designing MTL models is how much of the model should be shared. In general, the more similar the tasks are, the more of the model can be shared, if the tasks are very different it should be shared as much as possible to regularise the optimisation and avoid overfitting. Since the overall quality can be obtained by an integration of the perceptual quality dimensions, it can be assumed that the different tasks in this work are closely related. While the task of predicting the overall MOS needs to learn to predict quality for all types of distortions, the dimension tasks need to learn to predict the impact on the perceived quality for one specific type of distortion while other types of distortions are ignored. By learning the overall quality task, the model thus already learns a feature representation for the distortions of the individual dimension task. However, with adding the dimension tasks, the model also needs to learn to differentiate between different types of distortions and assign and weight them accordingly. Therefore, one question that is analysed in this chapter is whether the property of the model to differentiate between different types of distortions helps to improve the overall quality prediction. In contrast, it is analysed if the performance of a specific dimension can be improved by knowledge about other types of distortions. Furthermore, single-task models and multi-task models that share a different number of layers across the tasks are compared in terms of overall performance to answer the question of how much of the model should be shared. In the next section, the different MTL models are presented. In the following Section 5.3, the models are trained, and the results are analysed and discussed. Finally, the conclusions of this chapter are summarised in the last section.

5.2 Multi-Task Models In this section, four types of single-ended MTL speech quality models, based on the models from Chap. 3, are presented. In particular, because it achieved the best results for predicting the overall MOS in Chap. 3, a model with a CNN for per-frame feature calculation, a self-attention block for time-dependency modelling, and an attentionpooling layer for time aggregation are used (CNN-SA-AP model). In the presented MTL models, the location of the transition from shared layers to individual nonshared, task-specific output layers is varied. In the first presented model, merely the output size of the last fully connected layer is changed from one to five, to predict the four dimensions and the overall quality. The next models then incrementally share less and less of the layers, where the last model only shares parts of the CNN across tasks.

76

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

5.2.1 Fully Connected (MTL-FC) This MTL model presents the maximum possible amount of shared layers by only changing the output size of the model from one to five. The last layer of the pooling block is a fully connected layer which receives the pooled features. Because the pooling layer only aggregates the features without changing the dimensions, the input size to the final fully connected layer is the same as the output size of the time-dependency model (i.e. the value dimension in the case of a self-attention block). The final fully connected layer then maps these pooled features to the target value. Figure 5.1 shows the last fully connected layer for the single-task model (ST) from Chap. 3, where the pooled features are mapped to the overall MOS value. Figure 5.2 shows the last fully connected layer with the additional outputs for the prediction of the speech quality dimension scores, in this case, the final fully connected layer consists of five different linear combinations of the pooled features for each predicted dimension and the overall MOS. Figure 5.3 gives an overview of the MTL-FC model. The dashed lines represent the feature flow for each of the depicted Mel-spec segments. Each Mel-spec segment is firstly processed by the same CNN. The outputs of the CNN are then processed by the time-dependency block, where the CNN output of each Mel-spec represents one of the time steps. Finally, the features are aggregated over time by the pooling block. The CNN, time-dependency, and pooling layers are all shared by the five tasks. Only the output size of the model is changed to five. This model is denoted as MTL-FC. Fig. 5.1 Final fully connect layer of the ST model without dimension prediction

5.2 Multi-Task Models

77

Fig. 5.2 Final fully connect layer of the MTL-FC model with dimension prediction

CNN

TimeDependency

Pooling

MOS

NOI

COL

DIS

LOUD

Fig. 5.3 MTL-FC model. All layers are shared between the tasks, where the output size is set to five. The dashed lines represent the feature flow for one of the six depicted Mel-spec segments/time steps

5.2.2 Fully Connected + Pooling (MTL-POOL) In this model, the entire pooling block is calculated separately for each dimension. As can be seen in Fig. 5.4, the Mel-spec features are calculated by the same CNN and time-dependency block for each dimension. The outputs of each timedependency time step are then the input for five individual pooling blocks that predict the overall MOS and the dimension scores. (only two of the five outputs are displayed for the sake of clarity.) Intuitively it makes sense to separate the pooling

78

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

CNN

TimeDependency

… NOI MOS Fig. 5.4 MTL-POOL model. All layers except for the pooling block are shared between the tasks. Only two of the five outputs (dimensions + overall MOS) are shown. The dashed lines represent the feature flow for one of the six depicted Mel-spec segments/time steps

layers for the dimension calculation since the temporal perception is different for each quality dimension. For example, a low coloration quality for a short amount of time may not be perceived as annoying as a short interruption in the middle of a sentence. Therefore, short interruptions may be weighted higher by the model than other distortions. This model with individual pooling blocks is denoted as MTLPOOL.

5.2.3 Fully Connected + Pooling + Time-Dependency (MTL-TD) In this model, only the CNN is shared across the tasks, while there are five individual task-specific time-dependency and pooling blocks. An overview of the model is shown in Fig. 5.5, where only two of the five outputs are displayed. It can be seen that all of the Mel-spec feature vectors are processed by the same CNN. The CNN output of each Mel-spec segment is then routed in parallel through five different time-dependency blocks. Again, intuitively it makes sense to apply different timedependency models for the quality dimensions because some of the dimensions may interact with other time steps differently. For example, a sudden distortion in the speech signal could be perceived as discontinuity. However, if this distortion is continually present in the signal, it may be perceived as background noise. The individual time steps can only learn this context through the time-dependency block. This model with individual time-dependency and pooling blocks is denoted MTL-TD.

5.3 Results

79

CNN



TimeDependency Pooling

NOI

MOS

Fig. 5.5 MTL-TD model. Only the CNN layers are shared between the tasks. Only two of the five outputs (dimensions + overall MOS) are shown. The dashed lines represent the feature flow for one of the six depicted Mel-spec segments/time steps

5.2.4 Fully Connected + Pooling + Time-Dependency + CNN (MTL-CNN) The last MTL model shares only parts of the CNN, while all of the other layers are task-specific. In particular, only the first four of six convolutional layers are shared amongst the tasks. Table 5.1 shows the layer design of the used CNN (see also Sect. 3.4.1) with the split between shared and non-shared layers. Whereas in the last two layers, most of the speech quality-related features are calculated, in the first layers, low-level features are calculated. The last two layers also contain more kernels (i.e. 64 channels) to capture the higher complexity of the top-level speech quality features. An overview of the model in Fig. 5.6 shows that only the first twothirds of the model are shared between task, the CNN then splits up with individual convolution layers for each task.

5.3 Results The four MTL models were trained and validated on the four datasets presented in Sect 3.1: NISQA_TRAIN_SIM, NISQA_TRAIN_LIVE, NISQA_VAL_SIM, and NISQA_VAL_LIVE. Additionally, to the MTL models, each dimension and the overall MOS were also trained individually with a single-task (ST) model that shares none of the layers with other tasks. The training parameters were again set to the same values as in Sect. 3.7 for the single-ended model and also the same CNN model was applied (see Sect. 3.4.1). As performance indicator, the average Pearson’s correlation coefficient between the live and simulated validation dataset was used.

80

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

Table 5.1 CNN design of the MTL-CNN model with the split between shared and non-shared layers (the kernel size is 3x3.)

Layer Input Conv 1 Bach normalisation ReLU Adaptive Max-pool Conv 2 Bach normalisation ReLU Adaptive Max-pool Dropout Bach normalisation ReLU Conv 3 Bach normalisation ReLU Conv 4 Bach normalisation ReLU Adaptive Max-pool

Dropout Conv 5 Bach normalisation ReLU Dropout Conv 6 (no width padding) Bach normalisation ReLU Output (flatten)

Channel 1 16

Height 48 48

Width 15 15

16 16

24 24

7 7

16

12

5

32

12

5

32

12

5

32

6

3

64

6

3

64

6

1

384

5.3.1 Per-Task Evaluation Overall MOS Figure 5.7 shows the results of the four different MTL and the ST model for the prediction of the overall MOS only. For each training run, the epoch with the best result for predicting the overall MOS was used as a result. The individual 10 training runs are then presented as boxplots. It can be seen that the prediction performance is very similar between the different models, where the model MTL-TD with taskspecific time-dependency and pooling blocks achieved slightly better results than the single-task model or the MTL models that share most of the layers. However, based on the figure it can be concluded that using the quality dimension as auxiliary tasks does not notably improve the prediction of the overall MOS. One reason for this

5.3 Results

81

CNN



TimeDependency Pooling

NOI

MOS

Fig. 5.6 MTL-CNN model. Only the first four CNN layers are shared between the tasks. Only two of the five outputs (dimensions + overall MOS) are shown. The dashed lines represent the feature flow for one of the six depicted Mel-spec segments/time steps

Fig. 5.7 Comparison of the results of the overall MOS prediction of different multi-task models. Results on the validation set after 10 training runs presented as boxplots

could be that the signal distortions of the individual quality dimensions are already covered by the overall quality. Therefore, the model does not seem to learn new information from predicting quality dimensions that it could leverage to improve the prediction of the overall MOS. Also, the regularisation effect of the MTL model appears to be small, possibly because the noise of the ratings is correlated between the dimensions and overall MOS since they are rated by the same test participant.

82

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

(a)

(c)

(b)

(d)

Fig. 5.8 Comparison of the results for the dimension predicting with different multi-task models. Results on the validation set after 10 training runs presented as boxplots. (a) Noisiness. (b) Coloration. (c) Discontinuity. (d) Loudness

Quality Dimensions Figure 5.8 shows the results for predicting the individual quality dimensions. Again for each training run, the epoch with the best result for predicting the corresponding dimension was used as a result. The results of 10 training runs are then presented as boxplots. Additionally, to the MTL and ST models, in this experiment, also an ST model that is pretrained on the overall MOS is included. In this transferlearning approach, the model was first trained to predict the overall MOS. After that, it was fine-tuned to predict one of the quality dimensions. This overall MOS

5.3 Results

83

pretrained model is denoted as ST-PT. In the four plots of Fig. 5.8 the most variation in results can be seen for the dimension Coloration. The Coloration ST model with a median PCC of 0.69 is outperformed by the other MTL models with medians around 0.71. This shows that the model can benefit from the additional information that it receives through the other tasks. In fact, the ST-PT model that is pretrained to predict overall MOS obtains the best median PCC. Therefore, it can be concluded that the model mostly benefits from the additional information gained from the overall MOS prediction rather than by the other quality dimensions. The MTL models in the plot are sorted from sharing all of the layers to only sharing parts of the CNN from left to right. The Coloration MTL model with the best results is MTL-POOL, where the performance notably decreases for each stage, of which less of the network is shared across tasks. However, MTL-FC that shares all of the layers achieves a slightly lower PCC, showing that in this case, it is better to not completely share the network layers between all tasks. For the other quality dimensions, the results of the different models are closer together, and only small differences between the ST and the MTL models can be observed. However, there is a clear trend that the models MTL-POOL and MTL-TD achieve a slightly higher PCC than the ST models for the dimensions Noisiness, Discontinuity, and Loudness, whereas there is a notable improvement for the Coloration dimension. This shows that the prediction of the dimension scores can indeed benefit from the additional knowledge that is learned by considering other types of distortions as well.

5.3.2 All-Tasks Evaluation In this experiment, the different MTL models are compared on their overall performance across all four dimensions and the overall MOS. To this end, the average PCC rALL = is calculated for each epoch of a training run, then the epoch with the best average PCC on the validation set is used as result. The average PCC is calculated as follows: rALL =

rMOS + rNOI + rCOL + rDIS + rLOUD . 5

(5.1)

The results are shown in Fig. 5.9, where MTL-POOL obtains the best PCC. Again the differences between the models are only marginal; however, sharing some of the layers gives on average a slightly better performance than sharing all of the layers or sharing only a few layers. It can be concluded that the different tasks that the model learns are indeed quite similar to each other. Still, there is some difference between them so that the model benefits from having at least its own task-specific pooling layer for each of the quality dimensions and the overall quality. In case parts of the CNN are shared as well, the model performance decreases again, showing that

84

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

Fig. 5.9 Overview of the general speech quality prediction model neural network architecture

sharing only the low-level CNN features is not enough to leverage the possible (but small) MTL performance increase.

5.3.3 Comparing Dimension In this experiment, the prediction performance of the different quality dimensions and the overall MOS are compared to analyse which dimensions are the most difficult to predict and possible reasons for the difference in prediction accuracy. As a baseline for the possible prediction accuracy, the correlation between a single subjective rating of one test participant is compared to the average rating across all test participants. In the datasets, there are five ratings per file available, where each participant rated a minimum of ten files. One of the five ratings is sampled randomly for each of the files and then compared to the average rating (i.e. overall MOS or dimension MOS). This process is repeated ten times and presented as boxplots on the left-hand side of Fig. 5.10. It can be seen that a single subjective overall MOS rating correlates much higher with the average rating than a single dimension rating correlates with the average dimension rating. This effect can also be observed for the prediction models; however, the PCC of the ST and MTL-POOL prediction model is higher than the PCC of one single human rating. When the prediction performance of the models for different dimensions is compared, it can be noted that the Coloration and Discontinuity are the most difficult to predict, with much lower PCCs than the overall MOS. While the apparent lower consistency in the human ratings is one possible explanation for

5.3 Results

85

(a)

(b)

(c)

Fig. 5.10 Results on validation dataset over 5 runs for each segment step size. (a) Human. (b) ST. (c) MTL-POOL

this, the two dimensions Noisiness and Loudness obtained much higher prediction accuracy while the consistency in the human ratings is low as well. This shows that the prediction of Coloration and Noisiness is particularly difficult for the model, likely because of signal-related issues. Discontinuities are generally more difficult for a single-ended model to predict since no reference is available and therefore no information about the original content that could be interrupted or missed is available. Often packet-loss conditions with packet-loss concealment algorithms are also rated with a lower Coloration by test participants (Möller et al. 2019a), which could be one reason for the lower prediction accuracy. However, compared to the PCC of a single human rating, the prediction accuracy of the quality dimensions is overall good.

5.3.4 Degradation Decomposition One of the goals of the quality dimension approach is the degradation decomposition. That means that additional to the information of the overall quality of a speech sample also the composition of the overall quality in terms of perceptual quality dimensions is known. This information can then be used to determine the technical root-cause in the transmission system. Therefore, it is essential that the dimension scores are consistent with the overall MOS scores; that is, the individual quality scores span the entire quality space and comprise the overall quality. To analyse this, a linear regression between the individual predicted dimension scores and the overall predicted MOS of the validation datasets was fitted and compared for the four different MTL and the ST model. Additionally, also, the results for the subjective quality ratings were calculated. The results in terms of PCC and R 2 can be seen in Table 5.2, where for each model type the model with the best overall results was picked for this experiment.

86

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

Table 5.2 Results of linear regressions between the individual dimensions and the overall MOS in terms of Pearson’s correlation coefficient r and coefficient of determination R 2

Model MTL-FC MTL-POOL MTL-TD MTL-CNN ST Subjective

r 0.98 0.99 0.98 0.98 0.95 0.94

R2 0.97 0.97 0.95 0.96 0.91 0.88

The single-task model only achieves an R 2 of 0.91 while the MTL models obtain R 2 -values between 0.95 and 0.97. This shows that the consistency between the dimension scores and the overall quality is notably higher when a multi-task model is used. However, the amount of shared layers amongst the dimensions does not seem to be a decisive factor, as all of the MTL models obtain similar results. The correlation of the MTL models is generally very high, with a PCC of 0.98 to 0.99 and thus also higher than the correlation between the subjective dimensions and subjective overall quality with a PCC of 0.94. Another aspect in regard to the degradation decomposition approach that should be considered is the orthogonality between the quality dimension scores. To analyse how independent the individual dimensions are to each other, the correlation between the dimension scores was calculated for the subjective ratings, the MTLPOOL model, and the single-task models. The results are presented in Tables 5.3, 5.4, and 5.5. Table 5.3 shows the correlations of the subjective ratings, where it can be seen that the dimensions Discontinuity and Noisiness correlate the least with each other with a PCC of 0.36. On the other hand, the dimensions Loudness and Coloration show a high correlation of 0.75. When these values are compared to Table 5.4 with the multi-task model predictions, it can be noted that the correlations are very similar to the subjective ratings. Only for the combinations of LOUD/NOI and LOUD/COL, the correlations are 0.10 higher than the ground truth subjective correlations. The correlations of the single-task models in Table 5.5 are overall much lower than the subjective correlations. For example, the combination DIS/NOI correlates with a PCC of 0.22, whereas the subjective ratings correlate with a PCC of 0.36. Overall it can be concluded that the single-task models lead to an “unnatural” low correlation between the dimensions. In contrast, the predictions of the multitask model tend to correlate higher with each other than the subjective ratings. However, the multi-task prediction correlations are generally in the same range as the subjective ratings. In particular, when the dimension Loudness is not considered, which is known to be not completely orthogonal to the other dimensions (Côté et al. 2007).

5.4 Summary Table 5.3 PCC between subjective dimension ratings

Table 5.4 PCC between MTL-POOL predictions

Table 5.5 PCC between ST predictions

87 NOI NOI COL DIS LOUD

0.45 0.36 0.55 NOI

NOI COL DIS LOUD

0.50 0.34 0.65 NOI

NOI COL DIS LOUD

0.32 0.22 0.60

COL 0.45 0.61 0.75 COL 0.50 0.62 0.85 COL 0.32 0.64 0.72

DIS 0.36 0.61

LOUD 0.55 0.75 0.48

0.48 DIS 0.34 0.62

LOUD 0.65 0.85 0.48

0.48 DIS 0.22 0.64

LOUD 0.60 0.72 0.41

0.41

5.4 Summary In this chapter, four different multi-task models were presented that share different amounts of layers between the tasks. The prediction accuracy for all quality dimensions was higher than the one of a single, randomly selected human rating. The dimensions that are the most difficult to predict are Coloration and Discontinuity with a PCC of around 0.70. Loudness and Noisiness could be predicted with a PCC of around 0.79 and 0.81, respectively, while the overall MOS can be predicted with a PCC of 0.87. Whereas the overall lower prediction accuracy of the quality dimensions compared to the overall quality can be explained with lower consistency of the subjective ratings, it is not entirely clear why the Loudness and Noisiness accuracy is notably higher than the human baseline even for the single-task models. It can, however, be assumed that this effect is caused by the technical impairments speech level and background noise, which are relatively easy to predict. It was further shown that the models MTL-POOL and MTL-TD that share some parts of the layers amongst tasks achieved the best overall results. In contrast, the models MTL-FC that share all of the layers and the model MTL-CNN that shares only the low-level layers of the CNN obtained lower overall PCC. However, the difference between the models is only marginal. Overall, all dimensions and the overall MOS benefited from the multi-task approach and achieved better results when compared to the single-task models. In particular, the accuracy of the dimension Coloration could be raised from a PCC of around 0.69 to 0.71. It was further shown that the achieved improvements with the MTL approach can generally also be obtained by pretraining a single-task model with the overall quality. While the results could be somewhat improved with the multi-task approach, the computation

88

5 Prediction of Speech Quality Dimensions with Multi-Task Learning

time is generally drastically reduced when compared to multiple single-task models. However, the exact computation time is largely dependent on the particular system and not of main interest in this work and therefore not analysed in detail. Further, it was shown that the decomposition approach can successfully be implemented with the MTL approach that delivers consistent predictions between the overall MOS and the dimension scores. One interesting idea to further improve the prediction performance in a multi-task approach would be to use technical parameters that were used for the generation of the simulated dataset as auxiliary tasks. For example, codecs could be predicted as a classification task and noise levels as a regression task. These auxiliary tasks could then potentially help the model to regularise better for the prediction of the overall MOS and dimensions.

Chapter 6

Bias-Aware Loss for Training from Multiple Datasets

Although many different deep learning architectures have been proposed for speech quality prediction, they do not consider the uncertainty in the ground truth MOS that are the target values of the supervised learning approach. In particular, it is common practice to use multiple datasets for training and validation, as subjective data is usually sparse due to the costs that experiments involve. However, these datasets often come from different labs with different test participants, and often, many years lie in between them. Test participants usually adapt to the context of an experiment and rate the lowest occurring distortion with the lowest possible rating. If two experiments differ in their quality range, the same distortion levels will likely be rated very differently in both experiments. Also, the quality expectation changes over time. While participants may have rated a narrowband speech signal with a high score ten years ago, service users are now more used to high definition speech (see also Sect 2.3). Because this is a widely known problem, the ITU-T recommends in ITU-T Rec. P.1401 (2020) to apply a third-order, monotonically increasing mapping of the predicted values before calculating the root-mean-square error (RMSE) for the evaluation of objective quality prediction models. In this chapter, a new bias-aware loss is presented that considers the biases between different datasets by learning the biases automatically during training. The calculated loss is adjusted in a way that errors, which only occur due to datasetspecific biases, are not considered when optimising the weights of the quality prediction model.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_6

89

90

6 Bias-Aware Loss for Training from Multiple Datasets

6.1 Method The basic idea is that the dataset-specific biases are learned and accounted for during the training of the neural network. After each epoch, the predicted MOS values of the training data are used to estimate the bias in each dataset by mapping their values to the ground truth subjective MOS values. The assumption is that the predicted MOS values of the model are objective in the sense that they will average out the biases of the different datasets while training. The biases are approximated with a polynomial function. Particularly, in this work, first-order and third-order polynomial functions are examined. In theory, a first-order function can model constant offsets and linear biases in a dataset, while with a third-order function also saturation effects can be modelled. The coefficients b of the third-order function for one specific dataset are estimated as follows: min b

N 2 1  yi − (b0 + b1

yi + b2

yi2 + b3

yi3 ) N

(6.1)

∂ (b0 + b1 t + b2 t 2 + b3 t 3 ) ≥ 0 ∂t

(6.2)

i=1

subject to

t ∈ [1, 5], yi are the predicted MOS where yi are the ground truth subjective MOS values,

values, and N is the number of samples in the dataset. Additionally, the function is constrained to be monotonically increasing for all possible input values (the MOS values in this work range between 1 and 5). The first-order polynomial function is estimated in the same way, but with the higher-order coefficients set to zero: b2 = b3 = 0. For the first-order function, the constraint can be applied with a bound on the second coefficient, that is b1 ≥ 0. After the biases are estimated for each dataset, they can be used in the next epoch to calculate the proposed bias-aware loss. Firstly, the predicted MOS values are mapped with the predetermined bias coefficients. After that the mean-square error between the bias-mapped predicted MOS and the subjective MOS values gives the loss as follows: l = L(y,

y, b) =

N 2 1  yi − (b0,i + b1,i

yi + b2,i

yi2 + b3,i

yi3 ) , N

(6.3)

i=1

where i is the sample index and N the overall number of samples. The bias that is applied to each predicted MOS depends on the dataset that this particular sample i belongs to.

6.1 Method

91

Because the predicted values are mapped according to the bias of each dataset, errors between the predicted and subjective values that only occur due to the datasetspecific bias are neglected. The model thus learns to predict quality rather than nonrelevant biases.

6.1.1 Learning with Bias-Aware Loss The complete algorithm of the bias-aware loss is depicted in Algorithm 1. The inputs to the algorithm are the input features xi (i.e. Mel-spec segments) and the subjective MOS values that are the desired output values of the supervised learning process yi for all samples i ∈ NSamples . Further, a list that contains the dataset name of each sample dbi is needed to assign the individual samples to their datasets/bias coefficients. Before the training starts, the bias coefficients bi of all samples are initialised with an identity function, with which Eq. (6.3) corresponds to a vanilla MSE (mean square error) loss function. Because the model will typically not give a meaningful prediction output that could be used to estimate a bias after the first few epochs, the update_bias flag is set to False until a predefined model accuracy rth is achieved. Until this threshold is not reached, the bias coefficients will not be updated, and therefore, a vanilla MSE loss is used for calculating the loss. Then, at the start of each epoch, the mini-batch indices idxk are randomly shuffled to increase the variety of different data the model is trained on. It is necessary to preserve these indices in order to assign the samples to their corresponding datasets. After each epoch, the model is used to predict the MOS values

y of all samples. Only once the accuracy threshold rth is met, the bias is estimated and updated. It should be noted that to speed up the training, instead of predicting the MOS values for the complete training data, the predictions of each mini-batch can be saved directly

y[idxk ] =

yb . However, in this case, the predicted bias of the first mini-batches may be different compared to the later mini-batches that already underwent more model weight optimisation steps. Once the model accuracy rth is reached, the update_bias flag is set to True and the bias will be updated after every epoch. To update the bias coefficients, the algorithm loops through all datasets individually, where idxj represents the indices of all samples that belong to the j th dataset. At line 21 of the algorithm, the subjective and predicted MOS values of the j th dataset are loaded. Then they are applied to Eq. (6.1) to estimate the bias coefficients bdb and subsequently saved in the overall bias matrix b with help of the indices idxj . In the next training epoch, the biases are then extracted at line 7 according to the randomly shuffled mini-batch indices idxk that may contain a random number of different datasets, and therefore each sample may also be subject to a different bias. By using the bias-aware loss, these biases are considered when calculating the error between the predicted and subjective MOS values. After each epoch, the bias coefficients are then updated to be in line with the updated model predictions.

92

6 Bias-Aware Loss for Training from Multiple Datasets

Algorithm 1 Training with bias-aware loss function Input: xi : input features (e.g. Mel-spec), yi : subjective MOS values, dbi list of dataset name for each sample i Parameter: rth : minimum prediction accuracy to update the bias coefficients Output: model weights 1: Initialize bias: bi = [0, 1, 0, 0], i ∈ NSamples 2: update_bias = False 3: while not converged do 4: Shuffle mini-batch indices idxk 5: k=0 6: for all mini-batches do 7: Get mini-batch: xb = x[idxk ], yb = y[idxk ], bb = b[idxk ] 8: Feed forward:

yb = model(xb ) 9: Calculate bias-aware loss with Eq. (6.3): l = L(yb , y b , bb ). 10: Backpropagate & optimise weights 11: k =k+1 12: end for 13: Predict MOS:

y = model(x) 14: Calculate Pearson’s correlation r = PCC(y,

y). 15: if r > rth or update_bias then 16: update_bias = True 17: j =0 18: Find unique dataset entries: dbu = unique(db) 19: for all datasets do 20: Find dataset indices: idxj = find(db == dbuj ) 21: Get dataset: ydb = y[idxj ],

ydb =

y[idxj ] 22: Estimate bias bdb with Eq. (6.1) 23: Update bias: b[idxj ] = bdb 24: j =j +1 25: end for 26: end if 27: end while 28: return model weights

6.1.2 Anchoring Predictions When the bias-aware loss is applied, the MOS predictions are not anchored to the absolute subjective MOS values and, therefore, can wander off. While the predictions will still rank the samples in the best possible way, there may be a large offset between predictions and subjective values on the validation set. This effect will lead to a higher root-mean-square error while the Pearson’s correlation coefficient (PCC) is usually not affected. To overcome this problem, the predicted MOS values of all samples can be mapped to the subjective MOS after the training.

6.2 Experiments and Results

93

This mapping can then be applied when making new predictions on validation data. Alternatively, two methods are proposed to anchor the predictions directly during training. Anchoring with Anchor Dataset Instead of estimating the bias for all datasets, the predictions are anchored to one specific training dataset. This approach can be particularly useful if there is one dataset of which it is known that the conditions are similar to the conditions that the model should be applied to later. For example, a new dataset is usually created with new conditions and then split into a training and validation set. To increase the training data size and to improve the model accuracy, older datasets are also included in the training set. It is likely that there will be a bias between the new dataset and the older dataset, for example, because the highest quality in the new dataset will be higher than the one in the older datasets. The predictions can be anchored to the new dataset by omitting the bias update for the new dataset only (skipping line 20–23 in Algorithm 1). Anchoring with Weighted MSE Loss Another approach to prevent the prediction from wandering off is to add a weighted vanilla MSE loss that will punish predictions that are too far from the original subjective MOS values. The loss is then calculated as follows: l=λ

N 1  yi )2 (yi −

N

(6.4)

i=1

+

N 2 1  yi − (b0,i + b1,i

yi + b2,i

yi2 + b3,i

yi3 ) , N i=1

where λ is the MSE loss weight. If λ is set too high, the bias-aware loss function’s influence will be too low to increase the model accuracy. In this work, the weight was set to a fixed value of λ = 0.01.

6.2 Experiments and Results In this section, the performance of the proposed method is analysed by training the model on different datasets. In the first experiment, a synthetic speech quality dataset is generated for which the biases are known. After that, an experiment with six different speech quality datasets is conducted.

94

6 Bias-Aware Loss for Training from Multiple Datasets

6.2.1 Synthetic Data Firstly, a synthetic speech quality dataset is generated to which artificial biases are applied. As source files, the 2–3 s reference speech files from the TSP dataset (Kabal 2002) are used (see also Sect. 3.1.1 for a description of the source files). For the four training datasets overall, 320 speech files and for the validation set, 80 speech files are used. The speech signals were processed with white Gaussian noise to create conditions of different distortion levels. It was found that when the SNR range of the added noise is too wide, it is too easy for the model to predict the quality, and when it is too narrow, the prediction becomes too challenging. Therefore, the speech files were processed with noise at SNR values between 20 and 25 dB, which showed to be a good compromise. To simulate MOS predictions, an S-shaped mapping between the technical impairment factor (i.e. SNR) and MOS, taken from ITU-T Rec. P.107 (2015), is used. The relationship is shown in Fig. 6.1 and maps the SNR values to a MOS between 1 and 4.5. It should be noted that this is an artificial mapping for simulation purposes and does not reflect the real relationship between SNR and MOS. The training data is divided into four different subsets, and a different bias is applied to each of them. These four artificial biases are shown in Fig. 6.2. To better analyse the influence of the bias-aware loss, extreme biases are used in this experiment. There is no bias applied to the first simulated database (blue line). The second and the third simulated databases have linear biases applied, while the fourth database is exposed to a bias modelled with a third-order polynomial function. Each of the training datasets contains 80 files; the validation set also contains 80 files. The synthetic experiments are run by using the CNN-SA-AP model presented in Chap. 3. Fig. 6.1 Mapping between SNR of white noise and speech quality MOS

6.2 Experiments and Results

95

Fig. 6.2 Four artificial biases introduced to the four simulated training datasets

Fig. 6.3 Validation results in terms of Pearson’s correlation over 15 training runs with anchoring for different rth thresholds

6.2.2 Minimum Accuracy rth In the first experiment, the influence of the minimum PCC that must be achieved on the training set before activating the bias update in the bias-aware loss algorithm (line 15 in Algorithm 1) is analysed. To this end, the experiment is run with 11 different threshold rth between 0 and 0.95. An early stop on the validation PCC of 20 epochs is used, and the best epoch of each run is saved as result. The training run of each of these 11 configurations is repeated 15 times. The results, together with the mean results and their 95% confidence interval, can be seen in Fig. 6.4 for training without anchoring and in Fig. 6.3 with anchoring. In the case of anchoring, the bias-aware loss is not used on the first dataset train_1 but only on the other three datasets. The correlation of the results is highly varying when an anchor dataset is used (Fig. 6.4). The highest correlation can be achieved for thresholds between 0.5 and 0.7. When no anchoring is applied, the exact threshold does not seem to be as crucial, as long as it is somewhere between 0.1 and 0.8. However, the PCC remains

96

6 Bias-Aware Loss for Training from Multiple Datasets

Fig. 6.4 Validation results in terms of Pearson’s correlation over 15 training runs without anchoring for different rth thresholds

overall lower than the higher PCCs that can be achieved with an anchor dataset. For a threshold of a correlation higher than 0.9, the accuracy notably drops. It can be assumed that at this point, the model weights are already optimised too far towards the vanilla MSE loss and cannot always profit from the late activated bias-aware loss.

6.2.3 Training Examples with and Without Anchoring Figure 6.5 shows an example of four different training runs and different anchoring configuration. The figures show the epoch with the best results on the validation data set in terms of PCC. Each row presents one training run, and each column presents the results on the four different training datasets and on the validation dataset. The artificial bias that was applied to the datasets can be seen as green line (see also Fig. 6.2). The estimated bias used by the bias-aware loss is depicted as orange line. As estimation function, a first-order polynomial was used for the first three examples and a third-order polynomial for the last example at the bottom. The top row presents the results without anchoring and shows how the prediction results can drift away from the original values. While the predictions are extremely biased in this case, the achieved PCC remains high. The second row shows a run with weighted MSE loss of λ = 0.01, where the predicted MOS keep closer to the original values. In the third and the fourth row, the results for anchoring with anchoring dataset are shown. During the training, the bias of the first training dataset train_1 was not estimated but fixed to an identity function. It can be seen that the prediction results on the validation set are less biased in this case. Furthermore, it can be seen that the model successfully learns the different biases when the estimated orange line is compared to the original green line.

6.2 Experiments and Results

97

Fig. 6.5 Three example training runs with bias-aware loss on the validation dataset. GREEN LINE: Introduced bias. ORANGE LINE: Estimated bias. 1st Row: Without anchoring and firstorder estimation. 2nd Row: Weighted MSE loss anchoring and first-order estimation. 3rd Row: Anchored with dataset and first-order estimation. 4th Row: Anchored with dataset and third-order estimation Table 6.1 Training configurations of the synthesised speech quality experiments Config Biased data Bias function Anchored with dataset MSE loss weight

1 No No – –

2 Yes No – –

3 Yes 1st No 0

4 Yes 1st No 0.01

5 Yes 1st Yes 0

6 Yes 3rd No 0

7 Yes 3rd No 0.01

8 Yes 3rd Yes 0

6.2.4 Configuration Comparisons As a next step, the validation set accuracy is compared in terms of PCC for different training parameters. The considered configurations are shown in Table 6.1. The first configuration represents the results when no artificial bias is applied to the training datasets and therefore represents the best achievable results. The second

98

6 Bias-Aware Loss for Training from Multiple Datasets

Fig. 6.6 Boxplots of the results of 15 training runs on the synthesised speech quality validation set in terms of Pearson’s correlation on the validation dataset (configurations in Table 6.1)

configuration represents the results when artificial biases are applied, but no biasaware loss is used. The best results on the validation dataset of 15 runs per configuration are presented as boxplots in Fig. 6.6. It can be seen that the model accuracy decreases significantly when the training datasets are exposed to biases (Config 1 vs Config 2). The best results on the biased data could be achieved with a bias-aware loss that is anchored to the first dataset, where the first-order (Config 5) and third-order (Config 8) bias estimation functions obtain similar results. Although the fourth dataset was exposed to a bias modelled with a third-order polynomial, the results show that a third-order polynomial does not notably improve the results. This presumably happens because the degree of freedom is too high, and thus, not only biases are modelled but also prediction errors are compensated. When these prediction errors do not occur in the calculated loss, they cannot be optimised for in the neural network training and therefore lead to lower prediction accuracy. Further, it can be noted that anchoring with MSE loss weight leads to a higher variation in the accuracy of the individual runs. The bias-aware loss without anchoring or MSE loss weight obtains similar results as the weighted MSE loss configurations. Overall, the experiments show the efficiency of the proposed biasaware loss when applied to the synthetic datasets, where the correlation can be increased from on average 0.77 (Config 2) to 0.92 (Config 5/7), which is almost as high as the average result without bias with a PCC of 0.94 (Config 1).

6.2 Experiments and Results

99

6.2.5 Speech Quality Dataset To analyse the efficiency on real datasets, the publicly available ITU-T P Suppl. 23 (2008) speech quality datasets with subjective quality ratings are used (see also Sect. 7.1). It should be noted that the datasets come from different sources and are therefore likely to be exposed to subjective biases. Six of the datasets that were rated on an ACR scale are used: EXP1a, EXP1d, EXP1o, EXP3a, EXP3c, and EXP3d. The proposed algorithm is then analysed with a leave-one-dataset-out crossvalidation, where the model is trained on the five remaining datasets. For each training run, the model of the epoch on which the best results in terms of PCC were achieved on the held-out validation dataset is saved. As prediction model again, the proposed CNN-SA-AP model from Chap. 3 is applied. The training was run 15 times with an early stop of 20 epochs on the validation PCC, a learning rate of 0.001, and a mini-batch size of 60. The bias estimation was anchored to a randomly chosen dataset. The results of the training runs on the validation datasets are presented as boxplots in Fig. 6.7. The training was run for different rth threshold values from 0.5 to 0.8. The boxplot with a threshold of rth = 1

Fig. 6.7 Boxplots of the per-condition results of 15 training runs on the leave-one-dataset-out validation in terms of Pearson’s correlation. The Boxplots with rth = 1 correspond to MSE loss, as the bias-aware loss is not activated during training

100

6 Bias-Aware Loss for Training from Multiple Datasets

Table 6.2 Average per-condition PCC of the leave-one-dataset-out cross-validation after 15 training runs. The last column shows the improvement between the best bias-aware loss result and the MSE loss result Dataset EXP1a EXP1d EXP1o EXP3a EXP3c EXP3d

Bias-aware loss rth =0.5 rth =0.6 0.957 0.958 0.963 0.963 0.935 0.943 0.921 0.910 0.884 0.888 0.924 0.915

rth =0.7 0.962 0.965 0.950 0.910 0.882 0.920

rth =0.8 0.960 0.964 0.941 0.894 0.900 0.923

MSE loss 0.949 0.945 0.959 0.872 0.890 0.886

Improvement 0.013 0.020 – 0.049 0.010 0.038

corresponds to training with MSE loss, as the bias-aware loss is not activated in this case. It can be seen that for some datasets, a lower, and for others, a higher threshold gives better results. The average results for each dataset over all runs are also shown in Table 6.2, where the best results is represented in bold and the last column on the right represents the improvement of the bias-aware loss compared to the MSE loss in terms of average PCC. It can be noted that the efficiency of the proposed algorithm depends on the datasets it has been trained and evaluated on. For five of the six datasets, the proposed loss outperforms the vanilla MSE loss function, whereas, for dataset EXP1o, the vanilla MSE loss performs better. On the EXP3a dataset, a performance increase of about 0.05 in terms of PCC can be observed, showing that the speech quality prediction can notably be improved by applying the proposed bias-aware loss.

6.3 Summary In this chapter, a bias-aware loss function that considers biases that occur when subjective quality experiments are conducted was presented. The proposed algorithm learns the unknown biases between different datasets automatically during model training. The bias estimation is updated after every epoch and is then used to improve the loss of the neural network training. By considering the biases, the loss function does not punish prediction errors that solely occur due to the subjective bias, while the rank order within the dataset may have been predicted correctly. The proposed method outperformed the vanilla MSE loss on the synthesised data with a relative improvement of 19% by increasing the correlation from 0.77 to 0.92 on average. While the proposed algorithm clearly outperforms the vanilla MSE loss on synthesised data and learns the introduced biases successfully, the performance on real data largely depends on the datasets that it is applied to. In cases where no biases are present between datasets, the model gives similar results as a vanilla MSE loss. However, it was shown that the proposed algorithm can notably improve the

6.3 Summary

101

performance in most cases. Because the loss function can be applied without any additional data or computational costs, the proposed method can be an easy and helpful tool to improve the accuracy of speech quality prediction models and could also be applied to other tasks, such as image or video quality prediction.

Chapter 7

NISQA: A Single-Ended Speech Quality Model

In this chapter, the single-ended speech quality prediction model NISQA is presented. It is based on the neural network architectures described in Chaps. 3 and 5 and trained with bias-aware loss, as described in Chap. 6. Overall, the model is trained and evaluated on a wide variety of 78 different datasets. In Chaps. 3–5, the datasets presented in Sect. 3.1 were used to analyse different network architectures and optimise hyper-parameters on a training and validation set that come from the same data distribution. However, the model presented in this chapter aims to deliver robust speech quality estimation for unknown speech samples. To achieve this aim, it is important to use speech samples that are highly diverse and come from different sources (i.e. data distributions). Because of this, in addition to the four datasets presented in Sect. 3.1, also speech datasets from the POLQA pool, the ITU-T P Suppl. 23 pool, and further internal datasets are used. The model is then finally evaluated on a live-talking test dataset that contains recordings of real phone calls. In the following section, these datasets are presented, and in the second section, the neural network architecture of the final model and the training process are outlined. In the third section, the model is evaluated, and lastly, this chapter is summarised in the fourth section.

7.1 Datasets In this section, the available datasets are presented, where a large part of the datasets come from the ITU-T POLQA pool. Another set of narrowband datasets are taken from the ITU-T P Suppl. 23 pool. Additionally, also a collection of other, mostly internal, datasets that were created for different projects are presented. Finally, a new test set with live-talking conditions is presented.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_7

103

104

7 NISQA: A Single-Ended Speech Quality Model

7.1.1 POLQA Pool The POLQA pool was set up for the ITU-T POLQA competition. In this competition, a successor model for the ITU-T standard for speech quality prediction PESQ was developed and selected. In contrast to PESQ, POLQA was developed to estimate speech in a super-wideband context. Consequently, the pool contains a wide variety of super-wideband datasets from different sources that were created for training and validation of the POLQA model. It also contains older narrowband and wideband datasets that were used for the PESQ and the P.563 model. This dataset pool is not publicly available. An overview of the 55 datasets can be seen in Table 7.1. The datasets 101–603 at the top of the table were specifically created for the POLQA competition, where the datasets ending with x01 and x02 were used as a training set, and the datasets ending with x03 and x04 were used as a test set. A general description of these datasets can be found in Appendix II of ITU-T Rec. P.863 (2018). Overall the pool contains eight different languages: English, German, French, Swedish, Dutch, Czech, Chinese, and Japanese. The pool contains most common codecs used in telecommunication networks. However, it does not contain the more recent super-wideband codecs Opus and EVS. Among other conditions, the pool contains packet loss, amplitude clipping, different background noises, and live recordings. Only for the three super-wideband Swissqual datasets, dimension speech quality ratings are available, which is indicated by the Dim column in the table. These dimension ratings were added later at the Quality and Usability Lab for the ITU-T work item P.AMD. In these experiments, the same number of listeners participated as for the overall quality.

7.1.2 ITU-T P Suppl. 23 ITU-T P Suppl. 23 (2008) is an older pool of datasets, containing speech samples used in the characterisation tests of the ITU-T 8 kbit/s codec G.729. It is one of the few publicly available speech quality dataset pools and therefore often used in the literature. Overall, the pool contains ten datasets of which three sets are rated on a DCR scale. In this work, only the seven datasets that were rated on an ACR scale are used. An overview of these datasets can be seen in Table 7.2. They contain four different languages: English, French, Italian, and Japanese. All of the datasets are rated in a narrowband context and contain coded speech with frame loss and noise conditions.

7.1 Datasets

105

Table 7.1 Overview over POLQA pool datasets Dataset 101_ERICSSON 102_ERICSSON 103_ERICSSON 104_ERICSSON 201_FT_DT 202_FT_DT 203_FT_DT 204_FT_DT 301_OPTICOM 302_OPTICOM 303_OPTICOM 401_PSYTECHNICS 402_PSYTECHNICS 403_PSYTECHNICS 404_PSYTECHNICS 501_SWISSQUAL 502_SWISSQUAL 503_SWISSQUAL 504_SWISSQUAL 601_TNO 602_TNO 603_TNO ATT_iLBC BT_P862_BGN_ENG BT_P862_PROP DT_P862_1st DT_P862_BGN_GER DT_P862_Share ERIC_AMR_4B ERIC_FIELD_GSM_EU ERIC_FIELD_GSM_US ERIC_P862_NW_MEAS FT_P563_PROP GIPS_EXP1 GIPS_Exp3 GIPS_Exp4 HUAWEI_1 HUAWEI_2 LUC_P563_PROP NTT_PTEST_1 NTT_PTEST_2

Scale SWB WB SWB NB SWB SWB SWB WB SWB SWB SWB SWB WB SWB NB SWB SWB SWB NB SWB SWB SWB NB NB NB NB NB NB NB NB NB NB NB NB WB SWB NB NB NB NB WB

Con 57 54 54 55 48 49 54 53 50 44 54 48 48 48 48 50 50 54 49 50 50 48 45 49 50 50 49 50 36 234 372 46 50 38 36 36 24 24 50 90 80

Files/con 12 12 12 12 4 4 4 4 4 4 4 24 24 24 24 4 4 4 4 4 4 4 29 4 24 4 4 4 12 1 1 4 16 48 48 4 24 24 4 16 14

Votes/con 119 159 95 95 96 96 96 96 96 96 96 192 192 192 192 96 96 96 96 96 96 96 233 96 96 96 112 96 112 N/A N/A 116 96 288 288 100 192 192 96 160 160

Votes/file 10 13 8 8 24 24 24 24 24 24 24 8 8 8 8 24 24 24 24 24 24 24 8 24 4 24 28 24 9 N/A N/A 29 6 6 6 25 8 8 24 10 12

Lang se se se se fr fr fr fr cz en en en en en en de de de de nl nl nl en en en de de de se se en se fr en en en zh zh en ja ja

Dim No No No No No No No No No No No No No No No Yes Yes Yes No No No No No No No No No No No No No No No No No No No No No No No

(continued)

106

7 NISQA: A Single-Ended Speech Quality Model

Table 7.1 (continued) Dataset OPT_P563_PROP PSY_P563_PROP QUALCOMM_EXP1b QUALCOMM_EXP1w QUALCOMM_EXP2b QUALCOMM_EXP3w QUALCOMM_EXP4 QUALCOMM_EXP5 QUALCOMM_EXP6a QUALCOMM_EXP6b SQ_P563_PROP TNO_P862_KPN_KIT97 TNO_P862_NW_EMU TNO_P862_NW_MEAS

Scale NB NB NB WB NB NB NB WB NB NB NB NB NB NB

Con 50 54 32 40 32 32 32 56 48 48 50 59 50 46

Files/con 12 12 64 64 64 64 64 64 48 48 12 1 4 4

Votes/con 288 96 256 256 256 256 256 256 192 192 96 60 96 96

Votes/file 24 8 4 4 4 4 4 4 4 4 8 60 24 24

Lang de en en en en en en en en en de nl nl nl

Dim No No No No No No No No No No No No No No

Votes/file 24 24 24 24 24 24 24

Lang fr ja en fr it ja en

Dim No No No No No No No

Table 7.2 Overview over the ITU-T P Suppl. 23 datasets Dataset ITU_SUPPL23_EXP1a ITU_SUPPL23_EXP1d ITU_SUPPL23_EXP1o ITU_SUPPL23_EXP3a ITU_SUPPL23_EXP3c ITU_SUPPL23_EXP3d ITU_SUPPL23_EXP3o

Scale NB NB NB NB NB NB NB

Con 44 44 44 50 50 50 50

Files/con 4 4 4 4 4 4 4

Votes/con 96 96 96 96 96 96 96

7.1.3 Other Datasets Besides these two large pools and the new test set, also a set of other datasets from varies projects are used for training the model. An overview of these datasets can be seen in Table 7.3. The three DT datasets are older internal German datasets with codec and noise condition. The listening experiments were conducted at the Telekom Innovation Laboratories in Berlin. The datasets NISQA_TRAIN_LIVE, NISQA_TRAIN_SIM, NISQA_VAL_LIVE, and NISQA_VAL_SIM are described in Sect. 3.1. The other datasets are described in more detail in the following. PAMD The first three datasets are coming from the ITU-T P.AMD pool and also contain speech quality dimension scores. The German datasets PAMD_DTAG_1 and PAMD_DTAG_2 were created by Wältermann (2012) for a degradation decomposition analysis, where they are called “Dim-Scaling 1” and “Dim-Scaling 2” and are described in detail. The datasets contain ratings for the overall quality, Noisiness,

7.1 Datasets

107

Table 7.3 Overview over other speech quality datasets Dataset PAMD_DTAG_1 PAMD_DTAG_2 PAMD_Orange_1 TCD DT_1 DT_2 DT_3 TUB_AUS1 TUB_DIS TUB_LIK TUB_VUPL NISQA_TRAIN_SIM NISQA_TRAIN_LIVE NISQA_VAL_SIM NISQA_VAL_LIVE NISQA_TEST_LIVETALK

Scale Con WB 66 WB 76 SWB 56 SWB 76 WB 125 WB 26 SWB 29 FB 50 SWB 20 SWB 8 SWB 15 FB 10000 FB 1020 FB 2500 FB 200 FB 58

Files/con 12 12 4 4 151 4 42 12 2 12 4 1 1 1 1 4

Votes/con 40 48 72 96 260 N/A 84 89 82 240 144 5 5 5 5 96

Votes/file 3 4 18 24 N/A N/A N/A 7 41 20 36 5 5 5 5 24

Lang de de fr en de de de en de de de en en en en de

Dim Yes Yes Yes No No No No Yes Yes No No Yes Yes Yes Yes Yes

Coloration, and Discontinuity. However, no Loudness ratings are available. The dataset PAMD_Orange_1 was created by Orange for the training phase of ITUT work item P.AMD. It contains speech samples from the POLQA pool datasets 201_FT_DT and 202_FT_DT and is described in ITU-T SG12 C.287 (2015). This dataset only contains speech quality dimension ratings but no overall speech quality ratings. TCD The TCD-VoIP dataset is a publicly available dataset created for assessing quality in VoIP applications and is described in detail by Harte et al. (2015). It contains samples with common VoIP degradations and a set of subjective opinion scores from 24 listeners. Overall, there are six degradation types: chopped speech, clipped speech, competing speaker, echo, noise, and MNRU. The samples were annotated in two different tests, where the echo conditions were rated by different participants than the other conditions. Because of this, for this work, the echo conditions were removed since they are usually not within the scope of a listening-only test. TUB_DIS Dataset TUB_DIS was created as part of a student project. It consists of only 20 conditions but was also annotated with speech quality dimensions. The reference files of this dataset are taken from POLQA pool dataset 501_SWISSQUAL. The listening experiments were conducted at the Quality and Usability Lab. The dataset focuses on discontinuity conditions and contains, besides anchor conditions, packetloss conditions for the G.722 and the Opus codec.

108

7 NISQA: A Single-Ended Speech Quality Model

TUB_LIK This dataset was created to investigate the influence of voice likeability on speech quality perception and is described in the work by Gallardo et al. (2018). It contains only 8 conditions that are common in speech quality datasets (codecs, noise, packet loss). However, the reference files are extracted from interviews of the Nautilus Speaker Characterization Corpus (Fernández Gallardo and Weiss 2018). Therefore, the speech samples contain spontaneous/conversational speech that is more realistic for a non-intrusive speech quality assessment scenario than the commonly used double sentences. The listening experiments were conducted at the Quality and Usability Lab with 20 participants. For this work, the “likeable” and “non-likeable” voices were merged together, and the overall MOS per condition for all speakers was calculated as per-condition MOS. TUB_VUPL This dataset was created to investigate the influence of the position and voicing of concealed packet loss on speech quality and is described by Mittag et al. (2019). The source files for this dataset are the German ITU-T Rec. P.501 (2020) Annex C speech files. The samples are processed with packet loss and the EVS codec with the highest bitrate. Therefore, only degradations caused by packet loss but not through the almost lossless coding are contained in this dataset. The location of the lost frames was manually set to be at the start, middle, or end of an utterance and to be at an unvoiced or a voiced part of the speech signal. As anchor conditions, it contains the clean speech samples and MNRU noise. The listening experiments were conducted at the Quality and Usability Lab with 36 test participants. TUB_AUS1 This is a new dataset, created for the validation of the model presented in this work and described in ITU-T SG12 C.440 (2019). It is similar to the NISQA training and validation datasets presented in Sect. 3.1, in that it also contains reference samples from the AusTalk dataset (Burnham et al. 2011) as source files and it was also annotated in the crowd according to ITU-T Rec. P.808 (2018). It also contains ratings for the overall speech quality and speech quality dimensions. In contrast to the datasets NISQA_TRAIN_LIVE / NISQA_VAL_LIVE and NISQA_TRAIN_SIM / NISQA_VAL_SIM, there are 12 files per condition available, which makes a percondition analysis possible. Furthermore, there are more subjective ratings available (7 per file instead of 5 per file and 89 per condition instead of 5 per conditions). While the speech segments of the NISQA datasets were automatically cut, the reference files of the TUB_AUS1 dataset were manually cut to contain a meaningful part of the sentence and also to include natural speech pauses. Some of the source files of the AusTalk dataset contain recording noise, or the interviewer can be heard asking questions. These files with lower speech quality were manually filtered out for this dataset through listening. Because the reference files are cleaner, there are more ratings available, and there are several files per condition available; this dataset is more suitable for double-ended models, such as POLQA, and allows for a fairer comparison.

7.1 Datasets

109

The speakers were selected to have an equal age distribution from 19 to 75+. The dataset contains overall 600 files generated from 400 different reference files from 80 talkers with each 5 sentences. The conditions were selected to give a good quality distribution for each of the four quality dimensions. Besides P.SAMD anchor conditions, the dataset contains simulated conditions with different codecs (G.711, G.722, AMR-NB, AMR-WB, AMR-SWB) and live VoIP conditions (Skype, WhatsApp, Facebook, Line) with additional simulated distortions. The condition list is available in Table A.1 of the Appendix.

7.1.4 Live-Talking Test Set In this live-talking database, the talkers spoke directly into the terminal device (i.e. a smartphone or a laptop). The test participants were instructed to talk loudly, quietly, with loudspeaker, or music in the background to obtain different test scenarios and speech quality distortions. Depending on the condition, the talkers were located in different environments, such as in a café, inside a car on the highway, inside a building with poor reception, elevator, shopping centre, subway/metro station, on a busy street, etc. Most of the talkers used their mobile phone to call either through the mobile network or with a VoIP service (Skype/Facebook). The calls were recorded on a Laptop for the VoIP calls and on a Google Pixel 3 for the mobile phone calls. The conversations were either spontaneous or based on scenarios taken from ITU-T Rec. P.805 (2007). Then 6–12 s segments were extracted from the conversations and rated regarding their overall quality and speech quality dimensions. The final database consists of 58 different conditions (see Table A.2 in the Appendix) with each 4 different files, resulting in 232 files overall. The speech files were recorded from 8 different talkers (4 males and 4 females) in German, where for each condition 2 male and 2 female talkers were selected. The listening experiment was conducted with the same questionnaire as the training and validation sets described in Sect. 3.1, and however instead of in the crowd, the test was performed in a soundproof room at the Quality and Usability Lab. Each file was rated by 24 test participants, resulting in 96 votes per condition. The dataset is described in detail by Chehadi (2020). It should be noted that this test dataset is highly independent of the training and validation sets. It is the only dataset in which real phone calls were conducted, and the speech was acoustically transmitted from the talker’s mouth to the microphone. While the live training and validation set NISQA_TRAIN_LIVE / NISQA_VAL_LIVE also contained some real background noise distortions (e.g. street noise coming from an open window), in this dataset the test participants went to a number of different places with different ambient sounds. Also, each participant used their own device for the recordings, which are not contained in the training set. However, it should also be mentioned that many of the background sounds were filtered out by the smartphones’ noise reduction processing. The eight talkers in this

110

7 NISQA: A Single-Ended Speech Quality Model

Fig. 7.1 Histograms of per-condition quality ratings of live-talking dataset

dataset are also unknown to the trained model, as they are not contained in the other datasets. Figure 7.1 shows the histograms of the quality ratings per condition for the overall quality and the speech quality dimensions. Since, except for the anchor conditions, no simulated conditions are contained in this dataset, it is more difficult to guarantee an even MOS distribution. In the figure, it can be seen that the distribution for the overall quality is relatively evenly distributed, considering that a MOS of 1.5 is usually the lowest score in a speech quality test and a MOS of 4.5 is usually the highest score. However, the distribution of the quality dimension scores is more uneven—especially the dimension Loudness, which appears to be Gaussian distributed with a mean of 3.3. This can be explained with the fact that the talkers were instructed to speak very quietly or loudly during the phone calls; however, the automatic gain control of the sending and receiving devices is normalising the playback volume. It should be noted that this kind of distribution with values skewed towards the centre can negatively influence the resulting Pearson correlation of a prediction model (Mittag et al. 2020).

7.2 Model and Training In this section, the model and the training procedure are described. Because for most datasets, no dimension ratings are available, and a loss function that ignores these samples is presented.

7.2.1 Model The NISQA model is a CNN-SA-AP neural network with a CNN for framewise feature calculation, a self-attention network with two layers for time-dependency modelling, and a final network with attention-pooling. The CNN and time-dependency stages are shared between the overall quality and the speech quality dimension tasks,

7.2 Model and Training

111

while there is a separate pooling network for each quality dimension and the overall quality. A top-level diagram of this model can be seen in Fig. 5.4 of Chap. 5.

7.2.2 Bias-Aware Loss Because the aim is to train a model that gives robust results even for unknown speech samples from different sources, many different datasets are applied for training and validation. These datasets are conducted in completely different surroundings, with different participants, languages, equipment, and quality range of the containing files. In fact, many of the datasets are narrowband datasets. That means that the highest possible quality (clean narrowband signal) in these datasets would only be rated with a MOS around 3.8 in a fullband scale listening experiment (ITUT Rec. P.863.1 2019). To be able to increase the model’s performance by adding these datasets with different subjective experiment biases, the bias-aware loss presented in Chap. 6 is applied. The model is anchored to the fullband dataset NISQA_TRAIN_SIM as described in Sect. 6.1.2 and therefore predicts speech quality in a fullband context.

7.2.3 Handling Missing Dimension Ratings For most of the datasets, only overall speech quality ratings are available. This means these datasets cannot be used to train the dimension ratings. One solution to overcome this problem is to first only train the overall MOS part of the model with the datasets that only have overall MOS ratings available and then afterwards finetune the whole model with datasets that have overall MOS and dimension ratings available. However, the downside of this approach is the effect of catastrophic forgetting (French 1999) in neural networks. When neural networks are trained in sequence on different tasks or data, they tend to forget most of the initially learned features and mostly concentrate on the last learned task/data. Because of this, in this work, the model is trained with all available data, while the missing ratings are ignored when calculating the loss. To this end, instead of the MSE (mean square error), a NaN-MSE loss function LNaN is used that ignores missing values which are filled with “Not a Number” (NaN). It is implemented in PyTorch as shown in Algorithm 2. At first, the error between the bias corrected prediction and the target value is calculated. When the target value of a sample is NaN, the error will also yield NaN. The mean square is then only calculated over a slice of the error vector for which the

112

7 NISQA: A Single-Ended Speech Quality Model

Algorithm 2 NaN-MSE loss LNaN Output: MSE omitting NaN values 1: error = y−bˆy 2: idx = NOT( isnan(error) ) 3: nan_mse = mean( err[idx]2 ) 4: return nan_mse

error is not NaN. This bias-aware NaN-MSE loss is calculated for each dimension and the MOS to yield the overall loss l as follows: yMOS , bMOS ) l = LNaN (yMOS ,

+ LNaN (yNOI ,

yNOI , bNOI ) + LNaN (yCOL ,

yCOL , bCOL )

(7.1)

+ LNaN (yDIS ,

yDIS , bDIS ) + LNaN (yLOUD ,

yLOUD , bLOUD ). During training the model, the loss is calculated on random mini-batches of size 160. That means that in every optimisation iteration, 160 samples are randomly chosen. For these samples, the predicted values are calculated (forward pass). Based on these predictions, the model weights are optimised through backpropagation. Because the files are sampled randomly, there may be a random number of dimension ratings available in each mini-batch as some files do not have dimension ratings available. It should be noted that this can lead to an unbalanced weighting of certain speech samples. For example, it could happen that only one speech sample with dimension ratings is contained within one mini-batch. The weights of the individual pooling layers are then only optimised on the basis of this one sample in that iteration. Also, the shared layers are then heavily influenced by this certain sample because the MOS loss, for which all samples are available, makes up only 1/5 of the overall loss. However, in practice, the influence of this effect appeared to be small and did not lead to negative results.

7.2.4 Training The datasets are divided into a training, validation, and test set, where 59 datasets are used for training, 18 as a validation set, and one as an independent test set. The POLQA pool datasets that were used as a test set in the POLQA competition are used in the validation set. From the ITU-T P Supp. 23 datasets, four are used for training and three for validation. The model was then trained with a batch size of 160, learning rate of 0.001, and Adam optimiser. After each epoch, the model weights were stored, and the results on the training and validation set were

7.2 Model and Training

113

Table 7.4 Training datasets with resulting per-file PCC r Dataset 101_ERICSSON 102_ERICSSON 201_FT_DT 202_FT_DT 204_FTDT 301_OPTICOM 302_OPTICOM 401_PSYTECHNICS 402_PSYTECHNICS 501_SWISSQUAL 502_SWISSQUAL 601_TNO 602_TNO ATT_iLBC BT_P862_BGN_ENG BT_P862_PROP DT_1 DT_2 DT_3 DT_P862_1st DT_P862_BGN_GER DT_P862_Share ERIC_AMR_4B ERIC_FIELD_GSM_EU ERIC_P862_NW_MEAS FT_P563_PROP GIPS_EXP1 GIPS_Exp3 GIPS_Exp4 HUAWEI_1 ITU_SUPPL23_EXP1a ITU_SUPPL23_EXP1d ITU_SUPPL23_EXP3a ITU_SUPPL23_EXP3c LUC_P563_PROP NISQA_TRAIN_LIVE NISQA_TRAIN_SIM NTT_PTEST_1 NTT_PTEST_2 OPT_P563_PROP PAMD_DTAG_1

rMOS 0.81 0.86 0.93 0.88 0.87 0.87 0.85 0.88 0.91 0.91 0.90 0.90 0.89 0.87 0.94 0.81 0.93 0.80 0.93 0.85 0.94 0.84 0.95 0.87 0.82 0.84 0.97 0.96 0.84 0.90 0.86 0.84 0.84 0.84 0.91 0.87 0.94 0.84 0.83 0.90 0.88

rNOI N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.98 0.97 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.81 0.91 N/A N/A N/A 0.88

rCOL N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.95 0.96 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.69 0.91 N/A N/A N/A 0.83

rDIS N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.91 0.88 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.74 0.92 N/A N/A N/A 0.88

rLOUD N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.94 0.92 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.84 0.89 N/A N/A N/A N/A

Scale SWB WB SWB SWB WB SWB SWB SWB WB SWB SWB SWB SWB NB NB NB WB WB SWB NB NB NB NB NB NB NB NB WB SWB NB NB NB NB NB NB FB FB NB WB NB WB

Lang Con Files se 57 684 se 54 648 fr 48 192 fr 49 196 fr 53 212 cz 50 200 en 44 176 en 48 1152 en 48 1152 de 50 200 de 50 200 nl 50 200 nl 50 200 en 45 1312 en 49 196 en 50 1200 de 125 18875 de 26 112 de 29 1220 de 50 200 de 49 196 de 50 200 se 36 1008 se 234 234 se 46 184 fr 50 799 en 38 1824 en 36 1728 en 36 144 zh 24 576 fr 44 176 ja 44 176 fr 50 200 it 50 200 en 50 200 en 1020 1020 en 10000 10000 ja 90 1416 ja 80 1104 de 50 600 de 66 792

Votes/file 10 13 24 24 24 24 24 8 8 24 24 24 24 8 24 4 N/A N/A N/A 24 28 24 9 N/A 29 6 6 6 25 8 24 24 24 24 24 5 5 10 12 24 3 (continued)

114

7 NISQA: A Single-Ended Speech Quality Model

Table 7.4 (continued) Dataset PAMD_DTAG_2 PAMD_Orange_1 PSY_P563_PROP QUALCOMM_EXP1b QUALCOMM_EXP1w QUALCOMM_EXP2b QUALCOMM_EXP3w QUALCOMM_EXP4 QUALCOMM_EXP5 QUALCOMM_EXP6a QUALCOMM_EXP6b SQ_P563_PROP TCD TNO_P862_KPN_KIT97 TNO_P862_NW_EMU TNO_P862_NW_MEAS TUBDIS TUBVUPL

rMOS 0.86 N/A 0.91 0.82 0.84 0.80 0.80 0.78 0.76 0.91 0.83 0.82 0.88 0.92 0.82 0.80 0.93 0.84

rNOI 0.83 0.96 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.93 N/A

rCOL 0.92 0.93 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.91 N/A

rDIS 0.89 0.96 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.98 N/A

rLOUD N/A 0.84 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 0.89 N/A

Scale WB SWB NB NB WB NB NB NB WB NB NB NB SWB NB NB NB SWB SWB

Lang Con Files de 76 912 fr 56 224 en 54 648 en 32 2048 en 40 2560 en 32 2048 en 32 2048 en 32 2048 en 56 3584 en 48 2304 en 48 2304 de 50 600 en 76 304 nl 59 59 nl 50 200 nl 46 184 de 20 40 de 15 60

Votes/file 4 18 8 4 4 4 4 4 4 4 4 8 24 60 24 24 41 36

calculated as average PCC across all datasets. The training was stopped after the validation PCC did not increase for more than 10 epochs. The model weights of the epoch with the best performance on the validation set were then selected as the final model. In the case of the final NISQA model presented in this section, this was the 74th epoch. The training datasets and the per-file PCC for the overall quality and the quality dimensions are presented in Table 7.4. It can be seen that only for 8 datasets dimension ratings are available (and only 6 with Loudness ratings). However, the NISQA_TRAIN datasets, which have dimension ratings available, contain a large number of files with a wide variety of conversational speech from different talkers. The overall MOS PCC is on average 0.87 across all training datasets with a worstcase PCC of 0.76 and best-case of 0.97.

7.3 Results In this section, the results on the validation and the test set are presented. Afterwards, also the prediction behaviour of the model for different impairment levels of common speech distortions is analysed. At first, the evaluation metrics used in this section are presented.

7.3 Results

115

7.3.1 Evaluation Metrics The results are evaluated in terms of three metrics: • RMSE* after third-order polynomial mapping, denoted as RMSE* • RMSE (root-mean-square error) after first-order polynomial mapping, denoted as RMSE • Per-condition Pearson’s correlation coefficient (PCC) without mapping, denoted as r The epsilon-insensitive RMSE (RMSE*) after a third-order polynomial monotonic mapping is the ITU-T recommended metric for evaluation of objective quality prediction models according to ITU-T Rec. P.1401 (2020). The RMSE* is similar to the traditional root-mean-square error (RMSE) but considers the confidence interval of the individual MOS scores. The mapping compensates for offsets, different biases, and other shifts between scores from the individual experiments, without changing the rank order. To calculate the RMSE*, at first, the predicted values of each dataset are mapped to the subjective values with a third-order monotonic polynomial. In this work, the mapping is applied with SciPy’s Sequential Least Squares Programming (SLSP) implementation, where a constraint is set on the first derivative to achieve a monotonic function. The mapped predicted values are then used to calculate the per-condition predicted values by averaging across the conditions of each dataset. After that the prediction error “Perror” is calculated, which considers the per-condition confidence intervals as follows:

P error(i) = max 0, |MOSsub (i) − MOSobj (i)| − ci95 (i) .

(7.2)

The confidence interval ci95 is calculated with the following formula: σ ci95 = t (0.05, M) √ , M

(7.3)

where M denotes the number of votes per condition. The per-condition standard deviation is calculated with   L K − 1  2 , σj =  σj,l (7.4) N −1 l=1

where K indicates the number of ratings per file and N the number of ratings per condition, and j represents the condition number and l the file number. The RMSE*

116

7 NISQA: A Single-Ended Speech Quality Model

is then finally calculated as    ∗ RMSE = 

J 1  P error(i)2 , J −d

(7.5)

i=1

where J denotes the number of conditions and d = 4 denotes the degrees of freedom of a third-order mapping function. While the RMSE* is a helpful metric because it does not punish prediction errors for samples with high uncertainty in the subjective ratings, it is only meaningful when enough ratings per condition are available. In the datasets NISQA_VAL_SIM and NISQA_VAL_LIVE, only one file per condition with 5 ratings is available, which leads to a high confidence interval and therefore to a very small RMSE*. Furthermore, the monotonic third-order polynomial mapping is not trivial and can lead to different solutions, rendering a fair comparison between different models more difficult. However, since many of the validation datasets are annotated in a narrowband context but predicted by NISQA in a fullband context, the normal RMSE cannot be applied directly. Because of this, the standard RMSE after a straightforward first-order polynomial mapping on the per-file ratings is used as a second metric. Additionally, the commonly used PCC is used as a third metric. However, it should be noted that the PCC is dependent on the subjective MOS value distribution within each dataset and can thus, for example, lead to very low values, while the prediction error may still be small—or vice versa. For the ITU-T work item P.SAMD, the main and single performance criterion is the per-condition RMSE* after third-order mapping. The required objectives for unknown datasets are given as follows: • Overall performance: The averaged RMSE* across all experiments and dimensions/overall quality shall be smaller than 0.35 • Worst-case performance: No single RMSE* per experiment and dimension/overall quality shall be above 0.5

7.3.2 Validation Set Results: Overall Quality The results on the validation set are firstly compared to double-ended models and then to other single-ended models. Compared to Double-Ended Models The prediction results are compared to the double-ended speech quality prediction models POLQA and DIAL, which require a clean reference for prediction.1 The results are shown in Table 7.5, where the best results are represented in bold. It 1 The publicly available double-ended model VISQOL is excluded because it led to time-alignment errors without outputting a prediction for several files.

Dataset 103_ERICSSON 104_ERICSSON 203_FT_DT 303_OPTICOM 403_PSYTECHNICS 404_PSYTECHNICS 503_SWISSQUAL 504_SWISSQUAL 603_TNO ERIC_FIELD_GSM_US HUAWEI_2 ITU_SUPPL23_EXP1o ITU_SUPPL23_EXP3d ITU_SUPPL23_EXP3o NISQA_VAL_LIVE NISQA_VAL_SIM TUB_AUS1 TUB_LIK

Scale SWB NB SWB SWB SWB NB SWB NB SWB NB NB NB NB NB FB FB FB SWB

Lang se se fr en en en de de nl en zh en ja en en en en de

Con 54 55 54 54 48 48 54 49 48 372 24 44 50 50 200 2500 50 8

Files 648 660 216 216 1152 1151 216 196 192 372 576 176 200 200 200 2500 600 96

NISQA r RMSE 0.85 0.38 0.77 0.47 0.92 0.36 0.92 0.33 0.91 0.36 0.77 0.39 0.92 0.34 0.92 0.37 0.89 0.44 0.79 0.36 0.98 0.21 0.92 0.31 0.92 0.27 0.91 0.3 0.82 0.40 0.9 0.48 0.91 0.21 0.98 0.25 RMSE* 0.29 0.36 0.26 0.20 0.28 0.31 0.24 0.25 0.33 N/A 0.13 0.21 0.14 0.15 0.12 0.15 0.08 0.19

POLQA r RMSE 0.87 0.34 0.91 0.31 0.91 0.38 0.93 0.31 0.96 0.24 0.86 0.31 0.94 0.29 0.87 0.45 0.95 0.29 0.75 0.39 0.94 0.32 0.91 0.32 0.85 0.36 0.88 0.35 0.67 0.52 0.86 0.56 0.88 0.24 0.99 0.16

Table 7.5 Per-condition validation results for the overall quality compared to double-ended speech quality models RMSE* 0.24 0.19 0.29 0.16 0.16 0.24 0.18 0.34 0.16 N/A 0.19 0.18 0.23 0.22 0.21 0.21 0.10 0.10

DIAL r 0.78 0.76 0.79 0.71 0.92 0.67 0.85 0.73 0.86 0.71 0.89 0.91 0.84 0.87 -0.22 0.36 0.73 0.89

RMSE 0.45 0.49 0.57 0.59 0.34 0.46 0.46 0.63 0.48 0.42 0.44 0.33 0.36 0.36 0.69 1.03 0.35 0.53

RMSE* 0.35 0.38 0.48 0.43 0.26 0.40 0.35 0.50 0.39 N/A 0.34 0.22 0.23 0.22 0.36 0.61 0.22 0.58

7.3 Results 117

118

7 NISQA: A Single-Ended Speech Quality Model

should be noted that POLQA/DIAL was applied in fullband/super-wideband mode, also on the narrowband datasets, to allow a fairer comparison to the proposed NISQA model that only predicts in fullband context. Therefore, the results of the narrowband datasets should be considered with caution. It can be seen that POLQA outperforms NISQA on most of the POLQA pool datasets. However, the results on these datasets are still well below the required average RMSE* of 0.35 for each dataset, and only for the narrowband dataset 104_ERICSSON the RMSE* is 0.36, which is also the worst-case performance and well below the required maximum worst-case performance of 0.5. There does not seem to be a language influence on the prediction accuracy, as for non-European languages as Chinese and Japanese, for which less training files were available, still a high prediction accuracy is achieved. NISQA outperforms POLQA and DIAL on most of the datasets that contain conversational/spontaneous speech such as datasets NISQA_VAL_SIM, NISQA_VAL_LIVE, and TUB_AUS1. Because there is only one file per condition for the NISQA datasets, the RMSE rather than the RMSE* should be considered (the metrics on these datasets are consequently calculated per file). Considering this fact, the error on the NISQA_VAL_LIVE dataset with live conditions is still relatively low with an RMSE of 0.4 compared to an RMSE of 0.52 of POLQA and 0.69 of DIAL, which obtains a negative correlation, showing how challenging the prediction of the conditions in this dataset is. Compared to Single-Ended Models Table 7.6 shows the prediction results of the overall MOS on the validation set compared to the single-ended speech quality prediction models P.563, ANIQUE+, and WEnets (best model highlighted in bold). It should be noted that P.563 and ANIQUE are only available for narrowband signals, because of this the speech samples were downsampled to 8 kHz to apply them to these models. WEnets is a publicly available deep learning based speech quality model that is trained on PESQ predictions. It can be seen that NISQA outperforms these three models notably on most datasets, except for the Supp23 and the 404_PSYTECHNICS datasets. However, at least the Supp 23 datasets were available for the development of P.563 and ANIQUE+ models. When datasets with conversational/spontaneous speech, such as NISQA_VAL_SIM, NISQA_VAL_LIVE, and TUB_AUS1, are considered, it can be seen that NISQA notably outperforms P.563, ANIQUE+, and WEnet, which only obtain correlations of 0.30–0.70 compared to NISQA with correlations of 0.82–0.91.

7.3.3 Validation Set Results: Quality Dimensions In this subsection, the validation results for the speech quality dimensions are presented. So far, there are only two double-ended models for the prediction of speech quality dimensions available—the P.AMD candidate model and DIAL. The results for the individual dimensions are shown in Tables 7.7, 7.8, 7.9, and 7.10

Dataset 103_ERICSSON 104_ERICSSON 203_FT_DT 303_OPTICOM 403_PSYTECHNICS 404_PSYTECHNICS 503_SWISSQUAL 504_SWISSQUAL 603_TNO ERIC_FIELD_GSM_US HUAWEI_2 ITU_SUPPL23_EXP1o ITU_SUPPL23_EXP3d ITU_SUPPL23_EXP3o NISQA_VAL_LIVE NISQA_VAL_SIM TUB_AUS1 TUBLIK

Scale SWB NB SWB SWB SWB NB SWB NB SWB NB NB NB NB NB FB FB FB SWB

NISQA r RMSE 0.85 0.38 0.77 0.47 0.92 0.36 0.92 0.33 0.91 0.36 0.77 0.39 0.92 0.34 0.92 0.37 0.89 0.44 0.79 0.36 0.98 0.21 0.92 0.31 0.92 0.27 0.91 0.30 0.82 0.4 0.90 0.48 0.91 0.21 0.98 0.25 RMSE* 0.29 0.36 0.26 0.2 0.28 0.31 0.24 0.25 0.33 N/A 0.13 0.21 0.14 0.15 0.12 0.15 0.08 0.19

P563 r 0.36 0.64 0.68 0.85 0.81 0.82 0.71 0.83 0.83 0.42 0.93 0.90 0.93 0.91 0.42 0.45 0.62 0.85 RMSE 0.66 0.57 0.69 0.44 0.50 0.35 0.62 0.50 0.53 0.54 0.35 0.34 0.26 0.30 0.64 0.99 0.40 0.60

RMSE* 0.55 0.45 0.58 0.33 0.41 0.28 0.51 0.37 0.42 N/A 0.26 0.21 0.15 0.17 0.31 0.56 0.26 0.56

ANIQUE+ r RMSE 0.54 0.60 0.68 0.55 0.47 0.82 0.71 0.59 0.77 0.54 0.74 0.41 0.61 0.70 0.79 0.56 0.69 0.69 0.17 0.58 0.79 0.59 0.98 0.15 0.97 0.17 0.98 0.15 0.51 0.61 0.54 0.93 0.65 0.39 0.85 0.61

Table 7.6 Per-condition validation results for the overall quality compared to single-ended speech quality models RMSE* 0.47 0.44 0.72 0.47 0.40 0.34 0.58 0.45 0.58 N/A 0.40 0.04 0.09 0.05 0.28 0.50 0.21 0.72

WEnets r RMSE 0.28 0.68 0.13 0.74 0.64 0.72 0.43 0.76 0.78 0.53 0.14 0.61 0.59 0.71 0.54 0.77 0.59 0.77 0.60 0.47 0.63 0.75 0.73 0.53 0.68 0.50 0.79 0.45 0.36 0.66 0.30 1.05 0.70 0.36 0.59 0.93

RMSE* 0.57 0.64 0.63 0.63 0.45 0.53 0.59 0.67 0.64 N/A 0.71 0.44 0.38 0.32 0.30 0.59 0.23 1.09

7.3 Results 119

Scale SWB FB FB FB

Lang de en en en

Con 54 50 200 2500

Dataset 503_SWISSQUAL TUB_AUS1 NISQA_VAL_LIVE NISQA_VAL_SIM

Scale SWB FB FB FB

Lang de en en en

Con 54 50 200 2500

Files 216 600 200 2500

Files 216 600 200 2500

Table 7.8 Validation results for coloration dimension

Dataset 503_SWISSQUAL TUB_AUS1 NISQA_VAL_LIVE NISQA_VAL_SIM

Table 7.7 Validation results for noisiness dimension RMSE 0.26 0.16 0.49 0.48

NISQA r RMSE 0.84 0.39 0.84 0.28 0.57 0.43 0.84 0.50

NISQA r 0.94 0.97 0.73 0.86

RMSE* 0.27 0.13 0.05 0.14

RMSE* 0.14 0.05 0.12 0.11

PAMD r 0.9 0.81 −0.15 0.62

PAMD r 0.88 0.81 0.34 0.55

RMSE 0.31 0.29 0.51 0.73

RMSE 0.35 0.35 0.68 0.79

RMSE* 0.18 0.14 0.14 0.32

RMSE* 0.24 0.21 0.27 0.41

DIAL r 0.88 0.81 −0.11 0.25

DIAL r 0.84 0.88 0.31 0.40

RMSE 0.34 0.30 0.51 0.90

RMSE 0.39 0.29 0.69 0.87

RMSE* 0.21 0.16 0.14 0.44

RMSE* 0.25 0.17 0.30 0.48

120 7 NISQA: A Single-Ended Speech Quality Model

Scale SWB FB FB FB

Lang de en en en

Con 54 50 200 2500

Dataset 503_SWISSQUAL TUB_AUS1 NISQA_VAL_LIVE NISQA_VAL_SIM

Scale SWB FB FB FB

Lang de en en en

Con 54 50 200 2500

Table 7.10 Validation results for loudness dimension

Dataset 503_SWISSQUAL TUB_AUS1 NISQA_VAL_LIVE NISQA_VAL_SIM

Files 216 600 200 2500

Files 216 600 200 2500

Table 7.9 Validation results for discontinuity dimension

NISQA r 0.91 0.74 0.73 0.81

NISQA r 0.86 0.92 0.55 0.84

RMSE 0.29 0.32 0.47 0.48

RMSE 0.31 0.23 0.56 0.54

RMSE* 0.17 0.21 0.1 0.11

RMSE* 0.20 0.11 0.16 0.19

PAMD r 0.90 0.66 0.60 0.45

PAMD r 0.87 0.77 −0.02 0.66

RMSE 0.31 0.36 0.54 0.74

RMSE 0.31 0.37 0.67 0.75

RMSE* 0.17 0.21 0.11 0.32

RMSE* 0.18 0.23 0.25 0.36

DIAL r 0.94 0.62 0.54 0.38

DIAL r 0.74 0.61 0.10 0.23

RMSE 0.23 0.38 0.58 0.76

RMSE 0.42 0.46 0.67 0.97

RMSE* 0.11 0.25 0.17 0.33

RMSE* 0.33 0.31 0.25 0.56

7.3 Results 121

122

7 NISQA: A Single-Ended Speech Quality Model

(best model highlighted in bold). It can be seen that NISQA outperforms the P.AMD candidate and DIAL on all four datasets for the Noisiness dimension. On the other three dimensions, NISQA is outperformed by either P.AMD or DIAL on the 503_SWISSQUAL dataset that contains typical P.800 double sentences. It should be noted that the 503_SWISSQUAL dataset was part of the training set of the P.AMD model. However, NISQA outperforms both double-ended models on all four dimensions and the overall speech quality for the three datasets that contain conversational/spontaneous speech. The most challenging dimension appears to be Discontinuity for which on the NISQA_VAL_LIV dataset only an RMSE of 0.56 can be achieved, which is still better than the performance of the DIAL and P.AMD models. Overall, the prediction accuracy for the quality dimensions is lower than the accuracy for the overall quality. That being said, the obtained errors are still well below the required RMSE* of 0.35.

7.3.4 Test Set Results Finally, the model is evaluated on the test set that was not used during the training or selection of the model and that contains live-talking conditions, which are independent of the conditions and talkers contained in the other datasets. Since this dataset contains recordings of real phone calls, there are no clean reference files available. Thus, the prediction accuracy can only be compared to single-ended models and only for the overall speech quality since no other single-ended speech quality prediction models are available. The test set contains overall 58 conditions, with each 4 files, yielding overall 232 files. The sentences are spoken by 8 different German talkers. Each file was rated by 24 listening test participants. The results on the overall speech quality compared to the models P.563, ANIQUE+, and WEnets are presented in Table 7.11 (best model highlighted in bold). It can be seen that NISQA obtains excellent results with a PCC of 0.9 and an RMSE* of 0.24. While P.563 still achieves an acceptable correlation of 0.7, the RMSE* of 0.48 is in fact very high and almost at the worst-case objective of P.SAMD. ANIQUE+ and the deep learning based WEnets obtain even higher RMSE* values. The results for the prediction of the quality dimensions are presented in Table 7.12. As mentioned before, no comparison is possible for these results; however, the worst-case RMSE* of 0.25 for the Discontinuity dimension is well below the RMSE* P.SAMD objective of 0.35. The correlation results vary between 0.71 for the Loudness and 0.87 for the Coloration dimension. While the correlation of the Loudness prediction is rather low, the RMSE* is relatively small. The correlation diagrams of the test set results are presented in Fig. 7.2, where the red line represents the third-order polynomial mapping used for the RMSE* calculation, and the light-blue error bars represent the per-condition 95% confidence intervals. In the diagram of the Loudness dimension, it can be seen that most values

Scale FB

NISQA r RMSE 0.90 0.35

Dataset NISQA_TEST_LIVETALK

Scale FB

NOI r 0.76 RMSE 0.47

Table 7.12 Test set results for speech quality dimensions

Dataset NISQA_TEST_LIVETALK

Table 7.11 Test set results for overall quality

RMSE* 0.20

RMSE* 0.24

COL r 0.87

P563 r 0.70

RMSE 0.31

RMSE 0.58

RMSE* 0.17

RMSE* 0.48

DIS r 0.83

RMSE 0.40

ANIQUE+ r RMSE 0.56 0.68

RMSE* 0.25

RMSE* 0.53

LOUD r RMSE 0.71 0.36

WEnets r RMSE 0.66 0.61

RMSE* 0.17

RMSE* 0.50

7.3 Results 123

124

7 NISQA: A Single-Ended Speech Quality Model

Fig. 7.2 Test set correlation diagrams of subjective vs predicted scores

7.3 Results

125

are clustered around a MOS of 3, which explains the relatively low correlation but small RMSE*. It can also be seen that for the Noisiness dimension, the values are scattered wider around the polynomial line as for the overall quality. However, because of the higher confidence intervals, the Noisiness RMSE* is actually lower than the overall MOS RMSE*. Based on a visual inspection of the correlation diagrams, the overall MOS and the Coloration appear to fit best to the identity line with only a few outliers. In the case of the Discontinuity prediction, the overall trend seems to fit the identity line. Still, there are a number of outliers which are overestimated by the model, presumably interruptions that could not be found by the model as such. A similar effect can be observed for the Noisiness dimension for which a cluster of overestimated outliers can be seen. In contrast to that, for the Loudness estimation, most of the samples are underestimated. However, the Loudness ratings of this dataset are not ideally distributed for a final evaluation of the Loudness prediction, with too few samples that obtained a low subjective Loudness score.

7.3.5 Impairment Level vs Quality Prediction Finally, the prediction behaviour of the model is analysed in relation to varying impairment levels of different distortion types. The analysed distortions were also included in the NISQA training and validation datasets, and therefore subjective ratings are available as a comparison. This analysis is conducted to check whether any unexpected predictions occur for certain impairment levels. Furthermore, the prediction should have a monotonous relation to the impairment levels, that is, for example, if the SNR of a noisy signal is decreased, the predicted quality should also decrease until a certain saturation point is reached. In particular, the following eight distortion types are analysed: • • • • • • • • •

Temporal clipping Bursty packet loss with concealment Highpass Lowpass Active speech level Amplitude clipping MNRU noise White background noise Pub background noise

For this analysis, twelve reference speech files from the NISQA_VAL_SIM dataset were randomly selected, with one female and one male speaker from the AusTalk, UK-Ireland, and DNS-Challenge datasets (see Sect. 3.1.1 for source datasets). Then all twelve files are processed for all eight distortions with different levels.

126

7 NISQA: A Single-Ended Speech Quality Model

(a)

(b)

(d)

(e)

(c)

(f)

Fig. 7.3 Predicted scores vs temporal clipping with different frame error rates. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

Temporal Clipping The first distortion that is analysed is temporal clipping, where 20 ms speech frames are randomly set to zero and therefore cause interruptions in the speech signal. The prediction results in relation to the frame error rate (FER) are presented in Fig. 7.3. In Subfigure (a) on the top left-hand side, the overall quality predictions of the twelve individual files can be seen, visualised with different colours. Subfigure (b) to the right of Subfigure (a) shows the average MOS prediction over these 12 files along with the 95% confidence interval. The other four Subfigures (c)–(f) show the average quality dimension predictions across the twelve files (the individual files are only displayed for the overall MOS). It can be seen that for the overall MOS, an exponential decreasing relationship to the frame error rate is obtained, as it is also known from the ITU-T E-model (ITU-T Rec. P.107 2015). It can be seen that the line is overall monotonous with only a few spikes that could be explained through the random deletion of lost frames. In case the deleted frames occurred during a silent segment of the speech files, they hardly evoke any quality degradation. It can also be seen that the diagram of the Discontinuity dimension looks similar to the overall MOS, as expected. In fact, the predicted quality decreases even faster for the Discontinuity than for the overall MOS. The other three dimensions also seem to be influenced by the temporal clipping as they do decrease as well but much slower. While this seems to be an unwanted behaviour, the subjective ratings in Fig. 7.4 show a similar trend. In this figure, a

7.3 Results

127

(c)

(a)

(b)

(d)

(e)

Fig. 7.4 Subjective ratings vs temporal clipping with different frame error rates. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

scatter plot of the subjective MOS ratings vs the frame error rate is presented. These are the files used in the NISQA_TRAIN_SIM and NISQA_VAL_SIM dataset and are rated by five test participants each. It should be noted that each of the dots represents a speech sample from a different reference file (and is not one of the twelve files used for the predicted scores). Because the ratings are quite noisy, a third-order regression line alongside with the 95% confidence interval is displayed as well. In the figure, it can be seen that the test participants also rated the Noisiness, Coloration, and Loudness of files with high FER lower. Overall, the NISQA model did not give any unexpected prediction results in relation to the frame error rate of the temporal clipping condition. Bursty Concealed Packet Loss As a next distortion, bursty packet loss with applied concealment algorithm is analysed. To this end, the speech samples were processed with the EVS codec in 16.4 kbit/s mode. Then bursty packet loss with different frame error rates was applied. The NISQA prediction results can be seen in Fig. 7.5 and the subjective ratings are presented in Fig. 7.6. In Fig. 7.5b, it can be seen that the predicted MOS decreases more linearly when compared to the temporal clipping condition. This can be explained by the concealed lost frames, which do not degrade the perceived quality as strongly as unconcealed zero insertion. Also, the relationship between the MOS prediction and the frame error is less monotonous and shows many spikes. However, the effect of the concealment algorithm largely depends on the location in

128

7 NISQA: A Single-Ended Speech Quality Model

(a)

(b)

(d)

(e)

(c)

(f)

Fig. 7.5 Predicted scores vs EVS codec with different bursty frame error rates. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

(a)

(c)

(d)

(b)

(e)

Fig. 7.6 Subjective ratings vs EVS codec with different bursty frame error rates. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

7.3 Results

129

(a)

(d)

(b)

(e)

(c)

(f)

Fig. 7.7 Predicted scores vs highpass with different cut-off frequencies. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

the speech signal (e.g. voiced or unvoiced part of the speech), so that a completely monotonous curve cannot be expected. Again, the other dimensions are also affected by the packet loss, especially the Coloration dimension. This can be explained with speech distortions that appear when the concealment algorithm fails. In this case, the algorithm can produce robotic voices or artificial sounds that test participants judge as Coloration degradation. It can also be seen that the Discontinuity estimation is almost equal to the overall MOS prediction. When the predicted values are compared to the subjective ratings of Fig. 7.6, no unexpected deviations can be observed. Highpass The next analysed degradation condition is a highpass with different cut-off frequencies. The prediction results are presented in Fig. 7.7 and the corresponding subjective ratings in Fig. 7.8. In this case, for the analysis, extreme values were chosen that were not contained within the training set. In Fig. 7.7b, an overall monotonous behaviour until around 5000 Hz can be observed. There are distinctive spikes visible around 600, 1100, and 2500 Hz. Although the spikes are not large, it is unclear where they originate from. Interestingly, the prediction values firstly go below 1 at around 2500 Hz and then start rising again at around 5000 Hz. In fact, with a highpass at this cut-off frequency, no speech can be perceived; therefore, the scores in this range correspond to the

130

7 NISQA: A Single-Ended Speech Quality Model

(a)

(c)

(d)

(b)

(e)

Fig. 7.8 Subjective ratings vs highpass with different cut-off frequencies. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

prediction of an empty file. When the predictions are compared with the subjective ratings of Fig. 7.8, no strong deviations can be observed. However, the predicted Noisiness and Discontinuity scores show a surprising non-monotonic relation for smaller cut-off frequencies from 600 to 2500 Hz. It is unclear where this unwanted behaviour stems from as it cannot directly be observed in the subjective ratings. The non-monotonic behaviour for cut-off frequencies higher than 5 kHz is of more theoretical nature as such empty signals would usually not be applied to a speech quality prediction model. Lowpass In Fig. 7.9, the predicted scores of the lowpass filter analysis are presented. Figure 7.10 shows the corresponding subjective values. It can be seen that, as expected, mostly the Coloration dimension is affected by the lowpass condition. Additionally, the Loudness dimension is largely affected as well, which makes sense as through the filtering less energy is contained in the speech signal. Overall, the relationship between the cut-off frequencies is fairly monotonous. No major unexpected behaviour can be observed when the subjective ratings of Fig. 7.10 are compared to the predicted values. However, there is an interesting relatively steep prediction decrease of the Noisiness scores for extremely low cut-off frequencies.

7.3 Results

131

(a)

(b)

(d)

(e)

(c)

(f)

Fig. 7.9 Predicted scores vs lowpass with different cut-off frequencies. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

(a)

(c)

(d)

(b)

(e)

Fig. 7.10 Subjective ratings vs lowpass with different cut-off frequencies. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

132

7 NISQA: A Single-Ended Speech Quality Model

Fig. 7.11 Predicted scores vs active speech level. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

(c)

(a)

(b)

(d)

(e)

Fig. 7.12 Subjective ratings vs active speech level. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

7.3 Results

133

Active Speech Level As the next condition, the active speech level is analysed. The predicted scores are presented in Fig. 7.11 and the corresponding subjective ratings in Fig. 7.12. The ASL is usually set to −26 dBov for subjective speech quality experiments, which can also be observed in the predicted values as the highest quality is obtained for a speech level of around −26 dBov. Again, extreme values have been used in the analysis that were not contained in the training set. It can be seen that the predicted quality decreases when the speech level is increased or decreased from −26 dBov, as it can also be observed in the subjective ratings. The effect of a changing speech level is even larger on the Loudness dimension, which shows a steeper slope. Interestingly, in Fig. 7.12c of the subjective ratings, it can be seen that speech signals with a lower or a higher level are perceived as noisier by test participants, which can also be observed in the predicted values. Also, it can be seen that the prediction results increase again for very low speech levels. While this effect seems unwanted, at these levels, there is no perceivable speech left within the signal, and therefore, similar to the highpass condition, the predicted values correspond to the prediction of an empty signal. Apart from this effect, overall, the prediction behaviour is monotonous to the left and right sides of the peak value and shows no notable deviation from the subjective values. Amplitude Clipping The next analysed condition is amplitude clipping. The clipping threshold denotes the amplitude at which the signal is clipped, where the highest possible signal level

Fig. 7.13 Predicted scores vs amplitude clipping at different thresholds. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

134

7 NISQA: A Single-Ended Speech Quality Model

(c)

(a)

(b)

(d)

(e)

Fig. 7.14 Subjective ratings vs amplitude clipping at different thresholds.(a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

is normalised to 1. The NISQA predicted values are presented in Fig. 7.13 and the corresponding subjective values in Fig. 7.14. The overall MOS prediction behaviour for this distortion is very monotonous, as can be seen in Fig. 7.13b. Interestingly, the most affected quality dimension is the Coloration, which can also be observed in the subjective ratings in Fig. 7.14d. The clipping of the amplitude leads to a distorted voice, which is perceived as a Coloration distortion by the listening test participants rather than as noise. This effect could also be observed for the packetloss concealment condition, which can lead to robotic voices that are also perceived as Coloration degradation. When the predicted scores and the subjective ratings are compared, it can be noted that the prediction of the Loudness scores seems to be slightly more pessimistic than the subjective Loudness ratings; otherwise, there are no major deviations. MNRU Noise Figure 7.15 presents the NISQA predicted scores over different Q levels of MNRU noise. Similar to amplitude clipping, MNRU is a signal-correlated noise that leads to a distorted voice. It can be seen that MNRU strongly affects all four quality dimensions. The most affected dimension is Coloration, followed by Noisiness and Loudness. A mostly monotonous behaviour between Q and the predictions can be seen. Only for the Coloration and Discontinuity dimensions, there are small non-monotonous segments for extremely distorted signals with a Q value around 0. Interestingly, for extremely distorted signals, the overall MOS prediction goes below 1. These trends can also be observed for the subjective scores in Fig. 7.16,

7.3 Results

135

Fig. 7.15 Predicted scores vs MNRU noise with different levels. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

(c)

(a)

(b)

(d)

(e)

Fig. 7.16 Subjective ratings vs MNRU noise with different levels. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

136

7 NISQA: A Single-Ended Speech Quality Model

Fig. 7.17 Predicted scores vs white background noise at different levels. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

where ratings for Q values between 15 dB and 30 dB are available. Overall, no larger deviations between the objective scores and the subjective ratings can be noted. White Background Noise The next analysed degradation is white background noise. The predicted scores in relation to the SNR are presented in Fig. 7.17 and the corresponding subjective ratings are shown in Fig. 7.18. As expected, the most affected dimension in the predicted and subjective scores is the Noisiness, where the scores for small SNRs decrease faster than the scores of the overall MOS. Whereas the Loudness ratings decrease for small SNRs, the other dimensions remain mostly unaffected and only decrease for extreme SNR values for which almost no speech is perceivable. The predicted values for the overall MOS and the quality dimensions are overall monotonous, and no unexpected behaviour can be observed when the predictions are compared to the subjective values. Pub Background Noise Finally, the predicted scores are analysed in relation to “pub” background noise of different levels. To this end, the speech signals were mixed with pub background noise that was not contained in the training or validation set. The predicted values can be seen in Fig. 7.19. The values are compared to subjective ratings presented in Fig. 7.20. It should be noted that the background noise files of the samples with subjective ratings were chosen randomly from a large noise dataset.

7.3 Results

137

(c)

(a)

(b)

(d)

(e)

Fig. 7.18 Subjective ratings vs white background noise at different levels. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

Fig. 7.19 Predicted scores vs pub background noise at different levels. (a) MOS (individual files). (b) MOS. (c) NOI. (d) COL. (e) DIS. (f) LOUD

138

7 NISQA: A Single-Ended Speech Quality Model

(a)

(c)

(d)

(b)

(e)

Fig. 7.20 Subjective ratings vs various background noises at different levels. (a) MOS. (b) NOI. (c) COL. (d) DIS. (e) LOUD

The curves presented in both figures are almost identical to the white background noise condition diagrams. However, the confidence interval of the pub noise MOS predictions is slightly higher than for white noise. Overall, when the subjective and predicted scores are compared, no unexpected behaviour can be observed.

7.4 Summary In this section, the single-ended speech quality model NISQA was presented. Besides overall speech quality, it also predicts speech quality dimensions that can be used for speech degradation diagnosis. The model was trained on a large collection of 59 training datasets and 18 validation datasets. The model was tested on an independent live-talking dataset that contains recordings of real phone calls in different environments with speakers and equipment that are not contained in the training or validation set. The model predicts the overall speech quality of the test set with a PCC of 0.9 and an RMSE* of 0.24. The speech quality dimensions are predicted with a PCC of 0.71–0.87 and an RMSE* of 0.17–0.25, which is well below the P.SAMD required objective RMSE* of 0.35.

7.4 Summary

139

Finally, the predicted scores were analysed in relation to certain speech impairments. It could be noted that a highpass filtered signal can lead to unexpected Noisiness and Discontinuity predictions. Apart from this effect, overall, the model showed to give consistent speech quality predictions for different levels of the eight analysed impairments.

Chapter 8

Conclusions

In this thesis, a new deep learning based speech quality model was presented that—besides overall quality—also predicts the four perceptual quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. During the development, a number of different neural network architectures were analysed towards their suitability for speech quality prediction in Chap. 3.

Research Question 1 Which neural network architectures give the best performance for speech quality prediction and how can they be combined? It could be shown that a framewise CNN in combination with a self-attention time-dependency model and a final attention-pooling network (CNN-SA-AP) gives the best results when compared to the other considered neural network architectures.

In Chap. 4, it was shown how the clean reference can be incorporated into deep learning based speech quality prediction models. The reference and the degraded signals are sent through a Siamese CNN with the following self-attention network that calculates comparable features of both signals. After that, the reference features are aligned to the degraded features with a cosine similarity function as alignment score and a hard-attention mechanism. The features of both signals are subsequently combined in the fusion layer. The fused features are then sent through a second selfattention layer that is followed by a final time-pooling layer. It was further shown that, by applying the cosine similarity function, the implemented time-alignment can achieve the same results as with perfectly aligned signals.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0_8

141

142

8 Conclusions

Research Question 2 How can the clean reference signal be incorporated into neural networks for speech quality prediction to build a full-reference speech quality model? And does including the reference improve the results? By applying a Siamese neural network with additional attention-like timealignment and a subsequent fusion layer, the clean reference can be incorporated into deep learning based speech quality models. However, the double-ended model outperforms the single-ended model only by a PCC of 0.012. Considering the effort that is involved with an intrusive speech communication monitoring setup, the performance increase appears to be too small to justify the use of a double-ended model over the single-ended model with similar results.

In Chap. 5, four different multi-task models were presented that share different amounts of layers between the quality dimensions. It was shown that the perceptual decomposition approach can successfully be implemented with the MTL approach that delivers consistent predictions between the overall MOS and the dimension scores. The prediction accuracy for all quality dimensions was higher than the one of a single human rating. The dimensions that are the most difficult to predict are Coloration and Discontinuity with a PCC of around 0.70. Loudness and Noisiness could be predicted with a PCC of around 0.79 and 0.81, respectively, while the overall MOS can be predicted with a PCC of 0.87. The overall lower prediction accuracy of the quality dimensions compared to the overall quality can be explained with lower consistency of the subjective ratings.

Research Question 3 Does multi-task learning of speech quality and speech quality dimensions improve the results when compared to individually trained models? How much of the neural network should be shared across the different tasks? All dimensions and the overall MOS benefited from the multi-task approach and achieved better results when compared to the single-task models. In particular, the accuracy of the dimension Coloration could be raised from a PCC of around 0.69 to 0.71. It was shown that the models MTL-POOL and MTL-TD that share only the pooling block or the pooling block + timedependency block amongst tasks achieved the best overall results. While the results could only be slightly improved with the multi-task approach, the computation time is drastically reduced when compared to a single-task model.

In Chap. 6, a bias-aware loss function that considers biases that occur when subjective quality experiments are conducted was presented. The proposed algo-

8 Conclusions

143

rithm learns the unknown biases between different datasets automatically during model training. The bias estimation is updated after every epoch and is then used to improve the loss of the neural network training. By considering the biases, the loss function does not punish prediction errors that solely occur due to the subjective bias, while the rank order within the dataset may have been predicted correctly. While the bias-aware loss clearly outperforms the vanilla MSE loss on synthesised data and learns the introduced biases successfully, the performance on real data largely depends on the datasets that it is applied to. However, the bias-aware loss showed to notably improve the prediction performance on most of the speech quality datasets.

Research Question 4 How can machine learning based speech quality models be trained from multiple datasets that are exposed to dataset-specific biases? The models can be trained from multiple datasets by estimating the individual biases during training with a first-order polynomial function. The estimated biases can then be applied to the predicted values of the model to achieve almost the same prediction results as if the datasets would not be exposed to any biases.

Finally, the single-ended and diagnostic speech quality model NISQA was presented. Besides overall speech quality, it also predicts speech quality dimensions that can be used for speech degradation diagnosis. The model was trained on a large collection of 59 training datasets and 18 validation datasets. Furthermore, it was tested on an independent live-talking dataset that contains recordings of real phone calls in different environments with speakers and equipment that are not contained in the training or validation set. The predicted scores were also analysed in relation to eight technical speech impairments. It could be noted that a highpass filtered signal can lead to unexpected Noisiness and Discontinuity predictions. Apart from this effect, overall, the model showed to give consistent speech quality predictions for different levels of the eight analysed impairments.

Objective 1 Develop a single-ended speech quality model that predicts speech quality of real phone calls in a fullband context reliably on unknown data. NISQA predicts the overall speech quality of the test set with a PCC of 0.9 and an RMSE* of 0.24. It outperforms the single-ended speech quality models P.563, ANIQUE+, and WEnets. NISQA is outperformed by the double-ended model POLQA on datasets from the POLQA pool. However, on datasets that contain spontaneous/conversational speech, NISQA outperforms POLQA.

144

8 Conclusions

Objective 2 Develop a single-ended model that, additionally to the overall quality, also reliably predicts the four speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. On the live-talking test set, NISQA predicts the speech quality dimensions Noisiness, Coloration, Discontinuity, and Loudness with a PCC of 0.76, 0.87, 0.83, and 0.71, respectively, and an RMSE* of 0.20, 0.17, 0.25, and 0.17. NISQA outperforms the double-ended diagnostic models P.AMD candidate and DIAL on the spontaneous/conversational datasets and obtains similar results on a dataset from the POLQA pool.

While NISQA showed to predict speech quality reliably on all analysed datasets, it is interesting that it is still outperformed by POLQA on the POLQA pool datasets, although the POLQA pool training sets with the same reference files were used for training NISQA as well. Approaches for increasing the accuracy on these datasets could be to find the outliers in these datasets and then include more of the affected conditions in the training set. In fact, the complexity of the proposed NISQA model is relatively low when compared to other deep learning models. Therefore, it would be interesting to investigate if the complexity and accuracy can be increased when the training size is increased. In this case, for example, the number of filters and layers in the CNN could be increased. However, the corpus size would need to be increased by at least an order of magnitude, and also new kinds of conditions should be added. A further approach could be to add more subjective ratings to the existing NISQA datasets that are only annotated with 5 crowdsourcing ratings. Another interesting idea to improve the prediction performance in a multi-task approach would be to use technical parameters as auxiliary tasks. For the simulated NISQA datasets with speech quality ratings, these technical parameters were used during the generation of the speech samples and are stored together with the wavefiles. Thus, for example, codecs could be predicted as a classification task and frame error rates as a regression task. These auxiliary tasks could then potentially help the model to regularise better for the prediction of the overall MOS and the speech quality dimensions. In this way, the model could also be extended to give an additional technical root-cause analysis output, similar to the models presented by Mittag and Möller (2018, 2019c). The NISQA model could also be used as a basis to fine-tune it towards similar tasks. For example, Mittag and Möller (2020a) showed that the model can be used to predict naturalness of synthetic speech and that the accuracy improved when it was pretrained on speech quality datasets. Therefore, the model could be fine-tuned to predict the quality of speech that was processed with noise suppression algorithms or to predict speech intelligibility for hearing aids.

Appendix A

Dataset Condition Tables

See Table A.1 for the condition list of the TUB_AUS condition list and Table A.2 for the condition list of the NISQA_TEST_LIVETALK dataset. Table A.1 TUB_AUS condition list Condition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Description Anchor: FB clean Anchor: FB P50MNRU 25 dB Anchor: FB Noise 12 dB Anchor: FB Level −20dB Anchor: FB BP 500–2500 Hz Anchor: FB BP 100–5000 Hz + Level −10 dB Anchor: FB time clipping 2% Anchor: FB time clipping 20% G.711 + Noise 20 dB + Level −13 dB G.722 + Noise 25 dB + Level −18 dB WhatsApp good connection WhatsApp good connection amplitude clipping WhatsApp good connection +10 dB noise (Traffic) WhatsApp bad connection WhatsApp good connection gain ramp Skype good connection Skype good connection +10 dB noise (Mensa) Skype bad connection Skype bad connection +10 dB noise (Cafeteria) Skype good connection amplitude clipping Facebook good connection Facebook good connection +10 dB noise (Pub)

Device Simulated simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Phone Phone Phone Phone Phone Laptop Laptop Laptop Laptop Laptop Laptop Laptop (continued)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0

145

146

A Dataset Condition Tables

Table A.1 (continued) Condition 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Description Facebook bad connection Facebook good connection gain ramp Facebook good connection gain ramp +10 dB noise (Traffic) Line good connection Line good connection amplitude clipping Line good connection +10 dB noise (inside train) Line bad connection Line good connection gain ramp SWB WB NB [email protected] [email protected] [email protected] [email protected] [email protected] + level 5 dB [email protected] + Noise 20 dB (call center) [email protected] [email protected] PL 5% bursty [email protected] PL 10% bursty [email protected] PL 15% bursty [email protected] PL 15% random [email protected] PL 3%—selftandem x3 + noise 0 dB (Shopping Center) [email protected] PL 3%—selftandem x3 + noise 15 dB [email protected] PL 3%—selftandem x3 + level −10 dB [email protected] PL 3% x G.711 x [email protected] PL 3% [email protected] PL 3% x G.722 x [email protected] PL 3% [email protected] PL 3% x G.722 x [email protected] PL 3% + Noise 15 dB (car2 130 kmh)

Device Laptop Laptop Laptop Phone Phone Phone Phone Phone Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated Simulated

A Dataset Condition Tables

147

Table A.2 NISQA_TEST_LIVETALK condition list Nr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

Condition Anchor: FB clean Anchor: FB P50MNRU 25 dB Anchor: FB noise 12dB Anchor: FB level −20 dB Anchor: FB BP 500–2500 Hz Anchor: FB BP 100–5000 Hz + level −10 dB Anchor: FB time clipping 2% Anchor: FB time clipping 20% Mobile phone Skype/speaking loudly Skype/speaking quietly Skype/talker distant from microphone Skype/talker distant from microphone/speaking loudly Skype/talker distant from microphone/speaking quietly Skype/loudspeaker Skype/loudness variation Facebook Skype Facebook/speaking loudly Mobile phone/speaking loudly Mobile phone/speaking quietly Mobile phone/loudspeaker Fixed line/speaking quietly Facebook/loudspeaker Inside building (bad reception)/mobile phone Inside building (bad reception)/mobile phone/speaking loudly Inside building (bad reception)/mobile phone/loudspeaker Inside building (bad reception)/Skype Inside building (bad reception)/Facebook/loudspeaker Inside building (bad reception)/Skype/speaking quietly Environmental noise (e.g. shopping centre)/mobile phone Environmental noise (e.g. shopping centre)/mobile phone/speaking quietly Environmental noise (e.g. shopping centre)/mobile phone/talker distant from microphone Environmental noise (e.g. shopping centre)/Skype Subway station noise (bad reception)/mobile phone Subway station noise (bad reception)/mobile phone/speaking quietly Subway station noise (bad reception)/mobile phone/talker distant from microphone Subway station noise (bad reception)/Facebook Subway station noise (bad reception)/Skype Subway station noise (bad reception) /Facebook/speaking loudly Subway station noise (bad reception)/Facebook/speaking quietly (continued)

148

A Dataset Condition Tables

Table A.2 (continued) Nr 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

Condition Café noise/mobile phone Café noise/mobile phone/speaking quietly Café noise/mobile phone/talker distant from microphone Car on highway noise (bad reception)/mobile phone Car on highway noise (bad reception)/mobile phone/distant from microphone Car on highway noise (bad reception)/Facebook Car on highway noise (bad reception)/Facebook/speaking loudly/background music Car on highway noise (bad reception)/Skype/speaking quietly/background music Elevator (bad reception)/mobile phone Street noise/mobile phone Street noise/mobile phone/speaking loudly Street noise/mobile phone/talker distant from microphone Street noise/Facebook Metro station/mobile phone Metro station/mobile phone/speaking quietly Metro station/Facebook Metro station/Skype

Appendix B

Train and Validation Dataset Dimension Histograms

Figures B.1, B.2, B.3, and B.4 show the subjective MOS histograms of the training and validation datasets for the four quality dimensions.

(a)

(b)

(c)

(d)

Fig. B.1 Noisiness histogram of the simulated and live training and validation datasets. (a) Simulated training set: NISQA_TRAIN_SIM. (b) Simulated validation set: NISQA_VAL_SIM. (c) Live training set: NISQA_TRAIN_LIVE. (d) Live validation set: NISQA_VAL_LIVE

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0

149

150

B Train and Validation Dataset Dimension Histograms

(a)

(b)

(c)

(d)

Fig. B.2 Coloration histogram of the simulated and live training and validation datasets. (a) Simulated training set: NISQA_TRAIN_SIM. (b) Simulated validation set: NISQA_VAL_SIM. (c) Live training set: NISQA_TRAIN_LIVE. (d) Live validation set: NISQA_VAL_LIVE

Fig. B.3 Discontinuity histogram of the simulated and live training and validation datasets. (a) Simulated training set: NISQA_TRAIN_SIM. (b) Simulated validation set: NISQA_VAL_SIM. (c) Live training set: NISQA_TRAIN_LIVE. (d) Live validation set: NISQA_VAL_LIVE

B Train and Validation Dataset Dimension Histograms

151

Fig. B.4 Loudness histogram of the simulated and live training and validation datasets. (a) Simulated training set: NISQA_TRAIN_SIM. (b) Simulated validation set: NISQA_VAL_SIM. (c) Live training set: NISQA_TRAIN_LIVE. (d) Live validation set: NISQA_VAL_LIVE

References

3GPP TS 26.071, Mandatory speech codec speech processing functions; AMR speech Codec; General description. 3GPP, Sophia Antipolis Valbonne (1999) 3GPP TS 26.171, Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; General description. 3GPP, Sophia Antipolis Valbonne (2001) 3GPP TS 26.441, Codec for enhanced voice services (EVS); General overview. 3GPP, Sophia Antipolis Valbonne (2014) S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird, B. Schuller, Snore sound classification using image-based deep spectrum features, in Proceedings of Interspeech 2017, Stockholm (2017) A.R. Avila, J. Alam, D. O’Shaughnessy, T.H. Falk, Intrusive quality measurement of noisy and enhanced speech based on i-vector similarity, in Proceedings of 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin (2019a) A.R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, J. Gehrke, Non-intrusive speech quality assessment using neural networks, in Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019b) J. Ba, J. Kiros, G.E. Hinton, Layer normalization. ArXiv, abs/1607.06450 (2016) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in Proceedings of 2015 International Conference on Learning Representations (ICLR), San Diego (2015) J.G. Beerends, J.A. Stemerdink, A perceptual speech-quality measure based on a psychoacoustic sound representation. J. Audio Eng. Soc. 42(3), 115–123 (1994) J.G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, M. Keyhl, Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for endto-end speech quality measurement part I – Temporal alignment. J. Audio Eng. Soc. 61(6), 366–384 (2013a) J.G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, M. Keyhl, Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for endto-end speech quality measurement part II – Perceptual model. J. Audio Eng. Soc. 61(6), 385– 402 (2013b) J.G. Beerends, N.M.P. Neumann, E.L. van den Broek, A. Llagostera Casanovas, J.T. Menendez, C. Schmidmer, J. Berger, Subjective and objective assessment of full bandwidth speech quality. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 440–449 (2020) J. Berger, Instrumentelle Verfahren zur Sprachqualitätsschätzung: Modelle Auditiver Tests (Shaker, Düren, 1998)

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0

153

154

References

J. Berger, A. Hellenbart, R. Ullmann, B. Weiss, S. Möller, J. Gustafsson, G. Heikkilä, Estimation of ‘quality per call’ in modelled telephone conversations, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Las Vegas (2008) S. Bosse, D. Maniry, K. Müller, T. Wiegand, W. Samek, Deep neural networks for no-reference and full-reference image quality assessment, IEEE Trans. Image Process. 27(1), 206–219 (2018) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krüger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners. ArXiv, abs/2005.14165 (2020) D. Burnham, D. Estival, S. Fazio, J. Viethen, F. Cox, R. Dale, S. Cassidy, J. Epps, R. Togneri, M. Wagner, Y. Kinoshita, R. Göcke, J. Arciuli, M. Onslow, T.W. Lewis, A. Butcher, J. Hajek, Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable black box, in Proceedings of Interspeech 2011, Florence (2011) R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997) A.A. Catellier, S.D. Voran, Wawenets: a no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality, in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona (2020) B. Cauchi, K. Siedenburg, J.F. Santos, T.H. Falk, S. Doclo, S. Goetze, Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Trans. Audio Speech Lang. Process. 27(7), 1151–1163 (2019) J.R. Cavanaugh, R.W. Hatch, J.L. Sullivan, Models for the subjective effects of loss, noise, and talker echo on telephone connections. Bell Syst. Tech. J. 55(9), 1319–1371 (1976) CDC, Landline phones are a dying breed (2020). https://www.statista.com/chart/2072/landlinephones-in-the-united-states/. Accessed 15 Nov 2020 A. Chehadi, Subjektive Qualitätserfassung übertragener Sprache echter Telefonate. Bachelor’s Thesis, TU-Berlin (2020) G. Chen, V. Parsa, Nonintrusive speech quality evaluation using an adaptive neurofuzzy inference system. IEEE Signal Process. Lett. 12(5), 403–406 (2005) G. Chen, V. Parsa, Bayesian model based non-intrusive speech quality evaluation, in Proceedings of 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia (2005) N. Côté, Integral and Diagnostic Intrusive Prediction of Speech Quality (Springer, Berlin, 2011) N. Côté, V. Gautier-Turbin, S. Möller, Influence of loudness level on the overall quality of transmitted speech, in Proceedings of 123rd Audio Engineering Society Convention, New York (2007) I. Demirsahin, O. Kjartansson, A. Gutkin, C. Rivera, Open-source multi-speaker corpora of the english accents in the british isles, in Proceedings of 12th Language Resources and Evaluation Conference (LREC), Marseille (2020) J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding, in Proceedings of 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis (2019) X. Dong, D.S. Williamson, An attention enhanced multi-task model for objective speech assessment in real-world environments, in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona (2020a) X. Dong, D.S. Williamson, A pyramid recurrent network for predicting crowdsourced speechquality ratings of real-world signals, in Proceedings of Interspeech 2020, Shanghai (2020b) R.K. Dubey, A. Kumar, Non-intrusive speech quality assessment using several combinations of auditory features. Int. J. Speech Technol. 16(1), 89–101 (2013) T.H. Falk, W.-Y. Chan, Nonintrusive speech quality estimation using gaussian mixture models. IEEE Signal Process. Lett. 13(2), 108–111 (2006a)

References

155

T.H. Falk, W.-Y. Chan, Single-ended speech quality measurement using machine learning methods. IEEE Trans. Audio Speech Lang. Process. 14(6), 1935–1947 (2006b) L. Fernández Gallardo, B. Weiss, The nautilus speaker characterization corpus: speech recordings and labels of speaker characteristics and voice descriptions, in Proceedings of International Conference on Language Resources and Evaluation (LREC), Miyazaki (2018) Freesound (2020). https://freesound.org/ R.M. French, Catastrophic forgetting in connectionist networks. Trends Cognit. Sci. 3(4), 128–135 (1999) Q. Fu, K. Yi, M. Sun, Speech quality objective assessment using neural network, in Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul (2000) S.W. Fu, Y. Tsao, H.T. Hwang, H.M. Wang, Quality-net: an end-to-end non-intrusive speech quality assessment model based on BLSTM, in Proceedings of Interspeech 2018, Hyderabad (2018) L.F. Gallardo, G. Mittag, S. Möller, J. Beerends, Variable voice likability affecting subjective speech quality assessments, in Proceedings of Tenth International Conference on Quality of Multimedia Experience (QoMEX), Sardinia (2018) J.F. Gemmeke, D.P.W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R.C. Moore, M. Plakal, M. Ritter, Audio set: an ontology and human-labeled dataset for audio events, in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans (2017) I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, Cambridge, 2016). http://www. deeplearningbook.org V. Grancharov, D.Y. Zhao, J. Lindblom, W.B. Kleijn, Low-complexity, nonintrusive speech quality assessment. IEEE Trans. Audio Speech Lang. Process. 14(6), 1948–1956 (2006) A. Graves, G. Wayne, I. Danihelka, Neural turing machines (2014, preprint). arXiv:1410.5401 N. Harte, E. Gillen, A. Hines, TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications, in Proceedings of 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX), Costa Navarino, Messinia (2015), pp. 1–6 K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas (2016) A. Hines, E. Gillen, N. Harte, Measuring and monitoring speech quality for voice over IP with POLQA, viSQOL and P.563, in Proceedings of Interspeech 2015, Dresden (2015a) A. Hines, J. Skoglund, A.C. Kokaram, N. Harte, ViSQOL: an objective speech quality model. EURASIP J. Audio Speech Music Process. 2015(13), (2015b) S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) D.F. Hoth, Room noise spectra at subscribers’ telephone locations. J. Acoust. Soc. Am. 12(4), 499–504 (1941) L. Huo, Attribute-based Speech Quality Assessment:-Narrowband and Wideband (Shaker, 2015) IEEE, IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electroacoust. 17(3), 225–246 (1969) S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (ICML 2015), Lille (2015) ITU-T P Suppl. 23, Perceptual objective listening quality assessment. ITU-T, Geneva (2008) ITU-T Rec. G.191, Software tools for speech and audio coding standardization. ITU-T, Geneva (2019) ITU-T Rec. G.192, A common digital parallel interface for speech standardization activities. ITUGeneva (1996) ITU-T Rec. G.711, Pulse code modulation (PCM) of voice frequencies. ITU-T, Geneva (1988) ITU-T Rec. G.722, 7 kHz audio-coding within 64 kbit/s. ITU-T, Geneva (2012) ITU-T Rec. G.722.2, Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (AMR-WB). ITU-T, Geneva (2003) ITU-T Rec. P.10, Vocabulary for performance, quality of service and quality of experience. ITU-T, Geneva (2017)

156

References

ITU-T Rec. P.107, The E-model: a computational model for use in transmission planning. ITU-T, Geneva (2015) ITU-T Rec. P.107.1, Wideband E-model. ITU-T, Geneva (2019) ITU-T Rec. P.107.2, Fullband E-model. ITU-T, Geneva (2019) ITU-T Rec. P.1401, Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models. ITU-T, Geneva (2020) ITU-T Rec. P.48, Specification for an intermediate reference system. ITU-T, Geneva (1988) ITU-T Rec. P.501, Test signals for use in telephony and other speech-based application. ITUGeneva (2020) ITU-T Rec. P.56, Objective measurement of active speech level. ITU-T, Geneva (2011) ITU-T Rec. P.563, Single-ended method for objective speech quality assessment in narrow-band telephony applications. ITU-T, Geneva (2004) ITU-T Rec. P.800, Methods for subjective determination of transmission quality. ITU-T, Geneva (1996) ITU-T Rec. P.804, Subjective diagnostic test method for conversational speech quality analysis. ITU-T, Geneva (2017) ITU-T Rec. P.805, Subjective evaluation of conversational quality. ITU-T, Geneva (2007) ITU-T Rec. P.806, A subjective quality test methodology using multiple rating scales. ITU-T, Geneva (2014) ITU-T Rec. P.808, Subjective evaluation of speech quality with a crowdsourcing approach. ITU-T, Geneva (2018) ITU-T Rec. P.810, Modulated noise reference unit (MNRU). ITU-Geneva (1996) ITU-T Rec. P.862, Perceptual evaluation of speech quality (PESQ): an objective method for end-toend speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T, Geneva (2001) ITU-T Rec. P.862.2, Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs. ITU-T, Geneva (2007) ITU-T Rec. P.863, Perceptual objective listening quality assessment. ITU-T, Geneva (2018) ITU-T Rec. P.863.1, Application guide for recommendation ITU-T P.863. ITU-Geneva (2019) ITU-T SG12 C.287, Subjective test from Orange for the P.AMD project set A. ITU-T, Geneva (2015). Source: Orange, Study Period 2013–2016 ITU-T SG12 C.440, P.SAMD updated results and validation plan. ITU-T, Geneva (2019). Source: TU-Berlin, Study Period 2017–2020 ITU-T SG12 TD.137, Technical requirement specification P.AMD and P.SAMD. ITU-T, Geneva (2017). Source: Rapporteur Q9/12, Study Period 2017–2020 K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition? in Proceedings of 2009 IEEE 12th International Conference on Computer Vision (ICCV), Kyoto (2009) U. Jekosch, Voice and Speech Quality Perception: Assessment and Evaluation (Springer, Berlin, 2005) N.O. Johannesson, The ETSI computation model: a tool for transmission planning of telephone networks. IEEE Commun. Mag. 35(1), 70–79 (1997) P. Kabal, TSP speech database. McGill University, Quebec, Canada, Tech. Rep. Database Version 1.0 (2002) D. Kim, ANIQUE: an auditory model for single-ended speech quality estimation. IEEE Trans. Speech Audio Process. 13(5), 821–831 (2005) D. Kim, A. Tarraf, ANIQUE+: a new american national standard for non-intrusive estimation of narrowband speech quality. Bell Labs Tech. J. 12(1), 221–236 (2007) D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR, abs/1412.6980 (2015) F. Köster, D. Guse, M. Wältermann, S. Möller, Comparison between the discrete ACR scale and an extended continuous scale for the quality assessment of transmitted speech, in Proceedings of 41. Jahrestagung für Akustik (DAGA 2015), Nürnberg (2015)

References

157

F. Köster, V. Cercos-llombart, G. Mittag, S. Möller, Non-intrusive estimation model for the speechquality dimension loudness, in Proceedings of 12. ITG-Fachtagung Sprachkommunikation, Paderborn (2016a) F. Köster, G. Mittag, T. Polzehl, S. Möller, Non-intrusive estimation of noisiness as a perceptual quality dimension of transmitted speech, in Proceedings of 2016 5th ISCA/DEGA Workshop on Perceptual Quality of Systems (PQS), Berlin (2016b), pp. 74–78 A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) J. Lecomte, T. Vaillancourt, S. Bruhn, H. Sung, K. Peng, K. Kikuiri, B. Wang, S. Subasingha, J. Faure, Packet-loss concealment technology advances in EVS, in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane (2015) Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) Q. Li, Y. Fang, W. Lin, D. Thalmann, Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features, in Proceedings of 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), Chengdu (2014), pp. 1–6 LibriVox, Free public domain audiobooks (2020). https://librivox.org/ M.-T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, in Proceedings of 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP), Lisbon (2015) H. Maier, Android Qualitätsmonitor für mobile Sprachnetze. Master’s thesis, TU Berlin (2019) L. Malfait, J. Berger, M. Kastner, P.563 – The ITU-T standard for single-ended speech quality assessment. IEEE Trans. Audio Speech Lang. Process. 14(6), 1924–1934 (2006) G. Mittag, S. Möller, Non-intrusive estimation of packet loss rates in speech communication systems using convolutional neural networks, in Proceedings of IEEE International Symposium on Multimedia (ISM), Taipei (2018) G. Mittag, S. Möller, Quality estimation of noisy speech using spectral entropy distance, in Proceedings of 2019 26th International Conference on Telecommunications (ICT), Hanoi (2019a) G. Mittag, S. Möller, Quality degradation diagnosis for voice networks – estimating the perceived noisiness, coloration, and discontinuity of transmitted speech, in Proceedings of Interspeech 2019, Graz (2019b) G. Mittag, S. Möller, Semantic labeling of quality impairments in speech spectrograms with deep convolutional networks, in Proceedings of 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), Berlin (2019c) G. Mittag, S. Möller, Non-intrusive speech quality assessment for super-wideband speech communication networks, in Proceedings of 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton (2019d) G. Mittag, S. Möller, Deep learning based assessment of synthetic speech naturalness, in Proceedings of Interspeech 2020, Barcelona (2020a) G. Mittag, S. Möller, Full-reference speech quality estimation with attentional siamese neural networks, in Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona (2020b) G. Mittag, K. Friedemann, M. Sebastian, F. Köster, S. Möller, Non-intrusive estimation of the perceptual dimension coloration, in Proceedings of 42. Jahrestagung für Akustik (DAGA), Aachen (2016) G. Mittag, S. Möller, V. Barriac, S. Ragot, Quantifying quality degradation of the EVS superwideband speech codec, in Proceedings of 2018 10th International Conference on Quality of Multimedia Experience (QoMEX), Sardinia (2018) G. Mittag, L. Liedtke, N. Iskender, B. Nader, T. Hübschen, G. Schmidt, S. Möller, Einfluss der Position und Stimmhaftigkeit von verdeckten Paketverlusten auf die Sprachqualität, in 45. Deutsche Jahrestagung für Akustik (DAGA), Rostock (2019)

158

References

G. Mittag, R. Cutler, Y. Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, R. Aichner, DNN no-reference PSTN speech quality prediction, in Proceedings of Interspeech 2020, Shanghai (2020) S. Möller, Assessment and Prediction of Speech Quality in Telecommunications (Kluwer, Dordrecht, 2000) S. Möller, A. Raake, N. Kitawaki, A. Takahashi, M. Waltermann, Impairment factor framework for wide-band speech codecs. IEEE Trans. Audio Speech Lang. Process. 14(6), 1969–1976 (2006) S. Möller, T. Hübschen, G. Mittag, G. Schmidt, Zusammenhang zwischen perzeptiven Dimensionen und Störungsursachen bei super-breitbandiger Sprachübertragung, in Proceedings of 45. Jahrestagung für Akustik (DAGA 2019), Rostock (2019a) S. Möller, G. Mittag, T. Michael, V. Barriac, H. Aoki, Extending the E-Model towards superwideband and fullband speech communication scenarios, in Proceedings of Interspeech 2019, Graz (2019b) B. Naderi, R. Cutler, An open source implementation of ITU-T Recommendation P.808 with validation, in Proceedings of Interspeech 2020, Shanghai (2020) B. Naderi, T. Polzehl, I. Wechsung, F. Köster, S. Möller, Effect of trapping questions on the reliability of speech quality judgments in a crowdsourcing paradigm, in Proceedings of Interspeech 2015, Dresden (2015) B. Naderi, T. Hoßfeld, M. Hirth, F. Metzger, S. Möller, R.Z. Jiménez, Impact of the number of votes on the reliability and validity of subjective speech quality assessment in the crowdsourcing approach, in Proceedings of 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone (2020), pp. 1–6 M. Narwaria, W. Lin, I.V. McLoughlin, S. Emmanuel, C.L. Tien, Non-intrusive speech quality assessment with support vector regression, in Proceedings of 2010 Advances in Multimedia Modeling (MMM), Chongqing (2010) M. Narwaria, W. Lin, I.V. McLoughlin, S. Emmanuel, L. Chia, Nonintrusive quality assessment of noise suppressed speech with mel-filtered energies and support vector regression. IEEE Trans. Audio Speech Lang. Process. 20(4), 1217–1232 (2012) J. Ooster, B.T. Meyer, Improving deep models of speech quality prediction through voice activity detection and entropy-based measures, in Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton (2019) J. Ooster, R. Huber, B.T. Meyer, Prediction of perceived speech quality using deep machine listening, in Proceedings of Interspeech 2018, Hyderabad (2018) N. Osaka, K. Kakehi, Objective evaluation model of telephone transmission performance for fundamental transmission factors. Electron. Commun. Japan Part I-Commun. 69, 18–27 (1986) J. Pons, X. Serra, Designing efficient architectures for modeling temporal features with convolutional neural networks, in Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans (2017) A. Raake, S. Möller, M. Wältermann, N. Côté, J.P. Ramirez, Parameter-based prediction of speech quality in listening context, in Proceedings of 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), Trondheim (2010) C.K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, et al., The Interspeech 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results, in Proceedings of Interspeech 2020, Shanghai (2020) RFC 6716, Definition of the opus audio codec. Internet Engineering Task Force (IETF), Fremont (2012) A.W. Rix, M.P. Hollier, The perceptual analysis measurement system for robust end-to-end speech quality assessment, in Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Istanbul (2000) A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs, in Proceedings 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City (2001)

References

159

S. Ruder, An overview of multi-task learning in deep neural networks. arXiv:1706.05098 (2017) D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986) J. Schlüter, S. Böck, Improved musical onset detection with convolutional neural networks, in Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence (2014) K. Scholz, Instrumentelle Qualitätsbeurteilung von Telefonbandsprache beruhend auf Qualitätsattributen (Shaker, Düren, 2008) D. Sen, Predicting foreground SH, SL and BNH DAM scores for multidimensional objective measure of speech quality, in Proceeding of 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal (2004) D. Sen, W. Lu, Objective evaluation of speech signal quality by the prediction of multiple foreground diagnostic acceptability measure attributes. J. Acoust. Soc. Am. 131(5), 4087–4103 (2012) Y. Shan, J. Wang, X. Xie, L. Meng, J. Kuang, Non-intrusive speech quality assessment using deep belief network and backpropagation neural network, in Proceedings of 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei (2019) D. Sharma, Y. Wang, P.A. Naylor, M. Brookes, A data-driven non-intrusive measure of speech quality and intelligibility. Speech Commun. 80, 84–94 (2016) M. Soni, H. Patil, Novel deep autoencoder features for non-intrusive speech quality assessment, in Proceedings of 2016 24th European Signal Processing Conference (EUSIPCO) (2016) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (DEMAND): a database of multichannel environmental noise recordings. J. Acoust. Soc. Am. 133, 3591–3591 (2013) L. Tóth, Phone recognition with hierarchical convolutional deep maxout networks. EURASIP J. Audio Speech Music Process. 2015(1), 1–13 (2015) V.K. Varma, Testing speech coders for usage in wireless communications systems, in Proceedings of IEEE Workshop on Speech Coding for Telecommunications, Quebec (1993) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.U. Kaiser, I. Polosukhin, Attention is all you need, in Proceedings of 2017 Thirty-First Conference on Neural Information Processing Systems (NeurIPS), Long Beach (2017) M. Wältermann, Dimension-Based Quality Modeling of Transmitted Speech (Springer, Berlin, 2012) M. Wältermann, A. Raake, S. Möller, Quality dimensions of narrowband and wideband speech transmission. Acta Acust. United Acust. 96(6), 1090–1103 (2010) S. Wolf, C.A. Dvorak, R.F. Kubichek, C.R. South, R.A. Schaphorst, S.D. Voran, Future work relating objective and subjective telecommunications system performance, in Proceeings of IEEE Global Telecommunications Conference (GLOBECOM), Phoenix (1991) J. Yu, M. Li, X. Hao, G. Xie, Deep fusion siamese network for automatic kinship verification (2020, preprint). arXiv:2006.00143 Zafaco GmbH. Kyago whitepaper: die multi play und benchmarking plattform (2020) H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in Proceedings of 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane (2015) Y. Zhang, M. Yu, N. Li, C. Yu, J. Cui, D. Yu, Seq2Seq attentional siamese neural networks for text-dependent speaker verification, in Proceedings of ICASSP 2019, Brighton (2019) S. Zielinski, F. Rumsey, S. Bech, On some biases encountered in modern audio quality listening tests-A review. J. Audio Eng. Soc. 56(6), 427–451 (2008)

Index

A Absolute Category Rating (ACR) scale, 12, 99, 104 Active gain control, 10 Active speech level, 14, 37 Additive attention score, 62 AmazonMechanical Turk, 40 Ambient noises, 10 Amplitude clipping, 10, 37 AMR-NB, 37 AMR-WB, 8, 37 ANIQUE+, 21, 22, 118, 122 Arbitrary filter, 37, 38 Attention mechanism, 26–27 Attention-pooling (AP), 52–53 Audio bandwidth, 3, 9, 36 AusTalk, 34, 108 Australian English, 34 Autoencoder, 28, 29 Average-/max-pooling, 51

B Background noise, 13, 14, 34, 36, 51, 87, 136–138 Backpropagation algorithm, 28 Bahdanau, alignment score function, 26, 62 Bandpass filter, 14, 37 Batch Normalisation layer, 28, 44 Bellcore TR mode, 19 Bias-aware loss anchoring predictions with anchor dataset, 93 MOS values, 92 with weighted MSE loss, 93

configuration comparisons, 98–99 first-order polynomial function, 90 function, 140 learning with, 91 mean-square error, 90 minimum accuracy rth , 95–96 polynomial function, 90 speech quality dataset, 99–100 synthetic data, 94 on validation dataset, 97 Bias estimation, 99, 100, 141 BiLSTM network, 29, 31, 55, 56

C Circuit noise, 10, 19 Circuit-switched networks, 8, 9 CNN-TD network, 60, 61 Codecs, 9, 37–38 Codec Tandem, 38 Coloration, 11–12, 73, 139, 140 dimension, 83 histogram, 148 Comparison Category Rating (CCR), 12 Context-effect, 15 Convolutional neural network (CNN) framewise model, 43–45 image classification tasks, 25 max-pooling, 25 pooling layer, 25 zero padding, 25 Convolutional Neural Network–SelfAttention–Attention-Pooling (CNN-SA-AP), 4 Cosine, alignment score function, 62, 66

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 G. Mittag, Deep Learning Based Speech Quality Prediction, T-Labs Series in Telecommunication Services, https://doi.org/10.1007/978-3-030-91479-0

161

162 Crowdsourcing micro-task crowdsourcing, 16 subjective assessment via, 16–18

D Data-driven speech quality models, 32 Data screening, 18 Dataset condition tables NISQA_TEST_LIVETALK condition list, 145–146 TUB_AUS condition list, 143–144 Deep feedforward networks, 24–25 Deep learning, 5 convolutional networks, 25 deep feedforward networks, 24–25 deep learning based speech quality models, 28–31 Degradation Category Rating (DCR), 12 Degradation decomposition approach, 86 Delay, 1, 9, 10, 60, 67 DIAL model, 2, 21, 35, 118, 122, 142 Discontinuity, 11–12, 73, 83, 84, 139, 140 histogram, 148 prediction, 138, 141 DNS-Challenge (Librivox), 35 Dot product, 62 Dot-product attention, 48–49 Dot-product self-attention, 50 Double-ended signal-based models, 19–21 Double-ended speech quality prediction, 59–60 alignment, 66–67 challenges in, 60 feature fusion, 67–68 LSTM vs. self-attention, 65–66 method, 60–62 feature fusion, 64 reference alignment, 62–64 Siamese neural network, 62 vs. single-ended, 68–70 state-of-the-art model, 59 Dropout, 28

E Echo, 10 Echo cancellation, 10 E-model, 19 Epoch, 28 EVS, 37–38 Extreme learning machine (ELM), 30

Index F Feedforward network, 24, 45–46 Framewise model, 42, 43, 54–55 CNN, 43–45 feedforward network, 45–46 Fullband (FB), 8 Fully connected layer (FC), 24

G G.711, 37 G.722, 37 Gaussian mixture probability models (GMMs), 23 Gold standard question, 17–18

H Handover, 8 Hard-attention approach, 63 Highpass filter, 37

I ImageNet, 3 International Telecommunication Union (ITU-T), 2, 3, 8, 22, 60 ITU-T Rec. G.711, 7

J Jitter management, 10

L Landline, 8 Landline network, 7 Last-step-pooling, 51–52 Layer Normalisation layer, 28 Linear filtering, 14 Listeners, 14–15 Listening environment, 17 Listening level, 17 Listening system, 17 Long Short-Term Memory (LSTM), 26, 30, 31 Long short-term memory network (LSTM) vs. self-attention, 65–66 time-dependency modelling, 47–48 Long-term dependencies, 15–16 Loudness, 11–12, 73, 83, 139, 140 histogram, 149 Lowpass filter, 37 Luong, alignment score function, 62

Index M Max-pooling, 25 Mean opinion score (MOS), 2, 12 Mean squared error (MSE), 27, 90 Mean temporal distance (MTD), 28 Medium- and long-term context effects/corpus effect, 15 Mel-Frequency Cepstrum Coefficients (MFCC), 23, 24 Mel-spec segment, 42, 43, 78 Micro-task crowdsourcing, 16 Mini-batch, 28 MNRU noise, 36 Mobile/cellular network, 8 Mobile phone, 8 Modulated Noise Reference Unit 249 (MNRU), 14 Multi-head attention, 48–49 Multi-head self-attention, 27, 49 Multilayer perceptions, 24 Multiplicative noise, 23 Multi-task approach, 3 Multi-Task-Learning (MTL) problem, 73

N Narrowband (NB), 7 Natural Language Processing (NLP) tasks, 26 Network planning models, 18–19 Neural network architectures, 4, 33 dataset, 33 listening experiment, 40–41 live distortions, 39–40 simulated distortions, 35–39 source files, 34–35 framewise model, 43, 54–55 CNN, 43–45 feedforward network, 45–46 Mel-Spec segmentation, 43 neural network model, 42–43 pooling model, 57 time-dependency model, 46, 56 LSTM, 47–48 transformer/self-attention, 48–50 time pooling, 51 attention-pooling, 52–53 average-/max-pooling, 51 last-step-pooling, 51–52 training and evaluation metric, 53–54 Neural network model, 42–43 NISQA, 141, 142 bias-aware loss, 111 datasets PAMD, 106–107

163 TCD, 107 TUB_AUS1, 108–109 TUB_DIS, 107 TUB_LIK, 108 TUB_VUPL, 108 evaluation metrics, 114 impairment level vs quality prediction active speech level, 130–133 amplitude clipping, 133–134 bursty concealed packet loss, 127–129 highpass, 129–130 lowpass, 130 MNRU noise, 134–136 pub background noise, 136–138 temporal clipping, 125–127 white background noise, 136 ITU-T P Suppl. 23, 104–106 live-talking test set, 109–110 missing dimension ratings, 111–112 POLQA pool, 104 test set results, 122–125 training, 112–114 validation set results overall quality, 114, 116–118 quality dimensions, 118–122 Noise reduction algorithms, 10 Noisiness, 11–12, 73, 83, 138, 139, 140, 141 histogram, 147 Non-deep learning machine learning approaches, 23–24 Non-intrusive models, 21 Non-optimal speech level, 10 No-reference models, 21

O Opinion models, 18–19 Opus, 38 Over-the-top (OTT), 2, 8

P P.563, 22 Packet loss, 9, 38 Packet-loss concealment, 8 Packet-switched networks, 8 Parametric models, 18–19 Pearson’s correlation coefficient (PCC), 54 Perceptual approaches for multi-dimensional (P.AMD) analysis, 21 Perceptual dimensions, 11 Perceptual Evaluation of Speech Quality (PESQ), 20 Perceptual Linear Prediction (PLP), 23

164 Perceptual Objective Listening Quality Analysis (POLQA), 20, 39, 59 Plain old telephone service (POTS), 7 POLQA, 2, 142 Polynomial function, 90 Pooling layer, 25 Pooling model, 42, 57 Prediction accuracy, 140 Public switched telephone network (PSTN), 8

Q Qualification job, 16–17 Quality-Net, 28

R Rating job, 17 Rating noise, 15 Recorded background noise, 36 Recurrent neural networks (RNNs), 25–26 Reference, 14 Root-mean-square error (RMSE), 54, 89

S Scaling-effect, 15 Self-attention, 48–50 LSTM vs., 65–66 Self-attention time-dependency model, 50 Short-term context dependency/order effect, 15 Siamese CNN-TD network, 59 Siamese neural network, 62, 140 Signal-to-noise ratio (SNR), 36 Single-ended CNN-LSTM speech quality 866 model, 29 Single-ended machine learning based diagnostic speech quality model, 5 Single-ended models, 3 Single-ended signal-based models, 21–22 Single-ended speech quality model, 3, 4 double-ended vs., 68–70 Single-task (ST) model, 79 Smartphones, 1 Soft-attention, 63 Source material, 16 Speech communication, 1, 7–10 Speech material, 13 Speech quality, 7, 10–12 Speech quality dataset, 99–100 Speech quality dimensions, 7, 10–12 all-tasks evaluation, 83–84 degradation composition approach, 74 degradation decomposition, 85–87

Index dimension, 84–85 faster computation time, 73 multi-task learning, 74 multi-task models, 75 fully connected (MTL-FC), 76–77 fully connected + pooling (MTLPOOL), 77–78 fully connected + pooling + time-dependency (MTL-TD), 78–79 fully connected + pooling + timedependency + CNN (MTL-CNN), 79–80 negative transfer, 75 per-task evaluation, 80–83 single-task models, 75 soft parameter sharing, 74 Speech quality prediction, 5, 33 dataset, 33 listening experiment, 40–41 live distortions, 39–40 simulated distortions, 35–39 source files, 34–35 framewise model, 43, 54–55 CNN, 43–45 feedforward network, 45–46 Mel-Spec segmentation, 43 neural network model, 42–43 pooling model, 57 time-dependency model, 56 time-dependency modelling, 46 LSTM, 47–48 transformer/self-attention, 48–50 time pooling, 51 attention-pooling, 52–53 average-/max-pooling, 51 last-step-pooling, 51–52 training and evaluation metric, 53–54 Subject-effect, 15 Super-wideband (SWB) speech, 8 Synthetic data, 94

T Telecommunication services, 1 Temporal clipping, 14, 36–37 Test duration, 16 Test procedure, 16 Time-dependency modelling, 42, 46, 52, 56 LSTM, 47–48 transformer/self-attention, 48–50 Time–frequency domain, 60 Time pooling, 51 attention-pooling, 52–53

Index average-/max-pooling, 51 last-step-pooling, 51–52 Traditional speech quality models, 31 Training deep networks, 27–28 Training job, 17 Transcoding, 9 Transformer, 27 Transformer block, 48, 49 Transformer/self-attention, 48–50 Transmitted speech, quality assessment of crowdsourcing, subjective assessment via, 16–18 machine learning based instrumental methods, 22–23 deep learning architectures, 24–28 non-deep learning machine learning approaches, 23–24 speech communication networks, 7–10 speech quality and speech quality dimensions, 7, 10–12 subjective assessment, 12–16 traditional instrumental methods, 18 double-ended signal-based models, 19–21

165 parametric models, 18–19 single-ended signal-based models, 21–22 TSP speech database, 34

U UK-Ireland dataset, 34–35

V VISQOL, 20–21 Voice activity detection (VAD), 10 Voice over IP (VoIP), 2 Voice over LTE (VoLTE), 8 Voice over Wi-Fi (VoWiFi), 8

W White noise, 36 Wideband (WB) networks, 7

Z Zero padding, 25