Proceedings of the 9th Conference on Sound and Music Technology: Revised Selected Papers from CMST (Lecture Notes in Electrical Engineering, 923) 9811947023, 9789811947025

The book presents selected papers at the 9th Conference on Sound and Music Technology (CSMT) held virtually in June 2022

119 56 5MB

English Pages 157 [149] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Computational Musicology
A Multitask Learning Approach for Chinese National Instruments Recognition and Timbre Space Regression
1 Introduction
2 Related Work
3 Dataset
4 Methodology
4.1 Timbre Space
4.2 Model and Multitask Learning Approach
5 Experiments
5.1 Experiment Setup
5.2 Instrument Recognition Results and Analysis
5.3 Timbre Space Regression Results and Analysis
6 Conclusion
References
Design of Blowing Instrument Controller Based on Air Pressure Sensor
1 Introduction
1.1 Existing Products
1.2 Goals
2 Related Work
2.1 The Sound Structure of a Free Reed Instrument
2.2 Amplitude Feature Extraction of Musical Instrument Audio
3 Mouthpiece Design Based on Air Pressure Sensor
3.1 Straight Pipe Design
3.2 Tee Pipe Design
3.3 Tee Pipe with Unequal Sectional Areas
4 Mapping
4.1 The Dataset
4.2 Data Normalization
4.3 Neural Network Design
5 Implementation of the Blowing Instrument Controller
5.1 Hardware Communication Based on Wi-Fi Module
5.2 Preprocessing of Audio Source
5.3 User Interface Design Based on Unity
5.4 Project Archive and Demonstration Video
6 Conclusion and Future Discussion
6.1 Play the Instrument
6.2 Future Work
6.3 Conclusion
References
SepMLP: An All-MLP Architecture for Music Source Separation
1 Introduction
2 Related Work
2.1 Music Source Separation Methods
2.2 Deep Learning
2.3 MLP-Like Architecture
3 Method
3.1 TF Domain Methods for Music Source Separation
3.2 Proposed MLP-like Architecture
3.3 Fully-Connected Decoder for Source Separation
4 Experiments
4.1 Settings
4.2 Implementation Details
4.3 Results
5 Conclusion
References
Improving Automatic Piano Transcription by Refined Feature Fusion and Weighted Loss
1 Introduction
2 Onsets and Frames Transcription and MTL
2.1 Onsets and Frames Model
2.2 MTL-Based Improvement
3 Model Configuration
3.1 Input
3.2 Velocity Stack
3.3 Onset and Offset Stack
3.4 Connection Block
3.5 Frame Stack
3.6 Loss Function
4 Experiments
4.1 Dataset
4.2 Metrics
4.3 Experiments Details
4.4 Results
5 Conclusion
References
Application Design of Sports Music Mobile App Based on Rhythm Following
1 Introduction
1.1 Background
1.2 The Necessity of Music Rhythm Movement Following
1.3 The Significance of the Research
1.4 Implementation Mode
2 Step Detection of Running Based on Logistic Regression
2.1 Collection of Sensor Data
2.2 Classification Algorithm Based on Logistic Regression
2.3 Experimental Result
3 Music Rhythm Synchronization Algorithm
3.1 Tempo Based Rhythm Synchronization Algorithm
3.2 Some Details About Music Rhythm Synchronization Algorithm
3.3 Experiment Result
4 Conclusion
References
A Study on Monophonic Tremolo Recordings of Chinese Traditional Instrument Pipa Using Spectral Slope Curve
1 Introduction
1.1 Brief Description of Tremolo Analysis
1.2 Related Works
1.3 Problem Formulation and Paper Organization
2 Methods
3 Experiments and Analysis
3.1 Model Setting, Performance Metrics and Dataset
3.2 Experimental Results
3.3 Explanation of Non-tremolo Notes and Specificity of Spectral Slopes
4 Conclusion
References
Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music
1 Introduction
2 Proposed Method
2.1 U-Net for Singing Voice Separation
2.2 Feature Extraction
2.3 LRCN for Classification
2.4 Post Smoothing
3 Evaluation
3.1 Experiment Settings
3.2 Dataset
3.3 Experiments and Results
4 Conclusion
References
General Audio Signal Processing
Learning Optimal Time-Frequency Representations for Heart Sound: A Comparative Study
1 Introduction
2 Methods
2.1 Image Extraction
2.2 Pre-trained Model
3 Experiments
3.1 Database
3.2 Experimental Setup
3.3 Evaluation Criteria
3.4 Results
4 Conclusions
References
Improving Pathological Voice Detection: A Weakly Supervised Learning Method
1 Introduction
2 Methodology
2.1 CNN Architecture
2.2 Weakly Supervised Learning
2.3 Learning Fine-Grained Labels
2.4 Learning Sample Weights
2.5 Data Augmentation
3 Experiments
3.1 Datasets
3.2 Experimental Setup
3.3 Evaluate Metrics
3.4 Experimental Results
4 Conclusions
References
Articulatory Analysis and Classification of Pathological Speech
1 Introduction
2 Torgo Dataset
3 Data Filtering
4 Position Space
5 Range Skewness
6 Time Comparison
6.1 QQ Plot (Henry Line) of Times
7 Classification
7.1 Data Preparation
7.2 Proposed Method
7.3 Classification Result
8 Discussion
9 Conclusion
References
Human Ear Modelling
Channel-Vocoder-Centric Modelling of Cochlear Implants: Strengths and Limitations
1 Introduction
2 CI Signal Processing Strategies: Interleavedly Sampling the Temporal Envelopes
3 Channel Vocoders: The Algorithms and Applications
3.1 Algorithms of the Channel Vocoders
3.2 Frequency Allocation
3.3 Spectral Channel Number
3.4 Current Spread
3.5 Temporal Envelope
3.6 Intensity and Dynamic Range
3.7 Carrier Waveform
3.8 Short Summary
4 Channel-Vocoder Simulation vs. Actual CI Hearing
5 Sound Quality and Music Perception with Vocoded Sounds
6 How to Simulate New Experimental Strategies?
7 Conclusion
References
Recommend Papers

Proceedings of the 9th Conference on Sound and Music Technology: Revised Selected Papers from CMST (Lecture Notes in Electrical Engineering, 923)
 9811947023, 9789811947025

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Electrical Engineering 923

Xi Shao Kun Qian Xin Wang Kejun Zhang   Editors

Proceedings of the 9th Conference on Sound and Music Technology Revised Selected Papers from CMST

Lecture Notes in Electrical Engineering Volume 923

Series Editors Leopoldo Angrisani, Department of Electrical and Information Technologies Engineering, University of Napoli Federico II, Naples, Italy Marco Arteaga, Departament de Control y Robótica, Universidad Nacional Autónoma de México, Coyoacán, Mexico Bijaya Ketan Panigrahi, Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India Samarjit Chakraborty, Fakultät für Elektrotechnik und Informationstechnik, TU München, Munich, Germany Jiming Chen, Zhejiang University, Hangzhou, Zhejiang, China Shanben Chen, Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Tan Kay Chen, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore Rüdiger Dillmann, Humanoids and Intelligent Systems Laboratory, Karlsruhe Institute for Technology, Karlsruhe, Germany Haibin Duan, Beijing University of Aeronautics and Astronautics, Beijing, China Gianluigi Ferrari, Università di Parma, Parma, Italy Manuel Ferre, Centre for Automation and Robotics CAR (UPM-CSIC), Universidad Politécnica de Madrid, Madrid, Spain Sandra Hirche, Department of Electrical Engineering and Information Science, Technische Universität München, Munich, Germany Faryar Jabbari, Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA, USA Limin Jia, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Alaa Khamis, German University in Egypt El Tagamoa El Khames, New Cairo City, Egypt Torsten Kroeger, Stanford University, Stanford, CA, USA Yong Li, Hunan University, Changsha, Hunan, China Qilian Liang, Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, USA Ferran Martín, Departament d’Enginyeria Electrònica, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain Tan Cher Ming, College of Engineering, Nanyang Technological University, Singapore, Singapore Wolfgang Minker, Institute of Information Technology, University of Ulm, Ulm, Germany Pradeep Misra, Department of Electrical Engineering, Wright State University, Dayton, OH, USA Sebastian Möller, Quality and Usability Laboratory, TU Berlin, Berlin, Germany Subhas Mukhopadhyay, School of Engineering & Advanced Technology, Massey University, Palmerston North, Manawatu-Wanganui, New Zealand Cun-Zheng Ning, Electrical Engineering, Arizona State University, Tempe, AZ, USA Toyoaki Nishida, Graduate School of Informatics, Kyoto University, Kyoto, Japan Luca Oneto, Department of Informatics, Bioengineering., Robotics, University of Genova, Genova, Genova, Italy Federica Pascucci, Dipartimento di Ingegneria, Università degli Studi “Roma Tre”, Rome, Italy Yong Qin, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Gan Woon Seng, School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore, Singapore Joachim Speidel, Institute of Telecommunications, Universität Stuttgart, Stuttgart, Germany Germano Veiga, Campus da FEUP, INESC Porto, Porto, Portugal Haitao Wu, Academy of Opto-electronics, Chinese Academy of Sciences, Beijing, China Walter Zamboni, DIEM - Università degli studi di Salerno, Fisciano, Salerno, Italy Junjie James Zhang, Charlotte, NC, USA

The book series Lecture Notes in Electrical Engineering (LNEE) publishes the latest developments in Electrical Engineering - quickly, informally and in high quality. While original research reported in proceedings and monographs has traditionally formed the core of LNEE, we also encourage authors to submit books devoted to supporting student education and professional training in the various fields and applications areas of electrical engineering. The series cover classical and emerging topics concerning: • • • • • • • • • • • •

Communication Engineering, Information Theory and Networks Electronics Engineering and Microelectronics Signal, Image and Speech Processing Wireless and Mobile Communication Circuits and Systems Energy Systems, Power Electronics and Electrical Machines Electro-optical Engineering Instrumentation Engineering Avionics Engineering Control Systems Internet-of-Things and Cybersecurity Biomedical Devices, MEMS and NEMS

For general information about this book series, comments or suggestions, please contact [email protected]. To submit a proposal or request further information, please contact the Publishing Editor in your country: China Jasmine Dou, Editor ([email protected]) India, Japan, Rest of Asia Swati Meherishi, Editorial Director ([email protected]) Southeast Asia, Australia, New Zealand Ramesh Nath Premnath, Editor ([email protected]) USA, Canada Michael Luby, Senior Editor ([email protected]) All other Countries Leontina Di Cecco, Senior Editor ([email protected]) ** This series is indexed by EI Compendex and Scopus databases. ** More information about this series at https://link.springer.com/bookseries/7818

Xi Shao · Kun Qian · Xin Wang · Kejun Zhang Editors

Proceedings of the 9th Conference on Sound and Music Technology Revised Selected Papers from CMST

Editors Xi Shao Nanjing University of Posts and Telecommunications Nanjing, Jiangsu, China Xin Wang Communication University of China Beijing, China

Kun Qian Beijing Institute of Technology Beijing, China Kejun Zhang Zhejiang University Hangzhou, Zhejiang, China

ISSN 1876-1100 ISSN 1876-1119 (electronic) Lecture Notes in Electrical Engineering ISBN 978-981-19-4702-5 ISBN 978-981-19-4703-2 (eBook) https://doi.org/10.1007/978-981-19-4703-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

After nine years of development, with the outstanding effort from the organisation committees, the Conference on Sound and Music Technology (CSMT) has become a leading conference in the area of computational audition in China, with a growing reputation among sound and music engineers, artists, researchers and scientists. The first event was held in Fudan University on 14 December 2013, with a slightly different name—China Conference on Sound and Music Computing Workshop (CSMCW)—and with the proceedings published in Chinese. This event, CSMT 2021, the ninth in the series, is hosted by Zhejiang University and Zhejiang Conservatory of Music, based in Hangzhou, one of the most beautiful cities in China, with two World Cultural Heritage sites: the West Lake and the Grand Canal. The Conservatory is a fantastic place to host this exciting event, not only providing an excellent venue for the technical sessions, but also offering a great opportunity for engaging and educating the next generation of music artists, engineers and practitioners. Since four years ago, an annual English co-proceeding is published by including selected English papers, along with the Chinese proceeding. This has clearly lifted its impact among international audience of sound and music technologies and improved its profile worldwide, which may prove crucial for its development into a long-lasting and marked event in the global stage in the area of computer audition, acoustic engineering and music technology. This year saw overall 11 out of 19 English submissions accepted (57.89% acceptance rate) and 21 out of 40 Chinese submissions (52.5%), hence featuring an overall acceptance rate of 54.24%. This demonstrates the high standard set by CSMT organisation committee and the competitive process in selecting high-quality papers. The accepted papers cover a range of topics, including music instrument recognition, automatic piano transcription, music source separation, design of blowing instrument controller, sports music, singing voice separation and detection, music emotion recognition, audio tagging, multimodal scene classification, heart sound analysis, Chinese traditional instrument Pipa recording, classification of pathological speech, sound denoising, sound mixing and music generation. This shows the broad range of interest emerging from the community, the widespread impact on the scientific development in this area and the multidisciplinary nature of this field. v

vi

Preface

Nowadays, with the rapid development of artificial intelligence, the development of sound and music technology is also moving at a fast pace. The CSMT conference provides an excellent platform for facilitating the exchange of ideas among researchers and practitioners, stimulating interest from the participants and wide communities and nurturing novel ideas and technological advancement in this exciting area. It is envisaged that the publication of this edition helps expose the new development of this field in China to the international communities, bridge the communications between researchers in China and those from the world and add valuable materials to the reading list in the libraries worldwide. Guildford, UK

Wenwu Wang

Contents

Computational Musicology A Multitask Learning Approach for Chinese National Instruments Recognition and Timbre Space Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . Shenyang Xu, Yiliang Jiang, Zijin Li, Xiaoheng Sun, and Wei Li

3

Design of Blowing Instrument Controller Based on Air Pressure Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongfeng Li and Yiran Yuan

15

SepMLP: An All-MLP Architecture for Music Source Separation . . . . . . Jiale Qian, Yongwei Gao, Weixing Wei, Jiahao Zhao, and Wei Li Improving Automatic Piano Transcription by Refined Feature Fusion and Weighted Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiahao Zhao, Yulun Wu, Liang Wen, Lianhang Ma, Linping Ruan, Wantao Wang, and Wei Li Application Design of Sports Music Mobile App Based on Rhythm Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongfeng Li and Yu Liu A Study on Monophonic Tremolo Recordings of Chinese Traditional Instrument Pipa Using Spectral Slope Curve . . . . . . . . . . . . . . Yuancheng Wang, Hanqin Dai, Yuyang Jing, Wei Wei, Dorian Cazau, Olivier Adam, and Qiao Wang Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yifu Sun, Xulong Zhang, Xi Chen, Yi Yu, and Wei Li

31

43

55

69

79

vii

viii

Contents

General Audio Signal Processing Learning Optimal Time-Frequency Representations for Heart Sound: A Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhihua Wang, Zhihao Bao, Kun Qian, Bin Hu, Björn W. Schuller, and Yoshiharu Yamamoto

93

Improving Pathological Voice Detection: A Weakly Supervised Learning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Weixing Wei, Liang Wen, Jiale Qian, Yufei Shan, Jun Wang, and Wei Li Articulatory Analysis and Classification of Pathological Speech . . . . . . . . 117 Shufei Duan, Camille Dingam, Xueying Zhang, and Haifeng Li Human Ear Modelling Channel-Vocoder-Centric Modelling of Cochlear Implants: Strengths and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Fanhui Kong, Yefei Mo, Huali Zhou, Qinglin Meng, and Nengheng Zheng

Computational Musicology

A Multitask Learning Approach for Chinese National Instruments Recognition and Timbre Space Regression Shenyang Xu, Yiliang Jiang, Zijin Li, Xiaoheng Sun, and Wei Li

Abstract Musical instrument recognition is an essential task in the domain of music information retrieval. So far, most existing research are focused on western instruments. In this research, we turn to Chinese national instruments recognition. First, a dataset containing 30 Chinese national instruments is created. Then, a well-designed end-to-end Convolutional Recurrent Neural Network is proposed. Moreover, we combine instrument recognition with instrument timbre space regression using a multitask learning approach to improve performances of both tasks. We conduct experiments in instrument recognition and timbre space regression to evaluate our model and multitask learning approach. Experimental results show that our proposed model outperforms previous algorithms, and the multitask approach can further improve the results. Keywords Instrument recognition · Timbre space · Chinese national instruments · Multitask learning

1 Introduction The amount of online music data continues to grow with the development of digital technology, the demand for music data analysis and retrieval also increases. During the recent two decades, attempts in the domain of Music Information Retrieval (MIR) have been made to solve many related problems. As an important sub-task of MIR,

S. Xu · Z. Li Central Conservatory of Music, Beijing 100031, China Y. Jiang · X. Sun · W. Li (B) School of Computer Science and Technology, Fudan University, Shanghai 200438, China e-mail: [email protected] W. Li Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_1

3

4

S. Xu et al.

automatic musical instrument recognition is beneficial for many real-world applications, for instance, music education, instrument database construction and instrument audio retrieval. Previous research, either on monophonic or polyphonic instrument recognition, mainly focuses on Western orchestral instruments. Although there are some of the works have researched in folk instruments recognition [1, 2], study for Chinese national instruments is rare. In this paper, we conducted research of monophonic instrument recognition and timbre space regression specifically on Chinese national instruments. Human beings recognize each instrument by its timbre; however, the nature of timbre is rather hard to grasp. Intuitively, we see a potential by combining another timbre-related task, such as timbre space modeling task, by using a multitask learning paradigm to increase the performance of instrument recognition and vice versa. A timbre space can be modeled by first using selected subjective timbre evaluation terms and having them annotated by professional musicians, then using some mathematical techniques to project these annotated instrument timbre terms into a space, thus the visualization and analysis for the timbre of each instrument is feasible. Moreover, new instrument audio materials can be projected onto the space, which can be considered as a regression task. In this paper, we created a Chinese instrument solo note dataset by first gathering audio data from the database created in [3] and then cutting them into isolated notes. Then, an end-to-end Convolutional Recurrent Neural Network (CRNN), which is highly effective for sequence data like audio, is proposed. Finally, we propose a multitask learning approach by combining two tasks, i.e., instrument recognition task and timbre space regression task, to improve both of their performances. Our experimental results show that the proposed model performs better than other singletask baseline algorithms, and the multitask learning approach further improves the performance. The rest of the paper is organized as follows. Section 2 discusses previous studies in instrument timbre space modeling and instrument recognition. Section 3 introduces the dataset we created. Section 4 presents the details of the methodology. Section 5 describes the experiment setup and reports experimental results. Finally, in Sect. 6, we make a further conclusion.

2 Related Work For timbre space modeling, [4] is one of the earliest works in this area, in which timbre of 35 voiced and unvoiced speech sounds and musical sounds are studied using 30 pairs of subjective evaluation terms. Subsequent research in this field are becoming more and more mature but only for Western instruments, while Chinese national instruments have not been studied until recently. Jiang et al. [5] constructed a timbre space for up to 48 Chinese national instruments (orchestral and minority instruments) and 24 Western orchestral instruments. 16 subjective timbre evaluation terms were

A Multitask Learning Approach for Chinese National Instruments …

5

selected, such as bright, dark, raspy, mellow and so on. Several subjective experiments were conducted, and a 3-D timbre perception feature model was constructed by using the multidimensional scaling technique. Following the trajectory of this research, [6] used the same 16 subjective timbre evaluation terms to be clustered into 4 categories and an instrument recognition model is proposed based on the timbre of each instrument. The author made a further research in [7]. A 2D timbre perception space is constructed by using Kernel Principal Component Analysis (kPCA) and the two axes are determined by using an SVM. Moreover, a timbre space regression experiment was conducted. It is reported that using 98 timbre features together with k-Nearest Neighbors (KNN) as a regression algorithm produces the best result. In Sect. 4, we will propose our multitask learning approach based on [7]. For monophonic instrument recognition, numerous features and approaches have been used and proposed. Among all the timbre-related features, MFCC is the most frequently used one. In [8], MFCC and several spectral features are sent to an SVM. The performances of different features are compared separately and MFCC outperformed other features and achieved an accuracy of 87.23%. It is notable that this research is one of the few types of research that included Chinese instruments whose dataset contains 13 Chinese national instruments and 13 Western instruments. MFCC combined with higher-order spectral features were used in [9] and sent to a Counter Propagation Neural Network to identify 19 instruments and their family. Also, [10] used Hidden Markov Model (HMM) with MFCC features to classify 4 instruments and an accuracy of 90.5% is achieved. Theoretically, HMM would be suitable for processing audio data since it can deal with data of various length which makes it different from previous research that compress features into a fixedlength feature vector that will lose information more or less. A recent research [11] used MFCC features and a 6-layer Fully Connected Neural Network to classify 20 instruments from the London philharmonic orchestra dataset and achieved an accuracy of 97%. An interesting approach is methods in [12] where the authors used sparse filtering, which is an unsupervised feature learning method, for feature extraction. Then a Support Vector Machine (SVM) is used as classifier. In addition, [13] used different part of the instrument audio, e.g., the attack of the audio, the initial 100 Hz of the frequency spectrum and so on, to identify 8 instruments. Many of these previous approaches have limitations. Since they use hand-crafted features, it does not only require a large amount of work on preprocessing but is also difficult to fully capture the complexity of timbre. Some of them use very few instruments such as 4 or 8 instruments [10, 13]. Moreover, though widely used in polyphonic instrument recognition [14–16], end-to-end deep learning approach lacks due attention in monophonic instrument recognition. The reason could be that existing algorithms already achieved almost perfect results in some of the publicly available Western instrument datasets [17]. While our research focuses on Chinese national instruments whose timbre is more diverse [5], together with a new dataset, some improvements could be made using end-to-end deep learning method.

6

S. Xu et al.

3 Dataset First, we gathered audio data of 30 Chinese national instruments from the Chinese instrument database constructed in [3]. Since the original data is not solo note audio, we cut the audio into isolated notes by using both automatic and manual methods. All sound files are down-sampled to a 22.05 kHz sampling rate with 16-bit per sample and mixed down to a single channel waveform. As result, the dataset contains normal solo notes of the whole pitch range of each instrument played in a forte dynamic. Playing techniques is also included for most instruments (e.g., staccato, glissando, etc. for Erhu). We resulted in 2638 audio files and the total duration is 1.25 h. Duration of each audio ranges from 0.07 to 13 s. The content of the dataset is shown in Table 1, for each instrument we present its English name together with a Chinese name. Table 1 Dataset content Samples

Duration (s)

Duration (s)

104

343.7

Gaohu (高胡)

22

15.9

Erhu (二胡)

425

430.4

Bass Sheng (低音笙)

21

56.6

Zhonghu (中胡)

47

44.3

Soprano Suona (高音唢呐)

37

Banhu (板胡)

34

83.2

Tenor Suona (中音唢呐)

Bangdi (梆笛)

143

238.1

Qudi (曲笛)

85

Xindi (新笛)

Samples Pipa (琵琶)

Duration (s)

87

82.3

Yangqin (扬琴)

136

366.7

113.6

Alto Ruan (中阮)

365

840.6

60

57.8

Bass Ruan (大阮)

95

152.2

Bass Suona (低音唢呐)

11

21.0

Guzheng (古筝)

53

107.9

123.6

Soprano Guan (高音管)

37

48.3

Guqin (古琴)

12

37.8

88

129.4

Tenor Guan (中音管)

24

103.0

Konghou (箜篌)

66

180.9

Xiao (箫)

102

172.5

Bass Guan (低音管)

13

37.2

Sanxian (三弦)

53

44.4

Xun (埙)

31

37.9

Bawu (巴乌)

161

174.7

Yunluo (云锣)

29

51.2

118

212.2

Liuqin (柳琴)

131

126.8

Bell chimes (编钟)

48

59.5

Soprano Sheng (高音笙)

Tenor Sheng (中音笙)

Samples

A Multitask Learning Approach for Chinese National Instruments …

7

4 Methodology In this section, we first describe the timbre space research that we utilize, then present the proposed model and multitask learning approach.

4.1 Timbre Space The timbre space regression part of our multitask learning approach is based on the research of [7]. In this research, 16 dimensions timbre evaluation terms are annotated by professional musicians by listening to a 3 s long audio of each instrument. Then, kPCA is used to project each timbre term onto a 2D space where each quadrant corresponds to a timbre cluster clustered by K-means algorithm. Two axes of the 2D space are determined by the decision boundaries of an SVM. The resulting 2D timbre space is shown in Fig. 1. By constructing such a 2D timbre space, the new instrument’s audio can be projected onto the 2D space by using their timbre features with a regression algorithm, thus facilitating instrument timbre analysis and visualization. A timbre space regression experiment was then conducted in this research, 98 timbre features were sent to different regression algorithms and their performances were compared. It is reported that KNN achieved the best result. To put it straightforward, we are in fact using timbre information from the 16 subjective timbre terms which is reduced to 2 dimensions in the end. These terms that is annotated by professional musicians can be considered as human-intuitive

Fig. 1 2-D Timbre space constructed in [17]. Erhu is used as an example to display the timbre space regression result

8

S. Xu et al.

timbre features. Our hypothesis is that these human-intuitive timbre features are high-level features that provide more information about the timbre of an instrument than human-designed features or features learned by a neural network. Combined with the instrument recognition task, which also utilizes timbre information, could possibly increase performances of both tasks.

4.2 Model and Multitask Learning Approach CRNN is used in many audio-related and MIR tasks [18–20]. The convolutional part can automatically and effectively extract features from the input spectrogram, then the recurrent part can deal with data of various lengths then send fixed-length features to fully connected layers for classification and regression. Here we design a simple structure end-to-end CRNN framework since our dataset is small. The model consists of 3 convolutional blocks, 2 GRU layers, 2 fully connected layers and 2 output fully connected layers for two tasks. Each convolutional block contains a 3 × 3 convolutional layer, a batch normalization layer and a max-pooling layer. The structure of our model is illustrated in Fig. 2. Two tasks use the same structure, the only difference is the output of the last output fully connected layer, 30 for the output channel of instrument recognition task, and 2 for timbre regression. Preserve each of the output layers and discard the other will result in the corresponding single-task ablated version model for the corresponding task. For input, we use Mel-scale spectrogram and set Mel bins to 128. The loss for instrument recognition and timbre space regression is cross entropy loss and meansquared error (MSE) loss respectively. For the second task, its loss is the sum of two MSE losses corresponding to two axes of the 2D timbre space. The multitask learning approach is achieved by summing two losses, shown in Eq. (1):

Fig. 2 Illustration of our model

A Multitask Learning Approach for Chinese National Instruments …

Loss multitask = λLoss C E + μ(Loss M S E−X + Loss M S E−Y )

9

(1)

where Loss C E is cross entropy loss, while Loss M S E−X and Loss M S E−Y are MSE losses for the X axis and the Y axis of the timbre space respectively. λ and μ are two weight factors, we set λ to 1 and μ to 2.

5 Experiments For both tasks, we conduct multitask experiment and single task ablation study. In instrument recognition study, we compare with 3 existing algorithms [8, 10, 11], their research methods are reviewed in Sect. 2. For [8], we concatenate all the features, including MFCC and several spectral features, and sent to the classifier altogether rather than separately for this will produce better result. For timbre space regression task, we choose [7], whose methods are described in Sect. 4.1, as baseline algorithm. All the algorithms above are reproduced in python.

5.1 Experiment Setup The framework of our model is implemented in Pytorch. The model is trained with Nvidia 1050ti GPU. We set batch size as 16, and Adam optimizer is applied as an adaptive optimizer with a weight decay rate of 0.0001. The learning rate is 0.001. Five-fold cross-validation is utilized to make the results more reliable. The metrics for instrument recognition and timbre space regression are accuracy, F-measure and MSE, R2 score respectively.

5.2 Instrument Recognition Results and Analysis 5.2.1

Results and Analysis

The results of instrument recognition are shown in Table 2. It can be seen that our single-task version model already performs better than the 3 baseline algorithms, the accuracy and F-measure for which is 93.85% and 0.900 respectively. Among the baseline algorithms, [8] produces the best result which shows the effectiveness of using several features rather than one as input. Though theoretically suitable for processing data with various lengths, [10] does not act as expected and produces the worst result. It demonstrated that the CNN part of our model can learn features better than hand-crafted features, while the RNN part is better at dealing with data of various lengths.

10

S. Xu et al.

Table 2 Results for instrument recognition experiment, with mean and standard deviation of 5 folds in brackets Accuracy

F-measure

[8]

92.45% (0.0130)

0.884 (0.0359)

[10]

80.16% (0.0119)

0.752 (0.0223)

[11]

84.89% (0.0154)

0.798 (0.0386)

Proposed model (single-task version)

93.85% (0.0139)

0.900 (0.0097)

Proposed model (multitask version)

94.66% (0.0102)

0.920 (0.0124)

The result of our model further improved using multitask approach, achieves 94.66% and 0.920 in accuracy and F-measure. This indicates that multitask learning approach could make use of the timbre information from the subjective timbre evaluation terms, which is effective enough, in the timbre space regression task.

5.2.2

Confusion Matrix Analysis

We give the confusion matrix of each instrument and instrument family in one of the cross-validation experiments shown in Fig. 3. We give the instrument family according to Sachs–Hornbostel system and resulted in 5 families and these instrument families are presented in the confusion matrix in 5 colors: Bowed string instrument (Red), Edge wind instrument (Orange), Reed wind instrument (Blue), Plucked and struck string instrument (Green), Percussion instrument (Black). It can be seen from Fig. 3 that the Konghou is classified with the maximum error, only 38% of the labels are correctly classified. Then, the second largest error is observed in the Guzheng and the Yunluo. For instrument family, a number of classification errors occurred in the Plucked and struck string family, which almost does not exist in Western orchestral instruments, and Chinese percussion instruments family compared to other families. It is manifested that the timbre of Chinese Plucked and struck string instruments are so similar yet complicated that makes the algorithm hard to distinguish, especially when numerous playing techniques are considered. This could also explain why baseline algorithms achieved worse results in recognition of Chinese national instruments than Western orchestral instruments.

5.3 Timbre Space Regression Results and Analysis Timbre space regression results are shown in Table 3. From the results we can see that our single-task version model surpasses the baseline algorithm in MSE and R2 score, achieving 0.0058, 0.8922 for X-axis, and 0.0065, 0.7970 for Y-axis. While the multitask-version model further outperforms the single-task version model. The

A Multitask Learning Approach for Chinese National Instruments …

11

Fig. 3 Normalized confusion matrix of each instrument and each instrument family. The color for each instrument family is: Bowed string instrument (Red), Edge wind instrument (Orange), Reed wind instrument (Blue), Plucked and struck string instrument (Green), Percussion instrument (Black)

model achieved 0.0051, 0.9046 for MSE and R2 score of X-axis, and 0.0056, 0.8245 for MSE and R2 score of Y-axis. This indicates that more timbre information is learned from the instrument recognition task than hand-crafted features and features learned solely in the timbre space regression task thus ultimately improving the performance of timbre space regression.

12

S. Xu et al.

Table 3 Results for timbre space regression experiment, with mean and standard deviation in brackets MSE (X axis)

R2 Score (X axis)

MSE (Y axis)

R2 Score (Y axis)

[7]

0.0068 (0.0005)

0.8755 (0.0097)

0.0078 (0.0005)

0.7564 (0.0189)

Proposed model (single-task version)

0.0058 (0.0003)

0.8922 (0.0046)

0.0065 (0.0009)

0.7970 (0.0223)

Proposed model (multitask version)

0.0051 (0.0006)

0.9046 (0.0104)

0.0056 (0.0011)

0.8245 (0.0270)

6 Conclusion In this paper, an CRNN model is proposed that does not only act as an effective feature extractor but can also deal with audio data of various lengths. A multitask learning approach is proposed by combining instrument recognition task and instrument timbre space regression task to improve performances of both tasks. A Chinese instrument solo note dataset is created for the two tasks. Experimental results show that our proposed model outperforms existing methods in both tasks, furthermore, the multitask learning approach further improves the result. This indicates the effectiveness of using two timbre-related tasks to improve both performances. Also, a number of classification errors are observed in the instrument family of Plucked and struck string, which demonstrated the timbre diversity and classification difficulty in Chinese national instruments to some extent. Acknowledgements This work was supported by National Key R&D Program of China (2019YFC1711800), NSFC (62171138).

References 1. Sankaye SR, Mehrotra SC, Tandon US (2015) Indian musical instrument recognition using modified LPC features. Int J Comput Appl 122(13):6–10 2. Ibrahim R, Senan N (2012) Soft set theory for automatic classification of traditional Pakistani musical instruments sounds. In: international conference on computer & information science, vol 1. IEEE, pp 94–99 3. Liang X, Li Z, Liu J, Li W, Zhu J, Han B (2019) Constructing a multimedia chinese musical instrument database. In: Proceedings of the 6th conference on sound and music technology (CSMT). Springer Singapore, pp 53–60 4. Bismarck GV (1974) Timbre of steady sounds: a factorial investigation of its verbal attributes. Acta Acust Acust 30(3):146–159 5. Jiang W, Liu J, Zhang X et al (2020) Analysis and modeling of timbre perception features in musical sounds. Appl Sci 10(3):789 6. Jiang Y, Sun X, Liang X, et al Analysis of Chinese instrument timbre based on objective features. J Fudan Univ (Nat Sci), 59(3):346–353

A Multitask Learning Approach for Chinese National Instruments …

13

7. Jiang Y (2019) The application of computer audition on automatic phonation modes classification and perceptual timbre space construction. Fudan University, Shanghai 8. Liu J, Xie L (2010) SVM-based automatic classification of musical instruments.In: 2010 international conference on intelligent computation technology and automation, vol 3. IEEE, pp 669–673 9. Bhalke DG, Rao CR, Bormane D (2016) Hybridisation of mel frequency cepstral coefficient and higher order spectral features for musical instruments classification[J]. Archives of Acoustics 41(3):427–436 10. Jeyalakshmi C, Murugeshwari B, Karthick M (2018) HMM and K-NN based Automatic Musical Instrument Recognition. 2018 2nd international conference on I-SMAC (IoT in social, mobile, analytics and cloud). IEEE, pp 350–355 11. Mahanta SK, Khilji A, Pakray P (2021) Deep neural network for musical instrument recognition using MFCCs. Computación y Sistemas 25(2):351–360 12. Han Y, Lee S, Nam J et al (2016) Sparse feature learning for instrument identification: effects of sampling and pooling methods. J Acoust Soc Am 139(5):2290–2298 13. Toghiani-Rizi B, Windmark M (2017) Musical instrument recognition using their distinctive characteristics in artificial neural networks. arXiv preprint arXiv:1705.04971 14. Li P, Qian J, Wang T (2015) Automatic instrument recognition in polyphonic music using convolutional neural networks. arXiv preprint arXiv:1511.05520 15. Han Y, Kim J, Lee K (2017) Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Trans Audio Speech Lang Process 25(1):208–221 16. Hung YN, Chen YA, Yang YH (2019) Multitask learning for frame-level instrument recognition. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 381–385 17. Lostanlen V, Andén J, Lagrange M (2018) Extended playing techniques: the next milestone in musical instrument recognition. In: Proceedings of the 5th international conference on digital libraries for musicology, pp 1–10 18. Sang J, Park S, Lee J (2018) Convolutional recurrent neural networks for urban sound classification using raw waveforms. In: 26th European signal processing conference (EUSIPCO), pp 2444–2448 19. Phan H, Koch P, Katzberg F, et al (2017) Audio scene classification with deep recurrent neural networks. In: Interspeech 2017, pp 3043–3047 20. Chen MT, Li BJ, Chi TS (2019) CNN based two-stage multi-resolution end-to-end model for singing melody extraction. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 1005–1009

Design of Blowing Instrument Controller Based on Air Pressure Sensor Rongfeng Li and Yiran Yuan

Abstract This paper presents a design of wind instrument controller for Chinese folk music. Since the birth of MIDI, the electronic musical instrument industry has become increasingly mature through constant iteration. This paper aims to restore the air pressure amplitude mapping of real musical instruments, and tries to design a wind instrument controller based on the prototype of Sheng, making a preliminary attempt to fill the gap in the market of national wind instrument controllers. Keywords Air pressure · Blowing instrument controller · ESP32

1 Introduction 1.1 Existing Products For Western music, various wind instrument controllers with timbre and performance experience comparable to real instruments have been published. Beginning with the Lyricon [1] in the early 1970s, electronic wind instruments with different technology and functions such as WX [2] series from Yamaha and Aerophone [3] series from Roland have been created. As Birl [4], Eolos [5] and other experimental machines have tried in the aspects of machine learning and cost control, the market has been improved. However, up to now, the control of the amplitude of the wind instrument controllers directly comes from the transfer of the air sensor data to the amplitude envelope, without paying attention to the relationship between the wind intensity and the reaction of the instrument in the real playing process. Therefore, there are still shortcomings in restoring the wind instrument playing experience. R. Li (B) Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing 100876, China e-mail: [email protected] Y. Yuan Beijing University of Posts and Telecommunications, Beijing 100876, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_2

15

16

R. Li and Y. Yuan

On the other hand, due to the uniqueness of Chinese national wind instruments in material, structure and playing skills, electronic instruments based on Western music are incompatible with them. At present, the general solution is to focus on the development of folk music audio source, and apply it to the formed wind instrument controller or keyboard, and play folk music in the western way. Therefore, excellent folk music sources such as Qin Engine [6], Silk [7] and so on were spawned. On the other hand, for musical instruments with relatively simple sound structure such as Xiao and Xun, there are also mobile application programs to try to digitize, but it is difficult to restore performance feedback due to the lack of supporting hardware development. Therefore, taking folk music instrument Sheng as the research object, this paper aims to optimize the playing experience of electronic wind instruments by collecting the actual air pressure and amplitude data and simulating the feedback of instruments on blowing force in reality through machine learning.

1.2 Goals The purpose of this study is to through the sensor with a mobile device can be the same as the real wind instruments basic present digital wind instrument music effect, therefore the pipe should be from the playing style and sound effect to be able to simulate real instruments, for the most part and can satisfy the performance skills need precise control. There are three main issues that need to be addressed: (1) Design and manufacture detection device based on pressure sensor to simulate real blowing experience. (2) Build pressure amplitude mapping through machine learning to simulate the real loudness of Musical Instruments. (3) Design the digital instrument playing interface to reproduce the real playing skills and timbres. Figure 1 shows the overall flow diagram of the program.

Design of Blowing Instrument Controller Based on Air Pressure Sensor

17

Fig. 1 Flow diagram of the program

2 Related Work As mentioned above, the market of electronic national wind instruments involved in this paper is still in its infancy, and there is no specific project or product for reference. Therefore, all related work is related to the parts of the project after the content is divided.

2.1 The Sound Structure of a Free Reed Instrument In order to simulate the performance effect of Sheng, it is necessary to understand the structure and sound mode of it. Sheng is a typical free reed air instrument [8]. The air flows through the free reed to drive the air in the pipe to make resonance sound. Each pipe makes sound separately, and only when the sound hole of a certain pipe is blocked will it make sound. Therefore different performance skills can be quantified as changes in breath intensity. At the same time, Sheng is a discrete pitch instrument, and there is no glide, modulation and other techniques. Therefore, for the same pitch of Sheng, the only variable is the breath intensity when playing, and is suitable for modeling.

18

R. Li and Y. Yuan

2.2 Amplitude Feature Extraction of Musical Instrument Audio Theoretically, the amplitude envelope is the line of the crest of each wave in the audio time domain image. However, because of the existence of overtones, if the poles of the curve are simply connected, the ideal envelope image cannot be obtained. At this time, audio needs to be divided into frames. Select the appropriate window function to frame the original function, by setting the appropriate window distance and jump distance, can get all the frame signal. Select the peak within each frame, and the line can be approximated as the amplitude envelope of the audio.

3 Mouthpiece Design Based on Air Pressure Sensor The hardware part of the wind instrument controller is used to detect the change of air pressure when blowing. When calculating pressure amplitude mapping, the hardware part is connected with the musical instruments to collect real-time pressure data during performance. When playing with the digital wind instrument controller, the hardware part acts as the mouthpiece of the instrument, and the amplitude envelope of the sound source is controlled by the detection value of the pressure sensor. In the process of iteration, this paper tried to implement several different design schemes.

3.1 Straight Pipe Design In the original design (Fig. 2), a slot is cut in a straight pipe, a BMP280 [9] pressure sensor is inserted and sealed, in which the measuring element of the pressure sensor is fully plugged into the pipe and covers half the section. However, the sensor will directly contact the blown gas in the first time, so the measurement results will have errors from the temperature and humidity. And with the increase of the performance time, the vapor cannot be discharged at once, and will accumulate in the pipe, which further increases the systematic error.

Fig. 2 Schematic diagram of straight pipe mouthpiece

Design of Blowing Instrument Controller Based on Air Pressure Sensor

19

Fig. 3 A model of mouthpiece from eolos [5]

On the other hand, the sensor is placed directly in the pipe in a semi-open state, and the airflow needs to bypass the sensor as it flows through the sensor panel, creating a small turbulence near the sensor. These eddies will disturb the measurement results to some extent, making the data less stable.

3.2 Tee Pipe Design In order to solve the problem, in 2019, Juan Mariano Ramos designed the Eolos [5] using a three-way pipe design, with a branch extending from the main pipeline for sensor measurements. Figure 3 is the theoretical model of it. This design scheme successfully solved the problem of system error of straight pipe nozzle. By adding branches, the sensor is no longer in direct contact with the blown gas, thus greatly reducing the influence of the water vapor and temperature of the exhaled gas on the sensor. After testing, the air pressure in the static state can basically be stabilized at a value, and there will be little influence of the increase of playing time. The disadvantage of this scheme is that it is impossible to distinguish between blowing and sucking. As the sectional area of the blowing end is the same as that of the instrument end, when the air flow in the main pipe, no matter the flow direction, the air pressure in the main pipe will be smaller than that in the branch, so that the “static air” in the branch will move to the direction of the main pipe, so that the air pressure sensor will detect a decrease.

3.3 Tee Pipe with Unequal Sectional Areas In order to improve this situation, it is necessary to increase resistance to the flow of air in the main pipeline in one direction, and reduce resistance to the flow in the opposite direction. This goal can be achieved by a tee pipe with unequal sectional areas. The mouthpiece comprises three openings, which are the blowing end, the end to the instrument and the end to the sensor. Among them, “unequal” means that the sectional areas of the blowing end and the instrument end are not equal, specifically,

20

R. Li and Y. Yuan

the sectional area of the blowing end is equal to the sum of the areas of the instrument end and the sensor end. For the whole pipeline, the volume of gas flowing in and out per unit time is equal, and this part of gas volume can be expressed as the product of sectional area and flow rate. According to Bernoulli’s principle in fluid mechanics, the greater the flow velocity, the smaller the pressure, and the smaller the flow velocity, the greater the pressure. Since the sectional area of the blowing end is equal to the sum of the sectional area of the instrument end and the sensor end, the flow velocity of the inflow and outflow are equal, and the pressure is equivalent, without causing additional pressure increase or decrease. After confirming the feasibility of the design idea, the appearance of the nozzle is further optimized. There are various types of Sheng, and different types of Sheng have different shapes of mouthpiece. For example, the smaller Sheng, such as the holding Sheng, has a thick edge of the mouthpiece, which requires the player’s lips to cling it. The mouthpiece of the larger Sheng, is changed to be played in the mouth. Considering the player need to manually operate the keyboard on the mobile APP, mouthpiece need to be kept in his mouth for a fixed, so that blow in reference for the design of the row of blowing Sheng, cross section shape from round to oval, and increase the lateral upper and lower two sag, used for incisors occlusion, increase the friction, prevent blow mouth drops. Open a window on the side of the branch used for sensor detection to put the sensor in, making sealing easier and reducing the possibility of system leakage. The mouthpiece is modeled and 3D printed (Fig. 4), and the final version of the mouthpiece is shown in Fig. 5.

Fig. 4 Standard model diagram of the mouthpiece (in millimeter)

Fig. 5 Mouthpiece model

Design of Blowing Instrument Controller Based on Air Pressure Sensor

21

4 Mapping 4.1 The Dataset Musical instrument audio is saved in WAV file, sampling rate at 44100, which cannot be directly used for linear regression, and its amplitude envelope needs to be extracted. From the requirement of linear regression, the closer the amplitude envelope is to the reality, the better the fitting effect will be. So the method of dividing audio frames and taking the amplitude peak in each frame as the amplitude envelope is the best choice (Fig. 6). When collecting data, pressure and audio are collected separately, and the detection frequency of pressure sensor is much lower than the audio sampling rate, and it is difficult to synchronize the start and end point of both. Therefore, in order to reduce the deviation as much as possible, the method of reducing the single recording time and taking the average of multiple training is adopted to collect the performance data of different environments and different pitches. Therefore the dataset is the collection of 5 attemps of performance, each last about 30 s, resulting in about 100,000 data. Figure 7 is a diagram of audio and pressure data collected during one of the performance recordings.

Fig. 6 The amplitude envelope is obtained by extracting the frame peak

22

R. Li and Y. Yuan

(a) Audio Record of the Performance

(b) Pressure Record of the Performance

Fig. 7 Data record of the performance

4.2 Data Normalization Due to the limitation of hardware conditions, the sampling rate of pressure value which at about 1000 is much lower than that of audio, and the volume is changed after calculating the amplitude envelope of audio. With the interpolating method provided by Python’s SciPy library, the pressure is upsampled to meet the basic requirement of data fitting that the size of independent variable and dependent variable data sets should be consistent. During data acquisition, air pressure and audio will produce errors in the start time due to the influence of acquisition mode, experimental environment and other factors. The starting positions of the two will be aligned through endpoint detection.

Design of Blowing Instrument Controller Based on Air Pressure Sensor

23

Fig. 8 Endpoint detection result

This paper achieves this requirement by calculating RMS. The result of which is shown in Fig. 8.

4.3 Neural Network Design The pressure amplitude is mapped to a single variable curve, and it can be seen from general experience that the more force the player blows during the performance, the louder the instrument sounds, that is, the change value of pressure is positively correlated with the amplitude. In the actual process of fitting, the predictive function and the corresponding fitting method need to be selected first. Plot the corresponding image of the amplitude envelope and the pressure data, and set the abscissa as the pressure change value, an S-shaped curve can be observed as in Fig. 9. The curve shows that the pressure amplitude mapping has an activating threshold and amplitude limit during the blowing process, while during the interval the amplitude increases significantly with the pressure, which is basically linear. The prediction function can be set as a first order equation, and the linear regression is used to try to fit the pressure amplitude mapping.

24

R. Li and Y. Yuan

Fig. 9 Mapping prediction

Set up the independent variable data for x, y is the real value of the dependent variable, z is predicted, i represents one arbitrary data values. The default for the linear function is: z i = xi w + b (1) Generally, the mean square error is used to calculate the loss of the current fitting results: the predicted results are substituted into the data to get the predicted value, and the square loss is obtained by calculating the square difference between the predicted value and the true value of each group, and then the average of all the square losses is calculated. The formula is as follows: loss(w, b) =

1 (z i − yi )2 2

(2)

After obtaining the current error, the fitting results should be improved according to the error. The gradient descent method is used to carry out iterative calculation of the model by using the small-batch sample kick descent method. When m samples are used to participate in the calculation formula, the loss function becomes: J (w, b) =

m 1  (z i − yi )2 2m i=1

(3)

The gradient descent method requires the partial derivatives of J with respect to w and b, as follows:

Design of Blowing Instrument Controller Based on Air Pressure Sensor

25

m  1  ∂J ∂ J ∂z i 1 = = m(z i − yi )xi = X T · (Z − Y ) ∂w ∂z i ∂w m i=1 m i=1

(4a)

m  1  ∂J ∂ J ∂z i 1 = = m(z i − yi ) = (Z − Y ) ∂b ∂z ∂b m m i i=1 i=1

(4b)

After determining the fitting method, Tensorflow was selected to construct the regression model. After repeated attempts, the following fitting results were obtained after random gradient descent at a learning rate of 0.01 (Fig. 10): From the perspective of the image, the slope and intercept of the fitting line are basically in line with the pre-analysis, but the real audio amplitude will not reach the maximum value 1. Here, because of the fitting line, there will be some distortion in the reduction process. The fitted amplitude envelope can be obtained by substituting the fitting mapping formula into the original pressure value, as shown in the figure below. It can be observed that the amplitude envelope after fitting basically coincides with the original amplitude, which can restore the real situation. Due to the procedural error (starting point, amplitude peak), etc., the two curves cannot coincide completely, but they are still within the acceptable range (Fig. 11).

Fig. 10 Mapping result

26

R. Li and Y. Yuan

Fig. 11 Mapping formula application

5 Implementation of the Blowing Instrument Controller 5.1 Hardware Communication Based on Wi-Fi Module The wind instrument controller is divided into two modules: hardware and software. The action of playing an instrument needs to control the pitch and amplitude of the instrument at the same time, so there needs to be a reliable and rapid communication mode between the hardware for controlling the amplitude and the software for controlling the pitch. According to the connection mode, it can be roughly divided into wired and wireless connection two forms. In the implementation process, the wireless transmission mode based on the ESP32 [10] Wi-Fi module is given priority for the following reasons: (1) Wireless transmission can give full play to the portable advantages of electronic instruments. (2) Wireless transmission has lower requirements for mobile terminal devices and no requirements for data lines and interfaces, so there will be no problem that mobile devices cannot be adapted after being changed. (3) The delay caused by wireless transmission at about 50ms, and is within the acceptable range.

5.2 Preprocessing of Audio Source Generally here are two ways to make the sound source of the electronic musical instrument. One is to restore the sound of the traditional musical instrument through

Design of Blowing Instrument Controller Based on Air Pressure Sensor

27

waveform synthesis, and the other is to reprocess the recorded fragment of the traditional musical instrument performance to become the standard sound source. This paper adopts the latter way. The Multimedia Chinese Musical Instrument Database [11] contains the recording of the scale performance of treble key Sheng. Through the split processing of the audio based on the endpoint detection, the single-tone audio of the full range can be obtained. As a basis for subsequent amplitude envelope control. Audio itself has a naturally attenuated amplitude envelope, which needs to be de-enveloped before it can be used as a sound source. Firstly, the amplitude envelope is calculated by using the amplitude peak of the frames mentioned above. Then the amplitude envelope is sampled by interpolation to make it consistent with the original audio sampling points. Finally, the amplitude of the original waveform and the reciprocal of the envelope are multiplied to get the source waveform after removing the amplitude and envelope.

5.3 User Interface Design Based on Unity In Unity to achieve Audio Source, mainly used in Unity Audio control plug-in Audio Source. Each Audio Source is a sound source. The Socket module provided by Unity is used to open the TCP connection thread to obtain the pressure value detected by the blow mouth. According to the pressure amplitude mapping formula obtained above, it is converted to the amplitude value. The real-time amplitude envelope control of the sound source can be realized. The pitch of the performance is controlled through the interactive interface of the mobile APP. The interactive interface is divided into two parts: the keyboard area and the functional area, as shown in Fig. 12. The keyboard area is based on the design of the playing keyboard for Pai Sheng. Each key corresponds to a pitch, which simulates the hand movements when playing the Sheng. During the implementation, each key in the keyboard area is set as an Audio Source, ensuring that different pitches can be played at the same time during multi-touch.

Fig. 12 User interface design

28

R. Li and Y. Yuan

The drop-down box in the functional area is used to switch the pitch and the sound source. Using the volume adjustment slider implemented through the Audio Mixer can control the overall volume of the system.

5.4 Project Archive and Demonstration Video The website of the project archive is: https://github.com/RebYUAN/BlowingInstrumentController A video of the demonstration is available at the following url: https://www.bilibili.com/video/BV1hq4y1j79o/

6 Conclusion and Future Discussion 6.1 Play the Instrument The hardware part of the system is powered by USB. Once it is turn on, search and connect the hardware Wi-Fi on the phone, and then open the software, a symbol of successfully connected will show up on the upper right of the interface. Blow into the mouthpiece and press the buttons on the screen at the same time, and the instrument can be played as a traditional one. Through the real machine test, the instrument has been able to play the secondlevel music of the Sheng and has a relatively true restoration of the timbre and dynamics, and is well adapted to the harmony and other playing techniques (Fig. 13).

Fig. 13 Real machine test

Design of Blowing Instrument Controller Based on Air Pressure Sensor

29

6.2 Future Work Although the functions and progress expected in the earlier stage have been basically completed, neither the experimental conditions nor the relevant professional knowledge reserve are mature enough, and there is still a lot of room for improvement in all aspects. It is expected that in the future optimization, waveform synthesis can be used to improve the restoration of the timbres, Android native architecture can be used to reduce the system delay and optimize the effect of multi-threaded operation, and professional players can be invited to collect and fit the data in a larger scale, and the finished product can be tested and improved more scientifically and objectively.

6.3 Conclusion With the continuous development of global integration, different cultures in various fields are colliding and integrating. This research is an attempt in this trend of integration, with the method of modern wind instrument controller, to produce an electronic Sheng instrument that can replace the traditional Sheng instrument, to make up for the market gap; It is also hoped that more portable electronic instruments can play a role in the promotion of traditional instruments and make more people realize the charm of national instruments. Acknowledgements Supported by MOE (Ministry of Education in China) Youth Project of Humanities and Social Sciences, No. 19YJCZH084.

References 1. O’Brien D (2021) LYRICON Today’s most expressive electronic wind instrument. https:// www.lyricon.com/lyricon-history/.2021-4-23 2. Darter T (1987) WX7: an introduction to yamaha’s new MIDI wind controller. AfterTouch 3(9):10–11 3. Aerophone go owner’s manual (2018) Roland Corporation 4. Snyder J, Ryan D (2014) The birl: an electronic wind instrument based on an artificial neural network parameter mapping structure. In: NIME, pp 585–588 5. Ramos JM (2019) Eolos: a wireless MIDI wind controller. In: NIME, pp 303–306 6. Qin-powered instrument user manual (2013) Kong Audio Software Technology 7. East West Sounds (2009) Quantum leap silk virtual instrument. http://media.soundsonline. com/manuals/EW-Silk-User-Manual.pdf 8. Wentao M (1983) Chinese sheng and western reed instruments. Chinese Music (01):72–73 9. Bosch Sensortec (2015) BMP280: Data sheet. 2015.-C. 49. Reutlingen, Bosch Sensortec GmbH 10. Espressif systems.Esp32-Wroom-32 Datasheet. https://www.espressif.com 11. Liang X, Li Z, Liu J, et al (2019) Constructing a multimedia Chinese musical instrument database. In: Proceedings of the 6th conference on sound and music technology. Springer, Heidelberg, pp 53–60

SepMLP: An All-MLP Architecture for Music Source Separation Jiale Qian, Yongwei Gao, Weixing Wei, Jiahao Zhao, and Wei Li

Abstract Most previous deep learning based methods use convolutional neural networks (CNNs) or Recurrent neural networks (RNNs) to model the separation process in music signals. In this paper, we propose a MLP-like encoder-decoder architecture, in which per-location features and spatial information in music signals are exclusively handled by multi-layer perceptrons (MLPs). Additionally, We introduce a novel fully-connected decoder for feature aggregation without using skip-connection. The experimental results on the benchmark dataset MUSDB18 show the proposed model achieves comparable performance to previous CNN and RNN based methods, which demonstrates that MLP serves as a promising backbone for the data-driven methods in music source separation. Keywords MLP · Music source separation · Music information retrieval · Deep learning

1 Introduction Music source separation (MSS) is an increasingly important topic in music information retrieval (MIR), with the aim of isolating each music signal from a given mixture of multiple sources. Basically, MSS is normally a challenging problem since the only given information is the mixture signal, which is comprised of multiple musical instruments and voices by non-linear mixing ways. Moreover, Reverb, filtering, and other non-linear signal processing techniques may interfere in the process of source separation [1]. While it has valuable applications, e.g., music remixing and accompaniment isolating for karaoke system [2]. Moreover, multi-instrument J. Qian · Y. Gao · W. Wei · J. Zhao · W. Li (B) School of Computer Science and Technology, Fudan University, Shanghai 200438, China e-mail: [email protected] W. Li Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_3

31

32

J. Qian et al.

source separation has been demonstrated to be useful in handling some other MIR researches, such as automatic singer identification [3] and automatic music genre classification [4]. CNNs and RNNs have been mainstream models for music source separation and other MIR tasks since the rise of deep learning [5–8]. In this paper, we first introduce a MLP-like network to handle MSS problem without using CNNs or RNNs. The proposed method takes the patch-level embeddings of the mixture magnitude spectrogram as input instead of using pixel-level representation. In the encoding process, we use multiple MLP layers to extract features on multi-resolution musical representations in which CycleMLP [9] is adopted. Additionally, we propose a novel decoder architecture for MSS that simply consists of fully-connected layers to aggregate features from multiple encoding stages. The experimental results on MUSDB18 show the comparable performance of our proposed method to baseline models, which indicates that MLP-like network is a promising method for music source separation and can be generalized in other speech and music tasks.

2 Related Work 2.1 Music Source Separation Methods Previous methods of MSS can be grouped into two categories, time-frequency (TF) domain and time-domain methods. Time-frequency domain methods [5, 10, 11] use mask generation model to estimate masks of query sources on TF representations of the mixture, which obtained from the short-time Fourier transform (STFT). The waveforms of the estimated sources can be reconstructed from the estimated masks by inverse short-time Fourier transform (iSTFT). While in these methods, the phase information of estimated sources is replaced by that of the mixture in the reconstruction stage, which may affect separation results. Comparatively, time-domain methods [6, 12, 13] use end-to-end model in which neural network encoder replaces signal processing methods and achieves comparable performance. However, due to the high computation cost and the difficulty of modeling on waveforms, end-to-end MSS remains to be a challenging task. Our proposed method belongs to the first category, which applies the architecture as described in Fig. 1.

Fig. 1 The common architecture of TF domain model for music source separation

SepMLP: An All-MLP Architecture for Music Source Separation

33

2.2 Deep Learning In recent years, deep learning has developed rapidly, which has also been widely used in music source separation researches. Convolutional neural networks (CNNs) [14, 15] with the shared receptive field has the powerful ability of feature extraction, which outperforms previous handcrafted feature based methods and keeps the go-to model in deep learning. Comparatively, recurrent neural networks (RNNs) [16] such as long short-term memory can solve the long-term dependencies problem in the time series prediction, which is widely used in the sequence-related tasks such as text, speech [17] and music [18] tasks. While recent researches demonstrated that neither of them is necessary, which can be replaced by the orchestrated design of multi-layer perceptrons (MLPs) [19].

2.3 MLP-Like Architecture MLP-like models [20, 21] consist of two types of layers, spatial MLPs and channel MLPs, which are used to deal with spatial and channel information respectively. Layer normalization and skip-connections are adopted in MLP layers to keep the performance of the model as network depth increases. To better obtain local and global features and avoid high computational cost of pixel-level spatial MLPs, Each MLP module takes the patch-level embeddings as input and output those of highlevel features of the same size. The computation of the normal MLP module could be represented as h = z + cM L P (L N (z))

(1)

zˆ = h + s M L P (L N (h))

(2)

in which LN denotes layer normalization, cMLP denotes channel MLP and sMLP denotes spatial MLP, respectively. z and zˆ denote the input and the output of the MLP module.

3 Method In this section, we first summarize the overall workflow of our method on TF domain. Then we focus on the design of MLP block in which multi-resolution features of music signals are extracted efficiently. Finally, we introduce the decoder that simply consists of FC layers and upsampling layers.

34

J. Qian et al.

3.1 TF Domain Methods for Music Source Separation Time-frequency (TF) domain methods in music source separation use STFT to obtain time-frequency representations of the mixture from time-domain signals. The TF representations are then fed into the model to get the estimated mask and obtain the magnitude spectrograms as   m i, f t = Fi  X i, f t  ,

i = 1, 2, ..., C

    ˆ    Si, f t  = X i, f t   m i, f t

(3)

(4)

in which C denotes the numberof query  sources, m i, f t denotes the estimated mask,  X i, f t  denotes the input magnitude spectrogram and F denotes the masking model, i   ˆ   Si, f t  denotes the estimated magnitude spectrogram of the query source. Finally, we apply inverse STFT to obtain the estimated waveforms sˆ1 (t), sˆ2 (t), ...sˆc (t) from the estimated magnitude spectrograms and the mixture phase, which could be represented as sˆi (t) = i ST F T (X i , Y ) (5) where Y denotes the phase information of the mixture.

3.2 Proposed MLP-like Architecture As the workflow presented in Fig. 2, the overall method employs the encoder-decoder architecture on TF representations. Basically, several transition layers and MLP-like modules comprise the entire encoder, in which multi-resolution representations are processed in different stages. The encoder takes overlapped patch-level embeddings of the mixture as input which are transformed from input magnitude spectrogram and will be processed by the following MLP modules. The MLP module in the encoder consists of two kinds of MLP layers, Cycle-MLP and channel-MLP. Specifically, we use CycleMLP to handle spatial relationships with the specific receptive field. CycleMLP is a kind of MLP network proposed recently [9] that can deal with various image sizes and reduce the computation cost significantly compared with previous spatial MLPs. The key idea of CycleMLP is to perform channel information extraction on a projection area with a specific kernel size instead of on a pixel-level position, as presented in Fig. 3. In this way, the size of the input image can be flexible when extracting spatial information, which makes it feasible to deal with intermediate features during the downsampling process. The

SepMLP: An All-MLP Architecture for Music Source Separation

35

Fig. 2 The architecture of the proposed encoder-decoder model. The patch-level embeddings of the mixture magnitude spectrogram are fed into the encoder which consists of series of transition layers and MLP blocks. The multi-resolution features extracted from the encoder are then passed into the decoder to output the magnitude spectrograms of estimated sources. The right part of the figure is the detailed design of the MLP-like network

Fig. 3 Illustration of CycleMLP with the pseudo-kernel size of K h × K w = 1 × 3

36

J. Qian et al.

detailed design of CycleMLP is not discussed here and can be further explored in [9]. Additionally, channel-MLP is used to further aggregate the features learned in the previous layers. Layer normalization and dropout are added before and after each MLP layer. Skip-connection is adopted which further improves the performance of the MLP module. And we use maxpooling layers to perform downsampling. The multiple outputs of the encoder in the different stages are then passed into the decoder.

3.3 Fully-Connected Decoder for Source Separation Inspired by the design of MLP-like networks [19, 21], we propose a novel MLP decoder for music source separation simply consists of fully-connected (FC) layers without skip-connection that widely used in previous methods [10, 22]. As described in Fig. 2, the outputs of the encoder are fed into multiple FC layers in which each high-level representation is passed into the corresponding sub-decoder in parallel. Each sub-decoder consists of a single MLP layer and an upsampling layer for decoding features into representations with the original image size as dk = U S(FC(xk )),

k = 1, 2, ...N

(6)

in which xk represents the output of the encoder in stage k and dk represents the output with the original resolution. US and FC represent upsampling layer and FC layer, respectively. N represents the number of sub-decoders for different stages. The multiple decoded features are then passed into the following FC layer and the output layer. The FC layer is used to aggregate features and the output layer returns estimated masks for waveforms reconstruction as shown in (7) and (8). z = Agg(concat (d1 , d2 , ...d N ))

(7)

mˆ = H ead(z)

(8)

4 Experiments 4.1 Settings Dataset. We use the public dataset MUSDB18 to evaluate our proposed model. MUSDB18 [23] is a professional multi-track dataset comprising a total of 150 music works from MedleyDB and DSD100, which involves various genres such as jazz and electro. Each mixture track comprises 4 sources including bass, drums, vocals and

SepMLP: An All-MLP Architecture for Music Source Separation

37

other. We use fixed 100 works for training in which 14 of those are for validation and the remaining 50 works for testing. Evaluation. We use SDR (source-to-distortion ratio), SIR (source-to-interference ratio) and SAR (source-to-artifact ratio) as the metrics to evaluate the performance of our proposed model. The main objective SDR is defined as  S D R = 10 log10

  starget 2

  einter f + enoise + ear ti f 2

 (9)

where starget is the true source, and einter f , enoise , and ear ti f are error terms for interference, noise, and added artifacts, respectively. Mir_eval toolbox [24] is used for get these metrics.

4.2 Implementation Details Pre-processing. We first convert the stereo tracks to mono and downsample them 16000 Hz. We then apply short-time Fourier transform to obtain the time-frequency representation of the mixture. Specifically, STFT is windowed at 1024 samples with 75% overlap to obtain 512 × 512 magnitude spectrogram. For data augmentation, we choose the strategy of random scaling with uniform amplitudes from [0.75,1.25] and random remixing (incoherent rate of 50%) similar to [25]. Model Configuration and Training. We first apply multiple filters with the size of 7×7 and stride of 2 to obtain the overlapped patch-level embeddings. The all-MLP encoder consists of totally 4 stages in which each stage is comprised of multiple MLP layers with the expansion ratio of 4. Specifically, the configuration of the layers and the filters in each stage are {2, 2, 4, 2} and {64, 128, 256, 512}, respectively. With the input size of 512×512, the output resolutions of each stage are {256×256, 128×128, 64×64, 32×32}. The hidden dimension of the fully-connected layer in the MLP-like decoder is set to 256. The detailed configuration is shown in Table 1 and the code is available at https://github.com/jlqian98/SepMLP. As for the training process, we use Adam optimizer with the learning rate of 0.0001 and reduced it to half every 20k iterations. The batchsize is set to 4. The training loss is a balance between mean square error (MSE) on TF domain and scale-invariant SDR (SI-SDR) on time-domain, which can be represented as Ltotal = Lmse + wLsi−sdr

(10)

38

J. Qian et al.

Table 1 The detailed model configuration. h, k, s, p, n, l represent hidden dimension, kernel size, stride, padding size, the expansion ratio of the MLP layer and the number of MLP layers. And scale denotes the resolution of input features compared with the input magnitude spectrogram. SepMLP-light and SepMLP-large adopt different MLP layers and hidden dimension Layer Scale SepMLP-light SepMLP-large Patch embedding (h, k, s, p) MLP 1 (h, n, l) Downsampling (k, s) MLP 2 (h, n, l) Downsampling (k, s) MLP 3 (h, n, l) Downsampling (k, s) MLP 4 (h, n, l) FC decoders (h) Upsampling (k, s)

1

(32, 7, 2, 3)

(64, 7, 2, 3)

1/2 1/2 1/4 1/4 1/8 1/8 1/16 1/2, 1/4, 1/8, 1/16 1/2, 1/4, 1/8, 1/16

Agg FC (h) Upsampling (k, s) Separation head (h)

1/2 1/2 1

(32, 4, 2) max pool (2, 2) (64, 4, 2) max pool (2, 2) (128, 4, 2) max pool (2, 2) (256, 4, 2) 256 (1, 1), (2, 2), (4, 4), (8, 8) 256 bilinear (2, 2) 4

(64, 4, 2) max pool (2, 2) (128, 4, 2) max pool (2, 2) (256, 4, 4) max pool (2, 2) (512, 4, 2) 256 (1, 1), (2, 2), (4, 4), (8, 8) 256 bilinear (2, 2) 4

Table 2 SDR comparing with other TF domain based methods on MUSDB18 dataset Method #Params (M) Vocals Drums Bass Other BLSTM [25] DC/GMM [5] Dedicated U-Nets [10] SepMLP (proposed)

30.03 – 496

3.43 4.49 5.77

5.28 4.23 4.60

3.99 2.73 3.19

4.06 2.51 2.23

13.5

6.14

4.63

2.25

2.67

4.3 Results As shown in Table 2, one RNN based model and two CNN-based models are listed as baseline methods. Our proposed SepMLP achieves performance comparable to that of baseline models. Specifically, the metric of vocal separation achieves the highest score among all methods, which indicates its remarkable capability of separating singing voice. While that of the remaining sources are relatively lower than BLSTM method and are also comparable to two CNN based methods. We consider that the proposed model performs relatively weak on drums and bass because the backbone used focuses on local features, which can not process well on long-term dependencies in spectrograms of these sources. Additionally, the model size of the proposed method

SepMLP: An All-MLP Architecture for Music Source Separation

39

is significantly reduced compared to the known methods. Metrics in this paper are the median of the average value of each song. To further explore the effects of the model configuration and other training strategies, we also carry out some additional experiments as presented in Fig. 4. Obviously, data augmentation techniques significantly improve the performance of the proposed model. And the proposed decoder instead of the convolutional-based decoder with skip-connection provides better results from the comparison between SepMLP-1 and SepMLP-3. Finally, increasing the model size slightly boosts the performance especially for vocal separation. We refer to SDR as a more indicative metric of the source separation quality compared to SIR and SAR. The results of SIR are close to SDR, while those of SAR value are comparable and SepMLP-3 has slightly higher performance.

Fig. 4 Comparison of different experimental configurations on multiple sources. SepMLP-1, SepMLP-2 and SepMLP-3 use light model with fewer MLP layers and filters while SepMLP-4 uses a large model with more MLP layers and filters. SepMLP-2 does not apply data augmentation strategies and SepMLP-3 uses a convolutional-based decoder with skip-connection instead of our proposed decoder

40

J. Qian et al.

5 Conclusion In this paper, we present a novel MLP-like encoder-decoder model for music source separation, in which the role of CNNs and RNNs in previous methods is replaced by multi-layer perceptrons (MLPs) to extract music spectrogram features. Additionally, we also propose a simple decoder that processes multi-resolution features efficiently with multiple MLP layers. The experimental results show that the proposed method achieves performance comparable to that of baseline models. For future works, we intend to further explore the network design of MLPs and focus on the strategies processed on time-domain. Acknowledgements This work was supported by National Key R&D Program of China(2019YFC1 711800), NSFC(62171138).

References 1. Cano E, FitzGerald D, Liutkus A et al (2018) Musical source separation: an introduction. IEEE Signal Process Mag 36(1):31–40 2. Woodruff JF, Pardo B, Dannenberg RB (2006) Remixing stereo music with score-informed source separation. In: ISMIR, pp 314–319 3. Sharma B, Das RK, Li H (2019) On the importance of audio-source separation for singer identification in polyphonic music. In: Interspeech, pp 2020–2024 4. Rosner A, Kostek B (2018) Automatic music genre classification based on musical instrument track separation. J Intell Inf Syst 50(2):363–384 5. Seetharaman P, Wichern G, Venkataramani S, et al (2019) Class-conditional embeddings for music source separation. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 301–305 6. Défossez A, Usunier N, Bottou L, et al (2019) Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174 7. Schreiber H, Müller M (2018) A single-step approach to musical tempo estimation using a convolutional neural network. In: Ismir, pp 98–105 8. Su L (2018) Vocal melody extraction using patch-based CNN. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 371–375 9. Chen S, Xie E, Ge C, et al (2021) Cyclemlp: a mlp-like architecture for dense prediction. arXiv preprint arXiv:2107.10224 10. Kadandale VS, Montesinos JF, Haro G, et al (2020) Multi-channel u-net for music source separation. In: 2020 IEEE 22nd international workshop on multimedia signal processing (MMSP), pp 1–6 11. Takahashi N, Mitsufuji Y (2017) Multi-scale multi-band densenets for audio source separation. In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 21–25 12. Lluís F, Pons J, Serra X (2018) End-to-end music source separation: is it possible in the waveform domain?. arXiv preprint arXiv:1810.12187 13. Samuel D, Ganeshan A, Naradowsky J (2020) Meta-learning extractors for music source separation. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 816–820 14. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

SepMLP: An All-MLP Architecture for Music Source Separation

41

15. Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708 16. Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association 17. Tan K, Wang D (2018) A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech, pp 3229–3233 18. Luo Y, Chen Z, Hershey JR, et al (2017) Deep clustering and conventional networks for music separation: stronger together. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 61–65 19. Tolstikhin I, Houlsby N, Kolesnikov A, et al (2021) Mlp-mixer: an all-mlp architecture for vision. arXiv preprint arXiv:2105.01601 20. Touvron H, Bojanowski P, Caron M, et al (2021) RESMLP: feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404 21. Lian D, Yu Z, Sun X, et al (2021) As-mlp: an axial shifted mlp architecture for vision. arXiv preprint arXiv:2107.08391 22. Cohen-Hadria A, Roebel A, Peeters G (2019) Improving singing voice separation using deep u-net and wave-u-net with data augmentation. In: 2019 27th European signal processing conference (EUSIPCO), pp 1–5 23. Rafii Z, Liutkus A, Stöter FR, et al (2017) Musdb18-a corpus for music separation 24. Raffel C, McFee B, Humphrey E J, et al (2014) mir_eval: a transparent implementation of common mir metrics. In Proceedings of the 15th international society for music information retrieval conference, ISMIR 25. Uhlich S, Porcu M, Giron F, et al (2017) Improving music source separation based on deep neural networks through data augmentation and network blending. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 261–265

Improving Automatic Piano Transcription by Refined Feature Fusion and Weighted Loss Jiahao Zhao, Yulun Wu, Liang Wen, Lianhang Ma, Linping Ruan, Wantao Wang, and Wei Li

Abstract Automatic Piano Transcription is to transcribe raw audio files into annotated piano rolls. In recent studies, jointly estimating pitch, onset, offset, and velocity of each note is commonly used. The previous state-of-the-art “Onsets and Frames” model chooses to concatenate the output of both onset and offset sub-tasks with extracted features to improve the frame-wise pitch detection, which is, however, low-efficiency in our opinion. In this paper, we proposed an improved piano transcription model by doing feature fusion and loss weighting. Our proposed model outperforms the baselines by a large margin. It also shows comparable performance with the state-of-the-art “High-Res PT” model in note metrics and outperforms it in frame metrics with an F1 score of 90.27%. Keywords Piano transcription · Multi task learning · Multi pitch estimation

1 Introduction Automatic Music Transcription(AMT) is to transcribe raw audio into symbolic music annotations, such as piano rolls or Musical Instrument Digital Interface(MIDI) [1]. AMT is a very important task in MIR, for being very helpful to some higher-level tasks in MIR [2], such as AI music tutoring, music indexing, and searching, etc. AMT also helps artists in music producing, processed MIDI files can be used for further editing or music content visualizing. AMT involves a number of sub-tasks, including multi-pitch estimation (MPE), onset or offset detection, source separation, instrument recognition, beat tracking, etc. Additional information is also involved in J. Zhao · Y. Wu · W. Li (B) School of Computer Science and Technology, Fudan University, Shanghai 200438, China e-mail: [email protected] L. Wen · L. Ma · L. Ruan · W. Wang CETHIK Group Co., Ltd, Hangzhou 311199, China W. Li Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_4

43

44

J. Zhao et al.

particular instruments, such as sustain pedal information in the piano [3, 4], techniques information in the Chinese zither, etc. The automatic transcription of piano mentioned in this paper is to obtain note events containing onset, offset, pitch and velocity jointly. Piano transcription tasks are complicated and difficult even for human experts, especially when high annotation accuracy is required. The difficulty is shown in many aspects [5]. For example, several notes may sound simultaneously, causing interference and overlapping on their harmonics and themselves, which makes it difficult to accurately estimate each note. In addition, octave errors often occur in the transcription process due to the integer multiple relation between the fundamental frequency and its harmonics. Early AMT research mainly aims at frame-level transcription, which generally refers to pitch estimation or pitch detection [6]. Traditional signal processing methods are used widely on this task, such as cepstrum transform [7], wavelet transform [8] and auto-correlation method [9], most of which can only deal with monophonic music. While being simple and fast, traditional signal processing methods show weaker performance than methods based on non-negative matrix factorization (NMF) or neural networks (NN). In [10], a piano transcription method based on NMF was proposed. In [11], a group sparse NMF method based on β -divergence was proposed, and shown state-of-the-art performance among NMF-based AMT models. Since NNs and deep neural networks (DNN) have shown great performance in many fields, NNs-based AMT methods are widely used. As the two-dimensional time-frequency representation of the audio signal shows similarity with image signal in many aspects, convolutional neural network (CNN) is particularly popular for feature extraction in AMT. In [12], CNN is used for acoustic model in order to extract audio features. Meanwhile, recurrent neural network (RNN) shows its performance in Natural Language Processing field. As a semantic sequence of symbolic representations, notes can be dealt in the same way as natural language [13]. In [14], Sigtia et al. use CNN as acoustic model, and use RNN as music language model, achieved state-of-the-art performance on automatic piano transcription task. Onsets and Frames [15] model based on [14] will be introduced in Sect. 2. This paper is organized as follows. Section 1 is a brief introduction of Piano Transcription. Section 2 introduces Onsets and Frames Transcription and the feasibility of improving it as a multi-task learning(MTL) task. Section 3 introduces our proposed method. Section 4 introduces our experiments and the evaluation results. And Sect. 5 concludes this paper.

Improving Automatic Piano Transcription ...

45

Fig. 1 Brief structure of the onsets and frames model

2 Onsets and Frames Transcription and MTL 2.1 Onsets and Frames Model Based on the method proposed in [14], Curtis et al. [15] greatly improved its performance by focusing on onsets. In the proposed model called Onsets and Frames, an acoustic model based on CNN is used to extract audio features, and a music language model based on Bidirectional Long Short Term Memory network(Bi-LSTM) is used to model note sequences. In their later publication, the model is further improved by adding offset head and using augmented MAESTRO dataset [16]. In the improved version of Onsets and Frames, the onset detector and offset detector share the same structure with a linear model of CNN and Bi-LSTM. Onsets and offsets are first estimated and trained independently as a supervised task. In the frame model, the audio signal is first performed by the CNN acoustic model, and then concatenated together with the output prediction of the onset detector and offset detector. Then, after being performed by the Bi-LSTM and the fully connected Sigmoid layer, the frame-wise prediction is obtained. The velocity detector has a similar structure to the other stacks, and the velocity corresponding to each onset is estimated independently. This stack is not connected with other stacks, therefore its performance doesn’t affect others. A brief description of the model is shown in Fig. 1.

2.2 MTL-Based Improvement Multi-task learning (MTL), in simple terms, refers to processing multiple tasks simultaneously. These tasks usually have similar or identical inputs and are highly related, such as instance segmentation and semantic segmentation [17]. In this process, the information and detail of related tasks is used to optimize the performance of the main tasks. Basically, when you optimize multiple loss functions at the same time, you are doing multi-task learning [18]. In multi-task learning, as all the tasks are highly related, parameters sharing is widely used in several ways. Hard parameters

46

J. Zhao et al.

sharing [19] means that several tasks directly share some bottom layers, this method helps save computing resource but shows weaker performance. In soft parameters sharing, each task can obtain information from other tasks, therefore improves itself. In this process, particular selection module is applied, and decides the efficiency of information exchanging. As mentioned above, [3, 16, 20] Onsets and Frames based methods estimate pitch, offset, onset and velocity jointly, and optimize their loss functions respectively. Therefore, we can improve Onsets and Frames model as an MTL task. In Onsets and Frames, the way how information exchanges is simple. The output prediction of other stack is directly concatenated, causing information loss such as pitch contour, etc. So we decide to do feature fusion in order to obtain more information, and design a Connection Block to refine the feature. Also, we adjust the weight of each task by adding a weight factor to each loss function. The specific model configuration is introduced in Sect. 3.

3 Model Configuration Our proposed model can be divided into two parts: the network stacks performing each task and the Connection Block between those stacks. Among the sub-tasks of piano transcription, onset detection, offset detection and velocity detection can be considered as three independent sub-tasks. Because detection of the onset and the offset is relatively simple and is not the bottleneck of the entire model, we did not use the information from the velocity detection task as in [3]. As the most difficult task, frame activation detection not only uses its own network structure, but also fuses features and therefore obtains information by the Connection Block from other sub-tasks. The overall structure is shown in Fig. 2, and the respective structure of each stack is described below. Each stack shares many same structures, but does not share parameters. The CNN module for feature extracting is called Conv Stack in this paper, which has a structure similar to Conv stack in [15], consisting of three convolutional layers, each of which is followed by a batch regularization (BN) layer and a ReLU (Rectified Linear Unit) activation layer. The RNN module for modeling the note sequence is called LSTM stack in this paper, consisting of a Bi-LSTM with a hidden_size set to 384.

Improving Automatic Piano Transcription ...

47

Fig. 2 Structure of our proposed model

3.1 Input In our proposed model, each task shares the same input. The original audio signal is first applied with Short Time Fourier Transform(STFT), where FFT-window is set to 2048, hop_length is set to 512. It is then converted to a log Mel spectrogram with 229 frequency bins. This log Mel spectrogram is the common input shared by each stack.

3.2 Velocity Stack As a task that does not involve high-level semantic features, velocity detection is the simplest sub-task in our model. Based on this, we decided not to use the LSTM stack. Velocity stack consists of the Conv stack followed by a fully connected layer. This simple structure performs well in this task. Differ from other stacks, the loss function we optimize in velocity stack is as shown in Eq. 1: L velocit y =

1 Nonset

·

pmax T  

p,t

p,t

·Ionset ( p,t) (vlabel − v pr edict )2

(1)

p= pmin t=0

where p represents the pitch, pmax and pmin are the maximum and minimum MIDI pitch during transcription, t represents the time frame, T is the total number of frames in the current sample. Ionset ( p,t) denotes a ground truth exists at pitch p and frame t,

48

J. Zhao et al.

and equals to 1 in this case. v pr edict and vlabel represent the predicted velocity and the ground truth respectively. Nonset represents the total number of onsets in the current sample. The labeled ground truth velocities are normalized to [0,1] in pre-processing.

3.3 Onset and Offset Stack Because the tasks of the two stacks are similar, the same structure is adopted. The input is first performed by a Conv Stack for feature extraction, then divided into two branches. One branch connects the Connection Block, and one branch outputs the predicted probability after performed by the LSTM stack and a fully connected Sigmoid layer. In the Onset Stack and Offset Stack, we optimize the two following loss functions respectively in Eqs. 2, 3: L onset =

pmax T  

W BC E(β, G onset ( p,t) , Ponset ( p,t) )

(2)

W BC E(β, G o f f set ( p,t) , Po f f set ( p,t) )

(3)

p= pmin t=0

L o f f set =

pmax T   p= pmin t=0

where P( p,t) represents the probability of the output prediction, G ( p,t) represents the ground truth, and W BC E is the Weighted-Binary-Cross-Entropy loss function, which can be defined as: W BC E(β, G, P) = −β · G · log P − (1 − β) · (1 − G) · log(1 − P)

(4)

where β represents the weight factor of the positive sample. We noticed that recent models [3, 16, 20] shows a much higher precision than recall, which indicates that FNs (False Negative) occurs much more than FPs (False Positive), so we decided to increase the weight of the positive samples to balance the performance of the model. β equals to 0.6 by coarse hyper-parameter search.

3.4 Connection Block In recent Onsets and Frames based models, [3, 16], the connection between different tasks is accomplished by concatenating the output predictions of sub-tasks, which not only requires high-quality alignment, but also loses some information such as note contour. So we designed this stack to accomplish the feature fusion between the Frame Stack and other sub-tasks. Its structure is shown in Fig. 3:

Improving Automatic Piano Transcription ...

49

Fig. 3 Structure of the connection block

The input of the Connection Block is the output of Conv Stack in Onset Stack, Offset Stack and Velocity Stack, that is, the extracted features of each task. We have designed an attention module that can learn which part of the input features is helpful for the main task. The feature is performed by a simple channel attention module and a spatial attention module, followed by a ReLU activation layer, a flatten layer and a fully connected layer to adjust the output size. It is worth noting that the gradient does not flow towards the input, only the parameters of Connection Block themselves are updated by the gradient, because the frame-wise error does not always indicate a sub-task error. Moreover, because the output features of the Connection Block are more weakly supervised than onset predictions, it is less sensitive to the alignment, therefore has better generalization ability.

3.5 Frame Stack Frame Stack can be considered as the major task, which is also the most difficult and involves the most high-level semantics. The input log Mel spectrogram is first extracted by Conv Stack, and then concatenated with the sub-task features processed by Connection Block to apply feature fusion. The fused feature is then performed by an LSTM stack, then the predicted probability is output through a fully connected Sigmoid layer. The calculation of frame-wise loss is defined as: L f rame =

pmax T   p= pmin t=0

W BC E(β, G f rame( p,t) , P f rame( p,t) )

(5)

50

J. Zhao et al.

3.6 Loss Function Recently proposed model generally adds the loss functions of each stack directly as total loss. Due to the difficulty of each task, we weight the loss of each part. The total loss L total is calculated as: L total =

4 (αL onset + γ L o f f set + δL velocit y +  L f rame ) α+γ +δ+

(6)

Where α, γ , δ,  denote the weight of L onset , L o f f set , L velocit y , L f rame respectively. By weighting the losses of each part, we hope to make each stack converges in a closer rate and improve the difficult task by giving it higher weight. As we don’t want to change the weight extremely, we set α, γ , δ,  to 1.5, 1.5, 1, 2 respectively.

4 Experiments 4.1 Dataset We use MAESTRO (“MIDI and Audio Edited for Synchronous TRacks and Organization”) [16] as the dataset, which contains over 200 h of paired audio and MIDI recordings from nine years of International Piano-e-Competition events. Each audio is performed by virtuoso artists on a YAMAHA Disklavier grand piano, which is guaranteed with concert-quality audio and high precision of annotations. The audio is aligned within about 3ms accuracy. Except for the pitches, onsets, offsets and velocities, each audio file is annotated with its composer, file name, and year of performance. As for the pre-processing, we first transfer the sample rate of each audio file to 16000 Hz, and the number of channels to 1, same as in [3, 16]. The audio file is then segmented into about 10 s long each(163840 sample points 16000 Hz sample rate). Finally the dataset is separated into a training set with 954 audio files, a validation set with 105 audio files and a test set with 125 audio files.

4.2 Metrics We use both note-level and frame-level metrics to evaluate our proposed model, which includes precision, recall and F1 score. The metrics are obtained by the mir _eval library. The onset tolerance and pitch tolerance are set to 50ms and 50cents respectively by default. And the velocity tolerance is set to 0.1 manually. Except for onset and pitch, offset and velocity also play important roles in human auditory perception. The former decides the duration of the note, and the latter decides

Improving Automatic Piano Transcription ...

51

the dynamic of the current note. Transcriptions with precise offset and velocity sound more natural for human beings. So we evaluate the note with offsets and velocity(note w/offset&vel) metrics as well.

4.3 Experiments Details We trained our proposed model and other controlled experiments on the same training set, all the models are implemented by torch. The segmentation length is set to 163840 as mentioned above, the batch size is set to 8 due to the limitation of computing resources, the learning rate is set to 0.0006, and is reduced by a factor of 0.95 every 10000 iterations. The choices of other hyper-parameters are as shown in Sect. 3. Our model is trained for 200000 iterations, which costs approximately 20 h on an RTX 2080Ti GPU.

4.4 Results We first evaluate our proposed model and our baseline [16] on the test set of the MAESTRO dataset. As we mentioned above, the baseline shows quite different performance on precision and recall, so we did serial experiments to find the bestbalanced performance of it. Table 1 shows the overall performance of our proposed model, our baseline proposed in [16], our reproduced baseline(Baseline_re), and the state-of-the-art model High-Resolution Piano Transcription(High-Res PT) [3]. As is shown in the table, our model outperforms our baseline in every task considered, which proves that our improvements work. In the comparison with High-Res PT, our model shows comparable performance on Note and Note with off&vel metrics, and outperforms on Frame metrics. As our model is less refined with hyper-parameters, the result shows the great potential of our model.

Table 1 Transcriptions evaluated on the MAESTRO test set Note Frame P(%) R(%) F1(%) P(%) R(%) F1(%) Our model Baseline [4] Baseline_re High-Res PT [6]

Note with offset&vel P(%) R(%) F1(%)

96.45 97.42

94.27 92.37

95.35 94.80

90.15 93.10

90.39 85.76

90.27 89.19

79.63 78.11

77.87 74.13

78.74 76.04

97.86 98.17

90.53 95.35

94.05 96.72

92.07 88.71

86.75 90.73

89.33 89.62

80.97 82.10

73.04 79.80

76.80 80.92

52

J. Zhao et al.

Table 2 Transcriptions trained on the MAESTRO, evaluated on the MAPS Note Frame Note with offset&vel P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) Our Model 80.61 Baseline_re 82.46

79.34 75.01

79.97 78.56

69.89 67.62

73.72 70.45

71.75 69.01

38.62 39.25

38.19 34.64

38.40 36.80

Table 3 Ablation studies evaluated on the MAESTRO test set Note Frame Note with offset&vel P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) Our Model Without_CB Pred_Connect

96.45 94.66 98.17

94.27 93.54 91.34

95.35 94.09 94.63

90.15 85.81 91.25

90.39 92.05 87.51

90.27 88.82 89.34

79.63 71.91 81.36

77.87 71.13 75.92

78.74 71.51 78.54

In order to evaluate the generalization ability, we design the experiments shown in Table 2. In this comparison, our model and our reproduced baseline are considered. We train both models using the MAESTRO dataset and directly test them on MAPS dataset without fine-tuning. The result shows that our model outperforms our reproduced baseline by a large scale on Note, Frame, and Note with off&vel metrics. This proves our deduction that weaker supervision leads to better generalization ability. Finally, we design serial ablation studies, as shown in Table 3. In the second row, Without_CB represents that features from sub-tasks are sent directly to Frame Stack without being processed by Connection Block. As our model outperforms Without_CB by a large scale, especially on Note with off&vel metrics, it is clear that our Connection Block makes a great contribution to the performance improvement. In the third row, Pred_Connect represents that other three sub-tasks are connected to Frame Stack in the same way of [16], to compensate for the lack of attention module, we add the same attention module as in Connection Block after each Conv Block. The result shows that our way of feature fusion is better than directly concatenating the output prediction. And this change makes a great contribution to the performance improvement.

5 Conclusion In this paper we proposed an improved version of the Onsets and Frames model[16] by changing the way features fuse and adding weight to each part of the loss function. Our model shows strong performance on both note metrics and frame metrics, and is less sensitive to the alignment, therefore has better generalization ability. During experiments, the result proves that our refined middle-level features help the Frame Stack better than the output predictions. Our model outperforms the baseline by a

Improving Automatic Piano Transcription ...

53

large scale and achieved a 90.27% F1 score on frame metrics, better than the state-ofthe-art model [3]. Our further study will focus on finding a better Connection Block structure and trying to use adaptive weighting parameters. Acknowledgements This work was supported by National Key R&D Program of China(2019YFC 1711800), NFSC(62171138).

References 1. Raphael C (2002) Automatic transcription of piano music. In: ISMIR 2002, 3rd international conference on music information retrieval, Paris, France, 13–17 October 2002, Proceedings 2. Benetos E, Dixon S, Giannoulis D et al (2013) Automatic music transcription: challenges and future directions. J Intell Inf Syst 41(3):407–434 3. Kong Q, Li B, Song X, et al (2020) High-resolution piano transcription with pedals by regressing onsets and offsets times 4. Li B, Duan Z (2016) An approach to score following for piano performances with the sustained effect. IEEE/ACM Trans Audio Speech Lang Process 21:2425–2438 5. Benetos E, Dixon S, Duan Z et al (2019) Automatic music transcription: an overview. IEEE Signal Process Mag 36:20–30 6. Klapuri AP (2003) Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 11(6):804–816 7. Rabiner L, Cheng M et al (1976) A comparative performance study of several pitch detection algorithms. IEEE Trans Acoust Speech Signal Process 24:399–418 8. Kadambe S, Boudreaux-Bartels GF (1992) Application of the wavelet transform for pitch detection of speech signals. IEEE Trans Inf Theory 38(2):917–924 9. Rabiner LR (1977) On the use of autocorrelation analysis for pitch detection. IEEE Trans Acoust Speech Signal Process 25(1):24–33 10. Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: 2003 IEEE workshop on applications of signal processing to audio and acoustics 11. O’Hanlon K, Plumbley M (2014) Polyphonic piano transcription using non-negative matrix factorisation with group sparsity. In: IEEE international conference on acoustics 12. Kelz R, Dorfer M, Korzeniowski F, et al (2016) On the potential of simple framewise approaches to piano transcription. In: International society for music information retrieveal conference (ISMIR) 13. Boulanger-Lewandowski N, Bengio Y, Vincent P (2012) Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription[J]. Chem A Eur J 18(13):3981–3991 14. Sigtia S, Benetos E, Dixon S (2017) An end-to-end neural network for polyphonic music transcription. IEEE/ACM Trans Audio Speech Lang Process 24(5):927–939 15. Hawthorne C, Elsen E, Song J, et al (2017) Onsets and frames: dual-objective piano transcription 16. Hawthorne C, Stasyuk A, Roberts A, et al (2018) Enabling factorized piano music modeling and generation with the maestro dataset. arXiv e-prints 17. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39:2481–2495 18. Ruder S (2017) An overview of multi-task learning in deep neural networks 19. Kokkinos I (2017) Ubernet: Training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: 30th IEEE/CVF conference on computer vision and pattern recognition (CVPR) 20. Kelz R, Bck S, Widmer G (2019) Deep polyphonic ADSR piano note transcription. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP)

Application Design of Sports Music Mobile App Based on Rhythm Following Rongfeng Li and Yu Liu

Abstract With the rise of various running apps, running has become a national sport. Relevant studies have found that music can provide energy, psychological and psychophysiological benefits in physical activities, especially when the action and music are carried out simultaneously. However, most running apps on the market simply integrate the function of music playing, and do not focus on the synchronization of action and music. Therefore, it is a very innovative and meaningful thing to study the synchronous follow of music for running. This paper introduces the research process and implementation steps of rhythm following in detail. First, based on the sensor data (acceleration sensor and gyroscope) and logistic regression analysis, the accuracy rate of 92.4% and recall rate of 98.2% are finally achieved. After that is the synchronization algorithm of music rhythm, this paper introduces in detail how to adjust the music of musical instrument digital interface (MIDI) in real time based on the landing point, so as to achieve the synchronization effect of music rhythm and running rhythm. Finally, the Android program is developed, the above research results are integrated, and the rhythm following mobile app is produced. Keywords Step tracking · Rhythm following · MIDI

1 Introduction 1.1 Background With the progress of science and technology and the development of the Internet industry, more and more running apps appear in people’s vision. Among them, there are some excellent sports software, such as keep, yuepaoquan, etc., which can help R. Li (B) Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing 100876, China e-mail: [email protected] Y. Liu Beijing University of Posts and Telecommunications, Beijing 100876, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_5

55

56

R. Li and Y. Liu

us record running track, mileage, speed, etc. More and more people fall in love with running, and we often post our running news in our circle of friends to share our joy with others. Many people love to listen to rhythmic music while running, but it is troublesome to open music playing software and find a suitable music. A better way is to integrate music into running app, then let the app help us playing suitable music. This is also a development direction of current running app.

1.2 The Necessity of Music Rhythm Movement Following The synchronization of music rhythm and pace is very meaningful, which plays a positive role in promoting a series of sports such as running. Peter [1] discussed in the literature that music can provide energy, psychological and psychophysiological benefits in physical activities, especially when actions and music are carried out simultaneously; At the same time, Stuart D [2] also pointed out in the literature that synchronous music can be applied to the training process of non excellent athletes, and has quite positive effects. This is also reflected in our daily life. For example, when we are learning and thinking, it is suitable to listen to some soothing music; When we exercise in the gym, it is suitable to listen to some more passionate music, so that our muscles contract and relax with the rhythm of the music. The same is true of running. The rhythm of music will affect the rhythm of our running, and the speed of music will affect the speed of running. It’s easy for us to unconsciously synchronize our running state with music, which is why some music is suitable for running, but some are not.

1.3 The Significance of the Research To some extent, this study solves the problem of music selection and the disorder of music rhythm and running rhythm. Runners no longer need to choose related music in advance according to their running situation, they just need to choose a music with stable beat and rhythm, then the app will automatically synchronize the rhythm according to the runner’s state; Also, runners don’t need to worry about the disorder of landing points and beat points, because even if the rhythm of music is the same as that of running, however, in the process of running, it can’t ensure that the beat point of music is consistent with the landing point of running, so it will produce the feeling of disorder, which will be the focus of this topic. At the same time, the research also provides a new idea for the further development of running app. Based on choosing a suitable music when running, almost all running apps are based on the premise that the music remains unchanged, which will bring certain limitations. Therefore, although the expectation is rhythm synchronization, it can not achieve desired results. This study has brought a new idea, that is, music can also be changed. Without affecting the overall quality, you can make relevant adjustments

Application Design of Sports Music Mobile App …

57

to music, which is also a new idea to solve the problem. More importantly, it will bring a new music experience for runners, which is an intelligent rhythm following experience. We no longer follow music, but music follows us.

1.4 Implementation Mode Nowadays, smart phones are embedded with a wealth of sensors. The acceleration sensor (Fig. 1) can reflect the acceleration of the mobile phone in all directions, and the gyroscope can reflect the rotation of the mobile phone. If a runner runs with a mobile phone, the mobile phone can indirectly reflect the runner’s movement state, and display it through the sensor data, so as to realize real-time monitoring of running based on the sensor data. By collecting the sensor data in the actual running process and marking the sensor data at the time of stepping on the ground, we can find the universal law of data change and landing point. Through the relevant machine learning algorithm, we can finally realize the accurate judgment of the sensor data for the landing point of running. Java provides a complete set of MIDI native library, which can realize the real-time monitoring and real-time control of MIDI music. On the premise of accurately determining the landing point based on the sensor data, by continuously fine-tuning the beat speed of MIDI music, the beat points of music are finally consistent with each step of running, so as to produce a stampede effect of music following.

Fig. 1 Acceleration sensor and gyroscope

58

R. Li and Y. Liu

2 Step Detection of Running Based on Logistic Regression 2.1 Collection of Sensor Data Based on the above related API in android.hardware, by setting the sensor response speed in SensorManager, the running data of about 10 min were collected in 50 groups per second. Each group of data includes three-way acceleration and threeway gyroscope. Table 1 shows part of the sensor data actually collected, and each group of data has been marked with the landing point, where 0 represents the non landing point and 1 represents the landing point. Logistic regression is often used to predict diseases, such as negative and positive cancer. With tumor size, location, gender, age and other indicators as independent variables, and whether the tumor is positive as the dependent variable, 1 = yes and 0 = no were assigned to explore the influence of independent variables on the dependent variable. Similarly, we plan to take X-axis acceleration, Y-axis acceleration, z-axis acceleration, X-axis gyroscope, Y-axis gyroscope and z-axis gyroscope as independent variables, and take whether to land at this moment as dependent variable, assign 1 = yes and 0 = no to carry out multiple regression analysis, and then predict whether it is the landing point at any time of running. For the model data related to pathology, it is generally from the real pathology of the hospital, but for the training data of the landing point prediction model, it is necessary to manually label the landing point (0 or 1). In order to ensure the accuracy of labeling, the relevant labeling program is written here to automatically label the landing point for each group of sensor data in the process of running. Figure 2 is the flow chart of the automatic labeling program for landing point. Table 1 Part of the sensor data actually collected X-axis acceleration

Y-axis acceleration

Z-axis acceleration

X-axis angular velocity

X-axis angular velocity

X-axis angular velocity

Landing point

1

27.545

9.085

3.196

−2.124

−0.921

1.051

0

2

24.570

10.322

0.248

−2.719

0.136

0.385

0

3

17.506

9.032

0.840

−2.069

−0.181

0.018

0

4

9.471

6.422

2.104

−1.189

−0.705

0.002

0

5

3.694

4.316

2.223

−0.651

−1.190

−0.006

1

Application Design of Sports Music Mobile App …

59

Fig. 2 Automatic labeling process of landing point

2.2 Classification Algorithm Based on Logistic Regression Figure 3 shows the change of three-way resultant acceleration (square sum of acceleration components in three directions) with time in the actual running process. The blue point is the landing point, and the red point is the non landing point. As a whole, we can see whether it is the landing point has little relationship with the current value of a single group of sensors. The landing point is regularly distributed in the falling process from the highest point to the lowest point. That is, in the process of determining the landing point, we should contact several points nearby, so as to ensure the accuracy of prediction. Because of the need to ensure the real-time prediction, the future point can not be used in the landing point prediction. To sum up, in order to give consideration to the accuracy and real-time of prediction, the training method of predicting the current landing point by historical sensor data is selected in the process of logical regression. Therefore, we can predict whether the current step is the landing point through the five groups of historical data, which are 30 dimensions in total. Formula (1) is the formula used for logistic regression training, where x is the input, and here are the five groups of historical sensor data, which are 30 dimensions in total, x = (×1, × 2,…, × 30). θ is the parameter matrix to be solved, θ = (θ1, θ2, …,θ30). When θ matrix is transposed, each of its terms is multiplied by each of the sensor data of the X matrix, and then added. We will use the gradient descent algorithm to solve the parameter matrix θ that minimizes the cost function.   hθ (x) = g θ T x =

1 1 + e−θ

T

x

=

1 1 + e−(θ1 x1 +θ2 x2 +...+θ30 x30 )

(1)

The direct prediction result of logistic regression is composed of several cluster 1, not discrete one point. However, compared with the actual data, it is found that the cluster and landing point show a corresponding relationship, and the tail of each cluster 1 basically corresponds to and only corresponds to one actual landing point. Although the logistic regression prediction is not a discrete point, the existence of this corresponding law makes the prediction function available. By using this

60

R. Li and Y. Liu

Fig. 3 The change rule chart of three-way combined acceleration with time during running

corresponding relationship, the prediction points can be filtered through a series of filtering functions, and finally the discrete landing points can be obtained. Figure 4 is the flow chart of the final prediction data obtained by processing the data directly predicted by logistic regression. That is to say, after a series of function operations from cluster 1, a discrete point is finally obtained, that is, the available scattered points are finally obtained. Filter operation: because the prediction data is one cluster one by one, and what we really need is discrete landing points, we need to do discrete operation on the prediction results, and filter out one point in each cluster. In order to ensure real-time performance, we filter the first point of each cluster. In the actual process, there will be some discontinuous clusters, that is, a point in the middle of the cluster is suddenly predicted to be 0. If the filter operation is just to filter the first point of the cluster, there will be many excessive judgment points in this case. Therefore, in the process of filter filtering, the distance between the last filtering point and the current landing point is calculated. If the distance is less than a certain threshold, the current point is determined to be a part of the previous cluster 1, and it is determined to be invalid. Figure 5 is the logic diagram of filter operation. Offset operation: because the actual points are concentrated at the end of cluster 1, and the filter operation is the first point of the filtered cluster, it is necessary to

Fig. 4 Processing flow of forecast data

Application Design of Sports Music Mobile App …

61

Fig. 5 Filter processing flow

Fig. 6 Processing method of pre judgment based on landing point in Android

shift all the points downward. That is, we predicted the occurrence of landing point in advance. In Android programming, when we predict the landing point, we can use thread sleep or delay thread pool to achieve delay trigger of relevant landing logic. Moreover, the prediction of landing point has certain advantages, because the program operation and data calculation are time-consuming operations. If we determine whether the current moment is the landing point at the current moment, the subsequent landing point processing logic will bring delay, and eventually lead to the delay of our hearing in the process of Music Festival synchronization, so the program effect will not be particularly good, this way of prediction in advance is equivalent to giving the program enough time to process. Figure 6 shows the processing method of pre judgment based on landing point in Android program. After filter and offset processing, we get the final prediction data, and then calculate the accuracy rate and recall rate according to it.

2.3 Experimental Result Each group of sensor data used for training is brought into the logistic regression equation to predict the landing point, and the training data is processed through the processing flow in Fig. 6 to finally obtain the discrete landing point. After comparing with the real data, the prediction accuracy and recall rate are obtained, as shown in Table 2. Regression equation is the real-time prediction of landing points in the process of running. The synchronization of follow-up music rhythm is based on these prediction points, so as to produce the feeling of stampede on the premise of accurate prediction points. There is a certain error in hearing, so there is a certain time difference between

62

R. Li and Y. Liu

Table 2 The accuracy and recall of the final forecast data Error range

200 ms

100 ms

60 ms

20 ms

Accuracy

93.8%

92.4%

85.7%

53.3%

Recall

99.8%

98.2%

91.1%

56.7%

the predicted point and the actual landing point. The error range in Table 2 indicates the error between the predicted point and the actual landing point. 100 ms is selected as the maximum acceptable auditory error, and the final prediction result achieves 92.4% accuracy and 98.2% recall. In the follow-up, the Android program was used to test the effect. Once the landing point was determined at a certain time during the running, the specified sound effect was played, which basically reached one step at a time. There were fewer missed judgments and over judgments. To sum up, the overall prediction of the algorithm achieves a higher accuracy and recall rate, and the actual test achieve a great effect.

3 Music Rhythm Synchronization Algorithm 3.1 Tempo Based Rhythm Synchronization Algorithm The first step of rhythm synchronization is tempo synchronization, that is, the time of a step should be equal to the time of a beat. We need to obtain the step frequency in the process of running in real time, convert the step frequency into tempo of music, and adjust the speed of music in real time. Figure 7 is the algorithm flow chart of generating tempo through the monitoring of landing point and adjusting the music in real time. Every time the landing point is detected, the time difference between the current landing point and the previous landing point will be generated, and the value will be pushed into the time difference queue. If the queue is full, the head element of the queue will be out of the queue first. According to the first in first out principle of the queue, the time difference of the most recent landing points will be stored in the queue, which ensures the real-time calculation of step frequency. At the same time, the program will calculate the range in the queue in real time. If the range of all data in the queue is within 60 ms, the tempo of music will be updated. The significance of range determination is to ensure the stability of the time difference of the nearest landing points in the queue, that is, only when the step frequency is relatively stable, the tempo of the music will be updated, which also ensures the stability of the whole music. Figure 8 shows the process of the beat points and landing points from being out of sync to being in sync. The Lilliputian represents the landing point during running, and the music note represents the beat point during music playing, then draw them on the same timeline. The first blue arrow starts the synchronization algorithm at

Application Design of Sports Music Mobile App …

63

Fig. 7 Real time generation of tempo logic based on landing point monitoring

Fig. 8 Music rhythm and pace synchronization algorithm

the landing point, and the music is about half of the beat point. The synchronization algorithm ends at the landing point of the second blue arrow. At this time, the landing point matches the shooting point, and the subsequent landing point and shooting point will continue to match. The whole rhythm synchronization takes place between the two blue arrows, that is, the execution of each synchronization algorithm takes place between the two landing points. In Fig. 8, x represents the amount of data left in the current beat at the beginning of the synchronization algorithm, y represents the amount of data in a whole beat, and t represents the time of one step. In order to make the next landing point coincide with the beat point, it is easy to calculate the music speed V1 = (x + y)/t that should be set in time t, which is equivalent to the amount of data x + y played in time t. Another way is to play only the amount of data of x in the time of t, which can also complete the coincidence of shooting point and landing point at the next landing point. It can be calculated that the speed to be set in the time period of t is V2 = x/t. The purpose of the speed adjustment mentioned above is to make the landing point of the second blue arrow coincide with the music beat point. After the coincidence, that is, the landing point of the second blue arrow, another speed adjustment is needed. The purpose of the second speed adjustment is to make the music beat time equal to the time of one step, so that the subsequent beat points can continue to coincide with the landing point.

64

R. Li and Y. Liu

In order to reduce the change of hearing, we should choose the one with smaller speed difference between V1 and V2. The music speed after synchronization is also easy to get, that is, V3 = y/ t.

3.2 Some Details About Music Rhythm Synchronization Algorithm (1) Determination of beat position of music at a certain time: The above rhythm synchronization algorithm needs to adjust the speed twice in total. The purpose of the first speed adjustment is to make the beat point coincide with the next landing point. The speed adjustment needs to know the size of x in Fig. 8, which is equivalent to knowing the beat position of the current music. Formula (2) is the calculation formula of beat position. pos is the number of beats played in the current music, that is, the number of beats played, which is a floating-point number. Beat is the total beat number of MIDI file, curtick is the playing position of current music sequence (MIDI scale representation), tick is the total length of music sequence (MIDI scale representation). pos = beat × (cur tick/tick)

(2)

(2) Timing of two speed adjustments of music synchronization algorithm: The first speed adjustment takes place at the landing point of the first blue arrow in Fig. 8, and maintains to the landing point of the second blue arrow, so as to make the beat point coincide with the landing point of the second blue arrow; The second speed adjustment takes place at the landing point of the second blue arrow, and is maintained until the next synchronization algorithm is executed. The purpose is to make the time of one beat of music equal to one step, so that the subsequent beat points can match the landing point. (3) Determination of calculating the size of two speed adjustments of music synchronization algorithm: First, introduce the BPM that should be set for the second speed adjustment, because it is easier to understand. Combining with Fig. 8, where t represents the current stable pace, and the time of one step is t, and the unit is seconds. The premise of the second speed adjustment is that the current time beat point and landing point have been consistent, and the purpose of speed adjustment is to make the time of one beat equal to one step. Formula (3) is the adjustment formula of the second music speed. Calculate the BPM of music to calculate how many steps a runner runs in a minute. bpm = 60 ÷ t

(3)

Next is the speed calculation of the first adjustment. See the following two formulas in combination with Fig. 8, where pos is the number of current playing

Application Design of Sports Music Mobile App …

65

beats solved by Eq. (2), which is a floating-point number. x is the remaining playing amount of the current beat, y is the playing amount of a complete beat, and BPM is the value of the second speed adjustment calculated in Eq. (3). When the landing point at the first blue arrow is to the right of the middle of the two beat points, that is, x = y − x, then Eq. (5) is the formula for the first speed adjustment, which is equivalent to the amount of data x played between the two blue arrow landing points. It is equivalent to choosing different calculation formulas according to the beat position of the first blue arrow. The purpose is to make the value of the first speed adjustment closer to the value of the second speed adjustment. This will make the music more fluent, and sound like only one speed adjustment. bpm x≤y - x = ([ pos + 1] − pos + 1) × bpm

(4)

bpm x≥y - x = ([ pos + 1] − pos) × bpm

(5)

(4) Timing of synchronization algorithm: synchronization algorithm always starts at a certain landing point. After synchronization, with the delay of time, the beat point of music and the landing point of running may still be disordered. Therefore, in order to ensure the synchronization effect for all the time, we need to make a synchronization adjustment every other period of time. The method here is to do a synchronization algorithm every four steps to ensure that each section of music will have at least one beat point and landing point overlapped (Table 3). (5) Java API involved in the algorithm: Table 3 Java API involved in the algorithm API

Function

getMicrosecondLength

Get the total length of the music sequence, expressed in microseconds

getTempoInMPQ

Get the length of a quarter note in microseconds

getTickLength

The total length of the music sequence is obtained and expressed by MIDI scale

getTickPosition

Get the playing position of the music sequence at the current time

getTempoInBPM

Get the length of a quarter note in microseconds

setTempoInBPM

Set the speed of MIDI music

66

R. Li and Y. Liu

Table 4 Test data of coincidence effect between landing points and beat points Number of step

Number of beats

Speed adjustment

BPM adjusted for the first time

BPM adjusted for the second time

19

18.948

No

None

None

20

19.933

Yes

127.999

None

21

20.996

Yes

None

120.000

22

21.981

No

None

None

23

22.967

No

None

None

24

24.000

Yes

120.000

None

25

25.027

Yes

None

120.000

3.3 Experiment Result The following is the actual data during running, showing the coincidence effect of landing points and beat points. It can be seen from the data that the landing point and beat point are in good agreement, and basically one step corresponds to one beat (Table 4). You can also visit the following website to watch the music following effect during my run: “https://www.bilibili.com/video/BV1FK4y1V7An?share_source=copy_w eb”. The synchronization effect of music rhythm and running rhythm is shown at the end of the video.

4 Conclusion Based on mobile phone sensor and machine learning, this paper studies and designs the implementation of music rhythm following in the process of running from the following three aspects. First of all, the step detection based on mobile phone sensor, and the real-time determination of the landing point in the process of running based on the sensor data. Firstly, it is the preparation of training data, collecting the data of acceleration sensor and gyroscope in the actual running process, and automatically marking the landing point; After that, based on logistic regression, the last five groups of sensor data were used as independent variables, and whether the current time was landing (0 or 1) was used as dependent variable for training; Then, a series of filtering operations are carried out on the prediction data generated from the training results, and finally the required scattered locations are obtained. Finally, the accuracy rate of 92.4% and recall rate of 98.2% are achieved within the acceptable hearing error. At the same time, the actual running test is carried out on Android program. Voice processing is performed each time a landing point is detected by the prediction function, which

Application Design of Sports Music Mobile App …

67

can basically achieve the effect of running one step at a time, with less over judgment and missing judgment. Then it is the research of music rhythm synchronization algorithm, which realizes the synchronization of music rhythm and running rhythm on the premise of monitoring landing point. MIDI music is chosen for music, because Java language provides API for MIDI files. The implementation of music synchronization algorithm is mainly to adjust the speed of MIDI music. Each synchronization of a rhythm will adjust the speed twice. After the runner’s stride frequency is stable, the speed of the music will be adjusted for the first time at a landing point. The purpose of the adjustment is to make the next landing point coincide with the music beat point. After the match, the second speed adjustment will be made to the music, which will set the music beat speed to the current pace, so that the subsequent beat point and landing point will continue to match. Also, a beat speed generation algorithm based on landing point is designed, that is, the music speed generation algorithm that needs to be set in the second speed adjustment. With the help of the first in first out feature of the queue, the time difference between each landing point and the previous landing point is put into a fixed capacity queue during the running process, and the latest time difference of several landing points will be stored in the queue, By calculating the range in the queue in real time, when the range is less than a certain threshold, the current step speed is determined to be stable, and the average step speed is generated, which is finally converted into beat speed. Finally, the design of Android program, which needs the actual software development and testing. Based on Android studio program development, integration of step detection algorithm and music rhythm synchronization algorithm. In the actual running test process, the music can be adjusted in real time based on the running state, and achieve the stampede effect of one beat one by one. We have opened the source code of logistic regression and Android programs for follow-up research. You can also download the app for experience. GitHub address is “https://github.com/liuyubupt/RhythmFollowing”. Acknowledgements Supported by MOE (Ministry of Education in China) Youth Project of Humanities and Social Sciences, No. 19YJCZH084.

References 1. Terry PC, D’Auriac S, Saha AM (2011) Effects of synchronous music on treadmill running among elite triathletes. Accessed 30 July 2011 2. Simpson SD, Karageorghis SC (2005) The effects of synchronous music on 400-m sprint performance. School of Sport and Education, Brunel University, West London, Uxbridge, UK. Accessed 23 Oct 2005 3. Succi GP, Clapp D, Gampert R, Prado G (2001) Footstep detection and tracking. In: Proceedings of SPIE 4393, unattended ground sensor technologies and applications III. Accessed 27 Sept 2001, https://doi.org/10.1117/12.441277

68

R. Li and Y. Liu

4. Richman MS, Deadrick DS, Nation RJ, Whitney S (2001) Personnel tracking using seismic sensors. In: Proceedings SPIE 4393, unattended ground sensor technologies and applications III. Accessed 27 Sept 2001, https://doi.org/10.1117/12.441276 5. Ozcan K, Mahabalagiri A, Velipasalar S (2015) Autonomous tracking and counting of footsteps by mobile phone cameras. In: 2015 49th asilomar conference on signals, systems and computers. IEEE 6. Yantao L, Velipasalar S (2017) autonomous footstep counting and traveled distance calculation by mobile devices incorporating camera and accelerometer data. IEEE Sensors J 17(21):7157– 7166 7. Animesh S, et al (2015) Step-by-step detection of personally collocated mobile devices. In: Proceedings of the 16th international workshop on mobile computing systems and applications 8. Zhao N (2010) Full-featured pedometer design realized with 3-axis digital accelerometer. Analog Dial 44(06):1–5

A Study on Monophonic Tremolo Recordings of Chinese Traditional Instrument Pipa Using Spectral Slope Curve Yuancheng Wang, Hanqin Dai, Yuyang Jing, Wei Wei, Dorian Cazau, Olivier Adam, and Qiao Wang

Abstract Involving with rapid intensity modulation, the tremolo in plucked string instruments, particularly Chinese traditional instrument pipa, vitally enriches the musical human perceptions and local styles. In this paper, we propose to detect the pipa tremolo onsets in a new angle and found that the spectral slope curve, as a single parameter framewise feature, is effective to deal with the spurious amplitude peaks produced by the attack noise from fake nails. Evaluated on our toy dataset, STFTbased spectral slope curve demonstrates its significance on tremolo onset detection in monophonic pipa clips. The tremolo onsets analyzed here could be applied to more detailed parameter estimates afterward. Keywords Tremolo analysis · Onset detection · Fake nails · Attack noise · Spectral slope · Monophonic pipa recordings

Y. Wang · H. Dai · Q. Wang (B) School of Information Science and Engineering, Southeast University, Nanjing, China e-mail: [email protected] Y. Jing Department of Recording Art, Nanjing University of the Art, Nanjing, China W. Wei Conservatory of Music, XiaoZhuang University, Nanjing, China D. Cazau Institute of Mines-Télécom Atlantique, Lab-STICC, UMR 6285, CNRS, Brest, France O. Adam Sorbonne Univerisité, Institute of Jean Le Rond d’Alembert, UMR7190, CNRS, Paris, France © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_6

69

70

Y. Wang et al.

1 Introduction 1.1 Brief Description of Tremolo Analysis As a sound effect physically modulated by musicians, tremolo reflects amplitude variation of which the morphology essentially depends on the instrumentation and the playing styles. In the wind music, the sine-like envelope is produced by periodic expiratory intensity. The tremolo for plucked string instruments like guitar and pipa, constituted by a succession of rapid plucks on a single tone, brings a more complex envelope. Categorized into wheel (Lun, 轮), rolling (Gun, 滚) and shaking (Yao, 摇) performed on different strings and plucking fingerings, pipa tremolo, also accompanied with diverse timbres, is the interest of the paper.

1.2 Related Works Playing technique analysis often consists of detection and parameter estimation tasks. The previous identifies the expression intervals for music score level application and the latter extracts the parameters within the intervals like tremolo rate and dynamic trends/uniformity for further performance analysis. Widely existing in singing and instrumental music, tremolo permits ornamenting music and expressing performers’ emotion, but most of them focus on non-plucked string instruments. Regnier [1] utilizes harmonic model on interpolated spectrum of vocal recordings. Barbancho [2] investigates violin tremolo determined by attack time, spectral peak bandwidth and envelope fluctuation in a segment of recording with the same pitch. Wang [3] and Lin [4] extend the scope to Chinese instruments bamboo flute and pipa, and they propose a tremolo detection approach by applying SVM classifiers to the secondorder scattering networks and the grid diagram trimmed via same pitch height. The above approaches face the following issues: First, most of the instruments have the tremolo behaviors different from pipa due to the instrumental acoustics and playing style. Second, the parameter estimation isn’t taken into account in the classifier-based works. Third, the pitch-based methods assume the occurrence of a single feature pattern in the trimmed time interval. In fact, the note segmentation in tremolo scenarios concerns about not only the pitch but also the plucking points. The pitch shift techniques may lead to a bad tremolo detection. Meanwhile, the periodic hypothesis and intensity uniformity may not meet the practical needs, particularly in the expressive tremolo. As a pre-requiste step of parameter estimation, the energy-based onset detection used to assess the musicians’ performance in the acoustic guitar [5] will be discussed in our study. Bello [6] clarifies the definition of onset, attack and transient in recordings and summarizes the early works on onset detection. Reboursiere [7] evaluates 4

A Study on Monophonic Tremolo Recordings of Chinese Traditional ...

71

onset detectors under playing techniques in guitar music and the well-known Spectral Flux (SpecFlux) achieves the best result. Böck [8] proposes SuperFlux which suppresses the vibrato effect by filtering out the pitch shift within two adjacent frames.

1.3 Problem Formulation and Paper Organization Specially, due to the inevitable attack noise of fake nails in pipa performance, the plucks usually produce crisp sound and acoustic phenomena in which a spurious amplitude peak often occurs before the natural peak in a single tone. A transient/steady separation using MMSE filter [9], learnt from the excerpt within the red steady interval shown in the upper plot of Fig. 1, is carried out on the whole segment of raw signal. This effectively reduces the harmonic energy for a clip of uniform tremolo on the same pitch and makes the filtered audio in lower plot as the mixture of body resonance and crash sound between the fake nail and string. Examples of attack peaks and natural transients, in which the previous serves to note segmentation and the latter to the true intensity extraction of a tone, are pointed out in the lower plot in Fig. 1. In Fig. 2, the envelopes obtained by filtered RMS energy, SpecFlux and SuperFlux often have two neighbouring peaks in a single tone. The motivation of the paper aims to identify attacks and natural transients so that the follow-up task like parameter estimation can be effectively achieved in monophonic pipa recordings.

Fig. 1 A clip of uniform pipa tremolo recordings and preprocessed audio via sustain signal in maroon interval

72

Y. Wang et al.

Fig. 2 Onset envelopes on uniform tremolo. Red lines and green lines respectively denote the estimated natural transient positions and the annotated attacks

The remainder of paper is organized as follows. Two types of spectral slopes [10] are present in Sect. 2 to identify two types of peaks. In Sect. 3, we evaluate on our toy dataset to demonstrate that the spectral slope is an effective descriptor for pipa tremolo onset detection. Finally, we conclude this paper in Sect. 4 and explain the future work in expressive analysis field.

2 Methods Since the pink noise and instrumental sound usually manifest the linear trend along the frequency axis, and the string dragging by fake nails physically produce the amplitude steep at the attack point, meanwhile the crash sound provides more high frequency energy, the spectral slope at attack point approaches to 0 compared to that at natural transients nearby. The advantage of spectral slope curve to substitute amplitude-based or SpecFlux-based methods is to reduce the affect from strength fluctuation due to its theoretical robustness to amplitude. In other words, an arbitrary energy gain doesn’t change the slope of regression on the log-energy spectrum. Unlike the SpecFlux requiring the comparison between spectrum of two adjacent frames, this method has also a low impact on the pitch shift techniques like vibratos. Given the short time log-amplitude spectrum S(k, t) at k bin and t frame, the Spectral Slope(SS) SS(t) computed with a 1-order regression with a bias at t frame b(t) is formulated as:

A Study on Monophonic Tremolo Recordings of Chinese Traditional ...

73

Fig. 3 Spectrum and trends displayed for an attack frame and its following natural transient frame

⎡ ⎤ 1 ⎢2⎥ ⎢ ⎥ S(1 : K , t) = SS(t) ⎢ . ⎥ + b(t) +  ⎣ .. ⎦ K

(1)

where S(1 : K , t) denotes spectral column vector at t frame with K the total frequency bin number,  the fitting noise. Figure 3 shows the log-amplitude spectrum and the spectral trends illustrated by regression lines. Comparing to the regression on Mel spectrum, that fitted on STFT spectrum has more focus on high frequency register so that the STFT SS curve is more sensitive to high frequency change. Similar to other envelopes before peak picking, the spectral slope needs to be normalized as follows: SS  (t) = max(−SS(t), 0) (2)

SS  (t) =

SS  (t) − min t (SS  ) maxt (SS  ) − min t (SS  )

(3)

The attack and natural transient positions are identified using peak picking respectively on SS  and 1 − SS  . In what follows, the STFT-spectral slope curve is directly computed from Short Time Fourier Transform (STFT) and the slope curve of the short time Mel Spectrum corresponding to bode plot each frame, is referred as Melspectral slope curve in what follows. The lowest 2 plots in Fig. 2 show the two types of SS curves and natural transient points selected by the same peak picker as that acted on the other curves. We could find the points uniformly separated from each other and roughly consistent with the content of the recording. Furthermore, Fig. 4

74

Y. Wang et al.

Fig. 4 Attack points and natural transients obtained by STFT SS curve and Mel SS curve for uniform tremolo

exhibits two types of spectral slope curves from time domain views to approximately align its troughs with the attacks and peaks with natural transients.

3 Experiments and Analysis 3.1 Model Setting, Performance Metrics and Dataset The smoothness of envelope increases with the increase of the window size, however a large window size may lead to the disappearance of the gap between adjacent attacks [7]. In this paper, we refer to [5] for window size of 1024 samples for energy envelope calculation and [8] with the window size ±23 ms meaning 2048 samples under 44100 Hz sampling rate for SpecFlux, SuperFlux as well as our proposed models. The hop size is all configured as 5 ms. We reset all the time-related parameters in peak picker from [8] down to 30 ms better capture the peaks in tremolo interval. The onset position is commonly evaluated by the F-score(F), Precision(P) and Recall(R). The detected onsets with the tolerance of 25 ms to the ground-truth are identified correct. The evaluation is performed with mir_eval toolbox [11]. Our dataset1 consists of real-world and synthesized audio signals in which the previous is performed on customized pipa (Model A803) and student pipa (Model 561 [12]) and the latter is synthesized by Ample China Pipa (ACP [13]) Virtual Studio Technology instrument (VSTi). The synthesized audio tracks contain: 1

We are very grateful to the contribution of Yimin Xu (B.S. from Southeast University of China (SEU)) and Qianwen Shen (B.S. from SEU) as pipa performers with more than 10-year experience.

A Study on Monophonic Tremolo Recordings of Chinese Traditional ...

75

• 4 open strings and 12-frets of each string played as the playing pitches. • Fingerings (Wheel performed on 1st and 4th strings, rolling and shaking on 2nd and 3rd strings). • 4 tremolo types (Non-expression, expression, triple and fade out implemented by tremolo switch). • The fake nail noise activated to simulate the real recordings. • Two models with different timbres (Expert and Master pipa). On the other hand, the real world recordings contain: • 2 tremolo clips: Uniform tremolo (shown in Fig. 1) and speed-up tremolo (from single note to high speed tremolo). • 1 piece of monophonic Chinese folk music with normal plucks and tremolo technique: Jasmine Flower. Dataset is annotated by music experts via 10-order butterworth high-pass filtered waveform. Notice that there may be manual errors like the time deviation and very small numbers of uncertain peaks. Most of the recordings contain relatively uniform tremolo with different rates.

3.2 Experimental Results 3.2.1

Benchmark Among Algorithms

Tables 1 and 2 show the benchmarks on the detection of attack and natural transient points on different music clips. The F-score in STFT SS demonstrates the fact that onset detection can benefit from the attack noise, particularly for tremolo only ambiance, i.e. uniform and synthesized audio. A high F-score ensures the accuracy of parameter estimation in tremolo regions. In Speed-up and Jasmine Flower clips, we found false positives often occur at the long tails of single notes or the unvoiced regions in STFT SS method so that it has a relatively low precision. In Jasmine Flower recording, the noise from fingerings and string friction from other playing techniques reflect another disturbance factor. As the false positives can be eliminated by postprocessing, generally, Mel SS based method, with obviously lower recall than that in STFT SS method, implies irremediable peaks. Figure 6 displays the peak picking on triple tremolo cases and locally shows the low results of the log-energy envelope, specFlux, SuperFlux stem from the inherent inability to distinguish attack bursts and natural transients.

76

Y. Wang et al.

Table 1 Synthesized music in unit of percentage(%)(F/P/R) Methods Attacks Natural Transients Master Expert Master Expert Log-Energy SpecFlux SuperFlux Mel-SS STFT-SS STFT-SS-post

91/87/95 89/87/90 85/84/86 80/97/68 96/99/94 –

93/90/96 89/83/96 85/79/92 87/98/77 97/98/97 –

Table 2 Real world in unit of percentage(%)(F/P/R) Methods Attacks Uniform Speed-up Jasmine Flower Log-Energy SpecFlux SuperFlux Mel-SS STFT-SS STFT-SSpost

87/77/100 90/85/96 87/79/96 92/88/96 100/100/100 –

82/71/96 86/79/94 88/82/93 83/89/77 81/73/91 –

75/65/88 77/73/80 73/75/71 81/88/76 81/71/94 –

90/86/93 87/86/88 84/83/85 82/99/70 98/98/97 98/98/97

85/82/88 83/77/90 79/73/86 89/97/81 98/97/99 99/99/99

Natural Transients Uniform Speed-up 89/81/100 96/93/100 93/86/100 96/96/96 96/93/100 100/100/100

83/72/97 86/80/95 88/83/94 85/83/86 85/79/94 87/82/94

Jasmine Flower 81/70/94 86/82/90 85/88/82 86/92/81 86/78/95 87/79/95

3.3 Explanation of Non-tremolo Notes and Specificity of Spectral Slopes As the Spectral Slope curve reflects the trend variation of harmonic series which decreases much slower than the other curves, so that the SS values at tail of current tones remain usually high at the attack of next pluck in tremolo. On the other hand, SS curves at attack point after unvoiced region or isolated tones cannot reach an absolute null so that two peaks may still occur in the single tone. This is the reason why the other algorithms may outperform ours in case of isolated notes, which limits the application scope of SS-based methods.

3.3.1

Post-processing: False Positive Removal Using Thresholding

Since the STFT-based spectral slope is irrelevant to the amplitude and weighted on high frequency components, the false positives from unvoiced or low energy regions

A Study on Monophonic Tremolo Recordings of Chinese Traditional ...

77

Fig. 5 Distributions of true positives and false positives for attack peaks and natural transients

Fig. 6 Onset envelopes on triple tremolo part in synthesized audio. Red lines and green lines respectively denote the estimated natural transients and the annotated attack points

at the tails of single notes may be introduced by peak-picking. The distributions of TP and FP for natural transients and attacks are shown in Fig. 5 and we propose to remove the estimated natural transients with SS values less than 0.4. The postprocessed results have a slight increase on precision as well as F-score.

78

Y. Wang et al.

4 Conclusion In this paper, we investigate spectral slope curve that effectively identifies the spurious peaks produced by attack noise and the natural transients of Chinese traditional instrument pipa tremolo sound. Evaluated on the real-world and synthesized pipa materials, STFT based spectral slope curve reaches the state-of-the-arts performance on onset detection in tremolo-only regions. In the future, more massive datasets and the feature fusion could be explored. The extension to the other instruments performed with fake nails, pick, plectrum like guitar, guqin, zhongruan and yueqin is a promising direction. Finally, the studies and real-world datasets annotated with abundant features dedicated to Chinese instruments remains scarce and will contribute to the development of ethonomusicology.

References 1. Regnier L, Peeters G (2009) Singing voice detection in music tracks using direct voice vibrato detection. In: 2009 IEEE international conference on acoustics, speech and signal processing, pp 1685–1688 2. Barbancho I, de la Bandera C, Barbanche AM et al (2009) Transcription and expressiveness detection system for violin music. In: 2009 IEEE international conference on acoustics, speech and signal processing, pp 189–192 3. Wang C, Benetos E, Lostanlen V et al (2019) Adaptive time-frequency scattering for periodic modulation recognition in music signals. In: International society for music information retrieval conference 4. Liu Y, Zhang J, Xiao Z (2019) Grid diagram features for automatic pipa fingering technique classification. In: 12th international symposium on computational intelligence and design (ISCID) 5. Freire S, Nézio L (2013) Study of the tremolo technique on the acoustic guitar: Experimental setup and preliminary results on regularity. In: Proceedings of International Conference on Sound and Music Computing, Stockholm, pp 329–334 6. Bello JP, Daudet L, Abdallah S et al (2005) A tutorial on onset detection in music signals. IEEE Trans Speech Audio Process 13(5):1035–1047 7. Reboursiere L, Lähdeoja O, Drugman T et al (2012) Left and right-hand guitar playing techniques detection 8. Böck S, Widmer G (2013) Maximum filter vibrato suppression for onset detection. In: Proceedings of the 16th international conference on digital audio effects (DAFx). Maynooth, Ireland, vol 7 9. Ephraim Y, Malah D (1985) Speech enhancement using a minmum meansquare error logspectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 33:443–445 10. Chakraborty S (2013) An introduction to audio content analysis: applications in signal processing and music informatics. Comput Rev 54(8):469–470 11. Raffel C, McFee B, Humphrey EJ et al (2014) mir_eval: a transparent implementation of common mir metrics. In: Proceedings of the 15th international society for music information retrieval conference 12. No. 561 product of the Shanghai No.1 National Musical instrument factory. Accessed Aug 2021. http://shop.dunhuangguoyue.com/product-503.html 13. Ample China Pipa Software. Accessed Aug 2021. http://www.amplesound.net/en/pro-pd.asp? id=30

Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music Yifu Sun, Xulong Zhang, Xi Chen, Yi Yu, and Wei Li

Abstract Singing voice detection (SVD), to recognize vocal parts in the song, is an essential task in music information retrieval (MIR). The task remains challenging since singing voice varies and intertwines with the accompaniment music, especially for some complicated polyphonic music such as choral music recordings. To address this problem, we investigate singing voice detection while discarding the interference from the accompaniment. The proposed SVD has two steps: i. The singing voice separation (SVS) technique is first utilized to filter out the singing voice’s potential part coarsely. ii. Upon the continuity of vocal in the time domain, Long-term Recurrent Convolutional Networks (LRCN) is used to learn compositional features. Moreover, to eliminate the outliers, we choose to use a median filter for time-domain smoothing. Experimental results show that the proposed method outperforms the existing state-of-the-art works on two public datasets, the Jamendo Corpus and the RWC pop dataset. Keywords Singing voice detection · Vocal detection · Singing voice separation · Music information retrieval

Y. Sun · X. Chen · W. Li School of Computer Science and Technology, Fudan University, Shanghai 200438, China X. Zhang Ping An Technology (Shenzhen) Co., Ltd., Shenzhen 518000, China Y. Yu Digital Content and Media Sciences Research Division, National Institute of Informatics, Chiyoda, Tokyo 163-8001, Japan W. Li (B) Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_7

79

80

Y. Sun et al.

1 Introduction The purpose of singing voice detection task is to identify the part of the vocals in a given song. In the domain of music information retrieval (MIR), singing voice detection is often regarded as a pre-processing method to identify the vocal segments in the original mixed audio signal which can be exploited in many research topics, such as singer identification [1, 2], melody transcription [3], query by humming [4], lyrics transcription [5], etc. In the early years, researches had been focused on feature engineering. In [6], a feature set consisting of fluctogram, spectral flatness, spectral contraction and vocal variance, is chosen in order to distinguish the singing voice from highly harmonic instruments. But the identification result based on traditional techniques can hardly be considered ideal. As deep learning is successfully applied in feature representation and classification, better performance can be achieved. In [7], Kum et al. used the multi-task learning method to detect and classify the singing voice jointly. In [8], better performance is achieved by transferring knowledge learned from a speech activity detection dataset. Since singing voice varies through time and intertwines with the background accompaniment music, these factors lead to the difficulty of detecting singing voice activities in polyphonic music. The SVD task is similar to vocal detection (VD) in speech domain [9]. However, unlike random background noise, the accompaniment is highly correlated with the singing voice [10]. The singing voice is a fluctuating sound, not as stationary as harmonic instruments like piano or guitar, but much more than percussive ones, like drums. It thus lies between harmonic and percussive components [11]. According to the results of studying several SVS methods such as [12–17] devised for analyzing non-stationary harmonic sound sources such as the singing voice, we propose to use U-Net [14] to mitigate the interference of accompaniment. With the joint effort of the LRCN classifier [18] and the median filter as the post smoothing component [19], the idea of our SVD system containing two steps: coarse-grained pre-processing and fine-grained detection is proposed, as shown in Fig. 1. In our work, the four major procedures of vocal detection were compared, including vocal separation, feature selection, classifier selection, and filter smoothing. Finally we choose the process with the optimal performance in each step as our purposed method.

Fig. 1 Idea of the SVD system with two steps

Investigation of Singing Voice Separation ...

81

2 Proposed Method Singing voice intertwining with the signal of background accompaniment makes SVD a challenging task. Figure 2 illustrates a detail comparison. The long-window spectrogram is used to concentrate more on the frequency domain for analysis. Since the singing voice does not have a relatively stable pitch like harmonic instruments, the energy spreads out a range of frequency channels (see Fig. 2(c)). This makes the singing voice acts more like a percussive instrument in Fig. 2(b), rather than a harmonic instrument in Fig. 2(a); under high time resolution, analysing using shortwindow spectrogram, pitch fluctuates of the vocal can be ignored, making energy

Fig. 2 Long-window (186 ms) spectrograms of (a), (b) and (c) corresponding to a piece of piano, drum and the vocal; Short-window (11 ms) spectrograms of (d), (e) and (f) corresponding to a piece of piano, drum and the vocal. Above audio clips are in 44.1k Hz sample rate. For long-window spectrograms, 8192 samples length, corresponding to 186 ms, is used as the window size; For short-window, 512 samples length, corresponding to 11 ms, is used as the window size.

82

Y. Sun et al.

concentrate on a set of frequency channels and smooth in the time direction. In this situation, the singing voice (as seen Fig. 2(f)) is more like a harmonic instrument (as seen Fig. 2(d)). In the audio processing field, it is typical to represent an audio clip in the timefrequency domain. Once using the mixture as input, no matter hand-crafted features or data-driven methods, the interference from time and frequency domain is hard to eliminate. This motivates us to propose using the SVS technique to remove the accompaniment before fine-grained singing voice detection.

2.1 U-Net for Singing Voice Separation The typical SVS method in the coarse-grained pre-processing step aims to generate the mask applied on the spectrogram to filter out the potential targeted part of the singing voice. How to model the target and generate the mask is the most challenging part. Over the years, SVS methods can be classified into three classes. The first assumpts that the singing voice is mostly harmonic [10] and tries to model the singing voice by finding the fundamental frequency. According to the fact that the accompaniment is highly structed and tends to lie in a small range of frequencies [10], the second kind of method tries to model the accompaniment and get the singing voice by subtracting the accompaniment from the mixture. The third kind of method is the data-driven method which learns the model by large and representative datasets. To recreate the fine, low-level detail required for high-quality audio reproduction [20], the data-driven method, U-Net, is chosen as the SVS component for coarsegrained pre-processing. The flow chart of SVS using U-Net as the separator is shown in Fig. 3. SVS’s audio representation is the magnitude spectrogram obtained by shorttime Fourier Transform (STFT). U-Net’s target is to generate the mask used to extract the target spectrogram out [21]. In other words, the mask determines whether to keep constant or attenuated a certain frequency bin [10]. Letting  denote the element wise manipulation, the masking operation is formulated as: yvocal ˆ = ymi x  M,

(1)

where ymi x , yvocal ˆ and M represent the spectrogram of the mixture, the estimated vocal spectrogram and the mask separately. Finally, the estimated vocal audio can be rebuilt using the phase information from STFT and inverse short-time Fourier Transform (ISTFT) operation. As illustrated in the dashed box in Fig. 3, the encoder, the decoder and the skip connection build up the U-Net. A stack of convolutional operations composes the encoder. Each layer halves the feature map size in both dimensions simultaneously and doubles the number of channel, trying to encode smaller and deeper representations. In contrast, the decoder goes exactly the opposite way. The skip connections between the same hierarchical level allow low-level information to flow directly from the high-resolution input to the high-resolution output to recreate much more detailed information [14].

Investigation of Singing Voice Separation ...

83

Fig. 3 Singing voice separation method in the green dashed block and the U-Net model in the blue dashed block.

2.2 Feature Extraction There are many features proposed for the singing voice detection task. Among the features, most commonly used Mel-frequency Cepstral coefficients (MFCC) [6], Linear Predictive Cepstral Coefficients (LPCC) [22] and Perceptual Linear Predictive Coefficients (PLP) [6] are chosen to examine the performance. MFCC has been widely used in a large number of speech and audio recognition tasks [6], and MFCC can represent the audio signal’s timbre features. It is thus used as the feature in the proposed model. The most popular approach for modelling human voice production is Linear Prediction Coefficients (LPC) which performs well in a clean environment but not so good in a noisy one. LPCC are calculated by introducing the cepstrum coefficients in the LPC. Assume that LPCC, governed by the shape of vocal tract, is the nature of the sound. PLP is originally proposed by Hynek Hermansky as a way of warping spectra to minimize the differences between speakers while preserving important speech information.

84

Y. Sun et al.

The audio signal is first segmented into frames with overlapping. On each frame, a fast Fourier transform (FFT) is computed with a Hamming window. Most of the features are selected for their ability to discriminate voice from music [6]. Because MFCC can achieve the best performance among the other two features and their combinations, MFCC is chosen as the audio descriptor in our proposed singing voice detection method.

2.3 LRCN for Classification Considering LRCN’s capability of learning compositional acoustic representations in the feature and time domain, it is chosen as the classifier in the step of fine-grained singing voice detection. The LRCN architecture is a combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) which is shown in Fig. 4. And the key equations of LSTM structure are shown in Eq. (2)–(6). i(t) = σ (Wi · [Fc (X (t)), H (t − 1), C(t − 1)] + bi )

(2)

f (t) = σ (W f · [Fc (X (t)), H (t − 1), C(t − 1)] + b f )

(3)

Fig. 4 The LRCN model used for block-wise singing/non-singing classification.

Investigation of Singing Voice Separation ...

85

C(t) = f (t) · C(t − 1) + i(t) · tanh(Wc · [Fc (X (t)), H (t − 1)] + bc )

(4)

o(t) = σ (Wo · [Fc (X (t)), H (t − 1), C(t)] + bo

(5)

H (t) = o(t) · tanh (C(t))

(6)

where ‘·’ represents the element-wise product, and ‘Fc ’ represents the convolution operator. σ is the sigmoid function, W is the weight matrix, and b is the bias vector. The input gate i(t), forget gate f (t) and output gate o(t) of LRCN are separately listed in (2), (3) and (5). The C(t) shown in Eq. (4) is the LRCN cell, and H (t) in Eq. (6) is the output of the LRCN cell. Since MFCC provides abundant information for SVD. Therefore, the CNN block in LRCN serves as the feature extractor for singing/non-singing information. The LSTM block in LRCN learns the long-range dependencies between different frames. With drawbacks existing in the LSTM block of LRCN, the proposed system takes block-wise audio series as the input. Firstly, it is known that LSTM can solve the gradient vanishing problem to a certain extend. However, when it comes to long (such as more than 1000 steps, which is common in the audio processing field) series, the gradient vanishing problem is hard to eliminate in LSTM. Secondly, LSTM can not process information in parallel, making the training process time-consuming. Thirdly, the singing voice duration is often in word-level or sentence-level, while the dependency between frames is not that long. Therefore, block-wise data is processed as a unit in the LRCN classifier. In the last, with a flattened layer and three fully connected layers following the LRCN, the block-wise output is obtained.

2.4 Post Smoothing Frame-wise classifiers tend to provide noisy results leading to an over-segmentation (short segments). Human annotation tends to provide under-segmentation (long segments ignoring instrumental breaks or singer breathing). For these reasons, postprocessing was usually applied to the estimated segmentation. Accumulating the segment likelihood over a more extended period is more reliable for decision making. The median filter is applied as a post smoothing step. The label in each time frame is recalculated with a certain kind of the averaged value over a fixed size window along the time dimension.

86

Y. Sun et al.

3 Evaluation 3.1 Experiment Settings The architecture of the pre-processing model, U-Net, is following that in [14]. Each layer is composed of a 2D 5 × 5 kernel sized convolutional operation with stride 2, batch normalization, and leaky rectified linear units (ReLU) in the encoder leakiness 0.2. Each layer comprises a 2D 5 × 5 kernel sized deconvolutional operation with stride 2, batch normalization, and ReLU activation in the decoder. The dropout operation with a drop rate of 50% is used in the first three layers. The acoustic feature of the successive frames in a fixed duration block is used as the input for the classifier. The block duration is set to 600 ms, which is equal to 29 audio frames. The LRCN comprises a 2D 1 × 4 kernel sized convolutional layer with ReLu activation and an LSTM block with hard sigmoid activation. After flattening for each LSTM output, three fully connected layers with an output size of 200, 50 and 1 separately are added to get the final block-wise label.

3.2 Dataset Two public datasets are used for a fair comparison of our approach with others. The Jamendo Corpus, retrieved from the Jamendo website, contains 93 copyright-free songs with singing voice activity annotations. The database is built and published along with [23]. The corpus is divided into three sets: the training set containing 61 songs, the validation set and the test set containing 16 songs, respectively. The RWC music dataset consists of 100 pop songs released by Goto et al. [24], while singing voice annotations are provided by Mauch et al. [9].

3.3 Experiments and Results 3.3.1

Singing Voice Separation as a Pre-processing Step

To demonstrate the importance of the singing voice separation method, especially the U-Net, as a pre-processing step on SVD, experiments are conducted on Jamendo Corpus in this section. We choose five methods as the pre-processing component, which are REpeating Pattern Extraction Technique (REPET) [12], Kernel Additive Modelling (KAM) [16], Robust Principal Component Analysis (RPCA) [15], multistage non-negative matrix factorization (NMF) [17] from assumption-based methods and the U-Net [18]. Experiment results can be seen in Fig. 5. Measurements used are accuracy, precision, recall and F1. Except that the accuracy score of RPCA is slightly lower than

Investigation of Singing Voice Separation ...

87

Fig. 5 Comparison results on different SVS methods as the pre-processing step. The “raw” indicating the baseline method without pre-processing step.

that of the original method without pretreatment, the three methods (rept, U-Net and RPCA) are better than the original method in other measurements. Performances on recall and F1 of KAM are lower than that of the raw. This might be because that KAM does not only reduce the accompaniment but also affect the singing voice to some extent. What is more, U-Net performs best as the pre-processing component. It is because data-driven methods do not have the assumptions that set a ceiling for the performance. The rationality of using U-Net as the pre-processing step is thus verified.

3.3.2

Post Process of Smoothing

In this work, we evaluated two novelty detection approaches that used the same framework as the post-processing for the vocal detection system. The first was a simple median filter with a fixed window length of 87 frames (3.48 s) which was found to give the best trade-off between complexity and accuracy. The second was Hidden Markov Model (HMM) [23] based method, a temporal smoothing of the posterior probabilities with a hidden Markov model that helped to adapt the segmentation sequence to the manual annotation. Comparing the raw system framework without the post process, using the median filter as the temporal smoothing has improved in the f1 value, accuracy, and precision. With the comparison results, we chose the median filter for post-processing.

3.3.3

Comparison Results on the Public Dataset

Finally, the proposed singing voice detection system is compared with the existing state-of-the-art works, i.e. Ramona [23], Schlüter [25], Lehner-1 [26], Lehner-2 [19],

88

Y. Sun et al.

Table 1 Comparison results on Jamendo Corpus Methods Accuracy Precision Ramona [23] Schlüter [25] Lehner-1 [26] Lehner-2 [19] Leglaive [11] JDCs [7] U-Net-LRCN

0.822 0.923 0.882 0.848 0.915 0.800 0.888

– – 0.880 – 0.895 0.791 0.865

Table 2 Comparison results on RWC pop dataset Methods Accuracy Precision Schlüter [25] Mauch [9] Lehner-1 [26] Lehner-2 [19] U-Net-LRCN

0.927 0.872 0.875 0.868 0.916

– 0.887 0.875 0.879 0.926

Recall

F1

– 0.903 0.862 – 0.926 0.802 0.920

0.831 – 0.871 0.846 0.910 0.792 0.892

Recall

F1

0.935 0.921 0.926 0.906 0.934

– 0.904 0.900 0.892 0.930

Leglaive [11] and JDCs [7] on the Jamendo corpus, and with Mauch [9], Schlüter [25], Lehner-1 [26] and Lehner-2 [19] on the RWC pop dataset. The comparison results are shown in the Table 1 and Table 2. Our proposed system is named as UNet-LRCN. Table 1 shows comparison results on Jamendo Corpus. The U-Net-LRCN outperforms the shallow models. For the method of Leglaive [11], the bi-LSTM model, which considers the past and the future information, is used. Leglaive [11] achieved an F1 of 0.91, which outperforms other state-of-the-art works. On F1, the U-NetLRCN achieves 0.892, 0.018 lower than Leglaive [11], is on par with state-of-the-art. As seen in Table 2, on RWC dataset, method of Schlüter [25] performs best. It uses a data argument method. The dataset used is not the original one. Besides, the precision and F1 score are not given. Except for Schlüter [25], the U-Net-LRCN attained an F1 of 0.930, which is an improvement over the state-of-the-art method of Mauch [9] by 0.026. In summary, the proposed U-Net-LRCN produces relatively better results compared with state-of-the-art methods over two public datasets. Using the SVS method as a pre-processing step has demonstrated its incredible power to eliminate the accompaniment’s interference and improve SVD performance. However, our proposed UNet-LRCN is also based on the LSTM to learn the context relation. The U-Net-LRCN is capable of learning the spatial relation and performs better than the LSTM.

Investigation of Singing Voice Separation ...

89

4 Conclusion As singing voice changes and intertwines with the signal of background accompaniment in the time domain, it increases the difficulty of SVD in polyphonic music. Therefore, we propose to use the SVS method, U-Net, as a pre-processing step to eliminate the inference of background accompaniment. With the LRCN classifier’s joint effort and the median post smoothing method, the proposed SVD system performs relatively better than the current state-of-the-art works. Future work will try much more light-weight methods to eliminate the inference from the background accompaniment. Furthermore, using the proposed singing voice detection system presented in this paper to specific use cases such as singer identification will be attempted. Acknowledgement This work was supported by National Key R&D Program of China (2019YFC1711800), NSFC (62171138).

References 1. Berenzweig A, Ellis DP, Lawrence S (2002) Using voice segments to improve artist classification of music. In: Proceedings of the AES 22nd international conference. [S.l.] 2. Zhang X, Qian J, Yu Y, et al (2021) Singer identification using deep timbre feature learning with KNN-net. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3380–3384. [S.l.]. IEEE 3. Salamon J, Gómez E, Ellis DP et al (2014) Melody extraction from polyphonic music signals: approaches, applications, and challenges. IEEE Sig Process Mag 31(2):118–134 4. Hsu CL, Wang D, Jang JSR et al (2012) A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Trans Audio Speech Lang Process 20(5):1482–1491 5. Mesaros A (2013) Singing voice identification and lyrics transcription for music information retrieval invited paper. In: 7th conference on speech technology and human-computer dialogue (SpeD), pp 1–10. [S.l.]: IEEE 6. Rocamora M, Herrera P (2007) Comparing audio descriptors for singing voice detection in music audio files. In: Brazilian symposium on computer music, vol 26, p 27. [S.l.] 7. Kum S, Nam J (2019) Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Appl Sci 9(7):1324 8. Hou Y, Soong FK, Luan J, et al (2020) Transfer learning for improving singing-voice detection in polyphonic instrumental music. In: Meng H, Xu B, Zheng TF (eds) 21st Annual conference of the international speech communication association, Virtual Event, Shanghai, China, 25–29 October 2020: ISCA, pp 1236–1240. https://doi.org/10.21437/Interspeech.2020-1806 9. Mauch M, Fujihara H, Yoshii K, et al (2011) Timbre and melody features for the recognition of vocal activity and instrumental solos in polyphonic music. In: Proceedings of the 12th international society for music information retrieval conference, pp 233–238. [S.l.] 10. Rafii Z, Liutkus A, Stöter FR et al (2018) An overview of lead and accompaniment separation in music. IEEE/ACM Trans Audio Speech Lang Process 26(8):1307–1335 11. Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 121–125. [S.l.]: IEEE 12. Rafii Z, Pardo B (2012) Repeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans Audio Speech Lang Process 21(1):73–84

90

Y. Sun et al.

13. Yang YH (2012) On sparse and low-rank matrix decomposition for singing voice separation. In: Proceedings of the 20th ACM international conference on multimedia, pp 757–760. [S.l.] 14. Jansson A, Humphrey E, Montecchio N, et al (2017) Singing voice separation with deep U-Net convolutional networks. In: 18th international society for music information retrieval conference. [S.l.] 15. Huang PS, Chen SD, Smaragdis P, et al (2012) Singing-voice separation from monaural recordings using robust principal component analysis. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 57–60. [S.l.]: IEEE 16. Liutkus A, Rafii Z, Pardo B, et al (2014) Kernel spectrogram models for source separation. In: 4th joint workshop on hands-free speech communication and microphone arrays (HSCMA), pp 6–10. [S.l.]: IEEE 17. Zhu B, Li W, Li R et al (2013) Multi-stage non-negative matrix factorization for monaural singing voice separation. IEEE Trans Audio Speech Lang Process 21(10):2096–2107 18. Donahue J, Anne Hendricks L, Guadarrama S, et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634. [S.l.] 19. Lehner B, Sonnleitner R, Widmer G (2013) Towards light-weight, real-time-capable singing voice detection. In: Proceedings of the 14th international society for music information retrieval conference, pp 53–58. [S.l.] 20. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. LNCS, vol 9351, pp 234–241. Springer, Cham. https://doi.org/10.1007/978-3-319-24574-4_28 21. Fan ZC, Lai YL, Jang JSR (2018) SVSGAN: singing voice separation via generative adversarial network. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 726–730. [S.l.]: IEEE 22. You SD, Wu YC, Peng SH (2016) Comparative study of singing voice detection methods. Multimedia Tools Appl 75(23):15509–15524 23. Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: 2008 IEEE international conference on acoustics, speech and signal processing, pp 1885– 1888. [S.l.]: IEEE 24. Goto M, Hashiguchi H, Nishimura T, et al (2002) RWC music database: popular, classical and jazz music databases. In: Proceedings of the 3rd international conference on music information retrieval: volume 2, pp 287–288. [S.l.] 25. Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference, pp 121–126. [S.l.] 26. Lehner B, Widmer G, Sonnleitner R (2014) On the reduction of false positives in singing voice detection. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7480–7484. [S.l.]: IEEE

General Audio Signal Processing

Learning Optimal Time-Frequency Representations for Heart Sound: A Comparative Study Zhihua Wang, Zhihao Bao, Kun Qian, Bin Hu, Björn W. Schuller, and Yoshiharu Yamamoto

Abstract Computer audition based methods have increasingly attracted efforts among the community of digital health. In particular, heart sound analysis can provide a non-invasive, real-time, and convenient (anywhere and anytime) solution for preliminary diagnosis and/or long-term monitoring of patients who are suffering from cardiovascular diseases. Nevertheless, extracting excellent time-frequency features from the heart sound is not an easy task. On the one hand, heart sound belongs to audio signals, which may be suitable to be analysed by classic audio/speech techniques. On the other hand, this kind of sound generated by our human body should contain some characteristics of physiological signals. To this end, we propose a comThis work was partially supported by the BIT Teli Young Fellow Program from the Beijing Institute of Technology, China, the China Scholarship Council (No. 202106420019), China, the JSPS Postdoctoral Fellowship for Research in Japan (ID No. P19081) from the Japan Society for the Promotion of Science (JSPS), Japan, and the Grants-in-Aid for Scientific Research (No. 19F19081 and No. 20H00569) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. Z. Wang · Z. Bao · K. Qian (B) · B. Hu (B) School of Medical Technology, Beijing Institute of Technology, Beijing, China e-mail: [email protected] B. Hu e-mail: [email protected] Z. Wang e-mail: [email protected] Z. Bao e-mail: [email protected] Z. Wang School of Mechatronic Engineering, China University of Mining and Technology, Xuzhou, China B. W. Schuller GLAM – Group on Language, Audio, and Music, Imperial College London, London, UK e-mail: [email protected] Z. Wang · Y. Yamamoto Educational Physiology Laboratory, The University of Tokyo, Bunkyo, Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_8

93

94

Z. Wang et al.

prehensive investigation on time-frequency methods for analysing the heart sound, i.e., short-time Fourier transformation, wavelet transformation, Hilbert-Huang transformation, and Log-Mel transformation. The time-frequency representations will be automatically learnt via pre-trained deep convolutional neural networks. Experimental results show that all the investigated methods can reach a mean accuracy higher than 60.0%. Moreover, we find that wavelet transformation can beat other methods by reaching the highest mean accuracy of 75.1% in recognising normal or abnormal heart sounds. Keywords Computer audition · Digital health · Heart sound · Time-frequency analysis · Deep learning · Transfer learning

1 Introduction Every year, more than 17 million people statistically leave us due to the Cardiovascular Diseases (CVDs) all over the world, ranking it first among various causes of death [29]. In this context, timely and accurate heart sound classification methods can greatly reduce the CVDs’ mortality [23]. Auscultation is the most simple and direct way to diagnose CVDs [7, 19]. However, it mainly depends on the doctor’s subjective experience to judge the disease [5, 31]. Computer audition (CA) has been demonstrated to be effective in plenty of healthcare applications [15, 16, 18]. In particular, heart sound classification has been increasingly studied in recent years [4, 10, 11]. CA based methods overcome the defect of auscultation technology to a certain extent. However, in the classic machine learning (ML) paradigm, it relies on the features extracted by experts, which is an exhaustive work and greatly impacts the final result [2, 26]. In addition, these features are mostly extracted from long recordings containing different cardiac cycles [14, 34]. The recordings with different time length destroy the fairness of result comparison, and the long recordings are not conducive to the development of rapid heart sound diagnostic equipment. Due to the success of convolutional neural networks (CNNs) in recent years, a growing number of researchers started to utilise CNNs for various heart sound processing applications [3, 21, 22]. The inherent mechanism of CNNs’ automatic feature extraction eliminates the influence of ‘artificial’ or suboptimal features on the results. However, one of the main difficulties of employing CNNs for heart sound classification is the lack of a sufficient variety of pre-trained models based on time domain signals [8, 9]. Accordingly, one has to train one’s models from scratch, which can be a difficult and computationally expensive task. In recent years, driven by the competitive challenges run, a large number of pre-trained models in the field of image processing have been made publicly available, such as VGG [24] and ResNet [6]. This inspires us to utilise pre-trained CNNs from the image processing domain for heart sound analysis in combination with sound-to-image transformation techniques.

Learning Optimal Time-Frequency Representations for Heart Sound ...

95

Short-Time Fourier Transformation (STFT) [32], Wavelet Transformation (WT) [17], Hilbert-Huang Transformation (HHT) [25], and Log-Mel Transformation (LogMT) [13] are powerful time frequency analysis methods, and have achieved remarkable results in heart sound classification and other acoustic analysis and processing. However, to the best of the authors’ knowledge, most of the heart sound researchers use time-frequency analysis methods for pre-processing [1, 27] and ‘artificial’ (i.e., non-data driven) feature extraction [28, 30]. There are few reports on heart sound classification based on the images extracted by these four time-frequency analysis methods. For this reason, we challenge to classify heart sounds using the images extracted by STFT, WT, HHT, and LogMT, and then find the optimal time-frequency representations for heart sounds classification. Herein, we utilise the pre-trained model VGG16 for heart sound classification with the help of time-frequency analysis methods and transfer learning techniques. As named, the images are extracted by STFT, WT, HHT, and LogMT. Further, we also compare the performance of these four time-frequency analysis methods for the heart sound classification task. The remainder of the paper is organised as follows: The proposed approach is first described in Sect. 2; then, we describe the database, experimental setup, evaluation criteria, and results in Sect. 3; finally, the paper concludes and future work plans are found in Sect. 4.

2 Methods In this section, we describe the main technologies involved in the proposed method for heart sound classification. This mainly includes: the images extraction methods by STFT, WT, HHT, and LogMT, and a pre-trained VGG16 model as basis for fine-tuning for normal/ abnormal heart sounds classification.

2.1 Image Extraction The CNN model has different classification results for different images as input. In this regard, we explore the four sound-to-image transformation methods that have achieved remarkable success in other acoustic classification tasks. For remaining easy to compare, the four transformations are conducted using single cardiac cycle signals on the MATLAB R2020a platform. According to the work presented in [33], we select 1.5 s as the length of a single cardiac cycle. STFT. We utilise the short time Fourier analysis function of MATLAB spectrogram to extract the STFT spectrograms with 2 kHz sampling frequency. The relevant parameters are set as follows: the window function is Hamming with 128 as the length; the overlap ratio of the window function is 50%. When generating the images, we

96

Z. Wang et al.

(a) Abnormal (a0001.wav)

(b) Normal (a0007.wav)

Fig. 1 The spectrograms are extracted from the normal/abnormal heart sounds using STFT

represent time on the horizontal axis and frequency on the vertical axis. The Jet colour map, which varies from blue (low value) to green (mid value) to red (upper value), is used to map the power spectral density. After removing the axes and margins marking, the images are extracted by the Matlab function imwrite. The STFT spectrograms of a normal heart sound (a0007.wav, see for details on the database in Sect. 3.1) and an abnormal heart sound (a0001.wav) are given in Fig. 1. Finally, the extracted images are scaled to 224 × 224 × 3 for compatibility with the VGG16. WT. In this study, we utilise the continuous wavelet transform function of MATLAB cwt to obtain the wavelet coefficient values of heart sounds. The wavelet basis function is cgau3. The images are generated by the drawing function of MATLAB imagesc. The change of colour in the images represents the wavelet coefficient values. The remaining settings are consistent with Sect. 2.1. Figure 2 shows the scalogram images of a normal heart sound (a0007.wav) and an abnormal heart sound (a0001.wav). It can even be observed by the human eye that there are some clear distinctions between the two images. HHT. In this subsection, the MATLAB empirical mode decomposition toolkit1 is used to extract the HHT spectrograms. The intrinsic mode functions of heart sounds are obtained by the function emd in the toolkit. The images are generated by the functions hhspectrum and toimage in the toolkit. The remaining settings are consistent with Sect. 2.1. Accordingly, the HHT spectrograms of a normal heart sound (a0007.wav) and an abnormal heart sound (a0001.wav) are given in Fig. 3. LogMT. We utilise the audio processing library librosa2 of Python to generate the LogMT spectrograms. The function melspectrogram in the library is used to compute a Mel-scaled spectrogram. Then, The function power_to_db in the library is applied to convert a power spectrogram to db. According to repeated trials, we choose 32 Melbanks. The remaining settings are consistent with Sect. 2.1. The LogMT spectrograms

1 2

http://perso.ens-lyon.fr/patrick.flandrin/emd.html. https://librosa.org/doc/latest/index.html.

Learning Optimal Time-Frequency Representations for Heart Sound ...

(a) Abnormal (a0001.wav)

97

(b) Normal (a0007.wav)

Fig. 2 The scalogram images are extracted from the normal/abnormal heart sounds using WT

(a) Abnormal (a0001.wav)

(b) Normal (a0007.wav)

Fig. 3 The spectrograms are extracted from the normal/abnormal heart sounds using HHT

of a normal heart sound (a0007.wav) and an abnormal heart sound (a0001.wav) are shown in Fig. 4.

2.2 Pre-trained Model In this section, we describe the architecture of a pre-trained model after fine-tuning to combine it with the extracted images. As the work reported in [20] shows that it is appropriate for heart sound classification, we use VGG16 ImageNet to process the images. VGG16 consists of thirteen convolution layers, five max pooling layers, three fully connected layers, and a softmax layer for 1 000 categories of images classification. The input image size is 224 × 224 × 3. More details and the training procedure for VGG16 are described in [24]. As shown in Fig. 5, we replace all layers after the last max pooling layer with a flatten layer, two fully connected layers with 128 neurons, and a softmax layer with 2 labels. While re-training the pre-trained model on the extracted images, the param-

98

Z. Wang et al.

(a) Abnormal (a0001.wav)

(b) Normal (a0007.wav)

Fig. 4 The spectrograms are extracted from the normal/abnormal heart sounds using LogMT Table 1 An overview of the dataset used in the paper Dataset Folder of the database Training set training-a: a0001.wav—a0368.wav training-b: b0001.wav—b0441.wav training-c: c0001.wav—c0028.wav training-d: d0001.wav—d0050.wav training-e: e00001.wav—e01927.wav Total Validation set training-a: a0369.wav—a0409.wav training-b: b0442.wav—b0490.wav training-c: c0029.wav—c0031.wav training-d: d0051.wav—d0055.wav training-e: e01928.wav—e02141.wav Total Testing set validation Total

Recordings Normal 368 441 28 50 1927 2814 41 49 3 5 214 312 301 301

108 348 6 24 1767 2253 9 38 1 3 191 242 150 150

Abnormal 260 93 22 26 160 561 32 11 2 2 23 70 151 151

eters of the convolution layers are frozen, and only the parameters of the remaining layers are trained. The pre-trained model VGG16 are obtained from Keras3 .

3 Experiments 3.1 Database We use the database of the PhysioNet/CinC Challenge 2016 to evaluate the proposed approaches [12]. There are only normal and abnormal heart sound recordings with 2 kHz sampling frequency in the database4 . As the testing set of the database is not 3

https://keras.io/.

Learning Optimal Time-Frequency Representations for Heart Sound ...

99

Fig. 5 Diagram of a modified pre-trained VGG16 model that was fine-tuned for heart sound classification

publicly available, we selected 312 samples from the training set of the database as the verification set, and take a validation set of the database as testing set. A detailed overview of dataset is described in Table 1. The training set and validation set are structured by five training folders of the database, and the testing set is so by the validation folder.

3.2 Experimental Setup To train the our model, one-hot encoding is used for the sample labels. The categorical_crossentropy is applied as the loss function and Adadelta optimisation algorithm is used as the optimiser with an initial learning rate of 0.05 and a weight decay factor of 0.01. The network evaluation metric is accuracy. A mini-batch size of 32 and 50 epochs are used in all experiments.

3.3 Evaluation Criteria According to the official scoring mechanism of the PhysioNet/CinC Challenge 2016 [12], our approach is evaluated by both Sensitivity (Se) and Specificity (Sp). For classification of normal and abnormal heart sounds, Se and Sp are defined as: Se =

4

TP T P + FN

https://www.physionet.org/content/challenge-2016/1.0.0/.

(1)

100

Z. Wang et al.

E

(a) STFT

(b) HHT

(c) LogMT

(d) WT

Fig. 6 Training and validation loss curves of the fine-tuned VGG16 model based on four timefrequency analysis methods

Sp =

TN T N + FP

(2)

where TP is the number of true positive normal heart sound samples, FN is the number of false negative normal heart sound samples, TN is the number of true negative abnormal heart sound samples, and FP is the number of false positive abnormal heart sound samples. Finally, the Mean Accuracy (MAcc) is calculated by Se and Sp to represent the overall score of the classification results, which is defined as: M Acc =

Se + Sp 2

(3)

Learning Optimal Time-Frequency Representations for Heart Sound ...

(a) STFT

(b) HHT

(c) LogMT

(d) WT

101

Fig. 7 Normalised confusion matrix of on the testing set (in [%])

3.4 Results As shown in Fig. 6, the training and validation loss are decreasing with the number of epochs going up. When the number of epochs reaches 50, the training and validation loss tend to be stable, indicating that the models have been trained. The normalised confusion matrix of the testing set are illustrated in Fig. 7. We can observe that compared with the other three time-frequency analysis methods, WT has not only a high precision (71.8% = 100% ∗ 82.7/(82.7 + 32.5)) for normal heart sounds, but also the highest precision (79.6% = 100% ∗ 67.5/(67.5 + 17.3)) for abnormal heart sounds.

102

Z. Wang et al.

(a) On verification set

(b) On testing set

Fig. 8 Performances comparison of the four transformation methods on the verification and testing set

Figures 8 show the experimental results by the four time-frequency representations on the verification set and the testing set, respectively. The MAccs are more than 60% on the verification set and the testing set. This indicates that the proposed heart sound classification method by combining the time-frequency representations extracted from single cardiac cycle signals and transfer learning technology is feasible and effective. It is worth noting that among the four time-frequency analysis methods, WT achieves the best performance, which is well reflected by the MAcc on the verification set and the testing set. Especially on the testing set, WT has the highest MAcc of 75.1%, and is with 12.6% increase significantly higher than STFT (p < 0.05 by one-tailed z-test). In addition, we note that the Se of the four methods is greater than the Sp. We infer that one possible reason is that the number of normal heart sound samples on the training set is much larger than that of abnormal heart sound samples.

4 Conclusions We investigated the application of four time-frequency representations and transfer learning technology in heart sound classification. The key point was to compare the performance differences among them. The main conclusions are as follows: Without any denoising and pre-processing, the methods by combining the time-frequency images extracted from single cardiac cycle heart sounds and transfer learning technology are feasible and effective for normal/ abnormal heart sounds classification; among the four time-frequency analysis methods, the performance of WT is significantly better than the other three methods, and STFT falls last.

Learning Optimal Time-Frequency Representations for Heart Sound ...

103

In future work, one should consider more pre-trained models to compare the performance differences of different time-frequency analysis methods. Further, one also needs to introduce an explainable model to further explain the performance differences between these.

References 1. Ali N, El-Dahshan ES, Yahia A (2017) Denoising of heart sound signals using discrete wavelet transform. Circ Syst Sig Process 36(11):4482–4497 2. Arora V, Leekha R, Singh R, Chana I (2019) Heart sound classification using machine learning and phonocardiogram. Mod Phys Lett B 33(26):1–24 3. Deng M, Meng T, Cao J, Wang S, Zhang J, Fan H (2020) Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Networks 130:22–32 4. Dong F et al (2020) Machine listening for heart status monitoring: Introducing and benchmarking HSS-the heart sounds Shenzhen corpus. IEEE J Biomed Health Inform 24(7):2082–2092 5. Gardezi S et al (2018) Cardiac auscultation poorly predicts the presence of valvular heart disease in asymptomatic primary care patients. Heart 104(22):1832–1835 6. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings CVPR, pp 770–778. IEEE, Las Vegas, Nevada 7. Karnath B, Thornton W (2002) Auscultation of the heart. Hosp Phys 38(9):39–45 8. Koike T, Qian K, Kong Q, Plumbley MD, Schuller BW, Yamamoto Y (2020) Audio for audio is better? An investigation on transfer learning models for heart sound classification. In: Proceedings EMBC, pp 74–77. IEEE, Montreal, ´ Canada 9. Li F et al (2019) Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J Adv Sig Process 2019(1):1–11 10. Li J, Li K, Du Q, Ding X, Chen X, Wang D (2019) Heart sound signal classification algorithm: a combination of wavelet scattering transform and twin support vector machine. IEEE Access 7:179339–179348 11. Li S, Li F, Tang S, Xiong W (2020) A review of computer-aided heart sound detection techniques. BioMed Res Int 2020:1–10 12. Liu C et al (2016) An open access database for the evaluation of heart sound algorithms. Physiol Meas 37(12):2181–2213 13. Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D Log-Mel spectrograms with deep learning network. IEEE Access 7:125868–125881 14. Noman F, Ting CM, Salleh SH, Ombao H (2019) Short-segment heart sound classification using an ensemble of deep convolutional neural networks. In: Proceedings ICASSP, pp 1318–1322. IEEE, Brighton, UK 15. Qian K et al (2021) Can machine learning assist locating the excitation of snore sound? A review. IEEE J Biomed Health Inform 25(4):1233–1246 16. Qian K et al (2020) Computer audition for healthcare: opportunities and challenges. Front Digit Health 2:1–4 17. Qian K, Ren Z, Dong F, Lai W, Schuller B, Yamamoto Y (2019) Deep wavelets for heart sound classification. In: Proceedings ISPACS, pp 1–2. IEEE, Taiwan, China 18. Qian K et al (2021) Computer audition for fighting the SARS-CoV-2 corona crisis – introducing the multi-task speech corpus for COVID-19. IEEE Internet Things J 1–12 (in press) 19. Ren H, Jin H, Chen C, Ghayvat H, Chen W (2018) A novel cardiac auscultation monitoring system based on wireless sensing for healthcare. IEEE J Transl Eng Health Med 6:1–12 20. Ren Z, Cummins N, Pandit V, Han J, Qian K, Schuller B (2018) Learning image-based representations for heart sound classification. In: Proceedings DHA, pp 143–147. ACM, New York, USA

104

Z. Wang et al.

21. Renna F, Oliveira J, Coimbra MT (2019) Deep convolutional neural networks for heart sound segmentation. IEEE J Biomed Health Inform 23(6):2435–2445 22. Ryu H, Park J, Shin H (2016) Classification of heart sound recordings using convolution neural network. In: Proceedings CinC, pp 1153–1156. IEEE, Vancouver, Canada 23. Safdar S, Zafar S, Zafar N, Khan F (2018) Machine learning based decision support systems (DSS) for heart disease diagnosis: a review. Artif Intell Rev 50(4):597–623 24. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 25. Sun J, Kang L, Wang W (2017) Heart sound signals based on CNN classification research. In: Proceedings ICBBS, pp 44–48. ACM, Singapore 26. Tschannen M, Kramer T, Marti G, Heinzmann M, Wiatowski T (2016) Heart sound classification using deep structured features. In: Proceedings CinC, pp 565–568. IEEE, Vancouver, Canada 27. Tseng YL, Ko PY, Jaw FS (2012) Detection of the third and fourth heart sounds using HilbertHuang transform. Biomed Eng Online 11(1):1–13 28. U˘guz H (2012) Adaptive neuro-fuzzy inference system for diagnosis of the heart valve diseases using wavelet transform with entropy. Neural Comput Appl 21(7):1617–1628 29. Upretee P, Yüksel ME (2021) 13 - accurate classification of heart sounds for disease diagnosis by using spectral analysis and deep learning methods. Data Anal Biomed Eng Healthcare 215–232 (in press) 30. Wang Y, Li W, Zhou J, Li X, Pu Y (2014) Identification of the normal and abnormal heart sounds using wavelet-time entropy features based on OMS-WPD. Future Gener Comput Syst 37:488–495 31. Son GY, Kwon S (2018) Classification of heart sound signal using multiple features. Appl Sci 8(12):1–14 32. Yuan Y, Xun G, Jia K, Zhang A (2017) A multi-view deep learning method for epileptic seizure detection using short-time Fourier transform. In: Proceedings ACM BCB, pp 213–222. ACM, Boston, Massachusetts 33. Zhang W, Han J, Deng S (2017) Heart sound classification based on scaled spectrogram and partial least squares regression. Biomed Sig Process Control 32:20–28 34. Zhang W, Hana J, Deng S (2020) Analysis of heart sound anomalies using ensemble learning. Biomed Sig Process Control 62:1–14

Improving Pathological Voice Detection: A Weakly Supervised Learning Method Weixing Wei, Liang Wen, Jiale Qian, Yufei Shan, Jun Wang, and Wei Li

Abstract Deep learning methods are data-driven. But for pathological voice detection, it is difficult to obtain high-quality labeled data. In this work, a weakly supervised learning Method is presented to improve the quality of existing datasets by learning sample weights and fine-grained labels. First, A convolutional neural network (CNN) is devised as the basic architecture to detect the pathological voice. Then, a proposed self-training algorithm is used to iteratively run and automatically learn the sample weights and fine-grained labels. These learned sample weights and fine-grained labels are used to train the CNN model from scratch. The experiment results on the Saarbrucken Voice database show that the diagnosis accuracy improved from 75.7 to 82.5%, with a 6.8% improvement in accuracy over the CNN models trained with the original dataset. This work demonstrates that the weakly supervised learning method can significantly improve the classification performance to distinguish pathological voice and healthy voice. Keywords Weakly supervised learning · Pathological voice · Acoustic analysis · Deep learning · Convolutional neural network

1 Introduction Voice plays an important role in people’s social life. The health condition of the human voice can significantly affect the communication activities of an individual. Voice disorder can be caused by pathologies such as structural lesions, neoplasms, and neurogenic disorders [1]. In clinical diagnosis, a variety of methods are generally used W. Wei · J. Qian · W. Li School of Computer Science and Technology, Fudan University, Shanghai 200438, China W. Li (B) Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China e-mail: [email protected] L. Wen · Y. Shan · J. Wang CETHIK Group Co., Ltd., Hangzhou 311100, China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_9

105

106

W. Wei et al.

for a comprehensive evaluation, including patient interview, and laryngeal imaging [2]. But these methods mostly rely on the doctor’s experience and are subjective. Pathological voice detection methods based on computerized analysis of acoustic signals are non-invasive and objective. Such methods can even identify vocal pathologies that are inaudible to the human ear [3]. The process includes collecting voices recordings, extracting features, and using features for classification. Many related research works have been carried out in recent decades. Usually, researchers study pathological voice detection from three aspects: one is to design new features based on domain-related knowledge; the second is to use new classifiers; the third is to improve the size and quality of datasets. Exploring new features requires domain-related knowledge. There are three kinds of common features for pathological voice classification: (1) features originally used for automatic speech recognition, such as Mel-Frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC) and energy [4]; (2) features that are used to detect the quality of the speech, such as pitch, jitter, shimmer and harmonic-to-noise ratio (HNR) [5]; (3) measures based on nonlinear dynamics, such as correlation dimension, Rényi entropies, and Shannon entropy [6]. Some studies have tried to explore new features. Muhammad et al. [7] propose a method of pathological voice detection based on MPEG-7 audio low-level features and SVM, reaching 99.9% accuracy on the MEEI dataset. Panel et al. [8] established a vector composed of 31 parameters to analyze speech signals, and analyzed the applicability of the parameters used to evaluate voice pathology. Applying a better classifier can significantly increase the diagnostic accuracy for pathological voice. Traditional machine learning methods are widely used for their good performance on small datasets(pathological voice datasets are small), such as support vector machine (SVM), Gaussian mixture model (GMM), and K-means clustering [9]. SVM is the most widely used. Markaki and Stylianou proposed a SVM model to identify the pathological voice in the MEEI dataset and achieved 94.1% diagnosis accuracy. Cordeiro et al. [10] applied a hierarchical SVM model to diagnose voice disorder and reported 98.7% accuracy. Deep learning methods also have huge potential in the medical field, more and more deep learning-based methods are used to detect pathological voices. Hemmerling et al. [11] put an effort into Distinguishing normal and pathological voice using auto-associative neural networks, achieving 61% accuracy on 3 pathology classification. Wu et al. [12] propose a convolutional neural network (CNN) and a Convolutional Deep Belief Network (CDBN) are applied to pre-train the model. The system is shown to reach an accuracy of 77% on the Saarbrucken Voice database. Chen et al. [13] applied a sparse autoencoder (SAE) network to pathological voice identification with 98.6% accuracy on a dataset of PhysioNet. Datasets are also an important aspect of machine learning. In particular, the deep learning methods are highly dependent on data. However, datasets of pathological voices are small because of the high cost to collect, which limits the performance of the method based on deep learning. Some researchers try to make breakthroughs in datasets. Mesallam et al. [14] designed and developed an Arabic voice pathology database (AVPD) to solve the voice complications in the Arab region rather than

Improving Pathological Voice Detection ...

107

using the MEEI or SVD datasets, They find that characteristics of voice in AfricanAmerican adult males are different from those of white adult males. Harar et al. [15] attempted to build robust classifiers for pathological voice detection by combining different datasets (MEEI, PVD, AVPD, and PDA), but the experiment results are unsatisfactory. In our point of view, there are some problems with pathological voice datasets, such as a small amount of data and insufficiently detailed annotation. Generally, a voice sample is marked as healthy or pathological only, but the severity of the disease is not clear (it is difficult to accurately mark it), which belongs to coarse-grained label as described in [16]. It is unreasonable to treat the samples with mild voice diseases and severe voice diseases equally, which will cause some interference to the classifier and affect its performance. To solve this problem, this work intends to learn finegrained labels through a weakly supervised learning method. First, a simple CNN is proposed as the binary classification model to detect the pathological voice. Second, a new self-training algorithm is used to iteratively run and automatically learn the sample weights and fine-grained labels. Finally, these learned sample weights and fine-grained labels are used to train new CNN models. This paper is organized as follows: Sect. 2 introduces our basic pathological voice detection CNN architecture and the proposed weakly supervised learning method to learn fine-grained labels; Sect. 3 introduces our experiment details and results; Sect. 4 concludes this work.

2 Methodology In this work, a simple CNN network is used as a basic classifier to classify healthy voice and pathological voice, and a weakly supervised learning method is applied to retrain the model and improve the classification results. Since the dataset is small, data augmentation is also a necessary means to promote the model performance and robustness.

2.1 CNN Architecture The CNN architecture is shown in Fig. 1. The original voice is first resampled to 25kHz to boost the training process. Then Short-time Fourier Transform(STFT) is applied on the raw audio with a Hamming window of size 2048 and a hop size of 512. Finally, the spectrogram is converted to mel scale with bins size of 128 and fed into the model. The model applies three convolutional layers to convert the input to hidden feature representation. Each layer is followed by a max-pooling with a filter size of 2 × 2. Rectified Linear Unit(RELU) is the activation function. The dropout rate is set to 0.1. The first convolutional layer was convolved with 64 filters with the size of 3 × 3

108

W. Wei et al. Dense layers

Specgram (128x80) 64 filters

128 filters

RELU Max Pooling Dropout

RELU Max Pooling Dropout

256 filters

RELU Max Pooling Dropout

full connected

Healthy Pathological

Fig. 1 CNN architecture

and stride of 1. The second and third convolutional layers are almost the same as the first one except that 128 filters and 256 filters were used to convolve respectively. Then the feature map was fed into three dense layers with the size of 64, 32, and 2. Finally, a softmax function is used as the activation function to output classification probabilities.

2.2 Weakly Supervised Learning In the medical field, due to the high cost of data annotation and privacy problems, many tasks are difficult to obtain strong supervision labels, and it is also difficult to obtain enough annotation data. Therefore, when the number of data samples is insufficient, and the annotation is incomplete, it is desirable to use weakly supervised machine learning technology. Typically, there are three types of weak supervision: incomplete supervision; inexact supervision; and inaccurate supervision [16]. Most pathological voice datasets are exactly the second type, they are coarse-grained labeled. Voice samples are only labeled as healthy or pathological. Therefore, we propose a weakly supervised learning method to learn fine-grained labels and sample weights.

2.3 Learning Fine-Grained Labels The learning process is shown in Fig. 2. In the original dataset, all the healthy voice samples are labeled as 0 and all the pathological voice samples are labeled as 1. First, split the dataset into n folds, select 1 fold as test set and others as training set, the CNN model is trained with the training set. Then the predicted values (between 0 and 1) on the test set are used to update the labels of test set. We use exponential moving average to update labels for a smoother learning process. The learning process is as Eq. (1). In our work, the smooth factor β1 is set to 0.75. A cross entropy loss is used as the loss function shown in Eq. (2)

Improving Pathological Voice Detection ...

109

Training Set Coarse-grained Labels n-fold cross-validation

train CNN Model

Test Set predict Coarse-grained Labels

Predicted Data

update

Fig. 2 Learning process of fine-grained labels

yt+1 = β1 ∗ yt + (1 − β1 ) ∗ f t (x)

(1)

where y denotes the value of the sample label, f (x) denotes the output of the trained model, x denotes the input spectrogram, t denotes the t-th epoch (n-fold cross validation), β1 denotes the smooth factor. Losst+1 = −

n 1 i i i [y ln yˆt+1 + (1 − yti ) ln(1 − yˆt+1 )] n i t

(2)

where n denotes the number of samples, yti denotes the label of the i-th sample i denotes the predicted value of the i-th sample in epoch t + 1. learned in epoch t, yˆt+1 After the label is updated on the test set, use another fold as test set and the rest as training set. Repeat the previous process n times, then the samples of the whole dataset are updated, and an epoch of learning process is finished. During a n-fold cross validation, the average accuracy of n-fold cross validation is calculated. The cross validations are repeated and the labels are learned iteratively using the previously updated labels until the average accuracy reaches the maximum.

2.4 Learning Sample Weights We noticed that the reliability and quality of samples with coarse-grained labels may different, and setting different sample weights to different samples may be reasonable. The process of learning sample weights is similar to that of learning fine-grained labels, and the process for updating sample weights is shown in Eq. (3). In this work, the Accbase is set to 0.5, the smooth factor β2 is set to 0.75. Sample weights are used to calculate the cross entropy loss of every sample respectively as described in Eq. (4).

110

W. Wei et al.

wt+1 = β2 ∗ wt + (1 − β2 ) ∗

Accsample (t) Accbase

(3)

where w denotes the sample weight, Accsample (t) denotes the classification accuracy of a sample in the previous t epochs learning, Accbase denotes a base accuracy, β2 denotes the smooth factor. Losst+1 = −

n 1 i i i [y ln yˆt+1 + (1 − y i ) ln(1 − yˆt+1 )] ∗ wti n i

(4)

i where y i denotes the label of the i-th sample, yˆt+1 denotes the predicted value of the i i-th sample in epoch t+1, wt denotes the sample weight of the i-th sample in epoch t.

2.5 Data Augmentation Data enhancement is a necessary means to enhance the robustness of the deep learning models. Common audio data Augmentation methods include time clipping, timeshifting, time-stretching, pitch shifting, noise adding, and so on [17]. We use 3 data augmentation methods: (1) time clipping, clip voice recordings into four 0.8second long segments equidistantly; (2) pitch shifting, offset the pitch by 4 values (in semitones): −2, −1, 1, 2; (3) Adding random Gaussian noise 4 times. As a result, the dataset is augmented 32 times.

3 Experiments 3.1 Datasets The Saarbruecken voice database[18] is used in the experiments. All data in the dataset was collected from more than two thousand people containing 71 different pathologies. Three different vowels including /a/, /i/, and /u/ are recorded for each sample. Each vowel recording sustained from 1 to 4 s. In our work, we used the recordings of the vowel /a/ produced at normal pitch. In previous studies for pathological voice detection, results vary greatly. The reason mainly comes from the differences in selected samples that were used for the experiment. For better comparison, we refer to Wu’s method [12], select 6 pathologies, which are all organic dysphonia and are caused by structural changes in the vocal cord. They are classified into the pathological group. We also select 687 healthy samples from the dataset. Then we get a set of samples, which is divided into two categories, healthy and pathological. The detailed information is listed in Table 1. We

Improving Pathological Voice Detection ...

111

Table 1 Details of the dataset Total 687

Healthy Pathological

Laryngitis Leukoplakia Reinke’s edema Recurrent laryngeal nerve paralysis Vocal fold carcinoma Vocal fold polyps

140 41 68 213

482

22 45

split the dataset into training, validation, and test set, with 60, 20, and 20% samples respectively.

3.2 Experimental Setup The models are developed in Python using Tensorflow. A GPU of NVidia Tesla P100 is used for higher training speed. An Adam optimizer [19] is used as the training optimizer with a learning rate of 0.0001. Categorical cross entropy is used as training loss function. L2 regularization is used to improve the robustness of the model. The L2 regularization weight is set to 0.005. The dropout rate is set to 0.1. The maximum training epochs are set to 100.

3.3 Evaluate Metrics The performance was evaluated through 5-fold cross-validation. The classification performance is evaluated in terms of the mean accuracy, precision, recall, and F1 on the test set across all folds. The average value of 10 times 5-fold cross-validation is taken as the final result for a more credible comparison.

3.4 Experimental Results 3.4.1

Performance of Fine-Grained Labels and Sample Weights

The learning process of the labels can be seen in Fig. 3. Each subgraph represents the sample distribution histogram of every 5-fold cross-validation. The abscissa indicates the abnormal degree of a voice(from healthy to pathological), the ordinate represents

112

W. Wei et al.

Fig. 3 The distribution of fine-grained labels in different epochs

Fig. 4 The distribution of sample weights in different epochs

the number of samples. At the beginning (epoch 0), voice samples are labeled as either 0 (healthy) or 1 (pathological). This is obviously not in line with the actual situation. After learning, from epoch 1 to epoch 9, the distribution of labels gradually dispersed to the region between 0 and 1, and gradually tended to normal distribution. According to our knowledge, people’s voice condition should obey the normal distribution, but this law has been destroyed when sampling because the sampling is not randomly selected, but selected from the individuals who are in the hospital. Obviously, more pathological samples will be collected. When learning sample weights, they are limited to the range from 0 to 2. As shown in Fig. 4, at the beginning, all samples’ weights are set to 1. From epoch 1 to 9, sample weights are scattered to the area between 0 and 2 according to the sample accuracy. Because most of the sample accuracy is close to 100%, most of the sample weights learned are also close to the maximum value of 2. From Fig. 5 we can see that, for the learning process of fine-grained labels, the average accuracy of 5-fold cross-validation quickly rise to the maximum, and then slowly decline. Finally, the accuracy reaches a maximum of 0.819 at epoch 5. That is, the learned fine-grained labels obtained at the 5-th epoch are the optimal. The learning process of sample weights is not as good as the former, getting an optimal result at epoch 8. However, if both of them are learned simultaneously, the best

Improving Pathological Voice Detection ...

113

Fig. 5 The accuracy of different epochs when learning labels Table 2 Results of different labels Labels Data Accuracy Healthy augmenPrecision tation original No 0.757 0.770 original Yes 0.767 0.793 sample Yes 0.791 0.805 weights fineYes 0.814 0.827 grained both Yes 0.825 0.823 used

Pathological Recall

F1

Precision Recall

F1

0.881 0.854 0.880

0.818 0.821 0.840

0.748 0.721 0.765

0.551 0.623 0.641

0.622 0.665 0.694

0.889

0.856

0.788

0.687

0.732

0.917

0.867

0.830

0.669

0.738

result is obtained at epoch 6, with an accuracy of 0.832. These fine-grained labels and sample weights are used for further training. The baseline model is a CNN trained and tested directly using original labels and without data augmentation. The proposed model is the same as the previous CNN but use previously learned fine-grained labels and sample weights to train. The results of the baseline method and the proposed methods can be seen in Table 2. As can be seen, data augmentation has a slight enhancement in performance, while sample weights and fine-grained labels receiving improvement significantly. The best performance is from the method using both fine-grained labels and sample weights.

114

W. Wei et al.

Table 3 Performance of different models Model Labels CNN+LSTM [20] CNN [12] Proposed CNN sample weights fine-grained labels both used

3.4.2

Dataset size

Accuracy

Precision Recall

F1

687 + 1353 482 + 482 687 + 482

0.714 0.770 0.791 0.814 0.825

0.720 0.710 0.792 0.814 0.827

0.720 0.720 0.791 0.811 0.820

0.714 0.801 0.814 0.824

Comparison with Other Works

The results are compared with the results published in [12] and [20] as shown in Table 3. These two articles also use the SVD dataset, taking the normal pronunciation of /a/ vowel as input, and the sample size is almost the same, so the comparison with the performance of the methods in these papers is credible. Obviously, CNN trained with both learned fine-grained labels and sample weights gets the highest accuracy of 82.5%, exceeded the CNN in [12] and CNN+LSTM in [20]. Results of precision, recall, and F1 score are similar to accuracy. Therefore, the methods of learning samples weights and fine-grained labels to improving sample quality work effectively on the pathological voice classification.

4 Conclusions In this paper, we proposed two weakly supervised learning methods to improve quality of existing datasets: learning sample weights and learning fine-grained labels, and then boost the performance of pathological voice detection. The experimental results also show that these two methods improve the classification accuracy from 75.7 to 79.1 and 81.4%. If both methods are used, accuracy can be further improved to 82.5%. In addition, they are not only effective for the CNN model based on deep learning but may also for SVM, k-NN, GMM, and other classifier algorithms based on machine learning, which can be explored in the future. Since the proposed weakly supervised learning method is just verified by a simple CNN, it is conceivable that the method will have better performance in future investigation. Acknoledgements This work was supported by National Key R&D Program of China (2019YFC17 11800), NSFC (62171138).

Improving Pathological Voice Detection ...

115

References 1. Stemple JC, Roy N, Klaben BK (2018) Clinical voice pathology: theory and management. Plural Publishing, San Diego 2. Dejonckere PH, Bradley P, Clemente P et al (2001) A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques. Eur Arch Oto-rhino-laryngology 258(2):77–82 3. Mekyska J, Janousova E, Gomez-Vilda P et al (2015) Robust and complex approach of pathological speech signal analysis. Neurocomputing 167:94–111 4. Rabiner L (1993) Fundamentals of speech recognition 5. Al-Nasheri A, Muhammad G, Alsulaiman M et al (2017) An investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classification. J Voice 31(1):113-e9 6. Henríquez P, Alonso JB, Ferrer MA et al (2009) Characterization of healthy and pathological voice through measures based on nonlinear dynamics. IEEE Trans Audio Speech Lang Process 17(6):1186–1195 7. Muhammad G, Melhem M (2014) Pathological voice detection and binary classification using MPEG-7 audio features. Biomed Sig Process Control 11:1–9 8. Panek D, Skalski A, Gajda J (2014) Quantification of linear and non-linear acoustic analysis applied to voice pathology detection. In: Pi¸etka E, Kawa J, Wieclawek W (eds) Information Technologies in Biomedicine, Volume 4. AISC, vol 284, pp 355–364. Springer, Cham. https:// doi.org/10.1007/978-3-319-06596-0_33 9. Hegde S, Shetty S, Rai S et al (2019) A survey on machine learning approaches for automatic detection of voice disorders. J Voice 33(6):947-e11 10. Cordeiro H, Fonseca J, Guimar˜aes I et al (2017) Hierarchical classification and system combination for automatically identifying physiological and neuromuscular laryngeal pathologies. J Voice 31(3):384-e9 11. Hemmerling D (2017) Voice pathology distinction using auto associative neural networks. In: 2017 25th European signal processing conference (EUSIPCO), pp 1844–1847. IEEE 12. Wu H, Soraghan J, Lowit A, et al (2018) A deep learning method for pathological voice detection using convolutional deep belief networks. In: Proceedings Interspeech 2018, pp 446–450. http://dx.doi.org/10.21437/Interspeech.2018-1351, https://doi.org/ 10.21437/Interspeech.2018-1351 13. Chen L, Chen J (2020) Deep neural network for automatic classification of pathological voice signals. J Voice S0892–1997 14. Mesallam TA, Farahat M, Malki KH et al (2017) Development of the Arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. J Healthcare Eng 2017:1–13 15. Harar P, Galaz Z, Alonso-Hernandez JB et al (2020) Towards robust voice pathology detection. Neural Comput Appl 32(20):15747–15757 16. Zhou Z (2018) A brief introduction to weakly supervised learning. Nat Sci Rev 1:1 17. Jiang Y, Zhang X, Deng J, et al (2019) Data augmentation based convolutional neural network for auscultation. J Fudan Univ (Natural Sci) 328–333 18. Woldert-Jokisz B (2007) Saarbruecken voice database. http://stimmdb.coli.uni-saarland.de/ 19. Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. ICLR (Poster) 20. Harar P, Alonso-Hernandezy JB, Mekyska J et al (2017) Voice pathology detection using deep learning: a preliminary study. In: International conference and workshop on bioinspired intelligence (IWOBI), pp 1–4. https://doi.org/10.1109/IWOBI.2017.7985525

Articulatory Analysis and Classification of Pathological Speech Shufei Duan, Camille Dingam, Xueying Zhang, and Haifeng Li

Abstract Unlike clinical practices, data analysis and speech technology (signal processing, and machine learning) have manifested many advantages to diagnose pathological speech. Due to this reason, this work has presented different methods of data analysis to compare dysarthria severity and healthy speech followed by the classification. The comparisons have been done based on the articulatory part of TORGO, a publicly dysarthric speech dataset. This study mainly compared the three-dimensional space, the tongue pronunciation range skewness, and the articulatory pronunciation time between dysarthria speech severity and control speech. In machine learning, the classification of the pathological database is important to diagnose the disease. However, the misclassification of the traditional algorithm on the data with more than two classes, worst when the sample numbers between classes are imbalanced is high. Due to that problem, a method is proposed in this paper by adopting both principal component analysis (PCA) and cost-sensitive (CS) with the J48 decision tree to reduce the misclassification and solve imbalanced data. The result of the proposed method both in overall accuracy and recall outperformed the original J48. This method is also compared against the two other imbalanced classification methods where it showed the best accuracy. Keywords Dysarthria · Pronunciation position · Time analysis · PCA · Cost-sensitive

1 Introduction Dysarthria describes a group of speech disorders by disturbances both of articulatory and movement due to the damage of the central nervous system [1]. Many characteristics of dysarthria have been mentioned in [2]. The evolution of signal processing engineering and machine learning has inspired and created many types of research for diagnosing and treat dysarthria with good effect unlike clinical practices [3, 4]. S. Duan · C. Dingam · X. Zhang (B) · H. Li College of Information and Computer, Taiyuan University of Technology, Taiyuan 030000, China e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_10

117

118

S. Duan et al.

The actual works on the TORGO and other dysarthric databases have presented many benefits to diagnose dysarthria. The study of Kent has proved the dysarthria in the TORGO dataset as the distortion motor due to the speech articulator resulting mostly in an unintelligible speech. Rudzicz has found that the word error due to the unintelligibility on a modern device has reached 97.5% for the population of dysarthric subjects against 15.5 for other populations [5]. The previous study of spatial articulatory characteristics of dysarthria revealed that articulatory characteristics for severe patients are impossible to extrapolate from the mild and moderate patients [6]. In the articulatory kinematic studied by Kuruvilla, the dysarthria patient’s tongue movement seems to be limited and the jaw movement is the opposite [6, 7]. Yunusova found a slower duration of Amyotrophic Lateral Sclerosis (ALS) temporal values [8]. Recently, Joshy and Rajan used different deep learning algorithms to classify dysarthria speech severity of the TORGO dataset [9]. In 2020, Siddhartha Prakash applied a Convolutional Neural Network on the TORGO database to obtain an accuracy of 68% [10]. Abner Hernandez and his collaborators used rhythm metrics to detect dysarthria on QoLT Korean and TORGO databases [11]. In 2020, Narendra and Paavo Alku adopted both raw speech signals and raw glottal flow waveforms on the TORGO and UA speech to detect the severity of dysarthria [12]. Meanwhile, the pathological intelligibility assessment is more complex than a healthy person assessment. This problem is worse with machine learning when we have to deal with multi-class imbalanced data. In this work, we solved this problem by adopting principal component analysis (PCA) and cost-sensitive with the J48 decision. The result showed the reduction of the misclassification by J48 and has outperformed other well-known imbalanced classification algorithms. Before solving the imbalanced data, this article first used the visualization in threedimensional space of the average positions from all spoken word by three dysarthria patient classes (mild, moderate, and severe) and normal people, then compared the scatter plot between them. Secondly, the study compared the degree of skewness from tongue back, tongue middle, and tongue tip pronunciation motion. The third study made the time comparison from articulatory pronunciation between each subject of three dysarthria classes and normal persons. Indeed, these studies of data analysis have given significant discrimination between dysarthria severity and healthy persons of the TORGO dataset. The tongue is one of the articulatory difference motor between the normal and dysarthria person [6, 13]. The movement of the tongue can be done in all directions [14] and the lips in front of the mouth play a major work in speech articulation [15]. The existence of 3D Electromagnetic Articulography (EMA) brought many scientists to analyze the position and motion of the articulatory organs in dysarthria study. The alternating magnetic field is an EMA technique to track the movement of miniature receiver coils affixed to the articulatory [16]. EMA data values are very useful in speech disorders treatment and have been classified as more interpretable than values from acoustic and perceptual levels of analysis [17]. Briefly, this paper aims to evaluate different methods of data analysis to discriminate dysarthria severity and propose a method to solve imbalanced data as described above.

Articulatory Analysis and Classification of Pathological Speech

119

Fig. 1 a 3D electromagnetic articulatory (EMA) AG500. b Sensors are set on each articulatory organ

2 Torgo Dataset The TORGO database refers to aligned acoustics and measured 3D articulatory features from speakers who suffer from dysarthria mainly concerning people with either cerebral palsy or amyotrophic lateral sclerosis. The TORGO dataset has origins from the University of Toronto and the Holland-Bloorview Kids Rehab hospital based in Toronto. It has both the features of the speakers who have cerebral palsy and amyotrophic lateral sclerosis, which are caused by the disruptions in the neuromotor interface. The database connection was the work in Toronto between 2008 and 2010 and consists of eight dysarthric subjects (5 males and 3 females) and 7 normal persons (4 males and 3 females) who were between 16 and 50 years old. All participants have read English text from a 19-in. LCD screen and the collected speech includes a wide range of articulatory contrasts [18]. In the articulatory, the XYZ coordinates were calculated by the electromagnetic position data and each sensor position was displayed. The X-axis is the forward and backward direction, Y-axis is the left and right direction, and Z-axis is the up and down direction. Sensors were set on the Tongue Tip (TT), Tongue Middle (TM), Tongue Back (TB), Left Mouth (LM), Right Mouth (RM), Upper Lip (UL), Lower Lip (LL), and jaw as shown in Fig. 1(b).

3 Data Filtering In this section, we selected the dysarthria speech data which share the same spoken word with a normal person. After analyzing the degree of abnormality (mild, moderate, and severe), 4 dysarthria speakers (2 females and 2 males) and 2 normal

120

S. Duan et al.

Table 1 Dataset preparation Category

Gender

Participants

Disease degree

Dysarthria

Female

F03

Moderate

98

Female

F04

Mild

98

Male

M02

Severe

Male

M03

Mild

Female

FC03

No

98

Male

MC04

No

120

Normal

Word spoken

71 120

speakers (1 male and 1 female) were extracted. The three females have spoken separately 98 same words and the three males also 120 same words, of which 71 words of M02 have been spoken by the two other males. The filtering result with the degree of disease is in the table below (Table 1):

4 Position Space In this chapter, we plot the average position of pronunciation data in the TB, TM, and TT of the subjects in three-dimensional space. In Fig. 2, the topologies of the scatter plot in the Tongue Back, Tongue Middle, Tongue Tip are more similar. The same view in Fig. 3. The scatter plot in the female case (Fig. 2) shows that the positions of dysarthria patients in the tongue back, tongue middle, and tongue tip are generally backward, at the left, and down compared to normal. So patients with dysarthria need to train their tongues forward, at the right, and up when they speak. In Fig. 3 (males case), dysarthria data points are more forward and at left than the normal person. While comparing up and down direction, the figures show that dysarthria points are sometimes up and sometimes down compared to the normal person. In the tongue middle and tongue tip, some dysarthria points are a little bit up than normal and they are confused with normal data points. The dysarthria points in the graph of males are more scattered and not stable; this revealed that dysarthria patient cannot maintain his tongue when he pronounces a word.

5 Range Skewness This chapter has extracted the tongue range from all directions; left to right (Y), forward and backward (X), up and down (Z) in the tongue back, tongue middle, and tongue tip. The steps of the range extraction in all directions are the same as for the left to the right direction:

Articulatory Analysis and Classification of Pathological Speech

121

Fig. 2 The 3D dimension of females average position Tongue Body, Tongue Middle, and Tongue Tip

1st step: calculate the maximum and minimum positions of each pronunciation data: X max j = max(X j ), j = 1, 2, ..., n

(1)

X min j = min(X j ), j = 1, 2, ..., n

(2)

where the elements (X min 1 , X min 2 , ..., X min n )T of the vector X min are the minimum displacements from TB, TM, and TP in the direction left to right and the elements (X max 1 , X max 2 , ..., X max n )T of vector X max represent their maximum displacements. 2nd step: The range I between the position of vector X max and vector X min : I j = X max j − X min j , j = 1, 2, ..., n

(3)

Skewness is mathematically formulated as a second and third moment [19] around the mean, the expressions are below:

122

S. Duan et al. TONGUE BACK(AVERAGE POSITION)

TONGUE MIDDLE(AVERAGE POSITION) M03 MC04

M03 MC04

10

20 10

Z(Down-Up)

Z(Down-Up)

0

-10

-20

0 -10 -20

-30 50

-30 50 50

0

0

0

-10

0

-50

-50

-20

-100 X(Forward-Backward)

-100

-150

-30 Y(Left-Right)

X(Forward-Backward)

-50

-40

Y(Left-Right)

Fig. 3 The 3D dimension of males average position Tongue Body, Tongue Middle, and Tongue Tip n 1 k2 = (xi − x)2 n i=1

(4)

n 1 (xi − x)3 n i=1

(5)

k3 =

where x is a mean, n is the number of the data point. A skewness of zero corresponds to the symmetrical distribution [20]. Besides it, there are two categories of skewness: positive (characterized by a long right side of the distribution tail) and negative (characterized by a long left side of the distribution tail). When the skewness values are between −0.5 and 0.5, we say the data are fairly symmetrical otherwise they are moderated or high skewed [21]. The skewness is used in this work to judge the asymmetry of tongue range curves of males and females participants without drawing the curves and this has a better result in terms of diagnosing the levels of dysarthria.

Articulatory Analysis and Classification of Pathological Speech

123

In the table above (Table 2), generally, MC04 has skewness near to zero than in M03 and M02; only in (TM_Y Skewness), the two dysarthria patients (M03 and M02) skewness values are a little bit less than for normal and in TT_X and TT_Z, M03 is a little bit less than MC04 (with the respective differences of 0.052 and 0.18). Comparing M03 and M02, M03 skewness values are generally near to zero than M02, only in TM_Y and TT_Y, M02 has small skewness values. Lot of positions of M03 and M02 have skewness values greater than 0.5, worst in M02 where all the values are greater than 0.5; both the two (M03 and M02) are not more fairly symmetrical but asymmetrical in many directions (55.56% of directions for M03 and 100% for M02). We can conclude that the pronunciation motion of the tongue (TT, TM, and TB) are more symmetrical in normal person than dysarthria persons and many directions of dysarthria persons tongue have skewness values greater than the symmetrical skewness. This shows the symptom of irregular frequencies and means values of dysarthria tongue motion. In Table 3 above, except for the skewness of TB_XSkewness, TB_YSkewness, and TB_ZSkewness of FC03, all remained skewness values are very small than the values in F04 and F03. This showed that the skewness values of the normal (FC03) are very closer to 0 and 0.5 than those of F04 and F03. This illustrates the same problem in males (Table 2) where the pronunciation motions of normal persons are more symmetrical than those of dysarthria persons.

6 Time Comparison Figures 4 and 5 show that in both males and females, the maximum values of start time and duration are longer in dysarthria people. We can see that the values of all times in normal persons are concentrated in the histogram while in the dysarthria, the values are very wide. In Fig. 4, the start time of F03 and F04 have largely overflowed 2 s, but the start time in FC03 is near to 2 s. Even the value of F03 is more than 6 s in the histogram. And by comparing F03 and F04, we can see clearly that F03 has largely overflowed 2 s than F04. While in duration, the values of F03 have overflowed 1.25 s. The duration of F04 has largely overflowed 1 s than the values of FC03. In Fig. 5, we can see a similar observation as described in the females’ case. Although M02 data are less than M03 and MC04, in the comparison of all time types, they have reached a longer maximum value of time. In the start time, the values of M02 and M03 have overflowed 2 s while the values in MC04 are within 2 s. Comparing M02 and M03, the values of M02 have widely overflowed 2 s than those of M03; the value of M02 is even higher than 8 s as showing in the above histogram. In the duration, we see that many values of M02 have overflowed 1 s, and the value of M03 is even more than 2 s. All the above comparisons bring us to conclude that the maximum values of all types of time in dysarthria subjects are very long than the maximum values of normal participants. In the term of sickness degree, F03 has highest maximum values compared to F04, and for male case, M02 is higher than M03. The analysis shows

0.30

7.85

N0

M03

M02

N0

7.89

1.47 N0

4.08

0.049 1.79

−0.12 0.83

2.22

−0.33 2.97

TM_Y Skewness

TM_X Skewness

2.85

0.19

0.053

TM_Z Skewness

2.75

−0.081

−0.029

0.81

2.25

0.29

TT_Y Skewness

4.09

0.34

0.52

TT_Z Skewness

TT_X Skewness

TB_Z Skewness

TB_X Skewness

TB_Y Skewness

Tongue Tip (TT) along X, Y and Z direction skewness (TT_X, TT_Y, TT_Z)

Tongue Back (TB) along X, Y and Z direction Tongue Middle (TM) along X, Y and Z skewness (TB_X, TB_Y, TB_Z) direction skewness (TM_X, TM_Y, TM_Z)

MC04

Types

Table 2 Skewness values of tongue back, tongue middle, and tongue tip for males

124 S. Duan et al.

6.15

2.67

1.04

F04

F03

4.02

5.69

9.46 0.83

1.78

3.91 0.95

4.02

0.66

TM_X Skewness

4.33

5.45

1.95

TM_Y Skewness

0.57

3.74

0.34

TM_Z Skewness

8.10

0.42

0.86

7.13

7.43

0.55

TT_Y Skewness

6.03

3.90

0.51

TT_Z Skewness

TT_X Skewness

TB_Z Skewness

TB_X Skewness

TB_Y Skewness

Tongue Tip (TT) along X, Y and Z direction skewness (TT_X, TT_Y, TT_Z)

Tongue Back (TB) along X, Y and Z direction Tongue Middle (TM) along X, Y and Z skewness (TB_X, TB_Y, TB_Z) direction skewness (TM_X, TM_Y, TM_Z)

FC03

Types

Table 3 Skewness values of tongue back, tongue middle, and tongue tip for females

Articulatory Analysis and Classification of Pathological Speech 125

126

S. Duan et al.

Fig. 4 Start time and duration for females

Fig. 5 Start time and duration for males

the abnormality of all dysarthria cases in time terms; due to the lack of their speech control, sometimes they take a long time to finish the word pronunciation or they take time to start it.

6.1 QQ Plot (Henry Line) of Times To check the Gaussian distribution of all types of time, we adopted the henry line. Henry Line will help us to make the differences of all time types (start time, duration, and average time) normality between mild, moderate, severe, and normal which will give the dysarthria diagnosis in the time domain. In Figs. 6 and 7, there is a band called confidence band limited by two curves (the first one is up and the second one down) and a line which is called Henry Line separating the two curves. If the data points are close to Henry’s line and within the confidence bands, we can ensure that the normality is made [22]. The probability distributions in the Henry line (Figs. 6 and 7) show that dysarthria start time did not more obey the probability distribution than for normal persons. In the female case (Fig. 6), the data points of FC03 are better in the line than F04 and

Articulatory Analysis and Classification of Pathological Speech

127

Fig. 6 Females start time on Henry Line

Fig. 7 Males start time on Henry line

F03 data points. The confidence bands of all dysarthria time points are very wide; the time points are very far from the line and outside the confidence band. While in the FC03, the time points are too close to the line and the confidence band is very narrow until it is confused with the line. Comparing F04 and F03, F03 is even worst which is consistent with the mild and moderate conditions of F04 and F03 respectively. In Fig. 7, M02 points are far from the line and confidence band than M03 points. M02 confidence band is wide than for M03. The comparisons of Figs. 6 and 7 are separately similar in duration and average time. These specific differences of time normal distribution can be seen more in Table 4 where we calculated each standard deviation.

128

S. Duan et al.

Table 4 Standard deviation of all types of time Standard deviation

F03

F04

FC03

Standard deviation (start time)

0.86

0.75

0.20

Standard deviation (time)

0.21

0.13

0.12

Standard deviation

M02

M03

MC04

Standard deviation (start time)

0.95

0.31

0.16

Standard deviation (Time)

0.25

0.16

0.093

Standard deviation (average time)

0.95

0.29

0.16

In Table 4 above, the standard deviations of all type of time (start time, duration, and average time) are higher in dysarthria, and to the degree of sickness (F03 is greater than F04, and are both greater than FC03; the same M02 is greater than M03, and are both greater than MC04). This analysis shows that the values of all time are more invariable in all dysarthria cases. The degree of faulty is consistent with the condition of mild, moderate, and normal for females; normal, mild, and severe for males, a consequence of what they cannot obey the Henry Line.

7 Classification 7.1 Data Preparation Our last work will concern the proposed method of classification. 3696 short words are used for classification and the composition of selected subjects and data is in Table 5 below: In this work, the classifications have been done on the combination of range and skewness along X, Y, and Z-direction of TT, TM, UL, and lower lip LL. Before classification, the data is split into 70% training set and 30% testing set independently, meaning that with no repeated value in training and testing. Weka has been used for classification. Table 5 Dataset composition Severity and control

TORGO participants

Samples

Severe

M01, M02

129

Moderate

F03

194

Mild

F04, M03

445

Normal

FC01, FC02, FC03, MC01, MC02, MC03, MC04

2928

Total

All participants

3696

Articulatory Analysis and Classification of Pathological Speech

129

7.2 Proposed Method This section describes a proposed method we will use to classify better the TORGO dataset. The proposed method is the combination of Principal Component Analysis (PCA) [23] and Cost-Sensitive (CS) [24]. In this work, PCA is used to transform the original features into a linear space where the newly obtained features are ordered from important values. Cost-sensitive is very used in the field of machine learning algorithms and is considered when the problem of misclassification of the first class to the second, the second class to the first occurs (binary classification) [25, 26]. The same problem of misclassification can happen between different classes of multiclass datasets. Because of its ability to solve misclassification between different classes, costsensitive is very important when a dataset is imbalanced. In our work, the sample numbers between different dysarthria severity and normal persons are imbalanced (see Table 5). Thereby, we introduce cost-sensitive, because when the sample numbers are imbalanced, the misclassification by the ordinary classifiers such as the J48 decision tree in classifying classes with fewer samples as classes with many samples always occur. Our work is on the classification of four classes dataset, so in Weka, the cost matrix is set in default as: ⎡

0.0 ⎢ 1.0 ⎢ ⎣ 1.0 1.0

1.0 0.0 1.0 1.0

1.0 1.0 0.0 1.0

⎤ 1.0 1.0 ⎥ ⎥ 1.0 ⎦ 0.0

(6)

We changed the settings to the following cost matrix by applying the penalty on incorrect classification as used in [27] to solve misclassification and imbalance: ⎡

0.0 ⎢ 1.0 ⎢ ⎣ 1.0 1.0

2.0 0.0 1.0 1.0

1.0 1.0 0.0 11.0

⎤ 1.0 42.0 ⎥ ⎥ 13.0 ⎦

(7)

0.0

We will compare the result against J48 and PCA with J48, SMOTE with J48, and Class balance with J48 to prove the efficiency of the proposed method in solving imbalance and enhancing ordinary J48. Synthetic Minority Oversampling Technique (SMOTE) is one of the commonly used oversampling techniques to solve imbalanced data and is known because of its best performance [28]. Class balance is an undersampling technique in solving imbalanced data [29]. The data for SMOTE and class balance is split into 70% of training and 30% of testing as for our proposed method. SMOTE and class balance are applied on the training set and predicted with the testing set. We applied SMOTE by increasing the sample numbers of minority classes until they have the sample numbers of the majority class and the nearest neighbor number

130

S. Duan et al.

Table 6 Total classification comparison

Classifiers

Accuracy

J48

77.70

PCA.J48

78.79

SMOTE.J48

68.45

Class balance.J48

68.18

CS_PCA.J48

81.05

Table 7 Classification result of the recall Classifiers

Recall Severe

Moderate

Mild

Normal

Weighted Avg

J48

89.50

31.50

23.00

88.50

77.70

PCA.J48

94.70

18.50

23.70

90.30

78.80

SMOTE.J48

92.10

55.60

51.10

70.90

68.40

Class balance.J48

89.50

27.8

50.40

72.50

68.20

CS_PCA.J48

94.70

27.80

23.70

92.60

81.10

is set to 5. In our work, the majority class is the “Normal” class with 2928 samples (see Table 5). Class balance is used to equal the weight of the classes [29]. The weight is the average number of samples of all the classes. In our work, the weight is 924, meaning all the samples of four classes are changed to 924 to solve the imbalance by class balance. The final result shows that our method outperformed the others.

7.3 Classification Result Table 6 above displayed the total accuracy and showed better results in Cost-Sensitive and PCA with J48 (CS_PCA.J48) than PCA with J48 (PCA.J48) and J48. As we can see, CS_PCA.J48 outperformed the ordinary classifier (J48) and PCA with J48 (PCA.J48). Specifically, the classification accuracy achieved the highest value at 81.05 when CS_PCA.J48 is used, which is about 3.35% and 2.26% higher than J48 and PCA.J48 respectively. Our method has well outperformed the two common approaches of imbalance, namely SMOTE.J48 and Class balance.J48 with the difference of 12.6 and 12.87 respectively. After all, we can notice that the classification of two approaches compared against our method has reduced the performance of the model. This show the robustness of our proposed method to classify better the imbalance between classes of the Torgo dataset. Table 7 above displayed the classification result of classes when the recall is considered. As we can see, CS_PCA.J48 achieved generally the best result in recall where its recall weighted average showed 81.10 and for others, namely J48 and PCA.J48, the results are 77.70 and 78.80 respectively. CS_PCA.J48 has generally

Articulatory Analysis and Classification of Pathological Speech

131

well outperformed the two common approaches of imbalance, namely SMOTE.J48 and Class balance.J48 with the difference of 12.7 and 12.9 respectively in Weighted Avg (Weighted average) of recall.

8 Discussion This article was the investigation of articulatory data analysis and classification on the TORGO database. The different statistical methods applied on the TORGO articulatory data, namely three-dimensional space, range skewness, histogram, and Henry line have shown clear discrimination between dysarthria severity and Control speech. Section 6 made a comparison analysis of time reaction before pronunciation start and duration for mild, moderate, severe, and control speech by using histogram, Henry Line, and standard deviation (Figs. 4, 5, 6 and 7 and Table 4). All the methods approved that the start time and duration of patients with dysarthria are very lower than for normal people and consistent with the condition of moderate, mild, and normal for females, the severe, mild and normal for males. This supports the acoustic work on vowels by Frank Rudzicz [5] of the TORGO dataset to distinguish dysarthria from normal speech (binary classification). The research of Frank Rudzicz found that the acoustic vowels showed lower duration in dysarthria compared to control speech [5]. Besides, a method is proposed to classify dysarthria speech with more than two classes and where sample numbers between those classes are imbalanced. This method has improved the accuracy of the original classifier by reducing the misclassification. Two approaches of imbalance data such as SMOTE and Class balance were compared against the proposed method and this one has shown better accuracy with the difference of 12.6% and 12.87% respectively.

9 Conclusion This article has mainly presented the articulatory analysis in position, range, and time of three dysarthria levels, followed by the method of classification to recognize dysarthria severity. However, this work is a basic study and has some limitations such as the scarcity of large dysarthric speech datasets. We may encourage in the future, the study of the correlation between the auditory and articulatory features of the Torgo dataset. Acknowledgements This project was supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grand No. 12004275), in part by the Natural Science Foundation of Shanxi Province, China, under Grant 20191D111095, and in part by the Taiyuan University of Technology Foundation, China, under Grant tyut-rc201405b and in part by Shanxi Scholarship Council of China (2020-042).

132

S. Duan et al.

References 1. Doyle PC, Leeper HA (1997) Dysarthric speech: a comparison of computerized speech recognition and listener intelligibility. J Rehabil Res Dev 34:309 2. Darley FL, Aronson AE, Brown JR (1969) Differential diagnostic patterns of dysarthria. J Speech Hear Res 12:246–269 3. Narendra NP, Alku P (2021) Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features. Comput Speech Lang 65:101117 4. Chen Y, Zhu G, Liu D, Liu Y, Yuan T, Zhang X, Jiang Y, Du T, Zhang J (2020) Brain morphological changes in hypokinetic dysarthria of Parkinson’s disease and use of machine learning to predict severity. CNS Neurosci Ther 26:711–719 5. Rudzicz F, Namasivayam AK, Wolff T (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Lang Resour Eval 46:523–541 6. Jimin L, Michael B, Zachary S (2018) Articulatory kinematic characteristics across the dysarthria severity spectrum in individuals with amyotrophic lateral sclerosis. Am J Speech Lang Pathol 71:258–269 7. Kuruvilla Mili S, Green Jordan R, Yana Y, Kathy H (2012) Spatiotemporal coupling of the tongue in amyotrophic lateral sclerosis. J Speech Lang Hear Res JSLHR 8. Yana Y, Green Jordan R, Lindstrom Mary J, Ball Laura J, Pattee Gary L, Lorne Z (2010) Kinematics of disease progression in bulbar ALS. J Commun Disorders 43:6–20 9. Joshy AA, Rajan R (2021) Automated dysarthria severity classification using deep learning frameworks. In: 28th European signal processing conference (EUSIPCO). IEEE 10. Korzekwa D, Barra-Chicote R, Kostek B, Drugman T, Prakash S (2020) Deep learning-based detection of Dysarthric speech disability 11. Hernandez A, Yeo EJ, Kim S et al (2020) Dysarthria detection and severity assessment using rhythm-based metrics. In: Proceedings of the annual conference of the international speech communication association (INTERSPEECH), Shanghai, China, pp 25–29 12. Narendra NP, Alku P (2020) Glottal source information for pathological voice detection. IEEE Access 8:67745–67755 13. Green Y, Kuruvilla W, Pattee S, Zinman B (2013) Bulbar and speech motor assessment in ALS: challenges and future directions. Amyotrophic Lateral Sclerosis Frontotemporal Degener 2013(14):494–500 14. Zhixiang C (2010) Study on tongue Movement and mouth expression of virtual human. Univ Sci Technol China 15. Shengli L (2010) Speech therapy. Huaxia Press 16. Goozee JV, Murdoch BE, Theodoros DG, Stokes PD (2000) Kinematic analysis of tongue movements in dysarthria following traumatic brain injury using electromagnetic articulography. Brain Inj 14:153–174 17. Jeff B, Andrew K, James S, Johnson Michael T (2017) Jaw rotation in dysarthria measured with a single electromagnetic articulography sensor. Am J Speech-Lang Pathol 26:596–610 18. The University of Toronto, Department of Computer Science (2012). http://www.cs.toronto. edu/~complingweb/data/TORGO/torgo.html 19. Doane DP, Seward LE (2011) Measuring skewness: a forgotten statistic. J Stat Educ 19 20. Rayner JCW, Best DJ, Mathews KL (1995) Interpreting the skewness coefficient. Commun Stat Theory Methods 24:593–600 21. Dugar D (2018) Skew and kurtosis: 2 important statistics terms you need to know in data science, 23 August 2018. https://codeburst.io/2-important-statistics-terms-you-need-to-knowin-data-science-skewness-and-kurtosis-388fef94eeaa. Accessed 19 Aug 2020 22. Soetewey A (2020) Descriptive statistics in R. Stat and R, 22 January 2020. https://www.sta tsandr.com/blog/descriptive-statistics-in-r/. Accessed 19 Aug 2020 23. Mahmoudi MR, Heydari MH, Qasem SN, Mosavi A, Band SS (2020) Principal component analysis to study the relations between the spread rates of COVID-19 in high risks countries. Alex Eng J 60:457–464

Articulatory Analysis and Classification of Pathological Speech

133

24. Soetewey A (2020) Descriptive statistics in R. Stat and R, 22 January 2020. https://www.sta tsandr.com/blog/descriptive-statistics-in-r/. Accessed 19 Aug 2020. 25. Zadrozny, Elkan C (2001) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining. ACM Press, pp 204–213 26. Zadrozny, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the third IEEE international conference on data mining, ICDM. IEEE, pp 435–442 27. Khan A, Khan F, Khan S et al (2018) Cost sensitive learning and SMOTE methods for imbalanced data. J Appl Emerg Sci 8:32–38 28. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357 29. Ruangthong P, Songsangyos P, Kankaew S (2016) Solving imbalanced problem of multiclass data set with class balancer and synthetic minority over-sampling technique. Int J Appl Comput Technol Inf Syst 6:87–90

Human Ear Modelling

Channel-Vocoder-Centric Modelling of Cochlear Implants: Strengths and Limitations Fanhui Kong, Yefei Mo, Huali Zhou, Qinglin Meng, and Nengheng Zheng

Abstract Modern cochlear implants (CIs) generate electric current pulsatile stimuli from real-time incoming to stimulate residual auditory nerves of deaf ears. In this unique way, deaf people can (re)gain a sense of hearing and consequent speech communication abilities. The electric hearing mimics the normal acoustic hearing (NH), but with a different physical interface to the neural system, which limits the performance of CI devices. Simulating the electric hearing process of CI users through NH listeners is an important step in CI research and development. Many acoustic modelling methods have been developed for simulation purposes, e.g., to predict the performance of a novel sound coding strategy. Channel vocoders with noise or sine-wave carriers are mostly popular among the methods. The simulation works have accelerated the procedures of re-engineering and understanding of the electric hearing. This paper presents an overview of the literature on channelvocoder simulation methods. Strengths, limitations, applications, and future works about acoustic vocoder simulation methods are introduced and discussed. Keywords Cochlear implant · Auditory prosthesis · Speech perception · Hearing research · Pitch · Vocoder

F. Kong · H. Zhou · N. Zheng (B) Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, People’s Republic of China e-mail: [email protected] Y. Mo · H. Zhou · Q. Meng (B) Acoustic Laboratory, School of Physics and Optoelectronics, South China University of Technology, Guangzhou 510610, Guangdong, People’s Republic of China e-mail: [email protected] Y. Mo School of Medicine, South China University of Technology, Guangzhou 510006, Guangdong, People’s Republic of China © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 X. Shao et al. (eds.), Proceedings of the 9th Conference on Sound and Music Technology, Lecture Notes in Electrical Engineering 923, https://doi.org/10.1007/978-981-19-4703-2_11

137

138

F. Kong et al.

1 Introduction Auditory perception relies on the normally functioning peripheral auditory coding within the inner ear (or cochlea). Thousands of hair cells, which lie between the basilar membrane and the tectorial membrane, convert vibrations of sounds coming through the oval window into action potentials according to the physical properties of the sound pressure waveform. The auditory nerve fibers synaptically connected to the hair cells are then electrically excited, and the sound signals are further encoded in both place and temporal patterns of spikes up to the brain [1]. Unfortunately, the cochleae may lose their function due to some reasons from genetic defects, noise exposure, drug toxicity, aging, etc. If an ear loses the normal acoustic hearing (NH) severely and cannot transmit enough speech information to the brain, a cochlear implant may be highly suggested by an audiologist. Cochlear implants (CIs), by bypassing the cochlear hair cells and directly stimulating the auditory nerve fibers, have successfully helped more than 1 million deafened people all over the world to (re)gain hearing. A CI converts incoming sounds captured by one or multiple microphones into electric pulses and then stimulates the residual auditory nerve fibers through an electrode array which is surgically implanted in the scala tympani of the cochlea [2]. Most CI users can understand speech in favorable conditions. However, their sound perception abilities are still abnormal compared with their NH peers in many aspects, e.g. speech recognition in noise [3] and reverberation [4], music appreciation [5], articulatory features (F0 and formants) [6, 7], sound localization [8], and atypical pronunciation [9, 10]. Among individuals of any CI cohort, large performance variances are observed. What’s more, the performance of a CI recipient cannot be precisely predicted before surgery, and the low-performance listeners usually cannot be trained to gain large improvement [11]. Technically, the main limitation comes from the electrode-to-neuron interface, i.e., the CI electrode array and its stimulation to the neurons. Information of different frequencies is coded in different stimulation places, i.e., different electrodes. A modern CI typically has 12 to 24 [12] electrodes, which are far less in number and larger in size than the hair cells in a healthy cochlea [13]. Compared with the natural frequency tuning of normal auditory nerve fibers, the tuning of the CI electrode stimulation is much broader. The broad frequency-tuning feature and the limited number of electrodes lead to a poor resolution of frequency information with the electric hearing. The electrode array mostly cannot reach the apical turn of the cochlea and a large mismatch exists in the mapping of all CI recipients. Furthermore, the temporal characteristics of the auditory nerve fibers responding to artificial electric pulse trains and natural acoustic sounds are significantly different [14], which may explain the temporal limits of electric hearing observed in many psychophysical studies. Healthy cochleae show phase-locking features to sinusoidal components with frequencies up to about 4 to 5 kHz. In contrast, the upper limit of temporal sensitivity of the electric hearing is only a few hundred hertz (see [15]).

Channel-Vocoder-Centric Modelling of Cochlear Implants …

139

The sound processing strategies also play an important role in the performance of CI users. Most modern clinically available CI signal processing strategies are based on temporal envelopes. Even though the physical resolutions of the CI interface are very coarse, there are many electric parameters (e.g., frequency mapping, electrode channel number, current spread, dynamic range, temporal envelope cut-off frequencies, and pulse shape) can be permuted to optimize the electric hearing performance. To this end, numerous efforts have been made to improve CI strategies. To accelerate the procedure of the (re-)engineering of CI strategies and the fundamental research on CI understanding, several acoustic models have been proposed to simulate the electric hearing, e.g., in [16]. In that historical paper, it was stated that “The main reason for formulating this acoustic model is to facilitate the development of speech coding schemes for use with multiple-channel cochlear implants. The model allows a normally hearing listener to gain insights into the capability of the implant and to assess alternative speech coding schemes first hand.” Acoustic CI models analyze a sound according to the core steps in a CI strategy and re-synthesize it into a new sound, which can be presented to NH subjects through headphones or speakers. The differences observed in NH listeners using acoustic CI models with different parameters may be used to predict the performance in CI listeners using strategies with corresponding different parameters. This is an ideal hypothesis, and the models are helpful research tools since CI listeners are much less accessible and more variable than NH listeners. Simulation experiments in NH listeners usually result in low-variance data and sometimes overestimate the mean actual CI performance. These are also not surprising, given that there are many physical differences between the neuron interface of the electric and acoustic hearing. The listening tests with human subjects are usually time-consuming. Reasonable parameters should be carefully selected to get valuable results. However, in practice, because of the complicated multidisciplinary concepts, large variance of CI subjects, and research purposes, it is usually difficult to make the appropriate choice during the model designing especially for new researchers in the field. This paper aims to review the strengths and limitations of the most conventional acoustic CI simulation methods, i.e., channel vocoders, and to outlook the future development of CI simulations. In other words, this work aims to provide an overview of the history, current status, and future directions of the channel-vocoder-centric CI simulation studies. Detailed arrangements are as follows. In Sect. 2, current CI strategies are briefly introduced. In Sect. 3, the channel vocoders and the physical parameters simulated in them are discussed.

2 CI Signal Processing Strategies: Interleavedly Sampling the Temporal Envelopes A standard CI strategy framework is shown in Fig. 1(A). It was proposed and evaluated in the early 1990s by Wilson and colleagues and named as continuous

140

F. Kong et al.

interleaved sampling (CIS) strategy [17]. The key features of CIS are 1) extracting temporal envelopes from multiple frequency channels, 2) stimulating each electrode place using a biphasic pulse train whose amplitude is modulated by the envelope extracted in feature 1, and 3) only one electrode place is stimulated at one time, i.e., there is no time overlapping between stimulations on multiple electrodes. Detailed implementations of each stage may be slightly different according to engineering choices. The CIS strategy is integrated in almost all CI products, with some new features over the naive CIS. For example, not all the electrodes are stimulated in a processing frame in the advanced combination (ACE) strategy in Cochlear devices. Typically, only eight electrodes with maximum amplitudes (maxima) are dynamically selected from the 22 electrodes for stimulation in each processing frame. It is a typical model of the n-of-m strategies [2]. In the fine structure processing (FSP) strategy in MEDEL devices, the zero-crossing points were used for the timing of electric pulses at the lowest 2–4 channels, which is assumed to introduce some temporal fine structures. In the HiRes120 strategy in Advanced Bionics devices, a current steering (or virtual channel) technique, i.e., using simultaneous firing neighbor electrodes to induce

Fig. 1 Block diagrams of a typical CI strategy, i.e., continuous interleaved sampling (CIS) processing strategy (A), and of a channel vocoder modelling algorithm (B). A: The input speech sound is filtered through a bandpass filter bank. For the output from each channel, the following processing stages include rectification, envelope smoothing, amplitude compression, and pulse modulation. Then the amplitude-modulated pulse train is transmitted to the corresponding electrode in the cochlea. B: The channel vocoder can use the same procedure as CIS to get the temporal envelopes. Then the envelopes are used to amplitude-modulate sine-wave or noise carriers. The modulated carriers from all bands are superimposed to generate a sound signal to be presented to normal hearing ear(s) through a loudspeaker or headphones

Channel-Vocoder-Centric Modelling of Cochlear Implants …

141

intermediate pitch perception between physical electrodes, was used. 120 virtual channels are assumed to be created among the 16 physical electrodes [18]. Even though different variations (including ACE, FSP, and HiRes120) of the strategies have been advocated by different companies, they share the key features (i.e., temporal envelope-based interleaved sampling bi-phasic pulses) with CIS. There is no consistent evidence to support the significant advantages of any strategy over the others.

3 Channel Vocoders: The Algorithms and Applications 3.1 Algorithms of the Channel Vocoders It is impossible to replicate the stimulation pattern delivered to auditory nerve fibers by CIs via acoustic stimulation in a normally hearing ear. Still, it can nevertheless be helpful for research and demonstration purposes to simulate CI processing using acoustic models. Since the 1990s, the most popular simulation method has been the channel vocoder, as shown in Fig. 1.B. The analysis band-pass filters and the envelope extraction steps can be the same as the actual CI strategies. To re-synthesize a sound stimulus for NH listeners, the temporal envelopes were used to amplitude-modulate either band-limited noise carriers [19] or sine-wave carriers [20]. The amplitude-modulated results are summed up into one sound. In some cases, band limiting filters and power normalization were inserted between modulation and summing. In the algorithm, all parameters of all stages can be manipulated to simulate different aspects of the CI strategy. The vocoder-centric simulation has produced a cottage industry of auditory research [21]. The key physical parameters of CIs and their simulation in the channel vocoders are discussed as below. Before going into the detailed parameters, it should be noted that brain plasticity should always be taken care of in vocoded speech perception experiments. CI listeners, at the time of experiments, usually have at least months of experience in listening to the artificial sound [22], whereas NH subjects are usually naive listeners to the vocoded speech. The hearing experience may influence their familiarity with the stimuli and thus affect their performance.

3.2 Frequency Allocation The CI frequency allocation (i.e., which frequency range is allocated to which electrode) is based on the tonotopic organization of the cochlea, i.e., more basal electrodes correspond to higher frequency bands. However, the detailed allocations in CI products are quite arbitrary. For example, the frequency allocation of the ACE strategy

142

F. Kong et al.

seems to be partially determined by the fast-Fourier-transform bins [23] and is not precisely following the physiological Green-wood frequency-place coding function [24]. Nonetheless, the CI frequency allocation in analysis filters as in Fig. 1(A) can be perfectly simulated in the analysis filter design of the channel vocoders as in Fig. 1(B). The analysis filter frequency allocation cannot guarantee either placematching or harmonic relationships in the synthesis parts. The (center) frequencies of sine-wave carriers or band-limited noise carriers can be adjusted to match the physiological characteristic frequencies of the auditory neurons being stimulated by corresponding electrodes. The matching or mismatching degree can be directly simulated by changing the (center) frequencies, e.g., for electrode insert depth simulation [25] and bilateral mismatching simulation [26].

3.3 Spectral Channel Number One spectral channel (band) usually corresponds to one electrode. The number of band-pass filters can be used to generate different numbers of spectral channels for both actual Cis and channel vocoders. Figure 2 shows examples of four and 16channel vocoder simulated sentence speech with both sine-wave and noise carriers. While the channel number in actual CIs is limited by the physical electrode number, the channel number can be set in a much larger range for the simulations. Shannon et al. (1995) [19] has shown that temporal envelopes from four spectral channels can provide enough information for speech intelligibility. Dorman et al. 1998 [27] confirmed this finding again and showed that for recognition of speech in noisy conditions larger channel numbers are needed than in quiet conditions. This larger number need was confirmed by recent actual CI data [28]. [19] and [27] are the beginnings of the channel vocoder used as tools for CI simulation. Friesen et al. 2001 [29] found that a vocoder with about eight channels can simulate the speech recognition performance of the best CI results, so many works used vocoders with no more than eight channels to simulate the CI performance even the majority of CI devices have more than eight electrode channels.

Fig. 2 Examples of spectrograms of channel-vocoded speech

Channel-Vocoder-Centric Modelling of Cochlear Implants …

143

3.4 Current Spread The maximum electrode number is limited by the current spread of the electric stimulation on individual electrodes. Because of the large electrode size and the long neuron-electrode distance, the electrode current spread is too broad to provide a fine spectral resolution [13]. Different current spread degrees can be simulated by manipulating the bandwidth and slopes of the frequency responses of the band-pass filters. For example, in [30] carrier filter slopes of −24 or −6 dB/octave were used, and they found that a steeper slope representing less channel interactions may provide better speech intelligibility in noisy conditions. Another way to simulate the current spread is by adding weighted contributions from other bands’ envelopes to current band’ envelope, e.g., in [31]. Sometimes, bandlimited filters can be inserted between the carrier modulation and the output summing in Fig. 1.B to control the interaction to some extent.

3.5 Temporal Envelope In [32], Rosen proposed a framework of the temporal envelope, periodicity, and fine structure to analyze the temporal structure of speech. In CI strategies, temporal envelopes from multiple channels are extracted. Cut-off frequencies of the low-pass filters (in Fig. 1) can be used to control the fluctuation rate of envelopes for both actual and simulated CI stimuli. Channel vocoders use continuous noise carriers or sin-wave carriers, while CIs use bi-phasic pulse train carriers. Noise carriers have been shown to introduce more intrinsic temporal fluctuation, which may interfere with the speech’s temporal envelope to be transmitted [33]. Higher cut-off frequencies (e.g., 500 Hz) can include more periodicity information in the envelopes. With channel vocoders, the periodicity cues have been used in many studies to simulate CI pitch perception tasks, such as lexical tone recognition [34], voice gender discrimination [35], and stream segregation [36, 37]. However, the different physical characteristics of the carriers may prevent the vocoders from accurately modelling the periodicity-based CI pitch perception. What’s more, a trade-off phenomenon in phoneme recognition was reported between the temporal fluctuation rate and the channel number [38]. The carrier effects should be carefully handled in both experimental design and discussions about implications on actual CIs. Another method of envelope extraction is using the Hilbert transform, i.e., calculating the magnitude of the analytic signal of the input band signal. The difference between the rectification-low-passfiltering method and the Hilbert method are very tricky. The former is more physically meaningful, especially when we want to control the periodicity cues for our applications.

144

F. Kong et al.

3.6 Intensity and Dynamic Range NH listeners can hear sounds in a large intensity range of up to 120 dB SPL, with a resolution of 1–2 dB, whereas CI users have a much narrower dynamic range. The dynamic range of a CI user is defined as the difference between the current intensities at which a sound is just audible (the threshold) and perceived as uncomfortable. For example, in Cochlear’s implants, eight bits (i.e., 255 steps) [23] were used to quantify the current intensity, and most recipients are fitted with a range covering only a portion of the 255 steps. This means tens of dB SPL range (e.g., 35 to 85 dB SPL) will be compressed to only a several dB range. (Information about CI compression can be referred to [39]). The rationale has been explained by Loizou et al. [40] as that eight bits were enough for vowel and consonant recognition with 6-ch actual and simulated CIs. However, to the best of our knowledge, there’s no more detailed research on this intensity quantization in electric hearing. It is unknown whether higher intensity resolution can improve the intelligibility and quality of sounds in various conditions. This is a reasonable question, especially for the people who are headphone enthusiasts. What’s more, in most CI simulation studies, the original quantization resolution (e.g., 16 bits or 32 bits) of the input audio was kept unchanged through the whole vocoding procedure. It is also unknown whether the quantization resolution would influence the vocoded sound perception. The dynamic range compression could be simulated by inserting a compression stage between the modulation and the summing stage in Fig. 1(B), and it was demonstrated to affect speech recognition negatively (e.g., in [41]), but it is often omitted in CI simulation studies.

3.7 Carrier Waveform As mentioned above, the most conventional carriers are sine-wave and noise signals. They are both continuous signals. One essential defect of the conventional channel vocoders is the lack of pulsatile characteristics, while pulsatile stimulation is known to be one of the key reasons of CIS’s success. The continuous carrier waveforms of channel vocoders cannot be used to simulate the electric pulse trains, which was almost completely ignored in hundreds of CI simulation papers. One possible reason is that most current CI strategies use a high stimulation rate (≥ 900 pulse-per-second, pps), which is assumed to be much higher than the upper temporal pitch limits of the electric hearing. Hence, the pulsatile characteristics of electric pulse trains can be viewed as too fast to be discriminated. However, this defect is worth further investigation to provide a better simulation model for signal processing and phenomenal comparisons.

Channel-Vocoder-Centric Modelling of Cochlear Implants …

145

3.8 Short Summary Frequency allocation, spectral channel number, current spread, and temporal envelope of CI stimuli can be conveniently simulated by manipulating the corresponding physical parameters in channel vocoders. However, the intensity and dynamic range were simulated in few studies and the pulsatile features of CI stimuli were almost completely ignored [54]. Researchers have carried out many experimental studies to verify and get insights from the simulations. For a channel-vocoded sound, all of the physical consistencies or disparities can affect the explanation of the sound perception performance.

4 Channel-Vocoder Simulation vs. Actual CI Hearing To examine the power of CI simulations, it is helpful to directly compare vocoded speech perception performance in normal hearing (NH) subjects and actual CI patients. Among the earliest publications on direct vocoder-vs-CI comparisons, Fu et al. 1998 [42] showed that using 4 channels, the vocoder simulations with noise carriers, and actual CIS strategy in CI patients can lead to similar performance as actual CI subjects using the same electrode channel numbers. Friesen et al. 2001 [29] showed that eight channels could lead to similar speech-in-noise recognition performance between the NH and CI groups, while higher channel numbers than eight can improve the performance of vocoder simulation but not for actual CI patients. Stickney et al. 2004 [43] compared 4-ch and 8-ch noise vocoders and five Nucleus CI users in a task of speech recognition in a speech masker. It is shown that 8-ch vocoder results are better than the CI subjects using 8-of-22 (or 8-of-24) ACE strategies. Fu et al. (2004) [35] showed that eight-channel sine-wave vocoders derived better speaker vowel recognition results than CI subjects and the envelope cutoff frequencies has no significant effect on the simulation results. Fu et al. (2005) [44] found that 4-ch noise band channel vocoders with -6 dB/oct bandpass filters lead to similar speech-in-Gated-Noise recognition. Iverson et al. (2006) [45] found actual CI subjects can use the formant movement for vowel recognition only as good as about 6-ch noise vocoders. In many pure simulation studies (without actual CI tests), less than eight channels are used. This is far less than the current actual CI channel numbers (12–24). Their selection of the small channel number is mainly because that channel vocoders with those low channel numbers can lead to similar performance as actual CI listeners in that experiment tasks to be done, for example, in [46–48]. Actually, this often-used assumption does not fully make sense. It is a major limitation of the channel vocoders. The intrinsic aim of vocoder modelling of CIs is not only to generate a vocoder method who can derive similar scores with CIs in one or several tasks. Researchers are always willing to have a model which can model many phenomena in many perception tasks and most physical parameters of

146

F. Kong et al.

actual CI stimuli. To make the vocoders with higher (but closer to actual CIs) channel numbers derive similar performance as CI listeners, channel interactions, frequency shifting, and dynamic range compression can be manipulated in the channel vocoders to some extent. In some works, alternative vocoder methods were designed to solve this problem and better mimic the electric stimulation or physiological procedure, e.g., in [54] and [55].

5 Sound Quality and Music Perception with Vocoded Sounds When keywords of “((cochlear implant) AND (vocoder))” were searched on PubMed.gov on May 11, 2020, there were 306 results. When “Speech” was added to the keywords, 293 results were shown. This tells that most studies on CI vocoder simulation are working on speech sounds. All parameters of the channel vocoders can be enumerated to examine the physical parameters’ effects on simulated speech perception. The results and experiments are diverse and cannot be logically concluded here. Instead, we re-state here that channel-vocoded simulation should be carefully explained when implications for actual CI are produced. For example, in studies including both vocoded NH and actual CI cohorts, using the same channel number NH subjects’ results were usually much better than CI results, e.g. in [10, 29, 45]. When keywords of “((cochlear implant) AND (vocoder)) AND (Music)” were searched in PubMed.gov, there were only 22 results. This tells that music appreciation cannot be easily simulated by the channel vocoders. The quality of vocoded music is too different from that of original music for NH listeners to be appreciated, while CI subjects have no choice and long-time experience. Another valuable group of CI subjects, i.e., single-sided deafness CI implantees, provide opportunities for direct quality comparison of sounds from vocoders and CIs. For example, in [49] and [50], different subjects prefer different vocoder configurations. We cannot consistently know which vocoder method can best predict the sound quality of CI hearing.

6 How to Simulate New Experimental Strategies? So far, we mainly discussed the vocoder simulation of conventional CIS-like envelope-based strategies. Engineering researchers have proposed some temporal fine structure enhancement strategies, for example, in [51–53]. Conventional channel vocoders should be modified to include the temporal fine structure cues. It should be noted here that actual CI experiments will always be the golden standard to evaluate a new signal processing strategy, although the simulation experiment work may provide some implications.

Channel-Vocoder-Centric Modelling of Cochlear Implants …

147

7 Conclusion The most popular acoustic modelling methods of CI hearing are the channel vocoders using sine-wave and noise carriers. Most critical physical parameters, including channel numbers, frequency allocations, current spread, temporal envelope fluctuation rate, and intensity and dynamic range can be simulated by the channel vocoders. They are powerful tools to simulate many aspects of cochlear implant hearing. However, the limitation is also significant. Many recent studies still use channel vocoders with no more than eight channels to simulate actual CI hearing with 12 to 24 channels, just because their performance is similar in some speech tasks. To make the vocoded hearing and CI hearing using the same channel numbers comparable to each other in performance, current spread and frequency shifting can be manipulated. Also, some alternative methods using novel carrier signals many provide more direct and quantitatively practical simulation to the cochlear implants. The channel-vocoder-centric modelling of cochlear implants is a useful method, but researchers should take care of the explanation of the vocoded results, especially in the quality judgments and novel temporal fine structure strategy evaluations. Acknowledgements This work is jointly supported by the National Natural Science Foundation of China (11704129 and 61771320), Guangdong Basic and Applied Basic Research Foundation Grant (2020A1515010386), and the Science and Technology Program of Guangzhou (202102020944).

References 1. Moore BCJ (2013) An introduction to the psychology of hearing, 6th edn. Brill, Leiden 2. Zeng FG, Rebscher S, Harrison W, Sun X, Feng H (2008) Cochlear implants: system design, integration, and evaluation. IEEE Rev Biomed Eng 1:115–142 3. Zhou H, Wang N, Zheng N, Yu G, Meng Q (2020) A new approach for noise suppression in cochlear implants: a single-channel noise reduction algorithm. Front Neurosci 14:301 4. Kressner AA, Westermann A, Buchholz JM (2018) The impact of reverberation on speech intelligibility in cochlear implant recipients. J Acoust Soc Am 144:1113–1122 5. Nogueira W, Nagathil A, Martin R (2019) Making music more accessible for cochlear implant listeners: recent developments. IEEE Signal Process Mag 36:115–127 6. Meng Q, Zheng N, Mishra AP, Luo JD, Schnupp JW (2018) Weighting pitch contour and loudness contour in mandarin tone perception in cochlear implant listeners. In: Interspeech 7. Gaudrain E, Baskent D (2018) Discrimination of voice pitch and vocal-tract length in cochlear implant users. Ear Hear 39:226–237 8. Jones H, Kan A, Litovsky RY (2014) Comparing sound localization deficits in bilateral cochlear-implant users and vocoder simulations with normal-hearing listeners. Trends Hear 9. Li Y, Zhang G, Kang HY, Liu S, Han D, Fu QJ (2011) Effects of speaking style on speech intelligibility for Mandarin-speaking cochlear implant users. J Acoust Soc Am 129:EL242– EL247 10. Meng Q et al (2019) Time-compression thresholds for Mandarin sentences in normal-hearing and cochlear implant listeners. Hear Res 374:58–68 11. Faulkner KF, Pisoni DB (2013) Some observations about cochlear implants: challenges and future directions. Neurosci Disc 1:9

148

F. Kong et al.

12. Zeng FG et al (2015) Development and evaluation of the Nurotron 26-electrode cochlear implant system. Hear Res 322:188–199 13. Zeng FG (2017) Challenges in improving cochlear implant performance and accessibility. IEEE Trans Biomed Eng 64:1662–1664 14. Boulet J, White M, Bruce IC (2016) Temporal considerations for stimulating spiral ganglion neurons with cochlear implants. J Assoc Res Otolaryngol 17:1–17 15. Cosentino S, Carlyon RP, Deeks JM, Parkinson W, Bierer JA (2016) Rate discrimination, gap detection and ranking of temporal pitch in cochlear implant users. J Assoc Res Otolaryngol 17:371–382 16. Blamey PJ, Dowell RC, Tong YC, Clark GM (1984) An acoustic model of a multiple-channel cochlear implant. J Acoust Soc Am 76:97–103 17. Wilson BS (2015) Getting a decent (but sparse) signal to the brainfor users of cochlear implants. Hear Res 322:34–38 18. Wilson BS, Finley CC, Lawson DT, Wolford RD, Eddington DK, Rabinowitz WM (1991) Better speech recognition with cochlear implants. Nature 352:236–238 19. Shannon RV, Zeng FG, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270:303–304 20. Dorman MF, Loizou PC, Rainey D (1997) Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. J Acoust Soc Am 102:2403–2411 21. Goupell MJ (2015) Pushing the envelope of auditory research with cochlear implants. Acoust Today 11:26–33 22. Reiss LA, Gantz BJ, Turner CW (2008) Cochlear implant speech processor frequency allocations may influence pitch perception. Otology Neurotol 29:160 23. Vandali AE, Whitford LA, Plant KL, Clark GM (2000) Speech perception as a function of electrical stimulation rate: using the Nucleus 24 cochlear implant system. Ear. Hear 21:608–624 24. Greenwood DD (1990) A cochlear frequency-position function for several species—29 years later. J Acoust Soc Am 87:2592–2605 25. Faulkner A, Rosen S, Stanton D (2003) Simulations of tonotopically mapped speech processors for cochlear implant electrodes varying in insertion depth. J Acoust Soc Am 113:1073–1080 26. Xu K, Willis S, Gopen Q, Fu QJ (2020) Effects of spectral resolution and frequency mismatch on speech understanding and spatial release from masking in simulated bilateral cochlear implants. Ear Hear 41:1362 27. Dorman MF, Loizou PC, Fitzke J, Tu Z (1998) The recognition of sentences in noise by normalhearing listeners using simulations of cochlear-implant signal processors with 6–20 channels. J Acoust Soc Am 104:3583–3585 28. Croghan NBH, Duran SI, Smith ZM (2017) Re-examining the relationship between number of cochlear implant channels and maximal speech intelligibility. J Acoust Soc Am 142:EL537– EL543 29. Friesen LM, Shannon RV, Baskent D, Wang X (2001) Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. J Acoust Soc Am 110:1150–1163 30. Fu QJ, Nogaki G (2005) Noise susceptibility of cochlear implant users: the role of spectral resolution and smearing. Jaro-J Assoc Res Otolaryngol 6:19–27 31. Oxenham AJ, Kreft HA (2014) Speech perception in tones and noise via cochlear implants reveals influence of spectral resolution on temporal processing. Trends Hear 18 2014. 32. Rosen S (1992) Temporal information in speech: acoustic, auditory and linguistic aspects. Phil Trans Royal Soc Lond Ser B Biol Sci 336:367–373 33. Whitmal NA, Poissant SF, Freyman RL, Helfer KS (2007) Speech intelligibility in cochlear implant simulations: Effects of carrier type, interfering noise, and subject experience. J Acoust Soc Am 122:2376–2388 34. Yuan M, Lee T, Yuen KC, Soli SD, van Hasselt CA, Tong MC (2009) Cantonese tone recognition with enhanced temporal periodicity cues. J Acoust Soc Am 126:327–337

Channel-Vocoder-Centric Modelling of Cochlear Implants …

149

35. Fu QJ, Chinchilla S, Galvin JJ (2004) The role of spectral and temporal cues in voice gender discrimination by normal-hearing listeners and cochlear implant users. J Assoc Res Otolaryngol 5:253–260 36. Gaudrain E, Grimault N, Healy EW, Bera JC (2008) Streaming of vowel sequences based on fundamental frequency in a cochlear-implant simulation. J Acoust Soc Am 124:3076–3087 37. Steinmetzger K, Rosen S (2018) The role of envelope periodicity in the perception of masked speech with simulated and real cochlear implants. J Acoust Soc Am 144:885–896 38. Xu L, Thompson CS, Pfingst BE (2005) Relative contributions of spectral and temporal cues for phoneme recognition. J Acoust Soc Am 117:3255–3267 39. Zeng F-G (2004) Compression and cochlear implants. Springer, Heidelberg 40. Loizou PC, Dorman M, Poroy O, Spahr T (2000) Speech recognition by normal-hearing and cochlear implant listeners as a function of intensity resolution. J Acoust Soc Am 108:2377–2387 41. Chen F, Zheng D, Tsao Y (2017) Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language. J Acoust Soc Am 142:1157–1166 42. Fu QJ, Shannon RV, Wang X (1998) Effects of noise and spectral resolution on vowel and consonant recognition: acoustic and electric hearing. J. Acoust Soc Am 104:3586–3596 43. Stickney GS, Zeng FG, Litovsky R, Assmann P (2004) Cochlear implant speech recognition with speech maskers. J Acoust Soc Am 116:1081–1091 44. Fu QJ, Nogaki G (2005) Noise susceptibility of cochlear implant users: the role of spectral resolution and smearing. J Assoc Res Otolaryngol 6:19–27 45. Iverson P, Smith CA, Evans BG (2006) Vowel recognition via cochlear implants and noise vocoders: effects of formant movement and duration. J Acoust Soc Am 120:3998–4006 46. Luo X, Fu QJ (2004) Enhancing Chinese tone recognition by manipulating amplitude envelope: implications for cochlear implants. J Acoust Soc Am 116:3659–3667 47. Zaltz Y, Goldsworthy RL, Kishon-Rabin L, Eisenberg LS (2018) Voice discrimination by adults with cochlear implants: the benefits of early implantation for vocal-tract length perception. J Assoc Res Otolaryngol 19:193–209 48. Zaltz Y, Goldsworthy RL, Eisenberg LS, Kishon-Rabin L (2020) Children with normal hearing are efficient users of fundamental frequency and vocal tract length cues for voice discrimination. Ear Hear 4:182–193 49. Peters JP, et al (2018) The sound of a cochlear implant investigated in patients with single-sided deafness and a cochlear implant 50. Dorman MF, et al (2017) The sound quality of cochlear implants: studies with single-sided deaf patients 51. Nie K, Stickney G, Zeng FG (2005) Encoding frequency modulation to improve cochlear implant performance in noise. IEEE Trans Biomed Eng 52:64–73 52. Meng Q, Zheng N, Li X (2016) Mandarin speech-in-noise and tone recognition using vocoder simulations of the temporal limits encoder for cochlear implants. J. Acoust Soc Am 139:301– 310 53. Goldsworthy RL (2019) Temporal envelope cues and simulations of cochlear implant signal processing. Speech Communication 54. Meng Q, Zhou H, Lu T, Zeng FG (2022) Gaussian-Enveloped Tones (GET): a vocoder that can simulate pulsatile stimulation in cochlear implants. medRxiv 55. Brochier T, Schlittenlacher J, Roberts I, Goehring T, Jiang C, Vickers D, Bance M (2022) From microphone to phoneme: an end-to-end computational neural model for predicting speech perception with cochlear implants. IEEE Trans Biomed Eng