The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing (Signals and Communication Technology) 3030934047, 9783030934040

This book first details a proposed Stationary Bionic Wavelet Transform (SBWT) for use in speech processing. The author t

140 56

English Pages 98 [95] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Book Summary
Introduction
Contents
About the Author
Acronyms
Chapter 1: Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum A Posterior Estimator of Magnitude-Squared Spectrum
1.1 Introduction
1.2 The Proposed Technique
1.2.1 Background
1.2.1.1 Wavelet Analysis [21, 22]
1.2.1.2 The Bionic Wavelet Transform
1.2.1.3 The Stationary Bionic Wavelet Transform (SBWT)
The Stationary Wavelet Transform (SWT)
Perfect Reconstruction of SBWT
1.3 Application of the Maximum A Posterior Estimator of Magnitude-Squared Spectrum in SBWT Domain
1.4 The Evaluation Metrics
1.4.1 Signal-to-Noise Ratio
1.4.2 Segmental Signal-to-Noise Ratio
1.4.3 Itakura–Saito Distance
1.4.4 Perceptual Evaluation of Speech Quality (PESQ)
1.5 Results and Discussions
1.6 Conclusion
References
Chapter 2: ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT
2.1 Introduction
2.2 Materials
2.2.1 The BWT Optimization for ECG Analysis
2.2.2 1-D Double-Density Complex DWT
2.2.3 Denoising Technique Based on Wavelets and Hidden Markov Models
2.2.4 The Denoising Approach Based on Non-local Means
2.2.5 The ECG Denoising Approach Based on BWT and FWT_TI [49]
2.2.6 The Proposed ECG Denoising Approach [29]
2.3 Results and Discussion
2.4 Conclusion
References
Chapter 3: Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude
3.1 Introduction
3.2 The MMSE Estimate of Spectral Amplitude
3.2.1 Signal Model
3.3 The Proposed Speech Enhancement Approach [19]
3.4 Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the SBWT Domain
3.5 Unsupervised Speech Denoising Via Perceptually Motivated Robust Principal Component Analysis [23]
3.6 The Speech Enhancement Technique Based on MSS–SMPO [25]
3.7 Results and Discussion
3.8 Conclusion
References
Chapter 4: Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC Using a Multi-layer Perceptron for Voice Control
4.1 Introduction
4.2 The Feature Extraction
4.2.1 MFCC Extraction
4.3 Pre-emphasis
4.4 Frame Blocking and Windowing
4.5 DFT Spectrum
4.6 Mel Spectrum
4.7 Discrete Cosine Transform (DCT)
4.8 Dynamic MFCC Features
4.9 Classifiers
4.10 The Proposed Speech Recognition Technique [12]
4.11 Experiments and Results
4.12 Conclusion
References
Index
Recommend Papers

The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing (Signals and Communication Technology)
 3030934047, 9783030934040

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Signals and Communication Technology

Talbi Mourad

The Stationary Bionic Wavelet Transform and its applications for ECG and Speech processing

Signals and Communication Technology Series Editors Emre Celebi, Department of Computer Science, University of Central Arkansas, Conway, AR, USA Jingdong Chen, Northwestern Polytechnical University, Xi'an, China E. S. Gopi, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA H. Vincent Poor, Department of Electrical Engineering, Princeton University, Princeton, NJ, USA

This series is devoted to fundamentals and applications of modern methods of signal processing and cutting-edge communication technologies. The main topics are information and signal theory, acoustical signal processing, image processing and multimedia systems, mobile and wireless communications, and computer and communication networks. Volumes in the series address researchers in academia and industrial R&D departments. The series is application-oriented. The level of presentation of each individual volume, however, depends on the subject and can range from practical to scientific. Indexing: All books in “Signals and Communication Technology” are indexed by Scopus and zbMATH For general information about this book series, comments or suggestions, please contact Mary James at [email protected] or Ramesh Nath Premnath at [email protected]. More information about this series at https://link.springer.com/bookseries/4748

Talbi Mourad

The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing

Talbi Mourad Laboratory of Nano-materials and Systems for Renewable Energies Center of Researches and Technologies of Energy of Borj Cedria Tunis, Tunisia

ISSN 1860-4862     ISSN 1860-4870 (electronic) Signals and Communication Technology ISBN 978-3-030-93404-0    ISBN 978-3-030-93405-7 (eBook) https://doi.org/10.1007/978-3-030-93405-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Book Summary

In this book, we have detailed our techniques of speech and ECG processing which are proposed in literature. The first technique detailed in Chap. 1 is of speech enhancement and is based on stationary bionic wavelet transform (SBWT) and maximum a posteriori estimator of magnitude-squared spectrum. The experiments were conducted for different sorts of noise and for many speech signals. This proposed technique is evaluated and compared to other popular speech enhancement techniques such as Wiener filtering and MSS-MAP estimation in frequency domain. This evaluation is performed through the computations of the signal-to-noise ratio (SNR), the segmental SNR, the Itakura–Saito distance (ISd), and the perceptual evaluation of speech quality (PESQ). The results obtained from these computations proved the performance of the proposed speech enhancement technique. The latter provided sufficient noise reduction and good intelligibility, without causing considerable signal distortion and musical background noise. Our second technique is detailed in Chap. 2 and is of ECG denoising and based on 1-D double-density complex DWT and SBWT. This proposed technique is evaluated and compared to the 1-D double-density complex DWT denoising one, the denoising technique based on wavelets and hidden Markov models, the technique based on non-local means, and our previous proposed approach based on BWT and FWT TI. The latter is the forward wavelet transform translation invariant. The simulation results obtained from the computations of signal-to-noise ratio (SNR), mean absolute error (MAE), peak SNR (PSNR), and CC cross-correlation (CC) show that this proposed technique outperforms the other ones employed for our evaluation. Our third technique is detailed in Chap. 3 and is of speech enhancement and based on SBWT and MMSE estimate of spectral amplitude. The performance of this approach is proved by the results obtained from the computation of the signal-­ to-­noise ratio (SNR), the segmental SNR (SSNR), and the perceptual evaluation of speech quality (PESQ). Our fourth technique is detailed in Chap. 4 and is of speech recognition and based on SBWT and MFCC and using a multilayer perceptron for voice control. A simulation program employed for testing the performance of the proposed approach showed a classification rate equal to 98%. v

Introduction

In this book we will deal with the stationary bionic wavelet transform (SBWT) and its applications in different domains of signal processing. In fact, it was applied for speech enhancement and also for speech recognition. It was also applied for ECG denoising. This book constitutes four chapters, where the first and the third detail our previous techniques of speech enhancement, proposed in literature. The second chapter details our ECG denoising approach proposed in literature. The fourth chapter of this book deals with speech recognition, and we elaborate upon the speech recognition technique proposed in literature. Our first speech enhancement technique is based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum. Our second speech enhancement technique is based on SBWT and MMSE estimate of spectral amplitude. Our proposed ECG denoising technique detailed in Chap. 2 is based on 1-D double-density complex DWT and SBWT. Our fourth proposed technique is based on stationary bionic wavelet transform and MFCC using a multi-layer perceptron for voice control. For evaluating the performance of the first speech enhancement technique (Chap. 1), we will compare it to a number of speech enhancement techniques, which are: the technique based on MSS − MAP estimation, the Wiener filtering, and the speech enhancement technique based on discrete Fourier transform (DFT). This evaluation of the proposed speech enhancement technique and the other ones will be performed by computing the signal to noise ratio (SNR), the segmental SNR (SSNR), the Itakura–Saito distance (ISd), and the perceptual evaluation speech quality (PESQ). For evaluating the second speech enhancement technique (Chap. 3), we will compare it to unsupervised speech denoising via perceptually motivated robust principal component analysis, the speech enhancement technique based on MSS − SMPO, and the denoising technique based on MMSE estimate of spectral amplitude. This evaluation of the second proposed speech enhancement technique and the other ones will be performed by computing SNR, the SSNR, and the PESQ. For evaluating the proposed ECG denoising technique (Chap. 2), we will compare it to four other ECG denoising techniques which are the 1-D double-density complex DWT denoising method, the technique based on wavelets and hidden Markov models, the technique based on non-local means, and our previous vii

viii

Introduction

technique based on BWT and FWT _ TI with hard thresholding proposed in literature. This evaluation will be performed by computing the SNR, the mean square error (MSE), the peak SNR (PSNR), the mean absolute error (MAE), and the cross-­ correlation (CC). For evaluating our previous technique of speech recognition (Chap. 4), proposed in literature, we will compare it to a number of speech recognition techniques proposed in literature which are as follow: • The feature extraction technique based on MFCC : Mel Frequency Cepstral Coefficients • The feature extraction technique based on MFCC using second order differential of MFCC, ∆∆MFCC • The feature extraction technique based on SBWT: Stationary Bionic Wavelet Transform • The feature extraction technique based on SBWT with MFCC • The feature extraction technique based on SBWT with ∆∆MFCC • The feature extraction technique based on CWT with ∆∆MFCC • The feature extraction technique based on BWT with MFCC • The feature extraction technique based on BWT with ∆∆MFCC

Contents

1 Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum A Posterior Estimator of Magnitude-­Squared Spectrum ������������������������������������������   1 1.1 Introduction����������������������������������������������������������������������������������������   1 1.2 The Proposed Technique��������������������������������������������������������������������   3 1.2.1 Background ����������������������������������������������������������������������������   4 1.3 Application of the Maximum A Posterior Estimator of Magnitude-Squared Spectrum in SBWT Domain ��������  11 1.4 The Evaluation Metrics����������������������������������������������������������������������  13 1.4.1 Signal-to-Noise Ratio�������������������������������������������������������������  13 1.4.2 Segmental Signal-to-Noise Ratio��������������������������������������������  14 1.4.3 Itakura–Saito Distance������������������������������������������������������������  14 1.4.4 Perceptual Evaluation of Speech Quality (PESQ)������������������  15 1.5 Results and Discussions����������������������������������������������������������������������  15 1.6 Conclusion������������������������������������������������������������������������������������������  25 References����������������������������������������������������������������������������������������������������  28 2 ECG Denoising Based on 1-D Double-­Density Complex DWT and SBWT������������������������������������������������������������������������  31 2.1 Introduction����������������������������������������������������������������������������������������  31 2.2 Materials ��������������������������������������������������������������������������������������������  33 2.2.1 The BWT Optimization for ECG Analysis����������������������������  33 2.2.2 1-D Double-Density Complex DWT��������������������������������������  35 2.2.3 Denoising Technique Based on Wavelets and Hidden Markov Models ��������������������������������������������������  36 2.2.4 The Denoising Approach Based on Non-local Means������������  37 2.2.5 The ECG Denoising Approach Based on BWT and FWT_TI������������������������������������������������������������  38 2.2.6 The Proposed ECG Denoising Approach ������������������������������  38 2.3 Results and Discussion ����������������������������������������������������������������������  41 2.4 Conclusion������������������������������������������������������������������������������������������  47 References����������������������������������������������������������������������������������������������������  47 ix

x

Contents

3 Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude����������������������������������������������������������������  51 3.1 Introduction����������������������������������������������������������������������������������������  51 3.2 The MMSE Estimate of Spectral Amplitude��������������������������������������  52 3.2.1 Signal Model��������������������������������������������������������������������������  53 3.3 The Proposed Speech Enhancement Approach����������������������������������  54 3.4 Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the SBWT Domain��������������������������������������  55 3.5 Unsupervised Speech Denoising Via Perceptually Motivated Robust Principal Component Analysis������������������������������  55 3.6 The Speech Enhancement Technique Based on MSS–SMPO������������  56 3.7 Results and Discussion ����������������������������������������������������������������������  56 3.8 Conclusion������������������������������������������������������������������������������������������  63 References����������������������������������������������������������������������������������������������������  66 4 Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC Using a Multi-layer Perceptron for Voice Control��������������������������������������������������������������������  69 4.1 Introduction����������������������������������������������������������������������������������������  69 4.2 The Feature Extraction������������������������������������������������������������������������  71 4.2.1 MFCC Extraction��������������������������������������������������������������������  71 4.3 Pre-emphasis ��������������������������������������������������������������������������������������  71 4.4 Frame Blocking and Windowing��������������������������������������������������������  72 4.5 DFT Spectrum������������������������������������������������������������������������������������  72 4.6 Mel Spectrum��������������������������������������������������������������������������������������  73 4.7 Discrete Cosine Transform (DCT)������������������������������������������������������  74 4.8 Dynamic MFCC Features ������������������������������������������������������������������  75 4.9 Classifiers��������������������������������������������������������������������������������������������  75 4.10 The Proposed Speech Recognition Technique������������������������������������  76 4.11 Experiments and Results��������������������������������������������������������������������  76 4.12 Conclusion������������������������������������������������������������������������������������������  78 References����������������������������������������������������������������������������������������������������  78 Index��������������������������������������������������������������������������������������������������������������������  83

About the Author

Talbi  Mourad  is an Assistant Professor of Electrical Engineering (Signal Processing) at the Center of Researches and Technologies of Energy of Borj Cedria, Tunis, Tunisia. He obtained his master’s degree in automatics and signal processing from the National Engineering School of Tunis in 2004. He has obtained his PhD in electronics from the Faculty of Sciences of Tunis, Tunis El-Manar University and his HDR in electronics from the Faculty of Sciences of Tunis.

xi

Acronyms

ADMM Alternating Direction Technique of Multipliers ANN Artificial Neural Network ASMF Adaptive Switching Mean Filter ASR Automatic Speech Recognition BWT Bionic Wavelet Transform BWT-1 Inverse of BWT CB-UWP Critical Bands–Undecimated Wavelet Package CC Cross-Correlation CWT Continuous Wavelet Transform DCT Discrete Cosine Transform DDAE Deep Denoising Autoencoder DFT Discrete Fourier Transform DWT Discrete Wavelet Transform DWT Discrete Wavelet Transform ECG Electrocardiogram EMD Empirical Mode Decomposition EMG Electromyogram Err Error FFT Fast Fourier Transform FIR Finite Impulse Response FT Fourier Transform FWT_TI Forward Wavelet Transform Translation Invariant GA Genetic Algorithm GFCC Gammatone-Frequency Cepstral Coefficient GFCCs Gammatone Filter Cepstral Coefficients GWN Gaussian White Noise HMMs Hidden Markov Models IMFs Intrinsic Mode Functions ISd Itakura–Saito Distance IWT_TI Inverse of FWT_TI LLR Log-Likelihood Ratio xiii

xiv

LM LPC LPCC LWT MAE MFCCs ∆MFCC ∆∆MFCC MLP MMI MMSE MPE MSE MSS-MAP NLM NMF PCA PESQ PLP PSNR PSO Qeq QMF SBWT SBWT−1 SE SNR SPP SS SSNR STOI STSA logSTSA SWT SWT−1 T(a, τ) TQWT VAD VLPs WPT WSS WT WTD

Acronyms

Local Means Linear Prediction Cepstrum Linear Prediction Cepstrum Coefficient Lifting Wavelet Transform Mean Absolute Error Mel Frequency Cepstral Coefficients First Order Differential of MFCC Second Order Differential of MFCC Multi-Layer Perceptron Maximum Mutual Information Minimum Mean Square Error Minimum Phone Error Mean Square Error Maximum a Posterior Estimator of Magnitude-Squared Spectrum Non-Local Means Nonnegative Matrix Factorization Principal Component Analysis Perceptual Evaluation of Speech Quality Perceptual Linear Predictive Peak SNR Particle Swarm Optimization Time-Varying Quality Factor Quadrature Mirror Filter Stationary Bionic Wavelet Transform Inverse of SBWT Speech Enhancement Signal to Noise Ratio Speech Presence Probability Spectral Subtraction Segmental SNR Short-Time Objective Intelligibility Short-Time Spectral Amplitude Estimation Short-Time Log-Spectral Amplitude Estimation Stationary Wavelet Transform Inverse of SWT, Time-Varying Linear Factor Tunable Q-factor-based Wavelet Transform Voice Activity Detector Ventricular Late Potentials Wavelet Packet Transform Weighted Spectral Slope Wavelet Transform Wavelet Thresholding Denoising

Chapter 1

Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum A Posterior Estimator of Magnitude-­Squared Spectrum

1.1  Introduction The speech enhancement is extremely required in the speech processing domain. Speech quality refers to a speech sample being pleasant, clear, and compatible with other speech processing techniques [1]. The principal idea behind speech enhancement consists in cancelling the background noise [2]. Echo cancellation is another key domain which requires to be addressed during speech enhancement [3, 4]. Speech recorded in natural environment includes background noises as well as echo [4]. However, echoless speech is captured in a special anechoic room. It sounds dry and dull to a human ear [4]. Echo suppression is also needed for speech samples collected from big house or halls. When the distance between the microphone and the speaker is large, the speech can pick some echo up [4]. For telephonic conversation, speech enhancement is also needed. In such cases, speech enhancement requires real-time processing [5], for getting quality sound from the speaker. Current telephone networks speech is band-limited from 300 to 3400 Hz [4]. With the development of recent trends, the band limit can get increased up to 7500 Hz or even higher. People will be capable to have a telephonic conversation then, even though they will be farther from the telephone network [4]. In those cases, speech enhancement will be required as well, as the concept is quite similar as of the speaker and microphone standing in a long distance, and then the possibility of having background noise and echo will be higher than expected. The cancellation of background noise [6, 7] is required in cases where people are having a telephone conversation in a noisy environment or in streets [8]. The cancellation of background noise [9] is required when the pilot of airplane sends speech signals from the cockpit to the ground or cabin. Speech enhancement is also needed for hearing aids [4]. For hearing aid devices, many research works were performed on speech enhancement. The speech enhancement application goes beyond [10] real-life examples, as it can be observed during criminal investigation also, as the speech enhancement plays a vital © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 T. Mourad, The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-93405-7_1

1

2

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

role in suppressing background noise from speech sample and identifying or classifying the target speech. Speech enhancement is also needed in some speech recognition systems, as the system tends to enhance the speech quality prior to feature extraction, selection, and classification [11]. Overall, it can be said that speech enhancement has a large range of applications, and it’s necessary for almost all the devices or systems or techniques related to speech signals. There are two principal conceptual requirements for assessing a performance of speech enhancement systems. The enhanced speech signal quality analyses in that signal its corrupted nature, clarity, and residual noise level. This quality is a subjective assessment of how happy the listener is with the ameliorated speech signal. The second standard examines the ameliorated signal intelligibility. This is an objective assessment offering the percentage of terms which listeners [12–14] will correctly identify. Poor intelligibility and good quality and may be a speech signal property and vice versa. A great number of speech enhancement systems permit to increase signal quality to the detriment of increasing their intelligibility [4]. By listening carefully to that speech signal, listeners typically gain more details of the noisy speech signal [4, 15] than from the enhanced signal. This is apparent from the information theory theorem on data processing [4]. Nevertheless, listeners experience tiredness during lengthy listening sessions, which results in the fragmented message being less intelligible [4]. When such situations occur, the ameliorated signal intelligibility may be higher than the noisy signal. The listener usually would need less effort to decode parts of the enhanced speech signal which conform to segments of the noisy signal’s high signal-to-noise ratio [4]. Real-world noise signals are non-stationary, and they are a mixture of more than one non-stationary noise signal [16]. Most of the classical speech enhancement techniques concentrated primarily on a single-noise corrupted speech signal. Consequently, it is far from real-world environments. In [16], Samba Raju Chiluveru and Manoj Tripathy discussed speech enhancement in real-world environments with a new speech feature. The novelty of [16] was threefold: (1) Their proposed model is analyzed in real-world environments [16]. (2) This model used discrete wavelet transform (DWT) coefficients as input features. (3) The proposed deep denoising autoencoder (DDAE) is designed experimentally [16]. The result of the proposed feature compares with conventional speech features like FFT amplitude, log magnitude, Mel-frequency cepstral coefficients (MFCCs), and the Gammatone filter cepstral coefficients (GFCCs). The performance of this speech enhancement technique proposed in [16] is compared to classical speech enhancement. The enhanced signal is evaluated with speech quality measures, like perceptual evaluation speech quality (PESQ), weighted spectral slope (WSS), and log-likelihood ratio (LLR). Similarly, speech intelligibility is measured with short-time objective intelligibility (STOI). The results show that the proposed SEA model with the DWT feature improves quality and intelligibility in all real-world environmental signal-­ to-­noise ratio (SNR) conditions.

1.2  The Proposed Technique

3

The tunable Q-factor-based wavelet transform (TQWT) is a new technique used for the speech enhancement (SE) task [17]. In TQWT, the controlling parameters Q factor and the decomposition level J are kept constant for diverse noise conditions which corrupts the overall SE performance. In general, the performance of SE is computed in terms of quality and intelligibility, though it has been reported that these two evaluation parameters do not usually correlate with each other due to the degradations introduced by the SE algorithms. These two important issues were addressed in [17], and satisfactory solutions were provided by using a multi-­ objective formulation in order to find the optimal values of the Q and J of the TQWT algorithm at diverse noise levels. Moreover, for correctly estimating the appropriate values of Q and J from the noisy speech, a low complexity functional link artificial neural network-based model was developed in [17]. To assess the performance of the hybrid approach proposed in [17], objective and subjective evaluation tests were carried out employing three standard noisy speech data sets. The results of the study were calculated with six recently reported SE techniques. It was proved that, in both the objective and subjective evaluation tests, the hybrid approach proposed in [17] outperforms the other six SE techniques. In this chapter, we will detail our speech enhancement technique previously proposed in [16]. This technique is based on stationary bionic wavelet transform (SBWT) [16] and maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP) [17]. The rest of this chapter is organized as follows: in Sect. 1.2, we will detail the proposed speech enhancement technique. In Sect. 1.3, we will deal with the evaluation metrics. In Sect. 1.4, we will present results and discussion, and in Sect. 1.5, we will conclude.

1.2  The Proposed Technique In this section, we will detail our speech enhancement technique proposed in [16]. It is based on the stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP). The SBWT was introduced for solving the problem of the perfect reconstruction associated with the bionic wavelet transform (BWT) [18–20]. The MSS-MAP estimation was employed for speech estimation in the SBWT domain. The block diagram of the proposed technique is presented in Fig. 1.1. According to this figure, the SBWT is first applied to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients, wbi, 1 ≤ i ≤ 8. After that, each of those Bionic wavelet coefficients is denoised by applying the technique on MSS-MAP estimation, and we obtain eight denoised coefficients wdi, 1 ≤ i ≤ 8 to them is applied the inverse of SBWT (SBWT−1) for obtaining the enhanced speech signal. In the following sub-section, we will deal with the bionic wavelet transform (BWT).

4

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

Noisy Speech Signal

Stationary

wb1

Bionic

wb2

MSS-MAP

MSS-MAP

wd1

Inverse of Stationary

wd2

Bionic

Wavelet

Wavelet

Transform:

Enhanced Speech Signal

Transform:

SBWT

MSS-MAP wb8

wd8

SBWT–1

Fig. 1.1  The flowchart of the proposed speech enhancement technique [16]

1.2.1  Background 1.2.1.1  Wavelet Analysis [21, 22] The continuous wavelet transform (CWT) of a signal x(t) is expressed as follows:



XCWT ( a,τ ) = x|ϕ a ,τ =

 t −τ ∫ x ( t )ϕ ∗  a  a

1

  dt 

(1.1)

with φ representing the mother wavelet selected for the wavelet transform. The variables a and τ are respectively the scale and time shift ones. If this mother wavelet is satisfying the admissibility criteria [23, 24], then the inverse of the wavelet transform is existing. The idea of wavelets originated with the Gabor Transform [25], a windowed Fourier transform conceived so as the localization time window duration is varying with frequency. A wavelet representation provides advantages compared to the conventional Fourier analysis in that the time support wavelet employed to perform correlation in (Eq. (1.1)) varying as a scale function, so that the length of the analysis window is matching with the frequency of interest, trading of frequency and time resolutions [24]. The variables a and τ of the CWT can be discretized and in a great number of cases still provide for complete representation of the underlying signal, provided that the mother wavelet meets certain requirements. This can be considered as a sort of multiresolution analysis, where, at each of scales, the signal is represented at a different level of detail. When the discretization of variables of the scale and time is dyadic in nature, so that τ = n · 2m and a = 2m, an effective implementation can be obtained through the employment of a quadrature mirror filter (QMF) decomposition at each level, where matching highpass and low-pass filter bank coefficients characterize the convolution with the mother wavelet and downsampling by 2 at each level equates to the doubling of the time interval according to scale [24]. This implementation of multiresolution filter

1.2  The Proposed Technique

5

bank is referred to as the discrete wavelet transform (DWT) and exists provided that the family of wavelets generated by dyadic scaling and translation forms an orthonormal basis set. The bionic wavelet transform (BWT) employed in [24] is based on the Morlet mother wavelet, for which the DWT representation is not possible, so a CWT coupled with fast numerical integration methods is employed instead of generating a set of discretized wavelet coefficients [24]. A more generalization of the DWT is the wavelet packet transform (WPT), also based on a filter bank decomposition method. In this case, the filtering process is iterated on low- and high-frequency components, rather than continuing lone on low frequency terms as with the DWT.  In Fig.  1.2, the decomposition tree of DWT (Fig. 1.2a) and that of WPT (Fig. 1.2b) are illustrated. The depth of the wavelet packet tree illustrated in Fig. 1.2b can be varied over the available frequency range, resulting in configurable filter bank decomposition [24]. This idea was employed for creating customized wavelet packet transforms where the filter banks match a perceptual auditory scale, such as the Bark scale, for the employments in speech representation, enhancement, and coding [25–30]. The employment of Bark-scaled WPT for speech enhancement has so far indicated a small but significant gain in overall speech enhancement quality thanks to this perceptual specialization [24]. This perceptual WPT, employing auditory critical band scaling following Cohen’s research work [27] as illustrated in Fig. 1.3, is implemented in [24]. The degree to which a particular set of wavelet coefficients form a useful or compact representation of a signal is a function of how well the mother wavelet matches with the underlying signal characteristics, as well as the times and scales selected [24]. For application to signal enhancement, frequently referred to in the literature as wavelet denoising, the coefficient magnitudes are reduced by

Fig. 1.2 (a) The decomposition tree (with 4 as the decomposition level) associated to DWT and (b) the decomposition tree (with 4 as the decomposition level) associated to WPT [24]

6

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

Fig. 1.3  Perceptually scaled WPT, with leaf-node center frequencies following an approximately critical band scaling

comparing them to a certain threshold. With a good choice of representation, this thresholding will suppress noise while conserving signal properties. To address the fact that many sorts of signals have substantial non-stationarity and cannot be well represented by a single fixed set of parameters, it is possible to make the wavelet transform adaptive, such that characteristics of the transform change over time as a function of the underlying signal characteristics [24]. There are many possible techniques to adaptive wavelet enhancement, including adaptation of the wavelet basis, adaptation of the wavelet packet configuration, direct adaptation of the time and scale variables, or adaptation of the thresholds or thresholding approaches employed. Of these, the most common approach is to employ a time-varying threshold or gain function based on an a priori energy or SNR measure [25–30]. The bionic wavelet transform (BWT) decomposition employed in [19, 20, 24] is both perceptually scaled and adaptive. The initial perceptual aspect of the transform comes from the logarithmic spacing of the baseline scale variables, which are designed to match basilar membrane spacing [24]. Consequently, two adaptation factors control the time support employed at each scale, based on a nonlinear perceptual model of the auditory system, as detailed in the following sub-section. 1.2.1.2  The Bionic Wavelet Transform The BWT was introduced by Yao et al. [19, 20] as an adaptive wavelet transform conceived specially for modelling the human auditory system. The basis for this transform is the Giguere–Woodland nonlinear transmission line model of the auditory system [31, 32], an active feedback electro-acoustic model incorporating the auditory canal, middle ear, and cochlea. The model leads estimates of the time-­ varying acoustic compliance and resistance along the displaced basilar membrane, as a function of the physiological acoustic mass, cochlear frequency-position mapping, and feedback factors representing the active mechanisms of the outer hair cells

1.2  The Proposed Technique

7

[24]. The net result can be viewed as a technique for estimating the time-varying quality factor Qeq of the cochlear filter banks as a function of the input sound waveform [24]. See [19, 20, 31, 32] for complete details on this model. The adaptive nature of the BWT is insured by a time-varying linear factor T(a, τ) representing the scaling of the cochlear filter bank quality factor Qeq at each scale over time. Incorporating this directly into the scale factor of a Morlet wavelet, we have X BWT ( a,τ ) =

1 T ( a ,τ )

 t −τ  − jw0  t −aτ  ∫ x ( t ) ϕ ∗  dt  e a  a • T ( a ,τ ) 

(1.2)

where

ϕ ( t ) = e ∗



 t −   T0

  

2



(1.3)

is the amplitude envelope of the Morlet wavelet, T0 is the initial time support, and w0 is the base fundamental frequency of the unscaled mother wavelet, here taken as w0  =  15, 165.4  Hz for the human auditory system, per Yao and Zhang’s original work [19]. The discretization of the scale variable a is accomplished employing pre-­ determined logarithmic spacing across the desired frequency range, so that the center frequency at each scale is given by the formula xm = x0/(1.1623)m, m = 0, 1, 2,... For this implementation, based on Yao and Zhang’s original work for cochlear implant coding [20], coefficients at 22 scales, m = 7, …, 28, are computed employing numerical integration of the CWT. Those 22 scales are corresponding to center frequencies logarithmically spaced from 225 Hz to 5300 Hz. The BWT adaptation factor, T(a, τ), for each scale and time is calculated implying the following updated equation: T ( a,τ + ∆τ ) =

 Cs 1 − G1  C + X s BWT ( a,τ ) 

1  ∂  X BWT ( a,τ )   1 + G2  ∂ t  

(1.4)

where G1 is the active gain factor that represents the outer hair cell active resistance function, G2 is the active gain factor that represents the time-varying compliance of the Basilar membrane, and Cs = 0.8 is a constant that represents nonlinear saturation effects in the cochlear model [19]. Practically, the partial derivative in Eq. (1.4) is approximated employing the first difference of the previous points of the BWT at that scale. According to Eq. (1.2), the adaptation factor T(a, τ) is affecting the duration of the amplitude envelope of the wavelet but is not affecting the frequency of the associated complex exponential. Consequently, one useful manner for thinking of the BWT is as a mechanism for adapting the time support of the underlying wavelet according to the quality factor Qeq of the corresponding cochlear filter model at each scale. The key parameters G1, G2, and T0 will be discussed in detail

8

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

[24]. It can be shown [20] that the resulting BWT coefficients XBWT(a, τ) can be calculated as a product of the original WT coefficients, XWT(a, τ), and a multiplying constant K(a, τ) which is a function of the adaptation factor T(a, τ). For the Morlet wavelet, this adaptive multiplying factor can be expressed as follows:

X BWT ( a,τ ) = K ( a,τ )· XWT ( a,τ ) K ( a,τ ) =



π C



T0 1 + T 2 ( a,τ )

(1.5) (1.6)



where C is a normalizing constant calculated from the integral of the squared mother wavelet. This representation yields an effective computational technique for calculating BWT coefficients directly from those of WT without performing the numerical integration of (1.2) at each scale and time [24]. There are several key differences between the discretized CWT employing the Morlet wavelet, employed for the BWT, and a filter bank-based WPT employing an orthonormal wavelet such as the Daubechies family, as employed for the comparative baseline technique. One is that the WPT is perfectly reconstructable, though the discretized CWT is an approximation whose exactness is depending on the number and placement of the selected frequency bands. Another difference, related to this idea, is that the Morlet mother wavelet consists of a single frequency with an exponentially decaying time support, though the frequency support of the orthonormal wavelet families employed for DWTs and WPT permits to cover a broader bandwidth. Consequently, the Morlet wavelet is more “frequency focused” along each scale, which is what permits the direct adaptation of the time support with minimal impact on the frequency support, the central mechanism of the BWT adaptation [24]. 1.2.1.3  The Stationary Bionic Wavelet Transform (SBWT) As previously mentioned, in our previous work [16], for speech enhancement, we applied the SBWT. This transform is obtained by replacing the discretized CWT employed in the BWT application, by the stationary wavelet transform (SWT). In Fig. 1.4, are given the different steps of the SBWT and its inverse, SBWT−1. According to Fig.  1.4, we obtain the stationary bionic wavelet coefficients by multiplying the stationary wavelet coefficients by the K factor (Eq. (1.5)). Those stationary wavelet coefficients are obtained by applying the SWT to the input signal. The steps of the SBWT application are the same steps followed in the BWT application, but instead of applying the discretized CWT, we apply the SWT to the input signal. The reconstructed signal is finally obtained by multiplying, at the first step, the stationary bionic wavelet coefficients by 1/K and then applying the SWT−1 to the resulting coefficients. In the SWT implementation, we have employed the Daubechies mother wavelet with ten vanishing moments.

1.2  The Proposed Technique Fig. 1.4  The stationary bionic wavelet transform (SBWT) and its inverse, (SBWT−1)

9

Input Signal

Stationary bionic wavelet coefficients

×

SWT

K Factor SBWT 1 K

Factor

× Stationary wavelet coefficients

SBWT –1

SWT –1 Reconstructed Signal

Fig. 1.5  Filter bank implementation of SWT

h2(n) h1(n)

Ca1

S

g2(n) g1(n)

Ca2

Cd2

Cd1

The Stationary Wavelet Transform (SWT) In both WPT and DWT, after filtering, the coefficients will be downsampled, which prevents redundancy and allows employing the same pair of filter in different levels. And so, these transforms are suffering from the lack of shift invariance, which means that small shifts in the input signal can cause major variations in the distribution of energy between coefficients at different levels and can cause some errors in reconstruction [33]. This problem is carried out by eliminating the downsampling steps after filtering at each level in SWT. By eliminating downsampling, the number of coefficients at each level is as long as the original signal. In Fig. 1.5, a signal decomposition by applying the SWT is illustrated. This decomposition is up to two levels (Fig. 1.5). In a signal decomposition, through a filter bank, if downsampling operators were eliminated, for the next level of decomposition, the high- and low-­ pass filters should be modified. For this, the high-pass and low-pass filters at each level will be upsampled by putting zero between each filter’s coefficients of previous level that named a trous algorithm [34]. In a signal denoising by thresholding in

10

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

SWT domain, the same three steps are followed compared to denoising by thresholding in the DWT domain [33]. Perfect Reconstruction of SBWT For checking the perfect reconstruction of the SBWT, we have compared the reconstructed and the original signals and this for the two transforms BWT and SBWT. This comparison is in terms of an error (Err) computed as follows:

Err = max ( x − y )

(1.7)



where x and y are respectively the clean and the reconstructed signals. For computing this error (Eq. (1.7)), we first enhanced the speech signal by MSS-MAP-based technique [18]. The application of this technique [18] or other one is performed because the clean speech signal is generally not available but only the noisy speech signal is available. Consequently, in order to calculate the error between the original and the reconstructed signals, we should first eliminate the noise degrading this original signal. In Fig. 1.6, the noisy speech signal is obtained by degrading the clean speech signal by the noise which is selected to be the car noise with SNR = 10 dB. For testing the perfect reconstruction of the SBWT, we have 20 speech signals (ten speech sentences pronounced by male voice in Arabic language and ten other ones pronounced by a female voice also in Arabic language). Those 20 signals are listed in the following table (Table 1.1). In the following tables (Tables 1.2 and 1.3), the values of the error (Eq. (1.7)) between the original speech signal, x, and the reconstructed one, y, are listed. The latter is obtained by the application of the BWT and its inverse, BWT−1, to x or by the application of the SBWT and its inverse, SBWT−1, to y. According to the values listed in Tables 1.2 and 1.3, the SBWT permits to have lowest error values between the original signal x and the reconstructed signal y, than those obtained in the case of applying BWT. The latter introduces some distortions Fig. 1.6  The procedure of verifying the perfect reconstruction of the wavelet transforms (BWT or SBWT)

Noisy Speech

MSS-MAP

[18]

Enhanced Speech: x SBWT Or BWT

Computing of max (|x – y |)

Reconstructed Speech: y

SBWT –1 Or BWT –1

1.3  Application of the Maximum A Posterior Estimator of Magnitude-Squared…

11

Table 1.1  The used speech signals

Table 1.2  Case of female voice

Speech signal Signal 1 Signal 2 Signal 3 Signal 4 Signal 5 Signal 6 Signal 7 Signal 8 Signal 9 Signal 10

Err =  max (|x − y|) Type of the wavelet transform speech signal BWT with 22 scales BWT with 30 scales 0.0694 0.0676 0.1428 0.1429 0.0700 0.1877 0.2062 0.0705 0.0527 0.0418 0.1633 0.1614 0.2305 0.2294 0.1629 0.0636 0.1585 1.1014 0.0677 0.0623

SBWT with 8 scales 7.0342e-06 9.72901e-06 1.5658e-05 1.4170e-05 1.4137e-05 1.1788e-05 1.4955e-05 1.0856e-05 1.2150e-05 2.1509e-09

on the reconstructed speech signals compared to the original speech signals and the distorsions introduced by the Bionic Wavelet Transform (BWT) especially when the number of scales is N = 22. Also for the BWT, the error between the original signal, x, and the reconstructed signal, y (Tables 1.2 and 1.3), is reduced when using N = 30 instead of N = 22 [16].

1.3  A  pplication of the Maximum A Posterior Estimator of Magnitude-Squared Spectrum in SBWT Domain In general, classical speech enhancement approaches based on thresholding in wavelet domain can cause some distorsions on the original speech signal and loss of information. This especially occurs for unvoiced sounds. Consequently, diverse

12

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

Table 1.3  Case of male voice

Speech signal Signal 1 Signal 2 Signal 3 Signal 4 Signal 5 Signal 6 Signal 7 Signal 8 Signal 9 Signal 10

Err =  max (|x − y|) Type of the wavelet transform speech signal BWT with 22 scales BWT with 30 scales 0.1897 0.0667 0.2449 0.1523 0.1983 0.1205 0.1893 0.0430 0.3015 0.0730 0.2495 0.1389 0.2730 0.1255 0.1897 0.1340 0.1550 0.0713 0.1743 0.0875

SBWT with 8 scales 1.7974e-05 1.4011e-05 1.1984e-05 1.4847e-05 1.1492e-05 0.0068 1.7819e-05 1.4949e-05 1.4087e-05 1.2989e-05

speech enhancement systems based on wavelets employ other tools such as spectral subtraction (SS), Wiener filtering, and MMSE-STSA estimation [35, 36]. This estimation is applied with the undecimated wavelet packet-perceptual filter banks in the speech enhancement system proposed by Tasmaz and Ercelebi [35]. In this system [35], the perceptual filter bank (CB-UWP) (critical bands–undecimated wavelet package) decomposition of the distorted speech signal by applying the undecimated wavelet packet perceptual transform to this signal is performed. Seventeen critical sub-bands are obtained from this decomposition, and this is performed by referring to a psychoacoustic model [35]. Each of these critical sub-bands is denoised by employing the speech enhancement technique proposed by Ephraim and Malah [36]. The estimation of the clean speech signal is then obtained by the CB-UWP reconstruction from the denoised sub-band signals. This speech enhancement principle is illustrated in Fig. 1.7. According to Fig.  1.7, the perceptual filter bank (CB-UWP) (critical bands– undecimated wavelet package) decomposition of the distorted speech signal is first performed in order to obtain 17 noisy sub-bands, yi, 1 ≤ i ≤ 17. Then, the speech enhancement algorithm proposed in [36] is applied to each of these sub-bands, yi, 1 ≤ i ≤ 17, for obtaining 17 denoised sub-bands, xˆ i ,1 ″ i ″ 17 . Finally, the estimation of the clean speech signal is obtained by the CB-UWP reconstruction from the denoised sub-band signals, xˆ i ,1 ″ i ″ 17 . In our speech enhancement system proposed in [16], the CB-UWP decomposition is replaced by the SBWT decomposition and the MMSE-STSA estimation is replaced by MSS-MAP one, and this is illustrated in Fig. 1.1. As previously mentioned, the SBWT is introduced for solving the problem of the perfect reconstruction associated with BWT. Moreover, the SBWT, among all wavelet transforms [37, 38], tends to uncorrelated data [39] and makes the noise cancellation easier. Moreover, the application of MSS-MAP in SBWT domain (Fig. 1.1) for denoising the noisy sub-bands, wbi, 1 ≤ i ≤ 8, permits to have better adaptation for noise and speech estimations compared to the application of

1.4  The Evaluation Metrics y1

y2 Noisy Speech y

Perceptual Filterbank (CB-UWP) Decomposition

13

Speech Enhancement algorithm Speech Enhancement algorithm

.

.

.

.

.

.

.

.

. y17

Speech Enhancement algorithm

x^1

x^2 Perceptual

. . . .

Filterbank

Enhanced Speech x^

(CB-UWP) Reconstruction

x^17

Fig. 1.7  Black diagram of the speech enhancement technique proposed by Tasmaz et Tasmaz and Ercelebi [35]

the MSS-MAP to the entire noisy speech signal. All those facts motivated us to propose this speech enhancement technique (SBWT/MSS-MAP) [16].

1.4  The Evaluation Metrics To test the performance of the proposed speech enhancement technique, the objective quality measurement tests, SNR, segmental signal-to-noise ratio (SSNR), Itakura–Saito distance, and perceptual evaluation of speech quality (PESQ), were employed.

1.4.1  Signal-to-Noise Ratio The following formula was employed to compute the SNR of enhanced speech signals:



N −1  ∑ x2 ( n ) SNR ( dB ) = 10 • log10  N −1 n = 0 2   ∑ n=0 ( x ( n ) − x ( n ))

   

(1.8)

14

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

where x(n) and xˆ ( n ) are the original and the enhanced signals, respectively, and N is the number of samples in the original signal.

1.4.2  Segmental Signal-to-Noise Ratio The frame-based segmental SNR is an objective measure of speech quality. It is calculated by averaging frame level estimates as follows:



1 SSNR ( dB ) = M

N m + N −1 2  x (n) ∑ n =N 10 • log10  N + N −1 m ∑ 2 m  m=0 x ( n ) − x ( n )) ( ∑ n = N m 

M −1

   

(1.9)

where x(n) and xˆ ( n ) represent respectively the original and the enhanced signals, M is the frame number, N is the number of samples in each short time frame, and Nm is the beginning of the m-th frame. Since the SNR can become very small and negative during silence periods, the SSNR values are limited to the range of −10 to 35 dB.

1.4.3  Itakura–Saito Distance The Itakura–Saito distance measure, based on the dissimilarity between the clean and the enhanced speech, is computed between sets of linear prediction coefficients (LPC) estimated over synchronous frames. This measure is extremly influenced by spectral dissimilarity due to mismatch in formant locations, with little contribution from errors in matching spectral valleys. Such behavior is desirable since the auditory system is more sensitive to errors in formant position and bandwidth than to spectral valleys between peaks. In this work, the average Itakura–Saito measure (as defined by Eq. 1.10) across all speech frames of a given sentence was computed for the evaluation of the speech enhancement technique.

ISd ( a,b ) =

(( a − b )

T

)(

R ( a − b ) / aT Ra

)

(1.10)

where a and b are respectively the LPC of the clean speech signal, x(n), and that of the enhanced speech signal xˆ ( n ) and R represents the matrix of autocorrelation. The symbol T represents the transpose symbol.

15

1.5  Results and Discussions

1.4.4  Perceptual Evaluation of Speech Quality (PESQ) The PESQ algorithm is an objective quality measure which is approved as the ITU − T recommendation P. 862 [40]. It is a tool of objective measurement introduced to predict the results of a subjective mean opinion score (MOS) test. It was shown [41, 42] that the PESQ correlated better with MOS than the traditional objective speech measures.

1.5  Results and Discussions In this section, ten Arabic speech sentences produced by a female speaker and ten other are produced by a male speaker. These sentences are artificially corrupted in additive manner with different noise types (Car, F16 cockpit, pink, tank, and white noises) at different values of SNR. Those noises are taken from the AURORA data-­ base [43]. The employed Arabic sentences (Table  1.1) are material phonetically balanced, and they are sampled at 16 kHz. The noisy speech signals were enhanced by employing the proposed approach (SBWT/MSS-MAP) [16], the technique based on MSS-MAP estimation [18], the Wiener filtering [17, 45, 46], and the speech enhancement technique based on discrete Fourier transform (DFT), proposed in [44]. Figures 1.8, 1.9, 1.10, and 1.11 show the curves obtained from the SNR, the SSNR, the Itakura–Saito distance (ISd), and PESQ computations for the different techniques: the technique based on MSS-MAP estimation [18], the proposed technique (SBWT/MSS-MAP) [16], Wiener filtering [45, 46], and DFT-domain-based single-microphone noise reduction [44]. The different curves illustrated in Fig. 1.8 show that all the speech enhancement techniques improve the SNR (SNRf > SNRi). Furthermore, the proposed technique 30 25

SNRf

20

MSS-MAP

SBWT/MSS-MAP

15 10

Wiener

5 0

Fig. 1.8  Speech signal corrupted by volvo noise (SNRf vs SNRi)

DFT-Domain Based SingleMicrophone Noise

16

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

(SBWT/MSS-MAP) [16] outperforms all the other techniques employed for our evaluation. These curves also show that the technique DFT-domain-based single-­ microphone noise reduction [44] outperforms the two other techniques, MSS-MAP [18] and Wiener filtering [17, 45, 46]. The different curves illustrated in Fig. 1.9 show that all the speech enhancement techniques improve the SSNR (SSNRf > SSNRi). Moreover, the proposed technique outperforms all the techniques applied for our evaluation. Also, according to Fig.  1.9, the technique DFT-domain-based single-­ microphone noise reduction [44] outperforms the two other techniques, MSS-MAP [18] and Wiener filtering [17, 45, 46]. According to the curves illustrated in Fig. 1.10, the proposed speech enhancement technique (SBWT/MSS − MAP) [16] gives the lowest values of ISd compared to other techniques. Therefore, in terms of ISd, the proposed technique (SBWT/MSS-MAP) [16] outperforms the three other techniques, MSS-MAP [18], Wiener filtering [17, 45, 46], and DFT-domain-based single-microphone noise reduction [44]. According to the curves illustrated in Fig. 1.11, the technique (SBWT/MSS-MAP) proposed in [16] and the technique DFT-domainbased single-microphone noise reduction [44] outperform the two other techniques: Wiener filtering [17, 46, 47] and MSS-MAP [18]. For the higher values of SNRi, the values of the PESQ after enhancement (PESQf), obtained from the application of the proposed technique (SBWT/MSS-MAP), are almost the same as the values obtained from the application of the technique DFT-domain-based single-microphone noise reduction [44]. Whereas for the lower values of SNRi, the technique DFT-domain-based single-microphone noise reduction [44] outperforms the proposed technique (SBWT/MSS-MAP) [16]. Figure  1.12 illustrates an example of speech enhancement using the proposed technique. This figure shows clearly that the proposed technique efficiently reduces the noise while preserving the quality of 16 14

MSS-MAP

12 10

SBWT/MSS-MAP

SSNRf

8 6

Wiener

4 2 0 –10

–5

0 –2 SSNRi

5

10

DFT-Domain Based SingleMicrophone Noise Reduction

Fig. 1.9  Speech signal corrupted by volvo noise (SSNRf vs SSNRi)

17

1.5  Results and Discussions 0,2 MSS-MAP 0,15 SBWT/MSS-MAP

ISdf

0,1

Wiener

0,05

0

0

2

–0,05

4

6

DFT-Domain Based Single-Microphone Noise Reduction

ISdi

Fig. 1.10  Speech signal corrupted by volvo noise (ISdf vs ISdi) 4 MSS-MAP

3,5 3

PESQf

2,5

SBWT/MSS-MAP

2 1,5

Wiener

1 0,5 0

0

1

2

PESQi

3

4

DFT-Domain Based Single-Microphone Noise Reduction

Fig. 1.11  Speech signal corrupted by volvo (PESQf vs PESQi)

the original speech signal. The evaluation of the different techniques, SBWT/MSSMAP [16], MSS-MAP estimation [18], and DFT-domain-based single-­microphone noise reduction [44], is also performed on a speech sentence taken from TIMIT

18

a

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

1

0

–1

b

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1

0

–1

c

1

0

–1

Fig. 1.12  Example of speech enhancement applying the speech enhancement technique (SBWT/ MSS-MAP) proposed in [16]: (a) Clean speech signal, (b) noisy speech signal, (c) enhanced speech signal

database. This speech sentence is “She had your dark suit in greasy wash water all year” and is pronounced by a female voice. This sentence is corrupted by car noise with different values of SNR. Figure 1.12 illustrates an example of speech enhancement applying the technique (SBWT/MSS-MAP) [16]. In Tables 1.4, 1.5, 1.6, and 1.7, the results obtained from the computation of the SNR, the SSNR, the ISd, and the PESQ and the results obtained in case of volvo noise, from the computation of the SNR, the SSNR, the ISd and the PESQ for the case of volvo noise are listed. The results obtained from the computation of SNR, SSNR, and ISd (Tables 1.4, 1.5 and 1.6) show that the proposed technique (SBWT/MSS-MAP) [16] outperforms the two techniques: MSS-MAP [18] and DFT-domain-based single-­ microphone noise reduction [44]. The results obtained from PESQ calculation (Table  1.7) show that the DFT-domain-based single-microphone noise reduction [44] outperforms the two techniques: the proposed technique (SBWT/MSS-MAP) [16] and the MSS-MAP estimation technique [18].

1.5  Results and Discussions

19

Table 1.4  SNR computation (case of volvo noise) SNRf (dB) Method MSS-MAP [18] 9.1904 13.7894 18.3689 22.6764 26.2160

SNRi (dB) –5 0 5 10 15

(SBWT/MSS-MAP) DFT-Domain based single-microphone noise [16] reduction [44] 12.16981 6.7524 16.4734 11.5116 18.8809 15.8715 23.1859 21.4899 26.8156 26.3789

Table 1.5  SSNR computation (case of volvo noise)

SSNRi (dB) –6.3572 –3.2400 0.4822 4.8450 9.1621

SSNRf (dB) Method MSS-MAP [18] 7.1139 11.0879 15.1413 18.7229 21.8488

(SBWT/MSS-MAP) [16] 7.2774 11.2451 15.4914 19.1495 22.3828

DHT-Domain based single-microphone noise reduction [44] 4.3178 8.3123 12.0685 17.7213 21.9582

Table 1.6  ISd computation (case of volvo noise)

ISdi 0.1009 0.0855 0.0572 0.0231 0.0050

ISdf Method MSS-MAP (SBWT/MSS-MAP) [18] [16] 0.0171 0.0026 0.0031 2.5812e-04 4.1817e-04 3.6254e-04 1.3195e-04 9.3145e-05 3.3776e-05 1.6663e-05

DPT-Domain based single-microphone noise reduction [44] 0.0397 0.0103 9.0442e-04 1.1662e-04 8.4648e-06

Table 1.7  PESQ computation (case of volvo noise)

PESQi 2.7811 3.1403 3.5639 3.8282 4.2065

PESQf Method MSS-MAP [18] 3.1591 3.4478 3.7728 3.9647 4.0719

(SBWT/MSS-MAP) [16] 3.2993 3.5466 3.8505 3.9998 4.0517

DFT-Domain based single-microphone noise reduction [44] 3.4530 3.7164 3.9573 4.1910 4.2520

20

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

We have also employed other speech signals and another denoising technique for our evaluation. This technique is the supervised and online nonnegative matrix factorization (NMF)-based noise reduction, proposed in [47, 48]. Figures 1.14, 1.15, 1.16, and 1.17 show the different curves obtained from the computation of the SNR, SSNR, the ISd, and PESQ for the different values of SNR before speech enhancement. Those results are obtained from the application of the proposed technique (SBWT/MSS-MAP) [16] and the other three techniques: the DFT-domain-based single-microphone noise reduction [44], MSS-MAP estimation [18], and supervised and online NMF-based noise reduction [47, 48] to a speech signal (Fig. 1.13) corrupted by different sorts of noise. This speech signal is sampled at 16 kHz and pronounced in English language by a male voice. According to the curves in Fig. 1.14 (curves of SNR variation), when the SNR before denoising is higher, the proposed technique (SBWT/MSS-MAP) [16] outperforms the other denoising techniques. Although, when the SNRi is lower, the best approach is the supervised and online NMF-based noise reduction technique [47, 48]. According to the curves in Fig. 1.15 (curves of SSNR variation), the proposed technique (SBWT/MSS-MAP) a

1

0

–1

0

0.5

1

1.5

2

2.5

3

3.5

0

0.5

1

1.5

2

2.5

3

3.5

b 1

0

–1

Fig. 1.13  An example of speech signal (a) degraded by volvo noise (b) and employed for next evaluation

1.5  Results and Discussions

21

25

20

15

SBWT/MAP-MAP

SNRf

MSS-MAP 10 DFT Domain based SingleMicrophone Noise Reduction 5

–15

–10

–5

0

Supervised and online NMF based noise

0

5

10

15

–5 SNRi

Fig. 1.14  Signal-to-noise ratio after denoising (SNRf) versus signal-to-noise ratio before denoising (SNRi): case of a speech signal (Fig. 1.13) corrupted by volvo noise 14 12

SSNRf

10 8

SBWT/MAP-MAP

6

MSS-MAP

4

DFT Domain based SingleMicrophone Noise Reduction

2

–10

–5

0

Supervised and online NMF based noise 0

5

–2

Fig. 1.15  Segmental signal-to-noise ratio after denoising (SSNRf) versus segmental signal-to-­ noise ratio before denoising (SSNRi): case of a speech signal (Fig. 1.12) corrupted by volvo noise

22

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum… 0,07 0,06 0,05 SBWT/MAP-MAP

ISdf

0,04

MSS-MAP

0,03

DFT Domain based SingleMicrophone Noise Reduction

0,02 0,01 0

Supervised and online NMF based noise 0

0,02

0,04

0,06

0,08

0,1

0,12

–0,01

Fig. 1.16  Itakura–Saito distance (ISdf) versus Itakura–Saito distance (ISdi): case of a speech signal (Fig. 1.12) corrupted by volvo noise

[16] outperforms the other denoising techniques employed in our evaluation. According to the curves in Fig. 1.16 (curves of ISd variation), the proposed technique (SBWT/MSS-MAP) [16] and MSS-MAP estimation-based one [18] outperform the other denoising techniques applied for our evaluation. According to the curves in Fig. 1.17 (curves of PESQ variation), when the PESQ before denoising (PESQi) is higher, the DFT-domain-based single-microphone noise reduction technique [44] outperforms the other denoising techniques applied for our evaluation. Although, when the PESQi is lower, the supervised and online NMF-based noise reduction technique [47, 48] outperforms the other techniques. When PESQi is higher, the proposed technique is better than the two techniques MSS-MAP estimation [18] and supervised and online NMF-based noise reduction [47, 48]. Figures  1.18, 1.19, 1.20, and 1.21 show other examples of speech enhancement employing the proposed technique (SBWT/MSS-MAP) [16]. Figures 1.22 and 1.23 illustrate another example of speech denoising using the proposed technique (SBWT/MSS-MAP) [16] where in the spectrograms of the clean speech signal (a), the noisy speech signal (b), and the enhanced speech signal (c) are illustrated. According to Fig. 1.23, the spectrogram (b) shows that the type of noise corrupting the speech signal is localized in low-frequency regions. The spectrogram (c) shows that the noise which is a car noise is eliminated efficiently by employing the proposed technique (SBWT/MSS-MAP [16]).

1.5  Results and Discussions

23

4 3,5 3

SBWT/MSS-MAP

PESQf

2,5

MSS-MAP

2 DFT Domain based SingleMicrophone Noise Reduction technique Supervised and online NMF based noise reduction technique

1,5 1 0,5 0

0

1

2

3

PESQi

4

Fig. 1.17  Perceptual evaluation of speech quality after denoising (PESQf) versus perceptual evaluation of speech quality before denoising (PESQi): case of a speech signal (Fig. 1.12) corrupted by volvo noise

a

1 0 –1

b

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1 0 –1

c

1 0 –1

Fig. 1.18  A speech signal taken from Timit database (a) and corrupted by tank noise (b), enhanced (c) by the proposed technique (SNRi  =  10  dB, SNRf  =  16.7383  dB, SSNRi  =  1.7965  dB, SSNRf = 7.6179 dB, ISdi = 0.0182, ISdf = 3.7397e-04, PESQi = 2.6675, PESQf = 3.1143)

a

1 0 –1

b

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1 0 –1

c

1 0 –1

Fig. 1.19  A speech signal taken from Timit database (a) and corrupted by pink noise (b), enhanced (c) by the proposed technique (SNRi  =  10  dB, SNRf  =  15.0956  dB, SSNRi  =  1.5896  dB, SSNRf = 6.2249 dB, ISdi = 0.0768, ISdf = 0.0495, PESQi = 2.2660, PESQf = 2.7800)

a

1 0 –1

b

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1 0 –1

c

1 0 –1

Fig. 1.20  A speech signal taken from Timit database (a) and corrupted by white noise (b) and enhanced (c) by the proposed technique (SNRi = 10 dB, SNRf = 14.5035 dB, SSNRi = 1.4850 dB, SSNRf = 6.0776 dB, ISdi = 0.5621, ISdf = 0.0495, PESQi = 2.0519, PESQf = 2.7304)

1.6 Conclusion

a

25

1 0

–1

b

0

0.5

1

1.5

2

0

0.5

1

1.5

2

0

0.5

1

1.5

2

1 0

–1

c

1 0

–1

Fig. 1.21  A speech signal taken from Timit database (a) and corrupted by F16 noise (b), enhanced (c) by the proposed technique (SNRi  =  5  dB, SNRf  =  11.4539  dB, SSNRi  =  1.7233  dB, SSNRf = 3.2526 dB, ISdi = 0.4625, ISdf = 0.4826, PESQi = 1.8480, PESQf = 2.4521)

1.6  Conclusion In this chapter, we detailed a speech enhancement technique integrating the stationary bionic wavelet transform (SBWT) and the MSS-MAP. The SBWT is introduced for solving the problem of the perfect reconstruction existing with the BWT. The MSS-MAP estimation was employed for estimating the speech in the SBWT domain. The proposed technique (SBWT/MSS-MAP) was compared to the technique based on MSS-MAP estimation, the Wiener Filtering, the speech enhancement technique based on DFT, and the supervised and online NMF-based noise reduction technique. The evaluation of the different techniques was performed using four objective metrics: SNR, SSNR, ISd, and PESQ. We have also used in our evaluation a number of speech signals (ten sentences pronounced in Arabic language by a male voice and ten other pronounced by a female voice also in Arabic language) and other speech sentences taken from TIMIT database. We have also employed diverse sorts of noises which are Car, F16, tank, white, and pink noises. The results obtained from the computations of SNR, SSNR, ISd, and PESQ show that the proposed approach (SBWT/MSS-MAP) outperforms the technique based on MSS-MAP estimation and the Wiener filtering. When compared with the technique supervised and online NMF-based noise reduction, the proposed approach (SBWT/MSS-MAP) is better when the SNR is higher, and we have the opposite when the SNR is lower.

26

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

a

1 0.5 0

–0.5 –1

b

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

0

0.5

1

1.5

2

2.5

3

1 0.5 0

–0.5 –1

c

1 0.5 0

–0.5 –1

Fig. 1.22  An example of speech enhancement using the proposed technique (SBWT/MSS-MAP) [16]: (a) Clean speech signal, (b) clean speech signal (Fig. 1.22a) corrupted by volvo noise with SNR = 10 dB, (c) enhanced speech signal

1.6 Conclusion

a

27 Clean Speech Signal

8

30

7 Freq (kHz)

6

20

5

10

4

0

3 2

–10

1 0

b

–20 0

0.5

1

1.5

2

2.5

Noisy Speech Signal

8

30

7 Freq (kHz)

6

20

5

10

4 3

0

2

–1

1 0

c

–2 0

0.5

1

1.5 Time (sec)

2

2.5

Enhanced Speech Signal

8 7

Freq (kHz)

6 5 4 3 2 1 0

0

0.5

1

1.5 Time (sec)

2

2.5

Fig. 1.23 (a) The spectrogram of the clean speech signal, (b) spectrogram of the noisy speech signal, (c) spectrogram of the enhanced speech signal

28

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

References 1. Paliwal, K.K.: Usefulness of phase in speech processing. In Proceedings IPSJ Spoken Language Processing Workshop, pp. 1–6 (2003) 2. Giacobello, D., Christensen, M.G., Dahl, J., Jensen, S., Moonen, M.: Sparse linear predictors for speech processing. In Proceedings of the International Conference on Spoken Language Processing, 2008, pp. 4–7 (2005) 3. Faúndez-Zanuy, M.M., Esposito, S., Hussain, A., Schoentgen, J., Kubin, G., Kleijn, W.B., et  al.: Nonlinear speech processing: overview & applications. Control. Intell. Syst. 30(1), 1–9 (2002) 4. Das, N., Chakraborty, S., Chaki, J., Padhy, N., Dey, N.: Fundamentals, present and future perspectives of speech enhancement. Int. J. Speech Technol. (2020). https://doi.org/10.1007/ s10772-­020-­09674-­2 5. Krishnamoorthy, P., Mahadeva Prasanna, S.R.: Temporal & spectral processing of degraded speech. In 16th International Conference on Advanced Computing & Communications, pp. 9–14 (2008) 6. Christiansen, T.U., Dau, T., Greenberg, S.: Spectro-temporal processing of speech  – An information-­theoretic framework. In: Kollmeier, B., et al. (eds.) Hearing – From sensory processing to perception, pp. 59–523. Springer, Berlin, Heidelberg (2007) 7. Vijayan, K. Xiaoxue, G. Li, H.: Analysis of speech & singing signals for temporal alignment. In Conference: Asia-Pacific Signal & Information Processing Association Annual Summit & Conference, pp. 1–5 (2018) 8. Santos, E., Khosravy, M., Lima, M.A., Cerqueira, A.S., Duque, C.A., Yona, A.: High accuracy power quality evaluation under a colored noisy condition by filter bank ESPRIT. Electronics. 8(11), 1259 (2019) 9. Deshmukh, O.D., Espy-Wilson, C.Y.: Speech enhancement using the modified phase-­ opponency model. J. Acoust. Soc. Am. 121(6), 3886–3898 (2007) 10. Mustière, F., Bouchard M. & Bolić, M. (2010). Bandwidth extension for speech enhancement. In 2010 IEEE 23rd Canadian Conference on Electrical and Computer Engineering – CCECE (AB Canada Calgary 2010 May 2 – 2010 May 5) (pp. 76–84) 11. Baumgarten, M., Mulvenna, M.D., Rooney, N., Reid, J.: Keyword-based sentiment mining using twitter. Int. J. Ambient Comput. Intell. 5(2), 56–69 (2013) 12. Sen, S., Dutta, A., Dey, N.: Audio indexing. In: Audio Processing and Speech Recognition. SpringerBriefs in Applied Sciences and Technology, pp. 1–11. Springer, Singapore (2019) 13. Sen, S., Dutta, A., Dey, N.: Speech processing and recognition system. In: Audio Processing and Speech Recognition. SpringerBriefs in Applied Sciences and Technology, pp.  13–43. Springer, Singapore (2019) 14. Sen, S., Dutta, A., Dey, N.: Audio classification. In: Audio Processing and Speech Recognition. SpringerBriefs in Applied Sciences and Technology, pp. 67–93. Springer, Singapore (2019) 15. Santosh, K.C., Borra, S., Joshi, A., Dey, N.: Advances in speech, music and audio signal processing. Int. J. Speech Technol. 22(2), 293–296 (2019) 16. Chiluveru, S.R., Tripathy, M.: A real-world noise removal with wavelet speech feature. Int. J. Speech Technol. 23(3), 683–693 (2020); Talbi, M.: Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum. Int. J. Speech Technol. (2016). https://doi.org/10.1007/s10772-­016-­9388-­7 17. Dash, T.K., Solanki, S.S., Panda, G.: Multi-objective approach to speech enhancement using tunable Q-factor-based wavelet transform and ANN techniques. Circuits Syst. Signal Process. (2021). https://doi.org/10.1007/s00034-­021-­01753-­2; Loizou, P.C.: Speech Enhancement Theory and Practice. Taylor & Francis, Abingdon (2007)

References

29

18. Yang, L., Loizou, P.C.: Estimators of the magnitude squared spectrum and methods for incorporating SNR uncertainty. IEEE Trans. Audio Speech Lang. Process. 19(5), 1123–1137 (2011) 19. Yao, J., Zhang, Y.T.: Bionic wavelet transform: a new time-frequency method based on an auditory model. IEEE Trans. Biomed. Eng. 48(8), 856–863 (2001) 20. Yao, J., Zhang, Y.T.: The application of bionic wavelet transform to speech signal processing in cochlear implants using neural network simulations. IEEE Trans. Biomed. Eng. 49(11), 1299–1309 (2002) 21. Debnath, L.: Wavelet Transforms and their Applications. Birkhauser, Bosto (2002) 22. Jaffard, S., Meyer, Y., Ryan, R.D.: Wavelets: Tools for Science and Technology. Society for Industrial and Applied Mathematics, Philadelphia (2001) 23. Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia (1992) 24. Johnson, M.T., Yuan, X., Ren, Y.: Speech signal enhancement through adaptive wavelet thresholding. Speech Comm. 49, 12 (2007) 25. Bahoura, M., Rouat, J.: Wavelet speech enhancement based on the teager energy operator. IEEE Signal Process. Lett. 8(1), 10–12 (2001) 26. Chen, S.-H., Chau, S.Y., Want, J.-F.: Speech enhancement using perceptual wavelet packet decomposition and teager energy operator. J.  VLSI Signal Process. Systems. 36(2–3), 125–139 (2004) 27. Cohen, I.: Enhancement of speech using bark-scaled wavelet packet decomposition. Paper presented at the Eurospeech 2001, Denmark, 2001 28. Fu, Q., Wan, E.A.: Perceptual Wavelet Adaptive Denoising of Speech. Paper presented at the Eurospeech, Geneva (2003) 29. Hu, Y., Loizou, P.C.: Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans. Speech Audio Process. 12(1), 59–67 (2004) 30. Lu, C.-T., Wang, H.-C.: Enhancement of single channel speech based on masking property and wavelet transform. Speech Comm. 41(2–3), 409–427 (2003) 31. Giguere, C.: Speech processing using a wave digital filter model of the auditory periphery. Ph.D., University of Cambridge, Cambridge, UK (1993) 32. Giguere, C., Woodland, P.C.: A computational model of the auditory periphery for speech and hearing research. J. Acoust. Soc. Amer. 95(1), 331–342 (1994) 33. Mortazavi, S.H., Shahrtash, S.M.: Comparing Denoising Performance of DWT, WPT, SWT and DT-CWT for Partial Discharge Signals. In Proceedings of the 43rd International Universities Power Engineering Conference (UPEC’08), pp. 1–6. Padova, Italy (2008) 34. M. J. Shensa, “The discrete wavelet transform wedding À Trous and Mallat algorithms,” IEEE Trans. Signal Process.,002040, 10, 2464 1992 35. Tasmaz, H., Ercelebi, E.: Speech enhancement based on undecimated wavelet packet-­ perceptual filterbanks and MMSE– STSA estimation in various noise environments. Digit. Signal Process. 18(5), 797–812 (2008) 36. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32, 1109–1121 (1984) 37. Biswas, A., Sahu, P.K., Bhowmick, A., Chandra, M.: Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition. Int. J. Speech Technol. 17(4), 389–399 (2014) 38. Singh, S., Mutawa, A.M.: A wavelet-based transform method for quality improvement in noisy speech patterns of Arabic language. Int. J. Speech Technol., 1–9 (2016) 39. Bahoura, M., Rouat, J.: Wavelet speech enhancement based on time-scale adaptation. Speech Comm. 48(12), 1620–1637 (2006) 40. Rix, A.W., Beerends, J.G., Hollier, M.P., & Hekstra, A.P.: Perceptual evaluation of speech quality (pesq)  – A new method for speech quality assessment of telephone networks and codecs. In Proceedings if ICASSP, IEEE International Conference on acoustics, speech and signal processing, Vol. 2, pp. 749–752 (2001)

30

1  Speech Enhancement Based on Stationary Bionic Wavelet Transform and Maximum…

41. Zavarehei, E., Vaseghi, S., Yan, Q.: Inter-frame modeling of DFT trajectories of speech and noise for speech enhancement using Kalman filters. Speech Comm. 48(11), 1545–1555 (2006) 42. Hu, Y., Loizou, P.C.: Evaluation of objective measures for speech enhancement. IEEE Trans. Speech Audio Process. 16(1), 229–238 (2008) 43. Hirsch, H., Pearce, D.: The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ISCA Tutorial and Research Workshop ASR2000, Paris, France (2000) 44. Hendriks, R.C., Gerkmann, T., Jensen, J.: DFT-domain based single-microphone noise reduction for speech enhancement: a survey of the state of the art. Synth. Lect. Speech Audio Process. 9(1), 1–80 (2013) 45. Deller, J.R., Hansen, J.H.L., Proakis, J.G.: Discrete-Time Processing of Speech Signals, 2nd edn. IEEE Press, New York (2000) 46. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice Hall, Upper Saddle River, NJ (1996) 47. Mohammadiha, N., Smaragdis, P., Leijon, A.: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2215 (2013) 48. Girish, K.V., Ramakrishnan, A.G., Ananthapadmanabha, T.V.: Adaptive dictionary based approach for background noise and speaker classification and subsequent source separation. J. Latex Class Files. 14(8) (2015)

Chapter 2

ECG Denoising Based on 1-D Double-­Density Complex DWT and SBWT

2.1  Introduction Signal denoising has considerably grown recently as it has become an interest domain [1–4]. In the biomedical domain, ECG signal processing has been broadly studied by many researchers, particularly its denoising for having reliable diagnosis [5–7]. There are different sorts of noise corrupting ECG signals, and among them we can mention the powerline interference, electrode contact noise, motion artifacts, baseline wander, muscular contraction, and instrumentation noise [8]: –– Powerline Interference It means the interference caused by 60/50 Hz power supply to which the machine is connected. Its magnitude can be as high as 50% of the peak ECG amplitude typically. Some of the common causes of this are: • • • • •

Stray effects caused by the alternating current fields. Inappropriate grounding of ECG machine or the patient. Electrode disconnect. Electromagnetic interference due to the power supply. Heavy electrical equipment such as elevators and X-ray units draw a large current from the power supply and can induce 50/60 Hz signals (and its harmonics) in the circuitry of the ECG machine.

–– Electrode Contact Noise It is caused by a faulty connection between the measuring system and the patient. Lack of adhesive jelly, dislocation of electrodes, etc. are the prime factors causing this artifact.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 T. Mourad, The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-93405-7_2

31

32

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

–– Motion Artifact Transient changes are induced in the baseline due to varying skin-electrode impedance which is caused by the patient movement while recording the ECG. –– Muscle Contractions It is also known as EMG noise which is induced because of the gross potentials picked up from the body surface by the ECG electrodes. Erratic patient body movement or vibrations are said to be responsible for this. The SD of this noise is approximated up to 10% of peak ECG amplitude, and frequency content ranges from dc (0Hz) to 10kHz. –– Baseline Wander It is caused by heavy respirational activity or movement of the thoracic cavity which creates problems in the precise detection of peaks. Due to this fact, low amplitude peaks such as T-waves become highly valued and might be mistaken for R peak which is of the highest amplitude in general. In signal processing literature, many denoising techniques have been proposed. Van Alste and Schilder [9] have proposed classical filtering based on finite impulse response (FIR) filters. In [10], least squares-based adaptive filtering is employed in order to cancel from the ECG signal, the electrical interference. In the same context, Vullings et al. [11] employed an adaptive Kalman filter [12, 13] in order to improve the ECG signal quality. Chang and Liu [14], for their part, employed the Wiener filter for suppressing Gaussian white noise in the ECG signal, though Oliveira et al. [15] explored removal of powerline interference by zeroing some wavelet scales. El-Sayed and El-Dahshan [16] proposed a hybrid linearization technique integrating extended genetic algorithm and a discrete wavelet transform (DWT) for eliminating ECG signal noise. Ling et al. [17] suggested a fuzzy rule-based multiwavelet ECG signal denoising. It was suggested to employ transform methods, such as wavelet transform, for denoising ECG signals [18, 19]. Sharma et al. [20] developed an ECG signal denoising using higher-order statistics in wavelet sub-bands. Mehmet et  al. [21] have proposed a weak ECG signal denoising method based on fuzzy thresholding and wavelet packet analysis. Firstly, the weak ECG signal is decomposed into various levels by wavelet packet transform. Then, the threshold value is determined using the fuzzy s-function. The reconstruction of the ECG signal from the retained coefficients is achieved by applying the inverse wavelet packet transform. Over the past two decades, an important tool named empirical mode decomposition (EMD) was introduced for analyzing nonlinear and non-stationary signals. This technique grabbed and aroused the interest of researchers and therefore has been applied to the ECG by Chang and Liu [14], Blanco Velasco et al. [22], Bouny et al. [23], and Nguyen and Kim [24]. Furthermore, the ECG signal denoising in the time-frequency transformation domains, based particularly on the employment of wavelets and the EMD technique, were explored by Kopsinis and Laughlin [25] and Kabir and Shahnaz [26]. Manas R. and Susmita D [27]. have proposed an efficient ECG denoising method applying empirical mode decomposition (EMD) and

2.2 Materials

33

adaptive switching mean filter (ASMF). The advantages of both EMD and ASMF methods are exploited for reducing the noises in the ECG signals with minimum distortion. Unlike conventional EMD-based methods, which reject the initial intrinsic mode functions (IMFs) or utilize a window-based approach for reducing high-­ frequency noises, in [27], a wavelet-based soft thresholding scheme is adopted for reduction of high-frequency noises and preserving QRS complexes. Chunqiang et al. [28] employed local means in order to denoise ECG signal. They have presented a simple technique for calculating the standard deviation of additive Gaussian white noise (GWN) corrupting the ECG signal and, after that, a fast ECG denoising technique which is local means (LM) approach. The latter is the “local” version of the NLM approach is introduced. LM technique has a second order of magnitude lower calculation cost than NLM method due to the “local search.” In a low-SNR level condition, the SNR amelioration increases via the LM technique and this by 21% compared to NLM technique [29]. Many techniques have been devoted to fractional calculus in signal processing [4, 30–33]. To the best of our knowledge, very few researchers have explored ECG signal processing by means of fractional wavelets [34–38]. In [4], a fractional wavelet technique was proposed for cancelling Gaussian white noise and powerline interference. Unlike classical wavelet, the main advantage of using fractional wavelet is its flexibility in terms of modifying parameters to reach different bandwidths. In this context, the approach proposed in [4] relies on the application of fractional wavelets for obtaining a better-quality signal employing threshold approaches. Fractional wavelets were compared by means of hard and soft threshold techniques with other classical wavelets, to prove their efficiency. In this chapter, we will detail the ECG denoising technique proposed in [29]. This technique is based on 1-D double-density complex DWT and SBWT. The rest of this chapter is organized as follows: Section 2.2 will be devoted to materials, and Section 2.3 will detail the proposed ECG denoising technique. In Sect. 2.3, we will present the results and discussion. Finally, we will conclude in Sect. 2.4.

2.2  Materials 2.2.1  The BWT Optimization for ECG Analysis According to the definition of the BWT, there is a major difference in resolution of time-frequency span of analyzing windows. In fact, in the wavelet transform (WT), for a fixed mother function, all the windows in a certain scale along the t-axis are fixed, and the window size of the WT is varying with the change of the analyzing frequency, though both the time and frequency resolutions can be different in the bwt even in a certain scale. The adjustment of the BWT resolution in the same scale is controlled by the T-function. This function is related to the signal instantaneous amplitude and its first-order differential [39]. Figure  2.1 illustrated the two time-­ frequency representations of an ECG signal (106.dat) corresponding to WT and BWT.

34

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

a

3

Amplitude (mV)

2 1 0 –1

0

200

400

600

800

1000

1200

1400

1600

1000

1200

1400

1600

1000

1200

1400

1600

Samples

Scale

b 10 20 30 0

200

400

600

800 Samples

Scale

c 10 20 30 0

200

400

600

800 Samples

Fig. 2.1 (a) ECG signal taken from MIT-BIH: 106:dat, (b) time-frequency representation corresponding to the BWT, (c) time-frequency representation corresponding to the WT

Notice the smoothing in the BWT representation which is the direct result of window changes over certain scales. It remains setting efficiently the parameters of the BWT so that it is capable to decompose the signal into a finite number of scales and, afterwards, determine the most energetic ones and selecting a global or local threshold [40]. For optimizing the BWT parameters, Omid and Mohammad [40] have applied a semi-optimal technique which considers both analytic and morphological aspects of the analyzed signal. Since in this work we are interested in an ECG signal, we have to be aware of its variability. Seemingly the most important feature for an ECG signal is the frequency range in which its main components occur. Though there are some other components such as ventricular late potentials (VLPs), we have restricted our interest in this work on waves P, Q, R, S, and T such as in [40]. The resulting frequency range is up to 100 Hz. For an ECG signal, it is not required a high frequency w0 such as required for speech signal analysis (w0 = 15165.4 Hz (Eq. 1.2)). Consequently, Omid and Mohammad [40] optimized it simply by running the program for different values

2.2 Materials

35

of w0 and then minimizing the gradient of error variance by comparing the results with each other. This comparison is numerically and morphologically performed. It has been found that when w0 lies in the range of 360–500 Hz, there would be no much degradation on the analyzed ECG signal [40]. This choice is based on the fact that to have no aliasing, it is preferred choosing the center frequency of the first scale (w0) superior to the sampling frequency of the ECG signal. Such as in [40], in our previous work [29], we have chosen w0 = 400 Hz because it yields satisfactory results. Unlike in [39], in the technique of Omid et al. [40], the constant 1.1623 (Eq. 2.1) was replaced by a parameter q > 1 taking for each signal and decomposition scale a fixed value which should obey an adaptation procedure.



wm =

w0 , m = 0,1, 2,… 1.1623m

(2.1)

Besides for every m, that is, in each distinct scale, it is adapted for different time-­ frequency windows. More explanation on how this parameter q is determined according to every analyzing window is given in [40]. Other parameters employed in the BWT formula (Eq. 1.2) are the same ones employed in [41, 42] for speech enhancement and denoising. These constants are G1 = 0.87, G2 = 45, and CS = 0.8. Finally, the calculation step is determined due to the sampling frequency. If fs is the sampling frequency, then the step is Δτ = 1/fs.

2.2.2  1-D Double-Density Complex DWT The input signal x(n) is processed by two parallel iterated filter banks gi(n) and hi(n) with i = 0, 1, 2 [43]. The real part of a complex wavelet transform is produced by the sub-band signals of the upper DWT, and the imaginary part is produced by the lower DWT as illustrated in Fig. 2.2. The implementation process for the 1-D double-density complex DWT is illustrated as a block diagram in Fig. 2.2 [43–45]. The 1-D double-density complex DWT denoising approach can be summarized by the block diagram illustrated in Fig. 2.3. As shown in this figure, the different steps of this denoising approach [43–45] are as follows: • First step: Apply the 1-D double-density complex DWT to the noisy signal. • Second step: Apply the soft thresholding to the sub-bands obtained in the first step. The soft thresholding is employing a certain threshold, T. • Third step: Apply the inverse of 1-D double-density complex DWT to the denoised sub-bands obtained in the second step for obtaining the denoised signal.

36

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

2

h0

2

h1

2

h2

2

g0

2

h0

2

h1

2

h1

2

h2

2

h2

2

x(n)

h0

g0

2

g1

2

g2

2

g0

2

g1

2

g1

2

g2

2

g2

2

Fig. 2.2  Filter bank diagram of 1-D double-density complex DWT Threshold : T

Noisy Signal

1-D double-density complex DWT

Soft thresholding Of the obtained sub-bands

Inverse of 1-D double-density complex DWT

Denoised Signal

Fig. 2.3  Signal denoising by applying the 1-D double-density complex DWT denoising approach

2.2.3  D  enoising Technique Based on Wavelets and Hidden Markov Models Wavelet-based statistical signal processing approaches such as detection and denoising typically model the wavelet coefficients as independent or jointly Gaussian [46]. Those models are unrealistic for numerous real-world signals. In [46], MS Crouse

2.2 Materials

37

et  al. developed a framework for statistical signal processing based on waveletdomain hidden Markov models (HMMs) which concisely models the statistical dependencies and non-Gaussian statistics encountered in real-world signals. Wavelet-domain HMMs were conceived with the intrinsic properties of the wavelet transform in mind and provide powerful, yet tractable, probabilistic signal models. Efficient expectation maximization algorithms were developed for fitting the HMMs to observational signal data. This framework [46] is appropriate for an extensive range of applications, including signal estimation, detection, classification, prediction, and even synthesis. To show the utility of wavelet-domain HMMs, MS [46] have introduced an algorithm for signal denoising.

2.2.4  The Denoising Approach Based on Non-local Means In [47], the non-local means for ECG denoising was applied. Non-local means is addressing the problem of recovering the clean signal, s, from the noisy signal, x [48]: x = s+n



(2.2)

where n is an additive noise. For a given sample k, the estimate, sˆ , is a weighted sum of the values at other samples j which are belonging to some search neighborhood N(k) [48]. sˆ ( k ) =

1 ∑ w ( k ,j ) v ( j ) Z ( k ) j ∈N ( k )

(2.3)

with Z ( k ) = ∑w ( k,j ) , and the weights are formulated as follows [48]: j





2  x ( k + δ ) − x ( j + δ )) ( ∑ δ ∈ ∆ w ( k ,j ) = exp  −  2 L∆ λ 2 

 d 2 ( k ,j )  w ( k ,j ) = exp  −  2 L λ 2  ∆  

   

(2.4)

(2.5)

where λ is a bandwidth parameter and Δ is a local patch of samples surrounding k, including L∆ samples; a patch having the same shape also surrounds j. In [48], d2 is the summed squared point-by-point difference between samples in the patches cantered on the samples k and j. In [48], each patch is averaged with itself with weight w(k, k) = 1. For achieving a smoother result, a canter patch correction is frequently applied [48]:

38



2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

w ( k,k ) = max w ( k,j ) j ∈N ( k ) , j ≠ k

(2.6)



For the application of the denoising approach based on non-local means [47, 48], we need the estimation of noise level, σ1. Consquently, in our previous work [29], we applied the DWT to the noisy ECG signal for estimating σ1, and this is performed by using the following formula:

σ 1 = MAD ( cD1 ) / 0.6745



(2.7)

where cD1 is the details coefficient obtained from the application of the DWT to the noisy ECG signal. Then σ1 is multiplied by 0.6, and the obtained value is employed for the application of this approach [47, 48].

2.2.5  T  he ECG Denoising Approach Based on BWT and FWT_TI [49] This ECG denoising approach can be summarized by the following block diagram illustrated in Fig. 2.4. As shown in this figure, this approach consists at the first step in applying the BWT to the noisy ECG signal for obtaining 30 noisy bionic wavelet coefficients, w1, w2, …, w30. Each of those coefficients is then considered as a noisy signal and denoised employing the thresholding in the FWT_TI domain, and we obtain 30 denoised bionic wavelet coefficients, wˆ 1 , wˆ 2 , …, wˆ 30 . To wˆ 1 , wˆ 2 , …, wˆ 30 , we apply the inverse of BWT (BWT−1) for obtaining finally the denoised ECG signal. The FWT_TI is the forward wavelet transform translation invariant.

2.2.6  The Proposed ECG Denoising Approach [29] This ECG approach proposed can be summarized by the block diagram illustrated in Fig. 2.5. According to Fig.  2.5, this approach consists at the first step in applying the SBWT to the noisy ECG signal for obtaining one details coefficient, wtb1, and one approximation coefficient, wtb2. From wtb1, the level of noise corrupting the original ECG signal is estimated. In [29], the ECG signals are corrupted by an additive Gaussian white noise where its level is σ and can be estimated employing the following formula:

σ = MAD ( wtb1 ) / 0.6745



(2.8)

2.2 Materials

39

W1

The Noisy ECG Signal

The Bionic Wavelet Transform (BWT)

FWT_TI

W2

FWT_TI

W30

∧ W1 The inverse of Bionic Wavelet Transform (BWT–1)

FWT_TI

IWT_TI

∧ W2

IWT_TI

∧ W30

IWT_TI

Thresholding

The Denoised ECG Signal

Fig. 2.4  The block diagram of the ECG denoising technique based on BWT and FWT_TI [49]

After that, is computed the threshold, thr and used for soft thresholding of the coefficient wtb1, and we obtain a denoised coefficient, wtd1. The computation of thr is performed employing the following formula:

thr = σ • 2 • log ( N )



(2.9)

where N is the number of samples of the coefficient wtb1. The approximation coefficient, wtb2, is denoised by applying the 1-D double-density complex DWT denoising technique [47, 48], for obtaining the denoised coefficient wtd2. The denoised ECG signal is finally obtained from the application of the SBWT inverse, SBWT−1, to the coefficients wtd1 and wtd2. For the application of 1 − D double-density complex DWT denoising technique [47, 48], we have used 0.6 × σ.

40

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

Noisy ECG Signal

Stationary Bionic Wavelet Transform: SBWT

Noise Level Estimation : Noise level: σ

wtb1

σ = MAD(|wtb1|)/0.6745

Threshold Calculation:

wtb2

Denoising By Soft Thresholding

thr = σ ⋅ 2 ⋅log(N )

1-D Doubledensity complex DWT denoising method

N: the number of samples in wtb1

thr Denoised Stationary Bionic Wavelet Coefficient: wtd1

The Denoised ECG signal

Denoised Stationary Bionic Wavelet Coefficient: wtd2

The Inverse of SBWT : SBWT –1

Fig. 2.5  The block diagram of the proposed ECG denoising technique [29]

2.3  Results and Discussion

41

1 0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 1 0.5 0 -0.5

Fig. 2.6  First example of ECG denoising using the proposed technique: (a) clean ECG signal (100.dat), (b) noisy ECG signal (SNRi  (dB)  =  10.0683), (c) denoised ECG signal (SNRf (dB) = 17.7079)

2.3  Results and Discussion In this section, we will evaluate the proposed ECG denoising approach [31] (Fig. 2.5) by comparing it to the four denoising techniques [36, 40–43], previously mentioned. Figs.  2.6, 2.7, 2.8, 2.9, 2.10 and 2.11 are illustrated six examples of ECG denoising applying the proposed ECG denoising technique [31]. For each of those examples, the noisy ECG signal is obtained by degrading the clean signal by an additive Gaussian white noise (GWN) at initial value of SNR, SNRi. Each clean ECG signal was taken from MIT-BIH database. According to these figures, we can remark that the proposed ECG denoising approach [31] permits to considerably reduce the noise, and the different waves P-QRS-T of the clean ECG signal are practically conserved. Tables 2.1 and 2.2 listed the values of SNR and MSE obtained from the application of the proposed technique and the four others ones which are the 1-D double-­ density complex DWT denoising method [36], the technique based on wavelets and hidden Markov models [40], the technique based on non-local means [41, 42], and our previous proposed technique based on BWT and FWT _ TI with hard thresholding [43]. Those different results (Tables 2.1 and 2.2) obtained from SNR and MSE computations show that the proposed ECG denoising technique outperforms the other denoising approaches used in our comparative study.

42

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT 1

0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 1 0.5 0 -0.5

Fig. 2.7  Second example of ECG denoising using the proposed technique: (a) clean ECG signal (101.dat), (b) noisy ECG signal (SNRi (dB)  =  7.8013), (c) denoised ECG signal (SNRf (dB) = 16.1197) 1 0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 1 0.5 0 -0.5

Fig. 2.8  Third example of ECG denoising using the proposed technique: (a) clean ECG signal (105.dat), (b) noisy ECG signal (SNRi  (dB)  =  10.3621), (c) denoised ECG signal (SNRf (dB) = 18.5830)

2.3  Results and Discussion

43

1 0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 1 0.5 0 -0.5

Fig. 2.9  Fourth example of ECG denoising using the proposed technique: (a) clean ECG signal (113.dat), (b) noisy ECG signal (SNRi  (dB)  =  15.7914), (c) denoised ECG signal (SNRf (dB) = 22.6416) 1 0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 1 0.5 0 -0.5

Fig. 2.10  Fifth example of ECG denoising using the proposed technique: (a) clean ECG signal (102.dat), (b) noisy ECG signal (SNRi (dB) = 7.7277), (c) denoised ECG signal (SNRf (dB) = 16.1197)

44

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

1 0.5 0 -0.5

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 0.5 0 -0.5 -1

1 0.5 0 -0.5 -1

Fig. 2.11  Sixth example of ECG denoising using the proposed technique: (a) clean ECG signal (104.dat), (b) noisy ECG signal (SNRi (dB) = 2.6969), (c) denoised ECG signal (SNRf (dB) = 11.0800) Table 2.1  Comparative study in terms of signal-to-noise ratio (SNR): results obtained from the computation of the mean of seven values of SNR corresponding to seven noisy ECG signals 100–106.dat corrupted by Gaussian white noise with different values of SNRi before denoising (varying from −5dB to 15dB) Technique The proposed ECG denoising technique The 1-D double-­ density complex DWT denoising method [43–45] The technique based on wavelets and hidden Markov models [46] The technique based on non-local means [47, 48] The proposed technique based on BWT and FWT _ TI with hard thresholding [49]

SNRi =  − 5dB SNRi = 0dB SNRi = 5dB SNRi = 10dB SNRi = 15dB 5.2528 9.7188 14.0847 18.0943 21.6649 3.6616

8.4115

12.9800

17.1080

20.8751

4.3536

8.9189

13.1256

17.6421

20.1387

4.2528

8.2003

12.0646

16.2780

20.1311

3.2998

8.9095

13.4812

17.6433

21.2030

2.3  Results and Discussion

45

Table 2.2  Comparative study in terms of mean square error (MSE): results obtained from the computation of the mean of seven values of MSE corresponding to seven noisy ECG signals 100–106.dat corrupted by Gaussian white noise with different values of SNRi before denoising (varying from −5dB to 15dB) Technique The proposed ECG denoising technique The 1-D double-­ density complex DWT denoising method [43–45] The technique based on wavelets and hidden Markov models [46] The technique based on non-local means [47, 48] The proposed technique based on BWT and FWT _ TI with hard thresholding [49]

SNRi =  − 5dB SNRi = 0dB SNRi = 5dB SNRi = 10dB SNRi = 15dB 0.0071 0.0026 9.4286e-04 3.7143e-04 1.5714e-04 0.0103

0.0034

0.0012

4.7143e-04

2.0000e-04

0.0087

0.0035

0.0012

4.0000e-04

2.2857e-04

0.0092

0.0037

0.0015

4.2857e-04

2.2755e-04

0.0122

0.0031

0.0011

4.1429e-04

1.8571e-04

We have also made a comparative study between the proposed ECG denoising technique and the other ECG denoising techniques (the proposed technique based on BWT and FWT _ TI with hard thresholding [43], the technique based on non-­ local means [41, 42], the technique based on wavelets and hidden Markov models [40], and the 1-D double-density complex DWT denoising method [36]). This comparative study is in terms of cross-correlation (CC), mean absolute error (MAE), and peak signal-to-noise ratio (PSNR). In Table 2.3 the results obtained from the computation of CC, MAE, and PSNR are listed. According to Table 2.3, the highest values of CC and PSNR and the lowest values of MAE are in red color, and they are obtained by the proposed technique. Also, in Table 2.3, the blue values are those obtained by the proposed technique using the threshold thr = 0.6 × σ with db6 or db7 as the mother wavelet. These values are better than those obtained by the other denoising techniques employed in our evaluation. In summary, in terms of CC, PSNR, and MAE, the proposed denoising technique outperforms the other denoising techniques used in our evaluation. Also, according to Table 2.3 and in the majority of cases, the proposed technique using thr = σ with db6 or db7 is better than the proposed technique using thr = 0.6 × σ with db6 or db7. Also compared with the values obtained by the proposed technique using thr = σ or thr = σ × 0.6 with db6 or db7, this proposed technique gives the worst values (in green color) when using thr = σ × 2 × log ( N ) .

46

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

Table 2.3  Comparative study in terms of cross correlation (CC), mean absolute error (MAE), and peak signal-to-noise ratio (PSNR): results obtained from the computations of the mean of seven values of MAE, the mean of seven values of PSNR, and the mean of seven values of CC, corresponding to seven noisy ECG signals 100–106.dat corrupted by Gaussian white noise with different values of SNRi before denoising (varying from −5dB to 15dB) Technique The proposed technique using thr = σ and db6

SNRi =  − 5dB MAE: 0.0546 PSNR: 22.7372 dB CC: 0.8898

The proposed technique using thr = σ and db7

MAE: 0.0545PSNR: 22.7402 dB CC: 0.8894

The proposed technique using thr = 0.6 × σ and db6

MAE: 0.0615 PSNR: 21.6817 dB CC: 0.8544

The proposed technique using thr = 0.6 × σ and db7

MAE: 0.0613 PSNR: 21.6894 dB CC: 0.8535

The proposed technique using thr = σ 2 ⋅ log ( N ) and db6

MAE: 0.1560 PSNR: 14.3855 dB CC: 0.7232

The proposed technique using thr = σ 2 ⋅ log ( N ) and db7

MAE: 0.1494 PSNR: 14.4195 dB CC: 0.7247

The proposed technique based on BWT and FWT_TI with hard thresholding [43] The 1-D doubledensity complex DWT denoising method [36]

MAE: 0.0755 PSNR: 19.7287 dB CC: 0.7830 MAE: 0.0687 PSNR: 20.3866 dB CC: 0.8010

SNRi = 0dB MAE: 0.0324 PSNR: 27.1931 dB CC: 0.9606 MAE: 0.0320 PSNR: 27.2582 dB CC: 0.9603 MAE: 0.0370 PSNR: 26.1476 dB CC: 0.9475 MAE: 0.0367 PSNR: 26.1785 dB CC: 0.9471 MAE: 0.0916 PSNR: 18.3641 dB CC: 0.8408 MAE: 0.0878 PSNR: 18.6865 dB CC: 0.8453 MAE: 0.0391 PSNR: 25.3384 dB CC: 0.9377 MAE: 0.0393 PSNR: 25.2202 dB CC: 0.9294

SNRi = 5dB MAE: 0.0202 PSNR: 31.3291 dB CC: 0.9848 MAE: 0.0200 PSNR: 31.3616 dB CC: 0.9847 MAE: 0.0226 PSNR: 30.5136 dB CC: 0.9810 MAE: 0.0225 PSNR: 30.4986 dB CC: 0.9808 MAE: 0.0514 PSNR: 23.3862 dB CC: 0.9374 MAE: 0.0499 PSNR: 23.6800 dB CC: 0.9376 MAE: 0.0233 PSNR: 29.9101 dB CC: 0.9788 MAE: 0.0233 PSNR: 29.8219 dB CC: 0.9755

SNRi = 10dB MAE: 0.0134 PSNR: 34.9012 dB CC: 0.9934

SNRi = 15dB MAE: 0.0094 PSNR: 38.0214 dB CC: 0.9968

MAE: 0.0133 PSNR: 34.9058 dB CC: 0.9934

MAE: 0.0093 PSNR: 38.0688 dB CC: 0.9967

MAE: 0.0147 PSNR: 34.5232 dB CC: 0.9926

MAE: 0.0095 PSNR: 38.0938 dB CC: 0.9968

MAE: 0.0143 PSNR: 34.4883 dB CC: 0.9925

MAE: 0.0094 PSNR: 38.1097 dB CC: 0.9967

MAE: 0.0279 PSNR: 28.5837 dB CC: 0.9766

MAE: 0.0176 PSNR: 32.5828 dB CC: 0.9898

MAE: 0.0272 PSNR: 28.7078 dB CC: 0.9764

MAE: 0.0168 PSNR: 32.6570 dB CC: 0.9897

MAE: 0.0147 PSNR: 34.0722 dB CC: 0.9920

MAE: 0.0098 PSNR: 37.6319 dB CC: 0.9963

MAE: 0.0146 PSNR: 34.0422 dB CC: 0.9908

MAE: 0.0098 PSNR: 37.6908 dB CC: 0.9962 (continued)

References

47

Table 2.3 (continued) Technique The technique based on non-local means [40, 41]

SNRi =  − 5dB MAE: 0.0639 PSNR: 21.0811 dB CC: 0.8240

The technique based on wavelets and hidden Markov models [40]

MAE: 0.0668 PSNR: 20.4772 dB CC: 0.8294

SNRi = 0dB MAE: 0.0438 PSNR: 24.5830 dB CC: 0.9291 MAE: 0.0477 PSNR: 23.6498 dB CC: 0.9168

SNRi = 5dB MAE: 0.0282 PSNR: 28.4559 dB CC: 0.9716 MAE: 0.0258 PSNR: 27.9770 dB CC: 0.9685

SNRi = 10dB MAE: 0.0173 PSNR: 32.6392 dB CC: 0.9887

SNRi = 15dB MAE: 0.0113 PSNR: 36.4479 dB CC: 0.9952

MAE: 0.0148 PSNR: 33.8433 dB CC: 0.9914

MAE: 0.0101 PSNR: 37.5430 dB CC: 0.9963

2.4  Conclusion In this work, we proposed a new ECG denoising technique based on the application of 1-D double-density complex DWT denoising method in the stationary bionic wavelet transform (SBWT) domain. This approach consists at first step in applying the SBWT to the noisy ECG signal in order to obtain two noisy coefficients 𝑤𝑡𝑏1 and 𝑤𝑡𝑏2. The coefficient 𝑤𝑡𝑏1 is a detail stationary bionic coefficient, and the coefficient 𝑤𝑡𝑏2 is an approximation one. To estimate the level of the white noise corrupting the original ECG signal, we use the first coefficient 𝑤𝑡𝑏1 which is then thresholded using soft thresholding function. The noisy approximation 𝑤𝑡𝑏2 is denoised by using 1-D double-density complex DWT denoising method. The latter requires determining the noise level which is estimated from 𝑤𝑡𝑏1 as previously mentioned. The results obtained from SNR, MSE, MAE, PSNR, and CC computations show the performance of the proposed technique. In fact, the noise was considerably reduced, and P-QRS-T waves are practically conserved. Moreover, this proposed technique performed the other four denoising approaches applied in this work for our comparative study. Moreover, the proposed technique is also compared to several existing ECG denoising approaches in order to study its effectiveness.

References 1. Chen, S., Dong, X., Xiong, Y., Peng, Z., Zhang, W.: Nonstationary signal denoising using an envelope-tracking filter. IEEE/ASME Trans. Mechatron. 23(4), 2004–2015 (2018) 2. Ignjatović, A., Wijenayake, C., Keller, G.: Chromatic derivatives and approximations in practice—part II: nonuniform sampling, zero-crossings reconstruction, and denoising. IEEE Trans. Signal Process. 66(6), 1513–1525 (2018) 3. Muduli, P.R., Mandal, A.K., Mukherjee, A.: An antinoise-folding algorithm for the recovery of biomedical signals from noisy measurements. IEEE Trans. Instrum. Meas. 66(11), 2909–2916 (2017)

48

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

4. Houamed, I., Saidi, L., Srairi, F.: ECG signal denoising by fractional wavelet transform thresholding. Res. Biomed. Eng. 36, 349–360 (2020). https://doi.org/10.1007/s42600-­020-­00075-­7 5. Vargas, V.A.C.P.: Electrocardiogram signal denoising by clustering and soft thresholding Regis Nunes. IET Signal Process. 12(9), 1165–1171 (2018) 6. Hesar, H.D., Mohebbi, M.: An adaptive particle weighting strategy for ECG denoising using marginalized particle extended Kalman filter: an evaluation in arrhythmia contexts. IEEE J. Biomed. Health Inform. 21(6), 1581–1592 (2017) 7. Pham, D.H., Meignen, S., Dia, N., Jallon, J.F., Rivet, B.: Phonocardiogram signal denoising based on nonnegative matrix factorization and adaptive contour representation computation. IEEE Signal Process. Lett. 25(10), 1475–1479 (2018) 8. Shubhranshu, S.: Denoising and Artifacts Removal in ECG Signals. PhD thesis. National Institute of Technology, Rourkela (India) (2015). 9. Van Alste, J.A., Schilder, T.S.: Removal of base-line wander and power-line interference from the ECG by an efficient FIR filter with a reduced number of taps. In: IEEE Transactions on Biomedical Engineering. BME 32(12): 1052–1060 (1985) 10. Maniruzzaman, M., Kazi, M., Billah, S., Biswas, U., Gain, B.: Least-mean square algorithm based adaptive filters for removing power line interference from ECG signal. In: IEEE International Conference on Informatics, Electronics & Vision (ICIEV’12), pp. 737–740 (2012) 11. Vullings, R., Vries, B., Bergmans, J.W.M.: An adaptive Kalman filter for ECG signal enhancement. I.E.E.E. Trans. Biomed. Eng. 58(4), 1094–1103 (2011) 12. Sayadi, O., Shamsollahi, M.B.: ECG denoising and compression using a modified extended Kalman filter structure. I.E.E.E.  Trans. Biomed. Eng. 55(9), 2240–2248 (2008). https://doi. org/10.1109/TBME.2008.921150 13. Lu, G., Brittain, J.S., Holland, P., Yianni, J., Green, A.L., Stein, J.F., Aziz, T.Z., Wang, S.: Removing ECG noise from surface EMG signals using adaptive filtering. Neurosci. Lett. 462(1), 14–19 (2009). https://doi.org/10.1016/j.neulet.2009.06.063 14. Chang, K.M., Liu, S.H.: Gaussian noise filtering from ECG by wiener filter and ensemble empirical mode decomposition. J. Signal Process. Syst. 64(2), 249–264 (2011) 15. Oliveira, B.R., Duarte, M.A.Q., Abreu, C.C.E., Vieira, F.J.: A wavelet-based method for power-line interference removal in ECG signals. Res. Biomed. Eng. 34(1), 73–86 (2018) 16. El-Dahshan, E.-S.A.: Genetic algorithm and wavelet hybrid scheme for ECG signal denoising. Telecommun. Syst. 46(3), 209–215 (2011) 17. Ling, B.W.-K., Ho, C.Y.-F., Lam, H.-K., Wong, T.P.-L., Chan, A.Y.-P., Tam, P.K.S.: Fuzzy rule based multiwavelet ECG signal denoising. In: IEEE International Conference on FUZZY Systems: (FUZZY 2008); Hong Kong, China (2008) 18. Sharma, L.N., Dandapat, S., Mahanta, A.: ECG signal denoising using higher order statistics in wavelet subbands. Biomed. Signal Process. Cont. 5, 214–222 (2010) 19. Ercelebi, E.: Electrocardiogram signals de-noising using lifting-based discrete wavelet transform. Comput. Biol. Med. 34(6), 479–493 (2004) 20. Sharma, L.N., Dandapat, S., Mahanta, A.: ECG signal denoising using higher order statistics in wavelet subbands. Biomed. Signal Process. Cont. 5(3), 214–222 (2010). https://doi. org/10.1016/j.bspc.2010.03.003 21. Mehmet, U., Muammer, G., Abdulkadir, S., Fikret, A.: Denoising of weak ECG signals by using wavelet analysis and fuzzy thresholding. Net. Mod. Anal. Heal. Inform. Bioinforma. 1(4), 135–140 (2012). https://doi.org/10.1007/s13721-­012-­0015-­5 22. Blanco-Velasco, M., Weng, B., Barner, K.: ECG signal denoising and baseline wander correction based on the empirical mode decomposition. Comput. Biol. Med. 38(1), 1–13 (2008) 23. Bouny, L., Khalil, M., Adib, A.: ECG signal denoising based on ensemble EMD thresholding and higher order statistics. In: IEEE International Conference on Advanced Technologies for Signal and Image Processing (ATSIP’2017), Morocco (2017) 24. Nguyen, P., Kim, J.M.: Adaptive ECG denoising using genetic algorithm based thresholding and ensemble empirical mode decomposition. Inf. Sci. 373, 499–511 (2016)

References

49

25. Kopsinis, Y., Laughlin, S.M.: Development of EMD-based denoising methods inspired by wavelet thresholding. IEEE Trans. Signal Process. 57, 1351–1362 (April 2009) 26. Kabir, M.A., Shahnaz, C.: An ECG signal denoising method based on enhancement algorithms in EMD and wavelet domains. In: IEEE Region 10 Conference TENCON, pp. 284–287 (2011) 27. Manas, R., Susmita, D.: An efficient ECG denoising methodology using empirical mode decomposition and adaptive switching mean filter. Biomed. Signal Process. Cont. 40, 140–148 (2018) 28. Chunqiang, Q., Honghong, S., Helong, Y.: Local means denoising of ECG signal. Biomed. Signal Process. Cont. 53 (2019) 29. Mourad, T.: New approach of ECG denoising based on 1-D double-density complex DWT and SBWT. In: Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization. Springer, Cham (2020). https://doi.org/10.1080/21681163.2020.1763203 30. Shen, H., Chen, Y.Q., Qiu, T.S.: Fractional Processes and Fractional Order Signal Processing. Springer, Berlin (2012) 31. Jianhong, W., Yongqiang, Y., Xiang, P., Xudong, G.: Parallel-type fractional zero-phase filtering for ECG signal denoising. Biomed. Signal Process. Cont. 18, 36–41 (2015) 32. Tseng, C.C., Lee, S.L.: Design of linear phase FIR filters using fractional derivative constraints. Signal Process. 92, 1317–1327 (2012) 33. Tseng, C.C.: Design of fractional order digital FIR differentiators. IEEE Signal Process. 8(3), 77–79 (2001) 34. Benmalek, M., Charef, A.: Digital fractional order operators for R-wave detection in electrocardiogram signal. IET Signal Process. 3(5), 381–391 (2009) 35. Abdelliche, F., Charef, A., Ladaci, S.: Complex fractional and complex Morlet wavelets for QRS complex detection. In: ICFDA’14 International Conference on Fractional Differentiation and Its Applications, (IEEE XPLORE) Catania, Italy (2014) 36. Abdelliche, F., Charef, A., Talbi, M.L., Benmalek, M.: A fractional wavelet for QRS detection. In: IEEE International Conference on Information & Communication Technologies 0–7803–9521-2/06, pp. 1186–1189 (2006) 37. Abdelliche, F., Charef, A.: Fractional wavelet for R-wave detection in ECG signal. Crit. Rev. Biomed. Eng. 36(2), 79–91 (2008) 38. Abdelliche, F., Charef, A.: R-peak detection using a complex fractional wavelet. In: IEEE International Conference on Electrical and Electronics Engineering (ELECO 2009), pp. 267–270 (2009) 39. Yao, J., Zhang, Y.T.: Bionic wavelet transform: a new time-frequency method based on an auditory model. IEEE Trans. Biomed. Eng. 48(8), 856–863 (2001) 40. Omid, S., Mohammad, B.S.: Multiadaptive bionic wavelet transform: application to ECG denoising and baseline wandering reduction. EURASIP J. Adv. Signal Process., 1–11 (2007) 41. Yao, J., Zhang, Y.T.: The application of bionic wavelet transform to speech signal processing in cochlear implants using neural network simulations. IEEE Trans. Biomed. Eng. 49(11), 1299–1309 (2002) 42. Yuan, X.: Auditory model-based bionic wavelet transform for speech enhancement. M.S. Thesis. Milwaukee (Wis, USA): Speech and Signal Processing Laboratory, Marquette University (2003) 43. Ivan, W.S.: The double-density dual-tree DWT. IEEE Trans. Signal Process. 52(5), 1304–1314 (2004). https://doi.org/10.1109/TSP.2004.826174 44. Haslaile, A., Dean, C.: Double density wavelet for EEG signal denoising. In: Second International Conference on Machine Learning and Computer Science: IMLCS’2013; Aug; Kuala Lumpur, Malaysia. pp. 51–53 (2013) 45. Vimala, C., Aruna, P.P.: Double density dual tree discrete wavelet transform implementation for degraded image enhancement. In: National Conference on Mathematical Techniques and its Applications. Journal of Physics: Conference Series, Volume 1000, National Conference on Mathematical Techniques and its Applications (NCMTA 18); Jan 5–6; Kattankulathur, India (2018)

50

2  ECG Denoising Based on 1-D Double-Density Complex DWT and SBWT

46. Crouse, M., Nowak, R., Baraniuk, R.: Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Process. 46, 886–902 (1998) 47. Brian, H.T., Eric, L.M.: Nonlocal means denoising of ECG signals. I.E.E.E. Trans. Biomed. Eng. 59(9), 2383–2386 (2012). https://doi.org/10.1109/TBME.2012.2208964 48. Ambuj, D., Hasnine, M.: Two-stage nonlocal means denoising of ECG signals. Int. J. Advan. Res. Comput. Sci. 5, 114–118 (2014) 49. Mourad, T.: Electrocardiogram de-noising based on forward wavelet transform translation invariant application in bionic wavelet domain. Sadhana J. 39(4), 921–937 (2014). https://doi. org/10.1007/s12046-­014-­0247-­4

Chapter 3

Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

3.1  Introduction In many speech-related applications, an input speech signal frequently suffers from environmental noise and needs further processing applying a speech enhancement technique for improving the associated quality before being of employment [1]. Generally, speech enhancement techniques can be classified into two groups which are supervised and unsupervised. Unsupervised techniques, such as spectral subtraction (SS) [2–4], Wiener filtering [5, 6], short-time spectral amplitude (STSA) estimation [7], and short-time log-spectral amplitude estimation (logSTSA) [8]. Concerning the supervised speech enhancement approaches, they use a training set for learning varied models for noise and clean speech signals, with examples including codebook-based approaches [9] and hidden Markov model (HMM)-based ones [10]. Conventional speech enhancement approaches often process a noisy utterance in a frame-wise way, viz., for enhancing each short-time period of the utterance nearly independently. However, some researches proved that considering the inter-­ frame variation over a relatively long span of time can contribute to superior performance in speech enhancement [1]. Some famed techniques along this direction include modulation domain spectral subtraction [11], Kalman filtering, and modulation domain Wiener filtering [12, 13]. Furthermore, compared to the Fourier transform (FT) in which lone the frequency parts are taken into account, the DWT [14] is taking care of the time and also frequency aspects of the signal to be analyzed and becomes famed in speech analysis. In the wavelet thresholding denoising (WTD) [15], the wavelet transform is applied for splitting the time-domain signal into sub  −  bands, and then thresholding is performed. In [16] the DWT [17, 18] was applied to the plain speech feature time series and simply conserves the obtained approximation portion, which simultaneously attains data compression and noise robustness in recognition. In [1] the DWT was used for analyzing the spectrogram of a noisy utterance along the temporal axis and after that devalues the resulting © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 T. Mourad, The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-93405-7_3

51

52

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

detail portion with the aim of reducing the noise effect for promoting speech quality [1] Despite the simplicity of its implementation, the preliminary evaluation results indicate that the technique proposed in [1] can provide input signals having better perceptual quality. It was proved that this technique can be paired with many famed speech enhancement approaches in order to achieve even better performance [1]. In this chapter, we will detail our speech enhancement technique proposed in [19] based on the SBWT [19–21], and minimum mean square error (MMSE) estimate of spectral amplitude [22] is proposed. In this chapter, the evaluation of this technique [19] was performed by comparing it to four speech enhancement techniques which are as follows: –– Unsupervised speech denoising via perceptually motivated robust principal component analysis (PCA) [23] –– The speech enhancement technique based on MSS-SMPO [24, 25] –– The denoising technique based on MMSE estimate of spectral amplitude [22] –– Our previous speech enhancement technique based on LWT and artificial neural network (ANN) and using MMSE estimate of spectral amplitude [26] The fourth technique is based on LWT and ANN [27–29] and using MMSE estimate of spectral amplitude [26]. It consists at the first step in applying the LWT to the noisy speech signal for obtaining two noisy detail coefficients, cD1 and cD2, and one approximation, cA2. Then, cD1 and cD2 are denoised by soft thresholding, and for their thresholding, we need to employ suitable thresholds, thrj, 1 ≤ j ≤ 2. Those thresholds are determined by employing an artificial neural network (ANN). This thresholding of cD1  and  cD2 is performed in order to obtain two denoised coefficients, cDd1 and cDd2. Also, the denoising technique based on MMSE estimate of spectral amplitude [22] is applied to cA2 for obtaining a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the LWT inverse, LWT−1, to cDd1, cDd2, and cAd2. In Sect. 3.2 of this chapter, we will deal with the MMSE estimate of spectral amplitude. In Sect. 3.3, we will detail our speech enhancement approach proposed in [19]. In Sect. 3.7, we will present results and discussion and finally we will conclude in Sect. 3.8.

3.2  The MMSE Estimate of Spectral Amplitude In literature, it was proposed to estimate the noise power spectral density employing MMSE optimal estimation [22] It was proved that the obtained estimator can be considered as a VAD (voice activity detector)-based noise power estimator, and the noise power is updated lone if speech absence is detected, compensated with a required bias compensation [22]. It was also proved that the bias compensation is needless if VAD is replaced by a soft SPP (speech presence probability) with fixed priors [22]. When choosing fixed priors, this has the benefit of decoupling the noise power estimator from subsequent steps in a speech enhancement algorithm, such as the estimation of the speech power and that of the clean speech [22]. Timo Gerkmann

3.2  The MMSE Estimate of Spectral Amplitude

53

et al. [22] proved that the proposed SPP method permits to maintain the quick noise tracking performance of the bias compensated MMSE-based method while exhibiting less over-estimation of the spectral noise power and an even lower complexity of computation.

3.2.1  Signal Model In [22], Timo Gerkmann et  al. have considered a frame-by-frame processing of time-domain signals where the discrete Fourier transform (DFT) is applied to these frames. Let the complex spectral noise and speech coefficients be given respectively by Nk(l) and Sk(l) where l is the time frame index and k the frequency bin index [22]. In [22], it was supposed that in the short-time Fourier domain, the noise and speech signals to be additive. Consequently, the complex spectral noisy observation is formulated as follows:

Yk ( l ) = Sk ( l ) + N k ( l )

(3.1)



In [22], it was assumed that the noise and speech signals own zero mean and are independent so that we have:

( ) = E( S )+ E( N )

E Y

2

2

2

(3.2)

with E(∙) as the statistical expectation operator. The spectral noise and speech power are formulated as follows:

( ) =σ

2 N

( ) =σ

2 S

E N



E S



2

2

(3.3)



(3.4)



Then, both a posteriori SNR and a priori SNR are formulated as follows:

γ =

Y

σ N2

ξ=

2

( a posteriori SNR )

(3.5)

σ S2 ( a priori SNR ) σ N2

All details about MMSE-based noise power estimation were given in [22].

(3.6)

54

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

3.3  The Proposed Speech Enhancement Approach [19] Our speech enhancement approach proposed in [19] is based on the SBWT [19–21] and MMSE estimate of spectral amplitude [22]. This approach consists in applying the speech enhancement method based on MMSE estimate of spectral amplitude [1, 22], in the SBWT domain. In fact, this method [22] is applied to each noisy stationary bionic wavelet coefficient for its denoising. Those noisy coefficients are obtained by applying the SBWT to the noisy speech signal. After that, the SBWT inverse (SBWT−1) is applied to the obtained denoised coefficients for having finally the enhanced speech signal. The block diagram of our speech enhancement approach proposed in [19] is illustrated in Fig. 3.1. According to Fig.  3.1, the proposed approach [19] consists at the first step in applying the SBWT to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. Those coefficients are named wbi, 1 ≤ i ≤ 8, and each of them is denoised by applying the speech enhancement method based on MMSE estimate of spectral amplitude [1, 22]. Therefore, we obtain eight denoised

Noisy Speech Signal

Noisy Stationary Bionic

SBWT

Wavelet Coefficients: wbi, 1 ≤ i ≤ 8

The speech enhancement technique based on MMSE Estimate of Spectral Amplitude [22]

The Denoised Stationary Bionic

Coefficients: wbi, 1 ≤ i ≤ 8

Enhanced Speech Signal

SBWT

–1

Fig. 3.1  The flowchart of the proposed speech enhancement approach

3.5  Unsupervised Speech Denoising Via Perceptually Motivated Robust Principal…

55

coefficients named wdi, 1 ≤ i ≤ 8 (Fig. 3.1). To those coefficients, wdi, 1 ≤ i ≤ 8, the SBWT inverse (SBWT−1) is applied, for having finally the enhanced speech signal.

3.4  M  inimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the SBWT Domain Generally, conventional speech enhancement techniques based on thresholding in wavelets domain can introduce some degradations to the original speech signal. This precisely occurs for the unvoiced sounds in a speech signal. Therefore, many speech enhancement algorithms based on a wavelet transform are applying other tools such as spectral subtraction, Wiener filtering, and MMSE-STSA estimation [30, 31]. For this reason, in our speech enhancement system proposed in [19], we applied the minimum mean square error (MMSE) estimate of spectral amplitude in the SBWT domain. The application of the SBWT permits to solve the problem of the perfect reconstruction existing with the application of the BWT [19]. Moreover, the SBWT among all the wavelet transforms [32, 33] tends to uncorrelated data [34] and makes the noise cancellation easier. The fact that the MMSE estimate of spectral amplitude [22] is applied to each noisy stationary bionic coefficient, wbi, 1 ≤ i ≤ 8 (Fig. 3.1), permits to ensure better adaptation for speech and noise estimations compared to the application of this technique [22] to the whole noisy speech signal.

3.5  U  nsupervised Speech Denoising Via Perceptually Motivated Robust Principal Component Analysis [23] To overwhelm the shortcomings in the existing sparse and low-rank speech denoising technique that the auditory perceptual properties aren’t fully exploited and the speech degradation is simply perceived, a perceptually motivated robust principal component analysis (ISNRPCA) technique was presented. In order to reflect the nonlinear property for frequency perception of the basilar membrane, cochleagram is employed as inputs of ISNRPCA. The latter employs the perceptually meaningful Itakura-Saito measure as its optimization objective function. Furthermore, non-­ negative constraints are also compulsory for regularizing the decomposed terms with respect to their physical meaning [23]. In [23], G. Min et al. proposed an alternating direction technique of multipliers (ADMM) in order to solve the optimization problem of ISNRPCA. The latter is completely unsupervised, neither the noise nor the speech model requires to be trained beforehand. Experimental results under diverse noise sorts and different SNRs prove that the ISNRPCA are showing promising results for speech denoising [23].

56

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

3.6  T  he Speech Enhancement Technique Based on MSS– SMPO [25] In [25], a two-step enhancement technique based on spectral subtraction and phase spectrum compensation was presented for noisy speeches in diverse environments requiring non-stationary noise and medium to low levels of SNR. In the first step of the technique proposed in [25], the magnitude of the noisy speech spectrum is modified by a spectral subtraction technique, where a noise estimation approach was introduced. The latter is based on the low-frequency information of the noisy speech. This noise estimation technique is able to estimate precisely the non − stationary noise. In the second step, the phase spectrum of the noisy speech is modified consisting of phase spectrum compensation, where an SNR − dependent technique is incorporated for determining the amount of compensation to be compulsory on the phase spectrum [25]. A modified complex spectrum is obtained by aggregating the magnitude from the step of spectral subtraction and the modified phase spectrum from the step of phase compensation, which is found to be a better representation of enhanced speech spectrum.

3.7  Results and Discussion In this work, the evaluation of the proposed approach is performed by its application to ten Arabic speech sentences pronounced by a male speaker and ten other ones by a female speaker (Table 3.1). Those speech signals are degraded in artificial manner by an additive noise at different values of SNRi (before denoising). In order to corrupt those speech signals (Table 3.1), we have chosen four sorts of noise which are white Gaussian, car, F16, and tank noises. These 20 speech signals are sampled at 16 kHz and are presented in Table 3.1. Also, for evaluating the proposed technique, it is compared to other three speech enhancement approaches which are as follows: Table 3.1  The list of the used Arabic speech sentences Female speaker

‫ﺃﺣﻔﻆ ﻣﻦ ﺍﻷﺭﺽ‬ ‫ﺃﻳﻦ ﺍﻟﻤﺴﺎ ﻓﺮﻳﻦ‬ ‫ﻻ ﻟﻢ ﻳﺴﺘﻤﺘﻊ ﺑﺜﻤﺮﻫﺎ‬ ‫ﺳﻴﺆﺫﻳﻬﻢ ﺯﻣﺎﻧﻨﺎ‬ ‫ﻛﻨﺖ ﻗﺪﻭﺓ ﻟﻬﻢ‬ ‫ﺍﺯﺍﺭ ﺻﺎﺋﻤﺎ‬ ‫ﻛﺎﻝ ﻭ ﻏﺒﻂ ﺍﻟﻜﺒﺶ‬ ‫ﻫﻞ ﻟﺬﻋﺘﻪ ﺑﻘﻮﻝ‬ ‫ﻋﺮﻑ ﻭﺍﻟﻴﺎ ﻭ ﻗﺎﺋﺪﺍ‬ ‫ﺧﺎﻻ ﺑﺎﻟﻨﺎ ﻣﻨﻜﻤﺎ‬

Male speaker

:Signal1 :Signal 2 :Signal 3 :Signal 4 :Signal 5 :Signal 6 :Signal 7 :Signal 8 :Signal 9 :Signal 10

‫ﻻ ﻟﻦ ﻳﺬﻳﻊ ﺍﻟﺨﺒﺮ‬ ‫ﺃﻛﻤﻞ ﺑﺎﻹﺳﻼﻡ ﺭﺳﺎﻟﺘﻚ‬ ‫ﺳﻘﻄﺖ ﺇﺑﺮﺓ‬ ‫ﻣﻦ ﻟﻢ ﻳﻨﺘﻔﻊ‬ ‫ﻏﻔﻞ ﻋﻦ ﺿﺤﻜﺎﺗﻬﺎ‬ ‫ﻭ ﻟﻤﺎﺫﺍ ﻧﺸﻒ ﻣﺎﻟﻬﻢ‬ ‫ﺃﻳﻦ ﺯﻭﺍﻳﺎﻧﺎ ﻭ ﻗﺎﻧﻮﻧﻨﺎ‬ ‫ﺻﺎﺩ ﺍﻟﻤﻮﺭﻭﺙ ﻣﺪﻟﻌﺎ‬ ‫ﻧﺒﻪ ﺁﺑﺎﺋﻜﻢ‬ ‫ﺃﻅﻬﺮﻩ ﻭ ﻗﻢ‬

:Signal 1 :Signal 2 :Signal 3 :Signal 4 :Signal 5 :Signal 6 :Signal 7 :Signal 8 :Signal 9 :Signal 10

3.7  Results and Discussion

57

–– The denoising technique based on MMSE estimate of spectral amplitude [22] –– The unsupervised speech denoising technique via perceptually motivated robust principal component analysis [23] –– The speech enhancement approach based on MSS-SMPO [24] This evaluation is performed through the computations of the SNR (signal-to-­ noise ratio), the segmental SNR (SSNR), and the PESQ (perceptual evaluation of speech quality). The results obtained from those computations are presented in Tables 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, 3.15 and 3.16. According to those tables, the best results are the values highlighted in red, and they are practically obtained from the application of the proposed technique. Therefore, this technique outperforms the other speech enhancement approaches [22, 23, 24, 25] applied for this evaluation. In Fig. 3.2, an example of speech enhancement applying the proposed technique to the clean speech signal (Fig. 3.2a) corrupted in additive manner by a car noise (Volvo) with SNR = 0dB (Fig. 3.2b) is illustrated. According to this figure, this technique permits to considerably reduce noise and to obtain an enhanced speech signal (Fig.  3.2c) with little distortions despite the fact that the value of the SNR is low (0 dB). In Fig. 3.3, the spectrograms of the clean (Fig. 3.3a), the noisy (Fig. 3.3b), and the enhanced (Fig. 3.3c) speech signals are illustrated. The spectrogram (b) shows that the type of noise corrupting the speech signal is localized in low-frequency regions. The spectrogram (c) shows that the car noise is considerably reduced by using the proposed speech enhancement technique. Moreover, this technique permits to have an enhanced speech signal with low distortions compared to the clean speech signal (Fig. 3.3a). In the following subsection, we make a comparative study between the proposed technique and our previous speech enhancement approach based on LWT and ANN and using MMSE [26]. The first difference between the technique proposed in this work and our previous approach is that they use two completely different wavelet

Table 3.2  Results in terms of SNR (Signal 7 (female voice) corrupted by Gaussian white noise) SNRf (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 2.7870 0 6.9200 5 11.0291 10 14.1329 15 16.7798

The speech enhancement technique based on MSS − SMPO [24, 25] 8.1682 11.5437 14.7845 18.2255 21.3456

The proposed speech enhancement technique 8.4026 12.4447 15.9887 19.3911 22.4836

The denoising technique based on MMSE estimate of spectral amplitude [22] 6.3331 10.4737 14.2200 17.6035 20.8019

58

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

Table 3.3  Results in terms of SSNR (Signal 7 (female voice) corrupted by Gaussian white noise) SSNR (dB) The denoising approach Unsupervised speech The speech denoising via enhancement perceptually technique based on motivated robust principal component MSS − SMPO [24, 25] SNRi (dB) analysis [23] −5 −2.4187 0.8504 0 −2.4610e-04 3.1099 5 2.7044 5.4567 10 5.0003 8.6660 15 7.7230 11.7670

The proposed speech enhancement technique 1.3313 3.9607 6.2536 8.9350 12.1193

The denoising technique based on MMSE estimate of spectral amplitude [22] 1.2531 2.7508 5.1757 7.4637 10.4470

Table 3.4  Results in terms of PESQ (Signal 7 (female voice) corrupted by Gaussian white noise) PESQ The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 1.0636 0 1.4606 5 1.9352 10 2.3212 15 2.7461

The speech enhancement technique based on MSS − SMPO [24, 25] 1.4235 1.8776 2.2374 2.5908 2.9835

The proposed speech enhancement technique 1.4884 1.9593 2.3747 2.7116 3.0695

The denoising technique based on MMSE estimate of spectral amplitude [22] 1.2531 1.7421 2.1773 2.5503 2.8944

Table 3.5  Results in terms of SNR (Signal 5 (male voice) corrupted by F16 noise) SNRf (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 1.2722 0 5.1589 5 8.9032 10 12.9517 15 15.9880

The speech enhancement technique based on MSS − SMPO [24, 25] 2.5071 6.2402 9.5351 14.0785 17.9726

The proposed speech enhancement technique 3.7283 8.2589 11.9891 15.5901 19.6030

The denoising technique based on MMSE estimate of spectral amplitude [22] 2.3837 7.1434 11.1435 14.6596 18.3975

3.7  Results and Discussion

59

Table 3.6  Results in terms of SSNR (Signal 5 (male voice) corrupted by F16 noise) SSNR (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 −4.0510 0 −2.2335 5 −0.4451 10 1.7251 15 3.7474

The speech enhancement technique based on MSS − SMPO [24, 25] −3.7447 −1.9010 −0.1214 2.7250 6.0977

The proposed speech enhancement technique −3.0533 −0.7847 1.0887 3.1337 5.9488

The denoising technique based on MMSE estimate of spectral amplitude [22] −3.6560 −1.3701 0.6802 2.6493 5.0968

Table 3.7  Results in terms of PESQ (Signal 5 (male voice) corrupted by F16 noise) PESQ The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 1.2831 0 1.6378 5 2.0750 10 2.4313 15 2.8416

The speech enhancement technique based on MSS − SMPO [24, 25] 1.2227 1.5769 2.1242 2.5673 3.0733

The proposed speech enhancement technique 1.2951 1.8125 2.2982 2.7291 3.1164

The denoising technique based on MMSE estimate of spectral amplitude [22] 1.2593 1.7017 2.1889 2.6444 3.0182

Table 3.8  Results in terms of SNR (Signal 3 (male voice) corrupted by tank noise) SNRf (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 1.8657 0 5.2569 5 8.8912 10 12.5670 15 15.9158

The speech enhancement technique based on MSS − SMPO [24, 25] 2.8506 6.0328 9.7367 13.6433 18.2296

The proposed speech enhancement technique 4.5261 8.2533 12.5241 16.6318 21.0643

The denoising technique based on MMSE estimate of spectral amplitude [22] 3.0084 6.6513 10.7510 14.7634 19.0401

60

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

Table 3.9  Results in terms of SSNR (Signal 3 (male voice) corrupted by tank noise) SSNR (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 −3.0742 0 −1.2104 5 1.0990 10 3.6778 15 6.2864

The speech enhancement technique based on MSS − SMPO [24, 25] −4.1830 −2.2805 −0.0078 2.6210 6.0870

The proposed speech enhancement technique −3.6883 −1.7476 0.9546 3.8260 7.2508

The denoising technique based on MMSE estimate of spectral amplitude [22] −4.2203 −2.6546 −0.1952 2.5252 5.7392

Table 3.10  Results in terms of PESQ (Signal 3 (male voice) corrupted by tank noise) PESQ The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 0.9941 0 1.3492 5 1.7720 10 2.2027 15 2.6084

The speech enhancement technique based on MSS − SMPO [24, 25] 1.3792 1.8875 2.3089 2.5906 2.7785

The proposed speech enhancement technique 1.5538 2.0085 2.3781 2.6361 2.8201

The denoising technique based on MMSE estimate of spectral amplitude [22] 1.3848 1.8400 2.2503 2.5094 2.7240

Table 3.11  Results in terms of SNR (Signal 8 (female voice) corrupted by factory noise) SNRf (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 3.0176 0 6.6901 5 10.4536 10 13.5863 15 16.0742

The speech enhancement technique based on MSS − SMPO [24, 25] 3.7705 7.8983 11.7369 14.9194 19.2111

The proposed speech enhancement technique 4.9624 9.0631 12.5889 15.7971 20.1601

The denoising technique based on MMSE estimate of spectral amplitude [22] 3.4404 7.2378 11.2022 14.4791 18.6555

3.7  Results and Discussion

61

Table 3.12  Results in terms of SSNR (Signal 8 (female voice) corrupted by factory noise) SSNR (dB) The denoising approach Unsupervised speech The speech denoising via enhancement perceptually technique based on motivated robust principal component MSS − SMPO [24, 25] SNRi (dB) analysis [23] −5 −2.6512 −2.3588 0 −0.5142 0.0580 5 1.6661 2.4913 10 3.5242 5.0128 15 5.5958 8.6024

The proposed speech enhancement technique −1.4115 0.8899 2.9617 5.3739 8.9245

The denoising technique based on MMSE estimate of spectral amplitude [22] −2.1701 −0.0307 2.1659 4.4255 7.7369

Table 3.13  Results in terms of PESQ (Signal 8 (female voice) corrupted by factory noise) PESQ The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 0.6882 0 0.9916 5 1.4558 10 1.9024 15 2.4498

The speech enhancement technique based on MSS − SMPO [24, 25] 0.5311 0.9724 1.6078 2.1297 2.6051

The proposed speech enhancement technique 0.8818 1.1770 1.7891 2.3493 2.7664

The denoising technique based on MMSE estimate of spectral amplitude [22] 0.7038 1.0327 1.5358 2.1484 2.6077

Table 3.14  Results in terms of SNR (Signal 2 (male voice) corrupted by Volvo noise) SNRf (dB) The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 4.5435 0 8.3623 5 12.4591 10 16.1163 15 18.1761

The speech enhancement technique based on MSS − SMPO [24, 25] 3.3737 7.4176 12.8324 17.5726 20.6149

The proposed speech enhancement technique 5.4782 9.9016 14.4917 18.9803 22.9204

The denoising technique based on MMSE estimate of spectral amplitude [22] 4.2192 8.3451 12.6024 17.4120 21.4578

62

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

Table 3.15  Results in terms of SSNR (Signal 2 (male voice) corrupted by Volvo noise) SSNR (dB) The denoising approach Unsupervised speech The speech denoising via enhancement perceptually technique based on motivated robust principal component MSS − SMPO [24, 25] SNRi (dB) analysis [23] −5 −1.1159 −1.2951 0 1.2097 1.3262 5 4.0138 5.1028 10 6.8442 8.3857 15 9.5287 10.9227

The proposed speech enhancement technique −0.1793 2.9410 5.9979 9.1966 12.4662

The denoising technique based on MMSE estimate of spectral amplitude [22] −1.1347 1.7861 4.7166 7.8228 10.9850

Table 3.16  Results in terms of PESQ (Signal 2 (male voice) corrupted by Volvo noise) PESQ The denoising approach Unsupervised speech denoising via perceptually motivated robust principal component analysis SNRi (dB) [23] −5 2.3027 0 2.6867 5 3.0848 10 3.3970 15 3.5439

The speech enhancement technique based on MSS − SMPO [24, 25] 2.2235 2.5464 2.8339 2.9781 3.0602

The proposed speech enhancement technique 2.5406 2.8408 3.1719 3.4203 3.6435

The denoising technique based on MMSE estimate of spectral amplitude [22] 2.4021 2.7163 3.0184 3.2461 3.4789

transforms which are the SBWT for the technique proposed in this paper and the LWT for our previous approach proposed in [26]. The second difference between these two techniques is to apply the denoising approach based on MMSE estimate of spectral amplitude [22] to all stationary bionic wavelet coefficients for the technique proposed in this paper. However, we apply this approach [22] only to the approximation coefficient for our previous speech enhancement technique proposed in [26]. The latter also uses an artificial neural network (ANN), and this fact differentiates this technique [26] to our technique proposed in this paper. The comparison of these two techniques is also in terms of SNR, SSNR, and PESQ. These two techniques are applied to a speech signal corrupted by a car noise with different values of SNR before denoising (SNRi). The results obtained from the computation of SNR, SSNR, and PESQ are presented in Tables 3.17, 3.18 and 3.19 for the two techniques. According to those tables, the best results are the values highlighted in red, and they are obtained from the application of the proposed technique. Therefore, this technique outperforms the other speech enhancement approach proposed in [26].

3.8 Conclusion

a

63

1 0 –1

b

0

0.5

1

1.5

2

2.5

3

3.5

4 × 104

0

0.5

1

1.5

2

2.5

3

3.5

4 × 104

0

0.5

1

1.5

2

2.5

3

3.5

4 × 104

1 0 –1

c

1 0 –1

Fig. 3.2  An example of speech enhancement applying the proposed speech enhancement technique: (a) clean speech signal (male voice (Signal 4)), (b) noisy speech signal (clean signal corrupted by additive car noise with SNRi = 0dB, (c) enhanced speech signal with SNRf = 13.2999, SSNR = 4.3802, and PESQ = 3.0888

3.8  Conclusion In this paper, we propose a new speech enhancement technique based on SBWT and MMSE estimate of spectral amplitude. This technique consists at the first step in applying the SBWT to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. The denoising of each of those coefficients is performed through the application of the denoising technique based on MMSE estimate of spectral amplitude and, finally, the inverse of SBWT (SBWT−1) to the obtained stationary wavelet coefficients. An evaluation of this technique is performed by comparing it to four other speech enhancement approaches. The first approach is the denoising technique based on MMSE estimate of spectral amplitude. The second one is the speech enhancement technique based on MSS − SMPO. The third one is the unsupervised speech denoising approach through perceptually motivated robust principal component analysis. The fourth one is the speech enhancement technique based on LWT and ANN and using MMSE estimate of spectral amplitude. This

64

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

Fig. 3.3 (a) The spectrogram of the clean speech signal (Fig. 3.2a), (b) the spectrogram of the noisy speech signal (Fig. 3.2b), (c) the spectrogram of the enhanced speech signal (Fig. 3.2c)

a

b

c

3.8 Conclusion

65

Table 3.17  Results in terms of SNR (Signal 2 (male voice) corrupted by Volvo noise) SNRf (dB) The denoising approach Speech enhancement based on LWT and ANN and SNRi (dB) using MMSE estimate of amplitude [26] −5 5.8737 0 9.8414 5 14.1647 10 18.5308 15 22.5102

The proposed speech enhancement technique 5.4782 9.9016 14.4917 18.9803 22.9204

Table 3.18  Results in terms of SSNR (Signal 2 (male voice) corrupted by Volvo noise) SSNR (dB) The denoising approach Speech enhancement based on LWT and ANN and SNRi (dB) using MMSE estimate of spectral amplitude [26] −5 0.2145 0 2.7478 5 5.6644 10 8.8942 15 11.9663

The proposed speech enhancement technique −0.1793 2.9410 5.9979 9.1966 12.4662

Table 3.19  Results in terms of PESQ (Signal 2 (male voice) corrupted by Volvo noise) PESQ The denoising approach Speech enhancement based on LWT and ANN and SNRi (dB) using MMSE estimate of spectral amplitude [26] −5 2.2837 0 2.5999 5 2.8709 10 3.1190 15 3.3590

The proposed speech enhancement technique 2.5406 2.8408 3.1719 3.4203 3.6435

evaluation is performed through the computations of signal − to − noise ratio (SNR), segmental SNR (SSNR), and perceptual evaluation of speech quality (PESQ). The results obtained from those computations show that the proposed technique outperforms the other previously mentioned approaches. Furthermore, this technique permits to considerably reduce the noises corrupting the clean speech signal and to have an enhanced speech signal with good perceptual quality.

66

3  Speech Enhancement Based on SBWT and MMSE Estimate of Spectral Amplitude

References 1. Lee, S.-K., Wang, S.-S., Tsao, Y., Hung, J.-W.: Speech enhancement based on reducing the detail portion of speech spectrograms in modulation domain via discrete wavelet Transform. arXiv:1811.03486v1 [eess.AS] 8 Nov 2018. 2. Boll, S.: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Audio Speech Lang. Process. 27(2), 113120 (1979) 3. Berouti, R., Schwartz, J.M.: Enhancement of speech corrupted by acoustic noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 208–211 (1979). 4. Kamath, S., Loizou, P.: A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2002). 5. Plapous, C., Marro, C., Scalart, P.: Improved signal-to-noise ratio estimation for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(6), 20982108 (2006) 6. Scalart, P., Filho, J.V.: Speech enhancement based on a priori signal to noise estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 629–632 (1996). 7. Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error shorttime spectral amplitude estimator. IEEE Trans. Audio Speech Lang. Process. 32(6), 11091121 (1984) 8. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error logspectral amplitude estimator. IEEE Trans. Audio Speech Lang. Process. (1985) 9. Srinivasan, S., Samuelsson, J., Kleijn, W.: Codebook driven short-term predictor parameter estimation for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 14(1), 163176 (2006) 10. Zhao, D.Y., Kleijn, W.B.: HMM-based gain modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Lang. Process. 15(3), 882892 (2007) 11. Paliwal, K.K., Wojcicki, K.K., Schwerin, B.: Single-channel speech enhancement using spectral subtraction in the short-time modulation domain. Speech Comm. 52(5), 450–475 (2010) 12. Hsu, C.-C., Cheong, K.-M., Chien, J.-T., Chi, T.-S.: Modulation Wiener filter for improving speech intelligibility. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 370–374 (2015). 13. So, S., Paliwal, K.K.: Modulation-domain Kalman filtering for single-channel speech enhancement. Speech Comm. 53, 818–829 (2011) 14. Rioul, O., Vettertui, M.: Wavelets and signal processing. IEEE Signal Process. Mag. (1991) 15. Chang, S.G., Yu, B., Vetterli, M.: Adaptive wavelet thresholding for image denoising and compression. IEEE Trans. Image Process. 9, 1532–1546 (2000) 16. Wang, S.-S., Lin, P., Tsao, Y., Hung, J.-W., Su, B.: Suppression by selecting wavelets for feature compression in distributed speech recognition. IEEE Trans. Audio Speech Lang. Process. (2018) 17. Huang, D., Ke, L., Mi, B., Wei, G., Wang, J., Wan, S.: A cooperative denoising algorithm with interactive dynamic adjustment function for security of stacker in industrial Internet of Things. Secur. Comm. Netw. 2019, 4049765 (2019). https://doi.org/10.1155/2019/4049765 18. Nematollahi, M.A., Vorakulpipat, C., Rosales, H.G.: Optimization of a blind speech watermarking technique against amplitude scaling. Secur. Comm. Netw. 2017, 5454768 (2017). https://doi.org/10.1155/2017/5454768 19. Talbi, M., Bouhlel, M.S.: A novel approach of speech enhancement based on SBWT and MMSE estimate of spectral amplitude. In: 4th International Conference on Advanced Systems and Emergent Technologies (IC_ASET) (2020). 20. Talbi, M.: Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum. Int. J. Speech Tech. 20, 75–88 (2017). https://doi.org/10.1007/s10772-­016-­9388-­7

References

67

21. Talbi, M.: New approach of ECG denoising based on 1-D double-density complex DWT and SBWT. Comp. Methods Biomech. Biomed. Engin. Imaging Vis. (2020). https://doi.org/10.108 0/21681163.2020.1763203 22. Gerkmann, T., Hendriks, R.C.: Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio Speech Lang. Process. 20(4), 1383–1393 (2012) 23. Min, G., Zou, X., Han, W., Zhang, X., Tan, W.: Unsupervised speech denoising via perceptually motivated robust principal component analysis. Shengxue Xuebao/Acta Acustica. 42(2), 246–256 (2017) 24. Lu, Y., Loizou, P.: Estimators of the magnitude-squared spectrum and methods for incorporating SNR uncertainty. IEEE Trans. Audio Speech Lang. Process. 19(5), 1123–1137 (2011) 25. Islam, M.T., Asaduzzaman, Shahnaz, C., Zhu, W.P., Ahmad, M.O.: Speech enhancement in adverse environments based on non-stationary noise-driven spectral subtraction and SNR-­ dependent phase compensation. arXiv preprint arXiv:1803.00396 (2018) 26. Talbi, M., Baazaoui, R., Bouhlel, M.S.: Speech enhancement based on LWT and artificial neural network and using MMSE estimate of spectral amplitude [online first]. IntechOpen. (2021). https://doi.org/10.5772/intechopen.96365. Available from https://www.intechopen. com/online-­first/speech-­enhancement-­based-­on-­lwt-­and-­artificial-­neural-­network-­and-­using-­ mmse-­estimate-­of-­spectral-­am 27. Chen, T., Kapron, N., Chen, J.C.-Y.: Using evolving ANN-based algorithm models for accurate meteorological forecasting applications in Vietnam. Math. Probl. Eng. 2020, 1–8 (2020). https://doi.org/10.1155/2020/8179652 28. Vilavicencio-Arcadia, E., Navarro, S.G., Corral, L.J., Martinez, C.A., Nigoche, A., Kemp, S.N., Ramos-Larios, G.: Application of artificial neural networks for the automatic spectral classification. Math. Probl. Eng. 2020, 1–15 (2020). https://doi.org/10.1155/2020/1751932 29. Yang, K.-C., Yang, C., Chao, P.-Y., Shih, P.-H.: Artificial neural network to predict semiconductor machine outliers. Math. Probl. Eng. 2013, 1–10 (2013). https://doi.org/10.1155/2013/210740 30. Tasmaz, H., Erc¸elebi, E.: Speech enhancement based on undecimated wavelet packet-­ perceptual filterbanks and MMSE– STSA estimation in various noise environments. Dig. Signal Process. 18(5), 797–812 (2008) 31. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean square error short time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32, 1109–1121 (1984) 32. Biswas, A., Sahu, P.K., Bhowmick, A., Chandra, M.: Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition. Int. J. Speech Technol. 17(4), 389–399 (2014) 33. Singh, S., Mutawa, A.M.: A wavelet-based transform method for quality improvement in noisy speech patterns of Arabic language. Int. J. Speech Technol., 1–9 (2016, 2016) 34. Bahoura, M., Rouat, J.: Wavelet speech enhancement based on time-scale adaptation. Speech Comm. 48(12), 1620–1637 (2006)

Chapter 4

Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC Using a Multi-layer Perceptron for Voice Control

4.1  Introduction Speech recognition is a process employed in order to recognize speech uttered by a speaker and has been in the research domain for more than five decades [1]. Speech recognition is a vital and emerging technology with great potential. The significance of speech recognition lies in its simplicity. This simplicity together with the ease of operating a device employing speech has lots of advantages. It can be employed in a great number of applications such as household appliances, cellular phones, security devices, and voice command which is the subject of this chapter. With the progress of automated systems, the complexity for integration and recognition problem is snowballing. The problem is found more complex when processing randomly varying analog signals such as speech signals. However, numerous techniques are proposed for effective extraction of speech parameters for recognition. The MFCC technique [2] is more dominantly employed. Researches and developments on speaker recognition have been undertaken for well over four decades, and they continue to be an attractive and active area. Approaches have spanned from human auditory [3] and spectrogram comparisons [3], to simple template matching, to dynamic time-warping approaches, to more modern statistical pattern recognition [3], such as neural networks. The development of an automatic Arabic speech recognition system (Arabic ASR) becomes an attractive research domain. Numerous efforts have been made for Arabic ASR construction which has promising results [4]. However, the majority of those works employ a reduced vocabulary. Multi-layer perceptron (MLP) classifiers are being extensively employed for acoustic modeling in automatic speech recognition (ASR) [5]. The MLP is trained employing acoustic features such as perceptual linear predictive (PLP) cepstral coefficients and its output classes representing the sub-word units of speech such as phonemes. In [6], the influence of feature extraction technique for ameliorating the performance of the classification of paved and unpaved road was studied. The results obtained in [6] show that the frequency-based feature extraction, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 T. Mourad, The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-93405-7_4

69

70

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC…

i.e., MFCC and PLP, obtained better performance than statistics-based feature extraction. Voice is the finest biometric feature for both investigation and authentication. It has biological and behavioral features. The acoustic features are related to the voice. The system of speaker recognition is conceived for the automatic authentication of speaker’s identity [7]. The latter is certainly based on the human’s voice. The MFCC and linear prediction cepstrum coefficient (LPCC) are employed for feature extraction from the provided voice sample [7]. In [8], the performance of conventional and hybrid speech feature extraction algorithms of MFCC, LPCC, PLP, and RASTA-PLP in noisy conditions through employing multivariate hidden Markov model (HMM) classifier [8, 9] was investigated. The behavior of the proposal system was evaluated employing TIDIGIT human voice dataset corpora, recorded from 208 different adult speakers in both training and testing processes. The theoretical basis for speech processing and classifier procedures was presented, and the recognition results were obtained based on word recognition rate [8]. The classical technique for building an automatic speech recognition (ASR) system employs diverse feature extraction methods at the front-end and various parameter classification methods at the back-end. The MFCC and PLP methods are the classical techniques employed for many years for extracting features, and the HMM was the most obvious selection for feature classification. However, the performance of MFCC-HMM and PLP-HMM-based ASR system degrades in real-time environments. The researches proposed in [10] discussed the implementation of discriminatively trained Hindi ASR system employing noise robust integrated features and refined HMM model. It sequentially combines MFCC with PLP and MFCC with gammatone-frequency cepstral coefficient (GFCC) for obtaining MF-PLP and MF-GFCC integrated feature vectors, respectively. The HMM parameters are refined using genetic algorithm (GA) and particle swarm optimization (PSO) [10]. Discriminative training of acoustic model employing MMI (maximum mutual information) and MPE (minimum phone error) was performed for enhancing the exactness of the system proposed in [10]. The results show that discriminative training using MPE with MF-GFCC integrated feature vector and PSO-HMM parameter refinement provides significantly better results than the other implemented methods [10]. In [11], a speech recognition system was conceived and the relevant hardware environment is built through Zynq FPGA AX7020 platform. In this system, the feature extraction of speech signals is based on MFCC (Mel Frequency Cepstrum Coefficients). The algorithm in this system includes pre-emphasis, framing and windowing, FFT, MFCC parameter calculation, VED endpoint detection, and DTW operations processes commonly used in speech system design. In this paper, the speech recognition algorithm is implemented on Zynq FPGA platform and also simulated in Matlab. At the same time, audio module AN831 is adopted in the hardware part to realize the sound acquisition function. Through the comparison of experimental data, it is found that the average recognition rate in Matlab simulation is 86.67%, while in Zynq FPGA platform test, the recognition rate is 80.67%. Using Zynq FPGA AX7020 as a platform to realize speech recognition, the recognition rate is higher than the traditional speech recognition technology.

4.3 Pre-emphasis

71

In this chapter, we will detail our approach of Arabic speech recognition with mono-voice and a small vocabulary, introduced in [12]. This approach consists at the first step in employing our proper speech database containing Arabic speech words which are recorded by a mono-locutor and this for a voice command. The second step consists in extracting features from those recorded words. The third step consists in classifying those extracted features. This extraction is performed by applying the stationary bionic wavelet transform (SBWT) to each recorded word at the first step, then calculating the Mel Frequency Cepstral Coefficients (MFCCs) [13–26], from the vector obtained from the concatenation of the obtained stationary bionic wavelet coefficients. The obtained MFCCs are then concatenated for constructing one input of an MLP [27–37] employed for features classification.

4.2  The Feature Extraction This stage is very important in the robust speaker identification system because the pattern matching and speaker modeling quality strongly depend on the quality of the feature extraction technique. Different types of techniques of speech feature extraction [38, 39] such as PLP, LPC, LPCC, RCC, MFCC, ∆MFCC, ∆∆MFCC, and Wavelets [12, 40, 41] have been applied to extract the features from the speech signal. In this work, we have extracted the Mel Frequency Cepstral Coefficients (MFCC) from the vector obtained from the concatenation of the different stationary bionic wavelet coefficients. These coefficients are obtained by applying the SBWT to the used words. Then, we have concatenated all the obtained MFCCs in order to be used as one input of the MLP.

4.2.1  MFCC Extraction The MFCC [42] feature extraction technique basically includes windowing the signal, applying the DFT, taking the log of the magnitude, and then warping the frequencies on a Mel scale, followed by applying the inverse DCT.  The detailed description of various steps involved in the MFCC feature extraction is explained below.

4.3  Pre-emphasis Pre-emphasis refers to filtering that emphasizes the higher frequencies. Its aim consists in balancing the spectrum of voiced sounds that have a steep roll-off in the high-frequency zone. For voiced sounds, the glottal source has an approximately −12 dB/octave slope [43, 44]. Though when the acoustic energy radiates from the

72

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC…

lips, this causes a roughly +6 dB/octave boost to the spectrum. As a result, a speech signal when recorded with a microphone from a distance has approximately a − 6 dB/octave slope downward compared to the true spectrum of the vocal tract. Consequently, pre-emphasis permits to cancel some of the glottal effects from the vocal tract parameters. The most frequently used pre-emphasis filter, is given by the transfer function having the following expression [43]: H  z   1  bz 1



(4.1)



where the parameter b permits to control the slope of the filter and is commonly between 0.4 and 1.0 [44].

4.4  Frame Blocking and Windowing The speech signal is a slowly time-varying or quasi-stationary signal. For stable acoustic characteristics, speech needs to be examined over a sufficiently short period of time. Therefore, speech analysis must always be carried out on short segments across which the speech signal is assumed to be stationary. Short-term spectral measurements are typically carried out over 20  ms windows, and advanced every 10 ms [45, 46]. Advancing the time window every 10 ms enables the temporal characteristics of individual speech sounds to be tracked, and the 20 ms analysis window is usually sufficient to provide good spectral resolution of these sounds, and at the same time short enough to resolve significant temporal characteristics. The purpose of the overlapping analysis is that each speech sound of the input sequence would be approximately centered at some frame. On each frame, a window is applied to taper the signal towards the frame boundaries. Generally, Hamming or Hanning windows are employed [44]. This is performed for enhancing the harmonics, smoothing the edges, and reducing the edge effect while applying the DFT to the signal.

4.5  DFT Spectrum Each windowed frame is converted into magnitude spectrum by DFT application. N 1



X  k   x  n  e



j 2 nk N

,0  k  N 1

n0

where N is the point number for computing the DFT.

(4.2)

4.6  Mel Spectrum

73

4.6  Mel Spectrum Mel spectrum is computed by passing the Fourier transform signal through a set of band-pass filters known as Mel filter bank. A Mel is a unit of measure based on the human ears’ perceived frequency. It does not correspond linearly to the physical frequency of the tone, as the human auditory system seemingly does not perceive pitch linearly. The Mel scale is about a linear frequency spacing below 1 kHz and a logarithmic spacing above 1  kHz [47]. The approximation of Mel from physical frequency can be expressed as follows:



f   f Mel  2595 log10  1    700 

(4.3)

where f designates the physical frequency in Hz and fMel designates the perceived frequency [45]. Filter banks can be implemented in both time and frequency domains. For MFCC calculation, filter banks are commonly implemented in frequency domain. The center frequencies of the filters are typically evenly spaced on the frequency axis. In order to mimic the human ears’ perception, the warped axis, according to the nonlinear function given in Eq. (4.3), is implemented. The most commonly employed filter shaper is triangular, and in some cases the Hanning filter can be found [44]. The triangular filter banks with Mel frequency warping is illustrated in Fig. 4.1. The Mel spectrum of the magnitude spectrum X(k) is calculated by multiplying the magnitude spectrum by each of the of the triangular Mel weighting filters. N 1



2 s  m     X  k  H m  k   , 0  m  M  1   k 0

(4.4)

where M is total number of triangular Mel weighting filters [48, 49] and Hm(k) designates the weight given to the kth energy spectrum bin contributing to the mth output band and is given by the following expression:



 0, k  f  m  1   2  k  f  m  1   f m  f m  1 , f  m  1  k  f  m       Hm  k     2  f  m  1  k   f m  1  f m , f  m   k  f  m  1       0, k  f  m  1 

where m is ranging from 0 to M – 1.

(4.5)

74

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC… 1 0.9 0.8 0.7

Gain

0.6 0.5 0.4 0.3 0.2 0.1 0

0

500

1000

1500

2000

2500

3000

3500

4000

Frequency (Hz)

Fig. 4.1  Mel filter bank

4.7  Discrete Cosine Transform (DCT) Since the vocal tract is smooth, the energy levels in adjacent bands tend to be correlated. The DCT is applied to the transformed Mel frequency coefficients producing a set of cepstral coefficients. Prior to applying the DCT, the Mel spectrum is frequently represented on a log scale. This results in a signal in the cepstral domain with a quefrency peak corresponding to the signal pitch and a number of formants which represents low quefrency peaks. Since the majority of the signal information is represented by the first few MFCC coefficients, the system can be made robust by extracting only those coefficients truncating or ignoring higher order DCT components [43]. Finally, MFCC is calculated as follows [43, 44]:



M 1   n  m  0.5   c  n    log10  s  m   cos   ; n  0,1, 2,, C  1 M m 0  

(4.6)

where c(n) denote the cepstral coefficients and C denotes the number of MFCCs. Classical MFCC systems employ only 8–13 cepstral coefficients. The zeroth coefficient is frequently excluded since it represents the average log-energy of the input signal, which only carries little speaker-specific information [43].

4.9 Classifiers

75

4.8  Dynamic MFCC Features The cepstral coefficients are frequently referred to as static features, since they are only containing information from a given frame. The extra information about the temporal dynamics of the signal is obtained by cal first and second derivatives of cepstral coefficients [50–52]. The first-order derivative is called delta coefficients, and the second-order derivative is called delta–delta coefficients. Delta coefficients tell about the speech rate, and delta–delta coefficients provide information similar to acceleration of speech. The commonly used definition for computing dynamic parameter is [50].

∑ k c (n + i) (n) = ∑ i T

∆cm

i = −T i m i =T

(4.7)

i = −T

where cm(n) designates the mth feature for the nth time frame, ki is the ith weight, and T is the number of successive frames used for computation. Generally, T is taken as 2. The delta–delta coefficients are computed by taking the first-order derivative of the delta coefficients.

4.9  Classifiers For classification, we have employed in our previous work [12] the multi-layer perceptron (MLP) [53] which is the most popular network architecture in use today, due originally to Rumelhart and McClelland [54]. The units each performed a biased weighted sum of their inputs and pass this activation level through a transfer function to produce their output, and the units are arranged in a layered feedforward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity. Important issues in MLP design include specification of the number of hidden layers and the number of units in these layers. The number of input and output units is defined by the problem (there may be some uncertainty about precisely which inputs to use) [53]. However, for the moment, we will assume that the input variables are intuitively selected and are all meaningful). The number of hidden units to use is far from clear. As good a starting point as any is to use one hidden layer, with the varying number: in this work, we vary the number of hidden units from 20 to 200. Figure 4.2 illustrated the employed MLP. For learning this MLP, we have used the backpropagation algorithm.

76

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC…

1 Vector Parameters

1

word n’1

2

10

word n’10

70

Input layer

Hidden layer

Output layer

Fig. 4.2  The architecture of the used MLP

4.10  The Proposed Speech Recognition Technique [12] This technique consists at the first step in using our proper speech database containing Arabic speech words recorded by a mono-voice for a voice command. The second step consists in extracting features from those recorded words. The third step consists in classifying those extracted features. This features extraction is performed by applying, at the first step, the stationary bionic wavelet transform (SBWT) to each recorded word, and then the Mel Frequency Cepstral Coefficients (MFCCs) are calculated from a vector obtained from the concatenation of the obtained stationary bionic wavelet coefficients. The obtained MFCCs are then concatenated for constructing one input vector of a multi-layer perceptron (MLP) used for the features classification. In the MLP learning and test phases, we have used ten Arabic words each of them is repeated 25 times by the same voice. In Fig. 4.3, the different steps of the speech recognition approach proposed in [12] are illustrated.

4.11  Experiments and Results For evaluating the proposed technique, we have tested it on some Arabic words reported in Table 4.1. Each of these words is recorded 25 times by the same voice in order to be employed for learning and testing the employed MLP: ten occurrences for learning and the rest for testing. For recording these words, we have employed the Microsoft Windows Sound Recorder. Each element of the reconstructed vocabulary is stored

4.11  Experiments and Results

77

Feature Extraction

Speech signal

SBWT

MFCC

Classification

MLP

Fig. 4.3  The general architecture of the proposed system Table 4.1  The used vocabulary Pronunciation Khalfa Amam Asraa Sir Istader Takadem Trajaa Tawakaf Yamin Yassar

Arabic Writing ‫خلف‬ ‫أمام‬ ‫أسرع‬ ‫سر‬ ‫إستدر‬ ‫تقدم‬ ‫تراجع‬ ‫توقف‬ ‫يمين‬ ‫يسار‬

and labeled with the correspondent word. For evaluating the speech recognition technique detailed in this chapter and proposed in [6], we have used other techniques which are as follows: • The feature extraction technique based on MFCC. • The feature extraction technique based on MFCC using second-order differential of MFCC. • The feature extraction technique based on BWT which used the bionic wavelet transform alone. • The technique BWT with MFCC which firstly applies the BWT to the recorded words and secondly computes the MFCCs. • The technique BWT with ∆∆MFCC which firstly applies the BWT to the recorded words and secondly computes the ∆∆MFCC. • The feature extraction technique CWT with MFCC which consists at first step in applying the CWT to the used words and then computes the MFCCs. • In Table 4.2, the different results obtained from the previously mentioned techniques are listed.

78

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC…

Table 4.2  Recognition rates obtained for eight different techniques Feature extraction MFCC: Mel frequency cepstral coefficients ∆∆MFCC SBWT: Stationary bionic wavelet transform SBWT with MFCC SBWT with ∆∆MFCC CWT with ∆∆MFCC BWT with MFCC BWT with ∆∆MFCC

Recognition rate 94% 96.66% 09.09% 89.09% 98% 30% 51.33% 60%

The results (Table 4.2) obtained from recognition rate calculations show clearly that the proposed technique outperforms the other ones used in our evaluation.

4.12  Conclusion In this chapter, we have detailed our technique of Arabic speech recognition with mono-voice and a reduced vocabulary. This technique was previously proposed in literature and it consists at the first step in employing our proper speech database containing Arabic speech words recorded by a mono-voice and this for a voice command. The second step of this technique consists in extracting features from those recorded words. The third step of this technique consists in classifying those extracted features. This feature extraction is performed by applying at the first step the stationary bionic wavelet transform (SBWT) to each recorded word, and then the Mel Frequency Cepstral Coefficients (MFCCs) are calculated from the vector obtained from the concatenation of the obtained stationary bionic wavelet coefficients. The obtained MFCCs are then concatenated in order to construct one input vector of a multi-layer perceptron (MLP) which is used for the feature classification. The obtained results from recognition rate computations show clearly that the proposed technique outperforms some speech recognition methods applied for our evaluation. It gives a 98% recognition rate.

References 1. Benkhellat, Z., Belmehd, A.: Utilisation des Algorithmes Génétiques pour la Reconnaissance de la Parole, SETIT (2009) 2. Maouche, F., Benmohamed, M.: Automatic recognition of Arabic words by genetic algorithm and MFCC modeling, Faculty of Informatics, Mentouri University, Constantine, Algeria 3. Patel, I., Rao, Y.S.: Speech recognition using HMM with MFCC- an analysis using frequency spectral decomposition technique. Signal Image Process. Int. J. 1(2) (2010)

References

79

4. Alghamdi, M., Elshafie, M., Al-Muhtaseb, H.: Arabic broadcast news transcription system. J. Speech Technol. (2009) 5. Park, J., Diehl, F., Gales, M., Tomalin, M., Woodland, P.: Training and adapting MLP features for Arabic speech recognition. Proc. IEEE Conf. Acoust. Speech Signal Process. (2009) 6. Cabral, F.S., Fukai, H., Tamura, S.: Feature extraction methods proposed for speech recognition are effective on road condition monitoring using smartphone inertial sensors. Sensors. 19, 3481 (2019). https://doi.org/10.3390/s19163481 7. Jain, S., Kishore, B.: Comparative study of voice print Based acoustic features: MFCC and LPCC. Int. J. Adv. Eng. Manag. Sci. 3(4), 313–315 (2017) 8. Këpuska, V.Z., Elharati, H.A.: Robust speech recognition system using conventional and hybrid features of MFCC, LPCC, PLP, RASTA-PLP and hidden Markov model classifier in noisy conditions. J. Comp. Comm. 3, 1–9 (2015). https://doi.org/10.4236/jcc.2015.36001 9. Elharati, H.: Performance evaluation of speech recognition system using conventional and hybrid features and hidden Markov model classifier. PhD Thesis, College of Engineering and Science of Florida Institute of Technology (2019) 10. Dua, M., Aggarwal, R.K., Biswas, M.: Discriminative training using noise robust integrated features and refined HMM modeling. J.  Intell. Syst. 29(1), 327–344 (2020). https://doi. org/10.1515/jisys-­2017-­0618 11. Liu, W.: Voice control system based on Zynq FPGA. J. Phys. Conf. Ser. 2020, 012177 (1631). https://doi.org/10.1088/1742-­6596/1631/1/012177 12. Talbi, M., Nasr, M.B., Cherif, A.: Arabic speech recognition by stationary bionic wavelet transform and MFCC using a multi layer perceptron for voice control. In: The International Conference on Information Processing and Wireless Systems (IP-WiS), Sousse (2012). 13. Shi, T., Zhen, J.: Optimization of MFCC algorithm for embedded voice system. In: Liang, Q., Wang, W., Liu, X., Na, Z., Li, X., Zhang, B. (eds.) Communications, Signal Processing, and Systems. CSPS 2020. Lecture Notes in Electrical Engineering, vol. 654. Springer, Singapore (2021). https://doi.org/10.1007/978-­981-­15-­8411-­4_88 14. Kakade, M.N., Salunke, D.B.: An automatic real time speech-speaker recognition sys tem: A real time approach. In: Kumar, A., Mozar, S. (eds.) ICCCE 2019. Lecture Notes in Electrical Engineering, vol. 570. Springer, Singapore (2020). https://doi. org/10.1007/978-­981-­13-­8715-­9_19 15. Singh, L., Chetty, G.: A comparative study of recognition of speech using improved MFCC algorithms and Rasta filters. In: Dua, S., Gangopadhyay, A., Thulasiraman, P., Straccia, U., Shepherd, M., Stein, B. (eds.) Information Systems, Technology and Management ICISTM 2012. Communications in Computer and Information Science, vol. 285. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-­3-­642-­29166-­1_27 16. Linh, L.H., Hai, N.T., Van Thuyen, N., Mai, T.T., Van Toi, V.: MFCC-DTW algorithm for speech recognition in an intelligent wheelchair. In: Toi, V., Lien Phuong, T. (eds.) 5th International Conference on Biomedical Engineering in Vietnam IFMBE Proceedings, vol. 46. Springer, Cham (2015). https://doi.org/10.1007/978-­3-­319-­11776-­8_102 17. Sood, M., Jain, S.: Speech recognition employing MFCC and dynamic time warping algorithm. In: Innovations in Information and Communication Technologies (IICT-2020), Proceedings of International Conference on ICRIHE – 2020, Delhi, India (2020) 18. Fahad, M.S., Deepak, A., Pradhan, G., Yadav, J.: DNN-HMM-based speaker-adaptive emotion recognition using MFCC and Epoch-based features. Circ. Syst. Signal Process. 40(3) (2021). https://doi.org/10.1007/s00034-­020-­01486-­8 19. Birch, B., Griffiths, C.A., Morgan, A.: Environmental effects on reliability and accuracy of MFCC based voice recognition for industrial human-robot-interaction. Proc. IMechE Part B: J Eng. Manuf. 235(12), 1939–1948 (2021) 20. Shareef, S.R., Irhayim, Y.F.: A review: isolated Arabic words recognition using arti ficial intelligent techniques. J.  Phys. Conf. Ser. 2021, 012026 (1897). https://doi. org/10.1088/1742-­6596/1897/1/012026

80

4  Arabic Speech Recognition by Stationary Bionic Wavelet Transform and MFCC…

21. Araujo, F.A., Riou, M., Torrejon, J., Tsunegi, S., Querlioz, D., Yakushiji, K., Fukushima, A., Kubota, H., Yuasa, S., Stiles, M.D., Grollier, J.: Role of non-linear data processing on speech recognition task in the framework of reservoir computing. Sci. Rep. 10, 328 (2020). https://doi. org/10.1038/s41598-­019-­56991-­x 22. Rajesh, S., Nalini, N.J.: Combined evidence of MFCC and CRP features using machine learning algorithms for singer identification. Int. J. Pattern Recognit. Artif. Intell. 35(1), 2158001 (2021). https://doi.org/10.1142/S0218001421580015 23. Mahmood, A., Köse, U.: Speech recognition based on convolutional neural networks and MFCC algorithm. Adv. Art. Intell. Res. 1(1), 6–12 (2021) 24. Dua, M., Aggarwal, R.K., Biswas, M.: Optimizing integrated features for Hindi auto matic speech recognition system. J.  Intell. Syst. 29(1), 959–976 (2020) https://orcid. org/0000-0001-7071-8323 25. Naing, H.M.S., Hidayat, R., Hartanto, R., Miyanaga, Y.: Discrete wavelet denoising into MFCC for noise suppressive in automatic speech recognition system. Int. J. Intell. Eng. Syst. 13(2) (2020). https://doi.org/10.22266/ijies2020.0430.08 26. Arjun, K.N., Karthik, S., Kamalnath, D., Chanda, P., Tripathi, S.: Automatic correction of stutter in disfluent speech. In: Third International Conference on Computing and Network Communications (CoCoNet’19), Procedia Computer Science 171, pp. 1363–1370 (2020) 27. Bourlard, H.A., Morgan, N.: Feature extraction by MLP. In: Connectionist Speech Recognition The Springer International Series in Engineering and Computer Science (VLSI, Computer Architecture and Digital Signal Processing), vol. 247. Springer, Boston, MA (1994). https:// doi.org/10.1007/978-­1-­4615-­3210-­1_14 28. Manaswi, Navin Kumar, Navin Kumar Manaswi, and Suresh John. Deep Learning with Applications Using Python. Apress, 2018 29. Joy, J., Kannan, A., Ram, S., Rama, S.: Speech emotion recognition using neural network and MLP classifier. Int. J. Eng. Sci. Comp. 10(4) (2020) 30. Kaur, J., Kumar, A.: Speech emotion recognition using CNN, k-NN, MLP and random forest. In: Computer Networks and Inventive Communication Technologies Proceedings of Third ICCNCT. Springer, Singapore (2020) 31. Berg, A., O’Connor, M., Cruz, M.T.: Keyword transformer: A self-attention model for keyword spotting. arXiv:2104.00769v3 [eess.AS] 15 Jun (2021) 32. Cai, C., Xu, Y., Ke, D., Su, K.: A fast learning method for multilayer perceptrons in automatic speech recognition systems. J. Robot. 797083, 1–7 (2015). https://doi.org/10.1155/2015/797083 33. Sidi Yakoub, M., Selouani, S.A., Zaidi, B.F., et al.: Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network. J.  Audio Speech Music Process. 2020, 1 (2020). https://doi.org/10.1186/s13636-­019-­0169-­5 34. Wang, Y., Zhang, M., Wu, R.M., Gao, H., Yang, M., Luo, Z., Li, G.: Silent speech decoding using spectrogram features based on neuromuscular activities. Brain Sci. 10, 442 (2020). https://doi.org/10.3390/brainsci10070442 35. Mustafa, M.K., Allen, T., Appiah, K.: A comparative review of dynamic neural networks and hidden Markov model methods for mobile on-device speech recognition. Neural Comput & Applic. 31(Suppl 2), S891–S899 (2019) 36. Eddine, K.S., Fathallah, K., Atouf, I., Mohamed, B.: Parallel implementation of NIOS II multiprocessors, Cepstral coefficients of Mel frequency and MLP architecture in FPGA: the application of speech recognition. WSEAS Trans. Signal Process. 16, 146–154 (2020). https://doi. org/10.37394/232014.2020.16.16 37. Park, J., Diehl, F., Gales, M., Tomalin, M., Woodland, P.: Training and adapting MLP features for Arabic speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009) 38. O’Shaughnessy, D.: Speech Communication Human and Machine. Addison Wesley, Reading, MA (1987) 39. Islam, M.R., Rahmant, M.F., Khant, M.A.G.: Improvement of speech enhancement techniques for robust speaker identification in noise. In: Proceedings of 2009 12th International

References

81

Conference on Computer and Information Technology (ICCIT 2009), 21–23 December, Dhaka, Bangladesh (2009) 40. Anusuya, M.A., Katti, S.K.: Comparison of different speech feature extraction techniques with and without wavelet transform to Kannada speech recognition. Int. J. Comput. Appl. 26(4), 19–24 (2011) 41. Nasr, M.B., Talbi, M., Adnane, C.: Arabic speech Recognition by Bionic Wavelet Transform and MFCC using a Multi Layer Perceptron. Digital Object Identifier. https://doi.org/10.1109/ SETIT.2012.6482017, pp. 803–808, IEEE CONFERENCE PUBLICATIONS (2012) 42. Zabidi, A., et  al.: Mel-frequency cepstrum coefficient analysis of infant cry with hypothyroidism. Presented at the 2009 5th International Colloquium on Signal Processing & Its Applications, Kuala Lumpur, Malaysia (2009) 43. Rao, K.S., Manjunath, K.E.: Speech Recognition Using Articulatory and Excitation Source Features SpringerBriefs in Speech Technology. Springer, Cham. https://doi. org/10.1007/978-­3-­319-­49220-­9 44. Picone, J.W.: Signal modeling techniques in speech recognition. Proc. IEEE. 81, 1215–1247 (1993) 45. Deller, J.R., Hansen, J.H., Proakis, J.G.: Discrete Time Processing of Speech Signals. Wiley, Prentice Hall, NJ (1993) 46. Benesty, J., Sondhi, M.M., Huang, Y.A.: Handbook of Speech Processing. Springer, New York (2008) 47. Volkmann, J., Stevens, S., Newman, E.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8, 185–190 (1937) 48. Fang, Z., Guoliang, Z., Zhanjiang, S.: Comparison of different implementations of MFCC. J. Comput. Sci. Technol. 16, 582–589 (2000) 49. Ganchev, G.K.T., Fakotakis, N.: Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of International Conference on Speech and Computer (SPECOM), pp. 191–194 (2005) 50. Rabiner, L., Juang, B.-H., Yegnanarayana, B.: Fundamentals of Speech Recognition. Pearson Education, London (2008) 51. Furui, S.: Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Trans. Acoust. Speech Signal Process. 29, 342–350 (1981) 52. Mason, J.S., Zhang, X.: Velocity and acceleration features in speaker recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 3673–3676 (1991) 53. Zabidi, A., Mansor, W., Khuan, L.Y., Yassin, I.M., Sahak, R.: The effect of F-ratio in the classification of Asphyxiated infant cries using multilayer perceptron neural network. In: IEEE EMBS Conference on Biomedical Engineering & Sciences (IECBES 2010), Kuala Lumpur, Malaysia, 30th November 2010 – 2nd December (2010) 54. Rumelhart, D., McClelland, J.L., The PDP Research Group (eds.): Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, Cambridge (1986)

Index

A Alternating direction technique of multipliers (ADMM), 55 Arabic speech recognition system, 69 MFCC and PLP, 70 MFCC technique, 69 MLP, 69 mono-voice and vocabulary, 71 PSO-HMM parameter, 70 Arabic speech sentences, 15, 56 B Bionic wavelet transform (BWT), 3, 6 adaptation factor, 7 coefficients, 8 definition, 33 DWTs and WPT, 8 Morlet wavelet, 7 net result, 7 smoothing, 34 C Cohen’s research work, 5 Continuous wavelet transform (CWT), 4 advantages, 4 Cross correlation (CC) and PSNR, 45 D Deep denoising autoencoder (DDAE), 2 Delta–delta coefficients, 75

Denoising approach, 60 Denoising technique, 57 Discrete cosine transform (DCT) delta–delta coefficients, 75 MFCC coefficients, 74 static features, 75 Discrete Fourier transform (DFT), 53 Discrete wavelet transform (DWT), 5 E ECG denoising approach, 38 ECG denoising technique, 41 Echo cancellation, 1 Echo suppression, 1 Electrocardiogram (ECG) denoising approach application, 38 BWT, 33 conventional, 33 DWT, 36 DWT and SBWT, 33 electrical interference, 32 electrodes, 31 feature, 34 fractional wavelets, 33 frequency content, 32 HMMs, 37 interference, 31 multiwavelet, 32 NLM technique, 33 noisy, 45 non-local means, 37 1-D double-density complex, 39 parameters, 35

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 T. Mourad, The Stationary Bionic Wavelet Transform and its Applications for ECG and Speech Processing, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-93405-7

83

Index

84 Electrocardiogram (ECG) denoising approach (cont.) R peak, 32 signals, 31 SNR and MSE, 41 time-frequency transformation domains, 32 transient changes, 32 Evaluation metrics signal to noise ratio, 13 F Feature extraction technique, 71 Fourier transform, 4 Fractional wavelets, 33 G Gammatone-frequency cepstral coefficient (GFCC), 70 Gaussian white noise (GWN), 33, 41 H Hidden Markov model (HMM), 37, 51, 70 I ISNRPCA, 55 Itakura–Saito distance (ISd), 14, 15 Itakura-Saito measure, 55 M Mean opinion score (MOS) test, 15 Mel Frequency Cepstral Coefficients (MFCC), 71 analysis window, 72 calculation, 73 DFT application, 72 feature extraction, 71 pre-emphasis, 71 speech signal, 72 Mel spectrum, 73 Mel weighting filters, 73 Minimum mean square error (MMSE), 52 noise and speech signals, 53 SBWT, 55 spectral amplitude, 52 spectral density, 52 Multi-layer perceptron (MLP), 75 architecture, 76 design, 75

P Perceptual evaluation of speech quality (PESQ), 13, 15 S Short-time spectral amplitude (STSA), 51 Signal denoising, 31 Signal-to-noise ratio (SNR), 44 Spectral amplitude, 52 Spectrogram, 22 Speech enhancement, 1, 2, 13, 54, 56, 63 classical, 11 conventional, 51 denoising technique, 52 DWT, 51 flowchart, 4 listening, 2 LWT and ANN, 52 spectral subtraction and phase spectrum compensation, 56 speech-related applications, 51 telephone networks speech, 1 TQWT, 3 Speech recognition, 69 Speech signal, 10, 20, 27 Stationary bionic wavelet transform (SBWT), 3 application, 8 BWT, 10 reconstruction, 10 speech enhancement, 8 Stationary wavelet transform (SWT) filter bank implementation, 9 WPT and DWT, 9 T Telephone conversation, 1 Tunable Q-factor-based wavelet transform (TQWT), 3 W Wavelet-based statistical signal processing approaches, 36 Wavelet packet transform (WPT), 5 decomposition tree, 5 perceptual, 5, 6 signal enhancement, 5 Wavelet thresholding denoising (WTD), 51