Volume 40, Number 7, November 2023 IEEE Signal Processing Magazine


148 89 11MB

English Pages 104 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
cover1_40msp07
cover2_40msp07
001_40msp07
002_40msp07
003_40msp07
004_40msp07
005_40msp07
006_40msp07
007_40msp07
008_40msp07
009_40msp07
010_40msp07
011_40msp07
012_40msp07
013_40msp07
014_40msp07
015_40msp07
016_40msp07
017_40msp07
018_40msp07
019_40msp07
020_40msp07
021_40msp07
022_40msp07
023_40msp07
024_40msp07
025_40msp07
026_40msp07
027_40msp07
028_40msp07
029_40msp07
030_40msp07
031_40msp07
032_40msp07
033_40msp07
034_40msp07
035_40msp07
036_40msp07
037_40msp07
038_40msp07
039_40msp07
040_40msp07
041_40msp07
042_40msp07
043_40msp07
044_40msp07
045_40msp07
046_40msp07
047_40msp07
048_40msp07
049_40msp07
050_40msp07
051_40msp07
052_40msp07
053_40msp07
054_40msp07
055_40msp07
056_40msp07
057_40msp07
058_40msp07
059_40msp07
060_40msp07
061_40msp07
062_40msp07
063_40msp07
064_40msp07
065_40msp07
066_40msp07
067_40msp07
068_40msp07
069_40msp07
070_40msp07
071_40msp07
072_40msp07
073_40msp07
074_40msp07
075_40msp07
076_40msp07
077_40msp07
078_40msp07
079_40msp07
080_40msp07
081_40msp07
082_40msp07
083_40msp07
084_40msp07
085_40msp07
086_40msp07
087_40msp07
088_40msp07
089_40msp07
090_40msp07
091_40msp07
092_40msp07
093_40msp07
094_40msp07
095_40msp07
096_40msp07
097_40msp07
098_40msp07
099_40msp07
100_40msp07
cover3_40msp07
cover4_40msp07
Recommend Papers

Volume 40, Number 7, November 2023 
IEEE Signal Processing Magazine

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Call for Papers IEEE Signal Processing Magazine Special Issue on Near-Field Signal Processing: Communications, Sensing and Imaging

Signal processing technologies are moving toward using small and densely packed sensors to create large aperture arrays. This allows for higher angular resolution and beamforming gain. However, with the extended aperture and small wavelength, when the receiver is in the near-field, i.e., it is closer to the transmitter than the Fraunhofer distance, the signal wavefront is no longer planar. Therefore, a spherical wavefront must be considered since the system's performance depends on both the propagation distance and the direction of the signal of interest. As a result, near-field signal processing has recently become an essential technique for both radar sensing and wireless communications to achieve spatial multiplexing with increased degrees of freedom and high-resolution with a range-dependent, very narrow beamwidth. Although near-field processing is a relatively new concept in wireless communications and sensing, it has already been extensively studied in other fields, mainly computational imaging, where the propagation distance is short, e.g., microscopy, holography, and optics, where the far-field assumption fails. The aim of this special issue is to provide a venue for a wide and diverse audience of researchers from academia, government, and industry to survey the recent research advances in major near-field applications such as wireless communications, sensing and imaging. Topics of interest include but are not limited to: ▪ Reactive and radiative near-field signal processing ▪ Near-field spatially varying channels, e.g., Rydberg sensors, for massive MIMO communications holographic surfaces, and frequency diverse arrays ▪ Short-range THz communications ▪ Information metasurfaces for near-field imaging, signal ▪ Active/passive near-field beamforming techniques processing, and wireless communications ▪ Near-field integrated sensing and communications ▪ Signal processing for near-field localization, direction-of▪ Near-field sensor array/reflecting surface processing arrival estimation and sensing ▪ Near-field automotive radar sensing and imaging ▪ Near-field synthetic aperture sounding and radar imaging ▪ Signal processing for near-field acoustics ▪ Machine learning techniques to enable near-field systems ▪ Signal processing on the spherical manifolds ▪ Near-field wireless power transfer for 5G and IoT ▪ Signal processing for mixed near-field and far-field ▪ Recent advances in mid- and near-field optics via coded observations diffraction patterns ▪ Antenna array calibration for near-field applications ▪ Modeling and prototyping in microscopy, holography, Raman ▪ Electromagnetics/physics of near-field beamforming spectroscopy, crystallography and optics Submission Guidelines: White papers are required, and full articles will be invited based on the review of white papers. The white paper format is up to 4 pages in length, including the proposed article title, motivation and significance of the topic, an outline of the proposed paper, and representative references. An author list with contact information and short bios should also be included. Submitted articles must be of tutorial/overview/survey nature, accessible to a broad audience, with significant relevance to the scope of the special issue. Authors are invited to submit their contributions by following the detailed instructions given at: https://signalprocessingsociety.org/publications-resources/ieee-signal-processingmagazine/information-authors-spm. Manuscripts should be submitted online via http://mc.manuscriptcentral.com/spmag-ieee Important Dates: Guest Editors: White paper due: 1 March 2024 Ahmet M. Elbir (Lead), University of Luxembourg, Luxembourg Invitation notification: 1 April 2024 ([email protected]) Ana Isabel Perez-Neira, Centre Tecnológic de Telecomunicaciones de Full manuscripts due: 15 June 2024 Catalunya, Spain ([email protected]) First review to authors: 1 September 2024 Henry Arguello, Universidad Industrial de Santander, Colombia Revision due: 1 November 2024 ([email protected]) Second review completed: 1 January 2025 Martin Haardt, Ilmenau University of Technology, Ilmenau, Germany Final manuscript due: 1 February 2025 ([email protected]) Publication: May 2025 Moeness G. Amin, Villanova University, USA ([email protected]) Tie Jun Cui, Southeast University, China ([email protected]) Digital Object Identifier 10.1109/MSP.2023.3313848

Contents

Volume 40 | Number 7 | November 2023

FEATURES

64 Tips & Tricks

Tricks for Designing a Cascade of Infinite Impulse Response Filters With an Almost Linear Phase Response David Shiung, Jeng-Ji Huang, and Ya-Yin Yang

18 POLYNOMIAL EIGENVALUE

DECOMPOSITION FOR MULTICHANNEL BROADBAND SIGNAL PROCESSING

Super-Resolving a Frequency Band Ruiming Guo and Thierry Blu Implementing Moving Average Filters Using Recursion Shlomo Engelberg

Vincent W. Neo, Soydan Redif, John G. McWhirter, Jennifer Pestana, Ian K. Proudler, Stephan Weiss, and Patrick A. Naylor

Sub-Nyquist Coherent Imaging Using an Optimizing Multiplexed Sampling Scheme  Yeonwoo Jeong, Behnam Tayebi, and Jae-Ho Han

38 A SIGNAL PROCESSING

INTERPRETATION OF NOISEREDUCTION CONVOLUTIONAL NEURAL NETWORKS

89 SP Education

Data Science Education: The Signal Processing Perspective  Sharon Gannot, Zheng-Hua Tan, Martin Haardt, Nancy F. Chen, Hoi-To Wai, Ivan Tashev, Walter Kellermann, and Justin Dauwels

Luis Albert Zavala-Mondragón, Peter H.N. de With, and Fons van der Sommen

ON THE COVER

94 SP Competitions

This issue discusses several topics including Fourier and the early days of sound analysis, multichannel broadband SP, and noise-reduction with CNN.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –90

0 –5 –10 –15 –20 –25 –30 –35 –40 –45 –40 –10 0 30 45 Angle of Arrival ϕ/[°]

COLUMNS Gain/[dB]

Normalized Angular Frequency Ω/π

COVER IMAGE: BACKGROUND— ©SHUTTERSTOCK.COM/IOAT, FOURIER IMAGE—WIKIMEDIA.ORG

Synthetic Image Detection  Davide Cozzolino, Koki Nagano, Lucas Thomaz, Angshul Majumdar, and Luisa Verdoliva

11 DSP History

Fourier and the Early Days of Sound Analysis  Patrick Flandrin

80 –90

PG. 38

PG. 18

IEEE SIGNAL PROCESSING MAGAZINE  (ISSN 1053-5888) (ISPREG) is published bimonthly by the Institute of Electrical and Electronics Engineers, Inc., 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA (+1 212 419 7900). Responsibility for the contents rests upon the authors and not the IEEE, the Society, or its members. Annual member subscriptions included in Society fee. Nonmember subscriptions available upon request. Individual copies: IEEE Members US$20.00 (first copy only), nonmembers US$248 per copy. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S. Copyright Law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA; 2) pre-1978 articles without fee. Instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. For all other copying, reprint, or republication permission, write to IEEE Service Center, 445 Hoes Lane, Piscataway, NJ 08854 USA. Copyright © 2023 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY, and at additional mailing offices. Printed in the U.S.A. Postmaster: Send address changes to IEEE Signal Processing Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854 USA. Canadian GST #125634188 

Digital Object Identifier 10.1109/MSP.2023.3317580

IEEEIEEE SIGNAL PROCESSING MAGAZINE 2023| SIGNAL PROCESSING MAGAZINE| November | July 2023

|

1

IEEE Signal Processing Magazine

DEPARTMENTS

EDITOR-IN-CHIEF

4 From the Editor

SPS Members, You Are All Heirs of Fourier! Christian Jutten

AREA EDITORS Feature Articles Laure Blanc-Féraud—Université Côte d’Azur, France

7 President’s Message

Reflections on the Poland Chapter Celebration Athina Petropulu

Cover 3

ASSOCIATE EDITORS—COLUMNS AND FORUM

Christian Jutten—Université Grenoble Alpes, France

Special Issues Xiaoxiang Zhu—German Aerospace Center, Germany

Dates Ahead

Columns and Forum Rodrigo Capobianco Guido—São Paulo State University (UNESP), Brazil H. Vicky Zhao—Tsinghua University, R.P. China e-Newsletter Hamid Palangi—Microsoft Research Lab (AI), USA Social Media and Outreach Emil Björnson—KTH Royal Institute of Technology, Sweden

EDITORIAL BOARD

©SHUTTERSTOCK.COM/SAYAN URANAN

cover 3

The IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) will be held in Seoul, Korea, 14–19 April 2024.

Massoud Babaie-Zadeh—Sharif University of Technology, Iran Waheed U. Bajwa—Rutgers University, USA Caroline Chaux—French Center of National Research, France Mark Coates—McGill University, Canada Laura Cottatellucci—Friedrich-Alexander University of Erlangen-Nuremberg, Germany Davide Dardari—University of Bologna, Italy Mario Figueiredo—Instituto Superior Técnico, University of Lisbon, Portugal Sharon Gannot—Bar-Ilan University, Israel Yifan Gong—Microsoft Corporation, USA Rémi Gribonval—Inria Lyon, France Joseph Guerci—Information Systems Laboratories, Inc., USA Ian Jermyn—Durham University, U.K. Ulugbek S. Kamilov—Washington University, USA Patrick Le Callet—University of Nantes,   France Sanghoon Lee—Yonsei University, Korea Danilo Mandic—Imperial College London, U.K. Michalis Matthaiou—Queen’s University Belfast, U.K. Phillip A. Regalia—U.S. National Science Foundation, USA Gaël Richard—Télécom Paris, Institut Polytechnique de Paris, France Reza Sameni—Emory University, USA Ervin Sejdic—University of Pittsburgh, USA Dimitri Van De Ville—Ecole Polytechnique Fédérale de Lausanne, Switzerland Henk Wymeersch—Chalmers University of Technology, Sweden

Ulisses Braga-Neto—Texas A&M University, USA Cagatay Candan—Middle East Technical University, Turkey Wei Hu—Peking University, China Andres Kwasinski—Rochester Institute of Technology, USA Xingyu Li—University of Alberta, Edmonton, Alberta, Canada Xin Liao—Hunan University, China Piya Pal—University of California San Diego, USA Hemant Patil—Dhirubhai Ambani Institute of Information and Communication Technology, India Christian Ritz—University of Wollongong, Australia

ASSOCIATE EDITORS—e-NEWSLETTER Abhishek Appaji—College of Engineering, India Subhro Das—MIT-IBM Watson AI Lab, IBM Research, USA Behnaz Ghoraani—Florida Atlantic University, USA Panagiotis Markopoulos—The University of Texas at San Antonio, USA

IEEE SIGNAL PROCESSING SOCIETY Athina Petropulu—President Min Wu—President-Elect Ana Isabel Pérez-Neira—Vice President, Conferences Roxana Saint-Nom—VP Education Kenneth K.M. Lam—Vice President, Membership Marc Moonen—Vice President, Publications Alle-Jan van der Veen—Vice President, Technical Directions

IEEE SIGNAL PROCESSING SOCIETY STAFF Richard J. Baseil—Society Executive Director William Colacchio—Senior Manager, Publications and Education Strategy and Services Rebecca Wollman—Publications Administrator

IEEE PUBLISHING OPERATIONS

Sharon M. Turk, Journals Production Manager Katie Sullivan, Senior Manager, Journals Production Gail A. Schnitzer, Associate Art Director Theresa L. Smith, Production Coordinator Mark David, Director, Business Development Media & Advertising Felicia Spagnoli, Advertising Production Manager Peter M. Tuohy, Director, Production Services Kevin Lisankie, Director, Editorial Services Dawn M. Melley, Senior Director, Publishing Operations

Digital Object Identifier 10.1109/MSP.2023.3317582

SCOPE:  IEEE Signal Processing Magazine publishes tutorial-style articles on signal processing research and

IEEE prohibits discrimination, harassment, and bullying. For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.

2

applications as well as columns and forums on issues of interest. Its coverage ranges from fundamental principles to practical implementation, reflecting the multidimensional facets of interests and concerns of the community. Its mission is to bring up-to-date, emerging, and active technical developments, issues, and events to the research, educational, and professional communities. It is also the main Society communication platform addressing important issues concerning all members.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

2024 IEEE Conference on Computational Imaging Using Synthetic Apertures (CISA) Advances in Theory, Engineering Practice, and Standardization

National Institute of Standards and Technology| Boulder | Colorado | 20-23 May 2024

Call for Papers

The IEEE Signal Processing Society, the IEEE Synthetic Aperture Standards Committee, and the IEEE Synthetic Aperture Technical Working Group enthusiastically invite you to the scenic campus of the National Institute of Standards and Technology (NIST) in Boulder, Colorado to a unique gathering of researchers and engineers engaged in cutting-edge research on computational imaging and sensing using synthetic apertures (SAs). The term SA refers generically to a discrete measurement scheme together with an inverse problem solution that yields imaging or sensing performance better than the hardware system is inherently capable of, e.g., wider field-of-view, higher angular resolution. An SA may sample a propagating wavefield or environmental parameters in the signal domain via linear motion of an Important Dates antenna or transducer, as in synthetic aperture radar (SAR), Special session proposals due: October 20, 2023 sonar (SAS), or channel sounding. Alternatively, an SA may Initial 2-page abstract submissions due: January 19, 2024 Tutorial proposals due: January 26, 2024 sample in the k-space domain via different look angles Acceptance notification: March 1, 2024 around an object or scene, as in computed tomography, Camera-ready 4+1-page paper due: April 5, 2024 spotlight SAR or Fourier ptychography. Lastly, an SA may be constructed from a sparse array of sensors as in radiometry, seismology, or radio astronomy. The front end of an SA may be a conventional antenna, acoustic transducer, or a quantum sensor, such as a Rydberg probe, in advanced implementations. CISA will highlight advances in the theoretical development, engineering practice, and standardization of all aspects of SA imaging and sensing. Suggested topics for CISA are listed below.

Radar: Automotive SAR, mmWave and THz SAR, polarimetric SAR, ISAR, 3-D imaging, High-dimensional feature processing using tensors Sonar: Micronavigation and position uncertainty, Bathymetry, Wideband regimes Optics: Phase retrieval, Ptychography, Holography, Coded diffraction imaging, Coded aperture imaging, Wirtinger flow, Deep learning techniques 5G: Channel sounding, Over-the-air calibration, MIMO antenna testbeds, Intelligent reflecting surfaces, Near-field beam focusing Seismology: Wave migration and localization techniques Inverse problems: Deconvolution and hardware deembedding, Neuromorphic computing methods Data-driven signal processing: SAR focusing techniques

Magnetic resonance imaging: Image reconstruction from under-sampled measurements Ultrasound: Flow and velocity estimation Distributed sensors: Networked coherent radars, sonars Power beaming: Wireless power transfer to UAVs Radiometry and remote sensing: 5G signal interference Quantum receivers: Rydberg probes, Lithium-niobate piezoelectric sensors Integrated sensing and communications: Coherent UAV swarms Radio astronomy: Low-noise receivers, Satellite interference mitigation Point cloud processing: LiDAR, 4D mmWave radar in robotics, autonomous driving Model-based image reconstruction: Regularization

Prospective authors should visit https://2024.ieeecisa.org/ for more details and to submit manuscripts. All manuscripts must adhere to IEEE formatting guidelines and accepted papers will appear in IEEE Xplore. The 2024 CISA conference will be an in-person event and authors must attend to present their papers live at NIST. For additional questions, please contact the co-chairs, Alexandra Artusio-Glimpse ([email protected]), Paritosh Manurkar ([email protected]), Samuel Berweger ([email protected]), Peter Vouras ([email protected]), or Kumar Vijay Mishra ([email protected]). Digital Object Identifier 10.1109/MSP.2023.3316949

FROM THE EDITOR Christian Jutten

| Editor-in-Chief | [email protected]

SPS Members, You Are All Heirs of Fourier!

M

y three years of service as the editorin-chief (EIC) of Signal Processing Magazine (SPM) are now coming to a close. During the past three years, many of us were deeply affected by serious political, social, and environmental events such as the war in Ukraine; protests for freedom in Iran; coups d’état in Africa; the COVID-19 pandemic; seisms in Turkey, Syria, and Morocco; huge floods in Libya and India; gigantic fires in North America and Southern Europe; and an avalanche of stones in the Alps, to name a few. In such a context, I believe that the IEEE slogan, “Advancing Technology for Humanity,” is incredibly relevant and timely. It also must be viewed in a wider sense, including the preservation of Earth and sustainable development. In point of fact, what would become of humanity without Earth? I believe that we must always have this in mind when contemplating our future projects, asking for funding, and while teaching. The year 2023 also marks the 75th anniversary of the IEEE Signal Processing Society (SPS), and this too offers us an opportunity to think about the signal processing domain and ponder its roots and its dazzling evolution. It is also interesting to think about the early contributions that are of the highest importance in our domain and became its pillars. During ICASSP 2023 in Rhodes, Alan Oppenheim, Ron Schafer, and Tony Constantinides recounted the adventure of digital signal processing in the 1970s.

Digital Object Identifier 10.1109/MSP.2023.3318848 Date of current version: 3 November 2023

4

Such ideas and the book Digital Signal Processing [1] were revolutionary at a time when computers were in their infancy. In fact, the concept of digital signal processing was met with mixed reviews and skepticism. But long before this came the contributions of Jean-Baptiste Joseph Fourier who developed for our understanding the propagation of heat. His most famous book [2], published 201 years ago, in 1822, contains the basics of the Fourier series and transform and their ability to represent a large range of signals. Fourier’s ideas were also “out of the box,” and they were also received with reservations from eminent scientists who could not understand how and why a sum of continuous functions could approximate noncontinuous functions. Later, in 1829, Dirichlet presented the theoretical results concerning the convergence of Fourier series [A1]. Fourier’s life is a real novel, which the curious reader can discover in this well-documented and fun work (unfortunately, only in French) [3].

In this issue During this year in which we celebrated the 75th anniversary of the SPS, it was mandatory to recall Fourier, and I warmly thank Patrick Flandrin for his article [A1], which gives many historical details on some of Fourier’s contributions and their impact on sound analysis and recordings. The article also highlights tricks for implementing computation before the computer era with amazing machines. As obvious proof of the importance of Fourier in signal and image processing (SIP), IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

note that all the articles in this issue explicitly mention Fourier’s legacy. You all know what an eigenvalue decomposition (EVD) is and some of its uses, but do you know what a polynomial EVD (PEVD) is? In feature article [A2], you will learn about PEVD and its application in many problems involving multichannel broadband signals. Denoising is an essential task in SIP. Currently, many methods for image denoising use convolutional neural networks (CNNs). Feature article [A3] proposes an in-depth understanding of encodingdecoding CNN architectures (convolution, down/upsampling structures, activation functions, etc.) following signal processing principles. This issue contains four “Tips and Tricks” columns. In [A4], the authors propose two tricks for approaching a perfect filter with reduced complexity. In [A5], the authors present a trick for robust estimation of the frequency of a single complex exponential using the magnitude of only two samples of its discrete-time Fourier transform. In [A6], the author shows that coding numbers as integers rather than as floating points can avoid rounding errors in the implementation of moving average filters. Note that a copy of the code for this trick can be obtained by sending an e-mail to the author. Finally, [A7] presents a trick for realizing sub-Nyquist coherent imaging based on an optimized multiplexing hologram scheme. In SP Education column [A8], the authors reflect on education in data science (DS), including signal processing and machine learning, with the objective that

their ideas and guidelines can inspire educators to develop new teaching programs. I appreciated that in these new programs, the authors highlight the consideration of ethical aspects and the sustainability of our global environment. I believe that, in the evaluation of DS methods, educators must also propose metrics, including both performance and complexity terms. Finally, [A9] reports on the 2022 VIP Cup, which took place in October 2022 at ICIP. The aim of the competition was the detection of synthetic images, i.e., telling the difference between real and fake images, which is a very important task in combating fraudulent design and the use and diffusion of fake images.

Many thanks… These three years as the EIC of SPM were a very enriching experience, requiring a lot of work but giving me great pleasure. Of course, the EIC is just a link in the chain. It has been my great luck and pleasure to work with a very nice and efficient Editorial Board. In the first circle, I warmly thank the area editors: Laure Blanc-Féraud for FAs, Xiaoxiang Zhu

for special issues (SIs), Rodrigo Guido and Vicky Zhao for columns and forums (C&Fs), Emil Björnson for social media and outreach, and Behnaz Ghoraani and Hamid Palangi for the e-newsletter. I appreciated their friendly interactions and their hard work during these three years. They were always involved in promoting high-quality articles with the specific tutorial nature that is the signature of SPM, and whose target audience covers all SPS members and beyond. I also thank the great team of associate editors of the C&F articles and the e-newsletter. Their role is essential for managing the different categories of articles but is not limited to handling reviews, as in the transactions, since they are also in charge of developing content. The team of senior members is of the highest importance. In fact, since SPM fully covers SIP, and due to the tutorial style of SPM articles, the expertise of the team of senior members must be very large. Each proposal for a Special Issue or a Feature Article white paper must be reviewed by a large set of scientists, not all experts in the domain, to represent the SPM target au-

dience. Usually, the decision on proposals is based on at least 10 reviews, which are required in less than three weeks. I thank you all for your service to SPM. All the members of the Editorial Board are ambassadors of SPM, and in addition to their reviewing tasks, their roles include the detection, stimulation, and invitation of potential scientists to submit articles or Special Issues to SPM. This everyday task is essential for providing compelling and attractive content. SPM is a fully edited journal. This means that all the articles, after acceptance, are edited, laid out, and illustrated by the IEEE editorial team. For each issue, the cover is also created by the design team after exchanges between the EIC and the journal ­production managers, Jessica Welsh (up to the end of 2021) and Sharon Turk. I warmly thank the IEEE editorial team, who contributed to the quality and the attractiveness of SPM, and especially Sharon and Jessica for their leadership and the quality of our interactions, both friendly and professional. The reviewing process is based in ScholarOne, and I warmly thank Rebecca

FIGURE 1. A montage of the 19 SPM issues published while Christian Jutten served as EIC.

Wollman for the valuable, efficient, and timely help she provided to the authors, teams of guest editors (GEs), and members of the Editorial Board. I won’t forget Rupal Blatt, webpage manager, for her reactivity in updating the SPM webpages, adding templates, adding calls for articles for Special Issues etc. I would like to thank all of the authors who contributed feature articles and columns and forums. And finally, I warmly thank the guest editors who proposed exciting Special Issues and thank them for their efforts in managing the reviews from white papers to full articles and for providing the final manuscripts in due time. SPM needs high-level tutorial-like contributions covering SIP methods and applications, following trends in DS and machine learning but always under the SIP umbrella. Keynote speakers and organizers of tutorials and special sessions in conferences and workshops—you are all potential candidates for SPM articles. Don’t hesitate to contact the area editors to refine and concretize your draft article or idea for a Special Issue. Following the ideas of previous EICs, I added the covers of the 19 SPM issues published during the last three years (Figure 1), illustrating the 6

diversity of articles and Special Issues and also the quality of the work done by the design and editorial teams. Professor Tulay Adali, from the University of Maryland, Baltimore County, will be taking over as EIC on 1 January 2024. You will be able to read about her vision for the magazine in her editorial in the January 2024 issue. I know her very well; she is a great scientist, and she has also served the SPS in different positions for many years. She is now inviting scientists to join her as area editors, and she will present herself and her team in more detail in her first editorials. With her as EIC, I know that SPM is in good hands.

Appendix: Related Articles

[A1] P. Flandrin, “Fourier and the early days of sound analysis,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 11–16, Nov. 2023, doi: 10.1109/MSP.2023.3297313. [A2] V. W. Neo, S. Redif, J. G. McWhirter, J. Pestana, I. K. Proudler, S. Weiss, and P. A. Naylor, “Polynomial eigenvalue decomposition for multichannel broadband signal processing,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 18–37, Nov. 2023, doi: 10.1109/ MSP.2023.3269200. [A3] L. A. Zavala-Mondragón, P. H. N. de With, and F. van der Sommen, “A signal processing interpretation of noise-reduction convolutional neural networks,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 38–63, Nov. 2023, doi: 10.1109/MSP.2023.3300100. [A4] D. Shiung, J.-J. Huang, and Y.-Y. Yang, “Tricks for designing a cascade of infinite impulse response filters with an almost linear phase response,” IEEE Signal

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Process. Mag., vol. 40, no. 7, pp. 64–73, Nov. 2023, doi: 10.1109/MSP.2023.3290772. [A5] R. Guo and T. Blu, “Super-resolving a frequency band,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 73–77, Nov. 2023, doi: 10.1109/MSP.2023. 3311592. [A6] S. Engelberg, “Implementing moving average filters using recursion,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 78–80, Nov. 2023, doi: 10.1109/ MSP.2023.3294721. [A7] Y. Jeong, B. Tayebi, and J.-H. Han, “Sub-Nyquist coherent imaging using an optimizing multiplexed sampling scheme,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 81–88, Nov. 2023, doi: 10.1109/ MSP.2023.3310710. [A8] S. Gannot, Z.-H. Tan, M. Haardt, N. F. Chen, H.-T. Wai, I. Tashev, W. Kellermann, and J. Dauwels, “Data science education: The signal processing perspective,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 89–93, Nov. 2023, doi: 10.1109/MSP.2023. 3294709. [A9] D. Cozzolino, K. Nagano, L. Thomaz, A. Majumdar, and L. Verdoliva, “Synthetic image detection: Highlights from the IEEE video and image processing cup 2022 student competition,” IEEE Signal Process. Mag., vol. 40, no. 7, pp. 94–100, Nov. 2023, doi: 10.1109/MSP.2023.3294720.

References

[1] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1975. [2] J. Fourier, Théorie Analytique de la Chaleur. Paris, France: Firmin Didot, 1822. [Online]. Available: https://gallica.bnf.fr/ark:/12148/bpt6k1045508v. texteImage [3] E. Marie and E. Cerisier, Les Oscillations de Joseph Fourier. Nantes, France: Editions Petit à Petit, 2018. 

SP

PRESIDENT’S MESSAGE  

Athina Petropulu

| IEEE Signal Processing Society President | [email protected]

Reflections on the Poland Chapter Celebration

M

y end of term as IEEE Signal Processing Society (SPS) president is fast approaching. It has been an incredible experience that has provided me with so many opportunities to engage with our members around the globe, forge relationships with other IEEE Societies, and meet a diverse range of people that I hope will become active members of our Society in the future. It has been a great privilege to be at the helm of a Society that garners such a high level of worldwide respect and recognition. It has also provided me with the chance to learn, identify the challenges we still face, anticipate future challenges, and work to find solutions that will make our Society, and the world, a better place. The SPS has a unique dual role. We strive to grow and advance technological innovation and problem-solve at the scientific level—from the bench to the applications of these technologies in the real world. We need to be mindful that our scientific pursuits don’t exist in a vacuum, that they have many social, political, and ethical implications, and that their very existence is often shaped by an uneven playing field—for scientists that are isolated within their research silos or by geopolitical events, for women and ethnic minorities, for citizens of socalled low-income countries, and for young people with economic or cultural restraints. Digital Object Identifier 10.1109/MSP.2023.3322050 Date of current version: 3 November 2023

Our Society has made many strides to level that playing field by providing many initiatives to grow and diversify our membership. I’ve discussed these initiatives in my past messages, and I’ll detail some recent programs below, but there are still many questions that require novel solutions.

War and peace This past September, I had the privilege of visiting Poland to commemorate the 20th anniversary of the IEEE SPS Poland Chapter. During this event, I presented the history of the SPS and its impactful role in signal processing. Additionally, I had the opportunity to learn about the journey of the Poland Chapter and its various activities. The anniversary celebration coincided with the Signal Processing Symposium (SPSympo). Since its inception in 2003, SPSympo has consistently attracted attendees from Poland and neighboring countries, particularly Ukraine. Unfortunately, due to geopolitical events, researchers from Ukraine were unable to travel abroad. It begs the question: Can we find innovative methods to help our members and nonmembers in countries in the midst of conflict, warfare, and humanitarian crises? Perhaps we could implement humanitarian programs or grants for accessing conferences via online attendance or open access to postconference transcripts and other options to help them overcome these barriers? During discussions at the Poland Chapter meeting, some wondered why IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

one should join the SPS in an era with abundant freely available informat ion, nu merous platforms for sharing scientific work, and a multitude of conference options? Some argued for the significance of in-person networking and professional development, underlining the importance of providing conference discounts and travel grants. Ultimately, the paramount value that the SPS provides is the assurance of high quality—in publications, conferences, technical activities, educational offerings, and high ethical standards and guidelines.

Attracting young minds There was also a discussion on how to engage students and young professionals in SPS initiatives. While older generations viewed Society membership as their only option to be connected with the outside scientific community, today’s students may need special encouragement to become members. The SPS is providing several events designed to spark the interest of younger generations and foster visibility and growth. Those include the Signal Processing Cup, the Video and Image Processing Cup, the 5-Minute Video Clip Contest, hackathons, and society level awards in the areas of Best PhD Dissertation Awards and Young Author Best Paper Awards and at the conference level there are and best student paper awards, all of which bring recognition to students and generate a lot of­ excitement. There are also o­ pportunities 7

designed to empower students and enhance their professional growth. St u d e nt s b e n ef it from job opportunities through events like the Student Job Fair and Luncheon, creating a br idge between academia and industry. The low-cost membership fee of just US$1 for IEEE Student and Graduate Student Members further opens the door to a wealth of resources. Networking takes center stage through events at SPS conferences, fostering connections with industry professionals. Additionally, students gain access to webinars, career and soft skills training, and mentorship opportunities, contributing to their holistic development. The SPS also provides travel grants to students in developing countries on a competitive basis and based on need to support to travel to ICASSP and ICIP. Also important are the SPS Scholarship and Seasonal Schools Programs. The Schola rship Program offers financial assistance to undergraduate and graduate students who are dedicated to pursuing education and careers in signal processing. Eligibility is open to students with a minimum B grade point average (or international equivalent). Over a span of three years, recipients have the opportunity to receive a total prize of up to US$7,000. This initiative aims to support and encourage students with a strong academic commitment to signal processing, providing a financial boost to their educational journey. Seasonal Schools primary objectives are development of students interested in signal processing, organizing opportunities to network with professors and established practitioners, and engage in hands-on tutorials in signal processing.

The SPS Academy Another interesting topic discussed at the Poland Chapter meeting delved into the difficult mathematical concepts that arise in signal processing and finding ways to convey them in an easily digestible form. This is indeed a very important issue, and it was the moti8

vation behind looking into expanding the educational offerings of the SPS. A couple of years ago, the SPS Education Board conceived the idea of the SPS Academy. The SPS would deliver education-oriented short courses, providing deep understanding of critical topics in the field. Unlike traditional tutorials, the SPS’s education-oriented short courses delve into subjects with more depth, starting from the basics and providing a comprehensive and multisided perspective on each topic. The courses are already being offered at I C A S S P a n d ICIP and have proved very popular. They consist of parallel tracks of 10-h sessions c o n d u c t e d i n three segments, offering participants an immersive learning experience. Upon successful completion of the course and quiz, participants are awarded professional development hours and continuing education unit certificates, recognizing their commitment to continuous learning and growth in their respective fields. The SPS Education Board is taking additional steps to make available those short courses to wider audiences, and it is working with a professional company to enhance these educational courses. The Society also offers free access to the SPS Resource Center for SPS members; this is an online library of tutorials, lectures, presentations. and more, and its spans the breadth of signal processing field.

The gender gap The low numbers of women among students and faculty at the Poland Signal Processing Conference SPSympo was quite evident, as was the feeling of isolation and hopelessness among the women attendees. In speaking with women faculty and students, there was a consensus that women shoulder more caregiving responsibilities than their male counterparts, impacting their career choices. It came as a shock to me when I asked a bright young graduate student in Signal Processing if she was interested in an academic position after graduation, and she answered that women are too emotional and cannot pursue high-responsiIEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

bility careers. When I asked her if she really believed such a bias, she said she did not, but the society does, and who was she to go against the society? The student was not aware of the SPS initiatives that disavow such perceptions, and I spoke to her about our efforts to reach out to women students, with several empowering opportunities. That conversation was a stark reminder that SPS still has a long way to go in our continuing efforts to foster diversity and inclusivity. I am now more confident than ever that our initiatives are addressing critical problems. Women in Signal Processing (WISP) provides mentoring opportunities and networking events, where women from around the world can share experiences and strategies in balancing family and career. As in Poland, the presence of women and minorities on the faculties of universities around the world is very small, and this deprives students of diverse role models and also limits the diversity of perspectives in academic research. Role models play a crucial role in inspiring students, instilling confidence in their abilities, and demonstrating the potential for success in their chosen fields. The SPS Promoting Diversity in Signal Processing (PROGRESS) Workshop recognizes the significance of diverse representation and seeks to bridge this gap through its empowering initiatives. Since its inception in 2020, the PROGRESS Workshop has gained substantial momentum. It is now part of ICASSP and ICIP conferences. The seventh PROGRESS Workshop took place at ICIP 2023 in Kuala Lumpur, Malaysia, and was successfully led by Dr. Zaid Omar, of the Universiti Teknologi Malaysia. The PROGRESS Workshop offers an online participation option, recognizing that travel expenses may pose economic challenges for some students. The SPS provides financial support to students who choose to attend in person. For instance, at the 2023 PROGRESS meeting during ICASSP, the SPS granted eight travel awards of US$1,000 each on a reimbursement basis. Similarly, the 2023 PROGRESS at ICIP offered 20 travel grants of US$500

each. These funds do not mandate SPS membership in an effort to reach out to students who do not traditionally attend SPS conferences. Other diversity and inclusion initiatives include the Mentoring Experiences for Underrepresented Young Researchers Program  (ME-UYR) [1] and K-12 Outreach Initiatives [2]. ME-UYR provides mentoring experiences in the form of a nine-month collaboration for young researchers from underrepresented groups together with an established researcher in signal processing from a different institute, and typically another country. The K-12 Outreach Initiatives Program strives to increase the visibility of SPS and the signal processing discipline to K-12 students worldwide by developing exciting, impactful educational programs that utilize tools and applications with hands-on signal processing experiences. The program is intended to bring the awareness of signal processing to students who belong to groups that are underrepresented in STEM fields regionally and/or globally.

Inter-Society initiatives The world is facing complex problems whose solutions require cross-disciplinary approaches, and strengthening interSociety initiatives is another key goal of the SPS. During SPSympo, I had the opportunity to meet with Mark Davis, the president of the IEEE Aerospace and Electronic Systems Society (AESS) who was also in attendance. We both delivered plenary talks on the topic of integrated sensing and communication (ISAC) systems. We had the opportunity to discuss the need to enhance and integrate opportunities for inter-Society engagement. ISAC is a naturally cross-disciplinary topic, encompassing technologies that combine sensing and communication systems to utilize wireless resources efficiently, realize wide area environment sensing, and even pursue mutual benefits. Realizing the great potential for research developments and standardization opportunities, the ISAC Technical Working Group (TWG) has been established to bring together academic and industrial researchers in the SPS and related

Societies to educate members and jointly address technical challenges. The SPS is forging strategic partnerships among multiple IEEE Societies in the ISAC area. Initial activities include the first 2023 Summer School on ISAC, sponsored jointly by the SPS, the AESS, and the European Association for Signal Processing, which took place in June 2023, in Baiona, Spain. The event was led by Nuria Gonzalez Prelcic, of North Carolina State University, and attracted 50 students, and was also supported by Qualcomm, Remcom, and Gradient. Another was the 2023 SPS–IEEE Communications Society (ComSoc) Summer School on ISAC, which was held in Shenzhen, China, and led by the SPS ISAC TWG in cooperation with the ComSoc. This was organized by Tsung-Hui Chang, Feng Yin, and Jie Xu, from the Chinese University of Hong Kong–Shenzhen; Fan Liu, from the Southern University of Science and Technology; and Xiao Han, from Huawei Technologies. The event attracted 180 students and researchers from mainland China, while the online streaming of the event reached 10,000 viewers. Another area that cuts across multiple areas is brain research. SPS is one of four core member societies of IEEE Brain Technical Community (TC), which is an IEEE-wide effort that unites engineering and computing expertise across IEEE Societies and Councils relevant to neuroscience. IEEE Brain facilitates cross-disciplinary collaboration and coordination to advance research, standardization, and development of engineering and technology to improve understanding of the brain in order to treat diseases and improve the human condition. As core member, the SPS is responsible for chairing the TC along with other core members, and this year Tulay Adali, former SPS Technical Activities Vice President, is the chair. T h is yea r, at ICASSP, a satellite workshop was organized on the topic of Unravelling the Brain, which was very well attended and introduced new blood in the area to ICASSP. The SPS partnered with IEEE EMBS to offer the 2023 IEEE EMBS-SPS ISBI IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Summer School on Biomedical Imaging, which covered applications of deep learning and AI in medical imaging. This event was organized by Jean-Christophe Olivio-Marin, Elsa Angelini, and Arrate Munoz Barrutia in Cartagena, Colombia, this past April. In Poland, I also participated on a panel discussion, which posed another interesting question related to  the contrast of model-based signal processing to data-driven machine learning (ML), which seems to be in the center of discussions in signal processing events. Despite ML’s success in various applications, it falls short in offering performance guarantees and lacks transparency in revealing how solutions are derived. This limitation has hindered its application in several key areas, including medical diagnosis. Research showcased in SPS venues concentrates on leveraging models and domain knowledge to design ML algorithms that are both reliable and explainable. Thus, greater synergies between SPS and EMBS has the potential to unlock more ­dependable applications of ML and AI in the field of medicine.

Developing novel technologies: Synthetic apertures Diversity is also key to scientific innovation and progress, and our Society is continually forging expertise in various new and evolving technological fields. On that note, I would like to share my excitement about the developments in synthetic apertures (SAs) led by the SA T WG, est abl ish ing the SPS as the pole of attraction of research in the critical SA area. SAs work by moving an antenna along a predetermined path via mechanical means. As the antenna moves, it measures the strength and direction of signals. This information helps reconstruct various properties of the scattered electromagnetic waves, like power, arrival directions, delays, and polarization. SAs can measure signals over a very wide frequency bandwidth and with an almost arbitrarily large aperture size, enabling high angular and delay resolution to resolve closely spaced scatterers. Further, SAs are 9

SPS President Athina Petropulu joining the Poland SPS Chapter chair, Konrad Je˛drzejewski, past chair Piotr Augustyniak, IEEE AESS President Mark E. Davis, and other Chapter members in marking the 20th Anniversary of the Poland SPS Chapter.

cost-effective compared to digital multichannel phased arrays while delivering comparable estimation performance. The two-pronged goal of the SA TWG is to support theoretical and empirical techniques that underpin the estimation of parameters of propagating waves through various media using SAs and also identify novel applications for SAs that are enabled by the precise measurement and estimation of environmental parameters. The SA TWG, under the leadership of Dr. Peter Vouras, is working on developing IEEE standards on SAs, establishing a shared repository for data and algorithms, delivering webinars on the topic, organizing special issues in journals, and providing challenges and competitions that promote the adoption of SAs in engineering school curricula as well as job training for graduating students. In an exciting development this year, the SA TWG, working with the IEEE Synthetic Aperture Standards Committee, will offer the inaugural NIST–IEEE Conference on Computational Imaging Using Synthetic Apertures. The conference will be held 20–23 May 2024, at the scenic campus of the National Institute of Standards and Technology, in Boulder, CO, USA.

Ethical standards In my tenure as the SPS president, it has been amazing to experience firsthand that people around the world have such high levels of respect and recognition

10

for the SPS. Going forward, we need to make extra efforts to safeguard that quality and the climate that holds the SPS to such high standards. Alongside the need to continually grow inclusivity, diversity, and interconnectedness in our membership and in our scientific pursuits, we must continually adapt and promote high ethical standards and guidelines with both our technological innovations and within the SPS leadership. As innovation advances at an exponential rate, ethical concerns surrounding current and emerging technologies grow proportionally. It is crucial to confront the impact of technology on privacy, security, and the environment. The urgency to cultivate researchers and engineers with strong ethical foundations has never been greater; they may serve as a crucial line of defense in navigating these complex challenges. Another challenge involves leadership. In an effort to energize the members, our Society has embraced a member-driven election for the president-elect. Yet the strength and agility of this approach requires a continued effort to increase membership diversity by growing our global appeal. We should also put policies into place to safeguard the election process and help prevent negative electioneering campaigns that increase internal divisiveness. If unchecked, electioneering can lead to the same behaviors observed in the political climate of the

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

United States and some other countries, which features negative ads and disinformation. Despite the many challenges, SPS membership growth has been strong throughout 2023. In October 2023, the membership of SPS soared past our goal of 20,000, reaching the highest point in SPS history. This underscores the enduring relevance and value that SPS has to offer. With my term as SPS president concluding this year, I’m optimistic that the SPS will strive to turn challenges into opportunities, so that we can grow and diversify our membership, and provide even more value to our members and our communities.

Acknowledgment I would like to thank Theresa Argiropoulos and Rich Baseil for their help with this article.

References

[1] “Mentoring Experiences for Underrepresented Young Researchers (ME-UYR) Program,” in IEEE Signal Process. Soc., 2023. [Online]. Available: https:// signalprocessingsociety.org/community-involvement/ me-uyr-mentoring-experiences-underrepresented -young-researchers-program [2] “K-12 Outreach Initiatives,” in IEEE Signal Process. Soc., May 2022. [Online]. Available: https:// signalprocessingsociety.org/community-involvement/ k-12-outreach-initiatives



SP

DSP HISTORY Patrick Flandrin

Fourier and the Early Days of Sound Analysis

J

oseph Fourier’s methods (and their variants) are omnipresent in audio signal processing. However, it turns out that the underlying ideas took some time to penetrate the field of sound analysis and that different paths were first followed in the period immediately following Fourier’s pioneering work, with or without reference to him. This illustrates the interplay between mathematics and physics as well as the key role played by instrumentation, with notable inventions by outsiders to academia, such as Rudolph Koenig and Édouard-Léon Scott de Martinville.

Introduction Fourier analysis, Fourier series, (Fast) Fourier transform. … Fourier has today something of a common name. If his presence is now ubiquitous in almost all fields of science and technology, the name of Fourier is especially unavoidable for all those interested in the theory and practice of signal processing. In particular, the methods he d­eveloped—and the attached fundamental concepts, such as that of spectral representation—are the cornerstone of audio signal processing (speech, music, and so on). This might suggest that they were developed in connection with the idea of analyzing and/or synthesizing sounds or at least that such an application was envisaged Digital Object Identifier 10.1109/MSP.2023.3297313 Date of current version: 3 November 2023

from the outset. This turned out not to be the case, the whole project of Fourier being devoted to a different physical problem, namely, the theory of heat, and to mathematical developments attached to it. Whereas many attempts had been made before Fourier (by Bernoulli, d’Alembert, Euler, Lagrange, and others) to solve the problem of vibrating strings and express solutions by means of sine/cosine expansions, Fourier himself seemed to have developed almost no interest in applying his results in this direction. Indeed, while his 1822 treatise on the analytical theory of heat is more than 600 pages long, there is only one sentence evoking such a possibility: “If we apply those principles to the question of the motion of vibrating strings, we shall overcome the difficulties first encountered in Daniel Bernoulli’s analysis.” It was only 20 years later that Fourier ideas entered explicitly the field of acoustics, thanks to Georg Simon Ohm (most famous for his law of electrical conductivity, established in 1827). This was, however, not a fully shared recognition, and, between theory and experiments, the following years witnessed a number of developments aimed at analyzing sounds, with or without a reference to Fourier. This is what this text is about. In complement to the immediate post-Fourier influences in acoustics discussed here, a comprehensive study of the (pre-Fourier) acoustics origins of harmonic analysis can be found in [1]. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Fourier theory, from heat to sound Fourier As mentioned in the preceding, the scientific work of Fourier culminated in his Théorie Analytique de la Chaleur (Analytical Theory of Heat) [2] that was eventually published in its final form in 1822, i.e., 11 years after having been first presented as a memoir to the French Academy of Sciences. Although its value was recognized at that time by awarding Fourier a prize, this contribution was received by the examiners (including Lagrange) with some reservations concerning rigor, raising convergence issues that were eventually resolved in full generality by Dirichlet and others. Fourier’s seminal work established, nevertheless, the foundations of modern harmonic analysis, a branch of mathematics that flourished in the 19th and 20th centuries and proved to be of utmost importance in numerous applications. Starting from a problem in physics and considering that [2, p. 13] “the profound study of nature is the most fertile source of mathematical discovery,” Fourier is generally considered the creator of mathematical physics [3]. Eager to solve physics problems by giving solutions based on firm mathematical grounds, Fourier was also deeply concerned with effective calculations, claiming explicitly that [2, p. 11] “the [proposed] method does not leave anything vague and indefinite in its 11

solutions; it drives them to their ultimate numerical applications, necessary condition for any research, and without which we would only end up with useless transformations.” This focus on what we would now call algorithmic efficiency also makes of Fourier an actual father of signal processing. Jean-Baptiste Joseph Fourier (1768– 1830) was a French mathematician, physicist, and political figure who has had more than one life (Figure 1). Or­­ phaned at the age of nine and spotted for his intellectual abilities, he was taken in charge by a religious educational institution, where he developed a particular interest in mathematics. He thus became a teacher in various domains and finally in mathematics. After having taken an active part in the French Revolution, for which he was imprisoned twice, he was selected as one of the first students of the newly created École Normale, where he quickly became an assistant professor before succeeding Joseph-Louis Lagrange as a professor at École Polytechnique, in 1797. One year later, he was designated to join the Egyptian expedition of Napoléon Bonaparte and became secretary of the Institut d’Égypte, conducting there scientific and political activities until the British victory. Back in France, in 1801, he thought of resuming his academic position but was appointed governor of Isère by Napoléon. While supervising various road and sewerage works, it was during this period that he began his masterwork on the analytical theo-

FIGURE 1. Joseph Fourier. 12

ry of heat. He also played a key role in the creation of the University of Grenoble and became a mentor and close friend of Jean-François Champollion, whom he encouraged in his research to decipher hieroglyphs. Subjected to the vicissitudes of Napoleon’s resignation in 1814 and attempt to return to power in 1815, Fourier was reassigned as governor from Grenoble to Lyon, but he resigned before the battle of Waterloo and went to Paris in June 1815, having no position at all. Being eventually elected a member of the Académie des sciences, in 1817 (and secrétaire perpétuel, in 1822), he devoted entirely the final period of his life to his scientific activities. (An authoritative presentation of the life and works of Fourier can be found in, e.g., [3].) Whereas Fourier theory is now central in acoustics, speech, and signal processing, it seems that its first explicit use in sound studies was due to Georg Simon Ohm, who claimed, in his 1843 seminal paper aimed at defining what a “tone” is, that he used [4, p. 519] “Fourier’s theorem, which has become famous through its multiple and important applications.” Ohm’s paper was devoted to specific sound systems, namely, sirens, whose physical construction clearly departed from more classical vibrating strings (for which sine/cosine descriptions were well accepted) and whose understanding was an open question. Sirens had been previously investigated by August Seebeck, who conducted a number of experiments, ending up with puzzling questions (combination tones, missing fundamental, and so on). Ohm proposed to interpret Seebeck’s findings in Fourier terms, but Seebeck raised objections, and a controversy followed [5]. Ohm’s approach was essentially mathematical and disconnected from hearing issues (Ohm even claimed to have an “unmusical ear” [6]). Seebeck, on the contrary, noted contradictions between Ohm’s predictions and actual perceptions by a trained ear. After Ohm lost interest in those questions, the controversy stopped, in 1849, when Seebeck passed away, and the fundamental question of confronting mathematical descriptions with physical realities resurfaced, IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

in 1856, with Hermann von Helmholtz [7], who made acoustics fully enter experimental physics while taking into account physiological considerations. Helmholtz gave credit to Ohm for his introduction of Fourier methods in acoustics, and he followed him in proposing to consider the inner ear a Fourier analyzer sensitive to the intensities of Fourier components, or proper modes (what he referred to as “Ohm’s law”). As a side note, it is worth mentioning that the question of whether Fourier proper modes (or their variations) are a physical reality or a mathematical construct has been recurrent since then. One can quote, in this respect, Louis de Broglie, who once claimed that [8] “if we consider a quantity that can be represented in the manner of Fourier, by a superposition of monochromatic components, it is the superposition that has a physical meaning, and not the Fourier components considered in isolation,” or refer to [9] for data-driven versus model-based approaches to beating phenomena. In parallel, Helmholtz conducte d exper iments with high-quality instruments—precise tuning forks for the production of well-controlled “pure” tones and resonators made of cavities of different sizes for identifying frequency components within complex sounds. Helmholtz’s contributions have been of primary importance in the development of modern acoustics and psychoacoustics. His approach was also emblematic of the key role played by instruments in addressing scientific questions and challenging theories (as once said by the philosopher Gaston Bachelard, “Instruments are reified theories”). To this end, he was in close contact with a gifted instrument maker settled in Paris: Rudolph Koenig ­(Figure 2).

Koenig Born in Königsberg (Prussia), Karl Rudolph Koenig settled in Paris in 1852 and died there in 1901. While developing a special interest in acoustics, Koenig was not part of any academic institution, but he was a prolific inventor—with 272 items in his 1889 catalog [10]—and a successful ­ businessman

who manufactured and sold his own products all around the world. He was especially famous for the quality of his tuning forks, and he contributed, with his experiments, to the debates and disputes about beats and combination tones [11]. Koenig happened to be, for a long time, the main maker of Helmholtz’s instruments, and his workshop in Paris was a busy meeting place, where the ideas of Helmholtz were popularized and spread in Parisian scientific circles, maintaining a vivid relationship with his native Germany [12]. In particular, exploiting the potentialities of Helmholtz resonators and combining them with his own invention of “manometric flames,” he designed a “sound analyzFIGURE 2. Rudolph Koenig. er” [10], [13] that allowed for a visuof frequency components. In the alization of the frequency content of sound analy z e r, s u c h intensities a sound. To visualize sound waves, he are evaluated acoustically and in parfirst designed, in 1862, an apparatus— allel (with all resonators acting simulthe so-called manometric flame—that taneously for selecting frequencies), consists of a flexible membrane encapand they are visualized by the modusulated in a chamber. When exposed to lations of the manometric flames. In a sound, the vibration of the membrane the sound spectrograph, the acoustic modulates the flow of a flammable signal is first recorded on a magnetic gas passed to a Bunsen burner, and the tape, and the frequency intensities are size of the flame is, in turn, modulated evaluated electrically and sequentially, accordingly. The final visualization is thanks to a heterodyne filtering that made possible in a stroboscopic way, acts in a synchronous manner with the thanks to a four-faceted rotating mirrotation of the disk on which the tape is ror. Koenig later designed, in 1867, a fixed and that of a drum on which the more complete “sound analyzer” by plugging such capsules at the output of a family of Helmholtz resonators (i.e., cavities tuned to specific frequencies) playing the role of a filter bank (Figure 3). The overall system permits, therefore, a Fourierlike frequency analysis and, in cases where the impinging sound is time varying, a timefrequency analysis. One can remark that Koenig’s apparatus has very much the flavor of an electromechanical system that would appear almost one century later: the so-called sound spectrograph [14]. This instrument—which, ironically, would be due to another Koenig—shares with the “sound analyzer” the idea of visualizing the intensity FIGURE 3. Koenig’s sound analyzer [10]. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

intensity is reported by a stylus on an electrically sensitive paper (Figure 4). Today, these acoustic or electromechanical devices have been replaced by computers to perform time-frequency analysis of digitized data, with routine techniques, such as short-time Fourier analysis, operating in time rather than frequency. The basic principles of these modern approaches are nonetheless similar, the difference being essentially one of implementation. Whereas Helmholtz elaborated on the findings of Ohm, who himself referred to Fourier, no explicit reference to Fourier can be found in the written productions of Koenig nor in the description of his sound analyzer [10]. Indeed, the motivation of Koenig was elsewhere, far from the rooting of his instruments on mathematical bases. It turns out that this was an attitude shared by most physicists of the immediate post-Fourier period (say, 1822–1850). Fourier is now perceived as the pioneer of modern mathematical physics, but in the years that followed its main publication, Fourier’s treatise attracted mostly the attention of mathematicians who substantially contributed to consolidating and extending Fourier’s seminal work, and, with the notable exception of Ohm, its importance for physics seems to have escaped physicists [15]. As an acoustician and an instrument-maker, Koenig was, in fact, primarily interested in visualizing sounds and in achieving this program through a more experimental and empirical approach. This leads to another cornerstone on the way to sound analysis. It involved Koenig too but preceded the invention of his sound analyzer.

Scott and “the problem of speech writing itself” If we think of visualizing sounds, there are at least two possibilities. The first one, which corresponds to Koenig’s approach with his sound analyzer, is indirect in the sense that what is given 13

to see are features resulting from some transformation upon the waveform (namely, intensities of Fourier modes). Another, more direct, approach could, however, be imagined, which would deliver a graphical representation of the waveform itself. Such a track was indeed followed in the middle of the 19th century by another outsider of academia: Édouard-Léon Scott de Martinville (Figure 5).

Scott Édouard-Léon Scott de Martinville (1817–1879) was a French inventor who, by profession, was a typographer. In the early 1850s, he conceived the idea of drastically improving upon stenography for keeping track of spoken words and other sounds by developing a system that would solve, in his own words, “the problem of speech writing itself.” This question obsessed Scott until his final days [16], and it is worth quoting his agenda, as reported in the sealed manuscript he sent to the French Academy of Sciences, in 1857 [17] (English translation by P. Feaster [18]): “Is there a possibility of reaching in the case of sound a result analogous to that attained at present for light by photographic processes? Can one hope that the day is near when the musical phrase, escaped from the singer’s lips, will be written by itself and as if without the musician’s knowledge on a docile paper and leave an imperishable trace of those fugitive melodies which the memory no longer

finds when it seeks them? Will one be able to have placed between two men brought together in a silent room an automatic stenographer that preserves the discussion in its minutest details while adapting to the speed of the conversation? Will one be able to preserve for the future generation some features of the diction of one of those eminent actors, those grand artists who die without leaving behind them the faintest trace of their genius?” To achieve this ambitious goal, Scott took his inspiration from the hearing process and proposed to make use of a membrane (mimicking the eardrum) at the output of a horn designed to collect and concentrate the sounds to be analyzed. The vibrations of the membrane were transmitted to a stylus attached to it, whose movements were inscribed as tracings on a sliding lampblacked glass plate. Scott gave to his invention the name “phonautograph” and started making experiments in 1853–1854, waiting until 1857 to submit it to the French Academy of Sciences [17] and patenting it [19]. Looking at those first attempts (with either speech or guitar sounds [17]), one must admit that the tracings he recorded are extremely erratic and unlikely to be interpretable. Scott’s first phonautograph was the work of an amateur, and to hope that it could be turned into a reliable instrument required the professionalism of an expert. The best expert one could think of in this regard at that time was Koenig, and it was only natural

FIGURE 5. Édouard-Léon Scott de Martinville.

FIGURE 4. Koenig’s sound spectrograph [14]. 14

that Scott approached him to perfect his device. Their collaboration resulted in a second-generation phonautograph (Figure 6) with far better performance than the initial prototype. Koenig replaced the sliding plate with a rotating cylinder, allowing for much longer recordings. He also supplemented the sound recording with the reference trace of a tuning fork, thus making it easier to read the unavoidable irregularities in the rotation of the hand-cranked cylinder. In 1860, reasonably neat tracings were obtained this way, leading to the fundamental issue: How to interpret them? The very purpose of Scott was to consider phonautograms as graphical representations of sounds and to uncover from their reading the content of what had been recorded. He was, thus, interested in finding specific features in the tracings, which could be considered elementary recognizable components of speech or natural sounds. This quest was not without echoes of some recent advances in signal processing, with representations built upon “waveform dictionaries” [20]. If we consider, for instance, the commented hand-drawn tracings of Figure 7, we see that Scott tried to identify elementary sounds from their graphical representation, that we would today refer to as tones with low/high frequency (“la voix grave/la voix aiguë”), downgoing/upgoing chirps (“une voix aiguë descendant au grave/une voix grave montant à l’aigu”), different amplitudes (“une

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

voix intense/moyenne/faible”), a plosive (“l’explosion de la voix”).

Scott’s phonautograph versus Edison’s phonograph

Hearing Scott Indeed, before deciding not to pursue his project any further, Scott deposited

FIGURE 6. Scott’s phonautograph [10].

FIRSTSOUNDS.ORG.

Reading tracings from complicated sounds, such as speech or songs, proved, however, to be tricky, and, short of support and encouragement, Scott abandoned, in the early 1860s, his project, becoming a librarian after having warranted Koenig an exclusive license to manufacture and sell the phonautograph. Far from Scott’s dream of “speech writing itself,” Koenig advocated the use of the phonautograph as a less ambitious scientific instrument aimed at mostly analyzing tuning forks or organ pipes, before leaving it and turning to his sound analyzer. Scott’s interest in his phonautograph was rekindled, however, in 1878, when Edison’s phonograph was demonstrated at a memorable session of the French Academy of Sciences. Having heard of it, Scott could not help but find elements in the phonograph—such as the system of recording by means of a membrane, a stylus, and a rotating drum—that seemed to him to be directly inspired by his phonautograph, without making any reference to it. Bitter about this lack of recognition as well as the contrast between the enthusiastic reception of Edison’s invention and the poor interest that his own invention had received 20 years earlier, Scott self-edited a long plea for his rights and his vision of speech analysis just before his death the following year [16]. Of course, one of the main reasons why the phonograph attracted so much more attention than the phonautograph was that the former allowed for the replay of recorded sounds, which the latter did not. As Scott had said many times, reproducing sound was not part of his program at all, his only goal being to decipher phonautograms. One can imagine that he would have found little interest in regenerating real sounds from his phonautograms, and yet this is what happened … in 2008.

FIGURE 7. Scott’s “waveform dictionary” [19].

a number of annotated phonautograms with the French Academy of Sciences, in 1861 [21]. These recordings were properly archived and preserved, yet forgotten until 2007, when David Giovannoni—a historian specializing in old recordings, who had learned of their existence—had the idea of transforming them into truly audible sounds. This was made possible—within the First Sounds project [22] and thanks, in particular, to Patrick Feaster—by getting highquality scans of the tracings and transforming them into digital files through modern signal processing techniques. This is how the folk song “Au Clair de la Lune” (“By the Light of the Moon”) IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

(Figure 8), registered by Scott himself on 9 April 1860, can now be heard [23], the first recording of a human voice, 17 years before Edison.

Conclusions The purpose of this text was not to discuss Fourier’s achievements per se: this can be found in many textbooks, from different perspectives (see, e.g., [24] for a classical introduction, [25] for a more mathematically oriented treatise, or [26] for a modern treatment, including recent variations). What was at stake was to see how Fourier’s ideas, which today seem indissociable from sound analysis, were not immediately adopted 15

tor, working within the Department of Physics, ENS De Lyon, Lyon, France, since 1991. His research interests include nonstationary signal processing, time-frequency/wavelet methods, scaling stochastic processes, and c­ omplex systems. He was awarded the SPIE Wavelet Pioneer Award (2001), the CNRS Silver Medal (2010), and a Technical Achievement Award from the IEEE Signal Processing Society (2017) and European Association for Signal Processing (EURASIP) (2023). He was elected to the French Academy of Sciences in 2010 and served as its president in 2021–2022. He is a Fellow of IEEE (2002) and EURASIP (2009).

(a)

References

[1] O. Darrigol, “The acoustics origins of harmonic analysis,” Arch. Hist. Exact Sci., vol. 61, no. 4, pp. 343–424, Jul. 2007, doi: 10.1007/s00407-007-0003-9.

FIRSTSOUNDS.ORG.

[2] J. Fourier, Théorie Analytique de la Chaleur. Paris, France: Firmin Didot, 1822. [Online]. Available: https:// g a l l i c a . b n f . f r /a r k : / 1 214 8 / b p t 6 k 10 4 55 0 8 v. texteImage [3] J. Dhombres and J.-B. Robert, Fourier, Créateur De la Physique Mathématique. Paris, France: Belin, 1998.

(b)

[4] G. S. Ohm, “Über die Definition des Tones, nebst daran geknüpfter Theorie der Sirene und ähnlicher tonbildender Vorrichtungen,” Ann. Phys. Chem., vol. 135, no. 8, pp. 513–565, 1843, doi: 10.1002/andp. 18431350802.

FIGURE 8. Scott’s phonautogram of the folk song “Au Clair de la Lune” [19]. (a) The complete recording and (b) an enlargement showing (on three successive revolutions of the drum) plots of the recorded voice and the reference oscillation given by the tuning fork.

in this context and that parallel pathways, based on different approaches, have been followed. It is striking to observe that some of the options considered then are still relevant today. For instance, an approach à la Koenig is based, in a first step, on the extraction of some features (in his case, the Fourier modes, even if not named as such) upon which the analysis is performed in a second step, whereas an approach à la Scott bypasses such a preprocessing and relies directly on the raw data, as can be the case in modern end-to-end recognition systems. Another important issue is the quest for interpretability when confronting experimental results with formal descriptions, i.e., physics with mathematics (this was at the heart of the Ohm–Seebeck dispute). Yet, under different forms extended to algorithms and computational issues, such a question of understanding is today of paramount importance, e.g., in deep neural 16

networks when one wants to go beyond a black box. We choose to close the piece of history that has been outlined here in 1878, when Edison opened a new chapter. Many things happened in the following years, with progressive and more and more pervasive use of Fourier techniques in many domains. In the same year, 1878, Lord Kelvin constructed his harmonic analyzer [27] that proved instrumental during decades for analyzing and predicting tides. Other mechanical [28], electromechanical [14], and, later, electronical [29] systems followed, eventually giving Fourier the full credit he deserves in the information era, but this is another story.

Author Patrick Flandrin ([email protected]) received his Ph.D. degree from INP Grenoble, France, in 1982. He is currently a CNRS emeritus research direcIEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

[5] R. S. Turner, “The Ohm-Seebeck dispute, Hermann von Helmholtz, and the origins of physiological acoustics,” Brit. J. Hist. Sci., vol. 10, no. 1, pp. 1–24, Mar. 1977, doi: 10.1017/S0007087400015089. [6] M. J. Kromhout, “The unmusical ear: Georg Simon Ohm and the mathematical analysis of sound,” Isis, vol. 111, no. 3, pp. 471–492, Sep. 2020, doi: 10.1086/710318. [7] H. von Helmholtz, “Über combinationstöne,” Ann. Phys. Chem., vol. 175, no. 12, pp. 497–540, 1856, doi: 10.1002/andp.18561751202. [8] L. de Broglie, Certitudes et Incertitudes de la Science. Paris, France: Albin Michel, 1966. [9] G. Rilling and P. Flandrin, “One or two frequencies? The empirical mode decomposition answers,” IEEE Trans. Signal Process., vol. 56, no. 1, pp. 85–95, Jan. 2008, doi: 10.1109/TSP.2007.906771. [10] R. Koenig, Catalogue des Appareils d’Acoustique construits par Rudolph Koenig. Paris, France: Chez l’auteur, 1889. [Online]. Available: https://sound a nd sc ienc e.d e /t ext /cat a log ue - d e s- a p p a r e i l s -dacoustique-construits-par-rudolph-koenig [11] R. Koenig, Quelques Expériences d’acoustique. Paris, France: Chez l’auteur, 1882. [Online]. Available: https://gallica.bnf.fr/ark:/12148/bpt6k5688601m. texteImage [12] D. Pantalony, “Rudolf Koenig’s workshop of sound: Instruments, theories, and the debate over combination tones,” Ann. Sci., vol. 62, no. 1, pp. 57–82, 2005, doi: 10.1080/00033790410001712183.

(continued on page 88)

On 2 June 1948, the Professional Group on Audio of the IRE was formed, establishing what would become the IEEE society structure we know today. 75 years later, this group — now the IEEE Signal Processing Society — is the technical home to nearly 20,000 passionate, dedicated professionals and a bastion of innovation, collaboration, and leadership.

Celebrate with us: Digital Object Identifier 10.1109/MSP.2023.3322949

Vincent W. Neo , Soydan Redif , John G. McWhirter, Jennifer Pestana , Ian K. Proudler , Stephan Weiss , and Patrick A. Naylor

Polynomial Eigenvalue Decomposition for Multichannel Broadband Signal Processing

©SHUTTERSTOCK.COM/MARISHA

A mathematical technique offering new insights and solutions

T

his article is devoted to the polynomial eigenvalue decomposition (PEVD) and its applications in broadband multichannel signal processing, motivated by the optimum solutions provided by the EVD for the narrowband case [1], [2]. In general, we would like to extend the utility of the EVD to also address broadband problems. Multichannel broadband signals arise at the core of many essential commercial applications, such as telecommunications, speech processing,

Digital Object Identifier 10.1109/MSP.2023.3269200 Date of current version: 3 November 2023

18

health-care monitoring, astronomy and seismic surveillance, and military technologies, including radar, sonar, and communications [3]. The success of these applications often depends on the performance of signal processing tasks, including data compression [4], source localization [5], channel coding [6], signal enhancement [7], beamforming [8], and source separation [9]. In most cases and for narrowband signals, performing an EVD is the key to the signal processing algorithm. Therefore, this article aims to introduce the PEVD as a novel mathematical technique suitable for many broadband signal processing applications.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

1053-5888/23©2023IEEE

Introduction Motivations and significance In many narrowband signal processing applications, such as beamforming [8], signal enhancement [7], subband coding [6], and source separation [9], the processing is performed based on the covariance matrix. The instantaneous spatial covariance matrix, computed using the outer product of the multichannel data vector, can capture the phase shifts among narrowband signals arriving at different sensors. In the narrowband case, diagonalization of the spatial covariance matrix often leads to optimum solutions. For example, the multiple signal classification algorithm uses an EVD of the instantaneous spatial covariance matrix to perform super-resolution direction finding [5], [10]. The defining feature of a narrowband problem is the fact that a time-delayed version of a signal can be approximated by the undelayed signal multiplied by a phase shift. The success of narrowband processing therefore depends on the accuracy of this approximation, which varies from problem to problem. It is well known that as this approximation degrades, various issues start to occur when using narrowband algorithms. In array processing problems, this is often because some quantity in the algorithm that is related to direction of arrival (DOA) starts to depend on the frequency of the signal. For example, in DOA algorithms, a wideband source can appear to be spatially distributed. Another issue is that of multipath. Reflections can cause problems, as different multipath signals are derived from a single source but arrive at the sensors at different times. This leads to various issues, which can be advantageous or disadvantageous depending on one’s point of view. In beamforming and DOA estimation, this causes a problem, as the bearing to the source is clearly not well defined. However, in signal recovery problems, such as speech enhancement and communication systems, multipath is advantageous, as the signals can be combined to improve the signal-to-noise ratio (SNR). This is, however, possible only if the multipath signals are coherent. With narrowband processing, multipath signals appear to decorrelate as the delay increases. Multipath signals can also cause frequencydependent fading, whereas narrowband processing can deal only with flat fading. Hence, for some problems, it is desirable to depart from narrowband processing and introduce some form of frequency-dependent processing. For the broadband case, one common approach is to divide each broadband signal into multiple narrowband signals. While these narrowband signals are often processed independently by wellestablished and optimal narrowband techniques that are typically based on the EVD, splitting the broadband signal into independent frequency bins neglects spectral coherence and thus ignores correlations among different discrete Fourier transform (DFT) bins [11], [12]. As a result, optimal narrowband solutions applied in independent DFT bins give rise to suboptimal approaches to the overall broadband problem [13]. Broadband optimal solutions in the DFT domain need to consider the cross-coupling among DFT bins via cross terms, but the number of terms depends on the SNR and cannot be determined in advance [14], [15]. Another approach uses tapped delay line (TDL) processing [16], [17], [18], but the performance depends on the filter length, which is challenging

to determine in practice. These approaches highlight the lack of generic tools to solve broadband problems directly. Polynomial matrices are widely used in control theory and signal processing. In the control domain, these matrices are used to describe multivariable transfer functions for multiple-input multiple-output (MIMO) systems [19]. Control systems are usually designed for continuous-time systems and are analyzed in the Laplace domain. There, factorizations, such as the Smith and Smith–McMillan decompositions, of matrices in the Laplace variable s target unimodularity, which is critical in the control context for invertibility, and spectral factorizations with minimum phase components to minimize time delays [20]. More recently, within digital signal processing (DSP), multirate DSP exploits polynomial matrices to describe lossless filter bank systems using polyphase notation [6], [20]. In multichannel broadband arrays and convolutively mixed signals, the array signals are generally correlated in time across different sensors. Therefore, the time delays for broadband signals cannot be represented only by phase shifts but need to be explicitly modeled. The relative time shifts are captured using the space-time covariance polynomial matrix, where decorrelation over a range of time shifts can be achieved using a PEVD [21]. While the initial work on the PEVD was a numerical algorithm [21], the existence of the decomposition of an analytic positive semidefinite para-Hermitian matrix, such as the space-time covariance matrix, has only recently been proven [22], [23], [24]. In most cases, unique para-Hermitian eigenvalues and paraunitary eigenvectors for a para-Hermitian matrix EVD exist but are of infinite length. However, being analytic, they permit good approximations by finite length factors, which are still helpful in many practical applications, such as beamformers [25], MIMO communications [26], source coding [27], signal enhancement [28], [29], source separation [30], source identification [31], and DOA estimation [32].

Outline of the article This article is organized as follows. The “Mathematical Background” section provides a primer on the relevant mathematical concepts. The “Preliminaries: Representing Broadband Signals” section introduces the notations and gives a background on multichannel array processing, including the use of spatial and space-time covariance matrices and the inadequacies of two common approaches. The “Polynomial Matrix EVD” section first introduces the PEVD, whose analytic eigenvalues and eigenvectors are described before their approximations by numerical algorithms are presented. The “Example Applications Using PEVD” section demonstrates the use of the PEVD for some multichannel broadband applications, namely, adaptive beamforming, subband coding, and speech enhancement. Concluding remarks and future perspectives are provided in the “Conclusions and Future Perspectives” section.

Mathematical background Analytic functions In the time domain, the key to describing the propagation of a broadband signal through a linear time-invariant (LTI) system

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

19

is the difference equation, where the system output y [n] depends on a weighted average of the input x [n] and past values of both y [n] and x [n]. This difference equation

y [n] =

/ b [o] x [n - o] + /

o$0

a [n] y [n - n](1)

n20

is straightforward to implement but does not lend itself to simple algebraic manipulations. For example, the difference equation for the concatenation of two LTI systems is not easily expressed in terms of the difference equations for the two component systems. For this reason, the z transform x (z) = R n x [n] z -n, with z ! C, or for short, x (z) :–% x [n], can

Algebra of Functions We are interested in matrices whose entries are more general than complex numbers. Specifically, we are interested in entries that are analytic functions: matrices whose entries are analytic functions rather than real and complex numbers, and the algebraic manipulation of these matrices may, at first, seem a little exotic, but many operations for real and complex numbers carry over to this setting. There are several different classes of functions, depending on what properties they have. For example, there are discontinuous functions, continuous but nondifferentiable functions, functions that are continuous and differentiable up to a certain order, and functions that are continuous and differentiable for all orders. The class of analytic functions, by definition, has locally convergent power series. Consequently, the functions are infinitely differentiable and easier to work with than other types of functions. These series might have a finite number of terms, but in general, there are infinitely many. The truncation of these series results in polynomial approximations of the underlying analytic function. Analytic functions can be algebraic or transcendental. An algebraic function f (x) is a function that is a root of a polynomial equation. More specifically, f is algebraic if it satisfies p (x, f (x)) = 0 for some irreducible polynomial p (x, y) with coefficients in some field. Examples of algebraic functions include rational functions and nth roots of polynomials. Note that the inverse function of an algebraic function (if it exists) is also algebraic. An analytic function that is not algebraic is called a transcendental function. Examples include e x, sin (x), and cos (x) . Such functions have power series representations with an infinite number of terms. Let us first consider analytic functions on their own. A function f (z) is (complex) analytic in a domain (an open set) D 1 C if at each point it can be written as a locally convergent Taylor series. (Note that this means that such a function is infinitely differentiable.) The set D is known as the domain of analyticity of f (z). We note that two different analytic functions f (z) and g (z) may have different domains of analyticity, say, D f and D g . When we operate on these functions, we assume that D f and D g overlap, i.e., that they have a nontrivial intersection, and restrict f (z) and g (z) to this common domain D = D f + D g . Then, we can perform certain fundamental operations on analytic functions, and the result will also be an analytic function with the same domain of analyticity D. In particu-

20

lar, if f (z) and g (z) are analytic on a domain D 1 C, then f (z) + g (z) is analytic; that is, it can be expressed as a locally convergent power series for any z ! D. Similarly, f (z) - g (z) and f (z) $ g (z) are analytic. Things become a little more complicated when we consider quotients of the form f (z) /g (z), but the result is analytic everywhere except at zeros of g (z), as might be expected. This “closure” is important since it means that as we manipulate analytic functions, we do not need to worry if the result is also analytic. Note, as well, that if the product f (z) $ g (z) / 0 on D, then f (z) / 0 on D or g (z) / 0 on D. If we now restrict our attention to polynomials in z, which are analytic everywhere, and Laurent polynomials, which are analytic everywhere except z = 0, then we can say something more. Indeed, if f (z) and g (z) are (Laurent) polynomials, then f (z) + g (z), f (z) - g (z), and f (z) $ g (z) are not just analytic but are also (Laurent) polynomials. Now, however, we must exercise some care when considering quotients f (z) /g (z) since the result will be analytic in D [except at the zeros of g (z) ] but not be a (Laurent) polynomial in general. However, f (z) /g (z), and, indeed, any analytic function, can be arbitrarily well approximated by polynomials, as discussed in the preceding. Let us now consider matrices R(z) whose entries are analytic functions in D 1 C. We start by noting that for any fixed z 0 ! C, the matrix R (z 0) is simply a matrix of complex numbers that can be manipulated in the usual ways. For example, we can multiply R (z 0) by another (conformable) matrix or vector, and we can compute the eigenvalue decomposition (EVD) of R (z 0) . When we instead allow z to vary, it is still possible to form, say, matrix–matrix and matrix–vector products with R (z) . Indeed, using the arguments in the previous paragraphs, if R (z) has analytic (polynomial) entries, then the resulting matrix or vector will also have analytic (polynomial) entries. However, it is not immediately obvious that we can write down a single z -dependent EVD of R (z) that holds for all values of z ! D. That this is true in certain circumstances is proved in a remarkable result from Rellich [S1]. Reference

[S1] F. Rellich, “Störungstheorie der Spektralzerlegung. I. Mitteilung. Analytische Störung der isolierten Punkteigenwerte eines beschränkten Operators,” Mathematische Annalen, vol. 113, pp. DC–DCXIX, 1937, doi: 10.1007/BF01571652. [Online]. Available: https://eudml.org/ doc/159886

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

be used to turn the time-domain convolution into the multiplicative expression y (z) = h (z) $ x (z), which is easy to manipulate [33], [34]. The z transform exists as long as the time-domain quantities are absolutely summable; i.e., we require for x (z), R n x [n] 1 3. Values of z for which the z transform is finite define the region of convergence (ROC), which, therefore, must include at least the unit circle since R n x [n] e -jXn # R n x [n] 1 3, where j = - 1 is the imaginary number. For values of z within this ROC, the function x (z) is complex analytic, which has profound consequences. Analytic functions mathematically belong to a ring such that any addition, subtraction, and multiplication will produce an analytic result. These operations potentially reduce the ROC. Dividing by an analytic function also results in an analytic function as long as the divisor does not have spectral zeros; again, this operation may shrink the ROC. For example, with b (z) and a (z) analytic and the latter without any zeros on the unit circle, then h (z) = b (z) /a (z) is also guaranteed to be analytic. Note that the same cannot be said for nonanalytic functions. This is important since nonanalytic functions can be difficult to approximate optimally in practice (see the following; for more on the algebra of analytic functions, see “Algebra of Functions.”)

Laurent series, power series, and polynomials Throughout this article, we often represent z transforms by series, i.e., by expressions of the form

h (z) =

N2

/

h [n] z -n . (2)

n = N1

This is motivated by the fact that analytic functions can be represented by a Taylor (or, equivalently, power) series within the ROC. More generally, we are interested in Laurent series, power series, Laurent polynomials, and polynomials, which we distinguish in the following. For finite N 1 and N 2 in (2), h (z) is a Laurent polynomial if N 1 and N 2 have opposing signs. If N 1 and N 2 share the same sign, i.e., if h (z) is purely an expression in powers of either z -1 or z, it is a polynomial. Typically, by a polynomial, we refer to an expression that contains powers in z -1 . If interpreted as a transfer function, a polynomial h (z) in z -1 refers to a causal finite-impulse response filter. If it possesses finite coefficients, then a polynomial or Laurent polynomial h (z) will always be absolutely summable and hence be analytic. A Laurent series is characterized by N 1 " - 3 and N 2 " 3, while for a power series, h (z) strictly contains only powers in either z -1 (for N 1 $ 0 and N 2 " 3) or z (for N 1 " - 3 and N 2 # 0) . Both Laurent and power series possess a generally infinite coefficient sequence {h [n]}. Such sequences can be used to represent rational functions, where h (z) = b (z) /a (z) is a ratio of two polynomials; with respect to (1), such a power series can describe an infinite-impulse response filter. Further and more generally, Laurent and power series can also represent transcendental functions, which are absolutely convergent but may not be representable by a finite number of algebraic operations, such as a ratio.

In signal processing, polynomials and convergent power series can represent quantities, such as finite- and infiniteimpulse responses, of either causal or anticausal stable systems. In contrast, Laurent series and Laurent polynomials appear as a result of correlation operations. First, assume that a zero-mean unit variance uncorrelated random signal x [n] excites a system with impulse response h [n]. With the input autocorrelation sequence rx [x] = E " x [n] x ) [n - x] , = d [x], the output autocorrelation is ry [x] = R n h ) [- n] h [x - n] [35], where E " $ , and [$] ) are the expectation and complex conjugate operators, respectively. Then, its z transform, the power spectral density ry (z) :-% ry [x], and ry (z) = h (z) h ) (1/z )) will be a Laurent series if h (z) is a power series and a Laurent polynomial if h (z) is a polynomial.

Polynomial approximation and polynomial arithmetic By the Weierstrass theorem, any continuous function can be arbitrarily well approximated by a polynomial of sufficient degree, but, in general, it can be nontrivial to construct the approximating polynomials. However, for analytic functions, such as Laurent and power series h (z), this approximation can be easily obtained by truncating h [n] %-: h (z) to the required order. If the result is a Laurent polynomial as in (2), describing, e.g., the impulse response of a noncausal system, then a polynomial (or causal system) can be obtained by a delay by N 1 sampling periods. Thus, by delay and truncation, all the preceding expressions describing analytic functions—Laurent series, power series, and Laurent polynomials—can be arbitrarily closely approximated by polynomials.

Key Statement While operations on analytic functions tend to yield analytic functions, the same is not true for polynomials; e.g., the ratio of polynomials generally yields a rational function but not a polynomial. Nonetheless, since the resulting function is analytic, it can be approximated arbitrarily closely by a polynomial via appropriate delay and truncation operations.

Matrices of analytic functions and polynomial matrices In this article, we consider matrices whose entries are analytic functions in general and their close approximation by polynomial matrices in particular. The mathematical theory of polynomial matrices that depend on a real parameter has been studied in, e.g., [36]. This has found application, for example, in the control domain [37]. Within signal processing, polynomial matrices have been used in filter bank theory. Specifically, polyphase notation [38] has been utilized to allow efficient implementation. Here, polynomial matrices in the form of polyphase analysis and synthesis matrices describe networks of filters operating on demultiplexed single-channel data. More generally,

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

21

polynomial matrices have been used to define space-time covariance matrices on demultiplexed data streams [20] and directly for multichannel data [21].

Polynomial matrix factorizations A number of polynomial matrix factorizations have been introduced in the past. Since we are particularly interested in diagonalizations of matrices, these prominently include the Smith and Smith–McMillan forms for matrices of polynomials and rational functions, respectively [20]. Popular in the control domain, these allow a decomposition into a diagonal term and two outer factors that are invertible but generally nonorthogonal polynomial matrices. Further, spectral factorizations [37], [39] involve the decomposition of a matrix into a product of a causal stable matrix and its time-reversed, complex-conjugated, and transposed version. These are matrix-valued extensions of Wiener’s factorization of a power spectral density into minimum and maximum phase components, and they are supported by numerical tools, such as PolyX [40]. In control theory, minimizing the delay of a system is a critical design issue, and hence, many of the existing matrix decompositions, such as spectral factorization, emphasize the minimum phase equivalent of the resulting matrix factors. In signal processing, the delay is often a secondary issue, while, e.g., energy preservation (or unitarity) of a transform is crucial. Therefore, in the following, we explore the diagonalization of an analytic matrix by means of energy-preserving transformations.

Broadband source signals, which naturally arise in, for example, audio, speech, communications, sonar, and radar, are directly reflected by (3). This is also applicable for narrowband systems. Here, the source signal is often described by a complex exponential, e jXn, where X is the normalized angular frequency. This means that (3) can be simplified by setting T = 0. As an alternative to (3), as shown in Figure 1, the L source signals, s , [n] ,, = 1, ..., L, could be generated using spectral-shaped noise obtained by filtering uncorrelated zero-mean unit variance complex Gaussian random variables, u , [n] ! N (0, 1), through some innovation filters, f, [n] [35]. The channel model in (3) can describe systems in diverse scenarios, for example, instantaneous and convolutive mixtures, near-field and far-field sources, and anechoic and reverberant environments. The signal model in (3) is often simplified by taking the z transform. However, care is needed, as the z transform of a random signal does not exist. Nonetheless, in the case of deterministic absolutely summable signals, the z transform of (3) may be written in matrix-vector form as x (z) = A P (z) s (z) + v (z), (4)



w h e r e A [n] %–: A (z) ! C L # M, s [n] %–: s (z) ! C L, a n d v [n] %–: v (z) ! C M are z-transform pairs of the channel matrix, source, and noise vectors, respectively. The well-known equivalence of convolution in the time domain and multiplication in the z domain [33] is expressed in (3) and (4).

Preliminaries: Representing broadband signals

Covariance matrices

Signal model

The covariance matrix used in many narrowband subspacebased approaches [5], [8], [10] is described by

The received signal at the mth of a total of M sensors for the discrete-time index n is x m [n] =

L

T

/ / a m,l [x] s , [n - x] + v m [n],

m = 1, f, M, (3)

,=1 x=0

where a m, l [n] models the channel from the ,th source signal s , [n] to the m th sensor and is an element of A [n] ! C M # L # (T + 1), v m [n] is the additive noise at the mth sensor assumed to be uncorrelated with the L source signals, and T is the maximum order of any of the channel impulse responses. The received data vector for M sensors is x [n] = [x 1 [n], f, x M [n]] T ! C M, and each element has a mean of E {x m [n]} = 0 6m, where [$] T represents the transpose operator. Similarly, the source and noise data vectors are s [n] ! C L and v [n] ! C M, respectively. u1[n]

f1[n]

.. . uL[n]

fL[n]

s1[n]

.. . sL[n]

L Sources

x1[n] Channel A[n]

.. .

y1[n] Processor

xM [n]

.. . yM [n]

M Sensors

FIGURE 1. The multichannel system model for L spectral-shaped source signals and M sensors. Uncorrelated noise signals v [n], not drawn in the figure, are optionally added to each sensor based on (3).

22

R = E {x [n] x H [n]}(5)



using the data vector obtained from (3). The (m, ,)th element of R is rm, , = E {x m (n) x *, (n)}, and the expectation operation is performed over n. In practice, the expectation is approximated using the sample mean, where the inner product between the received signals at the mth and ,th sensor is computed before normalizing by the total number of samples N. Because the inner product is calculated sample-wise, the covariance matrix instantaneously captures the spatial relationship among different sensors. This article calls it the instantaneous (or spatial) covariance matrix. When the system involves convolutive mixing and broadband signals, time delays among signals at different sensors need to be modeled. This spatiotemporal relationship is explicitly captured by the space-time covariance matrix parameterized by the discrete-time lag parameter x ! Z, defined as [21] R [x] = E {x [n] x H [n - x]}. (6)



The (m, ,)th element of R [x], arising from sensors with a fixed geometry, is rm, , [x] = E {x m [n] x ), [n - x]}, and, again, the expectation operation is performed over n, where widesense temporal stationarity is assumed. The autocorrelation and cross-correlation sequences are obtained when m = , and

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

m ! ,, respectively. Furthermore, (5) can be seen as a special case of (6) when only the instantaneous lag is considered; i.e., R [0] is the coefficient of z 0 when x = 0, as demonstrated in Figure 2. The z transform of the space-time covariance matrix in (6), R (z) =



3

/

R [x] z - x, (7)

x =-3

known as the cross spectral density (CSD) matrix, is a paraHermitian polynomial matrix satisfying the property R P (z) = R (z). (The symbol [$] P denotes the para-Hermitian operator, R P (z) = R H (1/z )), which involves a Hermitian transpose followed by a time reversal operation [20], where [.] H denotes the Hermitian transpose operator.) The polynomial matrix can be interpreted as a matrix of polynomials (functions) as well as a polynomial with matrix coefficients; i.e., R [x] is the matrix coefficient of z - x . This is visualized in Figure 2(b), which describes the temporal evolution of the spatial relationship across the entire array. Equivalently, the same polynomial matrix can also be interpreted as a matrix with polynomial elements, representing the temporal correlation in the z domain between sensor pairs, for example, element r3, 1 (z) for sensors 3 and 1 in Figure 2(c).

Key Statement The space-time covariance matrix completely captures the second-order statistics of multichannel broadband signals via auto- and cross-correlation functions. Its z transform has the useful property of being para-Hermitian.

Comparison with other broadband signal representations

R [0] f R [T - 1] R| = > h j h H.(8) R [- T + 1] f R [0]



Although of different dimensions, the covariance R | contains, as submatrices, the same terms that also make up the spacetime covariance matrix R [x]. However, it is not necessarily clear prior to processing how large or small T should be selected. Apart from its impact on the accuracy of a delay implementation, if T is selected smaller than the coherence time of the signal, then some temporal correlations for lags x $ T in the signals are missed, leading to a potentially insufficient characterization of the signals’ second-order statistics. If T is set too large, then no extra correlation information is included, but additional noise may be added. The EVD of the covariance matrix R | = Q | K | Q H| gives access to MT eigenvalues in K | . In inspecting these eigenvalues, there no longer is any separation between space and time, and, for example, a single broadband source that is captured by the array in its data vector x [n] can generate, depending on its bandwidth, anything between one and T + D nonzero eigenvalues, where D is the maximum propagation delay across the array that any source can experience. Hence, tasks such as source enumeration can become challenging. Furthermore, in narrowband processing, a common procedure is to project the received signals x onto the so-called signal subspace, as this suppresses some of the noise [7]. The signal subspace is defined by partitioning the eigenvalues by magnitude and selecting the subset of eigenvectors corresponding to the larger eigenvalues. Mimicking this in the broadband case would mean partitioning Q | = 6Q s Q n@, where Q s correspond to the larger eigenvalues. In the narrowband case, it is well known that, in general, the projected signals

The multichannel signal model introduced in the “Signal Model” section is compared against two signal representations commonly encountered in array processing. They are the TDL and short-time Fourier transform (STFT) approaches.

z0

z2

z –2

1

0.7

0

0.7

1

–0.3

0

–0.3

1

0

TDL processing The relative delays with which broadband signals arrive at different sensors cannot be sufficiently modeled by phase shifts because they can be accurate only at a single frequency. Therefore, these delays need to be implemented by filters that possess, at the very least, frequency-dependent phase shifts. Such filters must rely on processing a temporal window of the signals; the access to this window can, in the finite-impulse response case, be provided by TDLs that are attached to each of the M array signals. The length T of these TDLs will determine the accuracy with which such delays—often of a fractional nature [18]—are realized. Based on the array signal vector x [n], a T-element TDL provides data that can be represented by a concatenated vector T T T MT | [n] = [x [n], f, x [n - T + 1]] ! C , which holds both spatial and temporal samples. For the covariance matrix of H | [n], R | = E {| [n] | [n]}, we have

R[0]

z0

0.8 0

(a)

0

0

0 0.5 0

0

z1 z –1

(b)

11(z) 12(z) 13(z) 21(z) 22(z) 23(z) 31(z) 32(z) 33(z)

(c)

FIGURE 2. (a) A typical spatial covariance matrix for zero lag, i.e., x = 0, is the matrix coefficient of z 0 . (b) In general, each matrix slice corresponds to a coefficient of the polynomial. (c) The same polynomial matrix is also a matrix consisting of polynomial elements represented by tubes in the same cube.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

23

y [n] = Q Hs | [n] are not the source signals but merely span the same space. However, if only one source signal is present, then the projected signal is the source signal. In the broadband case, not even this is true since, as noted in the preceding, we might have more than one eigenvalue per signal.

STFT

If we take a T-point DFT W of each of the TDLs in | [n], we evaluate p [n] = (W 9 I M) | [n], with 9 being the Kronecker product. The DFT domain covariance matrix R p = E {p [n] p H [n]} = (W 9 I M) R | (W 9 I M) H is generally nonsparse due to cross-coupling between DFT bins. This cross-coupling does not subside even as T is increased. For bin-wise processing, i.e., processing each of the frequency bins across the array independently of other frequency bins, many of the terms in R p are neglected, leading to processing that can be very low cost but generally is suboptimal. To achieve optimality, time-domain criteria must be embedded in the processing, which generally leads to cross terms between bins [14], [41]. The generally dense nature of R p can be relaxed when employing more frequency-selective subband methods over DFT processing, but cross terms, at least between adjacent subbands, still remain [42]. Together with the increased computational cost of such filters over the DFT, this negates the low-complexity aspiration of this approach.

Key Statement Broadband processing requires accurately representing fractional time delays. Previous approaches do not lead to proper generalizations of the narrowband algorithms and are often suboptimal.

Polynomial matrix EVD As discussed in the “Comparison With Other Broadband Signal Representations” section, conventional approaches to processing broadband signals have some shortcomings. Arguably, this is because the incorrect signal representation is used. Specifically, the use of a TDL, with either time-domain or frequency-domain processing, mixes up the spatial and temporal dimensions. This section builds on the signal model in the “Preliminaries: Representing Broadband Signals” section, representing the broadband system using z transforms and “polynomials.” Guided by the successful use of linear algebra, i.e., EVD, in narrowband systems, this section focuses on the decomposition of para-Hermitian polynomial matrices, such as the space-time covariance matrix. That is, given a para-Hermitian polynomial matrix R (z) = R P (z), does a decomposition R (z) = Q (z) K (z) Q P (z) exist where K (z) is diagonal and Q (z) is paraunitary? Note that the EVD can diagonalize a para-Hermitian matrix R [x] for only one specific lag x = x 0 and, alternatively, R (z) :–% R [x] for one specific value z = z 0 . The unitary 24

matrices that accomplish the diagonalization at that value are unlikely to diagonalize the matrix at other values x ! x 0 and z ! z 0 . We therefore require a decomposition that diagonalizes R [x] for all values of x and R (z) for all values of z within the ROC. We address the existence of such a decomposition via the analytic EVD in the “Analytic EVD” section and provide some comments on numerical algorithms in the “PEVD Algorithms” section.

Analytic EVD The key to a more general EVD is the work by Rellich [43], who, in the context of quantum mechanics, investigated a matrix-valued function A (t) that is self-adjoint, i.e., A (t) = A H (t), and analytic in t on some real interval. Matrixvalued functions of this type admit a decomposition A (t) = U (t) C (t) U H (t) with matrix-valued functions U (t) and C (t) that are also analytic in t and where C (t) is diagonal and U (t 0) is unitary for any specific value t = t 0 . These results were obtained through perturbation analysis [44], where for the EVD of a matrix A (t 0) = U (t 0) C (t 0) U H (t 0), a change of A (t 0) by some small Hermitian matrix results in only a limited perturbation of both the eigenvalues and eigenvectors. There is no such guarantee if A (t) is not analytic in t; even infinite differentiability is not sufficient [44]. To decompose a matrix R (z) that is analytic in the complex-valued parameter z, it suffices to investigate R (z) on the unit circle for z = e jX. This is due to the uniqueness theorem for analytic functions, which guarantees that if two functions are identical on some part of their ROC—here, the unit circle, which must always be included—they must be identical across the entire ROC. Although X ! R, Rellich’s results do not directly apply, as they do not imply a 2r periodicity. Without such periodicity, it is not possible to reparameterize the EVD factors by replacing e jX with z and hence produce an EVD that is analytic in z. However, it has recently been shown that Rellich’s result admits 2rN-periodic eigenvalue functions, and, furthermore, N = 1 unless the data generating R (z) emerge from N-fold multiplexing or block filtering [23], [24]. Analytic eigenvector functions, then, exist with the same periodicity as the eigenvalues [45]. Therefore, an analytic EVD for this N-fold multiplexed system R (z N ) = Q (z) K (z) Q P(z) (9)



exists with analytic factors such that K (z) is diagonal. The matrix Q (z) contains the eigenvector functions and for z = e jX0 is unitary. For a general z, Q (z) is paraunitary such that Q (z) Q P(z) = Q P(z) Q (z) = I. Paraunitarity is an extension of the orthonormal and unitary properties of matrices from the real- and complex-valued cases to matrices that are functions in a complex variable [20]. For ease of exposition in the following, we talk of “analytic matrices X (z), ” with the understanding that X (z) is a matrix-valued analytic function. In the analytic EVD of (9), the eigenvalue function is K (z) = diag" m 1 (z), f, m M (z) ,, where diag ($) forms a diagonal matrix from its argument. When evaluated on the unit

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

circle, the eigenvalues m m (e jX), m = 1, f, M are real valued and unique up to a permutation. If there are M distinct eigenvalues, i.e., m m (e jX) = m n (e jX) only for m = n for any m, n = 1, f, M, then the corresponding eigenvectors q m (z) in Q (z) = [q 1 (z), f, q M (z)] are unique up to an arbitrary allpass function; i.e., qlm (z) = } m (z) q m (z) is also a valid analytic eigenvector of R (z N ), where } m (z) is all-pass. As an example, consider a system from [23],

PEVD algorithms The first attempt at producing a PEVD algorithm began with the second-order sequential best rotation (SBR2) [21], which was motivated by Jacobi’s method for numerically computing the EVD [2]. The PEVD of R (z), i.e., (7), as given by (9) for N = 1 and established in the “Analytic EVD” section, can be approximated using an iterative algorithm and is expressed as [21], [46] R (z) . U (z) K (z) U P (z), (11)



where the columns of the polynomial matrix, U (z) ! C M # M, correspond to the eigenvectors with their associated eigenvalues on the diagonal polynomial matrix, K (z) ! C M # M . The Laurent polynomial matrix factors U (z) and K (z) are necessarily analytic, being of finite order. However, under certain circumstances, the theoretical factors in the PEVD might not be analytic [23], [24], in which case, the Laurent polynomial matrix factors U (z) and K (z) are only approximations of the true factors, hence the approximation in (11). Rewriting (11) as K (z) . U P (z) R (z) U (z), (12)



the diagonalization of R (z) can be achieved by generalized similarity transformations, with U (z) satisfying the paraunitary, or lossless, condition [20]

λm(e jΩ)

which is constructed from eigenvalues K (z) = diag{z + 3 + z -1, jz + 3 - jz -1} and their corresponding eigenvectors q 1, 2 (z) = [1, ! z -1] T / 2 . The evaluation of the eigenvalues on the unit circle m 1, 2 (e jX) is presented in Figure 3(a). Figure 3(b) shows the Hermitian angle of the eigenvectors H jX j0 jX { m (e ) = arccos (; q 1 (e ) q m (e ) ;) is drawn in Figure 3(b). Note that due to the analyticity of the EVD factors, all these quantities evolve smoothly with the normalized angular frequency Ω. An all-pass modification of the eigenvectors might be as simple as imposing a delay; while this will not affect { m (e jX), it can increase support of Q [n] %–: Q (z). While in the preceding example, the factorization yields polynomial factors, this does not have to be the case: they could be Laurent and power series. For example, modifying the previous eigenvectors by arbitrary all-pass functions does not invalidate the decomposition, but it may change the order of q m (z) to 3, i.e., a power series. More generally, Laurent polynomial matrices R (z) are likely to lead to algebraic and even transcendental functions as EVD factors [23], [24]. Nonetheless, recall from the “Introduction” section that analyticity implies absolute convergence in the time domain. Therefore, the best least-squares approximation is achieved by truncation. Further, as the approximation order increases, the approximation error can be made arbitrarily small. The components of the analytic PEVD have some useful properties. The matrix of eigenvectors Q [n] can be viewed as a lossless filter bank. Clearly, it transforms the input time series into another set of time series. However, being paraunitary, the energy in the output signals is the same as that of the input signals. Furthermore, the output signals are strongly decorrelated. That is, any two signals have zero cross-correlation coefficients at all lags. Significantly, the signals are not temporally whitened; i.e., they do not have an impulse as their autocorrelation function. Note that the order of a z transform is connected to the time-domain support of the corresponding time series. Thus, the computational cost of implementing such a filter bank is related to the order of Q (z). In general, the eigenvalues of a narrowband covariance matrix have differing magnitudes, with the presence of small values indicating approximate linear dependency among the input signals. Similarly, the eigenvalues on the diagonal of K [n] can show linear dependence but in a frequency-dependent manner.

The pioneering work of Rellich showed that an analytic EVD exists for a matrix function. Applying this to a space-time covariance matrix on the unit circle introduces some additional constraints but results in the existence of an analytic PEVD.

5 4 3 2 1

0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /2



Normalized Angular Frequency Ω (a) π /2 3π /8 π /4 π /8 0

ϕm(e jΩ)

R1 - j 1 + j -1 1 + j 2 1 - j VW S z+3+ z z + 2 2 2 2 W R (z) = S 1 + j 1 - j j 1 1 + j -1W, (10) S -2 z z+3+ z W + S 2 2 2 2 T X

Key Statement

0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /2



Normalized Angular Frequency Ω (b) m=1

m=2

FIGURE 3. (a) Analytic eigenvalues on the unit circle and (b) Hermitian angles of the corresponding analytic eigenvectors, measured against a reference vector [23].

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

25

z z –2

0

z2

z0 z –2

z2

z0 z –2

z2

z4

z0

z –2

z –4

z1 z –1

z3 z1

z3 z1

z1 z –1

(a)

z2

z –1 z –3

z –1 z –3

(b)

(c)

(d)

FIGURE 4. Each PEVD iteration involves the following four steps. (a) The polynomial matrix is first searched for the maximum off diagonal across all lags. (b) The second delay step brings the largest element to the principal z 0-plane. (c) The third is the zeroing step, which transfers energy from the off-diagonal elements to the diagonal. (d) The final trimming step discards negligibly small coefficients in the outer matrix slices.

U P (z) U (z) = U (z) U P (z) = I, (13)



where I is the identity matrix. The similarity transform U (z) may be calculated via an iterative algorithm, such as the SBR2, and sequential matrix diagonalization (SMD) [46]. Here, a sequence of elementary paraunitary transformations G i (z) (i = 1, f) are applied to R (z) until the polynomial matrix becomes approximately diagonal; i.e., starting from u 0 (z) = R (z), the expression R u i (z) = G Pi (z) R u i - 1 (z) G i (z) (14) R



u N I (z) is approximately diagonal for some is iterated until R N I . An elementary paraunitary transformation takes the form of the product of a unitary transformation and a polynomial delay matrix, diag{1, f, 1, z n, 1, f, 1} . Figure 4 gives the steps involved during every iteration of the SBR2. At each iteration, the algorithm searches for the off-diagonal element with the largest magnitude across all z-planes, as marked in red in Figure 4. If the magnitude exceeds

γ SBR2,m(e jΩ)

5 4 3 2 1

0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /2



"

Normalized Angular Frequency Ω (a) ϕ SBR2,m(e jΩ)

π /2 3π /8 π /4 π /8 0

0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /2



"

Normalized Angular Frequency Ω (b) m=1

m=2

FIGURE 5. The results of using SMD to decompose the matrix in Figure 3: (a) eigenvalues on the unit circle and (b) Hermitian angles of the corresponding analytic eigenvectors, measured against a reference vector [23]. 26

a predefined threshold, a delay polynomial matrix is applied to bring the element to the principal z 0-plane, as shown in Figure 4. A unitary matrix, designed to zero out two elements on the zero-lag plane, is applied to the entire polynomial matrix. Note that applying one elementary paraunitary transformation may make some previously small off-diagonal elements larger, but overall, the algorithm converges to a diagonal matrix. As observed in Figure 4, the delay step can increase the polynomial order and make it unnecessarily large. Therefore, a trimming procedure [21] is used to control the growth of the polynomial order by discarding negligibly small coefficients in the outer planes, e.g., z -4 and z 4 in Figure 4. Furthermore, the similarity transformations in (12) affect a pair of dominant elements so that the search space can be halved due to the preservation of symmetry. The algorithm terminates when the magnitudes of all off-diagonal elements fall below the preset threshold and when a user-defined maximum number of iterations is reached. This has led to a family of time-domain algorithms based on the SBR2 [21] and SMD [46]. The computational complexity of these numerical algorithms is at least O (M 3 T) due to matrix multiplication applied to every lag [47]. The additional complexity incurred over the EVD approach is essential for the temporal decoupling of broadband signals. Furthermore, some promising efforts using parallelizable hardware [48] and numerical tricks [49] have been proposed, and the decomposition can be computed in a fraction of a second. These algorithms are also guaranteed to produce polynomial paraunitary eigenvectors but tend to generate spectrally majorized eigenvalues, which may not be analytic. Two functions f1 (z) and f2 (z) are said to be spectrally majorized if, on the unit circle, one function’s magnitude is always greater than the other’s. Figure 5 presents the results of using the SMD algorithm to process the matrix used to generate Figure 3. In Figure 5, the eigenvalue in blue is always greater than the one in red. In contrast, the (analytic) eigenvalues in Figure 3 intersect and are not spectrally majorized. As described in Figure 5(b), forcing a spectrally majorized solution for the eigenvalues leads to the eigenvectors having discontinuities that are difficult to approximate with polynomials. To get an accurate result, high-order polynomials are required. This, in turn, has consequences for the implementation cost of any signal processing based on the output of these algorithms. Note, however, that spectral majorization can be advantageous in some situations; see the “PEVD-Based Subband Coding” section.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Unlike in the case of the SBR2 algorithm [21], there is no proof that the SMD algorithm will always produce spectrally majorized eigenvalues, although evidence from the use of this algorithm strongly supports this conjecture. Given the issues that spectral majorization produces in terms of exploiting Q (z) as a filter bank and for identifying subspaces, recent work has been directed at designing an algorithm that can produce a PEVD whose components are guaranteed to be analytic. One such approach [50] involves working in the frequency domain and taking steps to ensure that the spectral coherence is not lost. A number of algorithms have been designed for decompositions of fixed order and without proven convergence. This includes the approximate EVD (AEVD) algorithm [51], which applies a fixed number of elementary paraunitary operations in the time domain in an attempt to diagonalize R (z) . In the DFT domain, [52] aims to extract maximally smooth eigenvalues and eigenvectors, which can target the extraction of the analytic solution.

Key Statement Approximating analytic functions by polynomials allows the development of PEVD algorithms based on an elementary paraunitary operator. The resulting algorithms are guaranteed to produce polynomial paraunitary eigenvectors but tend to generate spectrally majorized eigenvalues. This property has benefits as well as drawbacks.

Spatial filtering and steering vector Spatial filtering uses the fact that wavefronts arriving from different sources have a different delay profile when arriving at the sensors. If there are L spatially separated sources, then for the ,th source, , = 1, f, L, and let this delay profile be {x ,, 1, f, x ,,M}, where x ,,m is the delay at the mth sensor with respect to some common reference point. We further define a vector of transfer functions a , (z) = [d x ,,1 (z), f, d x ,,M (z)] T containing fractional delay filters, where d x [n] %s d x (z) implements a delay by x ! R samples [18]. We refer to a , (z) as a broadband steering vector since, when evaluated at a fixed frequency X ,, the ,th source can be regarded as a narrowband signal with center frequency X ,, in which case this vector of functions reduces to the well-known steering vector a , (z); z =e jX = a , (e jX,) ! C M . The latter contains the phase shifts that each sensor experiences with respect to the ,th source. If at least two sensors satisfy the spatial sampling theorem, and for a particular frequency X , = X 0, this steering vector is unique with respect to the DOA of the ,th source. We want to process the array data x [n] ! C M by a vector of filters w [n] %s w (z), with w P(z) = [w 1 (z), f, w M (z)] and w m (z) :V w m [n] is the filter processing the mth sensor signal such that the array output y [n] is the sum of the filtering operations, y [n] = R v w H [- o] x [n - o] = R m, v w m [o] x m [n - o] . The definition of the filter vector w [n], with its time-reversed and conjugated weights, may seem cumbersome, but it follows similar conventions for complex-valued data [53] and will later simplify the z-transform notation. ,

Narrowband beamforming Example applications using PEVD This section highlights three application cases that demonstrate key examples where PEVD-based approaches can offer advantages over state-of-the-art processing. In the “PEVDBased Adaptive Beamforming” section, we demonstrate how, for adaptive beamforming, the computational complexity is decoupled from the TDL length that otherwise determines the cost of a broadband adaptive beamformer. The “PEVD-Based Subband Coding” section shows how, in subband coding, the PEVD can generate a system with optimized coding gain and helps to formulate optimum compaction filter banks that previously could be stated only for the two-channel case. Finally, the “Polynomial Subspace Speech Enhancement” section addresses how the preservation of spectral coherence can provide perceptually superior results over DFT-based speech enhancement algorithms.

PEVD-based adaptive beamforming To explore PEVD-based beamforming, we first recall some aspects of narrowband beamforming before defining a linearly constrained minimum variance (LCMV) beamformer using both TDL- and PEVD-based formulations. We work with an arbitrary geometry of M sensors but, for simplicity, assume free-space propagation and that the array is sufficiently far field to neglect any loss in amplitude across its sensors.

In the narrowband case, the delay filters can be replaced by complex coefficients in a vector w H = [w 1, f, w M] ! C M that implement phase shifts. To generate a different gain f, , , = 1, f, L with respect to each of the L sources, the beamformer defined by w must satisfy the constraint equation a 1H (e jX1) f1 > h H w = > h H, (15) a HL (e jX L) fL 1442443 8 C



f

L#M

L

where C ! C and f ! C are the constraint matrix and associated gain vector for the L constraints. In the presence of spatially white noise, the minimum mean-square error (MMSE) solution is the quiescent beamformer w q = C @ f [53], where C @ is the pseudo-inverse of C. If the noise is spatially correlated, then the LCMV formulation min E" ; y [n];2 , w



s.t. Cw = f (16)

provides the MMSE solution, now constrained by (15). Solutions to (16) include, for example, the Capon beamformer, w opt = (R [0]) -1 C H [C (R [0]) -1 C H] -1 f, and the generalized sidelobe canceller (GSC). For the GSC, a “quiescent beamformer” w q implements the constraints in C, and a “blocking matrix” B is constructed such that CB = 0. When operating on the array data x [n], the blocking matrix output is free of any desired signal components protected by the constraints. All that

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

27

remains now is to suppress any undesired signal components in the quiescent beamformer output that correlate with the blocking matrix output. This unconstrained optimization problem for the vector w a in Figure 6(a) can be addressed by adaptive filtering algorithms via a noise cancellation architecture [53]. The overall response of the adapted GSC is w = w q - Bw a .

TDL-based GSC In the broadband case, each sensor is followed by a TDL of length T to implement a finite-impulse response filter that can resolve explicit time delays [54]. This leads to the concatenated data vector X [n] = [x H [n], f, x H [n - T + 1]] H ! C MT presented in the “TDL Processing” section. The weight vector v ! C MT now performs a linear combination across this spatiotemporal window such that the beamformer output becomes y [n] = v H | [n] . Analogous to (15), a constraint equation C b v = fb defines the frequency responses in a number of directions. The constraint formulation for a linear array with a look direction toward broadside is as straightforward as in the narrowband case [55]. For linear arrays with off-broadside constraints and for arbitrary arrays, the formulation of constraints becomes trickier and can be based on stacked narrowband constraints across a number of DFT bins, akin to the single-frequency formulation that leads to (15). Since it may not be clear how many such constraints should be stacked, robust approaches start with a large number, which is then trimmed to a reduced set of linearly independent constraints by using, e.g., a QR decomposition [56]. Overall, with respect to the narrowband case, the dimensions of the constraint matrix and constraining vector will increase approximately T-fold such that C b ! C TL # TM and fb ! C TL . For the broadband GSC [57], a quiescent beamformer v q = C @b fb ! C TL will generate an output that still contains any

wq q (z)

x[n]

– B

+

y [n]

wa a (z)

B(z) (a)

x[n]

TDL

vq x[n]



+

y [n]

va

Bb (b)

FIGURE 6. The GSC for (a) the narrowband (black quantities in boxes) and PEVD-based cases (blue quantities in boxes) and (b) the TDL-based case.

28

structured interference that is not addressed by the constraint equation. A signal vector correlated with this remaining interference is produced by the blocking matrix B b ! C T (M - L) # TM, whose columns, akin to the narrowband case, must span the null-space of C b such that C b B b = 0. Its output is then linearly combined by an adaptive filter v a ! C T (M - L) such that the overall beamformer output in Figure 6(b) is minimized in the MSE sense. Note that the TDL length determines the dimensions of all GSC components, with the overall adapted response of the beamformer, with respect to the input x [n] extended to the TDL representation in X [n], being v = v q - B b v a .

PEVD-based GSC In the PEVD-based approach, we replace narrowband quantities in the narrowband formulation by their polynomial equivalents to address the broadband case. This includes substituting the Hermitian transpose {·} H by a para-Hermitian transposition {·} P. Thus, the constraint equation becomes a P1(z) f1 (z) > h Hw (z) = > h H .(17) a PL(z) fL (z) = > z



C (z)

f( )

The constraint matrix C (z) is therefore made up of broadband steering vectors, and the gain vector f (z) contains the transfer functions f, (z), , = 1, f, L that should be imposed on the L sources at the beamformer output. Both quantities are of the same dimensions as in the narrowband case but are now functions of the complex variable z. Writing the beamformer output as y [n] = R v w H [- o] x [n - o] allows the broadband LCMV problem to be formulated as [25] z w P(z) R (z) w (z) d s.t. C (z) w (z) = f (z), min # w(z) | z | = 1 z (18) where R (z) is the CSD matrix of x [n] . The evaluation of (18) at a single frequency X 0 leads back to the narrowband formulation via the substitution z = e jX0 . The solution to the broadband LCMV problem can be found as the equivalent of the Capon beamformer w opt (z) = R -1 (z) C P(z) {C (z) R -1 (z) C P(z)} -1 f (z), which is a direct extension of the narrowband formulation. To access this solution, the inversion of the para-Hermitian matrices R (z) and, subsequently, C (z) R -1 (z) C P(z) can be accomplished via PEVDs [58]. Once factorized, the resulting paraunitary matrices are straightforward to invert, and it remains to invert the individual eigenvalues; for this, recall the comment on analytic functions as divisors in the “Mathematical Background” section. Alternatively, to avoid the nested matrix inversions of the Capon beamformer and to exploit iterative schemes for their general numerical robustness, an iterative unconstrained optimization can be performed via a broadband PEVD-based GSC, whereby, with respect to Figure 6(a), the quiescent beamformer is w q (z) = C P(z) {C (z) C P(z)} -1 f (z) . The pseudo-inverse of a polynomial matrix for the quiescent solution can again be

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Using polynomial matrix notations and the PEVD, narrowband approaches, such as the Capon beamformer and the GSC, can be directly extended to the broadband case.

Compared to the narrowband GSC in Figure 6(a), all quantities have retained their dimensions but are now functions of z. It now remains to set the polynomial orders of the different GSC components for an implementation. The quiescent vector w q (z) depends on the constraint formulation, and its order J 1 determines the accuracy of the fractional delay filters. The order J 2 of the blocking matrix B (z) needs to be sufficiently high such that no source signal components covered by the constraint equation leak into the adaptive part w a (z) . The order J 3 of the latter has to be sufficient to minimize the power of the output, y [n] . Thus, unlike in the TDL-based broadband beamformer case, the orders of the components are somewhat decoupled. If the optimization of the adaptive part is addressed by inexpensive least-mean-squares type algorithms, the computational cost in both the TDL- and PEVD-based approaches is governed by the blocking matrix. In the TDL-based case, it requires T 2 (M 2 - ML) multiplications and additions, while the PEVD-based blocking matrix expends only J 2 (M 2 - ML) such operations. With typically J 2 . T, the PEVD-based realization is less expensive by a factor of approximately the length of the TDL, T.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 –90

0 –5 –10 –15 –20 –25 –30 –35 –40 –45 –40 –10 0 30 45 Angle of Arrival ϕ /[°]

Gain/[dB]

Key Statement

the blocking matrix suffice. The adaptive filter is adjusted by a normalized least-mean-squares algorithm [53]. Note that J 2 1 T. Overall, per iteration, the PEVD-based GSC takes 12.3 kMACs, while the TDL-based GSC requires 3.46 MMACs, which is indeed more than a factor of T higher. To evaluate the beamformer performance, we determine the gain response or directivity pattern of the beamformer by probing the adapted overall beamformer response by sweeping a broadband steering vector a i (z) across a set of angles {{ i} with a corresponding delay profile. For the directivity pattern, the angle-dependent transfer function G (z, { i) = w P(z) a i (z) can be evaluated on the unit circle. For the PEVD-based GSC, this directivity pattern is displayed in Figure 7(a); the response (not displayed) for the TDL-based GSC is very similar. A difference can, however, be noted in the look direction, which, in the case of the TDL-based GSC, is protected by a number of point constraints along the frequency axis, as highlighted in Figure 7(b). The gain response satisfies these point constraints, but it shows significant deviations from the ideal flat response between the constrained frequencies. In contrast, the PEVD-based beamformer is based on a single broadband constraint equation, which shows a significantly lower deviation from the desired look direction gain. This is due to the formulation in the time domain, which preserves spectral coherence. There are downsides, and the gain response will break down closer to X = r, due to the imperfections that

Normalized Angular Frequency Ω/π

obtained via a PEVD of the para-Hermitian term C (z) C P(z) [58]. Furthermore, its subspace decomposition also reveals the null-space of C (z) that can be used to define the columns of the blocking matrix B (z) such that C (z) B (z) = 0. It remains only to operate a vector w a (z) of (M - L) adaptive filters on the output of the blocking matrix to complete the optimization of this PEVD-based GSC. Note that the overall response of the beamformer is w (z) = w q (z) - B (z) w a (z) .

80 –90

A linear array with M = 8 elements spaced by half the wavelength of the highest-frequency component has a look direction toward j = 30c, which is protected by a constraint. Three “unknown” interferers with directions j = {- 40c, - 10c, 80c} are active over a frequency range of 0.2 # X/r # 0.9 at -20-dB signal-to-interference-plus-noise and need to be adaptively suppressed. The data are further corrupted by spatially and temporally white additive noise 50 dB below the signal levels of the interferers. A TDL-based GSC operates with a TDL length of T = 175. For a PEVD-based GSC, the adaptive filter uses the same temporal dimension J 3 = T, but to match the MSE performance of the TDL-based version, a length of J 1 = 51 for the fractional delay filters in the quiescent beamformer and a temporal dimension of J 2 = 168 for

Gain in Look Direction/[dB]

(a)

Numerical example

0.15 0.1 0.05 0 –0.05 –0.1 –0.15 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 Normalized Angular Frequency Ω/π TDL Based

PEVD Based

TDL Constraints

(b)

FIGURE 7. The (a) directivity pattern for the adapted PEVD-based GSC and (b) gain response in the look direction ({ = 30°) for the PEVD- and TDL-based GSC.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

29

are inherent in fractional delay filters operating at close to half the sampling rate [18].

Key Statement The PEVD-based GSC can implement the constraint equation more easily and precisely than the TDL-based version and possesses a significantly lower complexity when addressing nontrivial constraints.

x [o] . Thus, a higher bit resolution is dedicated to those subbands of y [n] that possess higher power. By not increasing the overall number of bits with respect to x [o], judicious distribution of the coding effort results in an increase in the coding gain measure: the ratio between the arithmetic and geometric means of the variances of the subband signals in y [n] [59]. A coding gain greater than one can be exploited as an increased signal-toquantization-noise ratio under constant word length or in terms of a reduction in the number of bits required for quantization while retaining the same quality for the quantized signals.

Optimum coding gain and PEVD PEVD-based subband coding Data representation and task Although this article addresses techniques for array signals, in many circumstances, multichannel signal representations are derived from a single-channel signal by demultiplexing [20], [59], [60], [61]. Let x [o] be such a single-channel signal. Demultiplexing by M and an implicit decimation operation by the same factor, or serial-to-parallel conversion, is performed to obtain a data vector x [n] = [x [nM], x [nM - 1] , f, x [nM - M + 1]] T . This demultiplexed vector x [n] possesses the same form as the data vectors considered in the “Signal Model” section. While the number and type of samples that are held in x [n] remain unaltered from those in x [o], the representation in x [n] allows clever data reduction and coding schemes through filter bank-based processing, for which we ultimately exploit the PEVD.

Principal component filter banks and subband coding

Generally, we want to process the data x [n] through a transformation such that y [n] = R v Q H [- o] x [n - o] . Specifically, we wish this transformation to be lossless, i.e., for Q (z) :V Q [n] to be paraunitary, such that a perfect reconstruction via x [n] = R v Q [o] y [n - o] is possible. To the original unmultiplexed single-channel signal, the transformation Q H represents the analysis filter bank, whereas the transformation Q implements the synthesis (reconstruction) filter bank [20]. The matrices Q P (z) and Q (z) are known as the analysis polyphase matrix and synthesis polyphase matrix, respectively, and the paraunitarity of Q (z) guarantees perfect reconstruction of the overall filter bank system when operating back-to-back. The polyphase matrix Q (z) can be designed to implement a series of low-pass, bandpass, and high-pass filters to split the signal x [o] into signal components with different spectral content. However, the filter bank Q (z) can also be signal dependent. Chief among such systems are principal component, or optimum compaction, filter banks (PCFBs), which aim to assign as much power of x [o] into as few successive components of y [n] as possible. The purpose of this is to discard some components of y [n], thus producing a lower-dimensional representation of the data. A closely related task is subband coding, where a quantization is performed on y [n] rather than on 30

To maximize the coding gain under the constraint of the paraunitarity of Q (z), two necessary and sufficient conditions of y [n] have been identified [59]: 1) the subband signals in y [n] must be strongly decorrelated such that R y (z) is diagonal, and 2) they must be spectrally majorized such that for the elements S m (z) along the diagonal, on the unit circle, we have S m (e jX) $ S m +1 (e jX) 6X and m = 1, f, (M - 1) . Due to Parseval’s theorem, this implies that the powers of the subband signals are also ordered in a descending fashion. While under the constraint of paraunitary, this does not change the arithmetic mean; it minimizes the geometric mean of the subband variances and thus maximizes the coding gain. Optimum subband coders Q (z) have been derived for the case of a zeroth-order filter bank, where they reduce to the Karhunen-Loève transform (KLT), and for the infinite-order filter bank case [59], [61]. Executing the PEVD (described in the “PEVD Algorithms” section) on R [x] leads to R y [x] = K [x] that directly satisfies the preceding conditions and thus provides a solution to a subband coder of finite order [27] whose theoretical evaluation otherwise eluded the research community except for the case of M = 2 [60]. As discussed in the “PEVD Algorithms” section, at each SBR2 algorithm iteration, the parameters of the elementary paraunitary operator are selected such that the most dominant cross-covariance term of the input space-time covariance matrix is zeroed. There are two problems with this “greedy” optimization approach: 1) cross-correlation energies spread among subbands of the weakest power can end up being ignored, which limits the extent to which spectral majorization is performed, and 2) there is a stronger tendency to annihilate cross-correlations due to noise in powerful subbands rather than true cross terms related to weak subbands, which causes a degradation in strong decorrelation performance. The coding gain variant of the SBR2, namely, SBR2C [27], alleviates these problems because it uses a cost function based on the coding gain measure, which is proportionately equally receptive to cross-correlations among any of the subbands.

Numerical example Consider a signal x [o] described by a fourth-order autoregressive model [27], [35]; its PSD S (e jX) appears in Figure 8(a). Demultiplexing by M = 4 produces a pseudocirculant matrix R (z) whose analytic eigenvalues S m (X) = S (e j (X/M - 2r (m - 1))), m = 1, f, M, are 8r-periodic modulated versions of this PSD [20]. For 0 # X # 2r, they are depicted as gray-highlighted

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

curves in Figure 8(b). Although the 8r periodicity of these functions means that R (z) has no analytic eigenvalues, an t (z), here based on 10 4 samples of estimated CSD matrix R x [o], does possess an analytic EVD due to the perturbation by t (z) will the estimation error [24]. Applying the SMD [27] to R y [ generate a strongly decorrelated signal vector n] via a paraunitary operation Q (z). The eigenvalues mt m (e jX) extracted by the SMD algorithms are also in Figure 8(b). These closely match the folded PSD of x [o] highlighted in gray but are spectrally majorized. Interpreting Q (z) as a polyphase analysis matrix, the associated four-channel filter bank is characterized in Figure 9. The theoretically optimum infinite-order PCFB [59] is also shown. These are obtained by assigning every demultiplexed frequency component of x [o] to one of four filters, in descending magnitude. This yields a binary mask in the Fourier domain, which would require the implementation of infinite sinc functions in the time domain. In contrast, the finite-order filters computed by the SMD algorithm, each derived from an eigenvector in Q (z) corresponding to the eigenvalues in Figure 8, very closely approximate the PCFB except where the input PSD is small and arguably unimportant.

coding gain obtained for each instance in the ensemble by the maximum coding gain of the infinite-order PCFB; the latter can be derived from each of the MA(14) processes [27]. This ensemble-averaged normalized coding gain verses the order of the polynomial matrix Q (z) is detailed in Figure 10. The figure shows results for the KLT, the AEVD algorithm in [51], and the SBR2C and SMD. The KLT is the optimum zeroth-order subband coder. The AEVD algorithm is a fixed-order technique that aims to generate a PEVD but without proved convergence. Note that, like the AEVD algorithm, the SMD algorithm for zeroth-order systems (i.e., length-one polynomials) reduces to an ordinary EVD that is equivalent to the KLT and optimum for narrowband source signals, as shown in the figure. Both the SBR2C and SMD converge toward the optimum performance of an infinite-order PCFB as the polynomial order of Q (z) increases. This is indeed what would be expected since the PEVD is effectively the broadband generalization of the KLT. Due to its specific targeting of the coding gain and the resulting enhanced spectral majorization, the SBR2C outperforms the SMD here and thus provides a highly useful trade-off between polynomial order and coding gain.

1

4

0.5

3

0

π

/4

/8 7π



/8 5π

/8 3π

jΩ 10log10 |λ m(e )|

1

FIGURE 9. The magnitude responses ; H m (e jX) ;, m = 1, f, 4 of the 0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /4



M = 4-channel filter bank equivalent to the polyphase analysis matrix Q (z), with the theoretical PCFB of infinite order shown in gray.

Normalized Angular Frequency Ω (a)

15

1

10 5 0

"

–5

m

Normalized Angular Frequency Ω

0

π /4

π /2

3π /4

π

5π /4 3π /2 7π /4



Normalized Angular Frequency Ω m=1

m=2

m=3

Ensemble-Averaged Normalized Coding Gain

PSD/[dB]

15 10 5 0 –5

π /2

2 π /4

0 π /8

To demonstrate the wider benefit of the proposed subband coder design, a randomly generated ensemble of 100 moving average processes of order 14 [MA(14)] produces signals x [o] that are demultiplexed by M = 4. For each ensemble probe, the space-time covariance matrix is estimated from 2 11 samples of x [o] as a basis for the subband coder design. To average the subband coding results across this ensemble, we normalize the

jΩ Magnitude |Hm(e )|

Ensemble results

(b)

SMD algorithm for the subband coding problem; the M = 4-times-folded PSD of the input signal is shown in gray.

0.9 0.85

SBR2C SMD AEVD KLT

0.8 0.75 100

m=4

FIGURE 8. The (a) PSD of input x [o] and (b) eigenvalues extracted by the

0.95

101 102 Polynomial Length

103

FIGURE 10. The averaged normalized coding gain in its dependence on the length (order: plus one) of Q (z) for an ensemble of random MA(14) processes and the case of demultiplexing with M = 4.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

31

Key Statement Hitherto, algorithms for M > 2 -channel paraunitary filter banks for subband coding were suboptimal. PEVDdesigned M -channel filter banks now closely approximate the ideal system.

This is combined with the space-time covariance of the ambient noise R v (z) to form R vu (z). The channel polynomial vector is au (z) = [au 1 (z), f, au M (z)] ! C M, where au m (z) is a polynomial obtained by taking the z transform of the direct path and early reflections in the AIR from the source to the mth microphone, i.e., au m (z) = R iI = 0 au m [i] z -i, dropping , for brevity.

PEVD-based speech enhancement The PEVD of (20) decomposes the polynomial matrix into

Polynomial subspace speech enhancement Speech enhancement is important for applications involving human-to-human communications, such as hearing aids and telecommunications, and human-to-machine interactions, including robot audition, voice-controlled systems, and automatic speech recognition. These speech signals are often captured by multiple microphones, commonly found in many devices today, and provide opportunities for spatial processing. Moreover, speech signals captured by different microphones naturally exhibit temporal correlations, especially in reverberant acoustic environments. This section shows that it is advantageous to use PEVD algorithms to capture and process these spatiotemporal correlations, thus preserving spectral coherence. A more comprehensive treatment with listening examples and code is available in [28].

Multichannel reverberant signal model Consider a scenario where there is a single speaker s [n], an array of microphones, and uncorrelated background noise v [n]. The speech propagates from the source to each microphone m through the channels with acoustic impulse responses (AIRs), a ,,m [n], that are assumed to be time invariant. The AIR models the direct path propagation of the speech signal from the speaker to the microphone as well as reverberation due to multipath reflections from objects and walls in enclosed rooms. Background noise is then added to each microphone. The signal model in the “Signal Model” section, with , = 1, can describe this situation. Across M microphones, the signal vector x [n] ! C M is used to compute the space-time covariance matrix R x [x] in (6) and its z transform R x (z) in (7). Exploiting the reverberation model in [62], the early reflections in the AIR represent closely spaced distinct echoes that perceptually reinforce the direct path component and may improve speech intelligibility in certain conditions. On the other hand, the late reflections in the AIR consist of randomly distributed small amplitude components, and the associated late reverberant signal components are commonly assumed to be mutually uncorrelated with the direct path and early signal components [28], [62]. Thus, (7) can be written as

R x (z) = au (z) au P (z) rs (z) + R l (z) + R v (z) (19)



= R su (z) + R vu (z), (20)

where rs (z) :–% rs [x] and rs [x] is the autocorrelation sequence of the source. The space-time covariance matrices of the late reverberation R l (z) are modeled as a spatially diffuse field. 32

R x (z) = 6U su (z) U vu (z)@;



0 K su (z) U Pu (z) E= sP G, (21) 0 K vu (z) U vu (z)

where {.} su and {.} vu are associated with the signal-plus-noise (or, simply, signal) and noise-only (or, simply, noise) subspaces, respectively. Unlike some speech enhancement approaches, the proposed method does not use any noise or relative transfer function (RTF) estimation algorithms since the strong decorrelation property of the PEVD implicitly orthogonalizes the subspaces across all time lags in the range of x. Consequently, speech enhancement can be achieved by combining components in the signal subspace while nulling components residing in the noise subspace. The paraunitary U (z) is a lossless filter bank. This implies that U (z) can distribute spectral power only among channels and not change the total signal and noise power over all subspaces. The eigenvector filter bank is used to process the microphone signals, using y [n] = / U H [- o] x [n - o], (22)



o

where U (n) %–: U (z) ! C M # M. Since the polynomial eigenvector matrix U (z) is constructed from a series of delay and unitary matrices, each vector u m [n] has a filter-andsum structure. For a single source, the signal subspace has a dimension of one. Therefore, the enhanced signal can be extracted from the first channel of the processed outputs y [n]. The enhanced output y 1 [n], associated with the signal subspace, includes mainly speech components originally distributed over all microphones but now summed coherently. In contrast, the noise subspace is dominated by ambient noise and the late reverberation in the acoustic channels. The orthogonality between subspaces is a result of strong decorrelation, expressed as R y (z) = K (z), where R y (z) :-% R y [x] is computed from R y [x] = E {y [n] y H [n - x]}. In practice, assuming quasi stationarity, the speech signals are processed frame by frame such that R x [x] in (6) can be recursively estimated. Additionally, the two-sided z transform of R x [x] in (7) can be approximated by some truncation window W, which determines the extent of the supported temporal correlation of the speech signal. The time-domain PEVD algorithms, such as the SBR2 and SMD, are used to compute (21) because they preserve spectral coherence of the speech signals

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

and do not introduce audible artifacts. The proposed algorithm can also cope in noise-only and reverberation-only scenarios, as explored in [28]. Experimental results are next presented to show these general principles applied for a specific case.

Experimental setup Anechoic speech signals, which were sampled at 16 kHz, were taken from the TIMIT corpus [63]. AIR measurements and babble noise recordings for the M = 3-channel “mobile” array were taken from the ACE corpus [64]. ACE lecture room 2 has a reverberation time T60 of 1.22 s. For each Monte Carlo simulation, 50 trials were conducted. In each trial, sentences from a randomly selected speaker were concatenated to have an 8- to 10-s duration. The anechoic speech signals were then convolved with the AIRs for each microphone before being corrupted by additive noise. The SNRs ranged from −10 to 20 dB.

Comparative algorithms PEVD-based enhancement can be compared against other algorithms, such as the oracle multichannel Wiener filter (OMWF), weighted power minimum distortionless response (WPD), and two subspace approaches, multichannel subspace (MCSUB) [65] and colored subspace (COLSUB) [66], which use an EVD and a generalized EVD (GEVD), respectively. Furthermore, unlike the PEVD approach, noise estimation is required for the GEVD. The OMWF is based on the concatenation of a minimum variance distortionless response beamformer followed by a single-channel Wiener filter. The OMWF provides an ideal performance upper bound since it uses complete prior knowledge of the clean speech signal, based on [67], where the filter length is 80. Practical multichannel Wiener filters, which rely on the RTF and noise estimation algorithms, do not perform as well as the OMWF, and comparative results can be found in [28]. The WPD is an integrated method for noise reduction and dereverberation [68]. The ground truth DOA is provided to compute the steering vector for the WPD to avoid signal direction mismatch errors. The PEVD does not use any knowledge of the speech, DOA, and array geometry. Experiments presented here to illustrate comparative performance use PEVD parameters, chosen following [28], including d = N 1 /3 # 10 -2 denoting the threshold of the dominant off-diagonal column norm, where N 1 is the square of the trace norm of R xx (0); trim factor n = 10 -3; and L = 500 iterations. In all experiments, the frame size T and window W are set to 1,600. With this parameter selection, correlations within 100 ms, which are assumed to include the direct path and early reflection components, are captured and used by the algorithm. The source corresponding to these experiments is available in [28].

Evaluation measures The frequency-weighted segmental SNR (FwSegSNR) can be used to evaluate the noise reduction and normalized signal-toreverberant ratio (NSRR) and Bark spectral distortion (BSD)

for dereverberation [28]. To measure speech intelligibility and to account for processing artifacts, short-time objective intelligibility (STOI) can be used. These measures are computed for the signals before and after enhancement by using the proposed and benchmark algorithms. The improvement T is reported. Positive T values show improvements in all measures except TBSD, for which a negative value indicates a reduction in spectral distortions.

Results and discussions An illustrative example based on clean speech s [n] corrupted by 5-dB babble noise in the reverberant ACE lecture room 2 is presented in Figure 11. The spectrogram of the first microphone signal x 1 [n] shows temporal smearing due to reverberation and the addition of babble noise. Comparing the plots for x 1 [n] with the processed signals y 1 [n], the dotted cyan boxes in Figure 11 qualitatively show the attenuation and some suppression of the babble noise and reverberation for the PEVD and COLSUB. This is supported by Table 1, which shows that the PEVD significantly improves the STOI and NSRR while coming second in the FwSegSNR and BSD, after the COLSUB. Although the COLSUB makes the most significant improvement in the FwSegSNR, the solid white boxes highlight the speech structures in s [n], which are lost after the processing of x 1 [n] to generate y 1 [n], as evident between 3 and 3.3 s and 4.2 and 4.7 s in Figure 11. This has resulted in artifacts in the listening examples and the lowest improvement in the STOI. The OMWF, which uses complete knowledge of the clean speech signal, is the second best in the STOI and slightly improves other metrics, similar to the WPD, which uses the ground truth steering vector. The MCSUB offers limited improvement. Listening examples also highlight that the PEVD does not introduce audible processing artifacts into the enhanced signal [28]. Results for the Monte Carlo simulation involving 50 speakers in lecture room 2 and corrupted by −10- to 20-dB babble noise are available in Figure 12. For SNR # 10 dB, the COLSUB outperforms other algorithms in TFwSegSNR but gives the worst TSTOI. On the other hand, the OMWF, designed to minimize speech distortion by using knowledge of clean speech, performs the best in TSTOI but not in TFwSegSNR. This also reflects the fact that speech intelligibility may not necessarily be affected by noise levels, up to some limit. Despite not being given any information on the target speech, the PEVD performs comparably to the OMWF and ranks first in TNSRR and second in TFwSegSNR and TSTOI. At a 20-dB SNR, algorithms targeting reverberation, such as the WPD, perform better than noise reduction approaches. Similar to generalized weighted prediction error in the reverberation-only case in [28], the WPD processes the reverberant signals aggressively by removing most early reflections but not the direct path and late reflections, as observed in the listening examples. Furthermore, the WPD uses the ground truth DOA to compute the ideal steering vector, leading to the best improvement in TBSD and TSTOI. Listening examples for the PEVD indicate that the direct path and early

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

33

Power/Decade (dB) –15 –10 –5 0

–20

8

8

7

7

6

6

Frequency (kHz)

Frequency (kHz)

–25

5 4 3

4 3 2

1

1 1

2

3

4

0

5

10

5

2

0

5

1

2

8

8

7

7

6

6

5 4 3

3

4

5

3

1

1 2

5

4

2

1

4

5

2

0

3 Time (s) (b)

Frequency (kHz)

Frequency (kHz)

Time (s) (a)

3

4

5

0

1

2

Time (s) (c)

Time (s) (d)

FIGURE 11. Spectrograms, with corresponding time-domain signals, for the processing of a noisy reverberant speech example in ACE lecture room 2 and 5-dB babble noise. Dotted cyan boxes highlight noise- and reverberation-suppressed regions as a result of processing. Solid white boxes highlight regions where speech structures are lost using the COLSUB but not PEVD processing. Listening examples are available in [28]. The (a) clean speech signal s [n], (b) noisy speech signal x 1 [n], (c) COLSUB-enhanced signal y 1 [n], and (d) PEVD-enhanced signal y 1 [n].

Table 1. The enhancement of a single reverberant speech sample in lecture room 2 and 5-dB ACE babble noise.

34

Algorithm

FwSegSNR

STOI

NSRR

BSD

Noisy

−10.9 dB

0.664

−7.57 dB

0.69 dB

OMWF

−11.1 dB

0.747

−7.42 dB

0.6 dB

MCSUB

−11.7 dB

0.711

−11.5 dB

0.93 dB

COLSUB

−6.6 dB

0.678

−7.9 dB

0.35 dB

PEVD

−8.21 dB

0.75

−6.13 dB

0.4 dB

WPD

−8.9 dB

0.723

−6.27 dB

0.45 dB

reflections are retained in the enhanced signal in the first channel. The late reverberations, absent in the enhanced signal, are observed in the second and third channels because of orthogonality [28]. Even without additional information, the PEVD performs comparably to the WPD and ranks second in TNSRR and TSTOI. Despite not being given knowledge of the DOA, target speech, and array geometry, the PEVD consistently ranks first for TNSRR and second in TSTOI and TFwSegSNR. over the range of scenarios. Comprehensive results with listening examples and code for the noise-only, reverberation-only, and noisier reverberant scenarios are available in [28].

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

10

0.1

0 –5

5 ∆NSRR (dB)

5 ∆STOI

∆FwSegSNR (dB)

10

0

–0.1

0 –5 –10 –15

–10 –10

–5

0

5

10

20

–0.2

–10

–5

SNR (dB) (a)

0

5

10

20

–20

–10

SNR (dB) (b) OMWF

MCSUB

COLSUB

–5

0

5

10

20

SNR (dB) (c) PEVD

WPD

FIGURE 12. A comparison of speech enhancement performance for recorded AIR and babble noise in ACE lecture room 2, with a 1.22-s reverberation time: the (a) TFwSegSNR (higher is better), (b) TSTOI (higher is better), and (c) TNSRR (higher is better).

Key Statement PEVD-based speech enhancement consistently improves noise reduction metrics, speech intelligibility scores, and dereverberation measures over a wide range of acoustic scenarios. This blind and unsupervised algorithm requires no knowledge of the array geometry and does not use any channel and noise estimation algorithms but performs comparably to an oracle algorithm. More notably, due to the preservation of the spectral coherence using time-domain PEVD algorithms, the proposed algorithm does not introduce noticeable processing artifacts into the enhanced signal. Code and listening examples are provided in [28].

Conclusions and future perspectives This article has demonstrated the use of polynomial matrices to model broadband multichannel signals and the use of the PEVD to process them. Previous approaches using TDLs and STFTs do not lead to proper generalization of narrowband algorithms and are suboptimal. Instead of considering only the instantaneous covariance matrix, the space-time covariance matrix has been proposed to completely capture the second-order statistics of multichannel broadband signals. Motivated by the optimum processing of narrowband signals using the EVD, i.e., for a single lag, the PEVD has been proposed to process broadband signals across a range of time lags. In most cases, an analytic PEVD exists and can be approximated by polynomials using numerical algorithms, which tend to generate spectrally majorized eigenvalues and paraunitary eigenvectors. PEVD-based processing for three example applications has been presented and is advantageous over state-of-the-art processing. The PEVD approach can implement the constraints

more easily and precisely for adaptive broadband beamforming while achieving a lower complexity than the TDL-based approach. For multichannel subband coding, the PEVD design approximates the ideal optimal data encoding system and overcomes the previous issues with the more-than-twochannel case. The PEVD-based algorithm, which uses only microphone signals, can consistently enhance speech signals, without introducing any audible artifacts, and performs comparably to an oracle algorithm, as observed in the listening examples. In addition to the applications presented in this article, the PEVD is also successfully used for blind source separation [30], MIMO system design [26], source identification [31], and broadband DOA estimation [32].

Future work Similar extensions from the EVD to an analytic or PEVD can be undertaken for other linear algebraic operations, e.g., the nonpara-Hermitian EVD, singular value decomposition (SVD), QR decomposition, and generalized SVD. Algorithms for SVD and QR decomposition have appeared but are without a theoretical foundation with respect to their existence. Powerful narrowband techniques, such as independent component analysis, may find their polynomial equivalents. While a number of low-cost implementations have already emerged, algorithmic scalability is an area of active investigation. We hope that these theoretical and algorithmic developments will motivate the signal processing community to experiment with polynomial techniques and take these beyond the successful application areas showcased in this article. Resources including code and demonstration pages are available in [28] and [69].

Acknowledgment The work of Stephan Weiss was supported by the U.K. Engineering and Physical Sciences Research Council (EPSRC), under grant EP/S000631/1, and the MoD University Defense Research Collaboration in Signal

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

35

Processing. The work of Patrick A. Naylor was funded through EPSRC grant EP/S035842/1 and the European Union’s Horizon 2020 research and innovation program, under Marie Skłodowska-Curie grant 956369.

Authors Vincent W. Neo ([email protected]) received his Ph.D. degree in electrical and electronic engineering in 2022 from Imperial College London. He is currently a principal engineer in the Singapore Defence Science and Technology Agency, working on speech technology, and a visiting postdoctoral researcher with the Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ London, U.K. His research interests include multichannel signal processing and polynomial matrix decomposition, with applications to speech, audio, and acoustics. He is a Member of IEEE. Soydan Redif ([email protected]) received his Ph.D. degree in electronics and electrical engineering from the University of Southampton, U.K. He is currently an associate professor at the College of Engineering and Technology, American University of the Middle East, Dasman 15453, Kuwait. His research interests include adaptive and array signal processing applied to source separation, communications, power, biomedical, and wearable systems. He is a Senior Member of IEEE. John G. McWhirter ([email protected]) received his Ph.D. degree in theoretical physics from Queen’s University of Belfast, U.K., in 1973. He is currently an emeritus professor at the University of Cardiff, CF24 3AA Cardiff, U.K. His research interests include independent component analysis for blind signal separation and polynomial matrix algorithms for broadband sensor array signal processing. He was elected as a fellow of the Royal Academy of Engineering in 1996 and the Royal Society in 1999. His work has attracted various awards, including the European Association for Signal Processing Group Technical Achievement Award in 2003. Jennifer Pestana ([email protected]) received her D.Phil. degree in numerical analysis from the University of Oxford, U.K., in 2012. She is currently a lecturer in the Department of Mathematics and Statistics, University of Strathclyde, G1 1XH Glasgow, U.K. Her research interests include numerical linear algebra and matrix analysis and their application to problems in science and engineering. Ian K. Proudler ([email protected]) received his Ph.D. degree in digital signal processing from the University of Cambridge, U.K., in 1984. He is currently a visiting professor at the University of Strathclyde, G1 1XW Glasgow, U.K. He was an honorary editor for IEE Proceedings: Radar, Sonar, and Navigation and the 2002 recipient of the Institution of Electrical Engineers J.J. Thomson Medal. His research interests include adaptive filtering, adaptive beamforming, multichannel signal processing, and blind signal separation. Stephan Weiss ([email protected]) received his Ph.D. degree in electronic and electrical engineering from the 36

University of Strathclyde, G1 1XW Glasgow, U.K., in 1998, where he is currently a professor of signal processing. His research interests include adaptive, multirate, and array signal processing, with applications in acoustics, communications, audio, and biomedical signal processing. He is a Senior Member of IEEE. Patrick A. Naylor ([email protected]) received his Ph.D. degree from Imperial College London, SW7 2AZ London, U.K., where he is currently a professor of speech and acoustic signal processing. He is a member of the Board of Governors of the IEEE Signal Processing Society and a past president of the European Association for Signal Processing. His research interests include microphone array signal processing, speaker diarization, and multichannel speech enhancement for applications, including binaural hearing aids and augmented reality. He is a Fellow of IEEE.

References

[1] G. Strang, Linear Algebra and Its Application, 2nd ed. New York, NY, USA: Academic, 1980. [2] G. H. Golub and C. F. van Loan, Matrix Computations, 3rd ed. Baltimore, MD, USA: The Johns Hopkins Univ. Press, 1996. [3] S. Haykin and K. J. R. Liu, Eds., Handbook on Array Processing and Sensor Networks. Hoboken, NJ, USA: Wiley, 2010. [4] N. S. Jayant and P. Noll, Digital Coding of Waveforms Principles and Applications to Speech and Video. Englewood Cliffs, NJ, USA: Prentice-Hall, 1984. [5] R. O. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. Antennas Propag., vol. 34, no. 3, pp. 276–280, Mar. 1986, doi: 10.1109/TAP.1986.1143830. [6] M. Vetterli and J. Kovacˇevic´, Wavelets and Subband Coding. Upper Saddle River, NJ, USA: Prentice-Hall, 1995. [7] M. Moonen and B. De Moor, SVD and Signal Processing, III: Algorithms, Architectures and Applications. New York, NY, USA: Elsevier, 1995. [8] H. L. Van Trees, Optimal Array Processing. Part IV of Detection, Estimation, and Modulation Theory. New York, NY, USA: Wiley, 2002. [9] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, 1st ed. New York, NY, USA: Academic, 2010. [10] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing, 1st ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 2000. [11] R. Klemm, Space-Time Adaptive Processing: Principles and Applications. London, U.K.: Inst. Elect. Eng., 1998. [12] A. Rao and R. Kumaresan, “On decomposing speech into modulated components,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 240–254, May 2000, doi: 10.1109/89.841207. [13] S. Weiss and I. K. Proudler, “Comparing efficient broadband beamforming architectures and their performance trade-offs,” in Proc. IEEE Int. Conf. Digit. Signal Process. (DSP), Jul. 2002, pp. 417–424, doi: 10.1109/ICDSP.2002.1027910. [14] W. Kellermann and H. Buchner, “Wideband algorithms versus narrowband algorithms for adaptive filtering in the DFT domain,” in Proc. Asilomar Conf. Signals, Syst. Comput., 2003, pp. 1–5, doi: 10.1109/ACSSC.2003.1292194. [15] Y. Avargel and I. Cohen, “System identification in the short-time Fourier transform domain with crossband filtering,” IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 4, pp. 1305 –1319, May 20 07, doi: 10.1109/ TASL.2006.889720. [16] B. Widrow, P. Mantey, L. Griffiths, and B. Goode, “Adaptive antenna systems,” Proc. IEEE, vol. 55, no. 12, pp. 2143–2159, Dec. 1967, doi: 10.1109/ PROC.1967.6092. [17] R. T. Compton Jr., “The relationship between tapped delay-line and FFT processing in adaptive arrays,” IEEE Trans. Antennas Propag., vol. 36, no. 1, pp. 15–26, Jan. 1988, doi: 10.1109/8.1070. [18] T. I. Laakso, V. Valimaki, M. Karjalainen, and U. K. Laine, “Splitting the unit delay [FIR/All Pass Filters Design],” IEEE Signal Process. Mag., vol. 13, no. 1, pp. 30–60, Jan. 1996, doi: 10.1109/79.482137. [19] T. Kailath, Linear Systems, 1st ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1980.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

[20] P. P. Vaidyanathan, Multirate Systems and Filterbanks, 1st ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1993. [21] J. G. McWhirter, P. D. Baxter, T. Cooper, S. Redif, and J. Foster, “An EVD algorithm for para-Hermitian polynomial matrices,” IEEE Trans. Signal Process., vol. 55, no. 5, pp. 2158–2169, May 2007, doi: 10.1109/TSP.2007.893222. [22] S. Icart and P. Comon, “Some properties of Laurent polynomial matrices,” in Proc. IMA Int. Conf. Math. Signal Process., Dec. 2012, pp. 1–4. [23] S. Weiss, J. Pestana, and I. K. Proudler, “On the existence and uniqueness of the eigenvalue decomposition of a para-Hermitian matrix,” IEEE Trans. Signal Process., vol. 66, no. 10, pp. 2659–2672, May 2018, doi: 10.1109/TSP.2018.2812747. [24] S. Weiss, J. Pestana, I. K. Proudler, and F. K. Coutts, “Corrections to ‘On the existence and uniqueness of the eigenvalue decomposition of a para-Hermitian matrix’,” IEEE Trans. Signal Process., vol. 66, no. 23, pp. 6325–6327, Dec. 2018, doi: 10.1109/TSP.2018.2877142. [25] S. Weiss, S. Bendoukha, A. Alzin, F. K. Coutts, I. K. Proudler, and J. Chambers, “MVDR broadband beamforming using polynomial matrix techniques,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2015, pp. 839–843, doi: 10.1109/ EUSIPCO.2015.7362501. [26] R. Brandt and M. Bengtsson, “Wideband MIMO channel diagonalization in the time domain,” in Proc. Int. Symp. Pers., Indoor Mobile Radio Commun., 2011, pp. 1958–1962, doi: 10.1109/PIMRC.2011.6139853. [27] S. Redif, J. G. McWhirter, and S. Weiss, “Design of FIR paraunitary filter banks for subband coding using a polynomial eigenvalue decomposition,” IEEE Trans. Signal Process., vol. 59, no. 11, pp. 5253–5264, Nov. 2011, doi: 10.1109/TSP.2011.2163065. [28] V. W. Neo, C. Evers, and P. A. Naylor, “Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 3255–3266, Oct. 2021, doi: 10.1109/TASLP.2021.3120630. [29] J. Corr, J. Pestana, S. Weiss, I. K. Proudler, S. Redif, and M. Moonen, “Investigation of a polynomial matrix generalised EVD for multi-channel Wiener filtering,” in Proc. Asilomar Conf. Signals, Syst. Comput., 2016, pp. 1354–1358, doi: 10.1109/ACSSC.2016.7869596. [30] S. Redif, S. Weiss, and J. G. McWhirter, “Relevance of polynomial matrix decompositions to broadband blind signal separation,” Signal Process., vol. 134, pp. 76–86, May 2017, doi: 10.1016/j.sigpro.2016.11.019. [31] S. Weiss, N. J. Goddard, S. Somasundaram, I. K. Proudler, and P. A. Naylor, “Identification of broadband source-array responses from sensor second order statistics,” in Proc. Sens. Signal Process. Defence Conf. (SSPD), 2017, pp. 1–5, doi: 10.1109/SSPD.2017.8233237. [32] W. Coventry, C. Clemente, and J. Soraghan, “Enhancing polynomial MUSIC algorithm for coherent broadband sources through spatial smoothing,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2017, pp. 2448–2452. [33] A. Oppenheim and R. W. Schafer, Digital Signal Processing, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1993. [34] B. Girod, R. Rabebstein, and A. Stenger, Signals and Systems. New York, NY, USA: Wiley, 2001. [35] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York, NY, USA: McGraw-Hill, 1991. [36] I. Gohberg, P. Lancaster, and L. Rodamn, Matrix Polynomials, 2nd ed. Philadelphia, PA, USA: SIAM, 2009. [37] V. Kucˇera, Analysis and Design of Discrete Linear Control Systems. Englewood Cliffs, NJ, USA: Prentice-Hall, 1991. [38] R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1983. [39] M. Davis, “Factoring the spectral matrix,” IEEE Trans. Autom. Control, vol. 8, no. 4, pp. 296–305, Oct. 1963, doi: 10.1109/TAC.1963.1105614. [40] “The polynomial toolbox.” Polyx. Accessed: Apr. 30, 2023. [Online]. Available: http://www.polyx.com [41] J. J. Shynk, “Frequency-domain and multirate adaptive filtering,” IEEE Signal Process. Mag., vol. 9, no. 1, pp. 14–37, Jan. 1992, doi: 10.1109/79.109205. [42] A. Gilloire and M. Vetterli, “Adaptive filtering in subbands with critical sampling: Analysis, experiments, and application to acoustic echo cancellation,” IEEE Trans. Signal Process., vol. 40, no. 8, pp. 1862–1875, Aug. 1992, doi: 10.1109/78.149989. [43] F. Rellich, Perturbation Theory of Eigenvalue Problems. New York, NY, USA: Gordon & Breach, 1969. [44] T. Kato, Perturbation Theory for Linear Operators. Singapore: Springer, 1980. [45] G. Barbarino and V. Noferini, “On the Rellich eigendecomposition of para-Hermitian matrices and the sign characteristics of *-palindromic matrix polynomials,” Linear Algebra Appl., vol. 672, pp. 1–27, Sep. 2023, doi: 10.1016/j.laa.2023.04.022. [46] S. Redif, S. Weiss, and J. G. McWhirter, “Sequential matrix diagonalisation algorithms for polynomial EVD of para-Hermitian matrices,” IEEE Trans. Signal Process., vol. 63, no. 1, pp. 81–89, Jan. 2015, doi: 10.1109/TSP.2014.2367460.

[47] F. K. Coutts, J. Corr, K. Thompson, S. Weiss, I. K. Proudler, and J. G. McWhirter, “Memory and complexity reduction in parahermitian matrix manipulations of PEVD algorithms,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2016, pp. 1633–1637, doi: 10.1109/EUSIPCO.2016.7760525. [48] S. Kasap and S. Redif, “Novel field-programmable gate array architecture for computing the eigenvalue decomposition of para-Hermitian polynomial matrices,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 3, pp. 522–536, Mar. 2014, doi: 10.1109/TVLSI.2013.2248069. [49] F. K. Coutts, I. K. Proudler, and S. Weiss, “Efficient implementation of iterative polynomial matrix EVD algorithms exploiting structural redundancy and parallelisation,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 12, pp. 4753–4766, Dec. 2019, doi: 10.1109/TCSI.2019.2937006. [50] S. Weiss, I. K. Proudler, and F. K. Coutts, “Eigenvalue decomposition of a parahermitian matrix: Extraction of analytic eigenvalues,” IEEE Trans. Signal Process., vol. 69, pp. 722–737, Jan. 2021, doi: 10.1109/TSP.2021.3049962. [51] A. Tkacenko, “Approximate eigenvalue decomposition of para-Hermitian systems through successive FIR paraunitary transformations,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2010, pp. 4074–4077, doi: 10.1109/ ICASSP.2010.5495751. [52] M. Tohidian, H. Amindavar, and A. M. Reza, “A DFT-based approximate eigenvalue and singular value decomposition of polynomial matrices,” EURASIP J. Appl. Signal Process., vol. 1, no. 93, pp. 1–16, Dec. 2013, doi: 10.1186/1687-61802013-93. [53] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1991. [54] K. M. Buckley, “Spatial/spectral filtering with linear constrained minimum variance beamformers,” IEEE Trans. Acoust., Speech, Signal Process., vol. 35, no. 3, pp. 249–266, Mar. 1987, doi: 10.1109/TASSP.1987.1165142. [55] W. Liu and S. Weiss, Wideband Beamforming: Concepts and Techniques. Hoboken, NJ, USA: Wiley, 2010. [56] R. G. Lorenz and S. P. Boyd, “Robust minimum variance beamforming,” IEEE Trans. Signal Process., vol. 53, no. 5, pp. 1684–1696, May 2005, doi: 10.1109/ TSP.2005.845436. [57] K. M. Buckley, “Broad-band beamforming and the generalized sidelobe canceller,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, pp. 1322–1323, Oct. 1986, doi: 10.1109/TASSP.1986.1164927. [58] S. Weiss, A. P. Millar, and R. W. Stewart, “Inversion of parahermitian matrices,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), Aug. 2010, pp. 447–451. [59] P. P. Vaidyanathan, “Theory of optimal orthonormal subband coders,” IEEE Trans. Signal Process., vol. 46, no. 6, pp. 1528–1543, Jun. 1998, doi: 10.1109/78.678466. [60] B. Xuan and R. Bamberger, “FIR principal component filter banks,” IEEE Trans. Signal Process., vol. 46, no. 4, pp. 930 –940, Apr. 1998, doi: 10.1109/78.668547. [61] A. Kirac and P. P. Vaidyanathan, “Theory and design of optimum FIR compaction filters,” IEEE Trans. Signal Process., vol. 46, no. 4, pp. 903–919, Apr. 1998, doi: 10.1109/78.668545. [62] P. A. Naylor and N. D. Gaubitch, Eds., Speech Dereverberation. London, U.K.: Springer-Verlag, 2010. [63] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, Philadelphia, PA, USA, Corpus LDC93S1, 1993. [64] J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 10, pp. 1681–1693, Oct. 2016, doi: 10.1109/ TASLP.2016.2577502. [65] F. Jabloun and B. Champagne, “A multi-microphone signal subspace approach for speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2001, pp. 205–208, doi: 10.1109/ICASSP.2001.940803. [66] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” IEEE Signal Process. Lett., vol. 9, no. 7, pp. 204–206, Jul. 2002, doi: 10.1109/LSP.2002.801721. [67] S. Doclo and M. Moonen, “GSVD-based optimal filtering for single and multimicrophone speech enhancement,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002, doi: 10.1109/TSP.2002.801937. [68] T. Nakatani and K. Kinoshita, “A unified convolutional beamformer for simultaneous denoising and dereverberation,” IEEE Signal Process. Lett., vol. 26, no. 6, pp. 903–907, Jun. 2019, doi: 10.1109/LSP.2019.2911179. [69] S. Weiss, J. Corr, K. Thompson, J. G. McWhirter, and I. K. Proudler. “Polynomial EVD toolbox.” Polynomial Eigenvalue Decomposition. Accessed: Apr. 30, 2023. [Online]. Available: http://pevd-toolbox.eee.strath.ac.uk

SP



IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

37

Luis Albert Zavala-Mondragón  , Peter H.N. de With  , and Fons van der Sommen 

A Signal Processing Interpretation of NoiseReduction Convolutional Neural Networks

©SHUTTERSTOCK.COM/DABARTI CGI

Exploring the mathematical formulation of encoding-decoding CNNs.

E

ncoding-decoding convolutional neural networks (CNNs) play a central role in data-driven noise reduction and can be found within numerous deep learning algorithms. However, the development of these CNN architectures is often done in an ad hoc fashion and theoretical underpinnings for important design choices are generally lacking. Up to now, there have been different existing relevant works that have

Digital Object Identifier 10.1109/MSP.2023.3300100 Date of current version: 3 November 2023

38

striven to explain the internal operation of these CNNs. Still, these ideas are either scattered and/or may require significant expertise to be accessible for a bigger audience. To open up this exciting field, this article builds intuition on the theory of deep convolutional framelets (TDCFs) and explains diverse encoding-decoding (ED) CNN architectures in a unified theoretical framework. By connecting basic principles from signal processing to the field of deep learning, this self-contained material offers significant guidance for designing robust and efficient novel CNN architectures.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

1053-5888/23©2023IEEE

Introduction A well-known image processing application is noise/artifact reduction of images, which consists of estimating a noise/ artifact-free signal out of a noisy observation. To achieve this, conventional signal processing algorithms often employ explicit assumptions on the signal and noise characteristics, which has resulted in well-known algorithms such as wavelet shrinkage [1], sparse dictionaries [2], total-variation minimization [3], and lowrank approximation [4]. With the advent of deep learning techniques, signal processing algorithms applied to image denoising, have been regularly outperformed and increasingly replaced by encoding-decoding CNNs. In this article, rather than conventional signal processing algorithms, we focus on so-called encoding-decoding CNNs. These models contain an encoder that maps the input to multichannel/redundant representations and a decoder, which maps the encoded signal back to the original domain. In both, the encoder and decoder, sparsifying nonlinearities, which suppress parts of the signal, are applied. In contrast to conventional signal processing algorithms, encoding-decoding CNNs are often presented as a solution, which does not make explicit assumptions on the signal and noise. For example, in supervised algorithms, an encoding-decoding CNN learns the optimal parameters to filter the signal from a set of paired examples of noise/artifact-free images and images contaminated with noise/artifacts [5], [6], [7], which highly simplifies the solution of the noise-reduction problems as this circumvents use of explicit modeling of the signal and noise. Furthermore, the good performance and simple use of encoder-decoder CNNs have enabled additional data-driven noise-reduction algorithms, where CNNs are embedded as part of a larger system. Examples of such approaches are unsupervised noise reduction [8] and denoising based on generative adversarial networks [9]. Besides this, smoothness in signals can also be obtained by advanced regularization using CNNs, e.g., by exploiting data-driven model-based iterative reconstruction [10]. Despite the impressive noise-reduction performance and flexibility of encoding-decoding CNNs, these models also have downsides that should be considered. First, the complexity and heuristic nature of such designs often offers restricted understanding of the internal operation of such architectures [11]. Second, training and deployment of CNNs requires specialized hardware and use of significant computational resources. Third and finally, restricted understanding of signal modeling in encoding-decoding CNNs does not clearly reveal the limitations of such models and, consequently, it is not obvious how to overcome these problems. To overcome the limitations of encoding-decoding CNNs, new research has tackled the lack of explainability of these models by acknowledging similarity of the building blocks of encoding-decoding CNNs applied to image noise reduction and the elements of well-known signal processing algorithms, such as wavelet decomposition, low-rank approximation [12], [13], [14], variational methods [15], lower-dimensional manifolds [8], and convolutional sparse coding [16]. Furthermore, practical works based on the shrinkage-based CNNs inspired in wellestablished wavelet shrinkage algorithms has further deepened

the connections between signal processing and CNNs [17], [18]. This unified treatment of signal processing-inspired CNNs has resulted in more explainable [6], [8], better performing [6], and more memory-efficient designs [19]. This article has three main objectives. The first is to summarize the diverse explanations of the components of encodingdecoding CNNs applied to image noise reduction based on the concept of deep convolutional framelets [12], and on elementary signal processing concepts. Both aspects are considered with the aim of achieving an in-depth understanding of the internal operation of encoding-decoding CNNs, and to show that the design choices have implicit assumptions about the signal behavior inside the CNN. A second objective is to offer practitioners tools for optimizing their CNN designs with signal processing concepts. Third and finally, the aim is to show practical use cases where existing CNNs are analyzed in a unified framework, thereby enabling a better comparison of different designs by making their internal operation explicitly visible. Our analysis is based on existing works [6], [12], [20] by authors who analyzed CNNs where the nonlinearities are ignored. In this article, we overcome this limitation and present a complete analysis including the nonlinear activations, which reveals important assumptions implicit in the analyzed models. The structure of this article is as follows. The “Notation” section introduces the notation used in this text. The “Encoding-Decoding CNNs” section describes the signal model and the architecture of encoding-decoding networks. Afterward, the “Signal Processing Fundamentals” section addresses fundamental aspects of signal processing, such as singular value decomposition (SVD), low-rank approximation, and framelets as well as estimation of signals in the framelet domain. All the concepts of the “Encoding-Decoding CNNs” and “Signal Processing Fundamentals” sections converge in the “Bridging the Gap Between Signal Processing and CNNs: Deep Convolutional Framelets and Shrinkage-Based CNNs” section, where the encoding-decoding CNNs are interpreted in terms of a data-driven low-rank approximation, and of wavelet shrinkage. Afterward, based on the learnings from the “Bridging the Gap Between Signal Processing and CNNs: Deep Convolutional Framelets and ShrinkageBased CNNs” section, the “Mathematical Analysis of Relevant Designs” section shows the analysis of diverse architectures from a signal processing perspective and under a set of explicit assumptions. Afterward, the “What Happens in Trained Models?” section explores whether some of the theoretical properties exposed here are related to trained models. Based on the diverse described models and theoretical operation of CNNs, the “Which Network Fits My Problem?” section addresses a design criterion that can be used to design or choose new models and briefly describes the state of the art for noise reduction with CNNs. Finally, the “Conclusions and Future Outlook” section elaborates on concluding remarks and discusses the diverse elements that have not yet been (widely) explored by current CNN designs.

Notation CNNs are composed by basic elements such as convolution, activation, and down-/upsampling layers. To achieve better clarity

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

39

in the explanations given in this article, we define the mathematical notation to represent the basic operations of CNNs. Some of the definitions presented here are based on the work of ZavalaMondragón et al. [19]. In the following, a scalar is represented by a lower-case letter (e.g., a), while a vector is represented by an underlined lowercase letter (e.g., a). Furthermore, a matrix, such as an image or convolution mask, is represented by a boldface lowercase letter (e.g., variables x and y). Finally, a tensor is defined by a boldface uppercase letter. For example, the two arbitrary tensors A and Q are defined by A=f



a 00 h

q0 f a 0N C - 1 j h p, Q = f h p.(1) f a NN RC -- 11 q NR - 1

a 0N R - 1

Here, entries a rc and q r represent 2D arrays (matrices). As the defined tensors are used in the context of CNNs, matrices a rc and q r are learned filters, which have dimensions of (N V # N H), Table 1. Relevant symbols used in this article. Symbol f(2 .) ($) f(2 -) ($) I K u K W u W WH WL x y h b t ) Kx ($)R ($) + x ($) ($) C ($) ($)

Meaning Downsampling operation Upsampling operation Convolution identity Encoding convolution kernel Decoding convolution kernel Filters for the forward discrete wavelet transform Filters for the inverse discrete wavelet transform High-pass filters of the forward discrete wavelet transform Low-pass filter of the forward discrete wavelet transform Noiseless image Noisy image Additive noise Bias vector Threshold level Image convolution Tensor convolution between tensor K and signal x Transpose of a tensor ReLU activation Generic thresholding/shrinkage operation Generic clipping operation

where N V and N H denote the filter dimensions in the vertical and horizontal directions, respectively. Finally, we define the total tensor dimension of A and Q by (N C # N R # N V # N H) and (N R # 1 # N V # N H), where N R and N C are the number of row and column entries, respectively. If the tensor A contains the convolution weights in a CNN, the row-entry dimensions ­represent the input number of channels to a layer, while the number of column elements denotes the number of output channels. Having defined the notation for the variables, we focus on a few relevant operators. First, the transpose of a tensor ($) R, expressed by Q < = ^q 0 f q N R - 1 h.(2)



Furthermore, the convolution of two tensors is written as AQ and specified by J NR - 1 0 r N K / ar ) q O K r=0 O O .(3) h AQ = K KNR - 1 O N 1 r R KK / a r ) q OO L r=0 P ) Here, the symbol defines the convolution between two matrices (images). In this article, images that are 2D arrays (matrices) are often convolved with 4D tensors. When this operation is performed, images are considered to have dimensions of (1 # 1 # N V # N H). In addition, in this article, matrix I is the identity signal for the convolution operator, which, for a 2D image, is the Kronecker delta/discrete impulse (an image with a single nonzero pixel with unity amplitude at the center of the image). Furthermore, we indicate that variables in the decoding path of a CNN are u , bu ). distinguished with a tilde (e.g., K Additional symbols that are used throughout the article are the down and upsampling operations by a factor s, which are denoted by f(s .) ($) and f(s -) ($) for downsampling and upsampling, respectively. In this article, both operations are defined in the same way as in multirate filter banks. For example, consider the signal x = ^1, 2, 3, 4, 5, 6, 7, 8, 9, 10 h.(4)



If we apply the downsampling operator to x by a factor of two, it results in z = f(2 .) ( x) = ^1, 3, 5, 7, 9h (5)

Tensor Convolution

Sum



+

where z is the downsampled version of x. Conversely, the result of applying the upsample operator f(2 -) (·) gives the result

ReLU

Shrinkage

b

t

Clipping t

Downsampling

Upsampling





FIGURE 1. Symbols used for the schematic representations of the CNNs addressed in this article.

40

f(2 -) (z) = ^1, 0, 3, 0, 5, 0, 7, 0, 9, 0 h .(6)



K0

Additional operators used in the article are rectified linear units (ReLUs), shrinkage/thresholding, and clipping, which are represented by (·) +, x (·) (·), and C ($) ($), respectively. For better clarity, the most important symbols used in this article are listed in Table 1. In addition, graphical representations of some of the symbols that are used to graphically describe CNNs are shown in Figure 1.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Encoding-decoding CNNs Signal model and noise-reduction configurations In noise-reduction applications, the common additive signal model is defined by

y = x + h (7)

where the observed signal y is the result of contaminating a noiseless image x with additive noise h. Assume that the noiseless signal x is to be estimated from the noisy observation y. In deep learning applications, this is often achieved by models with the form

xt = G (y).(8)

Here, G (·) is a generic encoding-decoding CNN. We refer to this form of noise reduction as nonresidual. Alternatively, it is possible to find xt by training G (·) to estimate the noise component ht and subtracting it from the noisy image y to estimate the noiseless image xt , or equivalently,

xt = y - G (y).(9)

This model is referred to as residual [5], [7], [21] because the output of the network is subtracted from its input. For reference, Figure 2 portrays the difference of the placement of the encodingdecoding structure in residual and nonresidual configurations.

Encoding-decoding CNNs Encoding-decoding (convolutional) neural networks are rooted in techniques for data-dimensionality reduction and unsupervised feature extraction, where a given signal is mapped to an alternative space via a nonlinear transformation. This space should have properties that are somehow attractive for the considered task. For example, for dimensionality reduction, the alternative space should be lower dimensional than the original input. In this article, we are interested in models that are useful for noisereduction applications. Specifically, this manuscript addresses models that are referred to as encoding-decoding CNNs, such as the model by Ranzato et al. [22], in which the encoder uses convolution filters to produce multichannel/redundant representations, in which sparsifying nonlinearities are applied. The sparsified signal is later mapped back to the original representation. It should be noted that despite the fact that the origins of the encoding-decoding CNNs are linked to feature extraction, this type of architecture was quickly shown to be useful for other applications such as noise reduction, which is the topic of this article. For the rest of this manuscript, whenever we mention an encoding-decoding CNN, we are referring to a design that follows the same basic principles as Ranzato’s design. It can be observed that encoding-decoding CNNs are constituted of three main parts. The first is the encoder, which maps the incoming image to a representation with more image channels with a convolution layer. Every channel of the resulting redundant representation contains a fraction of the content of the original signal. It should be noted that the encoder often (but not necessarily) decreases the resolution of the higher-dimensional representation to enable multiresolution processing, and to decrease

the memory requirements of the design. The second main part is the decoder, which maps the multichannel representation back to the original space. The third main part is the nonlinearities, which suppress specific parts of the signal. In summary, the most basic encoding-decoding step in a CNN G (·) is expressed by G (y) = G dec (G enc (y)) (10)



where G enc ($) is the encoder, which is generally defined by C 0 = E 0 (y), C 1 = E 1 (C 0), C 2 = E 2 (C 1),  h C N - 1 = E N - 1 (C N - 2), G enc (y) = C N - 1 .



(11)

Here, C n represents the code generated by the nth encoding E n (·), which can be expressed by C n = E n (C n - 1) = f(s .) (A ( b n - 1) (K n - 1 C n - 1)) .(12)



Here, the function A (·) (·) is a generic activation used in the encoder, and f(s .) (·) is a downsampling function by factor s. Complementary to the encoder, the decoder network maps the multichannel sparse signal back to the original domain. Here, we define the decoder by u N - 2 = D N - 1 (C N - 1), C h u 1 = D 2 (C u 2), C  u 0 = D 1 (C u 1), C u 0) G (y) = D 0 (C



(13)

t n is the nth decoded signal, which is produced by the nth where C decoder layer, yielding the general expression u n - 1 = D n (C u n) = A u n)) .(14) u Rn f(s -) (C u ( bu ) (K C



u (·) (·) is the activation function used in the decoder, and In (14), A f(s -) (·) is an upsampling function of factor s.

Nonresidual Configuration Noisy Input

Noisy Input

Noiseless Estimate

Encoding-Decoding Network G (⋅) Residual Configuration + Encoding-Decoding Network G (⋅)



Noiseless Estimate Estimated Noise

FIGURE 2. Residual and nonresidual network configurations. Note that the main difference between both designs is the global skip connection occurring in the residual structure. Still, it can be observed that the network G ($) may contain skip connections internally.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

41

An important remark is that the encoder-decoder CNN does not always contain down-/upsampling layers, in which case, the decimation factor s is unity, which causes f(1-) (x) = f(1.) (x) = x for any matrix x. Furthermore, it should also be noted that we assume that the number of channels of the code C N is always larger than the previous one: C N - 1 . Furthermore, it should be noted that a single encoder layer E n (·) and its corresponding decoder layer D n (·) can be considered a single-layer encoderdecoder network/pair. For this article, the encoding convolution filter for a given layer K has dimensions of (N o # N i # N h # N v), where N i and N o are the number of input and output channels for a convolution layer, respectively. Similarly, N h and N v are the number of elements in the horizontal and vertical directions, respectively. Note that the encoder increases the number of channels of the signal (e.g., N o 2 N i), akin to Ranzato’s design [22]. Furthermore, it is assumed that the decoder is symmetric in the number of channels to the encoder, therefore, the dimensions of the decoding conu R are (N i # N o # N h # N v). The motivation of volution kernel K this symmetry is to emphasize the similarity between the signal processing and the CNN elements.

Signal processing fundamentals As shown by Ye et al. [12], within encoding-decoding CNNs, the signal is treated akin to well-known sparse representations, where the coefficients used for the transformation are directly learned from the training data. Prior to addressing this important concept in more detail, relevant supporting concepts such as sparsity, sparse transformations, and nonlinear signal estimation in the wavelet domain are explained.

Sparsity A sparse image is a signal where most of the coefficients are small and the relatively few large coefficients capture most of the information [23]. This characteristic allows to discard low-amplitude components with relatively small perceptual changes. Hereby, the use of sparse signals is attractive for applications such as image compression, denoising, and suppression of artifacts. Despite the convenient characteristics of sparse signals, natural images are often nonsparse. Still, there are numerous transformations that allow for mapping the signal to a sparse domain and that are analogous to the internal operations of CNNs. For example, SVD factorizes the image in terms of two sets of orthogonal bases, of which few basis pairs contain most of the energy of the image. An alternative transformation is based on framelets, where an image is decomposed in a multichannel representation, whereby each resulting channel contains a fragment of the Fourier spectrum. In the following sections, we address all of these representations in more detail.

Sparse signal representations SVD and low-rank approximation

Assume that an image (patch) is represented by a matrix y with dimensions of (N r # N c), where N r and N c are the number of rows and columns, respectively. Then, the SVD factorizes y as 42

y=



N SV - 1

/

n=0

( u n v nR) $ v [n](15)

in which N SV is the number of singular values, n is a scalar index, while u n and v n are the nth left and right singular vectors, respectively. Furthermore, vector v contains the singular values, and each of its entries v [n] is the weight assigned to every basis pair u n, v n . This means that the product ( u n v Rn ) contributes more to the image content for higher values of v [n]. It is customary for the singular values to be ranked in descending order and for the amplitudes of the singular values v to be sparse, therefore, v [0] & v [N SV - 1]. The reason for this sparsity is because image (patches) intrinsically have high correlation. For example, many images contain repetitive patterns (e.g., a wall with bricks, a fence, rooftop tiles, or a zebra’s stripes) or uniform regions (for example, the sky or a person’s skin). This means that an image patch may contain only a few linearly independent vectors that describe most of the image’s content. Consequently, a higher weight is assigned to such image bases. Given that the amplitudes of the singular values of y in SVD are sparse, it is possible approximate yt with only a few bases: ( u n v Rn ). Note that this procedure reduces the rank of signal y, and hence it is known as low-rank approximation. This process is equivalent to yt =



N LR - 1

/

n=0

( u n v nR) $ v [n](16)

where N SV 2 N LR . Note that this effectively cancels the product ( u n v Rn ), where the weight given by v [n] is low. Alternatively, it is possible to assign a weight of zero to the product ( u n v Rn ) for n $ N LR . The low-rank representation of a matrix is desirable for diverse applications, among which we can find image denoising. The motivation for using low-rank approximation for this application results from the fact that, as mentioned earlier, natural images are considered low rank due to the strong spatial correlation between pixels, whereas noise is high rank (it is spatially uncorrelated). As a consequence, reducing the rank/number of singular values decreases the presence of noise while still providing a good approximation of the noise-free signal, as exemplified in Figure 3.

Framelets Just as with SVD, framelets are also commonly used for image processing. In a nutshell, a framelet transform is a signal representation that factorizes/decomposes an arbitrary signal into multiple bands/channels. Each of these channels contains a segment of the energy of the original signal. In image and signal processing, the framelet bands are the result of convolving the analyzed signal with a group of discrete filters that have finite length/support. In this article, the most important characteristic that the filters of the framelet transform should comply with is that the bands they generate capture all the energy contained on the input to the decomposition. This is important to avoid the loss of information of the decomposed signal. In this text, we refer to framelets that comply with the previous characteristics as tight

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

framelets, and the following paragraphs describe this property in more detail. In its decimated version, the framelet decomposition for tight frames is represented by

Yfram = f(2 .) (Fy) (17)

in which Yfram is the decomposed signal, and F is the framelet basis (tensor). Note that the signal Yfram has more channels than y. Furthermore, the original signal y is recovered from Yfram by

y = Fu R f(2 -) ( Yfram) · c.(18)

Here, Fu is the filter of the inverse framelet transform and c denotes an arbitrary constant. If c = 1, the framelet is normalized. Finally, note that the framelet transform can also be undecimated. This means that in undecimated representations, the downsampling and upsampling layers, f(2 .) ($) and f(2 -) ($), are not used. An important property of the undecimated representation is that it is less prone to aliasing than its decimated counterpart, but more computationally expensive. Therefore, for efficiency reasons, the decimated framelet decomposition is often preferred over the undecimated representation. In summary, the decomposition and synthesis of the decimated framelet decomposition is represented by

y = Fu R f(2 -) ( f(2 .) (Fy)) $ c (19)

while for the undecimated framelet it holds that

y = Fu R (Fy) · c .(20)

A notable normalized framelet is the discrete wavelet transform (DWT), where variables F and Fu are replaced by tensors u = ^w u LL, w u LH, w u HL, w u HHh, W = ^w LL, w LH, w HL, w HHh and W respectively. Here, w LL is the filter for the low-frequency band, while w LH, w HL, w HH are the filters used to extract the detail in the horizontal, vertical, and diagonal directions, respectively. u LH w u LH, w u HL, w u HH are the filters of the inverse deciFinally, w mated DWT. To understand the DWT more intuitively, Figure 4 shows the decimated framelet decomposition using the filters of the DWT. Note that the convolution Wy results in a four-channel signal, where each channel contains only a fraction of the spectrum of image y. This allows for downsampling of each channel with minimal aliasing. Furthermore, to recover the original signal, each individual channel is upsampled, thereby introducing aliasing, which is then removed by the filters of the inverse transform. Finally, all the channels are added and the original signal is recovered. Analogous to the low-rank approximation, in framelets, the reduction of noise is achieved by setting the noisy components to zero. These components are typically assumed to have low amplitude when compared to the amplitude of the sparse signal, as expressed by

yu = Fu R f(2 -) (x (t) ( f(2 .) (Fy))) $ c (21)

where x t ($) is a generic thresholding/shrinkage function, which sets each of the pixels in f(2 .) (Fy) to zero when values are lower than the threshold level t .

Nonlinear signal estimation in the framelet domain As mentioned in the “Framelets” section, framelets decompose a given image y by convolving it with a tensor F. Note that many of the filters that compose F have a high-pass nature. Images often contain approximately uniform regions in which the variation is low, therefore, convolving a signal y with a high-pass filter f h  – where f h ! F  – produces the sparse detail band d = f h ) y in which uniform regions have low amplitudes, while transitions, i.e., edges, contain most of the energy of the bands. Assume a model in which a single pixel d ! d is observed, which is contaminated with additive noise h. Then, the resulting observed pixel z is defined by z = d + h.(22)



To recover the noiseless pixel d from observation z, it is possible to use the point-maximum a posteriori (MAP) estimate [1], [24], defined by the maximization problem dt = argmax [ln (P (d ; z))] .(23)



d

Here, the log-posterior ln (P (d ; z)) is defined by ln (P (d ; z)) = ln (P (z ; d )) + ln (P (d )) (24)



where the conditional probability density function (PDF) P (z ; d ) expresses the noise distribution, which is often assumed Gaussian and defined by P (z ; d ) ? exp e -



(z - d ) 2 o .(25) 2v 2h

Here, v 2h is the noise variance. Furthermore, as prior probability, it is assumed that the distribution of P (d ) corresponds to a Laplacian distribution, which has been used in wavelet-based denoising [1]. Therefore, P (d ) is mathematically described by Clean

Clean NSV = 8

Clean NSV = 32

Noisy

Noisy NSV = 8

Noisy NSV = 32

FIGURE 3. SVD reconstruction of clean and corrupted images with a different number of singular values. Note that reconstruction of the clean image with eight or 32 singular values (N SV = 8 or N SV = 32, respectively) yields to reconstructions indistinguishable from the original image. This contrasts with their noisy counterparts, where N SV = 8 reconstructs a smoother image in which the noise is attenuated, while N SV = 32 reconstructs the noise texture perfectly.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

43

44

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

f (2↓)(Wy )

F{wHH ∗ y}

F{wHL ∗ y}

F{wLH ∗ y}

F{wLL ∗ y}

Ff (2↓){wHH ∗ y}

Ff (2↓){wHL ∗ y}

Ff (2↓){wLH ∗ y}

Ff (2↓){wLL ∗ y}

Convolutional Framelet Basis Downsampling

W

Wy

A f (2↓)(Wy )B

F{ f (2↑) A f (2↓) (wHH∗ y)B }

F{ f (2↑) A f (2↓) (wHL ∗ y)B }

F{ f (2↑) A f (2↓) (wLH ∗ y)B }

F{ f (2↑) A f (2↓) (wLL ∗ y)B }

Upsampling

f (2↓)

Inverse Transform y

F{ wHH ∗ f (2↑) A f (2↓) (wHH ∗ y)B }

F{ wHL ∗ f (2↑) A f (2↓) (wHL ∗ y)B }

F{ wLH ∗ f (2↑) A f (2↓) (wLH ∗ y)B }

F{ wLL ∗ f (2↑) A f (2↓) (wLL ∗ y)B }

y

F{ W f (2↑) A f (2↓) (Wy )B }

Convolutional Inverse Transformation Framelet

W

W f (2↑) A f (2↓)(Wy )B

FIGURE 4. 2D spectrum analysis of the decimated discrete framelet decomposition and reconstruction. In the figure, function F " $ , stands for the amplitude Fourier spectrum of the input argument. The yellow squares indicate a region in the low-frequency area of the Fourier spectrum, while the orange, purple and blue squares indicate the high-pass/detail bands. For these images, ideal orthogonal bases are assumed. Note that the forward transform is composed by two steps. First, the signal is convolved with the wavelet basis ^ Wy h . Afterward, downsampling is applied to the signal ^ f(2 .) (Wy) h . During the inverse transforu mation, the signal is upsampled by inserting zeros between each sample ^ f(2 -) (f(2 .) (Wy)) h, which causes spatial aliasing (dashed blocks). Finally, the spatial aliasing is removed by the inverse transform filter W u R f(2 -) (f(2 .) (Wy)) h . and all the channels are added ^ W

F{y}

Input

y

Forward Transform

P (d ) ? exp c -



;d ; vd

m, (26)

where v d is the dispersion measure of the Laplace distribution. For reference, Figure 5 portrays an example of both a Gaussian and a Laplacian PDF. Note that the Laplacian distribution has a higher probability of zero elements occurring than the Gaussian distribution for the same standard deviation. Finally, substituting (25) and (26) in (24) results in

ln (P (d ; z)) ? -

(z - d ) 2 ; d ; .(27) vd 2v 2h

Probability

In (27), maximizing d in ln (P (d ; z)) with the first derivative criterion, in an (un) constrained way, leads to two common activations in noise-reduction CNNs: the ReLU and the soft-shrinkage function. Furthermore, the solution also can be used to derive the so-called clipping function, which is useful in residual networks.

For reference and further understanding, Figure 6 portrays the elements composing the noise model of (22), signal-transfer characteristics of the ReLU, soft-shrinkage and clipping functions, and the effect that these functions have on the signal of the observed noisy detail band z.

ReLU If (27) is solved for d while constraining the estimator to be positive, the noiseless estimate dt becomes dt = (z - t) + (28)



0.4 0.2 0.0 −4

−2

0 (a)

2

4

−4

−2

0 (b)

2

4

FIGURE 5. The probability density function for (a) Gaussian and (b) Laplacian distributions. Noise ηn

Noiseless dn

Contaminated Signal zn

Amplitude

1 0 −1 0

5

10

0

Output Amplitude

(· − t)+

5 10 Sample Index (a) Soft (·) t(t)

0

5

10 C(t)(·)

0.4 0.2 0 −0.2 −0.4 −1

−0.5

0

0.5

1 −1

(zn − t)+

−0.5 0 0.5 Input Amplitude zn (b) Soft (z ) t(t) n

1 −1

−0.5

0

0.5

1

C(t)(zn)

Amplitude

1 0 −1 0

5

10

0

5 10 Sample Index (c)

0

5

10

FIGURE 6. Signals involved in the additive noise model, input/output transfer characteristics of activation layers and estimates produced by the activation layers when applied to the noise-contaminated signal. (a) The signals involved in the additive noise model. (b) The output amplitude of activation functions with respect to the input amplitude. (c) Finally, the application of the activation functions to the noisy observation z. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

45

where C Soft (t) ( $ ) is the soft-clipping function. Note that this function also can be expressed by

which is also expressed by (z - t) + = '



z - t, if z $ t, (29) 0, if t 2 z.

t, if z $ t, C Soft ( z ) = (t) * z, if t $ z 2 - t, (35) - t, if - t $ z.



Here, the threshold level is defined by 2



t=

vh .(30) vd

Note that this estimator cancels the negative and low-amplitude elements of d lower than the magnitude of the threshold level t. For example, if the signal content on the feature map is low, then v d " 0. In such case, t " + 3 and, consequently, dt " 0. This means that the channel is suppressed. Alternatively, if the feature map has strong signal presence, i.e., v d " 3, consequently, t " 0, and then dt " (z) + . A final remark is made on the modeling of functions of a CNN. It should be noted that the estimator of (28) is analogous to the activation function of a CNN, known as an ReLU. However, in a CNN, the value of t would be the bias b learned from the training data.

Other thresholding layers One of the main drawbacks of the soft-threshold activation is that it is a biased estimator. This limitation has been addressed by the hard and semihard thresholds, which are (asymptotically) unbiased estimators for large input values. In this section, we focus solely on the semihard threshold and avoid the hard variant because is discontinuous and therefore not suited to models that rely on gradient-based optimization, such as CNNs. Among the semihard thresholds, two notable examples are the garrote shrink and the shrinkage functions generated by derivatives of Gaussians (DoGs) [19], [26]. The garrote shrink function x Soft ($) ( $ ) is defined by Gar



x (t) (z) =

Soft shrinkage/thresholding

If (27) is maximized in an unconstrained way, the estimate dt is

dt = x Soft (t) (z) = (z - t) + - (- z - t) + . (31)

Here, x Soft (t) ( $ ) denotes the soft-shrinkage/-thresholding function, which is often also written in the form z + t, if z $ t, Soft x (t) (z) = * 0, if t 2 z $ - t, (32) z - t, if - t 2 z.



(z 2 - t 2) + .(36) z

Furthermore, an example of a shrinkage function based on the DoG is given by DoG



DoG

x (t) (z) = z - C (t) (z) (37)

where the semihard clipping function with the DoG C DoG ($) ( $ ) is given by C DoG (t) (z) = z · exp c -



zp m(38) tp

It can be observed that the soft threshold enforces the low-amplitude components whose magnitude is lower than the magnitude threshold level t to zero. In this case, t is also defined by (30). It should be noted that the soft-shrinkage estimator can also be obtained from a variational perspective [25]. Finally, it can be observed that soft shrinkage is the superposition of two ReLU functions, which has been pointed out by Fan et al. [18].

in which p is an even number. The garrote and semihard DoG shrinkage functions are shown in Figure 7 as well as their clipping counterparts. Note the shrinkage functions’ approximate unity for | z | & t, therefore, they are asymptotically unbiased for large signal values. The final thresholding function addressed in this section is the linear expansion of thresholds (LETs) proposed by Blu and Luisier [26]. This technique, known as LETs, combines multiple thresholding functions to improve performance and is defined by

Soft clipping



In the “ReLU” and “Soft Shrinkage/Thresholding” sections, the estimate dt is obtained directly from the noisy observation z. Alternatively, it is possible to estimate the noise h and subtract it from z, akin to the residual CNNs represented by (9). This can be achieved by solving the model

ht = z - dt = z - x (t) (z) (33) Soft

which is equivalent to 46

Soft

ht = C (t) (z) = z - ((z - t) + - (- z - t) +) (34)

LET

x ( t ) (z) =

NT - 1

/

a n · x (t n) (z) (39)

n=0

where a n is the weighting factor assigned to each threshold where all weighting factors should add up to unity.

Bridging the gap between signal processing and CNNs: Deep convolutional framelets and shrinkage-based CNNs The next sections address the theoretical operation of noisereduction CNNs based on ReLUs and shrinkage/thresholding functions. The first part describes the TDCFs [12] and is the most extensive study on the operation of encoding-decoding ReLU-based CNNs up to now. Afterward, we focus on the

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

u, ditions, the encoding and decoding convolution filters, K and K respectively, should comply with

operation of networks that use shrinkage functions instead of ReLUs [17], [18], [19], with the aim of mimicking well-established denoising algorithms [1]. Finally, the final part addresses the connections between both methods and additional links between CNNs and signal processing.

It can be noted that (40) is an extension of (20), which describes the reconstruction characteristics of tight framelets. From this point on, we refer to convolutional kernels compliant with (40) as phase-complementary tight framelets. As a final remark, it should be noted that a common practice in CNN designs is to also use ReLU nonlinearities in the decoder. In such a case, the phase-complementary tight-framelet condition can still be met as long as the pixels y ! y comply with y $ 0, which is equivalent to

TDCFs The TDCFs [12] describes the operation of encoding-decoding ReLU-based CNNs. Its most relevant contributions are 1) to establish the equivalence of framelets and the convolutional layers of CNNs, 2) to provide conditions that preserve the signal integrity within an ReLU CNN, and 3) explain how ReLUs and convolution layers reduce noise within an encoding-decoding CNN. The similarity between framelets and the encoding and decoding convolutional filters can be observed when comparing (12) and (14) with (17) and (18), where it becomes visible that the convolution structure of encoding-decoding CNNs is analogous to the forward and inverse framelet decomposition. Regarding the signal reconstruction characteristics, the TDCFs [12] states the following. First, to be able to recover an arbitrary signal y ! R N , the number of output channels of a convolution layer with ReLU activation should at least double the number of input channels. Second, the encoding convolution kernel K should be composed of pairs of filters with opposite phases. These two requirements ensure that any negative and positive values propagate through the network. Under these con-

Output Amplitude

1

u R (Ky) + · c) + .(41) y = (y) + = (K



It can be observed that the relevance of the properties defined in (40) and (41) is that they ensure that a CNN can propagate any arbitrary signal, which is important to avoid any distortions (such as image blur) in the processed images. An additional element of the TDCFs regarding reconstruction of the signal is to show that conventional pooling layers (e.g., average pooling) can discard high-frequency information of the signal, which effectively blurs the processed signals. Furthermore, Ye et al. [12] have demonstrated that this can be fixed by

DoG

DoG Transfer t(t) (·)

Transfer C(t) (·)

0.5 0 −0.5 −1

1 Output Amplitude

u R (Ky) + · c .(40) y=K



−1 −0.75 −0.5 −0.25

0

0.25

0.5

0.75

1

−1 −0.75 −0.5 −0.25

Gar Transfer t(t) (·)

0

0.25

0.5

0.75

1

0.5

0.75

1

Gar

Transfer C(t) (·)

0.5 0 −0.5 −1

−1 −0.75 −0.5 −0.25

0

0.25

0.5

0.75

1

−1 −0.75 −0.5 −0.25

0

0.25

FIGURE 7. Transfer characteristics of the semihard thresholds based on the difference of Gaussians and of the garrote shrink as well as their clipping counterparts. Note that in contrast with the soft-shrinkage and clipping functions shown in Figure 6, the semihard thresholds tend to unity for large values, while the semihard clipping functions tend to zero for large signal intensities. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

47

replacing the conventional up-/downsampling layers by reversible operations, such as the DWT. To exemplify this property, we refer to Figure 4. If only an average pooling layer followed by an upsampling stage were to be applied, the treatment of the signal would be equivalent to the low-frequency branch of the DWT. Consequently, only the low-frequency spectrum of the signal would be recovered and the images processed with that structure would become blurred. In contrast, if the full-forward and inverse wavelet transform of Figure 4 is used for up and downsampling, it is possible to reconstruct any signal, irrespective of its frequency content. The ultimate key contribution of the TDCFs is its explanation of the operation of ReLU-based noise-reduction CNNs. For a nonresidual configuration, ReLU CNNs perform the following operations. 1) The convolution filters decompose the incoming signal into a sparse multichannel representation. 2) The feature maps, which are uncorrelated to the signal, contain mainly noise. In this case, the bias and the ReLU activation cancel the noisy feature maps in a process analogous to the MAP estimate shown in the “ReLU” section. 3) The decoder reconstructs the filtered image. Note that this process is analogous to the low-rank decomposition described in the “SVD and LowRank Approximation” section. In the case of residual networks, the CNN learns to estimate the noise, which means that in that configuration the ReLU nonlinearities suppress the channels with high activation. A visual example of low-rank approximation in ReLU CNNs is shown in Figure 8, illustrating the operation of an idealized single-layer encoding-decoding ReLU CNN operating in both a residual and nonresidual way. It can be noted ReLU activation suppresses specific channels in the sparse decomposition provided by the encoder, thereby preserving the low-rank structures in the nonresidual network. Alternatively, in the residual example, the ReLUs eliminate the feature maps with high activation, which results in a noise estimate that is subtracted from the input to estimate the noiseless signal.

Shrinkage and clipping-based CNNs As in ReLU networks, the encoder of shrinkage networks [17], [18], [19] separates the input signal in a multichannel representation. As a second processing stage, shrinkage networks estimate the noiseless encoded signal by canceling the low-amplitude pixels in the feature maps in a process akin to the MAP estimate of the “Soft Shrinkage/Thresholding” section. As final step, the encoder reconstructs the estimated noiseless image. Note that the use of shrinkage functions reduces the number of channels required by ReLU counterparts to achieve perfect signal reconstruction because the shrinkage activation preserves positive and negative values, while ReLUs preserve only the positive part of the signal. As shown in the “Signal Model and Noise-Reduction Configurations” section, in residual learning, a given encodingdecoding network estimates the noise signal h so that it can be subtracted from the noisy observation y to generate the noiseless estimate xt . As shown in the “Soft Clipping” section, in the framelet domain, this is achieved by preserving the low-amplitude 48

values of the feature maps by clipping the signal. Therefore, in residual networks, the shrinkage functions can be explicitly replaced by clipping activations. Visual examples of the operation of a single-layer shrinkage and of clipping networks are presented in Figure 9, where it can be noted that the operation of shrinkage and clipping networks is analogous to their ReLU counterparts, with the main difference being that shrinkage and clipping networks do not require phase complements in the encoding and decoding layers as ReLUbased CNNs do.

Shrinkage and clipping in ReLU networks As addressed in the “Nonlinear Signal Estimation in the Framelet Domain” section, the soft-threshold function is the superposition of two ReLU activations. As a consequence, it is feasible that in ReLU CNNs, shrinkage behavior could arise in addition to the low-rankness enforcement mentioned in the “TDCFs” section. It should be noted that this can only happen if the number of channels of the encoder and decoder complies with the redundancy constraints of the TDCFs, and if the decoder is linear. To prove this, (31) is reparameterized as u R (Ky + b) + (42) dt = K



u R are defined by K = ((I where convolution filters K and K R R u = ^^I - Ihh, respectively, and b = ^- t - t hR - I)) and K represents the threshold value. In addition to the soft-shrinkage function, note that the clipping function described by (34) also can be expressed by u R = ^^I - I - I Ihh, and (42) if K = ^^I - I I - IhhR, K b = ^0 0 - t - t hR . It can be noted that representing the clipping function in convolutional form requires four-times-more channels than the original input signal. It should be noted that the ability of ReLUs to approximate other signals has also been observed by Daubechies et al. [29], who have proven that deep ReLU CNNs are universal function approximators. In addition, Ye and Sung [13] have demonstrated that the ReLU function is the main source of the high-approximation power of CNNs.

Additional links between encoding-decoding CNNs and existing signal processing techniques Up to now, it has been assumed that operation of the encoding and decoding convolution filters is limited to mapping the input image to a multichannel representation and to reconstructing it u R comply with K u R (K) + = I $ c) . Still, it is pos(i.e., K and K sible that, in addition to performing decomposition and synthesis tasks, the encoding-decoding structure also filters/colors the signal in a way that improves image estimates. It should be noted that this implies that the perfect reconstruction encoding-decoding structure is no l­onger ­preserved. For example, consider the following linear encoding-decoding structure u R (Ky) (43) xt = K



which can be reduced to xt = k ) y.(44)

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

K

+wLL –wLL +wLH –wLH +wHL –wHL +wHH –wHH

Encoding Convolution

K

+wLL –wLL +wLH –wLH +wHL –wHL +wHH –wHH

Encoding Convolution

Ky

1

–2

0

2

0

2 3 4 5 6 Channel Index n

7

(–wHH ∗ y + b[7])+

–wHH ∗ y

(–wLH ∗ y + b[3])+ (+wHL ∗ y + b[4])+

(+wLH ∗ y + b[2])+

(–wLL ∗ y + b[1])+

(+wLL ∗ y + b[0])+

(+wHH ∗ y + b[6])+

(Ky + b) +

+wHH ∗ y

1

8

(–wHL ∗ y + b[5])+

b

ReLU Activation

7

(–wHH ∗ y + b[7])+

(+wHH ∗ y + b[6])+

(–wHL ∗ y + b[5])+

(–wLH ∗ y + b[3])+ (+wHL ∗ y + b[4])+

(+wLH ∗ y + b[2])+

(–wLL ∗ y + b[1])+

(+wLL ∗ y + b[0])+

Residual Network

3 4 5 6 Channel Index

Encoder

2

ReLU Activation

(Ky + b) +

Nonresidual Network

–wHL ∗ y

+wHL ∗ y

–wLH ∗ y

+wLH ∗ y

–wLL ∗ y

+wLL ∗ y

–2

0

–wHH ∗ y 2

+wHH ∗ y

–wHL ∗ y

+wHL ∗ y

–wLH ∗ y

+wLH ∗ y

–wLL ∗ y

+wLL ∗ y

Encoder

K

Decoding Convolution

K

Decoding Convolution

Decoder

η

K (Ky + b)+

X

K (Ky + b)+

Decoder

–wHH +wHH –wHL +wHL –wLH +wLH –wLL +wLL



+

X

u , respectively) are the Haar basis of the 2D DWT and its phase-inverted counterparts. Given the content of the image, the image more, for this example, the encoding and decoding convolution filters (K and K in the decomposed domain Ky produces only a weak activation for the vertical and diagonal filters (w LH and w HH, respectively), and those feature maps contain mainly noise. In the case of the nonresidual network, the ReLUs and biases suppress the channels with low activation [see the ^ Ky + b h+, column], which is akin to the low-rank approximation. In contrast, in the residual example, the channels with image content are suppressed while preserving the uncorrelated noise. Finally, the decoding section reconstructs the noise-free estimate xu for the nonresidual network or the noise estimate ht for the residual example, where it is subtracted from y to compute the noiseless estimate xt .

FIGURE 8. Operation of a simplified denoising (non)residual ReLU CNN according to the TDCFs. In the figure, the noisy observation y is composed by two vertical bars plus uncorrelated Gaussian noise. Further-

Input y

Input y

Ky

Bias Bias b [n]

–wHH +wHH –wHL +wHL –wLH +wLH –wLL +wLL

49

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

K

+wLL –wLH +wHL –wHH

Encoding Convolution

K

+wLL –wLH +wHL –wHH

Encoding Convolution

Ky

1

–0.5

0

0.5

1

wHH ∗ y

wHL ∗ y

wLH ∗ y

wLL ∗ y

–0.5

0

0.5

wHH ∗ y

wHL ∗ y

wLH ∗ y

wLL ∗ y

2

DoG τ (b) (Ky)

3 4 5 6 Channel Index

b

Clipping Activation

DoG (b) (Ky)

3 4 5 6 Channel Index

Encoder

2

b

Shrinkage Activation

Encoder

(wLL ∗ y)

7

DoG (b[3]) (wHH ∗

y)

DoG (b[1]) (wLH ∗ y) DoG (b[2]) (wHL ∗ y)

DoG (b[0])

Residual Network

7

τ (b[3]) (wHH ∗ y)

DoG

DoG (wHL ∗ y) τ (b[2])

DoG

τ (b[1]) (wLH ∗ y)

DoG

τ (b[0]) (wLL ∗ y)

Nonresidual Network

K

Decoder

Decoding Convolution

K

Decoder

Decoding Convolution

–wHH +wHL –wLH +wLL

K

η

DoG (b) (yK)

X

DoG K τ (b) (Ky)



+

X

FIGURE 9. Operation of denoising in shrinkage and clipping networks. In the nonresidual configuration, the noisy signal y is decomposed by a set of convolution filters, which, for this example, are the 2D Haar basis functions of the DWT (Ky). As a second step, the semihard shrinkage produces an MAP estimate of the noiseless detail bands/feature maps ^ x DoG ( b ) (Ky) h . As third and final step, the decoder maps the estimated noiseless encoded signal to the original image domain. In the residual network, the behavior is similar, but the activation layer is a clipping function that performs an MAP estimate of the noise in the feature maps, which is reconstructed by the decoder to generate the noise estimate ht . After reconstruction, the noise estimate is subtracted from the noisy observation y to generate the noise-free estimate xu .

Input y

Input y

Ky

Bias Bias

50 –wHH +wHL –wLH +wLL

u R K is optimized to reduce the distance between y Here, k = K and the ground truth x. Consequently, the equivalent filter k can be considered a Wiener filter. It should be noted that this article is not the first to address the potential Wiener-like behavior of a CNN. For example, Mohan et al. [14] suggested that by eliminating the bias of the convolution layers, the CNN could behave more akin to the Wiener filter and be able to generalize better to unseen noise levels. It should be noted that by doing so, the CNN can also behave akin to the switching behavior described by the TDCFs, which can be described by the equation

(z) + = '

z, if z $ 0, (45) 0 if z 1 0

where z is a pixel that belongs to the signal z = k ) x. It can be observed that in contrast with the low-rank behavior described in the “TDCFs” section, in this case, the switching behavior is only dependent on the correlation between signal x and filter k. Consequently, if the value of z is positive, its value is preserved. On the contrary, if the correlation between x and k is negative, then the value of z is canceled. Consequently, the noise reduction becomes independent/invariant of the noise level. It can be observed that this effect can be considered a nonlinear extension of signal annihilation filters [30]. It should be noted that aside from the low-rank approximation interpretation of ReLU-based CNNs, additional links to other techniques can be derived. For example, the decomposition and synthesis provided by the encoding-decoding structure is also akin to the nonnegative matrix factorization [31], in which a signal is factorized as a weighted sum of positive bases. In this conception, feature maps are the bases, which are constrained to be positive by the ReLU function. Furthermore, an additional interpretation of encoding-decoding CNNs can be obtained by analyzing them from a low-dimensional manifold representation perspective [8]. Here, the convolution layers of CNNs are interpreted as two operations. On one hand they can provide a Hankel representation, and on the other they can provide a bottleneck that reduces dimensionality of the manifold of image patches. It should be noted that the Hankel-like structure attributed to the convolution layers of CNNs has also been noted by the TDCFs [12]. Two final connections with signal processing and CNNs are the variational formulation combined with kernel-based methods [15] and the convolutional sparse coding interpretation of CNNs [16].

Mathematical analysis of relevant designs To demonstrate an application of the principles summarized in the “Signal Processing Fundamentals” and “Bridging the Gap Between Signal Processing and CNNS: Deep Convolutional Framelets and Shrinkage-Based CNNs” sections, this section analyzes relevant designs of ReLUs and shrinkage CNNs. The analyses focus on three main aspects: 1) overall descriptions of the network architecture, 2) the signal reconstruction ­characteristics provided by the convolutional layers of the encoder and decoder subnetworks, and 3) the number operations O ($) executed by the trainable parts of the network, as this will give insight into the

computational requirements needed to execute each network, and its overall complexity. Signal reconstruction analysis provides a theoretical indication that a given CNN design can propagate any arbitrary signal when considering the use of ideal filters (i.e., they provide perfect reconstruction and are maximally sparse). In other words, for a fixed network architecture, there exists a selection of parameters (weights and biases) that make the neural network equal to the identity function. This result is important because a design that cannot propagate arbitrary signals under ideal conditions will potentially distort the signals that propagate through it by design. Consequently, this cannot be fixed by training with large datasets and/or with the application of any special loss term. To better understand the signal reconstruction analysis, we provide a brief example where it is a nonresidual CNN G ($), where we propagate a noiseless signal x contaminated with noise h so that x . G (x + h) .(46)



Here, an ideal CNN allows us to propagate any x while canceling the noise component h, irrespective of the content of x. If we switch our focus to an ideal residual CNN R ($), it is possible to observe that xt . R (y) = y - G (y) .(47)



Here, G ($) is the encoding-decoding section of the residual network R ($) . Consequently, it is desirable that the network G ($) is able to propagate the noise h, while suppressing the noiseless signal x, which is equivalent to h . G (x + h) . (48)



It should be noted that in both residual and nonresidual cases, there are two behaviors. On one hand, there is a signal that the network decomposes and reconstructs (almost) perfectly, and on the other a signal is suppressed. Signal reconstruction analysis focuses on the signals that the network can propagate or reconstruct, rather than the signal cancelation behavior. Consequently, we focus on the linear part of G ($) (i.e., its convolution structure), of which, according to the “TDCFs” section, we assume that it handles decomposition and reconstruction of the signal within the CNN. It should be noted that the idealized model assumed here is only considered for analysis purposes as practical implementations do not guarantee that this exact behavior is factually obtained. For more information, see the “Additional Links Between Encoding-Decoding CNNs and Existing Signal Processing Techniques” section and “Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks” and “Network Depth.” To test the perfect reconstruction in nonresidual CNNs, we propose the following procedure. 1) We assume an idealized u n, comply model G ($), where its convolution filters, K n and K with the phase-complementary tight-framelet condition, and where the biases and nonlinearities suppress low-amplitude (and negative for ReLU activations) samples from the feature

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

51

maps. 2) The biases/thresholds of ReLU/shrinkage CNNs are set to zero (or to infinity for clipping activations). It can be observed that this condition prevents low-rank (or high rank for residual models) approximation behavior of the idealized CNN. Under this circumstance, it should be possible to prove that the analyzed CNN can perfectly reconstruct any signal. 3) The last step involves simplifying the mathematical description of the result-

ing model of the previous point. The mathematical simplification of the model should lead to the identity function if the model complies with the perfect reconstruction property. To conclude the explanation on the perfect reconstruction analysis, we provide two relevant considerations. First, it can be claimed that a residual network, such as the model R (y) = y - G (y) discussed in (47), is able to reconstruct any

Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks To further understand the analogy between convolutional neural networks (CNNs) and low-rank approximation established by the theory of deep convolutional framelets, we can use as starting point the definition of singular value decomposition, which is expressed in (15), by N SV - 1

y=

| (u n=0

n

v Rn ) $ v [n] .

Given that left and right singular vector pairs u n v Rn generate an image D [n], then (15) can be rewritten as

N SV - 1

y=

| D [n] $ v [n] (S1) n=0

where tensor D = ^ ( u 0 v R0 ) f ( u N - 1 v RN - 1) hR contains the products of the left and right singular vectors and has dimensions of (N SV # 1 # M # N ) . Furthermore, the equation can be further reformulated to SV



SV

u R D (S2) y=K

u R = ^ ( v [0]) f ( v [N SV - 1]) h, where the brackets in which K of the (1 # 1) filters have been excluded for simplicity. In addition, it is now assumed that it is desirable to perform a low-rank approximation of signal y based on the reformulation of (43). If we assume that D ! R $N 0 , then the lowrank approximation can be expressed by

u R (D + b ) + (S3) yt = K

in which the values b are set to zero for the channels of D that have high contributions to the image content. Conversely, the channels of D [n] with less perceptual relevance are then canceled by assigning large negative values to the corresponding entries of b. As a final reformulation, we can assume that the basis images D are the result of decomposing the input image y with a set of convolution filters, i.e., D = Ky ; this transforms (44) into

u R (Ky + b ) + . (S4) xt = K

Here, it is visible that (S4) is analogous to the encodingdecoding architecture defined in (10)–(14), and the encoder and decoder filters are akin to the framelet formulation presented in the “Framelets” section. Note that (S4) assumes

52

that the entries D = Ky are positive, which may be not always true. In this situation, tensor D requires redundant channels in which their respective phases are inverted to avoid signal loss. Furthermore, it should also be noted that in a CNN, the bias/threshold level is not inferred from the statistics of the feature maps but learned from the data presented to the network during training. Multilayer designs It should be noted that CNNs contain multiple layers, which recursively decompose/reconstruct the signal. This may pose an advantage with respect to conventional lowrank approximation algorithms for a few reasons. First, the data-driven nature of CNNs allows us to learn the basis functions, which optimally decompose and suppress noise in the signal. Second, as networks are deep, the incoming signal is recursively decomposed and sparsified. This multidecomposition scheme is very similar to the designs used in noise-reduction algorithms based on framelets. It can be noted that, in the past, recursive sparsifying principles have been observed in methods such as the (learned) iterative soft-thresholding algorithm [27], [28] as well as convolutional sparse coding. In fact, the convolutional sparse-coding approach has been used for interpreting the operation of CNNs [16]. What about practical implementations? When training a CNN, the parameters of the model (i.e., u R , and b ) are updated to reduce the loss between K, K the processed noisy signal and the ground truth, which does not warranty that the numerical values of the convolution filters and biases of the trained model comply with the assumptions performed here. This is because CNNs do not have mechanisms to enforce that filters have properties such as sparsity or perfect reconstruction and negative values for the biases. Consequently, CNNs may not necessarily perform a low-rank approximation of the signal, although the mathematical formulation of the low-rank approximation and the single-layer encoding decoding are similar. Hence, the analysis presented here should be treated as insight into the mathematical formulation and/or potential properties that can be enforced for specific applications, and not as a literal description of what trained models do.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

signal when G (y) = 0 for any y = x + h. Still, this does not convey information about the behavior of the encoding-decoding network G ($), which should be able to perform a perfect decomposition and reconstruction of the noise signal h, as discussed in (55). To avoid this trivial solution, instead of analyzing the network R ($), the analysis described for nonresidual models is applied to the encoding-decoding structure G ($), which means that the residual connection is excluded from the analysis. The second concluding remark is that to distinguish the equations of the perfect signal reconstruction analysis from other

models, we specify the analyzed designs of the perfect reconstruction models in which the low-rank approximation behavior is avoided by setting the bias values to zero using a special operator P " $ , . For the analyses regarding the total number of operations of the trainable parameters, it is assumed that the tensors u u0)R, (K u R0 , (K u R1 , shown in Figure 10, u d0)R, K 1, and K K 0, K have dimensions of (C 0 # 1 # N f # N f ), (1 # C 0 # N f # N f ), (1 # C 0 # N f # N f ), (1 # C 0 # N f # N f ), (C 1 # C 0 # N f # N f ), and (C 0 # C 1 # N f # N f ), respectively. Here, C 0 and C 1

Network Depth It should be noted that one of the key elements of convolutional neural networks (CNNs) is their network depth, which we address in this section. To illustrate the effect of network depth, assume an arbitrary N-layer encoding-decoding CNN, in which the encoding layers are defined by



E 0 = (K 0 y + b 0 ) + , E 1 = ( K 1 E 0 + b 1) + , (S5) E 2 = (K 2 E 1 + b 2 ) + , h E N - 1 = ( K N - 1 E N - 2 + b N - 1) +

           E n = (K n E n - 1 + b n) + . (S6) Here, E n represents the encoded signal at the nth decomposition level, while K n, b n are the convolution weights and biases for the nth encoding layer, respectively. As addressed in the “ReLU” and “TDCFs” sections, the role of the rectified linear unit activations is to enforce sparsity and nonnegativity, which can be interpreted as the process of suppressing noninformative bases in the low-rank approximation algorithm. Consequently, every encoded signal E n is an encoded sparsified version of the signal E n - 1 . To recover the signal, we apply the decoder part of the CNN, given by



u N - 1 = (K u RN - 1 E N - 1 + bu N - 1) +, E h u 1 = (K u R2 E u 2 + bu 2) +, (S7) E R u u u E 0 = (K 1 E 1 + bu 1) +, u 0R E u 0 + bu 0) + xu = (K

u n - 1 = (K u nR E u n + bu n) + . (S8) E         Here, xt is the low-rank estimate/denoised version of the u n, K u Rn , bu n are the decoded signal input signal y, while E components at the nth composition level and the decoder convolution weights and biases for the nth layer, u n is the respectively. In (S8), every decoded signal E low-rank estimate of the encoded layer E (n - 1) . It should be noted that the activation of each of the decoder layers ($ + bu n) + can further enforce sparsity on the low-rank u (n - 1) . estimates E

Summary In conclusion, the mathematical formulation of deep networks is analogous to a recursive data-driven low-rank approximation, where the input to the successive encoding-decoding pairs is the low-rank approximated encoded signal generated by the encoder of the previous level. Still, as mentioned in “Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks,” lowrank approximation algorithms and CNNs are similar in terms of mathematical formulation, but we cannot ensure that the values obtained during training for the encoding and decoding filters and their biases have the properties needed to ensure that a CNN is an exact recursive datadriven low-rank approximation. For example, it is possible that the filters of the encoder and decoder do not reconstruct the signal perfectly because this may not be necessary to reduce the loss function used to optimize the network. Is it possible to impose a tighter relationship between low-rank approximation and CNNs? In specific applications where signal preservation and interpretability is required (e.g., medical imaging), it is desirable that the operation of CNNs is closer to the lowrank approximation description. To achieve this, the CNNs embedded in frameworks such as the convolutional analysis operator [S1] and Fast Iterative Soft Thresholding Algorithm Network [S2] explicitly train the u n to have properties such as perfect sigfilters K n and K nal reconstruction and sparsity. By enforcing these characteristics, the mathematical descriptions of low-rank behavior and of CNNs are more similar and the models become inherently more interpretable and predictable in their operation. References

[S1] I. Y. Chun, Z. Huang, H. Lim, and J. Fessler, “Momentum-net: Fast and convergent iterative neural network for inverse problems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4915–4931, Apr. 2023, doi: 10.1109/TPAMI.2020.3012955. [S2] J. Xiang, Y. Dong, and Y. Yang, “FISTA-net: Learning a fast iterative shrinkage thresholding network for inverse problems in imaging,” IEEE Trans. Med. Imag., vol. 40, no. 5, pp. 1329–1339, May 2021, doi: 10.1109/TMI.2021.3054167.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

53

assumed that there is at least one decoder filter. 4) No coadaptation between the filters of the encoder and decoder layers is considered. The remainder of this section shows analyses of a selection of a few representative designs. Specifically, the chosen designs are the U-Net [32] and its residual counterpart, the filtered backprojection network [21]. (Matlab implementation by their authors available at https://github.com/panakino/FBPConvNet.) Additional designs analyzed here are the residual encoder-decoder CNN [5] (Pytorch implementation by their authors available at https://github.com/SSinyu/RED-CNN) as well as the learned wavelet frame-shrinkage network (LWFSN) (Pytorch implementation available at https://github.com/LuisAlbertZM/demo LWFSN TMI and interactive demo available at IEEE’s code ocean https://codeocean.com/capsule/9027829/tree/v1. The demo also includes as reference pytorch implementations of FBPConvNet and the tight-frame UNet.). For reference, all the designs are portrayed in Figure 10.

represent the number of channels after the first and second convolution layers, respectively, and all the convolution filters are assumed to be square with N f # N f pixels. Furthermore, the input signal x has dimensions of (1 # 1 # N r # N c), where N r and N c denote the number of rows and columns, respectively. The analyses shown for the different networks in this article have the following limitations. 1) The analyzed networks have only enough decomposition levels and convolution layers to understand their basic operation. The motivation for this simplification is to keep the analyses short and clear. Moreover, the same principles can be extended to deeper networks because the same single-decomposition CNNs would be recursively embedded within the given architectures. 2) The normalization layers are not considered because they are linear operators that provide mean shifts and amplitude scaling. Consequently, for analysis purposes, it can be assumed that they are embedded in the convolution weights. 3) For every encoder convolution kernel, it is

U-Net/Filtered Backprojection Convolutional Network

y

Undecimated Path

Encoding

Decoding

Downsample

K 0 b0

+

Concatenate + Convolution +

Upsample

(K 0 )

WL

(K 0 )

x



Z

De

WL ate dP ath

cim

K 1 b1

K0

b1

U (y)

Residual Encoder-Decoder CNN (RED)

y

+

x

+ K 0 b0

b1

Q K 1 b1 K 0

Renc 0 (y)

R1(Q)

K0

Q

Rdnc 0 (Q)

R (y)

LWFSN Encoding

Shrinkage

y

De

x

Inverse DWT

Forward DWT

K0

Decoding

cim

+ WH

ate

dP ath

K0

WH

t0

WL

L (y)

WL

FIGURE 10. Simplified structure of encoding-decoding ReLU CNNs. The displayed networks are the U-Net/filtered backprojection network the encoderdecoder residual CNN (RED) and finally, the learned wavelet-frame shrinkage network (LWFSN). Note that for all the designs, the encoding-decoding structures are indicated by dashed blocks. It should be kept in mind that the drawings are simplified, they do not contain normalization layers, are shallow, commonly appearing dual convolutions are drawn as one layer. 54

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

U-Net/filtered backprojection network

Furthermore, the low-frequency path is

U-Net—overview of the design



The first networks analyzed are the U-Net and filtered backprojection networks, both of which share the encoding-decoding structure U (·) . However, they differ in the fact that the U-Net is nonresidual, while the filtered backprojection network operates in a residual configuration. Therefore, an estimate of the noiseless signal xt from the noisy observation y in the conventional U-Net is achieved by

xt = U (y) (49)

whereas in the filtered backprojection network, U ($) is used in a residual configuration, which is equivalent to

u 0d) R W u LR f(2 -) ^(K u 1R (K 1 P {Z}) +) +h (58) U d (y) = (K

where P {Z} is defined by P {Z} = f(2 .) (W L (K 0 y) +) .(59)



If K 1 is a phase-complementary tight frame, we know that u R1 (K 1 Z) + = Z $ c 1 . Consequently, (66) becomes K u d0) R W u LR f(2 -) (f(2 .) ((W L K 0 y) +)) $ c 1 .(60) P {U d} (y) = (K



Here, it can be noted that if K 0 is a phase-complementary tight framelet, then P {U d} (y) approximates a low-pass version of y, or equivalently,

xt = y - U (y) .(50) P {U d} (y) . W L y $ c 1 (61)

If we now switch our focus to the encoding-decoding structure of the U-Net U (y), it can be shown that it is described by

where W L is a low-pass filter. Finally, substituting (65) and (69) in (63) results in

U (y) = U u (y) + U d (y) (51) P {U} (y) . (I $ c 0 + W L $ c 1) y.(62)

where U u (y) corresponds to the undecimated path and is defined by

u u0)R (K 0 y + b ) + (52) U u (y) = (K -0

while the decimated path is

u 0d)R W u LR f(2 -) ^(K u 1R (K 1 Z + b ) + + bu ) +h .(53) U d (y) = (K -1 -1

Here, signal Z is defined by

Z = f(2 .) (W L (K 0 y + b- 0) +) .(54)

Note that the decimated path contains two nested encodingdecoding architectures, as observed by Jin et al. [21], who has acknowledged that the nested filtering structure is akin to the (learned) iterative soft-thresholding algorithm [27], [28].

U-Net—signal reconstruction analysis To prove whether the U-Net can perfectly reconstruct any signal, we assume that the biases are equal to zero; on this condition, the network P {U} (y) is defined by

P {U} (y) = P {U u} ( y) + P {U d} ( y) (55)

where subnetwork P {U u} ($) is defined by

u u0) R (K 0 y) + .(56) P {U u} (y) = (K

u u0) is a complementary-phase tight-frameAssuming that (K 0, K u let pair, then P {U } (y) is simplified to

P {U u} (y) = y · c 0.(57)

This result proves that the design of the U-Net cannot evenly reconstruct all the frequency of y unless c 1 = 0, in which case, the whole low-frequency branch of the network is ignored. Note that this limitation is inherent to its design and cannot be circumvented by training with large datasets and/or with any loss function.

U-Net—number of operations

It can be noted that encoding filter K 0 convolves x at its original resolution and maps it to a tensor with C 0 channels. Therefore, the number of operations O ($) for kernel K 0 is O (K 0) = C 0 $ N r $ N c $ N 2f floating-point operations (FLOPs). Conversely, due to the symmetry between encoder and d­ ecoder u u0) = O (K u d0) = O (K 0) . Furthermore, for this defilters, O (K sign, filter K 1 processes the signal encoded by K 0, which is downsampled by a factor of one half, and maps it from C 0 to C 1 channels. This results in the estimated operation cost u 1) = C 0 $ C 1 $ N r $ N c $ N 2f $ (2)-2 [FLOPs]. FiO (K 1) = O (K u u0, K u d0, K 1, and nally, adding the contributions of filters K 0, K u 1 results in K

O (U ) = (3 + 2 -1 $ C 1) $ C 0 $ N r $ N c $ N 2f [FLOPS] .(63)

U-Net—concluding remarks The U-Net/FBPConvNet is a flexible multiresolution architecture. Still, as has been shown, the pooling structure of this CNN may be suboptimal for noise-reduction applications because its configuration does not allow for recovery of the signal’s frequency information evenly. This has been noted and fixed by Han and Ye [6], who introduced the tight-frame U-Net in which the down-/upsampling structure is replaced by the DWT and its inverse. This simple modification overcomes limitations of the U-Net and improved its performance for artifact removal in compressed sensing imaging.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

55

Residual encoder-decoder CNN Residual encoder-decoder CNN—overview of the design The residual encoder-decoder CNN shown in Figure 10 consists of nested single-layer residual encoding-decoding networks. For example, in the network showcased in Figure 10, we can see that network R 1 ($) is nested into R 0 ($) . Furthermore, for this case, the image estimate is given by xt = (y + R 0 (y) + bu 0) + (64)



in which R 0 ($) is the outer residual network and bu 0 is the bias for the output layer. Note that the ReLU placed at the output layer intrinsically assumes that the estimated signal xt is positive. From (64), the output of the subnetwork R 0 ($) is defined by t Z = R dec 0 (Q) . (65)



Here, the decoder R dec 0 ($) is defined by t R dec 0 (Q) =



t = (Q + R 1 (Q) + bu ) + (67) Q 1

where the network R 1 ($) is u 1 (K 1 Q + b ) + .(68) R 1 (Q) = K 1 R



Furthermore, Q represents the signal encoded by R 0 ($), or equivalently, Q=

where

R enc 0 ($)

R enc 0 (y) (69)

is defined by R 0enc (y) = K 0 y.(70)

Residual encoder-decoder CNN—signal reconstruction analysis

u 1 (K 1 Q) + .(71) P " R 1 , (Q) = K R

Under complementary-phase tight-frame assumptions for the u 1), (71) reduces to pair (K 1, K

P " R 1 , (Q) = Q (72)

which shows that the encoder and decoder R 1 ($) can approximately reconstruct any signal. Now, switching to R 0, it can be observed that the linear part is 56

Just as with R 1 ($), it is assumed that the convolution kernels are tight framelets. Therefore, (73) becomes P " R 0 , (y) = y.(74)



Consequently, R 0 ($) and R 1 ($) can reconstruct any arbitrary signal under complementary-phase tight-frame assumptions.

Residual encoder-decoder CNN—number of operations In this case, all the convolution layers operate at the original resolution of image x. Therefore, the number of operations O ($) u 0 is O (K 0) = O (K u 0) = C 0 $ N r $ N c $ N 2f for kernel K 0 and K u 1 require O (K 1) = O (K u 1) = [FLOPs], while K 1 and K 2 C 0 $ C 1 $ N r $ N c $ N f [FLOPs]. By adding the contributions of both encoding-­decoding pairs, the total operations for the residual encoder-decoder becomes O (R) = 2 $ (1 + C 1) $ C 0 $ N r $ N c $ N 2f [FLOPS] .(75)

Residual encoder-decoder CNN—concluding remarks The residual encoder-decoder network consists of a set of nested single-resolution residual encoding-decoding CNNs. The singleresolution design increases its computation cost with respect to multiresolution designs, such as the U-Net. In addition, it should be noted that the use of an ReLU as the output layer of the encoder-decoder residual network forces the signal estimates to be positive, but this is not always convenient. For example, in computerized tomography imaging, it is common that images contain positive and negative values.

LWFSN LWFSN—description of the architecture The LWFSN network is a multiresolution architecture in which the DWT is used for down-/upsampling and also as part of the decomposition where shrinkage is applied. In this CNN, the noiseless estimates are produced by xt = L (y) (76)



As mentioned earlier, the residual encoder-decoder CNN is composed by nested residual blocks, which are independently analyzed to study the reconstruction characteristics of this network. First, block R 1 ($), is given by

R



t .(66) u R0 Q K

t is the noiseless estimate of the intermediate signal Q In (66), Q and is defined by

u 0 (K 0 y) + .(73) P " R 0 , ( y) = K



where L ($) represents the encoding-decoding structure of the LWFSN, and the encoding-decoding network L ($) is L (y) = L L (y) + L H (y) .(77)



Here, the high-frequency path is given by u 0W u H f(2 -) ^x (LET L H (y) = K t 0) ^ f(2 .) ^ W H K 0 y hhh . (78) R



R

Note that in this design, the encoder leverages the filter W H to generate a sparse signal prior to the shrinkage stage, i.e., LET x ( t 0) ^ f(2 .) ^ W H K 0 y hh . Meanwhile, the low-frequency path L L ($) is u 0W u L f(2 -) ^ f(2 .) ^W L K 0 y hh .(79) L L ( y) = K R



IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

R

LWFSN—signal reconstruction analysis

LWFSN—residual variant

When analyzing signal propagation of the LWFSN, we set the threshold level t 0 = 0. This turns (77) into

To illustrate the use of clipping activations in residual noise reduction, the residual version of the LWFSN is included. Note that there are two main differences with the conventional LWFSN. First, the shrinkage functions are replaced by clipping activations. Second, the low-frequency signal is suppressed. This is performed because the original design of the LWFSN does not have any nonlinearities in that section. This is akin to the lowfrequency nulling proposed by Kwon and Ye [33]. The modified LWFSN is shown in Figure 11. It can be observed that by setting to zero the low-frequency branch of the design, the model inherently assumes that the noise is high pass.



P " L , (y) = P " L L , (y) + P " L H , (y) .(80)

Here, P " L H , ($) is defined by

u 0W u H f(2 -) ^ f(2 .) ^W H K 0 yhh (81) P " L H , (y) = K R

R

while the low-frequency path P " L L , ($) is mathematically described by

(Residual) LWFSN—concluding remarks

u 0W u L f(2 -) ^ f(2 .) ^W L K 0 y hh .(82) P " L L , (y) = K R

R

The (residual) LWFSN (r)LWFSN is a design that explicitly mimics wavelet-shrinkage algorithms. It can be observed that the (r)LWFSN inherently assumes that noise is high frequency and explicitly avoids nonlinear processing in the low-frequency band. Follow-up experiments also included nonlinearities in the low-frequency band of the LWFSN [34] and obtained results similar to the original design.

Substituting (81) and (82) in (80) results in

u 0W u R f(2 -) ^ f(2 .) ^WK 0 yhh .(83) P " L , (y) = K R

u R f(2 -) ^ f(2 .) ^WQ hh . ConseFor the DWT, it holds that Q = W quently, (91) is simplified to

What happens in trained models?

u 0 K 0 y.(84) P " L , (y) = K

Properties of convolution kernels and low-rank approximation

R

The assumption that the convolution filters of a CNN behave as (complementary-phase) tight framelets is useful for analyzing the theoretical ability of a CNN to propagate signals. However, it is difficult to prove that trained models comply with this assumption because there are diverse elements affecting the optimization of the model, e.g., initialization of the network, the data presented to the model, and the optimization algorithm as well as its parameters. In addition, in real CNNs, there may be coadaptation between diverse CNN layers, which may prevent the individual filters of the CNN from behaving as tight framelets as the decomposition and filtering performed by one layer is not independent from the rest [35]. To test whether the behavior of the filters of trained CNNs can converge to complementary-phase tight framelets, at least on a simplified environment, we propose training a toy model, as displayed in Figure 12. If the trained filters of

u R0 K 0 = I $ c, with Assuming that K 0 is a tight framelet, i.e., K c = 1, then P " L , (y) = y.(85)



This proves that the encoding-decoding section of the LWFSN allows for perfect signal reconstruction.

LWFSN—number of operations The LWFSN contains a simpler convolution structure than the networks reviewed up to now. Therefore, for a singlelevel decomposition architecture, the total number of operations is

O (L) = 2 $ C 0 $ N r $ N c $ N 2f [FLOPS] .(86)

Residual LWFSN

y

+

x

– Encoding

Clipping

Decoding Inverse DWT

Forward DWT K0

De

+ WH

cim

ate

dP ath

WL

K0

WH

t0 0

L res (y)

WL

FIGURE 11. The residual version of the LWFSN. It can be noticed that the low-frequency branch of the network is nulled. In deeper networks, it would further decomposed and the nulling would be activated at the deepest level (lowest resolution). IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

57

Toy Model

Encoding y



Decoding

*

*

*

*

*

*

K0 b0

K1 b1

K2 b2

˜ b˜ K ˜ b˜ K ˜ K 2 2 1 1 0

M(y)

FIGURE 12. The toy model used for the experiment on the properties of the filters of a trained CNN. The dimensions for tensors K 0, K 1, and K 2 are (6 # 1 # 3 # 3), (12 # 6 # 3 # 3), and (24 # 12 # 3 # 3), respectively. The network is symmetric and the filter dimensions for the decoder u n are the same as their corresponding encoding convolution kernels K kernel K n .

u l h, (where l an encoder-decoder pair of the toy model ^K l, K denotes one of the decomposition levels) behave as a compleu l h approximentary-phase tight framelet, then the pair ^K l, K mately complies with the condition presented in (40), which, for identity input I, simplifies to

u Rn (K n) + = I $ c n (87) K

in which c n is an arbitrary constant. The toy model is trained on images that contain multiple randomly generated overlapping triangles. All the images were scaled to the range of [0,1]. For this experiment, the input to the images is the noise-contaminated image, and the objective/ desired output is the noiseless image. For training the CNNs, normally distributed noise with a standard deviation of 0.1 was added to the ground truth. For every epoch, a batch of 192 training images was generated. For validation and test images, we used the “astronaut” and “cameraman” images included in the Scipy software package. The model was optimized with Adam for 25 epochs with a linearly decreasing learning rate. The initial

learning rate for the optimizer was set to 10–3, and the batch size was set to one sample. The convolution kernels were initialized with Xavier initialization using a uniform distribution (see Glorot and Bengio [36]). The code is available at IEEE’s code ocean at https://codeocean.com/capsule/7845737/tree. Using the described settings, we trained the toy model and tested whether the phase-complementary tight-framelet ­property holds for filters of the deepest level: l = 2. The results for the u R2 (K 2) + are displayed in Figure 13(a), which shows operation K that when the weights of the encoder and decoder have different u 2 h is not a complementaryinitial values, the kernel pair ^K 2, K phase tight framelet. We have observed that the forward and inverse filters of wavelets/framelets are often the same or at least very similar. Based on this reasoning, we initialized the toy u nh. model with the same initial values of the kernel pair ^K n, K As shown in Figure 13(b), with the proposed initialization, the filters of the CNN converge to tensors with properties reminiscent of complementary-phase tight-framelets. This suggests that the initialization of the CNN has an important influence on the convergence of the model to a specific solution. Figure 14 displays test images processed with two toy models, one trained with different and one trained with the same initial values for the encoding-decoding pairs. It can be observed that there are no significant differences between the images produced by both models. In Figure 14(e) and (f), we set the bias of both networks to zero. In this case, it is expected that the networks will reconstruct the noisy input, as confirmed by the figure, where both CNNs partly reconstruct the original noisy signal. This result suggests that the ReLU plus bias pairs operate akin to the low-rank approximation mechanism proposed by the TDCFs.

˜ Initial K2 = K 2

˜ Initial K2 = K 2 1

1

0.75

0.75

0.5

0.5

0.25

0.25

0

0

–0.25

–0.25

–0.5

–0.5

–0.75

–0.75

–1

–1

(a)

(b)

FIGURE 13. The phase-complementary tight-framelet test for the trained-toy network, initialized with random weights. (a) The product Ku R2 (K 2) +, where u 2 is different. It can be seen that the pair (K 2, K u 2) does not comply with the complementary-phase framelet criterion of the initialization of K 2 and K u R2 (K 2) +, for the same CNN, but where the initial values of K u 2 and K 2 are identi(95). This contrasts with (b), which displays the result of the product K cal. For this initialization, the filters approximate the complementary-phase tight-framelet criterion. 58

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

The following conclusions can be drawn from this experiment. First, the filters of the CNN may not necessarily converge to complementary-phase tight framelets. This is possibly caused by initialization of the network and/or the interaction/ coadaptation between the multiple encoder/decoder layers. Second, we confirm that for our experimental setting, the lowrank approximation behavior in the CNN can be observed. For example, when setting the biases and thresholds to zero, part of the noise texture (high-rank information) is recovered. Third, it is possible that linear filtering happens in the network as well, which may explain why noise texture is not fully recovered when setting the biases to zero. Fourth and finally, we observed that the behavior of the trained models changes drastically depending on factors such as the learning rate and initialization values of the model. For this reason, we consider this experiment and its outcome more as a proof of concept, where further investigation is needed.

this evaluation is displayed in Figure 15. These results confirm that the performance of the original toy model degrades for higher noise levels. In contrast, the adaptive and bias-free toy models perform better than the original toy model for most of the noise levels. The results of this experiment confirm the diverse noisereduction mechanisms within a CNN as well as show that CNNs have certain modeling limitations. For example, noise invariance, which can be addressed by further incorporating prior knowledge into the model, such as with the case of the adaptive model or by forcing the model to have a more Wiener-like behavior, such as with the case of the bias-free model. In the case of the bias-free model, note that, theoretically, it should be possible

Noisy Input

Init K n = K˜ n

Generalization From the explanations in the “Nonlinear Signal Estimation in the Framelet Domain” section, it can be noted that the bias/ threshold used in CNNs can modulate how much of the signal is suppressed by the nonlinearities. In addition, the “Additional Links Between Encoding-Decoding CNNs and Existing Signal Processing Techniques” section established that there are additional mechanisms for noise reduction within the CNN, such as the Wiener-like behavior observed by Mohan et al. [14]. This raises the question of how robust conventional CNNs are to noise-level changes, different from the level at which the model has been trained. To perform such an experiment, we trained two variants of the toy model. The first variant was ­inspired by the multiscale sparse coding network of Mentl et al. [17], where the biases of each of the nonlinearities (in this case, an ReLU) are multiplied by an estimate of the standard deviation of the noise. In the design of this example, the noise estimate vt h, which, in accordance to Chang et al., [1] is defined by

SNR = 15.33 (dB) (a) Init K n = K˜ n

SNR = 23.04 (dB) (b) Ground Truth

SNR = 23.09 (dB) (c) Init K n = K˜ n, bn = 0

(d) Init K n = K˜ n, bn = 0

vt h = 1.4826 $ median ^ ; fHH ) x ; h . (88)

Here, variable fHH is the diagonal convolution filter of the DWT with Haar basis. For comparison purposes, we refer to this model as an adaptive toy model. The second variant of the toy model that was tested examines the case where the convolution layers of the model do not add bias to the signal. This model is based on the bias-free CNNs proposed by Mohan et al. [14], in which the bias of every convolution filter is set to zero during training. The purpose of this setting is to achieve better generalization on the model as it is claimed that this modification causes the model to behave independent of the noise level. We trained the described variants of the toy models with the same settings as the experiment in the “Properties of Convolution Kernels and Low-Rank Approximation” section. The three models are evaluated on the test image with varying noise levels: v n ! 60.1, 0.15, 0.175, 0.2, 0.225@ . The result of

SNR = 19.43 (dB) (e)

SNR = 19.02 (dB) (f)

FIGURE 14. The processed “cameraman” image for (in)dependently sampled initialization for the encoding and decoding filters. (a) The noise-contaminated input ^ v h = 0.1 h and (d) the noiseless reference. (b) and (e) The processed noisy image with the toy model trained with different initialization for its convolution filters, while (c) and (f) are images processed with the model where the same initial values are used for the encoding and decoding filters. (b) and (c) Nearly identical images in terms of quality and signal-to-noise ratio (SNR) so that initialization has no effect. (b) and (c) The same model presented that processed (b), (e), and (f) but where its bias is set to zero. As expected, the noise is partly reconstructed.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

59

be considered, for example, the required performance, memory required to train/deploy models, whether certain signal preservation characteristics are required, target execution time for the model, characteristics of the images being processed, and so on. Based on these requirements, diverse design elements of CNNs can be more or less desirable, for example, the activation functions, use of single/multiresolution models, need for skip connections, and so forth. The following sections briefly discuss such elements by focusing on the impact that such elements have in terms of performance and potential computational cost. A summary of the main conclusions of these elements is included in Table 2.

to obtain exactly the same behavior with the original toy model if the biases of the model would have converged to zero. This reasoning suggests that the large number of free parameters and nonlinear behavior of the model can potentially prevent finding the optimal/robust solution, in which case, the incorporation of prior knowledge can help improve the model.

Which network fits my problem? Design elements When choosing or designing a CNN for a specific noise-reduction application, multiple choices and design elements should

Input, ση = 0.1

Input, ση = 0.15

SNR = 15.31 (dB)

SNR = 11.79 (dB)

Input, ση = 0.175

Input, ση = 0.2

SNR = 10.45 (dB)

SNR = 9.27 (dB)

Input, ση = 0.225

SNR = 8.25 (dB)

(a) Original, ση = 0.1

SNR = 23.09 (dB)

Original, ση = 0.15

SNR = 21.24 (dB)

Original, ση = 0.175

SNR = 19.90 (dB)

Original, ση = 0.2

SNR = 18.49 (dB)

Original, ση = 0.225

SNR = 17.16 (dB)

(b) Adaptive, ση = 0.1

SNR = 23.09 (dB)

Adaptive, ση = 0.15

Adaptive, ση = 0.175

SNR = 21.10 (dB)

SNR = 20.23 (dB)

Bias Free, ση = 0.15

Bias Free, ση = 0.175

SNR = 21.57 (dB)

SNR = 20.82 (dB)

Adaptive, ση = 0.2

SNR = 19.45 (dB)

Adaptive, ση = 0.225

SNR = 18.89 (dB)

(c) Bias Free, ση = 0.1

SNR = 23.14 (dB)

Bias Free, ση = 0.2

SNR = 20.04 (dB)

Bias Free, ση = 0.225

SNR = 19.44 (dB)

(d)

FIGURE 15. A comparison of the baseline (original) toy model against its adaptive and bias-free variants. The models are evaluated in the cameraman picture with increasing noise levels. (a) The noisy input. (b) The images processed with the original toy model. (c) Results of the adaptive toy model. (d) ­Finally, results corresponding to the bias-free model. It can be observed that the performance original toy model degrades as the noise level increases, while the performance-adaptive and bias-free models degrade less with increased noise levels, resulting in pictures with lower noise levels. 60

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Nonlinearity In the literature, the most common activation function in CNNs is the ReLU. There are two main advantages of an ReLU with respect to other activations. First, ReLUs potentially enforce more sparsity in the feature maps than, for example, soft shrinkage, because ReLUs not only cancel the small values of feature maps like shrinkage functions do but also all the negative values. The second advantage of an ReLU is its capacity to approximate other functions (see the “Shrinkage and Clipping in ReLU Networks” section). Note that the high capacity of the ReLU to represent other functions [13], [29] (often referred to as expressivity) may also be one of the reasons why these models are prone to overfitting. Better expressivity of ReLU CNNs may be the reason why, at the time of this writing, that ReLU-based CNNs perform marginally better than shrinkage-based models in terms of metrics such as signal-to-noise ratio or the structural similarity index metric [19], [37], [38]. Despite this small benefit, the visual characteristics of estimates produced by ReLUs and shrinkage-based networks are very similar. Furthermore, the computational cost of ReLU-based designs is potentially higher than those with shrinkage functions because ReLUs require more feature maps to preserve signal integrity. For example, the LWFSN shown in the “LWFSN” section achieves a performance very close to the FBPConvNet and the tight-frame U-Net for noise reduction in computerized tomography, but only with a small fraction of the total trainable parameters, which allows for a faster and less computation-expensive model [19]. As a concluding remark, it can be noted that regardless of the expressivity of the ReLU activation, it is not entirely clear whether this means that ReLU activations outperform other functions, such as the soft threshold in general. We were unable to find articles that specifically focus on comparing the performance of ReLU/shrinkage-based models. In spite of this, there are some works that compare shrinkage-based CNNs with other (architecturally different) models based on ReLUs that indicate that the compared ReLU-based designs slightly outperform shrinkage-based ones. For example, Herbreteau and Kervrann [38] proposed the DCT2-Net, a shrinkage-based CNN, which, despite of its good performance, is still outperformed by the ReLU-based denoising CNN (DnCNN) [7] CNN. Similar behavior was observed by Zavala-Mondragón et al. [19], where their shrinkage-based LWFSN could not outperform the ReLU-based FBPConvNet [21] nor the tight-frame U-Net [6]. Another similar

case is the deep K-singular-value-decomposition network [37], which achieves a performance close to (but slightly less good) than the ReLU-based DnCNN. Among the few examples we found were that an ReLU CNN performed better than shrinkagebased models, i.e., in Fan et al. [18], where they compared variants of the soft autoencoder and found that the shrinkage-based model outperformed the ReLU variant.

Single/multiscale designs The advantage of single-scale models is that they avoid aliasing because no down-/upsampling layers are used. Still, this comes at the expense of more computations and more memory. Furthermore, this approach may lead to models with larger filters and/or deeper networks that achieve the same receptive field as multiscale models, which may further increase the computation costs of single-scale models. In the case of multiscale models, the main consideration should be that the down-/upsampling structure should allow perfect signal reconstruction to avoid introducing aliasing and/ or distortion to the image estimates (e.g., the DWT in the tightframe U-Net and in the LWFSN).

(Non)residual models Residual noise-reduction CNNs often perform better than their nonresidual counterparts (e.g., the U-Net versus FBPConvNet and the LWFSN versus the rLWFSN). This may be because the trained models have more freedom to learn the filters because the design does not need to learn to reconstruct the noiseless signal, it need only estimate the noise [12]. Also, it can be observed that nonresidual models potentially need more parameters than residual networks because the propagation/reconstruction of the noiseless signal is also dependent on the number of channels of the network.

State of the art Defining the state of the art in image denoising with CNNs is challenging for diverse reasons. First, there is a wide variety of available CNNs, which are not often compared to each other. Second, the suitability of a CNN for a given task may depend on image and noise characteristics, such as noise distribution and (non)stationarity. Third, the large number of variables in terms of, e.g., optimization, data, and data augmentation, adds reproducibility issues, which further complicate making a fair comparison among all the available models [11]. In addition, it

Table 2. Design elements and their impact on performance and computation cost. Design Elements Activation ReLU — Shrinkage — Clipping Scale Single scale — Multiscale Topology Nonresidual — Residual

Expressivity High Low Low High High High High

Performance Best Good Good Good Good Good Best

IEEE SIGNAL PROCESSING MAGAZINE

Number of Parameters High Medium Medium High Medium/high Higher Lower

|

November 2023

|

Receptive Field Per Layer Not applicable (N/A) N/A N/A Big Small N/A N/A

61

should be noted that for many of the existing models, the performance gap between state-of-the-art models and other CNNs is often small. Despite the aforementioned challenges, we have found some models that could be regarded as the state of the art. The first of which is the denoising residual U-Net [39], which is a bias-free model [14] that incorporates a U-Net architecture with residual blocks. In addition, the DRU-Net uses an additional input to indicate to the network the noise intensity, which increases its generalization to different noise levels. An additional stateof-the-art model is DnCNN [7]. This network is residual and single-scale while also using ReLU activations. Another stateof-the-art model is the multilevel-wavelet CNN [40], which has a design very similar to that of the tight-frame U-Net [6]. Both of these models are based on the original U-Net design [32] but are deployed in a residual configuration, and the down-/upsampling structure is based on the DWT. Furthermore, in addition to using standalone encoding-decoding CNNs, CNNs have been used as proximal operators within model-based methods [39], [41], which further improves the denoising power of nonmodel-based encoding-decoding CNNs.

Conclusions and future outlook In this article, the widely used encoding-decoding CNN architecture was analyzed from several signal processing principles. This analysis revealed the following conclusions. 1) Multiple signal processing concepts converge in the mathematical formulation of encoding-decoding CNNs models. For example, the convolution and down-/upsampling structure of the encoder-decoder structure is akin to the framelet decomposition: the activation functions are rooted in classical signal estimators. In addition, linear filtering may also happen within the model. 2) The activations implicitly assume noise and signal characteristics of the feature maps. 3) There are still many signal processing developments that can be integrated with current CNNs, further improving their performance in terms of accuracy, efficiency, or robustness. Despite the signal processing nature of encoding-decoding CNNs, at the time of this writing, the integration of CNNs and existing signal processing algorithms is at an early stage. A clear example of the signal modeling limitations of current CNN denoisers is the activation functions, where the estimators provided by current activation layers neglect spatial correlation of the feature maps. Possible alternatives to solving this limitation could be to perform an activation function inspired by denoisers working on principles such as Markov random fields [42], locally spatial indicators [43], and multiscale shrinkage [24]. Further ideas are provided by the extensive survey on denoising algorithms by Pižurika and Philips [44]. Additional approaches that can be further explored are nonlocal [45] and collaborative filtering [46]. Both techniques exploit the redundancy in natural images and only a few models are exploring these properties [47], [48]. Finally, we encourage the reader to actively consider the properties of the signals processed, design requirements, and existing signal processing algorithms when designing new CNNs. By 62

doing so, we expect that next the generation of CNN denoisers will not only be better performing but also more interpretable and reliable.

Acknowledgment We thank Dr. Ulugbek Kamilov and the anonymous reviewers for their valuable feedback and suggestions for this article.

Authors Luis Albert Zavala-Mondragón ([email protected]) received his M.Sc. degree in electrical engineering from Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands, where he is currently a Ph.D. candidate in signal processing. He has experience in the field of hardware emulation (Intel, Mexico) and computer vision (Thirona, The Netherlands). His research interests include the development of efficient and explainable computer vision pipelines for health-care applications. He is a Student Member of IEEE. Peter H.N. de With ([email protected]) received his Ph.D. degree in computer vision from Delft University of Technology, where he is a full professor and leads the Video Coding and Architectures research group at Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands. He was a researcher at Philips Research Labs, full professor at the University of Mannheim, and VP of video technology at CycloMedia. He is the coauthor of more than 70 refereed book chapters and journal articles, 500 conference publications, and 40 international patents. He was a Technical Committee member of the IEEE Consumer Electronics Society, ICIP, Society of Photo-Optical Instrumentation Engineers, and he is co-recipient of multiple paper awards. He is a Fellow of IEEE and a member of the Royal Holland Society of Sciences and Humanities. Fons van der Sommen ([email protected]) received his Ph.D. degree in computer vision. He is an associate professor at Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands. As head of the health-care cluster at the Video Coding and Architectures research group, he has worked on a variety of image processing and computer vision applications, mainly in the medical domain. His research interests include signal processing and information theory and strives to exploit methods from these fields to improve the robustness, efficiency, and interpretability of modern-day artificial intelligence architectures, such as convolutional neural networks. He is a Member of IEEE.

References

[1] S. G. Chang, B. Yu, and M. Vetterli, “Adaptive wavelet thresholding for image denoising and compression,” IEEE Trans. Image Process., vol. 9, no. 9, pp. 1532– 1546, Sep. 2000, doi: 10.1109/83.862633. [2] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Piscataway, NJ, USA: IEEE Press, 2006, vol. 1, pp. 895–900, doi: 10.1109/ CVPR.2006.142. [3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D, Nonlinear Phenomena, vol. 60, nos. 1–4, pp. 259–268, Nov. 1992, doi: 10.1016/0167-2789(92)90242-F.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

[4] K. H. Jin and J. C. Ye, “Annihilating filter-based low-rank Hankel matrix approach for image inpainting,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3498–3511, Nov. 2015, doi: 10.1109/TIP.2015.2446943.

[26] T. Blu and F. Luisier, “The sure-let approach to image denoising,” IEEE Trans. Image Process., vol. 16, no. 11, pp. 2778–2786, Nov. 2007, doi: 10.1109/ TIP.2007.906002.

[5] H. Chen et al., “Low-dose CT with a residual encoder-decoder convolutional neural network,” IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2524–2535, Dec. 2017, doi: 10.1109/TMI.2017.2715284.

[27] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, Nov. 2004, doi: 10.1002/cpa.20042.

[6] Y. Han and J. C. Ye, “Framing u-net via deep convolutional framelets: Application to sparse-view CT,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1418–1429, Jun. 2018, doi: 10.1109/TMI.2018.2823768.

[28] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 399–406.

[7] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017, doi: 10.1109/TIP.2017. 2662206. [8] T. Yokota, H. Hontani, Q. Zhao, and A. Cichocki, “Manifold modeling in embedded space: An interpretable alternative to deep image prior,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 3, pp. 1022–1036, Mar. 2022, doi: 10.1109/TNNLS.2020. 3037923. [9] K. C. Kusters, L. A. Zavala-Mondragón, J. O. Bescós, P. Rongen, P. H. de With, and F. van der Sommen, “Conditional generative adversarial networks for low-dose CT image denoising aiming at preservation of critical image content,” in Proc. 43rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Piscataway, NJ, USA: IEEE Press, 2021, pp. 2682–2687, doi: 10.1109/EMBC46164.2021.9629600. [10] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “CNN-based projected gradient descent for consistent CT image reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1440–1453, Jun. 2018, doi: 10.1109/TMI. 2018.2832656. [11] M. T. McCann, K. H. Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging: A review,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 85–95, Nov. 2017, doi: 10.1109/MSP.2017.2739299. [12] J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM J. Imag. Sci., vol. 11, no. 2, pp. 991–1048, 2018, doi: 10.1137/17M1141771. [13] J. C. Ye and W. K. Sung, “Understanding geometry of encoder-decoder CNNs,” in Proc. 36th Int. Conf. Mach. Learn., PMLR, Cambridge, MA, USA, 2019, pp. 7064– 7073. [14] S. Mohan, Z. Kadkhodaie, E. P. Simoncelli, and C. Fernandez-Granda, “Robust and interpretable blind image denoising via bias-free convolutional neural networks,” in Proc. Int. Conf. Learn. Representations, 2020. [Online]. Available: https://iclr.cc/ virtual_2020/poster_HJlSmC4FPS.html [15] M. Unser, “From kernel methods to neural networks: A unifying variational formulation,” 2022, arXiv:2206.14625. [16] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 2887–2938, 2017. [17] K. Mentl et al., “Noise reduction in low-dose CT using a 3D multiscale sparse denoising autoencoder,” in Proc. IEEE 27th Int. Workshop Mach. Learn. Signal Process. (MLSP), Piscataway, NJ, USA: IEEE Press, 2017, pp. 1–6, doi: 10.1109/ MLSP.2017.8168176. [18] F. Fan, M. Li, Y. Teng, and G. Wang, “Soft autoencoder and its wavelet adaptation interpretation,” IEEE Trans. Comput. Imag., vol. 6, pp. 1245–1257, Aug. 2020, doi: 10.1109/TCI.2020.3013796. [19] L. A. Zavala-Mondragón, P. Rongen, J. O. Bescos, P. H. De With, and F. Van der Sommen, “Noise reduction in CT using learned wavelet-frame shrinkage networks,” IEEE Trans. Med. Imag., vol. 41, no. 8, pp. 2048–2066, Aug. 2022, doi: 10.1109/ TMI.2022.3154011.

[29] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova, “Nonlinear approximation and (deep) ReLU networks,” Constructive Approximation, vol. 55, no. 1, pp. 127–172, 2022, doi: 10.1007/s00365-021-09548-z. [30] J. C. Ye, J. M. Kim, K. H. Jin, and K. Lee, “Compressive sampling using annihilating filter-based low-rank interpolation,” IEEE Trans. Inf. Theory, vol. 63, no. 2, pp. 777–801, Feb. 2017, doi: 10.1109/TIT.2016.2629078. [31] A. Cichocki, R. Zdunek, and S.-i. Amari, “Nonnegative matrix and tensor factorization [Lecture Notes],” IEEE Signal Process. Mag., vol. 25, no. 1, pp. 142–145, 2008, doi: 10.1109/MSP.2008.4408452. [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham, Switzerland: Springer International Publishing, 2015, pp. 234–241. [33] T. Kwon and J. C. Ye, “Cycle-free cyclegan using invertible generator for unsupervised low-dose CT denoising,” IEEE Trans. Comput. Imag., vol. 7, pp. 1354–1368, 2021, doi: 10.1109/TCI.2021.3129369. [34] L. A. Zavala-Mondragón et al., “On the performance of learned and fixed-framelet shrinkage networks for low-dose CT denoising,” Med. Imag. Deep Learn., 2022. [Online]. Available: https://openreview.net/pdf?id=WGLqD0zHXy9 [35] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” 2012, arXiv:1207.0580. [36] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Statist., JMLR Workshop Conf. Proc., 2010, pp. 249–256. [37] M. Scetbon, M. Elad, and P. Milanfar, “Deep K-SVD denoising,” IEEE Trans. Image Process., vol. 30, pp. 5944–5955, Jun. 2021, doi: 10.1109/TIP. 2021.3090531. [38] S. Herbreteau and C. Kervrann, “DCT2net: An interpretable shallow CNN for image denoising,” IEEE Trans. Image Process., vol. 31, pp. 4292–4305, Jun. 2022, doi: 10.1109/TIP.2022.3181488. [39] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6360–6376, Oct. 2022, doi: 10.1109/TPAMI.2021.3088914. [40] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-CNN for image restoration,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. Workshops, 2018, pp. 773–782. [41] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021, doi: 10.1109/MSP.2020.3016905. [42] M. Malfait and D. Roose, “Wavelet-based image denoising using a Markov random field a priori model,” IEEE Trans. Image Process., vol. 6, no. 4, pp. 549–565, Apr. 1997, doi: 10.1109/83.563320. [43] A. Pižurica and W. Philips, “Estimating the probability of the presence of a signal of interest in multiresolution single- and multiband image denoising,” IEEE Trans. Image Process., vol. 15, no. 3, pp. 654–665, Mar. 2006, doi: 10.1109/TIP.2005.863698.

[20] L. A. Zavala-Mondragón, P. H. de With, and F. van der Sommen, “Image noise reduction based on a fixed wavelet frame and CNNs applied to CT,” IEEE Trans. Image Process., vol. 30, pp. 9386–9401, 2021, doi: 10.1109/TIP.2021.3125489.

[44] A. Pižurica, “Image denoising algorithms: From wavelet shrinkage to nonlocal collaborative filtering,” in Wiley Encyclopedia of Electrical and Electronics Engineering. Hoboken, NJ, USA: Wiley, 1999, pp. 1–17.

[21] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, Sep. 2017, doi: 10.1109/TIP.2017.2713099.

[45] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image restoration by sparse 3D transform-domain collaborative filtering,” in Proc. SPIE Image Process., Algorithms Syst. VI, International Society for Optics and Photonics, Bellingham, WA, USA, 2008, vol. 6812, pp. 62–73.

[22] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Piscataway, NJ, USA: IEEE Press, 2007, pp. 1–8, doi: 10.1109/CVPR.2007.383157. [23] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar. 2008, doi: 10.1109/ MSP.2007.914731. [24] L. Sendur and I. W. Selesnick, “Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency,” IEEE Trans. Signal Process., vol. 50, no. 11, pp. 2744–2756, Nov. 2002, doi: 10.1109/TSP.2002.804091. [25] G. Steidl and J. Weickert, “Relations between soft wavelet shrinkage and total variation denoising,” in Proc. Joint Pattern Recognit. Symp., Berlin, Germany: Springer-Verlag, 2002, pp. 198–205, doi: 10.1007/3-540-45783-6_25.

[46] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Piscataway, NJ, USA: IEEE Press, 2005, vol. 2, pp. 60–65, doi: 10.1109/ CVPR.2005.38. [47] D. Yang and J. Sun, “BM3D-net: A convolutional neural network for transformdomain collaborative filtering,” IEEE Signal Process. Lett., vol. 25, no. 1, pp. 55–59, 2018, doi: 10.1109/LSP.2017.2768660. [48] H. Lee, H. Choi, K. Sohn, and D. Min, “KNN local attention for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2022, pp. 2139–2149.

SP



IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

63

TIPS & TRICKS David Shiung , Jeng-Ji Huang , and Ya-Yin Yang

Tricks for Designing a Cascade of Infinite Impulse Response Filters With an Almost Linear Phase Response

D

esigning filters with perfect frequency responses (i.e., flat passbands, sharp transition bands, highly suppressed stopbands, and linear phase responses) is always the ultimate goal of any digital signal processing (DSP) practitioner. High-order finite impulse response (FIR) filters may meet these requirements when we put no constraint on implementation complexity. In contrast to FIR filters, infinite impulse response (IIR) filters, owing to their recursive structures, provide an efficient way for high-performance filtering at reduced complexity. However, also due to their recursive structure, IIR filters inherently have nonlinear phase responses, and this does restrain their applicability. In this article, we propose two tricks regarding cascading a prototype IIR filter with a few shaping all-pass filters (APFs) for an almost linear phase response over its passband. After performing a delicate design on the prototype and shaping filters, we approach perfect filtering with reduced complexity.

Preliminaries Over the past decades, we have witnessed the power of DSP in various fields of applications, e.g., wireless communication [1], [2], seismology [3], and biomedical sciences [4]. Digital filtering undoubtedly plays an important role in realizing these fancy applicaDigital Object Identifier 10.1109/MSP.2023.3290772 Date of current version: 3 November 2023

64

tions. When filtering with a linear phase response, the intended signal merely experiences a constant group delay and preserves its waveform. This is a vital feature for many applications, e.g., the denoising of electrocardiography (ECG) records [4] and seismologic signals [3]. Traditionally, high-performance filtering with linear phase responses is achievable by high-order FIR filters. This, in turn, increases the system complexity (the number of adders and multipliers), although there are techniques to cut the system complexity in half by folding the symmetric filter coefficients [5]. The filter cascade technique can be used for designing filters with reduced complexity, e.g., the composite filter in [6] and interpolated FIR filters [5], [7]. A comprehensive survey regarding the design techniques of FIR filters is presented in [8]. Among these design techniques, the works in [9] and [10] extend the idea of interpolated FIR filters by first designing a bandpass filter and then modifying it by shaping the model filter by using some masking filters. This technique is an attractive candidate for obtaining a filter with a sharp transition band. The results show supremacy over other designs in terms of filtering performance and implementation complexity. However, there still exists room for further improvement. In contrast to FIR filters, IIR filters do achieve high-performance filtering with low system complexity due to their recursive structures. But also due to IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

their recursive structures, IIR filters are unable to provide linear phase responses. In this article, we provide our solution to the problem of perfect filtering with reduced complexity. Our solution is realized through cascading a prototype IIR filter with a few shaping APFs for an almost linear phase response over the passband. The proposed composite filter can be used to replace any ordinary FIR filter with fixed filter coefficients. We then approach perfect filtering with reduced complexity. APFs have wide applications in the fields of DSP and communication [11], [12], [13]. Ideally, an APF has a constant magnitude response over the whole frequency bands. The novelty of this article is that a cascade of some IIR filters can produce a composite filter with an almost linear phase response over the filter passband. In particular, the filtering performance of this composite filter can perform quite close to a high-order (thus, high-complexity) FIR filter. This composite filter provides a way to perfect filtering by using limited complexity. We first introduce the design of a composite low-pass filter (LPF) through first designing a prototype filter that meets the design specifications for the magnitude response. Then, the prototype filter is cascaded with a few shaping APFs to rebuild an almost linear phase response over the filter passband. This idea is then extended to designing a high-pass filter (HPF). The example filter shows that the intended 1053-5888/23©2023IEEE

signal waveform is preserved, while the unwanted signal is highly suppressed. The transfer function of a feasible APF suitable for our composite filter is of the form [14]

H ap ^ z h =

z -1 - a ) (1) 1 - az -1

where a = re ji is the complex pole of H ap ^ z h and 0 # i 1 2r. For stability, we need 0 # r 1 1. By substituting z = e j~ into (1), the frequency response of the APF can be written as



-j~ ) H ap (~) = e - -aj~ 1 - ae ) j~  = e -j~ 1 - a e-j~ . 1 - ae

(2)

We can confirm that H ap ^~h has a unity magnitude that is independent of ~. Reorganizing (2), the phase function of H ap ^~h is \H ap ^~h = - ~ - 2 tan -1  r sin (~ - i) c m. (3) 1 - r cos (~ - i) The group delay and phase delay associated with a system with frequency response H ^~h are defined as

x gr ^ ~ h _ -

d\H ^~h (4) d~

and

x ph ^ ~ h _ -

\H ^ ~ h ~

(5)

Consider the design of a real-­ coefficient APF. If the filter order is one, we can set the angle of pole i to zero or r. If the filter order is two, its transfer function can be the multiplication of that of two order 1 APFs with complex conjugate poles a and a ), respectively. This is because a real-coefficient equation has paired conjugate roots. Thus, the transfer function of a real-coefficient shaping filter can be defined as HS ^ z h _ Z -1 ] z -r , i=0 ] 1 - rz -1 ] z -1 - a ) z -1 - a , 0 1 i 1 r. $ [  ] 1 - az -1 1 - a * z -1 -1 ] z +r , i=r ] 1 + rz -1 \ (8) Note that the shaping filter defined in (8) still has a frequency magnitude response independent of ~. Extending the result in (7) for the transfer function defined in (8), we have grd 6H S ^~h@ = Z 1 - r2 ] ,i = 0 2 ] 1 + r - 2r cos (~) ] 1 - r2 + ]] 2 1 + r - 2r cos (~ - i) [ . ] 1 - r2 , 0 1 i 1 r ] 1 + r 2 - 2r cos (~ + i) ] 1 - r2 ,i = r ] 2 \ 1 + r + 2r cos (~) (9)

respectively. The group delay and phase delay have an important implication for an APF. If a narrow-band sequence x (n) = s (n) cos (~ 0 n) is passed through an APF, the filter output y (n) becomes [14] (6)

By definition, the group delay of H ap (~) can be presented as grd 6H ap ^~h@ =



1 - r2 1 + r - 2r cos (~ - i) 2

1 - r2 = 1 - re ji e -j~

2

(7)

Since 0 # r 1 1 for a stable APF, by (7), we confirm that the group delay is always positive for all frequency bands.

Properties of cascaded filters The central idea of the cascade technique for filter design is to build a high-performance filter by cascading a number of low-performance filters [5]. This technique can be used to sharpen the transition band or to suppress the stopband of the prototype filter [6], [15], [16], [17]. In this article, we focus on cascading a prototype filter with M shaping filters for an almost linear phase response. The relationship of the input and output sequences for the composite filter is presented in Figure 2. The z-domain input and output sequences are denoted by X (z) and Y (z), respectively. In particular, the prototype filter with a transfer function H P ^ z h is an IIR filter

20 r = 0.9, θ = 0 r = 0.9, θ = 0.2π r = 0.8, θ = 0.3π r = 0.9, θ = π

18 Group Delay (Samples)



y (n) = s (n - x gr ^~ 0 h) cos  ^~ 0 ^n - x ph ^~ 0 hhh.

By (7) and (9), we know that the group delay contributed by a real-coefficient APF is always positive. From (4), we also know that rebuilding the frequency phase response of a composite filter is equivalent to rebuilding its group delay. Four shaping filters with parameters ^ r, ih = ^ 0.9, 0h, ^ 0.9, 0.2rh, ^ 0.8, 0.3rh, and ^0.9, rh are shown in Figure 1, where the frequency is normalized to ~ = r (radians/sample). Here, i corresponds to the peak of the group delay curve, while r controls its shape. The choices of r and i provide degrees of freedom in shaping the group delay of the composite filter. Obviously, our design freedom increases when more shaping APFs are used.

16 14 12 10 8 6 4 2 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

FIGURE 1. The group delays of the shaping APFs for ^ r, i h = ^ 0.9, 0 h, ^ 0.9, 0.2r h, ^ 0.8, 0.3r h, and

^ 0.9, r h, respectively.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

65

for meeting the specifications regarding the magnitude response. A family of M shaping APFs with transfer functions H S, m ^ z h, m = 1, f, M is for remodeling the phase response of the composite filter. The aggregate transfer function of the composite filter is M Y (z) = H P ^ z h % H S, m (z). HC ^ z h = X (z) m=1  (10)

Clearly, the frequency magnitude response of the composite filter obtained by replacing z = e j~ into (10) is identical to that of the prototype filter; i.e., H C ^~h = H P ^~h . The frequency phase response of the composite filter can be related with that of the prototype filter and the M shaping APFs by \H C ^~h = \H P ^~h + / \H S, m ^~h. m=1 (11) M

Taking the negative derivative with respect to ~ at both sides of (11), we relate the relationship of the composite filter with its component filters as follows:



grd 6H C ^~h@ = grd 6H P ^~h@ +

M

/ grd [H S,m

m=1

^~h].

(12)

One trick for designing a filter with a near perfect frequency response is cascading a number of simple APFs with a prototype filter. The other trick is using a Chebyshev type 2 filter as the prototype filter for efficient compensation of the group delay. We elaborate on these two points in the following.

Compensate the prototype filter In the context, we use the terms constant group delay and linear phase interchangeably for ease of explanation. Although a FIR filter can have a linear frequency phase response over the whole frequency bands, it is unnecessary for a band-limited signal. Actually, the filter phase response can be relaxed to be linear only within the passband. This is because the other bands are already suppressed by the prototype filter. Consider the problem of ­designing a low-pass composite filter where its

phase response is linear within the passband edge frequency ~ p . Figure 3 explains the idea of compensating the group delay of the prototype filter within the frequency ~ p . The group delay margin for compensation is in green. In essence, this is an optimization problem for finding the best parameters r and i for the M shaping filters. Clearly, as the number of shaping filters increases, we expect a better fill for the margin. In addition, the geometry of the compensated margin also impacts the error caused by compensation. This idea is further verified in the design examples, and the results are promising when M $ 3 for the considered design specifications. The mean of the synthesized group delay of the composite filter over the passband [0, ~ p] can be written as m GD = 1

~p

#0

~p

grd 6H C (~)@ d~. (13)

The flatness of the synthesized group delay over the passband can be defined as the root mean square (RMS) group delay error of H C (~): 1 # " grd 6H (~)@ - m ,2 . T GD _ C GD ~p 0 (14) ~p

X(z)

Prototype X (z)HP (z) Filter, HP (z)

HS,1 (z)

H S,M (z)

Π m=1H S,m(z)

X(z)HP (z)

m

Y(z)

Here, T GD can be used in the filter design specifications to qualify to what degree the composite filter can perform.

Shaping APFs

FIGURE 2. The input and output relationship when cascading the prototype filter with M

Design a low-pass composite filter

shaping APFs.

The design specifications include the following: 1) Passband edge frequency ~ p = 0.1r (radians/sample). 2) Stopband edge frequency ~ s = 0.15r (radians/sample). 3) The passband peak-to-peak ripple is less than 0.05 dB. 4) The suppression in the stopband is no less than 73 dB. 5) The RMS group delay error over 60, ~ p@ is less than 0.2; i.e., T GD # 0.2. 6) The total filter order number is no greater than 20. 7) The number of shaping APFs is no greater than three; i.e., M # 3. The prototype filter is for meeting the frequency magnitude response

Prototype Filter Composite Filter

Group Delay (Samples)

Margin for Compensation

+

+

ωp

ωp

ωp

π Frequency, ω

FIGURE 3. Compensating the group delay of a prototype filter within ~ p by using a number of ­shaping APFs.

66

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

filters, the problem of designing a composite filter is then reduced to finding the optimal parameters (r1, f, rM, i 1, f, i M) for a cascade of M shaping APFs satisfying specifications 5–7. The objective function of the optimization problem is

min r1, f, rM, i 1, f, i M  T GD (r1, f, M rM, i 1, f, i M)

(15)

subject to

1 # M # 3 (16) 0 # r1, frM # 1(17) 0 # i 1, f, i M # r. (18)

Equation (15) ensures that the RMS group delay error is minimized over the solution space, that is, (16)–(18). Note that M is a positive integer, and the variables (r1, f, rM, i 1, f, i M) are real numbers. In addition, one shaping filter contributes an additional filter order of two to the composite filter. Thus, specification 6 is satisfied if the composite filter is constrained by (16). Note that (15) is not a linear function, and we cannot use linear programming to solve the problem. To simplify the problem, we set M to one, two, and three, respectively, and search

20

Magnitude Response (dB)

0 –70

–20 –40

–80 0.15

–60

0.2

–80 –100 –120 –140 –160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

Least-Squares FIR, Order 200 Least-pth-Norm IIR, Order 8 Chebyshev Type 2 IIR, Order 12

FIGURE 4. A comparison of the frequency magnitude responses for a least-squares FIR filter, least-pth-norm IIR filter, and Chebyshev type 2 IIR filter.

120 100 Group Delay (Samples)

of the design specifications, while the shaping APFs are for the linear phase response. To facilitate the design, we divide the design specifications into two groups. The specifications in the first group, including constraints 1–4, are for designing the prototype filter; the specifications in the second group, including constraints 5–7, are for designing the shaping filters. Some popular IIR filters, e.g., Butterworth filters, elliptic filters, least-pth-norm filters, and Chebyshev filters, all are candidates for the prototype filter. We just arbitrarily choose the least-pth-norm filter and Chebyshev type 2 filter as the candidate prototype filters. We can use some software package, e.g., MATLAB, to design a prototype filter satisfying specifications 1–4. The frequency magnitude responses are given in Figure 4. The filter orders are eight and 12 for the candidate least-pthnorm IIR filter and Chebyshev type 2 filter, respectively. We also arbitrarily choose a least-squares FIR filter of filter order 200 as a baseline for comparison. Note that all three filters meet specifications 1–4, and the baseline FIR filter is far more complex than the other two IIR filters, considering its high filter order. We can easily improve the stopband suppression for the two IIR filters by cascading them with a complementary comb filter [6], [15], [16], [17]. The frequency magnitude responses of the prototype filters are identical to those of the corresponding composite filters. The group delays of the baseline FIR filter and the other two prototype filters are provided in Figure 5. We can see that the group delays of the two prototype filters are monotonously increasing over the passband 60, ~ p@. In particular, the least-pth-norm IIR filter has a large group delay margin to be filled by the shaping APFs as compared with that of the Chebyshev type 2 filter. The baseline FIR filter undoubtedly has a constant group delay throughout the whole frequency bands, due to its symmetric filter coefficients. But we show in the example that the linear phase over the filter passband is enough to preserve the intended signal waveform. Provided that we know the filter orders of the two candidate ­prototype

80 60 Least-Squares (FIR) Least-pth-Norm (IIR) Chebyshev Type 2 (IIR)

40 20 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

FIGURE 5. A comparison of the group delays for a least-squares FIR filter, least-pth-norm IIR filter, and Chebyshev type 2 IIR filter.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

67

through the reduced solution space constrained by (17) and (18). The solution space can be first sliced using a coarse grid for finding candidate solutions. Then, the solution space around the candidate solutions is sliced using a fine grid. The process is repeated a few rounds, as in the work in [18]. This method is especially useful for well-behaved functions, and the optimal solution can be obtained in a few rounds. Figure 6 shows the RMS group delay errors of using the two candidate prototype filters when M = 1, f, 4. We find that choosing the Chebyshev type 2 filter as the prototype filter always achieves a lower RMS group delay error than that of the least-pth-norm filter. This is because the margin of the group delay to be compensated is smaller for the Chebyshev type 2 filter than that of the leastpth-norm filter. Thus, we get M = 3, and ^r1, r2, r3, i 1, i 2, i 3 h = (0.8820, 0.8869, 0.8892, 0.0514, 0.1553, 0.2625). The corresponding RMS group delay error is 0.1196. Using these numerical results, we accomplish the design of the composite filter and obtain the frequency phase response in Figure 7. As compared with the baseline FIR filter, we find that the composite filter can achieve an almost linear phase response over the filter passband 60, ~ p@. Figure 8 illustrates the group delays of the composite filter and those of its component filters. In this case, the group delay of the composite filter over the passband is around 58.5, which is smaller than that of the baseline FIR filter. The component filter parameters of the low-pass composite filter are tabulated in Table 1. Table 2 provides a comparison of the complexity among five LPFs, which are 1) our composite LPF, 2) a leastsquares FIR filter of order 200, 3) an equiripple FIR filter of order 134, 4) a FIR filter of order 182 designed by the window design method, and 5) a narrow transition-band FIR filter designed in [9]. From (8), we know that one shaping APF needs four multipliers and four adders. With an order 12 prototype filter, the composite filter needs a total of 36 (4 # 3 + 12 # 2 = 36) multipliers and 36 (4 # 3 + 12 # 2 = 36) adders.

RMS Group Delay Error

6 Chebyshev Type 2 LPF Least-pth-Norm LPF

4 2 0

1

2

3

4

M

FIGURE 6. A comparison of the RMS group delay errors by using different prototype filters. 0

Phase Response (rad)

–5

Least-Squares FIR Filter Composite Filter

–10 –15 –20 –25 –30 –35 –40 –45 –50

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

FIGURE 7. A comparison of the frequency phase responses for a least-squares FIR filter and our composite filter. The prototype filter is a Chebyshev type 2 IIR filter. 80

Group Delay (Samples)

70 60 60

50 55 0

40

0.05

0.1

30

Prototype Filter Shaping Filter 1 Shaping Filter 2 Shaping Filter 3 Composite Filter

20 10 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

FIGURE 8. The group delays of the composite filter and associated shaping filters. Table 1. The component filter parameters of the low-pass composite filter. Prototype Filter

Shaping Filter 1

Shaping Filter 2

Shaping Filter 3

Filter order

12

2

2

2

Passband peak-to-peak ripple (dB)

0.05

0

0

0

Stopband attenuation (dB)

73

0

0

0

Passband edge frequency (radians/sample)

0.1r

N/A

N/A

N/A

Stopband edge frequency (radians/sample)

0.15r

N/A

N/A

N/A

Filter Parameter

68

Filter Type

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

0.4r (radians/sample), and w (n) is a Hamming window; ~ 1 and ~ 2 are located at the filter passband, while ~ 3 is at the stopband. The Hamming window of length L + 1 is defined as [5], [14] w (n) = 

*

0.54 - 0.46 cos ` 2rn j, 0 # n # L, L . 0, otherwise (20)

The complete input sequence is defined as

x (n) = x 1 (n) + x 2 (n - L - 1)  + x 3 (n - 2L - 2), n $ 0. (21)

In fact, x (n) can be regarded as a simulation of an ECG record from a field trial. Parts x 1 (n) and x 2 (n) denote the intended signals, while x 3 (n) is an interfering signal. A composite filter with an almost linear phase response

Composite LPF, M = 3

Least-Squares Equiripple FIR, Order FIR, Order 200 134

Window Interpolated Method FIR, Bandpass Order 182 Method [9]

Number of multipliers

36

101

68

97

52

Number of adders

36

200

134

182

100

Complexity/ Group Delay

Filter Type

Timing complexity

Low

Low

Low

Low

Low

Group delay (samples)

58.5

100

67

91

85.5

0.1 0.5 0 –0.5 –1

Filtering results



x 1 (n) = w (n) cos (~ 1 n) (19a)



x 2 (n) = w (n) cos ` ~ 2 n - r j (19b) 2



x 3 (n) = w (n) cos ` ~ 3 n + r j (19c) 5

where ~ 1 = 0.07r (radians/sample), ~ 2 = 0.02r (radians/sample), ~ 3 =

0

50

100 150 Sample Number, n (a)

200

250

25 20 |X(ω)|

Figure 9(a) shows an input sequence x (n) consisting of three narrow-band pulses of sinusoids. The pulses are given as follows:

does preserve the intended waveform and remove the interfering signal. This can be vital for correct diagnosis. The discrete-time Fourier transform (DTFT) of windowed sinusoids equals the convolution of the DTFT of the window with that of an infinitely long sinusoid. A windowed sinusoid thus has a spread of spectrum, depending on the window length, around the center frequency of the sinusoid. The corresponding DTFT magnitude ; X (~) ; when L = 60 is in Figure 9(b). Note that the filter passband is full of a wideband signal, with an accompanied outof-band signal around ~ = ~ 3 = 0.4r (radians/sample). Figure 10 compares the output sequences filtered by 1) a Chebyshev type 2 filter, 2) the composite filter, and 3) a least-squares FIR filter. All three ­filters meet the filter magnitude ­specifications. The output sequences for each filter are denoted as y 1 (n), y 2 (n), and y 3 (n),

Table 2. A comparison of the complexity among five LPFs.

x(n)

The other three low-pass FIR filters (i.e., cases 2, 3, and 4) are designed using MATLAB. All five of them meet the same design specifications, e.g., passband/stopband edge frequencies, passband peak-to-peak ripple, suppression in the stopband, and so on, as outlined in the preceding. Clearly, the composite filter is significantly simpler than the other four FIR filters. Note that we fold the FIR filter architectures for cases 2, 3, and 4 by utilizing the symmetric filter coefficients presented in [5] for reduced complexity. Also shown in Table 2 are a comparison of the timing complexity and the group delay of the five designs. Cases 1–4 all have constant filter coefficients, and there is no need to update them constantly. Thus, the timing control and timing complexity for all fives designs are low. From Figure 8, we know that the group delay for the signals inside the passband of the composite filter is around 58.5, which is smaller than for the other four designs. Table 2 is based on the central idea of using the minimum filter order for each design so as to satisfy the same design specifications. DSP practitioners then leverage the constraints on the implementation platform and freely choose among the feasible designs. The philosophy of our comparison is fairly common and widely used in commercial software packages, e.g., the Filter Design and Analysis Tool in MATLAB, although we, indeed, can set all the filter orders to be fixed and compare their frequency responses. In addition, our comparisons belong to the architecture level. This means we can bypass circuit-level concerns, and it is a fair comparison.

15 10 5 0 –1 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 Normalized Frequency (b)

1

FIGURE 9. The input sequence for verifying example filters: the (a) waveform of signal x [n] and (b) corresponding discrete-time Fourier transform magnitude | X (~)|.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

69

respectively. The Chebyshev type 2 filter is actually the prototype filter of the composite filter; it is of filter order 12. The composite filter has three shaping APFs and is of filter order 18. The least-squares FIR filter has a linear phase response over the whole frequency bands and is of filter order 200; it is used as a benchmark for performance ­comparison. Comparing Figure 10(a) with Figure 9(a), we can see that the Che-

byshev type 2 filter is unable to maintain an undistorted waveform for the wideband signal, due to its nonlinear phase response. Comparing y 2 (n) with y 3 (n) , we see that both filters do suppress the out-of-band signals while preserving the in-band signals with high fidelity. The output sequences for both filters are similar except that the baseline FIR filter produces an additional delay of approximately 42 samples (100 - 58 = 42) .

y1(n)

1 0 Distortion 0

50

100 150 Sample Number, n (a)

200

250

0

50

100 150 Sample Number, n (b)

200

250

0

50

100 150 Sample Number, n (c)

200

250

1 y2(n)

0

y3(n)

1 0.5 0 –0.5 –1

FIGURE 10. A comparison of the output sequences filtered by a (a) Chebyshev type 2 IIR filter, (b) composite filter, and (c) least-squares FIR filter.

1 Shaping APFs

0.8

Imaginary Part

0.6



0.4 0.2

min r1, f, rM, i 1, f, i M  T GD (r1, f, rM, i 1, f, i M)

(22)

subject to

0



–0.2 –0.4 –0.6 –0.8 –1

–0.5

0 0.5 Real Part

1

FIGURE 11. A pole-zero diagram of the composite filter. IEEE SIGNAL PROCESSING MAGAZINE

1 # M # 4 (23) 0 # r1, frM # 1(24) 0 # i 1, f, i M # r. (25)

For the two candidate prototype filters, specification 6 is met by the constraint (23). The frequency magnitude responses of the two candidate prototype filters

–1

70

The pole-zero diagram of the composite filter for M = 3 is presented in Figure 11. There is a total of six (2 # M) pole-zero pairs scattered over the band [- 0.1r, 0.1r] as expected. All the poles are within the unit circle, and the composite filter is stable in any case.

The procedures for designing a highpass composite filter are similar to those for a low-pass composite filter. Assume that the filter specifications are 1) Stopband edge frequency ~ s = 0.82r (radians/sample). 2) Passband edge frequency ~ p = 0.88r (radians/sample). 3) Maximum passband peak-to-peak ripple A p = 1 (dB). 4) Minimum stopband suppression A s = 70 (dB). 5) The RMS group delay error over the passband 6~ p, r@ is less than 0.3. 6) The total filter order is less than 28. 7) The number of shaping APFs is no greater than four; i.e., M # 4. We can arbitrarily choose an IIR filter that meets specifications 1–4 as the prototype filter. For example, we choose a Butterworth IIR filter and a Chebyshev type 2 IIR filter as the candidate filters, which results in a filter order of 20 and 10, respectively. The objective function of the optimization problem can be formulated as

0.5 –0.5 –1

Pole-zero diagram of the composite filter

Designing high-pass and other types of composite filters

0.5 –0.5 –1

This is due to the composite filter having a lower in-band group delay as compared with that of the least-squares FIR filter.

|

November 2023

|

of 36 (4 # 4 + 10 # 2 = 36) multipliers and a total of 36 (4 # 4 + 10 # 2 = 36) adders. The three high-pass FIR filters are all designed using MATLAB, and

all meet the same design specifications as the composite filter. Clearly, the composite filter is the simplest among the four filters. Notice that we already

Magnitude Response (dB)

0 –20 –40 –60 –80 –100 –120 –140 –160 –180 –200

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

Equiripple (FIR), Order 82 Butterworth (IIR), Order 20 Chebyshev Type 2 (IIR), Order 10

FIGURE 12. A comparison of the frequency magnitude responses for an equiripple FIR filter, a Butterworth IIR filter, and a Chebyshev type 2 IIR filter.

80 Equiripple (FIR) Butterworth (IIR) Chebyshev Type 2 (IIR)

Group Delay (Samples)

70 60 50 40 30 20 10 0 –10

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

FIGURE 13. A comparison of the group delays for an equiripple FIR filter, a Butterworth IIR filter, and a Chebyshev type 2 IIR filter.

RMS Group Delay Error

are shown in Figure 12. Also shown in Figure 12 for comparison is an equiripple FIR filter of filter order 82. The frequency magnitude responses of the composite filters are the same as those of their corresponding prototype filters. Figure 13 demonstrates the group delays of the three filters. The Chebyshev type 2 IIR filter has a strictly decreasing group delay over the passband [~ p, r] and is easier to be compensated by the shaping APFs than the Butterworth IIR filter for a given number of shaping APFs. Figure 14 compares the RMS group delay errors for the two candidate prototype filters. We see that the RMS group delay error of the Chebyshev type 2 filter is lower than that of the Butterworth filter. Cascading the Chebyshev type 2 filter with M = 4 shaping APFs results in an RMS group delay error of 0.2191, and the resultant filter order is only 18. In contrast to the baseline FIR filter of filter order 82, the composite filter shows a significant reduction in complexity. However, when using the Butterworth IIR filter as the prototype filter, the RMS group error for M = 4 is 0.7421, which is unable to meet the design specifications. We already obtained M = 4, (r1, r2, r3, r4, i 1, i 2, i 3, i 4) = (0.8928, 0.9175, 0.9038, 0.8845, 3.0135, 2.822, 2.92, 3.1005). Figure 15 describes how the group delay of the composite filter is synthesized by the five component filters (one prototype filter and four shaping filters). The in-band group delay of the composite filter is around 66. We confirm that the composite HPF is stable in any case. The component filter parameters of the high-pass composite filter are tabulated in Table 3. This completes the design of a high-pass composite filter. Table 4 provides a comparison of the complexity among four HPFs, which are 1) the composite HPF, 2) an equiripple FIR filter of order 82, (3) a generalized equiripple FIR filter of order 104, and 4) a FIR filter of order 146 designed by the window design method. With an order 10 prototype filter, the composite filter needs a total

6 Chebyshev Type 2 HPF Butterworth HPF

4 2 0

1

2

3

4

M

FIGURE 14. A comparison of the RMS group delay errors for two candidate prototype filters. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

71

for adaptive filters that are constantly changing filter coefficients according to some optimization algorithms. This is because the recursive structures of IIR filters inherently have a good memory for samples. Nevertheless, a recursive structure does not necessarily result in a slow operating speed from a very large-scale integration (VLSI) hardware perspective. In fact, the peak operating speed of a filter is constrained by its critical path. The peak operating speed of an IIR filter can be increased by cutting down the critical path by using, e.g., pipelining or retiming techniques, and so can that of the composite filter [19].

simplified the three FIR filters by using the folded architectures presented in [5]. If we want to design a bandpass or band-stop composite filter, the design procedures are the same as those of the two example filters. We find that Chebyshev type 2 filters inherently have relatively lower group delay margins for compensation than the least-pth-norm and Butterworth filters. It is beneficial to select a Chebyshev type 2 filter as the prototype filter when designing a composite filter. The proposed composite filter can be used to replace any ordinary FIR filter with fixed filter coefficients. This means our composite filter is not suited 70

Group Delay (Samples)

60

68 66

50

64 0.85

0.9

0.95

1

40 Composite Filter Prototype Filter Shaping Filter 1 Shaping Filter 2 Shaping Filter 3 Shaping Filter 4

30 20 10 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Frequency

1

Conclusions This article presented two tricks for approaching a perfect filter (i.e., flat passband, sharp transition band, highly suppressed stopband, and linear phase) with reduced complexity. This goal was realized through first designing a prototype filter to meet the design specifications regarding the frequency magnitude response. The phase function of the prototype filter was then remodeled by a cascade of delicately designed shaping filters. After cascading the prototype IIR filter with the shaping APFs, we obtained a composite filter with an almost linear phase response over the filter passband. Two example filters with highly reduced complexity were demonstrated. We found that the Chebyshev type 2 filter is an appealing candidate for the prototype filter of the composite filter. Our composite filter shows quite similar filtering performance as the baseline FIR filter of significantly higher complexity. The composite filter provides a way to approach perfect filtering using limited complexity and is especially useful for replacing any ordinary FIR filter with fixed filter coefficients. Further VLSI implementations focusing on operating speed, hardware cost, and so on can be the next steps to further investigate the benefit of the proposed composite filter.

FIGURE 15. The synthesis of the group delay of the composite filter.

Authors David Shiung ([email protected]. tw) received his Ph.D. degree from National Taiwan University, Taipei, in 2002. He is an associate professor at National Changhua University of Education, Changhua 500, Taiwan. His research interests include signal processing for wireless communication and astronomical imaging. He is a Member of IEEE. Jeng-Ji Huang ([email protected]. tw) received his Ph.D. degree from National Taiwan University, Taipei, in 2004. He is a professor at National Taiwan Normal University, Taipei 106, Taiwan. His research interests include 5G, LoRaWAN, and vehicular ad hoc networks. He is a Member of IEEE. Ya-Yin Yang (ivyyang64@gmail. com) received her Ph.D. degree in

Table 3. The component filter parameters of the high-pass composite filter. Filter Parameter

Filter Type

Prototype Filter

Shaping Filter 1

Shaping Filter 2

Shaping Filter 3

Shaping Filter 4

Filter order

10

2

2

2

2

Passband peak-to-peak ripple (dB)

1

0

0

0

0

Stopband attenuation (dB)

70

0

0

0

0

Passband edge frequency (radians/sample)

0.88r

N/A

N/A

N/A

N/A

Stopband edge frequency (radians/sample)

0.82r

N/A

N/A

N/A

N/A

Table 4. A comparison of the complexity among four HPFs. Complexity

72

Filter Type Composite HPF, M = 4

Equiripple FIR, Order 82

Generalized Equiripple FIR, Order 104

Window Method FIR, Order 146

Number of multipliers

36

42

53

74

Number of adders

36

82

104

146

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

electrical engineering from National Taiwan University, Taipei, in 2009. She is currently an assistant researcher with the Institute of Computer and Communication Engineering, National Cheng Kung University, Tainan 701, Taiwan. Her research interests include channel estimation, radio resource allocation, and interference cancellation for wireless communication systems.

References

[1] M. Vaezi, Z. Ding, and H. V. Poor, Multiple Access Techniques for 5G Wireless Networks and Beyond, 1st ed. Cham, Switzerland: Springer-Verlag, 2019. [2] A. Ghosh et al., “Reconfigurable signal processing and DSP hardware generator for 5G transmitters,” in Proc. IEEE Nordic Circuits Syst. Conf. (NorCAS), Oslo, Norway, Oct. 2022, pp. 1–7, doi: 10.1109/ NorCAS57515.2022.9934696. [3] S. Bose, A. De, and I. Chakrabarti, “Area-delaypower efficient VLSI architecture of FIR filter for processing seismic signal,” IEEE Trans. Circuits Syst., II, Exp. Briefs, vol. 68, no. 11, pp. 3451–3455, Nov. 2021, doi: 10.1109/TCSII.2021.3081257. [4] T. M. Chieng, Y. W. Hau, and Z. Omar, “The study and comparison between various digital filters for ECG de-noising,” in Proc. IEEE-EMBS Conf. Biomed. Eng. Sci. (IECBES), Sarawak, Malaysia,

Dec. 2018, pp. 226–232, doi: 10.1109/IECBES. 2018.8626661.

cell,” Electronics, vol. 8, no. 1, Jan. 2019, Art. no. 16, doi: 10.3390/electronics8010016.

[5] R. G. Lyons, Understanding Digital Signal Processing, 3rd ed. Boston, MA, USA: Pearson, 2011.

[13] S. M. Perera et al., “Wideband N-beam arrays using low-complexity algorithms and mixed-signal integrated circuits,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 2, pp. 368–382, May 2018, doi: 10.1109/JSTSP.2018.2822940.

[6] W.-S. Lu and T. Hinamoto, “Design of leastsquares and minimax composite filters,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 3, pp. 982– 991, Mar. 2018, doi: 10.1109/TCSI.2017.2772345.

[14] A. V. Oppenheim and R. W. Schafer, DiscreteTime Signal Processing, 3rd ed. Boston, MA, USA: Pearson, 2010.

[7] Y. Neuvo, D. Cheng-Yu, and S. K. Mitra, “Interpolated finite impulse response filters,” IEEE Trans. Acoust., Speech, Signal Process., vol. 32, no. 3, pp. 563–570, Jun. 1984, doi: 10.1109/TASSP. 1984.1164348.

[15] D. Shiung, “A trick for designing composite filters with sharp transition bands and highly suppressed stopbands,” IEEE Signal Process. Mag., vol. 39, no. 5, p p. 70 –76 , S e p. 2 0 2 2 , d o i: 10 .110 9/ MSP.2022.3165960.

[8] S. Roy and A. Chandra, “A survey of FIR filter design techniques: Low-complexity, narrow transition-band and variable bandwidth,” Integration, vol. 77, pp. 193–204, Mar. 2021, doi: 10.1016/j. vlsi.2020.12.001.

[16] D. Shiung, Y.-Y. Yang, and C.-S. Yang, “Cascading tricks for designing composite filters with sharp transition bands,” IEEE Signal Process. Mag., vol. 33, no. 1, pp. 151–162, Jan. 2016, doi: 10.1109/ MSP.2015.2477420.

[9] S. Roy and A. Chandra, “Design of narrow transition band digital filter: An analytical approach,” Integration, vol. 68, pp. 38–49, Sep. 2019, doi: 10.1016/j.vlsi.2019.06.002.

[17] D. Shiung, Y.-Y. Yang, and C.-S. Yang, “Improving FIR filters by using cascade techniques,” IEEE Signal Process. Mag., vol. 33, no. 3, pp. 108–114, May 2016, doi: 10.1109/MSP.2016. 2519919.

[10] S. Roy and A. Chandra, “On the order minimization of interpolated bandpass method based narrow transition band FIR filter design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 11, pp. 4287–4295, Nov. 2019, doi: 10.1109/TCSI.2019. 2928052.

[18] D. Shiung, P.-H. Hsieh, and Y.-Y. Yang, “Parallels between wireless communication and astronomical observation,” in Proc. IEEE 29th Annu. Int. Symp. Pers., Indoor Mobile Radio Commun. (PIMRC), Bologna, Italy, Sep. 2018, pp. 1–6, doi: 10.1109/PIMRC.2018.8580926.

[11] J. G. Proakis and M. Salehi, Fundamentals of Communication Systems, 2nd ed. Boston, MA, USA: Pearson, 2014.

[19] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, 1st ed. New York, NY, USA: Wiley, 1999.

[12] S. R. Aghazadeh, H. Martinez, and A. Saberkari, “5GHz CMOS all-pass filter-based true time delay

Ruiming Guo and Thierry Blu 

Super-Resolving a Frequency Band

T

his article introduces a simple formula that provides the exact frequency of a pure sinusoid from just two samples of its discrete-time Fourier transform (DTFT). Even when the signal is not a pure sinusoid, this formula still works in a very good approximation (optimally after a single refinement), paving the way for the high-resolution frequency tracking of quickly varying signals or simply improving the frequency resolution of the peaks of a discrete Fourier transform (DFT).

modes of a sampled signal are the peaks of its DFT. We also learn that the accuracy of these frequency values is limited by the inverse of the time range of the signal (Heisenberg uncertainty), which correlates nicely with the inherent frequency resolution of a DFT: 2r/N, if N is the number of samples of the signal. This knowledge is so deeply rooted

Box 1: Notations

Single-Frequency Estimation We learn (and teach!) in college that the frequency content of a signal is encrypted in its FT and that the main frequency Digital Object Identifier 10.1109/MSP.2023.3311592 Date of current version: 3 November 2023

that it is hard to reconcile with the fact that the frequencies of a signal made of a sum of K complex exponentials can be recovered exactly from as few as 2K samples by using a two-centuryold method due to Gaspard de Prony. This apparent contradiction is resolved by recognizing that Heisenberg uncertainty relies on a much weaker signal

• • • • •

Single-frequency signal: x n = a 0 e jn~ , n = 0, 1, f N - 1 Uncertainty band: ~ 0 ! 6~ 1, ~ 2@, with ~ 2 - ~ 1 = integer # 2r/N Discrete-time Fourier transform (DTFT): X (~) = R nN=-01 x n e -jn~ Discrete Fourier transform (DFT): X (2rk/N ), where k = 0, 1, f, N - 1 Peak of the DFT: k 0 = argmax k X (2rk/N ) A ~ DFT = 2rk 0 /N

IEEE SIGNAL PROCESSING MAGAZINE

0

|

November 2023

|

73

assumption (basically, that its time and frequency uncertainties are finite) than the sum-of-exponentials model. And it

is only when this model is inexact that the estimated frequencies may be inaccurate, with their uncertainty now

ω1 + ω2 ω − ω 1 |X (ω 2)| − |X (ω 1)| + 2 arctan tan 2 2 4 |X (ω 2)| + |X (ω 1)|

ω0 =

|X(ω 2)|

|X(ω1)|

0

ω1

ω2

Uncertainty Band

The Trick



The main motivation for this article is to put forward a formula that provides the frequency value of the maximum of the DTFT from just two DTFT coefficients of a single-frequency signal; see the detailed setting in “Box 1: Notations.” Not only is this formula exact, but it is also very robust to the inaccuracies of the model, as we shall see later. More precisely, if we assume that the unknown frequency ~ 0 of the signal x n lies inside an “uncertainty band” 6~ 1, ~ 2@ where ~ 2 - ~ 1 is an integer multiple of 2r/N, this formula specifies how the amplitudes of the DTFT of x n at ~ 1 and ~ 2 should be combined so as to recover ~ 0 (see the visualization in Figure 1)

Frequency ω

FIGURE 1. The frequency ~ 0 (blue) of a single-frequency signal is obtained from the DTFT values at the endpoints (red) of a frequency interval known to contain ~ 0 (assumption: ~ 2 - ~ 1 = integer # 2r/N ) . Dotted line: a full period of the DTFT X (~) of the signal.

Box 2: Step-by-Step Didactic Proof of (1) Steps N-1

zN - 1 e jN(~ - ~) - 1 = z - 1 with z = e j(~ - ~) leads to X (~) = a 0 j(~ - ~) . e -1 n=0 ji -ji e -e 2. Euler’s formula: sin i = leads to 2j 1. Geometric sum:

|z

X ( ~ 0 - ~) = a 0 e j

0

n

N-1 2 ~

0

0

sin (N~/2) sin (~/2)

then X (~ 0 - ~) = ; a 0 ;

sin (N~/2) sin (~/2)

.

~0 =

]Z] ]] X (~ 1) = ; a 0 ; sin (N (u + B)) sin (u + B) ]] [] . ]] sin (N (u - B)) ]] X (~ 2) = ; a 0 ; ] sin (u - B) \ ; X (~ 2) ; sin (u + B) r 4. r- Periodicity of sin x : B = integer # 2N leads to ; X (~ ) ; = sin (u - B) . 1 ; X (~ 2) ; sin (B + u) 5. Sign of sin: !u # B leads to ; X (~ ) ; = sin (B - u) . 1 ; X (~ 2) ; 6. Trigonometry: sin (a ! b) = sin a cos b ! cos a sin b leads to ; X (~ ) ; = 1 tan B + tan u . tan B - tan u

|

November 2023

+ 2 arctan

X (~ 1) X (~ 1) + X (~ 2) X (~ 2) + ~2 # X (~ 1) + X (~ 2)

~0 . ~1 #

; X (~ 2) ; - ; X (~ 1) ; m. u = ~ 0 - ~ 12 = arctan c tan ` ~ 2 - ~ 1 j 2 4 ; X (~ 2) ; + ; X (~ 1) ;

IEEE SIGNAL PROCESSING MAGAZINE

2

A proof is provided in “Box 2: Stepby-Step Didactic Proof of (1),” requiring only elementary electrical and electronics engineering math knowledge. This formula becomes even simpler when the uncertainty bandwidth is small (i.e., ~ 2 - ~ 1 % r)

; X (~ 2) ; - ; X (~ 1) ; 7. Algebraic resolution: tan u = tan B # ; X (~ ) ; + ; X (~ ) ; which leads to 2 1

74

~1 + ~2

X (~ 2) - X (~ 1) c tan ` ~ 2 - ~ 1 j m. 4 X (~ 2) + X (~ 1) (1)

Z] ~ 12 = (~ 1 + ~ 2) /2 ]] 3. Notation: ][] B = (~ 2 - ~ 1) /4 leads to ]] ] u = (~ 0 - ~ 12) /2 \



given by the Cramér-Rao lower bound, which assumes unbiased estimators and known noise statistics. The contrast between the Fourier approach (analytic, intuitive, but Heisenberg limited) and Prony’s method (algebraic, black box, but exact) has made it difficult to envision a higher frequency resolution that would rely on the DFT. Yet, given that the DFT coefficients are just samples of a very smooth function (the DTFT), analytic considerations suggest that this function can be approximat­­ ed locally by a quadratic polynomial, leading to an estimate of its off-grid peak location; as few as three samples around the max of the DFT a­ lready provide a very good estimate of this frequency [1], [2].

|

Choose* ω 1 and ω 2 = ω 1 + 2π/ N Such that ω 1 ≤ ω 0 ≤ ω 2.

Calculate the Magnitude of DTFT at ω 1 and ω 2.

ω1 ω2

Calculate the Magnitude of DTFT at ω1 and ω 2.

|X (ω 1)| |X (ω 2)|

Calculate the Estimate of the Frequency ω 0 Using (1). *For

Choose ω1= ω0−π/N and ω2 = ω0 + π/N.

Calculate the Estimate of the Frequency ω 0 Using (1).

ω¯ 0

ω¯ 1 ω¯ 2

|X (ω 1)| |X (ω 2)|

ω¯ 0

instance, [ω 1, ω 2] = ω DFT + [−π/N , π/N ].

FIGURE 2. A refined trick: a double application of (1) achieves a quality that is equivalent to the best unbiased single-frequency estimation algorithm.

which is what intuition would suggest: a weighting of the end frequencies based on the relative magnitude of their DTFT. In practice, this formula is most useful when the uncertainty band is smallest, i.e., ~ 2 - ~ 1 = 2r/N. Then, a straightforward procedure for estimating a single frequency from N uniform signal samples is to 1) determine ~ DFT, the frequency of the peak of the signal DFT 2) apply (1) with ~ 1 = ~ DFT - r/N and ~ 2 = ~ DFT + r/N. This works because, for a single-frequency signal, the peak of the DTFT is always within ! r/N of the maximum of the DFT. But it would also be possible to bypass the computation of the full DFT if a rough estimate of ~ 0 were available, e.g., when the frequency of the signal is continuously changing (tracking between successive signal windows, as in radar applications) or when the frequency is a priori known up to some perturbation (physical resonance experiments, laser-based optical measurements, etc.).

Robustness to Inaccuracies When the single-exponential model is not exact, (1) is still a very robust estimator of its frequency. This is particularly so when the uncertainty bandwidth ~ 2 - ~ 1 is reduced to 2r/N an assumption that we will make from now on. Indeed, consider a noise model x n = a 0 e jn~ 0 + b n where

the complex-valued samples b n are independent realizations of a Gaussian random variable with variance 2 v -the “additive white Gaussian noise” assumption. When the number of samples N is large enough, a linearization of (1) makes it possible to calculate (tediously; not shown here) the standard deviation T~ of the frequency estimation error. In particular, in the case where ~ 0 is at the center of the interval 6~ 1, ~ 2@, this error behaves according to 2 r2 2 4 T~ = = r T~ CR (2) N N SNR 4 6 where SNR = ; a 0 ; /v and where T~ CR is the Cramér-Rao lower bound of the problem (see [3] for a calculation). Obviously, T~ is very close to T~ CR, within less than 1%. When ~ 0 is closer to the extremities of the interval 6~ 1, ~ 2@, T~ deviates from the Cramér-Rao lower bound by up to 80%, still a very low error in absolute terms. In fact, a simple refinement of the trick, as depicted in Figure 2, shows how to attain this lower bound, outlining the near optimality of this procedure; no other unbiased singlefrequency estimation algorithm would be able to improve this performance by more than 1%. This is confirmed in Figure 3 by simulations that consist of 1 million tests where IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

1) The number of samples, N, is random (uniform) between 10 and 1,000. 2) The frequency ~ 0 is random (uniform) between -r and r. 3) a 0 = 1. 4) The standard deviation v of the noise is such that the SNR is random (uniform) between -5 dB and 40 dB. 5) The noise b n is drawn from an independent identically distributed statistic (Gaussian). In each test, the uncertainty band before refinement is set by ~ 1 = ~ DFT - r/N and ~ 2 = ~ DFT + r/N. For comparison purposes, Figure 3 also shows the distribution of errors of Jacobsen’s estimator [1], [2], which is based on three consecutive DFT coefficients around ~ DFT . The better performance of our formula is likely due to the higher SNR enjoyed by the two DFT coefficients around the maximum of the DTFT in comparison to the three DFT coefficients used by Jacobsen’s formula, one of which has a significantly lower SNR because it is further away from the DTFT peak by more than 2r/N. A somewhat milder difference is also that (1) assumes an exact single-frequency model, whereas Jacobsen’s formula is but a local quadratic approximation. Beyond just noise, when the singlefrequency model is rendered inaccurate due to, e.g., quantization, sample ­windowing, or the addition of other sinusoidal/polynomial terms, the frequency estimation error of (1) is controlled by f, 75

the maximum error of the magnitude of the DTFT at the frequencies ~ 1 and ~ 2 , according to

; ~r 0 - ~ 0 ; #

4f tan ` r j 2N (3) N ; a0 ;  144424443 . 22rf for large N N ; a0 ;

where ; a 0 ; is the amplitude of the singlefrequency “ground-truth” signal. (See the proof in the “Error Bound (3)” section in “Other Proofs.”)

Note that because it is valid for every single “noise” instance, this bound is of a very different nature than the statistical result (2)—an average over infinitely many additive white Gaussian noise realizations. Interestingly, the inaccuracies of the model outside the uncertainty band do not contribute to the estimation error, which suggests that (3) can be used to predict the accuracy of a multiple-frequency estimation problem that uses the single-frequency trick.

0.4

PDF-Normalized Count

0.35 0.3 0.25

Jacobsen [1] Trick Refined Trick Cramér-Rao Bound

0.2 0.15 0.1 0.05 0 –6

0 2 4 –4 –2 Normalized Estimation Error ∆ω /∆ω CR

~r k - ~ k #



noise). The error that results from using Jacobsen’s frequency estimator [1], the trick (1), or the refined trick (Figure 2) is further normalized by the Cramér-Rao lower bound of the estimation problem (T~ CR = 2 3 N -3/2 SNR -1). The standard deviations of the three estimators are 1.5325, 1.3008, and 1.0092, respectively. PDF: probability density function.

DTFT DFT Samples Uncertainty Band Endpoint DTFT Estimated Frequency Ground-Truth Frequency

Frequency ω



FIGURE 4. Multiple frequencies (three complex exponentials, 12 samples) can be accurately estimat-

ed by first locating the isolated peaks (e.g., using findpeaks in MATLAB) and then applying (1) individually. The actual estimation errors of the three frequencies (from left to right) are roughly (0.01, 0.02, 0.03) # 2r/12, well below the resolution 2r/12 of the DFT; for comparison, the upper bound (4) provides the much more conservative values (0.66, 0.45, 0.58) # 2r/12.

76

This formula can easily be used in a multiple-frequency scenario provided that the frequencies to estimate are sufficiently separated. A straightforward procedure consists of first locating the isolated peaks of the DFT of the signal (e.g., using MATLAB’s findpeaks function) and then applying (1) to refine each frequency individually; see an example in Figure 4. The estimation error of each frequency can be quantified by using the bound (3) where, in the absence of other noise, the data inaccuracy f in the neighborhood of that frequency is essentially caused by the tail of the DTFT of the other frequencies. An example of such a calculation is shown in the “Error Bound (4)” section in “Other Proofs,” leading to the following statement. Assume that the frequencies ~ k of the signal are distant from each other (modulo (-r, r]) by at least d~ 2 r/N and that the amplitude of the dominant sinusoid is A; then, the estimation error of any of the frequencies of the signal is bounded according to

6

FIGURE 3. Histograms from 1 million random tests (number of samples N, SNR, frequency, and

0

Multiple Frequencies

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

2r N 8

DFT resolution

2 (K - 1) tan (r/(2N )) A  # . (4) r sin ((d~ - r/ N ) /2) ; a k ; 14444444244444443 `super-resolution_ coefficient

Here, ~r k , ~ k , ; and a k ; are the estimated frequency, the ground-truth frequency, and its amplitude, respectively. Despite its coarseness (see Figure 4), this inequality already demonstrates superresolution potential since the “superresolution” coefficient is usually smaller than one and, in fact, tends to zero when N tends to infinity (for fixed d~) . Empirically, a minimum value of 4r/N for d~, or two DFT bins, seems to be sufficient to obtain good frequency estimates. Of course, this cheap approach to high-resolution multifrequency estimation is not optimal, yet it could be used as the starting point of any iterative algorithm designed to maximize the likelihood of the problem.

Conclusion The frequency of a single complex exponential can be found exactly using

Other Proofs Error Bound (3) A direct proof of this inequality uses the fact that ; arctan a - arctan b ; # ; a - b ; and the triangle inequality ; a + b ; # ; a ; + ; b ;. More specifically, denoting by X 1, X 2 the discrete-time Fourier transform (DTFT) of the “ground-truth” signal at ~ 1, ~ 2, and by f 1, f 2 the errors (caused by noise or otherwise) on ; X 1 ;, ; X 2 ;, we have r X1 + f1 - X2 - f2 n ~r 0 - ~ 0 = 2 arctan d tan a 2N k X1 + f1 + X2 + f2 X1 - X2 r n - 2 arctan d tan ( 2N ) X1 + X2 r # 2 tan a 2N k

X1 + f1 - X2 - f2 X1 - X2 X1 + f1 + X2 + f2 X1 + X2 144444444444444444424444444444444444443 2 f 1 ( X 2 + f 2) - 2 f 2 ( X 1 + f 1 ) = ( X 1 + X 2 ) ( X 1 + f 1 + X 2 + f 2) r max (;2f 1 ;, ;2f 2 ;) ( X 1 + f 1 + X 2 + f 2) # 2 tan a 2N k ( X 1 + X 2 ) ( X 1 + f 1 + X 2 + f 2) 144444444444444444444444444424444444444444444444444444443 r max ^ ; f 1 ;, ; f 2 ; h = 4 tan a 2N k X1 + X2 which leads to the inequality (3) after noticing that ; X 1 ; + ; X 2 ; $ N ; a 0 ; (because ~ 0 ! [~ 1, ~ 2]) . Error Bound (4) Denoting by ~ 1, ~ 2, f ~ K the K different frequencies and a 1, a 2, f a K the associated (complex-valued) amplitudes, the DTFT of the samples x n is given by

the magnitude of only two samples of its DTFT, as this article shows. In the presence of noise or other inaccuracies, the trick that we provide is very robust and can even be iterated once to reach the theoretical optimum (Cramér-Rao lower bound)—up to less than 1%. The robustness of this formula makes it possible to, e.g., refine the peaks of the DFT of a signal, but we also anticipate that it can be used as a tool for high-resolution frequency estimation. For teaching purposes, we provide a step-by-step proof that requires only undergraduate signal processing knowledge.

Acknowledgment Thierry Blu is the corresponding author.

Authors Ruiming Guo (ruiming.guo@imperial. ac.uk) received his Ph.D. degree in electronic engineering at the Chinese University of Hong Kong (CUHK),

K

X (~) = | a k k=1

e jN(~ - ~) - 1 . e j(~ - ~) - 1 k

k

Evaluating the estimation error of the frequency ~ k 0 using (1) requires calculating the bound f in (3), i.e., the maximum error between X (~) and the DTFT of a singlefrequency model, when ; ~ - ~ k 0 ; # r/N (with the hypothesis that the minimum distance between ~ k 0 and the other ~ k is at least d~ 2 r/N ) jN (~ k - ~) X (~) - a k 0 e j (~ k - ~) - 1 = -1 e

jN (~ k - ~) a k e j (~ k - ~) - 1 -1 e k ! k0 sin (N (~ k - ~) /2) (using triangle inequality # / ;ak ; sin ((~ k - ~) /2) and Euler’s formula) k ! k0 A # / (denoting A = max a k ) k = 1fK (( ~ - ~) /2) sin k k ! k0 (K - 1) A # min sin ((~ k - ~) /2) 0

0

k ! k0

#

(K - 1) A sin ((d~ - r/N ) /2)

/

(since ~ - ~ k0 # r/N 1 d~) .

The right-hand side of the last inequality provides an upper bound for f, which we can use in (3) to find

~r k 0 - ~ k 0 #

Hong Kong Special Administrative Region, China, in 2021. He is currently a postdoctoral research associate at the Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ London, U.K., working with Prof. Ayush Bhandari on computational imaging and modulo sampling. He worked as a postdoctoral research fellow with Prof. Thierry Blu at the Electronic Engineering (EE) Department at CUHK, from 2021 to 2022. He received the Post­­ graduate Student Research Excellence Award from the EE Department of CUHK in 2022. His research interests include sparse signal processing, sampling theory, inverse problems, modulo sampling, and computational sensing and imaging. Thierry Blu ([email protected]) received his Ph.D. degree from Télécom Paris (ENST) in 1996. He is a professor in the Department of Electronic Engineering, the Chinese University of IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

2r # 2 (K - 1) tan (r/(2N )) A . N r sin ((d~ - r/N ) /2) ; a k 0 ; Hong Kong, Sha Tin New Territories, Hong Kong Special Administrative Region, China, where he has been since 2008. He received two Best Paper Awards (2003 and 2006) and is the coauthor of a paper that received a Young Author Best Paper Award (2009), all from the IEEE Signal Processing Society. His research interests include wavelets, approximation and sampling theory, sparse representations, biomedical imaging, optics, and wave propagation. He is a Fellow of IEEE.

References

[1] E. Jacobsen and P. Kootsookos, “Fast, accurate frequency estimators [DSP Tips & Tricks],” IEEE Signal Process. Mag., vol. 24, no. 3, pp. 123–125, May 2007, doi: 10.1109/MSP.2007.361611. [2] Ç. Candan, “A method for fine resolution frequency estimation from three DFT samples,” IEEE Signal Process. Lett., vol. 18, no. 6, pp. 351–354, Jun. 2011, doi: 10.1109/LSP.2011.2136378. [3] P. Stoica and A. Nehorai, “MUSIC, maximum likelihood, and Cramér-Rao bound,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 5, pp. 720–741, May 1989, doi: 10.1109/29.17564.

77

Shlomo Engelberg

Implementing Moving Average Filters Using Recursion

M

oving average filters output the average of N samples, and it is easy to see (and to prove) that they are low-pass filters. A simple, causal moving average filter satisfies



yn = 1 N

yn = 1 N



N-1

/ x n - k .(1)

k=0

Because of their simplicity and intuitive appeal, they are often preferred to more complicated low-pass filters when one would like to remove high-frequency noise and the demands on the filter are not too great.

/ x ( n - 1) - k

Note, however, that the apparent singularity at z = 1 is removable, and in fact, it is not properly speaking a singular point of the transfer function. Dividing the numerator of H(z) by the denominator, one finds that

/ x (n - 1) - k - x n - N / N

H (z) = ^1 + z -1 + g z - (N - 1) h /N, z ! 0

N-1

/ xn - k

k=0

= x n /N + 1 N = x n /N + 1 N = x n /N + 1 N

N-1

/ xn - k

k=1 N-2 k=0 N-1 k=0

which is equivalent to

y n = x n /N + y n - 1 - x n - N /N (2)

A Tip and a Trick

and this points to a different possible implementation. One can implement the filter by setting the current output, y n , equal to the sum of the previous output value, y n - 1, and the current sample of the input divided by N, x n /N, less the oldest sample of the input that was “part of” the previous value of the output divided by N, x n - N /N. If one would like to minimize the number of operations needed to calculate each value of the output, y n (if an efficient implementation of the moving average filter is important), the method based on (2) is to be preferred. (See, for ­example, [2].)

The Moving Average Filter

Stability

Let x k = x (kTs), Ts > 0 be samples of x(t). One can define a moving average filter by using (1), and this can be used as a template for implementing a moving average filter; looking at (1), one would be inclined to calculate each sample of the output by summing the last N values sampled and dividing the sum by N. One can, however, express the output of a moving average filter, yn, as

Is a moving average filter stable? The short answer is that it must be. Considering (1), it is clear that the filter is a finite-impulse-response (FIR) filter, and all FIR filters are stable [1]. Considering the (two-sided) Z-transform of (2), we find that

Background In this column, we explain how implementing a moving average filter using a recursive formulation (where the current output is a function of the current and previous inputs and previous outputs) is possible. We then show why it can be problematic and how to deal with that problem (and this is the “tip”). Finally, we describe circumstances under which the problem does not exist even though one might have thought that it should (and this is the “trick”).

Digital Object Identifier 10.1109/MSP.2023.3294721 Date of current version: 3 November 2023

78

Y (z) = X (z) /N + z -1 Y (z) - z -N X (z) /N. The transfer function of the filter would seem to be 1 z -N H (z) = 1 - -1 , z ! 0, 1. N 1-z IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

and this is the version of the transfer function one arrives at if one starts from (1) and proceeds naïvely. Here it is clear that H(0) is perfectly well defined and is equal to 1. If one would like to write the transfer function in closed form, the correct way to do so is to write 1 1 - z -N , z ! 0, 1 N * H ( z ) = . (3) 1 - z -1 1 z=1 When written this way, it is clear that 1 is contained in the transfer function’s region of convergence, and this is yet another way to see that the filter is stable [1]. As we find in the section “An Implementation Issue with the Recursive Implementation,” when rounding errors are added to the picture, the system’s stability becomes a somewhat more involved question.

Rounding Errors in Floating Point Calculations Floating point numbers such as C’s float data type are versatile, but when adding a small floating point number to a large one, one generally loses some accuracy. Floating point numbers generally have a certain number of bits allocated to storing a “number,” called the mantissa, and a certain number of bits allocated to storing an exponent, the power of two by which the mantissa, a binary fraction, is multiplied. If num is the number to be stored, we find that num = mantissa # 2 exponent . 1053-5888/23©2023IEEE

The absolute value of the mantissa is often required to be greater than or equal to one and less than two, so that its first digit is always a one (which makes it possible not to store that binary digit). In order to fix ideas, consider a simple, nonpractical, example. Suppose one has four bits dedicated to the mantissa and two to the exponent. Consider the sum 1.01b # 2 11 b + 1.01b # 2 00 b . Writing the summands out in binary, we find that we are adding 1010b to 1.010b. As the mantissa is limited to four bits, the sum is 1011b = 1.011b # 2 11 b . Because of the limited number of bits the mantissa has, the sum is not accurate. It is off by 0.01b.

An Implementation Issue With the Recursive Implementation Suppose that one uses the recursive implementation based on (2), but at each stage, there is a small amount of noise caused by rounding errors. Then (2) becomes y n = x n /N + y n - 1 - x n - N /N + rn (4) where rn is the noise term. The Z-transform of (4) is Y (z) = X (z) /N + z -1 Y (z) - z -N X (z) /N + R (z) or N (z - 1) Y (z) = (z - z - (N - 1)) X (z) + NzR (z) . Thus, we find that - (N - 1) Y (z) = 1 z - z X (z) N z-1 + z R (z), z ! 0, 1. z-1

It is clear that the first term on the right-hand side is the same as the transfer function in (3), and it is the transfer function of a stable system. The second term is, however, another story. This term truly has a single pole at z = 1, and this is the sign of a marginally stable system. In fact, z , | z |2 1 z-1 is the transfer function of a summer. Thus, the output of a moving average filter that is implemented recursively and that starts operating at time n = 0 is

implementing a moving average filter via recursion. When working in the C programming language, for example, use ints rather than floats in the N-1 n 1 moving average filter. (See [2], for exyn = /x +/r. N k = 0 n-k k = 0 k ample, for a somewhat different presentation of the problem and this solution.) If there are no rounding errors, the When implementing a moving averlast term does not cause any problems. age filter in C, it is often If there are rounding actually convenient to errors, as one would The absolute value of the work with C ints rathexpect when adding mantissa is often required er than C floats. On and subtracting floatto be greater than or the microcontroller we ing point numbers, for example, then one equal to one and less than used (the ADuC841, a member of the 8051 expects to see the sum two, so that its first digit family), when working of these (generally is always a one (which with ints, it is convevery small errors) afmakes it possible not to nient to store the (intefect the output of the store that binary digit). ger) values read from system by causing a the analog to digismall change to the tal converter (ADC) in an array and not output, and one expects the maximum convert the values to voltages. When the size of this change to increase with time. time comes to output values to the digital As yn is the sum of N terms of the to analog converter (DAC), the values are form x n /N and of noise-related terms, already properly scaled, as the ADC and yn should be substantially larger than DAC can be set to scale values in the same any given x n /N. As we saw in the way. Working with ints makes a lot of section “Rounding Errors in Floatsense from the point of view of effective ing Point Calculations,” adding small and efficient programming, even if it leads floating point numbers to large ones to there being no point at which the intcan lead to rounding errors. Thus, (4), based program “knows” the numerical which includes a noise term, seems to value of the voltage of the input or the outbe an appropriate model for a movput (in volts) and to the engineer needing ing average filter implemented using to work with unnormalized quantities and floating point numbers, the moving not volts. average filter is marginally stable with respect to rounding errors, and we expect the output voltage to tend to The Trick: The Unreasonable drift slightly with time as the sum of Effectiveness of floats the (very, very small) rounding errors We now consider a case where the tip changes (very slowly). turns out to be unnecessary, and that brings us to the trick. We wanted to write a program to demonstrate the problem The Tip: Avoiding Error the with using floating point numbers when Fixed-Point Way implementing a moving average filter If one stores all of one’s numbers as fixed-point numbers or integers, then as ­recursively by implementing such a fillong as there is no overflow or malfuncter on an ADuC841, which has a 12-bit tion when adding and subtracting, there ADC. To calculate the voltage correis no “rounding error,” (2) is a complete sponding to the 12-bit value returned by description of the filter, and the filter the ADC, one must multiply the 12-bit is stable. Thus, if one works entirely in value (when considered as an integer) integer or fixed point arithmetic, the filby the voltage of the ADuC841’s referter is properly characterized by (2), and ence voltage, 2.5 V, and divide by 212. one can implement the filter recursively In order to demonstrate the problem without any stability issues. with using floats, we stored the meaThus, the tip is to use some form of sured voltages and used them (and not fixed point or integer arithmetic when the 12-bit values returned by the ADC)

composed of the desired output (related to the input signal, xn) and the sum of all the previous rounding errors; it is

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

79

1 V (dc). The formula used to convert in our calculations. We confidently exsamples to floating point was pected to find that the output of the filter would be the expected output and a sample_val = (ADCDATAH*256 small “dc drift” because of the buildup + ADCDATAL)* 2.5 of the rounding error. We did not find / CONVERSION_FACTOR; such a drift. The reason is actually fairly clear where ADCDATAL contains the lower and has to do with the way floating eight bits of the ADC reading, and point operations are implemented. In ADCDATAH contains the upper four bits the C we used, a floating point number of the ADC reading (and in the prois stored as a single sign-bit, (what gram we wrote, its is effectively) a 24four most significant bit mantissa (where We then explained why bits are always zero). the mantissa is actuunder certain conditions, At compile time, the ally stored in 23 bits, one can store the numbers user could choose and the first bit of the as floats and not suffer to make CONVERmantissa, the “unany ill effects. SIO N _ F A C T O R stored bit,” is always either 4,096.0 (the one), and an 8-bit excorrect conversion factor) or 3,000.0 ponent [3]. Multiplication by 2.5 is the (as though the correct conversion facsame as adding twice a number to half tor was 3,000.0) by making use of the the number, and for binary numbers, following preprocessor commands and this is the same as shifting the number either leaving in or commenting out the left by one and adding it to the numfirst line (that #defines EXACT). ber shifted right by one. Multiplying a floating point number by 2.5 lengthens its mantissa by two bits. As the number #define EXACT read from the ADC uses not more than #ifdef EXACT 12 out of the 24 bits of the mantissa that #define CONVERSION_FACTOR 4096.0 it is “entitled to,” this is not a problem. #else Similarly, dividing by 4, 096 = 2 12 is #define CONVERSION_FACTOR 3000.0 simply changing the exponent (reduc#endif ing it by 12), and again, no precision is lost. In total, at most 14 bits of the manWhen CONVERSION _ FACTOR was tissa are in use, so even after adding 32 set to 4,096.0, we expected to see such numbers to one another (as we do the dc offset remain constant. When in our moving average filter), no more CONVERSION _ FACTOR was set to than 19 of the 24 bits are in use. Add3,000.0, we expected to see a very slow ing the new measurement does not lead drift because of the information in the to any imprecision, and the subtractions least significant bits that was ignored. also come off without a hitch. We measured the output of the If you want to see the effects of system using the PicoScope 2204A. rounding error, divide by 3,000 (and When setting CONVERSION _ FACmultiply by it when necessary) instead TOR to 4,096.0 and inputting a 10-Hz of by 4,096. Then you should see a very sine wave for a period of just over 2 h, slow drift in the “dc” value of the signal. we found a change in the dc value of about 1.0 mV (and from time to time, the dc value returned to the value Numerical Example from which it started). When using a We wrote a program that implements a CONVERSION _ FACTOR of 3,000.0, moving average filter that averages the we found that after a bit more than last 32 samples using C floats, and we 1 h and 40 min, the dc value had increased examined the output of the filter when by almost 17 mV. its input was a sine wave “riding on”

80

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Conclusions In this column, we provide a description of a problem that moving average filters can suffer from, and we explain why that problem should not actually rear its head if the values read in from that A/D are stored as integers but that we expect trouble from floating point numbers. The tip, which is not new, is to use ints rather than floats. We then explained why under certain conditions, one can store the numbers as floats and not suffer any ill effects. In particular, when using the ADuC841, the most natural formula for conversion from the integer value of the A/D reading to the floating point value of the measured voltage leads to no loss of accuracy, to no rounding errors, and even adding thirty-two such measurements does not bring us to the point where there will be any rounding errors. To receive a copy of the code used to implement the moving average filter, please send an e-mail to [email protected].

Author Shlomo Engelberg ([email protected]. il) received his bachelor’s and master’s degrees in engineering from The Cooper Union, New York, and his Ph.D. degree in mathematics from New York University’s Courant Institute. He is a member of the Department of Electrical and Electronics Engi­­ neering, Jerusalem College of Tech­­ nology, Jerusalem 9116001, Israel. His research interests include applied mathematics, instrumentation and measurement, signal processing, coding theory, and control theory. He is a Senior Member of IEEE.

References

[1] S. Engelberg, Digital Signal Processing: An Experimental Approach. London, U.K.: SpringerVerlag, 2008. [2] S. W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing, 2nd ed. San Diego, CA, USA: California Technical Publishing, 1999. Accessed: Dec. 18, 2022. [Online]. Available: https:// www.analog.com/en/education/education-library/ scientist_engineers_guide.html [3] “Floating-point numbers.” ArmKeil. Accessed: Dec. 19, 2022. [Online]. Available: https://www.keil. com/support/man/docs/c51/c51_ap_floatingpt.htm

Yeonwoo Jeong , Behnam Tayebi  , and Jae-Ho Han 

Sub-Nyquist Coherent Imaging Using an Optimizing Multiplexed Sampling Scheme

S

the Fourier transform. Familiarity with holography is also beneficial.

everal techniques have been developed to overcome the limitation of sensor bandwidth for 2D signals [1]. Though compressive sensing is an attractive technique that reduces the number of measurements required to record information on a sparse signal basis [2], [3], recording information beyond the Nyquist frequency remains difficult when working with nonsparse signals. Given this constraint, this article focuses on the use of the physical bandwidth of a coherent signal in the complex form instead of its intensity form. The resulting trick combines holographic multiplexing with sampling scheme optimization to obtain the information in a 2D coherent signal from beyond the Nyquist frequency range. The prerequisites for understanding this article are a knowledge of basic algebra and

phase displacement over a period of time. Thus, the principle of superposition and the intensity form the basis for recording the interference of coherent light waves and comprise the core of holography [4]. The holography technique was developed to preserve the depth information in an object signal, which cannot be captured by a normal camera [5]. When an object image is captured by an image sensor, the recorded intensity is proportional to the square of the object signal amplitude, causing the phase information of the object signal to be lost. By contrast, holography maintains the phase information using a reference signal. For example, a microscopy technique using holography, as illustrated in Figure 2, can be divided into two setup components: magnification and frequency modulation. Light from the coherent light source is scattered and passed through the object, and the

Background As shown in Figure 1, when two light waves, E1 and E2, approach each other and finally meet at the recording point, P, this situation can be mathematically described by using the principle of super­ position of electromagnetic fields and the wave intensity at that point. The principle of superposition in optics states that when multiple waves overlap in a medium, the resultant amplitude is equal to the algebraic sum of the individual wave amplitudes. The intensity is defined as the radiant power density of the light signal detected by a device such as a camera or an eye, and it can be expressed as the time average of the wave amplitude squared. As this article deals only with coherent light, electromagnetic waves can be assumed to maintain a fixed

Digital Object Identifier 10.1109/MSP.2023.3310710 Date of current version: 3 November 2023

Interference of Two Light Waves P

E1 = E1e –i (kd 1 – ω t + φ 1)

,

k=

E2 = E2e –i (kd 2 – ω t + φ 2)

2π λ

EP = E1 + E2 by the Principle of Superposition Intensity

d1

I1 = E1E1∗ = E12 Where ∗ Means Complex Conjugate d2

I2 = E2E2∗ = E22 IP = EPE∗P

= (E1 + E2)(E1 + E2)∗

Light Wave E1 Light Wave E2

= (E1 + E2)(E1∗ + E2∗)

= E1E1∗ + E2E2∗ + E1E2∗ + E2E1∗ = I1 + I2 + I12 Where I12 = E1E2∗ + E2E1∗

FIGURE 1. Interference of two light waves and the intensity. Here d is the distances traveled by the beams, ~ is the angular velocity, k is the angular wavenumber, and z represents the phase of the light at time 0. The intensity at point P has the interference term, I 12 other than the individual intensities. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

81

LASER

o­bjective lens magnifies the resulting object signal. The reference beam, ER, can be subsequently produced using a grating to duplicate the magnified signal and reorient it to a specific angle. The two signals after grating are physically transformed into the frequency domain at the Fourier plane, while an analog filter removes all information except the center intensity of the duplicated signal to obtain the reference signal, ER. Finally, a second lens is applied to cause the original signal to interfere with the ref-

circle describing the object signal intensity is twice that of the object signal amplitude because the Fourier transform of the two multiplied signals is equivalent to the convolution operation between the individual Fourier transformed signals. The phase information of the object is preserved in the bandwidths of the twin sidelobes, expressed as Q U O ^ fx ! i, fyh, simply reflecting the original frequency information of the object shifted by ! i. Therefore, in contrast to the case of direct imaging [Figure. 2(a)], which

erence signal, with the intensity of this combined signal, ; E R + E O ;2, recorded by the image sensor. The recorded object signal EO can be represented with its complex magnitude, ; E O (x, y) ;, and the phase component, e iiO ^ x, yh, while the magnitude and the phase of the recorded reference signal, ER, are | ER | and e ii R x, respectively, where i R is obtained by dividing 2r by the grating period. As indicated by the recorded frequency scheme shown in Figure 2(b), the radius of the physical bandwidth

Sensor Plane

Coherent Light Source

x

Fourier Plane Magnified Object

Λ θ R

Object

I = ER + EO2

ER

Objective Lens

y

Image Sensor

EO Grating First Lens Image Sensor

I = EOEO∗ = EO2

(a)

(b)

First Order

Fourier Transform Algorithm

∼ F(EO) = UO ( fx, fy)

Second Lens

Zeroth Order

Fourier Transform Algorithm

Analog Filter

F(I) = F{EO EO∗) ∼ ∼ = UO ( fx, fy) ∗ UO ( fx, fy)

(∗: Convolution) U0 Kp kp

U0 : The Range of the Sensor Bandwidth for the Direct Imaging Kp : The Range of the Physical Bandwidth of the Object Signal Intensity kp : The Range of the Physical Bandwidth of the Object Signal Amplitude

(b) The Single Hologram Scheme

Object Signal, EO = EO (x, y)eiθO (x, y) 2π Reference Signal, ER = EReiθR x, θR = Λ ∼

Λ = Grating Period

F(EO) = UO ( fx, fy)

fmax

I = ER + EO2 ∗ = (ER + EO)(ER + EO) = (ER + EO)(ER∗ + EO∗ ) = ERER∗ + EO EO∗ + EOE∗R + ER EO∗

Autocorrelation (Center Lobe)

∗ F(I ) = F{ERER∗ + EO EO∗ + EOER∗ + ER EO } ∗ ∗ ∗ = F{ERER} + F{EO EO } + F{EO ER} + F{EREO∗ }

∼ UR(0,0)2

∼ ∼ UO ( fx, fy) ∗ UO ( fx, fy)

Constant Intensity at Center

∼ UO ( fx – θ, fy)

Side Bandwidths

∼ UO ( fx + θ, fy)

(∗ = Convolution)

U1

Physical Bandwidth of Object Signal Amplitude Physical Bandwidth of Object Signal Intensity Sensor Bandwidth Dead Zone fmax Maximum Digitized Frequency

Digitized Bandwidth of Object Signal Amplitude (Shannon–Nyquist Theorem Applied) Digitized Bandwidth of Object Signal Intensity (Shannon–Nyquist Theorem Applied)

FIGURE 2. The frequency scheme of (a) direct imaging and (b) a single hologram with an optical system. 82

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

p­ reserves only the magnitude of the object signal, holography also facilitates the full recovery of the phase information from the object signal by programmatically extracting the bandwidth from one of the sidelobes [6]. According to the Nyquist theorem, distortion can be avoided in digital intensity recording systems by ensuring that the sampling rate used to digitize a signal is at least twice the maximum frequency range denoting the physical bandwidth of the signal. Consequently, when an object signal is captured directly, the frequency range of the sensor bandwidth U0 in Figure 2(a) should be at least four times the frequency range denoting the physical bandwidth of the object signal amplitude kp (i.e., at least twice the frequency range denoting the physical bandwidth of the object signal intensity Kp). As a result, the full sensor bandwidth required to record the signal intensity without distortion should be more than 16 (4 # 4) times the physical bandwidth of the amplitude, as indicated by the blue box in Figure 2(a); thus, the cost of direct intensity recording is extremely high. The situation is more adverse when considering the Nyquist theorem for a single hologram. As shown in Figure  2(b), the recorded intensity and its Fourier transform can be respectively expressed as follows: E R E )R

E O E )O

E O E )R

E R E )O

+ + +   I= (1)   

F ^ I h = F " E R E )R , + F " E O E )O , + F " E O E )R , + F " E R E )O , . (2)

The frequency information recorded by the sensor includes the amplitudes of the twin sidelobes in the single-hologram scheme shown in Figure 2(b) expressed as the last two terms on the right side of (2). In addition, the direct intensity information of the object, which is represented as the central lobe in Figure 2(b), is expressed as the second term on the right side of (2). After applying the Nyquist theorem, the radii of the sidelobes double, causing the required frequency range for the digitizing sensor in the single hologram scheme, defined as U1 in Figure  2(b), to be at least twice that in the direct-

imaging scheme [U0 in Figure 2(a)], and expanding the required sensor bandwidth accordingly. Therefore, this article aimed to introduce a solution to increase the ratio of the recorded amplitude bandwidth to the sensor bandwidth to overcome the inefficiencies otherwise associated with the Nyquist constraint.

Solution Two primary techniques have been developed to improve the available bandwidth with respect to the total frequency area of an image sensor: one based on frequency multiplexing [7] and the other based on sampling scheme optimization [8]. Frequency multiplexing is accomplished by obtaining multiple holograms using one reference beam. With this approach, the recorded intensity of N independent holograms at the image sensor can be expressed as   I = ; E R + E 1 + E 2 + g + E N ; 2 (3) = E R E )R +

N

N

N

/ / E i E )j + / E R E i)

i=1 j=1

   

i=1

N

+

/ E i E )R .

i=1

Furthermore, the following theoretical constraint is set to ensure that the sum of the N hologram signals is equal to the total object signal:

EO =

N

/ E i .(4)

i=1

Thus, when N = 4, as illustrated in Figure 3, frequency multiplexing of the holograms can be realized by magnifying and separating the object images into four distinct patches and recording them simultaneously by overlapping the associated signals. Compared to the normal imaging in Figure 3(a), as magnification is inversely proportional to the recorded frequency range, the frequency range required to capture the object decreases according to the magnification of the image, as shown in Figure 3(b). In addition, overlapping the signals enables the image sensor, which uses the same size as for direct imaging, to capture the full area of the object without decreasing the field of view (FOV), i.e., the recorded image area, which is possible because the bandwidth of each signal has been IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

located in a separate area using the holography technique, facilitating the reconstruction of the original object image by extracting each of these signals from the frequency area and combining them together, as shown in Figure 3(c). However, increasing the number of holograms does not necessarily decrease the size of the dead zone in the sensor frequency domain. Indeed, the sensor bandwidth including the sensor bandwidth for the two holograms in Figure 4(c) is larger than that for direct imaging, as shown in Figure 4(a) (i.e., U 2 2 U 0), with a considerable portion of the frequency domain remaining unutilized. Figure 4(d) shows a geometrically optimized scheme using a single hologram that exploits the repetitive pattern of the Fourier domain to improve the utilization of the frequency; however, U 3 2 U 0 . Notably, none of the sampling schemes in Figure 4(b)–(d) can record amplitude information beyond the Nyquist constraint, regardless of their utilization of the frequency domain. A relationship between bandwidths must be established to optimize the bandwidth available for recording the amplitude of the signal beyond the Nyquist frequency. Here, the required total bandwidth is defined as the sum of the bandwidths of sidelobes of the same size, and each bandwidth in the 2D frequency domain is defined as the square of the length along a single direction of frequency range. Thus, the maximum frequency range along one axis of the total bandwidth k d can be written as

k 2d = N ^k 2d, ih (5)

where k d, i represents the maximum frequency range along one axis of sidelobe i and can be expressed by

k d, i = k d . (6) N

The relationship between U0 and kd,i can be expressed by

U 0 = ck d, i (7)

where c denotes the optimal geometrical factor of the 2D sampling scheme. 83

As kd can be considered the digitized amplitude of the signal, denoted Kp, (7) can be rewritten as

U0 =

c

N

K p / CK p . (8)

Note that (8) relies upon the quantitative relationship between the maximum range along one axis of the sensor bandwidth and the maximum amplitude range of the signal, expressed by c. Thus, the ratio of c to the square root of the number of holograms can be defined as the effective coherent Nyquist factor, C, to quantify the efficiency of the sensor bandwidth utilization. The ideal case in a 1D scheme can be evaluated to provide a logical basis for optimization in a higher dimensional scheme. As shown in Figure 5, assuming that the sidelobes for N holograms are tightly packed in the 1D Fourier domain, the relationship between the maximum frequency range of the sensor, U0,1d, and that of the digital bandwidth of each sidelobe, kd,i,1d, can be expressed as

yielding the optimal geometrical factor c 1d = 2 ^ N + 1 h . Furthermore, as K p, 1d = Nk d, i, 1d , (9) can be rewritten as

U 0, 1d = 2 ^ N + 1h k d, i, 1d (9)

U 0, 1d =

2 ^ N + 1h K d, i, 1d (10) N

indicating that the effective coherent Nyquist factor for the 1D case, C 1d , is 2 ^ N + 1h /N. Therefore, for a large number of holograms, the single-axis range of the sensor cannot be less than twice that of the single intensity because only an infinite number of holograms ^ N " 3 h can achieve a C 1d value of 2. The higher dimensional scheme optimization does not exhibit such simple behavior. For example, in contrast to the nonoptimized scheme in Figure 4(c), Figure 4(e) shows the optimal frequency-sampling scheme for a two-hologram technique, which can be easily produced based on the fact that the radius of the central lobe should be twice that of the sidelobes, corresponding to a c of 2.71. Accordingly, the relationship between the frequency range along the axis of the digitized amplitude signal and its minimum frequency range on

Multiplexing

one axis of the sensor bandwidth can be expressed as U 0 = 1.92 ^2.71/ 2 h K p . As the resulting C is less than 2, this optimization achieves the sub-Nyquist condition. Similarly, Figure 4(f) and (g) illustrates the optimal sampling schemes for three and four holograms with U 0 = 1.87K p and U 0 = 1.73K p , respectively. In an extreme case with 16 tightly packed holograms [Figure  4(i)], the values of c and C are 6 and 1.5, respectively. Indeed, all optimized schemes in Figure 4(e)–(i) can be used to design a sub-Nyquist coherent imaging system.

Discussion To evaluate the efficiency of the proposed method, the percentage of the sensor occupied by the required bandwidth for digitizing the signal, relative to that occupied by the bandwidth required for the direct imaging, can be expressed as N



T = 100 #

k 2d

c / k d, i m 2

i=1 k 2d, 0

(11)

where k d, 0 is k d for the scheme of the direct imaging.

Image Reconstruction

Object

Decreasing FOV

Separating and Overlapping Signals

Masking Frequency Domain

Moving to Center FOV

Image Sensor

Image Sensor

Recorded Image

Image Sensor

Image Reconstruction (a)

(b)

Inverse Fourier Transform

(c)

FIGURE 3. Comparison of direct imaging and the method using frequency multiplexing. (a) Direct imaging, (b) direct imaging of the magnified object, and (c) imaging the combined four patches of the object using frequency multiplexing with image reconstruction. FOV: field of view.

84

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Table 1 compares the efficiencies of different sampling schemes when recording frequency information using the same sensor. According to the Nyquist

theorem, the metric T should remain less than 25%. However, the values of T for the multiplexed holography based on the sampling schemes presented in Fig-

ure 4(c)–(i) are 12.5%, 15.26%, 27.23%, 28.76%, 44.44%, and 33.33%, respectively. Therefore, except for the first four cases listed in Table 1, the frequency

1

2

U0 U2 > U0 (c)

(a) U1 > U0 (b)

1

2

3

4

U4 < U0 (e)

U5 < U0

U4 < U0

U3 > U0

(f)

(g)

(d) Physical Bandwidth of Intensity of SH Digitized Bandwidth of Intensity of SH

12

16

U7 < U0 (h)

U6 < U0 (i)

Physical Bandwidth of Amplitude of SH Digitized Bandwidth of Amplitude of SH Digitized Bandwidth of Intensity of MH Digitized Bandwidth of Amplitude of MH

FIGURE 4. 2D frequency schemes of (a) direct imaging, (b) single hologram without optimization, (c) two holograms without optimization, (d) single hologram with optimization, and (e) two holograms with optimization. The sampling schemes are optimized for (f) three, (g) four, (h) 12, and (i) 16 holograms. The physical bandwidth of the intensity and amplitude of the single hologram are shown in brown and gray, respectively, and its digitized intensity and amplitude are shown in violet and black, respectively. The digitized intensity and amplitude of the multiplexed hologram are shown in light and dark green, respectively. SH: single hologram; MH: multiple holograms. N Sidelobes (Holograms)

U0,1d

kd,i,1d

kd,i,1d

2kd,i,1d

FIGURE 5. The ideal case in the 1D scheme. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

85

information was successfully recorded beyond the Nyquist constraint. As mentioned previously, increasing the number of holograms can be inferred to reduce the value of C to less than two. The ideal C for the sampling scheme can be induced from Figure 4(i), and it can be expressed as C . 2c



^ N + 2h

N

m . (12)

Thus, for an infinite number of holograms, C = 2 , and the maximum value of T is 50%. However, as this value corresponds to an infinite sensor area with an infinite number of holograms, it cannot be achieved owing to the presence of an autocorrelation term and the resolution limits in the frequency domain. Therefore, the lower limit of the effective coherent Nyquist factor is 2 . Figure 6 depicts C as a function of the number of captured holograms. The

red squares show the optimal value of C manually obtained for the Fourier domain schemes using the single and multiple holograms in Figure 4(d)–(g). The results in Table 1 indicate that an increase in N causes the value of C to decrease toward a lower limit of 2 , demonstrating that multiplexing the signal and optimizing the sampling scheme significantly improved the quantity of information recorded by the sensor. As an example implementation, Figure 7 shows a sub-Nyquist coherent system using four holograms. In the first stage of the optical system, the object signal is magnified and divided into four distinct patches using an objective lens and masks. Although the FOV decreases in inverse proportion to the magnification, in this implementation, all patch signals are represented by holograms and finally gathered at the image sensor. If only one reference beam is used as (3), the

Table 1. Comparison of different sampling schemes. No. 1 2 3 4 5 6 7 8 9

2D Frequency Scheme Figure 4(a) Figure 4(b) Figure 4(c) Figure 4(d) Figure 4(e) Figure 4(f) Figure 4(g) Figure 4(h) Figure 4(i)

Number of Holograms Direct imaging 1 2 1 2 3 4 12 16

C 2 4 2.83 2.56 1.92 1.87 1.73 1.5 1.73

T 25% 6.25% 12.5% 15.26% 27.23% 28.76% 33.41% 44.44% 33.33%

Ref. – [6] [7] [8] [9] – – – –

10

Infinite hologram



1.41

50%



Effective Coherent Nyquist Factor

2.6 Manual Eq. (12)

2.4 2.2

Nyquist Factor 2 1.8



I = ; E R1 + E S1 + E R2 + E S2 + E R3 + E S3 + E R4 + E S4 ; 2 4

=     

4

4



(13)

4

/ / E R E )R + / / E S E )S i

i

j

i=1 j=1 4 4

+

j

i=1 j=1 4 4

/ / E R E )S + / / E S E )R i

i=1 j=1

i

j

 j

i=1 j=1

where E S i and E R i represent the object and reference signals, respectively. Each reference and object patch signal can be realized using two gratings and an analog filter. As the grating replicates the incoming light at specified angles and different intensities along a single axis, two gratings with appropriate specifications can set the required modulation frequency in the 2D Fourier domain. In the physical Fourier plane, the signals duplicated by the gratings are passed through analog filters, leaving only two signals. The four zeroth-order signals are preserved as they represent the object signal, while the first-order signals remain as they represent the modulation frequency. Thus, using this technique, four different sets of object patches and their corresponding reference signals can be provided, allowing the sidebands to be located independently. The first and second terms of (13) represent the bandwidth of the central circle, whereas the third and fourth terms represent the twin sidelobe bandwidths. The required information is contained in the four sidebands, each capturing a patch of the object image. However, according to (13), as the four holograms share the same reference beam and are therefore intermingled, they must be separated. For example, as indicated by = ^ E S 1 + E S 2 + E S 3 + E S 4h E )R 1 i=1 (14) 4

/ E S E )R

1.6 1.4

i

0

5

10 15 Number of Holograms

20

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

1

the bandwidth of the first hologram carried by the first reference signal is colocated with the other three holograms because they are also carried by the first reference signal.

FIGURE 6. Effective coherent Nyquist factor ^ C h as a function of the number of captured holograms. The red squares (manual) show the optimal value of C for recording one, two, three, and four holograms. 86

sideband cannot be moved independently. Therefore, the light path is split using mirrors and beam splitters to create multiple reference beams with the same number of holograms. Accordingly, the modified intensity expression is given by

|

To solve this problem, each set of objects and reference signals must be marked to avoid interference. Therefore, for the optical system in Figure 7, the two light sources with different wavelengths, polarization beam splitters, and dichroic mirrors are used. The dichroic mirror is a specially designed mirror that reflects light in a specific band, and the polarization beam splitter is also a specially designed beam splitter that splits light into two orthogonally polarized beams. The reason for using those components is that two light sources with different wavelengths do not interfere with each

other, and neither do two orthogonally polarized beams. Exploiting the light properties, the two dichroic mirrors at the front of the optical system split one light path into two paths having different wavelengths, respectively. Next, the polarization beam splitter before the masks splits each light path again into the two orthogonally polarized beams. Therefore, each reference beam ends up carrying only the corresponding patch signal, and (14) is changed to 4

/ E S E )R i

1

= E S 1 E )R 1,

  i=1 (15) a E S 2 E )R 1 = E S 3 E )R 1 = E S 4 E )R1 = 0.

In summary, after dividing the magnified object signal into several patches using masks, an optimal scheme can be realized by freely moving the sidebands via the gratings and manipulating the interference by selecting or changing the state of light using multiple light sources and specially designed optical components. Likewise, as the number of holograms increases, the additional components required to avoid interference among holograms, and the space to install them, inevitably increase.

Conclusions This article presented a trick for realizing sub-Nyquist coherent imaging based

Signal Intensity in Frequency Domain

Magnifying

Object

Magnified Object

Multiplexing and Optimization

Recording

Dichroic Mirror

LASER

Magnification

Overlapping Signals

Multiplexing and Optimizing Scheme

Polarization Beam Splitter First Order

: Analog Filter

LASER

Zeroth Order

Mirror

Objective Lens

Object ER3

Grating

ES3

Magnified Object

Recording

ER4 Image Sensor

ES4

Mask ER2 ES2 ER1 ES1

FIGURE 7. An example implementation of the sub-Nyquist coherent system using four holograms. It consists of three parts: magnification, multiplexing with optimization, and recording. The optimized sampling scheme for four holograms can be implemented by physically splitting and combining the magnified object signal. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

87

on an optimized multiplexing hologram scheme. The Nyquist theorem dictates that the digital imaging bandwidth requires four times the frequency range denoting the physical bandwidth of the object signal amplitude to avoid distortion as the intensity is proportional to the square of the amplitude when using coherent light sources. Thus, the full sensor bandwidth must be at least 16 (4 # 4) times the bandwidth of the original object signal amplitude to record intensity, wasting significant frequency space. To overcome this limitation, a sub-Nyquist imaging scheme was proposed by exploiting multiplexed holograms with optimization. Several optimized schemes were presented based on two to 16 sub-Nyquist holograms, and a theoretical effective coherent Nyquist factor limit of 2 was derived from the ideal cases. Finally, an example implementation of the proposed sub-Nyquist coherent imaging technique was provided using four holograms with two different coherent light sources.

Acknowledgment This work was supported by the Ministry of Science and ICT, Korea, under the Information Technology Research Center support program (IITP-2023-RS-2022-00156225) and under the ICT Creative Consilience program (IITP-2023-2020-0-01819) supervised by the Institute for Information and

DSP HISTORY 

Authors Yeonwoo Jeong (forresearch4220@ gmail.com) received his B.S. degree from the School of Electrical Engineering, Korea University, Seoul, South Korea, in 2017. He is currently pursuing his Ph.D. degree with the Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, South Korea. His research interests include artificial intelligence and signal processing. Behnam Tayebi (behnamty@gmail. com) received his Ph.D. degree in applied physics and optics from Yonsei University, Seoul, South Korea, in 2015. He was with Korea University as a research professor, and he is currently working as an optical engineering lead at Inscopix, Mountain View, CA 94043 USA. His current research interests include nanoimaging, holography, and signal processing. Jae-Ho Han ([email protected]) received his Ph.D. degree in electrical and computer engineering from Johns Hopkins University, Baltimore, MD, USA. He is a full professor with the

Department of Brain and Cognitive Engineering, Korea University, Seoul 02841, South Korea. His current research interests include novel optical imaging technologies and image processing for various fields in biomedicine and neuroscience research. He is a Member of IEEE.

References

[1] M. Mishali and Y. C. Eldar, “Sub-Nyquist sampling,” IEEE Signal Process. Mag., vol. 28, no. 6, pp. 98–124, Nov. 2011, doi: 10.1109/MSP.2011.942308. [2] R. G. Baraniuk, “Compressive sensing,” IEEE Signal Process. Mag., vol. 24, no. 4, pp. 118–121, Jul. 2007, doi: 10.1109/MSP.2007.4286571. [3] H. E. A. Laue, “Demystifying compressive sensing,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 171–176, Jul. 2017, doi: 10.1109/MSP.2017.2693649. [4] F. L. Pedrotti, L. M. Pedrotti, and L. S. Pedrotti, “Interference of light,” in Introduction to Optics, 3rd ed. London, U.K.: Pearson Education, 2006. [5] J. W. Goodman, “Holography,” in Introduction to Fourier Optics, 2nd ed. New York, NY, USA: McGraw-Hill, 1996. [6] B. Bhaduri et al., “Diffraction phase microscopy: Principles and applications in materials and life sciences,” Adv. Opt. Photon., vol. 6, no. 1, pp. 57–119, 2014, doi: 10.1364/AOP.6.000057. [7] P. Girshovitz and N. T. Shaked, “Doubling the field of view in off-axis low-coherence interferometric imaging,” Light Sci. Appl., vol. 3, no. 3, Mar. 2014, Art. no. e151, doi: 10.1038/lsa.2014.32. [8] K. Ishizuka, “Optimized sampling schemes for offaxis holography,” Ultramicroscopy, vol. 52, no. 1, pp. 1–5, Sep. 1993, doi: 10.1016/0304-3991(93)90017-R. [9] B. Tayebi, F. Sharif, A. Karimi, and J.-H. Han, “Stable extended imaging area sensing without mechanical movement based on spatial frequency multiplexing,” IEEE Trans. Ind. Electron., vol. 65, no. 10, pp. 8195–8203, Oct. 2018, doi: 10.1109/ TIE.2018.2803721. 

SP

(continued from page 16)

[13] D. Pantalony and R. B. Evans, “Seeing a voice: Rudolph Koenig’s instruments for studying vowel sounds,” Amer. J. Psychol., vol. 117, no. 3, pp. 425– 442, 2004, doi: 10.2307/4149009. [14] W. Koenig, H. K. Dunn, and L. Y. Lacy, “The sound spectrograph,” J. Acoustical Soc. Amer., vol. 18, no. 1, pp. 19–49, 1946, doi: 10.1121/1.1916342. [15] E. Garber, “Reading mathematics, constructing physics: Fourier and his readers, 1822-1850,” in No Truth Except in the Details, A. J. Cox and D. M. Siegel, Eds. Dordrecht, The Netherlands: Springer-Verlag, 1995, pp. 31–54. [16] É.-L. Scott, Le Problème de la Parole s’Ecrivant Elle-Même. Paris, France: Chez l’auteur, 1878. [Online]. Available: https://www.firstsounds.org/ publications/facsimiles/FirstSounds_Facsimile_08.pdf [17] É.-L. Scott de Martinville, “Principe de phonautographie,” Académie des sciences de l’Institut de France, Paris, France, Pli cacheté No. 1639, 1857. [Online]. Available: https://www.academie-sciences. fr/pdf/dossiers/pli_cachete/pli1639.pdf [18] P. Feaster. Daguerrotyping the Voice: Léon Scott’s Phonautographic Aspirations. (2017). Griffonage-Dot-

88

Communications Technology Planning and Evaluation (IITP). Correspondence should be sent to Behnam Tayebi and Jae-Ho Han (corresponding authors). Yeonwoo Jeong and Behnam Tayebi equally contributed for this work.

Com. [Online]. Available: https://griffonagedotcom. wordpress.com/2017/04/23/daguerreotyping-the-voiceleon-scotts-phonautographic-aspirations/#_edn13 [19] É.-L. Scott de Martinville, “Brevet d’invention no. 17897 (1857) and certificat d’addition no. 31470 (1859),” Institut National de la Propriété Industrielle, Paris, France, 2007. [Online]. Available: http://www. firstsounds.org/publications/facsimiles/FirstSounds _Facsimile_02.pdf [20] R. Gribonval and E. Bacry, “Harmonic decomposition of audio signals with matching pursuit,” IEEE Trans. Signal Process., vol. 51, no. 1, pp. 101–111, Jan. 2003, doi: 10.1109/TSP.2002.806592. [21] É.-L. Scott, “Inscription automatique des sons de l’air au moyen d’une oreille artificielle,” in Proc. Comptes Rendus Hebdomadaires Séances Acad. Sci., 1861, vol. LIII, pp. 108–111. [Online]. Available: h t t p s : / / w w w. fi r s t s o u n d s . o r g / p u b l i c a t i o n s / facsimiles/FirstSounds_Facsimile_06.pdf [22] First Sounds. [Online]. Available: https://www. firstsounds.org/ [23] “The phonautograms of Édouard-Léon Scott de Martinville.” First Sounds. Accessed: Jul. 27, 2023.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

[Online]. Available: http://www.firstsounds.org/ sounds/scott.php [24] R. N. Bracewell, The Fourier Transform and Its Applications. New York, NY, USA: McGraw-Hill, 1965. [25] T. W. Körner, Fourier Analysis. Cambridge, U.K.: Cambridge Univ. Press, 1988. [26] M. Vetterli, J. Kovacˇevic´, and V. K. Goyal, Fourier and Wavelet Signal Processing. Cambridge, U.K.: Cambridge Univ. Press, 2013. [Online]. Available: https://fourierandwavelets.org/FWSP_a3.2_2013.pdf [27] W. Thomson, “Harmonic analyzer,” Proc. Roy. Soc. London, vol. 27, pp. 371–373, May 1878. [Online]. Available: https://www.jstor.org/stable/113690 [28] A. A. Michelson and S. W. Stratton, “A new harmonic analyzer,” Amer. J. Sci., vol. 5, no. 25, pp. 1–13, 1898, doi: 10.2475/ajs.s4-5.25.1. [29] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Math. Comput., vol. 19, no. 90, pp. 297–301, 1965, doi: 10.2307/2003354.



SP

SP EDUCATION Sharon Gannot  , Zheng-Hua Tan  , Martin Haardt  , Nancy F. Chen , Hoi-To Wai  , Ivan Tashev , Walter Kellermann  , and Justin Dauwels

Data Science Education: The Signal Processing Perspective

I

n the last decade, the signal processing (SP) community has witnessed a paradigm shift from model-based to data-driven methods. Machine learning (ML)—more specifically, deep learning—methodologies are nowadays widely used in all SP fields, e.g., audio, speech, image, video, multimedia, and multimodal/multisensor processing, to name a few. Many data-driven methods also incorporate domain knowledge to improve problem modeling, especially when computational burden, training data scarceness, and memory size are important constraints. Data science (DS), as a research field, emerged from several scientific disciplines, namely, mathematics (mainly statistics and optimization), computer science, electrical engineering (primarily SP), industrial engineering, biomedical engineering, and information technology. Each discipline offers an independent teaching program in its core domain with a segment dedicated to DS studies. In recent years, numerous institutes worldwide have started to provide dedicated and comprehensive DS teaching programs with diverse applications.

Moreover, we think that now is the right time to start defining our needs and inspirations that will reflect the direction the field of SP will take in years to come. In this article, following a successful panel at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2022) held in Singapore, we focus on these education aspects and draft a manifesto for an SP-oriented DS curriculum. We hope this article will encourage discussions among SP educators worldwide and promote new teaching programs in the field.

DS, ML, and SP: Interrelations

Digital Object Identifier 10.1109/MSP.2023.3294709 Date of current version: 3 November 2023

DS is an interdisciplinary field that can be taught from different perspectives. Indeed, DS-oriented material can be a segment of many existing teaching programs in science, technology, engineering, and mathematics. In this article, we aim at the more ambitious task of defining a complete and comprehensive teaching program in DS that takes the unique SP perspective. To put things in context, SP is concerned with extracting information and knowledge from signals. Common SP tasks are the analysis, modification, enhancement, prediction, and synthesis of signals [see also https:// signalprocessingsociety.org/volunteers/ constitution (Article II)]. In parallel to the evolution of the SP methodology, we are witnessing a fast-growing interest in the field of ML. ML is not a new field

1053-5888/23©2023IEEE

IEEE SIGNAL PROCESSING MAGAZINE

Motivation and significance We believe that there is a unique SP perspective of DS that should be reflected in the education given to our students.

|

November 2023

|

of knowledge. Perhaps its most widely known definition dates back to 1959 (paraphrased from Arthur Samuel [1]): “Learning algorithms to build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.” ML is, thus, a method of data analysis that uses algorithms to enable computer systems to identify patterns in data, learn from them, and make predictions or decisions based on that learning. For the SP community, data come in the form of signals. While the definition of signals as the carriers of information remains unchanged, the variety of signal types is rapidly growing. Signals can be either 1D or multidimensional; can be defined over a regular grid (time or pixels) or on an irregular graph; can be packed as vectors, matrices, or higher dimensional tensors; and can represent multimodal data. As discussed, a significant component of SP is dedicated to extracting, representing, and transforming (raw) data to information that accentuates certain properties beneficial to downstream tasks. While, traditionally, SP focuses on processing raw data that have a physical grounding on planet Earth [e.g., audio, speech, radar, sonar, image, video, electrocardiogram, electroencephalogram, magnetoencephalography, and econometric data], one may not need to be limited to this standard practice. 89

A broader and more general definition of signals should include semantic data. Semantic information ultimately stems from the cognitive space in the human mind, which originates from neurophysiological activities in our brains. Cognitive neuroscience is currently not advanced enough to pinpoint how to map semantic information represented in a text to brain activation. Still, this limitation does not prevent one from applying the essence of SP approaches to understanding, representing, and modeling text data or, more generally, semantic information. (Text is, ultimately, just a human-made representation for encoding language and knowledge.) Moreover, multimodal signals are jointly analyzed and processed in some modern applications. Audiovisual SP is an excellent example of two physical signals that are jointly processed. Image captioning is a good example that involves both physical signals and semantic information and, hence, should be processed using methodologies adopted from both computer vision and natural language processing (NLP) disciplines. We, therefore, claim that the ICASSP 2020 motto, “From Sensors to Information, at the Heart of Data Science,” can be further extended to all types of data: physical, which is indeed captured by sensors, as well as cognitive and semantic. The essence of the processing tasks and the underlining methods remain similar.

The principles of DS education from SP and ML perspectives This section is dedicated to our view of the essential principles of DS education. Among other topics, we highlight the importance of SP and ML in the DS discipline.

SP and ML methods Traditionally, we may think of two complementary lists of DS methods stemming from the SP and ML disciplines. A noncomprehensive list can include the following: ■■ SP: convolution, time-frequency analysis (Fourier transform and wavelets), linear systems, statespace representations, the fusion 90

of modalities and sensors, Wiener and Kalman filters, and graph SP. ■■ ML: (variational) expectation maximization, deep learning, reinforcement learning, end-to-end processing, attention, transformers, graphical neural networks, generative models, dimensionality reduction, kernel methods, subspace, and manifold learning. This dichotomy between the lists is rather artificial. Recent trends have shown that these two paradigms are converging and are now strongly interrelated by routinely borrowing ideas and practices from each other. We believe that modern teaching programs should, therefore, emphasize the SP and ML aspects of DS without sacrificing other essential and fundamental elements, namely, optimization, statistics, linear algebra, multilinear (tensor) algebra, artificial intelligence (AI), algorithms, data handling, transmission and storage, and programming skills. From the SP perspective, rigorous training in DS should bring students to think more fundamentally about where the data at hand come from; what the data points and distributions represent; and how to model, sample, represent, and visualize such information robustly so that it is insensitive to various sources and types of noise for different applications and tasks. Just as important as the technical skills, students must become aware of ethical issues related to DS, e.g., the privacy of the data, biases in collecting the data, and the implications their future techniques and developments might impose on society and humanity.

Teaching methodologies All modern teaching programs—and, perhaps, specifically DS teaching programs—should give special attention to teaching methodologies that can be more relevant and attractive to the younger generation of students. While we certainly do not claim that “traditional” teaching methods—namely, a teacher lecturing in front of a class—should be abandoned, we encourage educators to incorporate diverse teaching techniques in their curricula. A nonexhaustive list IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

of teaching methodologies may include online courses, labs with interactive programming exercises, flipped classrooms, and hands-on experience that may involve projects and teamwork. As the DS discipline is vast and cannot usually be fully covered by one institute, we encourage educators to consider student exchange programs and joint programs between universities (especially with other countries) and to include internships in the industry. Needless to say, science has no borders, and students will greatly benefit from learning in different schools and listening to many points of view from world-leading experts in their respective disciplines.

Learning outcomes The graduates of the program are expected to master the theory and practice (including programming skills) of modern and classical SP and ML tools for handling various types of data, most notably, data that originate from signals. They are also expected to thoroughly understand the field’s underlying mathematical and statistical foundations as well as related fields, e.g., data handling, storage (databases and clouds), and transmission (over the network), including reliability and privacy preservation. With rigorous training in SP and ML, graduates will be able to identify and apply the correct tools for DS problems. Graduates should specialize in several advanced topics in the general field of DS and become acquainted with several domain-specific applications. Graduates will, thus, be able to address complex DS problems considering ethical aspects and the sustainability of our global environment.

DS undergraduate curriculum: A proposal from the SP perspective In this section, we draft a proposed curriculum for DS studies from the SP perspective. We are, of course, aware of the different education systems around the world. Nevertheless, we hope that such a list can serve as a source of inspiration to educators and policy leaders in academic institutes. In the following, we propose a fouryear program (in Europe, it is common to have three years of undergraduate

studies plus two years of graduate studies) comprising three layers: ■■ mandatory: a strong background in math and statistics, hands-on programming skills, basic data handling and AI, SP and ML, and ethics ■■ elective tracks: data sharing and communication over networks; advanced algorithms and optimization; security, reliability, and privacy preservation; and ML and DL hardware and software tools ■■ DS applications: in diverse domains. We next discuss each layer in detail and give a list of relevant courses. Naturally, each institute will pave its own way toward the most suitable curriculum.

Mandatory areas (with lists of proposed courses) We believe that each student should be extensively exposed to the field’s theoretical foundations and develop basic hands-on and programming skills: ■■ mathematics: calculus, linear algebra, combinatorics, set theory and logic, harmonic analysis, differential equations (regular and partial), numerical analysis, numerical algebra, multilinear algebra, algebraic structures, optimization, and complex functions ■■ statistics: probability theory, statistics, random processes, information theory, parameter estimation, and statistical theory ■■ computer skills and algorithms: programming basics, data structures and algorithms, Python (including libraries and packages—PyTorch, NumPy, SciPy, and more), objectoriented programming, computer architecture, computability, and cloud computing ■■ hands-on: labs and tools as well as annual projects with real data ■■ data handling and AI: introduction to DS (including the data processing cycle), meetings with industry (R&D in DS, ethics, practical and real-world problems, and needs), data analysis and visualization, data mining, data representations, and introduction to AI ■■ SP and ML: representations and types of signals and systems, SP in

■■

the time-frequency domain (Fourier and wavelet transform, filter banks), ML and pattern recognition, statistical algorithms in SP, statistical and model-based algorithms in ML, adaptive SP, generative models, supervised and unsupervised learning, deep learning, time series and sequences analysis and processing, graphical models, and ML operations ethics: ethical and legal aspects of DS, explainability, General Data Protection Regulation, bias, privacy, and approval processes.

Elective specialization tracks (with lists of proposed courses) Students should elect courses from two or three specialization tracks to advance their knowledge in the field. Specialization tracks may include advanced SP, ML, and optimization algorithms (we split the SP and ML courses into two lists: basic materials and an elective specialization track); dedicated DS-related hardware and software tools; and data sharing and storing methodologies considering security and privacy preservation: ■■ data sharing and communication over networks: detection theory, communication, wireless communication, ML for communications, computer networks, mathematical analysis of networks, social networks, cloud data handling, and federated learning ■■ advanced algorithms and optimization: online algorithms, advanced algorithms, streaming algorithms, big data, quantum learning, graph theory, advanced databases, game theory, deterministic and stochastic methods in operations research, analysis and mining of processes, distributed computation, and cloud computing ■■ advances in SP and ML: array SP, blind source separation and independent component analysis, data fusion (multiple sensors/modalities), reinforcement learning, distributed processing over networks (federated learning), graph SP, and graph neural networks ■■ Security, reliability, and privacy preservation: coding, cryptography, IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

■■

privacy-preserving computing and communications, safe computing, and anomaly detection ML and DL hardware and software tools: digital signal processors; field programmable gate arrays; CPUs; GPUs; neuromorphic processing systems; parallel computing architectures; parallel computing platform and application programming interface (CUDA); and Python, C, and C++ computer languages.

Domain-specific DS applications This track offers a noncomprehensive list of courses in knowledge domains that extensively apply DS tools. Students are encouraged to learn several courses from this list to become acquainted with real-life applications: econometrics, business intelligence, smart cities, blockchain and cryptocurrency, electrooptics, materials, bioinformatics, AI in health care and medical data mining, biomedical SP, DS in brain imaging, audio/speech analysis and processing, music SP and music information retrieval, NLP, image processing and computer vision, computer graphics, wireless communications, and autonomous vehicles. [Students are required to choose only a small number of courses (e.g., three or four) from the list to become acquainted with several domains that apply DS methodologies.]

Summary and further reading In this article, we proposed a DS curriculum focusing on SP and ML. We believe such a program can be relevant to many educators and researchers in the IEEE Signal Processing Society. This article follows a panel held at ICASSP’22 [2]. There have been several attempts to define the DS discipline and the required curriculum for a major in DS. Interested readers may refer to recent reports by the U.S. National Academies of Sciences, Engineering, and Medicine [3]; Park City Math Institute [4]; and Israeli Academy of Sciences and Humanities [5]. An overview of the history of DS, its prospective future, and some guidelines for educating in the discipline can be found in [6]. All these references 91

address the DS discipline in general. In our article, we attempt to focus on the SP and ML perspectives. Readers are also referred to an interesting discussion between Prof. Alfred Hero and Prof. Anders Lindquist about the impact of ML on SP and control systems, which can be found online [7]. Several institutes worldwide already offer study programs in DS with an SP flavor. A nonexhaustive list of study programs follows. The electrical and computer engineering faculty at the University of Michigan offers an ML curriculum [8]. A new undergraduate program proposed by Bar-Ilan University, Israel, follows the guidelines proposed in this article. This program will be opened in the 2023–2024 academic year. (The full program in Hebrew can be found in [9].) A recent presentation on AI curriculum [10] is exploring several AI and DS teaching programs at both the undergraduate and graduate levels, including at the Technical University of Denmark [11], Carnegie Mellon [12], and the Massachusetts Institute of Technology [13]. Friedrich-Alexander University Erlangen-Nuremberg offers an elite M.Sc. degree program in advanced SP and communications [14]. The corresponding M.Sc. degree program in communications and SP has been taught at Ilmenau University of Technology since 2009 [15], and a similar M.Sc. degree program in signals and systems at Delft University of Technology [16]. While not attempting to be exhaustive, this list demonstrates the broad interest of leading academic institutes in developing study programs in SP-oriented DS for both the undergraduate and graduate levels. The authors of this article hope that the ideas and guidelines presented here can inspire DS and SP educators to develop new teaching programs in this fascinating field.

Acknowledgment The authors are grateful to Prof. Mor Peleg from the University of Haifa, Israel, for fruitful discussions and for drawing our attention to some of the references listed in the article as well as to Dr. Ran Gelles for fruitful discussions 92

about the new undergraduate program proposed by Bar-Ilan University, Israel.

Authors Sharon Gannot (sharon.gannot@biu. ac.il) received his Ph.D. degree in electrical engineering from Tel-Aviv Universty, Israel, in 2000. He is a professor with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan 5290002, Israel, where he heads the data science program; he also serves as the faculty vice dean and served as the deputy director of the Data Science Institute. He served as the chair of the Audio and Acoustic Signal Processing Technical Committee. He will be the general cochair of Interspeech, to be held in Jerusalem in 2024; currently, he is the chair of the IEEE Signal Processing Society (SPS) Data Science Initiative and a member of the SPS Education Center Editorial Board, and EURASIP Signal Processing for Multisensor Systems TAC. He also serves as associate editor and senior area chair for several journals. He is a Fellow of IEEE. Zheng-Hua Tan ([email protected]) received his Ph.D. degree in electronic engineering from Shanghai Jiao Tong University. He is a professor in the Department of Electronic Systems, the Machine Learning Research Group leader, and a cohead of the Centre for Acoustic Signal Processing Research at Aalborg University, 9220 Aalborg, Denmark, as well as a colead of the Pioneer Centre for AI, Denmark. He is an associate editor for the IEEE Journal of Selected Topics in Signal Processing inaugural special series on “Artificial Intelligence in Signal and Data S c i e n c e — Tow a r d E x p l a i n a b l e , Reliable, and Sustainable Machine Learning.” He is a TPC vice chair for ICASSP 2024 and was the general chair for IEEE MLSP 2018 and a TPC cochair for IEEE SLT 2016. His work has been recognized by the prestigious IEEE Signal Processing Society 2022 Best Paper Award. His research interests include deep representation learning. He is a Senior Member of IEEE. Martin Haardt (martin.haardt@ tu-ilmenau.de) received his DoktorIEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Ingenieur (Ph.D.) degree from Munich University of Technology in 1996. He has been a full professor in the Department of Electrical Engineering and Information Technology and head of the Communications Research Laboratory at Ilmenau University of Technology, 98684 Ilmenau, Germany, since 2001. He received the 2009 Best Paper Award from the IEEE Signal Processing Society; the Vodafone (formerly Mannesmann Mobilfunk) Innovations Award for outstanding research in mobile communications; the ITG Best Paper Award from the Association of Electrical Engineering, Electronics, and Information Tech­­ nology; and the Rohde & Schwarz Outstanding Dissertation Award. He has served as a senior editor for IEEE Journal of Selected Topics in Signal Processing since 2019. His research interests include wireless communications, array signal processing, high-resolution parameter estimation, and tensor-based signal processing. He is a Fellow of IEEE. Nancy F. Chen ([email protected]. edu.sg) received her Ph.D. degree in biomedical engineering from the Massachusetts Institute of Technology and Harvard in 2011. She is a Fellow, Senior Principal Scientist, Principal Investigator and Group Leader at the Institute for Infocomm Research and Centre for Frontier AI Research, Agency for Science, Technology, and Research (A*STAR), Singapore, 138632. She leads research efforts in generative artificial intelligence with a focus on speech language technology with applications in education, healthcare, and defense. Speech evaluation technology from her team has been deployed at the Ministry of Education in Singapore to support home-based learning and led to commercial spinoffs. She has received numerous awards, including being named among the Singapore 100 Women in Tech in 2021, the Young Scientist Best Paper Award at MICCAI 2021, the Best Paper Award at SIGDIAL 2021, and the 2020 P&G Connect + Develop Open Innovation Award. She is currently the program chair of the International

Conference on Learning Representations (ICLR), IEEE Distinguished Lecturer, a Board Member of the International Speech Communication Association (ISCA), and a Senior Area Editor of IEEE/ACM Transactions on Audio, Speech, and Language Processing. She is a Senior Member of IEEE. Hoi-To Wai ([email protected]) received his Ph.D. degree from Arizona State University (ASU) in electrical engineering in 2017. He is an assistant professor with the Department of Systems Engineering and Engineering Management at the Chinese University of Hong Kong, Hong Kong, China. His research interests include signal processing, machine learning, and distributed optimization, with a focus on their applications to network science. His dissertation received the 2017 Dean’s Dissertation Award from the Ira A. Fulton Schools of Engineering at ASU, and he was a recipient of a Best Student Paper Award at ICASSP 2018. He is a Member of IEEE. Ivan Tashev (ivantash@microsoft. com) received his Ph.D. degree in computer science from the Technical University of Sofia, Bulgaria, in 1990. He is a partner software architect and leads the Audio and Acoustics Research Group in Microsoft Research, Redmond, WA 98052 USA; is an affiliate professor at the University of Washington in Seattle; and is an honorary professor at the Technical University of Sofia, Bulgaria. He also coordinates the Brain–Computer Interfaces project in Microsoft Research. He has published two books, two book chapters, and more than 100 scientific papers, and he is listed as an inventor for 50 U.S. patents. His research interests include audio signal processing, machine learning, multichannel transducers, and biosignal processing. He is a member of the Audio Engineering Society and the Acoustical Society of America and a Fellow of IEEE. Walter Kellermann ([email protected]) received his Dr.-Ing. degree in electrical engineering from Technical University Darmstadt, Germany, in 1988. He is a professor of communications at the University of

Erlangen-Nuremberg, 91058 Erlangen, Germany. His service to the IEEE Signal Processing Society includes Distinguished Lecturer (2007–2008), Chair of the Technical Committee for Audio and Acoustic Signal Processing (2008–2010), Member of the IEEE James L. Flanagan Award Committee (2011–2014), Member at Large SPS Board of Governors (2013–2015), Vice President Technical Directions (20162018), Member SPS Nominations Appointments Committee (2019– 2022), and Member of the SPS Fellow Evaluation Committee (2023–). He has served as the general chair of eight mostly IEEE-sponsored workshops and conferences. He is a corecipient of 10 best paper awards, was awarded the Julius von Haast Fellowship by the Royal Society of New Zealand in 2012, and received the Group Technical Achievement Award of the European Association for Signal Processing (EURASIP) in 2015. His research interests include speech signal processing, array signal processing, and machine learning, especially for acoustic signal processing. He is a fellow of EURASIP and a Life Fellow of IEEE. Justin Dauwels (j.h.g.dauwels@ tudelft.nl) received his Ph.D. degree in electrical engineering from the Swiss Polytechnical Institute of Technology in Zurich in 2005. He is an associate professor of signal processing systems, Department of Micro­­electronics, Delft University of Technology, 2628 CD Delft, The Netherlands. He is an associate editor of IEEE Transactions on Signal Processing, an associate editor of the Elsevier journal Signal Processing, a member of the editorial advisory board of International Journal of Neural Systems, and an organizer of IEEE conferences and special sessions. His research team has won several best paper awards at international conferences and from journals. His research interests include data analytics with applications to intelligent transportation systems, autonomous systems, and analysis of human behavior and physiology. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

References

[1] A. L. Samuel, “Some studies in machine learning using the game of checkers,” IBM J. Res. Develop., vol. 3, no. 3, pp. 210–229, Jul. 1959, doi: 10.1147/ rd.33.0210. [2] S. Gannot, Z.-H. Tan, M. Haardt, N. F. Chen, H.-T. Wai, and I. Teshev, “Data science education: The signal processing perspective,” Panel at IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) May 2022. [Online]. Available: https://rc.signal processingsociety.org/conferences/icassp-2022/ SPSICASSP22VID1984.html?source=IBP [3] Envisioning the Data Science Discipline: The Undergraduate Perspective. Washington, DC, USA: National Academy Press, 2018. [4] R. D. De Veaux et al., “Curriculum guidelines for undergraduate programs in data science,” Annu. Rev. Statist. Appl., vol. 4, no. 1, pp. 15–30, Mar. 2017, doi: 10.1146/annurev-statistics-060116-053930. [5] N. Ahituv, J. Ben-Dov, Y. Benjamini, Y. Bronner, Y. Dudai, D. Raban, and R. Sharan, Teaching Data Science in Universities in All Disciplines. Jerusalem, Israel: Israel Academy of Sciences and Humanities, 2020. [Online]. Available: https://www.academy.ac.il/ SystemFiles2015/2-1-21-English.pdf [6] D. Donoho, “50 years of data science,” J. Comput. Graphical Statist., vol. 26, no. 4, pp. 745–766, Aug. 2017, doi: 10.1080/10618600.2017.1384734. [7] C. June, “Machine learning and systems: A conversation with 2020 Field Award winners Alfred Hero and Anders Lindquist,” Elect. Comput. Eng., Univ. of Michigan, Ann Arbor, MI, USA, Oct. 2019. [Online]. Available: https://ece.engin.umich.edu/stories/ machine-learning-and-systems-a-conversation-with -2020-field-award-winners-al-hero-and-anders -lindquist [8] C. June, “Teaching machine learning in ECE,” Elect. Comput. Eng., Univ. of Michigan, Ann Arbor, MI, USA, Mar. 2022. [Online]. Available: https://ece. engin.umich.edu/stories/teaching-machine-learning -in-ece [9] “Data engineering - Bachelor’s degree,” Faculty Eng., Bar-Ilan Univ., Ramat Gan, Israel, 2023. [Online]. Available: https://engineering.biu.ac.il/ datascience [10] Z.-H. Tan, “On artificial intelligence curriculum and problem-based learning [Slides],” Aalborg Univ., Aalborg, Denmark, 2021. [Online]. Available: https:// people.es.aau.dk/~zt/online/AI-curriculum-Tan.pdf [11] [Online]. Available: https://www.dtu.dk/english/ education/undergraduate/undergraduate-programmes -in-danish/bsc-eng-programmes/artificial-intelligence -and-data [12] “B.S. in artificial intelligence,” School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA, 2023. [Online]. Available: https://www.cs.cmu.edu/ bs-in-artificial-intelligence/ [13] “Interdisciplinary programs,” Massachusetts Inst. Technol., Cambridge, MA, 2022–2023 USA. [Online]. Available: http://catalog.mit.edu/inter disciplinary/undergraduate-programs/degrees/ [14] “Elite master’s study programme: Advanced signal processing and communications engineering,” Inst. Digit. Commun., Erlangen, Germany, 2023. [Online]. Available: https://www.asc.studium. fau.de [15] “Master of science in communications and signal processing,” TU Ilmenau, Ilmenau, Germany, 2023. [Online]. Available: https://www.tu-ilmenau. de/mscsp [16] “Track: Signals & systems,” TU Delft, Delft, The Netherlands, 2023. [Online]. Available: https://www. tudelft.nl/en/education/programmes/masters/electrical -engineering/msc-electrical-engineering/track-signals -systems 

SP 93

SP COMPETITIONS Davide Cozzolino , Koki Nagano, Lucas Thomaz , Angshul Majumdar , and Luisa Verdoliva

Synthetic Image Detection Highlights from the IEEE Video and Image Processing Cup 2022 Student Competition

T

he Video and Image Processing (VIP) Cup is a student competition that takes place each year at the IEEE International Conference on Image Processing (ICIP). The 2022 IEEE VIP Cup asked undergraduate students to develop a system capable of distinguishing pristine images from generated ones. The interest in this topic stems from the incredible advances in the artificial intelligence (AI)-based generation of visual data, with tools that allow the synthesis of highly realistic images and videos. While this opens up a large number of new opportunities, it also undermines the trustworthiness of media content and fosters the spread of disinformation on the Internet. Recently, there has been strong concern about the generation of extremely realistic images by means of editing software that includes the recent technology on diffusion models [1], [2]. In this context, there is a need to develop robust and automatic tools for synthetic image detection. In the literature, there has been an intense research effort to develop effective forensic image detectors, and many of them, if properly trained, appear to provide excellent results [3]. Such results, however, usually refer to ideal conditions and rarely stand the challenge of real-world application. First of all, testing a detector on images generated by the very same models seen in the training phase leads to overly optimistic results. In fact, this is not a realistic Digital Object Identifier 10.1109/MSP.2023.3294720 Date of current version: 3 November 2023

94

scenario. With the evolution of technology, new architectures and different ways of generating synthetic data are continuously proposed [4], [5], [6], [7], [8]. Therefore, detectors trained on some specific sources will end up working on target data of a very different nature, often with disappointing results. In these conditions, the ability of generalizing to new data becomes crucial to keep providing a reliable service. Moreover, detectors are often required to work on data that have been seriously impaired in several ways. For example, when images are uploaded on social networks, they are normally resized and compressed to meet internal constraints. These operations tend to destroy important forensic traces, calling for detectors that are robust to such events and degrade performance gracefully. To summarize, to operate successfully in the wild, a detector should be robust to image impairments and, at the same time, able to generalize well on images coming from diverse and new models. In the scientific community, there is still insufficient (although growing) awareness of the centrality of these aspects in the development of reliable detectors. Therefore, we took the opportunity of this VIP Cup to push further along this direction. In designing the challenge, we decided to consider an up-to-date, realistic setting with test data including 1) both fully synthetic and partially manipulated images and 2) images generated by both established generative adversarial network (GAN) models and newer IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

architectures, such as diffusion-based models. With the first dichotomy, we ask that the detectors be robust to the occurrence of images that are only partially synthetic, thus with limited data on which to base the decision. As for architectures, there is already a significant body of knowledge on the detection of GAN-generated images [9], but new text-based diffusion models are now gaining the spotlight, and generalization becomes the central issue. With the 2022 IEEE VIP Cup, we challenged teams to design solutions that are able to work in the wild as only a fraction of the generators used in the test data are known in advance. In this article, we present an overview of this challenge, including the competition setup, the teams, and their technical approaches. Note that all of the teams were composed of a professor, at most one graduate student (tutor), and undergraduate students (from a minimum of three to a maximum of 10 students).

Tasks, resources, and evaluation criteria Tasks The challenge consisted of two phases: an open competition (split into two parts), in which any eligible team could participate, and an invitation-only final. Phase 1 of the open competition was designed to provide teams with a simplified version of the problem at hand to familiarize themselves with the task, while phase 2 was designed to tackle a 1053-5888/23©2023IEEE

more challenging task: synthetic data generated using architectures not present in the training. The synthetic images included in phase 1 were generated using five known techniques, while the generated models used in phase 2 were unknown. During the final competition, the three highest-scoring teams from the open competition were selected and were allowed to provide another submission graded on a new test set. Information about the challenge is also available at https://grip-unina.github.io/ vipcup2022/.

Resources Participants were provided with a labeled training dataset of real and synthetic images. In particular, the dataset available for phase 1 comprised real images from four datasets (FFHQ [4], Imagenet [17], COCO [18], and LSUN [19]), while synthetic images were generated using five known techniques: StyleGAN2 [11], StyleGAN3 [12], GLIDE [5], Taming Transformers [10], and inpainted images with Gated Convolution [13]. All the images of the test data were randomly cropped and resized to 200 × 200 pixels and then compressed using JPEG at different quality levels. This pipeline was used to simulate a realistic scenario where

images were randomly resized and compressed as happens when they are uploaded to a social network. In addition, they all had the same dimensions to avoid leaking information on the used generators (some models only generate data at certain specific resolutions). Some examples of generated images used during the competition are shown in Figure 1. Teams were provided with Python scripts to apply these same operations to the training dataset. For phase 2, there were no available datasets since the generated models in this case were unknown to the teams. However, participants were free to use any external data, besides the competition data. In addition, participants were allowed to use any available state-of-the-art methods and algorithms to solve the problems of the challenge. Teams were requested to provide the executable code to the organizers to test the algorithms on the evaluation datasets. The Python code was executed inside a Docker container with a GPU of 16 GB with a time limit of one hour to process a total of 5,000 images. The teams were allowed to submit their code and evaluate their performance five times during the period from 8 August to 5 September 2022.

Evaluation criteria The submitted algorithms were scored by means of balanced accuracy for the detection task (score = 0.7 × accuracy phase 1 + 0.3 × accuracy phase 2). The three highest-scoring teams from the open competition stage were selected as finalists. These teams had the opportunity to make an additional submission on 8 October on a new dataset and were invited to compete in the final stage of the challenge at ICIP 2022 on 16 October 2022 in Bordeaux. Due to some travel limitations, on that occasion, they could make a live or prerecorded presentation, followed by a round of questions from a technical committee. The event was hybrid to ensure a wide participation and allow teams who had visa issues to attend virtually. In the final phase of the challenge, the judging committee considered the following parameters for the final evaluation (maximum score was 12 points): ■■ the innovation of the technical solution (one to three points) ■■ the performance achieved in phase 1 of the competition, where only known models were used to generate synthetic data (one to three points) ■■ the performance achieved in phase 2 of the competition, where unknown

FIGURE 1. Examples of synthetic images from the datasets used in the open competition. The first row shows samples from GLIDE [5], Taming Transformers [10], StyleGAN2 [11], StyleGAN3 [12], and inpainting with Gated Convolution [13]. The second row shows samples from BigGAN [14], DALL-e mini [6], Ablated Diffusion Model [15], Latent Diffusion [7], and LaMa [16]. The images in the fifth column are only locally manipulated (the regions outlined in red are synthetic). IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

95

100

90

90

60

FIGURE 2. The anonymized results in terms of accuracy of the 13 teams on the two open competition datasets. 96

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

27,686

27,749

27,768

27,770

27,730

27,740

40

27,701

50

27,754

27,686

27,768

27,749

27,740

27,700

27,751

27,730

27,754

27,759

40

27,707

50

70

27,700

60

In this section, we present an overview of the approaches proposed by all of the participating teams for the challenge. All proposed methods relied on learning-based approaches and train deep neural networks on a large dataset of real and synthetic images. Many diverse architectures were considered: GoogLeNet, ResNet, Inception, Xception, DenseNet, EfficientNet, MobileNet, ResNeXt, ConvNeXt, and the more recent Vision Transformers. The problem was often treated as a binary classification task (real versus fake), but some teams approached it as a multiclass classification problem with the aim to increase the degrees of

27,707

70

Highlights of the technical approaches

80

27,753

80

27,751

Accuracy on Test Set 2 (%)

100

27,753

Accuracy on Test Set 1 (%)

The VIP Cup was run as an online class through the Piazza platform, which allowed easy interaction with the teams. In total, we received 82 registrations for the challenge, 26 teams accessed the Secure CMS platform, and 13 teams made at least one valid submission. Teams were from 10 different countries across the world: Bangladesh, China, Germany, Greece, India, Italy, Poland, Sri Lanka, United States of America, and Vietnam. Figure 2 presents the accuracy results obtained by the 13 teams participating in the two phases of the open competition. First, we can observe that the performance on test set 1 including images from known generators was much higher than those obtained in an open set scenario, where generators are unknown. More specifically, there were accuracy drops of around 10% for the best techniques, confirming the

27,759

2022 VIP Cup statistics and results

27,770

■■

manipulations. The same holds for test set 2 (unknown models) with the additional problem of images fully generated using diffusion models, where performances are on average lower than those obtained on images created by GANs. We also provide results in terms of area under the receiver operating characteristic curve in Figure 5. In this situation, we can note that the first and second places reverse on test set 2, which underlines the importance to properly set the right threshold for the final decision. A proper choice of the validation set is indeed very important to carry out a good calibration.

difficulty to detect synthetic images coming from unknown models. Then, we noted that even for the simpler scenario, only four teams were able to achieve an accuracy above 70%, which highlights that designing a detector that can operate well on both fully and locally manipulated images is not an easy task. In Figure 3, we present some additional analyses of all of the submitted algorithms. Figure 3(a) aims at understanding how much computational complexity (measured by the execution time to process 10,000 images) impacts the final score. Interestingly, there is only a weak correlation between computation effort and performance, with methods that achieve the same very high score (around 90%) with very different execution times. Figure 3(b), instead, shows the results of each method on test set 1 and test set 2. In this case, a strong correlation is observed: if an algorithm performs well/poorly on test set 1, the same happens on test set 2, even if the datasets do not overlap and are completely separated in terms of generating models. Finally, in Figure 4, we study in some more detail the performance of the three best performing techniques, reporting the balanced accuracy for each method on each dataset. For test set 1 (known models), the most difficult cases are those involving local

27,701

■■

models were used to generate synthetic data (one to three points) the quality and clarity of the final report, a four-page full conference paper in the IEEE format (one to three points) the quality and clarity of the final presentation (either prerecorded or live), a 15-min talk (one to three points).

Accuracy (%) on Test Set 2

100

90 80 70 60 50 40

0

10

20

40 40 50 60 70 80 90 Accuracy (%) on Test Set 1

100

Finalists The final phase of the 2022 IEEE VIP Cup took place at ICIP in Bordeaux, on 16 October 2022. Figure 6 shows the members of the winning team while receiving the award. In the following, we describe the three finalist teams listed according to their final ranking: FAU Erlangen-Nürnberg (first place), ­Megatron (second place), and Sherlock

80

27,686 27,749 27,768

70

AVG

ADM

DALL-e mini

BigGAN

LaMa

60

Latent Diff.

Balanced Accuracy (%) AVG

GC Inpainting

resizing, and rotation, most teams used augmentation based on Gaussian blurring and JPEG compression, found to be especially helpful in the literature [24], but also changes of saturation, contrast, and brightness, as well as CutMix and random cutout.

90

50 StyleGAN3

50

Test Set 2

100

60

StyleGAN2

60

(b)

The majority of the teams trained their networks on the data made available for the challenge; however, some of them increased this dataset by generating additional synthetic images using new generative models, such as other architectures based on GANs and new ones based on diffusion models. Of course, including more generators during training helped to improve the performance, even if some approaches were able to obtain good generalization ability even adding a few more models. In addition, augmentation was always carried out to increase diversity and improve generalization. Beyond standard operations, like image flipping, cropping,

70

Taming Tran.

70

70

test set 1 versus accuracy on test set 2.

80

GLIDE

60

80

FIGURE 3. The results of all of the submitted algorithms: (a) score versus time and (b) accuracy on

90

50

30 40 50 Time (min)

90

(a)

Test Set 1

100 Balanced Accuracy (%)

100

Score (%)

freedom for the predicting model and also to include an extra class for unknown models. To properly capture the forensic traces that distinguish pristine images from generated ones, the networks considered multiple inputs, not just the RGB image. Indeed, it is well known that generators fail to accurately reproduce the natural correlation among color bands [20] and also that the upsampling operation routinely performed in most generative models gives rise to distinctive spectral peaks in the Fourier domain [21]. Therefore, some solutions considered as input the image represented in different color spaces, i.e., HSV and YCbCr, or computed the co-occurrence matrices on the color channels. Moreover, to exploit frequency-based features, two-stream networks have been adopted, using features extracted from the Fourier analysis in the second stream. A two-branch network was also used to work both on local and global features, which were fused by means of an attention module as done in [22]. In general, attention mechanisms have been included in several solutions. Likewise, the ensembling of multiple networks was largely used to increase diversity and boost performance. Different aggregation strategies have been pursued with the aim to generalize to unseen models and favor decisions toward the real image class, as proposed in [23].

FIGURE 4. The balanced accuracy of the three best performing methods on images from test set 1 and test set 2. IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

97

90

80

80

AVG

LaMa

AVG

GC Inpainting

50 StyleGAN3

50

StyleGAN2

60

Taming Tran.

60

Latent Diff.

70

DALL-e mini

70

27,686 27,749 27,768

BigGAN

AUC (%)

90

GLIDE

AUC (%)

Test Set 2

100

ADM

Test Set 1

100

FIGURE 5. The area under the receiver operating characteristic curve (AUC) of the three best performing methods on images from test set 1 and test set 2.

tion. Models included during training were the five known techniques StyleGAN2 [11], StyleGAN3 [12], GLIDE [5], Taming Transformers [10], and inpainted images with Gated Convolution [13]. In addition, images generated using DALL∙E [25] and VQGAN [10] were used.

Megatron ■■

■■ ■■

■■

FIGURE 6. The winning team (FAU Erlangen-Nürnberg) during the award ceremony at ICIP 2022 in Bordeaux.

(third place). We will also present some details on their technical approach. ■■

FAU Erlangen-Nürnberg ■■

■■ ■■ ■■

98

Affiliation: Friedrich-AlexanderUniversität Erlangen-Nürnberg, Germany Supervisor: Christian Riess Tutor: Anatol Maier Students: Vinzenz Dewor, Luca

Beetz, ChangGeng Drewes, and Tobias Gessler Technical approach: an ensemble of vision transformers pretrained on Imagenet-21k and fine-tuned on a large dataset of 400,000 images. To extract generalizable features, a procedure based on weighted random sampling was adopted during training aimed at balancing the data distribu-

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

Affiliation: Bangladesh University of Engineering and Technology, Bangladesh Supervisor: Shaikh Anowarul Fattah Students: Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, and Zaber Ibn Abdul Hakim Technical approach: a multiclass classification scheme and an ensemble of convolutional neural networks and transformer-based architectures. An extra class was introduced to detect synthetic images coming from unknown models. Knowledge distillation and test time augmentation were also included in the proposed solution. The training set included, beyond the five known techniques, additional images coming from the following generators: ProGAN [26], ProjectedGAN [27], CycleGAN [28], DDPM [29], Diffusion-GAN [30], Stable Diffusion [31], Denoising Diffusion GAN [32], and GauGAN [33].

Sherlock

Generalization is still a main issue in synthetic image detection. In partic■■ Affiliation: Bangladesh University of ular, it has been observed that one Engineering and Technology, main problem is how to set the corBangladesh rect threshold in the more challeng■■ Supervisor: Mohammad Ariful ing scenario of unseen generators Haque during training. ■■ Students: Fazle Rabbi, Asif Quadir, Indrojit Sarkar, Shahriar Kabir ■■ The detection task can benefit of the Nahin, Sawradip Saha, and Sanjay attribution, which aims at identifyAcharjee ing the model that was used for synthetic generation. ■■ Technical approach: a two-branch We believe that the availability of the convolutional neural network that took dataset (https://github. as input features In particular, the VIP com/grip-unina/ extracted in the DMimageDetecspatial and in the Cup has shown the tion) created during Fourier domain. need to develop models the challenge can The ad­­opted ar­­ that can be used in the stimulate the research chitectures were wild to detect synthetic on synthetic image EfficientNet-b7 images generated by new detection and motivate and Mobile­­N et architectures, such as the other researchers to -v3. In addition, work in this interstrong augmentarecent diffusion models. tion was perform­­ esting field. The ad­­ ed, which included also CutMix vancements in generative AI make the distinction between real and fake beyond standard op­­erations. During very thin, and it is very important to training, only the five known generpush the community to continuously ation techniques were considered. search for effective solutions [38]. In particular, the VIP Cup has shown Conclusions the need to develop models that can This article describes the 2022 VIP Cup be used in the wild to detect synthetthat took place last October at ICIP. The ic images generated by new architecaim of the competition was to foster tures, such as the recent diffusion research on the detection of synthetic images, in particular, focusing on imagmodels. In this res­pect, it is important to es generated using the recent diffusion design explainable methods that can models [7], [8], [15], [34]. These archihighlight which are the forensic artitectures have shown an impressive abilfacts that the detector is exploiting ity to generate images guided by textual [39]. We hope that more and more descriptions or pilot sketches, and there methods will be published in the is very limited work on their detection research community and will be [35], [36], [37]. Below, we highlight the inspired by the challenge proposed main take-home messages that emerged in the 2022 IEEE VIP Cup at ICIP. from the technical solutions developed in this competition: Acknowledgment ■■ The best-performing models are preThe organizers express their gratitude trained very deep networks that rely to all participating teams, to the local on a large dataset of real and synorganizers at ICIP 2022 for hosting thetic images coming from several the VIP Cup, and to the IEEE Signal different generators. Indeed, increasProcessing Society Membership ing diversity during training was a Board for the continuous support. key aspect of the best approaches. Special thanks go to Riccardo Corvi and Raffaele Mazza from University ■■ Augmentation represents a fundaFederico II of Naples, who helped to mental step to make the model build the datasets. The authors also more robust to post-processing acknowledge the projects that support operations and make it work in this research: DISCOVER within the realistic scenarios. ■■

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

SemaFor program funded by DARPA under Agreement FA8750-20-2-1004; Horizon Europe vera.ai funded by the European Union, Grant Agreement 101070093; a TUM-IAS Hans Fischer Senior Fellowship; and PREMIER, funded by the Italian Ministry of Education, University, and Research within the PRIN 2017 program. This work is also funded by FCT/MCTES through national funds and when applicable cofunded EU funds under Projects UIDB/50008/2020 and LA/P/0109/2020.

Authors Davide Cozzolino (davide.cozzolino@ unina.it) is an assistant professor with the Department of Electrical Engineering and Information Tech­­ nology, University Federico II, 80125 Naples, Italy. He was cochair of the IEEE CVPR Workshop on Media Forensics in 2020. He was part of the teams that won the 2013 IEEE Image Forensics Challenge (both detection and localization) and the 2018 IEEE Signal Processing Cup on camera model identification. His research interests include image processing and deep learning, with main contributions in multimedia forensics. He is a Member of IEEE. Koki Nagano (knagano@nvidia. com) is a senior research scientist at NVIDIA Research, Santa Clara, CA 95051 USA. He works at the intersection of graphics and AI, and his research focuses on realistic digital human synthesis and trustworthy visual computing including the detection and prevention of visual misinformation. He is a Member of IEEE. Lucas Thomaz (lucas.thomaz@co. it.pt) is a researcher at Instituto de Telecomunicações, 2411-901 Leiria, Portugal, and an associate professor in the School of Technology and Management, Polytechnic of Leiria, Leiria 2411-901, Portugal. He is a member of the Student Services Committee of the IEEE Signal Pro­­ cessing Society, supporting the VIP Cup and the IEEE Signal Processing Cup, and the chair of the Engagement and Career Training Subcommittee. He 99

is a consulting associate editor for IEEE Open Journal of Signal Processing. He is a Member of IEEE. Angshul Majumdar (angshul@iiitd. ac.in) received his Ph.D. from the University of British Columbia. He is a professor at Indraprastha Institute of Information Technology, New Delhi 110020, India. He has been with the institute since 2012. He is currently the director of the Student Services Committee of the IEEE Signal Processing Society. He has previously served the Society as chair of the Chapter’s Committee (2016–2018), chair of the Education Committee (2019), and member-at-large of the Education Board (2020). He is an associate editor for IEEE Open Journal of Signal Processing and Elsevier’s Neurocomputing. In the past, he was an associate editor for IEEE Transactions on Circuits and Systems for Video Technology. He is a Senior Member of IEEE. Luisa Verdoliva ([email protected]) is a professor with the Department of Electrical Engineering and Information Technology, University Federico II, 80125 Naples, Italy. She was an associate editor for IEEE Transactions on Information Forensics and Security (2017–2022) and is currently deputy editor in chief for the same journal and senior area editor for IEEE Signal Processing Letters. She is the recipient of a Google Fa­­culty Research Award for Machine Perception (2018) and a TUMIAS Hans Fischer Senior Fellowship (2020–2024). She was chair of the IFS TC (2021–2022). Her scientific interests are in the field of image and video processing, with main contributions in the area of multimedia forensics. She is a Fellow of IEEE.

References

[1] A. Mahdawi, “Nonconsensual deepfake porn is an emergency that is ruining lives,” Guardian, Apr. 2023. [Online]. Available: https://www.theguardian. com/commentisfree/2023/apr/01/ai-deepfake-porn -fake-images

[4] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2019, pp. 4396–4405, doi: 10.1109/ CVPR.2019.00453.

[22] Y. Ju, S. Jia, L. Ke, H. Xue, K. Nagano, and S. Lyu, “Fusing global and local features for generalized AI-synthesized image detection,” in Proc. IEEE Int. Conf. Image Process., 2022, pp. 3465–3469, doi: 10.1109/ICIP46576.2022.9897820.

[5] A. Q. Nichol et al., “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 16,784–16,804.

[23] S. Mandelli, N. Bonettini, P. Bestagini, and S. Tubaro, “Detecting GAN-generated images by orthogonal training of multiple CNNs,” in Proc. IEEE Int. Conf. Image Process., 2022, pp. 3091–3095, doi: 10.1109/ICIP46576.2022.9897310.

[6] B. Dayma et al. “DALL-E mini.” GitHub. Accessed: Dec. 14, 2022. [Online]. Available: https:// github.com/borisdayma/dalle-mini [7] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2022, pp. 10,684–10,695. [8] Y. Balaji et al., “eDiff-I: Text-to-image diffusion models with ensemble of expert denoisers,” 2022, arXiv:2211.01324.

[25] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8821–8831. [26] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of GANs for improved quality, stability, and variation,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–12.

[9] D. Gragnaniello, D. Cozzolino, F. Marra, G. Poggi, and L. Verdoliva, “Are GAN generated images easy to detect? A critical analysis of the state-of-theart,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6, doi: 10.1109/ICME51207.2021.9428429.

[27] A. Sauer, K. Chitta, J. Müller, and A. Geiger, “Projected GANs converge faster,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 1–13.

[10] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2021, pp. 12,873–12,883, doi: 10.1109/ CVPR46437.2021.01268.

[28] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycleconsistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vision, 2017, pp. 2242–2251, doi: 10.1109/ICCV.2017.244.

[11] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2020, pp. 8110–8119.

[29] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 6840–6851.

[12] T. Karras et al., “Alias-free generative adversarial networks,” in Proc. Adv. Neural Inf. Process. Syst., 2021, vol. 34, pp. 852–863. [13] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” in Proc. IEEE/CVF Int. Conf. Comput. Vision, 2019, pp. 4471– 4480, doi: 10.1109/ ICCV.2019.00457. [14] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” in Proc. Int. Conf. Learn. Representations, 2019, pp. 1–11. [15] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. Adv. Neural Inf. Process. Syst., 2021, vol. 34, pp. 8780–8794. [16] R. Suvorov et al., “Resolution-robust large mask inpainting with Fourier convolutions,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vision, 2022, pp. 3172–3182, doi: 10.1109/WACV51458. 2022.00323. [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., 2009, pp. 248–255, doi: 10.1109/CVPR.2009.5206848. [18] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vision, 2014, pp. 740–755. [19] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop,” 2015, arXiv:1506.03365.

[2] J. Vincent, “After deepfakes go viral, AI image generator Midjourney stops free trials citing ‘abuse’,” Verge, Mar. 2023. [Online]. Available: https://www. theverge.com/2023/3/30/23662940/deepfake-viral -ai-misinformation-midjourney-stops-free-trials

[20] H. Li, B. Li, S. Tan, and J. Huang, “Identification of deep network generated images using disparities in color components,” Signal Process., vol. 174, Sep. 2020, Art. no. 107616, doi: 10.1016/j.sigpro.2020.107616.

[3] L. Verdoliva, “Media forensics and deepfakes: An overview,” IEEE J. Sel. Topics Signal Process., vol. 14, no. 5, pp. 910–932, Aug. 2020, doi: 10.1109/ JSTSP.2020.3002101.

[21] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting and simulating artifacts in GAN fake images,” in Proc. IEEE Int. Workshop Inf. Forensics Secur., 2019, pp. 1–6, doi: 10.1109/WIFS47025.2019.9035107.

100

[24] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. Efros, “CNN-generated images are surprisingly easy to spot… for now,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2020, pp. 8692– 8701, doi: 10.1109/CVPR42600.2020.00872.

IEEE SIGNAL PROCESSING MAGAZINE

|

November 2023

|

[30] Z. Wang, H. Zheng, P. He, W. Chen, and M. Zhou, “Diffusion-GAN: Training GANs with diffusion,” in Proc. Int. Conf. Learn. Representations, 2023, pp. 1–13. [31] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. “Stable diffusion.” GitHub. Accessed: Dec. 14, 2022. [Online]. Available: https://github.com/CompVis/stable-diffusion [32] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning trilemma with denoising diffusion GANs,” in Proc. Int. Conf. Learn. Representations, 2022, pp. 1–15. [33] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2019, pp. 2332–2341. [34] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022, arXiv:2204.06125. [35] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi, K. Nagano, and L. Verdoliva, “On the detection of synthetic images generated by diffusion models,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–5. [36] Z. Sha, Z. Li, N. Yu, and Y. Zhang, “DE-FAKE: Detection and attribution of fake images generated by t e x t - t o - i m a g e d i ff u s i o n m o d e l s ,” 2 0 2 2 , arXiv:2210.06998. [37] J. Ricker, S. Damm, T. Holz, and A. Fischer, “Towards the detection of diffusion model deepfakes,” 2022, arXiv:2210.14571. [38] M. Barni et al., “Information forensics and security: A quarter century long journey,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 67-79, Jul. 2023, doi: 0.1109/MSP.2023.3275319. [39] R. Cor vi, D. Cozzol ino, G. Poggi, K. Nagano, and L. Verdoliva, “Intriguing properties of synthetic images: From generative adversarial networks to diffusion models,” in Proc. IEEE Comput. Vision Pattern Recognit. Workshops, 2023, pp. 973–982.

SP

DATES AHEAD Please send calendar submissions to: Dates Ahead, Att: Samantha Walter, Email: [email protected]

2023 NOVEMBER

©SHUTTERSTOCK.COM/SAYAN URANAN

Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2023) 31 October–3 November, Taipe, Taiwan. General Chairs: JIng-Ming Guo, Gwo-Giun Lee, Shih-Fu Chang, and Anthony Kuh URL: https://www.apsipa2023.org 19th International Conference on Advanced Video and Signal-Based Surveillance (AVSS 2023) 6–9 November, Daegu, South Korea. General Chairs: Jeng-Neng Hwang and Michael S. Ryoo URL: https://www.avss2023.org

DECEMBER IEEE International Workshop on Information Forensics and Security (WIFS 2023) 4–7 December, Nuremberg, Germany. General Chairs: Marta Gomez-Barrero and Christian Riess URL: https://wifs2023.fau.de/ Ninth IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP 2023) 10–13 December, Costa Rica. General Chairs: M. Haardt and André de Almeida URL: https://www.tuwien.at/etit/tc/en/ camsap-2023/

Workshop on Automatic Speech Recognition and Understanding (ASRU 2023) 16–20 December, Taipei, Taiwan. General Chairs: Chi-Chun Lee, Yu Tsao, and Hsin-Min Wang URL: http://www.asru2023.org

2024 APRIL IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024) 14–19 April, Seoul, Korea. General Chairs: Hanseok Ko and Monson Hayes URL: https://2024.ieeeicassp.org/

Digital Object Identifier 10.1109/MSP.2023.3321570 Date of current version: 3 November 2023

The IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024) will be held in Seoul, Korea, 14–19 April 2024.

MAY IEEE Conference on Computational Imaging Using Synthetic Apertures (CISA 2024) 3–6 May, Boulder, CO, USA. General Chairs: Alexandra Artusio-Glimpse, Paritosh Manurkar, Sam Berweger, Kumar Vijay Mishra, and Peter Vouras URL: https://2024.ieeecisa.org/ IEEE International Symposium on Biomedical Imaging (ISBI 2024) 27–30 May, Athens, Greece. General Chairs: Konstantina S. Nikita, and Christos Davatzikos URL: https://biomedicalimaging.org/2024/

JUNE IEEE Conference on Artificial Intelligence (CAI 2024) 25–27 June, Singapore. General Chairs: Ivor Tsang, Yew Soon Ong and Hussein Abbass URL: https://ieeecai.org/2024/

JULY IEEE 13th Sensor Array and Multichannel Signal Processing Workshop (SAM 2024) 8–11 July, Corvallis, OR, USA. General Chairs: Yuejie Chi and Raviv Raich URL: https://attend.ieee.org/sam-2024/

IEEE International Conference on Multimedia and Expo (ICME 2024) 15–19 July, Niagara Falls, Canada. General Chairs: Junsong Yuan, Jiebo Luo, and Xiao-Ping Zhang URL: https://2024.ieeeicme.org/

SEPTEMBER IEEE 25th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC 2024) 10–13 September, Lucca, Italy. General Chair: Luca Sanguinetti URL: https://spawc2024.org/

OCTOBER IEEE International Conference on Image Processing (ICIP 2024) 27–30 October, Abu Dhabi, UAE. General Chairs: Mohammed Al-Mualla and Moncef Gabbouj URL: https://2024.ieeeicip.org/ 

SP

MATLAB SPEAKS

MACHINE LEARNING With MATLAB® you can use clustering, regression, classification, and deep learning to build predictive models and put them into production.

© The MathWorks, Inc.

mathworks.com/machinelearning