134 68 9MB
English Pages [128] Year 2023
IEEE Open Journal of Signal Processing: A Short Papers Submission Category and Review Track for ICASSP 2024 The IEEE Open Journal of Signal Processing (OJ-SP) has introduced a Short Papers submission category and review track, with a limit of eight pages plus an additional page for references (8+1). This is intended as an alternative publication venue for authors who would like to present at ICASSP 2024, but who prefer Open Access or the longer paper format than the traditional ICASSP 4+1 format. Short papers
submitted
to
OJ-SP
by
the
ICASSP 2024
submission deadline, 6
September 2023, and that are specifically indicated as being submitted to this review track, will receive an expedited review to ensure that a decision is available in time for them to be included in the ICASSP 2024 program.
Accepted
manuscripts will be published in OJ-SP and included as part of the ICASSP 2024 program. For additional details of this OJ-SP/ICASSP partnership, scan the QR code above. OJ-SP was launched in 2020 as a fully open-access publication of the IEEE Signal Processing Society, with a scope encompassing the full range of technical activities of the Society. Manuscripts submitted before the end of 2023, which includes all submissions to the alternative ICASSP review track, are eligible for a discounted Article Processing Charge (APC) of USD$995. For more about the APC discount, please visit the OJ-SP homepage. The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is a flagship conference of the IEEE Signal Processing Society. ICASSP 2024 will be held in Seoul, K o r e a ,
from
14
April
to
19
April
2024. For further information, please scan the QR code for ICASSP 2024. Paper Submission Firm Deadline: 6 September 2023 Paper Acceptance Notification: 13 December 2023
Digital Object Identifier 10.1109/MSP.2023.3290697
Contents
Volume 40 | Number 5 | July 2023
SPECIAL SECTION
NFORMATION FORENSICS 67 IAND SECURITY
Mauro Barni, Patrizio Campisi, Edward J. Delp, Gwenaël Doërr, Jessica Fridrich, Nasir Memon, Fernando Pérez-González, Anderson Rocha, Luisa Verdoliva, and Min Wu
75TH ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE 8 FROM THE GUEST EDITORS
IGNAL PROCESSING FOR BRAIN– 80 SCOMPUTER INTERFACES
Rodrigo Capobianco Guido, Tulay Adali, Emil Björnson, Laure Blanc-Féraud, Ulisses Braga-Neto, Behnaz Ghoraani, Christian Jutten, Alle-Jan Van Der Veen, Hong Vicky Zhao, and Xiaoxing Zhu
Le Wu, Aiping Liu, Rabab K. Ward, Z. Jane Wang, and Xun Chen
92
Stefan Vlaski, Soummya Kar, Ali H. Sayed, and José M.F. Moura
12 AUDIO SIGNAL PROCESSING IN
THE 21ST CENTURY
Gaël Richard, Paris Smaragdis, Sharon Gannot, Patrick A. Naylor, Shoji Makino, Walter Kellermann, and Akihiko Sugiyama
27 TWENTY-FIVE YEARS OF
EVOLUTION IN SPEECH AND LANGUAGE PROCESSING
Dong Yu, Yifan Gong, Michael Alan Picheny, Bhuvana Ramabhadran, Dilek Hakkani-Tür, Rohit Prasad, Heiga Zen, Jan Skoglund, Jan “Honza” Cˇernocký, Lukáš Burget, and Abdelrahman Mohamed
ETWORKED SIGNAL AND N INFORMATION PROCESSING
EVENTY YEARS OF RADAR 106 SAND COMMUNICATIONS
ON THE COVER This issue continues celebrating the SP Society’s 75th Anniversary. We remember the past as we engage with the present and build the future.
Fan Liu, Le Zheng, Yuanhao Cui, Christos Masouros, Athina P. Petropulu, Hugh Griffiths, and Yonina C. Eldar
COVER IMAGE: ©SHUTTERSTOCK.COM/G/DOLPHINNY
40 THE FOUNDATIONS OF
COMPUTATIONAL IMAGING
W. Clem Karl, James E. Fowler, Charles A. Bouman, Müjdat Çetin, Brendt Wohlberg, and Jong Chul Ye
54 SUPERRESOLUTION IMAGE
RECONSTRUCTION
Xin Li, Weisheng Dong, Jinjian Wu, Leida Li, and Guangming Shi
PG. 54
PG. 40
IEEE SIGNAL PROCESSING MAGAZINE (ISSN 1053-5888) (ISPREG) is published bimonthly by the Institute of Electrical and Electronics Engineers, Inc., 3 Park Avenue, 17th Floor, New York, NY 10016-5997 USA (+1 212 419 7900). Responsibility for the contents rests upon the authors and not the IEEE, the Society, or its members. Annual member subscriptions included in Society fee. Nonmember subscriptions available upon request. Individual copies: IEEE Members US$20.00 (first copy only), nonmembers US$248 per copy. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limits of U.S. Copyright Law for private use of patrons: 1) those post-1977 articles that carry a code at the bottom of the first page, provided the per-copy fee is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA; 2) pre-1978 articles without fee. Instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. For all other copying, reprint, or republication permission, write to IEEE Service Center, 445 Hoes Lane, Piscataway, NJ 08854 USA. Copyright © 2023 by the Institute of Electrical and Electronics Engineers, Inc. All rights reserved. Periodicals postage paid at New York, NY, and at additional mailing offices. Printed in the U.S.A. Postmaster: Send address changes to IEEE Signal Processing Magazine, IEEE, 445 Hoes Lane, Piscataway, NJ 08854 USA. Canadian GST #125634188
Digital Object Identifier 10.1109/MSP.2023.3271044
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
1
COLUMNS
IEEE Signal Processing Magazine EDITOR-IN-CHIEF
7 Society News
Christian Jutten—Université Grenoble Alpes, France
New Society Editors-in-Chief Named for 2024
AREA EDITORS Feature Articles Laure Blanc-Féraud—Université Côte d’Azur, France Special Issues Xiaoxiang Zhu—German Aerospace Center, Germany Columns and Forum Rodrigo Capobianco Guido—São Paulo State University (UNESP), Brazil H. Vicky Zhao—Tsinghua University, R.P. China
DEPARTMENTS 4 From the Editor
e-Newsletter Hamid Palangi—Microsoft Research Lab (AI), USA
IEEE Signal Processing Society 75th Anniversary During ICASSP 2023 Christian Jutten and Athina Petropulu
Social Media and Outreach Emil Björnson—KTH Royal Institute of Technology, Sweden
123 Dates Ahead
EDITORIAL BOARD
IMAGE LICENSED BY INGRAM PUBLISHING
123
The IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) will be held in Rome, Italy, 17–20 September 2023.
Massoud Babaie-Zadeh—Sharif University of Technology, Iran Waheed U. Bajwa—Rutgers University, USA Caroline Chaux—French Center of National Research, France Mark Coates—McGill University, Canada Laura Cottatellucci—Friedrich-Alexander University of Erlangen-Nuremberg, Germany Davide Dardari—University of Bologna, Italy Mario Figueiredo—Instituto Superior Técnico, University of Lisbon, Portugal Sharon Gannot—Bar-Ilan University, Israel Yifan Gong—Microsoft Corporation, USA Rémi Gribonval—Inria Lyon, France Joseph Guerci—Information Systems Laboratories, Inc., USA Ian Jermyn—Durham University, U.K. Ulugbek S. Kamilov—Washington University, USA Patrick Le Callet—University of Nantes, France Sanghoon Lee—Yonsei University, Korea Danilo Mandic—Imperial College London, U.K. Michalis Matthaiou—Queen’s University Belfast, U.K. Phillip A. Regalia—U.S. National Science Foundation, USA Gaël Richard—Télécom Paris, Institut Polytechnique de Paris, France Reza Sameni—Emory University, USA Ervin Sejdic—University of Pittsburgh, USA Dimitri Van De Ville—Ecole Polytechnique Fédérale de Lausanne, Switzerland Henk Wymeersch—Chalmers University of Technology, Sweden
ASSOCIATE EDITORS—COLUMNS AND FORUM Ulisses Braga-Neto—Texas A&M University, USA Cagatay Candan—Middle East Technical University, Turkey Wei Hu—Peking University, China Andres Kwasinski—Rochester Institute of Technology, USA Xingyu Li—University of Alberta, Edmonton, Alberta, Canada Xin Liao—Hunan University, China Piya Pal—University of California San Diego, USA Hemant Patil—Dhirubhai Ambani Institute of Information and Communication Technology, India Christian Ritz—University of Wollongong, Australia
ASSOCIATE EDITORS—e-NEWSLETTER Abhishek Appaji—College of Engineering, India Subhro Das—MIT-IBM Watson AI Lab, IBM Research, USA Behnaz Ghoraani—Florida Atlantic University, USA Panagiotis Markopoulos—The University of Texas at San Antonio, USA
IEEE SIGNAL PROCESSING SOCIETY Athina Petropulu—President Min Wu—President-Elect Ana Isabel Pérez-Neira—Vice President, Conferences Roxana Saint-Nom—VP Education Kenneth K.M. Lam—Vice President, Membership Marc Moonen—Vice President, Publications Alle-Jan van der Veen—Vice President, Technical Directions
IEEE SIGNAL PROCESSING SOCIETY STAFF Richard J. Baseil—Society Executive Director William Colacchio—Senior Manager, Publications and Education Strategy and Services Rebecca Wollman—Publications Administrator
IEEE PERIODICALS MAGAZINES DEPARTMENT Sharon Turk, Journals Production Manager Katie Sullivan, Senior Manager, Journals Production Janet Dudar, Senior Art Director Gail A. Schnitzer, Associate Art Director Theresa L. Smith, Production Coordinator Mark David, Director, Business Development Media & Advertising Felicia Spagnoli, Advertising Production Manager Peter M. Tuohy, Production Director Kevin Lisankie, Editorial Services Director Dawn M. Melley, Senior Director, Publishing Operations
Digital Object Identifier 10.1109/MSP.2023.3271042
SCOPE: IEEE Signal Processing Magazine publishes tutorial-style articles on signal processing research and
IEEE prohibits discrimination, harassment, and bullying. For more information, visit http://www.ieee.org/web/aboutus/whatis/policies/p9-26.html.
2
applications as well as columns and forums on issues of interest. Its coverage ranges from fundamental principles to practical implementation, reflecting the multidimensional facets of interests and concerns of the community. Its mission is to bring up-to-date, emerging, and active technical developments, issues, and events to the research, educational, and professional communities. It is also the main Society communication platform addressing important issues concerning all members.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
On 2 June 1948 the Professional Group on Audio of the IRE was formed, establishing what would become the IEEE society structure we know today. 75 years later, this group — now the IEEE Signal Processing Society — is the technical home to nearly 20,000 passionate, dedicated professionals and a bastion of innovation, collaboration, and leadership.
Celebrate with us: Digital Object Identifier 10.1109/MSP.2023.3290698
FROM THE EDITOR Christian Jutten | Editor-in-Chief | [email protected] Athina Petropulu | IEEE Signal Processing Society President | [email protected]
IEEE Signal Processing Society 75th Anniversary During ICASSP 2023 Remembering the past, engaging with the present, and building the future
T
he ICASSP 2023 conference in Rhodes, Greece, was remarkable from multiple perspectives. Notably, this was the first fully in-person ICASSP after three consecutive virtual conferences, which were necessitated by the COVID-19 pandemic. Attendees fully embraced the opportunity to engage in live interactions and reestablish their networks. Moreover, this conference held special significance as it coincided with the 75th anniversary of the IEEE Signal Processing Society (SPS), established on 2 June, 1948. During the opening ceremony, Petros Maragos, general chair of ICASSP23, discussed the unprecedented growth of ICASSP, evidence of a vibrant and growing Signal Processing (SP) community. He also presented the technical program, announced several innovative features, and an abundance of social events, emphasizing the significance of the SPS 75th anniversary (Figure 1). Subsequently, Athina Petropulu, president of the SPS, reviewed the significant milestones that marked the incredible growth of SP over the past seven decades, starting from its inception in 1948 (Figure 2). She highlighted the strong interconnections between advancements in
Digital Object Identifier 10.1109/MSP.2023.3286188 Date of current version: 14 July 2023
4
microelectronics, computer science, industrial and personal applications and provided insights gained from signal and image processing, all of which have shaped the evolution of the SPS. Further details regarding this historical journey can be found in the article “Empowering the Growth of Signal Processing: The Evolution of the IEEE Signal Processing Society” [1], which she coauthored and published in the June issue of IEEE Signal Processing Magazine (SPM).
Several events were organized to commemorate this momentous anniversary, including two special sessions dedicated to the SPS anniversary. The first session featured presentations by three pioneers of digital SP (DSP), namely, Alan Oppenheim, Ron Schafer, and Tony Constantinides, who shared their perspectives on the history of DSP. Oppenheim, Schafer, and Mos Kaveh (on behalf of Tony Constantinides, who, unfortunately, could not travel to Rhodes) shared their first-hand experiences and
FIGURE 1. Petros Maragos, ICASSP 2023 General Chair presented the ICASSP technical program and planned events. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
stories about the origins of DSP in various locations in the United States and Europe. They emphasized that, in addition to military applications like radar and sonar, audio and speech processing played a prominent role in the early stages of SP expansion, especially after the Second World War. The birth and history of DSP were closely intertwined with issues in audio and speech, as Oppenheim and Schafer conveyed during their talks (Figure 3). This can be attributed to factors such as the ease of recording audio signals using tape recorders and the relatively low data rates of audio, which were compatible with the technological tools available at that time. Their discussions also underscored the strong correlation between the development of SP and advancements in computers and integrated circuits, particularly microprocessors. They traced the evolution from massive computing rooms with limited performance at institutions like the Massachusetts Institute of Technology and Bell Labs to the advent of the first microprocessors. Additionally, they highlighted the significance of programming languages in SP, starting with Fortran, Basic, and Pascal and progressing to MATLAB, which has now been surpassed by languages like Python and Julia. Both Oppenheim and Schafer emphasized the significance of their interactions with other researchers in shaping their lives and contributions. In Oppenheim’s eloquent words, “All of us were at the right place, at the right time, with the right people.” This highlights the importance of collaborations and the role they played in their achievements. Tony Constantinides recounted the challenges he faced in getting “out-ofthe-box” ideas accepted. He shared an anecdote from a 1966 industry meeting where he spoke about the implementation of digital filters and received a dismissive response: “A waste of time and money. Nobody is and will ever be interested in such digital computer-based communication systems.” Similarly, in 1967, when predicting the high impact of digital techniques in telecommunications, he
was rebuffed with the statement, “You are quite wrong. Digital techniques will never be used in telecommunications.” And in the late 1970s, during a conference, a well-known scientist in SP remarked, “Digital signal processing . . . an interesting mathematical curiosity.” These experiences demonstrate the resistance and skepticism faced by innovators that introduce groundbreaking ideas. In one of
his slides, Tony cited a quote by Lord Rayleigh in 1897: “The history of science shows that important original work is liable to be overlooked, and perhaps the more liable the higher the degree of originality.” More than a half-century later, Tony’s experiences emphasize that innovators still face roadblocks in the pursuit of progress. T he publ icat ion of t he f i rst book on DSP in 1975 (Figure 4) was
Huawei
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
5
undoubtedly a pivotal milestone in the field, coinciding with the early stages of computer science. The initial release of this book, along with its subsequent editions, laid the groundwork for the continued progress and expansion of SP. The panel audience, composed of both young and seasoned members, deeply recognized the importance of this book and eagerly gathered around the stage at the end to get the autographs of Oppenheim and Schafer, acknowledging their significant contributions to the field. The second session involved a discussion on the past, present, and future of SP, which brought together six distinguished scientists and recipients of the Norbert Wiener Society Award: Alex Acero, Ray Liu, Jose Moura, Ali Sayed, Sergios Theodoridis and Rabab Ward. The discussion on the evolution of the SP featured the topic of SP in the artificial intelligence (AI) era, which was also the theme of ICASSP 2023. The speakers emphasized that SP plays a unique role in AI, particularly that a key factor in designing explainable and reliable AI algorithms is exploiting domain knowledge—something very familiar to SP researchers and practitioners. The conference plenary talks also highlighted the role of SP in the AI era and the relevance of explainability in deep neural networks. In her “Disrupting NextG”
FIGURE 2. Athina Petropulu presented milestones in the history of signal and image processing.
FIGURE 3. Alan Oppenheim and Ron Schafer.
FIGURE 4. The first book on DSP, published
FIGURE 5. The 75th anniversary lounge at ICASSP 2023.
in 1975.
6
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
(continued on page 11)
SOCIETY NEWS
New Society Editors-in-Chief Named for 2024
T
he following volunteers have been named editors-in-chief of IEEE Signal Processing Society publications. The term for these editors-in-chief will run from 1 January 2024 through 31 December 2026.
sity of Maryland Baltimore County, MD, USA. She is succeeding Christian Jutten, University of Grenoble-Alpes, France, who has held the post of editorin-chief since 2021.
Tulay Adali
Benoit M. Macq is the new editor-in-chief of IEEE Transactions on Image Processing. He i s a F e l l ow o f t h e IEEE and is with the Université catholique de Louvain, Belgium. He is succeeding Alessandro Foi, Tampere University, Finland,
Tulay Adali has been named editor-in-chief for IEEE Signal Processing Magazine. She is a Fellow of the IEEE and is with the UniverDigital Object Identifier 10.1109/MSP.2023.3273477 Date of current version: 14 July 2023
Benoit M. Macq
who has been the editor-in-chief since 2021.
Zhi Tian Zhi Tian has taken on the role of editor-in-chief for IEEE Transactions on Signal Processing. She is a Fellow of the IEEE and is with George Mason University, VA, USA. She is succeeding Wing-Kin (Ken) Ma, The Chinese University of Hong Kong, Hong Kong, who has held the post of editorin-chief since 2021. SP
TAP. CONNECT. NETWORK. SHARE. Connect to IEEE–no matter where you are–with the IEEE App. Stay up-to-date with the latest news
Schedule, manage, or join meetups virtually
Get geo and interest-based recommendations
Read and download your IEEE magazines
Create a personalized experience
Locate IEEE members by location, interests, and affiliations
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
7
FROM THE GUEST EDITORS Rodrigo Capobianco Guido , Tulay Adali , Emil Björnson , Laure Blanc-Féraud , Ulisses Braga-Neto , Behnaz Ghoraani , Christian Jutten , Alle-Jan Van Der Veen , Hong Vicky Zhao , and Xiaoxing Zhu
IEEE Signal Processing Society: Celebrating 75 Years of Remarkable Achievements (Part 2)
I
t is our great pleasure to introduce the second part of this special issue to you! The IEEE Signal Processing Society (SPS) has completed 75 years of remarkable service to the signal processing community. The eight selected articles included in this second part are clear portraits of that. As the review process for these articles took longer, however, they could not be included in the first part of the special issue, and we are glad to bring them to you now. In the first article, “Audio Signal Processing in the 21st Century,” Richard et al. [A1] provide an overview of the long history of research on audio and acoustics, including the analysis and modeling of room acoustics, generation of artificial reverberation, spatial rendering, echo cancellation, dereverberation, acoustic feedback control, source separation, music information retrieval, plus other related and relevant topics. Next is the second article, “Twentyfive Years of Evolution in Speech and Language Processing,” by Yu et al. [A2], who describe major breakthroughs in each of the following speech processing subfields: language processing, automatic speech recognition, speech synthesis, speech coding, speech enhancement, speaker recognition, language identification, language understanding, dialog systems, and deep learning. They also comment on the main driving forces that
Digital Object Identifier 10.1109/MSP.2023.3285483 Date of current version: 14 July 2023
8
led to the current state of the art in the field. The societal impacts and potential future directions are complementarily discussed by them. The third article in this special issue is “The Foundations of Computational Imaging,” where Fowler et al. [A3] present historical perspectives on the field of computational sensing and imaging, providing some context on how it has arrived at its present state as well as on its role within the SPS. Physics-driven imaging and explicit inverse operators, optimization formulation, and modelbased reconstruction, in addition to data-driven models and machine learning for image processing, are among the main details discussed. “Superresolution Image Reconstruction: Selective Milestones and Open Problems” is the title of the next article, in which Li et al. [A4] present a systematic review of the evolution of superresolution methodology in the past 25 years with an emphasis on theoretical insights, complemented with various well-cited superresolution algorithms, and the progression in both model- and learning-based approaches, in addition to open challenges in the field. The fifth article, “Information Forensics and Security: A Quarter Century Long Journey,” is authored by Barni et al. [A5]. They present an introductory section providing the context in the 1990s, where readers could find the main knowledge and technological challenges, focus areas such as IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
digital watermarking, steganography, steganalysis, biometrics, multimedia forensics, and adversarial signal processing. Finally, they present future trends in the domain and a discussion about the unethical use of information security tools. In the next article, “Signal Processing for Brain–Computer Interfaces: A Review and Current Perspectives,” Wu et al. [A6] cover the wide field of brain– computer interfaces, particularly discussing the history, types, and general flow of those interfaces, including key related aspects such as signal filtering, blind source separation, time-frequency analysis, compressive sensing, and machine learning. Future directions on the field, with pros, cons, and tradeoffs, are also presented by the authors. “Networked Signal and Information Processing,” authored by Vlaski et al. [A7], overviews the very significant advances in networked signal and information processing that have enabled extending decision making and inference, optimization, control, and learning to the increasingly ubiquitous environments of distributed agents. Taxonomies, networked algorithms, and stochastic optimization are among the key aspects explored by the authors, who carefully address the most relevant aspects that have dominated the field over the previous decades. The final article in this special issue is “Seventy Years of Radar and Communications: The Road from S eparation
to Integration,” where Liu et al. [A8] present an introduction to the field accompanied by key concepts such as information delivery and acquisition, basic principles of radar and communications, and the integration of sensing and communications. The early development of radar and communications, spectrum engineering, and multipleinput, multiple-output antenna arrays are additional relevant topics discussed by the authors, who conclude their article with a discussion on open challenges and future research directions in the field. This concludes the second part of this special issue. Once again we express our gratitude to all the contributing authors and reviewers, in addition to our administrative staff: Rebecca Wollman, who consistently helped us with all the administrative details, and the efficient team led by Sharon Turk, who carefully supervised the editorial process, taking care of every detail. We sincerely hope you enjoy reading this second part of the special issue and that you, as a member of the SPS, feel represented by the articles we have selected for your perusal.
Acknowledgment Rodrigo Capobianco Guido is the lead guest editor of this special issue.
Guest Editors Rodrigo Capobianco Guido (guido@ieee. org) received his Ph.D. degree in computational applied physics from the University of São Paulo (USP), Brazil, in 2003. Following two postdoctoral programs in signal processing at USP, he obtained the title of associate professor in signal processing, also from USP, in 2008. Currently, he is an associate professor at São Paulo State University, São José do Rio Preto, São Paulo, 15054-000, Brazil. He has been an area editor of IEEE Signal Processing Magazine and was recently included in Stanford University’s rankings of the world’s top 2% scientists. His research interests include signal and speech processing
based on wavelets and machine learning. He is a Senior Member of IEEE. Tulay Adali (adali@ umbc.edu) received her Ph.D. degree in electrical engineering from North Carolina State University. She is a distinguished university professor at the University of Maryland, Baltimore County, Baltimore, MD 21250 USA. She is chair of IEEE Brain and past vice president of technical directions for the IEEE Signal Processing Society (SPS). She is a Fulbright Scholar and an SPS Distinguished Lecturer. She received a Humboldt Research Award, an IEEE SPS Best Paper Award, the University System of Maryland Regents’ Award for Research, and a National Science Foundation CAREER Award. Her research interests include statistical signal processing and machine learning and their applications, with an emphasis on applications in medical image analysis and fusion. She is a Fellow of IEEE and a fellow of the American Institute for Medical and Biological Engineering. Emil Björnson (emilbjo@ kth.se) is a full (tenured) professor of wireless communication at the KTH Royal Institute of Technology, Stockholm, 100 44, Sweden. He received the 2018 and 2022 IEEE Marconi Prize Paper Awards in Wireless Communications, the 2019 EURASIP Early Career Award, the 2019 IEEE Communications Society Fred W. Ellersick Prize, the 2019 IEEE Signal Processing Magazine Best Column Award, the 2020 Pierre-Simon Laplace Early Career Technical Achievement Award, the 2020 Communication Theory Technical Committee Early Achievement Award, the 2021 IEEE Communications Society Radio Communications Committee Early Achievement Award, and the 2023 IEEE Communications Society Outstanding Paper Award. His work has also received six Best Paper Awards at conferences. He is a Fellow of IEEE, and a Digital Futures and Wallenberg Academy fellow. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Laure Blanc-Féraud (laure.blanc-feraud@univ -cotedazur.fr) received her Ph.D. degree and habilitation to conduct research in inverse problems in image processing from University Côte d’Azur in 1989 and 2000, respectively. She is a researcher with Informatique Signaux et Systèmes at Sophia Antipolis (I3S) Lab, the University Côte d’Azur, Centre national de la recherche scientifique (CNRS), Sophia Antipolis, 06900 France. She served/ serves on the IEEE Biomedical Image and Signal Processing Technical Committee (2007–2015; 2019–) and has been general technical chair (2014) and general chair (2021) of the IEEE International Symposium on Biomedical Imaging. She has been an associate editor of SIAM Imaging Science (2013–2018) and is currently an area editor of IEEE Signal Processing Magazine. She headed the French national research group GDR Groupement de recherche–Information, Signal, Image et ViSion (ISIS) of CNRS on Information, Signal Image and Vision (2021–2018). Her research interests include inverse problems in image processing using partial differential equation and optimization. She is a Fellow of IEEE. Ulisses Braga-Neto ([email protected]) received his Ph.D. degree in electrical and computer engineering from Johns Hopkins University in 2002. He is a professor in the Electrical and Computer Engineering Department, Texas A&M University, College Station TX 77843 USA. He is founding director of the Scientific Machine Learning Lab at the Texas A&M Institute of Data Science. He is an associate editor of IEEE Signal Processing Magazine and a former elected member of the IEEE Signal Processing Society Machine Learning for Signal Processing Technical Committee and the IEEE Biomedical Imaging and Signal Processing Technical Committee. He has published two textbooks and more than 150 peer-reviewed journal articles and 9
conference papers. He received the 2009 National Science Foundation CAREER Award. His research focuses on machine learning and statistical signal processing. He is a Senior Member of IEEE. Behna z G hora ani ([email protected]) received her Ph.D. from the Department of Electrical and Computer Engineering, Ryerson University, Toronto, Canada, followed by a Postdoctoral Fellow period with the Faculty of Medicine, University of Toronto, Toronto, Canada. She is an associate professor of electrical engineering and computer science at Florida Atlantic University, Boca Raton FL 33431 USA, with a specialization in biomedical signal analysis, machine learning, wearable and assistive devices for rehabilitation, and remote home monitoring. She is an associate editor of IEEE Journal of Biomedical and Health Informatics and BioMedical Engineering OnLine Journal. Her research has received recognition through multiple best paper awards and the Gordon K. Moe Young Investigator Award. Her research has been funded by grants from the National Institutes of Health, the National Science Foundation (including a CAREER Award), and the Florida Department of Health. She is an esteemed member of the Board of Scientific Counselors of National Library of Medicine, as well as the IEEE SPS Biomedical Signal and Image Professional Technical Committee. She has also taken on the role of the IEEE Women in Signal Processing Committee Chair and an Area Editor for the IEEE SPM eNewsletter. C h r i s t i a n Ju t t e n (christian.jutten@ grenoble-inp.fr) received his Ph.D. and Doctor es Sciences degrees from Gre noble Polytechnic Institute, France, in 1981 and 1987, respectively. He was an associate professor (1982–1989) and a professor (1989–2019), and has been a professor emeritus since Sep10
tember 2019 at University Grenoble Alpes, Saint-Martin-d’Hères 38400. He was an organizer or program chair of many international conferences, including the first Independent Component Analysis Conference in 1999 (ICA’99) and the 2009 IEEE Internat i o n a l Wo r k s h o p o n M a c h i n e Learning for Signal Processing. He was the technical program cochair of ICASSP 2020. Since 2021, he has been editor-in-chief of IEEE Signal Processing Magazine. Since the 1980s, his research interests have been in machine learning and source separation, including theory and applications (brain and hyperspectral imaging, chemical sensing, and speech). He is a Fellow of IEEE and a fellow of the European Association for Signal Processing. Alle-Jan Van Der Ve e n ( a . j . v a n d e r [email protected]) received his Ph.D. in system theory at the Circuits and S y s t e m s G r o u p , Department of Ele ctrical Engineer ing, TU Delft, The Netherlands, with a p o s t d o c t o r a l research position at Stanford University, USA. He is a professor and chair of the Signal Processing Systems group at Delft University of Technology, Delft, 2628, The Netherlands. He was editor-in-chief of IEEE Transactions on Signal Processing and IEEE Signal Processing Letters. He was an elected member of the IEEE Signal Processing Society (SPS) Board of Governors. He was chair of the IEEE SPS Fellow Reference Committee, chair of the IEEE SPS Signal Processing for Communications Technical Committee, and technical cochair of ICASSP 2011 (Prague). He is currently the IEEE SPS vice president of technical directions (2022–2024). His research interests are in the areas of array signal processing and signal processing for communication, with applications to radio astronomy and sensor network localization. He is a Fellow of IEEE and a fellow of the European Association for Signal Processing. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Hong Vicky Zhao ([email protected]. cn) received her Ph.D. degree in electrical engineering from the University of Maryland, College Park, in 2004. Since May 2016, she has been an associate professor with the Department of Automation, Tsinghua University, Beijing, 100084, China. She received the IEEE Signal Processing Society 2008 Young Author Best Paper Award. She is the coauthor of “Multimedia Fingerprinting Forensics for Traitor Tracing” (Hindawi, 2005), “Behavior Dynamics in Media-Sharing Social Networks” (Cambridge University Press, 2011), and “Behavior and Evolutionary D ynamics in C r owd Networks” (Springer, 2020). She was a member of the IEEE Signal Processing Society Information Forensics and Security Technical Committee and the Multimedia Signal Processing Technical Committee. She is the senior area editor, area editor, and associate editor of IEEE Signal Processing Letters, IEEE Signal Processing Magazine, IEEE Transactions on Information Forensics and Security, and IEEE Open Journal of Signal Processing. Her research interests include media-sharing social networks, information security and forensics, digital communications, and signal processing. Xiaoxing Zhu (xiao [email protected]) received her Dr.-Ing. degree and her “Habilitation” in signal processing from the Technical University of Munich (TUM), in 2011 and 2013, respectively. She is the chair professor for data science in Earth observation at TUM, Munich, 80333, Germany. She was founding head of the “EO Data Science” Department at the Remote Sensing Technology Institute, German Aerospace Center. Since October 2020, she has served as a director of the TUM Munich Data Science Institute. She is currently a visiting artificial intelligence professor at the European Space Agency’s Phi Lab. Her research interests include remote sensing and Earth observation, signal processing, machine
learning, and data science, with their applications to tackling societal grand challenges, e.g., global urbanization, the United Nations’ sustainable development goals, and climate change. She is a Fellow of IEEE.
Appendix: Related articles [A1] G . Richard, P. Smaragdis, S. Gannot, P. A. Naylor, S. Makino, W. Kellermann, and A. Sugiyama, “Audio signal processing in the 21st century,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 12–26, Jul. 2023, doi: 10.1109/MSP. 2023.3276171.
FROM THE EDITOR
[A2] D. Yu et al., “Twenty-five years of evolution in speech and language processing,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 27–39, Jul. 2023, doi: 10.1109/MSP.2023.3266155. [A3] W. C. Karl, J. E. Fowler, C. A. Bouman, M. Çetin, B. Wohlberg, and J. C. Ye, “The foundations of computational imaging,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 4 0–53, Jul. 2023, doi: 10.1109/MSP. 2023.3274328. [A4] X. Li, W. Dong, J. Wu, L. Li, and G. Shi, “Superresolution image reconstruction,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 54–66, Jul. 2023, doi: 10.1109/MSP.2023.3271438. [A5] M. Barni et al., “Information forensics and security,” IEEE Signal Process. Mag., vol. 40, no. 5,
pp. 67–79, Jul. 2023, doi: 10.1109/MSP. 2023.3275319. [A6] L. Wu, A. Liu, R. K. Ward, Z. J. Wang, and X. Chen, “Signal processing for brain–computer interfaces,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 80–91, Jul. 2023, doi: 10.1109/MSP.2023.3278074. [A7] S. Vlaski, S. Kar, A. H. Sayed, and J. M. F. Moura, “Networked signal and information processing,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 92–105, Jul. 2023, doi: 10.1109/ MSP.2023.3267896. [A8] F. Liu, L. Zheng, Y. Cui, C. Masouros, A. P. Petropulu, H. Griffiths, and Y. C. Eldar, “Seventy years of radar and communications,” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 106–121, Jul. 2023, doi: 10.1109/MSP.2023.3272881. SP
(continued from page 6)
plena r y tal k, A nd rea Goldsm it h emphasized two important future developments. First, that SP will play an outsized role in next-generation wireless technologies. And second, that machine learning can be viewed as a tool in the SP toolbox, while knowledge about the application and the data can lead to more effective and explainable machine learning algorithms for wireless communications. Richard Baraniuk’s talk, “The Local Geometry of Deep Learning,” discussed a new way to view the geometry of deep learning through the lens of approximation theory via splines. This approach provides a window to the inner workings of those algorithms. Michael Jordon provided the keynote talk, “An Alternative View on AI: Collaborative Learning, Incentives, and Social Welfare,” sharing his view of a future AI that is more collective and autonomous, with particular attention on statistical inference, such as prediction-powered inference, for computing valid confidence intervals. The IEEE Historical Center exhibited photographs of pioneers and early contributions in SP at the 75th anniversary lounge (Figure 5).
In this issue The second part of this SPM special issue on the SPS 75th anniversary includes eight articles that will help readers appreciate the diversity of SP, including how its expansion is impacted by technological progress, especially in microelectronics and computer science, and on many application domains that impact our everyday lives. The contents of these articles are presented in more detail in the “From the Guest Editors” column [A1]. Here is a summary of the key factors that illustrate the evolution of SP, with the emergence of new domains and technologies that have touched all aspects of our lives. ■■ Audio, speech, and language processing, and radar, and communications have a long history, which began before the term SP appeared, but they continued to evolve quite dramatically with technological innovations and societal needs becoming increasingly synergistic. ■■ Major technological advancements such as computer technologies, the cloud and the Internet of Things have recently spawned new SP domains, such as computational imaging, superresolution image IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
reconstruction, information forensics and security, and networked information. ■■ Brain–computer interfaces, a concept introduced by Vidal in 1973 [3], required both technological and SP advances, illustrating that complex technologies impact human health and also come with complex ethical issues related to the development of science. We finish this editorial with the help of Constantinides, who concluded his talk with the message, “Keep calm and carry on. The future is yours.”
Appendix: Related article
[A1] R. C. Guido, “IEEE Signal Processing Society: Celebrating 75 years of remarkable achievements (Part 2),” IEEE Signal Process. Mag., vol. 40, no. 5, pp. 8–11, Jul. 2023, doi: 10.1109/MSP.2023.3285483.
References
[1] A. Petropulu, J. M. F. Moura, R. K. Ward, and T. Argiropoulos, “Empowering the growth of signal processing: The evolution of the IEEE Signal Processing Society,” IEEE Signal Process. Mag., vol. 40, no. 4, pp. 14–22, Jun. 2023, doi: 10.1109/ MSP.2023.3262905. [2] A. Oppenheim and R. Schafer, Digital Signal Processing. London: Pearson, 1975. [3] J. Vidal, “Toward direct brain-computer communication,” Annu. Rev. Biophys. Bioengineering, vol. 2, no. 1, pp. 157–180, 1973, doi: 10.1146/ annurev.bb.02.060173.001105.
SP 11
75TH ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE
Gaël Richard , Paris Smaragdis, Sharon Gannot , Patrick A. Naylor , Shoji Makino , Walter Kellermann , and Akihiko Sugiyama
Audio Signal Processing in the 21st Century The important outcomes of the past 25 years
A
©SHUTTERSTOCK.COM/TRIFF
Digital Object Identifier 10.1109/MSP.2023.3276171 Date of current version: 14 July 2023
12
udio signal processing has passed many landmarks in its development as a research topic. Many are well known, such as the development of the phonograph in the second half of the 19th century and technology associated with digital telephony that burgeoned in the late 20th century and is still a hot topic in multiple guises. Interestingly, the development of audio technology has been fueled not only by advancements in the capabilities of technology but also by high consumer expectations and customer engagement. From surround sound movie theaters to the latest in-ear devices, people love sound and soon build new audio technology into their daily lives as an essential and expected feature. Some of the major outcomes of the research in audio and acoustic signal processing (AASP) prior to 1997 were summarized in a landmark paper published on the occasion of the 50th anniversary of the IEEE Signal Processing Society (SPS) [1]. At that time, the vast majority of the work was driven by the objective to build models that capture the essential characteristics of the analyzed audio signal and to represent it with a limited set of parameters and components. The field has now evolved beyond the essential characteristics explored in the past. For instance, a wide variety of speech/audio signal models have since been proposed and, in particular, around signal decomposition/factorization models and sparse signal representations. Nevertheless, the entire research domain covered by the IEEE Technical Committee (TC) on AASP is witnessing a paradigm shift toward data-driven methods based on machine learning and, especially, deep learning. In many applications, such data-driven models obtain stateof-the-art results if appropriate data are available to train the models. This has accompanied sustained efforts to gather highly valuable and public data collections (and, in particular, annotated data), which are, in fact, essential for data-driven algorithms. Concurrently, to promote reproducible research and identify state-of-the-art methods, a number of challenges have arisen, for instance, in acoustic characterization of environments (ACE), reverberant speech processing (REVERB), acoustic source localization and tracking, source separation
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
1053-5888/23©2023IEEE
nels, MPEG-2 Audio (13818-3:1995). MPEG-2 Audio was developed for multichannel and multilingual applications, such as digital radio broadcasting in Europe, with backward compatibility with MPEG-1. However, without the backward compatibility constraint, much higher subjective quality was successfully achieved with MPEG-2 Advanced Audio Coding (AAC) (13818-7:1997). It is still the foundation of today’s audio coding algorithms and is employed in terrestrial TV broadcasting in Japan and Latin America. From a viewpoint of applications, MPEG-4 AAC (14496-3:2009) and MPEG-4 High-Efficiency (AAC HE-ACC) (14496-3:2009/Amd 7:2018) achieve sufficient audio quality at 64 kbit/s and 32 kbit/s, respectively, for mobile applications and are most widely used today. One of the major improvements is brought by bandwidth extension (BWE), also known as subband replication (SBR), which encodes only the low-frequency subband plus highfrequency power envelope information, thereby reducing the bitrate with inaudible quality degradation. The decoder copies the low-frequency spectrum to the high-frequency band and adjusts the envelope by the transmitted envelope information to reconstruct the full-band audio (see Figure 1). MPEG-4 AAC and HE-AAC are used in various consumer products, such as PCs, tablet PCs, mobile phones, and car navigation systems, to name a few. The history of MPEG-1 Audio through MPEG-4 HEAAC was to remove redundancy of the input audio in the frequency domain (transform coding), time domain (prediction), and spatial domain (multichannel coding). The next stage of MPEG Audio, MPEG Surround (MPS) (230031:2007), exploits further redundancy in the spatial domain, based on binaural cue coding [2]. A multichannel audio signal is decomposed into a monaural signal and additional spatial information in the form of the interaural level difference (ILD) and interaural time difference (ITD) in multiple time-frequency tiles (segments). The monaural data are encoded by MPEG-4 AAC, with a little side information representing the ILD and ITD. MPS achieves comparable quality to MPEG-4 AAC at one-third of the MPEG-4 AAC bitrate. The absolute subjective quality is transparent to the
(SiSEC), acoustic echo cancellation (AEC), deep noise suppression dedicated to single-microphone noise reduction, and the detection and classification of acoustic scenes and events (DCASE), which has been the subject of a yearly event since 2016 (SPS data challenges: https://signalprocessingsociety.org /publications-resources/data-challenges; REVERB Challenge: http://reverb2014.dereverberation.com; SiSEC challenge: https: //sisec.inria.fr; and DCASE challenges: https://dcase.community /challenge2022). Without aiming for exhaustiveness, the article provides a view of the important outcomes of the field in the past 25 years, also illustrating the emergence of purely data-driven models. In particular, the article covers the research addressed in signal models and representations; the modeling, analysis, and synthesis of acoustic environments and acoustic scenes; signal enhancement and separation; music information retrieval (MIR); and Detection and Classification of Acoustic Scenes and Events (DCASE). The overall structure of the article is as follows. We discuss, in the “Advances and Highlights (Evolution and Breakthrough)” section, the main axes of progress and highlights of the domain underlining the evolution and breakthroughs of the field. We then focus, in the “Emerging Topics” section, on the new topics that have mostly emerged in the past 25 years, before suggesting some conclusions and perspectives.
Advances and highlights (evolution and breakthrough) Building upon the achievements prior to 1997, already discussed in [1], we summarize, in this section, the key advances and highlights of recent years.
Modeling and representation We first discuss the developments in audio coding and signal modeling, with a focus on multichannel audio channel coding. We then describe some of the important work pursued in modeling, analysis, and synthesis of acoustic environments, with specific highlights on room impulse response (RIR) analysis and synthesis.
Coding and signal modeling Audio coding is a long-standing topic in the field and has led to several international standards. [The International Organization for Standardization/ International Electrotechnical Commission (ISO/IEC) audio coding standards in the following are accessible at https://www.iso.org/standards.html by providing the search window with the numbers and years in the parentheses.] The field had its golden age in the 1990s, with the first international standard of audio coding, MPEG-1 Audio (11172-3:1993), and its extension to multichannel signals of up to five chan-
Bitstream
AAC Decoder
Analysis Filter Bank Low Frequency High-Frequency Generation
Synthesis Filter Bank
Envelope Adjustment
Decoded Audio
High Frequency
FIGURE 1. The BWE principle. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
13
rate of acoustic energy in an acoustic environment is measured source signal, which is suitable for content delivery between by the reverberation time, T60, the time it takes for the expogeographically distributed studios and broadcasting stations. MPEG Spatial Audio Object Coding (SAOC) (23003nentially decaying power profile of the reverberation tail to 2:2010) removes the redundancy of the input audio, based decay by 60 dB from its initial value. Typical offices have a on the composition of each audio object. The input audio T60 around 300–400 ms, and larger rooms can approach 1 s, signal consists of multiple audio objects, which are independepending on the volume, shape, and materials. The perceived dent audio sources, such as individual musical instruments. reverberation also depends on the ratio between the direct path Each audio object is expressed in multiple frequency tiles by (including the early reflections) and the power of the tail, deobject-level differences (OLDs) and interobject cross cohernoted as the direct-to-reverberant ratio (DRR). In the same enences (IOCs). The OLD is the relative energy to the energy vironment, distant sources will exhibit a lower DRR and be of the downmix signal that is a combination of the audio perceived as more reverberant. objects. The IOC is the cross correlation to the downmix Reverberation can degrade the quality of a speech signal signal. The downmix signal of multiple objects is encoded and, in severe cases, particularly in noise, its intelligibility. The by MPEG-4 AAC, whereas the OLD and IOC of each object word error rate (WER) of automatic speech recognition (ASR) are encoded as side information. The decoder recovers each systems is usually severely impacted by high reverberation levobject from the downmix signal, OLD, and IOC. A direct els, especially for a low DRR. link to MPEG SAOC can also be made with the line of An AIR encompasses the entire reflection pattern, consistwork developed simultaneously on (coding-based) informed ing of the direct path, the early reflections (consisting of sevsource separation [3]. eral distinguishable arrivals), and the late Until MPEG SAOC, speech-dominant reflection tail, with an exponentially decaySound propagation in audio signals and more general audio signals ing power profile. The latter part is the main acoustic enclosures had been encoded with different algorithms. cause of the reverberation phenomenon. is characterized by MPEG Unified Speech and Audio Coding When an acoustic environment is a multiple reflections and (USAC) (14496-3:2009/Amd 3:2012) is the room, its AIR is referred to as an RIR. the addition of noise, first audio coding framework that autoRoom acoustics, even in mild reverberation both associated with the matically switches between the speechconditions, should be taken into account oriented algorithm and the audio-oriented when designing acoustic signal processacoustic environment. algorithm, based on the input signal analing algorithms, and failing to do so may ysis result in multiple time-frequency tiles. The most recent severely degrade their performance. Modeling and accumember of the MPEG Audio family is MPEG-H (23008-3 rately analyzing the properties of the RIR is therefore of 2019), which is generic coding, including 3D audio (highercrucial importance. order ambisonics or HoA). The most successful application of audio coding is portable Room simulators, RIR datasets, and sound field generators audio players, represented by Apple’s iPod. The first prototype Acoustic signal processing algorithms should be evaluated was the Silicon Audio, developed in 1994, which was a preunder reverberant conditions. This can be achieved either by cursor of the iPod first put in the market in 2001. Audio playusing recorded RIRs or using room simulators. The outcome ers were later extended to include video data processing. The of such simulators may be less accurate, but using them aliPhone, released in 2007, was the first in the world and was lows researchers in the field to generate a vast number of excombined with a large display to make a tablet PC or with a amples. This has recently become extremely important with tiny display to make a smart watch. A history of these handy the emergence of machine learning algorithms that require personal terminals can be found in [4]. Nevertheless, despite a large volume and diversity of training data. The field has their immense success, audio players are now gradually being evolved from the pioneering work in acoustics by Schröder replaced by music streaming. (frequency-domain modeling), Polack (time-domain modeling), and Allen and Berkely (the image method) [5]. Based on these models (especially the image method), many RIR Acoustic environments modeling, analysis, and synthesis generators were developed: the RIR generator (https:// github.com/ehabets/RIR-Generator), PyRoomAcoustics Modeling and analysis of acoustic impulse responses (https://pyroomacoustics.readthedocs.io/en/pypi-release/py Sound propagation in acoustic enclosures is characterized by multiple reflections and the addition of noise, both associated roomacoustics.room.html), and gpuRIR (https://github.com/ with the acoustic environment. When an acoustic signal propaDavidDiazGuerra/gpuRIR). Using these generators, one can gates in an echoic environment, it is reflected by the room facevaluate the performance of audio processing algorithms and ets and objects in the enclosure, resulting in the reverberation also train data-driven methods. Recent advances improve the phenomenon. The acoustic impulse responses (AIRs) that reRIR generation using data-driven methods, usually generalate sound sources and microphones are usually a few hundred tive adversarial networks (GANs). milliseconds in duration, corresponding to a few thousand taps Databases of real-world RIRs are also available, facilitatin discrete-time filtering at typical sampling rates. The decay ing reliable evaluation of algorithms (https://www.dreams-itn. 14
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
eu/index.php/dissemination/science-blogs/24-rir-databases, https://github.com/RoyJames/room-impulse-responses, and https://asap.ite.tul.cz/downloads/mirage). In parallel, noise field generators were also proposed, including isotropic noise (https://github.com/ehabets/INF-Generator) and wind noise (https://github.com/ehabets/Wind-Generator).
Inference of room characteristics The parameters characterizing the acoustic properties of an enclosure can be inferred from the AIR and the reverberant sound itself. These parameters can be used in the development of audio processing algorithms and also in rendering acoustic scenes. The reverberation time, T60, and DRR were mentioned in the preceding. The coherent-to-diffuse power ratio (CDR) is another attribute of the sound field that determines the impact of reverberation and depends on the source–microphone distance and reverberation time. If the direct path and early reflections are dominant, the sound is perceived as more coherent, less diffuse, and less reverberant. The ACE Challenge (http://www.ee.ic.ac.uk/naylor/ACEweb) was dedicated to developing and benchmarking estimation procedures for the preceding room acoustic parameters. A recent database of RIRs with annotated reflections (“dEchorate”) can be used to advance research further in this direction (https://zenodo.org/ record/4626590#.Y1cMoOxByAQ).
Generation of artificial reverberation Another thriving research direction is the generation of artificial reverberation, with the most popular method being feedback delay networks [6]. Traditionally (from the pioneering work of Schröder), these algorithms have been widely used in music production and now find applications in new fields, such as game audio, including virtual and augmented reality. A different angle of research would rather consider geometric approaches, which rely on physics-based models. The image method remains intractable for modeling late reverberation, especially that of large rooms. The radiance transfer method (RTM) was introduced to overcome this limitation, as it can model the diffuse reflections and sound energy decay of the late reverberation [7]. Although complex, it was later shown that the RTM can be linked to feedback delay networks to build efficient geometry-based reverberators [8].
Analysis of acoustic scenes Here, we explore the field of acoustic scene analysis, using microphone arrays that are either arranged in structured constellations (e.g., spherical and circular) or arbitrarily distributed in the acoustic enclosure. We discuss the localization of sound sources and basic concepts of data-independent spatial filtering. We further discuss wave domain representations using the cylindrical or spherical harmonics domain [9]. While originating from sound field rendering and microphone array beamforming, these representations are now frequently used for, e.g., source localization, echo cancellation, active noise control (ANC), and blind source separation (BSS), which are discussed in the following.
Acoustic sensor networks Recent technological advances in the design of miniature and low-power devices enable the deployment of so-called wireless acoustic sensor networks (WASNs). A WASN consists of multiple (often battery-powered) microphone nodes, each of which is equipped with one or more microphones, a signal processing unit, and a wireless communication module. The large spatial distribution of such microphone constellations yields a large amount of spatial information and consequently increases the probability that a subset of the microphones (node) is close to a relevant sound source. Many daily life devices are now equipped with multiple microphones and considerable audio processing capabilities. These technological advancements significantly pushed the research forward. WASNs find applications in hearing devices, speech communication systems, acoustic monitoring, ambient intelligence, and more. However, new challenges arise in these new ad hoc architectures. Typically, for a spatially extended network, the utility of sensors for a given task should be assessed, and for coherent signal processing of multiple sensor nodes, the signals must be synchronized. In particular, when data centralization is not possible, due either to the lack of a dedicated central processing device or to overly demanding transmission/processing requirements, one must rely on distributed processing, where nodes share only compressed/fused microphone signals with one another. The according modifications for the various algorithms, e.g., for beamforming, will be discussed along with their nondistributed versions in the following. First steps have also been taken to consider a moving robot as part of an acoustic sensor network.
Localization and tracking Speaker localization algorithms, mainly time difference of arrival (TDoA) and direction of arrival (DoA) estimation, emerged in the 1970s, with solutions based on the normalized cross correlation between the signals received by a pair of microphones, the so-called generalized cross correlation, and were later extended to multimicrophone solutions, most notably the steered response power phase transform [10], which steers a beam toward all candidate directions. Especially for simultaneously localizing multiple sources, generic frequency estimation and direction-finding algorithms (such as MUSIC and ESPRIT) were also adapted to acoustic applications, most prominently to the cylindrical and spherical harmonics domain. While TDoA and DoA estimation dominate localization efforts, efficient range estimation based on sound field characteristics, e.g., the CDR, has been demonstrated and applied for position estimation in WASNs [11]. In later years, there were many attempts to incorporate statistical methods that can also facilitate the tracking of sources in dynamic scenarios, including Bayesian methods, e.g., nonlinear extensions of the Kalman filter, particle filters, and probability hypothesis density filters, and non-Bayesian methods, e.g., recursive expectation maximization (EM). Acoustic reflections may degrade the performance of localization and tracking algorithms, especially in highly
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
15
reverberant environments and when multiple speakers are concurrently active. There are two paradigms in the literature to mitigate the effects of reverberation on localization accuracy. The first focuses on extracting the direct path of the sound propagation from the source to the microphones while trying to minimize the effects of the long AIR. Under the second paradigm, more general features are extracted from the microphone signals. These features characterize sound propagation. Then, a mapping from these high-dimensional features to the source location is learned. Manifold learningbased methods adopt this paradigm (see the 2019 European Signal Processing Conference tutorial at https://sharongannot. group/wp-content/uploads/2021/06/Speaker-Localization-on -Manifolds.pdf). This is part of the trend toward data-driven methods, specifically deep neural network (DNN)-based algorithms, that infer the source location from a feature vector [12]. A recent survey [13] explores many of these methods. Under the same paradigm, simultaneous localization and mapping (SLAM) can be used in the acoustic domain (acoustic SLAM) to enable devices equipped with microphones, such as robots, to move within their environment to explore, adapt to, and interact with sound sources of interest [14].
Spatial filtering Essentially all multichannel algorithms, implicitly or explicitly, use the spatial diversity of the sensor arrangement for spatially selective signal processing. Referring to later sections for the treatment of other spatial filtering methods, such as data-dependent beamforming and multichannel source separation and signal extraction, here, we limit the consideration to data-independent linear spatial filtering, which was portrayed as an active area of research in [1]. Since then, notable advances in this area include the exploitation of the spherical harmonics domain [9], [15] as well as differential microphone arrays [16], [17], due to their high directivity. These also included the introduction of polynomial beamforming for efficient and flexible beamsteering; the use of powerful optimization algorithms for noniterative designs of beamformers that meet robustness constraints, e.g., on white noise gain; and the incorporation of object-related transfer functions, e.g., head-related transfer functions (HRTFs), into the beamformer design. While these data-independent techniques were conceived for microphone array signal processing, they can also be used for sound reproduction by loudspeaker arrays. For the latter, more reproduction-specific techniques are discussed in the following.
Synthesis of acoustic scenes Listener-centric binaural rendering Binaural rendering usually refers to the process of spatial sound reproduction with headphones. One popular approach is based on the use of HRTF filters. Such filters contain all the cues that allow a listener to localize a sound source (and, in particular, spectral cues and interaural differences in time 16
and intensity) [18]. The binaural signals are then obtained, for each ear, by filtering the input monophonic signal by the HRTF corresponding to a given position in space. The rendering for reverberant environments is more complex since it should superimpose different HRTFs for each direction of the early reflections. This approach is, however, facing major challenges: the difficulty to acquire large databases of HRTFs, the difficulty of obtaining generic and nonindividualized HRTFs, and the necessity to limit the computation complexity for high-quality rendering. These challenges have fueled extensive research in several complementary directions: 1) obtaining more generic HRTFs, 2) obtaining means to adapt generic HRTFs to individuals (for instance, by averaging sets of HRTFs, using anthropometric measurements, and resorting to physical models), and 3) selecting an appropriate set of HRTFs from a large database by, e.g., subjective tests [19].
Sound field rendering Beyond the universal numerical methods based on finite elements and finite differences, the signal processing of sound fields started to take advantage of wave domain representations, especially using the cylindrical or spherical harmonics domain [9], and has now been applied to address many key challenges in sound field rendering. An important class of sound rendering techniques relies on a specific setting of distributed loudspeakers surrounding the listening area. Specific formats were developed based on stereophonic principles for a variety of configurations: six channels, including an additional one for low frequencies (5.1); eight channels (7.1); 12 channels (10.2); and 24 channels (22.2). These formats are associated with directional sound field encoding, which imposes strict constraints on the loudspeaker positions. Also, in practice, the spatial illusion is correct only in a rather small area around the center of the room (called the sweet spot). Outside this sweet spot, the sound is perceived as coming from the closest loudspeaker. The approaches based on sound field reproduction, such as ambisonics, originally proposed by Gerson in 1973, and wave field synthesis, introduced in the 1980s by Berkhout, and, in a more general representation, the spatial frequency domain [20] solve some of these constraints by taking into account the actual position of the speakers and creating virtual speakers for each required direction. In practice, these approaches can rely on objectbased coding and have a much wider sweet spot. Since their introduction, these methods have received much attention and led to many extensions for sound field reproduction with parametric and nonparametric methods, with potentially small-size microphone arrays for the recording to arbitrary loudspeaker layouts [21]. Once sound field rendering also accounts for the acoustic environment, room equalization techniques become necessary, which have been studied in [22].
Acoustic signal enhancement In this section, we explore both single- and multimicrophone approaches for acoustic signal enhancement, addressing multiple sources of interference, namely, echo, feedback,
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
r everberation, noise, and competing signals. A generic view of an acoustic signal processing architecture together with sound field synthesis, which was discussed in the preceding, is depicted in Figure 2.
Echo cancellation Echo cancellation emerged in the 1960s but has seen radical progress in the past 50 years. Many of the advances in the field of AEC were explored at the SPS 50th anniversary [1], including recursive least squares, affine projection, subband and frequency-domain adaptive filters, and double-talk detectors. AECs became the enabling technology of hands-free telecommunication systems, especially modern video conference systems. Several important challenges were then tackled to take into account the nonlinearities of the reproduction system [23], [24], the latter also harnessing DNNs to improve performance. A global approach for combining (residual) echo cancellation, dereverberation, and noise reduction, usually by applying a postfiltering stage, was also a topic of extensive research. The classical spectral postfiltering may be substituted with modern structures, such as DNNs, to further improve performance. In multimicrophone settings with additive noise present, it is important to design the AECs and beamforming stages such that their cross interference is minimized. Step size control continued to develop from double-talk detection [25] to Kalman filter-based and, more recently, Kalman filter with deep learning-based step size optimization. Stereophonic AEC, as discussed in Sondhi’s seminal work, was extended to the multichannel case [26] and multiple-input, multiple-output AEC in the wave domain. Comprehensive surveys of the AEC field, its achievements, and remaining challenges can be found in [26] and [27]. The International Workshop on Acoustic Echo and Noise Control (https://www.iwaenc.org), begun in 1989 and held at two-years interval, was originally dedicated to AEC, but its scope was rapidly extended to other audio signal processing domains, and the name was accordingly changed to the International Workshop on Acoustic Signal Enhancement.
Acoustic feedback and ANC
duced in the 1990s to cancel the feedback components, and this approach has been advancing in recent years through the use of better models of the feedback path and better methods to control feedback-canceling algorithms. Usable gains have risen by as much as 10 dB in some cases, providing corresponding benefits to the hearing impaired. ANC systems are based on microphones that capture the sound outside a volume and render “antisound” to create a quiet zone. Research in the field was boosted by commercial products, e.g., noise-canceling headphones and aircraft and automotive applications. Aside from just suppressing noise in a given zone, multizone rendering became a topic of significant, both theoretical and practical, interest [29]: here, in each zone, only one of multiple simultaneously active sources should be audible, i.e., forming a “bright” zone, whereas all others should be suppressed, i.e., forming a “dark” zone each. This technology finds applications in entertainment, business, and health applications. For example, the sound from multiple TVs in the same hospital room may be zoned separately to each patient’s bed. Also, the sound level and rendering strategy of a movie may be zoned differently to different seats in the listening room, creating a “bright zone” and a “dark zone.” Different languages for the dialogue may also be rendered in specific zones. Note that as soon as the reference information on the undesired sound in a certain zone does not need to be acquired by microphones but can be estimated from an observable sound source and modeled and measured sound propagation path characteristics (e.g., impulse responses), the creation of dark and bright zones reduces to a spatial filtering task.
Dereverberation Related to the objective of AEC, the topic of dereverberation has received growing attention due to the clear need to remove reverberation from audio signals, particularly in speech processing tasks. Dereverberation, as opposed to AEC, is a blind estimation problem, as no reference signal for the anechoic signal is available. While only a few dereverberation algorithms were available in the late 1990s, dereverberation has become a flourishing field of research and reached some level of maturity, as reflected by a dedicated and highly cited book summarizing a decade of intensive activity [30] and, later, by the community-wide REVERB Challenge. Both single- and
Acoustic feedback occurs when a microphone signal is played back by a loudspeaker (e.g., in public announcement systems and hearing aids). This creates a closed loop that limits the amount of amplification that can be applied in the loop Noise before the system becomes unstable and produces the howling effect [28]. Sound Sound This problem is well known to hearing Field Field . . .. .. Synthesis aid wearers, who report it as one of the Analysis main drawbacks, especially for those requiring high gain due to moderate to severe hearing impairment. In the first step, a good “closed” fitting of a hear- FIGURE 2. A typical multichannel sound system. On the analysis side, a spatially and/or spectrally selective acquisition is applied, including noise reduction, speaker separation (using either beamforming aid can usually provide for a stable ing or independent component analysis), and dereverberation. Echo signals are also removed, and increase in useful gain. To go beyond sources may be localized. On the synthesis side, a spatially selective rendering is applied, and noise this, adaptive processing was intro- can be actively canceled. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
17
the amplitude spectrum of the speech and augment it with the multimicrophone dereverberation algorithms have been pronoisy phase, recent findings have shown that it is beneficial to posed and evaluated. Statistical modeling of the decaying tail estimate the phase as well [36]. of the RIR has been used to derive spectral methods for singleAll-pole modeling of the speech signal, widely used in tramicrophone dereverberation [31]. ditional speech compression algorithms, was adopted by Lim In the multichannel case, dereverberation can be treated and Oppenheim to develop an iterative scheme, alternating as a blind equalization problem. Hence, either the RIR coefbetween the estimation of the speech autoregressive coeffificients or, alternatively, the inverse of a matrix of impulse cients and enhancing the speech signal using Wiener filtering. responses should be estimated. Estimation procedures for the The same speech model was later used under the EM framemultichannel equalization system include subspace methods, work, with a Kalman filter substituting the Wiener filter. i.e., extracting the RIRs from the null subspace of the spatial An early data-driven model for speech enhancement was correlation matrix of the received microphone signals and proposed in [37]. In this work, rather than using a specific least-sqaures methods for (partially) equalizing the multichanmodel for the LSA of the speech, a mixture of Gaussians model nel RIRs and, consequently, the reverberation effects. The is inferred in a training stage using the entire TIMIT dataanechoic signal and (time-varying) RIRs can be also jointly base. In recent years, the field of single-microphone speech estimated by applying a (recursive) EM algorithm in parallel enhancement (including noise reduction) has been dominated to Kalman filtering. by DNN-based algorithms. Many of these algorithms recast The weighted prediction error (WPE) method [32] realthe noise reduction problem as a mask estimation. The ideal ized blind dereverberation of time-varying colored audio binary mask (IBM) determines for each time-frequency bin sources, such as speech, based on multichannel linear predicwhether it is dominated by speech or noise. Another popular tion (MCLP). To enable MCLP to handle such a source, the mask is the ideal ratio mask (IRM), which is a softer version of WPE introduced two necessary extensions into it: a nonstathe IBM. A survey of many noise reduction algorithms can be tionary Gaussian source model and a delayed prediction that found in [38], where other masks, e.g., the protects inherent source correlation from complex IRM, which is also sensitive to the being whitened by MCLP. The WPE estabThere is a growing phase, are explored and compared. Although lished a new effective MCLP algorithm interest in developing called variance-normalized delayed linalready achieving remarkable results, there “thin” models that can be ear prediction. Several extensions to this are still many challenges left. Many of the method, including joint BSS and dereverdeployed in edge devices, algorithms require huge amounts of speech beration and the incorporation of DNNs, and noise data for training, and the resultsuch as cellular phones, were also proposed. ing models are usually very large. There is a and even simpler devices In recent years, several successful datagrowing interest in developing “thin” modthat are used as nodes in driven methods based on DNNs were proels that can be deployed in edge devices, WASNs. posed [33]. We believe that this research such as cellular phones, and even simpler direction will continue, exploring aspects devices that are used as nodes in WASNs. including the noisy and time-varying nature of real-world Moreover, in most telecommunication applications, low latenscenarios, probably combining model-based and datacy is mandatory, rendering utterance-level algorithms inaddriven paradigms. equate. There are many challenging acoustic environments that require further algorithmic improvements. One example is busy cafés and bars, usually characterized by babble noise. Noise suppression Another example is factories and mines, characterized by Noise reduction algorithms gained momentum in the late extreme noise levels. A third example is transient noise, e.g., 1970s, with the pioneering single-channel spectral subtraction keyboard typing and wind noise. method published by Boll and by Berouti et al. A few years later, with the introduction of the seminal papers by Ephraim and Malah on the estimation of the spectral amplitude and the Spatial filtering (beamforming) log-spectral amplitude (LSA), statistically optimal methods The enhancement and separation capabilities offered by mulbecame dominant. Beyond the statistically optimal estimation tichannel interfaces are usually greater than those of singleunder the Gaussian assumption on the speech spectral comchannel interfaces, although DNN-based single-microphone ponents, these papers also introduced novel concepts related solutions now offer competitive performance. We have exto estimation under signal presence uncertainty as well as the plored data-independent beamformers. This section is dedicatdecision-directed approach for the a priori signal-to-noise ratio ed to data-dependent beamformers, namely, beamformers that (SNR) estimation. Extensions to other probability distributions, adapt to the received microphone signals. Early multimicroe.g., super-Gaussian, were later presented. Comprehensive surphone speech enhancement and speaker separation solutions veys of the state of the art in the first decade of the 21st century adopted beamforming techniques with free-field propagation can be found in [34] and [35]. models [1]. Early attempts to incorporate statistically optimal While it was assumed for many years that the estimation solutions in the beamformer design as well as advanced speakof the phase is unimportant and that it is sufficient to estimate er localization algorithm are summarized in [39]. 18
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
As discussed in the preceding, sound fields in acoustic enclosures are typically characterized by high-order multipath propagation. If the number of microphones is too small to form narrow beams, using only the direct path of the AIR may provide insufficient sound quality. It therefore became common to take into consideration the entire AIR in the beamformer design. The concept of designing a matched filter toward multiple reflections of the sound was first introduced by Jan and Flanagan in 1996, but without discussing AIR estimation procedures. In [40], the acoustic transfer function (ATF) relating the speaker and a microphone array was estimated using a subspace tracking procedure and used in the design of a minimum variance distortionless response (MVDR) beamformer. The relative transfer function (RTF) was later introduced and used in the MVDR design as a substitute for the ATF. The RTF encompasses the relevant information regarding the acoustic propagation between the source and a pair of microphones. Multiple optimal design criteria were used in the literature of microphone arrays, namely, the MVDR, the multichannel Wiener filter (MWF) and its variant the speech distortion-weighted MWF [41], the maximum SNR, and the linearly constrained minimum variance (LCMV). The latter addresses the speaker extraction problem, which is closely related to (semi-) blind speaker separation, as discussed in the next section of this article. Here, we only briefly note that microphone array processing and BSS paradigms are now strongly interrelated and routinely borrow ideas from each other. Further elaboration on spatial processing algorithms can be found in [42] and [43], including spatial processing criteria and algorithms and the relation to blind speaker separation. While general-purpose multimicrophone speech enhancement algorithms aim at selectively enhancing the desired speech source and suppressing interfering sources and ambient background noise, the objective of binaural algorithms is also to preserve the auditory impression of the acoustic scene. This can be achieved by preserving the so-called binaural cues of the desired speech source, interfering sources, and background noise such that the binaural hearing advantage of the auditory system can be exploited, and confusions due to a mismatch between acoustic and visual information are avoided. A range of multichannel filters to achieve this goal is surveyed in [43, Ch. 18]. All criteria discussed in the preceding were designed for centralized processing. In WASNs, when such processing becomes too expensive, either optimal or suboptimal distributed algorithms should be applied instead. The outcome of the optimal distributed algorithms should be identical to their centralized counterparts, while for suboptimal algorithms, some performance degradation may result. The advantage of the latter family of algorithms is reduced communication bandwidth and sometimes even a lower local computational load. The challenges typical to WASN processing, several important applications, and several efficient node fusion schemes can be found in [44]. Distributed versions of many of the preceding
criteria can be found in the literature. In WASN processing, sampling rate synchronization may be crucial for guaranteeing the proper operation of the system. Multiple resynchronization schemes can be found in the literature. A large number of DNN-based spatial processing algorithms were proposed in recent years. Three main trends can be found in the current literature. In the first line of work, the DNN is used for estimating the building blocks of the statistically optimal beamformers. In the second line of work, e.g., in [45], the DNN directly estimates the multichannel weights of the beamformer. The advantage of the latter is the ability to go beyond the conventional second-order statistics and implement a beamformer with perceptually more meaningful cost functions (or with the WER as a loss function in ASR applications). However, this may not be as robust as the DNN-controlled beamformers. In the third line of work, the DNN is directly applied to the multichannel data, and the beamformer structure is not preserved.
Audiovisual signal enhancement The visual modality can clearly support the enhancement task. As an example, focusing on the face of the speaker, and particularly the lips, can be used to extract the desired speaker from background noise and competing speakers [46].
Signal separation Source separation and blind source separation (BSS) were topics of growing interest in the mid-1990s and gradually moved from determined and overdetermined cases to the more challenging underdetermined case, where there are potentially more sources than observed mixtures [47]).
Determined case BSS started as an application of independent component analysis (ICA). A series of ICA conferences began in 1999 and were held in 1.5-year intervals, playing an important role in promoting the field. Audio signals are, due to TDoAs of the source signals arriving at different sensors and reverberation, convolutively mixed in a room. Because a convolutive mixture in the time domain can be converted to instantaneous mixtures in the frequency domain, the frequency-domain ICA approach converts time-domain signals into the time-frequency domain by using a short-time Fourier transform (STFT). ICA theory inherently includes two ambiguities: output order (permutation) and output amplitude (scaling). Both become serious problems in frequency-domain ICA. To solve the permutation problem, spatial information and spectral information of the sources are key information. It was further shown that ICAbased BSS forms a null directivity pattern toward the interfering source and suppresses it [48]. An interesting framework for multichannel blind signal processing for convolutive mixtures, known as Triple-N ICA for Convolutive Mixtures [49], defines an information-theoretic cost function and enables the utilization of three fundamental signal properties, namely, nonwhiteness, non-Gaussianity, and nonstationarity. Nonnegative matrix factorization (NMF) [50]
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
19
separates sources by using common frequency patterns as frequency bases. Independent low-rank matrix analysis [51] separates sources by using spatial information of ICA and spectral information of NMF. As in most fields of audio processing, deep learning methods are now widely used, and some of them are improved variants of classical algorithms. For instance, the multichannel variational autoencoder (VAE) [52] combines spatial information of ICA and spectral information of DNNs. Audio source separation methods and algorithms are surveyed in [43] and [53].
Monophonic separation Although multichannel separation provided a way to invert mixing, the case in which the input mixture is presented in a single channel only, known as monophonic separation, posed a new challenge. Techniques that emerged in this area utilized either generative modeling or variations of masking approaches to recover the intended source. This problem also brought into the spotlight the idea of trained separation algorithms as opposed to blind methods. An early successful approach along these lines came from models based on NMF [50]. These models were pretrained using sound examples, learned a target-specific spectral dictionary, and were able to isolate and reconstruct such a target from an input mixture. Variations of this approach included multichannel versions, convolutional models, models trained on a variety of spectrotemporal representations, Markov models, probabilistic formulations, and more [54], [55]. Although generative models performed well at the time, an alternative approach came from a technique that was first used for multichannel separation. W-disjoint orthogonality [56] took advantage of sparsity in the time-frequency representation of most sounds to directly apply a binary mask on a spectrogram and isolate the desired sound. First formulated for stereo recordings, this idea became a cornerstone for approaches based on NNs and resulted in a discriminative approach to solving the separation problem, where each time-frequency point is classified as useful or not. A popular NN model that made use of this idea was deep clustering [57], which projected mixtures in a space where time-frequency bins could be clustered and labeled accordingly as belonging to independent sources. Other NN models dispensed with the clustering step, thereby losing some generality, and directly attempted to predict a mask given just an input mixture [38]. The latter approach has dominated the source separation research of late, providing many approaches with impressive-sounding results, ranging in their application from small and efficient on-device speech enhancers that are commonly used for most voice communication today to larger high-quality offline models, such as those used for the awardwinning restorations of historical Beatles recordings. Models along these lines have explored many of the new neural architectures (the U-net, transformers, and so on) and span a wealth of extensions, such as the use of soft masks, models that learn a latent space as opposed to using an STFT [58], models that resolve ambiguity in the order of output sources (permutationinvariant training, conditional models that are guided toward a 20
target by a user, models that directly optimize perceptual metrics, and more). In Figure 3, several examples of approaches for monophonic separation are given. A special case of these models has had a significant impact on music processing. The release of easy-to-use musicoriented source separation models (https://research.deezer.com /projects/spleeter) has resulted in a wealth of free and commercial software that allows users to decompose a music recording into its constituent instrument tracks and freely remix and manipulate. Aside from being a very useful tool, this has enhanced the way we interact with recorded music and opened new avenues of media interactivity that are still being explored. Although discriminative models offer superior performance with relative ease of use, their downside as compared to generative methods is that they are prone to overspecialization and cannot be easily extended and redeployed for alternative uses. Some open questions still remain on how to make universal separators, learn with limited training data, extend a trained model to work out-of-distribution, and so on. Despite the impressive-sounding demos, there is still a lot of work to be done in this space.
Emerging topics Another viewpoint of the evolution and breakthrough discussed in the preceding is the emergence of new topics, almost absent in the 1990s and that today are among the most popular fields.
Objective evaluation Objective evaluation of speech and audio quality has emerged as a highly relevant topic in the past 25 years. If the ultimate means for speech/audio quality evaluation and intelligibility assessment is a human perceptual test, it is also known that it is costly and tedious to organize. This has motivated the community to develop objective metrics for sound quality that are better correlated with perception. For instance, led by the speech coding community, several speech quality metrics were developed (and standardized), including Perceptual Evaluation of Speech Quality, Perceptual Objective Listening Quality Assessment, and Virtual Speech Quality Objective Listener. An overview of objective perceptual measures is provided in [59]. There is also widespread adoption of speech intelligibility measures for hearing aids, such as Short-Time Objective Intelligibility (STOI) together with binaural extensions: modified binaural STOI. These measures are the de facto standard for assessing the impact of speech enhancement algorithms in human interface devices. Similarly, several metrics were proposed to evaluate audio quality (such as Perceptual Evaluation of Audio Quality and Perception Model-Based Quality) and the performance of an audio source separation algorithm (the scale-invariant signal-to-distortion ratio, signal-to-artifact ratio, and signal-to-interference ratio) [60]. Other interesting objective measures were also proposed, in particular for hearing-impaired listeners (see [61] for an overview). More recently, we have also seen the incorporation of trained models that output perceptual scores [62]. These models can be
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
21
Frequency (Hz)
Single-Microphone Input
Embedding Map
Learned Interference Dictionary 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0
Learned Target Dictionary 8,000 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0
(b)
Deep Clustering
(a)
Estimated Dictionary Activations
Input Mixture
Outputs After Clustering
Combined Dictionary
NMF Separation
Target Activations
Reconstructed Target
(c)
Predicted Binary Mask
Separation Network
Discriminative Separation
Input Spectrogram
Selective Reconstruction
Target Dictionary
projects the time-frequency points of a mixture to a latent space adapted such that different sources cluster separately, and it then uses the cluster labels to reconstruct each source. (c) Finally, discriminative separation, mostly based on NNs, predicts masking functions directly from input signals such that it can mute interference and isolate target sounds.
FIGURE 3. Approaches for monophonic separation. (a) NMF models decompose inputs based on trained dictionaries and then use that information to reconstruct selected parts of the input. (b) Deep clustering
Frequency (Hz)
trained on audio inputs to directly predict user responses and provide a rapid alternative to listener tests and otherwise slowto-compute evaluation methods. When used with differentiable models, these evaluation methods can also be directly incorporated into algorithm optimization, providing new possibilities for training perceptually relevant systems. Finally, when any of the approximations in the preceding are not deemed sufficient, audio algorithm designers can resort to modern crowdsourcing tools that can reach thousands of listeners and conduct experiments with unprecedented sample sizes. The ability to do this has revolutionized how audio products are evaluated today and provides stronger statistical results than ever before.
MIR MIR is defined as a field that covers all the research topics involved in the understanding and modeling of music and that uses information processing methodologies (see the MIR road map at http://www.mires.cc/wiki/index1a1d. html?title=Roadmap&oldid=2137). It is, in essence, an interdisciplinary domain involving machine learning, signal processing, and/or musicology. The nature of the processed music can also be very diverse, including the raw audio signal, a symbolic representation of the music score or recording (for example, in the musical instrument digital interface format), an image (for example, as a scanned version of the music score), and even as 3D trajectory movements (for example, as gestures of performers). If the MIR domain has initially focused on symbolic music processing, some early studies have paved the way for many subsequent works on raw audio signals, for example, in speech/music discrimination, beat tracking [63], and music analysis and recognition [64], to name a few. The early approaches often took inspiration from speech recognition methods, mostly using mel-frequency cepstral coefficients (MFCCs) as features, with statistical models such as Gaussian mixture models (GMMs), hidden Markov models (HMMs), support vector machines (SVMs), and more. Similarly, for underdetermined source separation, major progress was made in using dedicated low-rank and sparse decomposition, such
Musical Timbre Perception
as based on NMF and matching pursuit and its variants. With the exception of some early papers that exploited NNs (see, for example, [65] for multipitch estimation), the advent of deep learning is rather recent (see Figure 4). Today, the major trend is to consider deep learning for nearly all applications, with remarkable achievements in polyphonic music source separation, music transcription (estimation of melody, harmony, rhythm, lyrics, and so on), music style transfer, and music synthesis, for instance, [66]. As in speech recognition, the field has also received a great interest toward end-to-end deep learning approaches, which even replace the traditional feature extraction step with a data-driven representation learning paradigm. The variety and complexity of music signals also motivate the development of new tailored methods for representation learning and unsupervised learning to avoid the particularly cumbersome stage of music signal annotation. A particularly interesting approach was recently introduced for self-supervised pitch estimation [67]. Besides the main historic domains of MIR, music synthesis is becoming a stronger field with impressive results, especially around new generative models. In recent years, we have witnessed the emergence of approaches at the crossroads of DNNs and classical generative models in so-called deep generative models. Some of the most popular models include different forms of autoencoders (including VAEs, autoregressive models, and GANs). A concurrent trend, especially for music generation, revisits the use of classic audio signal models, such as, for instance, the source–filter model of speech production and the harmonic + noise model. In fact, such models have great potential in hybrid neural architectures integrating audio models under the form of differentiable signal processing blocks [68]. Hybrid architectures are indeed particularly attractive and already show great promise. For instance, the use of differentiable source generative models opens the path to data-efficient fully unsupervised music source separation paradigms [69].
DCASE Nevertheless, the most recent and strongest growth has been in the field of DCASE [70]. This growing interest is visible
Beat Tracking With DNN (RNN, LSTM)
Genre Recognition Musically Motivated Features + GMM
Multipitch Estimation With NNs
Music Instrument Recognition on Isolated Notes Instrument Recognition Tempo and Beat Analysis (Polyphonic Music) (Polyphonic Music) Multiple Timbre Features Bandpass Filters + GMM, SVM, …
1995
1998
2001
2004
Music Synthesis Autoregressive End-to-End GANs Transcription, Source Separation,… Music Source Separation With Deep U-net
2011
2016
1964
FIGURE 4. MIR: a rather early adoption of DNNs. RNN: recurrent NN; LSTM: long short-term memory. 22
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Unsupervised Learning Hybrid DNN (Differentiable Signal Models) Generative Modeling 2020….
in the increase of the DCASE community and the success of its DCASE workshop, a series launched in 2016 (with attendance growing from 68 in 2016 to 201 in 2019, with an average of 50% from the industry), and its companion international challenge (with continuous growth of the number of submitted systems, from 84 in 2016 to 470 in 2020). (Note, though, that the very first DCASE challenge was organized in 2013, but it became an annual event only from 2016.) This steady increase of interest is clearly visible in the number of submissions to ICASSP: in 2022, DCASE was by far the field with the highest number of submissions, with up to 23.5% of all submissions in audio. Although very important work on the perception of sound objects was reported by Schaeffer in his treatise on musical objects in the 1960s, one often refers to computational auditory scene analysis and the work on acoustic scene analysis by Bregman in the early 1990s as the most emblematic initial work in DCASE. As illustrated in Figure 5, this field has seen a similar (although much faster) evolution from speech recognitioninspired methods to fully data-driven deep learning methods, with a particularly strong axis on weakly supervised approaches [71]. With the notable exception of work by Sawhney and Maes in 1997, which exploited NNs, most of the studies until 2015 relied on more traditional clustering and machine learning paradigms, for instance, based on the SVM, GMM, and HMM. Also, similar to the domains of audio source separation and MIR at the dawn of the 21st century, many works have exploited approaches to obtain compact and informative audio signal representations. Sparse decomposition methods, imagebased features, and NMFs have been particularly popular. Then, since 2014, deep learning has gained strong momentum and very rapidly become the mainstream architecture. In the DCASE 2016 challenge, all submitted systems for acoustic scene classification but four involved NNs, even if they were not yet defining the state of the art. Two years later, in the 2018 challenge, the top 30 performing systems were DNN-based, confirming the indisputable supremacy of NNs for such a task.
Computational ASA (Audio Stream Segregation) Use of Auditory Periphery Model Audio Scenes Recognition Five Classes of Sound PLP + Filter Bank Features, RNN or k-NN
Powerful consumer electronics devices and fast Internet connections Finally, recent years have witnessed a very fast deployment of powerful consumer electronics devices with audio processing capabilities and usually with more than a single microphone. Example devices are laptops, tablet PCs, cellular phones, smartphones and smart watches, smart speakers, hearing devices and hearables, smart loudspeakers (Amazon Echo, Apple HomePod, and Google Home), and virtual and augmented reality glasses. Dedicated multimicrophone hardware, e.g., spherical microphone arrays, is also available (see the Eigenmike at https://mhacoustics.com). Concurrently, the rapid deployment of fast Internet connections, specifically with data over the cellular network, dramatically changed the way we communicate. Rather than
Scene/Event Recognition More Specific Methods Exploiting Sparsity, NMF, Image Features …
Acoustic Scene Acoustic Scene Recognition MFCC + HMM + GMM Recognition With HMM
Auditory Sound Analysis (Perception/Psychology):
1983 1990
Although DCASE often refers to a single domain, it considers, in practice, multiple applications, which have their own specifics and constraints. In acoustic scene recognition, a more mature application, numerous approaches were proposed to operate at low complexity, and in that regard, the use of network compression, pruning, and knowledge distillation, for instance, exploiting teacher–student frameworks, are among the most successful developments. For the task of acoustic events detection and localization, there is easy access to huge weakly annotated databases. This has obviously accompanied the emergence of an anthology of weakly supervised and fewshot learning approaches, for instance, around prototypical networks and mean teacher architectures, which are particularly efficient for few-shot learning, weakly supervised learning, and domain adaptation. Finally, it is worth mentioning the wide use of data augmentation techniques, which have proved, in many domains, to be very efficient to reduce model overfitting. Popular data augmentation techniques include SpecAugment (with feature warping and time-frequency masking), pitch shifting, time stretching, mix-up and channel confusion in the case of multichannel recordings, random noise addition, and many more.
1993
1998
2009
2003
Acoustic Event Recognition With DNN Large-Scale, Weakly Supervised, Few-Shot Learning, Domain Adaptation, … Prototypical Networks, Transformers,… 2014
2020….
FIGURE 5. DCASE: from perceptual auditory sound analysis to large-scale deep learning algorithms. ASA: auditory scene analysis; PLP: perceptual linear predictive; k-NN: k-nearest neighbors. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
23
communicating over the wired telephone network and, later, the cellular network, we now widely use voice over Internet Protocol (VoIP) as a cheap and reliable alternative. Moreover, teleconferencing tools, e.g., Google Meet, Skype, and Zoom, have become very popular, as recently demonstrated during the COVID-19 pandemic, allowing everyone to work from home and remotely communicate with colleagues and coworkers. The VoIP technology promoted research on audio coding, packet loss concealment, and echo cancellation over IP. Similarly, the widespread use of the Internet has revolutionized the consumption of music through new applications, such as audio and music retrieval and music identification [e.g., the popular Shazam service (https://www.shazam.com)], and around streaming services with automatic recommendation and playlist generation.
Conclusions and perspectives The domain of AASP is clearing experiencing growing interest, with a broad range of specific and interdisciplinary research and development. This growth has been accompanied by the AASP TC, whose “mission is to support, nourish and lead scientific and technological development in all areas of audio and acoustic signal processing.” Over the years, and especially recently, the domain has shifted toward more datadriven methods for nearly all speech and audio applications. In some cases, the methods developed are pure end-to-end approaches, where all the “knowledge” is extracted from data. We believe that this is a very strong trend that will be further developed in the future but probably with a different angle. In fact, pure end-to-end deep neural approaches are complex, often overparametrized, and, in many cases, remain rather unexplainable. There is thus an interest to go toward more frugal data-driven and interpretable and controllable systems. A potential path is to combine the strength of data-driven paradigms with efficient signal models to build new modelbased (and hybrid) deep neural architectures. For example, in MIR, it is possible to associate differentiable sound production models and deep learning architectures to design interpretable, more frugal, and yet efficient methods. This may be one of the future paths toward developing new algorithms and technologies that will be in accordance with sustainable and ecological development and compliant with high ethical standards, which we believe will become general concerns of major importance. Another future research direction that should receive growing interest in audio processing is federated (or collaborative) learning [72]. In fact, massive amounts of data are now stored on devices. As a result, more models can now be directly trained on devices (often referred to as on the edge). This allows us to better take into account privacy concerns (recorded data are not stored centrally) but also brings a number of challenges for audio applications, particularly in global optimization with communication constraints, learning with heterogeneous data (audio data recorded from diverse and heterogeneous recording devices), and learning with partial and missing data. Federated learning, which gathers techniques for machine learning and 24
statistical signal processing using multiple distributed devices, then, appears as a particularly promising framework for future audio processing applications. Stronger edge devices, with more powerful processing units and faster communication capabilities, will certainly support this trend. We also expect that multimodal processing will become more prominent and that we will witness, in the near future, more algorithms that utilize vision to support speaker localization and separation. Beyond audiovisual processing, other modalities will be more extensively used, e.g., brain-informed speech separation using electroencephalography signals [73].
Authors Gaël Richard ([email protected]) received his State Engineering degree from Telecom Paris, France, in 1990, and his Ph.D. degree from the University of ParisSaclay, in 1994. He is now a full professor of audio signal processing at Telecom-Paris, Institut Polytechnique de Paris, 91120 Palaiseau, France, and the scientific codirector of the Hi! PARIS interdisciplinary center on artificial intelligence and data analytics. He is the past chair of the IEEE Signal Processing Society Technical Committee for Audio and Acoustic Signal Processing. He received, in 2020, the IMT– National Academy of Science grand prize for his research contribution in sciences and technologies. In 2022, he was awarded an advanced European Research Council grant for a project on hybrid deep learning for audio. His research interests include machine learning, audio and music signal processing. He is a Fellow of IEEE. Paris Smaragdis ([email protected]) received his Ph.D. degree from the Massachusetts Institute of Technology in 2001. He is now a professor of computer science at the University of Illinois Urbana-Champaign, Champaign, IL 61801, USA. He was an IEEE Distinguished Lecturer (2016– 2017) and has previously chaired the IEEE Data Science Initiative, IEEE Audio and Acoustics Signal Processing Technical Committee, and IEEE Machine Learning for Signal Processing Technical Committee. He has also served as a member of the IEEE Signal Processing Society Board of Governors. He is currently the editor-in-chief of IEEE Transactions on Audio, Speech, and Language. His research interests include machine learning and signal processing. He is a Fellow of IEEE. Sharon Gannot ([email protected]) received his Ph.D. degree in electrical engineering from Tel-Aviv University, Israel in 2000. He is currently a full professor in the Faculty of Engineering, Bar-Ilan University, Ramat Gan 5290002, Israel. He currently serves as a senior area chair for IEEE Transactions on Audio, Speech, and Language Processing; a member of the senior editorial board of IEEE Signal Processing Magazine; and the chair of the IEEE Signal Processing Society Data Science Initiative. He also served as the chair of the IEEE Audio and Acoustic Signal Processing Technical Committee in 2017–2018. He was the general cochair of the 2010 International Workshop on Acoustic Signal Enhancement and 2013 IEEE Workshop on
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Applications of Signal Processing to Audio and Acoustics. He is a recipient of the 2022 European Association for Signal Processing Group Technical Achievement Award. His research interests include statistical signal processing and machine learning methods for single- and multi-microphone arrays, with applications to speech enhancement, noise reduction and speaker separation, and diarization, dereverberation, speaker localization, and tracking. He is a Fellow of IEEE. Patrick A. Naylor ([email protected]) received his Ph.D. degree from Imperial College London in 1990. He is now a professor of speech and acoustic signal processing at Imperial College London, SW7 2AZ London, U.K. He has served on the IEEE Signal Processing Society Board of Governors, as chair of the IEEE Audio and Acoustic Signal Processing Technical Committee, as an associate editor of IEEE Signal Processing Letters, and as a senior area editor of IEEE Transactions on Audio, Speech, and Language Processing. He is a past president of the European Association for Signal Processing. His research interests include microphone array signal processing, speaker diarization and localization, and multichannel speech enhancement for application to binaural hearing aids. He is a Fellow of IEEE. Shoji Makino ([email protected]) received his Ph.D. degree from Tohoku University in 1993. He is currently a professor at Waseda University, Kitakyushu 808-0135 Japan. He has served on the IEEE Signal Processing Society (SPS) Board of Governors, SPS Technical Directions Board, SPS Awards Board, and SPS Fellow Evaluation Committee. He has received 30 awards, including the SPS Leo L. Beranek Meritorious Service Award in 2022, SPS Best Paper Award in 2014, IEEE Machine Learning for Signal Processing Competition Award in 2007, and ICA Unsupervised Learning Pioneer Award in 2006. His research interests include adaptive filtering technologies, acoustic signal processing, and machine learning for speech and audio applications. He is an SPS Distinguished Lecturer and a Fellow of IEEE. Walter Kellermann ([email protected]) received his Dr.-Ing. degree from the Technical University Darmstadt in 1988. He is currently a professor at the University of Erlangen-Nürnberg, 91058 Erlangen, Germany. He was a Distinguished Lecturer and a Vice President Technical Directions of the IEEE Signal Processing Society. He is a corecipient of ten best paper awards and a recipient of the Group Technical Achievement Award of the European Association for Signal Processing (EURASIP). His main research interests focus on physical model-based and data-driven multichannel methods for acoustic signal processing and speech enhancement. He is a Fellow of EURASIP and a Life Fellow of IEEE. Akihiko Sugiyama ([email protected]) received his Dr. Eng. from Tokyo Metropolitan University in 1998. He is currently working for Yahoo Japan Corporation, Tokyo 1028282, Japan. He served as the chair of the IEEE Audio and Acoustic Signal Processing Technical Committee, an associate editor of IEEE Transactions on Signal Processing, and a member of IEEE Fellow Committee. He was a technical
p rogram chair for ICASSP 2012, past IEEE Signal Processing Society (SPS) Distinguished Industry Speaker, and past SPS and IEEE Consumer Technology Society Distinguished Lecturer. His research interests include audio coding and interference/noise control. He is a Fellow of IEEE.
References
[1] M. Kahrs et al., “The past, present and future of audio signal processing,” IEEE Signal Process. Mag., vol. 14, no. 5, pp. 30–57, Sep. 1997, doi: 10.1109/ MSP.1997.1179708. [2] F. Baumgarte and C. Faller, “Binaural cue coding-part I: Psychoacoustic fundamentals and design principles,” IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 509–519, Nov. 2003, doi: 10.1109/TSA.2003.818109. [3] A. Ozerov, A. Liutkus, R. Badeau, and G. Richard, “Coding-based informed source separation: Nonnegative tensor factorization approach,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 8, pp. 1699–1712, Aug. 2013, doi: 10.1109/ TASL.2013.2260153. [4] A. Sugiyama and M. Iwadare, “The origin of digital information devices: The silicon audio and its family,” APSIPA Trans. Signal Inf. Process., vol. 7, no. 1, Jan. 2018, Art. no. e1, doi: 10.1017/ATSIP.2017.16. [5] J. Allen and D. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Amer., vol. 65, no. 4, pp. 943–950, Apr. 1979, doi: 10.1121/1.382599. [6] V. Välimäki, J. D. Parker, L. Savioja, J. O. Smith, and J. S. Abel, “More than 50 years of artificial reverberation,” J. Audio Eng. Soc., Jan. 2016. [7] E. Nosal, M. Hodgson, and I. Ashdown, “Improved algorithms and methods for room sound-field prediction by acoustical radiosity in arbitrary polyhedral rooms,” J. Acoust. Soc. Amer., vol. 116, no. 2, pp. 970–980, Sep. 2004, doi: 10.1121/1.1772400. [8] H. Bai, G. Richard, and L. Daudet, “Late reverberation synthesis: From radiance transfer to feedback delay networks,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 12, pp. 2260–2271, Dec. 2015, doi: 10.1109/TASLP. 2015.2478116S. [9] B. Rafaely, Fundamentals of Spherical Array Processing. Berlin, Germany: Springer-Verlag, 2015. [10] J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays, M. Brandstein and D. Ward, Eds. Berlin, Germany: Springer, 2001, pp. 157–180. [11] A. Brendel and W. Kellermann, “Distributed source localization in acoustic sensor networks using the coherent-to-diffuse power ratio,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 61–75, Mar. 2019, doi: 10.1109/JSTSP.2019.2900911. [12] S. Chakrabarty and E. A. Habets, “Multi-speaker DOA estimation using deep convolutional networks trained with noise signals,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 1, pp. 8–21, Mar. 2019, doi: 10.1109/JSTSP.2019.2901664. [13] P.-A. Grumiaux, S. Kitic´, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” J. Acoust. Soc. Amer., vol. 152, no. 1, pp. 107–151, Jul. 2022, doi: 10.1121/10.0011809. [14] C. Evers and P. A. Naylor, “Acoustic SLAM,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 9, pp. 1484–1498, Sep. 2018, doi: 10.1109/ TASLP.2018.2828321. [15] D. P. Jarrett, E. A. Habets, and P. A. Naylor, Theory and Applications of Spherical Microphone Array Processing. Cham, Switzerland: Springer, 2017. [16] G. W. Elko, “Differential microphone arrays,” in Audio Signal Processing for Next-Generation Multimedia Communication Systems, Y. Huang and J. Benesty, Eds. Boston, MA, USA: Springer, 2004, pp. 11–65. [17] J. Benesty and C. Jingdong, Study and Design of Differential Microphone Arrays. Berlin, Germany: Springer-Verlag, 2012. [18] D. Begault and L. Trejo, “3-D sound for virtual reality and multimedia,” Nat. Aeronaut. Space Admin., Washington, DC, USA, NASA/TM-2000-209606, 2000. [19] D. Poirier-Quinot and B. F. Katz, “On the improvement of accommodation to non-individual HRTFs via VR active learning and inclusion of a 3d room response,” Acta Acoust., vol. 5, no. 25, pp. 1–17, Jun. 2021, doi: 10.1051/aacus/2021019. [20] J. Ahrens and S. Spors, “Sound field reproduction using planar and linear arrays of loudspeakers,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 8, pp. 2038–2050, Nov. 2010, doi: 10.1109/TASL.2010.2041106. [21] A. Politis, J. Vilkamo, and V. Pulkki, “Sector-based parametric sound field reproduction in the spherical harmonic domain,” IEEE J. Sel. Topics Signal Process., vol. 9, no. 5, pp. 852–866, Aug. 2015, doi: 10.1109/JSTSP.2015.2415762. [22] S. Cecchi, A. Carini, and S. Spors, “Room response equalization - A review,” Appl. Sci., vol. 8, no. 1, 2018, Art. no. 16, doi: 10.3390/app8010016. [23] A. Stenger and W. Kellermann, “Adaptation of a memoryless preprocessor for nonlinear acoustic echo cancelling,” Signal Process., vol. 80, no. 9, pp. 1747–1760, Sep. 2000, doi: 10.1016/S0165-1684(00)00085-2.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
25
[24] M. M. Halimeh, C. Huemmer, and W. Kellermann, “A neural network-based nonlinear acoustic echo canceller,” IEEE Signal Process. Lett., vol. 26, no. 12, pp. 1827–1831, Dec. 2019, doi: 10.1109/LSP.2019.2951311. [25] T. Gänsler and J. Benesty, “The fast normalized cross-correlation double-talk detector,” Signal Process., vol. 86, no. 6, pp. 1124–1139, Jun. 2006, doi: 10.1016/j. sigpro.2005.07.035. [26] J. Benesty, M. M. Sondhi, and Y. Huang, Springer Handbook of Speech Processing. Berlin, Germany: Springer-Verlag, 2008. [27] E. Hänsler and G. Schmidt, Acoustic Echo and Noise Control: A Practical Approach. New York, NY, USA: Wiley, 2005. [28] T. Van Waterschoot and M. Moonen, “Fifty years of acoustic feedback control: State of the art and future challenges,” Proc. IEEE, vol. 99, no. 2, pp. 288–327, Feb. 2011, doi: 10.1109/JPROC.2010.2090998.
IEEE Trans. Speech Audio Process., vol. 13, no. 1, pp. 120–134, Feb. 2005, doi: 10.1109/TSA.2004.838775. [50] P. Smaragdis, “Convolutive speech bases and their application to supervised speech separation,” IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 1, pp. 1–12, Jan. 2007, doi: 10.1109/TASL.2006.876726. [51] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 24, no. 9, pp. 1626–1641, Sep. 2016, doi: 10.1109/TASLP.2016.2577880. [52] H. Kameoka, L. Li, S. Inoue, and S. Makino, “Supervised determined source separation with multichannel variational autoencoder,” Neural Comput., vol. 31, no. 9, pp. 1891–1914, Sep. 2019, doi: 10.1162/neco_a_01217. [53] S. Makino, Audio Source Separation. Cham, Switzerland: Springer, 2018.
[29] Y. J. Wu and T. D. Abhayapala, “Spatial multizone soundfield reproduction: Theory and design,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 6, pp. 1711–1720, Aug. 2011, doi: 10.1109/TASL.2010.2097249.
[54] T. Virtanen, J. F. Gemmeke, B. Raj, and P. Smaragdis, “Compositional models for audio processing: Uncovering the structure of sound mixtures,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 125–144, Mar. 2015, doi: 10.1109/MSP.2013.2288990.
[30] P. A. Naylor and N. D. Gaubitch, Speech Dereverberation. London, U.K.: Springer-Verlag, 2010.
[55] P. Smaragdis, C. Févotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman, “Static and dynamic source separation using nonnegative factorizations: A unified view,” IEEE Signal Process. Mag., vol. 31, no. 3, pp. 66–75, May 2014, doi: 10.1109/MSP.2013.2297715.
[31] E. Habets, S. Gannot, and I. Cohen, “Late reverberant spectral variance estimation based on a statistical model,” IEEE Signal Process. Lett., vol. 16, no. 9, pp. 770–773, Sep. 2009, doi: 10.1109/LSP.2009.2024791. [32] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1717– 1731, Sep. 2010, doi: 10.1109/TASL.2010.2052251. [33] K. Han, Y. Wang, D. Wang, W. S. Woods, I. Merks, and T. Zhang, “Learning spectral mapping for speech dereverberation and denoising,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 23, no. 6, pp. 982–992, Jun. 2015, doi: 10.1109/TASLP.2015.2416653. [34] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-domain based singlemicrophone noise reduction for speech enhancement: A survey of the state of the art,” in Synthesis Lectures Speech Audio Processing, vol. 9. San Rafael, CA, USA: Morgan & Claypool, 2013, pp. 1–80. [35] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL, USA: CRC Press, 2007. [36] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phase processing for single-channel speech enhancement: History and recent advances,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 55–66, Mar. 2015, doi: 10.1109/MSP.2014.2369251. [37] D. Burshtein and S. Gannot, “Speech enhancement using a mixture-maximum model,” IEEE Trans. Speech Audio Process., vol. 10, no. 6, pp. 341–351, Oct. 2002, doi: 10.1109/TSA.2002.803420. [38] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, Oct. 2018, doi: 10.1109/TASLP.2018.2842159. [39] M. Brandstein and D. Ward, Microphone Arrays: Signal Processing Techniques and Applications. Berlin, Germany: Springer-Verlag, 2001. [40] S. Affes and Y. Grenier, “A signal subspace tracking algorithm for microphone array processing of speech,” IEEE Trans. Speech Audio Process., vol. 5, no. 5, pp. 425–437, Sep. 1997, doi: 10.1109/89.622565. [41] S. Doclo, A. Spriet, J. Wouters, and M. Moonen, “Speech distortion weighted multichannel wiener filtering techniques for noise reduction,” in Speech Enhancement. Berlin, Germany: Springer-Verlag, 2005, pp. 199–228. [42] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 692–730, Apr. 2017, doi: 10.1109/TASLP.2016.2647702. [43] E. Vincent, T. Virtanen, and S. Gannot, Eds. Audio Source Separation and Speech Enhancement. Hoboken, NJ, USA: Wiley, 2018. [44] A. Bertrand, “Applications and trends in wireless acoustic sensor networks: A signal processing perspective,” in Proc. IEEE Symp. Commun. Veh. Technol. Benelux (SCVT), 2011, pp. 1–6, doi: 10.1109/SCVT.2011.6101302. [45] X. Xiao, “Deep beamforming networks for multi-channel speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2016, pp. 5745–5749, doi: 10.1109/ICASSP.2016.7472778. [46] A. Ephrat et al., “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” 2018, arXiv:1804.03619. [47] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications. New York, NY, USA: Academic, 2010. [48] S. Araki, R. Mukai, S. Makino, T. Nishikawa, and H. Saruwatari, “The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech,” IEEE Trans. Speech Audio Process., vol. 11, no. 2, pp. 109–116, Apr. 2003, doi: 10.1109/TSA.2003.809193. [49] H. Buchner, R. Aichner, and W. Kellermann, “A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics,”
26
[56] S. Rickard and O. Yilmaz, “On the approximate W-disjoint orthogonality of speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2002, vol. 1, pp. I–529–I–532, doi: 10.1109/ICASSP.2002.5743771. [57] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2016, pp. 31–35, doi: 10.1109/ ICASSP.2016.7471631. [58] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019, doi: 10.1109/TASLP.2019.2915167. [59] M. Torcoli, T. Kastner, and J. Herre, “Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,” IEEE/ ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 1530–1541, Mar. 2021, doi: 10.1109/TASLP.2021.3069302. [60] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech, Language Process., vol. 14, no. 4, pp. 1462–1469, Jul. 2006, doi: 10.1109/TSA.2005.858005. [61] T. H. Falk et al., “Objective quality and intelligibility prediction for users of assistive listening devices: Advantages and limitations of existing tools,” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 114–124, 2015, doi: 10.1109/MSP.2014.2358871. [62] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, and J. Gehrke, “Nonintrusive speech quality assessment using neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2019, pp. 631–635, doi: 10.1109/ ICASSP.2019.8683175. [63] E. D. Scheirer, “Tempo and beat analysis of acoustic musical signals,” J. Acoust. Soc. Amer., vol. 103, no. 1, pp. 588–601, Jan. 1998, doi: 10.1121/1.421129. [64] J. Foote, “An overview of audio information retrieval,” Multimedia Syst., vol. 7, no. 1, pp. 2–10, Jan. 1999, doi: 10.1007/s005300050106. [65] M. Marolt, “A connectionist approach to automatic transcription of polyphonic piano music,” IEEE Trans. Multimedia, vol. 6, no. 3, pp. 439–449, Jun. 2004, doi: 10.1109/TMM.2004.827507. [66] G. Peeters and G. Richard, “Deep learning for audio and music,” in MultiFaceted Deep Learning: Models and Data, J. Benois-Pineau and A. Zemmari, Eds. Cham, Switzerland: Springer, 2021, pp. 231–266. [67] B. Gfeller, C. Frank, D. Roblek, M. Sharifi, M. Tagliasacchi, and M. Velimirovic´, “SPICE: Self-supervised pitch estimation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 1118–1128, Mar. 2020, doi: 10.1109/TASLP.2020.2982285. [68] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable digital signal processing,” in Proc. Int. Conf. Learn. Representations (ICLR), 2020. [69] K. Schulze-Forster, C. S. J. Doire, G. Richard, and R. Badeau, “Unsupervised audio source separation using differentiable parametric source models,” 2022, arXiv:2201.09592. [70] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events,” IEEE Trans. Multimedia, vol. 17, no. 10, pp. 1733–1746, Oct. 2015, doi: 10.1109/TMM.2015.2428998. [71] T. Virtanen, D. Ellis, and M. Plumbley, Eds. Computational Analysis of Sound Scenes and Events. Cham, Switzerland: Springer, 2018. [72] J. Konecný, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” 2016, arXiv:1610.02527. [73] E. Ceolini et al., “Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception,” Neuroimage, vol. 223, Dec. 2020, Art. no. 117282, doi: 10.1016/j.neuroimage.2020.117282.
SP
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
75TH ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE
Dong Yu , Yifan Gong, Michael Alan Picheny, Bhuvana Ramabhadran , Dilek Hakkani-Tür, ˇ ernocký , Lukáš Burget , Rohit Prasad, Heiga Zen, Jan Skoglund , Jan “Honza” C and Abdelrahman Mohamed
Twenty-Five Years of Evolution in Speech and Language Processing
I
n this article, we summarize the evolution of speech and language processing (SLP) in the past 25 years. We first provide a snapshot of popular research topics and the associated state of the art (SOTA) in various subfields of SLP 25 years ago, and then highlight the shift in research topics over the years. We describe the major breakthroughs in each of the subfields and the main driving forces that led us to the SOTA today. Societal impacts and potential future directions are also discussed.
Introduction
©SHUTTERSTOCK.COM/TRIFF
Digital Object Identifier 10.1109/MSP.2023.3266155 Date of current version: 14 July 2023
1053-5888/23©2023IEEE
The year 2023 marks the 75th anniversary of the IEEE Signal Processing Society (SPS). Technologies have been significantly advanced in these 75 years, and society has been greatly impacted by these advances. For example, the mobile Internet has greatly changed people’s lifestyles. Researchers and practitioners in signal processing have contributed their share to these progresses. In this article, we concentrate on the field of SLP, which is the scope covered by the IEEE Speech and Language Processing Technical Committee (SLTC), and summarize the major technological developments in the field and the key societal impacts caused by these advances in the past 25 years. As part of the SPS, the SLTC serves, promotes, and influences all the technical areas of SLP, including automatic speech recognition (ASR), speech synthesis [often referred to as text to speech (TTS)], speaker recognition (SPR) and diarization, language identification (LID), speech enhancement, speech coding, speech perception, language understanding, and dialog systems. The SLTC can trace its roots back to the Institute of Radio Engineers Audio Group, founded in 1947. In 1969, this audio group established the Speech Processing and Sensory Aids Technical Committee. “Sensory Aids” was dropped from the name in the early 1970s. For more than 30 years, it remained the Speech Processing Technical Committee. In 2006, its scope was expanded, and its name was officially changed to the SLTC. Today, more than 50 years after the formation of the SLTC and 16 years after the recent name change, the field has been
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
27
significantly expanded and greatly reshaped by new thoughts and techniques. In fact, we have observed rapid progress in SLP in the past decade, largely driven by deep learning, big data, high-performance computation, and application demands. For example, ASR accuracy has surpassed the adoption threshold in the closed-talk setup where the microphone is close to the mouth. TTS can now generate natural-sounding speech that is hard to distinguish from human speech [1]. The performance of natural language processing tasks has been greatly improved with huge, pretrained language models (LMs). The remainder of the article is organized as follows. In the “Status of the Field in 1998” section, we provide an overview of the field and summarize the main knowledge and key issues of each major subfield 25 years ago. In the “Main Driving Forces Over the Last Decades” section, we describe the main driving forces that have reshaped the field and caused a paradigm change in the past decade. In the “Major Technical Breakthroughs in Each Subfield” section, we summarize major breakthroughs and the current SOTA in each subfield. In the “Conclusion” section, we conclude the article with comments on the societal impact, and perspectives on future developments in the domain.
Status of the field in 1998 SLP has been an active research area since the 1950s. By 1998, the field had already made great leaps. The many key technologies that we know of today were developed then. In this section, we provide an overview of the field, summarize the main knowledge, and point out the key technical obstacles at that time.
Overview of the field IEEE played an important role in pushing the SOTA of the field around 1998. IEEE Transactions on Speech and Audio Processing and ICASSP were the flagship journal and conference, respectively. In 1997, the SLTC extended the scope of the IEEE Automatic Speech Recognition Workshop and renamed it the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Almost all of the popular subfields we study today had been extensively studied by 1998. The proceedings of ICASSP indicate that in 1998, the popular topics were ASR, speech enhancement, speech coding and perception, speech synthesis and analysis, speaker and LID, and speech-to-speech translation. The topic of spoken language understanding (SLU) was better covered in ASRU.
Speech coding The task of speech coding is to compress speech signals for efficient storage and transmission. The major speech coding event around 1998 was the launch of the mixed-excitation linear prediction (MELP) codec [2], which is based on the linear prediction coder (LPC) but with five additional features, presented in 1997 as the winning candidate of the U.S. Department of Defense contest to select a new federal standard for narrow-band (8-kHz sampling frequency) voice coding at 28
2.4 kbps, replacing the previous LPC10 codec from 1984. In a way, this contest ended a decade-long golden age of speech coding as the growth of digital mobile telephony applications in the late 1980s through the 1990s required rapid development in the field. At the time, speech coding algorithms could be categorized into two classes: model-based parametric codecs (also called source codecs or vocoders) such as MELP, operating mainly at rates up to 5 kbps, and waveform-matching codecs at rates above 5 kbps. Model-based parametric codecs usually consider the source-filter model of speech production and preserve only the spectral properties of speech. By sending only the source/ excitation type and the filter parameters, model-based parametric codecs can achieve very low bit rates. Waveform-matching codecs aim to reproduce the speech waveform as faithfully as possible. They usually do not rely on the prior knowledge that might have created the audio. Even though the waveform-matching, linear-predictive codecs based on code-excited linear prediction (CELP) [3] employed a model separating the speech signal into an excitation signal driving a linear synthesis filter, the analysis by synthesis offered monotonically increased fidelity with an increased bit rate. Basically, all mobile telephony standards are based on the CELP methodology. At bit rates below 5 kbps, waveform matching performs poorly, and better quality is achieved by model-based parametric coding with efficient quantization of extracted speech features and vocoder synthesis. Parametric codecs have a tendency to produce speech that sounds a bit unnatural and robotic. The speech quality is limited by the model; hence, quality will stop improving after certain rates. The major application for parametric coders was secure voice communication, where bandwidth was limited and speech intelligibility was more important than high-fidelity quality. After 1998, research interest in narrow-band, low-rate speech coding subsided. For example, there was a subsequent International Telecommunications Union telecommunication standardization sector standardization effort for a new 4-kbps speech codec a few years later. This effort was essentially abandoned due to irrelevancy after the emergence of Voice over Internet Protocol (IP), and future mobile communication standards offered higher bandwidths and encouraged speech compression efforts toward higher bit rates and sampling rates.
ASR The task of ASR is to convert the speech audio sequence into the corresponding text sequence. The modern field of speech recognition, as a whole, had its origins in information theory as far back as the early 1970s. For a fascinating view of how this approach took hold, the reader is referred to [4]. By 1998, data-driven approaches employing complex statistical models had gained broad acceptance in the community. Indeed, ASR was already being viewed as a largely solved problem, with a large push being made toward the commercialization of the technology. Around 1998, a typical speech recognition system comprised the following components:
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Feature vector extractor: mel-frequency cepstral coefficients (MFCCs) or perceptual linear prediction coefficients ■■ Acoustic model (AM): a context-dependent, phonetic hidden Markov model (HMM) using Gaussian mixture models (GMMs) to model feature vector probabilities ■■ LM: a backoff-based n-gram LM, sometimes class-based LM ■■ Hypothesis search: a Viterbi decoder, beam search; single or sometimes multipass ■■ Adaptation mechanisms: an AM adaptation with maximumlikelihood linear regression or maximum a posteriori (MAP). LM adaptation using cache-based interpolation with an add-word mechanism to handle out-of-vocabulary words. It should be noted that neural network (NN) technology had achieved some limited success by 1998, but that its performance was not good enough to replace GMMs, much less HMMs. The early systems were all trained using the maximum-likelihood criterion. By 1998, discriminative training criteria already existed, but performance gains were small and expensive to achieve. Back in 1998, “large” systems were trained on a few hundred hours of speech. ■■
TTS synthesis The goal of TTS is to render naturally sounding speech given an arbitrary text. Two data-driven, corpus-based speech synthesis approaches were proposed in the 1990s: an exemplarbased, unit-selection approach with which speech is synthesized by concatenating scaled pronunciation-unit samples from a corpus, and a model-based, generative approach. Shortly after the proposal of the first data-driven unitselection TTS [5] in 1992, an HMM-based, trainable unitselection speech synthesis was proposed [6], where decision tree-clustered, context-dependent HMMs were used for unit segmentation and cost functions. The formulation and trainable framework of unit selection made it popular in R&D for the next two decades. At the same time, the first paper toward generative TTS was proposed in 1995 [7], where probabilities of acoustic features (vocoder parameters) given linguistic features (context-dependent phonemes) were modeled and generated using HMMs. The generative TTS’s flexibility to change its voice characteristics was demonstrated in [8]. The generative approach was still incomplete in 1998 as it lacked prosody modeling and generation. Prosody refers to the duration, intonation, and intensity patterns of speech associated with the sequence of syllables, words, and phrases. Without proper prosody modeling and generation, long sentences will sound unnatural. Although the unit-selection approach could synthesize naturally sounding speech for in-domain texts (those covered well by the corpus), due to data sparsity, its quality could degrade significantly for out-of-domain texts with discontinuities in the generated speech caused by uncovered units. Furthermore, having multiple speakers, emotions, and speaking styles was difficult as it required an ample number of recordings with these characteristics. On the other hand, the generative approach had already demonstrated a way to change its voice characteristics. However,
the naturalness of synthesized speech was limited by the quality of vocoders.
SPR, identification, and diarization The task of SPR infers speaker identity from the speech signal. The most straightforward task is speaker verification, which aims to determine whether two recordings were spoken by the same speaker or different ones. A range of other tasks can be derived from speaker verification, such as speaker identification (closed or open set), speaker tracking (determining speaker trajectories), and speaker search (determining from where a specific voice is speaking). Two basic settings are text dependent (the speaker needs to provide a predetermined key phrase) and text independent. Speaker diarization is a derived, more complicated task as it aims to segment a recording into regions spoken by one speaker and generate speaker labels (such as “A,” “B,” and so on) consistently over the recording. The status around 1998 is covered in an excellent tutorial by Campbell [9]. A typical SPR system is a statistical model that contained feature extraction, pattern matching, and decision (see Figure 1). As in related fields, all the possibilities of features (LPC, MFCC, line spectral pairs, and so on) were investigated, and several matching techniques (Gaussian modeling, distance computation, and dynamic time warping) competed. R&D usually contained feature selection, testing, and the fusion of several matching techniques. Several researchers experimented with NNs, but without much success. The National Institute of Standards and Technology’s (NIST’s) Speaker Recognition Evaluation series started in 1996 and has since become a platform to evaluate SPR technology. In speaker diarization, a typical system in 1998 already contained similar components as the current ones (excluding the end-to-end ones): segmentation and automatic clustering of segments. Kullback–Leibler (KL) distance was widely used. For the segmentation, a sequence of cepstral features was extracted from the input speech and split into 2-s windows. A segment boundary was detected if the KL distance between Gaussian distributions estimated for the neighboring windows was above a threshold. Similarly, it was also used for agglomerative clustering of the resulting segments, i.e., initially treating each segment as a cluster and then gradually merging clusters. The 1996 DARPA Hub 4 Broadcast News Evaluation was a popular task used to evaluate diarization.
LID LID, also termed spoken language recognition or just language recognition (LR), aims to determine the language in a particular speech segment. Engineers usually depend on linguists and politicians (https://en.wikipedia.org/wiki/A _language_is_a_dialect_with_an_army_and_navy) to answer the “Language or dialect?” question and consider every class labeled with the same label as a language. Around 1998, two standard LID approaches were defined [10]: acoustic, aiming at the classification of a sequence of acoustic feature vectors into a class, and phonotactic, first tokenizing the input sequence into discrete units (phones)
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
29
Speech separation is usually considered a more difficult problem because the target speech and the interfering speech share very similar characteristics. At that time, the main approach to speech separation, which focused on blind source separation, was independent component analysis, which aims at recovering a set of maximally independent sources from the observed mixtures without knowing the source signals or the mixing parameters. Also in that time period, the perceptual principles of human auditory scene analysis (ASA) was extensively studied. Many of these principles were later adopted in the computational ASA (CASA) [12] approach. It’s important to point out that the majority of the works at that time exploit only the information in the current audio stream, which is very different from today’s machine learning (ML)-based SOTA techniques that also take advantage of large training corpora collected or simulated over the time. Furthermore, most of the work at the time concentrated on monaural speech processing. This is because microphone arrays were considered expensive and were rarely used in practical systems, except in meeting scenarios. The single-microphone setup was believed to be more important and relevant. Both constraints limited the performance of the then-SOTA systems.
using one or several phone recognizers, and then performing language (phonotactic) modeling. Each class (language) was modeled by a GMM in the case of the acoustic approach, and by an n-gram LM in the case of phonotactic systems. The latter obviously depended on the accuracy of phone recognition. Similar to SPR, LID was driven by the NIST Speaker Recognition Evaluation series started in 1996.
Speech enhancement and separation Real-world speech signals are often contaminated by interfering speakers, noises, and reverberation. The task of speech enhancement and separation, which aims at extracting clean speech signals from a mixture, is thus very important for both human-to-human and human–machine communication. Conventionally, people refer to speech separation as the problem of segregating one or more target speakers from other interfering speakers and/or noises, and speech enhancement as the problem of removing noises and/or reverberation from noisy speech. The dominant techniques for speech enhancement were purely signal processing-based in 1998 [11]. Under this framework, enhancement of noisy speech signals is essentially a problem of estimating a clean speech signal from a given sample function of the noisy signal by minimizing the expected value of some distortion measure between the clean and estimated signals. Although these techniques (e.g., the Wiener filter) differ in the statistical models (e.g., Gaussian and hidden Markov processes) assumed, the distortion metric (e.g., minimum mean-square error) used, the domain (e.g., time, frequency, and magnitude domain) in which the enhancement is carried out, and the way the noise and speech statistics are estimated (e.g., minimum statistics-based noise estimator), they often assume that speech and noise follow statistically independent processes.
1950s
1962, (Kersta) Spectrum Analysis
SLU and dialog systems Interacting with machines in natural language has been of continued interest to mankind since the early days of computing. Around 1998, the popular architecture for dialog systems included language understanding, a dialog manager, and natural language generation. Language understanding aims to interpret user utterances into a semantic representation that can be converted to back-end knowledge and task-completion resources. Then, the dialog manager may formulate a query to the back end and predict the next system action based on the results of the query and the dialog context. Finally, natural
2000
2013
2000, (Reynolds) GMM-UBM 2008, (Kenny) Joint Factor Analysis 2011, (Kenny) i-vector
2013, (Google) d-vector 2017, (Snyder) x-vector 2020, (Desplanques) ECAPA-TDNN
Statistical Model
Spectrographic Analysis
DNN
200 Training Set 150
Development
Pretrained Model
Enrollment
100 50
X1X2X3...XN 0
50
100
150
Test
Segment Level
Embedding a
MAP
Likelihood Ratio
Transformer Layer L
Embedding b
Statistics Pooling
Speaker Model
Testing Set
ECAPA-TDNN
AAM-Softmax
GMM-EM UBM
X1X2X3...XN
2020, (Meta) wav2vec 2021, (Meta) Hubert 2021, (Microsoft) WavLM
Pretrained Model
P(spkri x1, x2, ... , xT)
250
0
2021
Decision
Transformer Layer 1
Frame Level
Target
Impostor
Speaker Embedding FC Layer
Transformer Layer 2
CNN Feature Encoder
Attentive Statistic Pooling ECAPA-TDNN Frame Encoder Fbank Feature
STFT + Mel Filter Bank
: Hidden Representations Sequence
x1, x2, ... , xT
: Learnable Weight
FIGURE 1. The progress of SPR systems. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation time-delay NN; STFT: short-time Fourier transform; Fbank: filter-bank; AAM: additive angular margin; FC: fully connected (layer); GMM-UBM: Gaussian mixture model MAP-adapted from a universal background model; DNN: deep NN; CNN: convolutional NN.
30
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
language generation produces the natural language utterance that conveys the system action. For language understanding, approaches that rely on data and ML methods became feasible with the availability of annotated datasets, such as the air-travel-related queries of the DARPA airline travel information systems project. Traditionally, language understanding consists of a triage of tasks: intent/domain classification and slot tagging [identifying contiguous spans of words in an utterance that correspond to certain parameters (i.e., slots) of a user query], which have been treated as sequence classification and sequence-tagging problems, respectively [13]. The most popular technique for slot tagging at that time was conditional random field. For language-understanding tasks, annotated data enabled the combination or replacement of the earlier symbolic approaches with Bayesian classifiers and HMMs [14]. For dialog management, the common approaches around 1998 relied on dialog flows [15] that were designed by engineers to represent the interactions between the machine and humans. Approaches that rely on machine learning-based methods for learning dialog policies, such as [16], proposed reinforcement learning for predicting the next system action, were just starting to appear. For language generation in dialog systems, the majority of the work was also template or grammar based. ML-based methods, which separated sentence planning and realization for language generation that aimed to reduce the high cost of handcrafting knowledge-based generation systems, started appearing.
Main driving forces over the last decades SLP fields underwent a slow development period of roughly a decade and then went on to a fast track after 2010. Since 2010, we have seen rapid progress with various new modeling techniques and significantly improved performance. This progress is being driven by being able to relax previously existing modeling constraints through a combination of deep learning, big data, and high-performance computing.
Time for a paradigm change
200 mb/s, even for a relatively low-end connection. This huge increase in bandwidth enables almost instantaneous uploading and downloading of speech signals and models, making it practical to utilize extremely large and accurate models in the cloud, resulting in dramatically increased SLP performance.
Affordable, high-performance computing In a related development, the amount of computing power now available has also dramatically increased. Clock speed alone has increased by more than a factor of 100 over the last 25 years. The ability to pack multiple computing cores in a single processor and/or coprocessor has added even more computing capabilities. This enabled efficient parallelization of the large number of matrix operations required in deep learning. This allows for some very powerful SLP tasks to be done locally, providing enhanced latency, and also permits even bigger models to be employed in the cloud to achieve even greater performance gains and yet still run in real time.
Open source tools Another driver of performance improvements has been the community’s emphasis on research reproducibility and providing open source implementations of newly developed techniques, e.g., Kaldi (https://github.com/kaldi-asr/kaldi). Investments in deep learning platforms like TensorFlow and PyTorch have additionally sped up the rate of development of new speech processing toolkits, which powered a new wave of fundamental research, e.g., ESPnet (https://github.com/espnet/ espnet), SpeechBrain (https://github.com/speechbrain/speech brain), and Fairseq (https://github.com/facebookresearch/ fairseq).
Deep learning and big models Large-scale data sets and powerful computing infrastructure have enabled the adoption of deep learning techniques. Relative to the previous generation of technology, they have higher modeling capacity and can thus better leverage big data and offer significantly improved generalization abilities.
Neural architectures
Big data The Internet and various digital applications significantly increased the number of data available to improve SLP systems. It was estimated that 2.5 quintillion bytes of data would be created every day in 2022. For example, in the 1998 time frame, typical large ASR systems would be trained on a few hundred hours of speech and 300 million words of text. Today, it is not uncommon to see systems trained on 100,000 h of speech, with some sites [17] using more than a million hours of speech. Although more data alone may not guarantee performance improvements, when combined with model-size increases, significant performance improvements result.
High bandwidth Back in 1998, communication bandwidths were still at modem speeds (56 kb/s!), whereas today, bandwidth is on the order of
Before the 2000s, single hidden layer, feedforward NNs were used in speech systems, either as a replacement for GMMs or as feature extractors [18], [19]. Later in the 2010s, deep feedforward NNs with many layers of latent representations began to replace latent variable models, e.g., GMMs, partially observable Markov decision processes, and i-vectors in speech systems, offering significant improvements across multiple benchmarks. Unlike GMMs, NNs do not make any assumptions about the input data distribution. They are able to use far more data to constrain each parameter because the output for each training case is sensitive to a large fraction of the weights. Recurrent NNs (RNNs), with and without explicit memory, e.g., simple RNNs and long short-term memories, can capture long-range dependencies between inputs and bring finer-grain integration of temporal information for speech and language representation
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
31
[20]. In RNNs, connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. However, the sequential nature of recurrent networks made it harder to leverage the all-parallel world of GPUs, which revolutionized the deep learning community with massive computational increases. Transformer layers [21] much better leverage GPUs by utilizing positionwise feedforward layers (i.e., the same feedforward layers are used for each positioned item in the sequence) and replacing the recurrence operation with a multihead self-attention sublayer (which runs through an attention mechanism several times in parallel) that captures input dependencies using a constant number of operations.
End-to-end modeling and optimization Typical speech recognition, generation, translation, and dialog systems combine many logical components, e.g., feature representation, AM, LM, speaker, pronunciation, translation models, waveform generation, and hypothesis search, to name just a few. Modeling each logical component explicitly in classical systems ensured tight performance control and enabled easier integration of human knowledge. The first wave of research on neural models for spoken language systems represented each module with a neural counterpart, offering solid gains while maintaining the modularity of existing systems. Advances in numerical methods for optimizing NNs (e.g., layer normalization and residual connection) enabled neural models to combine multiple logical functions. Recent approaches trained single end-to-end models (instead of optimizing each component separately) to represent entire systems, e.g., ASR [20], [22], dialog systems [23], and TTS [24].
Big, pretrained models One major bottleneck for deep neural models is their reliance on large volumes of labeled data, which is aggravated in endto-end models that rely exclusively on data to skip low-level domain knowledge as feature engineering is accomplished automatically in the network instead of from human design. The help came from semisupervised (which utilizes both labeled and unlabeled data) and self-supervised training (which obtains supervisory signals from the data itself) approaches, which, by leveraging massive, unlabeled speech data, have reached unprecedented performance levels recently. Pseudolabeling, also known as student–teacher distillation, trains a student model on a few hours of labeled data to track outputs generated by a teacher model [25]. Self-supervised approaches [26] utilize pretext tasks to pretrain models generatively, contrastively, or predictively. These representation learning models impacted a wide range of downstream spoken language tasks, e.g., ASR, speaker diarization, SLU, and spoken question and answer.
Generative modeling Parallel to the efforts in learning representations using unlabeled speech and audio data, there was an active line of research for modeling data distributions and learning high-quality generative models. Autoregressive (AR) generative models predict future values based on past values, both 32
given and predicted, in previous steps. They factorize a highdimensional data distribution into a product of AR conditional distributions. Thanks to reparameterization, a way that solves undifferentiable expectations by rewriting them so that the distribution with respect to which we take the gradient is independent of model parameters, the encoder, which converts the input into a latent representation, and the decoder, which reconstructs the input from the latent representation, variational autoencoders (autoencoders whose input is encoded as a distribution over the latent space instead of a single point) can be trained jointly to reconstruct input data from samples of learned latent distributions. Generative adversarial networks (GANs) mimic the input data distribution through an adversarial game between a generator and a discriminator, aiming to discover realistic inputs from fake generated ones [27]. Utilizing these generative approaches has led to significant progress in generative models of speech that are controllable and of realistic quality.
Major technical breakthroughs in each subfield Although deep learning has reshaped the whole field, different subfields have different problems to solve. In this section, we introduce the major technical breakthroughs in each major subfield in the past decades and describe the current SOTA.
Speech coding As higher bandwidths became available for speech communication, first through Voice over IP applications such as voice and video conferencing over the Internet, and later in mobile communication such as Voice over LTE, bit rate scalable codecs were introduced. These codecs could operate at rates and bandwidths ranging from 5 to 6 kbps for narrow band and up to hundreds of kilobits per second for full-band speech and general stereo audio. Examples are Opus (https://opus-codec.org/), unified speech and audio coding [28], and enhanced voice services [29]. To achieve the ability to operate over this wide range of bit rates and signal bandwidths, these codecs combine linear predictive time-domain methods from low-rate speech coding with transform coding (such as the modified discrete cosine transform) common in high-quality general audio coding. Transform coding compresses audio data by representing the original signal with a small number of transform coefficients. These are all still based on traditional digital signal processing techniques. As ML became successful in other speech processing areas, it also made its way into speech coding. WaveNet [30] showed that generative modeling can achieve impressive speech quality when conditioned on traditional speech analysis features such as spectral envelopes and pitch parameters. In [31], it was shown that traditional low-rate parametric speech codec features could drive a WaveNet neural synthesis and produce high-quality wideband speech at 2.4 kbps. Since then, other methods have been presented, driving down the complexity and bit rate further. Recent work [32] has produced high-quality speech at ~400 bps. Most of today’s neural speech codecs are excellent at reproducing clean speech, however, in a practical speech coding system, in addition to complexity constraints for running in
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
several open source software toolkits. In generative TTS, statistical parametric speech synthesis [36] with high-quality vocoders gained popularity in the late 2000s. The Blizzard Challenge (https://www.synsig.org/index.php/Blizzard_ Challenge), an annual event that evaluates TTS systems with a common training speech corpus and a set of test sentences, started in 2005 and helps researchers and developers compare different technologies on the same ground. ASR Deep learning was first introduced to replace the HMMInitial advances in deep learning-based speech recognition were based AM in generative TTS [37]. In 2016, WaveNet [30], an based on extensions to the existing GMM-HMM framework AR generative model for raw audio, demonstrated that it could [34]. The basic HMM nature was not touched; deep learning integrate an AM and a vocoder into a single generative model was only applied to output distributions in the HMM framework. and synthesize more naturally sounding speech than convenOver time, there were increasingly more attempts to replace the tional unit-selection and generative TTS systems. In parallel, HMM framework with one based solely on deep learning. the AR encoder–decoder models with attention were successA more speech-focused methodology called connectionist fully adapted as an AM of generative TTS [38] like other temporal classification (CTC) [35] combined HMM concepts sequence-to-sequence mapping problems. A combination of with sequence-to-sequence mapping. It took almost 10 years the encoder–decoder model with the WaveNet-based vocodto demonstrate competitive performance over hybrid models, er model achieved near-human-level synthetic speech [39]. with the realization that phone-based models worked better for Recently, non-AR generative models demonstrated that they CTC than state-based ones. CTC also produced a competitive could achieve the same or better performance than these AR performance with grapheme-based units, eliminating the need generative models, both in AMs and vocoders [1]. Finally, intefor costly pronunciation dictionaries (at least for systems with grating these two components into a single adequate numbers of training data). model to make the entire system fully end to Benefiting from the monotonic relationA more speech-focused end is being actively investigated [24]. ship between ASR inputs and outputs, the methodology called Some SOTA, NN-based, TTS systems RNN transducer [20] took the modeling proconnectionist temporal have demonstrated human parity in the cess further by augmenting the AM with a classification combined reading-speech (in contrast to conversational prediction network, which replaced the need HMM concepts with speech that features wide prosody variations) for an LM and was trained jointly within the sequence-to-sequence synthesis domain. Current research in TTS whole “end-to-end” ASR model. targets harder speech generation tasks, such Encoder–decoder models with attention mapping. as synthesizing texts in low-resource lan[21], a processing mechanism that allows a guages, handling code mixing (the embedding of linguistic units network unit at a layer to pay more attention (with a higher such as words and morphemes of one language into an utterance weight) to specific units at other layers, were then successfully of another language), code switching (alternating between two adapted from the translation community, but such networks’ or more languages) within a sentence, synthesizing long-form freedom to reorder outputs would sometimes introduce new texts, realizing expressiveness, and synthesizing nonverbal types of speech recognition errors. vocalizations such as laughter. Humans are still significantly Another significant advance in speech recognition occurred better than TTS with these tasks. Developments in data collecwhen it was realized that self-supervised learning concepts, as embodied in bidirectional encoder representations from tion, analysis, and modeling can help us tackle these hard tasks. transformers, could be adapted to improve speech recognition performance. To achieve that, the concept of masking discrete SPR, identification, and diarization elements in a text stream, as is done in Bidirectional EncodIn SPR, the beginning of 2000s was dominated by Gaussian er Representation from Transformers (BERT), needed to be mixture speaker models MAP-adapted from a universal backextended to speech, which is a continuous signal. More genground model (GMMs-UBM) [40]. Two important variations erally, self-supervised methods were further extended to the of using adapted GMMs existed: 1) direct evaluation of the pretraining of speech models [26] so that the more accessible, likelihood of utterance, where the verification score was the unlabeled speech data may be exploited to improve speech log-likelihood ratio between GMM-UBM and speaker-adaptprocessing performance. Again, the main challenge here was ed GMM, and 2) extracting adapted GMMs’ parameters as extending models originally developed for discrete units (text) “speaker supervectors” and using them as the input to another to continuous units (speech) without obvious reconstruction classifier [e.g., a support vector machine (SVM)]. targets, as there are in a text stream. Many techniques, such as feature mapping and nuisance attribute projection, were developed to compensate for channel variability so that speaker variability could be better idenTTS tified. In [41], joint factor analysis (JFA) was introduced as Unit-selection TTS was popular in R&D in the early 2000s. an improvement to the previous GMM-UBM/MAP approach, There were many commercial unit-selection TTS systems and real time on devices, the system should also be able to handle speech in different background scenarios, e.g., in noisy and/or reverberant environments. This has been a challenge for neural speech codecs. In [33], an end-to-end system based on an autoencoder with GAN losses was proposed, and this method achieved good quality for both clean and noisy speech.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
33
where large GMM models could be robustly and independently adapted to the speaker and/or channel of an utterance. Similar to the eigenvoices adaptation used in ASR, in which each speaker is represented as a linear combination of latent basis vectors named eigenvoices, only low-dimensional speaker and channel latent vectors needed to be estimated from an input utterance. The subsequent i-vector approach directly used the latent vectors as low-dimensional, fixed-length embeddings of speech utterances [42]: i-vectors defined only one “totalvariability” space, and supervectors of GMM means were projected into such a space by a total-variability matrix trained on a huge number of unlabeled speaker data. I-vectors, however, included both wanted speaker and unwanted channel information, so scoring had to be implemented by probabilistic linear discriminant analysis (PLDA), rather than by simple distance metrics. I-vectors dominated the field for more than a decade, and they became popular elsewhere, from LR to the adaptation of ASR systems (even those based on deep learning). SPR was actually one of the last ML fields where GMMs surrendered to NNs. Efforts have been made for more than a decade, and researchers have registered partial victories (such as NN-based features and NN alignments, instead of using Gaussian components), but the true switch to NNs came after Snyder et al. [43] trained a time-delay NN on a large pool of speakers with a speaker identification criterion. The NN has several blocks: 1) extracting frame-by-frame hidden representations; 2) pooling over time, resulting in a fixed-length representation of an utterance; 3) adding a few more NN layers to produce the embedding (x-vector); and 4) during training, the “x-vector” is connected to a linear classification layer, which was discarded once the “x-vector” extractor was trained. Since the introduction of x-vectors, the SPR standard architecture has stabilized with the chain feature-extraction—embedding extraction—back end. Current work in SPR is compatible with the other ML fields and includes research in data augmentation, novel network architectures (often taken from the computer vision community, such as ResNet34), training criteria, end-to-end systems (including trainable signal processing blocks), and the use of pretrained models. In diarization, the Bayesian information criterion (BIC) [44] has long been used for both segmentation and clustering. Diarization has closely followed the developments in SPR: i-vectors (or x-vectors) were used to represent the speech segments, and PLDA was used to evaluate the similarity for segment clustering. Also, the BIC-based segmentation was replaced by simpler uniform segmentation, where i- and x-vectors are extracted every 0.25 s from a window of approximately 1.5 s. Hierarchical agglomerative clustering, spectral clustering, or clustering based on Bayesian HMMs were typically used to cluster the segments. Variational Bayes (VB) diarization (Bayesian HMM with eigenvoice priors [45]) was unique as it did not perform separate segmentation and clustering steps. VB techniques worked excellently with deep NN (DNN)-generated x-vectors representing segments of fixed length. Current work in diarization also targets end-to-end architectures, and promising results have been obtained with target-speaker voice activity detection. 34
LID Significant improvements were made to the acoustic approach of LID in the early 2000s by reusing discriminative training that had previously been tested in ASR. As phone recognition was the first speech field where NNs achieved significant success, it is not surprising that the phonotactic approach of LID benefited from the development of reliable phone recognizers. LID then evolved alongside SPR because the same groups and people typically worked on both techniques, and some of this evolution is covered by [46]. LID based on JFA and i-vectors virtually eliminated the need for phonotactic approaches. Although some attempts to use DNNs for LID still found it advantageous to combine with GMMs and use NNs as feature extractors, others have shown the superiority of NNs, leading to neural approaches dominating the LID field earlier than SPR. X-vectors have also proven their modeling power for LID [47], with simple, discriminative Gaussian classifiers used as the back end. The interest in LID also initiated several data collection efforts, from the extraction of telephone calls from broadcasts done by Brno University of Technology and the Linguistic Data Consortium around 2009, to the recent VoxLingua107 data collection. As for SPR, LID currently witnesses developments in data augmentation, new NN architectures, and end-to-end systems.
Speech enhancement and separation Over the past two decades, we have observed significant progress in speech enhancement and separation. Most of the developments are summarized in [48], [49], [50], [51], and [52]. In 2001, nonnegative matrix factorization (NMF) [52], an unsupervised data-driven technique, was introduced under the assumption that the audio spectrogram has a low rank structure that can be represented with a small number of nonnegative bases. Under certain conditions, the decomposition in NMF is unique, without making orthogonality or independence assumptions. The main difference between NMF and previous signal model-based approaches (e.g., the Wiener filter) is that NMF uses clean speech and noise streams to learn the basis, and then applies these bases during the testing phase. Around the same time, CASA was proposed [12]. In CASA, certain segregation rules based on perceptual grouping cues (e.g., pitch and harmonics that can be used to distinguish different speakers) are designed to operate on low-level features such as a spectrogram to estimate a time-frequency (T-F) mask for each signal component belonging to different speakers. This mask is then used to reconstruct the signal by multiplying it with the input. Although CASA has many limitations [50], the idea of estimating T-F masks, when combined with data-driven approaches, has reshaped the direction of speech enhancement. Deep learning has also led to a paradigm shift in speech enhancement and separation. The key idea is to convert the original problem into a supervised learning problem [49], [50]. As the target clean speech is seldom available in real-world recordings, the training sets are usually synthesized by mixing various clean speech, noise, and reverberation conditions. The task then becomes extracting the clean speech from the synthesized mixture. As the mixing sources and parameters are known during
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
the training phase, various training objectives (mostly T-F masks and signal matching based) can thus be directly defined if only one speaker needs to be extracted from the mixture. However, if two or more speakers need to be separated and extracted from the mixture and there is no information (e.g., speaker embedding, face, or location) during the segregation process to identify the order of the set of extracted speech streams, some technique is needed to solve the permutation-ambiguity issue [50]. The two most effective approaches for solving this problem are deep cluster [53] and permutation-invariant training (PIT) [54]. Unfortunately, the synthesized mixture, although it can be close, is different from the real recordings. To exploit the real recordings that do not come with the separation or enhancement targets, mixture-invariant training [55] was proposed and achieved significant success when combined with PIT. Another observation in the past two decades is the improved exploitation of additional information. For example, the research on multichannel processing [48] and multimodal processing [51] has significantly increased. Multichannel processing can utilize spatial information to improve the performance of speech enhancement and separation. Beamforming based solely on signal processing was the dominant multichannel technique 10 years ago. Today, deep learning has been exploited to estimate sound statistics, learn a dynamic beamformer, and introduce additional target clues such as speaker embedding and multimodal information.
SLU and dialog systems The past two decades have been flourishing for dialog systems. Due to the advancements in speech and language technology, several commercial applications, such as customer service applications, virtual personal assistants, and social bots, have been launched, resulting in a huge number of interactions. These have resulted in even more research interest in dialog systems as they have enabled us to identify remaining challenges and new conversational application domains and scenarios. For language understanding, the methods relying on SVMs and CRFs were followed by the use of DNNs [56] and large, pretrained LMs fine-tuned for language understanding [57]. Similar to language understanding, pretrained LMs have proven to be useful for other dialog tasks as well; for example, dialog-state tracking and response generation. Inspired by these works, zero- (with no training sample) and few-shot (with only several training samples) methods that rely on fine-tuning question answering [58], prompt tuning, and instruction tuning [59] were shown to be effective. In parallel, end-to-end methods based on deep learning [23], for both task-oriented and open-domain dialog systems, became popular. Most recently, ChatGPT, a generalpurpose, open-domain dialog system based on the generative predictive transformer [60], has shown great potential and become prevalent.
Conclusion In this article, we reviewed major advances made in the SLP fields in the past 25 years. The availability of more data, higher
computation power, and advancements in deep learning techniques have accounted for the majority of the progress made in SLP in the last decade. In this section, we summarize the state of the field today, with comments on future developments.
Comments on the field today Figure 2 compares the ICASSP SLP paper theme category percentages (2023 versus 1998). We observe that the percentage of categories such as ASR, speech synthesis, SPR, language modeling, speech enhancement, and speech analysis has drastically increased over the last 25 years. Language understanding, emotion recognition, voice conversion, multilingual ASR, speaker diarization, speech corpora and resources, ML for language processing, and self-supervised learning emerge as significant theme categories. In contrast, ASR robustness (now achieved with a large number of realistic training data), features for ASR (feature engineering is now part of data-driven model training), NNs (speech modeling with NNs has become a universal tool), and speech coding no longer take up a noticeable percentage as theme categories. Figure 3 shows the evolution of ICASSP SLP paper submission statistics for the past 20 years. We observe that the number of paper submissions has nearly tripled. Especially, we see a roughly 20%/year-over-year increase for the last five years. The developments in SLP have enabled many scenarios and significantly improved our daily life. For example, ASR techniques, given their significantly improved accuracy, are now widely used in smart assistants such as Siri, Alexa, and Google Now; in-car entertainment systems, voice search systems, and medical transcription systems. ASR techniques have also enabled many other scenarios, such as speech-to-speech translation. Due to the high quality of synthesized speech, TTS techniques have significantly improved the multimodal and multimedia context generation process. They are widely used in audiobooks, digital humans, and dialog systems.
Perspectives on future developments We have observed the convergence of techniques in most of the subfields of SLP. Only decades ago, these subfields were based on very different theories and models; today, most of them are based on the same set of deep learning techniques. When a new effective model or algorithm is developed in one of these subfields, it is quickly applied to other subfields and brings progress to them. We welcome this trend and believe it will continue because, although many problems seem to be different on the surface, they are identical at a higher and more abstract level. At the same time, we believe problems in different subfields come with different assumptions and have different structures. These assumptions and structures should be taken into consideration when designing models to advance the SOTA. Given this convergence, one of the main impacts we expect to see in the coming years is an increasing number of systems that may have different preprocessing and postprocessing modules for different modalities, but share a common central architecture. This will facilitate cross-modal learning and data
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
35
modeling longer context that is important for future interactions, and common-sense reasoning. Another area that still requires significant technology advancement deals with the general area of catastrophic forgetting. The SOTA today in speech and other modalities involves fine-tuning a large, pretrained network; this results in biasing
sharing, something humans do easily but with which current systems still struggle. For language understanding and dialog-response generation, although these new methods (e.g., ChatGPT) have resulted in significant improvements, several challenges remain, such as maintaining the factual accuracy of the responses,
20
Vertical (Value) Axis
1998
2023
(%)
15
10
5
S Sp pee Sp S ee ch ee pe ch Sy ch ec Re nth c e M E h N L mo Enh ogn sis at M ti a Ke ur et on n itio yw M al ho R ce n La ds e me or ult d im ng fo co nt Sp od ua r L gn ot al ge an itio tin P Pr gu n g roc oc ag an e es e d ss V i si n R oi g ng es ce o f ou /E L S Sp ve an P rc A e- da e nt gu R C p on ta L ake De ag a r t e s t Sp tra ion ngu Dia ect ee ine an ag riz ion ch d d R e M ati a Sp o o on Sp nd eec bus de ee La h t T ling ch ng Re rai M an uag cog nin ul d e n g til Te D itio in is gu x n al Vo t Tr ord Sp Sp ice an er ok ee C sla s en Sp ch on tio L a ee ng S Re ver n ch p ua e co sio R ge ech gn n ec U A itio og Pro nd na n Au ni du di t io ct D ers lys o/ SR n: T ion ial tan is Te o M xt S Spe Se ext an g S din g ul e e lf -B d tis gm ch -S as Pe yst pe e R up e rc em d ak nta ec er A ep er ti og vis da tio /C on n e p n ha , T itio d L tat nn ag n e io e g R a n Pr l S ing ob rnin u on pe , a s g un ec nd tne s ci h at R Par s io ec si n n o Fe As gni g at se tio ur ss n es m fo ent rA Sp SR ee ch NN C s od in g
0
FIGURE 2. Evolution of the percentage of paper category of ICASSP SLP papers (1998 versus 2023), ordered after 2023 category percentage.
SLP
1,800
Submitted
Accepted
1,600 1,400 1,200 1,000 800 600 400 200
36
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
23
22
20
21
20
20
20
19
20
18
20
17
20
16
FIGURE 3. Evolution of ICASSP SLP paper submission statistics (from 2002 to 2023).
20
15
20
14
20
13
20
12
20
11
20
10
20
09
20
08
20
07
20
06
20
05
20
04
20
03
20
20
20
02
0
the network to the new data with a resultant loss in robustness. Again, although there have been some attempts to deal with this problem, we have a long way to go. We’d also like to point out that the majority of the progress has been made in support of data-driven techniques. This, however, does not mean that theoretical models are useless or less meaningful. In fact, we think theories on deep learning should be established to explain the models and results we have thus far, and new theories should be developed to guide further development of the fields in which they apply. It is also beneficial to combine theoretical models with data-driven techniques, e.g., in the speech enhancement and separation fields. As the performance of the systems continues to improve, it is very important to maintain ethics. For example, as synthesized speech is no longer distinguishable from human speech under some conditions, we need to make sure that the advanced TTS technique is not used to cheat people or for illegal financial gain. The research society should have clear guidance on how to evaluate the benefits and side effects of new research and techniques. Both technical and legal mechanisms are also needed to prevent evil people from getting access to powerful techniques, and to identify AI-generated content.
Acknowledgment Dong Yu is the corresponding author.
Authors Dong Yu ([email protected]) received his Ph.D degree in computer science from the University of Idaho. He is a distinguished scientist and vice general manager at Tencent AI Lab, Bothell, WA 98011 USA. Prior to joining Tencent in 2017, he was a principal researcher at Microsoft Research, Redmond, WA. He has two monographs and more than 300 papers to his credit. His work has been widely cited and recognized by the IEEE Signal Processing Society Best Paper Award in 2013, 2016, 2020, and 2022, as well as by the 2021 NAACL Best Long Paper Award. He was elected chair of the IEEE Speech and Language Processing Technical Committee from 2021 to 2022. His research focuses on speech and natural language processing. He is a Fellow of IEEE and a fellow of Association for Computing Machinery and the International Speech Communication Association. Yifan Gong ([email protected]). received his Ph.D. degree in computer science from the Department of Mathematics and Computer Science, University of Nancy I, France. He leads a speech modeling team developing machine learning and speech technologies across scenarios/tasks at Microsoft, Redmond, WA USA. Prior to joining Microsoft in 2004, he worked as a senior research scientist at the National Center of Scientific Research, France, and then as a senior member of the technical staff at Texas Instruments. He has authored and coauthored more than 300 publications in books, journals, and conferences, and has more than 70 granted patents. He serves on the Senior Editorial Board of IEEE Signal Processing Magazine and is the elected chair of the IEEE Speech and Language Processing Technical Committee
(2023–2024). His research focus is on speech processing. He is a Fellow of IEEE. Michael Alan Picheny ([email protected]) received his Sc.D. degree in electrical Engineering and computer science from the Massachusetts Institute of Technology. He has worked in the speech recognition area since 1981, joining the IBM Thomas J. Watson Research Center after finishing his doctorate degree. After retiring from IBM in 2019, he joined NYUCourant Computer Science and the Center for Data Science, New York, USA, as a part-time research professor. He has published numerous papers in both journals and conferences on almost all aspects of speech recognition. He was chair of the IEEE Speech Technical Committee from 2002 to 2004 and a member of the International Speech Communication Association (ISCA) Board from 2005 to 2014. Her research focus is on speech recognition. He is a Fellow of IEEE and a fellow of the ISCA. Bhuvana Ramabhadran ([email protected]) received her Ph.D. degree in electrical engineering from the University of Houston. She leads a team of researchers focusing on semisupervised learning for speech recognition and multilingual speech recognition at Google, New York, NY 10011 USA. Prior to joining Google, she was a distinguished research staff member and manager at IBM Research AI, IBM Thomas J. Watson Research Center, New York, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s worldwide laboratories in the areas of speech recognition, synthesis, and spoken-term detection. She was elected chair of the IEEE Speech and Language Processing Technical Committee from 2015 to 2016. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. She is a Fellow of IEEE and a fellow of the International Speech Communication Association. Dilek Hakkani-Tür ([email protected]) received her Ph.D. degree in computer engineering from Bilkent University. She is a senior principal scientist focusing on enabling natural dialogs with machines at Amazon Alexa AI, Sunnyvale, CA USA, Prior to joining Amazon, she was a researcher at Google, Microsoft Research, the International Computer Science Institute at the University of California, Berkeley, and AT&T Labs-Research. She received best paper awards for publications she coauthored on conversational systems from the IEEE Signal Processing Society, International Speech Communication Association (ISCA), and European Association for Signal Processing. Recently, she served as editor-in-chief of IEEE Transactions on Audio, Speech, and Language Processing. Her research interests include conversational artificial intelligence, natural language and speech processing, spoken dialog systems, and machine learning for language processing. She is a Fellow of IEEE and a fellow of the ISCA. Rohit Prasad ([email protected]) received his M.S. degree in electrical engineering from the Illinois Institute of Technology. He is a senior vice president at Amazon, Boston, USA, where he is head scientist for Amazon Alexa. He leads R&D in artificial intelligence technologies for enriching the
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
37
daily lives of everyone, everywhere. Prior to Amazon, he was the deputy manager and senior director of the Speech, Language and Multimedia Business Unit at Raytheon BBN Technologies. He is a named author on more than 100 scientific articles and holds several patents. He was listed at number nine in Fast Company’s “100 Most Creative People in Business” in 2017 for leading the “voice-controlled revolution.” In 2021, he was listed as one of the 50 most influential people in technology as part of “Future Tech Awards.” His research interests include speech processing and dialog system. He is a Senior Member of IEEE. Heiga Zen ([email protected]) received his Ph.D. degree in computer science from the Nagoya Institute of Technology in 2006. He is a research scientist with the Google Brain team, Tokyo 150-0002, Japan. After receiving his Ph.D. degree, he joined Toshiba Cambridge Research Laboratory, U.K., in 2008. He was with the Text-to-Speech team at Google, London, between 2011 and 2018, then moved to the Brain team in Tokyo as one of its founding members. He is one of the early explorers in generative model-based speech synthesis, one of the original authors of the hidden Markov model-based speech synthesis system HMM/DNN-based Speech Synthesis System (HTS), and its first maintainer. He served as a member of the IEEE Speech and Language Processing Technical Committee between 2012 and 2014. He is a fellow of the International Speech Communication Association. He is a Senior Member of IEEE. Jan Skoglund ([email protected]) received his Ph.D. degree in information theory from the School of Electrical and Computer Engineering of Chalmers University of Technology, Sweden, in 1998. He leads a team that develops speech and audio signal processing components for capture, real-time communication, storage, and rendering (deployed in products such as Meet and Chromebooks) at Google, San Francisco CA USA. After receiving his Ph.D. degree, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, from 2000 to 2011, working on speech and audio processing tailored for packet-switched networks. GIPS’ audio and video technology is found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung, and was open sourced as WebRTC after a 2011 acquisition by Google. His research interests include speech and audio signal processing. He is a Senior Member of IEEE. ˇ ernocký ([email protected]) received Jan “Honza” C his Ph.D. degree in electronics from the University Paris XI. He is a full professor and head of the Department of Computer Graphics and Multimedia, Faculty of Information Technology (FIT), Brno University of Technology (BUT), Brno, Czech Republic. He also serves as managing director of the BUT Speech@FIT research group. He is responsible for signal and speech processing courses at FIT BUT. In 2006, he cofounded Phonexia. He was general chair of INTERSPEECH 2021 in Brno. His research interests include artificial intelligence, signal processing, and speech data mining (speech, speaker, and language recognition). He is a Senior Member of IEEE. 38
Lukáš Burget ([email protected]) received his Ph.D. degree in information technology from Brno University of Technology. He is an associate professor with the Faculty of Information Technology (FIT), Brno University of Technology (BUT), Brno, Czech Republic, and research director of the BUT Speech@FIT group. He was a visiting researcher with OGI Portland and SRI International. He was on numerous European Union- and U.S.-funded projects and currently leads the NEUREM3 project, which is supported by the Czech Science Foundation. In 2006, he cofounded Phonexia. His research interests are in the field of speech processing, concentrating on acoustic modeling for speech, speaker, and language recognition. He is a Member of IEEE. Abdelrahman Mohamed ([email protected]) received his Ph.D. in computer science from the University of Toronto. He is a research scientist with the Facebook Artificial Intelligence Research group at Meta, Seattle, WA, USA. Prior to joining Meta, he was a principal scientist/ manager at Amazon Alexa AI and a researcher at Microsoft Research. He was a part of the team that started the deep learning revolution in spoken language processing in 2009. He has more than 70 research journal and conference publications with more than 35,000 citations. He is the recipient of the 2016 IEEE Signal Processing Society Best Journal Paper Award. He currently serves as a member of the IEEE Speech and Language Processing Technical Committee. His research work spans speech recognition; representation learning using weakly, semi-, and self-supervised methods; language understanding; and modular deep learning. He is a Member of IEEE.
References
[1] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” 2020, arXiv:2006.04558. [2] L. Supplee, R. Cohn, J. Collura, and A. McCree, “MELP: The new federal standard at 2400 bps,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1997, vol. 2, pp. 1591–1594, doi: 10.1109/ICASSP.1997.596257. [3] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): High-quality speech at very low bit rates,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1985, pp. 937–940, doi: 10.1109/ICASSP.1985.1168147. [4] F. Jelinek, “Some of my best friends are linguists,” Lang. Resour. Eval., vol. 39, no. 1, pp. 25–34, Feb. 2005, doi: 10.1007/s10579-005-2693-4. [5] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, “ATR o-talk speech synthesis system,” in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), 1992, pp. 483–486. [6] R. Donovan and P. C. Woodland, “Automatic speech synthesiser parameter estimation using HMMs,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1995, pp. 640–643, doi: 10.1109/ICASSP.1995.479679. [7] K. Tokuda, T. Kobayashi, and S. Imai, “Speech parameter generation from HMM using dynamic features,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1995, pp. 660–663, doi: 10.1109/ICASSP.1995.479684. [8] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteristics conversion for HMM-based speech synthesis system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1997, pp. 1611–1614, doi: 10.1109/ ICASSP.1997.598807. [9] J. P. Campbell, “Speaker recognition: A tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, Sep. 1997, doi: 10.1109/5.628714. [10] M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech,” IEEE Trans. Speech Audio Process., vol. 4, no. 1, p. 31, Jan. 1996, doi: 10.1109/TSA.1996.481450. [11] P. C. Loizou, Speech Enhancement: Theory and Practice. Boca Raton, FL, USA: CRC Press, 2007. [12] G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Netw., vol. 15, no. 5, pp. 1135–1150, Sep. 2004, doi: 10.1109/TNN.2004.832812.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
[13] G. Tur and R. De Mori, Spoken Language Understanding: Systems for Extracting Semantic Information From Speech. Hoboken, NJ, USA: Wiley, 2011.
[38] J. Sotelo, S. Mehri, K. Kumar, J. Santos, K. Kastner, A. Courville, and Y. Bengio, “Char2Wav: End-to-end speech synthesis,” in Proc. ICLR Workshop, 2017.
[14] W. Minker, S. Bennacef, and J.-L. Gauvain, “A stochastic case frame approach for natural language understanding,” in Proc. 4th Int. Conf. Spoken Lang. Process. (ICSLP), 1996, vol. 2, pp. 1013–1016, doi: 10.1109/ICSLP.1996.607775.
[39] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 4779–4783, doi: 10.1109/ICASSP.2018.8461368.
[15] P. C. Constantinides, S. Hansma, C. Tchou, and A. I. Rudnicky, “A schema based approach to dialog control,” in Proc. 5th Int. Conf. Spoken Lang. Process. (ICSLP), 1998, Paper 0637, doi: 10.21437/icslp.1998-68.
[40] D. A. Reynolds, “Channel robust speaker verification via feature mapping,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2003, pp. II-53– II-56, doi: 10.1109/ICASSP.2003.1202292.
[16] E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-machine interaction for learning dialog strategies,” IEEE Trans. Speech Audio Process., vol. 8, no. 1, pp. 11–23, Jan. 2000, doi: 10.1109/89.817450.
[41] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A study of interspeaker variability in speaker verification,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 5, pp. 980–988, Jul. 2008, doi: 10.1109/TASL. 2008.925147.
[17] S. H. K. Parthasarathi and N. Ström, “Lessons from building acoustic models with a million hours of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2019, pp. 6670–6674, doi: 10.1109/ICASSP.2019.8683690. [18] H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. Norwell, MA, USA: Kluwer, 1993.
[42] N. Dehak, R. Dehak, P. Kenny, N. Brummer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification,” in Proc. Interspeech, 2009, 1559–1562, doi: 10.21437/Interspeech.2009-385.
[19] H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000, vol. 3, pp. 1635–1638, doi: 10.1109/ ICASSP.2000.862024.
[43] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 5329–5333, doi: 10.1109/ICASSP.2018.8461375.
[20] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 6645–6649, doi: 10.1109/ICASSP.2013.6638947.
[44] S. Chen et al., “Speaker, environment and channel change detection and clustering via the Bayesian information criterion,” in Proc. DARPA Broadcast News Transcription Understanding Workshop, 1998, vol. 8, pp. 127–132.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, vol. 30.
[45] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based on Bayesian hmm with eigenvoice priors,” in Proc. Odyssey, 2018, pp. 147–154.
[22] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2016, pp. 4960–4964, doi: 10.1109/ICASSP.2016.7472621. [23] R. T. Lowe, N. Pow, I. Serban, L. Charlin, C.-W. Liu, and J. Pineau, “Training end-to-end dialogue systems with the ubuntu dialogue corpus,” Dialogue Discourse, vol. 8, no. 1, pp. 31–65, Jan. 2017, doi: 10.5087/dad.2017.102. [24] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. 38th Int. Conf. Mach. Learn. (ICML), 2021, pp. 5530–5540. [25] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 27, 2014. [26] A. Mohamed et al., “Self-supervised speech representation learning: A review,” 2022, arXiv:2205.10643. [27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2014, vol. 27, pp. 2672–2680. [28] M. Neuendorf et al., “The ISO/MPEG unified speech and audio coding standard – Consistent high quality for all content types and at all bit rates,” J. Audio Eng. Soc., vol. 61, no. 12, pp. 956–977, Dec. 2013. [29] M. Dietz et al., “Overview of the EVS codec architecture,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2015, pp. 5698–5702, doi: 10.1109/ICASSP.2015.7179063.
[46] H. Li, B. Ma, and K. A. Lee, “Spoken language recognition: From fundamentals to practice,” Proc. IEEE, vol. 101, no. 5, pp. 1136–1159, May 2013, doi: 10.1109/ JPROC.2012.2237151. [47] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken language recognition using x-vectors,” in Proc. Odyssey, 2018, pp. 105–111. [48] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A consolidated perspective on multimicrophone speech enhancement and source separation,” IEEE/ ACM Trans. Audio, Speech, Language Process., vol. 25, no. 4, pp. 692–730, Apr. 2017, doi: 10.1109/TASLP.2016.2647702. [49] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, Oct. 2018, doi: 10.1109/TASLP.2018.2842159. [50] Y.-M. Qian, C. Weng, X.-K. Chang, S. Wang, and D. Yu, “Past review, current progress, and challenges ahead on the cocktail party problem,” Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 1, pp. 40–63, Jan. 2018, doi: 10.1631/ FITEE.1700814. [51] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 1368– 1396, Mar. 2021, doi: 10.1109/TASLP.2021.3066303. [52] D. Seung and L. Lee, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2001, vol. 13, pp. 556–562.
[30] A. van den Oord et al., “WaveNet: A generative model for raw audio,” 2016, arXiv:1609.03499.
[53] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2016, pp. 31–35, doi: 10.1109/ ICASSP.2016.7471631.
[31] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, and T. C. Walters, “Wavenet based low rate speech coding,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2018, pp. 676–680, doi: 10.1109/ ICASSP.2018.8462529.
[54] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017, pp. 241–245, doi: 10.1109/ICASSP.2017.7952154.
[32] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W.-N. Hsu, A. Mohamed, and E. Dupoux, “Speech resynthesis from discrete disentangled selfsupervised representations,” 2021, arXiv:2104.00355.
[55] S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, and J. Hershey, “Unsupervised sound separation using mixture invariant training,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2020, vol. 33, pp. 3846–3857.
[33] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 495–507, 2022, doi: 10.1109/TASLP.2021.3129994.
[56] G. Mesnil et al., “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Trans. Audio, Speech , L anguage Process., vol. 23, no. 3, pp. 530–539, Mar. 2015, doi: 10.1109/TASLP.2014. 2383614.
[34] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012, doi: 10.1109/MSP.2012.2205597.
[57] Q. Chen, Z. Zhuo, and W. Wang, “BERT for joint intent classification and slot filling,” 2019, arXiv:1902.10909.
[35] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. 23rd Int. Conf. Mach. Learn. (ICML), 2006, pp. 369–376, doi: 10.1145/1143844.1143891.
[58] M. Namazifar, A. Papangelis, G. Tur, and D. Hakkani-Tür, “Language model is all you need: Natural language understanding as question answering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2021, pp. 7803–7807, doi: 10.1109/ICASSP39728.2021.9413810.
[36] H. Zen, K. Tokuda, and A. Black, “Statistical parametric speech synthesis,” Speech Commun., vol. 51, no. 11, pp. 1039–1064, Nov. 2009, doi: 10.1016/j.specom. 2009.04.004.
[59] P. Gupta, C. Jiao, Y.-T. Yeh, S. Mehri, M. Eskenazi, and J. P. Bigham, “Improving zero and few-shot generalization in dialogue through instruction tuning,” 2022, arXiv:2205.12673.
[37] Z.-H. Ling, S. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Process. Mag., vol. 32, pp. 35–52, May 2015, doi: 10.1109/MSP.2014.2359987.
[60] OpenAi, “GPT-4 technical report,” 2023. [Online]. Available: https://arxiv.org/ abs/2303.08774
SP
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
39
75 ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE
W. Clem Karl , James E. Fowler , Charles A. Bouman , Müjdat Çetin , Brendt Wohlberg , and Jong Chul Ye
The Foundations of Computational Imaging A signal processing perspective
T
wenty-five years ago, the field of computational imaging arguably did not exist, at least not as a standalone arena of research activity and technical development. Of course, the idea of using computation to form images had been around for several decades, largely thanks to the development of medical imaging—such as magnetic resonance imaging (MRI) and X-ray tomography—in the 1970s and synthetic-aperture radar (SAR) even earlier. Yet, a quarter of a century ago, such technologies would have been considered to be a subfocus of the wider field of image processing. This view started to change, however, in the late 1990s with a series of innovations that established computational imaging as a scientific and technical pursuit in its own right.
Introduction
©SHUTTERSTOCK.COM/TRIFF
In this article, we provide a signal processing perspective on the area of computational imaging, focusing on its emergence and evolution within the signal processing community. First, in the “Historical Perspective” section, we present an overview of the technical development of computational imaging wherein we trace the historical development of the field from its origins through to the present day. Then, in the “Signal Processing Society Involvement” section, we provide a brief overview of the involvement of the IEEE Signal Processing Society (SPS) in the field. Finally, in the “Conclusions” section, we make several concluding remarks.
Historical perspective
Digital Object Identifier 10.1109/MSP.2023.3274328 Date of current version: 14 July 2023
40
We begin our discourse by tracing the history of computational image formation. We start this historical perspective with its origins in physics-dependent imaging and then proceed to model-based imaging, including the impact of priors and sparsity. We next progress to recent data-driven and learning-based image formations, finally coming full circle back to how physics-based models are being merged with big data and machine learning for improved outcomes. Computational imaging can be defined in a number of ways, but for the purposes of the present article, we define IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
1053-5888/23©2023IEEE
it as the creation of an image from measurements wherein Physics-driven imaging: Explicit inversion computation plays an integral role in the image-formation We first focus on physics-driven imaging, wherein images are process. In contrast to historical “standard” imaging in which created based on idealized imaging models. The idea here is optics play the central role in image formation, in computato form images using the physical inversion operator with the tional imaging, it is computation, in the form of algorithms key challenge being the design of efficient algorithms for the running on computers, that assumes the primary burden of calculation of xt = C -1 ( y) as an approximation to the desired producing images. Well-known examples image x in (1). When data are plentiful and would include X-ray-based tomography, of high quality, and the system is designed In this article, we provide SAR, and MRI. In such cases, the data to closely approximate certain simple classa signal processing produced by the sensing instrument are es of sensing operators, this direct-inversion perspective on the area generally not images, and thus, require approach can work reasonably well. Such processing to produce the desired useful of computational imaging, image-formation methods often represent output in the form of an image. To fix ideas focusing on its emergence the first approach taken in the development and notation, we denote the unknown of a new imaging modality and will often and evolution within but desired image by x ! R N , such that invoke very little (or very rudimentary) the signal processing prior information, if at all. We discuss four x contains N pixels. In our problems of community. example modalities in greater detail next. interest, we cannot directly observe x, M y ! R , but instead observe a set of data which has been measured through a process connected to Analog cameras a sensor C. This relationship can be represented mathIn a film camera, lenses are used to bend the light and focus it ematically by onto a film substrate, as depicted in Figure 3(a). Digital cameras work largely the same way by simply placing a digitizer at the film plane. In terms of (1)–(2) then, this example has an y = C (x) + n (1) ideal linear model with C = I so that the final image is really n where is a noise signal. The goal of computational imjust the observation, assuming negligible noise. Traditionally, aging is then to estimate the image x from knowledge of the primary path to improving the image quality was through improvements to the optical path itself, that is, through the use y both the data as well as the imaging system or measureof better lenses that bring the physical sensing process closer to ment process, C, i.e., it naturally involves solving an inverse the ideal identity C = I. problem. Several examples of computational imaging are depicted in Figure 1, illustrating their sensing process and resulting images. X-ray tomography For many common imaging problems, C in (1) is (or can X-ray tomography is used in applications such as nondestructive be well approximated by) a linear operator, i.e., a matrix, evaluation, security, and medical imaging and is, in essence, an such that image of the attenuation produced by an object as X-rays pass through it, as illustrated in Figure 3(b). The negative log ratio of M#N the attenuated output to the input incident energy is taken as the C ( x ) = C x , C ! R . (2) observation, with a simplified physical model being In this case, when C is invertible and well conditioned, the inverse problem is mathematically straightforward, although y (L) = # x (s) ds (3) L even in this case, it is often computationally challenging due to the size of the problem. For example, an image of modest where L is a given ray path. Mathematically, the observation size—say, 1, 024 # 1, 024-corresponds to an x with 1 mil(or projection), y (L), is a line integral of the desired attenulion variables and a C with 10 12 elements. The inverse problem ation image x (s) along the line L. The collection of all such projections for all lines L (i.e., in every direction) defines becomes even more difficult when C is an ill-conditioned mawhat is called the Radon transform of the image x. The Ratrix or a nonlinear operator or produces an observation y with don transform is a linear operation and, assuming no noise, x . fewer samples than the number of unknowns in can be represented by (1), with C in this case being defined In the sequel, we provide a roughly chronological road map of computational imaging from a signal processing perspecby the integral operator in (3). An explicit analytic inverse of tive. Four broad domains, as illustrated in Figure 2, will be the Radon transform (i.e., C -1) exists and forms the basis of discussed in turn: physics-driven imaging, model-based image the image-formation method used by commercial scanners. reconstruction (MBIR), data-driven models and learning, and This inversion approach, called the filtered back-projection learning-based filters and algorithms. Our discussion con(FBP) algorithm, is very efficient in both computation and cludes with current-day algorithmic methods that effectively storage, requiring only simple filtering operations on the projoin the right side of Figure 2 back up to the left, i.e., techniques jections followed by an accumulation process wherein these that couple physical models with learned prior information filtered projections are summed back along their projection through algorithms. directions [4]; however, it assumes the existence of an infinite IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
41
(a)
(b)
Hydrophones Shock-Wave Source Reflected Sound Waves
Seabed
Oil
Oil
(c)
(d) LED Array
Camera (e)
(f)
(g)
(h)
Source
Aperture (i)
Detector
FIGURE 1. Examples of computational-imaging modalities. (a) X-ray tomography. (b) Ultrasound imaging. (c) MRI. (d) Seismic imaging. (e) Radar imaging. (f) Computational microscopy. (g) Light-field imaging. (h) Astronomical imaging. (i) Coded-aperture imaging. Sources: (a) Mart Production (left); MindwaysCT Software, CC BY-SA 3.0 (right). (b) Mart Production (left); Daniel Kondrashin (right). (c) Mart Production. (d) Adapted from [1] (right). (e) Wclxs11, CC BY-SA 3.0 (left); NASA/JPL (right). (f) Adapted with permission from [2]; ©The Optical Society. (g) Dcoetzee, CC0 (left); Doodybutch, CC BY-SA 4.0 (right). (h) Adapted from [3].
42
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
continuum of projections, which is not possible in practice. Nevertheless, excellent reconstructed images are possible if the sampling of y (L) is sufficiently dense. Thus, higher quality images are obtained by making the X-ray machine better approximate the underlying assumptions of the Radontransform inversion.
Physics-Driven Imaging
Deep Learning-Based Algorithms
FIGURE 2. An overview of the historical evolution of computational imaging.
As depicted in Figure 3(c), MRI images the anatomy of the body through radio-frequency excitation and strong magnetic fields—unlike X-ray imaging, no ionizing radiation is used. Classical MRI acquisition is usually modeled as producing observations according to the equation
y^ f h =
Data-Driven Models and Learning Time
MRI
Model-Based Image Reconstruction
# x^ s he -i2 sf ds (4) r
where it can be seen that the observations y ^ f h are values of the desired image in the Fourier domain. The basic MRI acquisition acquires samples of these Fourier values line by line in the Fourier space, called the k-space, and once sufficient Fourier samples are obtained, an image is produced by the application of an inverse Fourier transform. As with X-ray tomography, the image formation follows from an analytic formula for the inverse of the Fourier-based sensing operator such that improved imagery is obtained through the denser and more complete sampling of the k-space [5].
s2
L
x(s)
Lens
Film (or CCD)
Aperture
s1 Shutter
Image
y(L) = x(s) ds
(a)
L
(b)
κ-space Sample Paths
s2
L(θ,r) Reconstruction: Inverse Fourier Transform
x(s)
Acquisition: Forward Fourier Transform
s1
r qθ (r) =
θ
L(θ,r)
x(s) ds
Original Image Space
(c)
(d)
FIGURE 3. Four common forms of physics-driven imaging with explicit inversions. (a) Analog camera. (b) X-ray tomography. (c) MRI. (d) SAR. CCD: charge-coupled device. Sources: (b) Image from Nevit Dilmen, CC BY-SA 3.0. (c) MRI image from the IXI Dataset (https://brain-development.org/ ixi-dataset/), CC BY-SA 3.0. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
43
operate on the observed data, but the algorithms themselves are not dependent on the data; i.e., the structure of the Observation True Scene algorithm, along with any parameters, is MBIR Image Formation fixed at the time of the algorithm design Difference: System Model: based exclusively on the inversion of the C(x) Prediction L(y, C(x)) measurement operator and is not learned Optimization from image data. These examples of the Update Prior Model early period of computational imaging R(x) can be characterized in our taxonomy by Estimate x (comparatively) low computation (e.g., Fourier transforms) and the presence of “small-data” algorithms (i.e., just the FIGURE 4. A general framework for MBIR. observations). When data are complete and of high quality and the system is designed to closely approximate the assumptions underlying the inversion operators, these approachSpotlight-mode SAR es can work very well. Yet, the physical systems are constrained Spotlight-mode SAR, as shown in Figure 3(d), is able to creby the algorithmic assumptions on which the image formation is ate high-resolution images of an area day or night and in all built, and if the quality or quantity of observed data is reduced, types of weather and is thus widely used in remote sensing much of the corresponding imagery can exhibit confounding applications. SAR works by transmitting microwave chirp artifacts. An example would be standard medical tomography— pulses toward the ground and has a resolution that is largely since the system will create an image using the FBP, the system independent of the distance to the region of interest. SAR needs to be such that a full 180° of projections are obtained at a finds use in mapping and monitoring vegetation and sea sufficient sampling density. ice and in NASA planetary missions as well as in military applications, just to name a few. The SAR data-acquisition process can be modeled as MBIR: The rise of computational imaging Computational imaging really flourished during the next phase we consider. This phase has been called MBIR and, -jX (t) r y i (t) = # q i (r) e dr (5) in contrast to the situation discussed in the “Physics-Driven Imaging: Explicit Inversion” section, is characterized by the use of explicit models of both sensing physics as well as image X ( t ) where is a time-dependent “chirped” frequency, and features. The major conceptual shift was that image formation was cast as the solution to an optimization problem rather than q i (r) = # x (s) ds (6) L (i, r) as the output of a physically derived inversion algorithm. Next, we present the specifics of this optimization problem, consider the role of prior models in its formulation, and explore several x ( s ) is the projection of the scattering field along the line L at algorithms for its solution. angle i and range r such that (5) is a 1D Fourier transform of the projection along the range direction. Thus, the observations for SAR are again related to the values of the desired image x (s) in Image formation as optimization with explicit models the Fourier domain, similar to MRI. Combining (5) and (6), one In general, in MBIR, the image-formation optimization probcan show that these observations are Fourier values of x (s) on lem can be taken to be of the form a polar grid; consequently, the standard image-formation algorithm for such SAR data is the polar-format algorithm [6], which xt = argmin L ^y, C (x)h + R (x) (7) x resamples the acquired polar Fourier data onto a rectangular grid and then performs a standard inverse Fourier transform on the regridded data. As in our other examples, the image-formawhere xt is the resulting output image (an approximation to the tion process follows from an analytic formula for the inverse of true image); L is a loss function that penalizes any discrepthe (Fourier-based) sensing operator, and improved imagery is ancy between the observed measurement y and its prediction obtained by extending the support region of the acquired Fourier C (x); and R is a regularization term that penalizes solutions samples, which are related to both the angular sector of SAR that are unlikely according to prior knowledge of the solution observation (related to the flight path of the sensor) as well as the space. This optimization is depicted schematically in Figure 4. bandwidth of the transmitted microwave chirp. Arguably, the simplest example of this formulation is TikIn all the examples that we have just discussed, image formahonov regularization tion comprises analytic inversion following from the physics of 2 2 xt = argmin y - C (x) 2 + m C (x) 2 (8) the measurement operator [C in (1)]. These inversion approaches x x
44
Physical System
y
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
where m is a parameter controlling the amount of regularization, and C is a problem-specific operator [7], often chosen to be a derivative. This formulation also connects to probabilistic models of the situation. In particular, the first term in (8) can be associated with a log-likelihood log p ^y ; x, C h and the second term with a log prior log p (x) under Gaussian assumptions. With this association, (8) represents a maximum a posteriori (MAP) estimate of the image given a likelihood and a prior, i.e.,
captured by the term R (x) . We focus on the impact of prior models next.
Focus on prior models and the emergence of sparsity
The growth of model-based methods led to a rich exploration of choices for the prior term R (x) in (7). The simplest choice is perhaps a quadratic function of the unknown x, as illustrated in (8). Such quadratic functions can be viewed as corresponding to a Gaussian assumption on the statistics of C (x) . While simple and frequently leading to efficient solutions of the corresponding optimization problem, the resulting image estixt = argmax p ^x ; y, C h = argmaxlog p ^y ; x, C h + log p (x) . (9) mates can suffer from blurring and the loss of image structure x x as these types of priors correspond to aggressive global penalties on image or edge energy. There are a number of major advantages gained from the Such limitations led to a surge in the development and use conceptual shift represented by viewing image formation as in the late 1980s and 1990s of nonquadratic functions that the solution of (7). One advantage is that this view separates share the property that they penalize large values less than out the components of the formulation from the algorithm the quadratic penalty does. Additionally, when applied to used to solve it; i.e., the overall problem is partitioned into image derivatives, they promote edge formation. In Table 1, independent modeling and optimization tasks. Indeed, there we present a number of the nonquadratic are many approaches that can be used to penalty functions that arose during this solve (7), allowing discussion of the algoOne of the more visible period, separated into those that are conrithm to be decoupled from the debate applications of sparsity in vex and those that are not. (The penalties about the problem formulation [although model-based reconstruction tabulated in Table 1 are functions on scaobviously some function choices in (7) is compressed sensing. lars. To form R (x), these scalar penalties correspond to easier problems, and thus, simpler algorithms]. could be applied element by element to Another advantage of this explicit focus on models is x = 6x 1 x 2 g x N@T , e.g., R (x) = R i { (x i) . Alternatively, that one can consider a much richer set of sensing operators referring to (8), they could likewise be applied to elements since we are no longer limited to operators possessing simple, of C (x) .) In general, convex functions result in easier optiexplicit, and closed-form inverse formulations. For example, mization problems, while nonconvex functions possess more in X-ray tomography, the FBP algorithm is an appropriate aggressive feature preservation at the expense of more chalinversion operator only when a complete (uniform and densely lenging solution computation. sampled) set of high-quality projection data is obtained—it is A key property that these nonquadratic penalties promoted not an appropriate approach if data are obtained, for example, was the sparsity of the corresponding quantity. In particuover only a limited set of angles. On the other hand, the formular, when applied to x itself, the resulting solution becomes lation (7) is agnostic to such issues, requiring only that the C sparse, or when applied to C (x), the quantity C (x) becomes operator accurately captures the actual physics of acquisition. sparse. A common example is to cast C as an approximation Thus, model-based inversion approaches have been successto the gradient operator such that the edge field then becomes fully applied in situations involving novel, nonstandard, and sparse, resulting in piecewise constant (mainly flat) solutions challenging imaging configurations that could not have been [16]. Eventually, interest nucleated around this concept of considered previously. Furthermore, one can now consider the sparsity and, in particular, on the use of the , 0 and , 1 norms joint design of sensing systems along with inversion, as occurs as choices in defining R (x) (the , 1 norm corresponds to the in computational photography. last row in Table 1) [17]. A third advantage of model-based image formation is that (7) can be used to explicitly account for noise or uncertainty in the data—the connection to statistical methods and MAP Table 1. A selection of nonquadratic prior penalties {(t). estimation as alluded to previously [i.e., (9)] makes this Convex Penalties Nonconvex Penalties connection obvious. For example, rather than using a loss t2 function corresponding to a quadratic penalty arising from t [8] [9], [10] 1 + t2 Gaussian statistics (as is common), one can consider instead min " t 2, 1 , [12] log cosh t [11] a log-likelihood associated with Poisson-counting statistics, 2 log ^ 1 + t 2 h [14] min # t , 2 t - 1 - [10], [13] which arises naturally in photon-based imaging. The use of such models can provide superior noise reduction in lowt [15] signal situations. 1+ t p p A final advantage of model-based inversion methods t , p $ 1 [16] t , 0 1 p 11 is the explicit use of prior-image-behavior information, as IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
45
One of the more visible applications of sparsity in modelbased reconstruction is compressed sensing (CS) [18], [19]. In brief, under certain conditions, CS permits the recovery of signals from their linear projections into a much lower dimensional space. That is, we recover x from y = Cx, where x has length N, y has length M, and C is an M # N measurement matrix with the subsampling rate (or subrate) being S = M/N with M % N. Because the number of unknowns is much larger than the number of observations, recovering every x ! R N from its corresponding y ! R M is impossible in general. The foundation of CS, however, is that, if x is known to be sufficiently sparse in some domain, then exact recovery of x is possible. Such sparsity can be with respect to some transform T such that, when the transform is applied to x, only K 1 M % N coefficients in the set of transform coefficients X / T x are nonzero. Relating this situation back to (8), we can formulate the CS recovery problem as
t = argmin y - CT -1 X X X
2 2
+ m X p (10)
ADMM algorithm as well as other similar methods exploiting proximal operators [23], [24], [25]. Such methods split the original problem into a series of pieces by way of associated proximal operators. Specifically, the ADMM algorithm for solving (7) recasts (7) with an additional variable z and an equality constraint xt = argmin L ^y, C (z)h + R (x), such that x = z (11)
x, z
which is solved via the iterations
x (k + 1) ! argmin x
t
2
x - z (k) + u (k)
= prox R, t/2 ^z
(k)
2 2
-u h
+ R (x)
(k)
(12)
z (k + 1) ! argmin 1 L ^y, C (z)h + x (k + 1) - z + 2 2 z t
2 u (k) 2 (13)
u (k + 1) ! u (k) + x (k + 1) - z (k + 1) (14)
where u is the scaled dual variable, and t is a penalty pa-1 t t rameter. We have indicated previously that (12) is a proximal with the final reconstruction being x = T X. Ideally, for operator, effectively performing smoothing or denoising to its p = 0 a T-domain sparse solution, we set in (10), invoking argument. We will return to this insight later as we consider the , 0 pseudonorm, which counts nonzero entries. Since this the incorporation of learned information choice results in an NP-hard optimization, into computational imaging. Note that the however, it is common to use p = 1, therePerhaps the simplest ADMM algorithm comprises an imageby applying a convex relaxation of the , 0 data-driven extension of smoothing step (12), a data- or observationproblem. Interestingly, if the underlying the MBIR paradigm can be integration step (13), and a simple reconcilisolution is sufficiently sparse, it can be found in the development ation step (14). shown that the two formulations yield the of dictionary learning. In the model-based approach to image same final result [18], [19], [20]. Additionreconstruction discussed in this section, ally, for exact recovery, it is sufficient that image formation is accomplished through the solution of an the image transform T and the measurement matrix C be optimization problem, and underlying models of acquisition “mutually incoherent” in the sense that C cannot sparsely and image are made explicit. The prior-image models can represent the columns of the transform matrix T. Accordserve to stabilize situations with poor data, and conversely, ingly, an extensive number of image transforms T have the observed data can compensate for overly simplistic priorbeen explored; additionally, large-scale images can be reimage models. These MBIR methods, including the use of constructed by applying the formulation in a block-by-block nonquadratic models and the development of CS, have had a fashion (e.g., [21]). CS has garnered a great deal of interest profound impact on the computational-imaging field. They in the computational-imaging community in particular due have allowed the coupled design of sensing systems and to the demonstration of devices (e.g., [22]) that conduct the inversion methods wherein sensor design can be integrated compressed signal-sensing process y = C x entirely within with algorithm development in ways not previously possible. optics, thereby acquiring the signal and reducing its dimenThe impact has been felt in fields as disparate as SAR, comsionality simultaneously with little to no computation. puted tomography, MRI, microscopy, and astronomical science. These methods are characterized in our taxonomy by Optimization algorithms relatively high computation (resulting from the need to solve One of the challenges with MBIR-based approaches is that relatively large optimization problems iteratively) and the use the resulting optimization problems represented by (7) must of “small-data” algorithms (again, just the observations). be solved. Fortunately, solutions of such optimization problems have been well studied, and a variety of methods exist, including the projected gradient-descent algorithm, the Data-driven models and learning: Dictionaries Chambolle-Pock primal-dual algorithm, and the alternating The next phase we consider is that in which data start to take direction method of multipliers (ADMM) algorithm as well on an important role in modeling, with approaches characas others. These methods are, in general, iterative algorithms terized by the increasing use of models derived from data composed of a sequence of steps that are repeated until a rather than physics or statistics. This increased focus on data stopping condition is reached. Of particular interest is the in modeling rather than on analytic models provides the 46
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
o pportunity for an increase in explanatory richness. Perhaps the simplest data-driven extension of the MBIR paradigm can be found in the development of dictionary learning [17], [26], [27], [28]. The idea behind dictionary learning is that a noisy version of a signal x can be approximated by a sparse linear combination of a few elements of an overcomplete dictionary D. This problem can be cast as an MBIR-type formulation as
at = argmin x - Da a
2 2
+ m a 0 (15)
wherein the dictionary D has, in general, many more columns than rows, and thus, enforcing sparsity selects the most important columns. The final estimate is obtained as xt = Dat . Data-based learning is introduced into this framework by employing a large set of training samples " x 1, x 2, f, x N , to learn the dictionary D. This dictionary-learning process can be cast conceptually as another MBIR-style sparsely constrained optimization
t , " at i , = argmin D
N
/
D, " a i , i = 1 [D] j 2 # 1
x i - Da i
2 2
+ m a i 0 (16)
where [D] j 2 denotes the norm of column j of D, this norm constraint being necessary to avoid unbounded solutions for D. Ultimately, this formulation uses training data as well as a sparsity constraint to learn an appropriate representational framework for a signal class, and the resulting dictionary can then be used as a model for such signals in subsequent reconstruction problems. We note that while (16) conveys the aim of dictionary learning, in practice, a variety of different formulations are used.
Learning-based filters and algorithms: The appearance of deep learning Renewed interest in the application of neural networks (NNs) to challenging estimation and classification problems occurred
(a)
when AlexNet won the ImageNet Challenge in 2012. The availability of large datasets combined with deep convolutional NNs (CNNs) as well as advanced computing hardware such as graphics processing units (GPUs) has enabled a renaissance of NN-based methods with outstanding performance and has created a focus on data-driven methods. Beyond classification, deep CNNs have achieved state-of-the-art performance in computer-vision tasks ranging from image denoising to image superresolution. Naturally, these deep learning models have made their way into the world of computational imaging in a variety of ways, several of which we survey next.
Estimate post-processing Perhaps the simplest way of folding deep learning into computational-imaging problems is to apply a deep learning-based image enhancement as a post-processing step after a reconstructed image has been formed. In doing so, one uses an existing inversion scheme—such as FBP for X-ray tomography or inverse Fourier transformation for MRI—to create an initial reconstructed image. A deep network is then trained to bring that initial estimate closer to a desired one by removing artifacts or noise. The enhancement can be done directly on the formed image, or more recently, the deep enhancement network is trained on a set of residual images between initial estimates and high-quality targets. Approaches in this vein are perhaps the most straightforward way to include deep learning in image formation and were thus some of the first methods developed. Example application domains include X-ray tomography [29], [30] with subsampled and low-dose data as well as MRI [29], [31] with subsampled Fourier data. Figure 5 illustrates this learning-driven post-processing approach for subsampled MRI along with a physics-driven explicit inversion as well as an MBIR-based reconstruction.
Data preprocessing Another use of deep learning in computational imaging is as a preprocessing step. In these methods, learning is used to “correct” imperfect, noisy, or incomplete data. Accordingly,
(b)
(c)
(d)
FIGURE 5. Subsampled MRI reconstruction and performance in terms of signal-to-noise ratio (SNR). (a) Inverse Fourier transform of complete data. (b) Reconstruction of sixfold subsampled data via the inverse Fourier transform as described in the “MRI” section; SNR = 11.82 dB. (c) Reconstruction of sixfold subsampled data via a convex optimization regularized with a total-variation criterion, i.e., an MBIR reconstruction in the form of (7); SNR = 15.05 dB. (d) Reconstruction via the inverse Fourier transform of the subsampled data followed by post-processing enhancement with a CNN as described in the “Estimate Post-processing” section; SNR = 17.06 dB. (Source: Adapted from [29], ©2017 IEEE.) IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
47
data can be made to more closely match the assumptions that underlie traditional physics-based image-formation methods, which can then be used more effectively. Consequently, such an approach can leverage existing conventional workflows and dedicated hardware. One example can be found in X-ray tomography [32] wherein projection samples in the sinogram domain that are missed due to the sparse angular sampling used to reduce dose are estimated via a deep network; afterward, these corrected data are used in a conventional FBP algorithm.
Learned explicit priors Yet another use of deep learning in computational imaging has been to develop explicit data-derived learned priors that can be used in an MBIR framework. An example of this approach can be found in [33] wherein a K-sparse patch-based autoencoder is learned from training images and then used as a prior in an MBIR reconstruction method. An autoencoder (depicted in Figure 6) is a type of NN used to learn efficient reduced-dimensional codings (or representations) of information and is composed of an encoder E and decoder D such that D ^E (x)h . x with E (x) being of much lower dimension than x. The idea is that one is preserving the “essence” of x in creating E (x), which can be considered a model for x and that, ideally, D ^E (x)h is removing only useless artifacts or noise. A sparsity-regularized autoencoder can be obtained from a set of training data " x i , as the solution of the optimization E, D = argmin / x k - D ^E ^x k hh E, D
k
2 2
s.t. E ^x k h 0 # K (17)
where E and D are both NNs whose parameters are learned via solving (17). Once E and D of the autoencoder are found, they can be used as a prior to reconstruct an image using an MBIR formulation
xt = argmin y - C (x) x
s.t. E ^xh 0 # K.
2 2
+ m x - D ^E ^xhh
2 2
(18)
Such an approach was adopted, for example, in [33], which applied (18) using a formulation based on image patches, solving the resulting optimization via an alternating minimization.
E(x)
x
Encoder
Low-Dimensional Representation
FIGURE 6. A generic autoencoder structure. 48
Decoder
Learned inverse mappings Deep learning can also be used to train a network that directly implements the inverse mapping from the data y to estimate xt (e.g., [34]). That is, we learn a function F such that xt = F ^yh . This approach was taken in [35] with a method termed AUTOMAP. Four cases motivated by tomography and MRI were considered: tomographic inversion, spiral k-space MRI data, undersampled k-space MRI data, and misaligned k-space MRI data. The AUTOMAP framework used a general feed-forward deep NN architecture composed of fully connected layers followed by a sparse convolutional autoencoder. The network for each application was learned from a set of corresponding training data without the inclusion of any physical or expert knowledge. Taking k-space MRI as an example, although we know that the inverse Fourier transform produces the desired image, the AUTOMAP network has to discover this fact directly from the training data alone. While the results are intriguing, the approach suffers from the large number of training parameters required by the multiple fully connected layers, which seems to limit its application to relatively small problems.
Deep image priors The final approach that we consider here is the deep image prior (DIP) [36], which replaces the explicit regularizer R (x) in the MBIR optimization (7) with an implicit prior in the form of an NN. Specifically, as in the autoencoders discussed previously, the DIP NN is a generator or decoder, D ($), that maps a reduced-dimensional “code” vector w to a reconstructed image, i.e., xt = D (w) . The DIP formulation solves for the decoder so as to minimize the loss with respect to the observation y, i.e., D = argmin L ^y, D ^whh . (19)
D
For example, one might use a loss in the form of L ^y, D ^whh = 2 y - C ^D ^whh 2 , as in (8). While optimization might also include w, usually the code w is chosen at random and kept fixed; additionally, the initial network D is also typically chosen at random. In essence, (19) imposes an implicit regularization such that R (x) = 0 for images x that can be produced by the deep NN D, and R (x) = 3 otherwise. DIP formulations have been applied to a number of image-reconstruction tasks, including denoising, superresolution, and inpainting [36], as well as in applications such as MRI, computed tomography, and radar imaging, along with other computational-imaging modalities illustrated in Figure 1. The DIP approach has the advanD(E(x)) tage that the decoder D is learned during the DIP optimization (19) as applied to the specific observed y in question; that is, no large body of training data is required. On the other hand, the number of iterations conducted while solving (19) must be carefully controlled so as to avoid the overfitting of xt = D (w) to y.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
Folding learning into algorithms: Coming full circle
approach can be applied to any proximal algorithm [24], including, in particular, the accelerated proximal gradient In the “Physics-Driven Imaging: Explicit Inversion” section, method or the fast iterative shrinkage/thresholding algowe noted that the early stages of imaging were dominated by rithm (ISTA) [44], which are far more suitable for problems the drive for physics-derived algorithms that achieve the inwith a nonlinear forward operator and which have allowed verse of the physical sensing operator, which is then applied the development of online variants of PPP [45], i.e., those to the observed data. This emphasis on algorithmic inversion that use only a subset of the observations y, rather than the was subsequently replaced with the optimization framework of MBIR, allowing more complex sensing configurations and the full set, at each iteration. Additionally, RED was proposed inclusion of prior information. In the service of increasing the as an alternative formulation for exploiting a denoiser in a role of learning in computational imaging, researchers have way that does have an explicit cost function [46]. While this recently shifted their focus from optimization back to algoformulation is tenable in only some special cases [47], it rithms, but this time using learned—rather has proven to be a popular alternative to than analytically derived—elements. We PPP, with a growing range of applications Additionally, the use of now discuss some examples of recent develand extensions. training data coupled opments in this vein with an aim to present Finally, MACE [48] extended the origiwith high-dimensional representative examples of these methods nal PPP framework in two distinct ways: models can compensate rather than an exhaustive catalog. extending the number of terms (or “agents”) for imperfect analytical Perhaps the most straightforward examthat could be addressed via a formulation knowledge in a problem. ple of this paradigm is to start with a physithat is closely related to ADMM consencally derived inverse algorithm and then sus [23, Ch. 7] and introducing a more replace parts of the original algorithm with alternative learned theoretically sound interpretation of the approach based on elements. For example, in [37], [38], the physics-appropriate fixed-point algorithms rather than as optimization with FBP algorithm of tomography is mapped into an NN architecunknown or nonexistent regularizers. The general nature ture wherein the back-projection stage is held fixed while the of MACE has also been used to include learned models filter (i.e., the combining weights) is instead learned from data into both the observation side of inverse problems as well to minimize a loss function. as the prior-image information [49], allowing balanced A particularly impactful and popular approach has been inclusion of both types of constraints—Figure 7 shows plug-and-play priors (PPPs) [39], [40], which was motivated an example. by the advances in image denoising within the image processAdditionally, interest has grown in a set of techniques that ing community. Effectively, PPP incorporated into the MBIR can be collected under the labels “algorithm unrolling,” “deep image-formation framework the power of advanced denoisunrolling,” or “unfolding” [50], as depicted in Figure 8. The ers—such as those based on nonlocal means [41] or blockidea of these methods is to view a fixed number of iterations matching and 3D filtering (BM3D) [42]—even though these of an algorithm as a set of layers (or elements) in a network. denoisers are not simple solutions of an underlying optimiThe resulting network then performs the steps of the algozation problem. Specifically, the PPP framework originated rithm, while the parameters or steps of the original network from the ADMM approach [(12), (13), and (14)] for solving are then collectively learned from training data or replaced by the MBIR problem (7). ADMM entails iteration over two main learned alternatives. One stated benefit of such an approach is subproblems, with one subproblem (13) involving the observathat the resulting overall networks, obtained from underlying tion and sensor loss function L ^y, C ^xhh, while the other (12) optimization algorithms, can be interpreted in ways that typical black-box deep networks with many parameters cannot. involves a proximal operator of the prior (or regularization) term Though the number of works in this spirit is now so large R (x), which has the form of a MAP denoiser. This insight led that it is impossible to touch on them all, we briefly mention to the replacement of the prior proximal operator derived from a few final examples. An early instance was the work in [51], R (x) in (12) with alternative state-of-the-art denoisers even which was based on the ISTA for sparse coding and may be though these denoisers might have no corresponding explicit the first to refer to algorithm “unrolling.” In [51], a small fixed function R (x) . The result was an extremely flexible framework number of ISTA iterations are used for the network structure, leading to state-of-the-art performance in a very diverse range and the associated linear weights are learned. In another of applications. example, [52] starts with a projected gradient-descent methThe PPP framework allows one to replace explicit image od to minimize a regularized optimization problem similar priors specified by the function R (x) with powerful imageto (8), and then the projector onto the space of regularization enhancement mappings, including those potentially learned constraints in the algorithm is replaced by a learned CNN. In from training data (e.g., [43]). Notable developments within a similar vein, the authors of [28] instead start with a soluthis framework or inspired by it include general proximal algotion to (7) provided by 10 iterations of the Chambolle-Pock rithms, regularization by denoising (RED), and multiagent primal-dual algorithm, and then the functions of the primal consensus equilibrium (MACE). General proximal algorithms and dual steps are replaced by learned CNNs. Another examarose when it was demonstrated that while the original develple is found in [53], which unrolls a half-quadratic splitting opment of PPP focused on the ADMM algorithm, the same IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
49
The rise of learning-based methods discussed in this algorithm for the solution of a blind-deconvolution problem to section represents a great increase in the richness of the obtain an associated network with the underlying parameters prior information that can be included in computationalthen being learned. We note that while many more examples imaging problems. Additionally, the use of training data are discussed in [50], they all largely retain physics-derived coupled with high-dimensional models can compensate observation models while incorporating prior information for imperfect analytical knowledge in a problem. Conseimplicitly through learning or training processes for algorithquently, these methods are an ongoing area mic parameters. of active research and promise to impact While a variety of NN models have TCI focuses on solutions a wide range of application areas. In been used in this context, there has been to imaging problems in significant recent interest in generative our taxonomy of computational-imaging which computation plays models, including various forms of generaapproaches, these techniques are charan integral role in the tive adversarial networks [54], normalizacterized by relatively high computation formation of an image ing flows [55], and score-based diffusion (for parameter learning) coupled with from sensor data. models [56], [57]. The stochastic nature of the use of big data. these methods also makes them attractive for uncertainty characterization for computational imaging, SPS involvement another alternative for which is to use Bayesian NNs [58]. This section provides a brief history of SPS initiatives and Finally, the challenges associated with access to the ground activities related to computational imaging, including the truth required for supervised learning have led to the develestablishment of IEEE Transactions on Computational Imopment of self-supervised or unsupervised methods for learnaging (TCI) and the SPS Computational Imaging Technical ing-based computational imaging [59], [60], an important Committee (CI TC) as well as support for community-wide current direction. conference and seminar activities.
(a)
(b)
(c)
(d)
FIGURE 7. Tomographic reconstructions with challenging limited-angle scanning in a baggage-screening security application. (a) Ground truth (FBP reconstruction using complete scanning). (b) FBP reconstruction followed by CNN post-processing using half-scan data. (c) PPP reconstruction with an image-domain learned denoiser using half-scan data. (d) MACE reconstruction including both learned data- and image-domain models using half-scan data. (Source: Adapted from [49], ©2021 IEEE.)
x
h(·; θ)
Iterative Algorithm
Unrolling
y
h 1(·; θ 1)
h N (·; θ N )
h 2(·; θ 2)
x
Unrolled Algorithm y
FIGURE 8. A general framework for algorithm unrolling. Left: An iterative algorithm composed of iterations of the fixed function h ($) with parameter set i. Right: An unrolling of N iterations into a network with multiple sublayers of h k, where each h k now represents a (possibly structured) network with (possibly) different parameters i k; these networks are then learned through a training process. (Source: Adapted from [50].)
50
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
TCI
Those interested in becoming involved with the CI TC can become an affiliate member via an easy web-based registration process (see https://signalprocessingsociety.org/community -involvement/computational-imaging/affiliates). Affiliates are nonelected nonvoting members of the TC, and affiliate membership is open to IEEE Members of all grades as well as to members of certain other professional organizations in interdisciplinary fields within the CI TC’s scope.
Motivated by the rapid growth of computational imaging as a research and technology area distinct from image processing, the creation of a new journal on computational imaging was first proposed to the SPS Technical Activities Board Periodicals Committee in 2013 in an effort led by three serving and prior editors-in-chief of IEEE Transactions on Image Processing: Charles Bouman, Thrasos Pappas, and Clem Karl. The motivation for the new journal was the rapid growth of computational imThe Computational Imaging Vision and outlook aging as a research and technology area Special Interest Group was Although computational imaging has only distinct from image processing. The jourrecently emerged as a distinct area of reestablished within the SPS search, it rests upon several decades of work nal was launched in 2015, with Clem Karl in 2015 and was promoted performed in separate research and techas its inaugural editor-in-chief. to TC status in 2018. TCI focuses on solutions to imaging nology communities. Accordingly, a broad problems in which computation plays an collaboration of researchers—hailing from integral role in the formation of an image from sensor data. signal processing, machine learning, statistics, optimization, The journal’s scope includes all areas of computational optics, and computer vision with domain expertise in various imaging, ranging from theoretical foundations and methods application areas ranging all the way from medical imaging and to innovative computational-imaging system design. Topics computational photography to remote sensing—is essential for of interest include advanced algorithms and mathematical accelerating progress in this area. This progress is exactly what methodology, model-based data inversion, methods for image TCI and the CI TC aim to catalyze. Most recently, the commurecovery from sparse and incomplete data, techniques for nity has seen an increasing involvement and impact of machine nontraditional sensing of image data, methods for dynamic learning in computational imaging, a theme to which TC meminformation acquisition and extraction from imaging sensors, bers and TCI authors make significant contributions. and software and hardware for efficient computation in imaging systems. Conclusions TCI has grown rapidly since its inception. With around 40 In conclusion, computational imaging has emerged as a signifisubmissions a month and an impact factor of 4.7, it is now one cant and distinct area of research. With the increasing demand of the leading venues for the publication of computationalfor improved information extraction from existing imaging sysimaging research. TCI is somewhat unique within the SPS tems coupled with the growth of novel imaging applications and publications portfolio in that it draws submissions from a broad sensing configurations, researchers have turned to advanced range of professional communities beyond the SPS, including computational and algorithmic techniques. These techniques, SIAM, Optica (formerly the Optical Society of America), and including deep learning methods, have led to the development SPIE, and that it has connections to a broad range of domains, of new imaging modalities as well as the ability to process and including radar sensing, X-ray imaging, optical microscopy, analyze large datasets. The SPS has played an important role in and ultrasound sensing. The editorial board similarly includes this growing area through the creation of a new highly ranked members from these diverse communities. journal, a new energetic TC, and support for new cross-society conference and seminar activities. The continued advancement of computational imaging will impact a wide range of applicaCI TC tions, from health care to science to defense and beyond. The Computational Imaging Special Interest Group was established within the SPS in 2015 and was promoted to TC status in 2018. The goal of the CI TC is the promotion of compuAuthors tational imaging as well as the formation of a community of W. Clem Karl ([email protected]) received his Ph.D. degree in computational-imaging researchers that crosses the traditional electrical engineering and computer science from the boundaries of professional-society affiliations and academic Massachusetts Institute of Technology. He is currently the disciplines, with activities including the sponsorship of workchair of the Electrical and Computer Engineering Department shops and special sessions, assistance with the review of paand a member of the Biomedical Engineering Department, pers submitted to the major society conferences, and the supBoston University, Boston, MA 02215 USA. He was the inauport and promotion of TCI. The CI TC currently consists of the gural editor-in-chief of IEEE Transactions on Computational chair, the vice chair (or the past chair), and 40 regular voting Imaging and the editor-in-chief of IEEE Transactions on members. Additionally, there are nonvoting advisory memImage Processing. He has served in many society roles bers, associate members, and affiliate members. To promote including on the IEEE Signal Processing Society Board of collaboration across communities, the CI TC also appoints liGovernors and the IEEE Signal Processing Society aisons to other IEEE and non-IEEE professional groups whose Publications Board. He is a Fellow of IEEE and the American interests intersect with computational imaging. Institute for Medical and Biological Engineering. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
51
James E. Fowler ([email protected]) received his Ph.D. degree in electrical engineering from The Ohio State University. He is currently a William L. Giles Distinguished Professor in the Department of Electrical and Computer Engineering, Mississippi State University, Mississippi State, MS 39762 USA. He was previously the editor-in-chief of IEEE Signal Processing Letters, a senior area editor for IEEE Transactions on Image Processing, and an associate editor for IEEE Transactions on Computational Imaging. He is currently the chair of the Computational Imaging Technical Committee of the IEEE Signal Processing Society. He is a Fellow of IEEE. Charles A. Bouman ([email protected]) received his Ph.D. degree from Princeton University. He is the Showalter Professor of Electrical and Computer Engineering and Biomedical Engineering at Purdue University, West Lafayette, IN 47907-1285 USA. He has served as the IEEE Signal Processing Society’s vice president of technical directions as well as the editor-in-chief of IEEE Transactions on Image Processing. He is a Fellow of IEEE, the American Institute for Medical and Biological Engineering, the Society for Imaging Science and Technology (IS&T), and SPIE and a member of the National Academy of Inventors. Müjdat Çetin ([email protected]) received his Ph.D. degree from Boston University. He is a professor of electrical and computer engineering and of computer science, the director of the Goergen Institute for Data Science, and the director of the New York State Center of Excellence in Data Science, University of Rochester, Rochester, NY 14627-0001 USA. He is currently serving as the editor-in-chief of IEEE Transactions on Computational Imaging and a senior area editor for IEEE Transactions on Image Processing. He previously served as the chair of the IEEE Computational Imaging Technical Committee. He is a Fellow of IEEE. Brendt Wohlberg ([email protected]) received his Ph.D. degree in electrical engineering from the University of Cape Town. He is currently a staff scientist with the Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545-1663 USA. He was the chair of the Computational Imaging Special Interest Group (now the Computational Imaging Technical Committee) of the IEEE Signal Processing Society and was the editor-in-chief of IEEE Transactions on Computational Imaging. He is currently the editor-in-chief of IEEE Open Journal of Signal Processing. He is a Fellow of IEEE. Jong Chul Ye ([email protected]) received his Ph.D. degree from Purdue University. He is a professor of the Kim Jaechul Graduate School of Artificial Intelligence and an adjunct professor at the Department of Mathematical Sciences and the Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejon 305-701, Korea. He has served as an associate editor of both IEEE Transactions on Image Processing and IEEE Transactions on Computational Imaging. He was the chair of the Computational Imaging Technical Committee of the IEEE Signal Processing Society. He is a Fellow of IEEE. 52
References
[1] Z. Wang and G. AlRegib, “Interactive fault extraction in 3-D seismic data using the Hough transform and tracking vectors,” IEEE Trans. Comput. Imag., vol. 3, no. 1, pp. 99–109, Mar. 2017, doi: 10.1109/TCI.2016.2626998. [2] L. Tian, X. Li, K. Ramchandran, and L. Waller, “Multiplexed coded illumination for Fourier Ptychography with an LED array microscope,” Biomed. Opt. Exp., vol. 5, no. 7, pp. 2376–2389, Jul. 2014, doi: 10.1364/BOE.5.002376. [3] K. L. Bouman, “Portrait of a black hole: Here’s how the event horizon telescope team pieced together a now-famous image,” IEEE Spectr., vol. 57, no. 2, pp. 22–29, Feb. 2020, doi: 10.1109/MSPEC.2020.8976898. [4] L. A. Shepp and B. F. Logan, “The Fourier reconstruction of a head section,” IEEE Trans. Nucl. Sci., vol. 21, no. 3, pp. 21–43, Jun. 1974, doi: 10.1109/ TNS.1974.6499235. [Online]. Available: http://ieeexplore.ieee.org/document/ 6499235/ [5] Z.-P. Liang and P. C. Lauterbur, Principles of Magnetic Resonance Imaging: A Signal Processing Perspective. New York, NY, USA: IEEE Press, 2000. [6] C. V. Jakowatz, D. E. Wahl, P. A. Thompson, P. H. Eichel, and D. C. Ghiglia, Spotlight-Mode Synthetic Aperture Radar: A Signal Processing Approach. New York, NY, USA: Springer, 1996. [7] M. Vauhkonen, D. Vadasz, P. Karjalainen, E. Somersalo, and J. Kaipio, “Tikhonov regularization and prior information in electrical impedance tomography,” IEEE Trans. Med. Imag., vol. 17, no. 2, pp. 285–293, Apr. 1998, doi: 10.1109/42.700740. [Online]. Available: http://ieeexplore.ieee.org/document/ 700740/ [8] J. Besag, “Digital image processing: Towards Bayesian image analysis,” J. Appl. Statist., vol. 16, no. 3, pp. 395–407, Jan. 1989, doi: 10.1080/02664768900000049. [9] S. Geman and D. E. McClure, “Statistical methods for tomographic image reconstruction,” in Proc. 46th Session Int. Statist. Inst., 1987, vol. 52, pp. 5–21. [10] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud, “Deterministic edge-preserving regularization in computed imaging,” IEEE Trans. Image Process., vol. 6, no. 2, pp. 298–311, Feb. 1997, doi: 10.1109/83.551699. [Online]. Available: https://ieeexplore.ieee.org/document/551699/ [11] P. Green, “Bayesian reconstructions from emission tomography data using a modified EM algorithm,” IEEE Trans. Med. Imag., vol. 9, no. 1, pp. 84–93, Mar. 1990, doi: 10.1109/42.52985. [Online]. Available: http://ieeexplore.ieee.org/ document/52985/ [12] A. Blake, “Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 1, pp. 2–12, Jan. 1989, doi: 10.1109/34.23109. [Online]. Available: http://ieeexplore.ieee.org/document/23109/ [13] P. J. Huber, “Robust estimation of a location parameter,” Ann. Math. Statist., vol. 35, no. 1, pp. 73–101, Mar. 1964, doi: 10.1214/aoms/1177703732. [Online]. Available: http://projecteuclid.org/euclid.aoms/1177703732 [14] T. Hebert and R. Leahy, “A generalized EM algorithm for 3-D Bayesian reconstruction from Poisson data using Gibbs priors,” IEEE Trans. Med. Imag., vol. 8, no. 2, pp. 194–202, Jun. 1989, doi: 10.1109/42.24868. [Online]. Available: http:// ieeexplore.ieee.org/document/24868/ [15] D. Geman and G. Reynolds, “Constrained restoration and the recovery of discontinuities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 3, pp. 367–383, Mar. 1992, doi: 10.1109/34.120331. [Online]. Available: http://ieeexplore.ieee.org/ document/120331/ [16] C. Bouman and K. Sauer, “A generalized Gaussian image model for edge- preserving MAP estimation,” IEEE Trans. Image Process., vol. 2, no. 3, pp. 296– 310, Jul. 1993, doi: 10.1109/83.236536. [Online]. Available: http://ieeexplore.ieee. org/document/236536/ [17] J. Mairal, “Sparse modeling for image and vision processing,” Found. Trends® Comput. Graphics Vision, vol. 8, nos. 2–3, pp. 85–283, Dec. 2014, doi: 10.1561/ 0600000058. [Online]. Available: http://www.nowpublishers.com/articles/foundations -and-trends-in-computer-graphics-and-vision/CGV-058 [18] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006, doi: 10.1109/TIT.2006.871582. [Online]. Available: http:// ieeexplore.ieee.org/document/1614066/ [19] E. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006, doi: 10.1109/TIT.2005.862083. [Online]. Available: http://ieeexplore.ieee.org/document/1580791/ [20] D. Malioutov, M. Cetin, and A. Willsky, “Optimal sparse representations in general overcomplete bases,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Montreal, QC, Canada: IEEE, 2004, vol. 2, pp. ii–793-6, doi: 10.1109/ ICASSP.2004.1326377. [21] S. Mun and J. E. Fowler, “Block compressed sensing of images using directional transforms,” in Proc. IEEE Int. Conf. Image Process., Cairo, Egypt, Nov. 2009, pp. 3021–3024, doi: 10.1109/ICIP.2009.5414429. [22] D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. F. Kelly, and R. G. Baraniuk, “A new compressive imaging camera architecture
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
using optical-domain compression,” in Proc. Comput. Imag. IV (SPIE), San Jose, CA, USA, Jan. 2006, p. 606509, doi: 10.1117/12.659602. [23] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends® Mach. Learn., vol. 3, no. 1, pp. 1–122, Jul. 2011, doi: 10.1561/2200000016. [Online]. Available: https://www.nowpublishers.com/article/ Details/MAL-016 [24] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim., vol. 1, no. 3, pp. 127–239, Jan. 2014, doi: 10.1561/2400000003. [25] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Proc. Fixed-Point Algorithms Inverse Probl. Sci. Eng., H. H. Bauschke, R. S. Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, Eds. New York, NY, USA: Springer, 2011, vol. 49, pp. 185–212. [26] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, Jun. 1996, doi: 10.1038/381607a0. [Online]. Available: http://www. nature.com/articles/381607a0 [27] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, Dec. 2006, doi: 10.1109/TIP.2006.881969. [Online]. Available: http:// ieeexplore.ieee.org/document/4011956/ [28] J. Adler and O. Oktem, “Learned primal-dual reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1322–1332, Jun. 2018, doi: 10.1109/TMI.2018.2799231. [Online]. Available: https://ieeexplore.ieee.org/document/8271999/ [29] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, Sep. 2017, doi: 10.1109/TIP.2017.2713099. [Online]. Available: http://ieeexplore.ieee.org/document/7949028/ [30] C. You et al., “Structurally-sensitive multi-scale deep neural network for lowdose CT denoising,” IEEE Access, vol. 6, pp. 41,839–41,855, Jul. 2018, doi: 10.1109/ACCESS.2018.2858196. [31] G. Yang et al., “DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1310–1321, Jun. 2018, doi: 10.1109/TMI.2017.2785879. [Online]. Available: https://ieeexplore.ieee.org/document/8233175/ [32] M. U. Ghani and W. C. Karl, “Deep learning-based sinogram completion for low-dose CT,” in Proc. IEEE 13th Image, Video, Multidimensional Signal Process. Workshop (IVMSP), Zagorochoria, Greece: IEEE, Jun. 2018, pp. 1–5, doi: 10.1109/IVMSPW.2018.8448403. [33] D. Wu, K. Kim, G. El Fakhri, and Q. Li, “Iterative low-dose CT reconstruction with priors trained by artificial neural network,” IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2479–2486, Dec. 2017, doi: 10.1109/TMI.2017.2753138. [Online]. Available: https://ieeexplore.ieee.org/document/8038851/ [34] Y. Wu and Y. Lin, “InversionNet: An efficient and accurate data-driven full waveform inversion,” IEEE Trans. Comput. Imag., vol. 6, pp. 419–433, 2020, doi: 10.1109/TCI.2019.2956866. [Online]. Available: https://ieeexplore.ieee.org /document/8918045 [35] B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen, “Image reconstruction by domain-transform manifold learning,” Nature, vol. 555, no. 7697, pp. 487–492, Mar. 2018, doi: 10.1038/nature25988. [Online]. Available: http://www. nature.com/articles/nature25988 [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” Int. J. Comput. Vision, vol. 128, pp. 1867–1888, Jul. 2020, doi: 10.1007/s11263020-01303-4. [37] T. Würfl, F. C. Ghesu, V. Christlein, and A. Maier, “Deep learning computed tomography,” in Proc. Med. Image Comput. Comput.-Assisted Intervention (MICCAI), S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds. Cham, Switzerland: Springer International Publishing, 2016, vol. 9902, pp. 432–440. [38] D. H. Ye, G. T. Buzzard, M. Ruby, and C. A. Bouman, “Deep back projection for sparse-view CT reconstruction,” in Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP), Anaheim, CA, USA: IEEE, Nov. 2018, pp. 1–5, doi: 10.1109/ GlobalSIP.2018.8646669. [39] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg, “Plug-and-play priors for model based reconstruction,” in Proc. IEEE Global Conf. Signal Inf. Process., Dec. 2013, pp. 945–948, doi: 10.1109/GlobalSIP.2013.6737048. [40] S. Sreehari, S. V. Venkatakrishnan, B. Wohlberg, G. T. Buzzard, L. F. Drummy, J. P. Simmons, and C. A. Bouman, “Plug-and-play priors for bright field electron tomography and sparse interpolation,” IEEE Trans. Comput. Imag., vol. 2, no. 4, pp. 408–423, Dec. 2016, doi: 10.1109/TCI.2016.2599778. [Online]. Available: http://ieeexplore.ieee.org/document/7542195/ [41] A. Buades, B. Coll, and J.-M. Morel, “Non-local means denoising,” IPOL J. Image Process. On Line, vol. 1, pp. 208–212, Sep. 2011, doi: 10.5201/ipol.2011. bcm_nlm. [Online]. Available: https://www.ipol.im/pub/art/2011/bcm_nlm/?utm _source=doi
[42] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image restoration by sparse 3D transform-domain collaborative filtering,” in Proc. Image Process., Algorithms Syst. VI, J. T. Astola, K. O. Egiazarian, and E. R. Dougherty, Eds. San Jose, CA, USA, Feb. 2008, p. 681207. [43] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017, doi: 10.1109/TIP.2017.2662206. [Online]. Available: https://ieeexplore.ieee.org/document/7839189/ [44] U. S. Kamilov, H. Mansour, and B. Wohlberg, “A plug-and-play priors approach for solving nonlinear imaging inverse problems,” IEEE Signal Process. Lett., vol. 24, no. 12, pp. 1872–1876, Dec. 2017, doi: 10.1109/LSP.2017.2763583. [Online]. Available: http://ieeexplore.ieee.org/document/8068267/ [45] Y. Sun, B. Wohlberg, and U. S. Kamilov, “An online plug-and-play algorithm for regularized image reconstruction,” IEEE Trans. Comput. Imag., vol. 5, no. 3, pp. 395–408, Sep. 2019, doi: 10.1109/TCI.2019.2893568. [46] Y. Romano, M. Elad, and P. Milanfar, “The little engine that could: Regularization by denoising (RED),” SIAM J. Imag. Sci., vol. 10, no. 4, pp. 1804– 1844, Jan. 2017, doi: 10.1137/16M1102884. [47] E. T. Reehorst and P. Schniter, “Regularization by denoising: Clarifications and new interpretations,” IEEE Trans. Comput. Imag., vol. 5, no. 1, pp. 52–67, Mar. 2019, doi: 10.1109/TCI.2018.2880326. [Online]. Available: https://ieeexplore.ieee. org/document/8528509/ [48] G. T. Buzzard, S. H. Chan, S. Sreehari, and C. A. Bouman, “Plug-and-play unplugged: Optimization-free reconstruction using consensus equilibrium,” SIAM J. Imag. Sci., vol. 11, no. 3, pp. 2001–2020, Jan. 2018, doi: 10.1137/ 17M1122451. [49] M. U. Ghani and W. C. Karl, “Data and image prior integration for image reconstruction using consensus equilibrium,” IEEE Trans. Comput. Imag., vol. 7, pp. 297–308, Mar. 2021, doi: 10.1109/TCI.2021.3062986. [Online]. Available: https://ieeexplore.ieee.org/document/9366922/ [50] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021, doi: 10.1109/MSP.2020.3016905. [Online]. Available: https://ieeexplore.ieee.org/document/9363511/ [51] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. 27th Int. Conf. Mach. Learn., Haifa, Israel: Omnipress, Jun. 2010, pp. 399–406. [52] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “CNNbased projected gradient descent for consistent CT image reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1440–1453, Jun. 2018, doi: 10.1109/ TMI.2018.2832656. [Online]. Available: https://ieeexplore.ieee.org/document/ 8353870/ [53] Y. Li, M. Tofighi, V. Monga, and Y. C. Eldar, “An algorithm unrolling approach to deep image deblurring,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Brighton, U.K.: IEEE, May 2019, pp. 7675–7679, doi: 10.1109/ICASSP.2019.8682542. [54] S. Lunz, O. Öktem, and C.-B. Schönlieb, “Adversarial regularizers in inverse problems,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA: Curran Associates, Inc., Dec. 2018, pp. 8516–8525. [55] H. Sun and K. L. Bouman, “Deep probabilistic imaging: Uncertainty quantification and multi-modal solution characterization for computational imaging,” in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 3, pp. 2628–2637, doi: 10.1609/aaai.v35i3.16366. [Online]. Available: https://ojs.aaai.org/index.php/ AAAI/article/view/16366 [56] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving inverse problems in medical imaging with score-based generative models,” in Proc. Int. Conf. Learn. Representations, Mar. 2022. [Online]. Available: https://openreview.net/ forum?id=vaRCHVj0uGI [57] H. Chung and J. C. Ye, “Score-based diffusion models for accelerated MRI,” Med. Image Anal., vol. 80, Aug. 2022, Art. no. 102479, doi: 10.1016/j. media.2022.102479. [Online]. Available: https://linkinghub.elsevier.com/retrieve/ pii/S1361841522001268 [58] C. Ekmekci and M. Cetin, “Uncertainty quantification for deep unrolling-based computational imaging,” IEEE Trans. Comput. Imag., vol. 8, pp. 1195–1209, Dec. 2022, doi: 10.1109/TCI.2022.3233185. [59] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning image restoration without clean data,” in Proc. 35th Int. Conf. Mach. Learn., PMLR, Jul. 2018, pp. 2965–2974. [Online]. Available: https://proceedings.mlr.press/v80/lehtinen18a.html [60] J. Liu, Y. Sun, C. Eldeniz, W. Gan, H. An, and U. S. Kamilov, “RARE: Image reconstruction using deep priors learned without groundtruth,” IEEE J. Sel. Topics Signal Process., vol. 14, no. 6, pp. 1088–1099, Oct. 2020, doi: 10.1109/ JSTSP.2020.2998402. [Online]. Available: https://ieeexplore.ieee.org/document/ 9103213/
SP
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
53
75TH ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE
Xin Li , Weisheng Dong , Jinjian Wu , Leida Li , and Guangming Shi
Superresolution Image Reconstruction Selective milestones and open problems
I
©SHUTTERSTOCK.COM/TRIFF
n multidimensional signal processing, such as image and video processing, superresolution (SR) imaging is a classical problem. Over the past 25 years, academia and industry have been interested in reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. We review the development of SR technology in this tutorial article based on the evolution of key insights associated with the prior knowledge or regularization method from analytical representations to data-driven deep models. The coevolution of SR with other technical fields, such as autoregressive modeling, sparse coding, and deep learning, will be highlighted in both model-based and learning-based approaches. Model-based SR includes geometry-driven, sparsity-based, and gradient-profile priors; learning-based SR covers three types of neural network (NN) architectures, namely residual networks (ResNet), generative adversarial networks (GANs), and pretrained models (PTMs). Both model-based and learning-based SR are united by highlighting their limitations from the perspective of modeldata mismatch. Our new perspective allows us to maintain a healthy skepticism about current practice and advocate for a hybrid approach that combines the strengths of model-based and learning-based SR. We will also discuss several open challenges, including arbitrary-ratio, reference-based, and domain-specific SR.
Introduction
Digital Object Identifier 10.1109/MSP.2023.3271438 Date of current version: 14 July 2023
54
In image processing, SR refers to techniques that increase image resolution. The use of SR imaging can be implemented on a hardware basis (e.g., optical solutions) or on a software basis (e.g., digital zooming or image scaling). Softwarebased (as well as computational) SR imaging approaches can be classified in several ways according to the assumptions about the relationship between LR images and HR images: single image versus multiframe, nonblind versus blind, fixed versus arbitrary scaling ratios, etc. In the past quartercentury, SR techniques have evolved into two categories: model based (1998 to present) [2], [7], [12], [20], [31], [37] IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
1053-5888/23©2023IEEE
with an LDL approach [21]. By simultaneously estimating and learning based (2014 to present) [4], [6], [14], [18], [19], the blur kernel and HR image, a network is unfolded to solve [22], [27], [32], [33], [39]. Model-based approaches rely on the joint optimization problem. mathematical models to c onnect LR and HR data; the main A systematic review of SR’s evolution over the last difference is in how the LR observation and HR image prior 25 years is presented in this tutorial. The purpose of this are characterized. Learning nonlinear mapping between LR article is not to provide a comprehensive review of image and HR image data can be greatly facilitated by the simple SR; interested readers are referred to three recent survey idea of skip connections (i.e., ResNet) in learning-based aparticles [1], [23], [36]. We aim to highlight the rich connecproaches. Recently, researchers have focused on developing tions between image processing and other technical fields novel network architectures [e.g., Generative Latent Bank rather than focusing on a wide range of topics. SR has (GLEAN) [3] and nonlocal blocks [25], [34]] and applying evolved with Wiener filtering, compressed sensing, and NN them to realistic scenarios (e.g., locally discriminative learndesign since 1998 (the 50th anniversary of ing (LDL) [21]). the IEEE Signal Processing Society). SR These four perspectives can be used to SR imaging has also been is a class of inverse problems extensively justify the importance of studying SR. SR applied to a variety of studied in the literature from a mathematimaging has a wide range of applications, engineering systems, ical perspective. As a scientific concept, ranging from nanometer- to light-year including Mars Curiosity SR is related to the Rayleigh criterion, a scale (for example, SR microscopy won the and NASA’s James Webb limit for diffraction in optical imaging Nobel Prize in Chemistry in 2014). Watsystems. Engineering applications of SR son and Crick’s discovery about DNA’s Space Telescope. range from biomedical imaging to condouble-helix structure could become trivsumer electronics. Smartphones, high-definition television ial if SR microscopy technology reveals DNA’s detailed (HDTV), remote sensing, and smart health are examples of structure on a nanometer scale. In terms of technology, SR SR technology in our daily lives. Following are the key new imaging shows how expensive hardware (i.e., optical zoom) insights offered in this tutorial in addition to the scientific can be traded for more cost-effective software (i.e., SR algochallenges and key milestones of SR: rithms). Single-lens reflex cameras are phasing out as SR technology advances, resulting in smartphone photography. ■■ The first is a selective review of SR milestones in the past SR imaging has also been applied to a variety of engineer25 years with an emphasis on theoretical insights: i.e., how ing systems, including Mars Curiosity and NASA’s James can missing high-frequency information be approximated Webb Space Telescope. Last but not least, SR image reconor recovered? struction is a class of inverse problems that have been exten■■ The second is a healthy skepticism toward well-cited sively studied by mathematicians. SR image reconstruction SR algorithms. To illustrate progress in both modelsolutions often have profound implications for inverse based and learning-based approaches, we will highlight problems, such as blind image deconvolution and medical failure examples. image reconstruction. ■■ Three open challenges have been selected in the field There are two main motivations behind this tutorial article. of SR image reconstruction: arbitrary-ratio, referInstead of mathematically approximating LR and HR imagence-based, and domain-specific SR. We will discuss es with nonlinear mappings f : X LR " X HR, SR has evolved the current state of the art and future directions for each challenge. to data-driven or learning-based methods of determining surrogate models. During the past seven years, extensive research has been conducted along the following two lines. Problem formulation First, skip connections and squeeze and excitation modules have been introduced into ResNet-like NN architectures to Observation model alleviate the vanishing gradient problem. Second, modelGenerally speaking, the problem of single-image SR (SISR) based approaches can be leveraged to provide new insights, refers to the reconstruction of an HR image from its corsuch as the importance of exploiting higher order attention responding LR observation [refer to Figure 1(a)]. For a layand nonlocal dependency [4]. Model-based SR can also person, SISR is widely known as digital zoom, which is in be unfolded directly into deep NNs (DNNs) [27]. In concontrast to optical zoom. Digital zoom and optical zoom trast, learning-based SR has coevolved with other fields in represent software- and hardware-based approaches to encomputer vision and machine learning. Using a discrimihance the resolution of digital images; the latter is often native model, SRGAN intelligently separates the truth of conceived as the upper bound for the former when optical the ground (real HR) from the result of the SR reconstrucimaging systems operate within the diffraction limit. From tion (fake HR) as a result of the invention of the GAN. The a computational imaging perspective, SISR or digital zoom attention mechanism has sparked interest in transformerrepresents a cost-effective approximation of optical zoom. based models, which have been successfully applied to SR Closing the gap between software-based and hardware(e.g., [8]). Recent advances in blind image restoration have based approaches has been the holy grail of SR technology renewed interest in solving the blind real-world SR problem in the past 25 years. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
55
kernel H is unknown and even spatially varying; therefore, we We note that the SISR problem formulation has made have to address the problem of estimating the blurring kernel several simplified assumptions to make it technically more and reconstructing the HR image simultaneously. tractable. Depending on the assumption, with the LR observation model, we can formulate the SISR into image interpolation where LR Y is simply a decimated version of HR X, as Image prior shown in Figure 1(b), or SR image reconstruction where LR The key technical challenge of SISR lies in the construction of is obtained from HR by several operators (e.g., warping, blur, an image prior P (X) (as well as the regularization functional and downsampling), as shown in Figure 1(c). When the degin the literature of image restoration and inverse problems) radation is unknown (i.e., the so-called blind or real-world [28]. During the past 25 years, the effort devoted to image scenario), the problem of SISR is more prior construction can be classified into challenging than its nonblind formulation two paradigms: model based (1998 to presFrom a computational (i.e., with complete knowledge of the LR ent) and learning based (2014 to present). imaging perspective, observation model). Blind or real-world In the paradigm of model-based SR, the SISR or digital zoom SISR [23] is one of the frontiers of SR unifying theme is to construct mathematirepresents a cost-effective cal models (e.g., geometry driven, sparsity research these days. approximation of In the framework of Bayesian inferbased, or gradient domain) for the class of ence, a maximum a posteriori (MAP) HR images. In the paradigm of learningoptical zoom. estimation of an HR image X from its based SR, the common objective is to learn version of the LR observation Y can be formulated as a nonlinear mapping [e.g., NNs consisting of several building blocks such as convolution and max-pooling layers, rectified argmax P (X | Y) . argmax P (Y | X) P (X) using the Bayesian linear unit (ReLU), and batch normalization modules] from formula. The LR observation model deals with the likelihood the space of LR images to that of HR images. The paradigm term P (Y | X) that characterizes the degradation process of shift from model based to learning based is catalyzed by rapid the LR image acquisition. For example, one might start with a advances in data science (e.g., the large-scale collection of parametric observation model Y = DH X + n, where D and H training data such as ImageNet) and deep learning (i.e., the denote downsampling/blurring operators, respectively, and n replacement of Moore’s law for CPU acceleration by Huang’s is additive noise. Note that the image interpolation problem is law for GPU acceleration). (Huang’s law is an observation in a special case with H, n being skipped; spatially invariant blur computer science and engineering that advancements in GPUs H is already an oversimplified abstraction of image degradaare growing at a rate much faster than with traditional CPUs.) tion in the real world. In the meantime, the source of additive Image prior/regularizer construction or learning reprenoise n can be sensor related (e.g., shot noise in a raw color sents the state of the art in developing analytical or numerifilter array) or transmission related (e.g., image compression cal representations to explain intensity distributions in images, artifacts). In the formulation of blind problems, the blurring
HR
SR
LR
D
D*
(a)
HR
(c)
Conv
Conv
ResBlock
HR
ResBlock
Conv
LR
ABL
(b)
LR
FIGURE 1. The problem formulation of SISR. (a) The abstract relationship between LR and HR from the pinhole imaging model. (b) A simplified SISR formulation (model-based image interpolation), where LR is a decimated version of HR. (c) Degradation modeling for more accurate characterization of LR observation from HR (learning-based image SR).
56
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
r egardless of the model- or learning-based paradigm. Wavelet, partial differential equations, Markov random fields, and NNs serve only as tools to communicate ideas abstracted from the physical properties of visual information. Such abstraction, generative or discriminative, allows us to handle a wide range of image data regardless of their semantic contents (e.g., biometrics or surveillance), acquisition conditions (e.g., camera distance and illumination), or physical origins (e.g., optical sensors versus MRI scanners).
Model-based SR: From edge directed to sparsity based
is a geometric pattern consisting of five points arranged in a cross, with four of them forming a square or rectangle and a fifth at its center.) The limitations of NEDI are summarized next. First, NEDI is a localized model that ignores the nonlocal dependency within the natural images. Second, the geometry-related prior exploited by NEDI matches only a certain class of image structures. For example, edge-directed insight is not applicable to irregular texture patterns whose local statistics are more sophisticated and violate the scale-invariant assumption. Third, the two-step implementation of NEDI is open loop, ignoring the issue of possible inconsistency between adjacent windows. A closed-loop optimization of LS-based autoregressive models was later studied in the literature (e.g., [38]).
In this section, we review model-based SR based on geometry-driven, sparsity-based, and gradient-profile priors that were developed during the first decade of the new millennium. They are constructed from varying insights about the prior knowledge of unBlind or real-world SISR is known HR images.
Adaptive image interpolation via geometric invariance
Image SR via sparse coding
The birth of compressed sensing theory around 2006 has inspired many novel applications of sparse representations, including SISR. A key observation obtained from the theory of sparse coding or compressed sensing is that image patches can be decomposed into a sparse linear combination of elements from an overcomplete dictionary. Such observations suggest that the sparse representation can be faithfully recovered from the downsampled signals under mild conditions (the theoretical foundation for SR image reconstruction). In [37], a sparse coding-based approach to SR is developed by jointly training two dictionaries for the LR and HR image patches. Unlike geometry-driven NEDI, the new insight is to enforce the similarity of sparse representations between the LR and HR image patch pairs with respect to their own dictionaries. Along this line of reasoning, the
one of the frontiers of SR research these days.
In the simplified image interpolation situation, LR pixels correspond directly to the decimated version of HR, as shown in Figure 2. For a scaling factor of two, the task of image interpolation boils down to guessing the missing pixels that occupy three-quarters of the spatial locations. The new edge-directed interpolation (NEDI) [20] extends the classic Wiener filtering [mathematically equivalent to least-square (LS) estimation] from prediction to interpolation. As shown in Figure 2, missing pixels as unknown HR sampling locations are denoted by yellow dots. Each yellow pixel (labeled “0”) must be predicted from the linear combination of its four surrounding black pixels (labeled as “1” to “4”). Wiener filtering or LS-based e stimation of weighting coefficients requires the calculation of local covariance at the HR (marked by solid lines with different colors), which is infeasible due to the missing yellow pixels. Based on the observation that edge orientation is scale i nvariant, NEDI calculates the local covariances at the LR (marked by dashed lines with different colors) and uses them as the surrogate covariance to drive the derivation of LS-based estimation at the HR. The effectiveness of NEDI can be interpreted from the following perspectives. First, local geometric information on the direction of the edge can be viewed as being implicitly embedded in the four linear weights in the LS formula. Such an implicit exploitation of the geometry-related prior (i.e., the scale-invariant property of edge orientation) makes the NEDI model a good fit for an arbitrarily oriented edge. Second, there is an elegant duality between step 1 and step 2 of NEDI implementation—they are geometrically isomorphic (up to a rotation by 45° clockwise). Note that the pixels interpolated from step 1 will be treated the same as the given LR (i.e., yellow pixels in step 1 become black ones in step 2). Such geometric duality demonstrates the potential of quincunx sampling as an improved strategy to hierarchically organize visual information compared to conventional raster sampling. (Quincunx
1
4
0
1
4
1
2 (a)
0 2
0 2
2
3
3
4
1 0
4
3 3 (b)
FIGURE 2. A geometric invariance property exploited by model-based SR, such as NEDI [20]. On the basis of the observation that edge orientation is scale invariant, we can replace the fine-scale correlation (marked by solid lines) with their coarse-scale counterparts (marked by dashed lines). In other words, a multiscale extension of classic Wiener filtering was at the core of NEDI to adapt the interpolation based on the local covariance estimates (implicitly conveying the information about local edge direction). Note that the correspondence between the LR and HR pixel pairs is marked by different colors, and step 2 is isomorphic to step 1 (up to the rotation of 45° clockwise). (a) Step 1 of NEDI. (b) Step 2 of NEDI.
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
57
of the original image (each pixel is associated with a pair sparse representation of an LR image patch can be used as a of gradients). Image reconstruction from gradient profiles surrogate model for the HR image patch dictionary to genercan be interpreted as a nontrivial back-projection operation ate an HR image patch. As shown in Figure 3, the learned from the gradient space to the image space. In the context dictionary pair is a more compact representation of the image of gradient-domain image processing, we can address the patch pairs, substantially reducing the computational cost. problem of SISR by prioritizing gradient profiles instead of Further development along this line of research includes an intensity values, as shown in Figure 4. adaptive selection of the sparse domain and nonlocal extenThe key observation behind the gradision, as presented in [7]. ent profile prior (GPP) is that the sharpness The performance of SR via sparse repThe birth of compressed of natural images can be characterized by resentations is tightly coupled with the sensing theory around a parametric model such as a generalized quality of the training dataset used for dic2006 has inspired many exponential distribution. To impose such tionary learning. The selection of the patch novel applications of an image prior, it is possible to design a size and the optimal dictionary size for SR sparse representations, gradient-field transformation, as shown on image reconstruction remain open issues to address. For example, a special dictionthe right of F igure 4. The role of the graincluding SISR. ary was learned for face hallucination in dient transform is to match the distribu[37]; can we generalize such a result to other specific applition of gradient fields between the target and the observed cation domains? Algorithm 1 [Figure 3(b)] uses an initial images. The transformed gradient field is then used to reconstruct the enhanced images. In this way, the objective of the SR X 0 as the stepping stone; can a related reference image SR image reconstruction is achieved in the gradient domain. refine such an estimate? Furthermore, the observation model Similar to other geometry-driven priors (e.g., total-variation in problem formulation assumes a fixed scaling factor. A difmodels), the performance of the GPP often degrades for the ferent dictionary needs to be trained for a different scaling class of texture images. factor. These weaknesses will be addressed in the three open problems later.
Learning-based SR: Evolution of NN architectures
Image SR via gradient profile prior Gradient-domain image processing, also known as Poisson image editing, deals with image gradients rather than original intensity values. The mathematical foundation of gradient- domain image processing is the numerical solution to the Poisson equation. Conceptually, the horizontal and vertical gradient fields can be viewed as a redundant representation
A rise in deep learning can be seen in 2015. SR via convolutional NN (SRCNN) [5] represented a pioneering work in deep learning-based SR, as shown in Figure 5. Since then, there has been an explosion of literature related to learning-based SR. Due to space limitations, we have to selectively review the most representative work from the perspective of the evolution of network architectures.
Algorithm 1 (SR Via Sparse Representation). Sparse Coding LR
Training Data
HR
Dictionary Learning (a)
1: Input: training dictionaries Dh and Dl, a low-resolution image Y. 2: For each 3 × 3 patch y of Y. taken starting from the upper-left corner with 1 pixel overlap in each direction, • Compute the mean pixel value m of patch y. ~ ~ • Solve the optimization problem with D and y. ~ ~ 2 defined in (8): minαDα – y2 + λα 1. • Generate the high-resolution patch x = Dhα *. Put the patch x + m into a high-resolution image X0. 3: End 4: Using gradient descent, find the closest image to X0 which satisfies the reconstruction constraint 2
2
X * = arg minSHX – Y2 + cX – X02 . X
5: Output: SR image X *. (b)
FIGURE 3. (a) and (b) SISR via sparse representation [37]. The key idea is to enforce the similarity of sparse representations between the LR and HR image patches with respect to their own dictionaries.
58
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
SR image reconstruction via residue refinement
[29] contains some promising results. The second is related to the interpretability of the network design. From a pracThe first category is inspired by the celebrated ResNet artical perspective, a transparent design is expected to help chitecture. The simple yet elegant idea behind ResNet is to strike an improved tradeoff between cost and performance. alleviate the notorious vanishing gradient problem via skip In our recent work [27], we have presented a model-guided connections (mathematically equivalent to predictive coddeep unfolding network (MoG-DUN) implementation, which ing). This idea is naturally consistent with the objective of SR achieves an improved tradeoff between the performance of image reconstruction because missing high-frequency inforthe SR reconstruction [measured by the peak signal-to-noise mation can be interpreted as residue signals, the target of nonratio (PSNR) values] and the cost (mealinear mapping, as shown in Figure 5(a). If sured by the number of parameters). we make an analogy between traffic flow SR via convolutional NN (from source to destination) and informarepresented a pioneering tion flow (from input to output), the conPerceptual optimization work in deep learningstruction of the network architecture for via adversarial learning based SR. SISR shares an objective similar to that of The second category is inspired by the the transportation network. The common influential GAN architecture. In the pioobjective is to maximize the flow capacity of a transportaneering work of SRGAN [19], the objective of perceptual option network or the amount of residue information in an image timization was achieved by introducing an adversarial loss, reconstruction network. which pushes the superresolved image closer to the manifold Many well-cited papers have been published under the of natural images. In SRGAN, as shown in Figure 6(a), a framework mentioned previously, as shown in Figure 5(b). dedicated discriminator network is trained to differentiate Early work such as the deep recursive convolutional network (DRCN) [17], the deep recursive residual network (DRRN) [32], the enhanced deep SR network (EDSR) [22], and the Laplap(x0) x1 cian pyramid SR network (LapSRN) x0 [18] focused on the construction of a x2 network architectures to facilitate the x x0 prediction of high-frequency residub p(x0) als (e.g., via recurrent layers [17], [32] x1 and multiscale decomposition [18], x0 [22]). This line of research was furx2 ther enhanced by the introduction of the squeeze and excitation module (a) (b) (c) (d) in residual channel attention netu works (RCANs) [39] and residual T l (x) . r(d(x,x0)) h (x) dense networks (RDNs) [40]. Other u x improvements include considering the l (x) T error feedback mechanism in deep x0 h (x) back-projection networks (DBPNs) [14] x x0 and higher order attention mechanisms x x0 such as the second-order attention netGradient Transform work (SAN) [4]. There are two open questions related to the construction of ResNetinspired SR networks. First, what is the fundamental limit of this residual refinement strategy? An improved theoretical understanding of what can Gafter Gbefore Iafter Ibefore be learned (i.e., what missing high(e) frequency information can be recovered?) will offer valuable guidance to the design of a high-order attention FIGURE 4. Image SR via a gradient profile prior (GPP) [31]. (a)–(d) Gradient profile. (a) Two edges with different sharpness; (b) gradient maps; (c) 1D gradient profiles; and (d) tracing the curve of the mechanism in DNNs. The latest work gradient profile requires subpixel interpolation. (e) Gradient transformation (before and after correon iterative refinement with denoising sponds to before and after imposing the GPP). By imposing an image prior to gradient transformation, diffusion probabilistic models [15], the GPP-based image SR achieves the objective of sharpening edges in the reconstructed HR images. I∆
I∆
I∆
I∆
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
59
HR image as a reference during the discrimination process; how to relax such a requirement (e.g., using a related HR image as a reference) is an interesting topic worthy of further study.
between superresolved images (the fake sample) and original photorealistic images (the real sample). Note that ideas inspired by ResNet, such as residual-in-residual dense block, have also been incorporated into SRGAN, further improving the performance of adversarial learning. Other ideas, such as relativistic GAN and improved perceptual loss, have also shown impressive performance improvements in enhanced SRGAN (ESRGAN) [35]). In the context of perceptual optimization, the most controversial issue will be the compromise between improving the details of the image and avoiding the generation of artifacts [21]. Differentiating between texture-like signals and artifact-like noise requires the sophisticated modeling of visual perception by a human vision system. LDL [21] represents the first step toward explicitly discriminating visual artifacts from realistic details. However, LDL requires an
LR Image
Feature Extraction and Representation
Large-factor SISR via PTMs More recently, PTMs such as StyleGAN have been proposed as a latent bank to improve the restoration quality of largefactor image SR (e.g., PULSE [26]). Unlike existing SISR approaches that attempt to generate realistic textures through adversarial learning, Generative Latent Bank (GLEAN) [3] made a significant departure by directly leveraging rich and diverse priors encapsulated in a PTM. GLEAN also differs from the prevalent GAN inversion methods that require expensive image-specific optimization at runtime because it needs only a single forward pass to generate the SR image.
Reconstruction
Nonlinear Mapping
HR Image
(a)
82, 1024
82, 1024
642, 256 x · yt
1282, 128
642, 256 1282, 128
(b)
yt –1
FIGURE 5. Learning-based SR. (a) An early attempt, such as SRCNN [5], learns a nonlinear mapping from the space of LR images to HR images. (b) The latest advances achieve SR via U-Net-based iterative residue refinement using denoising diffusion probabilistic models [29].
60
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
As shown in Figure 7, GLEAN can be easily incorporated into a simple encoder-bank-decoder architecture with multiresolution skip connections, making it versatile with images from various categories. Despite the impressive performance achieved by GLEAN (e.g., as much as 5 dB of PSNR improvement over PULSE on certain classes of images), it still suffers from two fundamental
k9n3s1
Conv
PReLU
k3n256s1
Conv
BN
Elementwise Sum
BN
PReLU
Conv
PReLU
Conv
Input
k3n64s1
Pixelshuffler ×2
k3n64s1 k3n64s1
LR
Conv
k9n64s1
B Residual Blocks
Elementwise Sum
Generator Network
limitations. First, the performance of GLEAN on real-world LR images has remained poor due to a strong assumption with paired training data. In a real-world scenario, SISR is blind because only unpaired LR and HR images are available for training. Degradation learning plays an equally important role in prior learning. How to jointly optimize the interacting components of degradation and prior learning in a Bayesian
SR
Skip Connection Discriminator Network k3n64s1
k3n256s2 k3n128s2 k3n512s2 k3n128s1 k3n512s1 k3n256s1
k3n64s2
Sigmoid
Dense (1)
Leaky (ReLU)
Dense (1024)
Leaky ReLU
BN
Conv
Leaky ReLU
Input
Conv
HR
SR
(a)
Reconstruction Loss Pretrained Solution Perpetually Unpleasant Solutions
Possible Optimization Directions Perpetually Pleasant Solutions (b)
FIGURE 6. Adversarial learning-based SR. (a) SRGAN [19] uses an HR image as the reference (real) to distinguish it from a reconstructed SR image (fake). (b) Perceptual optimization of the GAN-based SR model [21]. PReLU: parametric rectified linear unit; BN: batch normalization. IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
61
framework is the key to the next milestone in realistic SR. Recent work has reported some initial success along this line of research [23]. Second, the generality of the GLEAN model made it suboptimal for a specific class of images (e.g., human faces). It is possible to design a more powerful and optimized generative prior for face images alone (e.g., generative face prior GFP-GAN). Most recently, denoising diffusion probabilistic models [15] have been successfully applied to perform SR through a stochastic iterative denoising process in SR3 [29]. The key idea behind SR3 is to iteratively refine the reconstructed HR images by a U-Net architecture trained on denoising at various noise levels and conditioned on the LR input image. SR3 has demonstrated strong SR performance at varying magnification factors and diverse image contents. Additional latest advances are the extension of SISR to blind SR through a joint MAP formulation in KULNet [9] and deep constrained least squares (DCLS) [24]. To estimate the unknown kernel and HR image simultaneously, KULNet introduces uncertainty learning in the latent space to facilitate the estimation of the blur kernel. The joint MAP estimator was unfolded into a deep CNN-based implementation with a learned Laplacian scale mixture prior and the estimated kernel. DCLS reformulates the degradation model so that the deblurring kernel estimation can be transferred into the space of LR images. The reconstructed feature and the LR image feature are jointly fed into a dual-path structured SR network and restore the final HR image.
r eal-world applications. In this section, we will discuss why they are important and what the promising attacks are.
Arbitrary-ratio SR: Beyond integer factors Most articles published in the literature on SISR have considered only integer factors (e.g., # 2, # 3, # 4 ). Such integerfactor constraints are simplified situations that make it easier to develop SISR algorithms. In practice, digital zooming often requires noninteger scenarios, e.g., upsampling a 640 # 480 image to 1,024 # 768 will require a fractional factor of 8/5. Meta-SR [16] is one of few methods that can deal with an arbitrary scaling ratio r (Figure 8). In this method, local projection, weighted prediction, and feature mapping are jointly exploited to implement the noninteger meta-upscale module r. Note that such meta-upscale modules with fractional ratios offer an intellectually appealing alternative to integer-factor upscaling, e.g., 2 = 8/7 # 7/6 # 6/5 # 5/4. Therefore, a particularly promising direction to work with small fractional factors r 2 1 is the exploitation of local self-similarity (LSS), as advocated in [11]. There are several open questions related to the development of meta-upscale modules. First, the training dataset is obtained by bicubic resampling of popular DIV2K images (available at https://data.vision.ee.ethz.ch/cvl/DIV2K/). It is desirable to collect a more realistic training dataset by varying the focal length of a digital camera. We believe that the ultimate objective of arbitrary-ratio SR is to provide a cost-effective solution to optical zoom. Second, the design of local projection, weighted prediction, and feature mapping can be optimized end to end. For example, if we consider the dual operator of meta-upscale f (m/n), namely meta-downscale g = f (n/m), the concatenation of f and g should become an identity operator [16]. Third, natural images are characterized by LSS, as shown in [11]. Such an LSS is best preserved for factors close to unity ((m/n) " 1).
Open problems: Arbitrary-ratio SR, reference-based SR, and domain-specific SR Despite the rapid progress of SR in the last 25 years, there are still many open problems in the field. From the signal processing perspective, we have handpicked the three most significant challenges based on their potential impact in
Ei
32 32 256 256
16 16
128 128
88 44
E0
E1
E2
Encoder 32 32
E3
E4
C = (c0, ..., c5)
f3
S0
c0
f2
f1
S1
c1
S2
c2
64 64 g3 g4
f0
S3
c3
S4
c4
D0
S5
D1
Si Di
g5
D2
D3
f0
c5
Generative Latent Bank
Decoder
FIGURE 7. SISR via pretrained GAN. GLEAN [3] uses the generator as a dictionary and conditions it on the convolutional features provided by the encoder. With a pretrained GAN that captures the natural image prior to processing, GLEAN is capable of achieving a large-scale SISR (a scaling factor of #8 is shown). 62
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
overcome the fundamental limitations of SISR. A different perspective is to view RefSR as a constrained formulation of example-based SR; instead of working with a whole dataset, we aim at utilizing the most relevant reference (containing similar content) to generate rich textures. The key technical challenge is how to pass on the missing high-frequency details from the teacher (reference HR image) to the student (reconstructed SR image). Similarity Search and Extraction Network (SSEN) [30] represents an example solution to RefSR based on knowledge distillation. As shown in Figure 9, SSEN uses a Siamese network with shared parameters as the backbone of the
The question of how to exploit LSS using nonlocal NNs [34] is a fascinating topic.
Reference-based SR via knowledge distillation Since SISR is an ill-posed inverse problem, it is generally challenging to accurately reconstruct the missing highfrequency details of the unknown HR images from LR observation. A more plausible approach to recover missing high-frequency details is to “borrow” them from a reference HR image with similar content. With additional help from the reference image, this class of reference-based SR (RefSR), as well as guided image SR [42], has the potential to
LR X1.5 X2.0 X2.5 X3.0
X3.5 X4 0 X4.0
FIGURE 8. SISR with an arbitrary scaling ratio [16]. Note that in real-world applications, the magnification ratio is often not an integer but some fractional numbers.
Baseline Reconstruction Network
Feature Extraction
LR Image
LR Feature
SR Image SSEN
Shared Parameter Reference Image
Feature Extraction
Dynamic Offset Offset Estimator
DeformConv
Dynamic Offset Offset Estimator
DeformConv Aligned Reference Feature
Reference Feature
: Concatenation
FIGURE 9. RefSR. The availability of a reference image provides relevant side information to facilitate the reconstruction of the SR image [30].
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
63
to be jointly optimized with the imaging modality and for speteacher-student network. Inspired by the feature alignment cific applications. The central question is how to incorporate capability of deformable convolution, RefSR can be formudomain knowledge (related to the object of interest or the imlated as an integrative reconstruction process of matching aging modality itself) into domain-specific SR algorithms. similar contents between input and reference features and SR microscopy is best known for winning the 2014 Nobel extracting the reference features (distilled knowledge) in Prize in Chemistry. The development of superresolved fluoaligned form. Since similar patches can occur at different rescence microscopy overcomes the barrier of diffraction scales, a multiscale search with a progressively larger receplimits and brings optical microscopy into the nanodimentive field can be achieved by stacking deformable convolusion. SR microscopy has become an indispensable tool for tion layers. The combination of a multiscale structure with understanding biological functions at the molecular level in nonlocal blocks makes it convenient to estimate the offsets for deformable convolution kernels. Note that SSEN can also the biomedical research community. One can imagine that the take the input LR image as a self-reference, which is concepgreat discovery (the double-helix structure of DNA) made by tually similar to the idea of self-similarityWatson and Crick indirectly using an XRD based SISR [13]. image would have been almost straightforSR microscopy is best One of the open challenges in RefSR is ward if we could directly observe the DNA known for winning the the selection of a suitable reference image. structure at nanometer scales. From the 2014 Nobel Prize in Dual-camera zoom in modern smartphone signal processing perspective, one of the Chemistry. design offers a natural choice in that a pair emerging opportunities is multiframe SR of images, with different zoomed observaimage reconstruction [10]. To break the diftions, can be acquired simultaneously. The one with more fraction limit, one can utilize fluorescent probes that switch zoom (telephoto) can serve as a reference for the other with between active and inactive states so that only a small opticalless zoom (short focus). Such a problem formulation of RefSR ly resolvable fraction of the fluorophores is detected in every admits a self-supervised learning-based solution because snapshot. Such a stochastic excitation strategy ensures that the telephoto, with proper alignment, serves as a self-supervipositions of the active sites can be determined with high presion reference for a digital zoom of short focus. Another cision from the center positions of the fluorescent spots. With closely related extension of RefSR is from image based to multiple snapshots of the sample, each capturing a random video based. With adjacent frames available, video-based subset of the object, a final SR image can be reconstructed SR faces the challenge of fusing relevant information from from the accumulated positions. multiple reference images. How to jointly optimize the interAstronomical imaging is another promising domain on the action between image alignment and SR reconstruction has other scale of physics (distances measured by light years) where remained an under-researched topic. SR has great potential in practice. In 2019, for the first time, mankind obtained a photo of a black hole [see Figure 10(a)] that was captured by the Event Horizon Telescope. Due to the Domain-specific SR: Connecting domain far distance, it is not trivial to peek at the supermassive black knowledge with network architecture hole in the M87 galaxy, which is 6.5 billion times larger than The last category for which SR is likely to attract increasing our sun. The new launch of the James Webb Telescope has attention is computational imaging in physical and biological equipped humans with unprecedented capabilities to probe sciences. SR imaging is the key to enhancing mankind’s vision deep space. However, SR techniques, if cleverly combined with capability at extreme scales (e.g., nanometers and light years) optical hardware, can further break the fundamental limit of by breaking the barrier in the physical world. From microsphysical laws (conceptually similar to the microscopic world). copy to astronomical imaging, domain-specific SR includes a Computational imaging techniques such as SR still have much class of customized design challenges where SR imaging has
LR Image (a)
Superresolved Image
(b)
FIGURE 10. Domain-specific SR for astronomical imaging. (a) The first photo of a black hole captured by the Event Horizon Telescope. (b) SR image reconstruction from burst imaging (joint optimization of registration and reconstruction will be needed to suppress undesirable artifacts). 64
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
to offer to transform the practice of observing deep space from Earth. Figure 10(b) illustrates an emerging concept called burst imaging. By trading space with time, one can improve the spatial resolution of an image by combining the information acquired from multiple timings.
Conclusion In this article, we review the evolution of SR technology in the last 25 years from model-based to learning-based SISR. A priori knowledge about HR images, representing the abstraction of 2D data, can be incorporated into the regularization functional in analytical models or loss functions in NNs. Model-based approaches enjoy the benefit of excellent interpretability but suffer from the limitation of a potential mismatch with real-world data. As G. Box once said, “All models are wrong; some of them are useful.” On the contrary, learning-based approaches are conceptually closer to the data (but there is still a potential mismatch between training and test data) at the sacrifice of transparency. Perhaps a hybrid approach, combining the strengths of model-based and learning-based paradigms (e.g., [41]), can achieve both good generalization and interpretability. Looking ahead, what will we see in the next 25 years? Bayesian deep learning provides a new theoretical framework for quantifying various uncertainty factors in deep learning models. By unfolding Bayesian iterative optimization into a DNN-based implementation, we can achieve a principled approach to model and estimate uncertainty for learning-based SR. On the application end, we can foresee that SR technology will reach a higher impact in computational imaging by finding novel applications from the two extreme scales, nanometers and light years. Because most of the information processed by the human brain is visual, we expect that SR imaging will continue to be a key enabling technology in human adventures to the unexplored territories in biological and physical sciences.
Acknowledgment This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0101400 and the Natural Science Foundation of China under Grant 61991451, Grant 61632019, Grant 61621005, and Grant 61836008. Xin Li’s work is partially supported by the NSF Awards CMMI-2146015 and IIS-2114644.
Authors Xin Li received his B.S. degree (highest honors) in electronic engineering and information science from the University of Science and Technology of China, Hefei, China, in 1996 and his Ph.D. degree in electrical engineering from Princeton University, Princeton, NJ, USA, in 2000. Since January 2003, he has been a faculty member with the Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506-6109 USA. He was a member of the technical staff with Sharp Laboratories of America, Camas, WA, USA, from August 2000 to December 2002. He is a Fellow of IEEE.
Weisheng Dong received his B.S. degree in electronic engineering from the Huazhong University of Science and Technology, Wuhan, China, in 2004 and his Ph.D. degree in circuits and systems from Xidian University, Xi’an, China, in 2010. In 2010, he joined the School of Electronic Engineering, Xidian University, Xi’an 710071, China, as a lecturer, where he has been a professor since 2016. He was a visiting student at Microsoft Research Asia, Beijing, China, in 2006. From 2009 to 2010, he was a research assistant with the Department of Computing, Hong Kong Polytechnic University, Hong Kong. His research interests include inverse problems in image processing, sparse signal representation, and image compression. He was a recipient of the Best Paper Award at SPIE Visual Communication and Image Processing (VCIP) in 2010. He is currently serving as an associate editor of IEEE Transactions on Image Processing. He is a Member of IEEE. Jinjian Wu received his B.Sc. and Ph.D. degrees from Xidian University, Xi’an, China, in 2008 and 2013, respectively. Since July 2015, he has been an associate professor with the School of Electronic Engineering, Xidian University, Xi’an 710071, China. From September 2011 to March 2013, he was a research assistant at Nanyang Technological University, Singapore. From August 2013 to August 2014, he was a postdoctoral research fellow at Nanyang Technological University. From July 2013 to June 2015, he was a lecturer at Xidian University. His research interests include visual perceptual modeling, saliency estimation, quality evaluation, and just noticeable difference estimation. He has served as the special section chair for IEEE Visual Communications and Image Processing (VCIP) 2017 and section chair/organizer/TPC member for ICME2014-2015, PCM2015-2016, ICIP2015, and QoMEX2016. He was awarded the Best Student Paper at ISCAS 2013. He is a Member of IEEE. Leida Li received his B.S. and Ph.D. degrees from Xidian University, Xi’an, China, in 2004 and 2009, respectively. He is currently a professor with the School of Artificial Intelligence, Xidian University, Xi’an 710071, China. From 2014 to 2015, he was a visiting research fellow with the Rapid-Rich Object Search (ROSE) Laboratory, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, where he was a senior research fellow from 2016 to 2017. His research interests include multimedia quality assessment, affective computing, information hiding, and image forensics. He was the senior program committee member for IJCAI 2019-2020, session chair for ICMR 2019 and PCM 2015, and TPC for AAAI 2019, ACM MM 2019-2020, ACM MM-Asia 2019, ACII 2019, and PCM 2016. He is an associate editor for the Journal of Visual Communication and Image Representation and the EURASIP Journal on Image and Video Processing. He is a Member of IEEE. Guangming Shi received his B.S. degree in automatic control in 1985, his M.S. degree in computer control, and his Ph.D. degree in electronic information technology, all from Xidian University in 1988 and 2002, respectively. Presently, he is the deputy director of the School of Electronic Engineering, Xidian University, Xi’an 710071, China, and the academic
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
65
leader in the subject of circuits and systems. He joined the School of Electronic Engineering, Xidian University, in 1988. Since 2003, he has been a professor in the School of Electronic Engineering at Xidian University, and in 2004, he became the head of the National Instruction Base of Electrician and Electronic (NIBEE). From 1994 to 1996, as a research assistant, he cooperated with the Department of Electronic Engineering at the University of Hong Kong. From June to December 2004, he studied in the Department of Electronic Engineering, University of Illinois at Urbana Champaign.
References
[20] X. Li and M. T. Orchard, “New edge-directed interpolation,” IEEE Trans. Image Process., vol. 10, no. 10, pp. 1521–1527, Oct. 2001, doi: 10.1109/ 83.951537. [21] J. Liang, H. Zeng, and L. Zhang, “Details or artifacts: A locally discriminative learning approach to realistic image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 5657–5666. [22] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 136 –144, doi: 10.1109/CVPRW. 2017.151. [23] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-resolution: A survey and beyond,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 5, pp. 5461–5480, May 2022, doi: 10.1109/TPAMI.2022.3203009. [24] Z. Luo, H. Huang, L. Yu, Y. Li, H. Fan, and S. Liu, “Deep constrained least squares for blind image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17,642–17,652 , doi: 10.1109/CVPR52688.2022.01712.
[1] S. Anwar, S. Khan, and N. Barnes, “A deep journey into super-resolution: A survey,” ACM Comput. Sur v., vol. 53, no. 3, pp. 1–34, May 2020, doi: 10.1145/3390462.
[25] Y. Mei, Y. Fan, and Y. Zhou, “Image super-resolution with non-local sparse attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3517–3526, doi: 10.1109/CVPR46437.2021.00352.
[2] S. Baker and T. Kanade, “Limits on super-resolution and how to break them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1167–1183, Sep. 2002, doi: 10.1109/TPAMI.2002.1033210.
[26] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin, “PULSE: Selfsupervised photo upsampling via latent space exploration of generative models,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2437–2445, doi: 10.1109/CVPR42600.2020.00251.
[3] K. C. K. Chan, X. Wang, X. Xu, J. Gu, and C. C. Loy, “GLEAN: Generative latent bank for large-factor image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14,245–14,254, doi: 10.1109/ CVPR46437.2021.01402. [4] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-order attention network for single image super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 11,065–11,074, doi: 10.1109/CVPR.2019.01132. [5] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295–307, Feb. 2016, doi: 10.1109/TPAMI.2015.2439281. [6] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in Proc. Eur. Conf. Comput. Vis., Cham, Switzerland: Springer-Verlag, 2016, pp. 391–407, doi: 10.1007/978-3-319-46475-6_25. [7] W. Dong, L. Zhang, G. Shi, and X. Wu, “Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization,” IEEE Trans. Image Process., vol. 20, no. 7, pp. 1838–1857, Jul. 2011, doi: 10.1109/ TIP.2011.2108306. [8] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12,873–12,883, doi: 10.1109/CVPR46437.2021.01268. [9] Z. Fang, W. Dong, X. Li, J. Wu, L. Li, and G. Shi, “Uncertainty learning in kernel estimation for multi-stage blind image super-resolution,” in Proc. 17th Eur. Conf. Comput. Vis., Tel Aviv, Israel: Springer-Verlag, Oct. 2022, pp. 144–161, doi: 10.1007/978-3-031-19797-0_9. [10] S. Farsiu, M. D. Robinson, M. Elad, and P. Milanfar, “Fast and robust multiframe super resolution,” IEEE Trans. Image Process., vol. 13, no. 10, pp. 1327– 1344, Oct. 2004, doi: 10.1109/TIP.2004.834669. [11] G. Freedman and R. Fattal, “Image and video upscaling from local self-examples,” ACM Trans. Graph ., vol. 30, no. 2, pp. 1–11, Apr. 2011, doi: 10.1145/1944846.1944852. [12] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-resolution,” IEEE Comput. Graph. Appl., vol. 22, no. 2, pp. 56–65, Mar./Apr. 2002, doi: 10.1109/38.988747. [13] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 349–356, doi: 10.1109/ ICCV.2009.5459271. [14] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1664–1673, doi: 10.1109/CVPR.2018.00179. [15] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 6840–6851. [16] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun, “Meta-SR: A magnification-arbitrary network for super-resolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1575–1584, doi: 10.1109/CVPR.2019.00167. [17] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1637–1645, doi: 10.1109/CVPR.2016.181. [18] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep Laplacian pyramid networks for fast and accurate super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 624–632, doi: 10.1109/CVPR.2017.618. [19] C. Ledig et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4681–4690.
66
[27] Q. Ning, W. Dong, G. Shi, L. Li, and X. Li, “Accurate and lightweight image super-resolution with model-guided deep unfolding network,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 2, pp. 240 –252, Feb. 2021, doi: 10.1109/ JSTSP.2020.3037516. [28] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: A technical overview,” IEEE Signal Process. Mag., vol. 20, no. 3, pp. 21–36, May 2003, doi: 10.1109/MSP.2003.1203207. [29] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4713– 4726, Apr. 2023, doi: 10.1109/TPAMI. 2022.3204461. [30] G. Shim, J. Park, and I. S. Kweon, “Robust reference-based super-resolution with similarity-aware deformable convolution,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8425–8434, doi: 10.1109/CVPR42600. 2020.00845. [31] J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient profile prior,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8, doi: 10.1109/CVPR.2008.4587659. [32] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3147– 3155, doi: 10.1109/CVPR.2017.298. [33] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Proc. Asian Conf. Comput. Vis., Cham, Switzerland: Springer-Verlag, 2015, pp. 111–126. [34] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7794–7803, doi: 10.1109/CVPR.2018.00813. [35] X. Wang et al., “ESRGAN: Enhanced super-resolution generative adversarial networks,” in Proc. Eur. Conf. Comput. Vis. Workshops, 2018, pp. 63–79, doi: 10.1007/978-3-030-11021-5_5. [36] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3365–3387, Oct. 2021, doi: 10.1109/TPAMI.2020.2982166. [37] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov. 2010, doi: 10.1109/TIP.2010.2050625. [38] X. Zhang and X. Wu, “Image interpolation by adaptive 2-D autoregressive modeling and soft-decision estimation,” IEEE Trans. Image Process., vol. 17, no. 6, pp. 887–896, Jun. 2008, doi: 10.1109/TIP.2008.924279. [39] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 286–301. [40] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 2472–2481, doi: 10.1109/CVPR.2018.00262. [41] Z. Zhang, S. Yu, W. Qin, X. Liang, Y. Xie, and G. Cao, “Self-supervised CT super-resolution with hybrid model,” Comput. Biol. Med., vol. 138, Nov. 2021, Art. no. 104775, doi: 10.1016/j.compbiomed.2021.104775. [42] M. Zhou, K. Yan, J. Pan, W. Ren, Q. Xie, and X. Cao, “Memory-augmented deep unfolding network for guided image super-resolution,” Int. J. Comput. Vis., vol. 131, no. 1, pp. 215–242, Jan. 2023, doi: 10.1007/s11263-022-01699-1.
SP
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
75TH ANNIVERSARY OF SIGNAL PROCESSING SOCIETY SPECIAL ISSUE
Mauro Barni , Patrizio Campisi , Edward J. Delp , Gwenaël Doërr , Jessica Fridrich, Nasir Memon , Fernando Pérez-González , Anderson Rocha , Luisa Verdoliva , and Min Wu
Information Forensics and Security A quarter-century-long journey
I
nformation forensics and security (IFS) is an active R&D area whose goal is to ensure that people use devices, data, and intellectual properties for authorized purposes and to facilitate the gathering of solid evidence to hold perpetrators accountable. For over a quarter century, since the 1990s, the IFS research area has grown tremendously to address the societal needs of the digital information era. The IEEE Signal Processing Society (SPS) has emerged as an important hub and leader in this area, and this article celebrates some landmark technical contributions. In particular, we highlight the major technological advances by the research community in some selected focus areas in the field during the past 25 years and present future trends.
Introduction
©SHUTTERSTOCK.COM/TRIFF
Digital Object Identifier 10.1109/MSP.2023.3275319 Date of current version: 14 July 2023
1053-5888/23©2023IEEE
The rapid digitization of society during recent decades has fundamentally disrupted how we interact with media content. How can we trust recorded images/videos/speeches that can be easily manipulated with a piece of software? How can we safeguard the value of copyrighted digital assets when they can be easily cloned without degradation? How can we preserve our privacy when ubiquitous capturing devices that jeopardize our anonymity are present everywhere? How our identity is verified or identified in a group of people has also significantly changed. Biometric identifiers, used at the beginning of the 20th century for criminal investigation and law enforcement purposes, are now routinely employed as a means to automatically recognize people for a much wider range of applications, from banking to electronic documents and from automatic border control systems to consumer electronics. While the issues related to the protection of media content and the security of biometric-based systems can be partly addressed using cryptography-based technologies, complementary signal processing techniques are needed to address them fully. It is those technical challenges that gave birth to the IFS R&D community. Primarily driven, at their early stage, by the need for copyright protection solutions, IFS contributions were published in various venues and journals that were not dedicated to this area. Although some dedicated
IEEE SIGNAL PROCESSING MAGAZINE
|
July 2023
|
67
text, or something other—to transmit information that could conferences (SPIE/IST Conference on Security, Steganoglater be recovered robustly, even if the watermarked content raphy, and Watermarking of Multimedia Contents; ACM had been modified [1]. Multimedia and Security Workshop; and ACM Workshop This new research area rapidly attracted contributions from on Information Hiding and Multimedia Security) emerged, related domains: perceptual modeling, digital communications, this nascent community lacked a well-identified forum where audio/video coding, pattern recognition, and so on. Early waterresearchers, engineers, and practitioners could exchange the latmarking methods used very simple rules, e.g., least-significantest advances in the area, which is multidisciplinary by nature. A bit replacement, thereby providing almost no robustness to call for contributions to IEEE Transactions on Signal Processattacks. Significant progress was made when the IFS research ing in 2003 attracted enthusiastic responses to fill three special community realized that the retrieval of the issues on secure media. It was time to creInformation forensics embedded watermark could be framed as a ate a platform to advance the research and digital communications problem. A semitechnology development of signal processand security is an active nal watermarking contribution, coined as ing-related security and forensic issues. R&D area whose goal is spread-spectrum watermarking [2], leverTo foster broader community building to ensure that people ages a military communications model well and strive for a bigger and lasting impact, use devices, data, and known for its resilience to jamming. The a collective effort by a group of volunteer intellectual properties for underlying principle is to spread each waterleaders of the SPS charted a road map in authorized purposes. mark bit across many dimensions of the 2004 for creating IEEE Transactions on host media content to achieve robustness; Information Forensics and Security (T-IFS) for a given bit b ! {!1} to be embedded and an input (host) and a corresponding IFS Technical Committee, both of which were launched in 2006. It was written in the proposal to IEEE n-dimensional vector x, additive spread-spectrum outputs a that the new journal would aim at examining IFS issues and watermarked vector y such that applications “both through emerging network security architectures and through complementary methods including, but y = x + bw (1) not limited to: biometrics, multimedia security, audio-visual surveillance systems, authentication and control mechanisms, where w is an n-dimensional carrier secret to adversaries. and other means.” A few years later, in 2009, the first edition Spreading is achieved when n & 1. Due to intentional and inadof the IEEE Workshop on Information Forensics and Security vertent attacks (e.g., content transcoding), a legitimate decoder was held, in London, U.K. (that knows the carrier w) gets access only to a distorted version The IFS community has established a strong presence of y from which it must extract the embedded bit b with the highin the SPS and is attracting submissions from a variety of est possible reliability. By further exploiting connections with domains. In view of the page budget allocated to this retstatistical detection and coding, it has been possible to derive rospective article, rather than surveying, exhaustively but optimal ways to extract the embedded watermark information briefly, each individual IFS area, we opt for a more focused for various hosts and increase robustness using channel coding. review of selected domains that experienced major breakIt should be kept in mind, however, that watermarking deviates throughs over the past 25 years and that are expected to be from standard communications theory in that more aligned with the technical background of the IEEE 1) The watermark embedding process must remain impercepSignal Processing Magazine readership. While this choice tible, so standard power constraints (i.e., < w