Cognitive Computing of Visual and Auditory Information (Reports of China’s Basic Research) [1st ed. 2023] 9819932270, 9789819932276

This book discusses fruitful achievements in basic cognitive theories, processing technologies of visual and auditory in

98 69 2MB

English Pages 124 [119] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Reports of China’s Basic Research Editorial Board
Editor-in-Chief
Associate Editors
Editors
Preface to the Series
Preface
Embracing the New Era of Artificial Intelligence
Contributors
Contents
1 Program Overview
1.1 Program Introduction
1.2 Overall Scientific Objectives
1.3 Core Scientific Issues
1.4 Program Layout
1.5 Comprehensive Integration
1.6 Situation of Interaction Between Disciplines
1.7 Significant Progress
2 “Cognitive Computing of Visual and Auditory Information”: State of the Art
2.1 Research Status
2.2 Development Tendency
2.2.1 Cognitive Computing Theories and Methods for Visual and Auditory Information Based on Neuromorphic Computing
2.2.2 Enhancing Perception of Visual and Auditory Information in Cognitive Computing Systems
2.2.3 Improved Ability of Discovering Knowledge in Big Data
2.2.4 Enhanced Ability of Natural Language Comprehension
2.2.5 Man-in-the-Loop, Human–Computer Cooperation and Hybrid Augmented Intelligence
2.2.6 Development of More Powerful and Reliable Autonomous Smart Cars (Robots)
2.2.7 Study of Driving Brain with Selective Attention Memory Mechanism
2.2.8 Theoretical Capabilities and Limitations of Cognitive Computing for Visual and Auditory Information
2.2.9 Advances in Hardware Architecture Design for Cognitive Computing of Visual and Auditory Information
2.3 Trend of Field Development
3 Major Research Achievements
3.1 Leaping Development of Basic Theories of Cognitive Computing
3.1.1 Creation and Development of “Global-First” Object-Based Attention Theory
3.1.2 Research on Visual Computing Models and Methods Inspired by Biological Cognitive Mechanism and Characteristics
3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface
3.2.1 Research on Key Technologies and Applications of the Bidirectional Multidimensional Brain-Computer Interface
3.2.2 Research on Theory and Key Technology of Intuitive Bionic Manipulation of Intelligent Prostheses Based on the Brain-Computer Interface
3.3 Breakthroughs in Visual Information Processing
3.3.1 Statistical Learning Method for Visual Attention Based on Scene Understanding
3.3.2 Supporting the Recognition and Understanding of Traffic Signs for Driverless Vehicles
3.3.3 Graphics and Text Detection, Recognition and Under-Standing of Traffi Signs Under Complex Conditions
3.4 Leaping Development of Natural Language Processing
3.4.1 Research on Semantic Calculation Methods for Chinese Text Comprehension
3.4.2 Theoretical and Methodological Research on Integration Between Human-Like Hearing Mechanism and Deep Learning
3.4.3 Research on Theory and Technology of Speech Separation, Content Analysis and Understanding in Multi-person and Multi-party Dialogue
3.5 Breakthroughs in Theory and Technology Research of Multi-modal Information Processing
3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles
3.6.1 Theory, Method and Integration and Comprehensive Experimental Platform of Autonomous Vehicle
3.6.2 Key Technology Integration and Comprehensive Verification Platform for Driverless Vehicles Based on Visual and Auditory Cognitive Mechanism
3.6.3 Intelligent Test Evaluation and Environment Design of Driverless Vehicles
4 Outlook
4.1 Deficiencies and Strategic Needs
4.1.1 Deficiencies Needs
4.1.2 Strategic Needs
4.2 Ideas and Suggestions for In-Depth Study
4.2.1 The Ideas of In-Depth Research
4.2.2 Suggestions for the Next Step
References
Index
Recommend Papers

Cognitive Computing of Visual and Auditory Information (Reports of China’s Basic Research) [1st ed. 2023]
 9819932270, 9789819932276

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Reports of China’s Basic Research

Nanning Zheng Editor

Cognitive Computing of Visual and Auditory Information

Reports of China’s Basic Research Editor-in-Chief Wei Yang, National Natural Science Foundation of China, Beijing, China Zhejiang University, Hangzhou, Zhejiang, China

The National Natural Science Foundation of China (NSFC) was established on February 14, 1986. Upon its establishment, NSFC was an institution directly under the jurisdiction of the State Council, tasked with the administration of the National Natural Science Fund from the Central Government. In 2018, it became managed by the Ministry of Science and Technology (MOST) but kept its due independence in operation. Since its establishment, NSFC has comprehensively introduced and implemented a rigorous and objective merit-review system to fulfill its mission of supporting basic research, fostering talented researchers, developing international cooperation and promoting socioeconomic development. Featuring science, basics, and advances, the series of Reports of China’s Basic Research is organized by the NSFC to present the overall level and pattern of China’s basic research, share innovative achievements, and illustrate excellent breakthroughs in key fields. It covers various disciplines including but not limited to, computer science, materials science, life sciences, engineering, environmental sciences, mathematics, and physics. The series will show the core contents of the final reports of the Major Programs and the Major Research Plans funded by NSFC, and will closely follow the frontiers of basic research developments in China. If you are interested in publishing your book in the series, please contact Qian Xu (Email: [email protected]) and Mengchu Huang (Email: mengchu. [email protected]).

Nanning Zheng Editor

Cognitive Computing of Visual and Auditory Information

Editor Nanning Zheng Xi’an Jiaotong University Xi’an, China

ISSN 2731-8907 ISSN 2731-8915 (electronic) Reports of China’s Basic Research ISBN 978-981-99-3227-6 ISBN 978-981-99-3228-3 (eBook) https://doi.org/10.1007/978-981-99-3228-3 Jointly published with Zhejiang University Press The print edition is not for sale in China (Mainland). Customers from China (Mainland) please order the print book from: Zhejiang University Press. © Zhejiang University Press 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Reports of China’s Basic Research Editorial Board

Editor-in-Chief Wei Yang

Associate Editors Ruiping Gao Yu Han

Editors Changrui Wang Qidong Wang Xuelian Feng Liexun Yang Junlin Yang Liyao Zou Zhaotian Zhang Xiangping Zhang Yongjun Chen Yanze Zhou Ruijuan Sun

v

vi

Jianquan Guo Longhua Tang Guoxuan Dong Zhiyong Han Ming Li

Reports of China’s Basic Research Editorial Board

Preface to the Series

As Lao Tzu said, “A huge tree grows from a tiny seedling; a nine-storied tower rises from a heap of earth”. Basic research is the fundamental approach to fostering innovation-driven development, and its level becomes an important yardstick for measuring the overall scientific and national strength of a country. Since the beginning of the twenty-first century, China’s overall strength in basic research has been consistently increasing. With respect to input and output, China’s input in basic research increased by 14.8 times from 5.22 billion yuan in 2001 to 82.29 billion yuan in 2016, with an average annual increase of 20.2%. In the same period, the number of China’s scientific papers included in the Science Citation Index (SCI) increased from lower than 40,000 to 324,000; China rose from the 6th to the 2nd place in global ranking in terms of the number of published papers. In regard to the quality of output, in 2016, China ranked No. 2 in the world in terms of citations in 9 disciplines, among which the materials science ranked No. 1; as of October 2017, China ranked No. 3 in the world in the numbers of both Highly Cited Papers (top 1%) and Hot Papers (top 0.1%), with the latter accounting for 25.1% of the global total. In talent cultivation, in 2006, China had 175 scientists (136 of whom from the Chinese mainland) included in Thomson Reuters’ list of Highly Cited Researchers, ranking 4th globally and 1st in Asia. Meanwhile, we should also be keenly aware that China’s basic research is still facing great challenges. First, funding for basic research in China is still far less than that in developed countries—only about 5% of the R&D funds in China are used for basic research, a much lower percentage than 15%–20% in developed countries. Second, competence for original innovation in China is insufficient. Major original scientific achievements that have global impact are still rare. Most of the scientific research projects are just a follow-up or imitation of existing research, rather than groundbreaking research. Third, the development of disciplines is not balanced, and China’s research level in some disciplines is noticeably lower than the international level—China’s Field-Weighted Citation Impact (FWCI) in disciplines just reached 0.94 in 2016, lower than the world average of 1.0.

vii

viii

Preface to the Series

The Chinese government attaches great importance to basic research. In the 13th Five-Year Plan (2016–2020), China has established scientific and technological innovation as a priority in all-round innovation and has made strategic arrangements to strengthen basic research. General Secretary Xi Jinping put forward a grand blueprint of making China into a world-leading power in science and technology in his speech delivered at the National Conference on Scientific and Technological Innovation in 2016, and emphasized that “we should aim for the frontiers of science and technology, strengthen basic research, and make major breakthroughs in pioneering basic research and groundbreaking and original innovations” at the 19th CPC National Congress on Oct. 18, 2017. With more than 30 years of unremitting exploration, the National Natural Science Foundation of China (NSFC), one of the main channels for supporting basic research in China, has gradually shaped a funding pattern covering research, talent, tools and convergence, and has taken action to vigorously promote basic frontier research and the growth of scientific research talent, reinforce the building of innovative research teams, deepen regional cooperation and exchanges, and push forward multidisciplinary convergence. As of 2016, nearly 70% of China’s published scientific papers were funded by the NSFC, accounting for 1/9 of the total number of published papers all over the world. Facing the new strategic target of building China into a strong country in science and technology, the NSFC will conscientiously reinforce forward-looking planning and enhance the efficiency of evaluation, so as to achieve the strategic goal of making China progressively share the same level with major innovative countries in research total volume, contribution, and groundbreaking researchers by 2050. The series of Advances in China’s Basic Research and the series of Reports of China’s Basic Research proposed and planned by the NSFC emerge against such a background. Featuring science, basics, and advances, the two series are aimed at sharing innovative achievements, diffusing performances of basic research, and leading breakthroughs in key fields. They closely follow the frontiers of basic research developments in China and publish excellent innovation achievements funded by the NSFC. The series of Advances in China’s Basic Research mainly presents the important original achievements of the programs funded by the NSFC and demonstrates the breakthroughs and forward guidance in key research fields; the series of Reports of China’s Basic Research shows the core contents of the final reports of Major Programs and Major Research Plans funded by the NSFC to make a systematic summarization and give a strategic outlook on the achievements in the funding priorities of the NSFC. We hope not only to comprehensively and systematically introduce backgrounds, scientific significance, discipline layouts, frontier breakthroughs of the programs, and a strategic outlook for the subsequent research, but also to summarize innovative ideas, enhance multidisciplinary convergence, foster the continuous development of research in concerned fields, and promote original discoveries. As Hsun Tzu remarked, “When earth piles up into a mountain, wind and rain will originate thereof. When waters accumulate into a deep pool, dragons will come to live in it”. The series of Advances in China’s Basic Research and Reports of China’s

Preface to the Series

ix

Basic Research are expected to become the “historical records” of China’s basic research. They will provide researchers with abundant scientific research material and vitality of innovation, and will certainly play an active role in making China’s basic research prosper and building China’s strength in science and technology.

Wei Yang Academician of the Chinese Academy of Sciences Beijing, China

Preface

Embracing the New Era of Artificial Intelligence The research of human visual and auditory cognitive mechanism is an important part of cognitive science, and the machine understanding and computing of human visual and auditory information has always been the main research content in the field of artificial intelligence and plays very important roles in the field of national economy, social development, national security, and so on. Artificial intelligence is used to make machines to understand, think, and learn like humans, that is, to use computers to simulate human intelligence. It covers a wide range of disciplines and areas such as cognition and reasoning (including various physical and social common senses), computer vision, natural language understanding and communication (including hearing), and machine learning. Therefore, cognitive computing of visual and auditory information is an important research part in artificial intelligence, and understanding its mechanism and establishing visual and auditory cognitive models is of great significance for the development of frontier theories and core algorithms of artificial intelligence. Artificial intelligence is probably familiar by now, because from a few years ago, there have been related news stories: AlphaGo beats a human champion in chess competition; driverless cars get test licenses and hit the road; and many artificial intelligence colleges and research institutes are established in universities. However, twenty years ago, the public’s understanding of artificial intelligence was neither as deep as it is today nor as deep as one might expect. Even in the early 1990s, when personal computers appeared and popularized, artificial intelligence did not achieve the expected goals and encountered problems such as a significant reduction in scientific research funding, and then it experienced a cold winter. In 1999, the National Natural Science Foundation of China started the preliminary thinking and top-level design of the major research plan of “Cognitive Computing of Visual and Auditory Information”. Because scientists initially did not reach a basic consensus, after nine years of arduous research accumulation and many discussions, with the joint efforts of the National Natural Science Foundation of China and

xi

xii

Preface

experts, the major research plan of “Cognitive Computing of Visual and Auditory Information” was officially launched in 2008. This major research plan is one of the milestones of artificial intelligence fundamental research in China, marking the official establishment of the artificial intelligence research “national team” in China. At the beginning of this major research plan, in order to ensure the realization of its scientific goals and provide key technological innovations for national security, public safety and the development of information services and related industries, the major scientific issues of the major research plan were determined as follows: to study and construct new computing models and methods, to improve the computer’s understanding ability of unstructured visual and auditory perception information and the processing efficiency of massive heterogeneous information, and to solve the bottleneck and difficulty in image, speech, and text (language) information processing. To this end, it is necessary to start from the visual and auditory cognitive mechanism of human beings and focus on the research of the three core scientific issues: “extraction, expression, and integration of perceptual features”; “machine learning and understanding of perceptual data”; and “collaborative computing of multi-modal information”, which are related with the basic scientific problems such as “expression” and “computation” in the cognitive process. Through the study of the above three core scientific issues, the major research plan has achieved a series of landmark results in cognitive mechanisms and models, visual and auditory information processing, and natural language (Chinese) understanding. By taking “cognitive computing and brain-computer interface” and “driverless driving and intelligent testing” as breakthroughs, 5 integrated projects were deployed. According to the incomplete statistics, 10 project teams undertaking the integrated projects have published many papers in international core academic journals including the fields of information science, cognitive science, psychology, neuroscience, physics, life science, and so on, which fully reflects the interdisciplinary characteristics and the academic level of research works. In addition, the Information Science Department of the National Natural Science Foundation of China has created a relaxing environment for the cooperation of experts in different fields in terms of the management mechanism innovation for the implementation of the major research plan. It is particularly worth mentioning that in order to further promote research works out of the laboratories and produce major original achievements, the major research plan created two competition platforms, i.e., the “China Intelligent Vehicle Future Challenge” (IVFC) and the “China Brain-Computer Interface Competition” (BCI). In the past years, 10 IVFC competitions and 2 BCI competitions have been held, respectively. This is a new way to verify the theoretical results in the real physical environment and to solve the problems of complex cognition and intelligent behavior decision-making in the actual environment. Hence it has changed the traditional mode of simple paper summarization or laboratory results demonstration and promoted the organic combination of the fundamental research applications and the physical realizable systems. The major research plan has also cultivated and created a large number of outstanding young and middle-aged talents in computer vision, braincomputer interface, driverless technology, artificial intelligence, etc. Especially after

Preface

xiii

10 years of exploration and practice, the IVFC has become an important brand in the research and development of driverless intelligent vehicles in China and has cultivated a large number of outstanding young and middle-aged technological backbones in this field. It is worthy of being the “Huangpu Military Academy” for the research and development of driverless intelligent vehicles in China. Since its establishment, the major research plan has brought obvious interdisciplinary attributes, such as information science, neuroscience, cognitive psychology, mathematical science, and other disciplines. In fact, the interdisciplinary research is needed not only to solve scientific problems but also to deal with the profound social problems brought by artificial intelligence. Because artificial intelligence blurs the boundaries among physical reality, data, and individuals; it extends to complex ethical, legal, and security issues. The gradual popularization and in-depth applications of artificial intelligence will inevitably bring psychological impact on people and create social and cultural risks. This is no longer a problem that can be solved by traditional engineering safety methods. Therefore, in these fields, the disciplines of humanity, society, and philosophy will play a huge role. At the beginning of the establishment of this major research plan, artificial intelligence has not yet formed the current research boom sweeping the world, which is enough to reflect the academic insight and strategic foresight of the National Natural Science Foundation of China and related experts. Looking back at the past 10 years, through the fundings of the major research plan, China has developed vigorously in the field of artificial intelligence in terms of theory, method, technology and application, driving the original innovation of information science and technology in China and improving China’s basic innovation ability and international influence in the field of artificial intelligence. In particular, this major research plan has made fruitful achievements in basic cognitive theories, processing technologies of visual and auditory information and research platforms, providing strong support for the research and development of artificial intelligence of major national projects, playing important roles in national application systems such as unmanned systems and smart cities, and it has laid a solid foundation for the development of artificial intelligence in China. Nanning Zheng Academician of the Chinese Academy of Engineering Beijing, China

Contributors

Qianshun Chang Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China Lin Chen Institute of Biophysics, Chinese Academy of Sciences, Beijing, China Dewen Hu National University of Defense Technology, Changsha, Hunan, China Deyi Li Academy of Military Sciences, Beijing, China Fuchun Sun Tsinghua University, Beijing, China Jiaguang Sun Tsinghua University, Beijing, China Jingmin Xin Xi’an Jiaotong University, Xi’an, Shaanxi, China Jingyu Yang Nanjing University of Science and Technology, Nanjing, Jiangsu, China Zheng’an Yao Sun Yat-sen University, Guangzhou, Guangdong, China Chengqing Zong Institute of Automation, Chinese Academy of Sciences, Beijing, China

xv

Contents

1 Program Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Program Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Overall Scientific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Core Scientific Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Program Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Comprehensive Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Situation of Interaction Between Disciplines . . . . . . . . . . . . . . . . . . . 1.7 Significant Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 “Cognitive Computing of Visual and Auditory Information”: State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Research Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Development Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Cognitive Computing Theories and Methods for Visual and Auditory Information Based on Neuromorphic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Enhancing Perception of Visual and Auditory Information in Cognitive Computing Systems . . . . . . . . . . . . 2.2.3 Improved Ability of Discovering Knowledge in Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Enhanced Ability of Natural Language Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Man-in-the-Loop, Human–Computer Cooperation and Hybrid Augmented Intelligence . . . . . . . . . . . . . . . . . . . . 2.2.6 Development of More Powerful and Reliable Autonomous Smart Cars (Robots) . . . . . . . . . . . . . . . . . . . . . . 2.2.7 Study of Driving Brain with Selective Attention Memory Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.8 Theoretical Capabilities and Limitations of Cognitive Computing for Visual and Auditory Information . . . . . . . . . .

1 1 3 4 7 8 15 19 29 29 31

31 32 32 33 33 34 34 35

xvii

xviii

Contents

2.2.9 Advances in Hardware Architecture Design for Cognitive Computing of Visual and Auditory Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Trend of Field Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Major Research Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Leaping Development of Basic Theories of Cognitive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Creation and Development of “Global-First” Object-Based Attention Theory . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Research on Visual Computing Models and Methods Inspired by Biological Cognitive Mechanism and Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Research on Key Technologies and Applications of the Bidirectional Multidimensional Brain-Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Research on Theory and Key Technology of Intuitive Bionic Manipulation of Intelligent Prostheses Based on the Brain-Computer Interface . . . . . . . . . . . . . . . . . . . . . . . 3.3 Breakthroughs in Visual Information Processing . . . . . . . . . . . . . . . . 3.3.1 Statistical Learning Method for Visual Attention Based on Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Supporting the Recognition and Understanding of Traffic Signs for Driverless Vehicles . . . . . . . . . . . . . . . . . . 3.3.3 Graphics and Text Detection, Recognition and Under-Standing of Traffi Signs Under Complex Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Leaping Development of Natural Language Processing . . . . . . . . . . 3.4.1 Research on Semantic Calculation Methods for Chinese Text Comprehension . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Theoretical and Methodological Research on Integration Between Human-Like Hearing Mechanism and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Research on Theory and Technology of Speech Separation, Content Analysis and Understanding in Multi-person and Multi-party Dialogue . . . . . . . . . . . . . . . 3.5 Breakthroughs in Theory and Technology Research of Multi-modal Information Processing . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Theory, Method and Integration and Comprehensive Experimental Platform of Autonomous Vehicle . . . . . . . . . . .

35 36 41 41 41

44 47

48

51 54 54 56

59 61 61

65

66 71 74 75

Contents

xix

3.6.2 Key Technology Integration and Comprehensive Verification Platform for Driverless Vehicles Based on Visual and Auditory Cognitive Mechanism . . . . . . . . . . . . 3.6.3 Intelligent Test Evaluation and Environment Design of Driverless Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81 84

4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1 Deficiencies and Strategic Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.1 Deficiencies Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.1.2 Strategic Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Ideas and Suggestions for In-Depth Study . . . . . . . . . . . . . . . . . . . . . . 97 4.2.1 The Ideas of In-Depth Research . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2.2 Suggestions for the Next Step . . . . . . . . . . . . . . . . . . . . . . . . . . 101 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Chapter 1

Program Overview

1.1 Program Introduction “Cognitive Computing of Visual and Auditory Information” is a major research plan started by the National Natural Science Foundation of China (NSFC) in 2008, and it is one of the major research plans launched by the NSFC during the 11th FiveYear Plan period. It is jointly organized by the Department of Information Sciences, Department of Life Sciences and Department of Mathematical and Physical Sciences, NSFC with a total fund of 190 million yuan. Since its official launch, the major research plan of “Cognitive Computing of Visual and Auditory Information” has funded 539 projects, including 401 cultivation projects, 125 key support projects, and 13 integration projects. The initial thinking, preparation, and top-level design of the major research plan of “Cognitive Computing of Visual and Auditory Information” started in 1999. After the arduous accumulation of preliminary research, it was finally and officially launched in 2008. This major research plan is one of the milestones in the development of China’s artificial intelligence fundamental research, marking the official establishment of the “national team” of artificial intelligence research in China. Artificial intelligence is a strategic technology leading the future. Major developed countries in the world regard the development of artificial intelligence as a major strategy to enhance their national competitiveness and national security, and speed up to set up plans and policies. In 2015, the White House Office of Science and Technology Policy of the United States successively released three heavyweight reports: “Preparing for the Future of Artificial Intelligence”, “National Artificial Intelligence Research and Development Strategic Plan”, and “Artificial Intelligence, Automation and Economic Report”. In July 2017, the State Council of China released the “Artificial Intelligence Development Plan for New Generation” to formulate and implement a national strategy for the development of artificial intelligence, and to carry out overall planning and top-level design for artificial intelligence development at the national level. At the beginning of the establishment of this major research © Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3_1

1

2

1 Program Overview

plan, artificial intelligence technology has not yet formed the current research boom sweeping the world. The establishment of the project in that year reflects the academic insight and strategic version of the NSFC and related experts. In recent years, many developed countries in the world have established largescale research programs to promote the research and application of brain science. For example, in 2013, the Obama administration announced the “Promotion of Innovative Neurotechnology Brain Research Program”, which aimed to explore the working mechanism of the human brain, draw a full map of brain activity, and ultimately develop treatments for brain diseases. In 2013, the European Union launched the 10year “Human Brain Program” with the participation of 135 cooperative institutions from 26 countries. In 2014, Japanese scientists initiated the neuroscience research program, which announced an investment of 40 billion yen (about 365 million US dollars) within 10 years. In March 2016, China listed the “Brain Science and Brainlike Research” as a “Major Project of Scientific and Technological Innovation in 2030” in the 13th Five-Year Plan Outline. However, due to the complexity of the human brain, the development of brain science and cognitive science is extremely slow. Among the research results of the journals Nature and Science in related fields in the past 10 years, there were few original achievements made by scholars from the Chinese mainland. This major research plan starts from the integration of cognitive science and information science; it has made a series of breakthroughs in the field of brain-computer interface and published them on PNAS, which is of great significance to enhance China’s international reputation in the field of brain science and cognitive science. It has also played an important and far-reaching role in stabilizing the scientific research team and cultivating and retaining related scientific research talents. As shown in Fig. 1.1, this major research plan started from the human visual and auditory cognitive mechanism and focused on the research of the three core scientific problems of the “extraction, expression, and integration of perceptual features”; “machine learning and understanding of perceptual data”; and “collaborative computing of multi-modal information”, which are related with the basic scientific problem of “expression” and “computing” of the cognitive process. Its goal is to develop and construct new computing models and algorithms to provide scientific support for improving the computer’s ability to understand unstructured perception information and massive amounts of heterogeneous information and computing efficiency. This major research plan has gathered and cultivated a team of high-level artificial intelligence talents, improved China’s basic innovation ability and international influence in the field of artificial intelligence, provided strong scientific support for artificial intelligence research and development of major national projects, and played an important role in national application systems such as driverless vehicles, drones, aerospace, smart city, and finance.

1.2 Overall Scientific Objectives

Extraction, expression and integration of perceptual features

Research objective human visual and auditory cognitive mechanism image and video speech and text Chinese

Machine learning and understanding of perceptual data

Collaborative computing of multimodal information

3

Computing model of human visual and auditory cognitive mechanism Understanding of visual and auditory information in driverless scenes Language cognition, natural language (Chinese) processing

Cross-discipline information science life science mathematical science psychology Research objects and methods

Driverless driving system and its intelligent test platform

China Intelligent Vehicle Future Challenge (IVFC)

China Brain-Computer Interface Competition (BCI)

Collaborative computing of multimodal information

Brain-computer interface and rehabilitation training system

Breakthroughs in fundamental theoretical issues

Physical carriers and integration platforms

Fig. 1.1 Overview of the major research plan

1.2 Overall Scientific Objectives In order to make important contributions to ensuring national security and public safety, promote the development of information services and related industries and improve people’s living and health standards, the goals of this major research plan are to focus on the major needs of the country, give full play to the cross advantages of information science, life science and mathematical science, start from the human visual and auditory cognitive mechanism, study and construct new computing models and methods, improve the computer’s understanding ability of unstructured visual and auditory perceptual information and the processing efficiency of massive heterogeneous information, and overcome the bottleneck difficulties faced by information processing of image, speech, and text (language). It is specifically reflected in the following aspects: significant progress has been made in the fundamental theoretical research of visual and auditory information processing; breakthroughs have been made in three key technologies, namely, collaborative computing of visual and auditory information, understanding of natural language (Chinese) and brain-computer interfaces related to visual and auditory cognition. By integrating the above-related research results, driverless vehicle verification platforms with natural environment perception and intelligent behavior decision-making ability should be developed. The main performance indicators have reached the world advanced level so as to enhance China’s overall research strength in the field of visual and auditory information processing, cultivate outstanding talents and teams with international influence, and provide relevant research environment and technical support for national security and social development.

4

1 Program Overview

1.3 Core Scientific Issues In order to achieve the above overall goals, the major research plan studies the following three core scientific issues. 1. Extraction, expression, and integration of perceptual features The extraction, expression, and integration of perceptual features are common problems in image, speech and language information processing, and are the basis of establishing computing models of information processing. Perceptual features refer to the basic information that humans extract from the perceptual channel. In the case of images, they refer to features such as depth, motion, texture, and color, rather than some abstract mathematical results obtained from images using mathematical transformations (such as wavelet transform). The fundamental reason why the computer cannot effectively process the perceptual information is that the basic features of the real scene cannot be extracted reliably; the general expression of the basic features of the real scene and the theories of effective integration of information features under different modes are lacking. Because these basic problems have not been solved, the computer can only handle some simple problems under the relatively ideal condition, and it is difficult to deal with the complex problems encountered in the real world. The extraction, expression, and integration of perceptual features are the key steps in the transformation of perceptual information from unstructured to structured or semi-structured. By taking information processing in the field of visual monitoring as an example, a popular research direction is the so-called behavior analysis. However, research in recent years has shown that, except for some simple behaviors in ideal environments, which can obtain some meaningful semantic information, computers are almost powerless to analyze behaviors in complex real-world scenarios. The main cause of this is that when the background is complex, it is very difficult for computers to detect moving objects from the background, not to mention analysis of their behaviors. From the research status of cognitive science, people have done a lot of research on the brain and have some understanding of the neural mechanism of feature extraction, which has laid a certain foundation for the establishment of basic feature extraction calculation models. However, people’s understanding of the brain processing mechanism of high-level semantics is still very unsystematic, and it is difficult to make breakthroughs in establishing relevant computing models in the short term. Therefore, it is technically feasible to position the focus of this major research plan on the level of “extraction, expression, and integration of perceptual features”. 2. Machine learning and understanding of perceptual data Machine learning is an important method of unstructured information processing, which aims to discover rules and mine knowledge from observational data to lay the foundation for comprehensive decision-making based on relevant information processing. In recent years, although methods such as machine learning and artificial

1.3 Core Scientific Issues

5

neural networks have made certain progress in high-dimensional data visualization, feature extraction, data clustering, and feature subspace analysis, the determination of the essential dimensions of unstructured data is still an open problem. The current machine learning research fields are concerned with the following issues: seeking the adaptive kernel function method related to the problem; the principle and method of dimensionality reduction for data feature space extraction; the theory and method of dimensionality reduction mapping and localized realization of feature extraction for massive data. The unstructured and semi-structured features of image, speech, and language data make it difficult for computers to transform from the data layer to the semantic layer. Building new machine learning methods is an effective way to achieve this transformation. The research of machine learning and understanding of perceptual data faces great challenges from three different levels (model, algorithm, and implementation). (1) Machine learning and understanding modeling of perceptual data At the model level, the expert system developed from the early artificial intelligence based on knowledge bases and characterized by sequential reasoning to the prevailing computational intelligence based on model and characterized by distributed parallel computing in the second half of the twentieth century. This trend will continue and remain mainstream. A variety of artificial neural networks and fuzzy system models play a role in the field of unstructured information processing; a series of simulated evolutionary algorithms are also widely used to search for complex unstructured data. The methods based on statistical framework, Bayesian network model, hidden Markov model, and random field network model are the most representative results in modern times. In addition, the kernel method represented by support vector machine (SVM) can effectively extend many traditional linear models to nonlinearity and partially overcome the “dimensional disaster” in the traditional sense. The principle of structural risk minimization (SRM) embodied in these methods is still playing an irreplaceable role in guiding the establishment of new data models. The above models and methods have two following limitations: a single fixed model often cannot solve the problems faced by unstructured information processing well, and appropriate structural parameters (such as the number of hidden layer units in the neural network, the membership function of the fuzzy system, and the kernel function in the SVM) still lack theoretical basis or practical guidance. It is especially worth noting that the existing models and methods basically stay at the data level, then how to establish new learning models and methods to realize the transformation from data calculation to semantic calculation at the model level is still a key issue that has not been solved so far. (2) Construction and verification of the new efficient algorithm At the algorithm level, mainstream research is to develop rapidly in the direction of “empirical risk minimization (ERM) → structural risk minimization (SRM) → crossvalidated stability”. This development has benefited from the two most outstanding results of modern learning theories: Vapnik’s statistical learning theory and a crossvalidation of the stability theory on learning algorithms established by Poggio et al.

6

1 Program Overview

Besides learning theory, active learning and constrained learning are drawing wide attention in algorithmic design. However, how to find multidisciplinary support for algorithm design in order to form a new and more efficient machine learning algorithm, how to select the optimal algorithm for a specific problem to achieve the best effect, and how to verify these algorithms in a full, effective, and reasonable way are still key issues that have not been resolved yet. (3) The self-adaptation of perceptual-data understanding system On the implementation level, the construction and implementation of a perceptual information processing system often has certain pertinence. However, this pertinence not only limits the adaptability of the system’s capabilities to handle object changes, but also limits the system’s transplantation and expansion to different fields. Therefore, how to realize the dynamic adaptation and ability evolution of the system to the domain change is the key problem to be solved at this level. 3. Collaborative computing of multi-modal information In a sense, the “excellence” of human beings is reflected in a high degree of coordination among systems, and the basic conditions of such high coordination are the fusion and integration of visual and auditory information and other different perception channels. At present, with the development of sensing, storage, and transmission technology, more and more perceptual information appears in multi-modal forms. Multi-modal information can make up for the uncertainty and incompleteness of single-modal information, can improve human perception ability of the environment, and can express richer information. The existing information processing methods are mainly for single-modality, and the processing of multi-modal information basically stays at the decision-making level of various single-modal information processing results fusion. It is an effective way to improve the perceptual information processing ability to study the multi-modal information coordination mechanism and propose the corresponding computational model. Military and intelligent robots are two typical fi of multi-modal information integration applications. The prominent feature of modern warfare is the use of various detection methods to obtain battlefield data, including the use of synthetic aperture radar (SAR), multi-spectral imaging radar, inertial navigation system (INS), global positioning system (GPS), and other sensors to collect a variety of heterogeneous data, analyze and synthesize or fuse these heterogeneous data and then achieve the purpose of interpretation, judgment, and real-time decision-making. How to adopt the multi-modal information fusion technology to improve the weapon’s environmental perception ability and the precise positioning of tracking or striking targets has become the top priority in the national defense construction. In the field of intelligent robots, laser, infrared, ultrasound, and tactile sensors have become the basic sensing devices for the intelligent robot’s body. There is no very effective computational theoretical framework for how to make comprehensive use of this sensing information.

1.4 Program Layout

7

1.4 Program Layout With the main goal of improving the computer’s ability to understand complex perceptual information and the processing efficiency of massive heterogeneous information, this major research project takes images, speech, and text (language) information related to visual and auditory perception as the research objects, starting from the human visual and auditory cognitive mechanism and establish a new computing model to realize the improvement of the computer’s computing capability and the change of processing methods, and make contributions to meeting the major needs of social development, national economy, and national security. Moreover, the results of cognitive mechanism research need to be verified by biological experiments, while computing models inspired, abstracted, and simplified by cognitive mechanism can only be verified by artificial system. In view of these reasons, this major research plan has built two major integrated innovation verification platforms: driverless driving and brain-computer interface, which can not only verify the research results of relevant key points or cultivation projects, but also provide verifiable physical carriers for the theoretical research results to reach the real world. This major research plan has successfully held the “China Intelligent Vehicle Future Challenge” for 10 consecutive years. The integrated verification platform for driverless intelligent cars and related technologies have become an important and comprehensive verification carrier for China’s smart cars and smart transportation, especially for the current artificial intelligence research. Therefore, the characteristics of this major research plan lie in driving basic research in the field of visual and auditory information cognitive mechanism and breakthroughs in related key technologies with scientific tasks, and taking driverless intelligent vehicles and brain-computer interface as the physical carriers to centrally verify and reflect the research results of this major research plan. Since the official launch of this major research plan in 2008, it has always followed the general approach of “limited goals, stable support, integrated sublimation, and leaping development” to solve the “expression” and “computation” of visual and auditory information as the research keys. It has focused on core scientific issues such as “extraction, expression, and integration of perceptual features”; “machine learning and understanding of perceptual data”; and “collaborative computing of multi-modal information”. Then, it determined the priority research direction, and carried on innovative research on frontier foundation, key technologies, and integration sublimation in the interdisciplinary fields such as information science, life science, and mathematical and physical science. The principles of selecting research direction include to give priority to research with innovative ideas on core scientific issues, with good foundation, with mature conditions and what is expected to make breakthroughs in the near future, and interdisciplinary integrated research. It plays the decisive role for the overall objectives of this major research plan. The project’s approval work of this major research plan is mainly divided into two stages.

8

1 Program Overview

During the first four years (2008–2011) of the major research plan, the targetoriented and needs-oriented annual project guidelines and research work should be early deployed. According to the innovative ideas, research value, and contribution to the overall goal of the research plan, the project was funded in the form of cultivating project, key support project, and integration project (the different types of the project vary in funding strength and goals achieved). The mid-term evaluation of this major research plan was carried out in the fifth year (2012) of the implementation period. According to the results of the mid-term evaluation, partial adjustments were made to the implementation of the research plan. In the four years after the mid-term evaluation (2013–2016), the major research plan tended to choose research direction for project approval and initiation, which plays a decisive role in achieving the overall goals. On the basis of the achievements of “cultivation project” and “key support project” in the early stage, it focused on integrated innovation research, and funded with greater support in the form of the combination of “integration project” and “key support project”. In particular, the whole major research plan carried out the inheritance and sublimation of projects and results in the last two years. This major research plan has achieved a series of landmark results in cognitive mechanisms and models, visual and auditory information processing, and natural language (Chinese) understanding. Among them, in terms of visual cognitive mechanism, the topological definition of perceptual objects and the attentional blink topological explanation were proposed; for visual and auditory information processing and computing, it established a new computing model of visual-attention statistical learning and a new theory of significant object detection. In the aspect of Chinese natural language understanding, a new theoretical framework for semantic computing was created and a series of language interaction systems for public security were successfully developed. In particular, in order to further promote the research work to walk out of the laboratory and produce major original results, this major research plan has created two competition platforms: “China Intelligent Vehicle Future Challenge” for 10 sessions and “China Brain-Computer Interface Competition” for 2 sessions. Through verifying the theoretical results in the real physical environment, and solving the problems of complex cognition and intelligent behavior decisionmaking in the actual environment, this major research plan has changed the traditional mode of simple paper summary or laboratory results demonstration, and promoted the organic combination of basic research application and physical achievable systems.

1.5 Comprehensive Integration By taking “cognitive computing and brain-computer interface” and “driverless driving and intelligent testing” as the entry and destination, this major research plan always pays attention to the comprehensive integration of the research results. A total of 5 integration projects were deployed to integrate relevant integration by focusing

1.5 Comprehensive Integration

9

on supporting and cultivating the project’s superior research team and preliminary research results. The main comprehensive integration works are as follows. 1. Comprehensive integration of cognitive computing and brain-computer interface (1) Mobile rehabilitation system based on cognitive mechanism and non-invasive brain-computer interface technology This integration project takes the overall goal and direction of the mobile rehabilitation system that can be applied to sports rehabilitation and assistance for the disabled persons and combines visual and auditory perceptual signal processing with brain electrical signals control. It improves the motor function of the disabled persons and helps them to carry out rehabilitation training, explores the theory and technology related to visual and auditory cognitive mechanism and non-invasive brain-computer interface technology and its application in the rehabilitation system implementation and realizes the organic combination of advanced brain-computer interface technology and motor function rehabilitation system. Through the research of practical, convenient, and strong anti-interference brain signal acquisition technology, high-efficiency brain signal decoding algorithm with multi-modal brain signal fusion, the operation method of the brain-controlled rehabilitation manipulator has high human-machine interaction efficiency and reliable brain-computer interface paradigm and the rehabilitation method of disabled persons’ motor function based on brain-computer interface technology. The mobile platform system and the mechanical arm system for rehabilitation training were designed and developed to assist the disabled in sports and the system’s response time, stability and plasticity of the auxiliary movement behavior, and the ability to resist environmental interference of the whole system can reach the world leading level in general. Through further study of the brain-computer control theory and control methods of human-machine collaboration, it realized the motion navigation or target operating system of the rehabilitation system based on brain-computer interface technology and optimized the above brain signal processing, human-machine system theory exploration, and experimental paradigm through a large number of scientific verification and the developed rehabilitation system can be applied in the clinical aspect. (2) Key technology research and system integration verification of brain-controlled rehabilitation robots based on the reconstruction of visual and auditory-induced motor nerve channels As shown in Fig. 1.2, the integrated project completed three research tasks: the study on inducing mechanism and brain-computer interface of audio-visual-motor psychological synergy based on biological movement; multi-source noise enhancement and recognition technology of weak psychological signal based on stochastic resonance; brain-computer interaction mechanism and collaborative rehabilitation technology induced by biological movement thought. The project has also cooperated with cooperative units to complete the integration and verification of braincontrolled rehabilitation robots based on the reconstruction of visual and auditory induced motor nerve channels, including limb electromyography drive control and

10

1 Program Overview Audiovisual-motor center stimulus induction

Brain stimulation noise enhancement of stochastic resonance

Brain-controlled movement navigation and EEG-EMG upper limb rehabilitation

Research on key technologies

motion navigation for upper and lower limb rehabilitation system

Brain health assessment based on brain-computer information fusion

Motion navigation and upper and lower limb rehabilitation platform integration

Collaborative rehabilitation of lower limbs with brain-computer interaction

Fig. 1.2 Integration platform of brain-controlled rehabilitation robot system based on the reconstruction of visual and auditory induced motor nerve channels

rehabilitation techniques based on brain-computer information decoding; rehabilitation training evaluation and optimization technology based on brain-computer fusion information; and clinical application verification of the brain-controlled rehabilitation robots based on the reconstruction of visual and auditory induced motor nerve channels. On the basis of corresponding technical and theoretical research, the project developed brain-controlled intelligent wheelchair and Chinese spelling system, designed low-cost lightweight rehabilitation manipulator system, developed a brain-controlled active and passive cooperative lower limb rehabilitation training robot system, and built a verification platform for the key technologies of the brain-controlled rehabilitation robot, which is based on the reconstruction of visual and auditory induced motor nerve channels. The verification system developed by the project has been demonstratively carried out in many hospitals such as Xijing Hospital and Shaanxi Hospital of Traditional Chinese Medicine. Among them, the comparative test verification in Xijing Hospital showed that compared with traditional rehabilitation methods, the patients who apply this innovative rehabilitation technology can improve the function of paralyzed limbs by more than 20%, which greatly improves the efficiency and effectiveness of rehabilitation training after brain injury at home and abroad. And it achieved good evaluation results. At the same time, the brain control spelling won the first place in the second China BCI sponsored by the NSFC. Brain-controlled lower

1.5 Comprehensive Integration

11

limb rehabilitation training robots and myoelectric controlled biological prosthesis were reported by CCTV and many other media. Brain-controlled wheelchairs and muscle-controlled artificial hands were specially invited to attend large exhibitions such as Shenzhen High-tech Fair and Western Trade Fair for many times, which had a good social impact. 2. Comprehensive integration of unmanned intelligent vehicle verification platform and intelligent test with evaluation (1) Key technologies and integrated verification platform for autonomous driving vehicles The integrated verification platform of the project carried out simulation and experimental testing, implemented autonomous driving experimental research on the open city/intercity/high-speed road environment and completed the integration and joint debugging of various cooperative units. In terms of basic research on theory and method, it carried out in-depth research on new theories and new algorithms for autonomous vehicle environmental perception, decision-making and the control of path tracking as well as brain-computer interface and auditory information processing. Important research progress has been made in many fields, such as visual attention modeling based on frequency domain scale analysis and its application in scene understanding, comprehensive environmental perception models and methods based on road structure constraints, new traffic sign recognition methods based on block kernel function features, multi-objective enhancement learning driving strategy optimization methods, accurate recognition and tracking of road markings in complex scenes, vehicle speed tracking algorithms based on model predictive control, fast information input based on non-invasive brain-computer interfaces, and sound context-based environmental state recognition methods. The completed project focuses on basic scientific problems such as the environmental perception and understanding of driverless vehicles, intelligent decisionmaking and control of the normal traffic environment of urban roads and intercity roads (including a variety of branch roads, primary roads, ring roads, expressways, and highways) and carries out innovative and comprehensive integrated research which integrated theory, key technology, and engineering optimization. At the same time, it integrated the relevant research results in the fields of brain-computer interface and builds up an integrated verification platform of driverless vehicles with open and modular architecture (as shown in Figs. 1.3, 1.4 and 1.5), which realizes long-distance autonomous driving in normal traffic environment and weather, where the artificial intervention ratio is less than 2%. (2) Key technologies and platforms of driverless vehicles for the comprehensive urban environment The integration project established an open modular driverless vehicle structure with four levels of perception, decision-making, control and execution, and developed a driverless vehicle test system that includes test content, test methods, and evaluation indicators. It constructed the automatic driverless vehicle fault diagnosis system including part layer, subsystem, system layer, and developed dual-way redundant

12

1 Program Overview

Single-line lidar Video camera

4-line lidar

Millimeter wave radar

32-line lidar

Fig. 1.3 Integrated platform for autonomous driving vehicle based on HQ-3

Parking guidance line Self-parking

Lane deviation warning

Fig. 1.4 A rear-end-collision and lane deviation warning system with independent intellectual property rights developed jointly with FAW group

driverless controller. As shown in Fig. 1.6, many platforms were developed, such as Intelligent Pioneer Driverless Vehicle, Kafu Driverless Vehicle, GAC Trumpchi Driverless Vehicle, Jianghuai IEV6S Driverless Vehicle, and Inner Mongolia First Machinery Group 4 × 4 Driverless Light Tactical Vehicle platform. Experiments and testing were conducted and tested in Hefei, Guangzhou, Xi’an, Baotou and other cities, and this integration project also participated in the China IVFC sponsored by the NSFC, which completed a total of more than 2,000 km of autonomous

1.5 Comprehensive Integration

13

Fig. 1.5 An autonomous driving system developed on the platform of Dongfeng Fengshen AX-7

driving. With the monitoring of third-party software, 86 autonomous driving tests were conducted on various urban roads, with a cumulative testing mileage of 1,228 km, and the predetermined development target was completed. (3) Intelligent testing evaluation and environment design for driverless vehicles The integrated project completed the integration of the intelligent test evaluation and environmental design of driverless vehicles in terms of test theory, test technology, site construction, major research plan services, and personnel training.

Intelligent Pioneer II

IEV6S driverless car

Intelligent Pioneer III

4×4 driverless car

Kuafu I

Security patrol platform

GA5 driverless vehicle of GAC motor

Production inspection platform

Fig. 1.6 Integrated verification platform for major projects of visual and auditory cognitive computing

14

1 Program Overview

Firstly, by fusing the two existing definitions of driverless vehicle intelligence (based on scenario testing vs human-like intelligence), the connection among driverless intelligent evaluation, task evaluation, and capability evaluation were revealed, a driverless intelligent hierarchical task representation system was established, and the test system and specific evaluation methods were defined. Secondly, the environment design technology was studied for field test and simulation test of driverless vehicles, which is used to test and evaluate the individual and integrated intelligence of driverless vehicles. In addition, as shown in Figs. 1.7 and 1.8, the project proposed a parallel test architecture, which realizes a collaborative testing between the field test base and the simulation test space called “digital special-nine-grid” and improves the coverage, pertinence, and flexibility of the testing. Thirdly, the on-site test base and intelligent comprehensive testing and verification platform for driverless vehicles provided venues to meet the intelligent test and competition of driverless vehicles in the next five years and integrate with the real urban intelligent traffic area. The site area is larger than 2 km2 and includes all kinds of typical traffic scenarios. Based on V2X technology, it realizes real-time vehicle test data collection of driverless vehicles, reduces test cost, and improves the scientificity and fairness of evaluation. In addition, a driverless simulation testing platform was built, which can provide offline test and visualize the cognitive intelligence of driverless vehicles. The proposed TSD-max real traffic scene dataset can meet the needs of 10,000 kilometers of driverless offline testing. The project provided a number of guarantee services around the overall goal of this major research plan. In 2016 and 2017, the platform assisted the NSFC and the major research plan guidance expert group to successfully hold 2016 and 2017 China IVFC and held 2 sessions of offline tests of the basic ability of visual information environment cognition at the same time. In addition, the project also provided technical support for on-site testing and offline testing for the preliminary acceptance of Physics nine-grid

Digital nine-grid

High-precision road network data

Reorganization of road space

Scene points cloud data

labeling

scene video

Scene video generation

Fixed testing site Limited scene changes

Fig. 1.7 Virtual test environment

Cognitive test environment in driverless driving environment

1.6 Situation of Interaction Between Disciplines U-turn End point

Terminal parking spot

15

Four-phase Starting T-junction point with two-phase light intersection light

Stop for passenger

Construction enclosure School speed limit Stop for passenger

Simulated tunnel

Roadside pedestrians

Operating vehicles Scale 1

2,000

Turn right phase at four-phase light intersection

Simulating non-motor vehicles

Simple construction road

Semi-closed construction

Fig. 1.8 “Special-nine-grid” intelligent car test site

the unmanned driving verification platform integration project of the major research plan.

1.6 Situation of Interaction Between Disciplines The implementation process and achievements of this major research plan fully reflected the multidisciplinary characteristics. According to the incomplete statistics of 10 project teams, the papers were published in the international core academic journals such as the fields of information science, cognitive science, psychology, neuroscience, physics, and life science, which fully reflect the multidisciplinary characteristics and the academic level of the research work. 1. Combination of computational vision with biological vision The visual attention mechanism is an important feature of biological vision. The early research focused on areas of psychology, cognitive science, and neurophysiology. After the 1980s, this topic attracted the attention of scholars in the fields such as computer vision and artificial intelligence. In response to the demand for visual navigation of autonomous vehicles, several research groups carried out in-depth and multidisciplinary research on the combination of computational vision and biological vision and achieved a number of important academic achievements.

16

1 Program Overview

In terms of human-machine driving model fusion research, Tsinghua University carried out research on the cognitive mechanism of driver perception information processing and fusion. Jilin University simulated the cognitive processing mechanism of expected trajectory information by real drivers and studied the local path planning of driverless vehicles. The National University of Defense Technology studied the modeling of visual attention mechanism and proposed an evaluation method of visual attention mechanism modeling. 2. Intersection of geographic information science and computing intelligence Wuhan University played an important role in surveying and mapping science, made full use of urban geographic information database including vehicle driving road information, high-precision GPS dynamic positioning and provided a variety of prior knowledge for multi-sensor heterogeneous information cognition and smartcar dynamic driving planning. These applications will greatly reduce the difficulty of related processing and analysis and improve efficiency. 3. Overlapping and integration of scientific theory, technology, and engineering The core scientific problem of the driverless vehicles (namely, intelligent vehicles) research is the cognition of visual and auditory information, and this scientific problem itself involves many fields such as information science, cognitive science, psychology, and neurobiology. The driverless vehicle is also a physical system with certain attributes designed and implemented based on the above-mentioned multidisciplinary scientific principles. Therefore, it involves a series of technical and engineering problems including sensors, control systems, network communication, computing software, vehicle control, system integration, testing, evaluation, etc. Thus, we can achieve the predetermined goals only by organically combining scientific principles with engineering. 4. Intersection of information science and biology The creation and development of the “Global-First” object-based attention theory is the key project of “Cognitive Mechanism and Neural Expression of Attention” undertaken by Shanghai Institute of Biological Sciences and Chinese Academy of Sciences, which focuses on the study of the input-output characteristics of nonclassical receptive field, spatiotemporal transfer function, neural mechanism as well as information processing of the extensive complex images and feature extraction. On this basis, the corresponding non-classical receptive field computing model can be used to solve some basic problems in computer vision. The project embodies the intersection of information science and biology in the modeling and verification of visual and auditory cognition computing. The key project “Neural Network Model and Verification Experiment of Visual Function Organization Mechanism” undertaken by Fudan University is also a representative of the intersection of information science and biology. There is a wide gap among machine learning algorithms, neural computing model, and brain-function animal experiments in the field of information. The interdisciplinary research will promote the in-depth intersection and integration of “use brain”,

1.6 Situation of Interaction Between Disciplines

17

“simulate brain”, and “know brain”. For the urgent requirements of the visual cognitive basis and neural expression theory, this research uses natural image driving in unsupervised machine learning and applies it to self-organizing neural computing networks. This model can predict the visual cortex response induced by unknown stimuli and test the living animals’ visual cortex accurately by adopting various technical methods such as endogenous signal optical imaging, two-photon imaging, and electrophysiological recording and conduct two-way discussion and verification of models and experiments on the expression mechanism of visual information. At the same time, this study used the inherent laws of the computational model to derive the brain-function map with fine organization rules from the poor brain imaging signals measured in the experiment and used a variety of imaging methods to verify it. In addition, it used the neural computing model as the core to explore the functional organization mode of the visual cortex, open the interface of the integration of three fields in practical work, fully consider each other’s promotion in the application and promote the in-depth study of “cognitive computing of visual and auditory information” by intersecting multiple disciplines closely. 5. Intersection of information science, life science, and medicine “Basic Theory and Key Technology of Stroke Assistance Rehabilitation Robot Based on Brain-Computer-Muscle Information Loop” is a key project undertaken by Tianjin University and a typical representative of the intersection of information science, life science, and medicine. Stroke is a common and high-incidence disease that seriously endangers human life and health. Now exploring the new mechanisms and rehabilitation treatment methods has become the focus of the world, especially in China. The research of the rehabilitation mechanism of brain plasticity is also a frontier hotspot of the current neuroscience research. The project intended to focus on the problem that the existing rehabilitation training cannot effectively achieve the subjective intention to induce brain-muscle coordinated movement, studied the neural information fusion mechanism of brain plasticity in stroke rehabilitation by combining with functional electrical stimulation, mechanical exoskeleton technology and developed new assisting methods for electromechanical rehabilitation. The project established interactive information loop model of brain-computer-muscle coordination for walkingaid-oriented rehabilitation neuro-robot based on BCI frontier technology, carried out plasticity change detection and analysis research based on brain-function network and EEG-EMG coherence, explored the key measures to improve the close humancomputer information interaction and central nervous plasticity inducing effect and enhanced the functional rehabilitation effect of patient brain-computer-muscle coordination, and established an objective standard for new rehabilitation evaluation. This study will provide a new scientific basis and application model for promoting the indepth study of the mechanism of central nervous plasticity rehabilitation machine and lay a key technical foundation for the development of a new brain-computer-muscle

18

1 Program Overview

cooperative rehabilitation robot system with multiple functions such as intelligent control and safety monitoring. 6. Intersection of information science, life science, and mathematics The Project “Key Technology and Integration Verification Platform for Driverless Vehicle in Rural Road Environment” undertaken by Nanjing University of Science & Technology is a representative of information science, life science, and mathematics. The project studied the key technologies of driverless vehicles in the rural road environment, and its core problem was the image and visual information processing of environmental understanding. It mainly focused on the understanding and modeling of the multi-sensor integration of the rural road environment. The research gave full play to the intersection advantages of information science, life science and mathematics and constructed new computing models and methods by starting from the human visual cognitive mechanism. It improved the computer’s ability to understand unstructured visual perception information and the processing efficiency of heterogeneous information, overcame the bottleneck difficulties faced by image information processing and made important progress in the basic theoretical research of visual information processing. In addition, the project also conducted in-depth research on other key technologies involved in the driverless vehicle verification platform, including driverless vehicle architecture and intelligent decision system, path and behavior planning technology in rural road environment and system integration technology. The project also conducted in-depth research to build a driverless vehicle verification platform to integrate and verify the above research results, while the main performance indexes reached the world’s advanced level. 7. Fusion of psychology and simulated driverless vehicle The integration of cognitive computing and driverless research cannot be inseparable from the support of psychology. The key project named “Key Technologies of Artificial Cognition and Integrated Verification Platform for Driverless Vehicles” undertaken by Tsinghua University aimed to face the three key scientific issues of sense, perception, and decision-making involved in the development of an integrated verification platform for driverless vehicles. It referred to new mechanisms and new models for processing and fusion of human driving perception information, models of attention focusing and selection based on human cognitive mechanisms in the driving environment, and cognitive psychology models for overall human-vehicle-road decisionmaking. The following three key technologies were specifically studied: i) in-vehicle multi-modal perception data fusion and new vehicle model fusion based on human driving perception information processing; ii) driving intention-oriented ACS system structure and selective perceptual information processing; iii) cognitive psychology decision optimization method guided by learning model and advanced cognitive model based on human-vehicle-road overall decision-making. In addition, on the basis of the above-mentioned key technologies, especially for the important key technology of artificial cognition, a new generation of driverless vehicle integration verification platform with artificial cognition capability was developed, and field tests under highway conditions were carried out.

1.7 Significant Progress

19

1.7 Significant Progress The six representative achievements of this major research plan are summarized as follows, including “cognitive computing”, “visual information processing”, “braincomputer interface”, “natural language processing”, “collaborative computing of multi-modal information”, and “driverless driving”. 1. Enriched and developed the basic theories of visual and auditory cognition a. Topological definitions of perceptual objects; b. Topological explanation of the attentional blink; c. Neural expression of temporal organization in audio-visual fusion. 2. Established a visual-attention statistical learning method for scene understanding 3. Conducted research on efficient computing models and methods for multimode information collaboration a. b. c. d. e. f.

Cross-media expression and modeling based on hypergraph and matrix; Sparse selection of high-dimensional structures; Cross-media learning mechanism from a geometric view; New retrieval mechanism for cross-media data; Construction and application of cross-media vertical search engine platform; Demonstrational application of the search and service system of China National Intellectual Property Administration (CNIPA).

4. Created a new theoretical framework and implementation method for Chinese semantic computing (1) Multi-level, cross-language semantic representation, and computation of Chinese a. Context-sensitive semantic representation and computation of words; b. Clause semantic representation learning and computing for implicit discourse relation recognition; c. Sentence semantic representation and computation inspired by human attention mechanism; d. Unsupervised cross-language semantic representation learning method first proposed worldwide. (2) Automatic acquisition of semantic knowledge based on query logs and cross language a. Automatic acquisition of Chinese semantic knowledge is based on query logs and cross language; b. Chinese semantic knowledge acquisition method is based on crosslanguage mapping.

20

1 Program Overview

(3) Application of syntactic and semantic methods for neural machine translation a. Semantic smoothing method of context-aware for neural machine translation; b. Source-dependent representation method for neural machine translation; c. Syntax-oriented attention method for neural machine translation. (4) Chinese syntactic analysis method based on the combination of neural network and feature templates a. Deterministic attention mechanism for syntactic analysis task; b. Combination of neural-network-based syntactic analysis method and traditional syntactic analysis method; c. Functional component tree database construction and functional component analysis of Chinese sentence. (5) Template-based understanding and reasoning model for mathematical application problems; (6) Application: domain-specific question answering system a. Q & A experimental system of geographic common sense based on distributed representation; b. Task-based dialogue system for travel life; c. Q & A system for the medical and health field. 5. Intuitive bionic control theory of intelligent prosthesis based on braincomputer interface a. EEG command decoding; b. Multi-source physical quantity detection and sensory feedback of bionic prosthesis; c. Real-time movement planning based on EEG instruction; d. Prosthetic movement based on skill learning and expansion. 6. Development of key technologies and integrated verification platform for autonomous driving vehicles a. Environmental perception theory and method for autonomous driving vehicles; b. Optimized decision and control of autonomous driving vehicles; c. The structure and reliability of the autonomous driving system; d. Integrated research of brain-computer interface technology in vehicles’ autonomous driving system; e. Integrated research of auditory information processing in vehicles’ autonomous driving system. After the completion of the plan, the developing trend in the field can be seen in Table 1.1.

Research status at the end of the plan At the end of this major research plan, a great breakthrough has been made in the basic problems of the above cognitive mechanism (i) Through multiple object tracking (MOT) experimental paradigm, it is found that the change of topological properties of graphics does lead to a significant decline in tracking performance. This shows that the change of topological properties is indeed perceived as the emergence of new perceptual objects, thus interfering with the continuity of perceptual object representation. The results show that the extraction of topological properties is the starting point of forming perceptual object representation. Further study of brain functional imaging found that the anterior temporal lobe of the brain is responsible for the representation of perceptual objects. This is consistent with our previous research results The results were published in the top international journal PNAS (ii) A series of 16 experiments supported the topological definition of perceptual objects, which excluded various non-topological factors such as local properties, spatial factors, color, semantics, and the influence of “Top-Down”

Research status at the start of the plan

When this major research plan was launched, there was no clear definition of the linear mechanisms of perceptual cognition, attentional blink, and so on in the cognitive field in China (i) “What is perceptual object” is a controversial central issue in the field of cognitive science, and after decades of research, researchers still cannot accurately describe what is perceptual object. This makes the attention theory based on objects often face challenges based on space theory, or fails to give a unified and systematic explanation to various phenomena, so that it has been excluded from the mainstream of modern research for a long time. Therefore, how to define perceptual objects scientifically and accurately is a core problem in current attention research and even the whole cognitive science (ii) Attentional blink phenomenon refers to the rapid presentation of a series of visual stimuli. When the first target is correctly detected (200~500ms), the correct rate of detecting the second target drops sharply

Core scientific problems under scientific objectives

Research on cognitive mechanism

Table 1.1 Comparison of field development trends after the plan

At the end of this major research plan, there has been some progress on the above cognitive mechanism in the world (i) In the aspect of perceptual cognition, most international scholars still insist on local-first topological perception theory. The research on global-first topological property perception theory is still a hot and cutting-edge topic (ii) International research on the mechanism of attentional blink is based on the following two papers. The essay on PNAS “Language affects patterns of brain activation associated with perceptual decision” indicates that the color difference of vocabulary in English language family may increase or decrease the perceptual difference. J. Vision’s essay “Is there a lateralized category effect for color” shows that the change of non-topological properties such as color has no effect on the emergence of new objects and the prompt attention

International research status at the end of the plan

(continued)

At the end of this major research plan, China is ahead of its international counterparts in the research of the above basic problems of cognitive mechanism (i)Through a series of MOT experiments, the perception theory of topological properties of global-first is strongly supported, which challenges the perception theory of local-first of international mainstream scholars (ii) The proposed theory has been supported by many international scholars. And many subsequent papers published by international scholars have reached similar conclusions, which further confirmed the correctness of the proposed theory, and reached this conclusion earlier than other international scholars. This shows that China is in a leading position in the research of attentional blink mechanism

Advantages and gaps compared with international research status

1.7 Significant Progress 21

Scholars have found that the change of topological properties produces new objects and attentional blink. This conclusion not only supports our above analysis, but also can explain many famous problems in attentional blink research, such as “lag-1 sparing” phenomenon. It also fully shows that solving the scientific definition of perceptual objects is helpful to solving the classical problems of cognitive science. The experimental results were published in top international journals such as PNAS and Hum Brain Map

The research of attentional blink has become the hottest issue in the field of attention at present. However, what is the mechanism of attentional blink? More than 20 years’ research in the field of attention has resulted in increasingly divergent explanations

At the start of this major research plan, the signal processing and data marking of traffic signs in driverless driving were urgent visual information processing problems to be solved, for example, (i) In the image completion of traffic signs in driverless driving, the traditional matrix image completion method based on kernel norm usually weeded to meet certain theoretical assumptions, but these assumptions were often difficult to meet in practice

Research on cognitive mechanism

Visual information processing At the end of this major research plan, a series of important research achievements have been made in visual information processing of driverless driving, and the representative progress is as follows: (i) A fast matrix completion method based on truncated kernel norm is proposed to solve the traffic image missing caused by occlusion and pollution

Research status at the end of the plan

Research status at the start of the plan

Core scientific problems under scientific objectives

Table 1.1 (continued)

At the end of this major research plan, a series of important achievements have been made in visual information processing internationally. The representative achievements related to this project are as follows: (i) “Principal component analysis with missing data and its application to multilateral object modeling” and other papers are heuristic methods based on kernel norm, which can achieve good performance

International research status at the end of the plan

(continued)

At the end of this major research plan, China has achieved the international leading level on some problems of driverless driving (i) For the problem of traffic image missing, although the current international methods can achieve good performance, this method often can only achieve sub-optimal performance, because the theoretical requirements of kernel norm heuristic (for example, irrelevant property) are often difficult to meet the theoretical assumptions in practice

Advantages and gaps compared with international research status

22 1 Program Overview

Research status at the end of the plan The algorithm reduces the limitation of hypothesis. As long as the low rank solution exists, the algorithm can get the optimal solution by minimizing the truncated kernel norm. Professor Cai Deng further proposed an efficient iterative process to accelerate the solution to this optimization problem. This work was published in IEEE T-PAMI, the top international journal of computer vision and pattern recognition, and it was cited for 154 times by other scholars (ii) Chinese scholars have innovatively proposed a semi-supervised matrix factorization algorithm, which is called non-negative matrix factorization algorithm with constraints, taking labeling information as a constraint condition. The algorithm can effectively fuse the labeling information, so that the decomposed matrix factors have stronger distinguishing ability. This work was published in IEEE T-PAMI, the top international journal of computer vision and pattern recognition, which was cited 216 times by other scholars

Research status at the start of the plan

Therefore, it is a difficult point that the current matrix completion algorithm cannot accurately restore the low rank structure of the image (ii) Nonnegative matrix factor (NMF) is an unsupervised learning algorithm That is to say, NMF is not suitable for many practical problems, because domain experts can obtain limited knowledge. However, many machine learning researchers found that the combination of unlabeled data and a small amount of labeled data can significantly improve the learning accuracy. The costs associated with the labeling process may make the training set of complete labeling infeasible, while obtaining a small set of labeling data is relatively cheap. In this case, semi-supervised learning has great practical value. Therefore, it will be very beneficial to extend the use of NMF to semi-supervised mode

Core scientific problems under scientific objectives

Visual information processing

Table 1.1 (continued)

This shows that the heuristic method based on kernel norm has a good research prospect (ii) Papers such as “Learning sparse representations by non-negative matrix factorization and sequential cone programming” have obtained the optimization scheme of NMF based on sequential quadratic and second-order cone programming. A key advantage of this method is that NMF can be extended by combining prior knowledge in the form of additional constraints within the same optimization framework. The additional constraint is to limit the coefficient of each class and each marker point belonging to it to a cone around the center of the class

International research status at the end of the plan

(continued)

Therefore, the fast matrix completion method based on truncated kernel norm proposed by scholars is more suitable for practical scenarios and achieves better performance (ii) Many international scholars have proposed semi-supervised learning algorithms based on NMF, and achieved good results. However, scholars creatively put forward a new semi-supervised learning algorithm based on NMF, and achieved good results. Chinese scholars have provided new ideas for the study of semi-supervised learning algorithms in the world and promoted the development of research in this field

Advantages and gaps compared with international research status

1.7 Significant Progress 23

Research status at the end of the plan At the end of this major research plan, a mixed paradigm based on P300 and SSVEP is proposed to detect the consciousness of patients with dyslexia, which is obviously superior to the traditional paradigm of P300 or SSVEP in performance. In addition, a visual-auditory brain-computer interface is proposed, which has been successfully applied to the detection of command following and digital processing ability of patients with consciousness disorder, and achieved good results. The related results were published in the top IEEE journal Proc IEEE, international authoritative journals Sci Rep, BMC Neurol, and so on

Research status at the start of the plan

At the start of this major research plan, the consciousness detection methods used to detect patients with consciousness disorders were generally traditional paradigms such as SSVEP or P300. However, these traditional paradigms had their own defects, so a better and more perfect method was needed to overcome these defects

Core scientific problems under scientific objectives

Brain-computer interface

Table 1.1 (continued)

The research on consciousness detection in the world mainly adopts single-modal methods, such as “Continuous three-dimensional control of a virtual helicopter using a motor imagery based brain-computer interface” (2011) and other papers use a single modal method to detect left hand, right hand, both hands and other information to control multi-dimensional objects and perform complex operations

International research status at the end of the plan

(continued)

Compared with the international mainstream research methods, the information of multi-modal methods adopted in this major research plan is more robust Multi-modal method can overcome the defects of some single-modal methods in the consciousness detection of the disabled, thus further improving the system’s performance

Advantages and gaps compared with international research status

24 1 Program Overview

Research status at the end of the plan After years of research, at the end of this major research plan, effective solutions were put forward for natural language processing, which supported the needs of key national fields, such as (i) Harbin Institute of Technology combines the traditional methods of Chinese text understanding with the current popular deep machine learning methods, and studies from 3 levels including semantic representation and acquisition, semantic computing model and semantic reasoning, and creates a new theoretical framework and implementation method of Chinese semantic computing. A task-based dialogue system for travel life has been realized, that is, a Q&A dialogue system for 3 fields such as train tickets, air tickets and hotel reservations, and an automatic Q&A system for the medical and health field (i) The Institute of Automation, Chinese Academy of Sciences conducts research on human-like hearing mechanism and deep learning algorithm, and effectively integrates human-like hearing mechanism and deep learning algorithm, giving full play to the cross advantages of information science and life science, and promoting new breakthroughs in intelligent speech technology

Research status at the start of the plan

At the start of this major research plan, most of the research on natural language processing in China adopted traditional methods. Although many valuable research achievements had been made, there were some problems in language processing, such as insufficient resources of Chinese semantic knowledge, sparsity of symbolic semantic representation and calculation methods, and inability of existing reasoning algorithms to consider global information In the aspect of speech processing, only the knowledge of information science was used, but the knowledge of life science was not integrated

Core scientific problems under scientific objectives

Natural language processing

Table 1.1 (continued)

At the end of this major research plan, from the international point of view, foreign countries have also made a series of achievements in the aspect of natural language (i) According to the paper “Ask me anything: dynamic memory networks for natural language processing” published in ICML 2016, the dynamic memory network can input sequences, questions, and situational memory forms, thus generating relevant answers (ii) The paper “State-of-the-art speech recognition with sequence-to-sequence models” published in ICAASP 2018 includes the acoustic, pronunciation, and language model components of the traditional automatic speech recognition (ASR) system in a single neural network, which proves that this architecture is equivalent to the most advanced ASR system in dictation tasks, and has achieved the best results in more complex speech search and dictation tasks

International research status at the end of the plan

(continued)

Through the implementation of this major research plan, China has put forward practical methods in Chinese text understanding and speech processing It has achieved the best results in Chinese text understanding, and proposed a new loss function for constructing semantic representation model by matching the implicit state values of the source language and the target language for the first time in the world. Without any parallel corpus or any seed translation dictionary, it can automatically learn the distributed semantic representation of cross language. In the aspect of speech processing, we use the auditory attention mechanism of human ears for reference, and build an auditory attention model which integrates speech print memory and attention selection, and realize the auditory attention selection tasks of stimulus-driven (passively stimulating or calling people’s names) and task-oriented (actively chatting with friends) ones

Advantages and gaps compared with international research status

1.7 Significant Progress 25

Research status at the end of the plan At the end of this major research plan, a series of effective solutions have been put forward for the expression and modeling of multi-modal information, dimension reduction and selection of high-dimensional features, etc., which support the needs of key national fields. For example, Zhejiang University has proposed cross-media representation and modeling based on hypergraph and matrix, which can effectively concentrate massive cross-media data and extract robust and most effective information features from it. In the aspect of cross-media data retrieval, cross-media sparse hash index, cross-media bidirectional structure sorting and cross-media related feedback algorithm based on historical log are proposed, which are applied to the construction and application of cross-media vertical search engine platform, and have been successfully applied in Xinhua News Agency and the National Intellectual Property Administration. The related research results were published in IEEE T-PAMI and other famous international journals, which were cited more than 200 times by other scholars and won the second prize of the National Natural Science Award

Research status at the start of the plan

At the start of this major research plan, the research on multi-modal information processing in China was at the initial stage. Although many valuable research achievements had been made, most of them directly used the existing single-modal processing methods to process the single-modal data, and the research on cross-correlation between multi-modal data was weak. Especially with the arrival of the era of big data, there are multi-dimensional, high-order and massive text, image and video data in real life. The expression and modeling of complex relations of multi-modal data, dimension reduction and selection of high-dimensional features, multi-modal content mining and cross-media retrieval were weak, and there was no effective solution yet

Core scientific problems under scientific objectives

Multi-modal information collaborative computing

Table 1.1 (continued)

At the end of this major research plan, international countries have also made a series of achievements in cross-media retrieval. For example, the paper “On the role of correlation and absorption in cross-modal multimedia retrieval” published by IEEE T-PAMI in 2014 equates the design of cross-modal retrieval system with the design of isomorphic feature space of different content modalities. 3 new solutions to cross-modal retrieval problems are derived: correlation matching, semantic matching, and semantic correlation matching combining both, so as to solve the problems of image searching text. For example, in the paper “Multimodal similarity-preserving hashing” published by IEEE T-PAMI in 2014, coupled Siamese neural network structure is used to learn similarity within and between classes. The hashing method in China is not limited to binary linear mapping, and can handle any responsible form

International research status at the end of the plan

(continued)

Through the implementation of this major research plan, China has put forward feasible and effective methods in the aspects of expression and modeling of complex relations of cross-media data, dimension reduction and selection of high-dimensional features, cross-media retrieval, etc., and applied them to engineering practice to construct a cross-media vertical search engine platform, which has been applied in Xinhua News Agency and the National Intellectual Property Administration It has been quoted and positively evaluated by peers in cross-media data expression and modeling

Advantages and gaps compared with international research status

26 1 Program Overview

Research status at the end of the plan At the end of this major research plan, there have been effective solutions to the perception of driverless vehicles to the environment and the evaluation of intelligent level and decision-making ability of driverless vehicles (i) Xi’an Jiaotong University has realized the environment-aware computing framework of driverless vehicles with selective attention mechanism and multi-sensor information interaction; explored a dynamic dual cooperative computing model and local dynamic scene modeling method for primary visual information processing; established knowledge base of primary driving behavior, and realized real-time control and autonomous driving of driverless vehicles on the basis of environment perception and behavior decision. Two sets of driverless vehicle prototype systems have been successfully developed, which can drive autonomously in urban traffic environment and obey traffic rules, with a top speed of 60 km/h. They have participated in the “China Intelligent Vehicle Future Challenge Competition” (2009–2017) sponsored by the National Natural Science Foundation of China for 9 consecutive years and have performed well (ii) The Army Military Transportation University of the PLA breaks through the technical problem of multi-vehicle interactive cooperative driving based on visual and auditory information. In theory, a selective attention mechanism is established for the perception and recognition of visual and auditory information, and a variable-scale grid method is proposed to fuse multi-source sensor information; put forward the competition and sharing mechanism of road resources (right of way) by vehicle running. The multi-vehicle interaction strategy is put forward, and the multi-vehicle cooperative driving mechanism is established

Research status at the start of the plan

At the beginning of this major research plan, the research on driverless driving and testing in China was still in its infancy, mainly relying on traditional positioning and navigation technologies such as GPS. Although the traditional methods had made some valuable achievements, there were still great limitations in the detection and tracking of road environment, and it is difficult to adapt to the varied road traffic scenes. Therefore, it is very important to develop driverless vehicles with natural environment perception and intelligent behavior decision-making

Core scientific problems under scientific objectives

Driverless driving verification platform

Table 1.1 (continued)

At the end of this major research plan, other countries have also made some achievements in the research and industrialization of driverless vehicles (i) The VisLab Institute of the University of Palma, Italy, uses computer vision technology instead of laser radar to realize the environment perception of driverless cars. In July 2013, VisLab’s BRAiVE car was also successfully tested in Italy’s countryside, cities, and highways (ii) The University of Oxford doesn’t use GPS or embedded infrastructure (such as beacons), but uses algorithms to navigate, including machine learning and probabilistic reasoning to build surrounding maps

International research status at the end of the plan

Through the implementation of this major research plan, the gap between China and developed countries in the research of driverless vehicles has been gradually shortened For example, more than 20 driverless vehicle experimental platforms have been developed, the relationship between brain cognition and cognitive computing has been explored, and new schemes such as intelligent computing hope and driving brain have been put forward, which have solved the problems of real-time perception of traffic elements and understanding of driving environment in large-scale traffic scenes based on multi-source information fusion, decision-making of human-like driving behavior in dense traffic flow, and motion planning and trajectory tracking

Advantages and gaps compared with international research status

1.7 Significant Progress 27

Chapter 2

“Cognitive Computing of Visual and Auditory Information”: State of the Art

Artificial intelligence enables machines to understand, think, and learn like people, namely to simulate human intelligence with computers. It is a very broad field and generally speaking, mainly covers cognition and reasoning (including various physical and social common sense), computer vision, natural language understanding and communication (including auditory sense), machine learning, and other subjects. Therefore, the cognitive computing of visual and auditory information is an important research topic in artificial intelligence. Understanding human audio-visual cognition and establishing a computable audio-visual cognitive model have great inspiration for the core algorithm of artificial intelligence. The layout of the NSFC is very forward-looking. With the support of the major research program “Cognitive Computing of Visual and Auditory Information”, it has greatly promoted the vigorous development of theories, methods, technologies, and applications in the field of artificial intelligence in our country.

2.1 Research Status In 1955, the United States and other countries proposed the concept of “artificial intelligence (AI)” in a research program, and then the definition of artificial intelligence was first given at the Dartmouth College Summer Forum in 1956, and its perception part includes visual and auditory information. The original idea of processing. Since then, AI research has gone through three technological waves. The first wave focused on artisanal knowledge. In the 1980s, it focused on a rule-based expert system with a clearly defined domain, in which knowledge was collected from human experts, calculated by “if–then” rules, and then executed in hardware. This type of systematic reasoning can be successfully applied to narrow-sense problems, but it does not have the ability to learn or deal with uncertainty. However, it can still yield important © Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3_2

29

30

2 “Cognitive Computing of Visual and Auditory Information”: State …

solutions, and technology development is still very active today. The third wave of artificial intelligence research began in the twenty-first century, which is characterized by the rise of machine learning. Artificial intelligence research is also divided into visual, auditory, cognitive reasoning, knowledge engineering, machine learning, etc. During the burst of the third wave, cognitive computing of visual and auditory information plays an important role. The availability of massive amounts of data, relatively cheap massively parallel computing power, and improved learning techniques have all made AI more advanced when applied to tasks such as image and text recognition, speech understanding, and human language translation. All these achievements mainly come from the processing of visual and auditory information, mainly including images, speech recognition and understanding, machine translation, etc. Today, AI systems often outperform humans at specialized tasks. Major milestones in which AI surpassed human performance for the first time include chess (1997), trivia (2011), Atari games (2013), image recognition (2015), speech recognition (2015), IBM Watson diagnosis of acute myeloid leukemia (text processing, knowledge reasoning, 2015), AlphaGo (2016), and Texas Hold’em Poker (2017). In the fields of visual and auditory information processing applications such as face recognition, speech recognition, and unmanned driving, China has always been in a leading position. These achievements of artificial intelligence are supported by a strong foundation of fundamental research. From 2013 to 2015, the number of scientifically indexed journal articles mentioning “deep learning” increased by six times, and the application of artificial intelligence in the visual and auditory field is now creating considerable benefits for large enterprises. However, while great progress has been made, AI systems still have their limitations. Almost all visual and auditory advances have been made in “narrow AI” that can effectively perform specialized tasks, while “general AI”, which can play an effective role in various cognitive domains, has made little progress. Even narrow AI progress in cognitive computing of visual and auditory information is uneven, and AI systems for image recognition require significant human effort to label thousands of sample answers. Instead, most people have a “one-shot” approach to learning that requires only a few examples. While most machine vision systems are easily cluttered with complex scenes with overlapping things, children can easily do “scene analysis”. Scenarios that are easy for humans to understand are still hard for machines. The field of artificial intelligence is now at the beginning of the third wave, and its main explosive point is the processing of visual and auditory information. Currently, we pay much attention to interpretability and general artificial intelligence technology. The goal of these methods is to strengthen the learning model through explanation and interface modification, clarify the basis and reliability of output, operate with high transparency and surpass narrow artificial intelligence, and obtain functions that can be generally used in wider fields. If the goal is obtained, engineers can create a cognitive computing system for visual and auditory information, construct explanatory models of real-world phenomena, communicate with people naturally, learn and think when encountering new tasks and situations, and solve new problems through past experiences. These models can realize the rapid learning of artificial intelligence systems and provide artificial intelligence systems with “meaning” or

2.2 Development Tendency

31

“understanding”, so that the artificial intelligence system can obtain more general functionality.

2.2 Development Tendency With the development of the Internet, big data, cloud computing, and new sensing technologies, the cognitive computing of visual and auditory information is triggering scientific breakthroughs that can produce chain reactions, spawning a batch of disruptive technologies, fostering new momentum for economic development, shaping a new industrial system and accelerating a new round of technological revolution and industrial transformation. This performance leap in cognitive computing for visual and auditory information has been accompanied by impressive progress in hardware technology for basic operations, such as sensation, perception, and target recognition. New platforms and new markets of data-driven products, as well as economic incentives for discovering new products and new markets, have also promoted the emergence of artificial-intelligence-driven technologies. All these trends are driving the following hot research areas.

2.2.1 Cognitive Computing Theories and Methods for Visual and Auditory Information Based on Neuromorphic Computing Von Neumann’s computational model for traditional computers separates input/ output, instruction processing, and memory modules. With the achievements of deep neural networks in a range of tasks, alternative computing models are actively sought—especially those inspired by biological neural networks—to improve hardware efficiency and the stability of computing systems. Currently, these “neuromorphic” computers have not clearly shown great success, yet are initially expected to commercialize, but perhaps they may become usual in the near future (even only as “siblings” added by Von Neumann). Deep neural networks are in great demand in the application field. When these networks can be trained and executed on specialized neuromorphic hardware instead of being simulated in a standard Von Neumann structure like today, the era of artificial intelligence based on neuromorphic computing may come.

32

2 “Cognitive Computing of Visual and Auditory Information”: State …

2.2.2 Enhancing Perception of Visual and Auditory Information in Cognitive Computing Systems Perception is the window of an intelligent system to face the world. Perception (which may be distributed) comes from sensor data and takes many forms, such as the state of the system itself or relevant information about the environment. Sensor data is often processed and integrated with prior knowledge and models to extract task-related information in artificial intelligence systems, such as geometric features, attributes, position, and speed. Synthetic data from perception forms environmental perception, providing a comprehensive knowledge and world state model for cognitive computing systems of visual and auditory information to help plan and perform tasks efficiently and safely. Artificial intelligence systems will greatly benefit from advances in hardware and algorithms for more stable and reliable perception. The sensor must be able to capture real-time data with high resolution over long distances. The perception system should be able to synthesize data from various sensors and other sources (including computing clouds) to determine the current perception object of the cognitive computing system of visual and auditory information, and predict its future state. It remains challenging to detect, classify, identify, and confirm objects, especially under chaotic and dynamic conditions. In addition, the right combination of sensors and algorithms can greatly improve human perception and allow AI systems to work with humans more effectively. Throughout the perception process, a framework for computing and propagating uncertainty is needed to quantify the confidence an AI system has in its situational awareness and improve its accuracy.

2.2.3 Improved Ability of Discovering Knowledge in Big Data The data-driven paradigm has achieved great success and replaced the traditional artificial intelligence paradigm. Procedures such as theorem proving, and logic-based knowledge representation and reasoning have received less and less attention, in part due to the ongoing challenge of connecting to real-world foundations. The project was a mainstay of AI research in the 1970s and 1980s and has received little attention since then. Part of the reason is that it relies heavily on modeling assumptions and is difficult to meet the needs of practical applications. Many new methods and technologies are required to enable intelligent data understanding and knowledge discovery. Further progress is required in developing more advanced machine learning algorithms to mine all the useful information in big data. Many open research questions involve the creation and use of data, including the accuracy and appropriateness of training cognitive computing systems on visual and auditory information. Accuracy is particularly challenging when dealing with large amounts of data, making it difficult to evaluate and extract information from it. Although many studies employ data quality assurance methods to ensure the

2.2 Development Tendency

33

accuracy of data cleaning and knowledge discovery, further research is needed to improve the efficiency of data cleaning techniques, establish methods to detect data inconsistencies and anomalies, and make them compatible with human feedback. Researchers need to explore new ways to leverage data and associated metadata. Many applications of cognitive computing for visual and auditory information are interdisciplinary in nature and utilize heterogeneous data. Further research on multimodal machine learning is required to realize knowledge discovery from different types of data (such as discrete data, continuous data, text data, spatial data, temporal data, spatiotemporal data, and graphical data). The cognitive computing of visual and auditory information must determine the amount of data required for training, correctly handle large data and long-tail data requirements, and determine how to identify and deal with small probability events other than purely statistical methods. At work, knowledge sources (that is, any type of information that explains the world, such as knowledge of the law of gravity or social norms) and data sources are used, and models and ontology are combined in the learning process. When big data sources are not available, limited data can be used to obtain effective learning results.

2.2.4 Enhanced Ability of Natural Language Comprehension Natural language comprehension is often regarded as a very active field of machine perception along with automatic speech recognition. It quickly became a mainstream language commodity with large datasets. Google has declared that 20% of its mobile queries are now speech, and recent demonstrations have proved the possibility of real-time translation. Now research is turning to developing sophisticated and capable systems that can interact with people through dialogue rather than just responding to stylized requests.

2.2.5 Man-in-the-Loop, Human–Computer Cooperation and Hybrid Augmented Intelligence Due to the uncertainty, fragility, and openness of many problems faced by human beings, intelligent machines of any degree cannot completely replace humans. Therefore, it is necessary to introduce human roles or human cognitive models into cognitive computational intelligence systems of visual and auditory information to form hybrid augmented intelligence, which is a feasible and important growth model for artificial intelligence or machine intelligence. The development and use of such human–robot collaborative intelligent systems face significant research challenges in the planning, coordination, control, and scalability of such systems. The planning technology for multi-intelligence systems must be fast enough to operate in real time and adapt to changes in the environment. The human–computer collaborative system

34

2 “Cognitive Computing of Visual and Auditory Information”: State …

is the study of models and algorithms to help develop autonomous systems that can work with other systems and humans. The research relies on developing a formal collaboration model and learning the ability required to make the system an effective partner. Applications that can take advantage of the complementary advantages of humans and machines are attracting more and more people’s interest. For humans, intelligent systems for cognitive computing can help visual and auditory information overcome their limitations, while for agents, they can expand human ability and activity. Human–machine hybrid augmented intelligence is mainly divided into two basic forms: one is the human–machine collaborative hybrid augmented intelligence with man-in-the-loop, and the other is to embed the cognitive model in the machine learning system to form hybrid intelligence based on cognitive computing. Its basic elements include intuitive reasoning and causal models, memory and knowledge evolution; the role and rationale of intuitive reasoning in solving complex problems, and cognitive learning networks for visual scene understanding based on memory and reasoning; competitive-adversarial cognitive learning methods. These methods have a wide range of applications in areas such as autonomous driving. These methods have a wide range of applications in areas such as autonomous driving.

2.2.6 Development of More Powerful and Reliable Autonomous Smart Cars (Robots) Researchers need to better understand the perception of driverless smart vehicles (robots) in order to extract information from various sensors and to perceive the surroundings in real time. Driverless smart cars (robots) require more advanced cognitive and reasoning capabilities so that they can better understand and interact with the physical world. The improvement of adaptation and learning ability will enable driverless smart cars (robots) to self-summarize and self-evaluate, and learn driving and sports skills from humans. With the development of technology, it has enabled driverless smart vehicles (robots) to work closely with humans. Today, artificial intelligence technologies have demonstrated their ability to supplement, enhance, or simulate human physical capabilities or human intelligence as expected. However, scientists need to make these driverless smart car (robot) systems more powerful, reliable, and easier to use.

2.2.7 Study of Driving Brain with Selective Attention Memory Mechanism Brain science mainly studies three aspects: understanding the brain, protecting the brain, and creating an artificial brain. The car in future is a networked wheeled

2.2 Development Tendency

35

robot. As the driver’s intelligent agent, the driving brain of a driverless smart car can fully possess the driver’s driving cognition and selective attention, as well as memory cognition, interactive cognition, computing recognition, and self-learning ability. The driving brain can assist in brain science to research brain cognition, where human skills and wisdom come from, what time and space scales should be used for memory and thinking activities, and the research and evaluation of human cognitive abilities.

2.2.8 Theoretical Capabilities and Limitations of Cognitive Computing for Visual and Auditory Information Although the ultimate goal of many intelligent algorithms for cognitive computing of visual and auditory information is to use human-like solutions to solve open challenges, we have little understanding of the theoretical capabilities and limitations of intelligence, and how feasible such human-like solutions can be used in conjunction with cognitive algorithms for visual and auditory information. Similar solutions can be used in conjunction with cognitive algorithms for visual and auditory information. We need theoretical work to better understand why intelligent techniques (especially machine learning) for cognitive computing on visual and auditory information generally work well in practice. Although different disciplines (including mathematics, control science, and computer science) are studying this problem, the field currently lacks a unified theoretical model or framework to understand the performance of cognitive computing intelligence systems based on visual and auditory information. More research is needed on computational solvability, the understanding of the types of problems that can theoretically be solved by cognitive computational algorithms of visual and auditory information. Similarly, it also aims at problems that they cannot solve. This understanding must be formed in the context of existing hardware in order to understand how the hardware affects the performance of these algorithms. And understanding which problems are theoretically unsolvable can lead researchers to develop approximate solutions to these problems, and even open up new research lines for new hardware of intelligent systems for cognitive computing of visual and auditory information.

2.2.9 Advances in Hardware Architecture Design for Cognitive Computing of Visual and Auditory Information Although AI research is most often associated with software development, the performance of intelligent systems for cognitive computing of visual and auditory

36

2 “Cognitive Computing of Visual and Auditory Information”: State …

information is largely dependent on hardware. The current revival of deep machine learning is directly related to advances in the graphics processing unit (GPU)-based hardware technology and its improved memory, input/output, clock speed, parallelism, and energy efficiency. Hardware optimization of cognitive computing algorithms for visual and auditory information will achieve higher performance levels than GPUs. An example is a “neuromorphic” processor, which is inspired by the degree of freedom of brain tissue and in some cases optimizes the operation of neural networks. Hardware upgrades can also improve the performance of cognitive computing intelligence methods with highly data-intensive visual and auditory information. Improved hardware can produce cognitive computing intelligence systems with more powerful visual and auditory information, and intelligent systems can also increase the performance of hardware. This reciprocity will lead to further improvements in hardware performance, as addressing the physical constraints of computing requires new hardware design approaches. Intelligent cognitive computing approaches based on visual and auditory information may be particularly important for improving the operation of high-performance computing (HPC) systems. The design of cognitive computing intelligence algorithms for visual and auditory information is required to enable HPC systems to operate online and at scale.

2.3 Trend of Field Development From the perspective of development trends in this field, cognitive computing of visual and auditory information, along with the development of the Internet, big data, cloud computing, and new sensing technologies, is triggering scientific breakthroughs, generating chain reactions and accelerating a new round of technological revolution and industrial change. The focus of cognitive computing intelligence systems for visual and auditory information is shifting toward visual and auditory modeling and big data-driven artificial intelligence inspired by neural computing. The underlying theory, key technologies, and broad applications of cognitive computing intelligence in visual and auditory information, and the resulting security, reliability, and ethics related to physical and social elements, are being systematically explored. This will lead to a new generation of powerful AI system methods and techniques. In order to achieve robust artificial intelligence research goals, it is necessary to conduct comprehensive and unified research on data acquisition, perception and understanding, planning, decision-making, and software and hardware architecture design that support large-scale intelligent computing in the dynamic development environment of artificial intelligence systems. Regarding national-level basic research investment in artificial intelligence, in the process of implementation, the United States, the United Kingdom, the European Union, Japan, etc. are all continuously investing, and artificial intelligence is still the high point of competition in the information field.

2.3 Trend of Field Development

37

In October 2016, the White House released the National Artificial Intelligence Research and Development Strategic Plan. In order to prioritize investments in the next generation of AI that will foster new discoveries and insights while maintaining the USA’s world leadership in AI, the following strategies are proposed: make a long-term investment in AI research into cognitive computing for visual and auditory information; intelligently develop effective collaborative approaches; understand and address the ethical, legal, and societal implications of AI; ensure the safety and reliability of AI systems; develop public datasets and environments for AI training and testing; develop standards and benchmarks to measure and evaluate artificial intelligence technologies, etc. Stanford University released the Artificial Intelligence and Life in 2030 in September 2016, a report on the impact of AI over the next decade. The report includes the definition and research trends of artificial intelligence: eight application areas of artificial intelligence, including transportation, health, education, etc., and the formulation of related policies. Carnegie Mellon University also announced in May 2018 the opening of the first undergraduate major in artificial intelligence in the USA. The goal of this major is to allow students to think broadly about how tasks in various disciplines should be accomplished. Learning content focuses on how to make decisions based on complex inputs such as vision, language, and large databases to enhance human capabilities. In October 2018, Massachusetts Institute of Technology (MIT) announced a $1 billion new School of Computing and Artificial Intelligence, repositioning the MIT, bringing the power of computing and AI to all areas of the school’s research and enabling insights from other disciplines and perspectives to shape the future of computing and artificial intelligence to ensure the USA can lead the way in shaping the future of these powerful and transformative technologies. European countries also introduced plans to promote the development of artificial intelligence. In March 2018, the European Political Strategy Center released the report “The Age of Artificial Intelligence: Towards a European Strategy for HumanCentered Machines” (hereinafter referred to as “Artificial Intelligence Europe Strategy”). It was clearly stated that Europe should establish a framework to support AI investment while setting global AI quality standards to address internal and external challenges facing European AI. Immediately, the European Commission announced that it would take three major measures in the field of artificial intelligence: the first is to invest 1.5 billion euros by 2020 to promote the participation of public and private capital, and the total investment is expected to reach 20 billion euros; the second is to promote the upgrading of the education and training system to adapt to the changes that artificial intelligence brings to employment positions; and the third is to study and formulate new ethical standards for artificial intelligence to defend European values. In October 2017, the UK released a report on the development of the UK’s artificial intelligence industry. It plans to increase investment in research and development, strengthen the management and sharing of big data, and cultivate more artificial intelligence talents to ensure that the UK is a world leader in the field of artificial intelligence. In May 2016, Japan formulated an advanced comprehensive intelligent platform plan and

38

2 “Cognitive Computing of Visual and Auditory Information”: State …

proposed a comprehensive development plan integrating artificial intelligence, big data, Internet of Things, and network security, and it determined 2017 as the first year of AI. In March 2017, Japan took artificial intelligence technology as the core and cutting-edge technology of the “Fourth Industrial Revolution” to promote economic growth and formulated a roadmap for the industrialization of artificial intelligence, planning to promote the application of artificial intelligence technology within three years to significantly improve the efficiency of production, distribution, medical, space and mobility. In June 2014, General Secretary Xi Jinping delivered an important speech at the opening ceremony of the 17th Academician Conference of the Chinese Academy of Sciences and the Opening Ceremony of the 12th Academician Conference of the Chinese Academy of Engineering. As the new generation of information technology such as big data, cloud computing, mobile Internet, and robotics is rapidly merging with new technologies, 3D printing and artificial intelligence are developing rapidly, and the software and hardware technologies for manufacturing robots are becoming more mature and the costs are continuously reduced, while performance is continuously improved; military drones, self-driving cars, and service robots have become a reality, and some artificial intelligence robots have a considerable degree of autonomous thinking and learning abilities. We must assess the situation, consider the overall situation, take time to plan, and move forward steadily. This is the first time that the top leader of the party and the country has spoken highly of artificial intelligence and related intelligent technologies. It is a solemn call and vigorous promotion for the development of artificial intelligence and intelligent robot technology. In September 2016, President Xi Jinping delivered a keynote speech at the G20 Business Summit, pointing out that a new round of technology and industrial revolution centered on the Internet is poised to take place. New technologies such as artificial intelligence and virtual reality are changing with each passing day. The combination of the real economy and virtual reality will bring revolutionary changes to people’s production methods and lifestyles. With the development of artificial intelligence from an academic topic into the outbreak stage of an industrial economy, the world is now marching toward the fourth industrial revolution represented by the Internet, artificial intelligence, 3D printing, autonomous driving, and other technologies. China is a leader in the global innovation arena and one of the fastest countries to move toward the intelligent age. In July 2017, the State Council of China decided to implement “New Generation Artificial Intelligence Development Plan” and to elevate artificial intelligence to a national strategy. In November 2018, the US Department of Commerce announced the framework of the technology export control system, soliciting public opinions on 14 types of representative technologies. The U.S. will continue, and stably and continuously upgrade its policy restrictions on technology transfer to China. Among them, there

2.3 Trend of Field Development

39

are seven major parts involved in cognitive computing of visual and auditory information. These limitations made greater demands on the related research on cognitive computing of visual and auditory information. It must be independent and selfreliant, and on the basis of integrating the world’s technological development achievements, study the basic theory, method, technology, and application of the cognitive calculation of visual and auditory information. From this point of view, the major research plan of “Cognitive Computing of Visual and Auditory Information” laid out by the NSFC in the field of artificial intelligence in China is very forward-looking, which is a large-scale national supporting project implemented earlier internationally. After the implementation of this major research plan, the development of the cognitive computing research of visual-auditory information in China has been not only in line with the international development direction in recent 10 years but also with strong development momentum. It has formed a good momentum on three levels: basic theory, key technology and platform, and application practice, and obtained a number of achievements that have led or paralleled the world. In the process of studying the challenges and opportunities of the artificial intelligence industry, Europe analyzes China as a research opponent next to the United States. In terms of basic theory, this major research plan has provided landmark solutions to the four major scientific issues of concern, which have had a greater impact on the international community. In the past 10 years, in the famous international journals PNAS, CELL, TPAMI, and IJCV, and international conferences NIPS, ICML/CVPR, and ICCV, the proportion of papers published by Chinese scholars has increased by an order of magnitude. In particular, Chinese scholars have taken the lead in developing a “global-first”, which is suitable for describing a new mathematical framework for cognition (involving various branches of mathematics that describe basic units of cognition, basic cognition relations, and basic operations of cognition) and established a new generation of artificial intelligence mathematical basic theories and formed systematic results in cognitive computing. We believe that the major research plan of “Cognitive Computing of Visual and Auditory Information” has achieved the expected source innovation in the basic aspects, and has improved our country’s overall innovation capability and international competitiveness in the field of basic artificial intelligence research. Under the support of the major research plan of “Cognitive Computing of Visual and Auditory Information”, in terms of key technologies and platforms, whether in individual technology or tool or in terms of overall platforms, Chinese scholars have achieved a number of results that are leading or parallel in the international indicators. The China IVFC has been held for 10 years since it was first held in 2009, once a year; the China BCI has been held for 4 times since 2010. These competitions have greatly improved the level of scientific research in related fields, making the field of driverless smart cars in our country occupy an important position in the world, and the level of scientific research in brain-computer interfaces and other aspects at the forefront of the world. In the application and practice of artificial intelligence, this major research plan has achieved the expected research results for the country’s major needs, and has

40

2 “Cognitive Computing of Visual and Auditory Information”: State …

been applied in cognitive computing, brain-computer interfaces, driverless intelligent vehicles, and intelligent robots. Applications represented by brain-computer interfaces, brain-controlled intelligent robots, and unmanned intelligent vehicles are in a leading position worldwide in terms of their scale and efficiency. From the perspective of development trends in the field of artificial intelligence, the research on cognitive computing of visual and auditory information of China has reached the international frontier, and is already at the forefront of the core competition in the future trend. In the face of international competition, it is of great significance to consolidate and expand the important achievements of the major research program of “Cognitive Computing of Visual and Auditory Information” and maintain and improve the good development trend of cognitive computing for visual and auditory information in the country, which is important for improving our country’s software science and technology level and promoting major scientific discoveries and technological innovation. As mentioned above, they have important and far-reaching strategic significance.

Chapter 3

Major Research Achievements

3.1 Leaping Development of Basic Theories of Cognitive Computing Taking the new theories, new structures, new methods and new technologies involved in cognitive computing as breakthrough points, the original innovations in scientific theory and experimental technology have been realized, which have driven the leapforward development of basic theories of cognitive computing in China. On some important basic scientific issues, the plan has obtained a number of internationally leading or advanced scientific research results, published a series of papers with original innovative ideas in important international academic journals, led the academic development of international trusted software and taken a place in basic research on international cognitive computing.

3.1.1 Creation and Development of “Global-First” Object-Based Attention Theory This major research plan takes the basic expression of visual and auditory perception as the core scientific problem and explores ways to realize autonomous driving, reflecting the following scientific consensus: the key crux of autonomous driving lies not in information or physical technology, but in the understanding of cognitive neural mechanism of visual and auditory information processing. Starting from the self-established “global-first” topological property detection theory of visual perception, this project takes the definition of the basic cognitive unit with “perceptual objects defined by topological properties” as the core, and analyzes and understands the bottlenecks of visual and auditory information processing from a new perspective. It provides beneficial enlightenment for the establishment of a more effective © Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3_3

41

42

3 Major Research Achievements

information processing framework in automatic driving, and offers new ideas to the important breakthrough of this major research plan. Its contributions to core scientific problems and the degree of its leaping development are reflected as follows: 1. Contributions to solving the core scientific problems Major contribution 1: topological definition of perceptual objects–Evidence from MOT experiment “What is the perceptual object” is a controversial central issue in the current field of cognitive science; but after decades of research, scholars are still unable to describe exactly what a perceptual object is. The concept of perceptual objects originated from the Gestalt tradition. Although the connotation is profound and reasonable, it is an intuitive concept often with a cyclical color and lacks the power of theoretical analysis and prediction. As a result, object-based attention theories are often challenged by spatial theories, or fail to give unified or systematic explanations for various phenomena, and have long been excluded from the mainstream of modern research. Therefore, how to scientifically and accurately define perceptual objects is a core problem of current attention research and even the whole cognitive science. Based on the “global-first” theory of perception of topological properties, for the basic unit of attention operation, it is considered that the basic unit of attention operation is the perceptual object, and the topological definition of “the perceptual object” is proposed: the invariant property that remains unchanged under topological transformation. The core meaning of the concept of “the perceptual object”–global identity that remains unchanged under shape-changing transformation–can be scientifically and accurately described as topologically invariant properties, such as connectivity. According to this definition, a basic inference is that the change in topology, such as a solid graphic to be hollow, will be interpreted by the visual system as the new object; but changes in local properties, such as changes in brightness and even color, do not change the continuity of perception of objects. Such inferences are a challenge to the dominant “local-first” initial characteristic analysis theoretical line, which is inconsistent with people’s traditional common sense. Through the experimental paradigm of multiple object tracking (MOT), we found that changes in the topological properties of graphics do cause a significant decrease in tracking performance. This indicates that the change of topological properties is indeed recognized as the generation of new perceptual objects, which interferes with the continuity of the representation of perceptual objects and the result shows that the extraction of topological properties is the starting point for the formation of perceptual object representation. Major contribution 2: attentional blink experiment Phenomenon of attentional blink refers to the rapid presentation of a series of visual stimuli. When the first target is correctly detected (200–500 ms), the accuracy of detecting the second target drops sharply. The research of attentional blink has become one of the hottest problems in the field of attention. But what is the mechanism of attentional blink? After more than 20 years of research in the attention field, more and more divergent explanations have arisen. Based on the topological definition of perceptual objects, we propose a topological explanation for attentional

3.1 Leaping Development of Basic Theories of Cognitive Computing

43

blink: the change of topological properties between the target and the interference Fig. is perceived as the new object, causing attentional blink. We carried out a series of 16 experiments to support the topological definition of perceptual objects. These experiments excluded various non-topological factors such as various local properties, spatial factors, colors, semantics and the influence of “Top-Down”, and it was unanimously found that the change of topological properties produced new objects and attentional blink. This conclusion not only supports our above analysis, but can also explain many well-known problems in the current attentional blink research, such as the phenomenon of “lag-1 sparing”. This also fully demonstrates the help of solving the scientific definition of perceptual objects to solve the classic problems of cognitive science. Major contribution 3: neural expression of temporal organization in visual and auditory fusion At present, our research on the topological hypothesis of perceptual objects mainly focuses on the organizational relationship in the spatial dimension, while the temporal organization is another very important dimension. Our previous auditory work has shown that the brain uses a time window that conforms to the statistical characteristics of the time organization of the speech signal to segment and process the continuous speech stream in real time. Such a time window reflects the perceptual organization in the time dimension, namely the “temporal gestalt”. Our latest work uses natural film fragments including dialogue scenes to expand the research on temporal organization from auditory research to audio-visual fusion research. The result is that audio-visual fusion is also based on a specific time scale, that is, the natural audio-visual stream is tracked and processed by adjusting the phase of the time window in real time across modes. To be specific, in addition to modulating the oscillation of the auditory cortex, the auditory signal will also modulate the oscillation phase of the visual cortex in real time, so that the visual flow signal consistent with the temporal organization of the auditory flow can enhance the cognitive calculation of the visual auditory information on the time window suitable for processing, so as to achieve real-time continuous fusion across modes. Visual flow signals also use the same mechanism to modulate the time window phase of auditory cortex. This work has for the first time, found and discovered the neural mechanism that may solve the “cocktail party effect” (auditory scene analysis problem) in human subjects. In a larger sense, the crossmodal fusion based on the time window reflects the important organizational role of the time organization structure in the establishment of perceptual objects in complex multi-modal dynamic scenes. Temporal organization and spatial organization as the main carrying dimensions of information flow, are decisive factors for the generation, establishment and immutability of perceptual objects in multi-modal dynamic scenes. Major contribution 4: other related studies Focusing on the cognitive mechanism and neural expression of attention, we have also carried out a series of other related studies, including attention research on driving behavior and brain imaging method research, etc. Among them, the study of “Basic Research and Clinical Application of Minimally Invasive Neurosurgery for the Protection of Brain Cognitive Function in Craniocerebral Surgery”, which

44

3 Major Research Achievements

was cooperated with Tiantan Hospital, successfully applied the imaging method developed in this study to neurosurgery, and won the second prize of the National Science and Technology Progress Award in 2009. Other representative studies are as follows: (1) Research on the age difference of driving control under the condition of decreased visibility. This research attempts to establish a direct connection between attention research and the scientific goals of this major research plan. In a simulated driving scenario, we investigated the extent to which the driving ability of subjects of different ages was affected by environmental changes. It was found that when the visibility is reduced, the accuracy of driving of both young and old drivers decreases, but the accuracy of young people is higher than that of old people. The results provide behavioral evidence for the thesis that elder drivers are more likely to cause traffic accidents due to mis-operation when visual conditions are poor. (2) Study on non-traditional functional magnetic resonance imaging sequence. Traditional functional magnetic resonance imaging uses gradient echo EPI sequence to obtain the BOLD functional signal, but this method is prone to image distortion and signal loss in some areas, and the functional positioning accuracy is not good either. We use an improved HASTE method and combine with parallel imaging technology to develop an alternative sequence, which can overcome the above shortcomings, basically avoid image distortion and signal loss, and greatly improve the accuracy of functional positioning. 2. Degree of achieving leaping development Related research papers have been published in top international journals, including two papers on PNAS, one paper on PLos Biology, and one paper on Neuroimage; and a new patent of parallel transmitting and new type of patent “Parallel transmitting and receiving RF interface circuit and phased array transmitter receiver coil” was authorized.

3.1.2 Research on Visual Computing Models and Methods Inspired by Biological Cognitive Mechanism and Characteristics This project has carried out in-depth research from two inspects, mainly on the inspiration of biological cognitive mechanism and biological cognitive characteristics. In terms of methods inspired by biological cognitive mechanism, the focus is on the feed-forward and feedback models of visual information processing in the brain, including the feed-forward model of the visual ventral pathway, the current model of cells in the V1 area, and the hierarchical recurrent model of the visual ventral pathway, attention feed-forward and feedback model. In terms of the methods

3.1 Leaping Development of Basic Theories of Cognitive Computing

45

inspired by biological cognitive characteristics, with the help of sparse representation, deep learning and other tools, the focus is on the visual saliency calculation model, the local representation model of visual information, and the robust classification method of visual information. The contributions to the solution of core scientific problems and the degree of leaping development are embodied as follows. 1. Contributions to solving core scientific problems This project carried out in-depth research on visual computing problems inspired by biological cognitive mechanism and cognitive characteristics, and achieved a number of results, mainly including: Major contribution 1: a matrix decomposition method based on dual-core norm was proposed, which used low-rank hypothesis to simultaneously describe the real image and the noisy image, that is, it was assumed that the matrix composed of all image vectors was low-rank, and each error image caused by occlusion was also low-rank. Experiment on the occlusion removal of the ground images and the background extraction of the video surveillance showed the effectiveness of the proposed model. The research results were published in IEEE Trans Image Process. Major contribution 2: an approximate estimation method for the kernel norm of the tree structure was proposed. The experiments on face restoration and recognition have proved the advantages of the proposed method over the previous one and the research results were published in IEEE Trans Image Process. Major contribution 3: a method of saliency fusion using a dual low-rank matrix repair model was proposed. Experiments on 5 datasets showed that it is always better than each independent saliency detection method and other mainstream saliency fusion methods. The research results were published in IEEE Trans Image Process. Major contribution 4: an orthogonal Procrustes regression model-based pose-aligned face representation method was proposed. We adopted an effective alternating iterative algorithm and the experimental results of three general face databases–CMU PIE database, CMU complex PIE database and LFW database to prove the effectiveness of our proposed method. The research results were published in IEEE Trans Image Process. Major contribution 5: the learning discriminative singular value decomposition representation was proposed, where a large number of experiments were carried out on three general face databases, and the results demonstrated the effectiveness of our method in dealing with light change, occlusion, camouflage and face drawing recognition tasks. The research results were published in Pattern Recognition. Major contribution 6: a matrix regression method based on kernel norm was proposed. Experimental results on multiple datasets showed that the proposed method has obvious advantages in image recognition problems with large-area occlusion and illumination changes. Related results were published in IEEE Transactions on Pattern Analysis and Machine Intelligence, the top journal in the field.

46

3 Major Research Achievements

Major contribution 7: a robust kernel norm matrix regression model was proposed. Experimental results on multiple datasets showed that the proposed method has obvious advantages in image recognition problems with large-area occlusion and illumination changes. Related results were published in IEEE Trans Image Process, an authoritative journal in the field. Major contribution 8: a matrix regression model based on low-rank constraints of hidden variables was proposed, and some advantageous experimental results on multiple face, fingerprint and target classification datasets were achieved. Related results were published in IEEE Trans Image Process, an excellent journal in the field. Major contribution 9: a sparse deep stacking network was proposed. Experimental results confirmed that the proposed method was superior to related algorithms, especially in the 15th scene database, which achieved a recognition rate of 98.8%. This result was published at the international conference AAAI 2015. Major contribution 10: it was found that the mid-level’s saliency is better than the lowlevel features when predicting the image, while the high-level’s saliency is weaker than the low-level’s; the saliency of mid-level and high-level is better than the lowlevel’s when predicting the video. In both cases, the mid-level and high- level saliency can get better results after fusion. The results were published in the journal of IEEE Trans Image Process. Major contribution 11: the convolutional neural networks (CNN) with intra- layer feedback was proposed for object recognition and achieved the peak accuracy on four commonly used small-scale datasets. Related papers were published at the IEEE CVPR conference. Major contribution 12: a convolutional neural network (CNN) with intra-layer feedback was proposed for scene segmentation. This method directly achieved end-to-end scene segmentation and has achieved very good results. The results were published at the NIPS conference. Major contribution 13: a recurrent neural network method was proposed with high storage and calculation efficiency, and the language model was evaluated on different standard datasets. We found that LightRNN reduces the parameter size by up to one percent compared to the previous model, and at the same time obtains lossless or even higher accuracy. This result was published at the Artificial Intelligence Conference NIPS. Major contribution 14: a persistent memory network based on residual learning was proposed for image restoration. Multiple experimental results showed that our method is superior to other existing image restoration algorithms in terms of performance indicators and restoration of visual effects. Related results were published at the top international conferences in the field, CVPR 2017 as well as ICCV 2017. Major contribution 15: a human pose estimation method was proposed based on Generative Adversarial Networks. The quantitative and qualitative results on multiple

3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface

47

datasets showed that the new method significantly generates a more reasonable pose than previous convolutional neural networks. Related papers were published at the top conference ICCV in the field. Major contribution 16: a loss function with importance consciousness was proposed, and the experimental results of the proposed loss function on the CamVid and Cityscapes datasets showed that the existing FCN, SegNet and ENet models, in the autonomous driving scenario, can segment the important target categories which were defined in advance more accurately. Relevant results were published at IJCAI, an artificial intelligence summit. Major contribution 17: a deep learning model based on the Gram matrix was proposed for image style classification. Experiments showed that the model with this bypass is significantly better than the model without this bypass. Moreover, through the control experiment, it was found that the improvement of the effect does not just come from a simple parameter increase, which fully illustrates the advantages of the Gram matrix for the classification of picture styles. The results were published in the international journal of IEEE Trans Image Process. 2. Degree of achieving leaping development The phased achievements included the publication of 26 SCI journal papers, with 105 SCI citations, 1 ESI highly cited paper, 15 top journals (including IEEE Trans Pattern Anal Mach Intell, IEEE Trans Image Process, etc.), and 17 papers in important international journals and at important international conferences (including NIPS, CVPR, ICCV, etc.), and 3 patents were authorized. In addition, we won the first prize of the 2017 Natural Science Outstanding Achievement Award for Scientific Research in Higher Education Institutions (“The Theory and Method of Visual Information Representation Learning”). Significant progress has also been made in personnel cultivation: one person was selected as a leading talent in the National “Ten Thousand Talents Program”, and one person was selected as a distinguished professor in Jiangsu Province.

3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface This major research plan takes the key technology of the brain-computer interface as a key entry point and an important foothold, and integrates the superior research team and previous research results of relevant integration, key support and cultivation projects in the major research plan, and makes in-depth research on the key technologies of the brain-computer interface from three aspects—basic theoretical system, method and platform architecture, and typical application demonstration, and achieves systematic and innovative research results. All efforts effectively promote the standardization of technical methods, the interaction and integration of academic teams, and the integration and sublimation of research results. The contributions

48

3 Major Research Achievements

to solving core scientific problems and the degree to which it achieves leaping development are reflected as follows.

3.2.1 Research on Key Technologies and Applications of the Bidirectional Multidimensional Brain-Computer Interface The research content of this project is fully in line with the overall goals of the major research plan, and the contributions are as follows: a brain signal analysis algorithm with signal sparse representation, tensor analysis, multivariate pattern analysis and decoding as the core providing algorithm support for the brain- computer interface system. It combines different brain signal paradigms or different brain signals and a multi-modal hybrid brain-computer interface (MBI) to realize multi-dimensional and multi-functional target control of BCI. The plan researched neurofeedback, designed a brain-computer interface user training model based on neurofeedback, established a two-way brain-computer interface based on neurofeedback, and used it in the paradigm of sports rehabilitation training for the disabled. An applicationoriented brain-computer interface system was developed, including a brain-computer interface mouse and a web browser, brain-computer interface virtual car control, wheelchairs for the disabled with integrated driverless technology, and a rehabilitation training system for the disabled based on virtual environment as well as a brain-computer interface rehabilitation robot training system for handicapped upper limb movement. 1. Contributions to solving core scientific problems The entire team has conducted research on brain signal analysis, key technologies and applications of the brain-computer interface for nearly 4 years, and obtained a series of research results such as publishing papers in international authoritative journals, authorizing patents and establishing multiple prototype system for braincomputer interface applications. These results are mainly reflected in the following aspects. Major contribution 1: multi-modal brain-computer interface and multi-degree of freedom control Two modes of P300 and motor imagination were used in EEG to generate control signals for the brain-computer interface. The two innovative points of this method are as follows: the two control signals are generated by different mechanisms and are almost independent; the two modes can work in coordination. If this method is used to control the cursor, the cursor can be moved from any position to any target position (currently there is no other international work on cursor control that can do this). If the signals of the two modes of P300 and motion imagination are used, the selection function of the left mouse button can be realized, and its performance is substantially improved compared with the single mode method.

3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface

49

The accuracy of selection and rejection can reach 94% within 2 s. Based on the control method of the multi-modal two-dimensional cursor, we have successfully developed the brain-computer interface network browser and mail system. Both of them completely imitated mouse control and this was the world’s first trial. This work was published in the regional journal J Neural Engineer, and was selected by the journal as one of the 16 outstanding papers in 2012. Based on the multimodal method, the simultaneous control of the wheelchair’s direction and speed were successfully realized. It was the world’s first use of a multimodal brain-computer interface to control a wheelchair. The result was published in IEEE Trans Neural Syst Rehabil Eng. The brain-controlled wheelchair has participated in external exhibitions for many times and has attracted widespread media attention both at home and abroad. A special episode about the brain-controlled wheelchair was produced by I Love Invention of CCTV. Major contribution 2: feature selection of brain signals based on sparse representation A series of brain-brain signal feature selection algorithms based on sparse representation were established, mainly to achieve the effect that traditional repetition cannot achieve: now all interesting features can be selected; according to the sparse representation of weight symbols, the selected features are divided into two categories, respectively, for the two experimental stimuli or tasks; it was combined with a nonparametric statistical test to remove unselected features. Results were published in PLos One, IEEE Signal Process Mag, etc. Major contribution 3: integration of visual and auditory information in brain The project studied the integration of visual and auditory information in face recognition and explored its role in the neural representation of facial semantic features. Through multiple fMRI experiments and decoding of facial features (gender or emotion) based on fMRI data, we found that visual and auditory integration strengthened the neural representation of features related to the experimental task and selective attention played an important regulatory role in visual and auditory integration. The relevant results were published in the international authoritative journal Cereb Cortex and so on. Major contribution 4: visual and auditory brain-computer interface Based on the mechanism of human brain’s visual and auditory integration, a kind of visual and auditory brain-computer interface was established. The data of normal people showed that the performance of the brain-computer interface is better than that of the single-mode vision and auditory brain-computer interface. The reason is that the visual and auditory integration in the brain strengthens P300, N200 and P100, thereby enhancing the differentiation between target and non-target brain signals. The relevant results were published in the international authoritative journal Sci Rep. Major contribution 5: consciousness detection of patients with consciousness disorders based on the multi-modal brain-computer interface For patients with consciousness disorders, a variety of brain-computer interface paradigms were proposed for consciousness detection and auxiliary diagnosis. For example, a hybrid paradigm based on P300 and SSVEP was proposed, which was

50

3 Major Research Achievements

significantly better than the traditional paradigm of P300 or SSVEP in performance. The brain-computer interface was used for consciousness detection of patients with consciousness disorders, and the detection effect was better than that of clinically commonly used behavioral scales. Furthermore, the brain-computer interface was used to detect cognitive functions such as digital cognition and mental arithmetic in patients with consciousness disorders and has gotten good results. Related results were published in the top IEEE journals Proceed IEEE, international authoritative journals Sci Rep, BMC Neurol, etc. Major contribution 6: robust tensor decomposition and feature extraction based on the full probability model Feature extraction of noisy EEG signals is a key core technology for the braincomputer interaction. This topic proposed the problem of how to construct the tensor decomposition problem by using temporal and spatial frequency multi-model characteristics of EEG signals and give the tensor decomposition implementation algorithm and an adaptive algorithm for tensor rank estimation on the full probability model of the brain electrical signals. The advantage of this model is that it can estimate the hidden factor and its number under the condition of missing data. This result is suitable for processing the feature extraction problem of complex data with high noise and interference, and has been applied in the dynamic pattern recognition of EEG, visual event recognition, etc. The results were mainly published in IEEE Trans Pattern Anal Mach Intell, etc. Major contribution 7: higher-order partial least squares method and its application Higher-order partial least squares (HOPLS) is a generalized multi-linear regression method, which is mainly used to predict from one tensor data to another one, and establish a regression prediction model, whose basic realization idea is to find the hidden variables. Unlike other existing regression models, this model controlled the complexity of the model by a set of orthogonal tensors and orthogonal load parameters to prevent data from over fitting. The low-dimensional hidden space is obtained by optimizing the rotation sequence and shrinking the dimensionality reduction, so that it has the common subspace with the greatest correlation of both the input variable and the output variable. In the process of optimizing the dimensionality reduction, instead of decomposing input variables and output variables independently, highorder singular value decomposition uses the generalized cross-covariance to find the optimal orthogonal load value. Experiments with artificial data and real biological data show that HOPLS has advantages over existing regression methods in ability of prediction and robustness against noise, especially in processing small sample data. 2. Degree of achieving leaping development The phased results published include 55 SCI papers, more than 10 articles in top journals and 124 articles in important international journals (including IEEE Trans Pattern Anal Mach Intell, Proceed IEEE, IEEE Signal Process Mag and other journals) or at important international conferences. The research team has made 23 specially invited lectures at conferences with important influence in the field of

3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface

51

robotics. The project leader was awarded the “IEEE Fellow” in 2016 for his contributions to brain signal analysis and brain- computer interfaces, and won the first prize of Guangdong Natural Science Award in the same year. The brain-controlled wheelchair designed by the team has participated in external exhibitions for many times, and has attracted widespread attention from Chinese media and overseas media. A special episode about the brain- controlled wheelchair was produced by I Love Invention of CCTV.

3.2.2 Research on Theory and Key Technology of Intuitive Bionic Manipulation of Intelligent Prostheses Based on the Brain-Computer Interface This project conducted research on the basic issues and key technologies existing in the intelligent bionic prosthesis, and perfected and developed the research on the fine control theory of the robot bionic hand, which has important theoretical significance and practical value for improving the high-precision control of the robot bionic hand. The project’s contributions to solving core scientific problems and the degree to which it achieves leaping development are reflected as follows. 1. Contributions to solving core scientific problems Major contribution 1: completed theoretical innovation The bionic prosthetic manipulation theory for the brain computer interface was proposed, including: bionic prosthetic token fusion and neural feedback theory, skill transfer bionic control theory and the mechanism of neural plasticity. It provided a technical basis for the intelligent operation of prosthetics and put forward the algorithm of skill matching, transfer and expansion. It mainly includes the use of optimization algorithms, variable trajectory technology and reference model adaptive control algorithms to realize the matching of the robot controller to operators with different characteristics; the use of tactile recognition and variable impedance control to realize the transmitting of dexterity and versatility of the robot’s operating skills. It laid a technical foundation for promoting the application of bionic prosthetics in helping the elderly and the disabled. Major contribution 2: achieved the technical indicators and equipment upgrades This subject developed a bionic prosthesis with 15 degrees of freedom. The prosthesis adopts lightweight design, anthropomorphic shape, high-sensitivity touch and position sensing functions; it can perform brain-computer interface operation and closed-loop control and can sense contact force, vibration, fine-point contact and temperature or heat flow. It designed a wearable attachment to support the movement and carrying capacity of the prosthesis. The subject completed the design of EEG signal command operation and prosthetic perception feedback channel, realized EEG signal decoding algorithm and prosthetic sensory feedback, recorded various physical and biological signals, decoded from sense, muscle, and non-implanted EEG signals,

52

3 Major Research Achievements

and established a recognition model. Both tactile recognition and variable impedance control were used to realize the transmitting of dexterity and multi-purpose operation skills of the robot. 2. Degree of achieving leaping development Main progress 1: decoding of EEG command One of the main goals of this project is to realize autonomous and intuitive control of multi-freedom prosthetic bodies based on limb movement information decoded by EEG instructions. The main content of the research in this part includes the extraction of EEG information features and the research of high-performance pattern classification algorithms based on EEG information features. Feature extraction is to extract a set of feature information from the EEG signal to describe the signal pattern; action classification is to identify the type of autonomous action by decoding the signal feature information. First, an action classifier based on pattern recognition algorithm is trained with signal feature information. Then, the trained classifier is used to decode the EEG signals in real time and after that, the output of the classifier selects the type of body movement. Neural pattern recognition is used to realize action recognition, and we use a deep neural network to extract the features of the pre-processed EEG signal. Its advantages are unsupervised and multi-level, and it can autonomously extract structural information hidden in complex signals, which can effectively improve the accuracy of the classifier and reduce the difficulty of artificially designing features. Feature selection classification is to select the best feature subset for classification from the initial ones, form a feature vector, and input it to the classifier then the classifier distinguishes different thinking tasks according to the input feature vector, and outputs the corresponding category number; and the category number is called logic control signal. In order to make the equipment have adaptive ability, an intelligent EEG recognition module was designed and developed, which uses linear and non-linear recognition methods and combined offline training data and the online adaptation process to improve the effectiveness of the collected signals, providing bionic prosthesis with a more precise command signal. The intelligent EEG recognition module includes the intelligent pattern recognition method, the design of the module and the development of the corresponding software. A variety of paradigms were used to obtain human intention information, and multi-modal information is used to improve the accuracy of feature classification. For different EEG signals, different feature extraction and classification methods were adopted, and compared the performance of various methods then selected the best processing method for various EEG signals and recommended it to users. Users can also choose a method according to their needs. The multi-modal EEG signals are fused at the signal level, feature level as well as decision level, so that the information is fully complementary to improve the recognition ability of the system. Main progress 2: multisource physical quantity detection and sense feedback of bionic prosthesis

3.2 Breakthroughs in Key Technologies of the Brain-Computer Interface

53

The intelligent tactile perception and information fusion methods were proposed, which realize the high-speed collection of multi-source physical information, eliminate some differences in the physics of different information, obtain a unified understanding of the external environment, and realize the analysis of the external environment and the accurate operation of objects. By embedding equipment of sensing, collection, communication, perception and computing in the prosthesis, multi-scale and distributed collection and perception of physical information of the external environment are carried out, thereby achieving close integration of computing and physics, and realizing the sensitivity of tactile sensors, miniaturization, integration and intelligence, etc. It mainly focuses on the design of five subsystems: environment perception module, preprocessing module, network communication module, cognitive computing module and neurofeedback, as well as the integration of each part of the module. Later experiments verified the performance of the sensor array. Main progress 3: real-time motion planning based on EEG commands The motion planning of the bionic prosthesis based on EEG instructions adopted the following research methods and technical routes: (i) the physical limits of bionic prosthesis’ various joints are considered and transformed into a single double-ended inequality constraint; (ii) the obstacle avoidance is considered and transformed into linear inequality constraints; (iii) the optimization of various EEG command objective functions is considered and transformed into the form of quadratic optimization; (iv) various problem descriptions and redundancy analysis solutions are unified into the quadratic optimization problem-solving based on EEG instructions. The dynamics system can be realized by hardware (usually realized by circuit). For the feasibility analysis of the research, the various minimum energy redundancy analysis solutions were unified into a quadratic optimization problem, which includes the minimum force moment analysis solution, the inertial inverse moment solution, the minimum acceleration norm analysis solution, the minimum kinetic energy analysis scheme, the minimum velocity norm analysis scheme and the repetitive motion planning analysis scheme. Main progress 4: prosthetic movement based on skill learning and expansion The data glove was used to collect the functional movement of the arm on the healthy side of the prosthesis wearer, cooperates with the electromyography and peripheral nerve electrical signals to understand the person’s behavior mode, and extract his or her personalized movement control characteristics, such as finger movement control dynamic characteristics, movement trajectory planning and average movement speed. Using the collected electromyographic information and motion position information of the healthy arm, the motor neuron model was designed for the action planning and execution of the operation task, and used the existing robotic arm for simulation experiments, especially the use of the grasping experiment to verify. In the experiment, the upper limbs of the robot was controlled by the output of the motor neuron model. The vision module captured the image of the object in the robot’s eye camera, and the tactile module corresponds to the robot’s hand skin sensor. The data of the motion control characteristics of the robotic arm was compared with the data

54

3 Major Research Achievements

collected from the healthy arm, and the model parameters were further adjusted to correspond to the motion characteristics of the healthy arm. The research team published 23 SCI papers, including 5 ESI highly cited papers, and 15 top journals (7 of which were published in IEEE Trans Indl Electron and 3 in IEEE Trans Cybern), while 16 papers were published at important international conferences and 6 invention patents were authorized.

3.3 Breakthroughs in Visual Information Processing 3.3.1 Statistical Learning Method for Visual Attention Based on Scene Understanding Attention detection based on biological visual cognitive mechanism is one of the basic scientific problems in the field of visual computing research, involving two important parts, saliency visual pattern representation and computing. The classic “Bottom-Up” visual attention computing model was proposed by Professor Itti L of the University of Southern California in the mid-1990s and has a wide range of influences. However, due to the unclear representation of the corresponding relationship between visual features and significant visual incentives and the synergy between multiple visual features, the visual incentive mechanism is easy to fall into difficulties such as poor accuracy and weak adaptability. 1. Contributions to core scientific problems To solve the above problems, visual attention detection is defined as the detection of salient objects in scene images. Through large-scale supervised learning, the local, regional and global salient visual features of visual patterns, namely multi-scale contrast, surround histogram and color space distribution, are revealed, and a conditional random field model for optimal fusion of multiple features was established. The attention control of “Top-Down” in saliency target detection was realized, the saliency target detection algorithm with the best performance at that time was given, a new direction of visual saliency detection was developed, and important theoretical problems such as automatic extraction and semantic annotation of saliency target and its fine contour of visual perception data were solved. A large-scale image database (more than 20,000 images) for quantitative evaluation of salient target detection algorithms was established, and this database was recognized as one of the most widely used benchmark databases for evaluating visual attention detection algorithms in the field. Figure 3.1 shows the statistical learning method for visual attention, and Fig. 3.2 shows the automatic extraction result of the proposed visually significant targets and their fine contours. In addition, Fig. 3.3 shows the established large-scale significant target detection database (more than 20,000 images).

3.3 Breakthroughs in Visual Information Processing

55

X

X

Y1

Y1 Local multiscale contrast

Centersurround Area

Y1

Y3

YS 1

Y2 Y4

Y5

Global color distribution (b) Salient representation

(c) Computing model

Fig. 3.1 Statistical learning method for visual attention

Input picture

User Input (Red box)

Method

User selection Standard point (Green cross)method

Our method

Fig. 3.2 Automatic extraction result of the proposed visually significant targets and their fine contours

2. Degree of achieving leaping development This work proposed learning based selective visual attention detection, formed a new theory of salient target detection, opened up a new research idea for visual

56

3 Major Research Achievements

Fig. 3.3 The established large-scale significant target detection database (more than 20,000 images)

attention research and led the second upsurge of visual attention research. The abovementioned research work published in IEEE Trans Pattern Anal Mach Intell has been cited for more than 2000 times in Google Scholar, and the papers published in IEEE CVPR have been cited for more than 600 times in Google Scholar. They have been widely cited by scholars in Europe, the United States, and Japan. It’s a well-recognized representative work of using statistical learning method to study visual attention detection. For example, the authoritative review papers published by Professor Itti L of the University of Southern California, the pioneer and leader of visual attention computing models, introduced our work in nearly 100 words. Professor Torr PHS of Oxford University, winner of the Marr Prize, the highest award in the field of computer vision, published a paper pointing out that this work is “a major benchmark for the evaluation of visual attention detection algorithms”.

3.3.2 Supporting the Recognition and Understanding of Traffic Signs for Driverless Vehicles Although the content is the recognition and understanding of traffic signs, it is an important part of the core perception of driverless technology. The technology is

3.3 Breakthroughs in Visual Information Processing

57

mainly aimed at the perception technology of visual signals, and it can be effectively applied to the key components of driverless smart vehicles, lane line detection, obstacle detection or signal light detection. The project’s contributions to solving core scientific problems and the degree of leaping development to which it achieves are reflected as follows. 1. Contributions to solving core scientific problems This subject has completed the research work according to the scheduled plan. This project mainly carries out algorithm research and system development for the detection and recognition of road traffic signs for driverless smart vehicles. The sub-items are listed below. Major contribution 1: in order to reduce the impact of various subjective and objective factors on the final results caused by the poor quality of image and video data acquisition and to lay a good foundation for the subsequent actual detection and recognition, we adopted a series of effective technical means in the previous pre-processing links. For example, for the loss of the image caused by truncation, we proposed a fast matrix completion method which based on the truncated kernel norm; in order to effectively identify low-resolution traffic identification images, we proposed an image-based super-resolution approach for a multi-scale dictionary. Major contribution 2: in terms of data expression and feature extraction, for the study of effective visual features of traffic sign image recognition, we put forward an enhanced visual word model that considers the cluster size. To solve the problem of fast and accurate matching of the feature points of the front and back key frames in the real-time reconstruction of the 3D driving scene, we proposed a fast-matching method of feature points based on the binary feature retrieval of convolutional Treelets. Major contribution 3: in the study of the rapid recognition of traffic identification images, we first proposed a hash-based nearest-neighbor search method and made a series of algorithm improvements. Later, using the independent learning and feature expression ability of deep neural network, we designed and built a detection and recognition system for road traffic identification, which can be roughly summarized as the steps of coarse detection (sliding window) + fine detection (CNN two classification), tracking and identification (CNN multi-classification). Major contribution 4: in order to meet the real-time and robustness requirements of the system, and to ensure accuracy while speeding up the detection and classification speed as much as possible, we added a saliency detection technology to reduce the search space of traffic signs and the error detection rate. For deep neural networks, the sparse learning ability of the current stochastic optimization algorithm is improved to learn the sparse deep neural network model. In addition, we also tried to use the KNN classifier to design a new hash index method for the nearest-neighbor search to improve the speed. Major contribution 5: we collected 3 batches of real data and made reasonable expansion and layout, and then put it all into experimental analysis and running tests to

58

3 Major Research Achievements

assist in the improvement of system composition and scheme design, and enhance performance and practical value. Major contribution 6: based on the obtained research results, we developed a prototype system, and tested and evaluated it in the unmanned driving system. The performance was good and the functions achieved the expected effect. 2. Degree of achieving leaping development 76 SCI journal papers were published, with 3 ESI highly cited papers, 8 top journals (including IEEE Trans Pattern Anal Mach Intell, IEEE Trans Image Process and other journals); 22 papers were published in other important international journals or at important international conferences and 3 patents were authorized. The research content of this project focuses on the theoretical research of target detection and image recognition, as well as the applicational basic research and development of road traffic sign graphic recognition system supporting driverless vehicles. The main work progress and outstanding achievements are detailed below. Main progress 1: in driverless vehicles, the traffic sign pictures taken by the camera are often blurry. Therefore, improving the super-resolution of the image plays a very important role in target detection and recognition. Traditional reconstructionbased methods often show unreasonable details when the picture is greatly enlarged. The project team proposed a new sparse near-neighbor selection method for ultraresolution reconstruction. Using the idea of sparse representation for reference, by introducing local structure information and using the sparsity and robustness of L1 norm, the k-nearest neighbor in the relevant subset is appropriately selected for reconstruction, which has great advantages compared with the ordinary KNN based algorithm. The reconstruction speed is improved while ensuring the quality. The paper was published in the top international computer vision journal IEEE Trans Image Process. Main progress 2: in the tracking and identification of targets, image completion is used to solve the image loss problem caused by occlusion, stain and other reasons. Restoring missing information in an incomplete image can be seen as a problem of matrix completion. Traditional kernel norm-based matrix completion methods often need to satisfy certain theoretical assumptions, which are often difficult to satisfy in practice. Therefore, in view of the difficulty that the current matrix completion algorithm cannot accurately restore the low-rank structure of the image, we proposed a new matrix completion algorithm based on truncated kernel norm regularization, which reduces the limitation of assumptions. This work was published in IEEE Trans Pattern Anal Mach Intell, which is a top international journal on computer vision and pattern recognition. Main progress 3: non-negative matrix factorization is an effective data expression technology that can explore potential data structures and obtain partial expressions. It has a wide range of applications in pattern recognition, information retrieval, computer vision and other fields. However, non-negative matrix factorization is an unsupervised learning algorithm, and it is difficult to effectively use labeled data.

3.3 Breakthroughs in Visual Information Processing

59

Therefore, the project team took the labeled data as a restriction condition, and innovatively proposed a semi-supervised matrix factorization algorithm, called the nonnegative matrix factorization algorithm with restriction. This algorithm can effectively fuse labeled information, so that the decomposed matrix factors have stronger distinguishing ability. This work was published in IEEE Trans Pattern Anal Mach Intell, which is a top international journal on computer vision and pattern recognition.

3.3.3 Graphics and Text Detection, Recognition and Under-Standing of Traffi Signs Under Complex Conditions During the execution of the project, deep learning made a breakthrough in the field of image recognition. The research group used a way of deep learning to solve the image recognition problem. On the one hand, it improved the recognition performance; on the other hand, it avoided the cumbersome process of manually designing features, greatly improving the image recognition effect and robustness. The project’s contributions to solving core scientific problems and the extent to which it achieves leaping development are reflected as follows. 1. Contributions to solving core scientific problems Major contribution 1: algorithm research of removal of rain, snow and fog in image/ video In the process of traffic sign graphics and text processing for driverless smart vehicles, the rain, snow and fog seriously affect the automatic detection, recognition and understanding of traffic sign images. For rain and snow, we adopted multiple low-pass filtering methods for removal processing; for heavy fog and haze weather, we used fog and haze imaging model defogging method. The proposed defogging algorithm is based on the physical model. Through the estimation of global atmospheric light, the calculation of atmospheric transfer function and the restoration of scene radiance, we can realize the real scene restoration of fog degraded image. Through this method, the influence of haze in the image can be effectively removed, and the contrast of the image can be greatly improved. In addition, the restored scene radiance can not only maintain the color of the real scene, but also preserve the details of the objects in the scene to the maximum. At the same time, this method has high processing speed. When configured with Intel(R) Core(TM) i5 CPU, the main frequency is 3.2 GHz, and the memory is 8 GB, by using software written in C++ to process, for videos of 600 × 400, it reaches a processing speed of 30 fps, which can meet the requirements of real-time processing systems. Major contribution 2: research on enhancement algorithm of night image In the aspect of night image enhancement, an improved algorithm based on Retinex is proposed. By using Sigmoid function instead of log function in the original Retinex algorithm, the improved algorithm can not only omit the later specified steps, but

60

3 Major Research Achievements

also ensure that the details of high and low brightness will not be ignored. The noise at the dark part of the night image is mainly caused by the very low brightness of its nearby area, which means that the details of the region due to much loss of common compression algorithms can produce large amounts of noise or useless block regions after recovery. In the process of enhancing the night image, the present algorithm reduces the noise corresponding to the dark part of the enhanced image by increasing the weight of the original image. In addition, for details in highlights, it assigns different image weights to different regions, which can achieve better night enhancement effects. The improved algorithm based on the proposed Retinex can ensure the consistent color sense of the image while completing the compression of the dynamic range of the image. The enhanced image can greatly enhance the global and local contrast of the image, and significantly improve the visual effect of the image. Major contribution 3: traffic recognition algorithm under complex conditions For complex conditions (including fading and defaced traffic signs), the image contrast enhancement algorithm is adopted for preprocessing and designs a traffic sign recognition system, which is based on the deep learning model. It selects the convolutional neural network, which is the most suitable model for solving the image recognition problems under complex conditions. The image is taken as the input layer, and then the automatic learning from feature extraction to the classifier is formed through multi-layer volume base, down-sampling and the final classifier. Currently, a convolutional neural network is used, and the output layer channels are 70, 110 and 180 respectively. The GPU is used for back-propagation training, and the end loss function is trained with a large interval Hinge-loss. Due to the slow calculation of the neural network, which cannot meet the requirements of real-time recognition of driverless smart vehicles, a series of accelerated algorithms are used to improve the recognition speed of the system: (i) The picture obtained by the camera is divided into several sub-pictures without any loss, and each sub-picture is sent to a computing node for detection, so the detection speed is greatly improved through this multi-core parallel method; (ii) For each computing unit, a cascaded classifier is designed to speed up the detection process. The designed traffic sign recognition system achieved excellent results on the internationally public datasets and the driverless intelligence vehicles competition. On the German traffic sign recognition dataset, the designed system reached 99.65% of the test set accuracy rate, breaking the best record of this dataset made by European scholars. 2.

Degree of achieving leaping development

We published 40 papers, among which 14 papers were published in top journals (including TITS and other journals) and 22 papers in important international journals or at influential international conferences (including AAAI, SIGKDD and ICML).

3.4 Leaping Development of Natural Language Processing

61

The traffic sign recognition system designed by the research group achieved excellent results on the international public datasets and the driverless intelligence vehicles competition. On the German traffic sign recognition dataset, the designed system reached 99.65% of the test set accuracy rate, breaking the best record of this dataset made by European scholars. We won the second place in the 2015 China Blurred Image Processing Competition–Traffic Scene Image Enhancement Challenge and the third place in the Traffic Sign Image Recognition Challenge.

3.4 Leaping Development of Natural Language Processing Taking the new theories, new structures, new methods and new technologies involved in natural language processing as breakthrough points, the original innovations in scientific theories and experimental techniques have been realized, which have driven the development of basic theories of natural language processing in China. On some important basic scientific issues, a number of internationally leading or advanced scientific research results have been obtained, a series of papers with original innovative ideas have been published in important international academic journals, and we took a place in the basic research of international natural language processing.

3.4.1 Research on Semantic Calculation Methods for Chinese Text Comprehension This project combines traditional methods for Chinese text understanding and the current popular deep machine learning methods, with the overall research goal of establishing an effective high-performance new method for Chinese semantic computing. The research is carried out from the three levels of semantic representation and acquisition, semantic computing model and semantic reasoning, and its contributions to solving core scientific problems and the extent of achieving leaping development are reflected as follows. 1.

Contributions to solving core scientific problems

Through the research of this project, we create a new set of theoretical frameworks and implementation methods for Chinese semantic calculation, including: solving the problem of insufficient Chinese semantic knowledge resources by creating an effective Chinese semantic knowledge base; solving the sparsity problem of existing symbol-based semantic representation and calculation methods by creating new semantic representation and calculation methods based on deep learning and distributed expression; solving the problem that the existing reasoning algorithm cannot consider the global information. The existing semantic logic calculation unit is extended from the vocabulary level to the sentence and even the text level

62

3 Major Research Achievements

to meet the needs of more complex text intelligent reasoning. On the basis of the above theoretical frameworks and methodological research, a high- level automatic question-answering system for geography and health consultation was realized. The contributions are mainly divided into the following aspects. Major contribution 1: multi-level and cross-language Chinese semantic representation and computation (1) Semantic representation learning and computation of context-sensitive words. This study defines word vectors as two parts: (i) semantic center; (ii) context bias. The semantic center of the word vector is similar to the traditional method, and the context bias is determined by the specific context. Its core idea is to use the context to calculate the bias weight and act on the semantic center of the word, causing the shift of semantic center, forming the vector semantic representation of the word in the current context, thereby solving the polysemy problem of a word. (2) Clause semantic representation learning and calculation for implicit text relationship recognition. We proposed a text analysis method based on multi-view learning. From the perspective of semantic relationships among text units, this method models the relationships among text semantic units from the perspective of relation classification and relation translation. In addition, from the views of semantic feature representation, the method learns the semantic representation of text units from the view of words, phrases, clauses, and sentences. (3) Sentence semantic representation and computation inspired by human attentional mechanisms. We aim to study a distributed semantic representation of sentences based on unsupervised integration of attentional mechanisms. The method first assumes that the semantic representation of a sentence can be obtained by combining the semantic representation of words. According to this assumption, the distributed semantic representation of sentences is broken down into three sub-problems: (i) the semantic representation of words; (ii) the attention weight of each word; (iii) the way of semantic combination from words to sentences. (4) An unsupervised cross-language semantic representation learning first proposed worldwide. We proposed a new loss function to construct a semantic representation model by matching the implicit state values of the source language and target language. In the process of learning the semantic representation of the two languages, we dynamically estimate the distribution characteristics of the word embedded in the two languages, and continuously reduce the differences between them through the standard back-propagation algorithm. Major contribution 2: automatic acquisition of semantic knowledge based on query logs and cross-language (1) Automatic acquisition of Chinese semantic knowledge based on massive query logs. We proposed a method by mining Chinese syntax dependence from massive query logs. Experimental results show that our method effectively improves the performance of the Chinese syntactic analysis model by using

3.4 Leaping Development of Natural Language Processing

63

the query log. The insufficiency of domain adaptability is also an important problem in the research and development of Chinese syntactic analyzers. We proposed a method for mining domain syntactic knowledge from unlabeled data on the Internet, which effectively improves the syntactic analysis performance of the syntactic analysis model in new domain texts. (2) Chinese semantic knowledge acquisition method based on cross-language mapping. The knowledge graph is one of the key resources of semantic computing. At present, the number of Chinese knowledge graphs is seriously insufficient. We proposed a knowledge graph translation method based on the topic model, which automatically maps massive English semantic knowledge to Chinese, and effectively makes up for the lack of Chinese semantic knowledge. Major contribution 3: application of syntactic and semantic methods for neural machine translation (1) Contextual-aware semantic smoothing approach for NMT. We proposed a new context-aware smoothing approach, which relies on local context words of the sentence to dynamically learn a sentence-specific vector for each word, and then use this vector to smooth the corresponding word. The smoothed word vector is a vector representation for a specific sentence as the NMT word embedding layer. Providing NMT with word representations in contextual smoothing vectors to strengthen the word embedding layer will help improve the translation performance of NMT. (2) Source-end dependency representation for NMT. In order to capture the sourceend long-distance constraint, we extract a dependency tuple for each word from the dependency tree of the source sentence. With this element, the dependent constraints of each word encoded on the source-end can be displayed, which can then be represented by a source sentence vector encoding more source side information. When learning the semantic vector of this dependent unit in the future, the standard CNN is directly used to encode the corresponding sentence sequence into a combined semantic vector with the same dimension as the word vector, which is called the source-end dependency representation. (3) Syntax-oriented attention method for NMT. Syntactic distance constraints are used to guide the learning of attention alignment, so that the translation context can focus on syntactically more relevant source-end annotation vectors, thereby improving translation. Firstly, the decoder learns the position of the source-end to be aligned under the current time sequence steps, learns the corresponding alignment weight, and then adjusts the original alignment weight of attention. Secondly, consider using those source-end annotation vectors that are syntactically related to the aligned source-end words to calculate the context vector. Finally, this context vector is used to predict the target word under the current time sequence steps.

64

3 Major Research Achievements

Major contribution 4: method of Chinese syntactic analysis based on neural network and feature templates (1) Deterministic attention mechanism for syntactic analysis tasks. A new attention mechanism is proposed, namely the deterministic attention mechanism. This new attention mechanism can select words that are really helpful for syntactic analysis as context without destroying the conciseness of the sequence-sequence model. (2) Syntactic analysis method based on neural network combined with the traditional syntactic analysis method. A syntactic analysis method is proposed, which combines the traditional syntax analysis method based on feature template and the syntactic analysis method based on the sequence-sequence model. This combined method avoids the shortcomings of the above mentioned methods, while retaining the advantages of the two types of methods. (3) Chinese sentence functional component tree bank construction and functional component analysis. In order to better understand Chinese sentences, a complete and nested Chinese sentence functional component tree bank labeling system has been established, including seven major components such as the subject, predicate, object, adverbial, attributive, central language and complement. Under this labeling system, the Chinese tree bank of the University of Pennsylvania in the United States is transformed into a tree bank with functional component annotations. On the basis of the tree bank, three methods are used to realize the functional component analysis of Chinese sentences. Major contribution 5: model for understanding and reasoning of mathematical application problems based on templates A template-based reasoning model is used to solve the mathematical application problems. Two steps are generally required to map math word problems to equations. First, select a template to define the overall structure of the equation system, and then instantiate numerals and nouns in the text. Second, the derivation process from application problem to answer can be defined as a triple , in which T is the selected system template, p is an alignment from template T to application problem x, and a is the answer produced by instantiating the template with x by aligning p. 2. Degree of achieving leaping development Related papers were published at international conferences—The Thirty-first AAAI Conference on Artificial Intelligence (AAAI-2017), International Joint Conferences on Artificial Intelligence (IJCAI-2016) and in the international academic journal IEEE Trans Audio Speech Lang Process.

3.4 Leaping Development of Natural Language Processing

65

3.4.2 Theoretical and Methodological Research on Integration Between Human-Like Hearing Mechanism and Deep Learning This project focuses on the integration of human auditory mechanism and deep learning algorithm, gives full play to the cross advantages of information science and life science and promotes new breakthroughs in intelligent speech technology. The contributions to solving core scientific problems and the degree of realizing leaping development are as follows. 1.

Contributions to solving core scientific problems

Main contribution 1: auditory attention model incorporating speechprint memory and attention selection In complex environments, people realize the function of separating and selecting different sound sources through the joint action of the auditory system and the brain. When using the neural network to solve the problem of speech separation, attention selection and memory mechanism are also very important means to achieve the performance improvement of speech separation technology. The memory unit used in the existing neural network framework is a kind of temporary built-in memory, and all the memory information is a hidden layer expression obtained by converting the trained network parameters after being activated by the current data, which has the problems of limited time, low application value and low reusability. In addition, in practical applications, there are often some data that are not seen or rarely seen in the training set, which is often difficult to learn from the parameters in the codec model. Based on this, we proposed a network equipped with an external long-term memory unit to encode and memorize speech data. This method can save and solidify longer historical data information, which can not only use the sound source information stored in memory learned from training data for high-accuracy selection and speech separation, but also flexibly identify and record rare sound source information that has never appeared. The auditory attention model, which incorporates speechprint memory and attention selection, can realize the auditory attention selection tasks of stimulus-driven (passive noticeable sound stimulation or calling people’s names) and task-oriented (active chat with friends). Related work was published at AAAI 2018, the top conference on artificial intelligence. Main contribution 2: top-down auditory attention model of “first hearing”, “inference” and then “attention” Traditional models often use the feed forward neural network structure to construct a multi-label regression model to fit the target speaker’s speech when solving cocktail party problems in multi-talk auditory scenes. There is no goal orientation in this process, and all channels need to be separated. Based on the mechanism of auditory attention, we reconstructed the auditory pathway framework of “first hearing”, “inference” and “attention”, and realized the process of inferring multiple candidate speakers from bottom to top and paying top-down auditory attention to their respective speech pathways. It solves a series of problems caused by aimless orientation in

66

3 Major Research Achievements

auditory attention tasks, and overcomes the limitation that the number of speakers must be given. The related work was published at IJCAI 2018, the top conference of artificial intelligence. Main contribution 3: constructing auditory attention model of long-term, goal-driven and self-game learning mechanism Using deep reinforcement learning and the Generative Adversarial Network to train an intelligent agent to simulate the mechanism of human phonetic information processing, through constant trial and error and self-game, we constructed an auditory attention model with long-term, goal-driven and self-game learning mechanism. The game among adversarial networks is used to learn the speech distribution of specific speakers, and reinforcement learning is used to deal with the inconsistency between the training index and the test index, so as to enhance the optimization target attribute of paying attention to speech, and finally make the auditory model approach or exceed the human auditory processing level as much as possible. The related work was published at IJCNN 2018, a joint conference of neural networks. 2. Degree of achieving leaping development The related papers were published at international conference the 27th International Joint Conference on Artificial Intelligence (IJCAI). The paper “Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment” was published at the international conference the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-2018).

3.4.3 Research on Theory and Technology of Speech Separation, Content Analysis and Understanding in Multi-person and Multi-party Dialogue This project works closely around auditory perception and speech information processing, studies the theory and method of speech processing and understanding of multi-person and multi-party dialogue in specific scenes, and has made a series of progress. Through the research on the basic theory of auditory perception and the core technology of speech information processing, the overall research strength in the field of audio-visual information processing in China has been enhanced, and outstanding talents and teams with certain international influence have been cultivated, providing a good research environment and strong technical support for national security and social development. The contributions to solving core scientific problems and the degree of realizing leaping development are reflected as follows. 1.

Contributions to solving core scientific problems

Main contribution 1: study on the mechanism of noise auditory perception In view of some key scientific problems involved in the auditory anti-noise mechanism and method, a series of in-depth studies have been carried out, and remarkable

3.4 Leaping Development of Natural Language Processing

67

progress has been made in the following aspects: the mechanism of auditory information masking effect, the mechanism of content activation in speech masking, the antiinterference mechanism of speech understanding under reverberation, the binaural hearing mechanism, the intelligibility of noisy speech, multi-channel audio coding and so on. Around these studies, more than 20 academic papers were published in important journals and at international conferences in this field, and 5 national invention patents were granted. Many reports on relevant achievements were given in international academic exchanges, which were affirmed by international peers. Main contribution 2: research on sound source localization Sound source localization technology is widely used in fields such as speech signal pickup and automatic video surveillance. Low computational efficiency and environmental acoustic interference are two main challenges faced by broadband single sound source localization technology. A single sound source localization method with both computational efficiency and robustness is proposed to estimate the angle of arrival of the sound source relative to the array. The robustness of the proposed method is considered from two aspects. The source signal enhancement method based on feature space is used to eliminate the influence of acoustic interference on location, and then the weight coefficient is used to reduce the influence of singular value in pairing time delay on location. The simulation environment and real environment are used to test the algorithm. Experiments showed that the algorithm is significantly ahead of the classical maximum controllable response power beamforming method in robustness and 10 times higher than the latter in computational efficiency on nine-element array. Main contribution 3: research on front-end processing of remote speech communication and recognition The core technologies include speech enhancement, separation and localization based on microphone array, as well as full-duplex echo cancellation and dereverberation. The system also adopts a priori target sound source location of activation word, and original directional speech activity detection technology. Through the comprehensive application of the above technologies, the high signal- to-noise ratio pickup and highquality front-end processing performance of the recognition system can be achieved indoors (3–5 m). Main contribution 4: speaker recognition research This project studied robust modeling for speechprint under changing acoustic conditions and the feature extraction of speaker recognition in a complex channel environment. Traditional speaker recognition uses the basic feature MFCC and its derivative features (first and second order difference). These features are easy to implement, but some correlation will be introduced when stitching, which is not conducive to the back-end modeling. The correlation is removed by two-dimensional discrete cosine transform, and the time–frequency cepstrum characteristics are obtained. The male speaker data of NIST SRE08 are used for testing. Without increasing the feature dimension, the recognition rate of the proposed time–frequency two-dimensional feature is about 10% higher than that of the traditional MFCC feature.

68

3 Major Research Achievements

Main contribution 5: research on acoustic modeling method based on deep trusted network The structure of deep network is closer to the mechanism of human perception of speech, and it can extract high-order information from speech signals. At the same time, the neural network has better discriminative learning ability, and the model classification performance can be better. In the framework of HMM model, the state probability is estimated by using deep neural network instead of the Gaussian mixture model. Aiming at the difficulty of training multi-layer neural network, the restricted Boltzmann machine is used for pre-training. Compared with the traditional HMM/ GMM system trained based on MPE criterion, the performance improvement of the telephone natural spoken speech test set is achieved, in which the word error rate is reduced by 5–11%. Main contribution 6: research on language model modeling method based on neural network Statistical model based on n-gram has been widely used in speech recognition system. Although this method is simple and effective, it is insufficient to model long history information and advanced language features. Deep learning based on neural network is a new exploration idea, which can overcome the disadvantages of n-gram language model by modeling language model in continuous space. The language model based on recurrent neural network is established. Compared with the traditional n-gram statistical model, the degree of confusion is reduced by 20–30%. Main contribution 7: research on fast decoding The recurrent neural network language model is applied to the second-pass decoding of speech recognition. On the 863 natural spoken dialogue test set, the word error rate dropped from 34.8 to 33.6%. The speed of the second decoding is optimized, and a multi-candidate revaluation algorithm based on the prefix tree is proposed, which is about 11 times faster than the traditional method. Main contribution 8: research on robust dialogue understanding Most of the traditional dialogue understanding work is mainly based on text sentences, ignoring other information sources which have important influence on intention understanding, such as various non-linguistic information and contextual information. In addition, speech recognition results are likely to contain errors in practical applications. The intention recognition method is based on multi- modal information fusion. On the one hand, in the text key concept extraction, the keyword extraction and confirmation method based on N-Best syllable lattice is adopted, and the syllable distance measurement based on the confusion matrix of vowel recognition improves the detection rate and reduces the false alarm rate. On the other hand, context information is used to assist in flexible command parsing, and the decisionlevel fusion method of multi-modal information confidence weighted superposition is adopted, which considers the confidence of different information sources and improves the speaker’s emotional state recognition accuracy.

3.4 Leaping Development of Natural Language Processing

69

2. Degree of achieving leaping development This project mainly studies the theory and method of multi-person and multi- party dialogue speech processing and understanding in specific scenes, including: speech analysis method and theory in multi-source reverberation scenes; robust modeling of speechprint and speech recognition under the condition of acoustic variability; language understanding and intention analysis under the environment of multi-person and multi-party dialogue. The main innovations are as follows. Main progress 1: for the speech analysis method and theory in multi-source reverberation scenes, we have carried out in-depth analysis and exploration, focusing on auditory perception excitation, binaural hearing mechanism, intelligibility of noisy speech, influence of speech enhancement on intelligibility, speech enhancement in complex acoustic environments, sound source localization and endpoint detection, multi-channel speech decoding, etc. A series of research progress were made, which provided theoretical and methodological guidance for speech processing. Main progress 2: in speech recognition, the project studied anti-noise features, pronunciation features, acoustic modeling, decoding and keyword recognition. In the aspect of speaker recognition, the robust modeling of speechprint under acoustic changeable conditions and feature extraction of speaker recognition in complex channel environments are studied. The results showed that the performance and efficiency of speech recognition and speaker recognition are significantly improved. Main progress 3: in the aspect of language understanding and intention analysis, a key concept extraction method based on N-Best syllable lattice and an information fusion method based on multi-source information confidence weighted superposition are proposed, thus realizing speech understanding based on multi-modal information fusion. In the aspect of dialogue management modeling, SCXML was used to describe the dialogue process, and statistical modeling was studied. Main progress 4: during the implementation of this project, two prototype systems were established, namely, the speech interactive system and the audio retrieval system. These two prototype systems have been popularized in practical applications in many fields. Through cooperation with enterprises (such as Baidu, Tencent and Alibaba), it has been widely used in the market. At the same time, it has been applied in many national engineering projects and played an important role. (1) Audio content recognition and understanding service platform under the cloud computing environment. With the popularization and development of the Internet, audio content recognition and understanding are widely needed in national security and civil fields. It focuses on the information acquisition, content analysis and mining of real users’ massive audio data under the sea cloud computing framework, as well as the audio content service and intelligent interaction technology based on semantic understanding, breaking through the key technologies such as audio information mining and searching, progressive semi- automatic and full-automatic online self-learning, and establishing an audio content identification and understanding service platform based on the

70

3 Major Research Achievements

sea cloud computing environment with independent intellectual property rights. The core technology research is promoted by real user data, and the practical application problems such as the robustness of speech recognition technology to noise, channel and accent and large-scale concurrent processing are emphatically solved. In the field of Internet, it has successfully cooperated with three major Internet companies in China, namely Baidu, Tencent and Alibaba, and launched applications such as speech input, speech and music retrieval, and speech customer service. Among them, the intelligent speech customer service system has been applied to Alipay customer service business of Alibaba since the day of May 30th, 2013. In 2013, the system stood the test of hundreds of thousands of customer service speech calls on the day of Nov. 11th. In addition to Alibaba, it also cooperates with Huawei, Shanghai Telecom, Centrin Data Systems, Dongfang Guoxin and other companies in this field. (2) Multilingual massive audio information automatic processing platform. According to the urgent needs of the country, the core technologies such as speech keyword recognition, speaker or language recognition, and fixed audio retrieval have been developed. An acoustic model based on pronunciation characteristics is proposed by combining regular clustering and data driving, and the acoustic model modeling method based on deep neural network is realized. A Chinese keyword retrieval algorithm based on dynamic word matching and a token passing scheme based on topological sorting are proposed to improve WFST word graph. A speech activity detection strategy based on statistical modeling and unsupervised online learning is proposed. A language recognition algorithm based on parallel phonemes is proposed and improved. A speaker recognition algorithm based on sparse mixed probability linear discriminant analysis is proposed. It put forward a method to generate Uyghur sub-words by combining morphological analysis of Uyghur words based on grammar with statistical word segmentation based on MDL criterion. The above core technologies effectively improve the adaptability of the recognition system to noise, channel and accent and the recognition performance in practical application environments. An automatic understanding system of multilingual mass audio is constructed, which accords with the actual business process. In 2013, it won the first prizes of scientific and technological progress awards of provincial and ministerial level. In a word, this project made a series of achievements, with 47 papers published, including 34 SCI indexes (including important journals in this field such as J Acoust Soc of Am, Neurosci Biobehav Rev, Hear Res, Ear Hear, IEEE Trans Audio Speech Lang Process, Speech Commun) and 94 EI indexes; 32 invention patents were applied and 30 were authorized. It has cultivated 44 graduate students, including 31 with doctorate degree and 13 with master’s degree. The relevant achievements won three provincial and ministerial first prizes and one third prize.

3.5 Breakthroughs in Theory and Technology Research of Multi-modal …

71

3.5 Breakthroughs in Theory and Technology Research of Multi-modal Information Processing This project focuses on solving the basic scientific theoretical problems of Internetoriented cross-media mining and search engine, and its contributions to solving core scientific problems and realizing leaping development are reflected as follows. 1.

Contributions to solving core scientific problems

Multi-dimensional, high-order and massive data such as text, images and videos on the Internet are widely and intricately cross-related, and have cross- media characteristics. It is an urgent core problem for the next generation search engine to analyze and mine the cross-correlation among Internet web pages, different types of multimedia data and user interaction information. Major contribution 1: cross-media representation and modeling based on hypergraph and matrix For massive cross-media data, effectively condensing complex information and extracting robust features rich in effective information have become an important way to efficiently and accurately understand and mine the knowledge and semantics contained in massive cross-media data. It is a challenge faced by relevant technologies such as network text, image and video content mining. Traditional methods usually use the vector form when expressing cross-media data, but the vector expression method cannot describe the rich context attributes and high-order associations contained in cross-media data. In order to overcome this deficiency, the project team introduced the method based on matrix and hypergraph into cross- media semantic understanding, modeled the association relationship in cross-media data through matrix and hypergraph, and established a learning mechanism across data types and sources. Major contribution 2: structural sparse selection of high-dimensional features Given massive cross-media data, it is a very challenging problem to extract highdimensional and heterogeneous features, select the most distinctive features, and establish a more explanatory model to understand the rich semantics in cross- media data. The research focus of the project team is to solve two difficulties in the selection of heterogeneous features: (i) make full use of structural priors in high- dimensional features to strengthen feature selection; (ii) ensure the consistency of selection results. In order to achieve the above objectives, the research group uses the structural sparsity-inducing penalty factor (structured sparsity-inducing penalty) to select the heterogeneous features of the structures such as introduction group, superposition group, hierarchical structure (tree), graph and path at the input end and output end respectively, thus forming the structural input/output regularization factor, sparsity of the nonconvex structure group, and sparsity of input normalization constraints based on graph structures. At the same time, the theoretical proof of “oracle property” characteristics in feature selection, such as continuity, no deviation and consistent results, is given to promote the consistency of feature selection results. This research

72

3 Major Research Achievements

overcomes the shortcomings of traditional sparse selection algorithms (such as Lasso) which select each feature one by one and ignore the structural attributes between features, and establishes an interpretable semantic understanding mechanism for cross-media processing. Major contribution 3: cross-media learning mechanism based on geometric point of view The parallel vector field learning theory is proposed and applied to semi- supervised learning, dimension reduction, multi-task learning, cross-media retrieval and other issues, and good experimental results have been achieved. In the semantic understanding and processing of media data such as images, videos and audio, it is very important to reveal the inherent geometric structure of the data. Traditional manifold learning methods usually ignore the high-order geometric information of data. At present, Laplacian operator on graphs is widely used, but it is too rough to measure the smoothness of functions by Laplacian operator, and the convergence speed of the algorithm is slow. Recent theoretical research shows that the second derivative of constraint function can more accurately approximate the real regression function. In order to describe the second derivative of functions on manifolds, we introduce the vector field as a powerful tool. Each point of the vector field on the manifold specifies a tangent vector of that point, and the smooth vector field also requires that these tangent vectors smoothly depend on the change of points. The project team found that there is a close relationship between the vector field on the manifold and the geometric topological structure of the manifold, and the properties of the function on the manifold can be described by the vector field. Major contribution 4: new retrieval mechanism for cross-media data Traditional media retrieval mainly deals with the same type of data or heterogeneous features in the same type of data. Different from the traditional structured and unstructured data, the data on the network is embodied in various forms such as text, images and videos. These different types of media obtained from different channels and their related social attribute information are more closely mixed together and presented in the form of “cross-media”. Therefore, it is a great challenge to study and establish a new retrieval mechanism for cross-media data. The project team has made in-depth research on cross-media hash index, crossmedia sorting and cross-media related feedback, and put forward algorithms such as cross-media sparse hash index, cross-media bidirectional structure sorting and cross-media related feedback based on historical logs. Major contribution 5: establishment and application of cross-media vertical search engine platform Focusing on the task of demonstrating the application of cross-media retrieval technology and an extensible distributed cross-media retrieval experimental platform, the project team implemented an extensible distributed engine architecture based on TRS big data management system.

3.5 Breakthroughs in Theory and Technology Research of Multi-modal …

73

In this architecture, the project team introduced multi-engine mechanism (text, images, audio, relational database and other management engines), and suitable engines can be selected to provide services for different cross-media applications. The system adopts a flat design, with complete peer-to-peer connection between nodes, which can provide external services and support elastic expansion. It supports multicopy mechanism of data, has the ability of anomaly perception and self-recovery, and is compatible with Hadoop standard. According to the query characteristics of the application, it can automatically partition and index data, supports memory table and column storage, supports hybrid index, distributed cache and asynchronous retrieval, and gives full play to the advantages of the modern PC multi-core server and large memory. This vertical search engine platform has been applied in Xinhua News Agency and State Intellectual Property Office. Demonstrational application of Xinhua News Agency Multimedia Database project: this vertical search engine platform and the intelligent text mining technology related to this project, including automatic classification and information duplicate checking, have been successfully applied in Xinhua News Agency Multimedia Database project. The Xinhua News Agency Multimedia Database is the largest multimedia and multilingual news information database in Asia, which collects the main resources of Xinhua News Agency, such as words, pictures, audio and videos, newspapers and magazines, and contains more than 150 million original news items of Xinhua News Agency and a large number of multilingual data resources, with a total data volume of 400 TB. Demonstrational application of retrieval and service system of State Intellectual Property Office: this vertical search engine platform and the core technologies related to this project, such as intelligent text mining, appearance patent retrieval and mechanical drawing retrieval, have been successfully applied in the retrieval and service system of the State Intellectual Property Office. The system already contains patent documents from 98 countries and regions around the world, with a total of more than 100 million patent records. 2.

Degree of achieving leaping development

A total of 86 papers were published in international journals, including IEEE Trans Pattern Anal Mach Intell (3 papers), the top and authoritative academic journal in this field, IEEE Trans Image Process (5 papers), IEEE Trans Multimedia (5 papers), IEEE Trans Known Data Engineer (2 papers), IEEE Trans Circuits Syst Video Technol (2 papers), Pattern Recognition, J Mach Learn Res, VLDB J, and ACM Transmultim Compute, Communal Appl (2 papers) and at top international academic conferences ACM Trans Multimedia (21, including 9 Full Paper), AAAI (7), IJCAI (2), CVPR (4), NIPS, SIG IR, WWW, SIG KDD, etc. The published high-level research papers have also been recognized by international counterparts, such as the best paper of AAAI 2012, ACM MM 2013 best paper nomination, ACM MM 2012 best paper nomination, ACM 2010 best paper nomination, MMM 2013 Best Student Thesis, the 7th Joint Academic Conference on Harmonious Man–Machine Environment (2013)

74

3 Major Research Achievements

and the 6th Joint Academic Conference on Harmonious Man–Machine Environment (2012), etc. Some research achievements of the project won the second prize of 2010 National Science and Technology Progress Award (“Multimedia Technology and Intelligent Service System of Million Digital Libraries”). During the research period of the project, it won one of the 100 Outstanding Doctoral Theses from the Ministry of Education in 2012 (the doctoral thesis is entitled “Research on Key Technologies of Cross-Media Retrieval and Intelligent Processing”; won the Excellent Doctoral Thesis of China Computer Federation in 2012 (the title of the doctoral thesis is “Image Semantic Understanding Based on Graph Model Expression and Sparse Feature Selection”); won the nomination award of Excellent Doctoral Thesis in China Computer Federation in 2010 (the title of the doctoral thesis is “Video Semantic Understanding of Multimodal Feature Fusion and Variable Selection”); won the special prize of the 2012 Dean’s Award of Chinese Academy of Sciences and the Excellent Doctoral Thesis of Chinese Academy of Sciences in 2013, and was selected into Springer’s Excellent Doctoral Thesis Series (the doctoral thesis is entitled “Collaborative Search and Recommended Technology Research for Social Media”). During the research period, one person was employed as 973 Chief Scientist, two were awarded the National Outstanding Youth Fund, one was elected as IEEE Fellow, one was selected as the top young talents of the Central Organization Department (10,000-person plan), and one was selected as the New Century Excellent Talents Support Program of the Ministry of Education. To promote the exchange of research in cross-media related fields among international peers, the project team held a panel discussion on “Cross-Media Analysis and Mining” at the 21st ACM Multimedia (ACM MM 2013) conference and launched a special issue on “Cross-Media Analysis” in the international journal Int J Multimed Inf Retr.

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles Taking driverless driving in the real road environment as the physical verification carrier for the study of unstructured information processing and behavior computing, the organic combination of applied basic research and physically realizable systems has laid a solid foundation for the verification of cognitive computing theories and methods of visual and auditory information and produced major original research results that can meet the major needs of the country. In some important basic scientific issues, a number of international leading or advanced scientific research achievements have been obtained, and a series of papers with original innovative ideas have been published in important international academic journals, which greatly promoted the research and development process of

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

75

driverless intelligent vehicles in China and narrowed its gap with developed countries such as the United States and European countries.

3.6.1 Theory, Method and Integration and Comprehensive Experimental Platform of Autonomous Vehicle This major research plan has deployed the integration project of “Key Technologies and Integration Verification Platform for Autonomous Driving Vehicles”, which is regarded as the key entry point and an important foothold. Led by the National University of Defense Technology of the Chinese People’s Liberation Army, the project integrated the advantageous research team and early research results of relevant integration, key support and cultivation projects in the major research plan, aimed at the basic scientific problems of autonomous vehicle environmental perception and understanding, intelligent decision-making and control, carried out the innovative comprehensive integration research integrating theory, key technology and engineering optimization, obtained systematic and innovative research results, effectively promoted the standard solidification of technical methods, and promoted the interactive integration of academic teams and the integration and sublimation of research results. The contributions of this project to solving core scientific problems and the degree of realizing leaping development are as follows. 1.

Contributions to solving core scientific problems

Major contribution 1: theory and method of environmental perception of autonomous vehicles (1) Visual attention model and its application in environment understanding of autonomous vehicles Based on the visual attention model proposed by the project team based on multi-scale frequency domain analysis, the attention model is further improved by introducing top-down information. The project studies the relationship between the attention mechanism in cognitive science and the environment understanding of autonomous vehicles, and explores the back attention mechanism and calculation model. The project studies the application of the attention model in the scene understanding of autonomous driving tasks, captures the cognitive (attention) rules of human drivers through eye movement instruments, and guides the design of perception ability of autonomous vehicles. (2) Overall perception model and algorithm of urban and intercity road environment On the basis of the existing environment perception system, the overall perception model and algorithm of urban and intercity road environment are studied. According to the environmental characteristics of urban and intercity roads, combining the environmental information needs of autonomous vehicles, an overall perception model of urban and intercity road environment for autonomous vehicles is established.

76

3 Major Research Achievements

The project mainly carries out highly adaptive road environmental perception with tight coupling of multi-sensors, so as to realize highly reliable overall environmental perception and road scene understanding of autonomous vehicles. (3) Multi-scale heterogeneous information fusion method The project studies the effective fusion method of multi-scale heterogeneous information, registers and fuses multi-scale heterogeneous interactive information, and further improves the representation and application of knowledge to meet the reliability and real-time requirements of the autonomous driving system. (4) Identification methods of road entrances and exits, traffic signs, road maintenance signs and tunnel entrances The project studies the identification methods of urban and intercity road entrances and exits, traffic signs, road maintenance signs and tunnel entrances, and focuses on the rapid identification of traffic signs based on sparse representation and the rapid acquisition of regions of interest based on frequency domain multi- scale analysis of attention focusing mechanism. Under normal weather conditions, the project studies the detection technology of lane lines, the detection technology of vehicles ahead based on images, and the real-time detection, recognition and understanding technology of road traffic signs and graphic information. Major contribution 2: optimal decision-making and control of autonomous vehicles (1) Simulation system of vehicle autonomous driving function based on machine learning and vehicle dynamics model The project studies autonomous driving control strategy based on machine learning, and uses simulation environment and machine learning technology to study the optimization of autonomous vehicle driving control strategy, especially the strategy optimization of overtaking and merging traffic flow. (2) Adaptive optimal control of autonomous driving system According to various road conditions of urban and intercity roads, the obstacle information sensed by multi-sensors is reasonably utilized, and the safety standards for autonomous driving of vehicles are constructed, so as to improve the safety under abnormal conditions such as occasional mistakes in decision-making during driving. (3) Safety specification of vehicle autonomous driving based on multi-sensor fusion Major contribution 3: structure and reliability of autonomous driving system. The project mainly redesigns the overall framework of software and hardware of the autonomous driving system to improve the overall intelligence level and system reliability.

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

77

(1) Hardware structure design of20 autonomous vehicle By upgrading the new hardware system, the computing capacity of the autonomous driving system can be improved to meet the computing needs of perception, decisionmaking and planning of large-scale complex road conditions. The project improves the real-time, reliability and flexibility of information transmission. The project adds map-based global planning and guidance, and upgrades the hardware of environmentaware system to meet the task requirements. (2) Software structure design of autonomous vehicle The project improves autonomous driving system software that has nothing to do with hardware and enhances the flexibility and reliability of system integration. The project upgrades the debugging function of autonomous driving system and the debugging efficiency. Studying the system software structure with knowledge processing as the core lays a solid foundation for the multi-level integration of man– machine intelligence. Major contribution 4: research on integration of brain-computer interface technology in vehicle autonomous driving system The project designs the visual stimulation method for driving navigation, studies BCI online adaptive learning algorithm and strategy, and realizes brain- computer interface with video technology. The project studies EEG detection and antiinterference technology under vehicle conditions in electrode structure, amplifier modification and special noise reduction processing. The eye movement information is used to obtain the gaze direction, and the EEG information is combined to realize the brain-computer interface system of assisted driving. For typical road conditions such as multiple intersections and road maintenance ahead, the above BCI-based driving assistance function is realized. Major contribution 5: research on integration of auditory information processing in vehicle autonomous driving system According to the national standard, the project detects various sirens, analyzes their changing tracks, and understands the semantics of various sirens. The project studies the method of noise elimination and sound source localization based on microphone array in driving noise environment and the method of fast acoustic event detection under the condition of insufficient environmental prior knowledge, which provides auxiliary information for the vehicle autonomous driving system. 2.

Degree of achieving leaping development

Focusing on the basic scientific issues of environmental perception and understanding, intelligent decision-making and control of driverless vehicles in the normal traffi environment of urban roads and intercity highways, the project carried out innovative comprehensive integration research integrating theory, key technology and engineering optimization. At the same time, it integrated relevant research results in the fiof the brain-computer interface, to build an open integrated verifi platform for driverless vehicles with modular architecture, which realizes long- distance

78

3 Major Research Achievements

autonomous driving (more than 2000 km) in normal traffic environment and normal weather conditions of real urban and intercity roads, and the proportion of manual intervention is less than 2%. Important research progresses have been made in visual attention modeling based on frequency domain scale analysis and its application in scene understanding, the environmental comprehensive perception model and method based on road structure constraints, the new traffi sign recognition method based on block kernel function features, the driving strategy optimization method based on multi-objective reinforcement learning, accurate recognition and tracking of road signs in complex scenes, vehicle speed tracking algorithm based on model predictive control, rapid information input based on the non-invasive brain- computer interface and the environmental status recognition method based on sound context. Specifi features and innovations are as follows. Main progress 1: highly reliable highway environment perception and speed control technology and experimental verification of long-distance autonomous driving. In view of the extremely high requirements for environmental perception reliability of autonomous vehicles on expressways, this project integrated various environmental factors such as road conditions, vehicles on the road, and roadside obstacles, established a two-dimensional local environment model of expressways for autonomous vehicles, and quantitatively analyzed the safety of autonomous vehicles in the process of merging into traffic flow by using the probability model, and established the quantitative relationship between safe distance and speed of merging traffic flow and environmental perception performance parameters. Furthermore, the comprehensive sensing technology of highway environment based on road structure constraints was put forward. This technology is cross- verified in each sensing subtask, which greatly improves the reliability of sensing results. Among them, the success rate of vehicle detection within 120 m ahead is as high as 99.65%, which can provide accurate lane information for some previously difficult problems such as blocking, defacing and crossing bridge opening, and provide sufficient and reliable environmental information guarantee for long- distance autonomous driving and safe merging of dense traffic flow. Aiming at the requirements of high-speed autonomous driving for the accuracy of vehicle steering control and speed control and the nonlinear time-varying characteristics of vehicle dynamics, the steering instruction optimization method based on road width constraint and the adaptive speed adjustment technology based on the numerical acceleration and deceleration model are proposed. Among them, the steering control ensures that the lateral control error is less than 0.5 m (the speed is 120 km/h), and the adaptive speed regulation technology based on numerical model achieves the performance index that the steady state of speed tracking control is less than 5% of the expected speed in the expressway environment. Based on the two-dimensional local environment model of expressways and the safety analysis of incoming traffic flow, the first autonomous driving experiment of long-distance expressways under normal traffic conditions in China was successfully completed. In this experiment, Hongqi HQ-3 autonomous vehicle is used as the experimental platform. The starting point is Changsha Yangzichong Toll Station,

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

79

and the ending point is Wuhan West (Wuchang) Toll Station. The driverless distance is 286 km, which lasts for 3 h and 22 min. In the whole process, the computer system controls the speed and direction of vehicles. Due to the complex road conditions, the system sets a high speed of 110 km/h, which is monitored manually and intervened only under special circumstances. The measured average speed of autonomous driving is 85 km/h, and the manual intervention distance is 2140 m, which is 0.75% of the total mileage of autonomous driving. During the experiment, Hongqi HQ-3 driverless platform successfully overtook vehicles for 67 times, successfully overtook 116 vehicles and was overtaken for 148 times by other non-experimental vehicles. It was safely imported into dense traffic flow for 12 times, and safely drove for a long distance in dense traffic flow. Major progress 2: high performance hardware-in-the-loop simulation system for safety analysis and optimization decision of freeway autonomous driving. In order to assist the development of the highway driverless vehicle verification platform, and comprehensively carry out the simulation test of highway autonomous driving system under all working conditions, the hardware embedding scheme of hardware-in-the-loop simulation system based on xPC Target master–slave structure is proposed, and a three-dimensional virtual test bench of intelligent driving control system is established, which includes planning and decision-making, realtime control and high-precision vehicle dynamics model. Complete the dynamic model parameter acquisition and vehicle dynamics verification experiment of Hongqi HQ-3 car. According to the national standard and the characteristics of driverless vehicles, the lateral average accuracy of this model is 78.1% through the verification of steering wheel angle pulse, steady state circle, steering wheel angle step and central area manipulation. Through the acceleration test, braking test and random test of 100 km, the large error of longitudinal acceleration is 3.4426 m/s2 , the average error is 0.3874 m/s2 and the mean square error is 0.7264 m/s2 . It is self-developed vehicle dynamics model with 14 degrees of freedom and completed parameter matching and dynamics verification test. According to the highway road design code, a typical highway road database is established. The database can realize arbitrary splicing and simulation of different road slopes, roughness, turning radius and road adhesion coefficient. The established highway vehicle simulation system is used to simulate various complex traffic scenes and various decision-making control schemes for autonomous driving, and an autonomous optimization decision-making method for driverless vehicles on highways based on machine learning is proposed. On the one hand, all kinds of key data can be obtained and safety analysis can be carried out through simulation research, which can greatly reduce the number of real vehicle experiments and reduce the possible dangers in real vehicle experiments. On the other hand, the use of high-precision simulation technology can improve the learning and optimization efficiency of decision-making control system, and provide comprehensive optimization decision support for complex driving behaviors such as autonomous vehicles merging into traffic flow and overtaking. With the support of simulation environment, the project proposes a rolling optimization control method for driverless vehicles based on differential flatness, and obtains the vehicle puncture control and

80

3 Major Research Achievements

emergency obstacle avoidance strategy. Through bench experiments under various random working conditions, the control strategies for operating the puncture and emergency obstacle avoidance vehicles are made and calibrated, providing reliable basis for the safe driving and stable control of vehicles during puncture and emergency obstacle avoidance. Major progress 3: visual attention model based on global suppression and real-time detection and recognition of traffic signs. This study applies the new theory of visual attention mechanism in cognitive science and puts forward a new attention mechanism model: visual attention model based on global inhibition. In 2005, Yantis and Beck published an article in Neuroscience, a sub-journal of Nature, pointing out that competition and mutual inhibition exist in different areas of human visual system. In this competitive process, most areas are suppressed, and only a few will eventually catch our attention. In our research, we use experiments to verify that the human visual system has global inhibition ability. Our model starts from visual cortex inhibition, and puts forward a brand-new model based on global inhibition of visual attention. In this model, the concept of spectrum scale space is put forward, and the amplitude spectrum is as important as the phase spectrum. Specifically, the peak on the amplitude spectrum corresponds to the suppressed trivial mode in V1 region, and the convolution of the frequency domain amplitude spectrum and Gaussian kernel is equivalent to the calculation of spatial attention distribution. In this model, the hyper complex Fourier transform is used to represent the signal in frequency domain, and the frequency domain scale space is established on the amplitude spectrum for the optimal scale selection of Gaussian kernel. Experiments show that this model can well predict the results of eye movement and attention as well as the labeling results of human’s interest areas, and its performance exceeds that of most models published in the current literature. In the model research, a new evaluation mechanism of attention model is proposed. The related papers have been published in IEEE TPAMI in the form of regular paper, which has been well received by evaluation experts. On the basis of theoretical research, a real-time detection and recognition algorithm has been designed for traffic signs in urban and expressway environments. At present, 55 kinds of common traffic signs have been tested and recognized in Changsha urban road environment under normal lighting conditions. The experimental results show that the recognition rate of the traffic sign recognition system is above 91.16%, and the average processing time of a single frame image is 181 ms. An efficient recognition method for arrow-shaped traffic lights is proposed. The interesting areas of light boards and traffic lights are tracked by Kalman filter, and a hidden Markov model is established by using representative observation sequences. The current state of traffic lights is estimated by combining the recognition and tracking results. The recognition experiment under normal lighting conditions shows that the recognition rate of common traffic lights in cities is over 92.41%, the average processing time of a single frame image is 152 ms, and the state of traffic lights can be estimated effectively.

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

81

3.6.2 Key Technology Integration and Comprehensive Verification Platform for Driverless Vehicles Based on Visual and Auditory Cognitive Mechanism This major research plan has deployed the integration project of “Key Technology Integration and Comprehensive Verification Platform for Driverless Vehicles Based on Visual and Auditory Cognitive Mechanism”, which is regarded as a key entry point and an important foothold. Facing the real urban and intercity roads and normal traffic scenes, the research on driverless vehicles’ integrating theory, key technologies, engineering optimization and road experiments is carried out, providing a comprehensive integrated verification platform for the realization of the overall scientific objectives of the major research plan. The contributions of this research to solving core scientific problems and the degree of realizing leaping development are as follows. 1.

Contributions to solving core scientific problems

In the aspect of environment perception, a bionic visual computing model based on selective attention and sparse perception mechanism is constructed, which realizes the perception and understanding of traffic signs in traffic elements; the environment modeling method based on multi-sensor information fusion is studied. Lane detection and tracking based on inverse perspective transformation and time–space matching, obstacle detection based on four-dimensional spatial filtering, real-time detection of road boundary based on B-spline model and local map construction based on three-dimensional lidar are realized. It puts forward a fast construction method of all-lane high-precision map for driverless urban environment, and puts forward an accurate positioning method based on point set registration, car body state and GPS, which realizes a positioning and navigation method integrating various information. At the same time, the recognition of automobile taillights based on images and state time sequence information, the recognition of road direction signs based on lane line position relationship and the recognition of flute language of specific vehicles are also realized. Based on the above research results, a multi-information fusion environment perception system suitable for driverless vehicles is finally constructed, which completes the real-time perception of traffic elements and the understanding of driving environment for driverless vehicles in large-scale traffic scenes, and has passed the real vehicle test in different road environments such as cities, highways and villages, and in different weather conditions such as sunny, cloudy and rainy days, which verifies the effectiveness of the method. The papers were published at international conferences IJCNN, IROS, IV, ICMA, ROBIO and ITSC and in international academic journals TITS, SENSOR, Neurocomputing, Robot, etc. In the aspect of decision planning, based on the cognitive model of the overall situation of man-vehicle–road, a hybrid decision-making model is established, and an optimized decision-making method based on intention-driven and decision- making risk assessment is used to construct a humanoid driving decision-making system,

82

3 Major Research Achievements

which realizes the decision-making of safe driving behavior in complex environment. At the same time, the path planning technology based on a variety of static geographic information and a variety of dynamic traffic guidance information is studied to improve the efficiency of driverless vehicle path planning. Through the above research, the task planning, behavior decision-making and motion planning algorithms are established for the integrated driving environment of driverless vehicles, and the unit technology has passed the real vehicle test, which verifies the effectiveness of the algorithms. The papers were published at international conferences IV and ROBIO and in international academic journals IJCA, IJRA, Robot and so on. In the aspect of control technology, the establishment of longitudinal motion control model and lateral motion control, and the design of control algorithm for autonomous driving control system of driverless vehicles are completed. Based on the control model, the modular processing of control algorithm and multi-system integration scheme are studied, and the methods to improve control accuracy and stability are also studied. In view of the shortcomings of existing driverless vehicle longitudinal motion control algorithms, such as lack of personification, neglect of comfort and inadaptability to road slope characteristics, a longitudinal motion control algorithm for driverless vehicles based on multiple road conditions is proposed, and the corresponding control model is established. Aiming at the disadvantage that the existing control methods for the lateral motion of driverless vehicles can not consider the road shape characteristics, a lateral motion control algorithm for driverless vehicles based on road curvature characteristics is proposed, and the corresponding control model is established. Based on the longitudinal and lateral control models of driverless vehicles, a multi-system fusion scheme of the driverless vehicle control algorithm and a method to improve the accuracy and stability of the lateral control algorithm are proposed. The papers were published at the international conference IV and in the international academic journal Adv Robot. The study completed the establishment of an open modular driverless vehicle architecture with four layers: perception layer, decision layer, control layer and execution layer. The study developed a driverless vehicle test system including test contents, test methods and evaluation indexes. An automatic fault diagnosis system for driverless vehicles including the component layer, the subsystem layer and the system layer is established, and a dual redundant driverless controller is developed. The study developed the intelligent pioneer driverless platform, Kuafu No.1 driverless platform, GAC MOTOR driverless vehicle platform, JAC IEV6S driverless vehicle platform, Inner Mongolia one-machine 4 × 4 driverless light tactical vehicle platform and other platforms. Experiments and tests were carried out in Hefei, Guangzhou, Xi’an, Baotou and other cities, and researchers continuously participated in the “China Smart Car Future Challenge” hosted by the National Natural Science Foundation of China, and completed autonomous driving of more than 2000 km. Under the condition of third-party software monitoring, 86 autonomous driving tests

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

83

of various urban roads have been completed, with a mileage of 1228 km. The study completed the predetermined development goals. 2.

Degree of achieving leaping development

Focusing on the basic scientific problems of natural environment perception and understanding, intelligent decision-making and autonomous control of driverless vehicles in normal traffic environment of urban roads and intercity highways, this project carries out innovative comprehensive integration and experimental verification research integrating theory, key technologies, engineering optimization and road experiments, establishes an open and highly reliable comprehensive integration verification platform for driverless vehicles, and realizes long-distance autonomous driving in normal traffic environment of real urban and intercity roads, so that the research on driverless vehicles in China generally reaches the world advanced level. Specifically, according to the research plan, in 2014, on the basis of investigation and exchange at home and abroad, the overall scheme was designed and demonstrated, and the open and modular architecture of driverless vehicles was established. The environment awareness system, intelligent decision-making system and autonomous control system were built, the open system platform was built, and the intelligent pioneer driverless vehicles and Kuafu No.1 driverless vehicles were built. In 2015, the environment perceptual system, intelligent decision-making system and autonomous control system were improved, environmental element modeling and environmental awareness for different traffic environments were realized, and a behavioral decision-making model capable of multi-attribute decision-making in unknown state was established. A traffic element detection method based on spatial– temporal correlation and a motion planning method based on RRT were proposed, and a behavioral decision-making model based on the hierarchical finite state machine and the multi-attribute decision-making method was established, thus realizing accurate positioning combined with visual information. The driverless vehicle platform based on GAC MOTOR was successfully developed, the mobile robot perception test system was initially built, the environmental perception integration scheme was completed, and a large number of real vehicle experiments were carried out. In this year, it hosted the “2015 Research Progress Report and Academic Exchange Conference of the Major Research Plan ‘Cognitive Computing of Visual and Auditory Information’”, at which the project’s objectives and progress, key technological breakthroughs and the next work plan were reported. The leaders of the National Natural Science Foundation of China and the members of the expert group listened to the project’s report and inspected the driverless cars developed in this project. In 2016, the environment awareness system, intelligent decision-making system and autonomous control system were optimized, and the comprehensive performance of the integrated platform was improved. The project successfully developed enterprise driverless vehicle platforms such as JAC IEV6S driverless vehicles and 4 × 4 driverless tactical vehicles, and applied SLAM technology, laser guidance and motion control technology in driverless vehicles to many new products such as security patrol robots, security service robots and rescue robots of Arctech Smart City Technology (China) Co., Ltd., which stimulated investment in driverless driving in Anhui

84

3 Major Research Achievements

Province and Guangdong Province. In this year, the integrated platform was tested to realize driving behaviors such as keeping lanes, changing lanes and avoiding obstacles, giving way at intersections, avoiding pedestrians, passing complex obstacles and passing electronic toll stations in urban road environment. It was tested in Hefei, Xi’an, Guangzhou, Baotou and other cities through several driverless platforms, and the accumulated autonomous driving mileage was over 2000 km.

3.6.3 Intelligent Test Evaluation and Environment Design of Driverless Vehicles This major research plan has deployed the integration project of “Intelligent Test Evaluation and Environmental Design of Driverless Vehicles”, which is taken as the key entry point and important foothold, and led by Tsinghua University, providing single and comprehensive intelligent classification, test and evaluation methods and supporting environmental design technologies for “developing a driverless vehicle verification platform with natural environment perception and intelligent behavior decision-making”, and providing a comprehensive integrated verification platform for realizing the overall scientific objectives of the major research plan. The contributions of this research to solving core scientific problems and the degree of realizing leaping development are as follows. 1.

Contributions to solving core scientific problems

We proposed that the intelligence of driverless vehicles should be defined by semantic relation network. The semantic relationship network is divided into four parts: scene, task, function atom and function set. Among them, tasks link scenes and functional atoms should be the core of research. Through semantic relation network, we put forward a complete definition of driverless vehicle intelligence, and divided the current driverless vehicle testing into two schools—scene testing school and function testing school, and combined them, with special emphasis on the key task. Scene test: in systematics research, scenes are mostly defined as specific systems in specific time and space. Traffic scene generally refers to a specific traffic system composed of many traffic participants and specific road environment. If the tested vehicle can drive through the traffic system autonomously, it is called passing the driving test of the specific scene. For example, DARPA 2005 Driverless Vehicle Challenge selected 212 km desert road as the test scene; in DARPA 2007 Driverless Vehicle Challenge, the 96 km urban road was selected as the test scene. The implicit assumption of scene test is that after passing one or several driving tests in a certain scene, you can pass the scene smoothly in the future. But this assumption may not hold forever. Function test: according to the functional classification of human intelligence, driving intelligence is divided into three broad categories: information perception, information processing and decision-making and decision-making execution. For

3.6 Breakthroughs in Key Technology and Platform of Autonomous Vehicles

85

example, path planning belongs to single intelligence of information processing and decision-making. This definition emphasizes the commonness of methods and technologies to realize the single intelligence. However, it is not related to specific traffic scenes and driverless test tasks, and it is insufficient to measure the intelligent level of driverless driving. Task originally refers to the assigned work. Driving tasks not only refer to some general driving tasks such as following, avoiding collision, changing lanes and stopping, but also refer to a specific driving task in a specific environment. If the tested vehicle can drive autonomously to complete a specific task, it is called passing the driving test of the specific task. Compared with the driving scene, the driving task is more specific, and the time and space range are clearer. A specific driving scene usually contains multiple driving tasks. For example, the “China Smart Car Future Challenge” in the past two years has carefully designed several tasks to test the specific capabilities of driverless vehicles. Passing the test task is still not equal to testing driverless intelligence and driving ability. Driving ability generally refers to the ability to complete a certain driving behavior. To complete a specific driving task, it is usually necessary for the tested vehicle to have multiple driving abilities. Based on different scenes and tasks, its driving ability can be quantitatively evaluated and the driving ability of the whole driverless vehicle can be done by further summarizing. In this semantic relationship network, along the forward connection between scenes, tasks and competency modules, we can sort out specific driving abilities from driving scenes, and subdivide and standardize quantifiable driving ability indicators, so as to establish a complete test system. And along the reverse connection from ability, task to scene module, we can automatically generate reasonable driving tasks and even driving scenes according to the requirements of the function test, and solve the problem of automatic design of driving environment for test matching. After the driving scene is determined, the matching driving environment can be automatically and virtually generated for the simulation test and field test of driverless intelligence. The scene test is located at the left end of the semantic relationship network, while the function test is located at the right end of the semantic relationship network. In fact, the driverless intelligence system proposed by us integrates two existing driverless intelligence definitions into one and complements each other. After the theory and the corresponding test framework were put forward, it aroused enthusiastic response from relevant researchers at home and abroad. Related papers have been ranked among the top 5 in the monthly download list of IEEE Trans Intell Veh for several consecutive months. 2. Degree of Achieving Leaping Development This project provides single and comprehensive intelligent classification, testing and evaluation methods and supporting environmental design technologies around “developing a driverless vehicle verification platform with natural environment perception and intelligent behavior decision-making”, which makes the research

86

3 Major Research Achievements

of driverless vehicles in China reach the world advanced level in general. Specific features and innovations are as follows. (1) This project integrates two existing definitions of driverless vehicle intelligence (scene-based test vs humanoid intelligence), reveals the relationship among driverless intelligence evaluation, task evaluation and capability evaluation, and establishes the hierarchical task expression system, definition test system and specific evaluation method of driverless intelligence. (2) This project studies the environmental design technology for the driverless vehicle field test and the simulation test of driverless vehicles, which is used to test and evaluate the single and comprehensive intelligence of driverless vehicles. Furthermore, this project proposes a parallel test architecture, which realizes the collaborative test between the field test base and the simulation test space “Digital Nine-Grid”, and improves the coverage, pertinence and flexibility of the test. (3) The field test base and intelligent comprehensive test and verification platform for driverless vehicles have been built in this project, which provides a site that meets the intelligent test and competition of driverless vehicles in the next five years and integrates with the real urban intelligent transportation environment. The site covers an area of more than 2 square kilometers and contains various typical traffic scenes. Based on V2X technology, the test data of driverless vehicles can be collected in real time, which can reduce the test cost and improve the scientific and fair evaluation. (4) This project has built a driverless simulation test platform, which can test and visualize the cognitive intelligence of driverless vehicles offline. The proposed TSD-max real traffic scene dataset can meet the needs of driverless off-line testing for tens of thousands of kilometers. (5) Around the overall goal of the major research plan, this project provides a lot of security services. In November 2016 and November 2017, this project assisted the National Natural Science Foundation of China and the major research plan guidance expert group to successfully hold the China Smart Car Future Challenge in 2016 and 2017, and held the first and second off-line tests of basic ability of environment cognition for visual information at the same time. In addition, this project also provides technical support for field test and off-line test for the preliminary acceptance of the integration project of the driverless verification platform in the major research plan.

Chapter 4

Outlook

4.1 Deficiencies and Strategic Needs 4.1.1 Deficiencies Needs Although this major research plan has achieved the expected goal, from the perspective of the overall international situation, the cognitive computing research on visual and auditory information is still lagging behind and stays at the stage of running after. 1. Deficiencies still exist in the basic theory and technology of cognitive computing of visual and auditory information Compared with the research on artificial intelligence worldwide, the cognitive computing of visual and auditory information in our country has made significant progress on the technical level, and a number of technical indicators are in the leading or parallel position in the world. There are many incremental innovations on the technical level, which are reflected in the improvement of the performance. However, the integrated technical system has relatively large shortcomings. For example, the United States ranks the research of brain science and brain-like science at the forefront, while China’s independent research and development capabilities in this area are relatively weak, and there are gaps in breakthroughs and innovations; the research accumulation inspired by cognitive computing of visual and auditory information is insufficient; research on the basic theories and methods of cognitive computing of visual and auditory information in an open dynamic environment is small in China; research on machine learning theory and method in dynamic interactive collaboration also has weakness. And these innovative basic theories and technical research urgently need to be strengthened.

© Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3_4

87

88

4 Outlook

2. Core tools and platforms for cognitive computing of visual and auditory information are insufficient The study and optimization of the technologies and tools in the field of cognitive computing of visual and auditory information rely on powerful and efficient learning and reasoning tools. For example, most of the current China’s cognitive computing researchers on visual and auditory study are with the help of the widely circulated TensorFlow, Caffe, and other deep learning platforms. Developing basic core tools takes a long time and usually requires several years of work accumulation. There are not only hard theoretical problems, but also large-scale development and optimization problems. Most of the core tools and platforms for cognitive computing of visual and auditory information are basically from foreign countries and this situation is not conducive to the sustainable development of artificial intelligence in our country. Compared with open source and third-party tools that have emerged abroad, although the usability of foreign tools is not high, there is still a certain gap in the application of tools developed in this major research plan in many fields and the provision of a wide range of services. 3. Inadequate development of chips and systems for cognitive computing of visual and auditory information Currently, most of the intelligent computing processors for visual and auditory information have the following limitations: (i) they are designed specifically for optimizing neural network calculations but ignoring the diversity and complementarity of intelligent computing; (ii) they unilaterally pursue the theoretical peak performance indicators of numerical computing, without combining the hierarchical and structured characteristics of intelligent (cognitive) computing framework to coordinate and manage resources; (iii) the support for a large number of graph data structures existing in intelligent computing is poor. In the future, it is necessary to explore ways to promote the integration of statistical models and neural networks at the basic level to improve the information mining capabilities of the system. Under the framework of intelligent (cognitive) computing, according to the task service quality and selective attention mechanism, the time and space management and scheduling of resources are dynamically performed to realize the collaborative organization of flexible and efficient heterogeneous computing, storage and communication resources. The groups design a communication architecture that is tightly coupled with the computing unit, which supports efficient storage of graph data structure and highly scalable, and a flexible deep neural network computing unit. In terms of these new challenges, our country’s technological level is slightly behind. 4. There are still shortcomings in the application of cognitive computing technology of visual and auditory information in cross-domain integration Artificial intelligence is an enabling technology. The cognitive computing of visual and auditory information is highly integrated with various disciplines, and its applications are also relatively wide, including control, communication, finance, network security, driverless driving, robotics, unmanned aerial vehicles, national defense

4.1 Deficiencies and Strategic Needs

89

fields, and so on. The integrative development between cognitive computing of visual and auditory information and other various fields requires continuous improvement in long-term research accumulation and application practice. Among these application requirements, most of them have a large share of applications in key national areas. In many fields, compared with the wide and strong demand, the conversion radiation of the cognitive computing technology achievements of visual and auditory information is insufficient. It is necessary to strengthen the public service of cognitive computing technology of visual and auditory information.

4.1.2 Strategic Needs In the field of artificial intelligence, the requirements for cognitive computing of visual and auditory information mainly include on the one hand, it is necessary to consolidate, extend, radiate, and sublimate the results of cognitive computing of visual and auditory information in this major research plan; on the other hand, there is a demand for the continuous development of basic theories and methods, core tools and platforms, and AI chips and systems, which is driven by the development of artificial intelligence. In the wave of global informatization, artificial intelligence provides core energy for a new technological revolution and becomes an important cornerstone of intelligent world, while cognitive computing of visual and auditory information provides core basic theories and methods for artificial intelligence. As a technology carrier, artificial intelligence has become the origin of many innovations in various fields. It needs to be continuously developed and maintained in various industries’ innovations. Meanwhile, it puts forward more requirements and demands for the cognitive computing of visual and auditory information. The rise and fall of artificial intelligence are due to whether the demand can be supported by its algorithm’s function or not. The most important thing for artificial intelligence technology is to have robust performance and improve the efficiency of the industry. This results in a continuous demand for the research of cognitive computing in the field of AI, especially in the field of visual and auditory information. The generation of artificial intelligence requires robust cognitive computing algorithms for visual and auditory information, sufficient hardware computing power support, and data resources. Its driving force mainly comes from the needs of industrial upgrading and technological change. The development of artificial intelligence mainly comes from problem orientation, computing platforms, and human intelligence. In a future world of technological change, artificial intelligence will become ubiquitous and a core element. In order to quickly respond to the complex, diverse and dynamic application demands, and to deal with the challenges brought by big data and intelligent applications, artificial intelligence should have the ability to better support basic theories, core tools, platforms and chips and other intelligent computing and management capabilities, as well as the ability to better provide intelligent application mechanisms. AI must effectively address the challenges posed by the above aspects, and cognitive computing of visual and auditory information is one

90

4 Outlook

of the effective ways. In order to adapt to the major changes of platform environment and application models, artificial intelligence system is gradually presenting a new type of complex artificial intelligence system, which is open, multi-tasking, dynamic, fragile and uncertain, and emerging new characteristics such as autonomy, man–machine collaboration, knowledge evolution, deep situational understanding, spontaneity, and other new features. To adapt to new open and dynamic application scenarios, new ubiquitous, complex and rapid application requirements, and new thinking modes brought about by the Internet and big data, it is urgent to develop the basic theories, technologies and methods of artificial intelligence, especially the breakthrough in the basic theories and methods of cognitive computing of visual and auditory information. In the face of the increasing demands of practical applications, and the complexity, dynamics and autonomy of application scenarios, the cognitive computing of visual and auditory information should have stronger ability of representation, execution and decision-making, which is more in line with the basic theories, construction methods and evaluation methods of human intelligence. 1. Challenges to the basic theory of cognitive computing of visual and auditory information: the relationship between cognition and computing Summarizing the progress of this major research plan in the past decade, the core basic scientific problem for the development of a new generation of AI is “the relationship between cognition and computing”. For example, to answer the widely concerned question of whether artificial intelligence will replace human intelligence, boiled down to a scientific question, is essentially to answer the question of the relationship between cognition and computing. The relationship between cognition and computing can be further refined into four aspects: (1) The relationship between the basic unit of cognition and the basic unit of computing To study any kind of process and establish a scientific theory of a process, we must answer a fundamental question: what is the basic unit of the operation? Each basic science has its specific basic units, such as the elementary particles of high energy physics, the genes of genetics, the symbols of computational theory, and the bits of information theory. For cognitive science, one must answer what the basic unit of cognitive process operation is. A large number of cognitive science experiments show that the basic unit of cognition is not a symbol of calculation, nor a bit of information theory. Going beyond intuitive understanding, scientifically and accurately defining basic cognitive units, and establishing a unified basic cognitive unit model are fundamentally important for the development of a new generation of artificial intelligence. (2) Relationship between the anatomical structure of cognitive neural expression and the architecture of artificial intelligence computing The brain is composed of several brain regions that are relatively independent in structure and function. This modular structure is different from the computer’s central

4.1 Deficiencies and Strategic Needs

91

processor and unified memory architecture. If deep learning is inspired by the hierarchical structure of the nervous system, then whole-brain imaging that originates from specific cells and goes beyond brain regions will provide a deeper and richer inspiration for the new generation of artificial intelligence architecture. (3) The relationship between the specific mental activity phenomenon of cognitive emergence and the peculiar information processing phenomenon of the computing emergence For example, consciousness is a peculiar phenomenon of cognitive emergence. How to turn the discussion of intuitive and speculative thinking about whether artificial intelligence will produce consciousness into scientific experimental research? We propose comparative research on the consciousness, including the aspects of cognitive mechanism, the neural expression, the evolution, and the abnormalities. (4) The relationship between the cognitive mathematical basis and the computing mathematical basis Both the classical theory of computation and the popular Bayesian theory of computation believe that the essence of cognitive process is the calculation of discrete symbols in the sense of Turing. In other words, the mathematical basis of computing is the Turing machine. It is necessary to develop a new mathematical framework (involving various mathematical branches to describe the basic units of cognition, the basic relations of cognition, and the basic operations of cognition) that is suitable for describing cognition and to explore the mathematical basis for establishing a new generation of artificial intelligence. 2. Challenges of human–computer collaboration: human-in-the-loop human– computer collaborative hybrid intelligence Intelligent machines have become the companions of human beings. The interaction and mixing of humans and intelligent machines are the developing form of the future society. However, many problems facing humans are uncertain, fragile, and open. Human beings are the service objects of intelligent machines and the final arbiter of “value judgment”. Therefore, human’s intervention in machines is continuous. Even if sufficient or even unlimited data resources are provided for the artificial intelligence system, human intervention in the intelligent system cannot be ruled out. For example, facing the understanding of the nuances and ambiguities of human language in human–computer interaction systems, especially in some major areas of artificial intelligence applications (such as industrial risk management, medical diagnosis, and criminal justice artificial intelligence systems), how to avoid the risks, loss of control and even harm caused by the limitations of artificial intelligence technology requires the introduction of human supervision and interaction. The applications shall allow people to participate in verification, improve the confidence of intelligent systems, form a human-in-the-loop hybrid augmented intelligence paradigm, and use human knowledge to be up to the optimal balance between human intelligence and computing power of the computer in the best way. Human-in-the-loop hybrid

92

4 Outlook

augmented intelligence organically combines human cognitive ability with the fast computing and mass storage capacity of computers. In the real world, people cannot model for all problems. There are qualification problems and ramification problems, that is, it is impossible to enumerate all the prerequisites of a behavior, nor to list all the branches of a behavior. Machine learning cannot match humans’ understanding of the real-world environment, incomplete information processing, and complex spatiotemporal associated task handling capacity. Furthermore, both the plasticity of the human brain’s neural network structure and human brain interaction between non-cognitive factors and cognitive function are difficult to or even cannot be described by formal systems. The human brain’s understanding of non-cognitive factors is more derived from intuition and is influenced by experience and long-term knowledge accumulation. The natural form of biological intelligence of the human brain provides important enlightenment for improving the adaptability of machines to complex dynamic environments or situations, as well as the ability of non-complete and unstructured information processing and autonomous learning, and the construction of hybrid augmented intelligence based on cognitive computing. Cognitive computing architecture can combine complex planning and problemsolving with perceptual and action modules. It has the potential to explain or implement certain human or animal behaviors and the way they learn and act in a new environment. It can build artificial intelligence systems that require much less computation than existing programs. In the framework of cognitive computing, a more complete large-scale data processing and a more diversified computing platform can be built, and it can also solve the problem of planning and learning models for multiagent systems, as well as provide a new model for machine collaboration in a new task environment. 3. Challenges to cognitive computing systems for visual and auditory information in open environment Although we have made great progress, intelligent bodies (smart cars, robots) still face the following challenges in adapting to complex open environments. (1) Accurate perception of complex and uncertain environment AI systems are used in complex environments, and there are a large number of potential states that cannot be thoroughly inspected or tested. The system may face conditions that have never been considered during its design. (2) Response to “unexpected encounters” based on pre-behavior For AI systems that learn after deployment, the behavior of the system may be mainly determined by the learning phase under unsupervised conditions. In this case, it may be difficult to predict the behavior of the system. Drivers often convey their driving intentions based on pre-behavior. For example, they can judge whether the driver ahead is a veteran or a novice while driving, so as to decide “whether to stay away

4.1 Deficiencies and Strategic Needs

93

from him or her”, but the current driverless technology is difficult to explain or understand these subtleties. (3) “Natural human-car interaction” In many cases, the performance of AI systems can be greatly affected by human interactions. In these cases, changes in human response may affect the security of the system. To address these issues, additional investments are needed to improve AI security and reliability, including interpretability and transparency, trust, verification and validation, security against attacks, and long-term AI security and numerical tuning. (4) The risk of artificial intelligence security Autonomous driving, which uses the cloud to access and update maps, will be at greater risk. In addition, we hope that local governments can provide a qualified road test environment for driverless vehicles and better space for innovation. For example, some areas can be opened up and temporary license plates can be provided to allow driverless vehicles to conduct road tests, so as to promote continuous improvement and innovation of the technology. 4. Safety and reliability of cognitive computing intelligent systems for visual and auditory information Before the widespread use of visual and auditory information in cognitive computing intelligent systems, it is necessary to ensure that the system operates safely and reliably in a controlled manner. Research is needed to solve this challenge: creating a cognitive computing system that is reliable, authentic, and reliable for visual and auditory information. The cognitive computing intelligent system of visual and auditory information faces important security challenges. (1) Complex and uncertain environment In many cases, the cognitive computing intelligent system of visual and auditory information is used in complex environments, and there are a large number of potential states that cannot be thoroughly inspected or tested. The system may face conditions that have never been considered during its design. (2) Emergency behavior For intelligent systems that learn after deployment, the behavior of the system may be mainly determined by the learning phase under unsupervised conditions. In this case, it may be difficult to predict the behavior of the system. (3) Human–computer interaction restrictions In many cases, the performance of AI systems can be greatly affected by human interactions. In these cases, changes in human response may affect the security of the system. To address these issues, additional investments are needed to improve AI

94

4 Outlook

security and reliability, including interpretability and transparency, trust, verification and validation, and security against attacks. To improve interpretability and transparency, including many algorithms such as deep learning are opaque to users. There are only a few existing mechanisms to interpret their results. Increase credibility. System designers for cognitive computing of visual and auditory information need to create accurate and reliable systems with informative and user-friendly interfaces, and operators must take the time to be adequately trained to understand system operation and performance limitations. A complex system (such as intelligent vehicles) widely trusted by users is operated in a way that the user can see, creditable (the user accepts the output of the system), auditable (the system can be evaluated), reliable (the system acts as the user expects), and recoverable (the user control can be restored when needed). Currently and in the future, a major challenge for intelligent systems for cognitive computing of visual and auditory information remains the unstable quality of software production technology. Along with development, stronger connections between humans and intelligent systems are built, and challenges in the field of trust keep pace with changes and increasing capabilities. It is expected that technological advancements will be adopted and used in the long term, and management principles and policies will be formulated for the best implementation of research, design, construction and use, including appropriate training of operators. Enhance verifiability and confirmability. New methods are needed to verify and confirm the intelligent systems of cognitive computing for visual and auditory information. “Verifiability” determines that the system meets the formal specifications, while “confirmability” means that the system meets the user’s operational requirements. Safety intelligent systems may require new methods of assessment (to determine whether the system is malfunctioning, possibly when operating exceeds the expected parameters), diagnostic methods (to determine the cause of the failure), and maintenance methods (to adjust the system to resolve the failure). For systems that operate autonomously over extended periods of time, the system designer may not consider every condition that the system will encounter. Such systems may require the ability to self-evaluate, self-diagnose, and self-repair in order to keep robust and reliable. Protect the system from attack. In order to deal with accidents, the cognitive computing intelligence of visual and auditory information embedded in critical systems must be durable, but it should also be safe to deal with large-scale deliberate cyber attacks. Security engineering involves understanding the vulnerabilities of the system and the actions of people who intend to attack it. For example, a key research area is “anti-machine learning”, which “contaminates” training data, modifies algorithms, or makes small changes to a target that prevents it from being correctly recognized (for example, a prosthesis that deceives a facial recognition system) to explore how harmful the intelligent system of cognitive computing of visual and auditory information will be. The implementation of cognitive computing intelligence for visual and auditory information in network security systems that require a high degree of autonomy is also an area that needs further research. A recent example in

4.1 Deficiencies and Strategic Needs

95

this area is DARPA’s Cyber Grand Challenge, which involves intelligent agents for cognitive computing of visual and auditory information that autonomously analyze and resist cyber attacks. Achieve long-term AI security and optimization. The intelligent system for cognitive computing of visual and auditory information may eventually be capable of “cyclic self-improvement”, in which a large number of software modifications will be made by the software itself, rather than by human programmers. In order to ensure the safety of the self-modifying system, additional research is needed for development: the self-monitoring architecture checks the system for behavioral consistency through the original goal of the human designer; restriction strategies are used to prevent the system from being released during evaluation; in value learning, user values, goals, or intentions can be inferred by the system; and the value structure can be proved to resist self-modification. 5. Challenges to the ethical, legal, and social impacts of the constructed intelligent systems based on cognitive computing of visual and auditory information When intelligence acts on its own, we expect it to act according to our human formal and informal norms. As the power of the basic social order, law, and morality, it can not only inform, but also judge the behavior of the cognitive computing intelligent system of visual and auditory information. The primary research needs to involve an understanding of ethics, law and the social implications of intelligent systems, and intelligent designs must be developed in a way that complies with ethical, legal, and social principles. Privacy issues must also be considered. As with other technologies, legal and ethical principles will inform the acceptable use of intelligent systems; how to apply these principles to this new technology is a challenge, especially those technologies involving autonomy, agency, and control. Research in this field can benefit from the multidisciplinary perspectives of experts involved in computer science, social and behavioral sciences, ethics, biomedical sciences, psychology, economics, law, and policy, as well as the challenges of key information technology research in the field. (1) Improve fairness, transparency, and design accountability mechanisms People have concerned a lot about the susceptibility to errors and abuse of intelligent algorithms for cognitive computing of data-intensive visual and auditory information, as well as the possible impact on gender, age, race, or economics. In this regard, it is an important challenge to properly collect and use the data of the cognitive computing intelligent system of visual and auditory information. However, in

96

4 Outlook

addition to purely data-related issues, the larger issues that arise in the intelligent design of cognitive computing of visual and auditory information are essentially fairness, transparency, and accountability. (2) Establish ethical cognitive computing intelligence of visual and auditory information Besides the basic assumptions of justice and fairness, there are concerns about whether the intelligent system of cognitive computing of visual and auditory information is capable of behaving in ways that are tolerated by general ethical principles. How can the framework of intelligence be improved in terms of new “machinerelated” issues in ethics, or what uses of intelligence might be considered unethical? Ethics is essentially a philosophical question, and AI technologies depend on and are limited by engineering. Therefore, within the scope of technically feasible, researchers must strive to develop algorithms and architectures that are consistent with existing laws, social norms, and ethics—this is obviously a very challenging task. (3) Design an intelligent architecture for ethical cognitive computing of visual and auditory information Additional progress must be made in basic research to determine how to best design the architecture of AI systems that include moral reasoning. As intelligent systems for cognitive computing of visual and auditory information become more common, their architecture may include subsystems that can take on ethical issues at multiple levels so researchers need to focus on how best to address the overall design of AI systems that meet ethical, legal, and social goals. In short, cognitive computing of visual and auditory information will play a huge role in a new round of industrial promotion and social progress. It is not only facing great challenges to its basic theories and methods from all sides, but also an opportunity for technological innovation in China’s information industry. Continuous investment in the field of cognitive computing of visual and auditory information will vigorously promote the progress of new theories, new methods, new tools, chip architecture, and applications, so as to gain opportunities and advantages in the wave of future industrial and information development.

4.2 Ideas and Suggestions for In-Depth Study

97

4.2 Ideas and Suggestions for In-Depth Study 4.2.1 The Ideas of In-Depth Research In view of the overall goal of this major research plan, the research achievements, and existing problems, it is suggested to increase the support for the following research directions. 1. Theories and methods of cognitive science The issue of the relationship between cognition and computing requires long-term exploration and should receive continuous support. Regarding the disciplinary characteristics of cognitive science, it is recommended to pay attention to the layout of the following three aspects. (1) Attention should be paid to the study of new variables, new concepts, and new principles of cognitive science. At present, a trend of brain science development is that the biological data of the brain has emerged unprecedentedly, and the concept of computing and information is widely used in the analysis of cognitive processes. But a relatively neglected question is how to understand these unprecedentedly rich biological data at the level of cognitive psychology? Such a question is beyond the level of biology and it must be answered by new concepts and principles at the level of cognitive science, which are suitable for describing cognition. (2) The basic research of cognitive computing of visual and auditory information should emphasize the study of systems, entirety, and behavior, with human beings as the main part and animals as the auxiliary part, and focus on the macro, combined with the micro. (3) Questions about the relationship between cognition and computing, such as the question of what is the basic unit of cognition, cannot be solved by computing analysis alone, but can only be answered through experiments. Therefore, the basic research of artificial intelligence should especially support the experimental research of cognitive science, and it should pay attention to the interdisciplinary research based on the cognitive science experiment. 2. Cognitive computing theory of visual and auditory information in open dynamic environment The research on perception and representation makes up for the defects of independent perception channels, lack of channel collaboration, isolation of data sources, poor model comprehensibility, and low semantic level. The groups conduct research on learning and reasoning; effectively deal with common problems such as open and dynamic environment, a large number of samples, scarce and low-quality labels; enhance comprehensibility; integrate knowledge acquisition and knowledge reasoning capabilities. The groups study decision-making and control technology to

98

4 Outlook

overcome the problems of heavy reliance on expert knowledge, poor environmental adaptability, and low utilization rate of samples. 3. Human-in-the-loop and human–computer-collaborative hybrid augmented intelligence Human-in-the-loop hybrid augmented intelligence can deal with the processing of highly unstructured information, bringing more accurate and credible results than a single machine intelligence system. (1) The research enables people to train the machine in a natural way in the intelligent loop and break through the barrier of human–computer knowledge interaction. (2) The research combines human decision-making and experience with the advantages of intelligent machines in logical reasoning, deductive induction, and other computing processing to achieve efficient and meaningful human–computer collaborative construction. (3) The groups conduct research and build cross-task and cross-domain contextual relationships. (4) The groups conduct research and establish task or concept-driven machine learning methods, so that machines can not only learn from massive training samples, but also learn from human knowledge, and use the learned knowledge to complete highly intelligent tasks. 4. Construct a large-scale neuromorphic computing system It includes that the dynamic interaction mechanism between cognitive function and different distribution areas in the brain network; the internal mechanism of the formation and dissolution of the brain function network, and the connection and separation of the brain structure network; in complex cognitive behavior, the effective working mode of cooperation, competition, and coordination in brain function network; the functional roles of different brain tissues and the basic mathematical principles between roles, including the acquisition, representation, and storage of knowledge; the internal mechanism of brain memory enhancement; and the energy consumption pattern used by the brain to process external incentives. Biological neural network is a kind of impulse neural network, in which the input pulse received by the neuron causes the cell body membrane potential to rise. When it exceeds a certain threshold, it will send a nerve impulse to the axon, and neurotransmitters are transmitted through synapses and subsequent neuronal dendrites to affect their membrane potential. Spike potential is the transmission signal between neurons. Researching and understanding the way of information coding (spike signal coding) will help us better understand how the brain works and develop human–computer interaction technology. The key issue of neuromorphic engineering is to understand the morphology, neuron circuit and overall structure of a single neuron, to create and obtain the computing power required to meet the needs of different tasks, to complete the

4.2 Ideas and Suggestions for In-Depth Study

99

expression of information, to learn and develop adaptive changes in plasticity and changes in favor of evolution. 5. High-performance/low-power brain-like computing chips and platforms Brain-like computing requires advances from high-performance computing to high intelligent computing and from high power to low power computing. The measure of computing power varies from floating-point operations per second (FLOPS) to synaptic operations per second (SOPS). The human brain has about 1 × 1011 neurons, of which each neuron has about 1 × 104 synaptic connections. If the nerve pulse is released at 10 Hz, the computational amount is about 1 × 1016 SOPS. Assuming that each nerve impulse operation requires 1 × 102 times of value calculation, a high-performance computing system with a total of 1 × 1018 operations is required to simulate the computing power of the entire brain. At present, the computing power of the fastest high-performance computer Tianhe-2 is (3.86 –5.49 ) × 1016 FLOPS. 6. Theories and methods of cognitive computing of visual and auditory information inspired by brain cognition The groups design and study a new visual computing model and processing architecture based on the information processing ability of visual channels, especially retina, and the network structure of brain nerve connection. The groups component unit of this architecture includes frame-driven and event-driven information acquisition unit (intelligent computing), attention selection or event-driven mode of access to information, space–time dynamic information coding, the networked distributed and dynamic information processing, combined with long-term and short-term memory function of the network structure, elements and conditions of constraints, and effective control of guidance. The groups realize the mapping of brain structural network, functional network, and effective network at different levels of the visual processing architecture. 7. Natural language understanding The groups study the mechanisms of human understanding, representation, and cognition of natural language; conduct research on the means of knowledge acquisition that serve natural language processing in order to acquire world knowledge, domain knowledge, and linguistic knowledge; conduct research on computer mathematical modeling methods of natural language processing, such as modeling lexical, syntactic, semantic, and discourse; study knowledge-based natural language processing methods, such as in-depth understanding of natural language at the discourse level, question and answer, and dialogue. 8. Brain functional network in natural scene understanding and its inspired deep neural network The groups study the cross-level relationships from attributes, objects to scenes to discover the representation and reasoning methods related to the corresponding visual knowledge; conduct research on the hierarchical recognition of scenes and the automatic discovery method of related categories and attributes; conduct research

100

4 Outlook

on recognition and learning methods with the ability to by analogy; establish the alignment between the temporal and spatial characteristics of visual objects and language expression then combine with computing linguistics to achieve a seamless transition from perception to cognition. Multi-modal brain functional imaging (such as fMRI and EEG) fusion methods and technologies achieve high spatial and temporal resolution observation of brain functional activities. The groups study the recognition of brain functional networks in the process of natural scene understanding, the enhancement and inhibition of each sub-network, causality and modulation relationship between each sub-network, which can complete verification for structure design of large-scale neural networks and efficient training methods of inspiration information. 9. Intelligent testing mechanism The groups conduct research on new theories and new methods that break through the classical conceptual representation theoretical framework; discover the mathematical principles behind human dialogue; establish a new computational framework of man–machine dialogue; provide a theoretical basis for computers to pass the Turing test. The groups develop a dialogue system that conforms to the principles of human dialogue (such as the principle of cooperation and politeness), and obtain the probability of success of man–machine dialogue with the help of theory rather than engineering tests. 10. Intelligent vehicles (1) “Comprehensive perception” of complex traffic scenes Under all conditions, no matter how the weather changes and the road conditions are complicated, driverless cars must reliably perceive the surrounding scenes and respond safely. In other words, autonomous driving “must be an artificial intelligence system that will not make any mistakes”. (2) Understanding of “pre-behavior” Drivers often convey driving intentions based on pre-behavior. For example, they can judge that the front driver is a veteran or a novice while driving, so as to decide “whether to stay away from the front car”, but the current driverless technology is difficult to explain or understand these subtleties. (3) Response to “unexpected encounters” For example, in the occasion of temporary traffic control, it is difficult for the driverless car to recognize the hand gestures of the traffic police commanding the vehicle. At the same time, if a pedestrian suddenly wants to cross the road, it is very difficult to make computer systems provide a prediction and instantaneous understanding, because it is impossible to code every scene in advance for rule-based autonomous driving. (4) “The natural interaction of human-car” Autonomous driving must communicate with humans in a natural way to achieve barrier-free communication between vehicles and passengers. When passengers take

4.2 Ideas and Suggestions for In-Depth Study

101

a driverless car, the autonomous driving system needs to communicate, know the destination that the passengers are going to, understand and answer the questions raised by the passengers, rather than a simple “point-to-point driving”. (5) Risks to cyber security Autonomous driving, which uses the cloud to access and update maps, will face greater risk. 11. Driving brain with selective-attention memory mechanism The car in the future is a networked wheeled robot. It studies the driving brain with selective attention mechanism, memory cognition, interactive cognition, computational cognition, and self-learning abilities. As the driver’s intelligent agent, it can fully possess the driver’s driving recognition and selective attention. On this basis, we will further study where people’s skills and wisdom come from, and what time scale and space scale should be used for expressing memory and thinking activities, and study and evaluate people’s cognitive ability.

4.2.2 Suggestions for the Next Step 1. Further deepen the interdisciplinary work It is necessary to further intensify the strength and depth of interdisciplinary cooperation, so as to achieve closer and substantive cooperation among interdisciplinary subjects, and strive to make important progress in the fi of cognitive computing of visual and auditory information. 2. Focus on practical results of academic exchanges In addition to strengthen multidisciplinary intersection, we should also attach importance to academic exchanges among research teams. In general, the exchanges have not gone deep enough and have not become the norm. The annual “China Intelligent Vehicle Future Challenge Competition” is a good opportunity for communication, through which we can intuitively understand the strengths and weaknesses of each team. 3. Promote the development of multi-modal information processing technology research through appropriate methods It is recommended to achieve the goals and assumptions originally proposed by the major research plan in multi-modal information processing such as speech (especially in-vehicle speech), vision, behavior, text, and emotion, so that this part of the work can keep up with the overall development of the major research plan, which is an issue that requires special attention.

References

1. Preparing for the future of artificial intelligence. https://obamawhitehouse.archives.gov/blog/ 2016/05/03/preparing-future-artificial-intelligence 2. National artificial intelligence research and development strategic plan. https://www.nitrd.gov/ PUBS/national_ai_rd_strategic_plan.pdf 3. Artificial intelligence, automation, and the economy. https://obamawhitehouse.ar-chives.gov/ blog/2016/12/20/artificial-intelligence-automation-and-economy 4. Brain research through advancing innovative neurotechnologies (BRAIN). https://www.braini nitiative.nih.gov/ 5. Human brain project. https://www.humanbrainproject.eu/en/ 6. Brain mapping by integrated neurotechnologies for disease studies. https://brainminds.jp/en/

© Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3

103

Index

A Augmented Intelligence, 33, 34, 91, 92, 98 Autonomous smart csars, 34

M Multi-person and multi-party dialogue, 66, 69

C Cognitive computing, 1, 8, 9, 13, 17–19, 29–37, 39, 40, 41, 53, 74, 83, 87–90, 92–97, 99, 101

N Natural language comprehension, 33 Neuromorphic computing, 31, 98

D Driverless vehicles, 3, 11–14, 16, 18, 56, 58, 77–79, 81–86, 93

V Visual and auditory information, 1, 3, 6–8, 16, 17, 29–37, 39, 40, 42, 49, 74, 87–90, 92–97, 99, 101

© Zhejiang University Press 2023 N. Zheng (ed.), Cognitive Computing of Visual and Auditory Information, Reports of China’s Basic Research, https://doi.org/10.1007/978-981-99-3228-3

105