115 61 52MB
English Pages 861 [858] Year 2023
LNAI 14089
De-Shuang Huang · Prashan Premaratne · Baohua Jin · Boyang Qu · Kang-Hyun Jo · Abir Hussain (Eds.)
Advanced Intelligent Computing Technology and Applications 19th International Conference, ICIC 2023 Zhengzhou, China, August 10–13, 2023 Proceedings, Part IV
123
Lecture Notes in Computer Science
Lecture Notes in Artificial Intelligence Founding Editor Jörg Siekmann
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Wolfgang Wahlster, DFKI, Berlin, Germany Zhi-Hua Zhou, Nanjing University, Nanjing, China
14089
The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence. The series publishes state-of-the-art research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.
De-Shuang Huang · Prashan Premaratne · Baohua Jin · Boyang Qu · Kang-Hyun Jo · Abir Hussain Editors
Advanced Intelligent Computing Technology and Applications 19th International Conference, ICIC 2023 Zhengzhou, China, August 10–13, 2023 Proceedings, Part IV
Editors De-Shuang Huang Department of Computer Science Eastern Institute of Technology Zhejiang, China Baohua Jin Zhengzhou University of Light Industry Zhengzhou, China Kang-Hyun Jo University of Ulsan Ulsan, Korea (Republic of)
Prashan Premaratne University of Wollongong North Wollongong, NSW, Australia Boyang Qu Zhong Yuan University of Technology Zhengzhou, China Abir Hussain Department of Computer Science Liverpool John Moores University Liverpool, UK
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-981-99-4751-5 ISBN 978-981-99-4752-2 (eBook) https://doi.org/10.1007/978-981-99-4752-2 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
The International Conference on Intelligent Computing (ICIC) was started to provide an annual forum dedicated to emerging and challenging topics in artificial intelligence, machine learning, pattern recognition, bioinformatics, and computational biology. It aims to bring together researchers and practitioners from both academia and industry to share ideas, problems, and solutions related to the multifaceted aspects of intelligent computing. ICIC 2023, held in Zhengzhou, China, August 10–13, 2023, constituted the 19th International Conference on Intelligent Computing. It built upon the success of ICIC 2022 (Xi’an, China), ICIC 2021 (Shenzhen, China), ICIC 2020 (Bari, Italy), ICIC 2019 (Nanchang, China), ICIC 2018 (Wuhan, China), ICIC 2017 (Liverpool, UK), ICIC 2016 (Lanzhou, China), ICIC 2015 (Fuzhou, China), ICIC 2014 (Taiyuan, China), ICIC 2013 (Nanning, China), ICIC 2012 (Huangshan, China), ICIC 2011 (Zhengzhou, China), ICIC 2010 (Changsha, China), ICIC 2009 (Ulsan, South Korea), ICIC 2008 (Shanghai, China), ICIC 2007 (Qingdao, China), ICIC 2006 (Kunming, China), and ICIC 2005 (Hefei, China). This year, the conference concentrated mainly on theories and methodologies as well as emerging applications of intelligent computing. Its aim was to unify the picture of contemporary intelligent computing techniques as an integral concept that highlights the trends in advanced computational intelligence and bridges theoretical research with applications. Therefore, the theme for this conference was “Advanced Intelligent Computing Technology and Applications”. Papers that focused on this theme were solicited, addressing theories, methodologies, and applications in science and technology. ICIC 2023 received 828 submissions from 12 countries and regions. All papers went through a rigorous peer-review procedure and each paper received at least three review reports. Based on the review reports, the Program Committee finally selected 337 high-quality papers for presentation at ICIC 2023, and inclusion in five volumes of proceedings published by Springer: three volumes of Lecture Notes in Computer Science (LNCS), and two volumes of Lecture Notes in Artificial Intelligence (LNAI). This volume of LNAI_14089 includes 68 papers. The organizers of ICIC 2023, including Eastern Institute of Technology, China Zhongyuan University of Technology, China, and Zhengzhou University of Light Industry, China, made an enormous effort to ensure the success of the conference. We hereby would like to thank the members of the Program Committee and the referees for their collective effort in reviewing and soliciting the papers. In particular, we would like to thank all the authors for contributing their papers. Without the high-quality submissions from the authors, the success of the conference would not have been possible. Finally,
vi
Preface
we are especially grateful to the International Neural Network Society, and the National Science Foundation of China for their sponsorship. June 2023
De-Shuang Huang Prashan Premaratne Boyang Qu Baohua Jin Kang-Hyun Jo Abir Hussain
Organization
General Co-chairs De-Shuang Huang Shizhong Wei
Eastern Institute of Technology, China Zhengzhou University of Light Industry, China
Program Committee Co-chairs Prashan Premaratne Baohua Jin Kang-Hyun Jo Abir Hussain
University of Wollongong, Australia Zhengzhou University of Light Industry, China University of Ulsan, Republic of Korea Liverpool John Moores University, UK
Organizing Committee Co-chair Hui Jing
Zhengzhou University of Light Industry, China
Organizing Committee Members Fubao Zhu Qiuwen Zhang Haodong Zhu Wei Huang Hongwei Tao Weiwei Zhang
Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China Zhengzhou University of Light Industry, China
Award Committee Co-chairs Michal Choras Hong-Hee Lee
Bydgoszcz University of Science and Technology, Poland University of Ulsan, Republic of Korea
viii
Organization
Tutorial Co-chairs Yoshinori Kuno Phalguni Gupta
Saitama University, Japan Indian Institute of Technology Kanpur, India
Publication Co-chairs Valeriya Gribova M. Michael Gromiha Boyang Qu
Far Eastern Branch of Russian Academy of Sciences, Russia Indian Institute of Technology Madras, India Zhengzhou University, China
Special Session Co-chairs Jair Cervantes Canales Chenxi Huang Dhiya Al-Jumeily
Autonomous University of Mexico State, Mexico Xiamen University, China Liverpool John Moores University, UK
Special Issue Co-chairs Kyungsook Han Laurent Heutte
Inha University, Republic of Korea Université de Rouen Normandie, France
International Liaison Co-chair Prashan Premaratne
University of Wollongong, Australia
Workshop Co-chairs Yu-Dong Zhang Hee-Jun Kang
University of Leicester, UK University of Ulsan, Republic of Korea
Organization
ix
Publicity Co-chairs Chun-Hou Zheng Dhiya Al-Jumeily Jair Cervantes Canales
Anhui University, China Liverpool John Moores University, UK Autonomous University of Mexico State, Mexico
Exhibition Contact Co-chair Fubao Zhu
Zhengzhou University of Light Industry, China
Program Committee Members Abir Hussain Antonio Brunetti Antonino Staiano Bin Liu Bin Qian Bin Yang Bing Wang Binhua Tang Bingqiang Liu Bo Li Changqing Shen Chao Song Chenxi Huang Chin-Chih Chang Chunhou Zheng Chunmei Liu Chunquan Li Dahjing Jwo Dakshina Ranjan Kisku Dan Feng Daowen Qiu Dharmalingam Muthusamy Dhiya Al-Jumeily Dong Wang
Liverpool John Moores University, UK Polytechnic University of Bari, Italy Università di Napoli Parthenope, Italy Beijing Institute of Technology, China Kunming University of Science and Technology, China Zaozhuang University, China Anhui University of Technology, China Hohai University, China Shandong University, China Wuhan University of Science and Technology, China Soochow University, China Harbin Medical University, China Xiamen University, China Chung Hua University, Taiwan Anhui University, China Howard University, USA University of South China, China National Taiwan Ocean University, Taiwan National Institute of Technology Durgapur, India Huazhong University of Science and Technology, China Sun Yat-sen University, China Bharathiar University, India Liverpool John Moores University, UK University of Jinan, China
x
Organization
Dunwei Gong Eros Gian Pasero Evi Sjukur Fa Zhang Fengfeng Zhou Fei Guo Gaoxiang Ouyang Giovanni Dimauro Guoliang Li Han Zhang Haibin Liu Hao Lin Haodi Feng Hongjie Wu Hongmin Cai Jair Cervantes Jixiang Du Jing Hu Jiawei Luo Jian Huang Jian Wang Jiangning Song Jinwen Ma Jingyan Wang Jinxing Liu Joaquin Torres-Sospedra Juan Liu Jun Zhang Junfeng Xia Jungang Lou Kachun Wong Kanghyun Jo Khalid Aamir Kyungsook Han L. Gong Laurent Heutte
China University of Mining and Technology, China Politecnico di Torino, Italy Monash University, Australia Beijing Institute of Technology, China Jilin University, China Central South University, China Beijing Normal University, China University of Bari, Italy Huazhong Agricultural University, China Nankai University, China Beijing University of Technology, China University of Electronic Science and Technology of China, China Shandong University, China Suzhou University of Science and Technology, China South China University of Technology, China Autonomous University of Mexico State, Mexico Huaqiao University, China Wuhan University of Science and Technology, China Hunan University, China University of Electronic Science and Technology of China, China China University of Petroleum, China Monash University, Australia Peking University, China Abu Dhabi Department of Community Development, UAE Qufu Normal University, China Universidade do Minho, Portugal Wuhan University, China Anhui University, China Anhui University, China Huzhou University, China City University of Hong Kong, China University of Ulsan, Republic of Korea University of Sargodha, Pakistan Inha University, Republic of Korea Nanjing University of Posts and Telecommunications, China Université de Rouen Normandie, France
Organization
Le Zhang Lejun Gong Liang Gao Lida Zhu Marzio Pennisi Michal Choras Michael Gromiha Ming Li Minzhu Xie Mohd Helmy Abd Wahab Nicola Altini Peng Chen Pengjiang Qian Phalguni Gupta Prashan Premaratne Pufeng Du Qi Zhao Qingfeng Chen Qinghua Jiang Quan Zou Rui Wang Saiful Islam Seeja K. R. Shanfeng Zhu Shikui Tu Shitong Wang Shixiong Zhang Sungshin Kim Surya Prakash Tatsuya Akutsu Tao Zeng Tieshan Li Valeriya Gribova
Vincenzo Randazzo
xi
Sichuan University, China Nanjing University of Posts and Telecommunications, China Huazhong Univ. of Sci. & Tech., China Huazhong Agriculture University, China University of Eastern Piedmont, Italy Bydgoszcz University of Science and Technology, Poland Indian Institute of Technology Madras, India Nanjing University, China Hunan Normal University, China Universiti Tun Hussein Onn Malaysia, Malaysia Polytechnic University of Bari, Italy Anhui University, China Jiangnan University, China GLA University, India University of Wollongong, Australia Tianjin University, China University of Science and Technology Liaoning, China Guangxi University, China Harbin Institute of Technology, China University of Electronic Science and Technology of China, China National University of Defense Technology, China Aligarh Muslim University, India Indira Gandhi Delhi Technical University for Women, India Fudan University, China Shanghai Jiao Tong University, China Jiangnan University, China Xidian University, China Pusan National University, Republic of Korea IIT Indore, India Kyoto University, Japan Guangzhou Laboratory, China University of Electronic Science and Technology of China, China Institute of Automation and Control Processes, Far Eastern Branch of Russian Academy of Sciences, Russia Politecnico di Torino, Italy
xii
Organization
Waqas Haider Wen Zhang Wenbin Liu Wensheng Chen Wei Chen Wei Peng Weichiang Hong Weidong Chen Weiwei Kong Weixiang Liu Xiaodi Li Xiaoli Lin Xiaofeng Wang Xiao-Hua Yu Xiaoke Ma Xiaolei Zhu Xiangtao Li Xin Zhang Xinguo Lu Xingwei Wang Xinzheng Xu Xiwei Liu Xiyuan Chen Xuequn Shang Xuesong Wang Yansen Su Yi Xiong Yu Xue Yizhang Jiang Yonggang Lu Yongquan Zhou Yudong Zhang Yunhai Wang Yupei Zhang Yushan Qiu
Kohsar University Murree, Pakistan Huazhong Agricultural University, China Guangzhou University, China Shenzhen University, China Chengdu University of Traditional Chinese Medicine, China Kunming University of Science and Technology, China Asia Eastern University of Science and Technology, Taiwan Shanghai Jiao Tong University, China Xi’an University of Posts and Telecommunications, China Shenzhen University, China Shandong Normal University, China Wuhan University of Science and Technology, China Hefei University, China California Polytechnic State University, USA Xidian University, China Anhui Agricultural University, China Jilin University, China Jiangnan University, China Hunan University, China Northeastern University, China China University of Mining and Technology, China Tongji University, China Southeast Univ., China Northwestern Polytechnical University, China China University of Mining and Technology, China Anhui University, China Shanghai Jiao Tong University, China Huazhong University of Science and Technology, China Jiangnan University, China Lanzhou University, China Guangxi University for Nationalities, China University of Leicester, UK Shandong University, China Northwestern Polytechnical University, China Shenzhen University, China
Organization
Yunxia Liu Zhanli Sun Zhenran Jiang Zhengtao Yu Zhenyu Xuan Zhihong Guan Zhihua Cui Zhiping Liu Zhiqiang Geng Zhongqiu Zhao Zhuhong You
xiii
Zhengzhou Normal University, China Anhui University, China East China Normal University, China Kunming University of Science and Technology, China University of Texas at Dallas, USA Huazhong University of Science and Technology, China Taiyuan University of Science and Technology, China Shandong University, China Beijing University of Chemical Technology, China Hefei University of Technology, China Northwestern Polytechnical University, China
Contents – Part IV
Knowledge Discovery and Data Mining Ponzi Scheme Identification of Smart Contract Based on Multi Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaoxiao Jiang, Mingdong Xie, Shulin Wang, and Sheng Yang
3
An Adaptive Method for Generating the Traffic State Thresholds on Road Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiacheng Wu, Wang Zhu, and Jianli Xiao
15
RNL: A Robust and Highly-Efficient Model for Time-Aware Web Service QoS Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiajia Mi and Hao Wu
27
A Time-Aware Graph Attention Network for Temporal Knowledge Graphs Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuxin Cao, Chengwei Liu, Xiaoxu Zhu, and Peifeng Li
40
Multivariate Time Series Anomaly Detection Method Based on mTranAD . . . . . Chuanlei Zhang, Yicong Li, Jie Li, Guixi Li, and Hui Ma Proximal Symmetric Non-negative Latent Factor Analysis: A Novel Approach to Highly-Accurate Representation of Undirected Weighted Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yurong Zhong, Zhe Xie, Weiling Li, and Xin Luo Information Extraction System for Invoices and Receipts . . . . . . . . . . . . . . . . . . . . QiuXing Michelle Tan, Qi Cao, Chee Kiat Seow, and Peter Chunyu Yau Missing Data Analysis and Soil Compressive Modulus Estimation via Bayesian Evolutionary Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wenchao Zhang, Peixin Shi, Xiaoqi Zhou, and Pengjiao Jia
52
64
77
90
Music Emotion Recognition Using Multi-head Self-attention-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Yao Xiao, Haoxin Ruan, Xujian Zhao, Peiquan Jin, and Xuebo Cai Deep Reinforced Active Learning for Time Series Anomaly Detection . . . . . . . . 115 Haojie Li, Hongzuo Xu, and Wei Peng
xvi
Contents – Part IV
Dynamic Label Propagation Density Peak Clustering Based on the Tissue-Like P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Qing Du and Xiyu Liu TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network for Service Recommendation on Trigger-Action Programming Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Zijun Huang, Jiangfeng Li, Huijuan Zhang, Chenxi Zhang, and Gang Yu Online Unsupervised Anomaly Detection in Stream Data with Spiking Neural Networks Using Dynamic Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Yaling Li and Jintian Ge Research on Double Input Electric Load Forecasting Model Based on Feature Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Zi Wang, Tao Zhang, Sheng Zeng, and Bing Wang Machine Learning K-means Based Transfer Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Yuanyuan Du, Bo Li, and Zhonghua Quan HSIC Induced LncRNA Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Anjie Guo and Bo Li 2D-DLPP Algorithm Based on SPD Manifold Tangent Space . . . . . . . . . . . . . . . . 201 Xiaohang Li, Bo Li, and Zonghui Wang Cluster Equality Validity Index and Efficient Clustering Optimization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Zebin Huang, Ning Yu, Qingqiang Wu, and KunHong Liu Modeling Portraits of Students and Exercises for Exercise Recommendation . . . 226 Weiwei Gao, Huifang Ma, Yan Zhao, Jing Wang, and Quanhong Tian EduAction: A College Student Action Dataset for Classroom Attention Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Kunhong Liu, Bin Chen, Liyan Chen, Yong Xu, Lu Lin, Fan Gao, and Yudi Zhao Zero-Shot Learning Based on Weighted Reconstruction of Hybrid Attribute Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Jiarui Zhang, Ruilin Li, Nannan Yu, Jian Liu, and Yi Kong
Contents – Part IV
xvii
Adaptive Clustering-Based Collusion Detection in Crowdsourcing . . . . . . . . . . . . 261 Ruoyu Xu, Gaoxiang Li, Wei Jin, Austin Chen, and Victor S. Sheng Improving Causality Explanation of Judge-View Generation Based on Counterfactual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Qinhua Huang and Weimin Ouyang Instance Weighting-Based Noise Correction for Crowdsourcing . . . . . . . . . . . . . . 285 Qiang Ji, Liangxiao Jiang, and Wenjun Zhang Improvement of Graph Convolution Network of Missing Data Based on P Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Runpu Chi and Xiyu Liu Explainable Artificial Intelligence 101: Techniques, Applications and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Wiktor Kurek, Marek Pawlicki, Aleksandra Pawlicka, Rafał Kozik, and Michał Chora´s MSAM: Cross-Domain Recommendation Based on Multi-Layer Self-Attentive Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 XiaoBing Song, JiaYu Bao, Yicheng Di, and Yuan Li TDRConv: Exploring the Trade-off Between Feature Diversity and Redundancy for a Compact CNN Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Haigen Hu, Deming Zhou, Hui Xu, Qi Chen, Qiu Guan, and Qianwei Zhou Community Detection Using Revised Medoid-Shift Based on KNN . . . . . . . . . . . 345 Jiakang Li, Xiaokang Peng, Jie Hou, Wei Ke, and Yonggang Lu Adaptive Graph Augmentation for Graph Contrastive Learning . . . . . . . . . . . . . . 354 Zeming Wang, Xiaoyang Li, Rui Wang, and Changwen Zheng A No Parameter Synthetic Minority Oversampling Technique Based on Finch for Imbalanced Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Shoukun Xu, Zhibang Li, Baohua Yuan, Gaochao Yang, Xueyuan Wang, and Ning Li Terminology-Enriched Meta-curriculum Learning for Domain Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Zheng Chen and Yifan Wang
xviii
Contents – Part IV
Automatic Model Selection Algorithm Based on BYY Harmony Learning for Mixture of Gaussian Process Functional Regressions Models . . . . . . . . . . . . . 391 Xiangyang Guo, Tao Hong, and Jinwen Ma GNAT: Leveraging Weighted Negative Sampling for Improved Graph Attention Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Yujin Lu, Qi Wang, Wanyi Zhou, and Jeffrey Zheng Natural Language Processing and Computational Linguistics SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Hao Wang and Yong Dou MISC: A Multimodal Approach for Sentiment Classification of Classical Chinese Poetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 Chang Su, Shupin Liu, and Chalian Luo A Sentence Quality Evaluation Framework for Machine Reading Comprehension Incorporating Pre-trained Language Model . . . . . . . . . . . . . . . . . . 443 Fan-Jun Meng, Ji-Fei He, Xing-Jian Xu, Ya-Juan Zhao, and Li-Jun Sun UCM: Personalized Document-Level Sentiment Analysis Based on User Correlation Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 Jiayue Qiu, Ziyue Yu, and Wuman Luo Multi-modal Rumor Detection on Modality Alignment and Multi-perspective Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Boqun Li, Zhong Qian, Peifeng Li, and Qiaoming Zhu Recognizing Functional Pragmatics in Chinese Discourses on Enhancing Paragraph Representation and Deep Differential Amplifier . . . . . . . . . . . . . . . . . . 484 Yu Lu, Yaxin Fan, Peifeng Li, Xiaomin Chu, and Qiaoming Zhu Learning from Patterns via Pre-trained Masked Language Model for Semi-supervised Automated Essay Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Jingbo Sun, Weiming Peng, Tianbao Song, and Jihua Song Multi-granularity Prompts for Topic Shift Detection in Dialogue . . . . . . . . . . . . . 511 Jiangyi Lin, Yaxin Fan, Xiaomin Chu, Peifeng Li, and Qiaoming Zhu Recognizing Functional Pragmatics of Chinese Discourses on Data Augmentation and Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 Yu Lu, Feng Jiang, Xiaomin Chu, Peifeng Li, and Qiaoming Zhu
Contents – Part IV
xix
Simple but Effective: Keyword-Based Metric Learning for Event Sentence Coreference Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 Tailai Peng, Rui Chen, Zhe Cui, and Zheng Chen Automatic Text Extractive Summarization Based on Text Graph Representation and Attention Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Yuan-Ching Lin and Jinwen Ma Transition-Based Mention Representation for Neural Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Qingqing Li and Fang Kong Speaker-Aware Dialogue Discourse Parsing with Meta-Path Based Heterogeneous Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Shaoming Ji and Fang Kong RA-KD: Random Attention Map Projection for Knowledge Distillation . . . . . . . 587 Linna Zhang, Yuehui Chen, Yi Cao, and Yaou Zhao Nucleus Beam Search for Machine Translation Decoding . . . . . . . . . . . . . . . . . . . 597 Zheng Chen, Ruiwen Tao, and Yifan Wang A Survey on Multimodal Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . 609 Shenyi Qian, Wenduo Jin, Yonggang Chen, Jiangtao Ma, Yaqiong Qiao, and Jinyu Lu HSRG-WSD: A Novel Unsupervised Chinese Word Sense Disambiguation Method Based on Heterogeneous Sememe-Relation Graph . . . . . . . . . . . . . . . . . . 623 Meng Lyu and Shasha Mo Employing Beautiful Sentence Evaluation to Automatic Chinese Essay Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 Yaqiong He, Xiaomin Chu, and Peifeng Li A Data Augmentation Method Based on Sub-tree Exchange for Low-Resource Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 Chuncheng Chi, Fuxue Li, Hong Yan, Hui Guan, and Zhongchao Zhao Improving Neural Machine Translation by Retrieving Target Translation Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 Fuxue Li, Chuncheng Chi, Hong Yan, and Zhen Zhang Chinese Named Entity Recognition Based on Multi-feature Fusion . . . . . . . . . . . 670 Zhenxiang Sun, Runyuan Sun, Zhifeng Liang, Zhuang Su, Yongxin Yu, and Shuainan Wu
xx
Contents – Part IV
Knowledge Graph Construction for Supply Chain Management in Manufacturing Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Wenyan Chen, Lianglun Cheng, Tao Wang, and Jianfeng Deng Leveraging Inter-class Differences and Label Semantics for Few-Shot Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694 Xinran Xie, Rui Chen, Tailai Peng, Zhe Cui, and Zheng Chen Simplifying Aspect-Sentiment Quadruple Prediction with Cartesian Product Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 Jigang Wang, Aimin Yang, Dong Zhou, Nankai Lin, Zepeng Wang, Weifeng Huang, and Boyu Chen A Content Word Augmentation Method for Low-Resource Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720 Fuxue Li, Zhongchao Zhao, Chuncheng Chi, Hong Yan, and Zhen Zhang STADEE: STAtistics-Based DEEp Detection of Machine Generated Text . . . . . . 732 Zheng Chen and Huming Liu Multimodal Event Detection on Chinese Glyphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 Qianqian Si, Zhongqing Wang, and Peifeng Li Weak Positive Sampling and Soft Smooth Labeling for Distractor Generation Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756 Jiayun Wang, Jun Bai, Wenge Rong, Yuanxin Ouyang, and Zhang Xiong Chinese Event Extraction with Small-Scale Language Model . . . . . . . . . . . . . . . . 768 Quanlin Chen, Jun Jia, and Shuo Fan Exploiting Query Knowledge Embedding and Trilinear Joint Embedding for Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780 Zheng Chen and Yaxin Wen Beyond the Label Distribution Prior for Long-Tailed Recognition . . . . . . . . . . . . 792 Ming Li and Liujuan Cao KDCE: Effective Lifelong Learning for Code Pre-train Language Model . . . . . . 804 Jiadong Feng and Hui Li Sliding Window GBDT for Electricity Demand Forecasting . . . . . . . . . . . . . . . . . 817 Qing Yin, Chengyu Sun, Min Cao, Yong Duan, Rongrong Jia, Wenjin Yu, Xin Yi, Lu Yin, Jiangsheng Huang, and Zhihong Zhang
Contents – Part IV
xxi
Knowledge Graph Completion for Power Grid Main Equipment Using Pretrained Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828 Chenxiang Lin, Zhou Zheng, Shitao Cai, Li Fu, Wei Xie, Teng Ma, and Zhihong Zhang Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
Knowledge Discovery and Data Mining
Ponzi Scheme Identification of Smart Contract Based on Multi Feature Fusion Xiaoxiao Jiang1 , Mingdong Xie2 , Shulin Wang1 , and Sheng Yang1(B) 1 College of Computer Science and Electronic Engineering, Hunan University,
Changsha 410000, Hunan, China [email protected] 2 Hunan Mingyan Culture Communication Co., Ltd., No. 39, Jianshan Road, Changsha 410000, Hunan, China
Abstract. Due to the anonymity and imperfect supervision of the blockchain, criminal acts committed by criminals on the blockchain are difficult to be investigated. In recent years, scams based on district smart contracts have emerged one after another, among which the losses caused by Ponzi schemes have reached millions of dollars. However, there is little research on smart contract fraud identification at present, and for fraud detection, information utilization is not comprehensive, only based on a single feature or simply fused multiple features directly, without considering the duplication and connection between features. In order to solve these problems, in this paper, we propose a multi feature fusion scheme identification model (MFFSI) for smart contracts. Our contributions mainly include the following two points: 1) In terms of information use, the operands containing the information about the jump relationship of the opcode execution are retained; 2) In feature fusion, the operation code (opcode) and application binary interface (ABI) sequence features are extracted, and attention modules are used to guide the fusion of features to alleviate the interference of irrelevant features on classification. The results show that the model proposed in this paper has a good detection effect. The F1 score is higher than 86%, which is better than the previous. Keywords: Blockchain · smart contract · Ponzi scheme · deep learning · multi feature fusion
1 Introduction In 2008, Nakamoto Cong proposed a decentralized cryptocurrency - Bitcoin [1]. In 2009, the Bitcoin system began to operate as a decentralized and anonymous payment system. In 2014, Turing’s complete Ethereum system [2] emerged, which is different from Bitcoin in that it is not only limited to realizing the transfer of value, but also the real transaction logic can be realized by using a smart contract. When the preset conditions are met, the contract is automatically executed on the blockchain [3]. This anonymous and trusted system, which is not controlled by a third party, has attracted the attention of a large number of investors, and the assets involved in the contract continue to grow [4]. At the same time, criminals use loopholes, hacker attacks and Ponzi scams © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 3–14, 2023. https://doi.org/10.1007/978-981-99-4752-2_1
4
X. Jiang et al.
to gain benefits. The investigation report shows that the proportion of losses caused by Ponzi fraud is as high as 76% [5]. Ponzi fraud is a traditional investment fraud. It uses the money of subsequent investors to distribute interest and returns to existing investors. Once new investment is insufficient to cover interest payments, the chain of interest payments will break. Later investors are doomed to lose most of their funds or even lose their money. This obviously undermines the smooth operation of the economy and society. Due to the anonymity and lack of supervision of the blockchain, criminals can easily use smart contracts on the blockchain to disguise the Ponzi scheme as a high-value investment plan. Each smart contract has its own account characteristics and transaction data characteristics. At present, there are three main methods to identify smart contract scams. First, based on empirical research, the first empirical analysis of bitcoin scams [6] found 192 scams and illegally profited at least $11 million from 13,000 different victims. M. Bartoletti et al. [7] analyzed the commonalities of Ponzi fraud smart contracts from three aspects: descriptive information, source code and transaction records, and conducted an empirical study on the Ponzi fraud on the blockchain for the first time. The second is the research on feature extraction based on machine learning and artificial design. Torres et al. [8] used symbolic execution and well-defined heuristic methods to detect smart contract honeypots. Tsikerdekis [9] uses transaction related data, combining data science methods and machine learning methods to detect fraud contracts. K. Toyoda et al. [10] used machine learning methods to screen malicious accounts that may have fraud based on account transaction data. Chen et al. [11, 12] used XGBoost and random forest (RF) models to detect contract fraud based on smart contract account characteristics and opcode frequency characteristics, respectively. The F1 value reached 86%, which is far more efficient than manual detection; Fan et al. [13] used n-gram method for feature extraction of the opcode sequence of the smart contract and Boosting algorithm. Classify scams. This machine learning based method relies on human cognition for the extraction of smart contract features. Therefore, some deeper features cannot be found and extracted. As a result, although the accuracy is more efficient and accurate than manual detection, it has limited improvement in accuracy. Third, based on the deep learning method, Huang et al. [14] proposed an word embedding model extracts text semantic information as a fraud detection model in smart contracts; Chen et al. [15], based on the source code of the smart contract, proposed a combination model of TextCNN and Transofrmer [16] to detect Ponzi fraud. Wang et al. [17] and Zhang et al. [18] used LSTM model to classify contract frauds according to account, frequency characteristics of opcodes and sequence characteristics of opcodes respectively. Bian et al. [19] The byte code sequence, operation code frequency and application binary interface call sequence are converted into corresponding grayscale images respectively, and then mapped to three color channels to generate RGB images. The SE-CapsNet model is used to detect contract fraud using image-based methods. Different from previous studies, in this paper, we propose MFFSI model to detect fraud. First, in terms of data use, it is not based on the historical information of contract transactions with lag, or the source code that is not fully open or single feature data, but the operation code and ABI sequence data that exist and are open after each contract is generated; Secondly, in data processing, this paper retains the operands containing
Ponzi Scheme Identification of Smart Contract
5
the information about the jump relationship of the opcode execution; Third, DCNNBert is used to extract semantic features. DCNN is used to extract local features, and Bert [20] is used to obtain full text features. It solves the problems of complexity and slow training convergence of extracting super long text feature model only using Bert. Fourthly, on data fusion, there are duplication and connection between features, attention mechanism is used to generate ABI attention features participated by opcodes before fusion to mitigate the negative impact between features.
2 MFFSI Model 2.1 Overall Framework The overall process of Ponzi fraud contract detection mainly includes: Data acquisition and data preprocessing. First obtain the original data from the Ethereum browser. Then disassemble the hexadecimal byte code to obtain the operation code, and remove the meaningless characters of the operation code; The ABI call sequence is extracted by depth first search. Multi feature fusion scheme identification model. As shown in Fig. 1. It is divided into two modules: 1) Feature extraction module. According to their respective word embedding matrices, using DCNN-Bert extract the semantic features of opcode and ABI call sequence; 2) Feature fusion and classification module. Using the opcode text feature guides the generation of attention features of ABI sequence text features to mitigate feature duplication. The obtained opcode text features, ABI call sequence text features, and attention features are spliced and fused. After that, a two-layer fully connected network is used, and Softmax is used for final classification.
Fig. 1. Overall architecture of MFFSI
2.2 Data Acquisition and Data Preprocessing Data Acquisition. We use Python program to obtain the byte code sequence and ABI call sequence corresponding to each smart contract from Ethereum browser [21] according to the smart contract account, and save them as text documents. Data Preprocessing. Data preprocessing is divided into two parts: the preprocessing of opcode sequence and ABI call sequence.
6
X. Jiang et al.
Opcode sequence. The preprocessing process of the opcode sequence is divided into four steps. The first step is to disassemble the bytecode into corresponding opcodes (including instructions (such as PUSH1) and operands (such as 0 × 26)). Step 2: Remove irrelevant characters. The predecessors usually remove the operands as irrelevant characters. However, this paper believes that when using the opcodes to detect Ponzi scams, the operands contain the jump relationship of the opcodes, so they cannot be removed. To make the model understand this jump relationship, we keep only the starting address of each JUMPDEST code segment (because JUMP and JUMPI are effective jumps only when the jump address is Step 3 Simplify the opcode. According to the Ethereum Yellow Book [22], there are 142 operation instructions and 10 functions. If instructions are used directly, it may cause dimension disaster due to too many instructions. Therefore, we classify operation codes according to functions to simplify operation codes. ABI call sequence. The preprocessing of ABI call sequence is divided into two steps. The first step is to search for relevant characters of ABI call sequence according to depth first, and remove irrelevant characters and punctuation marks. The second step is to simplify the ABI according to the IDF sorting. Because the dimension of each contract ABI call sequence is too large, it needs to be cleaned to avoid dimension disaster. This study only adds words with smaller IDF values to the dictionary, and uses “null” to replace words with larger IDF values. Because of the extreme imbalance of the contract, if all contracts are calculated as a whole, then the common contract will dominate, making the calculation inaccurate. Therefore, we divide the contract into two parts according to whether it is a fraud, calculate the IDF value of each part separately, and add the characters with an IDF greater than 5.4 to the stop words. Finally, the stop words of each contract ABI call sequence are removed. For each contract, the formula for calculating the IDF contract is (1) and (2): IDF0,i = log(
D0 ) D0,i + 1
(1)
IDF1,i = log(
D1 ) D1,i + 1
(2)
0 and 1 represent common and fraud contracts respectively, and i is a call instruction, such as inputs and D0 and D1 represents the total number of documents corresponding to ordinary and fraud contracts respectively, D0,i and D1,i Represent the total number of documents containing instruction i in ABI sequence documents corresponding to ordinary and fraud contracts respectively. IDF0,i and IDF 1,i Represents the IDF value corresponding to the general and fraud contract instructions i.
2.3 Word Embedding The trained word embedding use other corpora to obtain more prior knowledge. Using the pre training vector initialization requires that the current data is similar to the corpus data. However, there is no pre training model related to the smart contract code at present. Therefore, this paper uses the random initialization method to generate the word embedding matrix corresponding to the opcode sequence and ABI call sequence.
Ponzi Scheme Identification of Smart Contract
7
Each word of each text is represented as a vector, and each text is represented as a matrix. The word embedding matrix corresponding to a text of length n can be represented as x1:n =x1 ⊕ x2 ⊕ . . . ⊕ xn . Where ⊕ is the join operator, where xi represents the word vector corresponding to the ith word. The word embedding matrices obtained from random initialization of opcodes and ABI call sequences are To and Ta . 2.4 Feature Extraction Module According to the word embedding matrix T generated in the previous step, local features are first extracted by DCNN, then global features are extracted by Bert [20] network, and finally feature vector F containing local and global information is obtained. Bert extracts the context Semantic information of the sequence using the coding layer of the multi-layer bidirectional Transformer. Bert uses the self-attention mechanism of the Transformer encoder to extract context information, and Bert’s multi-layer and bidirectional structure can fully learn the context Semantic information of the selected text. Each Transformer encoder in Bert consists of two sublayers. As shown in Fig. 2, the first layer is a multi-head attention mechanism layer, and the second layer is a fully connected layer. After the two sub layers, residual connection and layer normalization are used respectively. Among them, Q, K, and V are obtained by multiplying the word embedding matrix with the position encoding matrix by different weight matrices. The output of each vector in multi-head attention is obtained based on all input vectors through matrix operation. Compared with BiLSTM, both Bert and BiLSTM can learn the contextual information of a language. However, BiLSTM is a temporal structure that relies on the output of previous time steps for learning context, and cannot be calculated in parallel. When the sequence is too long, there may be problems of vanishing gradients or exploding gradients. Bert’s learning of context is based on all input vectors and does not depend on the output of previous time steps, Therefore, Bert can be calculated in parallel.
Fig. 2. Tansformer encoder structure
The structure of the DCNN-Bert layer is shown in Fig. 3. This is a deep architecture, including multiple volume layers, with a maximum pool behind each volume layer. The feature map is sampled to remove useless features and retain important features. Then add the location information and use Bert to capture the overall relevance. Finally, the matrix containing global information is pooled by global average pool to obtain the final result of current text feature extraction F. After the above operations, the opcode sequence and ABI call sequence can obtain the text features Fo and Fa containing local
8
X. Jiang et al.
and global semantic information respectively. Fo = DCNN - Bert(To )
(3)
Fa = DCNN - Bert(Ta )
(4)
Fig. 3. DCNN-Bert structure
2.5 Feature Fusion Module The contribution of each feature in the ABI sequence feature in the classification is not equal, so it is necessary to assign different weights to each feature, and because the feature of the operator code sequence contains strong semantic information, in order to reduce the interference of ABI sequence independent features on the classification, we use the operator code feature to participate in generating the attention feature of the ABI feature. First put Fo Pass through a neuron layer with nonlinear activation function Relu, and then input it into Softmax to obtain the weight of each feature. Finally, it is combined with the ABI sequence feature Fa Multiply the corresponding elements to get the weighted Aa . The formula is as follows: hk = relu(wk tk + bk ), tk ∈ Fo
(5)
e hk Wk = h p pe
(6)
Aa = W Fa
(7)
where Wk is the attention weight of each feature in ABI features related to the opcode feature, corresponding to Fa Importance of each element. Aa is the attention feature representation of the ABI features involved in the obtained opcode features. The operator code feature vector Fo , ABI sequence feature Fa , and ABI attention feature Aa participated by the operator code are spliced in the row direction through the Concat layer to obtain the vector F fused with the three features. F = Concat(Fo , Fa , Aa ) = Fo ⊕ Fa ⊕ Aa
(8)
Ponzi Scheme Identification of Smart Contract
9
2.6 Classification Module The fusion feature F is passed through two full connection neural networks and a Relu activation layer and a Dropout layer are added between each full connection layer to prevent over fitting. The probability belonging to each category is finally output through the Softmax function. The fraud detection studied in this paper is a binary classification problem. Due to the extreme imbalance of smart contract fraud data, the improved loss function is used. This paper uses focal loss [23] proposed by Kaiming He to solve model performance problems caused by unbalanced data. The principle is to give greater weight to a small number of samples α. In order to mine the samples that are difficult to classify, the modulation coefficient is added to the predicted category probability γ , it can reduce the loss contribution of easily classified samples. L=
1 γ (1 − α) · (1 − yi ) · log(1 − pi )γ − α · yi · log pi , i ∈ (1, 2, · · · , N ) i N (9)
where yi represents the label of sample i, where the positive class is 1 and the negative class is 0. pi represents the probability that sample i is predicted to be a positive class. N is the total number of samples.
3 Experimental Evaluation 3.1 Data Set This article uses the data set as the Ponzi fraud data set of XBlock1 , an open source blockchain data platform. It includes 3790 smart contracts. This article is based on the contract address from Etherscan crawls the corresponding contract bytecode and call sequence. After removing duplicate contracts and unlabeled contracts, 3776 smart contracts remain, including 200 Ponzi scheme smart contracts and 3576 ordinary smart contracts. 3.2 Experiment The parameters of this model are set as follows: the maximum length of the embedded layer of text words is 1000, and the dimension of each word vector is 256; The DCN convolution of the opcode has two layers, and the size of the convolution kernel is [5, 5]. There are 256 convolution cores in each layer with the same dimension as the word vector, and the maximum pooled ones are all 5; The DCNN one-dimensional convolution of the ABI call sequence has four layers in total, and the size of the convolution kernel is [2, 3, 3, 3]. The Bert layer has six Encoder layers in total. The number of multi head attention in each layer is 8, and the number of neurons in the hidden layer is 256. Gelu is used for nonlinear activation between every two fully connected layers. Dropout is used to prevent over fitting, and its value is set to 0.3. The training parameters are set as 1 http://xblock.pro/#/.
10
X. Jiang et al.
follows: the initial learning rate is 1e-5, the batch size is 32, the total number of training rounds epoch is 120, and the focal loss parameters are set as follows: (1-α):α is 1:8, γ is 2. In this paper, the data set is divided into training set and test set according to 4:1, using 5-fold cross validation. Training set is divided into 5 parts randomly and evenly. Then, four of which were used for training and one for testing. This process is repeated five times, and each fold is used as a training and test set to reduce variance and bias. Therefore, this can get five test results and list the average values of the five results in Table 1. The maximum values of each column are shown in bold. Table 1. Comparison of different models Model
Precision
Recall
F1-score
RF
0.95
0.69
0.79
XGBoost
0.87
0.76
0.81
SVM
0.94
0.23
0.37
BiLSTM
0.78
0.81
0.80
Bert
0.83
0.81
0.82
Resnet
0.92
0.33
0.49
MFFSI
0.88
0.83
0.86
The Bert model and whether attention is used before feature fusion affect the model effect. Three experiments were conducted. The first does not use attention before feature fusion. The second uses the BiLSTM model when extracting global information, the third uses the Bert model, The experimental results are shown in Table 2. Table 2. Attention vs Inattention Method
Precision
Recall
F1-score
MFFSI (No Attention)
0.93
0.78
0.85
MFFSI (BiLSTM)
0.75
0.83
0.79
MFFSI
0.88
0.83
0.86
When extracting global information, Bert is obviously better than BiLSTM. Because the data length after deep convolution is 100, BiLSTM is still a long sequence, so the effect is general. Second, F1 using attention before feature fusion is 1% higher than that without attention, and the use of attention is obviously better than that without attention. This shows that attention can indeed weaken the influence of irrelevant features on the model during feature fusion. In order to view the impact of bytecode, opcode sequence and ABI call sequence on the experimental performance, only a single feature is used, one feature is removed, or
Ponzi Scheme Identification of Smart Contract
11
all feature datasets are used for the experiment. The comparison of experimental results is shown in Table 3. Table 3. Comparison of Different Characteristics Feature
Precision
Recall
F1-score
Just Opcode
0.91
0.81
0.85
Just ABI
0.77
0.75
0.76
All
0.88
0.83
0.86
It can be seen from the above table that the experimental result of opcodes is the best when only one feature is used. Therefore, the opcode text feature is the most important of all features. ABI features have features that are independent of classification, so use the opcode feature attention to guide the generation of ABI attention features and weaken the role of irrelevant features in fraud detection. The validity of focal loss function. The performance of the model under different alpha conditions is shown in the following Table 4. Table 4. Comparison of Different Alpha of Loss Function (1 − α) : α
Precision
Recall
F1-score
1:17
0.90
0.75
0.82
1:15
0.90
0.75
0.82
1:12
0.96
0.72
0.83
1:9
0.91
0.81
0.85
1:8
0.88
0.83
0.86
1:7
0.88
0.81
0.84
1:4
0.93
0.69
0.79
1:1
1.00
0.00
0.00
The ratio between the number of ordinary samples and the number of fraud samples is 1:17. When the ratio of positive and negative sample loss function weights is 1:1, the recall rate is 0. Because of the extreme imbalance of the samples, the model cannot converge at this time, and all samples are divided into ordinary contracts. When the weight ratio is closer to the inverse ratio of the number of sample data of each category, the performance of the model first gets better and then worse. When the weight ratio is 1:8, the best model effect F1 is 86%. Because when the fraud loss weight is small, the model tends to learn the data characteristics of ordinary contracts with a large proportion, and when the fraud loss weight is large. This makes the model tend to learn the characteristics of the contract data of the fraud, which will lead to the model being unable to learn the
12
X. Jiang et al.
comprehensive characteristics. Therefore, there is an optimal proportion of fraud loss weight, while the proportion of the optimal ordinary contract: fraud contract sample loss weight is 1:8. 3.3 Honeypot Contract Detection In order to prove that this model is also applicable to the detection of other types of fraud contract scams, we use this model to detect another kind of fraud of Ethereum - the honeypot contract. Such fraud makes speculative users who do not know enough about the contract mistakenly think that there are exploitable loopholes in the contract and they can obtain tokens through the contract loopholes. However, when users really use the loopholes to make profits, they will find that This vulnerability is a disguised vulnerability. Instead of profiting, they will lose money. We use the open source honeypot contract data set [24]. This data set contains 1161 honeypot contracts. After removing duplicate contracts, 855 honeypot contracts are used as fraud data sets. For normal contracts, we selected 3413 non honeypot ordinary contracts from Ethereum, and divided the data set into three parts: training set, verification set and test set according to the ratio of 7:1:2. The data distribution is shown in Table 5. Table 5. Data Information of Honeypot Contract Data set
Training set
Validation set
Testing set
Normal contract
2376
351
686
611
76
168
honeypot contract
After the data set is divided, the training set is used to train the model, the verification set is used to find the optimal parameters, and the test set is used to evaluate the generalization ability of the model. Table 6 shows the results of the final test set using MFFSI model to detect honeypot contracts. Table 6. MFFSI Identifies Honeypot Contract Feature
Precision
Recall
F1-score
Just Opcode
1.00
0.96
0.98
Just ABI
0.92
0.97
0.95
All
0.98
0.98
0.98
The F1 of the model integrating the two features is up to 98%. This model can be used not only to detect Ponzi fraud but also to detect honeypot contracts. In addition, compared with using only the operation code, the result of fusing the two features increases the recall rate by about 2% when F1 is the same. In order to reduce the risk of missing
Ponzi Scheme Identification of Smart Contract
13
inspection, the recall rate takes priority and belongs to negative quality inspection. In the case of the same F1, the greater the recall rate is, the better. Therefore, the model of fusing two features is obviously superior to that of using only one feature.
4 Conclusion The MFFSI model proposed in this paper extracts two types of modal data by analyzing smart contracts. One is the opcode sequence containing semantics and jump relationships; The second is called API function sequence. Each DCNN Bert model is designed to extract its own semantic features, and attention mechanism is used to alleviate the interference of irrelevant features on classification. Finally, the three features are fused to detect fraud. The experimental results show that the model proposed in this paper can effectively detect Ponzi scheme of smart contract, and the F1 score has been significantly improved compared with the predecessors. In addition, it is also applicable to the detection of other smart contract schemes.
References 1. NAKAMOTOS.: Bitcoin: a peer-to-peer electronic cash system (2023). https://bitcoin.org/ en/bitcoin-paper 2. Buterin, V.: A next-generation smart contract and decentralized application platform (2014) 3. Aitzhan, N.Z., Svetinovic, D.: Security and privacy in decentralized energy trading through multi-signatures, blockchain and anonymous messaging streams. IEEE Trans. Dependable Secure Comput. 15(5), 840–852 (2018) 4. Zheng, Z., Xie, S., Dai, H.N., Chen, X., Wang, H.: blockchain challenges and opportunities: a survey. Int. J. Web Grid Serv. 14(4), 352 (2018) 5. Norta, A.: Creation of smart-contracting collaborations for decentralized autonomous organizations. In: Matuleviˇcius, R., Dumas, M. (eds.) BIR 2015. LNBIP, vol. 229, pp. 3–17. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21915-8_1 6. Vasek, M., Moore, T.: There’s no free lunch, even using bitcoin: tracking the popularity and profits of virtual currency scams. In: Böhme, R., Okamoto, T. (eds.) FC 2015. LNCS, vol. 8975, pp. 44–61. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-47854-7_4 7. Bartoletti, M., Carta, S., Cimoli, T., Saia, R.: Dissecting Ponzi schemes on ethereum: identification, analysis, and impact. arXiv preprint arXiv:1703.03779 (2017) 8. Torres, C.F., Steichen, M., State, R.: The Art of The scam: demystifying honeypots in ethereum smart contracts. arXiv preprint arXiv:1902.06976 (2019) 9. Tsikerdekis, M., Zeadally, S., Schlesener, A., Sklavos, N.: Approaches for preventing honeypot detection and compromise. In: 2018 Global Information Infrastructure and Networking Symposium (GIIS), pp. 1–6. IEEE, Thessaloniki (2018) 10. Toyoda, K., Ohtsuki, T., Mathiopoulos, P.T.: Multi-class bitcoin-enabled service identification based on transaction history summarization. In: 2018 IEEE International Conference on Blockchain, pp. 1153–1160. IEEE, Halifax (2018) 11. Chen, W., Zheng, Z., Cui, J., Ngai, E., Zheng, P., Zhou, Y.: Detecting ponzi schemes on ethereum: towards healthier blockchain technology. In: Proceedings of the 2018 World Wide Web Conference, pp. 409–1418. ACM (2018) 12. Chen, W., et al.: SADPonzi: detecting and characterizing Ponzi schemes in Ethereum smart contracts. In: ACM SIGMETRICS 2021, pp. 35–36. Association for Computing Machinery, New York (2021)
14
X. Jiang et al.
13. Fan, S., Fu, S., Xu, H., Cheng, X.: Al-SPSD: anti-leakage smart Ponzi schemes detection in blockchain. Inf. Process. Manage. 58(4), 102587–102599 (2021) 14. Huang, B., Liu, Q., He, Q., Guang, Z., Chen, J.: Towards automatic smart-contract codes classification by means of word embedding model and transaction information. Acta Automatica Sinica. 43(9), 1532–1543 (2017) 15. Chen, W., Zheng, Z., Ngai, E.C.-H., Zheng, P., Zhou, Y.: Exploiting blockchain data to detect smart Ponzi schemes on ethereum. IEEE Access 7, 37575–37586 (2019) 16. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017) 17. Wang, L., Cheng, H., Zheng, Z., Yang, A., Zhu, X.: Ponzi scheme detection via oversamplingbased long short-term memory for smart contracts. Knowl.-Based Syst. 228, 107312 (2021) 18. Hongxia, Z., Qi, W., Dengyue, W.: Honeypot contract detection of blockchain based on deep learning. J. Commun. 24(1), 194–202 (2022) 19. Bian, L., Zhang, L., Zhao, K., Wang, H., Gong, S.: Image-based scam detection method using an attention capsule network. IEEE Access 9, 33654–33665 (2021) 20. Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 21. Ethereum blockchain browser (2022). https://etherscan.io/ 22. Wood, G.: Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper 151, 1–32 (2014) 23. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020) 24. HoneyBadger Data Set (2019). https://github.com/christoftorres/HoneyBadger
An Adaptive Method for Generating the Traffic State Thresholds on Road Networks Jiacheng Wu, Wang Zhu, and Jianli Xiao(B) School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, 516 Jun Gong Road, Shanghai 200093, China [email protected]
Abstract. Traffic state identification of road networks is always a key issue in intelligent transportation systems. Most of the traffic state identification methods use fixed traffic state thresholds to identify the traffic state of the road networks, which are set manually. The accuracy of the traffic state identification depends heavily on these thresholds. Namely, better traffic state thresholds can greatly improve the traffic state identification accuracy. Thus, it is very important to select reasonable traffic state thresholds. In this paper, an adaptive method for generating the traffic state thresholds on road networks is proposed, which can automatically generate reasonable traffic state thresholds and ensure the traffic state identification accuracy. In the proposed method, the traffic state is predicted by constructing a Hidden Markov Model firstly, and then the distribution of traffic state on speed is fitted by a two-dimensional Gaussian distribution. Finally, the thresholds are generated according to the properties of two-dimensional Gaussian distribution. The rationality of the traffic state thresholds is verified by experiments. Also, the traffic state identification accuracy using these thresholds is evaluated by experiments. All the results show that the proposed method can generate reasonable traffic state thresholds and ensure the traffic state identification accuracy of the road networks. Keywords: Urban road networks · Traffic state prediction · Hidden Markov Model · Two-dimensional Gaussian distribution · Threshold generation
1 Introduction Intelligent transportation systems play an important role in ensuring the smooth operation of urban transportation. In intelligent transportation systems, the planning and management of urban traffic need to rely on the real-time traffic state [1]. Traffic state is a comprehensive index to evaluate road capacity, which is determined by traffic flow speed, travel time, road type and other factors. In the actual traffic management, managers usually divide the traffic state into three types: blocked, congested and unblocked. Traffic state can directly reflect the urban traffic situation, and also provide guidance for people’s travel [2]. Therefore, it is of great significance to accurately predict the state of urban traffic according to road data. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 15–26, 2023. https://doi.org/10.1007/978-981-99-4752-2_2
16
J. Wu et al.
Most of the existing traffic state prediction methods are actually for traffic speed prediction [3]. Machine learning or deep learning methods, such as Kalman Filter Method [4], Time Series Model [5] and Neural Network Model [6], are used to fit the speed data and conduct prediction. After the predictive results are obtained, the traffic state is identified by manually setting the speed threshold [7]. There are many limitations to set thresholds manually based on experience. For example, thresholds vary greatly due to different time periods and road networks, but the thresholds are not modified once set manually in most situations. The accuracy of the traffic state identification depends heavily on these thresholds. Namely, better traffic state thresholds can greatly improve the traffic state identification accuracy. Thus, it is very important to select reasonable traffic state thresholds. Based on this situation, this paper proposes an adaptive threshold generation method for road networks. The method uses Hidden Markov Model (HMM) to predict traffic state, and then uses two-dimensional Gaussian distribution model to fit traffic state data. Traffic data is a kind of time series data, the order of the data has an effect on the data itself [8]. The traffic speed is an observable quantity, while the traffic state is usually a vague quantity that cannot be directly expressed in a mathematical way. Therefore, the traffic state prediction problem can be regarded as an HMM problem, where speed corresponds to the observed state and traffic state corresponds to the hidden state. HMM is a model for sequence data analysis and prediction [9]. In the field of transportation, traffic data and financial data have certain similarities, if the problem is properly described, HMM model should be able to achieve good results [10]. After the establishment of HMM model for traffic state prediction, a series of state labels corresponding to speed are obtained, and the correlation between them needs to be mined by statistical models. Generally speaking, the statistics of speed data conform to the Gaussian distribution. Namely, there is relatively a little data in the low and high speed segments, and more data in the middle speed segment. For the three states of unblocked, congested and blocked, the traffic state can be represented by using speed as the symbol. In this way, the statistical distribution of traffic state data can be obtained, and the two-dimensional Gaussian distribution model can be used for data fitting. Furthermore, after selecting the appropriate statistical model, it is important to generate thresholds according to statistical properties and verify the accuracy and rationality of thresholds.
2 Methodological Framework and the Establishment of Models 2.1 Methodological Framework This paper is devoted itself to generating traffic state thresholds automatically, which can be used to identify the traffic state of the road networks. The framework of the proposed method is shown in Fig. 1. The method mainly consists of two parts, and the first part is the state prediction part. According to the road speed data, the HMM model is established to predict the traffic state, and the predictive results of traffic state are obtained. The second part is the threshold generation part. According to the results of HMM model, the upper and lower limits of speed in each state are obtained, and the 3D histogram statistics of them can be drawn. Then, the two-dimensional Gaussian
An Adaptive Method for Generating the Traffic State Thresholds
17
distribution model is used to fit the results, and the threshold generation process is carried out according to the statistical properties of the model.
Predictive results of traffic state
HMM model
Road speed data
The upper and lower limits of the speed in each state
3D statistical histograms
Two- dimensional Gaussian distribution fitting
Fig. 1. The framework of the proposed method.
2.2 HMM Model for Traffic State Prediction A basic HMM model can be expressed as: λ = (A, B, )
(1)
where A is the hidden state transition probability matrix, B is the observed state generation probability matrix and is the hidden state initial probability matrix. The premise of establishing the HMM model according to the basic form of the model is to determine the hidden state and observed state of the problem. For the traffic state prediction problem, the hidden state is the traffic state, which can be denoted as qi , and the observed state is speed, which can be denoted as vk . Therefore, the hidden state transition probability matrix A can be expressed as: A = aij N ×N (2) where aij = P(it+1 = qj |it = qi ), it represents the probability of moving from qi state at time t to qj state at time t + 1, that is, the change of traffic state over time. The observed state generation probability matrix B can be expressed as: B = bj (k) N ×M (3) where bj (k) = P(ot = vk |it = qj ), it represents the probability of generating observed state vk from hidden state qj , that is, the speed corresponding to the traffic state. The form of hidden state initial probability matrix is: = [π(i)]N where π(i) = P(i1 = qi ), it represents the traffic state at time t = 1.
(4)
18
J. Wu et al.
The HMM model established according to these three parameter matrices is shown in Fig. 2. There are three hidden states set, q1 corresponds to the red blocked state, q2 corresponds to the yellow congested state, and q3 corresponds to the green unblocked state. Speed is the observation sequence, and different speed at each moment corresponds to a different observation quantity vk . aij and bj (k) represent the state transition probability and the probability of generating speed observations from the state respectively.
Fig. 2. The construct of the HMM model.
After establishing the HMM model, it is necessary to determine the parameters of the model before using the model to solve the problem of state prediction. For traffic data, only the observation sequence, i.e. the speed data, is known, while the hidden state, i.e. the traffic state data, is unknown. This is an unsupervised learning problem, so the Baum-Welch algorithm is needed to solve the parameters. The principle of the algorithm is similar to the EM algorithm [11], that is, the expectation of the joint distribution P(O, I |λ) based on conditional probability P(I |O, λ) is first calculated in step E, and the expression is as follows: L λ, λ = P I |O, λ logP(O, I |λ) (5) I
where I is the hidden state sequence, O is the observed state sequence and λ is the currently updated model parameter. Then, this expectation is maximized in step M, and the updated model parameters are obtained as follows: λ = arg max P I |O, λ logP(O, I |λ) (6) λ
I
This process is repeated until the parameter λ converge. After the parameters of HMM model are obtained, the hidden state sequence can be solved according to Viterbi algorithm [12]. Viterbi algorithm is a dynamic programming algorithm, which defines two local states. The first local state is the maximum probability of all possible state transition paths (i1 , i2 , ..., it ) at time t with hidden state i, denoted as δt (i), and the form is: δt (i) =
max P(it = i, it−1 , it−2 , ..., i1 , ot , ot−1 , ..., o1 |λ), i = 1, 2, ..., N
i1 ,i2 ...it−1
(7)
An Adaptive Method for Generating the Traffic State Thresholds
19
The recursive expression of δ can be obtained from the definition of δt (i): δt+1 (i) =
max
P(it+1 = i, it , it−1 , ..., i1 , ot+1 , ot , ..., o1 |λ) = max δt (j)aji bi (ot+1 )
i1 ,i2 ,...,it−1
(8)
1≤j≤N
The second local state is obtained recursively from the first local state. Suppose that among all state transition paths with the hidden state of i at time t, the hidden state of node t − 1 in the transition path with the highest probability is ψt (i), and its recursive expression can be expressed as: (9) ψt (i) = arg max δt−1 (j)aji 1≤j≤N
With these two local states, we can recurse from the initial time, and then backtrack using the previous most likely state node recorded by ψt (i) until the optimal sequence of hidden state is found, then the predicted results of the traffic state are obtained. According to the results of the HMM model, each state has a different speed interval, and the size of the interval is determined by the lower limit and upper limit of the speed. Therefore, for each traffic state, the lower limit and upper limit of the speed interval can be used as two-dimensional variables to characterize the state, and the schematic diagram is shown in Fig. 3. Speed
vm3
60
v2 vl
50 m3 40
v1 vl
30 m2 20
10v1 l
0:00
3:00
6:00
9:00
12:00
15:00
18:00
21:00
24:00
Time
Fig. 3. Traffic states characterized by the upper and lower limits of speed.
2.3 Threshold Generation Based on Two-Dimensional Gaussian Distribution The generation of urban traffic thresholds needs to base on the state prediction results of multiple roads in the road network, and the statistical model should be established according to the relationship between the state and its symbols. Assume that the statistical model here uses the two-dimensional Gaussian distribution model, and the joint probability density function of the Gaussian distribution based on the two-dimensional variable X = (x1 , x2 ) can be expressed as: 1 1 T −1 (10) f (X) = exp − C − µ) − µ) (X (X 2 2π |C|1/2
20
J. Wu et al.
where µ = (μ1 , μ2 ) is the mean vector, C is the covariance matrix, which describes the degree of correlation between the variables of each dimension. The establishment of two-dimensional Gaussian distribution depends on the mean vector and covariance matrix. For the suppose that the lower limit vector i ith itraffic state, i and the upper limit vector is vi = , vl2 , . . . , vln of all n roads’ speed data is vil = vl1 m i i , . . . , vi vm1 , vm2 mn , then the mean vector can be expressed as:
(11) μi = vil , vim The covariance matrix can be expressed as: cov(vil , vil ) cov(vil , vim ) Ci = cov(vim , vil ) cov(vim , vim )
(12)
If the two-dimensional Gaussian distribution model established can fit the distribution of data, then it can be proved that the relationship between traffic state and its symbols conforms to the statistical model of two-dimensional Gaussian distribution. Therefore, the adaptive threshold generation process can be carried out according to the characteristics of the statistical model. The mean value of the two-dimensional Gaussian distribution represents the center of each dimensional variable, that is, the mean value of the upper and lower limits of speed. If the traffic state data conforms to the two-dimensional Gaussian distribution, the overall level can be represented by the mean vector, that is, the speed range corresponding to the traffic state. For the three types of states, their mean vectors are as follows:
⎧ 1 , v1 ⎪ v = μ 1 ⎪ m l ⎪ ⎪ ⎨
μ2 = v2l , v2m (13) ⎪ ⎪
⎪ ⎪ ⎩ μ3 = v3 , v3 m l where 1 represents the blocked state, 2 represents the congested state, and 3 represents the unblocked state. The two thresholds that need to be generated can be obtained from the mean value of v1m and v2l as well as the mean value of v2m and v3l , which are as follows: ⎧ ⎪ v1 + v2l ⎪ ⎪ ⎨ T1 = m 2 (14) ⎪ 2 + v3 ⎪ v ⎪ m l ⎩ T2 = 2
3 Experiments and Results Analysis 3.1 Experiments on Traffic State Prediction of HMM Model The raw GPS data from Shanghai Traffic Information Center is used for experiments, and the data sampling period is two minutes. The items of the raw GPS include time, road section number, speed, etc. The speed data from 0 am to 24 pm on some roads throughout
An Adaptive Method for Generating the Traffic State Thresholds
21
a day is used to construct the HMM model. In the experimental part of the HMM model, three traffic states are selected, representing the three states of unblocked, congested and blocked. In order to verify and compare the experimental results of roads in different regions, ChangPing Road in Jing’an District, HengShan Road in XuHui District, AiHui Road in BaoShan District and JinShaJiang Road in PuTuo District were selected as experimental roads by taking administrative districts as boundaries. The speed data of these roads in one day is input into the HMM model as the observation sequence, and the model can obtain the hidden state sequence corresponding to the observation sequence through parameter learning and state prediction.
Fig. 4. Predictive results of HMM model.
The hidden state sequences obtained by the HMM model is a collection of tags, and after visualizing, the results are shown in Fig. 4. For each road, the figure above shows the change of speed over the course of a day, and the figure below shows the corresponding HMM model predictions. As can be seen from the figures, although the distribution and trend of speed values on different roads are various, the predictive results of the model have something in common. For the predictive results of the four roads, the yellow points which reflect congested state are densely distributed, concentrated in the middle speed segment, while the red points which reflect blocked state and the green points which reflect unblocked state are sparsely distributed, and the corresponding speed interval span is large. The result accords with the characteristic of Gaussian distribution. In addition, the distinction of each state is also relatively obvious, and the traffic state can be divided by reasonable choice of speed value. The experimental results of local roads verify the feasibility of determining the traffic state through the traffic speed. Next, the scope needs to be extended to the whole city to determine the traffic thresholds of urban road networks. Therefore, it is necessary to select more urban roads reasonably, expand the data set, and conduct statistics and analysis. In this paper, it is assumed that the results obtained by the HMM model are accurate, so that the subsequent experiments and results analysis can be carried out.
22
J. Wu et al.
3.2 Experiments on Two-Dimensional Gaussian Distribution Fitting of Traffic State The selection of road data should be able to reflect the overall traffic situation of the city, and the principle is to select the major roads and the whole city should be evenly covered. After data screening and pretreatment, a total of 58 major urban roads are selected, and corresponding traffic speed data of one day are extracted. After using the HMM model for state prediction of all roads, the traffic state and its corresponding two-dimensional symbols can be obtained. According to the predictive results of the HMM model, a total of 58 sets of road data can be obtained by taking the lower and upper limits of speed in three states as road attributes. Part of the results are shown in Table 1. In the table, each road has three states, and in each state, the value of the top row represents the upper limit and the value of the bottom row represents the lower limit. Upper and lower limits can be regarded as the two-dimensional symbols of each state. In order to evaluate the overall state distribution of all roads, a 3D histogram should be used to obtain the distribution range and number of upper and lower limits in different speed ranges. The results are shown in Fig. 5. Table 1. Traffic state of several roads and the corresponding upper and lower limits of speed. AiHui Road
BaoChang Road
ChangPing Road
ChangYang Road
HengShan Road
JinShaJiang Road
…
Blocked state
29.7
25
34.1
17.3
25.9
30.8
…
9.8
9.1
10.6
3.9
10.8
12.5
…
Congested state
41.9
29.6
41.5
29
36.3
40.5
…
25.1
19.4
26.1
10.3
19.3
26.2
…
Unblocked state
61
41.2
47.5
48.9
58.8
55.1
…
38
29
37.3
22.8
23.2
36.9
…
Fig. 5. 3D histogram statistics of traffic state.
It can be seen that in Fig. 5(a), the lower limit range of the speed in the blocked state is 0–25, and the upper limit range is 10–45, with the greatest number of distribution concentrated in the part where the lower limit is 15 and the upper limit is 30. In Fig. 5(b), the lower limit range of speed in the congested state is 10–40, and the upper limit range
An Adaptive Method for Generating the Traffic State Thresholds
23
is 10–60, with the greatest number of distribution concentrated in the part where the lower limit is 25 and the upper limit is 40. In Fig. 5(c), the lower limit range of speed in the unblocked state is 10–50, and the upper limit range is 20–80, with the greatest number of distribution concentrated in the part where the lower limit is 35 and the upper limit is 60. The 3D histograms of three states basically show the characteristics of high in the central areas and low on other sides. The range of upper and lower limits is large, and the greatest number of distribution is concentrated in the central part. Based on such statistical characteristics, two-dimensional Gaussian distribution is considered to fit the data. According to the road data, the mean vectors and covariance matrixes of three states are calculated, and the corresponding two-dimensional Gaussian distribution surfaces are drawn in Fig. 6. For Fig. 6(a) ~ (c), the left figures show the fitting effect of the two-dimensional Gaussian distributions on the 3D histogram statistics, and the right figures show its projections on the bottom planes. It can be seen that the two-dimensional Gaussian distribution surface basically covers the area of the 3D histogram, and the shape is basically the same. Moreover, the center position of the bottom projection also coincides with the position of the largest number of histograms. The results show that the two-dimensional Gaussian distribution surface has a good fitting effect on the 3D histogram, and the traffic state data does conform to the two-dimensional Gaussian distribution. Therefore, the statistical model of the two-dimensional Gaussian distribution can be used to generate the thresholds.
Fig. 6. Two-dimensional Gaussian distribution fitting of traffic state.
3.3 Threshold Adaptive Generation and Result Analysis Threshold Adaptive Generation. After determining that the data conforms to the twodimensional Gaussian distribution, the properties of the two-dimensional Gaussian distribution can be used for threshold generation. The mean value of the two-dimensional Gaussian distribution represents the center of each dimensional variable. In the projection figure, it represents the position of the yellow center point. The two coordinate axes of the projection plane represent the upper limit and lower limit of the speed respectively,
24
J. Wu et al.
so the mean vector composed of them reflects the range of speed corresponding to the state. Mean vectors of two-dimensional Gaussian distributions for three states can be calculated by road data, and the results are shown in Table 2. Table 2. Means of upper and lower limits of speed for three states. Blocked state
Congested state
Unblocked state
Mean value of upper limit
29.4448
38.5466
55.7259
Mean value of lower limit
11.2845
24.0448
32.1759
It can be seen that the mean value of upper limit for the blocked state is greater than the mean value of lower limit for the congested state, and the mean value of upper limit for the congested state is greater than the mean value of lower limit for the unblocked state. The ranges of speed corresponding to the three states have overlaps. However, in actual traffic scenarios, traffic thresholds must be definite values to identify traffic states. Therefore, in order to obtain the thresholds for identifying the three states of unblocked, congested and blocked, an average approximation method is used. We firstly average the mean value of lower limit for unblocked state and the mean value of upper limit for congested state, then average the mean value of lower limit for congested state and the mean value of upper limit for blocked state. Thus, two thresholds to identify the three states can be obtained, that is 26.74 and 35.36 respectively. The process of threshold adaptive generation is shown in Fig. 7.
Fig. 7. The process of threshold adaptive generation.
Threshold Accuracy Analysis. To check the accuracy of the thresholds on the road state identification results, the generated traffic state thresholds are substituted into the one-day speed data of the experimental roads. By comparing with the results of the HMM model, the results are shown in Fig. 8. It can be seen that the state identification results of the threshold values are basically consistent with the results of the HMM
An Adaptive Method for Generating the Traffic State Thresholds
25
model, indicating that the threshold values can accurately divide the state of most roads. Of course, for some roads with high traffic speed or slow traffic speed, the results of state identification will be biased, because the thresholds reflect an average situation for most urban roads.
Fig. 8. Comparison of road state identification results.
4 Conclusions In this paper, an adaptive method for generating the traffic state thresholds on road networks is proposed, which can automatically generate reasonable traffic state thresholds and ensure the traffic state identification accuracy. The proposed method can generate the thresholds adaptively from the road speed data, which can avoid the problem that the thresholds set manually are not accurate and need to be replaced frequently. In the result analysis part of the thresholds, the accuracy of the thresholds is verified by comparing the state identification results of the thresholds with the predictive results of the HMM model. Acknowledgments. This work is supported by China NSFC Program under Grant NO. 61603257 and 61906121.
References 1. Cui, H., Meng, Q., Teng, T.H., et al.: Spatiotemporal correlation modelling for machine learning-based traffic state predictions: state-of-the-art and beyond. Transp. Rev. 2023, 1–25 (2023) 2. Lujak, M., Giordani, S., Ossowski, S.: Route guidance: bridging system and user optimization in traffic assignment. Neurocomputing 151, 449–460 (2015)
26
J. Wu et al.
3. Yin, R.R., Yuan, H.L., Wang, J., et al.: Modeling and analyzing cascading dynamics of the urban road traffic network. Physica A 566, 125600 (2020) 4. Okutani, I., Stephanedes, Y.J.: Dynamic prediction of traffic volume through Kalman filtering theory. Transp. Res. Part B Methodological. 18(1), 1–11 (1984) 5. Shekhar, S., Williams, B.M.: Adaptive seasonal time series models for forecasting short-term traffic flow. Transp. Res. Rec. 2024(1), 116–125 (2007) 6. He, S., Cheng, H., Song, G.-j, Xie, K.-q, Sun, Y.-z: Real-time short-term traffic flow forecasting based on process neural network. In: Sun, F., Zhang, J., Tan, Y., Cao, J., Wen, Y. (eds.) Advances in Neural Networks - ISNN 2008, pp. 560–569. Springer Berlin Heidelberg, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87734-9_64 7. Vlahogianni, E.I., Karlaftis, M.G., Golias, J.C.: Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach. Transp. Res. Part C: Emerg. Technol. 13(3), 211–234 (2005) 8. Li, Y.N., Xiao, J.L.: Traffic peak period detection using traffic index cloud maps. Physica A 553, 124277 (2020) 9. Zhu, G.Y., Song, K., Zhang, P., et al.: A traffic flow state transition model for urban road network based on hidden Markov model. Neurocomputing 214, 567–574 (2016) 10. Jin, Z.Q., Chen, Y.Y., Li, C., et al.: Trip destination prediction based on hidden Markov model for multi-day global positioning system travel surveys. Transp. Res. Rec. 2677(2), 577–587 (2023) 11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. 39(1), 1–22 (1977) 12. Schuster-Böckler, B., Bateman, A.: An introduction to hidden Markov models. Curr. Protoc. Bioinform. 18(1), A-3A (2007)
RNL: A Robust and Highly-Efficient Model for Time-Aware Web Service QoS Prediction Jiajia Mi and Hao Wu(B) School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China [email protected]
Abstract. Accurately and efficiently predicting unknown time-varying qualityof-service (QoS) data based on observed ones is essential in time-aware Web service selection. A nonnegative latent factorization of tensors (NLFT) model is one of the most attractive approaches to tackle this issue owing to its excellent scalability. Current NLFT models generally adopt L2-norm loss to measure the discrepancy between the predicted time-varying QoS values and observed ones. However, L2-norm is sensitive to outlier data that often exist in observed QoS data, which may significantly impair prediction accuracy. Moreover, a gradient descent-based optimization method is frequently used to build a NLFT-based QoS predictor, yet it may suffer from low computational efficiency. To address these issues, this work proposes a Robust Nonnegative Latent factorization of tensors (RNL) model. It adopts two-fold ideas: a) adopting Cauchy loss to build objective function for achieving strong robustness and high prediction accuracy, b) applying an alternating direction method of multi-pliers (ADMM) principle to design parameters learning scheme for obtaining high computational efficiency. Experimental results on two time-varying QoS datasets demonstrate that the proposed RNL is superior to state-of-the-art models in both prediction accuracy and computational efficiency when they are used to predict the unknown QoS. Keywords: Nonnegative Latent Factorization of Tensors · Cauchy Loss · ADMM · Time-varying QoS Prediction
1 Introduction As the Internet infrastructure is upgraded iteratively, a huge set of Web services with similar functions are supplied by various service providers [1]. Thus, it becomes a popular topic how to select the best Web service from a large pool of candidates [2, 3]. Quality-ofservice (QoS) is a non-functional characteristic, e.g., capacity and price, which measures the performance of a Web service [4]. In general, user-viewed QoS data, e.g., throughput and response time, indicates the service performance experienced by users, which can help user select Web services wisely. Nevertheless, due to practical restrictions like time costs and invoking expenses, it is typically high-dimensional and incomplete (HDI) [4, 5]. Consequently, generating predictions for unobserved QoS data based on historical ones becomes essential and indispensable. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 27–39, 2023. https://doi.org/10.1007/978-981-99-4752-2_3
28
J. Mi and H. Wu
Up to now, various researchers have proposed a wide range of solutions for the above issues. For example, Jia et al. [5] present a deep collaborative filtering-based QoS prediction method via incorporating the global and local location information of services and users into the interaction layer of a multilayer perceptron. Yang et al. [6] propose a factorization machine-based QoS predictor, which fuses services and users’ neighbor QoS information. Although the aforementioned QoS predictors are effective, they are defined on static data, which do not take into account the dynamic variations of timevarying QoS data. In order to model the temporal dynamics of QoS data, researchers propose some representative models [7–10]. For example, Tong et al. [3] propose a time-aware collaborative filtering model via introducing similarity computation method to predict dynamic QoS data. Wu et al. [11] design various regularized NLFT models for predicting QoS data by adopting different regularization methods. Zou et al. [12] utilizes feature integration to realize a time-aware service QoS prediction based on deep learning. Luo et al. [13] proposes a NLFT-based time-varying QoS prediction model with fused linear biases. Among the multitude of superior Qos prediction methods mentioned above, NLFTbased ones are increasingly popular owing to its excellent scalability and accuracy [14, 15]. Concretely, a NLFT-based QoS predictor adopts a HDI “user × service × time” tensor to describe the observed time-varying QoS data, where the spatiotemporal pattern of QoS data can be well modeled. As a result, its objective function and parameters learning scheme are defined on the known data of HDI tensor only while ignoring unobserved ones, which is significantly different from the traditional tensor decomposition model defined on a full tensor without missing elements. As a result, in order to build an accurate NLFT-based time-varying QoS prediction model, it becomes very vital to design its objective function and parameter learning scheme [16, 17]. Currently, a NLFT model mostly adopts L 2 -norm to build objective function to achieve an accurate and steady prediction for missing data of an HDI tensor. However, the robustness of a NLFT model cannot be guaranteed if an HDI tensor is mixed with outlier data. Unfortunately, due to the limitations of network environment or other factors, the time-varying QoS data generated by requesting real Web services usually contains outlier data [18, 19]. According to [20], although a L 1 -norm-based learning model has intrinsic robustness to outliers, its solution is unstable and it is also more difficult to optimize. Moreover, a gradient descent-based optimization method is frequently adopted by a NLFT model to design parameters learning scheme, yet it can lead to low computational efficiency owing to its inherent nature [8, 9]. As prior research [15], compared with both L 2 -norm loss and L 1 -norm loss, Cauchy loss has better robustness to outliers. Theoretically, Cauchy loss will only give incorrect results if nearly half of the observations are outliers. It is rare that almost half of the observed QoS values are outliers. Moreover, an alternating direction method of multipliers (ADMM) is frequently used to construct a learning model with constraints. Further, an ADMM-based parameters learning scheme can obtain the high convergence rate by properly designing optimization task, nevertheless, it is often encountered under complete data [21–23].
RNL: A Robust and Highly-Efficient Model
29
Motivated by the above discussions, the following research question is proposed: Research Question. Is it possible to adopt Cauchy loss to build objective function and incorporate the principle of ADMM into a NLFT model for gaining a robust and highly efficient time-varying QoS prediction model? To answer it, this paper presents a Robust Nonnegative Latent factorization of tensors (RNL) model. Its main idea is to build a robust learning objective by employing Cauchy loss and design a highly efficient learning scheme under ADMM framework, thereby obtaining a time-varying QoS predictor with high accuracy and efficiency. In overview, the main contributions of this paper include: 1) Proposing a RNL model with high accuracy and strong robustness in predicting the missing time-varying QoS data. 2) Designing a highly-efficient parameters learning scheme according to the principle of ADMM. Experimental results on two real-world time-varying QoS datasets show that compared with state-of-the-art QoS prediction methods, RNL gains high prediction accuracy and computational efficiency for missing data of a QoS tensor. The rest of the paper is organized as follows: Sect. 2 presents a review of the related work. Section 3 introduces the preliminaries, Sect. 4 proposes our method, Sect. 5 depicts our experimental results in detail, and the last section concludes this paper.
2 Preliminary 2.1 High-Dimensional and Incomplete QoS Tensor In this study, when performing a time-aware QoS prediction, a HDI “user × service × time” QoS tensor is used as the input data source, where the observed QoS data set is denoted as and the unknown one is denoted as . Naturally, as depicted in Fig. 1, since time-varying QoS data is collected from real scenes, it is usually non-negative. Hence, the corresponding HDI tensor is also filled with nonnegative data, it is defined as follows: Definition 1 (HDI User × Service × Time Tensor): Given a set of users I, a set of services J, a set of time slot K, a QoS tensor Y|I|×|J|×|K| denotes a “user × service × time” tensor, where each element yijk represents the QoS scored by user i ∈ I on service j ∈ J at the time slot k ∈ K, and Y is a HDI QoS tensor if || < < ||.
2.2 Nonnegative Latent Factorization of Tensors In this study, the tensor Canonical/Polyadic decomposition (CPD) is used to perform the NLFT on a HDI tensor Y. In terms of CPD, we decompose tensor Y into R rank-one ˆ to Y. tensors like A1 ,…, AR , in which R is the rank of the proximate tensor Y |I |×|J |×|K|
Definition 2 (Rank-One Tensor): Ar denotes a rank-one tensor which could be calculated by the outer product of three latent feature (LF) vectors ur , sr , and t r as Ar = ur ° sr ° t r .
30
J. Mi and H. Wu
The LF matrices U|I|×|R| , S|J|×|R| , and T|K|×|R| are made of R LF vectors ur , sr , and t r with length |I|, |J|, and |K|. Accordingly, the element-wise expression of Ar can be obtained as: r aijk = uir sjr tkr .
(1)
As for the R rank-one tensors, i.e. {Ar | r ∈ {1, 2, …, R}}, we obtain the rank-R ˆ of Y, and Y’s ˆ each element is formulated as: approximation Y yˆ ijk =
R r=1
r aijk =
R
uir sjr tkr .
(2)
r=1
For obtaining the required LF matrices U, S, and T, it is common to use L 2 -norm ˆ and Y. Since just a few elements of Y are known, measuring the discrepancy between Y a NLFT model’s objective function is define on according to the principle of densityoriented modeling [13]. Further, the non-negativity constraints for U, S, and T need to be introduced for their real meaning (e.g., the QoS data is nonnegative). Thus, the objective function based on L 2 -norm can be given as follows: 2 R 2 ε= yijk − yˆ ijk = uir sjr tkr , yijk − yijk ∈
yijk ∈
r=1
(3)
s.t. ∀i ∈ I , ∀j ∈ J , ∀k ∈ K, r ∈ {1, . . . , R} : uir ≥ 0, sjr ≥ 0, tkr ≥ 0. By adopting an appropriate a learning scheme (e.g., nonnegative multiplicative updating) to minimizing (3) with U, S, and T [11–13], a NLFT model can be obtained.
Fig. 1. An example of HDI QoS tensor for response time.
3 Our Method 3.1 Objective Function As prior research [10], the Cauchy loss is more resistant to outliers than both L 1 -norm loss and L 2 -norm loss. To enable our method to be more robust to outlier time-varying QoS data, we take advantage of the Cauchy loss as a measure of the difference between the
RNL: A Robust and Highly-Efficient Model
31
predicted QoS values and observed ones. Therefore, the objective function constructed by Cauchy loss can be formulated as: ⎛ ⎛ 2 ⎞⎞ R ⎝ln⎝1 + yijk − ε= uir sjr tkr γ 2 ⎠⎠, yijk ∈
r=1
(4)
s.t. ∀i ∈ I , j ∈ J , k ∈ K, r ∈ {1, 2, ..., R} : uir ≥ 0, sjr ≥ 0, tkr ≥ 0. where γ is a constant. According to [11, 13], the stability of a learning model is able to be boosted by introducing linear bias into model. Considering a HDI QoS tensor Y|I|×|J|×|K| , three linear bias vectors a, b, and c with length |I|, |J|, and |K| can be used to represent the linear bias [21–23]. Thus, by incorporating linear bias vectors into ε, we have: ⎛ ⎛ 2 ⎞⎞ R ⎝ln⎝1 + yijk − ε= uir sjr tkr − ai − bj − ck γ 2 ⎠⎠, yijk ∈
r=1
(5)
s.t. ∀i ∈ I , j ∈ J , k ∈ K, r ∈ {1, 2, ..., R} : uir ≥ 0, sjr ≥ 0, tkr ≥ 0, ai ≥ 0, bj ≥ 0, ck ≥ 0. However, (5) makes U, S, T, a, b and c be model decision parameters and output parameters simultaneously, thereby perplexing the learning process to be subject to non˜ and c˜ are introduced into ˜ S, ˜ T, ˜ ã, b, negativity constraints. Thus, auxiliary variables U, (5) as decision parameters for splitting the non-negativity constraints from the learning process, so (5) is reformulated as: ⎛ ⎛ 2 ⎞⎞ R ⎝ln⎝1 + yijk − u˜ ir s˜jr ˜tkr − a˜ i − b˜ j − c˜ k γ 2 ⎠⎠, ε= yijk ∈
r=1
s.t. ∀i ∈ I , j ∈ J , k ∈ K, r ∈ {1, 2, ..., R} :
(6)
uir = u˜ ir , sjr = s˜jr , tkr = ˜tkr , ai = a˜ i , bj = b˜ j , ck = c˜ k , uir ≥ 0, sjr ≥ 0, tkr ≥ 0, ai ≥ 0, bj ≥ 0, ck ≥ 0. ˜ where u˜ ir , s˜ jr , t˜kr , ã i , b˜j , and c˜ k respectively denote an element in auxiliary variables U, ˜ and c˜. Note that (6) makes the model output parameters (i.e., U, S, T, a, b and ˜ T, ˜ ã, b, S, ˜ and ˜ S, ˜ T, ˜ ã, b, c) be nonnegative, while the optimization of ε with auxiliary variables U, c˜ is unconstrained. The decision parameters and output parameters are connected with the equality constraints. According to the principle of ADMM [21–23], the augmented
32
J. Mi and H. Wu
Lagrangian of (6) can be presented as: ⎛ ⎛ L() =
⎛
1 ⎜ ⎜ ⎝ln⎝1 + ⎝yijk − 2 yijk ∈
R
⎞2 u˜ ir s˜jr ˜tkr − a˜ i − b˜ j − c˜ k ⎠
⎞⎞ ⎟⎟ γ 2 ⎠⎠
r=1
τi 2 υj 2 ωk ˜t − tkr + ψkr ωk 2 u˜ ir − uir + φir τi + s˜jr − sjr + ρjr υj + + 2 2 2 kr (i,r)
(j,r)
(k,r)
αi 2 βj 2 δk 2 b˜ j − bj + ϕj βj + + a˜ i − ai + χi αi + c˜ − ck + σk δk − ϑ. 2 2 2 k i
j
k
(7) where the ϑ is given as: ϑ=
2 ωk + φir2 τi + ρjr2 υj + ψkr χi2 αi + ϕj2 βj + σk2 δk
(i,r)
(j,r)
(k,r)
i
j
k
(8) where φ ir , ρ jr , ψ kr , χ i , ϕ j , and σ k are elements of the Lagrangian multiplier matrices match the constraints in (7). Furthermore, the known data distribution of HDI QoS tensor is imbalance, i.e., different users (or services, time slots) is associated with different number of known elements in Y. Therefore, for precisely controlling augmentation effect according to the perspective of data density, the augmentation constants τi , υj , ωk , αi , βj , and δk are set as follows: τi = αi = λ|(i)|, υj = βj = λ|(j)|, ωk = δk = λ|(k)|.
(9)
where (i), ( j), (k) are the known entries linked with the user i, service j, and time-point k, correspondingly. 3.2 Parameters Learning Scheme a) Auxiliary Variable. Since (7) is not analytically solvable owing to its non-convex nature, we take into account the alternating learning strategy [16], namely fixing the other variables in (7) and solving the optimal problem for the auxiliary variable. Thus (7) becomes a convex problem with analytic solution. Considering u˜ ir as the active decision parameter, L’s partial derivative with it is reformulated as: ∂L =− ijk · s˜jr ˜tkr yijk − yˆ ijk + τi u˜ ir − uir + φir τi ∂ u˜ ir yijk ∈(i) 2 r + ijk · s˜jr ˜tkr yijk − yˆ ijk ijk · u˜ ir s˜jr ˜tkr + τi u˜ ir − τi uir + φir . =− yijk ∈(i)
yijk ∈(i)
(10)
RNL: A Robust and Highly-Efficient Model
33
2 R r = where yijk ˜ in s˜jn ˜tkn + a˜ i + b˜ j + c˜ k , and ijk = 1 γ 2 + yijk − yˆ ijk . n=1,n=r u Then, ∀i ∈ I, r ∈ {1 ~ R}, making (∂L/∂ u˜ ir ) = 0, we get ⎛
⎜ u˜ ir = ⎝ yijk ∈(i)
⎞
⎟ r + τi uir − φir ⎠ ijk · s˜jr ˜t kr yijk − yˆ ijk
⎞
⎛
⎜ ⎝τi +
2 ⎟ ijk · s˜jr ˜t kr ⎠.
yijk ∈(i)
(11) As well, ∀j ∈ J, ∀k ∈ K, r ∈ {1 ~ R}, we can obtain the update rule of s˜ jr and t˜kr as: ⎛
s˜jr = ⎝
yijk ∈(j)
⎛
⎞ ⎛
r + υj sjr − ρjr ⎠ ijk · u˜ ir ˜tkr yijk − yˆ ijk
˜tkr = ⎝
yijk ∈(k)
⎞ 2 ijk · u˜ ir ˜tkr ⎠,
⎝υj +
yijk ∈(j)
⎞ ⎛ r ijk · u˜ ir s˜jr yijk − yˆ ijk + ωk tkr − ψkr ⎠ ⎝ωk +
⎞ 2 ijk · u˜ ir s˜jr ⎠.
yijk ∈(k)
(12) Similarly, ∀i ∈ I, ∀j ∈ J, ∀k ∈ K, we can achieve the update rule of ã i , b˜j , and c˜ k as: ⎛
⎜ a˜ i = ⎝
⎛ ijk · ⎝yijk −
yijk ∈(i)
⎛
⎜ b˜ j = ⎝
⎜ c˜ k = ⎝ yijk ∈(k)
⎞
ijk · ⎝yijk −
R
⎞
⎟ u˜ ir s˜jr ˜tkr − b˜ j − c˜ k ⎠ + αi ai − χi ⎠
r=1
⎛
yijk ∈(j)
⎛
R
⎞
αi + ijk · |(i)| ,
⎞
⎟ u˜ ir s˜jr ˜tkr − a˜ i − c˜ k ⎠ + βj bj − ϕj ⎠
βj + ijk · |(j)| .
r=1
⎛ ijk · ⎝yijk −
R
⎞
(13)
⎟ u˜ ir s˜jr ˜tkr − a˜ i − b˜ j ⎠ + δk ck − δk ⎠
δk + ijk · |(k)| .
⎞
r=1
(14) b) Output Parameters. Considering output parameters U, S, T, a, b and c, we use same principle to them. In particular, the output parameters are projected to the nonnegativity real field to satisfy the non-negative constraints as: uir = max 0, u˜ ir + τi φir , sjr = max 0, s˜jr + υj ρjr , tkr = max 0, ˜tkr + ωk ψkr . (15) ai = max 0, a˜ i + αi χi , bj = max 0, b˜ j + βj ϕj , ck = max 0, c˜ k + δk σk . (16) c) Lagrangian Multipliers. Based on ADMM principle, the dual gradient ascent method is employed to update the Lagrangian multiplies [17] as: φir = φir + ητi (˜uir − uir ), ρjr = ρjr + ηυj s˜jr − sjr , ψkr = ψkr + ηωk ˜t kr − t kr ; χi = χi + ηαi (˜ai − ai ), ϕj = ϕj + ηβj b˜ j − bj , σk = σk + ηδk (˜ck − ck ). (17)
34
J. Mi and H. Wu
Fig. 2. Dimension-oriented subtask sequence design.
3.3 ADMM-Based Learning Sequence To guarantee the high computational efficiency of RNL, we elaborate an efficient learning strategy following ADMM principle [23]. In the concrete, as illustrated in Fig. 2, we carefully split the optimization task into R + 1 subtasks by building a dimensionoriented subtask learning sequence, where each subtask is tackled relied on the previous solution. As shown in Fig. 2, the linear bias is updated firstly as foundation of the R˜ S, ˜ T} ˜ are dimensional latent feature space. Then, in each subtask, the variables in {U, updated via dimension-by-dimension sequence. Therefore, we place the auxiliary variable’s update rules given in (11)-(14) into the R + 1 subtasks. Moreover, the update of output parameters and Lagrangian multiplies depends on auxiliary variable. Thus, their update is attached to the subtask series accordingly.
4 Experimental Results and Analysis 4.1 General Setting Datasets. We conduct empirical studies on two time-varying QoS datasets collected by WSMonitor [25]. They record throughput and response times on 4532 Web services experienced by 142 real users over 64 different time slots. Each dataset contains 30,287,611 QoS records. The response times dataset’s value ranges from 0 to 20 s with a mean of 3.1773 s, as well as, the throughput dataset’s value ranges from 0 to 1,000 kbps with a mean of 9.609 kbps. Thus, we can yield two “user × service × time” tensors of size 4532 × 142 × 64. To achieve objective experimental results, each dataset is split into disjoint training set K, validation set , and testing set , as shown in Table 1. For example, the |K|: ||: || ratio for D1.1 is 16%:4%:80%, which indicates that we randomly select 16% from D1 as the training set and 4% as the validation set to build the model and tune the hyper-parameters. The remaining 80% of the D1.1 is predicted
RNL: A Robust and Highly-Efficient Model
35
to evaluate its performance. Moreover, in order to eliminate the possible biases caused by data splitting, the above process is repeated for 20 times, and the standard deviations are also recorded in the experimental results. Note that a tested model is judged to be converge if its validation error difference between two consecutive iterations becomes smaller than 10–5 , or the iteration count exceeds the preset threshold 1000. Table 1. Detailed Settings of Testing Cases. Dataset
case
|K|: ||: ||
|K|
||
||
D1
D1.1
16%:4%:80%
4,846,017
1,211,504
24,230,088
D1.2
20%:5%:75%
6,057,522
1,514,380
23,851,493
D1.3
24%:6%:70%
7,269,026
1,817,256
22,715,708
D2.1
16%:4%:80%
4,846,017
1,211,504
24,230,088
D2.2
20%:5%:75%
6,057,522
1,514,380
23,851,493
D2.3
24%:6%:70%
7,269,026
1,817,256
22,715,708
D2
Evaluation Metrics. In this study, we pay attention to prediction accuracy and computational efficiency. Therefore, we adopt Mean Absolute Error (MAE) and Mean Relative Error (MRE) as evaluation metrics for whether the model accurately predict the missing time-varying QoS data. If yˆ ijk and yijk denote the estimated and actual values respectively, then the two expressions can be written as follow MAE =
1 1 yijk − yˆ ijk , MRE = yijk − yˆ ijk yijk . || || yijk ∈
yijk ∈
Note that for a tested model, smaller MAE and MRE indicate higher accuracy to predict missing time-varying QoS data. Besides, to test the computational efficiency of a tested model, we measure CPU running time. All the experiments are conducted on a Tablet with 2.5 GHz i5 CPU and 16 GB RAM.
4.2 Comparison with State-of-the-Art Models We compare our model with four QoS prediction models. Their details are listed as: a) M1: A high-dimension-oriented time-varying QoS prediction model proposed in [7]. It exploits the structural relationships of multidimensional QoS data and bases on the concept of multi-linear algebra. b) M2: A QoS predictor for service-oriented environments [8]. It constructs a multidimensional relationship between QoS evaluation factors and service provider QoS values according to the collected reference reports. c) M3: A NLFT-based QoS prediction model [9]. It adopts learning depth adjustment scheme for handling train fluctuation problem.
36
J. Mi and H. Wu
d) M4: A tensor CP decomposition-based robust QoS prediction method [10]. It takes into account the effect of outliers on predicting missing QoS data. e) M5: a RNL model proposed in this paper Experimental Results. M1–M5’s MAE and MRE are summarized in Tables 2 and 3 respectively. The total time cost are displayed in Fig. 3. Therefore, we can achieve the following conclusions from the experimental results. a) M5 gets higher prediction accuracy for missing time-varying QoS data compared with its peers owing to adopting Cauchy loss. As shown in Tables 2 and 3, M5 outperforms the other models in all testing cases. For example, on D1.2, the MAE of M5 is 1.0305 ± 0.0024 with an accuracy improvement of 24.88%, 19.13%, 14.82%, and 11.69% over M1’s 1.3718 ± 0.0418, M2’s 1.2742 ± 0.0103, M3’s 1.2098 ± 0.0130, and M4’s 1.1670 ± 0.0012, respectively. By analogy, the MRE of M5 is 0.3245 ± 0.0007 with an accuracy improvement of 24.83%, 19.39%, 15.11%, and 11.67% over M1-M4, respectively. On D2.2, the MAE of M5 is 3.3018 ± 0.0234 with an accuracy improvement of 21.79%, 11.64%, 3.91%, and 7.49% over M1–4’s 4.2222 ± 0.0715, 3.7370 ± 0.1351, 3.4362 ± 0.0380 and 3.5690 ± 0.0136, respectively. Similarly, M5’s MRE is 0.3495, which is 23.59%, 25.14%, 18.60%, and 7.47% lower than those of M1-M4, respectively. Similar conclusions can be draw from other cases, as summarized in Tables 2 and 3. b) Since employing parameters learning scheme based on ADMM principle, M5 achieves higher computational efficiency than its peers. For example, on D1.1, as shown in Fig. 3(a), the total time of M5 is 1164 s to converge in MAE, which is 49.69% of 2314 s by M1, 84.23% of 7380 s by M2, only 28.45% of 1627 s by M3, and 55.59% of 2621 s by M4. Correspondingly, as depicted in Fig. 3(c), the total time cost of M5 spent in MRE is 51.08%, 84.66%, 29.16%, and 22.69% of M1-M2’s total time cost, respectively. As illustrated in Fig. 3(b), on D2.1, the total time of M5 is 636 s in MAE, which is 46.6% of M1’s 1191 s, 82.26% of M2’s 3976 s, 6.05% of M3’s 677 s, and 35.1% of M4’s 980 s. Correspondingly, as shown in Fig. 3(d), the total time cost of M5 spent in MRE is 51.97%, 84.04%, 12.13%, and 38.82% of M1-M4’s total time cost, respectively. From Fig. 3, we can obtain similar consequences in the remained testing cases.
Table 2. MAE of Each Model on All Testing Case. Testing Case
M1
M2
M3
M4
M5
D1.1
1.4139 ± 0.0419
1.2962 ± 0.0076
1.2189 ± 0.0060
1.1762 ± 0.0049
1.0338 ± 0.0034
D1.2
1.3718 ± 0.0418
1.2742 ± 0.0103
1.2098 ± 0.0130
1.1670 ± 0.0012
1.0305 ± 0.0024
D1.3
1.3482 ± 0.0477
1.2686 ± 0.0112
1.1893 ± 0.0113
1.1588 ± 0.0068
1.0207 ± 0.0015
D2.1
5.5500 ± 0.1509
3.8569 ± 0.1127
3.5106 ± 0.0395
3.6108 ± 0.0303
3.3781 ± 0.0255
D2.2
4.2222 ± 0.0715
3.7370 ± 0.1351
3.4362 ± 0.0380
3.5690 ± 0.0136
3.3018 ± 0.0234
D2.3
4.1838 ± 0.0514
3.6656 ± 0.1104
3.3854 ± 0.0230
3.5379 ± 0.0180
3.2688 ± 0.0141
RNL: A Robust and Highly-Efficient Model
37
Table 3. MRE of Each Model on All Testing Case. Testing Case
M1
M2
M3
M4
M5
D1.1
0.4450 ± 0.0131
0.4096 ± 0.0024
0.3852 ± 0.0018
0.3698 ± 0.0019
0.3254 ± 0.0011
D1.2
0.4317 ± 0.0131
0.4026 ± 0.0032
0.3823 ± 0.0041
0.3674 ± 0.0004
0.3245 ± 0.0007
D1.3
0.4243 ± 0.0150
0.3950 ± 0.0098
0.3802 ± 0.0012
0.3662 ± 0.0034
0.3212 ± 0.0005
D2.1
0.4892 ± 0.0133
0.4014 ± 0.0117
0.3612 ± 0.0024
0.3182 ± 0.0026
0.2978 ± 0.0023
D2.2
0.3807 ± 0.0097
0.3886 ± 0.0140
0.3574 ± 0.0039
0.3144 ± 0.0012
0.2909 ± 0.0021
D2.3
0.3740 ± 0.0059
0.3752 ± 0.0118
0.3524 ± 0.0023
0.3119 ± 0.0016
0.2882 ± 0.0013
4.3 Summary According to the aforementioned empirical studies results, we can summarize that the proposed RNL model possesses high prediction accuracy and computational efficiency owing to employing Cauchy loss and ADMM framework. It is superior to its peers in terms of overall prediction performance. Hence, RNL is a highly-efficient time-varying QoS prediction model.
Fig. 3. Total time cost of each model on all testing case.
38
J. Mi and H. Wu
5 Conclusions Aim at accurately predicting the missing data of a HDI QoS tensor, this paper proposes a robust nonnegative latent factorization of tensors (RNL) model. It employs Cauchy loss to construct a robust objective function and design a highly-efficient parameters learning scheme based on ADMM principle. Its ability to predict missing time-varying QoS data is validated on two HDI QoS tensors, and the results show that it notably outperforms comparison models in prediction accuracy and computational efficiency. Moreover, improving its robustness via adopting different robust loss is still open issue [29, 30]. We will tackle this issue in the future. Acknowledgements. This work was supported in part by the National Natural Science Foundation of China under Grant 62102086, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515110579.
References 1. Ghahramani, M.H., Zhou, M.C., Hon, C.T.: Toward cloud computing QoS architecture: analysis of cloud systems and cloud services. IEEE/CAA J. Automatica Sinica 4(1), 6–18 (2017) 2. Osypanka, P., Nawrocki, P.: QoS-aware Cloud Resource Prediction for Computing Services. IEEE Trans. Serv. Comput. (2022). https://doi.org/10.1109/TSC.2022.3164256 3. Tong, E., Niu, W., Liu, J.: A missing QoS prediction approach via time-aware collaborative Filtering. IEEE Trans. Serv. Comput. 15(6), 3115–3128 (2021) 4. Wu, H., Luo, X., Zhou, M.: Discovering hidden pattern in large-scale dynamically weighted directed network via latent factorization of tensors. In: Proceedings of IEEE 17th International Conference on Automation Science and Engineering, pp. 1533–1538 (2021) 5. Jia, Z., Li, J., Zhang, Y., et al.: Location-aware web service QoS prediction via deep collaborative filtering. IEEE Trans. Comput. Soc. Syst. (2022). https://doi.org/10.1109/TCSS.2022. 3217277 6. Yang, Y., Zheng, Z., Niu, X., et al.: A location-based factorization machine model for web service QoS prediction. IEEE Trans. Serv. Comput. 14(5), 1264–1277 (2018) 7. Wang, S., Ma, Y., Cheng, B., et al.: Multi-dimensional QoS prediction for service recommendations. IEEE Trans. Serv. Comput. 12(1), 47–57 (2016) 8. Su, X., Zhang, M., Liang, Y., et al.: A tensor-based approach for the QoS evaluation in service-oriented environments. IEEE Trans. Netw. Serv. Manage. 18(3), 3843–3857 (2021) 9. Luo, X., Chen, M., Wu, H., et al.: Adjusting learning depth in nonnegative latent factorization of tensors for accurately modeling temporal patterns in dynamic QoS data. IEEE Trans. Autom. Sci. Eng. 18(4), 2142–2155 (2021) 10. Ye, F., Lin, Z., Chen, C., et al.: Outlier-resilient web service QoS prediction. In: Proceedings of the Web Conference 2021, pp. 3099–3110 (2021) 11. Wu, H., Luo, X., Zhou, M.: Advancing non-negative latent factorization of tensors with diversified regularizations. IEEE Trans. Serv. Comput. 15(3), 1334–1344 (2020) 12. Zou, G., Li, T., et al.: DeepTSQP: temporal-aware service QoS prediction via deep neural network and feature integration. Knowl.-Based Syst. 241, 108062 (2022) 13. Luo, X., Wu, H., et al.: Temporal pattern-aware QoS prediction via biased non-negative latent factorization of tensors. IEEE trans. cybern. 50(5), 1798–1809 (2019)
RNL: A Robust and Highly-Efficient Model
39
14. Peng, Z., Wu, H.: Non-negative latent factorization of tensors model based on β-divergence for time-aware QoS prediction. In: Proceedings of 2022 IEEE International Conference on Networking, Sensing and Control, pp. 1–6 (2022) 15. Li, X., Lu, Q., Dong, Y., et al.: Robust subspace clustering by cauchy loss function. IEEE trans. neural networks learn. syst. 30(7), 2067–2078 (2018) 16. Luo, X., Zhou, M., Wang, Z., et al.: An effective scheme for QoS estimation via alternating direction method-based matrix factorization. IEEE Trans. Serv. Comput. 12(4), 503–518 (2016) 17. Cao, X., Chen, Y., Zhao, Q., et al.: Low-rank matrix factorization under general mixture noise distributions. In: Proceedings of the IEEE international conference on computer vision, pp. 1493–1501 (2015) 18. Wu, H., Luo, X.: Instance-frequency-weighted regularized, nonnegative and adaptive latent factorization of tensors for dynamic QoS analysis. In: Proceedings of 2021 IEEE International Conference on Web Services, pp. 560–568 (2021) 19. Wu, D., Luo, X., Shang, M., et al.: A data-characteristic-aware latent factor model for web services QoS prediction. IEEE Trans. Knowl. Data Eng. 34(6), 2525–2538 (2020) 20. Zhao, Q., Meng, D., Xu, Z., et al.: L1 -norm low-rank matrix factorization by variational Bayesian method. IEEE trans. neural networks learn. syst. 26(4), 825–839 (2015) 21. Luo, X., Zhou, M., Li, S., et al.: Non-negativity constrained missing data estimation for highdimensional and sparse matrices from industrial applications. IEEE trans. cybern, 50(5), 1844–1855 (2019) 22. Wu, H., Luo, X., Zhou, M., et al.: A PID-incorporated latent factorization of tensors approach to dynamically weighted directed network analysis. IEEE/CAA J. Automatica Sinica 9(3), 533–546 (2021) 23. Luo, X., Hao, W., Wang, Z., Wang, J., Meng, D.: A novel approach to large-scale dynamically weighted directed network representation. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 9756–9773 (2022) 24. Wu, H., Luo, X., Zhou, M.: Neural latent factorization of tensors for dynamically weighted directed networks analysis. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, pp. 3061–3066 (2021) 25. Zhang, Y., Zheng, Z., Lyu, M.R.: WSPred: a time-aware personalized QoS prediction framework for web services. In: Proceedings of 2011 IEEE 22nd international symposium on software reliability engineering, pp. 210–219 (2011) 26. Wu, H., Xia, Y., Luo, X.: Proportional-integral-derivative-incorporated latent factorization of tensors for large-scale dynamic network analysis. In: Proceedings of 2021 China Automation Congress, pp. 2980–2984 (2021) 27. Mohamadi, N., Dong, M., ShahbazPanahi, S.: Low-complexity ADMM-based algorithm for robust multi-group multicast beamforming in large-scale systems. IEEE Trans. Signal Process. 70, 2046–2061 (2022)
A Time-Aware Graph Attention Network for Temporal Knowledge Graphs Reasoning Shuxin Cao, Chengwei Liu, Xiaoxu Zhu(B) , and Peifeng Li Soochow University, Soochow 215026, China [email protected], {xiaoxzhu,pfli}@suda.edu.cn
Abstract. Temporal Knowledge Graphs (TKGs) have been widely used in various domains to describe dynamic facts using quadruple information (subject, relation, object, timestamp). The TKG reasoning task aims to predict missing entity, i.e., “?” in the quadruple (entity, relation, ?, future time) from known facts. Although existing temporal knowledge graph reasoning models consider the quadruple information of each timestamp separately, they fail to fully exploit the temporal information as well as the hidden information in the known graph structure. To address the above issue, we propose an end-to-end encoder-decoder framework that incorporates temporal information and two-hop neighbor information into the entity embedding representation. Specifically, we design a time-aware graph attention network (TA-GAT) as an encoder. Unlike existing models that deal with each quadruple independently, we integrate two-hop neighbor nodes into TA-GAT to capture the hidden properties of the target entity domain. To further improve our model, we enhance the convolutional neural network-based knowledge graph embedding model ConvTKB as a decoder. Experimental results show that our model TA-GAT outperforms the state-of-the-art models on three datasets, i.e., WIKI, YAGO, and ICEWS14. Keywords: Temporal knowledge graphs · Time-aware graph attention network · Two-hop neighbor nodes
1 Introduction Knowledge graph (KG), a knowledge base that represents the graphical structure of realworld facts and their relational knowledge representations. KGs represent facts in the form of triples (es , r, eo ), e.g., (Lebron James, player of , Lakers), where es (subject) and eo (object) denote nodes (entities) and r (relation) denotes the type of edge between es and eo . In recent years, KGs have received considerable attention as a promising method for storing factual knowledge. However, the relationships between entities are not static and may change over time. For example, if Lebron James were no longer playing for the Lakers, the triple (Lebron James, player of , Lakers) would no longer be applicable. To address this issue, Temporal Knowledge Graphs (TKGs) have been developed, allowing for dynamically evolving triples (es , r, eo ) to incorporate temporal information. These extended quadruples, represented as (es , r, eo , t) provide a solution to the problem of static triples in traditional knowledge graphs. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 40–51, 2023. https://doi.org/10.1007/978-981-99-4752-2_4
A Time-Aware Graph Attention Network for Temporal Knowledge
41
Fig. 1. For the aggregation of the target node “L.A Lakers”, we consider its associated two-hop node information as shown in the dashed line. αij denotes the relative attention value that represents the importance of this neighbor node.
We note that many existing TKG reasoning methods treat each quadruple individually, and many of them typically embed entities and relations in a temporal evolutionary order [1]. Obviously, these methods ignore the structural information of the TKG as a graph knowledge base and cannot encapsulate well the rich semantics and potential relationships inherent in the vicinity of the target entities in the TKG. To address these issues, we propose an end-to-end encoder-decoder framework that incorporates temporal information and two-hop neighbor information into the entity embedding representation. Specifically, for a given quadruple (es , r, ?, t) (or (?, r, eo , t)), we employ a time-aware graph attention network to model the TKGs up to a given time t in order to infer the entity that should be completed. The time-aware mechanism is to integrate temporal and relational information into the entity tableau time, and to assign different weights to different entity nodes based on temporal and relational variability. The two-hop neighbor nodes are to address the situation that some nodes have fewer direct neighbors. Incorporating two-hop neighbor nodes into the target entity embedding enriches the entity representation extending the attention mechanism. Since ConvKB, an entity and relation embedding model for static KG reasoning, has been proven to achieve optimal results on several datasets, we exploit its good link prediction performance and migrate it to TKG as our decoder ConvTKB. Experimental results show that our model TA-GAT outperforms the state-of-the-art models on three datasets, i.e., WIKI, YAGO, and ICEWS14. In summary, our main contributions are as follows: • To the best of our knowledge, we are the first to use a time-aware two-hop graph attention network as an encoder for the task of temporal knowledge graph reasoning, while employing a modified static knowledge graph reasoning model, ConvTKB, as a decoder. • We introduce a graph attention mechanism to capture entity relations and features in multi-hop domains, while also incorporating a time-aware mechanism. • We evaluate our model TA-GAT on three datasets and the experimental results show that our model has higher accuracy in comparison with various state-of-the-art models.
42
S. Cao et al.
2 Related Work There are two different settings for TKG reasoning: interpolation and extrapolation [2]. Assuming that the timestamp of a given TKG is from t0 to tn , then interpolation also known as TKG reasoning, means that we must predict the quadruple q= (es , r, ?, t) (or (?, r, eo , t)) occurs between t and t0 , i.e., t0 ≤ t ≤ tn . The reverse extrapolation means that the quadruples to be predicted will occur at time t after the given time tn , i.e., t > tn . For the first problem, many models have proposed their own solutions, HyTE [3] proposes to encode temporal information directly in the learning embedding in order to predict the temporal extent of previously unscoped KG beliefs. TeMP [4] uses WGCN in the encoding phase to learn to knowledge structural features of mapping nodes. DESimplE [5] uses ephemeral embeddings to represent entities with different timestamps and uses the same scoring function as SimplE. For extrapolation tasks, RENET [6], RE-GCN [7], HIP [8], and EvoKG [9] use GNN or RNN to capture temporal relationships in TKG. TANGO [10] proposed a graph transition layer to model edge formation and resolution in TKG, which effectively improved the performance of the model. CyGNet [11] investigated the potential phenomenon of temporal facts with reproducibility, and improved the recognition accuracy of the model by combining two reasoning models through a temporal aware replication generation mechanism.
3 Method In this section, we present the notation and definitions that are utilized in this paper. Then, we introduce two novel models: the encoder Time-Aware Graph Attention Network (TAGAT) and the decoder Convolutional Neural Network for Temporal Knowledge Base (ConvTKB). As shown in Fig. 2, for a given target entity node in TKGs, we classify the remaining nodes into three types, direct neighbor (one-hop) nodes, two-hop neighbor nodes and other nodes. Our TA-GAT is divided into two layers in total, and in the first layer of the model, all nodes receive information from their direct neighbors. In the second layer, individual entities receive information from the surrounding nodes for which they have received information from their direct neighbors. For the two-layer network, we consider that the information of the target node is accumulated in two hops. After the encoder training, we obtain the embedding of entity, relations and temporal information. 3.1 Background Formally, TKG is denoted by G = (E, R, T , Q), where E, R, T and Q represent entities (nodes), relations (edges), timestamps and quadruples, respectively. Each quadruple q = (es , r, eo , t) represents the subject es and the object eo with relation r at time t, where es ∈ E, r ∈ R, eo ∈ E, t ∈ T and q ∈ Q. The goal of embedding model is to learn valid representations of entities, relations and evaluation functions f such that for a given input quadruple q = (es , r, eo , t), f (q) gives q as the valid quadruple possibility. Taking the missing object as an example, for the reasoning task, the goal is to predict
A Time-Aware Graph Attention Network for Temporal Knowledge
43
Fig. 2. Illustration of our Encoder-Decoder model for TKG reasoning, where blue represents the entity embedding vector, yellow represents the relationship embedding vector, green represents the temporal embedding vector and αij represents the attention weight. (Color figure online)
the missing eo in q = (es , r, ?, t). We generate (N-1) erroneous quadruples by using ei ∈ E/eo . Then, we add the correct quadruple and assign scores to these N quadruples. Finally, we sort the scores in ascending order to obtain our prediction results. 3.2 Time-Aware GAT Encoder As the relationships within the temporal knowledge graph are subject to continual change over time, it is not feasible to construct a static graph to represent each moment. Therefore we thought of using start and end time pairs [tb , te ] to provide an auxiliary description of edges, as shown in Fig. 1. In contrast to the conventional modeling approach, where each edge consists only of the relation and the time (r, t), we use (r, [tb , te ]) to describe the edges. When making predictions for g0 = (?, r, eo , t0 ), we only need to focus on edges with tb ≤ t0 . Also, the further away from our quadruple g0 time i.e., the larger the (t0 − te ) the less attention we pay to the edges. Finally, using (te − tb ) we can also calculate the duration of a quadruple and obviously the longer the duration the more important it is for our reasoning. We map all entities, relations and times in G = (E, R, T , Q) into the vector space Rd , where d denotes the dimensionality of the vector space. Entity embeddings are represented by a matrix E ∈ RNe ×d , Ne is the total number of entities. With a similar construction, the relation embeddings are represented by a matrix R ∈ RNr ×d and T ∈ RNt ×d . During the encoding process, we incorporate both relational and temporal information by taking the average of entities and their corresponding temporal relationships within the target entity ei and its surrounding neighborhoods to generate a comprehensive entity representation. In particular, we consider the two-hop entity relationships as
44
S. Cao et al.
a complement when considering the neighbor-hood. As shown in Fig. 1, although there is no indication in the TKG that Jenny Bass is associated with the Lakers, she is in fact a member of the management team, and thus has a supporting role in the description of the target entity. The complete entity embedding of the target entity ei can be represented as follows. ⎤ ⎡ 1 1 1 ⎣ hτjs ||hτje ⎦ (1) hej hej hin ei = e N r N τ N +1 τ e r i
ej ∈Ni ∪{ei }
i
rj ∈Ni
i
τj ∈Ni
where denotes the valid entities in the neighborhood of ei , denotes the relationship demotes the time steps and we use hτsj and hτej to between them and the target entity, denote the start and end time of the relationship, respectively, as a way to improve the time sensitivity of the modeling. In our work, the self-attentive mechanism uses temporal and relational information to integrate the weights of nodes ei to ej with the specific weighted importance βi,j as shown below. ⎤ ⎡ h ⎥ ⎢ in rm (2) βi,j = ωT · ⎣hin hτjs hτje ⎦ ei hej τ r L rm ∈Lij ij where represents the set of relations between entity ei and ej , and ω is the shared attention vector. We use a nonlinear activation function LeakyReLU to obtain the normalized weight coefficients αi,j from ei to ej as follows.
αi,j = sofmax LeakyReLU βi,j =
exp(LeakyReLU(βi,j )) em ∈N e exp(LeakyReLU(βi,m ))
(3)
i
We use multi-head attention mechanism to improve the stability of the model selfattention learning, M is the number of attention heads we use, σ represents the non-linear function and the new embedding of entity ei is a weighted sum incorporating the attention weights of the neighboring nodes as follows. M
hei = || σ ( m=1
m hin ei αi,j )
(4)
j∈Nie
In our model, the first (k−1) attention layers we splice the product of entity features and attention coefficients. The last layer we use the averaging method to fuse the output of multiple attention heads to get the final output vector of the entity, shown as follows. M 1 in m − →out h ei = σ ( hei αi,j ) M e m=1 j∈Ni
(5)
A Time-Aware Graph Attention Network for Temporal Knowledge
45
To prevent entities from losing their original embedding information after multiple layers of attention mechanisms, we use a weight matrix to fuse the representation of the original embedding information of entities and the output embedding after TA-GAT as follows. − → − →out − →in h ei = W · h ei + h ei
(6)
− →out − →in where h ei denotes the input entity embedding mentioned in the previous section, h ei denotes the final output embedding after the model, and W denotes the weight matrix. Encoder Learning Objective. Inspired by the static knowledge graph model TransE [14], it is argued that the valid triad (es , r, eo ) of entity vectors should possess the rela
− → − → − → tionship h ei + h r ≈ h eo . We think that the valid quadruple relation gij = ei , r, ej , t0 should possess a similar relation. Therefore, we learn entities, relations and temporal − → − → − → − → embeddings by minimizing the L1 -norm dqij = || h ei + h r + h to − h ei ||. We train our model using the hinge-loss, which is given by the following equation. L() = max dq − dqij + γ , 0 (7) qij ∈Q q ∈Q ij
ij
where γ is the hyperparameter, Q is the set of valid quadruples, and Q is the set of invalid quadruples that replace the subject or object that does not appear in the TKG.
3.3 ConvTKB Decoder We use the modified ConvKB [13] to add temporal features as decoder called ConvTKB. The purpose of the decoder convolution layer is to obtain qij global embedding features in multiple spatial dimensions and the score function integrates multiple mapping features as follows. −
→ − → − → − → (8) f qij = || ReLU h ei , h r , h ej , h t ∗ ωm · W m=1
where ωm represents the m layer of filters, is the number of filters, ∗ is a convolution operator and W represents the linear transformation to compute the score of the quadruples. The difference with the original model is that the timing information adding to the convolution process. The model uses soft-margin loss as follows. L=
qij ∈{Q∪Q}
where lqij =
1, qij ∈ Q . −1, qij ∈ Q
λ + W 22 log 1 + exp lqij × f qij 2
(9)
46
S. Cao et al.
4 Experimentation 4.1 Experimental Settings Dataset. We selected three datasets that are more commonly used for TKG reasoning, namely, ICEWS14 [15], WIKI [16], and YAGO [17]. Among them, ICEWS14 uses timestamps to record mainly political events between 01/01/2014 and 12/31/2014. WIKI and YAGO are sourced from Wikipedia and YAGO3, and contain a variety of factual and temporal information from reality, many of which have long durations. Table 1 shows the relevant information for the three datasets. Table 1. Statistics of the datasets.
Data ICEWS14 WIKI YAGO
Entities 12498 12554 10623
Relations 260 24 10
Training Validation Test Granularity 323,859 341,490 24 hours 539,286 67,538 63,110 1 year 161,540 19,523 20,026 1 year
Baselines. We compare the methods in this paper with a total of 10 models of KG and TKG reasoning, respectively. The static methods include TransE [14], DistMult [18], ComplEx [19], ConvE [12]. Among them, the TransE model models triadic information based on distances. DsitMult simplifies RESCAL by restricting matrices to diagonal matrices. ComplEx introduces complex value embed-dings to extend DistMult. ConvE complements the static knowledge graph by convolution operations. Dynamic approaches include RE-NET [6], TeMP [4], RE-GCN [7], CyGNet [11], EvoKG [9], and HIP [8]. Training and Evaluation Protocol. We employed the TransE model to embed entities, relations, and events. The training process of this model involves two key stages. Initially, we used our TA-GAT to encode the quadruples, after which we trained the decoder ConvTKB to accomplish TKG reasoning. The encoder we developed is different from the original GAT in that it incorporates the auxiliary information of the two-hop domain and the temporal aspects of the quadruple. Furthermore, we optimized the parameters of the encoder with Adam. During the TKG reasoning task, our aim is to predict the missing entities es or eo in a quadruple, denoted asq = (es , r, eo , t0 ). To train our decoder model, ConvTKB, we replace the missing entity with the remaining (N − 1) invalid entities, score and rank all N possible quadruples, and select the quadruple with the highest score as the predicted entity. We evaluate the effectiveness of our model using two metrics: mean reciprocal rank (MRR) and Hits@n, which measures the proportion of correct entities among the top n predicted entities [20]. Our ultimate goal is to accurately predict the missing entities in a given quadruple.
A Time-Aware Graph Attention Network for Temporal Knowledge
47
4.2 Results and Analysis Table 2 shows the experimental results of our model and baseline models on three datasets. Compared to the baseline models, our model achieves a significant improvement on the WIK dataset. The main reason is that the nodes in the WIKI dataset have fewer neighbors, and our model can stack the information of two-hop neighbor nodes through two layers of graph attention networks to effectively supplement the information for nodes with less neighbor information. Secondly, 42.91%, 8.03%, and 6.52% of the entities in WIKI, YAGO, and ICEWS14 are invisible in the test set. The baseline model cannot handle these unseen entities, but our model can make good reasonings by integrating the relationship and time information between them through the graph networks. Table 2. Statistics of the Experimental results on WIKI, YAGO and ICEWS14 test sets. H@N values are in percentage. The best score is in bold. Models TransE DistMult ComplEx ConvE RE-NET TemP RE-GCN CyGNet EvoKG HIP TA-GAT
MRR 46.68 46.12 47.84 47.57 51.97 49.61 44.68 45.50 50.66 54.71 84.35
WIKI H@1 H@3 36.19 49.17 37.24 49.81 38.15 50.08 38.76 50.10 48.01 52.07 46.96 50.24 39.82 46.75 50.48 50.79 12.21 63.84 53.82 54.73 79.44 88.36
H@10 51.71 51.38 51.39 50.53 53.91 52.06 48.75 52.80 85.91 56.26 91.30
MRR 48.94 59.47 61.29 62.32 65.16 62.25 65.69 63.47 55.11 67.55 69.12
YAGO H@1 H@3 46.23 62.45 52.97 60.91 54.88 62.28 56.19 63.97 63.29 65.63 55.39 64.63 59.98 68.70 64.26 65.71 54.37 81.39 66.32 68.49 66.56 69.02
H@10 66.05 65.26 66.82 65.60 68.08 66.35 70.36 68.95 92.73 69.89 76.21
MRR 18.65 19.06 24.17 40.73 45.71 43.13 32.37 48.63 18.30 50.57 51.49
ICEWS14 H@1 H@3 1.12 31.34 10.09 22.00 16.13 27.49 33.20 43.92 38.42 49.06 35.67 45.79 24.43 35.05 41.77 52.50 6.30 19.47 45.73 54.28 46.26 53.48
H@10 47.07 36.41 41.09 54.35 59.12 56.12 48.12 60.29 39.37 61.65 62.78
Traditional static KG model methods often lose the time information in the quadruples when modeling temporal information, which is obviously disadvantageous for reasoning. Our model improves the sensitivity to time by incorporating time information in different parts of the embedding, which improves the time awareness of the model. Therefore, compared to static reasoning models, our model has improved performance on all indicators across different datasets. Compared to the TKG reasoning model in the baseline, our model makes better use of the graph structure network in the knowledge graph. Unlike these models, which consider each quadruple separately in chronological order, our model fully utilizes the known information for modeling. At the same time, assigning different weight coefficients to different nodes according to actual conditions increases the interpretability of the model. Therefore, it outperforms these models in the most important MRR indicator. In Table 2, TA-GAT has the best performance on the WIKI dataset, and there are also improvements on the other two datasets. This indicates that our model has made some improvements in addressing the TKG reasoning task, and the methods of incorporating information from two-hop neighbor nodes and fusing the initial embedding information of entities have a significant impact on the accuracy of the model.
48
S. Cao et al.
4.3 Ablation Study To demonstrate the validity of each component of the model, we present two sets of smile comparison studies and compare the performance of both studies on two datasets, WIKI and YAGO. We use TransE to replace the encoder part to verify the validity of TA-GAT. We also remove the time-aware mechanism from the entity representation and use baseGAT to encode the TKG to demonstrate the usefulness of the time-aware mechanism. In addition, we delete the information of two-hop neighbor nodes and only keep only the information of directly related neighbors to verify the usefulness of two-hop neighbor node information. As shown in Table 3, compared to TransE as an encoder, our model improves MRR by 31.72 and 19.38 on the WIKI and YAGO datasets, respectively, indicating that graph networks are more suitable for encoding TKGs than traditional modeling approaches. Table 3. Results of three sets of ablation experiments on two datasets, WIKI and YAGO Models TransE-ConvTKB Base-GAT-ConvTKB TA-GAT(1-hop)-ConvTKB TA-GAT-ConvTKB
MRR 52.63 76.78 78.12 84.35
WIKI H@1 H@3 45.36 50.45 68.49 72.95 70.56 76.23 79.44 88.36
H@10 54.68 79.62 81.56 91.30
MRR 49.74 59.85 60.48 69.12
YAGO H@1 H@3 H@10 45.26 48.69 53.68 50.36 59.25 61.72 56.78 60.21 63.47 66.56 69.02 76.21
Compared to GAT without adopting the time-aware mechanism, our model has an MRR improvement of 7.57 and 9.72 on the two datasets, respectively, demonstrating that the time-aware mechanism is still effective enough for the TKG reasoning task. When we consider only the immediate 1-hop neighbors in the node representation, our model MRR values drop by 6.23 and 8.64, respectively. When there are multiple immediate neighbors around the target entity the target entity is rich in information and has less influence. When there are fewer entities around the target entity, the two-hop neighbors play an important reference role to complement the entity information. Figure 3 lustrates how the MRR values of the four models in the ablation experiment vary with epochs, indicating that our model outperforms the ablation model over multiple epochs.We have extended the graph attention network for TKG reasoning in our model. Our experimental results show that the TA-GAT encoder is the most significant factor influencing the accuracy of the model, followed by the two-hop neighbor information fusion approach. The reason for this is that our encoder has integrated both structural and temporal information from the graph network during the training phase. The two-hop node information effectively complements the limited number of direct neighbors around each entity node. 4.4 Case Study As shown in Fig. 4, we selected Fabio Firmani, the entity with the lowest frequency of appearance in the WIKI training set, and tried to infer which team he played for in 2006. Only the TA-GAT model predicted the correct result: S.S. Lazio. When there
A Time-Aware Graph Attention Network for Temporal Knowledge
49
Fig. 3. An instance in the WIKI dataset involves reasoning which team Fabio Firmani was playing for in 2006 based on transfer information of Fabio Firmani between 1995 and 2005.
Fig. 4. Variation of MRR values with epoch for the four ablation models on the WIKI dataset
are few direct neighbors, we cannot obtain useful information about candidate entities from the quadruples directly related to the target entity, and we need to obtain effective information from multi-hop neighborhoods. Traditional TKG reasoning models embed four-tuples in chronological order. Although RE-GCN [7] uses graph network structure, it treats each neighboring entity equally and ignores multi-hop neighbor information. These factors result in the traditional models being unable to accurately reason about the missing entities in the example.
5 Conclusion In this paper, we propose an end-to-end framework for TKG reasoning, which consists of an encoder time-aware graph attention network (TA-GAT) as encoder and a modified convolutional neural network embedding model (ConvTKB) as decoder. For the specificity of TKG, we incorporate the time-aware mechanism and the entity information in the two-hop domain into the GAT. The graph structure information in TKG can be better obtained at the encoding stage. At the same time, we improve ConvKB to incorporate temporal vector embedding, so that it can efficiently perform the TKG reasoning task efficiently. Experiments show that our method outperforms traditional methods on several datasets. In future work, we will extend our method to mine the hidden higher-order information in graph models.
50
S. Cao et al.
Acknowledgement. The authors would like to thank the three anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62006167, 62276177 and 61836007), and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).
References 1. Jiang, T., et al.: Encoding temporal information for time-aware link prediction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2350– 2354 (2016) 2. Nguyen, G.H., Lee, J.B., Rossi, R.A., Ahmed, N.K., Koh, E., Kim, S.: Continuous-time dynamic network embeddings. In: Companion proceedings of the web conference 2018, pp. 969–976 (2018) 3. Dasgupta, S.S., Ray, S.N., Talukdar, P.P.: HyTE: hyperplane-based temporally aware knowledge graph embedding. In: Conference on Empirical Methods in Natural Language Processing, pp. 2001–2011 (2018) 4. Wu, J., Cao, M., Cheung, J.C.K., Hamilton, W.L.: Temp: Temporal message passing for temporal knowledge graph completion. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 5730–5746 (2020) 5. Goel, R., Kazemi, S.M., Brubaker, M., Poupart, P.: Diachronic embedding for temporal knowledge graph completion. In: Proceedings of the AAAI conference on artificial intelligence, pp. 3988–3995 (2020) 6. Jin, W., Qu, M., Jin, X., Ren, X.: Recurrent event network: Autoregressive structure inference over temporal knowledge graphs. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp.6669–6683 (2020) 7. Li, Z., et al.: Temporal knowledge graph reasoning based on evolutional representation learning. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 408–417 (2021) 8. He, Y., Zhang, P., Liu, L., Liang, Q., Zhang, W., Zhang, C.: Hip network: Historical information passing network for extrapolation reasoning on temporal knowledge Graph. In: International Joint Conference on Artificial Intelligence, pp. 1915–1921 (2021) 9. Park, N., Liu, F., Mehta, P., Cristofor, D., Faloutsos, C., Dong, Y.: Evokg: jointly modeling event time and network structure for reasoning over temporal knowledge graphs. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 794–803 (2022) 10. Han, Z., Ding, Z., Ma, Y., Gu, Y., Tresp, V.: Learning neural ordinary equations for forecasting future links on temporal knowledge graphs. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8352–8364 (2021) 11. Zhu, C., Chen, M., Fan, C., Cheng, G., Zhang, Y.: Learning from history: Modeling temporal knowledge graphs with sequential copy-generation networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4732–4740 (2021) 12. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2d knowledge graph embeddings. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1811– 1818 (2018) 13. Nguyen, D.Q., Nguyen, T.D., Nguyen, D.Q., Phung, D.: A novel embedding model for knowledge base completion based on convolutional neural network. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 327–333 (2018)
A Time-Aware Graph Attention Network for Temporal Knowledge
51
14. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013) 15. Trivedi, R., Dai, H., Wang, Y., Song, L.: Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In: International Conference on Machine Learning, pp. 3462–3471 (2017) 16. Leblay, J., Chekol, M.W.: Deriving validity time in knowledge graph. In: Companion Proceedings of the Web Conference 2018, pp. 1771–1776 (2018) 17. Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: a knowledge base from multilingual wikipedias. In: 7th Biennial Conference on Innovative Data Systems Research, pp. 1–11 (2014) 18. Yang, B., Yih, W.T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: 3rd International Conference on Learning Representation, pp. 1–12 (2015) 19. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. In: International Conference on Machine Learning, pp. 2071–2080 (2016) 20. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Largescale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015)
Multivariate Time Series Anomaly Detection Method Based on mTranAD Chuanlei Zhang1(B) , Yicong Li1 , Jie Li2 , Guixi Li1 , and Hui Ma3 1 Tianjin University of Science and Technology, Tianjin, China
[email protected]
2 China Unicom Research Institute, Beijing, China 3 Yunsheng Intelligent Technology Co., Ltd., Tianjin, China
Abstract. Multivariate time series anomaly detection is a crucial area of research in several domains, including finance, logistics, and manufacturing. Successfully identifying abnormal behaviors or events can help prevent disruptions, but the high false positive rate in this field is a significant challenge that affects detection accuracy. In this paper, we propose a novel method, mTranAD, which improves upon the TranAD algorithm by leveraging the benefits of Transformer and variational autoencoder (VAE) in multivariate unsupervised anomaly detection. Specifically, mTranAD replaces TranAD’s autoencoder structure with a VAE and trains it using the VAE’s loss function. The incorporation of latent variables in the VAE model enables accurate reconstruction of data and mapping of data to a lower dimensional latent space, allowing for a more efficient description of input data complexity with fewer parameters. By utilizing these latent variables, the model can effectively handle high-dimensional, complex data and exhibit greater flexibility when generating new data. We conduct experiments on four public datasets (NAB, MBA, SMAP and WADI) and compare mTranAD’s performance with 11 other state-of-the-art methods, including TranAD, MERLIN, LSTM-NDT, OmniAnomaly, USAD, and DAGMM. The experimental results demonstrate that mTranAD outperforms these methods in terms of performance, accuracy, and reliability. The primary purpose of this paper is to improve the TranAD algorithm and enhance the accuracy of multivariate time series anomaly detection by reducing the false positive rate. Keywords: Anomaly detection · Transformer · Variational autoencoder · mTranAD
1 Introduction Anomaly detection is an important and challenging problem in many disciplines, and anomaly detection methods have been widely applied in fields such as fault diagnosis, network intrusion, and monitoring detection. Time series data is a widely used data type in various fields, covering finance, healthcare, network monitoring, and many other domains. It can reflect the dynamic changes of a system or process over time, providing valuable information for decision-making and problem-solving. However, interpreting time series data can sometimes be challenging, particularly when there are outliers in © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 52–63, 2023. https://doi.org/10.1007/978-981-99-4752-2_5
Multivariate Time Series Anomaly Detection Method
53
the data. Outliers in time series data refer to patterns that are clearly deviant from normal behavior, which may indicate problems such as errors, malfunctions, or security threats. Therefore, accurately identifying and handling outliers in time series data is crucial for effective utilization [1]. Deep learning-based methods, especially those based on generative adversarial networks, have become hotspots in anomaly detection. However, adversarial training has problems such as mode collapse and non-convergence that require more research [2]. Unsupervised anomaly detection for multi-sensor time series data is important in machine learning [20]. Traditional time series anomaly detection methods rely on statistical theory, but the emergence of deep learning has led to increased exploration of its application in this field. Multi-layer neural networks, such as Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN), can mine feature information more effectively, improving accuracy and robustness. Different deep learning models and algorithms have been proposed for various application scenarios, including Generative Adversarial Network (GAN) and Variational Autoencoder (VAE)based methods. Reconstruction-based deep learning algorithms, particularly VAEs, have become popular due to their ability to detect anomalous data that violates temporal dependencies. RNNs and LSTMs have also shown promising performance in handling time series data tasks. Developing a system that can efficiently and accurately identify anomalies is challenging. While the TranAD model addresses high data volatility and ultra-low inference time, its autoencoder structure lacks strong adversarial noise and missing value capabilities, as well as generalization abilities. To address these limitations, we replaced the AE structure with VAE due to its robustness to noise and missing values caused by its randomness.
2 Related Work Time series anomaly detection is a complex task and has been extensively studied. Since supervised training algorithms lack labeled anomalous data, most anomaly detection methods are based on unsupervised methods. We can categorize unsupervised methods into four types: linear model-based methods, distance-based methods, probability and density estimation-based methods, and deep learning-based methods. Unsupervised anomaly detection methods can use Principal Component Analysis (PCA) [12] or Partial Least Squares (PLS) to analyze process measurements and identify anomalies. However, these methods only work well with highly correlated data that follows a multivariate Gaussian distribution. Distance-based methods like k-Nearest Neighbor (KNN) and Cluster-based Local Outlier Factor (CBLOF) are also used for anomaly detection, but they perform better when there is prior knowledge about the anomaly duration and number. Probabilistic and density estimation-based methods like Angle-based Outlier Detection (ABOD) and Feature Bagging (FB) are improvements on distance-based methods because they focus more on data distribution, but they are not suitable for analyzing short-term temporal correlations in multivariate time series data [13]. Deep learning-based unsupervised anomaly detection methods are important in many fields. The Transformer model has gained popularity for time series analysis, as it can
54
C. Zhang et al.
leverage self-attention to discover long-term temporal dependencies. It is also used for modeling and detecting anomalies based on reconstruction criteria. Traditional techniques like PCA, k-means, and OCSVM cannot capture temporal dependencies due to their inability to pass time memory states. However, standard Autoencoders may not generalize well due to the presence of noise and anomalies in normal samples, and the separation of feature extraction and prediction model building, which can lead to local optima [13]. Autoencoders comprise an encoder that learns data representation and a decoder that reconstructs the input. MAD-GAN [16] and MTAD-GAT [17] are GAN models that use LSTM and graph attention networks respectively to model time series. CAE_M [18] uses a convolutional autoencoder memory network with bidirectional LSTM to capture temporal trends, while USAD [15] employs an autoencoder with two decoders for quick training. GDN [19] utilizes graph-biased network w/attention-based prediction, and MERLIN [20] compares sub-sequences. DAGMM [21] compresses input data into latent space for Gaussian mixture modeling, while LSTM-NDT [22] uses LSTM w/dynamic threshold methods. However, recurrent models like LSTM are computationally expensive. MTAD-GAT and GDN use deep neural networks w/window inputs, but small fixed-size windows limit performance. A fast model is needed to capture high-level temporal trends with minimal overheads.
3 Methods The structure diagram of the mTranAD model is shown in Fig. 1.
Fig. 1. The mTranAD Model.
We deeply re-architected the Transformer for anomaly detection in time-series data, incorporating attention-based transformations and a Variational Autoencoder. VAE is a type of generative model that is specifically designed to encode datasets with high dimensionality into a lower-dimensional space called the latent space. This encoded
Multivariate Time Series Anomaly Detection Method
55
information is then utilized to generate novel samples that resemble the original data by means of decoding the latent representation. VAE uses an encoder network qϕ (z|x) to map the input data x to a lower-dimensional latent representation z, and a decoder network pθ (x|z) to map the latent representation back to the original data space. z =μ+ε∗σ
(1)
The loss function of VAE consists of two parts. The first part is the Kullback-Leibler divergence (KL divergence) between the trained generative model distribution qϕ (z|x) and the Gaussian distribution p(z) that the data latent variable z is assumed to follow. The KL divergence DKL [qϕ (z|x)||p(z)] is used to quantify the difference between the two distributions. The second part is the reconstruction loss, which measures the error between the generated data and the input data, given the latent variable z. It is defined as Eq [log10pθ (x|z)]. We use the MSE Loss instead of cross-entropy as the reconstruction loss, O2 − Wd 2 . The overall loss function of VAE is a linear combination of the two parts, with a trade-off parameter β balancing the importance of each part. β equals 1. It can be expressed as:
L(θ, ϕ, x) = −DKL [qϕ (z|x)||p(z)] + Eq [log10pθ (x|z)]
(2)
We have combined TranAD and VAE with the Transformer architecture for processing time-series data. By adversarially training the discriminator to detect anomalies in real samples, we can simplify downstream neural network inference operations using softmax compression. Our approach uses scaled dot-product attention to reduce weight variance and positional encoding for the input matrix [6]. F is matched to W dimensions, followed by concatenation via zero addition. The H1 input is obtained using positional encoding and performs the following operations during the first encoder pass. H11 = LayerNorm(H1 + MultiHeadAtt(H1 , H1 , H1 ))
(3)
H12 = LayerNorm(H11 + FeedForward H11 )
(4)
Apply multi-head self-attention to matrix H1 and add it to itself via matrix addition. In the window encoder, position encoding is used to generate input matrix H2 , and the self-attention mechanism in the encoder is adjusted to ensure the decoder cannot access future timestamps in the data during training. H21 = Mask(MultiHeadAtt(H2 , H2 , H2 ))
(5)
H22 = LayerNorm(H21 , H2 )
(6)
H23 = LayerNorm(MultiHeadAtt H12 , H12 , H12 + H22 )
(7)
Operations (5, 6, 7) are similar to operations (3, 4), where the complete time series H12 encoding is used as values and keys by the window encoder, while the encoded input window is used as the query matrix. Od = Sigmoid (FeedForward H23 ) (8)
56
C. Zhang et al.
Here, d ∈ {1, 2} refers to the first and second decoder, respectively. We use the sigmoid activation function to constrain the output within the range of [0, 1], and normalize it to match the input window. The Transformer model can be used for reconstructive prediction of input time series windows. In this model, each timestamp is utilized to act as an encoder-decoder network for prediction while introducing VAE latent variables between the encoder and decoder. The encoder transforms the time series window W of input into a compressed representation using an attention mechanism based on the context, similar to a transformation model. Then, an activation function is used to transform this compressed representation into outputs O1 and O2 . We have introduced the concept of adversarial training, using two independent decoders. The second decoder is used to distinguish between the input window and the candidate reconstruction generated by the first decoder in phase 1, by maximizing the difference of O2 − Wd 2 The aim of the first decoder is to deceive the second decoder, while minimizing the reconstruction error of the self-conditioned output. On the other hand, the aim of the second encoder is to maximize the reconstruction error of the self-conditioned output.
Algorithm 1 The mTranAD training algorithm Require᧶ ,D with parameters Encoder E with parameters , Decoder A Gaussian distribution ( ) Dataset W Hyperparameter ϵ 1:Initialize weights of E, ,D 2:n←0 3:do 4: for(d=1 to D) ( , 0) 5: ) 6: O ,O ∥ ) 7: 8: ̂ ) ( ) ( | , 0) ∥ ( | , 0) ( ) 11: ∥ and using and 12: Update the parameters 13: n 14:while n = P). To calculate the anomaly score st , we reconstruct the input window as Od and calculate the deviation between Wd and Od . For each window Wd , we use an autoencoder to map it to Od . Then, we calculate the reconstruction error (i.e., the deviation between Wd and Od ) and use it as the anomaly score st for window Wd . 4.4 Evaluation Indicators To evaluate the overall performance of the mTranAD model in detecting anomalies, we use precision, recall, receiver operating characteristic curve area (ROC/AUC), and F1 score as evaluation metrics: Precesion =
(10)
TP TP + FN
(11)
Precision × Recall Precision + Recall
(12)
Recall = F1 = 2 ×
TP TP + FP
where TP is the number of actually detected abnormal samples, FP is the number of samples that are detected as abnormal but are actually normal, FN is the number of abnormal samples that are not detected, and TN is the number of samples that are detected as normal. 4.5 Experimental Results Table 1 presents the precision, recall, ROC/AUC, and F1 score of mTranAD and baseline models for all datasets. mTranAD outperforms the baseline in terms of AUC and F1 score on all four datasets. mTranAD achieves the highest anomaly detection accuracy in terms of F1 score, which is the decisive evaluation metric, on the NAB, MBA, SMAP, and WADI datasets. On average, the mTranAD model has an F1 score of 0.8802 and an AUC of 0.9084. On the SMAP dataset, the TranAD model achieved the highest AUC score (0.9921), but mTranAD had the highest F1 score among all datasets. Compared to the TranAD model, mTranAD achieves an improvement of 0.5126%, 0.0920%, 1.144%, and 41.18% respectively on the above-mentioned datasets. Table 1 reveals poor performance of several methods on the WADI dataset due to its large sequence lengths and diverse data formats. The POT technique, used in mTranAD, TranAD, and related models, considers local peaks in data to set more precise thresholds, widely used in anomaly detection to reduce false alarms and improve identification of significant anomalous datapoints, leading to deeper data analysis and insights. It is important to note that the use of the POT technique requires parameter tuning based on the actual situation to achieve optimal results. The mTranAD model we propose is based on the TranAD
60
C. Zhang et al.
model, which introduces the latent variables and loss calculation of VAE. It transforms the features of the encoded data into a multivariate latent distribution, and then reconstructs it through the decoder. Compared with the TranAD model, our model can reduce the risk of overfitting and lower the false positive rate to improve the accuracy of multivariate time series anomaly detection. Table 1. Performance comparison of the mTranAD and baseline methods on the following dataset. P: Precision, R: Recall rate, AUC: Area under ROC curve, F1: F1 score of training data Method
NAB P
R
AUC
F1
MERLIN
0.8013
0.7262
0.8414
0.7619
LSTM-NDT
0.6400
0.6667
0.8322
0.6531
DAGMM
0.7622
0.7292
0.8572
0.7453
OmniAnomaly
0.8421
0.6667
0.8330
0.7442
MSCRED
0.8522
0.6700
0.8401
0.7502
MAD-GAN
0.8666
0.7012
0.8478
0.7752
USAD
0.8421
0.6667
0.8330
0.7442
MTAD-GAT
0.8421
0.7272
0.8221
0.7804
CAE_M
0.7918
0.8019
0.8019
0.7968
GDN
0.8129
0.7872
0.8542
0.7998
TranAD
0.8889
0.9892
0.9541
0.9364
mTranAD
0.8889
1.0000
0.9996
0.9412
Method
MBA R
AUC
F1
P MERLIN
0.9846
0.4913
0.7828
0.6555
LSTM-NDT
0.9207
0.9718
0.9780
0.9456
DAGMM
0.9475
0.9900
0.9858
0.9683
OmniAnomaly
0.8561
1.0000
0.9570
0.9225
MSCRED
0.9272
1.0000
0.9799
0.9623
MAD-GAN
0.9396
1.0000
0.9836
0.9689
USAD
0.8953
0.9989
0.9701
0.9443
MTAD-GAT
0.9018
1.0000
0.9721
0.9484
CAE_M
0.8442
0.9997
0.9661
0.9154
GDN
0.8832
0.9892
0.9528
0.9332
TranAD
0.9569
1.0000
0.9885
0.9780 (continued)
Multivariate Time Series Anomaly Detection Method
61
Table 1. (continued) Method
NAB P
R
AUC
F1
mTranAD
0.9587
1.0000
0.9890
0.9789
Method
SMAP P
R
AUC
F1
MERLIN
0.1577
0.9999
0.7426
0.2725
LSTM-NDT
0.8523
0.7326
0.8602
0.7879
DAGMM
0.8069
0.9891
0.9885
0.8888
OmniAnomaly
0.8130
0.9419
0.9889
0.8728
MSCRED
0.8175
0.9216
0.9821
0.8664
MAD-GAN
0.8157
0.9216
0.9891
0.8654
USAD
0.7480
0.9627
0.9890
0.8419
MTAD-GAT
0.7991
0.9991
0.9844
0.8880
CAE_M
0.8193
0.9567
0.9901
0.8827
GDN
0.7480
0.9891
0.9864
0.8518
TranAD
0.8043
0.9999
0.9921
0.8915
mTranAD
0.8211
1.0000
0.9895
0.9017
Method
WADI P
R
AUC
F1
MERLIN
0.0636
0.7669
0.5912
0.1174
LSTM-NDT
0.0138
0.7823
0.6721
0.0271
DAGMM
0.0760
0.9981
0.8563
0.1412
OmniAnomaly
0.3158
0.6541
0.8198
0.4260
MSCRED
0.2513
0.7319
0.8412
0.3741
MAD-GAN
0.2233
0.9124
0.8026
0.3588
USAD
0.1873
0.8296
0.8723
0.3056
MTAD-GAT
0.2818
0.8012
0.8821
0.4169
CAE_M
0.2782
0.7918
0.8728
0.4117
GDN
0.2912
0.7931
0.8777
0.4260
TranAD
0.3829
0.8296
0.8968
0.4951
mTranAD
0.6040
0.8296
0.9084
0.6990
The effectiveness of anomaly detection is shown in Fig. 2, which presents the predicted and true labels for the NAB test set using the mTranAD model. Purple represents the true anomaly values, green represents the anomaly score, blue represents the true values, and orange represents the predicted values. We can observe that the timestamps with high anomaly scores also have true anomaly values. In other words, when there is
62
C. Zhang et al.
an anomalous fluctuation in the green line, there is a corresponding fluctuation in the blue line, and this fluctuation is also within the purple area.
Fig. 2. Real Prediction Tags for NAB Datasets in the mTranAD Model.
5 Summary This paper proposes a method for detecting abnormal data by combining the Transformer with the VAE model based on the idea of adversarial training. The method first normalizes and handles missing values of the raw data, and then trains the data through the transformer encoder-VAE resampling-decoder. After the model is fitted, the test data is inputted, and the reconstruction error is maximized and minimized to ensure training stability. We verify the method using four publicly available datasets: NAB, MBA, SMAP, and WADI. These datasets use multi-sensor time series for anomaly detection. The experimental results show that the mTranAD model has better detection performance for multivariate time series data. The source code will be made available at https://git hub.com/Liyicong98/mTranAD.git.
References 1. Tuli, S.: TranAD: deep transformer networks for anomaly detection in multivariate time series data. VLDB (2022) 2. Léon, M.: Towards principled methods for training generative adversarial networks. In 5th International Conference on Learning Representations, ICLR (2017) 3. Yang, Q.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. (TIST) 10(2), 1–19 (2019) 4. Osada, G., Omote, K., Nishide, T.: Network intrusion detection based on semi-supervised variational auto-encoder. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 344–361. Springer, Cham (2017). https://doi.org/10.1007/978-3-31966399-9_19 5. Vaswani, A.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017) 6. Wang, Y., Masoud, N.: Real-time sensor anomaly detection and recovery in connected automated vehicle sensors. IEEE Trans. Intell. Trans. Syst. 22(3), 1411–1421 (2021) 7. Tuli, S., Casale, G.: PreGAN: preemptive migration prediction network for proactive faulttolerant edge computing. In: IEEE Conference on Computer Communications (INFOCOM), pp. 670–679. IEEE (2022) 8. An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability (2015)
Multivariate Time Series Anomaly Detection Method
63
9. Rani, B.J.B.: Survey on applying GAN for anomaly detection. In: 2020 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, pp. 1–5 (2020) 10. Chen, J., Pi, D: Imbalanced satellite telemetry data anomaly detection model based on Bayesian LSTM. Acta Astronautica 80, 232–242, ISSN 0094–5765 (2021) 11. Rousseeuw, P., Perrotta, D.: Robust monitoring of time series with application to fraud detection. Econometrics Stat. 9, 108–121, ISSN 2452–3062 (2019) 12. Ding, M., Tian, H: PCA-based network traffic anomaly detection. Tsinghua Sci. Technol. 21(5), 500–509 (2016) 13. Hu, M.: A novel computational approach for discord search with local recurrence rates in multivariate time series. Inf. Sci. 477, pp. 220–233, ISSN 0020–0255 (2018) 14. Abbasimehr, H.: An optimized model using LSTM network for demand forecasting. Comput. Ind. Eng. 143, 106435 (2020) 15. Audibert, J., Michiardi, P.: USAD: unsupervised anomaly detection on multivariate time series. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3395–3404 (2020) 16. Li, D.: MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In: International Conference on Artificial Neural Networks, pp. 703–716 . Springer (2019). https://doi.org/10.1007/978-3-030-30490-4_56 17. Zhao, H., Wang, Y.: Multivariate time-series anomaly detection via graph attention network. In: International Conference on Data Mining, pp. 841–850 (2020) 18. Zhang, Y.: Unsupervised deep anomaly detection for multi-sensor time-series signals. IEEE Trans. Knowl. Data Eng. (2021) 19. Deng, A.: Graph neural network-based anomaly detection in multivariate time series. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 4027–4035 (2021) 20. Nakamura, T., Imamura, M.: MERLIN: parameter-free discovery of arbitrary length anomalies in massive time series archives. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 1190–1195. IEEE (2020) 21. Zong, B., Song, Q.: Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In: International Conference on Learning Representations (2018) 22. Hundman, K.: Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 387–395 (2018) 23. Huang, S.: Hit anomaly: hierarchical transformers for anomaly detection in system log. IEEE Trans. Netw. Serv. Manage. 17(4), 2064–2076 (2020) 24. Cook, A.A: Anomaly detection for IoT time-series data: a survey. IEEE Internet Things J. 7(7), 6481–6494 (2020) 25. Ahmad, S.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262, 134–147 (2017) 26. Dai, E., Chen, J.: Graph-Augmented Normalizing Flows for Anomaly Detection of Multiple Time Series. United States. ICLR (2022) 27. Shukla, S.N.: Heteroscedastic temporal variational autoencoder for irregularly sampled time series. ICLR (2021) 28. Tang, W., Long, G.: Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. In: International Conference on Learning Representations, ICLR (2022) 29. Shin, Y., Yoon, S.: Coherence-based label propagation over time series for accelerated active learning. In: International Conference on Learning Representation, ICLR (2022) 30. Kieu, T.: Outlier detection for multidimensional time series using deep neural networks. In: 19th IEEE International Conference on Mobile Data Management (MDM), Aalborg, Denmark, pp. 125–134 (2018)
Proximal Symmetric Non-negative Latent Factor Analysis: A Novel Approach to Highly-Accurate Representation of Undirected Weighted Networks Yurong Zhong, Zhe Xie, Weiling Li, and Xin Luo(B) School of Computer Science and Technology, Dongguan University of Technology, Dongguan, China [email protected]
Abstract. An undirected weighted network (UWN) is constantly encountered in various big data-related applications. A UWN’s topological information can be expressed by a Symmetric, High-dimensional and Incomplete (SHDI) matrix, upon which the representation learning task is essential for knowledge acquisition. However, existing models mostly fail in modeling its intrinsic symmetry or low-data density, resulting in the model’s weak representation learning ability to its numerical features. For addressing this vital issue, this study presents a Proximal Symmetric Non-negative Latent-factor-analysis (PSNL) model with three-fold ideas: a) building a proximal term-incorporated, symmetry-aware, and data density-oriented learning objective subjected to the non-negativity constraints for ensuring its high representation learning ability; b) designing an Alternating Direction Method of Multipliers (ADMM)-based learning scheme for solving the learning objective on the premise of fast convergence; c) implementing selfadaptation of the model’s multiple hyper-parameters via the Tree-structured of Parzen Estimators (TPE) algorithm, thus enabling its high scalability. Empirical studies on four UWNs from real applications demonstrate that the proposed PSNL model achieves higher representation accuracy than state-of-the-art models do, as well as promising computational efficiency. Keywords: Undirected Weighted Network · Representation Learning · Symmetric Non-negative Latent Factor Analysis · Alternating Direction Method of Multipliers · Tree-structured of Parzen Estimators · Missing Data Estimation
1 Introduction An Undirected Weighted Network (UWN) is common in big data-related applications like protein networks in protein-protein interaction prediction [1, 19], social networks in community detection [2, 3] and knowledge networks in nature language processing [4, 5]. Note that such a network can be quantified as a Symmetric High-Dimensional and Incomplete (SHDI) matrix whose inherent characteristics are given as follows: 1. Symmetry. It is a topological characteristic, i.e., It is a symmetric matrix; © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 64–76, 2023. https://doi.org/10.1007/978-981-99-4752-2_6
Proximal Symmetric Non-negative Latent Factor Analysis
65
2. High-Dimensionality. Its node set is large. For example, there are many proteins in protein-protein interaction network related to a species [1, 19]; and 3. Incompleteness. Its most entries are unobserved, i.e., It is an extremely sparse matrix. Note that its data are non-negative in most cases like gene expression [6]. In spite of its incompleteness, useful knowledge can be extracted from it, such as unfound yet vital protein-protein interactions [1, 19]. Hence, a representation learning model should consider SHDI’s inherent characteristics during its modeling. For well representing a non-negative SHDI matrix generated by a UWN, recent studies’ proposed models can mainly be split into three categories: 1) Graph Convolutional Network (GCN)-based models [7, 8]. For example, He et al. [7] discard two common designs in GCN, i.e., feature transformation and nonlinear activation, thereby achieving a simplified and more appropriate model. Taeyor et al. [8] propose a GCN whose design exists both linear and non-linear embedding propagation chosen by a gating module for better accuracy gain. Though they are able to extract nonlinear features from a non-negative SHDI matrix, they do not consider its non-negativity and symmetry characteristics; 2) Symmetric Non-negative Matrix Factorization (SNMF)-based models [3, 9, 10]. He et al. [3] rescale the multiplicative term of a Non-negative and Multiplicative Update (NMU) algorithm by combining the initial state of each involved Latent Factor (LF) and the multiplicative term, thereby proposing a novel NMU algorithm for efficiently solving an SNMF problem. Yang et al. [9] incorporate graph regularization into an SNMF model’s learning objective for well describing the local topology of an undirected network. Li et al. [10] adopt a penalized term to build a symmetry-aware NMF model and then design an accelerated Hierarchical Alternating Least Squares (HALS) algorithm to efficiently solve the proposed model’s learning objective. Hence, they are able to precisely represent the symmetry and non-negativity of an SHDI matrix; However, they do not consider the incompleteness of an SHDI matrix, i.e., they need to prefill its missing data before their training, thereby resulting in unnecessary storage and computational cost, as well as loss of information; and 3) Non-negative Latent Factor (NLF)-based models [11, 12]. A standard NLF model’s learning objective subjected to non-negative constraints is defined on known data of an incomplete matrix only and then a single LF-dependent NMU algorithm is designed to solve the learning objective with high efficiency in terms of storage and computation [11]. For improving NLF’s generalization ability, Luo et al. [12] further propose an improved NLF model via making imbalanced information of the target incomplete matrix connected with its regularization effects. Hence, they consider an SHDI matrix’s incompleteness and non-negativity, but they cannot precisely describe its symmetry. Hence, above-mentioned models fail in modeling its intrinsic symmetry or low-data density. Moreover, they consume unnecessary computational costs to manually tune their multiple hyper-parameters for good performance. For addressing these crucial issues, this paper proposes a Proximal Symmetric Non-negative Latent-factor-analysis (PSNL) model. With it, this paper aims to answer the following questions: 1. How to implement a highly-accurate model with fast convergence on an SHDI matrix following the main principle of Alternating Direction Method of Multipliers (ADMM)?
66
Y. Zhong et al.
2. How to implement hyper-parameter self-adaptation in PSNL for enabling its high scalability; and 3. Can it achieve fine performance on an SHDI matrix generated from a UWN? By answering above questions, this paper has achieved the following contributions: 1. A PSNL model. It is an adaptive and highly-accurate model via making the following improvements: 1) proximal term-incorporated modeling; 2) consideration of SHDI’s inherent characteristics; 3) ADMM-incorporated learning scheme; and 4) hyper-parameters self-adaptation implemented by the Tree-structured of Parzen Estimators (TPE) algorithm; 2. Detailed algorithm design and analysis for a PSNL model are provided; 3. Empirical studies on four SHDI matrices generated from UWNs demonstrate that a PSNL model achieves higher representation accuracy than the state-of-the-art models do, as well as promising computational efficiency. The remaining parts of this paper are organized as follows. Section 2 introduces the preliminaries. Section 3 presents an ARSN model. Section 4 gives the experimental results. Finally, Sect. 5 concludes this paper.
2 Preliminaries 2.1 Notations Table 1 summarizes adopted symbols of this manuscript. Table 1. Adopted symbols and their descriptions. Symbol
Description
U
Node set of a UWN
f
Dimension of the LF space
Y |U|×|U| Yˆ |U|×|U|
An SHDI matrix
,
Known and unknown entry set from Y
(m)
A subset of related to ∀m ∈ {1,2,…,|U|}
A|U|×f , B|U|×f , X |U|×f
Optimization parameters
W |U|×f
Lagrangian multiplier
ym,n , yˆ m,n
ˆ ∀m, n ∈ {1,2,…,|U|} Single elements in Y and Y,
Rank-f approximation to Y
(continued)
Proximal Symmetric Non-negative Latent Factor Analysis
67
Table 1. (continued) Symbol
Description
am,d , bn,d , x m,d , wm,d ,
Single elements in A, B, X, W, ∀m, n ∈ {1,2,…,|U|}
λ
Regularization coefficient
αu
Augmentation coefficient
μ
Proximal coefficient
η
Rescaling coefficient of learning rate
ε
Augmented Lagrangian function
k
Training iteration count
2.2 Problem Formulation Given a UWN’s node set U, an SHDI matrix filled with non-negative data is taken as the input. Definition 1 (A non-negative SHDI matrix): Given U, Y describes certain interactions whose values are non-negative among its nodes. Given Y ’s known entry set and unknown one , then Y is a non-negative SHDI matrix if ||||. Definition 2 (An NLF model): Given Y and , an NLF model [11] builds a data density-oriented learning objective for acquiring rank-f approximation yˆ m,n = f d =1 am,d ∗ bn,d , where am,d ≥ 0 and bn,d ≥ 0. Due to ill-posedness caused by extracting non-negative LFs from an incomplete matrix, an L 2 -norm-based regularization [12, 20] is adopted for avoiding overfitting. Hence, with Euclidean distance [3, 9, 10, 21], its learning objective F is given as: ⎞ ⎛⎛ ⎞2 f f
2
2 ⎟ 1 ⎜⎝ am,d + bn,d am,d bn,d ⎠ + λ F= ⎠, ⎝ ym,n − 2 (1) y ∈ d =1 d =1 m,n
s.t. ∀m, n ∈ {1, 2, . . . , |U |}, d ∈ {1, 2, . . . , f } : am,d ≥ 0, bn,d ≥ 0, where regularization coefficient λ is positive.
68
Y. Zhong et al.
3 The Proposed PSNL 3.1 Learning Objective As shown in (1), an NLF model is not designed for Y ’s symmetry. To address this issue, making A = B in (1) is able to transform NLF to be a symmetry-aware model. Hence, the learning objective F can be reformulated as: ⎞ ⎛ 2 f f
2 2 ⎠ ⎝ ym,n − am,d + an,d , am,d an,d + λ F = 21 (2) ym,n ∈ d =1 d =1 s.t. ∀m, n ∈ {1, 2, . . . , |U |}, d ∈ {1, 2, . . . , f } : am,d ≥ 0, an,d ≥ 0. With (2), a symmetry-aware NLF model is achieved. Then, according to the previous study [13, 22], a non-negative constraint applied to the output LFs affects the model’s representation accuracy. Hence, we introduce X |U|×f into (2) to separate the non-negative constraint from generalized loss: ⎞ ⎛ 2 f f
2 2 ⎠ ⎝ ym,n − xm,d + xn,d , xm,d xn,d + λ F = 21 ym,n ∈ d =1 d =1 (3) s.t. ∀m, n ∈ {1, 2, . . . , |U |}, d ∈ {1, 2, . . . , f } : am,d ≥ 0, an,d ≥ 0, X = A. In (3), the generalized loss is achieved with X, while non-negative constraints are applied to A, i.e., am,d ≥ 0 and an,d ≥ 0, ∀m, n ∈ {1,2,…, |U|} and d ∈ {1,2,…, f }. Hence, introducing X = A into (2) facilitates convenience of minimizing generalized loss. According to the main principle of ADMM [13, 22], Lagrangian multiplier matrix W |U|×f is introduced for the equality constraint X = A, thereby achieving the following augmented Lagrangian function ε: ⎞ ⎛ 2 f f
2 2 ⎠ ⎝ ym,n − xm,d + xn,d xm,d xn,d + λ ε = 21 ym,n ∈
+
|U |
f
u=1 d =1
d =1
wu,d xu,d − au,d +
d =1
αu 2
|U |
f
u=1 d =1
2 xu,d − au,d ,
(4)
s.t. ∀u ∈ {1, 2, . . . , |U |}, d ∈ {1, 2, . . . , f } : au,d ≥ 0, where α u controls the augmentation effects and it is set as α u = γ |(u)|. Note that |(u)| denotes the length of ’s subset related to ∀u ∈ {1,2, …, |U|}, γ is used to rescale the augmentation effects for the constraint X = A. Hence, considering α u , augmentation effects are connected with ’s imbalanced distribution, i.e., each row/column is corresponding to different known entry counts. Afterwards, following the previous study [14], proximal terms are able to improve representation learning ability of the model designed by the main principle of ADMM. Hence, we introduce a proximal term related to X into (4), thereby reformulating (4) as
Proximal Symmetric Non-negative Latent Factor Analysis
69
follows:
ε = ε X k , Ak , W k +
μ 2
|U |
f k xu,d − xu,d ,
(5)
u=1 d =1
s.t. ∀u ∈ {1, 2, . . . , |U |}, d ∈ {1, 2, . . . , f }, au,d ∈ A : au,d ≥ 0, where X k , Ak and W k denote the statuses of X, A and W at the k-th iteration, and proximal coefficient μ is positive. From (5), we evidently see that the proximal term confines the distance between the current status and the previous status of x u,d , i.e., it is able to take the model’s historical training information into consideration, thereby achieving the model’s high representation learning ability. With (5), a PSNL model is achieved. 3.2 ADMM-incorporated Learning Scheme Following solution algorithms of an augmented Lagrangian function [13, 22], elementwise alternating least square is firstly adopted for updating optimized parameters from X and A in PSNL. ∀m, n ∈ {1,2, …, |U|}, d ∈ {1,2,…, f }. Hence, we have: f k k k + α ak k k xm,l xn,l xn,d ym,n − m m,d − wm,d + μxm,d k+1 = xm,d
l=1,l=d
n∈(m)
n∈(m)
k xn,d
2
+ λ + αm + μ
k+1 k+1 k am,d = max 0, xm,d + wm,d /αm .
,
(6a)
(6b)
Note that (6b) adopts a non-negative truncation method for ensuring am,d ≥ 0. Then, optimized parameters from W is updated via the dual gradient ascent, thereby achieving their update rules as follows: k+1 k+1 k+1 k , (7) = wm,d + ηαm xm,d − am,d wm,d where the rescaling of learning rate η is positive. Afterwards, ∀d ∈ {1,2,…, f }, following a standard ADMM-incorporated learning sequence [13, 22], PSNL splits the whole training task into f disjoint subtasks, and then designs the following sequence for the d-the subtask whose composed of three jobs: 1. Job One: (6a)
k+1 ← arg minε X.,d X.,d
k+1 k k k , A X.,1 , X , X , W .,d .,(d +1) f .,d .,d , (d −1)
(8a)
2. Job Two: k+1 (6b) k+1 k ← arg minε X.,d A.,d , , Ak.,d , W.,d A.,d ≥0
(8b)
70
Y. Zhong et al.
3. Job Three: k+1 (7) k k+1 k+1 k ←W.,d + ηαm ∇W.,d ε X.,d . , A.,d , W.,d W.,d
(8c)
where (8) is designed with the following principle: 1) involved optimized parameters from X, A and W are updated in a column-wise way; 2) X’s optimized parameters in each subtask are updated by previous solved ones; 3) updates of optimized parameters from A or W in each subtask follows a standard ADMM process [13, 22]. 3.3 Hyper-parameter Adaptation It is crucial to find optimal the model’s hyper-parameters for achieving its good performance [16]. According to previous studies [17], compared with manual tuning, Bayesian Optimization (BO) is a convenient and better way for automatically searching optimal hyper-parameters of the model. Hence, we adopt one kind of BO methods, i.e., the commonly-adopted tree-structured of parzen estimators (TPE) algorithm for implementing adaptation of PSNL’s hyper-parameters, i.e., λ, γ , μ and η. Let s = {λ, γ , μ, η} and b denotes the loss computed by ym,n and yˆ m,n with s. Given the observation set C = {(s1 , b1 ), (s2 , b2 ),…, (sβ , bβ )}, TPE defines p(s|b) via the following densities: l(s) if b < b∗ , (9) p(s|b) = g(s) if b ≥ b∗ , where b* is chosen as θ-quantile of b. Win (9), TPE can separate C into two parts, each of which is an interval in the space of b whose probability density function is being estimated. Then, TPE seeks to optimize the following Expected Improvement (EI): EIb∗ (s) =
b∗
−∞
b∗ − b p(b|s)db.
(10)
With p(b < b* ) = θ, we have EIb∗ (s) ∝ (θ + g(s)/l(s) ∗ (1 − θ)).
(11)
Following (11), TPE achieves the candidate b* with the greatest EI, and add the result into C. When the gap of the model’s loss between the current iteration and the latest iteration is smaller than a certain value like 10–6 , TPE returns s related to the minimal b value in C. Note that TPE is implemented by “Hyperopt” software package in [17].
4 Experimental Results and Analysis 4.1 General Settings Tasks and Performance Metrics. The main task of this paper is to estimate missing data of an SHDI matrix generated from a UWN. For example, given a protein-protein
Proximal Symmetric Non-negative Latent Factor Analysis
71
interaction network, our task aims to estimate unobserved interactions of an SHDI matrix generated by such a network. Hence, the metric that validates the prediction accuracy of an involved model is root mean squared error (RMSE) [18]. Note that low RMSE stands for a model’s high prediction accuracy for missing data. Meanwhile, time cost of each involved model is recorded for testing its computational efficiency. Data. Four SHDI matrices in our experiments, and their details are given in Table 2. Each SHDI matrix’s known entry set in our experiment is randomly split into ten disjoint subsets for tenfold cross-validation. More specifically, each time seven subsets are chosen as the training set, one is chosen as the validation set, and the remaining two are chosen as the test set. We repeat this process ten times to achieve the finally averaged results. Settings. The following settings are utilized to achieve fair comparisons: 1. Settings of TPE’s hyper-parameters in PSNL are the same as the default settings of software package Hyperopt [17], i.e., β is initialized as 30 and θ = 0.15; 2. The LF dimension f is set at 20;
Table 2. SHDI matrices generated from UWNs. Density (||/|U|2 )
Source
7,963
1.77%
[15]
4,181
5.85%
[15]
322,905
13,965
0.17%
[4]
57,002
2,427
0.97%
[4]
No.
Type
||
|U|
D1
Protein
1,120,028
D2
Protein
1,021,786
D3
Material
D4
Knowledge
3. Training process of each model terminates if: 1) the iteration count reaches a threshold, i.e. 1000; and 2) The RMSE gap in two consecutive iterations gets smaller than 10–5 . 4.2 An Ablation Study This set of experiments aims to discuss the effects of PSNL’s λ, γ , μ and η and TPE, respectively. First of all, we manually tune λ, γ , μ and η with grid-search, as shown in Figs. 1–4, where k denotes PSNL’s converging iteration count. From them, we have the following findings: (1) PSNL’s performance relies heavily on its choice of λ, γ and μ. As shown in Figs. 1 and 3, as the value of λ or μ becomes too large, PSNL’s RMSE rises dramatically. Moreover, as shown in Fig. 2, as γ is set as an inappropriate value, PSNL suffers from great accuracy loss. Meanwhile, the converging iteration count of PSNL varies along with different choices of λ, γ and μ on different datasets. (2) η slightly affects PSNL’s performance. For example, as shown in Fig. 4(c), PSNL’s RMSE almost keeps consistent with the same converging iteration count as η varies
72
Y. Zhong et al.
on D3. Meanwhile, the effects of η are data-dependent, i.e., the choice of η affects PSNL’s RMSE and converging iteration count on D1, D2, and D4, yet such effects are not great, as shown in Figs. 4(a), (b) and (d).
Fig. 1. Effects of λ
Fig. 2. Effects of γ
Fig. 3. Effects of μ
Fig. 4. Effects of η
Hence, it can be clearly found that PSNL’s λ, γ , μ and η should be tuned with care for achieving well performance. Note that the uniform ranges of PSNL’s λ, γ , μ and η in TPE on D1–4 are (2–18 , 2–5 ) for λ, (2–10 , 2–4 ) for γ , (2–8 , 1) for μ and (2–5 , 1) for η, respectively. For presenting the significance of TPE in PSNL, Figs. 6, 7 describe the effects of TPE in PSNL. From them, TPE is able to greatly reduce its total time cost
Proximal Symmetric Non-negative Latent Factor Analysis
73
Fig. 5. Effects of TPE in RMSE
Fig. 6. Effects of TPE in total time cost
Fig. 7. Symmetry validation results of M1–8 on D3.
with a slight accuracy loss (D1–3) or even better accuracy gain (D4), which contributes to the automatic tuning of PSNL’s λ, γ , μ and η with their uniform range on D1–4, i.e., avoiding expensive time cost caused by their manual grid-search. 4.3 Comparison Against State-of-the-Art Models This set of experiments compares the proposed PSNL model (marked as M8) with the following state-of-the-art models: NMFC [13], NIR [12], β-SNMF [3], GSNMF [9], HSNMF [10], LightGCN [7] and HMLET [8] (marked as M1–7). Figure 7 gives symmetry validation results of M1–8 on D3, and similar results can be found on D1, D2 and D4. Tables 3 and 4 summarize M1–8’s RMSE and time cost on D1–4, respectively. From them, we have the following findings:
74
Y. Zhong et al. Table 3. RMSE of M1–8 on D1–4.
No.
D1
D2
D3
D4
M1
0.1329±3.8E–4
0.1334±4.1E–4
0.0821±2.1E–5
0.0971±1.3E–3
M2
0.1275±7.9E–4
0.1292±4.2E–4
0.0751±3.8E–4
0.0705±5.2E–4
M3
0.1652±6.3E–4
0.1543±2.8E–4
0.0741±1.1E–5
0.1093±6.5E–5
M4
0.1813±1.8E–4
0.1576±1.5E–4
0.0764±8.4E–5
0.1255±1.5E–3
M5
0.1611±3.7E–4
0.1518±4.0E–5
0.0762±5.1E–5
0.0936±5.1E–5
M6
0.1297±1.4E–4
0.1318±1.6E–4
0.1037±2.2E–4
0.0928±1.1E–3
M7
0.1358±2.7E–4
0.1350±5.1E–4
0.0960±3.9E–5
0.1018±4.9E–3
M8
0.1266±9.1E-5
0.1278±6.0E-5
0.0736±1.5E-4
0.0631±2.8E-4
Table 4. Time Cost of M1–8 on D1–4 (Sec.). No.
Type
D1
D2
D3
D4
M1
1 Per
7.31±0.85
1.89±0.13
20.84±2.69
0.66±7.3E–2
2 Total
M2 M3 M4 M5 M6 M7 M8
138,649±11,031
24,886±1,992
403,176±24,296
14,805±1,191
1 Per
0.34±4.5E-2
0.26±3.8E-2
0.13±3.8E-2
1.9E–2±2.7E-3
2 Total
8,494±805
5,372±480
1,560±162
421±60
1 Per
5.25±0.42
1.48±9.6E–2
15.78±1.98
0.41±6.2E–2
Total
12,757±1,265
2,564±208
22,392±2,068
1,195±138
1 Per
9.54±1.03
2.28±0.27
25.54±2.68
0.83±7.7E–2
2 Total
57,524±5,317
10,708±949
151,346±14,734
6,602±721
1 Per
4.94±0.35
1.44±9.3E–2
14.47±1.82
0.35±5.7E–2
Total
1,795±263
545±63
1,234±316
159±17
1 Per
7.61±0.78
6.12±0.59
1.12±0.14
0.23±3.5E–2
2 Total
78,015±5,945
61,179±7,104
8,393±718
2,195±173
1 Per
20.73±2.85
17.01±1.65
3.26±0.37
0.32±4.3E–2
2 Total
255,124±20,643
285,698±21,856
5,994±534
8,487±752
1 Per
1.28±9.3E–2
0.92±8.7E–2
0.41±5.6E–2
2.5E–2±3.4E–3
2 Total
3,753±351 2,389±195 887±77 1 Per denotes time cost per iteration; 2 Total denotes total time cost
414±38.11
(1) PSNL is able to precisely describe the symmetry of an SHDI matrix. This finding is supported by Fig. 7(h), since M8’s predicted data distribution strictly keep consistent with the line yˆ m,n = yˆ n,m . (2) PSNL’s representation accuracy is significantly higher than state-of-the-art models. This finding is supported by Tables 3. M8’s highest accuracy gain on
Proximal Symmetric Non-negative Latent Factor Analysis
75
D1–4 contributes to its fully consideration to SHDI’s inherent characteristics and incorporation of the proximal term. (3) PSNL’s computational efficiency is promising. From Tables 4. M8’s total time cost is the least on D3, and it has the second least total time cost on D1, D2 and D4, i.e., M8’s computational efficiency is promising, which contributes to its data densityoriented modeling (the reason for M8’s low time cost per iteration) and adaptive training process.
5 Conclusion This paper proposes a PSNL model whose learning objective fully considers an SHDI matrix’s inherent characteristics and implements the incorporation of the proximal term. Then an adaptive ADMM-incorporated algorithm is designed for solving the learning objective with high efficiency. Empirical studies indicate that a PSNL model achieves higher representation accuracy than state-of-the-art models do, as well as promising computational efficiency. The research of representation to an SHDI matrix generated from a UWN remains in its infancy, and its applications have potential development, which is worthy of further investigation in the future. Acknowledgments. This work was supported by the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515110579 and 2021B1515140046.
References 1. Tang, X., Zhao, B., Qiu, X.: A novel algorithm for prioritizing disease candidate genes from the weighted PPI network. In: Proceedings of 2019 IEEE International Conference on Bioinformatics and Biomedicine, pp. 219–222 (2019) 2. Tian, L., Luo, P., Wang, H., Zheng, H., Wu, F.: CASNMF: a converged algorithm for symmetrical non-negative matrix factorization. Neurocomputing 275, 2031–2040 (2018) 3. He, Z., Xie, S., Zdunek, R., Zhou, G., Cichocki, A.: Symmetric non-negative matrix factorization: algorithms and applications to probabilistic clustering. IEEE Trans. Neural Netw. 22(12), 2117–2131 (2011) 4. Newman, M.: Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Phys. Rev. E. 64(1) (2001) 5. Davis, T., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1) (2011) 6. Zhang, T., Wang, M., Xi, J., Li, A.: LPGNMF: predicting long non-coding RNA and protein interaction using graph regularized nonnegative matrix factorization. IEEE-ACM Trans. Comput. Biol. Bioinform. 17(1), 189–197 (2020) 7. He, X., Deng, K., Wang, X., Li, Y., Zhang, Y., Wang, M.: LightGCN: Simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 639–648 (2020) 8. Kong, T., et al.: Linear, or non-linear, that is the question. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining, pp. 517–525 (2022)
76
Y. Zhong et al.
9. Yang, L., Cao, X., Jin, D., Wang, X., Meng, D.: A unified semi-supervised community detection framework using latent space graph regularization. IEEE Trans. on Cybern. 45(11), 2585–2598 (2015) 10. Li, X., Zhu, Z., Li, Q., Liu, K.: A provable splitting approach for symmetric nonnegative matrix factorization. IEEE Trans. Knowl. Data Eng. 35(3), 2206–2219 (2023) 11. Luo, X., Zhou, M., Xia, Y., Zhu, Q.: An efficient non-negative matrix-factorization-based approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inform. 10(2), 1273–1284 (2014) 12. Luo, X. Wang, Z., Shang, M.: An instance-frequency-weighted regularization scheme for non-negative latent factor analysis on high-dimensional and sparse data. IEEE Trans. Syst. Man Cybern. Syst. 51(6), 3522–3532 (2021) 13. Xu, Y., Yin, W., Wen, Z., Zhang, Y.: An alternating direction algorithm for matrix completion with non-negative factors. Front. Math. China. 7(2), 365–384 (2012) 14. Lu, S., Hong, M., Wang, Z.: A nonconvex splitting method for symmetric nonnegative matrix factorization: convergence analysis and optimality. IEEE Trans. Sig. Process. 65(12), 3120– 3135 (2017) 15. Szklarczyk, D., et al.: STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1), D607–D613 (2019) 16. Luo, X., Yuan, Y., Chen, S., Zeng, N., Wang, Z.: Position-transitional particle swarm optimization-incorporated latent factor analysis. IEEE Trans. Knowl. Data Eng. 8(6), 1524–1536 (2022) 17. Bergstra, J., Bardenet, R. Bengio, Y. et al.: Algorithms for hyper-parameter optimization. In: Proceedings of 2011 International Conference on Neural Information Processing Systems, pp. 2546–2554 (2011) 18. Yang, K., Liu, Y., Yu, Z., Philip, C.: Extracting and composing robust features with broad learning system. IEEE Trans. on Knowl. Data Eng. 35(4), 3885–3896 (2023) 19. Li, W., Wang, R., Luo, X., Zhou, M.: A second-order symmetric non-negative latent factor model for undirected weighted network representation. IEEE Trans. Netw. Sci. Eng. 10(2), 606–618 (2022) 20. Yu, Z., et al.: Semisupervised classification with novel graph construction for highdimensional data. IEEE Trans. Neural Netw. Learn. Syst. 33(1), 75–88 (2022) 21. Balasubramaniam, T., Nayak, R., Yuan, C., Tian, Y.: Column-wise element selection for computationally efficient nonnegative coupled matrix tensor factorization. IEEE Trans. Knowl. Data Eng. 33(9), 3173–3186 (2021) 22. Hu, L., Guo, G.: An augmented lagrangian alternating direction method for overlapping community detection based on symmetric nonnegative matrix factorization. Int. J. Mach. Cybern. 11(2), 403–415 (2020)
Information Extraction System for Invoices and Receipts QiuXing Michelle Tan1 , Qi Cao2(B) , Chee Kiat Seow2 , and Peter Chunyu Yau2 1 Computing Science, Singapore Institute of Technology - University of Glasgow, Singapore,
Singapore 2 School of Computing Science, University of Glasgow, Glasgow, Scotland, UK
[email protected]
Abstract. Rapid growth in the digitization of documents, such as paper-based invoices or receipts, has alleviated the demand for methods to process information accurately and efficiently. However, it has become impractical for humans to extract the data manually, as it is labor-intensive and time-consuming. Digital documents contain various components such as tables, key-value pairs and figures. Existing optical character recognition (OCR) methods can recognize texts, but it is challenging to extract the key-value pairs in unformatted digital invoices or receipts. Hence, developing an information extraction system with intelligent algorithms would be beneficial, as it can increase the workflow efficiency for knowledge discovery and data recognition. In this paper, a pipeline of the information extraction system is proposed with intelligent computing and deep learning approaches for classifying key-value pairs first, followed by linking the key-value pairs. Two key-value pairing rules are developed in the proposed pipeline. Various experiments with intelligent algorithms are conducted to evaluate the performance of the pipeline of information extraction system. Keywords: Information Extraction · Key-Value Pairs · Knowledge Discovery · Documents Digitization · Intelligent Algorithms
1 Introduction On a daily basis, many organizations deal with many paper documents such as receipts or invoices [1]. Currently, most documents must be processed manually, which is timeconsuming and expensive [2]. One of the most difficult aspects of invoice processing for logistics organizations is the time-consuming, intensive in-house procedure, requiring numerous manual works when extracting and keying data into various internal software systems. In addition, data captured from logistics invoices poses a particular difficulty because they are received in multiple formats. This is a significant issue, due to the intricacy of the invoices. Manually inspecting and correcting scanning errors greatly lengthens the correction and processing time [3]. Digital invoices contain various components such as tables, key-value pairs and figures. A key-value pair is made up of two connected data elements: a key which is a © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 77–89, 2023. https://doi.org/10.1007/978-981-99-4752-2_7
78
Q. M. Tan et al.
constant defining the data set; and a value which is a variable that is part of the set. A fully formed key-value pair will be like Gross Weight: 123 kg, where “Gross Weight” is the key and “123 kg” is the value. A corporation can benefit from digitising and extracting key information on a number of levels. Business owners can better track their processes, provide better customer services, increase employee productivity, and cut costs. Optical character recognition (OCR) also known as text recognition, is the process of extracting text information from scanned documents and images. Current OCR systems, such as Tesseract [4] and EasyOCR can recognize raw text in unformatted digital documents or images. But they are not capable of extracting information like key-value pairs from unformatted data. Key-value pairs are the most significant components in digital cargo invoices. Key-value pairs make raw texts more understandable. The existing solutions, such as Amazon Web Services (AWS) Textract, provide such a service [5]. But the off-the-shelf solutions may not fit the specific domains, and lack flexibility to be customized or tailored for different requirements. Moreover, AWS Textract is a service hosting on the cloud platform. Users need to upload their digital invoices and receipts to the cloud, in order to get them processed. It raises possible privacy concerns as some companies might treat the data as private and business sensitive. Hence, automation of extracting key-value pairs within unformatted digital invoices is proposed as it can significantly reduce manpower and costs while simultaneously ensuring the reliability of the data retrieved. Deep learning algorithms have been implemented in various applications with great success [6, 7]. In this paper, an information extraction system with deep learning approaches that is capable of extracting key-value pairs will be presented to improve the overall performance. The proposed information extraction system would be beneficial as it would help companies achieve workflow efficiency, resource utilization and eliminate costly errors. The remaining parts of this paper are organized next. Section 2 presents the prior works in literature. Section 3 explains the pipeline of the proposed information extraction system. Section 4 analyses and discusses experiment results. Lastly, Sect. 5 concludes this paper.
2 Related Work There are three different techniques of key-value pair extraction: regular expression, natural language processing (NLP) and layout detection. OCR affects the results when performing key-value pair extraction, as mistaken words might be wrongly classified. Hence, OCR plays a crucial role for key-value pair extraction. Vedant Kumar et al. [8] use Tesseract OCR on bill receipts images taken from a mobile phone to extract out the text information, with some image pre-processing such as binarization and removing of shadows, etc. OCR has been used for scanned documents to extract the text information in [9]. Similarly, they have pre-processed the scanned invoices by sharpening the images, threshing and binarization.
Information Extraction System for Invoices and Receipts
79
2.1 Key-Value Pair Extraction Using Regular Expression Regular expressions are patterns used to match character combinations in strings to find by text with colon. If the regular expression manages to find the word, this means that it is a key. A key-value searching system is developed with the open-source Tesseract OCR engine and post-processing techniques with regular expressions [10]. This method can be used to create a low-cost office automation system for invoice processing. It learns patterns in the dataset, collates all different types of patterns and put them into a pattern dictionary. The regular extraction pattern dictionary can be expanded over time to learn a wider variety of patterns, with more datasets being included and more patterns being added. Using regular expressions to extract key-value pairs is efficient in finding specific patterns or text. However, when a new text or pattern is introduced, that would be an issue as the system does not understand and cannot extract them. Additionally, some keys and values may have a variety of different forms of patterns. Thus, there is a need to manually look through the dataset and update the patterns to extract the key-value pairs correctly. Furthermore, with many patterns declared, it degrades the readability and performance of the codes. 2.2 Key-Value Pair Extraction Using NLP Bidirectional Encoder Representations from Transformers (BERT) [11] utilizes Transformer, an attention mechanism that discovers contextual relationships between words in a text. Transformer has two independent working parts: an encoder that reads text inputs, and a decoder that generates a task prediction. BERT has two training tasks: Masked Language Model (MLM) and Next Sentence Prediction. Robustly Optimized BERT Pretraining Approach (Roberta) [12] is another approach for pretraining NLP with similar architecture to BERT, and training with bigger batch sizes and longer sequences. The BERT is extended to another model, StructBERT, by incorporating language structures into pre-training [13], and leveraging structural information in addition to the present masking method. Two structural objectives are added to model pre-training, focusing on inner-sentence and inter-sentence structures. It allows StructBERT to represent language structures explicitly to reconstruct the correct order of words and sentences for accurate predictions. The NLP approach only takes in the text from the document and does not incorporate the position of the text. In the tasks of extracting key value pairs, the position of the text is useful. But StructBERT is able to understand the key-value pair structure using together with layout awareness algorithms to improve the accuracy. 2.3 Key-Value Pair Extraction Using Layout Detection LayoutLM [14] is reported to do labeling using positional information, text-based information, and image information. As an upgrade from LayoutLM, the LayoutLMv2 uses model architectures to pre-train text, layout and image in a multi-modal framework [15]. Unlike other Visually-rich Document Understanding (VrDU) approaches which aims to analyze scanned documents, LayoutLMv2 helps learn cross-modality interaction and
80
Q. M. Tan et al.
incorporates a spatially-aware self-attention mechanism into the Transformer design. It comprehends relative positioning relationships different text blocks. The performance of LayoutLMv2 is studied with multiple datasets, including open sourced FUNSD dataset which consists of different scanned forms [16]. Another approach, LAMBERT, is reported in [17] that tackles the challenges of comprehending documents where non-trivial layout affects the local semantics. It is a Layout-Aware Language Model, which combines NLP methods with layout understanding mechanisms. The LAMBERT uses the layout information of the document and trains it with the pretrained Roberta. Influenced by LayoutLM, a pre-trained approach StructuralLM utilizes cells and document layouts from scanned documents [18]. It uses cell-level 2D-position embeddings to represent the layout information of cells. It introduces a cell position classification which attempts to predict a cell location and their semantic relationships.
3 Methodology 3.1 Proposed Pipeline of the Information Extraction System The flow chart of the proposed information extraction system is shown in Fig. 1, which consists of two portions: the key-value pair classification and linking the key-value pairs. The key-value pair classification portion is to explore models to classify the text from OCR to “Question”, “Answer” and “Others”. The second portion is to link the key-value pairs with the use of layout spatial awareness like the bounding box position. To begin the pipeline of the proposed system, data collection is performed by taking images of invoices or receipts, with a quality check if the images are not blurry. After that, the invoices will be annotated into the FUNSD format [16] using an open source tool, Banksy [19], where the Named Entity Recognition (NER), Named Entity Linking (NEL), and a box region on the image will be outputted. The NER will be the labels into “Question”, “Answer” or “Others”. The NEL is the task to link the “Question” text to the “Answer” text. Lastly, the bounding box is drawn for the region of texts. The next step is to perform data cleaning and data preprocessing to format the data into a model-ready form with splitting the dataset into the train set (80%) and the test set (20%). For the key-value pair classification step in the pipeline of information extraction system, different intelligent algorithms are evaluated with the best performing model being chosen. For the linking of key-value pairs step, both channels of the regular expression and pairing by nearby bounding boxes are evaluated in the pipeline, where the better performing channel is selected to output the final key-value pair. The output is then compared with the ground truth. If the performance of the output from the proposed information extraction pipeline is not satisfied, it will go through another iteration to improve the accuracy with different model hyperparameters. The iterative process keeps going till the satisfied accuracy achieved, with the optimal model selected by the proposed pipeline.
Information Extraction System for Invoices and Receipts
81
Fig. 1. Flow chart for the pipeline of proposed information extraction system.
3.2 Algorithms Evaluated in the Pipeline 1) Key-value Pair Classification Step of the Pipeline. Two types of techniques are in the pipeline: NLP methods, and NLP methods combined with layout spatial awareness. NLP methods are able to understand context-sensitive human languages. By using NLP, it can effectively extract data from text-based documents [18]. For the NLP methods, both BERT model and Universal Language Model (ULMFit) model are explored in the pipeline. The BERT model analyzes the left and right sides of a word to infer its context. Additionally, The BERT model uses MLM, which covers or masks a word in a sentence. MLM enables or enforces bidirectional learning from a text by requiring BERT to predict the word on either side of the covered word [20]. Applying it to our constructed cargo invoice dataset, the model could understand the key-value pairs. For example, the term “Product” could be next to the word “Number”. The ULMFit model is trained using a general-domain corpus to capture overall language properties at several layers and then learns task-specific features [21]. It uses 3-layer Weight-Dropped Long Short-Term Memory Networks (LSTM). As the other technique, NLP with layout spatial awareness, the pipeline will evaluate two algorithms. The first algorithm is YOLOv4 combined with BERT. This algorithm is able to understand the position of the text as well as using BERT to understand the text embeddings. The second algorithm is LayoutLMv2; LayoutLMv2 not only considers the text and layout information, but also integrates the new text-image alignment and text image matching tasks, which help to learn cross-modality interaction [15]. 2) Linking of Key-value Pair Step in the Pipeline. Both regular expression and Pairing via Nearby Bounding Box methods are incorporated. The regular expression is efficient in finding specific patterns or text which is common in invoices and receipts, for example, “Product No.: 123”. The patterns commonly seen for keys and values are with colons.
82
Q. M. Tan et al.
For the regular expression algorithm, an analysis will be done to see the common keyvalue pairs and their patterns to extract the key and values pairing. The workflows of the regular expression algorithm are as follow: • Find all key-value pairs. • Calculate Levenshtein distance with given identifiers to see which one is the most likely identifier. • Return key-value pair with the lowest normalized Levenshtein distance. The Pairing via Nearby Bounding Box algorithm pairs the key and value by finding the nearest bounding box according to the pairing rules. It combines the word level text into sentence level to make sense of the full question and answer, based on the bounding box position and the labels. There are two pairing rules in the proposed pipeline. The flow of how the first key-value pairing rule works is shown in Fig. 2. With the words of the key and values, it first checks if “Question”/“Answer” has a right neighbor. Then the pairing rule checks whether the y coordinates between the neighbors are the same. Next, it checks whether the coordinates of the right side of one bounding box (bbox) are identical to the left side of the other bounding box. The pairing results are returned.
Fig. 2. Flowchart of the first key-value pairing rule.
After performing the first key-value pairing rule, the derived results are split into successfully paired outputs and unsuccessfully paired outputs. The unsuccessfully paired results will be taken into the second key-value pairing rule to pair the questions and answers, whose flowchart is shown in Fig. 3. Firstly, it checks whether the center of both bounding boxes is within the range of the other bbox in the y direction. Then it checks whether both bounding boxes overlap in the x direction or the gap in the x direction less than 50% of the width of the smaller bounding box. If yes, the key and value will be paired successfully. The final output of this step consists of information including the labels, bounding boxes, text, left_neightbour, right_neighbour, the width, the height, the variable of pair_with, and the center attribute. 3.3 Datasets The datasets used for experiments are the FUNSD dataset [16], and the Cargo Invoices dataset which is constructed in this research with cargo invoices images obtained from the warehouses. Table 1 lists the number of training and testing data in these datasets.
Information Extraction System for Invoices and Receipts
83
Fig. 3. Flowchart of the second key-value pairing rule. Table 1. Number of data in these two datasets. Dataset
Training
Testing
Total Data
FUNSD [16]
149
50
199
Cargo Invoices Dataset
484
121
605
4 Experiment Results and Analysis 4.1 Evaluation Metrics The evaluation metrics used in the ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction [22] are employed to evaluate the experiment performances in this paper: the Precision, Recall and F1 Score. The extracted text will be compared to the ground truth for each test image. If the submitted content and category of the extracted text matches the ground truth, it is marked as correct; otherwise, it is labelled as inaccurate. Furthermore, the algorithms will also be further evaluated by the classification report and confusion matrix. The values of the precision, recall, F1 score, and support are the four most critical headers for classification results to pay attention to in the classification reports. The value of precision refers to the ability of a classifier to avoid labelling a negative instance as positive [23]. The recall is the ability of a classifier to find all positive instances. The F1 score is a weighted harmonic mean of precision and recall. The support is the number of actual class occurrences in the specified dataset. 4.2 Evaluations for Classification Models in the Pipeline The dataset is split into 80% training and 20% testing. Four classification models in the pipeline are ULMFit, BERT, YOLOv4 combined with BERT and LayoutLMv2, each of which classifies texts into “Question”, “Answer” and “Others”. 1) Results of the ULMFit Model. The results of precision, recall, F1 score and support for the ULMFit model with the Cargo Invoice dataset are shown in Table 2. Overall,
84
Q. M. Tan et al. Table 2. ULMFit Results with Cargo Invoice Dataset. Class
Precision
Recall
F1 Score
Support
Question
99%
98%
98%
882
Answer
88%
96%
92%
882
Other
93%
80%
86%
568
Accuracy
93%
2332
the performance of this model is good with 93% accuracy. However, the model does not perform well for the “Other” class, compared to the other two classes. It might be because there are fewer “Other” labels compared to those of “Question” and “Answer”. 2) Results of the BERT Model. The experiment results of the BERT classification model with the Cargo Invoice dataset are shown in Table 3. It achieves 88% accuracy. This model can perform well on the “Other” class for the values of precision, recall, and F1 score, but not do well on the “Answer” class. Table 3. BERT Results with Cargo Invoice Dataset. Class
Precision
Recall
F1 Score
Support
Question
85%
91%
88%
2128
Answer
83%
76%
79%
1387
Other
96%
95%
95%
1493
Accuracy
88%
5008
3) Results of YOLOv4 Combined with BERT Model. The classification results of the YOLOv4 combined with BERT model are shown in Table 4. It shows that the overall performance is worse than other classification models. For “Other” label, it performs the worst. It might be due to the imbalanced dataset whereby “Other” has the least amount of data comparatively. The results in this experiment show that the combination of YOLOv4 and BERT algorithms does not enhance the performance, compared with the BERT model alone. 4) Results of the LayoutLMv2 Model. Different from the other three models, the labeling methods of the LayoutLMv2 model follow the BIOES tagging, where B means beginning; I mean in the middle; O means others; E means the ending; and S means a single word representing a full sequence [24]. For example, “Product No.”, will be split into “Product” which will be B-QUESTION and “No.” is “E-QUESTION”. The experiment results of the LayoutLMv2 classification model are shown in Table 5. It is observed that the LayoutLMv2 model achieves an accuracy of 96%, which is the highest out of other three classification models. All the labels achieve more than 90%
Information Extraction System for Invoices and Receipts
85
Table 4. Results of YOLOV4 Combined with BERT Model. Precision
Recall
F1 Score
Support
Answer
52%
42%
46%
882
Other
46%
39%
40%
568
Question
49%
40%
44%
882
Accuracy
40.27%
2332
for the precision, recall and F1 score. It shows that the LayoutLMv2 model can predict well and distinguish each label properly. Table 5. LayoutLMv2 Results with Cargo Invoice Dataset. Precision
Recall
F1 Score
Support
B-ANSWER
93%
94%
94%
331
B-QUESTION
98%
98%
98%
507
E-ANSWER
90%
96%
93%
331
E-QUESTION
98%
98%
98%
507
I-ANSWER
96%
98%
97%
915
I-QUESTION
96%
97%
97%
104
O
97%
93%
94%
1387
S-ANSWER
95%
97%
96%
502
S-QUESTION
96%
98%
97%
375
Accuracy
96%
4959
5) Experiment Results with the FUNSD Dataset. The classification results of the experiments are also performed on the FUNSD dataset to compare the performances of these four classification models, shown in Table 6. Table 6. Classification Results with the FUNSD Dataset. Model
Precision
Recall
F1 Score
ULMFit
77%
76%
76%
BERT
55%
67%
60%
YOLO combined with BERT
49%
40%
44%
LayoutLMV2
80%
85%
83%
86
Q. M. Tan et al.
It has shown that LayoutLMv2 model performs the best out of other three models on both datasets. This has been demonstrated that integrating the spatial-aware selfattention mechanism into the Transformer architecture in the LayoutLMv2 model has fully allowed the model to understand the relative positional relationship among different text blocks. On the other hand, YOLOv4 combined with the BERT model does not perform well. Even though it is trained with the text along with the bounding box of the texts, it is not able to find the relative positional relationship accurately compared to the LayoutLMv2 model. Thus, the LayoutLMv2 model will be the model chosen by the proposed pipeline process for the key-value pair classification tasks.
4.3 Evaluation for Linking of Key-Value Pair Algorithms in the Pipeline Some analysis is done on the dataset to understand the key-value patterns and how they are paired to enhance the extracting process. There are various formats of the keys, such as the short forms. Hence, a list of the different forms of keys is created with some examples of various forms shown in Table 7. After finding out different formats of the keys from the dataset, the next step is understanding their values pattern from these common keys. Some examples of the patterns of values are shown in Table 8. After performing the key-value pairing, the results are evaluated by comparing the derived final key-value pairs with the ground-truth. The experiment result comparisons are shown in Table 9. Table 7. Some Examples of Different Formats of Keys. Word
Different Formats of Keys
Gross Weight
G/W G/Weight Gross wt Gross wght Gross wght(kg)
Net Weight
Net WT Net wght Net weight(kg) Net wt(kg) Net wght(kg) N/W
Dimension
Dim Dims Dim(mm) DIM (CM) Dimensions(cm)
Information Extraction System for Invoices and Receipts
87
Table 8. Pattern of Values. Key
Values
Patterns
Gross Weight
G/W 2134 23 kg/g 88,500 kg
For gross weight the pattern is always an integer/float followed by the metrics (KG/lb/g). Sometimes G/W would be infront of the integer/float
Net Weight
270 lb 88,500 kg 3.840 kg
For Net weight the pattern is similar to Gross weight. The pattern is always an integer/float followed by the metrics (KG/lb/g)
PO
PO-IMA-21007 PO2000004328
For PO the pattern is PO followed by letters then integers or just integers
Dimension
228.00 × 148.00 × 165.00 CM 89.76 × 58.26 × 64.96 in. 3815 × 2150 × 2550 mm
For Dimension, the pattern is integer/float X integer/float X integer/float
Table 9. Results for Key Value Pairing in the pipeline. Algorithm
Precision
Recall
F1 Score
Regular Expression
63%
60%
66%
Pairing via Nearby Bounding Box
73%
72%
70%
It is observed from the results that the second algorithm, Pairing via Nearby Bounding Box, has done a better job with a precision of 73%, recall of 72% and F1 score of 70%, than those of the regular expression algorithm. The reason for the regular expression not performing well might be because there are limited patterns in the system. Even though the algorithm of Pairing via Nearby Bounding Box performs better, it can still be further improved, as on some occasions the questions and answers are very far apart in a horizontal direction. Furthermore, there are a few mistakes that are made by the OCR results which causes the classification models to misclassify some labels and hence messes up the key-value pairs.
5 Conclusion In this paper, an end-to-end pipeline of information extraction system for extracting key-value pairs from invoices or receipts is presented. The proposed system uses deep learning and intelligent computing approaches for knowledge discovery, classifying key-value pairs, and linking the key-value pairs. Its performances are evaluated. First, a few deep learning approaches are employed to explore key-value label classification and linking key-value pairs. Experiments are conducted to evaluate and compare the performances of each model for key-value pairs in the pipeline. It is observed from the
88
Q. M. Tan et al.
results that the LayoutLMv2 model performs the best. It shows that the LayoutLMv2 architecture of layout spatial awareness and words embedding improve the results of standard text classification. Afterwards, experiments for linking key-value pairs are conducted for two methods for linking the key-value pairs in the pipeline: the regular expression and a unique method by linking the key-value pairs by finding the nearby bounding boxes. Evaluated the performance, the algorithm of linking by Pairing via Nearby Bounding Box performs better with a precision of 73%, recall of 72% and F1-score of 70%. Several recommendations can be made for future improvements to the overall pipeline. For the key-value label classification, the LayoutLMv2 [15] model has recently introduced a new version known as LayoutLMv3 [25], which may be implemented to investigate its efficacy in enhancing the key-value label classification results compared to the current model in the pipeline of information extraction system. For linking key-value pairs, most key-value pairs are typically located beside each other on the same line horizontally. However, there are some cases where the values are below the keys vertically. Currently, this proposed pipeline of information extraction system has not explored that and can only find nearby bounding boxes horizontally. Hence, to further improve the system, it needs to be able to get nearby key-value pairs vertically. Acknowledgment. The first author would like to thank her intern supervisor Mr. Eric Tan of Infocomm Media Development Authority (IMDA) Singapore, for his guidance and dedicated supports in the project.
References 1. Turner, R.: The myth of the paperless office. New Libr. World 104(3), 120–121 (2003) 2. Klein, B., Agne, S., Dengel, A.: Results of a study on invoice-reading systems in Germany. In: International Workshop on Document Analysis Systems, pp. 451–462 (2004) 3. Document Recognizer to modernize information processing. https://crossmasters.com/en/ blog/document-recognizer-to-modernize-information-processing/. Accessed 16 July 2022 4. Kay, A.: Tesseract: an open-source optical character recognition engine. Linux Journal (2007) 5. Amazon Web Services: Form Data (Key-Value Pairs). https://docs.aws.amazon.com/textract/ latest/dg/how-it-works-kvp.html. Accessed 15 June 2022 6. Qing, Y., Zeng, Y., Cao, Q., Huang, G.-B.: End-to-end novel visual categories learning via auxiliary self-supervision. Neural Netw. 139, 24–32 (2021) 7. Xu, D., Li, Z., Cao, Q.: Object-based illumination transferring and rendering for applications of mixed reality. Vis. Comput. 1–15 (2021). https://doi.org/10.1007/s00371-021-02292-2 8. Kumar, V., Kaware, P., Singh, P., et al.: Extraction of information from bill receipts using optical character recognition. In: International Conference on Smart Electronics and Communication, pp. 72–77 (2020) 9. Kamisetty, V.S.R., Chidvilas, B.S., Revathy, S., et al.: Digitization of data from invoice using OCR. In: 6th International Conference on Computing Methodologies and Communication, pp. 1–10 (2022) 10. Kaló, Á.Z., Sipos, M.L.: Key-value pair searching system via tesseract OCR and post processing. In: 19th World Symposium on Applied Machine Intelligence and Informatics, pp. 000461–000464 (2021)
Information Extraction System for Invoices and Receipts
89
11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. North American Chapter of the Association for Computational Linguistics (2019) 12. Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 (2019) 13. Wang, W., Bi, B., Yan, M., et al.: StructBERT: incorporating language structures into pre-training for deep language understanding. In: International Conference on Learning Representations (2020) 14. Xu, Y., Li, M., Cui, L., et al.: LayoutLM: pre-training of text and layout for document image understanding. In: 26th ACM International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020) 15. Xu, Y., Xu, Y., Lv, T., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (2021) 16. Jaume, G. Ekenel H. K., Thiran, J.: FUNSD: a dataset for form understanding in noisy scanned documents. In: International Conference on Document Analysis and Recognition Workshops, pp. 1–6 (2019) 17. Garncarek, Ł., et al.: LAMBERT: layout-aware language modeling for information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 532–547. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_34 18. Li, C., Bi, B., Yan, M., et al.: StructuralLM: structural pre-training for form understanding. In: 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (2021) 19. Banksy Annotation Tool. https://github.com/AboutGoods/Banksy-annotation-tool. Accessed 18 June 2022 20. Muller, B.: BERT 101 state of the art NLP model explained. https://huggingface.co/blog/ber t-101. Accessed 21 June 2022 21. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: 56th Annual Meeting of the Association for Computational Linguistics (2018) 22. ICDAR 2019 robust reading challenge on scanned receipts OCR and information extraction. https://rrc.cvc.uab.es/?ch=13&com=tasks. Accessed 13 June 2022 23. Agrawal, S.: Metrics to evaluate your classification model to take the right decisions. https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classific ation-model-to-take-the-right-decisions/. Accessed 21 June 2022 24. Johansen, B.: Named-entity recognition for Norwegian. In: 22nd Nordic Conference on Computational Linguistics, pp. 222-231 (2019) 25. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: Pre-training for document AI with unified text and image masking. arXiv:2204.08387 (2022)
Missing Data Analysis and Soil Compressive Modulus Estimation via Bayesian Evolutionary Trees Wenchao Zhang, Peixin Shi(B) , Xiaoqi Zhou, and Pengjiao Jia Soochow University, Suzhou, Jiangsu, China {pxshi,pjjia}@suda.edu.cn, [email protected]
Abstract. Soil compression modulus (E s ) is an elemental parameter in geotechnical designs. During E s prediction, missing values in some parameters are inevitable. By analyzing the missing mechanism of the parameters, it is found that some missing values contain important domain knowledge information. A new method for predicting E s with missing values in the parameters is proposed. This method, called BETm, modifies the recursive partitioning scheme during the construction of genetic programming (GP) trees and Bayesian additive regression trees (BARTs) to incorporate missing data into partitioning rules. Compared to traditional interpolations or imputations, it can directly mine the domain information of missing values, avoiding the introduction of unnecessary errors and noise that can obscure this information. It can construct higher order feature variables to explore the interactions between parameters through GP and introduces beneficial uncertainty into an ensemble model through BART. To estimate E s , a database of 2955 geotechnical samples from 101 boreholes, including 5 missing input values, is constructed. A comparison is conducted with the hybrid models of interpolation (e.g., imputation using average values, multivariate imputation) and ensemble learning (e.g., random forest, extreme gradient boosting). The result reveals that BETm achieves an R2 value of 0.978, MAE of 0.045 MPa, and MAPE of 0.318 outperforming other hybrid models. The proposed model introduces variable importance analysis with Bayesian uncertainty. Keywords: Compression modulus · Missing value · Genetic programming · Bayesian additive regression tree
1 Introduction Soil compression modulus (E s ) is defined as the ratio of vertical additional stress to corresponding strain increment of a column with full lateral limits. It is one of the most important geotechnical parameters for determining soil compressibility which in turn determines ground and structural deformation during underground constructions. While it can be determined through in-situ testing, laboratory testing and empirical correlations, it is one of the parameters most difficult to determine due to its uncertainties. Some engineers argue that different sites cannot be treated as repeated events because each site © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 90–100, 2023. https://doi.org/10.1007/978-981-99-4752-2_8
Missing Data Analysis and Soil Compressive Modulus
91
is unique in its way. The spatial variability of ground properties and other geotechnical uncertainties can be modeled using random variables or fields. The characteristics of geotechnical data are described as MUSIC-X: multivariate, uncertain and unique, sparse, incomplete, and potentially corrupted, with X denoting the time and/or spatial dimension [1–3]. As such, the input parameters are multivariate and may be essentially incomplete (including missing values). An accurate prediction of E s is a huge challenge. Such heterogeneous incompleteness can lead to a great challenge for conventional algorithms. These algorithms are often unable to directly handle the dataset with missing values and need to be processed by other means before these algorithms can be executed [4]. There are several ways to handle missing data, such as the list-wise discarding method. In this method, the solution of deleting missing data at each step will lead to a drastic reduction of the data set and is therefore not feasible [5]. In addition, several imputation strategies are used, such as unconditional mean, conditional mean, single imputation, and multiple imputations. The missing entries can be also imputed with their conditional expectation using an expectation maximization algorithm [6]. These settings begin with the optimal predictor on incomplete data and attempt to recover the missing values while reducing bias. Multiple imputation methods are more efficient, including multiple imputations via chained equations [7] and MissForest [8]. More powerful methods, such as NeuMiss [9], a Deep Learning neural network capable of capturing the conditional links between observed and unobserved variables despite the pattern of missing values, and Generative Adversarial Imputation Nets [10], are also proposed. The imputation methods mentioned above can be computationally expensive and can introduce bias if the interpolated data are unreasonable. Thus, some methods which deal directly with missing values without imputation are preferred. For example, surrogate splits utilize another variable to induce a data partition similar to the original, as the default method in Recursive Partitioning [11]. Block propagation selects a split on observed values, and then sends all incomplete observations as a block to a side with error minimization [12]. The ‘missing incorporated in attribute’ (MIA) [11] is another method that is effective when missing patterns are informative, allowing for cutting of missing versus non-missing data to compute the best splits. MIA is implemented in both ‘ctree’ and ‘cforest’ [14], as well as XGBoost [15]. To potentially reduce the computational cost of imputation and fully exploit domain information in the missing values, a new method is provided from a Bayesian perspective, to predict E s using geotechnical data with uncertainty and missing values. The new method is the Bayesian evolutionary trees enhanced with MIA, named BETm. It is based on domain knowledge guided to estimate E s that can deal directly with missing values without imputation are favored.
92
W. Zhang et al.
2 Related Work In general, the determination of compressibility parameters from the odometer test takes a relatively long time and leads to a very demanding experimental work program in the laboratory [16]. To potentially reduce the reliance on repetitive and time-consuming conventional laboratory tests, more artificial intelligence techniques are applied to geotechnical parameter analysis. Statistical Learning Models: Probabilistic theory remains the primary approach to address the sparsity, heterogeneity, and uncertainty of rock and soil. Within the Bayesian framework, the proposed approach probabilistically integrates site-specific data with prior knowledge (e.g., previous engineering experience and project-specific test data) and employs Markov Chain Monte Carlo (MCMC) simulation to generate a large sample dataset, allowing for quantification of parameter correlations [17–19]. A novel method [20] based on Multiple Point Statistics is proposed for interpolating subsurface soil stratigraphy from sparse measurements. This method accurately interpolates the subsurface soil stratigraphy from sparse measurements while also quantifying uncertainty. Yu et al. [21] conduct an extensive parametric study and develop a statistical table for sample size determination considering spatial variation and correlation using Bayesian compressive sensing or sampling. A Bayesian supervised learning method is used to interpolate site-specific geotechnical data from sparse measurements. It can directly deal with non-stationary data, bypass parameter estimation of models, and performs better than kriging. Additionally, a Bayesian regularization neural network is designed to predict the compression index, a necessary parameter in the settlement calculation of clays [16]. Machine Learning Approaches: Machine learning methods can automatically detect patterns in data and use these patterns to improve predictions of future trends or facilitate better decision-making [22]. Therefore, machine learning models have increasingly been applied to predict geotechnical parameters over the last decade. Artificial Neural Networks (ANNs) are used to estimate clay sensitivity with piezo-cone penetration test data in Southwest Sweden [23]. Zhang et al. [24] apply Extreme Gradient Boosting (XGBoost) and Random Forest (RF) ensemble learning methods to capture undrained shear strength using various basic soil parameters. Multi-gene genetic programming is a robust technique that integrates the capabilities of standard GP and classical regression, dealing with the prediction of the compression index of fine-grained soils [25]. Recently, some scholars have begun to apply deep learning in the analysis of geotechnical parameters. Based on the spatial-geological stratigraphic graph, a spatial-geological graph attention network is constructed for node-level estimation and edge-level correction of Es prediction, which includes two attention layers to properly capture the dependencies between spatial location and lithology [26].
Missing Data Analysis and Soil Compressive Modulus
93
3 Methodology 3.1 Definition of Missing Value For a variable with missing values, we cannot observe its complete vector X. To exploit the useful information in the missing values, we introduce the indicator vector M ∈ {0, 1}d which satisfies, if and only if X j is missing for M j = 1, and M j = 0 otherwise [6, 27, 28]. The observed entries are present with X o , defined as X o = X j only if M j = 0. X m is the missing component of the input complete vector X. The incomplete variables are present with X , defined as X = NA if M j = 1, and X = X j otherwise. The main goal of statistical learning with ‘missingness’ is to estimate E[Y| X o , X m ]. Unfortunately, the mixed data types, such as X o ∪ {NA}, make the problem difficult. Indeed, many learning algorithms do not work with mixed data types containing missingness [6].
3.2 BETm We propose a competing algorithm, BETm, which can directly handle incomplete datasets and does not need imputation. It modifies the recursive partitioning scheme during the construction of GP trees and BART to incorporate missing data into the partitioning rules. This offers a model-based algorithm that can deal with mixed data types including NA. For clarity, we present an example in Fig. 1.
Fig. 1. BETm
In this paper, we focus on useful information about the missingness in our dataset and propose a BETm statistical learning framework to address this issue. The primary objective of this framework is to estimate the expected value of Y, given the observations and missing data represented by X o and X m , respectively. Conditional on o(X o , X m ) including observations and missingness, our models partition the full data likelihood as. ˆ θ, γ ) = P(Y|X o , X m , θ ) P(X m |X o , γ ) P(Y, X m |X,
(1)
94
W. Zhang et al.
where θ and γ are parameter vectors and assumed distinct. In this framework, each of the j ∈ (1, p) covariates with missingness, and is assumed to have its missing data mechanism (MDM). Thus, the MDM for the whole covariate space, P(X m | X o , γ ), can be arbitrarily convoluted among X m and each MDM relationship may be highly non-linear with complicated features after GP feature constructing.
4 Experiments 4.1 Data Description This study employs data obtained from 101 boreholes containing soil samples subjected to both in-situ and laboratory tests in Suzhou, China. In total, 2955 records are collected and consolidated into Table 2. The input parameters utilized for predicting E s in this study includes depth (h), natural density (ρ), natural water content (ω), specific gravity (Gs ), curvature coefficient (C c ), coefficient of uniformity (C u ), liquidity index (I L ), plasticity index (I P ), standard penetration testing N values (SPT N-value) of soil. It is found that there are significant amounts of missing values for C u , C c , I L , I P , and N-value, with respective missing proportions of 0.64, 0.64, 0.14, 0.14, 0.52, and 0.52. The deepest boreholes recache 88 m, with N-value values ranging from 1.8 to 33.3 blows and E s values ranging from 1.1 to 17.6 MPa, which suggests that there are large differences in parameters between different types of soil samples, and therefore clustering is necessary. Table 1. Summary of datasets n
miss mean sd
median mad min
33.17 19.75 30.2
max range skew
kurtosis −0.58
2955 0
Cu
1058 0.64 12.71 5.21
14.14
6.53 1.86 24
Cc
1058 0.64 1.49
1.08
0.92
0.4
ω
2955 0
30.01 3.15
30.6
3.11 21
49.3 28.3
−0.16 0.32
Gs
2955 0
2.71
0.01
2.72
0.01 2.68 2.74 0.06
−0.36 −0.9
ρ
2955 0
1.92
0.04
1.91
0.04 1.72 2.06 0.34
0.54
0.02
Ip
2530 0.14 12.81 3.15
13.6
2.22 5
18.7 13.7
−0.8
−0.09
IL
2530 0.14 0.8
0.82
0.24 0.07 2.12 2.05
0.12
0.4
Types
2955 0
−
0.3
22.14 −0.02 −1.07
0.17 5.28 5.11
–
1.64
–
–
–
–
6.67 1.8
33.3 31.5
0.71
−0.61
Es
6.2
2.58 1.1
18.7 17.6
0.88
0.03
2.93
–
1.49
9.1
7.24
–
0.5
N-value 1433 0.52 11.45 6.83 2955 0
–
21.5 1.2
87.8 86.6
h
mad: median absolute deviation, sd: standard deviation, miss: missingness proportions
4.2 Missing Mechanism in the Parameters Missing Mechanism Analysis: There is ineluctable missing information in input parameters, especially in C u , C c , I L , I P , and N-value for different types of soils. In
Missing Data Analysis and Soil Compressive Modulus
95
this section, we introduce the different missing mechanism of parameters. Rubin [29] defines missing value mechanism based on the relationship between missingness and observed values. If they are independent, the mechanism is said to be missing completely at random (MCAR); if the missingness only depends on observed values, it is missing at random (MAR); otherwise, it is missing not at random (MNAR) [6, 13, 30]. Figure 2 (a) shows the number of missing proportions for C u , C c , I L , I P , and N-value is about 0.63, 0.63, 0.12, 0.12, and 0.51, respectively. Figure 2 (b) shows the proportion of missing values for variable combinations. The proportion of the missing combinations in C u , C c , and N-value is 0.39, among which missing in C u and C c is 0.25, in I L and I P is 0.10, in N-value is 0.09, in IL, IP and N-value is 0.04, in C u , C c , I L , I P and N-value is 0.003, and in C u , C c , I L , and I P is 0.002. These missing variables are in the form of certain combinations, which may contain valid geotechnical information.
Fig. 2. Shadow matrix between variables with missing values
Considering the missing values in the above parameters contain information about geotechnical properties and the missing data is not completely random (i.e., the missing mechanism is MAR or MNAR), both dropping individual rows due to missing values or imputing missing will compromise geotechnical information. Across a wide range of data types (both non-informative and informative missing values) and sources and amounts of missingness, MIA has consistently shown good performance ([5, 6, 13]).
4.3 Experimental Setup To highlight the advantages of the proposed algorithms, we compare them with commonly used methods for managing missing data, such as MIA, imputation using average values (IA) and multivariate imputation (MI) [7], as well as popular ensemble algorithms for data analysis like RF, XGBoost, and BART. The BART and RF models are unable to run on incomplete datasets, hence imputation is utilized to complete the datasets with IA
96
W. Zhang et al.
and MI models selected as imputation techniques. We determine the hyperparameters through a grid search approach coupled with 5-fold cross-validation coupled with convergence analysis [24, 31]. Table 2 presents a summary of the optimal hyperparameters obtained for the algorithms (Table 1). Table 2. Hyperparameters setting during training. Model
Hyperparameters
Value
Explanations
BETm
Population_size
2000
Population of individuals
Generations
20
Function_set
‘ +’, ‘–’, ‘·’, ‘/’, ‘sqrt’, ‘log’
The available function set
Metric
Spearman
Correlation coefficient
m
50
Number of trees
Number burn-in
250
Number burn-in
Post. Samples
1000
Number iteration after burn-in
RF
num_trees
50
Number of trees
RF-MIA
ntree
50
Number of trees
MIA = TRUE
The treatment of NA as a category in a split
num_trees
50
Number of trees
missing
NA
Considering NA as missingness
num_trees
50
Number of trees
Number burn-in
250
Number burn-in
Post. Samples
1000
Number iteration after burn-in
num_trees
50
Number of trees
use_missing_data
TRUE
Using MIA to handle missingness (NA)
XGBoost
BART
BART-MIA
Others: Keep the default values of all parameters except the above ones
4.4 Results Table 3 presents the comparative analysis of the aforementioned algorithms on test sets. We note that for all models in this section, 80% of the observations are randomly selected for training, while the remaining 20% is used for testing. After employing IA and MI imputation techniques, RF and BART models can be computed. Notably, the competing MI does not demonstrate significant advantages over IA, except for RF. This finding suggests that when missing values contain domain information, such as X m : C u , C c , I L ,
Missing Data Analysis and Soil Compressive Modulus
97
I P, and N-value, which offer insights into geotechnical properties, direct imputation of complementary values could obscure this domain information, thereby undermining E s analysis. Consequently, our algorithm framework leverages GP Transformer to construct higher-order feature variables to enhance BETm performance. Table 3. Summary of metrics for algorithms w/or w/o imputation
Algorithm
Imputation (IA)
Imputation (MI)
Raw data
Metric MAPE
MAE
R2
RF
0.447
0.058
0.940
XGBoost
0.341
0.048
0.965
BART
0.344
0.049
0.966
RF
0.349
0.050
0.965
XGBoost
0.405
0.054
0.950
BART
0.341
0.050
0.964
RF-MIA
0.353
0.051
0.966
XGBoost
0.334
0.047
0.967
BART-MIA
0.332
0.048
0.967
Ours (BETm)
0.318
0.045
0.978
Remark: Raw data has much missingness; IA/MI indicates using average values/multivariate imputation to deal with missing values; Bold indicates the optimal item
Fig. 3. Variable importance plot with 25 to 75% quantile interval
√ Two higher-order variables (log(ω)/I P -log(ω) and ρ-ρ/ Gs ) are constructed by GP transformer, which in turn removes the corresponding lower-order variables and improves the competitiveness of the model while reduces the number of variables.
98
W. Zhang et al.
Figure 3 shows the new ranking of the new set of variables. The variable log(ω)/I P log(ω), mainly responds to the water content properties of soil (high or low water content affects the soft and hard properties of soil) and plasticity of soil (Soils √ with a high I P tend to be clay, those with a lower I P tend to be silt). The variable ρ-ρ/ Gs combines the specific gravity and density properties of soil (the compactness of the calibrated soil can be used as typical values of soil index properties). And the newly constructed higher-order variables are ranked higher, which proves the advantage of GP Transformer constructed features. The ‘Types’ is the most important variable for predicting E s . The second most important variable is log(ω)/I P -log(ω). However, when compared to the other variables, they possess a large quantile interval. This highlights the fact that these variables exhibit a higher degree of uncertainty, and consequently, their importance measure should be approached with caution. This is an advantageous aspect that traditional methods for assessing variable importance lack.
5 Conclusion When predicting E s , it is inevitable to encounter the missing variables, some of which contain important domain information, so the simple and brutal use of traditional interpolation/imputation methods will introduce errors and obscure the internal geotechnical information. In this paper, in-situ or laboratory test data of 2955 geotechnical samples from 101 boreholes with 32505 data (including 6166 missingness) in Suzhou, China are collected and compiled. The missing patterns of these parameters are analyzed and find that they are not MCAR. The correlation between the missing values in C u , C c, and I L , I P is about –0.6, the correlation between the missing values in C u , C c, and N-value is about 0.2, and in I L , I P, and N-value are about –0.2. The missing values within a dataset represent important domain-specific information that cannot be disregarded in analytical processes. Imputing these values directly may serve to conceal vital domain information. In contrast, utilizing MIA methodology enables the comprehensive use of missing data within the original dataset, thus proving a more effective and √ competitive approach. The higher-order feature variables (log(ω)/I P -log(ω) and ρ-ρ/ Gs ) are constructed through GP Transformer to enhance the competitiveness of our model (R2 = 0.978, MAE = 0.045 MPa, MAPE = 0.318). The constructed variables are ranked top 5 in terms of variable importance and have a strong positive correlation with E s . Acknowledgments. The authors would like to thank the reviewers, whose contributions have greatly improved the paper. The research described in this paper was supported by the National Natural Science Foundation of China (52278405, 52108380) and the Natural Science Foundation of Jiangsu Province (BK20210721).
References 1. Ching, J., Phoon, K.K.: Bayesian data mining for a generic geotechnical database. In: Proceedings of the 6th International Symposium on Reliability Engineering and Risk Management (6ISRERM), p. 8, Singapore (2018)
Missing Data Analysis and Soil Compressive Modulus
99
2. Phoon, K.-K., Ching, J., Wang, Y.: Managing risk in geotechnical engineering – from data to digitalization. In: Proceedings of the 7th International Symposium on Geotechnical Safety and Risk (ISGSR 2019), pp. 13–34 (2019). https://doi.org/10.3850/978-981-11-2725-0-SL-cd 3. Ching, J., Phoon, K.K.: Measuring similarity between site-specific data and records from other sites. ASCE-ASME J. Risk Uncertainty Eng. Syst., Part A Civ. Eng. 6(2), 04020011 (2020). https://doi.org/10.1061/AJRUA6.0001046 4. Bertsimas, D., Delarue, A., Pauphilet, J.: Beyond impute-then-regress: adapting prediction to missing data (2021). https://www.semanticscholar.org/paper/Beyond-Impute-Then-Reg ress%3A-Adapting-Prediction-to-Bertsimas-Delarue/d92d58e6b461ba503af4b8b1870f13 b1cb7ffa20. Accessed 21 Dec 2022 5. Mehrabani-Zeinabad, K., Doostfatemeh, M., Ayatollahi, S.M.T.: An efficient and effective model to handle missing data in classification. Biomed. Res. Int. 2020, e8810143 (2020). https://doi.org/10.1155/2020/8810143 6. Josse, J., Prost, N., Scornet, E., Varoquaux, G.: On the consistency of supervised learning with missing values ArXiv (2019). https://www.semanticscholar.org/paper/On-the-consis tency-of-supervised-learning-with-Josse-Prost/ad5f2818f76e5fbbf390b37369af7d45a900 efa7. Accessed 21 Dec 2022 7. van Buuren, S., Groothuis-Oudshoorn, K.: Mice: multivariate imputation by chained equations in R. J. Stat. Soft. 45(3) (2011). https://doi.org/10.18637/jss.v045.i03 8. Stekhoven, D.J., Buhlmann, P.: MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012). https://doi.org/10.1093/bioinform atics/btr597 9. Morvan, M.L., Josse, J., Scornet, E., Varoquaux, G.: What’s a good imputation to predict with missing values? ArXiv (2021). https://www.semanticscholar.org/paper/What’s-a-goodimputation-to-predict-with-missing-Morvan-Josse/c9aae8aaa2b19394faacb8c91d5e3e194 7224b98. Accessed 21 Dec 2022 10. Yoon, J., Jordon, J., Schaar, M.: GAIN: Missing Data Imputation using Generative Adversarial Nets. ArXiv (2018) https://www.semanticscholar.org/paper/GAIN%3A-Missing-DataImputation-using-Generative-Nets-Yoon-Jordon/a89f0a78f86077864e108a1bd2c4e670c 85907f8. Accessed 21 Dec 2022 11. Therneau, T.M., Atkinson, E.J., Foundation, M.: An Introduction to Recursive Partitioning Using the RPART Routines (2022) 12. Ke, G., et al.: LightGBM: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems 30 (2017). https://papers.nips.cc/paper/2017/hash/ 6449f44a102fde848669bdd9eb6b76fa-Abstract.html. Accessed 05 Jan 2023 13. Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29(7), 950–956 (2008). https://doi.org/10.1016/j.patrec. 2008.01.010 14. Hothorn, T., Zeileis, A.: partykit: a modular toolkit for recursive partytioning in R. J. Mach. Learn. Res. 16(118), 3905–3909 (2015) 15. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA, pp. 785–794 (2016). https://doi.org/10.1145/2939672.2939785 16. Fikret Kurnaz, T., Kaya, Y.: The comparison of the performance of ELM, BRNN, and SVM methods for the prediction of compression index of clays. Arab. J. Geosci. 11(24), 1–14 (2018). https://doi.org/10.1007/s12517-018-4143-9 17. Wang, Y., Cao, Z.: Probabilistic characterization of Young’s modulus of soil using equivalent samples. Eng. Geol. 159, 106–118 (2013). https://doi.org/10.1016/j.enggeo.2013.03.017 18. Wang, Y., Akeju, O.V.: Quantifying the cross-correlation between effective cohesion and friction angle of soil from limited site-specific data. Soils Found. 56(6), 1055–1070 (2016). https://doi.org/10.1016/j.sandf.2016.11.009
100
W. Zhang et al.
19. Wang, Y., Zhao, T.: Bayesian assessment of site-specific performance of geotechnical design charts with unknown model uncertainty. Int. J. Numer. Anal. Meth. Geomech. 41(5), 781–800 (2017). https://doi.org/10.1002/nag.2658 20. Shi, C., Wang, Y.: Nonparametric and data-driven interpolation of subsurface soil stratigraphy from limited data using multiple point statistics. Can. Geotech. J. 58(2), 261–280 (2021). https://doi.org/10.1139/cgj-2019-0843 21. Wang, Y., Guan, Z., Zhao, T.: Sample size determination in geotechnical site investigation considering spatial variation and correlation. Can. Geotech. J. 56(7), 992–1002 (2019). https:// doi.org/10.1139/cgj-2018-0474 22. Jong, S.C., Ong, D.E.L., Oh, E.: State-of-the-art review of geotechnical-driven artificial intelligence techniques in underground soil-structure interaction. Tunn. Undergr. Space Technol. 113, 103946 (2021). https://doi.org/10.1016/j.tust.2021.103946 23. Abbaszadeh Shahri, A.: An optimized artificial neural network structure to predict clay sensitivity in a high landslide prone area using piezocone penetration test (CPTu) data: a case study in Southwest of Sweden. Geotech. Geol. Eng. 34(2), 745–758 (2016). https://doi.org/ 10.1007/s10706-016-9976-y 24. Zhang, W., Wu, C., Zhong, H., Li, Y., Wang, L.: Prediction of undrained shear strength using extreme gradient boosting and random forest based on Bayesian optimization. Geosci. Front. 12(1), 469–477 (2021). https://doi.org/10.1016/j.gsf.2020.03.007 25. Mohammadzadeh S, D., Bolouri Bazaz, J., Vafaee Jani Yazd, S.H., Alavi, A.H.: Deriving an intelligent model for soil compression index utilizing multi-gene genetic programming. Environ. Earth Sci. 75(3), 1–11 (2015). https://doi.org/10.1007/s12665-015-4889-2 26. Wang, M., Wang, E., Liu, X., Wang, C.: Topological graph representation of stratigraphic properties of spatial-geological characteristics and compression modulus prediction by mechanism-driven learning. Comput. Geotech. 153, 105112 (2023). https://doi.org/10.1016/ j.compgeo.2022.105112 27. Ding, Y., Simonoff, J.S.: An investigation of missing data methods for classification trees applied to binary response data. J. Mach. Learn. Res. 11, 131–170 (2010) 28. Kapelner, A., Bleich, J.: Prediction with missing data via Bayesian additive regression trees. Can. J. Statist. 43(2), 224–239 (2015). https://doi.org/10.1002/cjs.11248 29. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). https://doi.org/ 10.1093/biomet/63.3.581 30. Tierney, N.J., Harden, F.A., Harden, M.J., Mengersen, K.L.: Using decision trees to understand structure in missing data. BMJ Open 5(6), e007450 (2015). https://doi.org/10.1136/bmjopen2014-007450 31. Zhang, P., Yin, Z.-Y., Jin, Y.-F., Chan, T.H.T., Gao, F.-P.: Intelligent modelling of clay compressibility using hybrid meta-heuristic and machine learning algorithms. Geosci. Front. 12(1), 441–452 (2021). https://doi.org/10.1016/j.gsf.2020.02.014
Music Emotion Recognition Using Multi-head Self-attention-Based Models Yao Xiao1
, Haoxin Ruan1
, Xujian Zhao1(B) , Peiquan Jin2 , and Xuebo Cai3
1 Southwest University of Science and Technology, Mianyang 621010, Sichuan, China
[email protected]
2 University of Science and Technology of China, Hefei 230026, Anhui, China 3 Sichuan University of Culture and Arts, Mianyang 621000, Sichuan, China
Abstract. Music Emotion Recognition (MER) has been a major challenge in Music Information Retrieval (MIR) and is essential in many fields, such as music psychotherapy, individualized instruction, and music recommendation. Some existing approaches aim to extract emotion-related features through deep neural networks. Unfortunately, these methods perform poorly in outputting music emotion continuously, which is important for emotion perception. More recently, self-attention-based models have been proposed to learn emotion-related information by determining which part should be the focus and ignoring irrelevant and noisy data. However, since music emotion has much multi-dimensional semantic information, they cannot fully excavate the multi-dimensional implicit semantic information of the feature sequence and will ignore some of the emotion-related information. Aiming at addressing these issues, we present a new approach for MER that can extract continuous features and then excavate multi-dimensional information sensitive to emotion effectively. Firstly, the study suggests a neural network-based method to extract continuous features, intending to mine emotionabundant features. After that, we design a classifier with the multi-head selfattention-based model to excavate multi-dimensional information for music emotion recognition. Finally, we conduct experiments on a real dataset EMOPIA. Experimental results show our methods surpass the baseline on the corresponding modalities (symbolic and acoustic) at an accuracy of 6.1% and 4.4% on the same dataset, which verifies the superiority of our method. Keywords: Music emotion recognition · Arousal-Valence · Deep learning · Short-chunk CNN · Multi-head self-attention
1 Introduction Music emotion recognition has been a major research subject in MIR (Music Information Retrieval) [1], especially when building an automatic MER (Music Emotion Recognition) system. Emotions are one of the primary motivators for people to engage and interact with music. In addition, as such systems have the potential to make individualized music instruction more accessible, the importance of MER systems is becoming more apparent. Y. Xiao and H. Ruan—These authors contribute equally to this work. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 101–114, 2023. https://doi.org/10.1007/978-981-99-4752-2_9
102
Y. Xiao et al.
Deep learning methods have become increasingly popular in recent years. However, they still face many challenges and limitations when performing music emotion recognition. For example, the continuous recognition degree of Convolutional Neural Network (CNN) is not high, and the timing of music emotion is not considered in Long Short-Term Memory (LSTM) models. Moreover, LSTM models do not consider the influence of local critical information on music emotion. In other words, they do not perform well in resolving the problem of continuity recognition, which is essential for emotion perception. Therefore, they are not promising for the MER tasks. Like other kinds of multimedia, such as pictures, videos, and audio, music also contains rich multi-dimensional semantic information. So far, self-attention-based models have been proposed to learn the emotion-related information of music by letting the model focus on different parts of the sequence to determine which part should be the focus [19]. Such models can help note emotional expression in music and ignore irrelevant and noisy information. However, when faced with the task of MER, they cannot fully excavate the multi-dimensional implicit semantic details of the feature sequence and will ignore some of the emotion-related information. More specifically, they can only pay attention to some emotion-related information and dismiss the rest, as Fig. 1 shows.
Fig. 1. How multi-head self-attention (b) works by noting multi-dimensional semantic information compared with self-attention (a).
Generally, there are two main challenging issues when formulating MER models, especially piano music emotion recognition in our paper: (1) How to extract music features that contain more continuous information than current methods? (2) How to capture rich multi-dimensional emotional information between different parts of the feature sequence and identify the multi-dimension of emotional expression? This study presents a novel approach to address the challenges of formulating MER models. Firstly, we propose a learning-based method to extract continuous features, intending to excavate emotion-abundant features. After that, we design a classifier with the Multi-Head Self-Attention (MHSA) based model to excavate multi-dimensional emotion-related information.
Music Emotion Recognition Using Multi-head Self-attention
103
In summary, we make the following contributions in this paper: (1) Aiming to extract continuous information sensitive to emotion, we design feature extractors according to the characteristic of different neural networks. Specifically, Short-chunk CNN-based and Bidirectional LSTM (BiLSTM) based music feature extractors are presented for symbolic and acoustic domains, respectively. (2) Aiming to capture abundant multi-dimensional emotional information between different parts of the music feature and identify multiple aspects of emotional expression, we exploit a multi-head self-attention-based model with a varying number of heads to strengthen the ability to recognize emotions of music. (3) We conduct experiments on a real dataset with four-category emotion labels EMOPIA [11]. The experimental results show our methods outperform the baseline. For the audio and MIDI branch, our method outperforms the baseline at an accuracy of 6.1% and 4.4% on the same dataset, respectively. Additionally, we develop a comprehensive benchmark for the music formats of both MIDI and audio through reliable ablation experiments to evaluate which settings will boost the accuracy to the maximum extent and provide valuable insights into the performance.
2 Related Works We divide previous work on music emotion recognition into the following three categories. 2.1 Symbolic Music vs. Acoustic Music Symbolic music representations use abstract representations of musical elements such as notes, rhythms, chords, and melodies and do not involve audio signals. This makes them easy to process, modify, and use across different platforms. Symbolic music is typically represented in a MIDI file with text-like characteristics, making it possible to apply Natural Language Processing (NLP) models like BERT and GPT for improved results. Ferreira [8] used a combination of LSTM and GPT2, to deal with the issue of emotion classification, while Qiu [17] proposed the MIDIGPT model based on the early work [7]. Open datasets have traditionally been limited by copyright laws, which interfere with training MER models because researchers are deterred from sharing the raw audio material needed for training and evaluation [1]. Some proper solutions for this situation are to provide metadata from the datasets and preprocessed audio features like MIDI. 2.2 Deep Learning Methods for MER With the development of deep learning, the accuracy of using deep learning methods to identify music emotions has been greatly improved in recent years. In music and speech recognition, researchers are using neural networks such as CNN to learn high-level features from audio data [1]. More recently, LSTM networks have gained significant attention in video, speech, and music recognition [5, 8, 17]. They were introduced to address the long-term dependency issue of RNN. Meanwhile, the AVEC 2017 competition winners showed that
104
Y. Xiao et al.
LSTM-RNN could reduce feature engineering efforts and improve performance for continuous emotion recognition [5]. BiLSTM is an extension of LSTM that processes sequences in both forward and backward directions, allowing it to capture contextual information from both the past and future. As the internal correlations in music emotion are strong and the current state is dependent not only on the previous but also on the future state, the BiLSTM network is an ideal solution to address this issue [10]. 2.3 Attention Mechanism for MER Self-attention models are effective in MER as they can weigh each position in an input sequence and learn the dependence relationships between different time steps. The Multiscale Context-based Attention model (MCA) proposed by [4] fuses different time scales with attention to dynamically model the representation of music structure, and experiments show its effectiveness. The attentive LSTM by [3] incorporates a modified attention mechanism that considers only the hidden states before a certain moment, resulting in significantly improved results compared to existing literature. These innovative approaches highlight the potential of self-attention models in MER. Traditional attention models are designed to rely heavily on external information [4], and they may not fully account for music’s complex and diverse emotional characteristics. Generally, existing studies typically do not perform well in resolving the problem of continuity recognition, which is important in music emotion recognition. Besides, current attention models used in MER cannot capture abundant multi-dimensional emotional information and identify the multi-dimension of emotional expression. In the paper, our proposal can extract continuous features more efficiently and then excavate multidimensional emotion-related information, making it comprehensive for MER.
3 Methodology As Figure 2 shows, our method for MER consists of three consecutive tasks: Preprocessing, Feature Extraction, and Emotion Recognition.
Fig. 2. An overview architecture of MER
Music Emotion Recognition Using Multi-head Self-attention
105
3.1 Preprocessing First, music data should be preprocessed to get their general representations, which can be used as the input of DNNs for the following tasks. In the paper, we model the music data through feature representation to obtain Mel-spectrogram representations from audio files and MIDI-like [15] representations from MIDI files. Mel-Spectrogram Feature Representation. Audio features are the earliest and most widely studied representations in the MER field, mainly extracted from waveform files [9]. The most commonly used timbre representation is Mel-spectrogram, which has a nonlinear correspondence with Hz frequency. It is widely used in speech recognition and MIR tasks because it provides a more meaningful representation of the spectral content of sound signals. Our implementation involves using TorchAudio to obtain the Mel-spectrogram with 128 bins using a 1024-size FFT (with Hanning window) and 512 hop-size at a 22,050 Hz sampling rate. We randomly sample three seconds of audio chunks to generate input matrices for the following model. Transcript to MIDI and MIDI-Related Feature Representation. The MIDI files are transcribed from the original audio using the high-resolution piano transcription model developed by [12]. After acquiring the MIDI files, we manually obtain MIDI-like [15] representation using TorchAudio. And then, the MIDI file will be converted into a vector, denoted as V 1×len , where the length of the vector depends on the MIDI file.
3.2 Feature Extraction After that, we train feature extraction networks to extract continuous features abundant in emotional information. Specifically, we leverage these representations on modality-specific networks enabling independent extraction of music emotion from different inputs. In detail, we introduce two DNNs as continuous feature extractors to extract emotion-abundant features for audio domain representation and MIDI domain representation, respectively. Short-Chunk CNN-Based Feature Extraction for Acoustic Music. The proposed feature extraction method utilizes the Short-chunk CNN [21] architecture, as shown in Fig. 3, for feature extraction in MIR, which has been demonstrated in early experiments to be strong robustness against noise and other changes [21]. It comprises 7-layer CNNs with residual connections, and the output is a fixed 128-dimensional vector obtained through max pooling, followed by two fully connected layers with ReLU activation for classification. Notably, using 3 × 3 small filters allows for deeper levels of network field of vision and the extraction of more detailed music information, contributing to the effectiveness and efficiency of the feature extraction process [20]. Consequently, we adopt the Short-chunk CNN as our backbone model for audio domain feature extraction. To get continuous implicit semantic features, we feed the Mel-spectrogram matrices into the backbone model, which are denoted as Q128×len . The length of the matrix is dependent on the audio file’s sample rate and the window size of the FFT from
106
Y. Xiao et al.
Fig. 3. Short-chunk CNN network for feature extraction (c/s/p stands for channel/stride/padding, respectively)
preprocessing. After that, we will get a 512-dimension vector V 512×1 for the following MHSA-based classifier. BiLSTM-Based Feature Extraction for Symbolic Music. We also design a symbolicdomain method similar to the audio-domain one. Considering the relationship between emotion and performance time, we use BiLSTM in Fig. 4 to model feature matrix W from MIDI-like representations, and the output dimensions of W is l × 512, where l is the length of a MIDI-like file.
3.3 Emotion Recognition Last, these features are learned by an MHSA-based classifier, each with a feed-forward layer and then fully connected layers for music recognition. In the paper, we incorporate a multi-head self-attention layer, along with a feedforward block and two dense layers, to classify the music data. Furthermore, the design enables adjustment of the number of heads used in the attention block for different dimensions of music emotion. By capturing multiple aspects of emotional expression by attending to different dimensions of musical emotion in each head, our proposal enhances the accuracy and robustness of MER models. In more detail, by utilizing multiple heads, abundant multidimensional emotional information can be captured between different parts of the music feature, identifying multiple aspects of emotional expression. Additionally, the number of heads in multi-head self-attention allows our classifier to attend to multiple aspects of emotional expression in parallel, enabling more efficient and effective modeling of music emotions. We feed the deep-semantic features extracted from the previous steps into the MHSAbased classifier to classify their emotion classification. Specifically, the methods armed with our MHSA-based classifier are called SCMA and BiLMA for ShortChunk CNNbased and BiLSTM-based feature extractors, respectively, illustrated in Fig. 5. Because MIDI files are not sliced into fixed lengths, the dimensions of the matrix W generated by BiLSTM are different. As a result, the dimensions of each W z matrix generated from W by multi-head self-attention layer are different. The dimension of W z and W all equals batchsize × n × m, which means m in every batch will be different. To ensure consistency across all dimensions of the features, we apply Eq. 1 to each batch. Thus the W z and W change into M , and the dimension is n × n. T M = W WZ
(1)
Music Emotion Recognition Using Multi-head Self-attention
Fig. 4. BiLSTM network for feature extraction
107
Fig. 5. Structure of SCMA/BiLMA.
Finally, after modeling feed-forward and fully connected layers, the emotion classification is performed, as depicted in Fig. 5.
4 Experiments and Results To evaluate the performance of our proposal, we experiment on the EMOPIA dataset. In this section, we discuss the performance evaluation of our algorithm. 4.1 Dataset The EMOPIA dataset [11] is a piano music dataset for symbolic and acoustic music emotion recognition. The emotion labels are assigned using the 4Q model [18], which consists of Q1, Q2, Q3, and Q4. Note that the audio files are gathered from the Internet by provided metadata. Unfortunately, we could only access 845 audio clips out of the total 1087 clips for our research due to copyright limitations. While this may have impacted the baseline accuracy of our study, we take great care to reproduce the baseline work for music emotion recognition using the remaining 845 clips. However, the repository includes complete MIDI format files. The MIDI files in the EMOPIA dataset are transcribed from the original audio using the high-resolution piano transcription model developed by [12]. The dataset creator manually checked the transcription results for a random set of clips and found the accuracy in note pitch, velocity, and duration satisfactory. After that, songs with engineered ambient effects were removed from the collection, as the resulting transcription could be fragmented and undesirable. More information about the EMOPIA dataset is summarized in Table 1. 4.2 Experimental Settings Modules armed with our proposed MHSA-based classifier, SCMA for audio, and MIDI BiLMA all share the configuration shown in Table 2. We evaluate the models for each training epoch in the validation set. The training algorithm is stopped early when the validation accuracy for emotion recognition has no improvement for T consecutive epochs, where T = 0.05 × N , N denotes the number of maximum training epochs. The checkpoint achieving the best accuracy in the validation
108
Y. Xiao et al. Table 1. Summary of EMOPIA [11]. INFO
EMOPIA
Number of MIDI
1,087
Number of mp3
845 clips used (in 1,087)
Emotional Label
4Q taxonomy
Train-validation-test splits
7:2:1
Source
Youtube
Piano Music Type
pop and multicultural
Single Duration
About 30 s
Table 2. Hyper-parameters in detail. Parameters in details
Detail
optimizer
Adam
weight decay
1e-4
max epoch
200
global seed (SCMA/BiLMA)
42/43
learning rate (SCMA/BiLMA)
1e–4/1e–3
batch size (SCMA/BiLMA)
32/8
number of heads (SCMA/BiLMA)
4/4
set during the training procedure is saved and evaluated in the testing set. All experiments are repeated in the same random seed with different epochs. Evaluation Metrics The performance of our model is measured in terms of F1-score and AUROC (Area Under the Receiver Operating Characteristic curve) [2]. The F1-score, defined by Eq. 4, is a performance metric that summarizes the balance between a classifier’s precision and recall in a single value. Additionally, in Eqs. 2, 3, and 5, TP stands for True Positive, which represents the number of positive instances and is correctly predicted as positive by the classification model. Similarly, FN stands for False Negative, FP stands for False Positive, and TN stands for True Negative. These terms are essential for understanding the calculation of various classification metrics and evaluating the performance of classification models. Precision = Recall = F1-score =
TP TP + FN
TP TP + FN
2 Precision × Recall Precision + Recall
(2) (3) (4)
Music Emotion Recognition Using Multi-head Self-attention
109
AUROC is a performance metric for classification tasks that evaluates a model’s ability to distinguish between classes by analyzing its performance at multiple thresholds, unlike fixed threshold measures such as F1-score. AUROC is calculated based on the relationship between True Positive Rate (TPR) and False Positive Rate (FPR) at different classification thresholds, as shown in Eq. 5. TPR =
FP TP , FPR = TP + FN FP + TN
(5)
AUROC provides a comprehensive measure of its classification ability by considering the model’s performance at different thresholds. Therefore, our study employed both AUROC and F1-score, along with accuracy, as evaluation metrics. 4.3 Performance of MHSA-Based Models for MER Table 3 summarizes the performance of our MER model for the audio branch model compared to the baseline [10] without an MHSA-based classifier. Our SCMA model outperforms the baseline model with an accuracy of 0.714, an F1-score of 0.712, and an AUROC score of 0.933. In contrast, the baseline model achieved an accuracy of 0.670, an F1-score of 0.634, and an AUROC score of 0.902. It suggests that our SCMA model can effectively capture relevant information from features for emotion recognition tasks. Moreover, the MHSA-based classifier enhances its ability to focus on essential emotion information from different perspectives. Table 3. Comparison between our audio domain model and Short-chunk CNN (baseline) [21]. Audio branch models
acc
f1
auroc
Short-chunk CNN [21]
0.670
0.634
0.902
SCMA (our method for audio music)
0.714
0.712
0.933
Table 4 show that the BiLMA model for symbolic domain outperforms all current models on the same dataset EMOPIA on accuracy, improving by 6.1% over the baseline model. It is worth noting that the F1-score of MT-MIDIBERT (2022) [17] marginally outperformed our method. However, it should be acknowledged that pre-trained models like BERT and GPT require significantly larger training datasets and entail higher time overheads than our lightweight model, which can be trained in a few minutes on a normal GPU without pre-training. While the small performance gains of pre-trained models come at a high cost, our model offers a more efficient and practical solution for certain applications. Further, Fig. 6 shows the detailed classification situation of typical models.
110
Y. Xiao et al. Table 4. Comparison between existing symbolic domain approaches. MIDI branch models
acc
f1
LSTM-Attn (Baseline) [14]
0.647
0.563
SVM (2013) [13]
0.477
0.476
SVM (2013) [16]
0.398
0.362
MIDIGPT (2020) [8]
0.587
0.572
MIDIBERT-Piano (2021) [6]
0.634
0.628
MT-MIDIGPT (2022) [17]
0.625
0.611
MT-MIDIBERT (2022) [17]
0.676
0.664
BiLMA (our method for symbolic music)
0.708
0.631
4.4 Ablation Experiments In this section, we conduct experiments on the EMOPIA dataset to study the impact of the network architecture, MIDI-related features, number of heads, and training epochs. Verify the Superiority of our Feature Extraction Networks. As Table 5 shows, our model with a MIDI-like feature achieves the best accuracy against REMI [10]. Regarding accuracy, our model for the symbolic branch outperforms the baseline model on both MIDI-like and REMI features. However, the improvement is more significant for MIDIlike features, where the BiLMA model outperforms the baseline model by 6.1%. For the REMI feature, the BiLMA model shows a gain of 5.7% over the baseline model. Table 5. Comparison of different MIDI features. MIDI Models
features
acc
f1
LSTM-Attn [14] (baseline)
remi
0.583
0.481
midi-like
0.647
0.563
remi
0.640
0.537
midi-like
0.708
0.631
BiLMA (ours)
As Table 6 shows, when our audio branch network’s architecture changes, accuracy will drop to an extent. Besides, the SCMABiL model shows only marginal performance improvements compared to the baseline model, indicating that the BiLSTM layer may add complexity to the model without providing significant benefits. Verify the Superior Training Procedure of our Networks. Table 7 shows that each number of head and training epochs on the SCMA model influences the result of emotion recognition. The results indicate that increasing the number of heads from 2 to 4 improves the model’s performance for all three metrics. When the head number equals four, and
Music Emotion Recognition Using Multi-head Self-attention
111
Fig. 6. The result of baseline models for audio and MIDI is shown in (a) and (c), respectively. The performance of models incorporated with a multi-head self-attention classifier is shown in (b) and (d). The audio model effectively recognizes Q2 but less for Q3, while the MIDI model exhibits the opposite trend.
Table 6. Comparison of audio models: Impact of different combinations. Audio Model
acc
f1
auroc
Short-chunk CNN (SC) [21] (baseline)
0.670
0.634
0.902
SC + MA (SCMA)(ours)
0.714
0.712
0.933
SC + MA + BiLSTM (SCMABiL)
0.690
0.677
0.898
the epoch equals 118, it reaches the best accuracy. Moreover, even when the head number equals 8, the lowest accuracy effectively surpasses the baseline shown in Table 3. Table 7. Influence of the different number of head and training epochs on the SCMA model. Number of heads
acc
f1
auroc
2
0.677
0.677
0.898
4 (62epoch)
0.705
0.728
0.918
4 (118epoch)
0.714
0.712
0.933
4 (148epoch)
0.670
0.625
0.870
8
0.636
0.650
0.886
16
0.670
0.686
0.870
Table 8 indicates that increasing the number of training epochs from 60 to 100 significantly improves accuracy and F1-score for the MIDI-like feature. However, increasing the number of epochs further to 250 does not improve the model’s performance.
112
Y. Xiao et al. Table 8. Influence of the different number of training epochs on the BiLMA model. BiLMA Epoch
acc-remi
f1-remi
acc-midi-like
f1-midi-like
60
0.640
0.537
0.674
0.591
100
0.570
0.440
0.708
0.631
200 (SGD)
0.571
0.440
0.697
0.609
250 (SGD)
0.537
0.458
0.651
0.542
4.5 Discussion Symbolic music is a better form for emotion recognition [17] since it intrinsically contains information such as pitch, duration, speed, and severity, which can be used to analyze emotion [9]. Besides, the pre-train models did a good job on text-like data, and MIDI is one of them. However, our results show that acoustic music’s accuracy is higher than symbolic music’s in emotion recognition. Compared with many current MIDI branch models in Table 4 show that the BiLMA model for symbolic domain outperforms all current models on the same dataset EMOPIA on accuracy, improving by 6.1% over the baseline model. It is worth noting that the F1-score of MT-MIDIBERT (2022) [17] marginally outperformed our method. However, it should be acknowledged that pre-trained models like BERT and GPT require significantly larger training datasets and entail higher time overheads than our lightweight model, which can be trained in a few minutes on a normal GPU without pre-training. While the small performance gains of pre-trained models come at a high cost, our model offers a more efficient and practical solution for certain applications. Further, Fig. 6 shows the detailed classification situation of typical models. Table 4, our accuracy of audio in Table 3 still surpasses that of symbolic music in all metrics, which suggests that except for pitch, duration, speed, and severity, there may be other information in audio that determines the emotion of the music.
5 Conclusion In this paper, we propose an efficient approach for MER. Firstly, a method to extract continuous features is presented, intending to excavate emotion-abundant features. After that, we design a classifier with the MHSA-based model to excavate multi-dimensional information for music emotion recognition. Experimental results demonstrate our proposal’s effectiveness, which achieves state-of-the-art performance on the EMOPIA dataset, setting a new benchmark in the field. Based on our approach, future research directions could include multimodal methods incorporating audio, MIDI, and even videos to understand music emotion from various perspectives. Such research may yield valuable insights and contribute to developing more sophisticated MER systems. Acknowledgments. This paper is supported by the Humanities and Social Sciences Foundation of the Ministry of Education (17YJCZH260), the Sichuan Science and Technology Program (2020YFS0057), the National Innovation Training Program for Undergraduate Students (202210619023).
Music Emotion Recognition Using Multi-head Self-attention
113
References 1. Cañón, J.S.G., et al.: Music emotion recognition: toward new, robust standards in personalized and context-sensitive applications. IEEE Sig. Process. Mag. 38, 106–114 (2021) 2. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997) 3. Chaki, S., Doshi, P., Patnaik, P., Bhattacharya, S.: Attentive RNNs for continuous-time emotion prediction in music clips. In: Chhaya, N., Jaidka, K., Healey, J., Ungar, L., Sinha, A.R. (eds.) Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020) Co-Located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, USA, 7 February 2020. CEUR Workshop Proceedings, vol. 2614, pp. 36–46. CEUR-WS.org (2020) 4. Chang, W.H., Li, J.L., Lin, Y.S., Lee, C.C.: A genre-affect relationship network with taskspecific uncertainty weighting for recognizing induced emotion in music. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2018) 5. Chen, S., Jin, Q., Zhao, J., Wang, S.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: Proceedings of the 7thAnnual Workshop on Audio/Visual Emotion Challenge, AVEC 2017, pp. 19–26. Association for Computing Machinery, New York (2017) 6. Chou, Y.H., Chen, I.C., Chang, C.J., Ching, J., Yang, Y.H.: MidiBERT-Piano: large-scale pre-training for symbolic music understanding. ArXiv abs/2107.05223 (2021) 7. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: Proceedings of the 18thACM International Conference on Multimodal Interaction, ICMI 2016, pp. 445–450. Association for Computing Machinery, New York (2016) 8. Ferreira, L.N., Lelis, L.H.S., Whitehead, J.: Computer-generated music for tabletop roleplaying games. In: Lelis, L., Thue, D. (eds.) Proceedings of the Sixteenth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE 2020, Virtual, 19–23 October 2020, pp. 59–65. AAAI Press (2020) 9. Han, D., Kong, Y., Han, J., Wang, G.: A survey of music emotion recognition. Front. Comput. Sci. 16(6), 166335 (2022) 10. Huang, Y.S., Yang, Y.H.: Pop music transformer: beat-based modeling and generation of expressive pop piano compositions. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, pp. 1180–1188. Association for Computing Machinery, New York (2020) 11. Hung, H., Ching, J., Doh, S., Kim, N., Nam, J., Yang, Y.: EMOPIA: a multi-modal pop piano dataset for emotion recognition and emotion-based music generation. In: Lee, J.H., et al. (eds.) Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, 7–12 November 2021, pp. 318–325 (2021) 12. Kong, Q., Li, B., Song, X., Wan, Y., Wang, Y.: High-resolution piano transcription with pedals by regressing onset and offset times. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3707–3717 (2021) 13. Lin, Y., Chen, X., Yang, D.: Exploration of music emotion recognition based on MIDI. In: de Souza Britto Jr., A., Gouyon, F., Dixon, S. (eds.) Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, 4–8 November 2013, pp. 221–226 (2013) 14. Lin, Z., et al.: A structured self-attentive sentence embedding. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017)
114
Y. Xiao et al.
15. Oore, S., Simon, I., Dieleman, S., Eck, D., Simonyan, K.: This time with feeling: learning expressive musical performance. Neural Comput. Appl. 32(4), 955–967 (2020) 16. Panda, R.E.S., Malheiro, R., Rocha, B., Oliveira, A.P., Paiva, R.P.: Multi-modal music emotion recognition: a new dataset, methodology and comparative analysis. In: 10th International Symposium on Computer Music Multidisciplinary Research (CMMR 2013), pp. 570–582 (2013) 17. Qiu, J., Chen, C., Zhang, T.: Novel Multi-Task Learning Method for Symbolic Music Emotion Recognition. arXiv preprint arXiv:2201.05782 (2022) 18. Russell, J.A.: A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980) 19. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook (2017) 20. Won, M., Choi, K., Serra, X.: Semi-supervised music tagging transformer. In: Lee, J.H., et al. (eds.) Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, 7–12 November 2021, pp. 769–776 (2021) 21. Won, M., Ferraro, A., Bogdanov, D., Serra, X.: Evaluation of CNN-based automatic music tagging models. In: Proceedings of 17th Sound and Music Computing (2020)
Deep Reinforced Active Learning for Time Series Anomaly Detection Haojie Li, Hongzuo Xu, and Wei Peng(B) College of Computer, National University of Defense Technology, Changsha 410073, China {lihaojie,xuhongzuo13,wpeng}@nudt.edu.cn
Abstract. Massive time series data are recorded in various domains. Identifying exceptional data objects within time series can help avoid potential faults, dangers, and accidents, which is significant in maintaining the target system’s health and stability. Recent years have witnessed a long list of unsupervised time-series anomaly detection models that can only work without any control after deployment. This motivates us to consider an intriguing question: can we devise a model that adaptively and iteratively evolves according to the interaction with human analysts. We intend to install a handle on the model, thus realizing a “human-in-the-loop” learning paradigm. However, it is still a non-trivial task due to the difficulty of (i) accurately exploring valuable data from the unlabeled set as query samples and (ii) fully exploiting returned annotated data. To tackle these challenges, this paper proposes a novel reinforced active time series anomaly detection algorithm. We first propose to use the ensemble of unsupervised anomaly scoring, and by leveraging the derived anomaly scores, we devise two reward strategies. The learning process is guided by these reward strategies, during which the agent is encouraged to explore possible anomalies hidden in the unlabeled set. These potential anomalies are submitted as queries for human labeling and further exploited in our reward functions to supervise the agent taking expected actions. Extensive experiments on real-world datasets demonstrate that our method substantially outperforms five state-of-the-art competitors and obtains 12.1%–60.3% F1 score improvement, and 11.5%–59.1% AUC-PR improvement. Keywords: Active learning · Reinforcement learning · Time series · Anomaly detection
1 Introduction Time series are ubiquitous in many real-world scenarios ranging from data centers [1] to spacecrafts [2]. Anomaly Detection (AD), a technique that aims at identifying unexpected items or events from data, is often deployed in these scenarios to alarm potential faults, risks, and accidents of target systems, thus ensuring the health and stability of systems. Since the volume of time series data is tremendous, and labeling all anomalies in them is laborious and time-consuming, most popular studies focus on unsupervised scenarios. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 115–128, 2023. https://doi.org/10.1007/978-981-99-4752-2_10
116
H. Li et al.
Unsupervised methods [3–6] usually consider training data as normal and learn a model to identify samples that are different from training data. However, due to the lack of any prior knowledge about anomalies, many detected samples are data noises, leading to high false alarms or low detection recall [7]. Different from the unsupervised scenario, this paper considers introducing human efforts to realize a “human-in-the-loop” learning paradigm based on Active Learning (AL). AL can improve model efficiency and reduce the cost of labeling by selecting the most useful samples to label [8]. With the help of human analysts, AL can continuously sense the knowledge from the unlabeled set. Moreover, since a small amount of labeled data is not difficult to obtain, we also consider utilizing readily accessible labeled data during the cold-start phase. However, adapting AL methods to work well on time series AD tasks is still intractable, which poses two key challenges: (1) how to accurately explore valuable data from the unlabeled set, and (2) how to fully exploit these annotated data. As for the first facet, existing AL-based work [9–11] selects the most uncertain samples in the unlabeled set as queries and submits these samples to humans for labeling. Although this query strategy can help update the model’s performance with limited data, it needs to compute over the whole data set in every step to get the most uncertain samples [8], which induces a significantly large computational overhead. In addition, since anomalies are rare events, suspected anomalies are more “valuable” to the model than uncertain samples, but the above studies do not consider the anomaly degree of the query samples. As for the second facet, prior arts [9, 10] have to re-train the model on the refined training set after receiving the labeled candidates, which also largely increases the computational burden. To tackle the challenges mentioned above, in this work, we propose a novel time series AD model which is trained consistently through REinforced Active Learning in the weakly supervised scenario, termed REAL. Specifically, we build an RL model based on Deep Q-Network (DQN) [12] as the time series anomaly detector that can be trained end-to-end. Besides, we use the ensemble of unsupervised anomaly detectors to provide anomaly scores which are calculated quickly at the beginning of each episode and can be used in the following steps. By leveraging these produced scores, we design two RL reward functions, specifically intrinsic and external reward functions. The two reward functions guide the model learning process: (1) the intrinsic reward function can encourage the agent to explore suspicious anomalies from the unlabeled set, and meanwhile, the explored anomalies are sent to human analysts for annotating; (2) the annotated samples can be further exploited by external reward functions to train the agent to perform expected actions. The contributions of our work are summarized as follows: • We propose an end-to-end time series AD model based on reinforced active learning. REAL inherits the advantages both from AL and RL, which enables the model to learn interactively with human analysts and explore potential anomalies from unlabeled sets to update the model’s experience. • To explore the suspicious anomalies and make the best of the labeled data, we propose two novel RL reward functions with the help of the ensemble unsupervised anomaly scores. The learning process can be guided by these functions, and the agent learns from both labeled and unlabeled data.
Deep Reinforced Active Learning for Time Series Anomaly Detection
117
• We conduct experiments on publicly available datasets to compare the performance of five competing anomaly detection methods. The experimental results show that REAL achieves the results of SOTA in terms of AUC-PR and F1 score.
2 Related Work 2.1 Unsupervised Based Methods Unsupervised methods normally learn the data normality by assuming all the unlabeled data are normal. When the distance between the test data and the predicted or reconstructed value exceeds a pre-defined threshold, it is regarded as an anomaly. Traditional unsupervised methods are based on different anomaly measures, such as partition-based iForest [3], distance-based KNN [13], and probability-based COPOD [14], which are simple and fast to train. Due to the rapid development of deep learning techniques, more advanced neural networks are adopted to learn the complex representation of training data, including Transformers [4], and GNNs [5]. However, unsupervised methods are unable to use human-labeled data when human knowledge is applicable, thus leading to sub-optimal performance. 2.2 Active Learning Based Methods Due to the lack of labeled data, some works propose to gather more supervision by asking queries to human analysts. Görnitz et al. [9] adopt AL to filter candidates for labeling. They query low-confidence decisions by the model to guide the users in the labeling process. Huang et al. [10] propose SLA-VAE, which employs VAE and AL to update the model by a small number of uncertain samples. However, they have to re-train the model when obtaining feedback from humans. RLAD [11] combines AL and RL and can continuously update the RL model via the newly labeled data, but the model can only be trained on labeled data in a supervised case. The above studies mainly adopt model uncertainty as the query strategy. Since anomalies account for a very small proportion of unlabeled set, the model uncertainty does not consider the abnormal degree of the selected samples, so it will provide a large amount of normal data for human labeling, which cannot maximize the utilization of AL. 2.3 Weakly Supervised Based Methods In addition to AL, utilizing partial labeled data for training is also a category of weakly supervised learning [15]. DeepSAD [16] uses deep neural networks to extract anomalyoriented representations of the training data, and devises an objective function to ensure a larger (or smaller) distance between the center of the hypersphere and the labeled anomaly (or normal) samples. DevNet [17] uses the normal distribution as the prior distribution of normal scores, and enforce the z-score value of abnormal samples far from the mean of normal distribution. PReNet [7] defines a two-stream ordinal regression neural network to predict anomaly scores of the input instance pairs. However, the existence of unlabeled anomalies is ignored in weakly supervised learning, which may
118
H. Li et al.
disturb the normality learning process, and the traditional weakly supervised methods are static and cannot obtain up-to-date knowledge from humans, therefore, the model still has room for improvement.
3 The Proposed Approach: REAL 3.1 Problem Statement Given a dataset D = {S1 , ..., ST } that contains T time series, and each time series is St = (X , Y), where X = (x1 , ..., xN ) ∈ RN ×m denotes the value and Y ∈ RN ×1 denotes the label, and m is the dimension of X . Our goal is to learn a function f (·) : D → {0, 1} that outputs whether the observation is an anomaly, where f (xi ) = 1 indicates xi is an anomaly and f xj = 0 indicates xj is normal. We adopt an adjacency window concatenating method to let the model learn the temporal dependencies. We split X = (x1 , ..., xN ) by length w into a list of sub
sequences S = s1 , s2 , . . . , where st = xt , ..., xt+w−1 . Then we concatenate the
adjacent w sub-sequences again as a state in RL: st = st , . . . , st+w−1 . We set the label of a state as –1 if it has no labeled data, we set 1 if it has labeled anomalies, and we set 0 otherwise. The state is the smallest anomaly detection unit and we put the labeled states into Da and unlabeled states into Du , and the final training set is D = {Da , Du }. 3.2 Overall Framework of REAL The overall framework of REAL is shown in Fig. 1. We construct an RL agent based on DQN since the traditional DQN architecture is enough to complete the AD task without more complicated RL models. In the cold start phase, the training set is Da that contains all labeled data. The agent receives time series state st from Da , then takes an action at ∈ {0, 1} to determine whether st is anomalous. The external reward function will give rte to the agent, and Da send the next state st+1 to it. After the cold start process, we train our agent on both labeled and unlabeled data. In the model training stage, when the agent receives state from Da , it still uses r e to calculate the reward. When the agent receives state from Du , it first takes an action and then determines whether it needs to query the human analysts according to the action and the abnormal scores output by EUAD (ensemble of unsupervised anomaly detectors). Then the intrinsic reward function r i is used to calculate the reward. Finally, REAL returns an agent, which can be used as an anomaly detector to detect anomalies. The agent performs one forward-pass in its network and outputs the action for each observation to indicate whether it is an anomaly.
Deep Reinforced Active Learning for Time Series Anomaly Detection
119
Fig. 1. The Framework of REAL
3.3 Specific Implementation of REAL Cold Start. In the cold start process, we construct an RL model using LSTM as the
agent network to learn the temporal dependencies and feed st = st , . . . , st+w−1 as a state to it. We define the external reward function r e as follows: ⎧ ⎪ ⎪ 5, if y = 1, a = 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ −5, if y = 1, a = 0 (1) re = ⎪ ⎪ ⎪ 1, if y = 0, a = 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ −1 if y = 0, a = 1 where y is the label of state st . Reward r e takes into account TP, FP, TN, and FN. Note that although there are very few labeled data, after a certain number of training steps, our model has the ability to preliminarily judge normal and abnormal samples. Model Training. This process is the core of REAL. We train our agent on the whole set D = {Da , Du }. The most important part is how to give rewards in the unlabeled situation, where we cannot use r e anymore. To tackle this challenge, we utilize several unsupervised anomaly detectors to provide an ensemble of anomaly scores and by leveraging the derived scores, we design a new internal reward function r i as follows:
ri =
⎧ ⎪ ⎪ 0.5, if score < τ, a = 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ 5, if score > τ, a = 1, AL(s) = 1 ⎪ ⎪ ⎪ −1, if score > τ, a = 1, AL(s) = 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, otherwise
(2)
120
H. Li et al.
Firstly, we compute the anomaly scores of each state st using EUAD. Then we set a threshold τ and we consider the current state as pseudo-abnormal when its anomaly score is above τ and as pseudo-normal otherwise. In our experiment, we find that the probability that a pseudo-normal is truly normal is greater than the probability that a pseudo-abnormal is truly abnormal. Therefore, we give a small positive reward to the agent when its anomaly score is lower than τ and the agent takes action at = 0. In order to encourage the agent to explore the suspicious anomalies in Du , we need to give a large positive reward if it successfully finds an anomaly. The query strategy is based on both the results of EUAD and the action of the agent. Because the agent has been pretrained in the cold start process, it will take action a = 1 when it encounters a sample similar to labeled anomalies. As for anomalies different from the known anomalies, EUAD will give high scores to them. Then we make queries and submit those data that have high anomaly scores and are considered abnormal by the agent to human analysts for labeling, and give the agent a high reward (5) if they are indeed anomalies, otherwise punish it (−1). With the help of r i , the RL model is inspired to take action a = 1 if the current sample is a suspicious anomaly according to its experience, and the agent will ask queries during the learning and exploring process. Lastly, in the case that the current state is labeled, we still use r e to calculate the reward. Algorithm 1 gives the details of the DQN training process. At the beginning of each episode, we randomly select a time series S from the data set and iterate over all its sliding windows. We compute the anomaly scores of all timestamps, then we take a sliding window from S as a state st , and we compute its Q-value using the main network Q. Afterward, we take an action following the epsilon decay strategy, which means that with probability 1 − we choose the action that maximizes Q(st , at ; θ ), and with probability we choose the action randomly. If there has a label in state st , we compute reward rte using (3), otherwise, we compute reward rti using (4). We take the next sliding window as the next state st+1 and store into the experience pool. Then ˆ Every K steps, the we take a batch of data from the pool and compute target using Q. target network is updated using the main network’s parameters.
Deep Reinforced Active Learning for Time Series Anomaly Detection
121
Ensemble Unsupervised Anomaly Detectors. We choose several classical unsupervised AD algorithms, including iForest [3], KNN [13], and COPOD [14] as EUAD. The above methods are confirmed effective by many papers, and most of the abnormalities can be detected by them, which ensures the validation of the clustering results. For all points in the time series S, we calculate the average anomaly scores obtained by these three methods: score(S) = 13 (iforest(S) + knn(S) + copod (S))
(3)
122
H. Li et al.
Then we calculate the anomaly score for each state. For example, given a state
st = st , . . . , st+w−1 , there are a total of w × w points in st and its anomaly score is denoted as follows:
1 w×w score(st ) = 21 maxscore(st ) + w×w (4) i=1 score(xi ) First, we find the maximum score in st and calculate the average score, then we take the average of the two as the final anomaly score. This allows us not to miss any possible anomalies. Next, we rank all anomaly scores and use the pth percentile of the anomaly scores as the threshold to transfer scores to binary labels. p is a hyper-parameter, and we set p as 70 after several experiments. This means that we believe the number of pseudo-normal samples is greater than 70%.
4 Experiments In this section, we empirically evaluate the effectiveness of REAL against five competitors. Then we conduct an ablation study to illustrate the contribution of key components in REAL, and use parameter tests to analyze the impact of hyper-parameters. We first describe the dataset and experimental setup before detailing the empirical findings. 4.1 Datasets and Evaluation Metrics Datasets • Yahoo benchmark dataset [18]. It is released publicly by Yahoo company for time series AD research, which consists of four sub-datasets and each contains 1,400– 1,700 timestamps data with all data labeled. We use the A2, A3, and A4 benchmarks in our experiments and train a model for each benchmark. • KPI [19]. It is released by AIOps Challenge Competition and contains dozens of KPI curves with labeled anomaly points and each KPI curve has different sizes and patterns. The data were collected every 1 min or 5 min from several large and famous Internet companies including Sogo, Tencent, eBay, etc. Similarly, we train a model for each KPI curve. Table 1 gives the summary of the adopted datasets. Table 1. Summary of datasets A2
A3
A4
KPI
Points
141,900
167,800
167,800
3,004,066
Anomaly ratio
0.33%
0.56%
0.50%
2.65%
Deep Reinforced Active Learning for Time Series Anomaly Detection
123
Evaluation Metrics. Following [6, 7, 20], the F1 score and the Area Under PrecisionRecall Curve (AUC-PR) are used as our experimental metrics. We adopt the pointadjustment strategy [21]; specifically, all the points in anomaly segments are regarded as retrieved if the detection model can alert any point within this segment. F1 is the harmonic mean of precision and recall. We enumerate all the possible F1 values by employing each anomaly score as the threshold and finally report the best condition. This has been done in mainstream previous studies. On the other hand, AUC-PR represents the averaging performance by simultaneously considering precision and recall. 4.2 Experimental Setups and Comparison Methods In the experiments, we adopt the deep learning framework Pytorch-1.8.1 and Python 3.8. We run all methods under the same machine with 12-core Intel® CPU @2.10 GHz and one NVIDIA GTX 3080Ti GPU. As for the Yahoo dataset, in each benchmark, we randomly choose 80% of the data as the training set and the rest 20% as the test set. Then we randomly select 5% of the training set with the full label as the training set in the cold-start stage, and the rest of the training set is converted to be unlabeled. We do the same for each single KPI curve. We compared REAL with several weakly supervised and unsupervised methods as follows: • DeepSAD [16] is a deep weakly supervised based anomaly detector. It uses a few labeled normal and anomalous data to train a model and enforce a distance between the one-class center and the labeled anomalies while minimizing the hypersphere. • DevNet [17] is an end-to-end deep weakly supervised anomaly detector that uses Gaussian distribution as prior over anomaly scores to train a deep model. • PReNet [7] is a deep weakly supervised anomaly detector that leverages labeled anomalies and unlabeled data to perform a pairwise relation prediction task. It can directly output the anomaly scores of instance pairs. • Ensemble of unsupervised AD method (EUAD) is an ensemble method that consists of three unsupervised models including iForest [3], KNN [13], and COPOD [14]. • GDN [5] is an unsupervised anomaly detector that leverages GNN to forecast the value of each timestamp and detect anomalies using the prediction errors. REAL uses a two-layer LSTM with 64 hidden units in each layer as the network in DQN, and the window size is 8. It is trained with 300 episodes in the cold start stage, and 900 episodes in the model training stage, and each episode consists of 1,500 steps. The size of the experience pool is 50,000, and the network starts to learn when the transition in the pool is larger than 10,000. The target network is updated every K = 5,000 steps. Besides, the epsilon greedy exploration is used in the cold start and model training stage, with annealed from 0.4 to 0.01. The batch size is 32, the discount factor γ is set to 0.5, and we set P to 70. We use the Adam optimizer with a learning rate of 0.005 to train our model. All comparison weakly-based models adopt a two-layers TCN network with 128 hidden units followed by a ReLU activation network. The other parameter settings are determined according to the default settings in the corresponding paper. Note that we recorded the number of labels required by REAL, and to be fair, for the weakly supervised methods DeepSAD, DevNet, and PReNet, we provide them with the same amount of labeled data in the training process.
124
H. Li et al.
4.3 Experimental Results and Analysis Table 2 presents the F1 scores and AUC-PR of all models on the Yahoo and KPI dataset. The best results are marked in bold, and the second-best results are underlined. We can see that REAL outperforms all comparison methods on F1 score and PR. Particularly, in terms of F1 score, on average, REAL substantially outperforms the weakly supervised methods DeepSAD (18.6%), DevNet (24.0%) and PReNet (17.6%), and obtains 60.3% improvement over unsupervised detectors EUAD, and 12.1% over GDN. In terms of AUC-PR, REAL substantially outperforms weakly supervised detectors by about 27.6%–41.9%, and EUAD by 59.1%, and GDN by 11.5% on average. Particularly, the result of REAL on A2 is almost 1.0. This is because REAL can select possible anomalies from the unlabeled set for annotation, and with the increase of the number of training steps, the agent will improve its ability to recognize anomalies. None of the other methods can explore anomalies from the unlabeled data sets. Table 2. F1 scores and AUC-PR (PR) performance of all methods Methods
A2
A3
A4
KPI
F1
PR
F1
PR
F1
PR
F1
PR
DeepSAD
0.789
0.677
0.906
0.908
0.783
0.764
0.731
0.681
DevNet
0.715
0.535
0.869
0.845
0.809
0.790
0.687
0.616
PReNet
0.726
0.583
0.883
0.877
0.797
0.782
0.836
0.808
EUAD
0.537
0.641
0.699
0.634
0.511
0.527
0.658
0.601
GDN
0.902
0.893
0.835
0.847
0.799
0.868
0.768
0.741
REAL
0.999
0.999
0.957
0.954
0.922
0.924
0.908
0.931
We show in Table 3 the number of samples that REAL selected to manually annotate during training and the proportion of anomalies in these annotations. On average, the number of labeled samples is 7.57% of the total samples in the Yahoo dataset, and 14.9% in KPI. In addition, as for the Yahoo dataset, the proportion of labeled samples with anomalies is more than 10 times the proportion of anomalies in the original set, and for KPI set is more than 5 times, indicating that the query strategy in REAL can select more anomalous data for human analysts. 4.4 Ablation Study To understand the influence of key designs in REAL, we compare REAL with its ablated variants. The F1 score and AUC-PR are demonstrated in Fig. 2. • w/o cold-start: We remove the cold-start module in REAL and train the model directly on D = {Da , Du }. • w/o query strategy (w/o qs): We replace REAL query strategy with margin sampling [22], which is a commonly used AL query strategy based on model uncertainty.
Deep Reinforced Active Learning for Time Series Anomaly Detection
125
Table 3. The number of labeled samples and the proportion of anomalies; The number of data in the original set and the proportion of anomalies in them. Dataset
Labeled samples
Anomalies in labeled samples (%)
Total samples in Org
Anomalies in Org (%)
A2
4,608
154 (3.34%)
141,900
466 (0.328%)
A3
5,209
514 (9.87%)
167,800
943 (0.562%)
A4
5,098
KPI
16,510
485 (9.51%)
167,800
835 (0.498%)
2,466 (14.9%)
3,004,066
79,554(2.65%)
Specifically, a0 and a1 are two actions output by the agent when seeing state s, and we compute margin M as follows: M = |Qπ (a0 , s; θ ) − Qπ (a1 , s; θ )|. We set a threshold m = 2 for the Yahoo dataset (m = 9 for KPI), and when M is smaller than m, we ask queries and compute r i as follows: ⎧ 0.5, if score < τ, a = 0 ⎪ ⎪ ⎨ 5, if score > τ, M < m, AL(s) = 1, a = 1 ri = ⎪ −1, if score > τ, M < m, AL(s) = 1, a = 0 ⎪ ⎩ 0, otherwise
(5)
Without cold start process, the F1 scores on A4 decrease slightly by 1%, and on A2 and A3 decrease by 2% to 8.4%, and the F1 scores on KPI decrease by 30.1%. Additionally, the AUC-PR decrease 46.8% on KPI and 3.6% on Yahoo dataset averagely. The results indicates that the cold start process is necessary to allow the model to initially learn the abnormal patterns, especially in KPI dataset. Using margin sampling results in an average decrease of about 1.6% in the F1 score on Yahoo dataset and 16.5% on KPI. The margin sampling strategy cannot provide sufficient anomalous samples, therefore, the model cannot achieve as high F1 scores and PR as REAL.
Fig. 2. Ablation Study Results. F1 scores and AUC-PR performance of REAL and its two ablated versions.
126
H. Li et al.
4.5 Parameter Test We used the Yahoo dataset to investigate the impact of different hyper-parameter settings of REAL on its detection performance. We test four hyper-parameters in REAL, i.e., window size, batch size, training epoch, and P. Figure 3 shows the F1 scores, and AUCPR performance also shows the same trend, which is omitted here. With the increase of window size, F1 scores on A3 and A4 show a trend from rising to decline. With the increase in batch size, the performance on A2 decreases slightly at first and then goes up. F1 scores on A3 and A4 increase firstly and then decrease. Increasing the epoch number can improve the overall performance, and we recommend setting the epoch to 700 or larger than 900. Finally, we set P to 40 80, meaning that we consider the abnormal proportion of data set accounts for 60% to 20%. When P = 90, the F1 scores of A3 decrease, indicating that most anomalies in A3 have anomaly scores greater than 80th percentile. When setting P to 40 60, F1 scores of A4 decrease, and when P is greater than 60, F1 scores rise, indicating that the anomaly scores of most anomalies in A4 may be greater than the 60th percentile.
Fig. 3. Parameter Test Results. F1 scores of REAL on Yahoo dataset with different hyperparameters (window size, batch size, epochs and P).
5 Conclusion This paper introduced REAL, a novel reinforced active time series anomaly detection method. Our method harnesses the knowledge of human analysts in the process of time series anomaly detection. We proposed to use the ensemble unsupervised anomaly scoring to devise two reward functions for the RL agent, helping the agent learn in the weakly supervised scenario, and explore the suspicious anomalies from the unlabeled set while utilizing the labeled data to update its experience. Extensive experiments show that REAL can achieve 12.1%–60.3% F1 scores and 11.5%–59.1% AUC-PR improvement over its comparison methods, with a minimum amount of labeled data. In the future, we
Deep Reinforced Active Learning for Time Series Anomaly Detection
127
plan to explore more applications on REAL, such as the practicality in medical [23], education [24], and AIOps [25]. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 61972412).
References 1. Yu, F., Xu, H., Jian, S., Huang, C., Wang, Y., Wu, Z.: Dram failure prediction inlarge-scale data centers. In: JCC, pp. 1–8 (2021) 2. Kyle, H., Valentino, C., Christopher, L., Ian, C., Soderstrom, T.: Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In: KDD, pp. 387–395 (2018) 3. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1–39 (2012) 4. Tuli, S., Casale, G., Jennings, N.R.: TranAD: deep transformer networks for anomaly detection in multivariate time series data. Proc. VLDB Endow. 15(6), 1201–1214 (2022) 5. Deng, A., Hooi, B.: Graph neural network-based anomaly detection in multivariate time series. In: AAAI, pp. 4027–4035 (2021) 6. Xu, H., Pang, G., Wang, Y., Wang, Y.: Deep isolation forest for anomaly detection In: IEEE Transactions on Knowledge and Data Engineering, pp. 1–14 (2023) 7. Pang, G., Shen, C., Jin, H., van del Hengel, A.: Deep weakly-supervised anomaly detection. arXiv preprint arXiv:1910.13601 (2019) 8. Zhan, X., Wang, Q., Huang, K., Xiong, H., Dou, D., Chan, A.B.: A comparative survey of deep active learning. arXiv preprint arXiv:2203.13450 (2022) 9. Görnitz, N., Kloft, M., Rieck, K., Brefeld, U.: Toward supervised anomaly detection. J. Artif. Intell. Res. 46, 235–262 (2013) 10. Huang, T., Chen, P., Li, R.: A semi-supervised VAE based active anomaly detection framework in multivariate time series for online systems. In: WWW, pp. 1797–1806 (2022) 11. Wu, T., Ortiz, J.: RLAD: time series anomaly detection through reinforcement learning and active learning. arXiv preprint arXiv:2104.00543 (2021) 12. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 13. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967) 14. Li, Z., Zhao, Y., Botta, N., Ionescu, C., Hu, X.: COPOD: copula-based outlier detection. In: ICDM, pp. 1118–1123 (2020) 15. Jiang, M., et al.: Weakly supervised anomaly detection: a survey. arXiv preprint arXiv:2302. 04549 (2023) 16. Ruff, L., et al.: Deep semi-supervised anomaly detection. In: ICLR (2020) 17. Pang, G., Shen, C., van den Hengel, A.: Deep anomaly detection with deviation networks. In: KDD, pp. 353–362 (2019) 18. Laptev, N., Amizadeh, S., Flint, I.: Generic and scalable framework for automated time-series anomaly detection. In: KDD, pp. 1939–1947 (2015) 19. AIOps competition (2018). https://github.com/iopsai/iops/tree/master/phase2_env 20. Pang, G., van den Hengel, A., Shen, C., Cao, L.: Toward deep supervised anomaly detection: reinforcement learning from partially labeled anomaly data. In: KDD, pp. 1298–1308 (2021)
128
H. Li et al.
21. Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: WWW, pp. 187–196 (2018) 22. Scheffer, T., Decomain, C., Wrobel, S.: Active hidden Markov models for information extraction. In: IDA, pp. 309–318 (2001) 23. Zhang, Y., Liu, S., Qu, X., Shang, X.: Multi-instance discriminative contrastive learning for brain image representation. In: NCAA, pp. 1–14 (2022) 24. Zhang, Y., An, R., Liu, S., Cui, J., Shang, X.: Predicting and understanding student learning performance using multi-source sparse attention convolutional neural networks. In: IEEE Transactions on Big Data, pp. 118–132 (2023) 25. Xu, H., Wang, Y., Jian, S., Liao, Q., Wang, Y., Pang, G.: Calibrated one-class classification for unsupervised time series anomaly detection. arXiv preprint arXiv:2207.12201 (2022)
Dynamic Label Propagation Density Peak Clustering Based on the Tissue-Like P Systems Qing Du1 and Xiyu Liu2(B) 1 Business School, Shandong Normal University, Jinan, China 2 Academy of Management Science, Shandong Normal University, Jinan, China
[email protected]
Abstract. The density peak clustering (DPC) proposed in 2014 has attracted extensive discussion and research. The DPC algorithm considers the connectivity of objects from the perspective of object density and continuously expands clusters based on connectivity to obtain the final clustering results. However, the DPC algorithm also has its drawbacks. The DPC algorithm requires appropriate as of distance parameter dc for different datasets. DPC is prone to chain reactions after an object misclassification. This paper proposes a new method called dynamic label propagation density peak clustering based on the tissue-like P systems (TPDLDPC). The entire method operates within the frame construction of the tissuelike P systems. Firstly, the local density is calculated using a fuzzy kernel function to reduce the parameter sensitivity of the method. Secondly, object assignment is completed by multiple iterations using a dynamic label propagation assignment strategy. Comparative experiments are carried out on seven datasets, and the consequences show that the proposed method has a good clustering performance. Keywords: Density peaks clustering · Label propagation · Tissue-like P system · Membrane computing
1 Introduction Membrane computing is part of the extension of natural computing, also known as membrane systems or P systems [1]. Membrane computing is a new area of mathematical sciences research, and its development provides an abundant computational frame construction for biomolecular computing. The P systems is composed of three important classes: cell-like P systems; tissue-like P systems and neural-like P systems [2]. In recent years, due to its powerful parallelism, membrane computing has grown rapidly in multifarious areas [3–5] and has been extensively combined with clustering [6–8]. Clustering is a typical unsupervised learning method that aims to make the similarities between objects within the same cluster as large as possible and those within different clusters as small as possible. Because of its ability to uncover hidden structural information between data, clustering is broadly applied to some areas such as image processing [9], data mining [10], medical applications [11], information security [12], and pattern recognition [13]. As scholars have studied clustering analysis methods in depth, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 129–140, 2023. https://doi.org/10.1007/978-981-99-4752-2_11
130
Q. Du and X. Liu
clustering can be classified into partition-based clustering [14], hierarchy-based clustering [15], density-based clustering [16], grid-based clustering [17] and model-based clustering [18] based on different criteria for classifying clustered objects. The density peaks clustering (DPC) [19] proposed by Alex Rodriguez and Alessandro Laio in 2014 is a typical density-based clustering method. DPC can recognize arbitrarily shaped clusters with high computational efficiency. However, the DPC algorithm also has some disadvantages: 1) the cutoff distance parameter dc needs to be set manually for different datasets. 2) the original allocation strategy of the DPC assign all data points in one time without subsequent adjustment, which is prone to chain reactions. Scholars from diverse countries have made perfection to DPC. Mingjing Du et al. [20] introduce KNN into DPC to provide a new alternative for the formula of local density and use PCA to reduce the dimensionality of large datasets. Jiazhao et al. [21] introduce fuzzy neighborhood to DPC, providing a new form of object similarity calculation and improving the accuracy of cluster centers selection. Abdulrahman Lotf et al. [22] use KNN to form cluster backbones to avoid chain reactions and use label assignment to allocate the remaining points. To settle the above DPC shortcomings, this paper presents a new method called dynamic label propagation density peak clustering based on the tissue-like P systems (TP-DLDPC). The entire method operates within the frame construction of the tissuelike P systems. Firstly, a fuzzy kernel is used to calculate the local density, by combining the fuzzy neighborhood with the k nearest neighbors. This function replaces the cutoff distance parameter dc with the parameter k, which can reduce the parameter sensitivity of the kernel function. Secondly, a dynamic label propagation allocation strategy is used to pass the clustering labels to the objects through multiple iterations. This allocation strategy can overcome the disadvantage that DPC is prone to chain reactions and improve the clustering performance. The rest of this paper is scheduled as follows: Sect. 2 describes the frame construction for the tissue-like P systems, the concept of DPC and the label propagation algorithm. Section 3 details the algorithm TP-DLDPC presented in this paper. Section 4 analyses the consequences of the experiments on seven different datasets. Section 5 draws some conclusions from this paper.
2 Related Work 2.1 Tissue-Like P Systems The tissue-like P systems is a further extension of the cell-like P systems, with objects present in both the cells and the environment. The tissue of the P systems can be seen as a web-like structure. The cells and environment in the system can be thought of as nodes. Cells are connected to each other by certain structures that can be thought of as directed arcs. Each unit is considered a processor that processes objects and communicates between units along pre-assigned channels [23]. A formal representation of a tissue-like P systems (of degree q > 0) with antiport/symport rules is as follows:
= (O, u, ω1 , ..., ωq , R1 , ..., Rq , R , i0 )
Dynamic Label Propagation Density Peak Clustering
131
where (1) O is an alphabet containing a finite number of objects, each of which is a d × k dimensional vector, where k is the number of clusters and d is the dimension of the feature space. (2) u is a membrane structure, each of membrane will normally be assigned a label i. (3) ωi (1 ≤ i ≤ q) is a finite set of objects consisting of strings on the alphabet O, which expresses a multiset of objects initially exist in membrane i. (4) Ri (1 ≤ i ≤ q) is a finite set of evolutionary rules adopted in membrane i. (5) R denotes a finite communication rule of the form (i, u/v, j), referring to the communication rule between membrane i and membrane j. (6) i0 indicates the output area of this P systems. 2.2 Density Peaks Clustering The DPC algorithm sets two basic assumptions: 1) Higher density of objects near the area where the center of clustering is located. 2) The cluster centers are distant from each other. Therefore, local density and relative distance are used as horizontal and vertical coordinates to plot the decision diagram. The object points in the upper right corner of the decision diagram will be selected as the cluster centers. The rest of object points are allocated based on the nearest neighbor principle. Let X = {x1 , x2 , ..., xn } be a dataset with n samples, xi (1 ≤ i ≤ n) has m attributes, and the relative distance δi and the local density ρi of xi are given below. Definition 1: The local density ρi of object xi is defined as the quantity of object points within a given cutoff distance dc to the point xi . 1 xρi
2.3 Label Propagation The label propagation is a semi-supervised learning algorithm, which consists of two stages: 1) constructing the similarity matrix. 2) conducting label dissemination. Label propagation formulates the problem as a graph, with each node representing an object and the edges representing the probabilities from i linking to j. The nodes propagate labels to neighboring nodes based on similarity [24].
132
Q. Du and X. Liu
The edge harbors the similarity between i and j. The greater the weight of the edge, the more similar the two nodes are, and the easier it is for the label to propagate: wij = exp(−
xi − xj 2 α2
)
(3)
where α is the hyperparameter. Let us define a n × n probability transfer matrix P: wij Pij = P(i → j) = n
k=1 wik
(4)
where Pij denotes the node transmission probability from i to j. Suppose there are C classes, L labelled objects and U unlabelled objects. Define a L × C label matrix YL , a U × C unlabelled matrix YU , and combine YL and YU to obtain a N × F soft label matrix F. Here’s how the label propagation works: (1) Propagate F ← PF. (2) Clamp the labelled data FL = YL . (3) Repeat from step 1 until F converges. Step 1: The matrix F is multiplied by the matrix P. Each node propagates its label to the other nodes with a determined probability P. Step 2: After each label propagation, the L objects that are initially marked retrieve their original labels. Step 3: Repeat from step 1 until F converges.
3 TP-DLDPC 3.1 Identifying Cluster Centers The local density formula of DPC tends to define local density in a clear and unambiguous range. This crisp measure is very sensitive to the parameter and is therefore less flexible in dealing with different datasets. To solve this trouble, this paper inspired by [20–22] combines fuzzy neighborhood with k nearest neighbors to propose a fuzzy kernel function. Unlike the original local density function, this kernel function replaces parameter dc with the parameter k to reduce the sensitivity of the kernel function. While the neighbor relation is used to consider the local structure information of the data to reduce the misclassification of objects and improve the clustering performance.
1 ρi = max 1 − (5) d xi , xj , 0 J˙ ∈kNN (xi ) k
where kNN (xi ) represents the k nearest neighbors of object xi and d xi , xj is the Euclidean distance between xi and xj .
kNN (xi ) = xj d xi , xj ≤ d (xi , xk ) (6)
Dynamic Label Propagation Density Peak Clustering
133
where xk is the k-th nearest neighbor of xi . The original DPC local density measure is unable to assign appropriate member relationship value to neighboring points based on their distance to the center. Whereas the fuzzy kernel function based on k nearest neighbors can assign different values of member relationship to the neighboring points of the object. Therefore, this fuzzy kernel function effectively reduces the impact of outliers on local density calculation, effectively identifies cluster centers and improves clustering performance. In this paper, we use the original relative distance formula of DPC to calculate δi . This paper uses a simple and fast strategy for the automatic determination of cluster centers. The object points are ranked on the basis of the score calculated by ρi and δi , and the top c objects will be selected as cluster centers. score(xi ) = ρi · δi
(7)
Using this measure, high scores can be attached to object points with higher ρi and δi . And it can effectively reduce human intervention in the clustering process and improve the accuracy of cluster center selection. 3.2 Dynamic Label Propagation The allocation of labels to objects is divided into two main steps: 1) propagation of labels from identified cluster centers to neighboring object points; 2) dynamic label allocation based on weighted graph. The main objective of step 1 is to construct a graph structure using KNN, where each object is a node in the graph and is connected to its k nearest neighbors. A different clustering label is assigned to each of the top c cluster centers selected in Sect. 3.1, and then cluster centers propagate the labels to the nearest neighbor objects through the nearest neighbor relationship.
Label peakj , if xi ∈ kNN peakj (8) Label(xi ) = 0, otherwise
where peaks represents the set of cluster centers and kNN peakj denotes the set of k neighbors of cluster center j. The KNN graph considers the local structure of the data rather than the global structure, so it reduces the calculation complexity and improves the accuracy of the object assignment. The main objective of step 2 is to propagate labels from the core points to the nearest neighbors through a graph-based label propagation method, and repeat this process for the neighbors of the nearest neighbors until all nodes have a label. It is first necessary to define a weighted graph G = (V , E, W ). V is a set of vertices representing objects. E is a set of edges, and Wi,j is the weight value in the range of [0,1]. 2 d(xi ,xj ) (kNN ) exp(− ), if xj ∈ kNN (xi ) 2 uσ (9) Wi,j = 0, otherwise
where Wi,j represents the similarity of vertex xi to xj , d xi , xj denotes the distance between xi and xj , u and σ are hyperparameters, and an adaptive kernel function is
134
Q. Du and X. Liu
applied to gain the hyperparameter σ : σi,j = b.Avg({knnd (xi ), knnd (xj )})
(10)
where Avg({knnd (xi ), knnd (xj )}) represents the average distance between the k nearest neighbors of xi and xj , and b is a hyperparameter. It should be noted that the values of the parameters b and k can be obtained by experiments. Define a probability transfer matrix P by normalizing Wi,j : (kNN )
Pi,j =
Wi,j
k∈V
(kNN ) Wi,k
(11)
where Pi,j is the probability that xi is connected to xj . Before implementing the label propagation strategy, the object points need to be divided into labelled objects Xl and unlabelled objects Xu . Xl refers to object points that have been assigned labels in step object 1 and Xu represents points that have not been labelled. Let define an initial label matrix Y0 = Y (l) ; Y (u) ∈ Rn×c , n means the quality of object points, c is the quality of clusters, Y (l) is the label matrix of labelled objects and Y (u) is the label matrix of unlabelled objects. The dynamic label propagation strategy used in this paper requires local structure of objects (i.e. transition matrix P) to be combined with the correlation between objects (i.e. YY T ) in each label iteration. In each iteration, the labels of the unlabelled objects are updated as follows: Yt+1= Ft ∗ Yt where Yt is the label matrix and Ft is a fusion graph: P, if t = 0 Ft+1 = P(Ft + αYt YtT )P T , if t > 0
(12)
(13)
where Yt YtT shows the pairwise correlation between objects and α ∈ [0, 1] controls the correlation rate between objects. After each iteration of label propagation, the object points that are initially labelled need to retrieve their original labels: (l)
(l)
Yt+1 = Y0
(14)
The dynamic label propagation strategy is based on the connectivity of KNN, which preserves the natural structure between object points and can improve the accuracy of object assignment. And it can combine the object correlation with the local structure in each iteration. If an object is incorrectly labelled, the dynamic label propagation strategy can correct the incorrect label in the next iteration until each object has a correct label, thus avoiding chain reactions. 3.3 The Tissue-Like P Systems Framework for TP-DLDPC Based on the relevant concepts in Sect. 2.1, the initial arrangement of cells and environment for the tissue-like P systems designed for TP-DLDPC is shown in Fig. 1.
Dynamic Label Propagation Density Peak Clustering
135
Fig. 1. Initial configuration of the tissue-like P systems
cell0 and celln+1 : They are empty units. cell1 , ..., celln : Each cell contains one object of the dataset. We define the following rules: R1: We use Eqs. (6) (5) (2) (7) to calculate the k nearest neighbors, local density, relative distance, and score of the object points respectively and input them into cell1 , ..., celln . R2: The first c object points are selected as cluster centers according to Eq. (7). Divide celln+1 into c new cells celln+1 , ..., celln+c according to the partition rule of membrane. R3: The selected c cluster centers and their neighbourhood information are entered into celln+1 , ..., celln+c and assign clustering label to each of these c new cell units. R4: The clustering labels in celln+1 , ..., celln+c are propagated to its k neighbors according to Eq. (8). R5: We fuse the cells where the c cluster centers are located with celln+1 , ..., celln+c . R6: We calculate the similarity of sample xi to xj by using Eq. (9) to form a similarity matrix Wi,j and input it into cell1 , ..., celln . R7: We normalize the similarity matrix Wi,j using Eq. (11) to obtain a probability transfer matrix Pi,j , which is fed into cell1 , ..., celln . R8: We update the labels of the unlabelled objects using Eq. (12) so that each object eventually has a clustering label, and input the final clustering results into cell0 . The process of implementing the tissue-like P systems is displayed in Fig. 2. The main steps of TP-DLDPC are as follows: Input: dataset X, sample percentage p, number of iterations T. Output: Clustering result Y. Step 1: Calculate the knn(xi ), ρi , δi and score of the samples according to rule R1. Step 2: Select c cluster centers according to rules R2 and R3, and assign different clustering labels to them. Step 3: The labels of the cluster centers are diffused to their respective neighbors according to rule R4. Step 4: Calculate the similarity Wi,j between samples according to rule R6 and generate a probability transfer matrix Pi,j according to rule R7.
136
Q. Du and X. Liu
Fig. 2. The process of implementing the tissue-like P systems
Step 5: Assign labels to the remaining points according to rule R8 until each sample has a clustering label. Step 6: The calculation is terminated and the clustering result Y is output.
4 Experimental Results and Analysis 4.1 Experimental Settings To validate the clustering performance, TP-DLDPC is displayed on four synthetic datasets and three UCI datasets. The experimental consequences of TP-DLDPC are compared with three well-known algorithms: DBSCAN [25], K-means [26], DPC [19]. The experimental consequences are taken from the best results of multiple experiments with each algorithm. The experiments in this paper are all done on Matlab2018a. To visualize the clustering performance of each algorithm, we use three well-known evaluation metrics: Accuracy (ACC), Purity and F. The higher the value of the measures, the better the clustering effect. The characteristics of the datasets used for the experiments are exhibited in Table 1 below. Table 1. Description of the datasets Dataset
#records
#attributes
#clusters
#source
Flame
240
2
2
Synthetic
Smile
266
2
3
Synthetic
399
2
6
Synthetic
3100
2
31
Synthetic
Compound D31
(continued)
Dynamic Label Propagation Density Peak Clustering
137
Table 1. (continued) Dataset
#records
#attributes
#clusters
#source
Wine
178
2
3
UCI
WDBC
198
32
2
UCI
1372
4
2
UCI
Banknote
4.2 Experimental Results Figures 3, 4, 5 and 6 show the clustering results of TP-DLDPC compared to the three algorithms on the Flame, Smile, Compound, and D31 datasets. As shown in Fig. 3, only the TP-DLDPC algorithm gives the best clustering results for the Flame dataset. Kmeans and DPC misallocate the points in the left-hand cluster to the right-hand cluster. Figure 4, all four algorithms divide the object points correctly. Figure 5 shows that TPDLDPC achieves the best clustering results of the four algorithms, but the delineation of the two clusters on the right is unsatisfactory. K-means and DPC do not identify the correct cluster centers and the internal structure of the object points is not taken into account when assigning the remaining points, resulting in misclassification of the objects. DBSCAN has the worst performance. In Fig. 6, TP-DLDPC shows the best clustering results when dealing with the D31 dataset.
Fig. 3. Clustering performances on the Flame dataset.
Fig. 4. Clustering performances on the Smile dataset.
Table 2 and Table 3 compare the clustering performance of TP-DLDPC with Kmeans, DBSCAN, and DPC in terms of three evaluation measure on the synthetic datasets as well as the UCI datasets respectively. F indicator, the TP-DLDPC achieves the best clustering results on the four synthetic datasets. The highest values are gained on the two UCI datasets, Wine and Banknote. ACC indicator, TP-DLDPC obtains the best results
138
Q. Du and X. Liu
Fig. 5. Clustering performances on the Compound dataset.
Fig. 6. Clustering performances on the D31 dataset.
on five datasets. But on the WDBC dataset, the TP-DLDPC accomplishes the secondbest result, just 0.0404 behind the best result. Purity indicator, The TP-DLDPC does not achieve the best results on the Flame and Banknote datasets only, but its Purity value on the Flame dataset is as high as 0.9875. Table 2. Clustering performances of four algorithms on synthetic datasets. Algorithm
F
ACC
Purity
Flame
F
ACC
Purity
Smile
TP-DLDPC
0.9769
0.9875
0.9875
1
1
1
K-means
0.7638
0.8583
0.8583
1
1
1
DBSCAN
0.9763
0.9708
1
1
1
1
DPC
0.6784
0.7875
0.7875
1
1
1
0.9671
0.9671
Compound TP-DLDPC
0.8745
D31 0.8546
0.8747
0.9360
K-means
0.6062
0.5338
0.7794
0.8281
0.8174
0.8710
DBSCAN
0.8556
0.7544
0.8421
0.8365
0.8065
0.8065
DPC
0.6249
0.6341
0.8346
0.9126
0.9313
0.9313
Dynamic Label Propagation Density Peak Clustering
139
Table 3. Clustering performances of four algorithms on UCI datasets. Algorithm
F
ACC
Purity
F
TP-DLDPC
ACC
Purity
0.9659
0.9831
0.9831
0.7288
0.7222
0.7626
K-means
0.8686
0.9326
0.9326
0.6018
0.6364
0.7626
DBSCAN
0.5052
DPC
0.7575
0.3989
0.3989
0.7776
0.7626
0.7626
0.7303
0.7303
0.5655
0.5101
0.7626
Wine
WDBC
Banknote TP-DLDPC
0.6760
0.7923
0.7923
–
–
–
K-means
0.5512
0.6122
0.6122
–
–
–
DBSCAN
0.0914
0.1472
0.9628
–
–
–
DPC
0.6472
0.7413
0.7413
–
–
–
5 Conclusion In response to the drawbacks that DPC requires artificially set parameter values and is prone to chain reactions, this paper presents a dynamic label propagation density peak clustering based on the tissue-like P systems. The entire method operates within the frame construction of the tissue-like P systems. Fuzzy neighborhood is associated with k nearest neighbors to form a fuzzy kernel function and to calculate the local density using the function. Using a dynamic label propagation allocation strategy, the clustering labels are passed to the object points through multiple iterations. Experimental consequences on seven datasets show TP-DLDPC outperforms K-means, DBSCAN and DPC in terms of clustering performance. Although TP-DLDPC has better clustering performance, it still needs the parameter . Future research work will focus on adaptive selection of parameters and parameter-free clustering. Acknowledgment. This study is supported by the Social Science Fund Project of Shandong (16BGLJ06, 11CGLJ22).
References 1. Zhang, G.X., et al.: Evolutionary membrane computing: a comprehensive survey and new results. Inf. Sci. 279, 528–551 (2014) 2. Song, B.S., Li, K.L., Zeng, X.X.: Monodirectional evolutional symport tissue p systems with promoters and cell division. IEEE Trans. Parallel Distrib. Syst. 33(2), 332–342 (2022) 3. Cai, Y.L., et al.: An unsupervised segmentation method based on dynamic threshold neural P systems for color images. Inf. Sci. 587, 473–484 (2022) 4. Dong, J.P., et al.: A distributed adaptive optimization spiking neural P system for approximately solving combinatorial optimization problems. Inf. Sci. 596, 1–14 (2022)
140
Q. Du and X. Liu
5. Long, L.F., et al.: A time series forecasting approach based on nonlinear spiking neural systems. Int. J. Neural Syst. 32(08) (2022) 6. Guo, P., Jiang, W.J., Liu, Y.C.: AP system for hierarchical clustering. Int. J. Mod. Phys. C 30(8) (2019) 7. Jiang, Z.N., Liu, X.Y., Sun, M.H.: A density peak clustering algorithm based on the k-nearest Shannon entropy and tissue-like P system. Math. Probl. Eng. 2019 (2019) 8. Zhang, X.L., Liu, X.Y.: Multiview clustering of adaptive sparse representation based on coupled P systems. Entropy 24(4) (2022) 9. Tao, X.N., et al.: SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl.-Based Syst. 234 (2021) 10. Chen, J.G., et al.: A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Inf. Sci. 435, 124–149 (2018) 11. Precup, R.E., et al.: Evolving fuzzy models for prosthetic hand myoelectric-based control. IEEE Trans. Instrum. Meas. 69(7), 4625–4636 (2020) 12. Yun, U., Ryang, H., Kwon, O.C.: Monitoring vehicle outliers based on clustering technique. Appl. Soft Comput. 49, 845–860 (2016) 13. Wang, H., et al.: Pattern recognition and classification of two cancer cell lines by diffraction imaging at multiple pixel distances. Pattern Recogn. 61, 234–244 (2017) 14. Lei, T., et al.: Significantly fast and robust fuzzy C-means clustering algorithm based on morphological reconstruction and membership filtering. IEEE Trans. Fuzzy Syst. 26(5), 3027– 3041 (2018) 15. Giacoumidis, E., et al.: Blind nonlinearity equalization by machine-learning-based clustering for single- and multichannel coherent optical OFDM. J. Lightwave Technol. 36(3), 721–727 (2018) 16. Gowanlock, M., et al.: A hybrid approach for optimizing parallel clustering throughput using the GPU. IEEE Trans. Parallel Distrib. Syst. 30(4), 766–777 (2019) 17. Singh, S.K., Kumar, P., Singh, J.P.: An energy efficient protocol to mitigate hot spot problem using unequal clustering in WSN. Wirel. Pers. Commun. 101(2), 799–827 (2018). https://doi. org/10.1007/s11277-018-5716-3 18. Chen, T., et al.: Model-based multidimensional clustering of categorical data. Artif. Intell. 176(1), 2246–2269 (2012) 19. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014) 20. Du, M.J., Ding, S.F., Jia, H.J.: Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl.-Based Syst. 99, 135–145 (2016) 21. Zhao, J., et al.: Density peaks clustering algorithm based on fuzzy and weighted shared neighbor for uneven density datasets. Pattern Recogn. 139 (2023) 22. Lotfi, A., Moradi, P., Beigy, H.: Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recogn. 107 (2020) 23. Peng, H., et al.: An automatic clustering algorithm inspired by membrane computing. Pattern Recogn. Lett. 68, 34–40 (2015) 24. Zhu, X.: Semi-supervised learning with graphs. Doctoral Dissertation. Carnegie Mellon University, CMU–LTI–05–192 (2005) 25. Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. KDD 96, 226–231 (1996) 26. MacQueen, J.: Some methods for classification and analysis of multivariate observations. Stat. Probab. 281–297 (1967)
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network for Service Recommendation on Trigger-Action Programming Platform Zijun Huang1 , Jiangfeng Li1(B) , Huijuan Zhang1 , Chenxi Zhang1 , and Gang Yu2 1 School of Software Engineering, Tongji University, Shanghai 201804, China
[email protected] 2 SILC Business School, Shanghai University, Shanghai 201800, China
Abstract. Trigger-Action Programming (TAP) is a popular IoT programming paradigm that enables users to connect IoT services and automate IoT workflows by creating if-trigger-then-action rules. However, with the increasing number of IoT services, specifying trigger and action services to compose TAP rules becomes progressively challenging for users due to the vast search space. To facilitate users in programming, a novel method named TAP-AHGNN is proposed to recommend feasible action services to auto-complete the rule based on the user-specified trigger service. Firstly, a heterogeneous TAP knowledge graph is designed, from which five meta-paths can be extracted to construct services’ neighborhoods. Then, the model incorporates a multi-level attention-based heterogeneous graph convolution module that selectively aggregates neighbor information, and a transformerbased fusion module that enables the integration of multiple types of features. With the two modules mentioned before, the final representations of services can capture both semantic and structural information, which helps generate better recommendation results. Experiments on the real-world dataset demonstrate that TAP-AHGNN outperforms the most advanced baselines at HR@k, NDCG@k and MRR@k. To the best of our knowledge, TAP-AHGNN is the first method for service recommendation on TAP platforms using the heterogeneous graph neural network technique. Keywords: Trigger-Action Programming (TAP) · Internet of Things (IoT) · Heterogeneous Graph Neural Network · Smart Service Recommendation
1 Introduction Trigger-Action Programming (TAP), a classic programming paradigm in Internet of Things (IoT), can support users in simplifying the cooperation between IoT devices by formulating IF-THEN rules in the form of “IF Device1.TriggerService is triggered, THEN Device2.ActionService is executed”. In the TAP paradigm, IoT devices are usually called channel, and smart services provided by devices can be divided into trigger © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 141–152, 2023. https://doi.org/10.1007/978-981-99-4752-2_12
142
Z. Huang et al.
and action, which are respectively used to monitor environmental conditions and perform specific tasks [1]. Users can customize the rule (also known as recipe), such as “IF MobilePhone.Location has left home, THEN Light.TurnOff”. In this scenario, the light will automatically turn off if the phone’s location is detected to have left home. TAP has gained wide support in the IoT community. Platforms such as IFTTT, Zapier and Mozilla have demonstrated the usability of the TAP paradigm. TAP recommendation becomes necessary as the number of services on TAP platforms increases rapidly and the difficulty of finding feasible services to compose rules grows exponentially [2, 3]. Numerous researchers have proposed recommendation methods for TAP platforms [4–7]. For instance, paper [7] suggests predicting appropriate devices based on user-specified devices by leveraging the node2vec method. Meanwhile, in works [4–6], researchers prompt users to input their requirements and subsequently recommend corresponding services based on those requirements. NLP-based techniques are utilized to tackle the tasks [4–6]. In conclusion, current works on TAP recommendation are mainly confronted with two significant challenges. Firstly, these studies neglect the users’ requirements to receive recommendations for feasible action services to auto-complete the rule once they have selected a trigger service on TAP platforms. Secondly, most existing works analyze the data on TAP platforms either from the perspective of text or from the perspective of isomorphic graphs, failing to capture both semantic knowledge and structure knowledge underlying the heterogeneous TAP network. To tackle the challenges mentioned above, our research focuses on smart service recommendation on TAP platforms, aiming to recommend feasible action services to auto-complete rules based on trigger services specified by users. To achieve this, a novel method TAP-AHGNN is proposed. Firstly, a heterogeneous TAP knowledge graph (TAHG) is contrived to capture the complex relationships among TAP entities, from which five meta-paths can be extracted to create service neighborhoods. Subsequently, services’ attributes of title, function and device are leveraged as initial features and processed by independent channels respectively. Finally, with the multi-level attentionbased heterogeneous graph convolution module aggregating neighbor information and transformer-based fusion module integrating multiple features, TAP-AHGNN generates the final representations of services which captures both semantic and structural information from the original data, leading to the better recommendation results. The main contributions of this paper are listed as follows: • A novel method TAP-AHGNN is proposed. It is composed of a multi-level attentionbased heterogeneous graph convolution module and a transformer-based fusion module, aiming at recommending feasible action services to users based on user-specified trigger services. To the best of our knowledge, TAP-AHGNN is the first research tackling the TAP service recommendation task from both semantic and heterogeneous graph structure perspectives. • A TAP heterogeneous knowledge graph (TAHG) is constructed. TAHG enables the modeling of complex relationships between heterogeneous entities on TAP platforms such as devices, services and rules, allowing for the discovery of new patterns between them.
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network
143
• Five types of meta-paths are designed based on TAHG, namely co-rule, co-device and co-keyword. The utilization of these meta-paths enables TAP-AHGNN to construct service neighborhoods with diverse semantics, thereby facilitating more accurate representations of services.
2 Problem Definition 2.1 Formal Definition Let G = (U, E) be the heterogeneous knowledge graph (TAHG) on TAP platforms, as shown in Fig. 1, where U is the set of nodes, E is the set of edges. Node set U consists of five types of nodes: rules R, trigger services T , action services A, devices C, and keywords W . That is to say, U = {R, T , A, C, W } and |U| = 5. We note that most previous works only used keywords as node features, but we believe that keywords are particularly important in measuring the similarity of smart services. Therefore, we consider them as the node type.
Fig. 1. A toy example of a TAP knowledge graph (TAHG). There are five types of entities and five types of meta-paths marked by various colors.
Suppose V = (T , A) is the service set in TAHG which contains trigger services T and action services A. The definition is made for the reason that trigger and action services are the target objects in the service recommendation task. Let P = {TRA, TCT , TWT , ACA, AWA} be the meta-path set which contains five meta-paths extracted from TAHG where TRA denotes Trigger-Rule-Action, TCT denotes Trigger-Device-Trigger, TWT denotes Trigger-KeyWord-Trigger, ACA denotes Action-Device-Action and AWA denotes Action-KeyWord-Action. They are abbreviated co−rule
co−device
co−keyword
co−device
co−keyword
as T → A, T → T, T → T, A → A and A → A. Suppose N v = {NvPi |Pi ∈ P} is the neighborhood of the service node v ∈ V and o is the neighbor of the service node v. Specifically, NvPi denotes the set of nodes reachable from v through meta-path Pi . Suppose X = {xvt |xvt ∈ {xvd , xvc , xvr }, v ∈ V} is the set of features associated with V. Specifically, feature xvd represents the function description of service v, feature xvc represents the information about devices to which the service v belongs, and feature xvr represents the title of service v. Finally, let n be the number of graph convolution layer.
144
Z. Huang et al.
The reason for using meta-paths is that it can help us connect pairs of nodes with composite relationships for semantic exploration. This has revelatory implications in recommendation tasks. For example, from the long path composed of two meta-paths co−device
co−rule
“Weather.Rain → Weather.Snow → Window.Close”, we can reasonably infer that the action “Window.Close” is suitable for the trigger “Weather.Rain”. This shows that, based on meta-paths, the probability of finding feasible action service to match trigger service is higher, which helps generate great recommendations. 2.2 Problem Statement Given TAHG G’s service set V, serivce neighborhoods N constructed from meta-path sets P and service features X , the goal of TAP-AHGNN is to learn a neural function Φ : V → Z that maps each smart service in V to a low-dimensional representation vector Z (n) . Then, the dot products between service pairs are ranked and the top-k action services are recommended for each trigger service, which is formulated as follows: (n)
(n)
S = ZT · ZA
R = Rij = topk (S)
(1) (2)
where ZT(n) and ZA(n) denotes the representations of trigger and action services. S denotes the score matrix and R denotes the recommended list. Rij refers to the index of the j-th action service recommended to the i-th trigger service.
Fig. 2. Overview of TAP-AHGNN structure. (a) Services’ initial features are processed by multiple channels respectively. (b) The module aggregates information selectively from neighbors along various meta-paths with multi-level attention mechanisms. (c) The module fuses multiple channels of features based on transformer. (d) The part recommends top-k action services for each trigger service.
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network
145
3 Proposed Model Our objective is to auto-complete TAP rules by recommending feasible action services to users based on their specified trigger services. To achieve this, a novel method TAPAHGNN is proposed as shown in Fig. 2. In the graph convolution module, the information is selectively aggregated from neighbors and meta-paths by utilizing multi-level attention mechanisms. Subsequently, in the fusion module, the three types of features processed by independent channels respectively are incorporated together to generate the final service representations. Finally, the dot product of service pairs (trigger, action) is ranked to recommend the top-k action services for each trigger service. In the following sections, we will provide a detailed discussion of these modules. 3.1 Attention-Based Heterogeneous Graph Convolution Module To transfer different feature spaces into a new unified feature space, we define a transformation matrix Wt that map the initial feature representation xvt to a new feature space: mtv = Wt · xtv
(3)
where mtv denotes the representation of node v’s feature t in the new feature space. 3.1.1 Node-Level Attention As shown in Fig. 3, the node-level attention mechanism is utilized to learn the attention coefficients and selectively aggregate the representations of neighbors on the specific meta-path to form a semantic-specific feature embedding of node v. Specifically, on the channel processing feature t, the attention coefficient etvo,Pi between a neighboring node pair (v, o) on meta-path Pi can be represented as: t t etvo,Pi = attη mtv , mto ; Pi = σ aP (4) · mv ||mo i where attη is a node-level attention operation and aPi is a node-level attention vector. t Then, the normalized attention coefficients αvo, Pi can be obtained: exp etvo,Pi t t αvo, Pi = softmax evo,Pi =
t Pi exp e vj, P i j∈N
(5)
v
The neighbors of node v along the specific meta-path Pi are aggregated as follows:
t t t zv, (6) = σ α · m P j i vj,Pi Pi j∈Nv
t where zv, Pi is the semantic-specific feature embedding of service v for the meta-path t Pi . However, the semantic-specific representation zv, Pi can only reflect the feature t of node v from the specific aspect.
146
Z. Huang et al.
3.1.2 Semantic-Level Attention As shown in Fig. 3, to study the comprehensive feature embedding of the node, semanticlevel attention is introduced to learn the importance of each meta-path and incorporate all the semantic-specific feature representations to generate the final representations. t can be acquired For a given channel processing feature t, the attention coefficient wP i for each meta-path Pi as follows: 1 t t wP = b · tanh W · zv, +t (7) P i i v∈V V where b is semantic-level attention vector. W denotes the weight matrix and t represents t can be obtained: the bias vector. Then, the normalized coefficients βP i t exp wP i t t βP =
= softmax wP (8) i i P t j=1 exp wPj The normalized attention coefficients of the meta-paths indicate the importance of each meta-path. By using these learned weights, we can obtain the final representation Zvt of node v’s feature t that incorporates all the semantics. t Ztv = βP · ztv,Pi (9) i Pi ∈P
It should be noted that Ztv is an abbreviation used to denote the feature representation (n) after one-hop propagation, while Ztv is used to denote the feature representation after n rounds of propagation.
Fig. 3. Details of Node-Level Attention and Semantic-Level Attention in Heterogeneous Graph Conv.
3.2 Transformer-Based Fusion Module (n)
(n)
(n)
(n)
The final feature representation Ztv of node v can be derived as Zdv , Zcv and Zrv following the same procedure as described above. Based on the effectiveness of the
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network
147
transformer model in selecting and merging word vectors, as demonstrated in [11], we propose a fusion module based on transformer approach. This module selectively merges the three feature representations to generate the final node representation of v, which is formulated as follows: (n) (n) (n) (10) Zv(n) = ψ Fθ ψ Zdv , Zcv , Zrv where the symbol Fθ denotes a two-layer multi-layer perceptron (MLP), while ψ represents the residual connection [12] and layer normalization operations [13]. 3.3 Model Optimization During the model training, we define actions that have interacted with the trigger as positive samples, while actions that have not interacted with the trigger are considered as negative samples. To balance the number of positive and negative samples, we randomly select one negative sample for each positive sample. The objective of the loss function is to increase the interaction probability of positive samples compared to negative samples, which can be formulated as follows: (n) (n) (n) (n) (11) max 0, 1 − σ ZTo · ZAp − σ ZTo · ZAq L= (To ,Ap )∈P,Aq ∈NTo where P denotes the set of positive service pairs (trigger, action) and NTo represents the set of negative samples of the trigger service To . The purpose of hinge loss is to restrict the model from learning the irrelevant negative samples.
4 Evaluation 4.1 Experimental Setup 4.1.1 Dataset Our proposed TAP-AHGNN method was evaluated on a large-scale IFTTT dataset [1] which is widely accepted and utilized in the field and is considered representative. Cross-validation and repeated experiments were conducted to ensure the consistency of the model’s performance. To obtain common-used recipes, we only analyzed the recipes with “like” counts exceeding 100. After processing, the dataset consisted of 9,884 recipes(rules), 1,249 triggers, 748 actions, and 431 channels(devices). Additionally, NLP tools were utilized to extract 3,980 keywords from the titles and descriptions of the services, which was utilized as nodes to construct TAHG. 4.1.2 Evaluation Metrics In the evaluation process, five negative actions were randomly generated for each positive sample (trigger, action). HR@k was used to measure the system’s ability to identify
148
Z. Huang et al.
positive action services and MRR@k and NDCG@k were used to evaluate the quality of the ranking. • HR@k HR@k is defined as the percentage of correctly matched action services in the top-k recommendation list among all action services. k O HR@k = hit (12) |Oall | k denotes the count of correctly matched action services in the top-k list, where Ohit while |Oall | denotes the total number of action services. • MRR@k MRR@k focuses on the first relevant object in the top-k recommendation list. 1 |Oall | 1 MRR@k = (13) i=1 ranki |Oall | where ranki denotes the rank of the first relevant action service in the list. • NDCG@k NDCG@k’s idea is placing the objects that are more suitable at the top of the list to improve the recommendation results. k 2reli − 1 |RELk | 2reli − 1 DCG@k = / (14) NDGG@k = i=1 log2 (i + 1) i=1 IDCG@k log2 (i + 1) where reli represents the relevance of the i-th action in the top-k list, and RELk represents the ideal descending list of the top-k results. 4.1.3 Training Details We employed two graph convolution layers, with 16 neighbors sampled at the first hop and 8 neighbors sampled at the second hop. In each layer, output embedding dimension d was set to 64 and a dropout rate of 0.2 was applied. The model was trained with Adam optimizer with a learning rate of 0.01 for up to 256 epochs. Our implementation was done using PyTorch Geometric [14] on a Linux platform with an Nvidia GeForce RTX 3080 GPU. 4.2 Performance Comparison TAP-AHGNN was compared with the popular feature-based ranking methods and the state-of-the-art recommendation methods based on isomorphic and heterogeneous graph techniques which are MLP [15], GCN [8], GIN [16], SAGE [9] and HAN [10]. The performance of all methods was evaluated on the IFTTT dataset using HR@k, NDCG@k, and MRR@k (k = 3, 5, 10, 15) metrics, as presented in Table 1. Across the four different values of k, TAP-AHGNN consistently outperforms other baselines significantly. Compared with the HAN method, our approach achieves an average improvement of 1.2%. We attribute this improvement to the multiple features and the transformer-based fusion module. Compared with HAN and our method, recommendation methods based on isomorphic graph techniques fail to capture relationships between specific TAP entities without meta-paths, leading to the worse performance. It highlights the importance of meta-paths. Among the various baselines, we observe that GNN variants (SAGE, GCN, GIN) outperform the MLP model, demonstrating that aggregating neighbor information through graph structure can yield considerable benefits.
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network
149
Table 1. Performance of various models at HR@k, NDCG@k and MRR@k. Metrics @3
@5
@10
@15
MLP
GCN
GIN
SAGE
HAN
Ours
HR
0.6962
0.6843
0.7290
0.8077
0.8290
0.8440
NDCG
0.6546
0.6521
0.7236
0.8789
0.9229
0.9416
MRR
0.6948
0.7041
0.7809
0.8977
0.9251
0.9381
HR
0.8272
0.8010
0.8291
0.8990
0.9073
0.9162
NDCG
0.7119
0.7011
0.7600
0.8929
0.9319
0.9475
MRR
0.7004
0.7120
0.7831
0.8988
0.9251
0.9381
HR
0.9366
0.9314
0.9346
0.9516
0.9618
0.9668
NDCG
0.7538
0.7581
0.8055
0.9004
0.9429
0.9551
MRR
0.7004
0.7139
0.7831
0.8989
0.9251
0.9382
HR
0.9593
0.9565
0.9610
0.9742
0.9822
0.9862
NDCG
0.7613
0.7662
0.8149
0.9057
0.9459
0.9577
MRR
0.7004
0.7139
0.7831
0.8989
0.9251
0.9382
4.3 Ablation Experiment In this section, we performed ablation experiments and obtained five variants of the TAP-AHGNN model, namely: • • • •
Ours-d: TAP-AHGNN without function description features. Ours-c: TAP-AHGNN without device features. Ours-r: TAP-AHGNN without title features. Ours-A: TAP-AHGNN without multi-level attention-based heterogeneous graph convolution module. • Ours-F: TAP-AHGNN without transformer-based fusion modules, only using the concatenate method for feature fusion. Table 2 shows the comparison results of different variants in terms of HR@k, NDCG@k, and MRR@k (k = 3, 5, 10, 15). It is observed that the TAP-AHGNN model performs the best among all the variants. Therefore, for the TAP service recommendation task, a joint model must be established to capture the device features, function description features, and title features of services, while using the multi-level attentionbased heterogeneous graph convolution module and transformer-based multi-feature fusion module for information propagation and fusion. What’s more, Ours-d, Ours-c, and Ours-r perform significantly worse than TAP-AHGNN, indicating that these three types of features are indeed beneficial for modeling the TAP service representation. Lastly, TAP-AHGNN outperforms Ours-A and Ours-F, demonstrating the effectiveness of the multi-level attention-based heterogeneous graph convolution module and the transformer-based multi-feature fusion module.
150
Z. Huang et al. Table 2. Ablation studies for sub-modules of TAP-AHGNN.
Metrics @3
@5
@10
@15
Ours-d
Ours-c
Ours-r
Ours-A
Ours-F
Ours
HR
0.7928
0.8367
0.8018
0.7876
0.8186
0.844
NDCG
0.8479
0.9239
0.8794
0.8540
0.9008
0.9416
MRR
0.8820
0.9212
0.8970
0.9006
0.9194
0.9381
HR
0.8772
0.9156
0.8977
0.8867
0.8942
0.9162
NDCG
0.8639
0.9283
0.8968
0.8783
0.9116
0.9475
MRR
0.8820
0.9213
0.8970
0.9007
0.9194
0.9381
HR
0.9519
0.9610
0.9502
0.9561
0.9607
0.9668
NDCG
0.8872
0.9312
0.9039
0.9029
0.9306
0.9551
MRR
0.8821
0.9213
0.8970
0.9007
0.9195
0.9382
HR
0.9720
0.9843
0.9752
0.9701
0.9805
0.9862
NDCG
0.8914
0.9361
0.9107
0.9117
0.9341
0.9577
MRR
0.8821
0.9213
0.8970
0.9007
0.9195
0.9382
4.4 Hyperparameter Study Figure 4 (1) shows the experimental results of TAP-AHGNN with varying numbers of graph convolution layers which equal to the hops of neighbors. The results demonstrate that the performance of TAP-AHGNN (2-hop neighbors) surpasses that of TAPAHGNN-0 (no neighbors) and TAP-AHGNN-1 (only 1-hop neighbors). This indicates that neighbor aggregation can effectively refine service representations. On the other
Fig. 4. (1) Performance of TAP-AHGNN with different graph layers in the top-3 and top-5 recommendation task. (2) Performance of TAP-AHGNN with different embedding dimensions in the top-3 and top-5 recommendation task.
TAP-AHGNN: An Attention-Based Heterogeneous Graph Neural Network
151
hand, TAP-AHGNN-3 (3-hop neighbors) underperforms TAP-AHGNN (2-hop neighbors), suggesting that including too farther neighbors may introduce more noise to the final representation. From Fig. 4 (2), it can be observed that increasing the embedding dimension d can improve the performance. However, beyond a certain point, the performance gains start to slow down, and in some cases, may even decrease. This suggests that the increase of dimensions d may have limitations due to noise and overfitting.
5 Discussion TAP-AHGNN addresses the challenge of recommending suitable action services in a large search space, providing personalized recommendations based on user-specified trigger services. By considering semantic and structural information, TAP-AHGNN generates accurate recommendations, enhancing TAP-based IoT workflows. Our research focuses on utilizing the public rules shared by users, so the privacy problem is not the primary objective of our study. In the future, we will consider both public and private rules to recommend feasible services through the techniques such as federated learning [17].
6 Conclusion In this paper, a novel method named TAP-AHGNN is proposed to tackle the TAP service recommendation problem. For capturing the complex structural information among services, a heterogeneous TAP knowledge graph (TAHG) is designed, from which five meta-paths are extracted to construct the service neighborhoods of various semantics. The proposed model is composed of a multi-level attention-based heterogeneous graph convolution module and a transformer-based fusion module which leverages services’ attributes of function, device and title as initial inputs. To the best of our knowledge, TAP-AHGNN is the first attempt that regards TAP service recommendation problem as a recommendation problem on heterogeneous graphs, which can be solved from both semantic and graph structure perspectives. Acknowledgements. This work is partially supported by National Key Research and Development Program of China (2021YFC3340601), National Natural Science Foundation of China (Grant No. 61972286, 62172301 and 61772371), the Science and Technology Program of Shanghai, China (Grant No. 22511104300, 20ZR1460500, 21511101503, 21ZR1423800, 22410713200), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100) and the Fundamental Research Funds for the Central Universities.
152
Z. Huang et al.
References 1. Mi, X., Qian, F., Zhang, Y., et al.: An empirical characterization of IFTTT: ecosystem, usage, and performance. In: Proceedings of the 2017 Internet Measurement Conference, pp. 398–404 (2017) 2. Zhang, L., He, W., Morkved, O., et al.: Trace2TAP: synthesizing trigger-action programs from traces of behavior. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 4(3), 1–26 (2020) 3. Makhshari, A., Mesbah, A.: IoT bugs and development challenges. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 460–472. IEEE (2021) 4. Yusuf, I.N.B., Jiang, L., Lo, D.: Accurate generation of trigger-action programs with domainadapted sequence-to-sequence learning. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 99–110 (2022) 5. Yusuf, I.N.B., Jamal, D.B.A., Jiang, L., et al.: RecipeGen++: an automated trigger action programs generator. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1672–1676 (2022) 6. Zhang, H., Zhu, L., Zhang, L., et al.: Smart objects recommendation based on pre-training with attention and the thing–thing relationship in social Internet of things. Future Gener. Comput. Syst. 129, 347–357 (2022) 7. Kim, S., Suh, Y., Lee, H.: What IoT devices and applications should be connected? Predicting user behaviors of IoT services with node2vec embedding. Inf. Process. Manag. 59(2), 102869 (2022) 8. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 9. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 30 (2017) 10. Wang, X., Ji, H., Shi, C., et al.: Heterogeneous graph attention network. In: The World Wide Web Conference, pp. 2022–2032 (2019) 11. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) 12. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 13. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 14. Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019) 15. Seiffert, U.: Multiple layer perceptron training using genetic algorithms. In: ESANN, pp. 159– 164 (2001) 16. Xu, K., Hu, W., Leskovec, J., et al.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018) 17. Zhang, Y., Xu, Y., Wei, S., et al.: Doubly contrastive representation learning for federated image recognition. Pattern Recogn. 139, 109507 (2023)
Online Unsupervised Anomaly Detection in Stream Data with Spiking Neural Networks Using Dynamic Scoring Yaling Li1 and Jintian Ge2(B) 1 School of Information Science and Engineering, University of Jinan, Jinan 250022, China 2 Industry School of Standardization, University of Jinan, Jinan 250002, China
[email protected]
Abstract. Unsupervised anomaly discovery in stream data is a challenging task, as it involves detecting unusual patterns in data that is constantly evolving over time. Online data-stream outlier detection can indeed be more difficult and challenging. This is because new data points are continuously arriving, and the outlier detection algorithm must process them in real-time. Our idea is to use online evolving spiking neural network classifier and dynamic outlier score normalization to determine whether the current data point is an outlier. The evolutionary algorithm helps the network adapt to changing patterns in the data over time, while the dynamic outlier score normalization adjusts the outlier score based on the distribution of data points over a moving window of time. Our approach is found superior in terms of effectiveness than the other solutions provided in the literature when applied to data streams from the Numenta Anomaly Benchmark repository. Keywords: Unsupervised anomaly detection · Data stream · Evolving Spiking Neural Networks · Dynamic scoring
1 Introduction With the increasing use of Internet of Things (IoT) devices, there will be a large amount of streaming data that requires mining and analysis. Outlier detection is one of the essential tasks in streaming data mining that can help identify abnormal patterns or events in real-time data. In contrast, online outlier detection of stream data is more difficult and challenging: (i) the training process must be carried out gradually with the passage of time. (ii) Incoming new samples can only be processed once. (iii) The distribution of data will change over time. We argue that an online unsupervised anomaly detector should have adaptive memory for historical input values–remember important historical data and forget data that will not affect the output. Thus the brain-inspired SNN [1, 2], by using trains of spikes (binary temporal events) transmitted among spatially located synapses and neurons,can be considered as the ultimate inspiration for the development of new machine learning © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 153–164, 2023. https://doi.org/10.1007/978-981-99-4752-2_13
154
Y. Li and J. Ge
techniques for those issues. eSNN [3] is a neural network methods based on SNN, in which learning processes, neuronal communication and classification of data instances are based solely on transmission of spikes from input neurons to output neurons. Recently, an online evolving spiking neural network for unsupervised anomaly detection algorithm (OeSNN-UAD) [4], which enables more effective detection of anomalies, runs entirely online and is able to make fast detection of anomalies among data stream input values and works efficiently in environments with imposed restrictive memory limits. One of the most representative approaches is the hierarchical temporal memory (HTM) Numenta anomaly detector [5], which is a prediction-based outlier detector using a theoretical framework called HTM [6] for sequence learning in the cortex. It is worth noting that the anomaly likelihood (AL) introduced in [5], a novel incremental threshold, significantly improves NAB score. Thus, scoring outlier based on prediction error modeling seems to make the anomaly detector achieve better results. In order to implement different methodologies for online data normalization and online outlier scoring, EORELM-AD [7] was developed by implementing the steps of the proposed framework over an ensemble of online recurrent extreme learning machines. EORELMAD detector’s performance is competitive in comparison to several state-of-the-art outlier detection algorithms. Considering all the positive features of EORELM-AD and OeSNN-UAD, our idea is to adapt the Online evolving Spiking Neural Network for Unsupervised Anomaly Detection algorithm (OeSNN-UAD) to the dynamic anomaly scoring mechanism, and improve its prediction process. On this basis, the model is further improved to better adapt to data with concept drift characteristics. The paper is structured as follows. In Sect. 2, we overview the related work. In Sect. 3, we present the architecture of Online evolving Spiking Neural Network for Unsupervised Anomaly Detection, whose adaptation proposed by us will be then used in our method. In Sect. 4, we provide new ideas discuss the proposed algorithm in detail. Section 5 presents the results of comparative experimental evaluation of the proposed detector and state-of-the-art unsupervised and semi-supervised detectors of anomalies. We conclude our work in Sect. 6.
2 Related Work Very few algorithms proposed thus far meet the stringent requirements of online processing for the detection of outliers in time series stream data. Several references can be found in this field, as represented by a referential paper on outlier analysis [8] and recently published review of outlier/anomaly detection in time series data streams [9]. Among the scarce proposals fulfilling these requirements, we highlight several approaches as the most representative methods in the current state-of-the-art. OLAD [10] identifies an anomaly whenever an observation deviates from the modeled normality. By embedding the method in a nonparametric Bayesian framework, OLAD has good properties such as nonparametric representation and analytic predictive distributions.
Online Unsupervised Anomaly Detection in Stream Data
155
Skyline [11] is an algorithm designed for detecting anomalies in data streams. The outlier detectors used in Skyline include Grubb’s test for outliers and a simple comparison of the current input value against the deviation from the average of past values. TwitterADVec [12] is a method for anomaly detection based on the Seasonal Hybrid ESD (S-H-ESD) algorithm [13]. The S-H-ESD algorithm is used to detect anomalies in time series data by first calculating extreme Student deviates [14] of the time series values. Extreme Student deviates are values that are significantly different from the rest of the data. Yahoo EGADS [15] is a powerful and flexible algorithm for detecting anomalies in time series data. To detect outliers, Yahoo EGADS uses a time series modeling module that decomposes the data into trend, seasonal, and residual components. Bayesian Changepoint [16] works by first modeling the data using a probability distribution. The algorithm calculates the probability of a changepoint occurring at each data point and selects the point with the highest probability as the most likely changepoint. ContextOSE [17] compares the context of the most recent subsequence of data with the contexts of past subsequences of data. If the context of the most recent subsequence differs significantly from the contexts of past subsequences, it is classified as anomalous. The anomaly detector proposed in [18] detects whether a current data stream value is an anomaly or not based only on a given number of recent input values. Some of the above presented methods and algorithms are directly compared to our approach in the experimental evaluation provided in Sect. 5.
3 Online Evolving Spiking Neural Networks for Unsupervised Anomaly Detection The presentation of our approach is preceded with an overview of the architecture of OeSNN-UAD network using a modified version of OeSNN [20]. OeSNN-UAD uses an input layer to encode time series data, then maps the encoded input neurons to the output layer and generates several candidate output neurons. It designs an evolutionary mechanism to store and iterate a certain number of output neurons. The output neurons repository is used to predict the current data state and determine whether it is abnormal. 3.1 Input and Output Layer the OeSNN-UAD Network Input Layer. The input layer adopts the encoding method in OeSNN, consists of socalled Gaussian Receptive Fields (GRFs) and input neurons. Given the input value, GRF,discussed by [19], encodes it as the firing times and firing order values of input neurons. The set of input neurons is denoted by NI. The number of input neurons is determined by user-given parameter NIsize . Let X denote an input stream of values to be classified and xt denote the newest value of that stream at time t. By Wt we denote a window containing the newest value xt of data stream X, as well as previous Wsize − 1 values of that data stream. Wsize is a user given parameter, which denotes the size of window Wt . Clearly, at time t, Wt contains xt−(Wsize −1) , xt−(Wsize −2) , . . . , xt values of data stream. In window Wt , we take W the maximum and minimum values represented as IW max and Imin , respectively. Then, the
156
Y. Li and J. Ge
firing times and firing order values of input neurons can be calculated by the values of excitation functions obtained for the input Wt . The excitation function of j − th GRF, where j = 0, . . . , NIsize − 1, for input value Wt is denoted by Excj (xt ) and is defined as the following Gaussian function: 1 xt − μj 2 Excj (xt ) = exp − (1) 2 σj where μj stands for j − th GRF’s mean which is expressed by Eq. 2, and σj stands for j − th GRF’s standard deviation which is expressed by Eq. 3: W − IW 2j − 3 Imax W min μj = Imin + (2) 2 NIsize − 2 W − IW 1 Imax min σj = (3) β NIsize − 2 The parameter β that occurs in the equation defining σj is used to control the degree to which Gaussian Random Fields overlap. μj is also called a center value of j − th GRF, while σj is also called its width. A firing time function defined in Eq. 4 assigns earlier firing time values to input neurons associated with GRFs having higher excitation values. The firing time function for input neuron nj , where j = 0, . . . , NIsize − 1, is denoted by Tnj (xt ), and is defined as follows: Tnj (xt ) = TS · (1 − Excj (xt ))
(4)
where TS is a user-given basic synchronization time of firings of input neurons and TS > 0.The firing times of input neurons imply their distinct firing order values; namely, input neurons with shorter firing times are assigned smaller firing order values, which are integers in {0, . . . , NIsize − 1}. The firing order value of input neuron nj is denoted by order(nj ). Output Layer. For each value xt of the input data stream, a candidate output neuron nc is created which characterized by five attributes: weights of synapses, post-synaptic potential threshold, output value, update counter, lower approximation of post-synaptic potential. The definition of the post-synaptic potential threshold γ of all candidate output neurons is given in Eq. 5: γ ←C·
1 − mod 2·NIsize 1 − mod 2
where C is a user fixed value from the interval (0, 1).
(5)
Online Unsupervised Anomaly Detection in Stream Data
157
Synapses weights between each input neuron nj and a candidate output neuron nc are initialized according to the firing order values of input neurons,as shown in Eq. 6: wnj nc = mod order(nj )
(6)
where mod is a user-given modulation factor within range (0, 1) and order(nj ) is nj ’s firing order value obtained as a result of the xt encoding. Vector [wn0 nc , . . . , wnNIsize −1 nc ] of weights of synapses connecting the input neurons in NI with candidate output neuron nc will be denoted by wnc . The update time τnc of the candidate output neuron is set to value t as Eq. 7 τnc = t
(7)
The update counter Mnc is set to 1 according to Eq. 8: Mnc = 1
(8)
The definition of lower approximation of post-synaptic potential PSP nc of nc is given in Eq. 9: PSP nc =
NI size −1
wnj nc · mod order(j)
(9)
j=0
Lower approximation PSP nc is obtained after firing all input neurons. A candidate output neuron nc will be fired when PSP nc is greater than γ . The fired neuron will be denoted by nf . Generation of Values of Candidate Output Neurons. The initial output value vnc of the candidate output neuron created for xt is taken randomly from a nor-mal distribution 2 are determined based on input values present whose mean xwt and standard deviation sw t in Wt according to Eq. 10: 2 vnc ∼ N(xwt , sw ) t
(10)
3.2 The Construction and Evolution Process of Output Neurons Evolving Output Neurons. In OeSNN-UAD, the set of output neurons is denoted by NO. The maximal number of output neurons is given by NOsize , which is also a userspecified parameter value. Each candidate output neuron nc is either added to the repository NO or is merged with some output neuron in NO. An output neuron in NO is denoted by ni . First, the current counter CNOsize of output neurons in output repository NO is set to 0. Next, the input neurons NI as well as nc is updated with input value Wt . When CNOsize is less than the maximum number of output neurons NOsize in the output repository, nc is added to the output repository, and then the value of CNOsize is increased by 1. When CNOsize is greater than 0, it indicates that NO is not empty. At this point, a
158
Y. Li and J. Ge
similarity algorithm is used to find the neuron in NO that is most similar to the current nc , represented as ns . The similarity algorithm is given in Eq. 11:
ns = min (Dnc ,ni )|i = 0, · · · , CNOsize − 1 Dnc ni = [dist(wnc , wn1 ), · · · , dist(wnc , wnCNOsize −1 )] (11) m=NIsize −1 (wm − wm )2 dist(w , w ) = nc
ni
nc
ni
m=0
nc will be updated by ns if the euclidean distance between the weights of nc and ns is greater than sim, where sim is an user-given similarity threshold. Finally, an update algorithm is used to update the neuron’s weight vector, output value, update time, and update counter. The update algorithm mentioned above is given in Eq. 12: wns = (wnc + Mns · wns )/(Mns + 1) vns = (vnc + Mns · vns )/(Mns + 1) τns = (τnc + Mns · τns )/(Mns + 1) Mns = Mns + 1
(12)
Anomaly Detection Phase. First, output neuron nf ∈ NO that fired as first is obtained. If none of output neurons fired, then input value xt is immediately classified as an anomaly. Otherwise, the output value of the output neuron that fired as first is reported as network prediction value yt for input value xt . Finally, the Anomaly classification module classifies input value xt as anomalous or not using a prediction error being the absolute difference between xt and yt with a threshold value calculated based on errors of recent Wsize prediction values. Additionally, if xt is not classified as anomalous, then the generated initial output value vnc of the candidate output neuron is corrected by the value correction module. The error et between xt and its prediction yt is calculated as the absolute difference between these two values: et = |xt − yt |. Let e be a subset of set E = et−(Wsize −1) , . . . , et−1 of those Wsize prediction error values that were obtained for input values classified as non-anomalous. If e is not empty, then the mean xe and the standard deviation se of error values in e are calculated and used to classify xt as either an anomaly or not. If the difference between et and xe is greater than ε · se , where ε is a user-specified anomaly classification factor, then xt is classified as an anomaly, otherwise it is not. Correction of an Output Value of an Output Neuron. When xt is not classified as an anomaly, the initial output value of the candidate neuron nc is adjusted as follows: vnc ← vnc + xt − vnc · ξ . In this formula, ξ is a user given value correction factor within the range [0, 1]. If ξ = 0, the initial output value of nc will not change. If ξ = 1, then vnc value will be equal to xt .
Online Unsupervised Anomaly Detection in Stream Data
159
4 The Proposed Algorithm for Online Unsupervised Anomaly Detection Through analysis of OeSNN-UAD, we identified two issues: First and foremost, in the output value generation module, taking the mean and standard deviation of the entire window as the normal distribution to generate the output value cannot guarantee the closeness to the real value. If there are many unbalanced factors in the input window Wt , it will seriously affect the generation of output values. In the second place, in the module of anomaly detection, applying an anomaly classification factor of anomaly threshold to unbounded raw prediction errors seems like the easiest solution in practice. However, choosing a threshold value is not straightforward and can lead to many false positives (FPs) if the value is not tailored to the target dataset. This weakness was the main motivation for the proposal of our streaming scoring methods, which are described in Sect. 4.2. Therefore, we propose an output value generation method based on k-means++ and an anomaly classification method combined with dynamic anomaly scoring. 4.1 The Proposed Algorithm for Online Unsupervised Anomaly Detection In this section, our proposed algorithm, as shown in Fig. 1, is presented and discussed in detail. The values of the first Wt are not classified as anomalies. For each of input values xt of X, where t ≥ Wsize + 1, first, window Wt is as updated with input value xt which becomes subject to anomaly classification, and input neurons are determined based on Sect. 3.1. Next, output neuron nf ∈ NO that fires as first is obtained. If no output neuron is activated, value xt is classified as being anomalous and the prediction yt of our model as well as error value et are set to NULL and +∞, respectively. Otherwise, the prediction yt is assigned output value vnf , error et is set to the absolute difference between xt and yt , and our proposed ‘anamoly detection based on dynamic anomaly scoring’ procedure in Sect. 4.3 is invoked. The procedure returns Boolean value ut indicating presence or absence of an anomaly for input value xt . Next, new candidate output neuron nc is created, and then initialized in our proposed ‘Generation of values of candidate output neurons based on k-means + +’ procedure, which is presented in Sect. 4.2. It first creates synapses between candidate output neuron nc and each input neuron in NI. Then, the weights of the created synapses are calculated according to the firing order values of input neurons in NI obtained for input value xt . Next, a temp window is calculated by kmeans + + with the parameter k. Then, output value vnc of nc is generated from a normal tmp distribution created based on the temp window Wt (as it was presented in Sect. 4.2. And finally the update time τnc is set to current input time t. Next, find an output neuron ns ∈ NO, such that the Euclidean distance between vectors of synapses weights of nc and ns is the smallest. If D(nc ,ns ) is less than or equal to the similarity threshold value sim, then ns is merged with nc according to the ‘Evolving output neurons’ procedure, presented in Sect. 3.2. Finally, Scores is calculated by Eq. 14 in Sect. 4.3.
160
Y. Li and J. Ge
Fig. 1. The proposed architecture
4.2 Generation of Values of Candidate Output Neurons Based on k-Means++ We perform k-means clustering on Wt and divide it into k classes, where k is a userdefined parameter to represent the number of clusters. The data of the category where tmp xt is located, denoted by temporary window Wt , is similar to xt . Then the mean xwtmp tmp
and standard s2 tmp deviation of Wt wt
t
are calculated. A value of candidate output neuron
is randomly generated by creating a normal distribution function with the mean value of xwtmp and the standard deviation of s2 tmp , as Eq. 13: t
wt
vnc ∼ N(xwtmp , s2 tmp ) t
wt
(13)
4.3 Anomaly Detection Based on Dynamic Anomaly Scoring We adopt the dynamic sigma scoring technique proposed in [7], which has been proved that has the best generalization and adaptation capabilities compared to other methods considered in [7]. In our work, we not only use anomaly scores as thresholds for individual outliers, but also as scoring criteria for long anomaly windows. On the basis of the anomaly detection module in the OeSNN-UAD, we have improved the method of judging anomalies: If e is not empty, which means that the model detects a long anomaly window. The next step should not be to roughly classify the next data as normal, but to determine its anomaly score and then determine the judgment result for the next data. In this way, we use the current scoret and ε (an user-given threshold of anomaly scoring) to classify xt as either an anomaly or not. Otherwise, if the difference between et and xe is greater than scoret , then xt is classified as an anomaly, otherwise it is not. In our approaches, the prediction error values E is used to calculate anomaly scores according to the dynamic anomaly scoring algorithm. The definition of the dynamic sigma scores scoret is given in Eq. 14: scoret = exp(−(ln 2/(3σt )2 )|et − μt |2 )
(14)
Accordingly, the dynamic mean μt and dynamic standard deviation σt are computed as discussed in [7], as follows: μt = μt−1 + (et − μt−1 )/t st = st−1 + (et − μt−1 ) ∗ (et − μt ) σt = st /(t − 1)
(15)
Online Unsupervised Anomaly Detection in Stream Data
161
Each scoret in the set of historical abnormal score, denoted by Scores is used to be a dynamic threshold or a score of long anomaly subsequence. It should be noted that we have removed the error correction module in OeSNN-UAD because its corrected E will affect the accuracy of scoring.
5 Experiments The goal of our first experiment is to compare the performance of our approach to some state-of-the-art algorithms reported in the literature related to online unsupervised anomaly detection. Additionally, a second experiment is conducted to evaluate the time efficiency of our approach (time required to train the model and determine whether an input data point is an outlier). In this section, we present the details of our experimental setup. The experimental results concerning both our proposed and all other anomaly detectors used for comparative assessment were obtained after tuning their parameters. 5.1 Datasets To design experiments, we use the Numenta anomaly benchmark (NAB) repository, which contains 6 categories of labeled datasets: artificialWithAnomaly; realAdExchange; realAWSCloudwatch; realKnownCauses; realTraffic; realTweets. 5.2 Evaluation Metrics We use five measures of detection quality: precision, recall, F-measure, balanced accuracy (BA) and Matthews correlation coefficient (MCC). In Eqs. 16–20 we give formula for precision, recall, F-measure, balanced accuracy and Matthews correlation coefficient. Pr ecision =
BA = MCC = √
1 · 2
(16)
TP TP + FN
(17)
Pr ecision · Recall Pr ecision + Recall
(18)
Recall = F1 = 2 ·
TP TP + FP
TP TN + TP + FN TN + FP
TP · TN − FP · FN (TP + FP)(TP + FN )(TN + FP)(TN + FN )
(19) (20)
162
Y. Li and J. Ge
5.3 Obtained Anomaly Detection Results Perform Compared to Other State-of-the-Art TSOD Methods. In Table 1, we show the results of our method’s anomaly detection for the Numenta Anomaly Benchmark repository as well as for the other unsupervised anomaly detection methods and algorithms. As in [7], we report the mean F-measure obtained for each category of data files for each compared detector. As follows from Table 1, our approach outperforms the results obtained by the other detectors in terms of F-measure for each category of data files. Table 1 presents the obtained precision and recall values for the selected data files from the Numenta Anomaly Benchmark repository. For some data files, our method is able to provide much higher values of recall than OeSNN-UAD. In addition, our detector is characterized by much larger values of the F1, BA, MCC, and thus it is much more efficient in detecting anomalies than the compared detector OeSNN-UAD. Time Efficiency of the Proposed Algorithm. We conclude our analysis by inspecting the time required by different approaches to process data in an online or incremental fashion. The results of this benchmark are summarized in Table 1. As shown in Table 1, although our model is slightly slower than PEWMA, SD-EWMA, TSSD-EWMA, KNN-ICAD, and EORELM-AD, all models are able to process data with a collection period of less than half of a second. Therefore, we conclude that our method is efficient for online processing because the smallest recollection period among the NAB datasets is 5 s (Tables 2 and 3). Table 1. Comparison of average F-measure values obtained for unsupervised anomaly detection of streaming time series data and the proposed method for the NAB (the results for methods marked with * are given in [8]). The highest scores for each dataset category are highlighted in bold.
Online Unsupervised Anomaly Detection in Stream Data
163
Table 2. Precision, recall, F1, BA, MCC values obtained for Numenta Anomaly Benchmark stream data using the selected unsupervised anomaly detector OeSNN-UAD and using our proposed detector. The results for the detectors marked with * were reported in [4].
Table 3. Time efficiency over one point processing using incremental processing algorithms.
6 Conclusions In this article, we offered a new detector of anomalies in data streams. Our proposed detector combining dynamic normalization and online evolving spiking neural networks is designed for univariate stream time series data. Based on our experiments and insights drawn from our study on the time efficiency of proposed detector, we can conclude that the proposed framework can serve as a referential tool for the community to adapt online time series prediction algorithms to outlier detection.
References 1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79(8), 2554–2558 (1982) 2. Hodgkin, A.L., Huxley, A.F.: A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117(4), 500–544 (1952) 3. Kasabov, N.: Evolving Connectionist Systems: The Knowledge Engineering Approach. Springer-Verlag, New York (2007) 4. Maciag, P.S., Kryszkiewicz, M.: Unsupervised anomaly detection in stream data with online evolving spiking neural networks. Neural Netw. 139, 118–139 (2021) 5. Ahmad, S., Lavin, A., Purdy, S., Agha, Z.: Unsupervised real-time anomaly detection for streaming data. Neurocomputing 262(SI), 134–147 (2017) 6. Hawkins, J., Ahmad, S.: Why neurons have thousands of synapses, a theory of sequence memory in neocortex. Front. Neural Circuits 10 (2016)
164
Y. Li and J. Ge
7. Iturria, A., Labaien, J., Charramendieta, S., Lojo, A., Del Ser, J., Herrera, F.: A framework for adapting online prediction algorithms to outlier detection over time series. Knowl. Based Syst. 256 (2022) 8. Gupta, M., Gao, J., Aggarwal, C.C., Han, J.: Outlier detection for temporal data: a survey. IEEE Trans. Knowl. Data Eng. 26(9), 2250–2267 (2014) 9. Blazquez-Garcia, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54(3) (2022) 10. Xu, Z., Kersting, K., Ritter, L.V.: Stochastic online anomaly analysis for streaming time series. In: Twenty-Sixth International Joint Conference on Artificial Intelligence (2017) 11. Stanway, A.: Etsy skyline (2015). https://github.com/etsy/skyline 12. Kejariwal, A.: Introducing practical and robust anomaly detection in a time series (2015). https://blog.twitter.com/engineering/en_us/a/2015/introducing-practical-and-robustanomaly-detection-in-a-time-series 13. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1–58 (2009) 14. Rosner, B.: Percentage points for a generalized esd many-outlier procedure. Technometrics 25(2), 165–172 (1983) 15. Laptev, N., Amizadeh, S., Flint, I.: Generic and scalable framework for automated time-series anomaly detection. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.p. 19391947. KDD 2015, Association for Computing Machinery, New York, NY,USA (2015) 16. Adams, R.P., MacKay, D.J.C.: Bayesian online changepoint detection (2007) 17. Smirnov, M.: Contextual anomaly detector (2016). Contextualanomalydetector 18. Zhang, L., Zhao, J., Li, W.: Online and unsupervised anomaly detection for streaming data using an array of sliding windows and pdds. IEEE Trans. Cybern. 51(4), 2284–2289 (2021) 19. Lobo, J.L., Lana, I., Del Ser, J., Bilbao, M.N., Kasabov, N.: Evolving spiking neural networks for online learning over drifting data streams. Neural Netw. 108, 1–19 (2018) 20. Lobo, J.L., Del Ser, J., Bifet, A., Kasabov, N.: Spiking neural networks and online learning: an overview and perspectives. Neural Netw. 121, 88–100 (2020)
Research on Double Input Electric Load Forecasting Model Based on Feature Fusion Zi Wang1(B) , Tao Zhang1 , Sheng Zeng1 , and Bing Wang2 1 School of Electrical and Information Engineering, Wanjiang University of Technology,
Maanshan 243031, People’s Republic of China [email protected] 2 School of Electrical and Information Engineering, Anhui University of Technology, Maanshan 243002, People’s Republic of China
Abstract. Electric load forecasting is a relatively basic work in the world’s power industry, which has an important impact on the operation of the control power system. With the development of social production, the electricity consumption in people’s daily life, factories and enterprises is continuously increasing, it also increases the difficulty of electric load forecasting. Traditional methods are difficult to analyze the huge and complex electricity consumption data. Moreover, the existing methods cannot fully learn the potential information of the data. To solve these problems, this research proposes a deep learning-based method which has the double input structure of Conv1D + Attention+GRU and BP network. These two structures are used to extract the time series characteristics of electric load data and the characteristics of meteorological factor data respectively. Performing feature fusion at the output end of the two structures, obtaining the prediction results of future electric load data through sigmoid function. After test and analysis, proposed method’s R-squared reaches 0.974, all indicators of error are minimal. Therefore, it is better than the existing methods. The work of this research has great significance for the maintenance and operation of power system. Keywords: Electric load · Time series · Deep learning
1 Introduction 1.1 Background With the construction process in various industries is accelerating, the electricity consumption is also greatly increased [1]. In 2022, the total electricity energy consumption of the whole society reached 8,637.2 billion kWh, with a year-on-year growth of 3.6%. In order to ensure the orderly progress of the construction, it is necessary to make reasonable plans for the operation of the power system and the electric load forecasting is an important basis to carry out this work. The meaning of electric load forecasting is that the national electric power departments take the electric load as the object, through the data collected by the electrical equipment installed in the enterprises, residential areas, enterprises and other buildings © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 165–175, 2023. https://doi.org/10.1007/978-981-99-4752-2_14
166
Z. Wang et al.
to describe the value of the electricity consumed accurately and to implement a series of plans for forecasting work [2]. Specific work needs to analyze and process the temporal distribution of electric load. Mainly focusing on the prediction of future power demand and consumption, providing an important theoretical basis for the planning and operation of the power system. The power system diagram is shown in Fig. 1.
Fig. 1. Power system diagram. It describes the operation content and process of the power system.
However, the speed of social construction and the electricity consumption is rising, which increases the difficulty of electric load forecasting. At the same time, meteorological factors such as temperature, humidity and rainfall have potential impacts on electric load, so it is necessary to extract the hidden information [3]. Traditional electric load forecasting methods are difficult to analyze the huge and complex electric load data under the background of the new era. In order to fully extract the main information in the electric load and meteorological factor data, deep learning methods with effective learning structure can be used for processing and analysis. Deep learning has become a popular research direction in recent years, it contains various neural networks which are suitable for processing different types of data [4]. In this research, considering electric load data has attributes of time series and meteorological factor data has potential information, building a two-input network model for electric load forecasting. One-dimensional convolution combined with gated recurrent unit is used to analyse electric load data. BP neural network is used to process meteorological factor data. 1.2 Related Work Researchers have carried out the relevant work of electric load forecasting and proposed effective methods. Zhiyong Li proposed knowledge discovery method based on decision tree, which can effectively predict electric load [5]. In 2016, AI Saleh proposed load forecasting strategy employs new outlier rejection and feature selection methodologies,
Research on Double Input Electric Load Forecasting Model
167
it proved that data mining technology can predict electric load accurately [6]. Two years later, Petra proposed to use support vector regression to predict short-term electric load, which has achieved great results on multiple datasets [7]. The idea of the above work is to explore the relationship between input and output. But the temporal distribution information of the electric load data is equally important. Pan, Lina proposed a new hybrid forecasting approach that combines ensemble empirical mode decomposition (EEMD) and back propagation neural network (BPNN) [8], improved neural network showed better results in prediction tasks. L Sehovac selected recurrent neural network to learn the sequence information of electric load, which provided a theoretical basis for related work [9]. In 2021, K Wu designed a model based on CNN, LSTM and BiLSTM to effectively learn the information of electric load series, it has obtained the optimal prediction [10]. Next year, D Niu added the attention mechanism to the CNN-BiGRU model, which has fewer parameters [11]. The experiment showed that the attention mechanism adjusted the weight of parameters and had better prediction effect. These methods have analyzed the data in terms of time series, but other factors were not considered and the model structure for extracting feature can still be optimized.
2 Method 2.1 Data Preparation The data in this research described the historical electric load data and meteorological factor data in a region from January 1, 2018 to January 17, 2021. The electric load data was collected from 0 to 24 o ‘clock every day and its sampling frequency was once every 15 min. Meteorological factor data has five characteristics, including daily maximum temperature, minimum temperature, average temperature, relative humidity and rainfall. Data preparation includes three steps: data cleaning, data standardization, and data construction. Data cleaning involves detecting missing values and outliers. Primary data is read into by Pandas and checked for missing values through isnull function. Then using boxplot to observed data distribution [12]. After checking the data, there are no missing values, extremely few outliers are eliminated. Data distribution is shown in Fig. 2. It can be seen from the data distribution that the development trend of several variables is similar. A total of 1106 days of electric load data and meteorological factor data are available for this research. The dimensions of the electric load data are 1106 rows and 96 columns, the dimensions of meteorological factor data are 1106 rows and 5 columns. Data standardization is to unify the physical dimension of the variables [13]. Since electric load data and meteorological factor data are both non-negative numbers, the minmax standardization method is selected to process the data. The min-max standardization formula is x∗ =
x − xmin xmax − xmin
(1)
where x min is the minimum value of the variable, x max is the maximum value of the variable, and x * is the corresponding value of x after standardization. For each variable,
168
Z. Wang et al.
find its maximum value and minimum value respectively. Put each value into the corresponding position in the formula, calculate the standardized value. Then the original data are linearly compressed to the range of [0,1].
Fig. 2. Data distribution of variables. Subgraph (a) represents the electric load data, subgraph (b) represents the maximum temperature, subgraph (c) represents the minimum temperature, subgraph (d) represents the average temperature, subgraph (e) represents the relative humidity and subgraph (f) represents the rainfall.
The correlation heat map between standardized variables is shown in Fig. 3. It can be seen that a part of meteorological factor has a high correlation with electric load, indicating that it is of reference significance to add meteorological factors into the analysis. The dimensions of input and output tensors need to be set for data construction. Considering that the data is in chronological order [14], the electric load data of the latest 96 moments and the meteorological factor data of the previous day are used as the input, the corresponding electric load value of the next moment is used as the output. Using Numpy for conversion of input and output. On the input side, the array storing electric load data is 106080 rows, 96 columns and the other storing meteorological factor data is 106080 rows, 5 columns. On the output side, the array storing prediction of electric load data is 106080 row, 1 column. The details are shown in Table 1. 2.2 Method Principle Deep learning is a popular research direction of machine learning, it uses deep neural networks to solve feature expression. Its main purpose is to simulate human senses and brain processing mechanism, as people to have the way of thinking and self-learning ability. Compared with the traditional feature extraction method of machine learning, deep learning integrates feature representation and pattern prediction into a model, establishes an end-to-end learning method that can automatically extract deeper and richer features.
Research on Double Input Electric Load Forecasting Model
169
Fig. 3. Correlation heat map between variables. Table 1. Data information table for each variable. Variable name
Unit
Distribution
Dimension
Electric load
KW
[1684.4, 8965.7]
(106080,96,1)
Maximum temperature
°C
[6.2, 54.9]
(106080,1)
Minimum temperature
°C
[3.9, 28.8]
(106080,1)
Average temperature
°C
[5.5, 32.1]
(106080,1)
Relative humidity
%RH
[21.0, 99.0]
(106080,1)
Rainfall
mm
[0.0, 236.1]
(106080,1)
One branch of the model proposed in this research includes one-dimensional convolution layer, maximum pooling layer and gated recurrent unit. Multiple one-dimensional convolution [15] layers are used to extract the deep temporal information of electric load data, it is calculated by the following formula, Pi −1 wim v(i−1)m vijx = bij + m p=0 p
x−p
(2)
where i represents the position of the convolution layer, j represents the channel position of the current convolution layer, Pi represents the width of the convolution kernel, m is the feature mapping from the previous layer to the current layer, b is the bias, w is the weight matrix. Maximum pooling layer is used for dimension reduction, it reduces the number of parameters in the network [16]. The gated recurrent unit has two gates, which are update gate zt and reset gate r t , as shown in Fig. 4. The update gate controls the state information of the previous moment which is sent into the current moment [17]. According to this mechanism, the long-term dependence of the time series can be learned. Reset gate controls how much the state information at the previous moment is ignored, so that the
170
Z. Wang et al.
short-term dependence of the time series can be learned. The formula of update gate and reset gate can be expressed as follows, rt = σ (Wir xt + bir + Whr ht−1 + bhr )
(3)
zt = σ (Wiz xt + biz + Whz ht−1 + bhz )
(4)
h˜ t = tanh(Win xt + bin + rt ∗(Whn ht−1 + bhn ))
(5)
ht = (1 − zt ) ∗ h˜ t + zt ∗ ht−1
(6)
where W is the weight and b is the bias of the current unit, σ is the sigmoid activation function. At time t, the update gate zt and reset gate r t are jointly determined by the current input and the hidden state of the previous time.
Fig. 4. Structure diagram of gated recurrent unit. It contains the reset gate r t and the update gate zt .
In this research, the gated recurrent unit is used to learn time series features extracted by one-dimensional convolution and effectively transmits unit information through gated mechanism. The other branch is composed of BP neural network. BP neural network is a kind of multilevel neural network [18]. The neurons between layers are fully connected to complete forward and back propagation. It can map the input data nonlinearly, it also has strong generalization ability. Therefore, BP neural network is used to analysis the feature information of meteorological factor data. 2.3 Model Establishment The structure of the electric load forecasting model in this research is shown in Fig. 5. The input of the upper branch is electric load data, each sample’s length of the time dimension is 96. Then it is processed by three one-dimensional convolution layers with kernel size of 3. Their function is to extract the time series characteristics of electric load data. For accelerating the iterative convergence and reducing the dimension of tensor. After each convolution layer there is a batchnormalization layer and a maximum pooling layer [19]. Transmit the extracted temporal information [20] to the gated recurrent unit, long-term associated information of the series will be learned.
Research on Double Input Electric Load Forecasting Model
171
Fig. 5. The structure diagram of the model proposed in this research. Its upper branch analyzes the electric load data, its lower branch processes the meteorological factor data. Concat operation is performed on the feature vector output of the two branches, finally obtain the predicted value.
The input of the lower branch is meteorological factor data. Meteorological factor characteristics’ weights are adjusted at the full connection layer. The feature vector of electric load data and meteorological factor data are obtained at the output end of the two branches respectively. In order to combine the extracted information, a concat method is used for feature fusion to obtain a new feature vector and it fuses vectors from two branches. Finally, the predicted value of the whole model is calculated by sigmoid function. In addition, dropout layer is also used to prevent training overfitting and attention mechanism is designed behind the last convolutional layer. Due to the existence of attention mechanism, temporal feature is multiplied by a fully connected layer, its activation function is softmax. The formula of this function is ezi Softmax(zi ) = C
c=1 e
zi
(7)
where zi is the output value of the ith neuron, and C is the number of output neuron. The weight corresponds to the probability value represented by the neuron. Therefore, the model will focus on learning information that is important.
3 Results and Discussion 3.1 Environment Configuration All experimental procedures are run in the environment configuration shown in the Table 2. The training set and the test set were divided in a ratio of 7:3. Take 10% of the training set as the verification set. The chronological order of the samples was not disturbed in
172
Z. Wang et al. Table 2. Environment configuration list of the program. Name of the software/hardware
Parameter/Version number
Processor
Inter(R) Core(TM)i5-10400M
Graphics card
GeForce GTX 1660(6G)
Internal storage
8G
Operating system
Windows10
Deep learning framework
Keras 2.2.5
Computing architecture
CUDA 10.1
the training process. The processed electric load data and meteorological factor data were respectively input into the two branches of the model. Adam optimizer was used to train the model with a learning rate of 0.0001. 3.2 Performance Comparison The proposed model in this research is compared with existing methods. Examine the performance of each method on the training set and the test set. The metrics used for comparison include mean absolute error [21], root mean square error [22], and R-squared [23]. Table 3 shows the results of the comparison. Table 3. Comparison of the performance of each method. Method DTR
Train
Test
MAE
RMSE
R2
MAE
RMSE
R2
416.76
586.27
0.913
478.05
611.74
0.906
SVR
398.50
548.95
0.921
446.18
579.93
0.915
LSTM
365.08
485.83
0.941
373.35
487.66
0.938
GRU
343.84
456.53
0.947
359.64
475.84
0.943
CNN + LSTM
280.23
379.76
0.964
301.74
412.82
0.955
CNN + GRU
240.71
363.03
0.967
285.16
410.61
0.958
Proposed
192.56
274.31
0.982
212.13
318.87
0.974
It can be seen from Table 3 that the method proposed in this research has the best prediction effect. It has the smallest MAE and RMSE both on training set and test set. The method based on LSTM also has great effect, but it requires more computation than the GRU, so the model built with GRU is lighter. This research also examines the generalization ability of proposed method. Figure 6 shows the change curve of each evaluation index over time. The distribution of the curve indicates that the proposed
Research on Double Input Electric Load Forecasting Model
173
Fig. 6. Generalization ability of the proposed method and CNN+GRU method over the test set. Subgraph (a) represents the variation curve of MAE, subgraph (b) represents the variation curve of RMSE, subgraph (c) represents the variation curve of R-score, subgraph (d) represents the distribution of Error.
method has better generalization ability than CNN+GRU model. It proves that the attention mechanism is effective. The main information is weighted so it can be delivered efficiently. For examining the influence of attention mechanism, this research verifies the effect of extracting time series information. Output the tensor processed by the attention mechanism. Then, autocorrelation analysis was used to observe the distribution law of feature vectors. Figure 7 shows the relationship between autocorrelation of feature vectors and attention mechanism.
Fig. 7. Autocorrelogram of feature vectors. Picture on the left represents model without attention mechanism. Picture on the right represents model with attention mechanism.
174
Z. Wang et al.
The blue area in the figure represents the 95% confidence interval. The data near the blue area indicates that there is a certain degree of autocorrelation, which has statistical basis. With the addition of the attention mechanism, the timing of the feature vectors extracted by the convolution layer becomes more obvious. This indicates that attention mechanism can effectively learn and transmit temporal feature information.
4 Conclusion The method proposed in this research is to analyze the electric load data and meteorological factor data. It can effectively extract the time series information of the electric load data and the characteristic information of the meteorological data. The accuracy of the proposed model by using the fused data is higher than existing methods. Moreover, the model can accurately predict the future development trend of electric load according to recent data, it has good fitting degree and generalization ability. In conclusion, the method in this research can provide a strong basis for the power department to plan energy distribution. It has important practical significance.
References 1. Zhou, X., Gao, Y., Yao, W., et al.: A robust segmented mixed effect regression model for baseline electricity consumption forecasting. J. Mod. Power Syst. Clean Energy 10(1), 71–80 (2022) 2. Wang, T., Li, X.: Research on short-term electric load forecasting based on extreme learning machine. In: International Conference on Advances in Energy and Environmental Research (2018) 3. Eljazzar, M.M., Hemayed, E.E.: Impact of economic, social and meteorological factors on load forecasting in different timeframes-a survey. In: 2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE). IEEE (2020) 4. Mousavi, N.S., Vaferi, B., Romero-Martinez, A.: Prediction of surface tension of various aqueous amine solutions using the UNIFAC model and artificial neural networks. Ind. Eng. Chem. Res. 28, 60 (2021) 5. Li, Y.Z.: An empirical study of knowledge discovery on daily electrical peak load using decision tree. Adv. Mater. Res. 433–440, 4898–4902 (2012) 6. Saleh, A.I., Rabie, A.H., Abo-Al-Ez, K.M.: A data mining based load forecasting strategy for smart electrical grids. Adv. Eng. Inform. 30(3), 422–448 (2016) 7. Vrablecová, P., et al.: Smart grid load forecasting using online support vector regression. Comput. Electr. Eng. 65, 102–117 (2018) 8. Pan, L., Feng, X., Sang, F., et al.: An improved back propagation neural network based on complexity decomposition technology and modified flower pollination optimization for short-term load forecasting. Neural Comput. Appl. 31(7), 2679–2697 (2017) 9. Sehovac, L., Grolinger, K.: Deep learning for load forecasting: sequence to sequence recurrent neural networks with attention. IEEE Access 8, 36411–36426 (2020) 10. Wu, K., Wu, J., Feng, L., et al.: An attention-based CNN-LSTM-BiLSTM model for shortterm electric load forecasting in integrated energy system. Int. Trans. Electr. Energy Syst. 1, 31 (2021) 11. Niu, D., Yu, M., Sun, L., et al.: Short-term multi-energy load forecasting for integrated energy systems based on CNN-BiGRU optimized by attention mechanism. Appl. Energy 313, 118801 (2022)
Research on Double Input Electric Load Forecasting Model
175
12. Gueta, Tomer, Carmel, et al. Quantifying the value of user-level data cleaning for big data: a case study using mammal distribution models. Ecol. Inform. Int. J. Ecoinformatics Comput. Ecol. 34, 139–145 (2016) 13. Hong, X., Wang, J., Qiu, S.: Authenticating cherry tomato juices—Discussion of different data standardization and fusion approaches based on electronic nose and tongue - ScienceDirect. Food Res. Int. 60(6), 173–179 (2014) 14. Jo, D.W., Kim, M.H.: Linked legal data construction and connection of LOD cloud. J. Korea Soc. Comput. Inf. 21(5), 11–18 (2016) 15. Wang, Y., Huang, S., Dai, J., et al.: A novel bearing fault diagnosis methodology based on SVD and one-dimensional convolutional neural network. Shock. Vib. 2020, 1–17 (2020) 16. Chenming, Yang, Simon X , et al. Hyperspectral remote sensing image classification based on maximum overlap pooling convolutional neural network. Sensors 18(10), 3587 (2018) 17. Sun, C., Zhang, Y., Huang, G., et al.: A soft sensor model based on long & short-term memory dual pathways convolutional gated recurrent unit network for predicting cement specific surface area. ISA Trans. 130, 293–305 (2022) 18. Zhang, S., Zhang, L., Gai, T., et al.: Aberration analysis and compensate method of a BP neural network and sparrow search algorithm in deep ultraviolet lithography. Appl. Opt. 61(20), 6023–6032 (2022) 19. Jiang, Y.D.: Classification of Alzheimer’s Disease via Eight-Layer Convolutional Neural Network with Batch Normalization and Dropout Techniques. J. Med. Imaging Health Inf. 10(5), 1040–1048 (2020) 20. Ma, X., Wang, Q., Tong, X., et al.: A deep learning model for incorporating temporal information in haze removal. Remote Sens. Environ. 274, 113012 (2022) 21. Malallah, F.L., Shareef, B.T., Saeed, M.G., et al.: Contactless core-temperature monitoring by infrared thermal sensor using mean absolute error analysis. Recent Pat. Eng. 4, 15 (2021) 22. Jobst, L.J., Heine, C., Auerswald, M., et al.: Effects of multivariate non-normality and missing data on the root mean square error of approximation. Struct. Equ. Model. Multi. J. 28(6), 851–858 (2021) 23. Irandoukht, A.: Optimum ridge regression parameter using R-squared of prediction as a criterion for regression analysis. J. Stat. Theor. Appl. 20(2) (2021)
Machine Learning
K-means Based Transfer Learning Algorithm Yuanyuan Du1(B) , Bo Li1 , and Zhonghua Quan2 1 School of Computer Science and Technology, Wuhan University of Science and Technology,
Huangjiahu West Road. 2, Wuhan 430070, China [email protected], [email protected] 2 Deyang Vocational College of Technology and Trade, Department of Intelligent Engineering, Sanxingdui Avenue. 122, Sichuan 618300, China [email protected]
Abstract. Focused on the issue that most transfer learning methods ignore the intra-domain distribution structures of the target domain, an algorithm based on K-means (K-means Transfer Learning, KTL) is proposed to enhance the transfer learning performance of the classification algorithms. In view of the poor clustering behavior of K-means on non-convex data sets, the multi-core version of KTL is proposed by clustering data into more small clusters to better fit the non-convex distribution of data (MKTL, Multi-core K-means Transfer Learning). The experimental results show that MKTL achieves the best average accuracy in 3 datasets. Compared with the original methods (kNN, TCA, GFK, JDA), the performance of MKTL is improved by 2.5 ~ 12.8 percentage in high computational efficiency. Keywords: Transfer Learning · Domain Adaptation · Clustering · K-means
1 Introduction Although it has been witnessed the great success of machine learning in the past decades [1, 2], the performance of traditional machine learning algorithms not only requires a large amount of well-labeled training data, but also heavily relies on the case that the data for training and test have the same or similar distributions, i.e. independent identical distribution (i.i.d.) [3]. However, in many real-world scenarios, manual labeling of sufficient training data is often expensive or even impractical [4]. And it is also difficult to guarantee that the training data and the test data are independent and identically distributed. Hence, we expect to take advantage of the information from the well-labeled source data and transfer it to the target data for classification. As a result, transfer learning emerges as the problem requires. Transfer learning focuses on transferring effective information in the labeled source domain to the different but related unlabeled target domain [5, 6]. A number of recent efforts concentrate on transferring information through feature transformation [3–7]. This strategy mitigates the gap between the source domain and the target domain as much as possible, and then the classification information of the source domain can be directly applied to the target domain. However, it has a distinct property of ignoring the structure © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 179–190, 2023. https://doi.org/10.1007/978-981-99-4752-2_15
180
Y. Du et al.
information of the target domain, which shows a great contribution to cross-domain transfer learning [6, 8]. Although some works have noticed this issue and put forward their solutions [6, 9], most of them pay their main attention to domain-invariant features and show no more interest in their structures. The direct schema to solve the problem of ignoring the target domain structures in transfer learning is to explore and exploit it, where clustering can be adopted. Clustering is one kind of typical unsupervised method that automatically cluster data into clusters through the structures of the original data [10], which realizes the structural utilization of data. Therefore, we employ clustering to approach the structures of the target data. After interpreting the structure information of the target domain and the classification information of the source domain, these two are fused for information transferring: Use the classifier trained with the source data to make classification to the target data clusters. And in order to successfully exploit the structural information of the target data clusters, a label-sharing paradigm is also introduced: All data in a cluster share the same classification result. The main notations and their definitions are as follows: X : data set; y: label set corresponding to X ; x: data sample; y: label of data; s: source domain; t: target domain; C: core net; c: core; nd : Number of data; nc : Number of categories; m: Number of feature dimensions; Q: Data cluster; f (·): Decision function/Classifier. To sum up, the main contributions and novelty of this paper are three-fold: 1. We adopt K-means for clustering in transfer learning. Accordingly, a K-means Transfer Learning (KTL) method is put forward. In the proposed KTL, the core net generated from the source data is taken as the initial mean vectors for the target data clustering, which can transfer the category location information of the source domain to the target data clustering. 2. In order to improve the poor clustering performance of K-means on non-convex data, we also propose a multi-core version of KTL, which is called Multi-core K-means Transfer Learning (MKTL). 3. A large number of experiment results have been carried out to validate the performance of the proposed MKTL or KTL.
2 Related Works Most works focus on transferring information by narrowing the distribution discrepancy between the source domain and the target domain. Pan et al. [11] modeled Transfer Component Analysis (TCA), which aims to minimize the marginal discrepancy between different domains by using Maximum Mean Discrepancy (MMD) [12] as the discrepancy metric. However, TCA only narrows the marginal distribution difference between the source domain and the target domain, ignoring the conditional distribution discrepancy between the two domains. Based on this, Long et al. [13] advanced Joint Domain Adaptation (JDA), which uses sufficient statistics and pseudos to reduce the difference in conditional distribution, and combines with TCA to minimize the difference in marginal distribution at the same time. The disadvantage of JDA is that it takes the marginal distribution adaptation and the conditional distribution adaptation as equally important. Balanced Distribution Adaptation (BDA) [7] gives that the importance of the two is
K-means Based Transfer Learning Algorithm
181
different in different situations, and then introduces a balance factor to dynamically adjust the importance of the two. The above methods are based on MMD, and Manifold Criterion guided Transfer Learning (MCTL) [4] believes that MMD ignores the local structures of data. MCTL overcomes this shortcoming by satisfying the manifold metric proposed in it. In addition to MCTL, a lot of works have also applied manifold learning to narrow the differences between the two domains. GFK (Geodesic Flow Kernel) [14] regards the two domains as two points in the high-dimensional space, and takes an appropriate amount of intermediate points on the geodesic distance between the two points, then gradually transforms the source domain into the target domain with these intermediate points. Some recent works believe that the internal structures of data can effectively improve the performance of transfer learning. DTLC (Discriminative Transfer feature and Label Consistency) [10] gives the view that feature distortion will be caused in the process of minimizing domain discrepancy through distance measurement formulas. Feature distortion may damage the internal category structures of data, resulting in performance degradation. Its solution is to simultaneously ensure feature alignment and feature classification. GSL (Guide Subspace Learning) [15] believes that the reason why many methods do not make good use of the internal structures of data is that these methods are all in the form of “one stage”, and propose a “two-stage” algorithm including “teaching stage” and “feedback stage”, which achieves better performance in the continuous iteration of the two stages. EasyTL (Easy Transfer Learning) [8] avoids negative transfer by using the intra-domain structures of data. Moreover, EasyTL can also improve the performance of existing transfer learning methods by using them as final classifiers. The above algorithms all use the feature alignment strategy, which ignores the structural information of the target domain, and may distort the internal structures of the target data in the process of feature alignment [16, 17]. This paper gives the solution to this problem by using a clustering algorithm to explore and exploit the target domain structure information, and then transfer it to other transfer learning algorithms.
3 Algorithm The algorithm consists of three parts: (1) Using a classifier to interpret the classification information of the source domain; (2) Using cluster to exploit the structure information of the target domain; (3) Fusing the structure information of the target domain and the classification information of the source domain for transfer learning, in other words, classify the target data clusters with the classifier trained by the source data. 3.1 Problems K-means is one kind of unsupervised machine learning method, which clusters data with high similarity into a data cluster by exploring and exploiting the distribution structures of the original data. But K-means has the following disadvantages: 1. K-means and other clustering algorithms cannot make label classification for clustered data.
182
Y. Du et al.
2. K-means randomly selects samples as the initial mean vectors, and K-means is sensitive to the selection of initial mean vectors (Fig. 1). 3. K-means cannot perform well for those data with non-convex structures (Fig. 3(c)). The following content will solve the above problems respectively. 3.2 Solution to Label Classification of Clustering Algorithms This section solves the problem that the clustering algorithms cannot label data. The solution of this paper is to utilize the label classification information contained in the source data to label the clustering results of the target data. Target data clusters can be classified by the following strategies: Mass: Mass is adopted to represent a data cluster by selecting some samples from it. The classification of Mass is: Use the classifier (trained with the source data) to classify the selected data of all the target data clusters. Next, the probability vector of a data cluster can be got in the following ways: Voting with the labels of its selected samples or calculating the mean of probability vectors of these selected samples. Agent: Instead of using the samples in a data cluster to stand for this cluster, Agent uses a feature vector generated from the samples in this cluster to represent it. The classification of Agent is: Use the classifier to classify the generated vectors of the target data clusters and get corresponding probability vectors. In this paper, for Agent, we take the mean vector of a data cluster as its generated vector. As to Mass, we take all the data in a cluster as the selected samples for this cluster, and we take the mean of probability vectors as the classification for it. 3.3 Solution to Random Initialization of K-means The solution to K-means random initialization of this paper is to provide a set of relatively good initial mean vectors for the target data clustering according to the information in the source data. These vectors are calculated as: 1 n(i) s (i) (nc ) (i) , c , . . . , c = xs j (1) Cs = cs(1), , c(2) s s s (i) j=1 ns (i)
(i)
(i)
(i)
(i)
where Xs = {xs 1 , xs 2 , ..., xs n(i) } denotes all the source samples with label i, ns is s the number of the source samples with label i. These vectors are named cores in this paper. The core net composed of all cores is used as the initial mean vectors for the target data clustering. Compared to randomly selected samples, cores are more likely to be closer to the center of the data clusters that we expect to get. Figure 1 shows an example, the randomly selected initial mean vectors provide the wrong category location information (left and right), but core net offers the comparatively correct category location information (up and down). Therefore, core net is more likely to perform better. After clustering the target data and obtaining the label classification information of the source domain, part.1 and part.2 of the algorithm model have been completed. Then use the Mass or Agent strategy to fuse the target structure information and the source label classification information. The above are the processes of the KTL, and its illustration is shown in Fig. 2.
K-means Based Transfer Learning Algorithm
183
Fig. 1. Clustering results with and without core net. (a) Distribution of the original data. (b) Randomly select samples from the target data as the initial mean vectors. (c) The clustering result for the target data with the initial mean vectors in (b). (d) Calculate the mean vectors of the source data clusters as the core net. (e) The clustering result by clustering the target data with the core net in (d).
Fig. 2. Illustration of KTL. (a) Cluster the source data to get core net; (b)With the core net obtained in (a), cluster the target data and obtain the target data clusters; (c) Classify the target data clusters with Agent or Mass strategy.
3.4 Solution to Poor Performance of K-means in Non-convex Data The reason why K-means is not suitable for clustering non-convex structures is that the data clusters obtained by K-means tend to be circular (Fig. 3(c)). The solution of this paper is to cluster data into more small clusters to approach the non-convex structures as much as possible [18]. The number of target data clusters can be increased by increasing the number of cores (Fig. 3(e)). Thus, the process of getting core net is: 1. Randomly select n(i) (n(i) is set manually, non-convex distribution structures can be well approached by setting a proper n(i) . This paper set the number of cores for all labels to be the same number n, i.e.n(1) = n(2) = · · · = n(nc ) = n) samples (i) from the ith category source data. Denote these selected source samples as Os = (i) (i) (i) {os 1 , os 2 , ..., os n(i) }.
184
Y. Du et al.
Fig. 3. Illustration of MKTL. (a) Distribution of non-convex data. (b) Each category of the source data generates one core. (c) The clustering result by clustering the target data with the core net in (b). (d) Each category of the source data generates four cores. (e) The clustering result by clustering the target data with the core net in (d). (i)
2. Perform K-means on the ith category source data with Os as the initial mean vectors. (i) (i) (i) After clustering, obtain the data clusters Q(i) s = {Qs 1 , Qs 2 , ..., Q s n(i) } for the ith category source data. (i) (i) 3. Calculate the mean vectors Cs(i) = {c(i) s 1 , cs 2 , ..., cs n(i) } of corresponding source data clusters as: c(i) s j =
1 (i) nj
x ∈Q(i) xs , j = 1, 2, ..., n(i) s
s
(2)
(i) where n(i) j denotes the number of data in the jth data cluster of {Qs }. The above three steps will be carried out for all categories of source data and get the (2) (nc ) mean vectors of all categories of source data clusters C s = {C (1) s ∪ Cs ∪, ..., ∪C s }. These vectors are cores, which form the core net. Except for generating core net, all other processes of MKTL are the same as KTL: target data clustering, classification and information fusion. KTL is just a special case of MKTL: When the number of cores of each category in MKTL is 1 (n(1) = n(2) = ...n(nc ) = 1), MKTL becomes KTL. MKTL/KTL is summarized in Algorithm 1.
K-means Based Transfer Learning Algorithm
185
4 Experiments 4.1 Data Sets The experiments use three popular transfer learning standard data sets in experiments: Image-CLEF DA [19], Office+Caltech10 [14], ImageNet+VOC2007 [20]. This paper uses the deep features commonly used in existing works for a fair comparison with the state-of-the-arts [21]. ImageNet+VOC2007 and Office+Caltech10 use the Decaf6 feature (activation of the sixth full connection layer of AlexNet [22] trained on ImageNet [23]). Image-CLEF DA uses the features extracted by ResNet50 [24]. The statistics of three data sets is listed in Table 1. All experiments are conducted on computer configured with Intel i7-8550U CPU, 1.80GHz, 16GB RAM. The programming language is MATLAB. Table 1. Statistics descriptions of the used data sets. Data set
#Feature
#Class
Domain(#Samples)
Image-CLEF DA
2048
12
C(600), I(600), P(600)
Office+Caltech10
4096
10
A(958), W(295), D(157), C(1123)
ImageNet+VOC2007
4096
5
I(7341),V(3376)
186
Y. Du et al.
4.2 Results To verify the effectiveness of the proposed algorithm, this experiment compares it with several standard methods and the state-of-the-arts. The compared methods include: kNN, TCA (Transfer Component Analysis) [11], GFK (Geodesic Flow Kernel) [14], JDA (Joint Distribution Adaptation) [13], BDA (Balanced Distribution Adaptation) [7], MCTL (Manifold Criterion guided Transfer Learning) [4], EasyTL (Easy Transfer Learning) [8], GSL (Guide Subspace Learning) [15], DTLC (Discriminative Transfer feature and Label Consistency) [9], CAMLP (Classifier Adaptation-based Modified Label Propagation) [25]. In addition to CAMLP, the results of the corresponding methods in the experiment are obtained by reproducing the original paper as much as possible. Table 2. Accuracy (%) on Image-CLEF DA data set Tasks
Compared Methods
MKTL
kNN GFK JDA BDA MCTL EasyTL GSL CAMLP kNN GFK JDA I→C
93.2
90.7
93.3 94.2
93.8
96.0
93.3
94.7
96.7
96.7
96.7
P → C 90.0
82.5
84.7 83.5
92.7
95.0
91.8
94.3
94.7
94.7
94.7
C→I
87.5
85.5
90.7 90.3
89.0
91.5
88.3
93.7
91.7
91.7
91.7
P→I
86.8
78.2
80.3 78.8
90.5
90.3
88.2
92.3
91.5
91.5
91.5
C → P 75.8
73.0
76.7 76.0
76.5
77.7
76.5
76.4
77.2
77.2
77.2
I→P
78.2
75.3
78.0 75.7
78.7
78.7
78.0
77.7
78.0
77.8
77.5
Avg
85.3
80.9
83.9 83.1
86.9
88.2
86.0
88.2
88.3
88.3
88.2
The experiment results of Image-CLEF DA are shown in Table 2, MKTL significantly improves the average accuracy of the used classification algorithms by 3.0 (kNN), 7.2 (GFK), 4.3 (JDA) percentage, respectively. Although MKTL achieves the best performance in just 1 task, MKTL (kNN and GFK) achieves the highest average accuracy. The experiment results of Office+Caltech10 are shown in Table 3, MKTL also significantly improves the average accuracy of the used classification algorithms by 6.6 (kNN), 8.5 (GFK), 2.5 (JDA) percentage, respectively. And MKTL achieves the best performance in 8 out of 12 tasks. And more importantly, all three MKTLs achieve the highest average performance (90.9%) of 1.1 percentage higher than the second (89.8%, CAMLP). As shown in Table 4, MKTL significantly improves the average accuracy of used classification algorithms by 8.1 (kNN), 7.4 (TCA), 12.8 (GFK), 2.5 (JDA) percentage, respectively. And MKTL (kNN) achieves the highest average accuracy. MKTL can significantly improve the performance of the methods with weak behavior in the transfer learning data sets, and achieve higher classification accuracy, which verifies that MKTL can enhance the transfer learning performance of the used classification algorithms. Interestingly, with the help of MKTL, algorithms with different
K-means Based Transfer Learning Algorithm
187
Table 3. Accuracy (%) on Office + Caltech10 data set Tasks
Compared Methods GFK
A→C
84.2
79.7
81.7
82.4 81.7
85.2
89.5
85.5
85.0
85.7
W→C
74.0
73.6
82.8
82.4 67.3
72.1
87.8
83.4
83.4
83.4
D→C
77.7
77.6
86.2
83.7 74.1
79.4
87.4
85.0
85.0
85.0
C→A
90.8
89.4
89.7
88.6 90.5
91.2
92.2
92.7
92.7
92.7
W→A
77.9
77.5
90.2
90.6 74.5
75.3
92.4
91.6
91.6
92.8
D→A
84.8
83.7
92.4
90.5 83.4
87.8
92.0
93.0
93.1
92.9
C→W
79.0
75.6
85.1
88.1 75.6
86.8
88.3
89.8
89.2
89.5
A→W
75.3
68.1
80.7
85.1 72.9
84.8
85.1
86.8
86.4
89.5
D→W
99.0
99.3
99.3
99.3 93.2
99.3
91.8
99.7
99.7
99.7
C→D
87.9
87.3
86.6
89.8 81.5
92.4
93.0
93.0
94.9
89.8
A→D
81.5
77.1
86.0
84.7 84.1
90.5
84.2
89.8
89.8
89.8
W→D
99.4
100.0 100.0 100.0 96.2
100.0
100.0
Avg
84.3
87.1
89.8
82.4
JDA
MKTL
kNN
BDA
88.4
EasyTL
88.8 81.2
GSL
CAMLP kNN
GFK
JDA
100.0 100.0 100.0 90.9
90.9
90.9
performance achieve almost the same performance, especially in Image-CLEF DA and Office+Caltech10 data sets. Table 4. Accuracy (%) on ImageNet+VOC2007 data set Tasks
Compared Methods
MKTL
kNN
TCA
GFK
JDA
EasyTL
DTLC
kNN
TCA
GFK
JDA
I→V
64.4
64.4
59.7
64.5
61.0
64.8
70.7
69.9
68.5
67.7
V→I
74.7
75.3
67.2
75.3
75.1
85.8
84.5
84.5
84.5
84.5
Avg
69.5
69.8
63.4
69.9
68.1
75.3
77.6
77.2
76.5
76.1
4.3 Computational Efficiency In order to evaluate the computational efficiency of the proposed MKTL, this experiment compares MKTL’s average running time on above three data sets with several methods. For the iterative methods, the results list the time required for just one iteration. Table 5 shows the running time of MKTL and several methods. MKTL (kNN) has the shortest average running time in all three data sets. And The average running time of MKTL (3.0s) is less than the 17 of the EasyTL (22.1s, ranking second). Thus, the experiment results verify the high computational efficiency of MKTL.
188
Y. Du et al. Table 5. Average running time(seconds) of all data sets (* denotes iterative methods)
Data Set
TCA
GFK
JDA*
BDA*
EasyTL
GSL*
MKTL(kNN)
Image-CLEF DA
0.9
2.2
0.9
1.3
3.3
10.0
0.2
Office+Caltech10
1.2
12.7
1.1
1.8
27.6
20.1
0.7
164.6
122.1
159.8
218.0
35.3
489.2
8.0
55.6
45.7
53.9
73.7
22.1
173.1
3.0
ImageNet+VOC2007 Avg
4.4 Ablation Analysis on Core Net This experiment conducts an ablation analysis to evaluate the contributions of the proposed core net. The experiment includes three groups: group Core Net, group Random and group K-means++. The latter two groups are used for comparison. The core net for group Core Net (MKTL) is generated by randomly selecting samples from the source data and clustering the target data, and group Core Net uses Agent as information fusion strategy and uses kNN as classifier. Group Random randomly selecting the target samples as the initial mean vectors for the target data clustering. K-means++ [44] is adopted in Group K-means++ for the target data clustering. All results are the average accuracy for all tasks in the Image-CLEF DA data set. And for each task in all groups, the experiment randomly selects ten groups and take the mean as their final results.
Fig. 4. Results of ablation analysis on core net (Image-CLEF). The two figures show the average accuracy for all tasks on Image-CLEF DA data set (left) and the running time of target data clustering (right).
As shown in Fig. 4, when the parameter n is small, core net can not only improve the convergence speed of K-means, but also significantly improve the clustering performance of K-means. When n is large, core net degrades both the convergence speed and the clustering performance of K-means. Maybe the reason is: With more number of cores, the distribution of the core net is more consistent with the source domain rather than that of the target domain. And there is distribution discrepancy between two domains, which increases the difficulty of clustering convergence.
K-means Based Transfer Learning Algorithm
189
5 Conclusion This paper propose a novel K-means transfer learning approach by incorporating Kmeans and classification algorithm into an information fusion model. The proposed approach outperforms other standard methods and the state-of-the-arts on three benchmark datasets. The running time verifies the computational efficiency of MKTL. The ablation study demonstrates the effectiveness of the proposed core net: it can improve the convergence speed and performance of K-means.
References 1. Wick, C.: Deep Learning. Informatik-Spektrum 40(1), 103–107 (2016). https://doi.org/10. 1007/s00287-016-1013-2 2. Gopalan, R., Li, R., Chellappa, R.: Domain adaptation for object recognition: an unsupervised approach. In: 2011 International Conference on Computer Vision, pp. 999–1006. IEEEComputer Society, Barcelona (2011) 3. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345– 1359 (2010) 4. Zhang, L., Wang, S., Huang, G., Zuo, W., Yang, J., Zhang, D.: Manifold criterion guided transfer learning via intermediate domain generation. IEEE Trans. Neural Netw. Learn. Syst. 30, 3759–3773 (2019) 5. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2019) 6. Liang, J., He, R., Sun, Z., Tan, T.: Aggregating randomized clustering-promoting invariant projections for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1027–1042 (2019) 7. Wang, J., Chen, Y., Hao, S., Feng, W., Shen, Z.: Balanced distribution adaptation for transfer learning. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 1129–1134. IEEE Computer Society, New Orleans (2017) 8. Wang, J., Chen, Y., Yu, H., Huang, M., Yang, Q.: Easy transfer learning by exploiting intradomain structures. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1210–1215. IEEE, Shanghai (2019) 9. Li, S., et al.: Discriminative transfer feature and label consistency for cross-domain image classification. IEEE Trans. Neural Netw. Learn. Syst. 31, 4842–4856 (2020) 10. Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Ann. Data Sci. 2(2), 165–193 (2015). https://doi.org/10.1007/s40745-015-0040-1 11. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22, 199–210 (2009) 12. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H., Schölkopf, B., Smola, A.: Integrating structured biological data by Kernel maximum mean discrepancy. Bioinformatics 22, e49-57 (2006) 13. Long, M., Wang, J., Ding, G., Sun, J., Yu, P.S.: Transfer feature learning with joint distribution adaptation. In: 2013 IEEE International Conference on Computer Vision, pp. 2200–2207. IEEE Computer Society, Sydney (2013) 14. Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066– 2073. IEEE Computer Society, Providence (2012) 15. Zhang, L., Fu, J., Wang, S., Zhang, D., Dong, Z.Y., Philip Chen, C.L.: Guide subspace learning for unsupervised domain adaptation. IEEE Trans. Neural Netw. Learn. Syst. 31, 3374–3388 (2020)
190
Y. Du et al.
16. Tang, H., Chen, K., Jia, K.: Unsupervised domain adaptation via structurally regularized deep clustering. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8722–8732. Computer Vision Foundation/IEEE, Seattle (2020) 17. Meng, M., Wu, Z., Liang, T., Yu, J., Wu, J.: Exploring fine-grained cluster structure knowledge for unsupervised domain adaptation. IEEE Trans. Circuits Syst. Video Technol. 32, 5481–5494 (2022) 18. Chaudhuri, D., Chaudhuri, B.B.: A novel multiseed nonhierarchical data clustering technique. In: IEEE Transactions on Systems, Man, and Cybernetics. Part B, Cybernetics: A Publication of the IEEE Systems, Man, and Cybernetics Society. vol. 27, pp. 871–876 (1997) 19. Caputo, B., et al.: ImageCLEF 2014: overview and analysis of the results. In: Kanoulas, E., et al. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 192–211. Springer, Cham (2014). https://doi. org/10.1007/978-3-319-11382-1_18 20. Fang, C., Xu, Y., Rockmore, D.N. Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias. In: 2013 IEEE International Conference on Computer Vision, pp. 1657–1664. IEEE Computer Society, Sydney (2013) 21. Donahue, J., et al.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 647–655. JMLR.org, Beijing (2013) 22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012) 23. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE Computer Society, Miami (2009) 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE Computer Society, Las Vegas (2015) 25. Du, Y., et al.: Classifier Adaptation Based on Modified Label Propagation for Unsupervised Domain Adaptation. Wirel. Commun. Mob. Comput. 2022, 2963195 (2022) 26. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1027--1035. SIAM, New Orleans (2007)
HSIC Induced LncRNA Feature Selection Anjie Guo(B) and Bo Li School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu West Road. 2, Wuhan 430070, China [email protected], [email protected]
Abstract. It has been confirmed that mutations of long non-coding RNAs(lncRNAs) are closely related to the development of cancer in humans, so the development of cancer can be predicted by analyzing lncRNAs. However, lncRNA is characterized by a limited amount of data samples and a large number of expression levels of gene features, where there exist much redundancy. It results in difficulty in cancer predicting. To solve the problem, this paper proposes a feature selection method for lncRNA based on Hilbert-Schmidt independence criterion (HSIC). The method quantifies the correlation between two random variables to well select a subset of lncRNA features that can make good prediction accuracy for cancer. Experiments with multiple gene datasets show that the proposed method is effective. Keywords: lncRNA · feature selection · HSIC
1 Introduction Relevant research has shown that proteins are the primary biomarkers currently used in clinical practice for cancer diagnosis and treatment [1]. However, given that only 2% of the human genome encodes proteins, modern genetic research has shifted its focus to the non-coding genome. This genome is primarily composed of long non-coding RNAs (lncRNAs), which make up 80% of the non-coding genome. Experimental evidence has demonstrated that lncRNAs play crucial regulatory roles in eukaryotic cells [2,3], and they are also heavily involved in the occurrence and progression of cancer [4–6]. Therefore, the abnormal overexpression of lncRNAs may serve as an important indicator for predicting lymph node metastasis in cancer. A common challenge in lncRNA datasets is the large number of features and small sample size, which can lead to overfitting, low prediction accuracy, and the curse of dimensionality [7]. Therefore, feature selection can improve the accuracy and efficiency of predicting lymph node metastasis. Traditional methods use mutual information to measure the correlation between feature vectors [8–10], but in some challenging problems, probability and probability density cannot be effectively estimated. In contrast, this study uses HSIC to calculate and optimize the correlation coefficient indicator of feature vectors, with the aim of achieving better feature selection performance. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 191–200, 2023. https://doi.org/10.1007/978-981-99-4752-2_16
192
A. Guo and B. Li
2 HSIC
2 |f ∫ Let L(X ) = f |f : X → R, (x)| < +∞ be a square integrable function space,
and its inner product can be defined on L(X ), such that H = (L(X ), ·, ·) is a complete inner product space, that is, Hilbert space. The inner product can be defined as the following form. f , g = ∫ f (x)g(x)dx
(1)
Define H = (L(X ), ·, ·) as a Hilbert space if the function k : X × X → R satisfies the following two conditions: • ∀x ∈ X , kx = k(·, x) ∈ H. • ∀x ∈ X , ∀f ∈ H, f (x) = f , k(·, x)H . then H is called the ReproducingKernelHilbert Space (RKHS) and k is called the regenerating kernel of H. H must must be separable and requires it to have a complete orthogonal system. If H is the RKHS and k is the regenerative kernel of H, then for ∀x ∈ X , X = {x1 , x2 , . . . , xn }, there can define a function φ : X → H. φ(x) = k(·, x) = kx ∈ H.
(2)
thus, using the properties of the renewable nucleus, the inner product of two elements in H can be expressed as: φ(x), φ x = kx , k ·, x = kx (y) = k(y, x) = k(x, y) . (3) RKHS can be generated by the kernel function, let k : X × X → R, if k satisfies the following conditions: • ∀x, y ∈ X , k(y, x) = k(x, y). • ∀x ∈ X , kx = k(·, x) is squareintegrable. • ∀x ∈ X , X = {x1 , x2 , . . . , xn }, define the associated Gram matrix G to be positive definite: ⎤ ⎡ k(x1 , x1 ) k(x1 , x2 ) · · · k(x1 , xn ) ⎢ k(x2 , x1 ) k(x2 , x2 ) · · · k(x2 , xn ) ⎥ ⎥ ⎢ (4) G=⎢ ⎥. .. .. .. .. ⎦ ⎣ . . . . k(xn , x1 ) k(xn , x2 ) · · · k(xn , xn )
then k can be called a kernel function. Similarly, let G be the second separable Hilbert space on the separable space Y, and G has a kernel l(·, ·) and a characteristic mapping ψ : Y → G. Denote the HilbertSchmidt(HS) norm of a linear operator by C : H → G. If the sum converges, then the HS norm of C is defined as: 2
C 2HS := Cvi , uj H . (5) i,j
HSIC Induced LncRNA Feature Selection
193
Let f ∈ H, g ∈ G, then the tensor product operator f ⊗ g : G → H can be defined as: (f ⊗ g)t:=f g, tG , ∀t ∈ G.
(6)
In addition, according to the definition of the HS norm, the HS normal form of f ⊗ g can be calculated by the following formula:
f ⊗ g 2HS = f ⊗ g, f ⊗ gHS = f , f H g, gG .
(7)
2.1 The Cross-Covariance Operator Here let X be a random vector in X , while let Y be a random vector in Y. The crosscovariance operator between X and Y is defined as follows. Let : HS(H → G) → R, for ∀C ∈ HS(H → G) (C) = Exy φ(x) ⊗ ψ(y), T HS . (8) if Exy φ(x) ⊗ ψ(y) HS is less than infinity, then is a continuous linear function on HS(H → G). For ∀x ∈ X , y ∈ Y, there have φ(x) ∈ H, ψ(y) ∈ G, respectively, so by Eq. (6) we know that the tensor product of φ(x) and ψ(y) is expressed as φ(x) ⊗ ψ(y) ∈ HS(H → G), and φ(x) ⊗ ψ(y), CHS is a value. When x and y are random vectors, φ(x) ⊗ ψ(y), CHS becomes a function of the random vectors x and y. Therefore, φ(x) ⊗ ψ(y), CHS becomes a random vector, while Exy φ(x) ⊗ ψ(y), CHS denotes a numerical value and is used to express the mathematical expectation of that random variable.
Fig. 1. The cross-covariance operator Cxy and mathematical expectation μx and μy relationship diagram.
From the Riesz representation theorem, we know that there is a unique function μx ∈ H which satisfies: (f ) = Ex φ(x), f = f , μx x . (9) where μx is called the mathematical expectation of x in H. Similarly, the mathematical expectation of y in G is denoted as μy . The relationship between the cross-covariance operator Cxy and the mean functions μx and μy can be shown in Fig. 1.
194
A. Guo and B. Li
2.2 Definition of HSIC Let the standard orthogonal bases of reproducing kernel Hilbert space H and G be vi and uj respectively, then HSIC can be defined as the square HS-norm of the covariance operator Cxy . 2 2 2 HSIC(x, y) = Cxy HS := vi , Cxy uj H = Cov vi (x), uj (y) . i,j
(10)
i,j
If and only if there is Cov(H(x), G(y)) = 0 for all (H, G) ∈ H × G, HSIC(x, y) in the above formula equals zero. Jacod et al. proved that for all bounded continuous Functionals H and G, if and only if Cov(H(x), G(y)) = 0, we can deduce that x and y are independent and independent [11]. Thus it can be seen that HSIC can make a statistical estimation of the HS operator on RKHS, and the case where HSIC equals zero represents the independence between the two variables. With two random vectors x and y, the above expression can be rewritten as follows: 2 (11) HSIC(x, y) = Exy (φ(x) − μx ) ⊗ ψ(y) − μy HS . Instead of directly measuring the covariance Exy of two random vectors x and y, HSIC(x, y) maps x and y to two RKHS spaces H and G respectively through functions φ and ψ, and then measures the covariance between φ(x) and ψ(y). Choosing appropriate functions φ and ψ can better reveal some inherent characteristics of x and y, and HSIC is the covariance to measure these inherent characteristics. To facilitate the calculation of HSIC in feature selection, the kernel functions k(·, ·) and l(·, ·) associated with H and G can be used to express: HSIC(x, y) =Exx yy k x, x l y, y + Exx k x, x Eyy l y, y − 2Exy Ex k x, x Eyy l y, y . (12) here x , y is an independent identically distributed copy of (x, y) and HSIC(x, y) depends only on the density of (x, y). Gretton et al. [11] discussed the measure of statistical correlation based on RKHS and proved that a sufficient condition that HSIC equals zero represents independence is that RKHS H and G induced by kernel functions k(·, ·) and l(·, ·) are dense in bounded continuous function spaces mapping from Rp and Rq to R. Among them, k(·, ·) and l(·, ·) are also called pankernel functions, but the condition of universality is very limited and can only be applied to compact domains. At present, some feature selection algorithms are more commonly used, such as radial basis function kernels (also known as Gaussian kernels), Laplace kernels, Marton kernels and exponential function kernels, because these kernels better depict the independence of compact and non-compact sets. HSIC for feature selection of lncRNA data sets in this article uses radial basis function kernels. Here gs is used to represent the density on the standard Gaussian distribution Rs defined on x ∈ Rs . gs can be defined as: 1 s (13) gs (x) = (2π)−s/2 exp − xi2 . i=1 2
HSIC Induced LncRNA Feature Selection
195
For ∀γ = γ1 , . . . , γp ∈ (0, +∞)p , and ∀δ = δ1 , . . . , δq ∈ (0, +∞)q , where p and q denote dimensions, for ∀x ∈ Rp , y ∈ Rq we can define: −1 xp x1 . (14) φγ (x) = γ1 . . . γp gp ,..., γ1 γp −1 yq y ϕδ (y) = δ1 . . . δq gq . (15) ,..., δ1 δq then, for ∀x, x ∈ Rp ,∀y, y ∈ Rq , the kernel of radial basis function isrespectively defined as: kγ x, x = φγ x − x . (16) lδ y, y = ϕδ y − y .
(17)
In practical application, it is not feasible to calculate HSIC only by relying on the density f of (x, y), because f is unknown. Therefore, given n independent and uniformly distributed samples {(x1 , y1 ), . . . , (xn , yn )}, 1 ≤ i ≤ n of f with com-mon density, the HSIC can be estimated by each mathematical expectation in the estimation formula (12). For this reason, U-statistics are introduced to estimate the unbiased Eq. (12): HSIC = [n(n − 1)]−1 (18) kγ xi , xj lδ yi , yj .
(i,j)∈I2n
HSIC = [n(n − 1)(n − 2)]−1
kγ xi , xj lδ yj , yc .
(19)
(i,j,c)∈I3n
HSIC
= [n(n − 1)(n − 2)(n − 3)]−1
kγ xi , xj lδ (yc , yd )
(20)
(i,j,c,d )∈I4n
where Irn is the set of all r-tuples drawn from the set {1, 2, . . . , n} without replacement. The HSIC can then be estimated by the U-statistic.
HSIC(f ) = HSIC + HSIC − 2HSIC .
(21)
The above HSIC estimators are usually used to construct independence tests to measure the correlation between variables. For this reason, it is proved that when x and y are independent of each other, the asymptotic distribution of HSIC estimation can be approximated by a Gaussian distribution, and the Gaussian distribution has the convenience of easy parameter estimation. However, in order to make it easier to test independence, we will find it easier to use another biased empirical estimate. Here we will use V-statistics instead of U-statistics. According to the above discussion, HSIC can be rewritten as: HSIC =
m m m 1 2 1 k l + k l − kij lic . ij ij ij cd m2 m4 m3 i,j
i,j,c,d
i,j,c
(22)
196
A. Guo and B. Li
1 1 x K y + x 1K y 1 tr K , x , y tr K , x , y x i j y i j x i j y i j m2 m4 2 − 3 tr Kx xi , xj Ky yi , yj 1 m 1 = 2 tr Kx JKy J . m where Kx xi , xj and Ky yi , yj are both symmetric matrices of size m × m, abbreviated as Kx and Ky , where tr(·) denotes the trace of the matrix, where J = I m − 1/m,I m is a unit matrix of order m, and the bolded 1 denotes an all-ones matrix of m × m. It can be clearly seen that when the finite sample formula (22) is used to estimate the sample variance, when the sample size becomes smaller, the fluctuation will become smaller, so the variance estimation will be smaller, that is to say, the formula (22) is biased. Since the population distribution seems to be more dispersed than the sample distribution, the unbiased estimation form of Eq. (22) is used to correct the deviation in the sample variation. The unbiased estimation of HSIC is as follows: (23) HSIC = (m − 1)−2 tr Kx JKy J . =
2.3 Feature Selection Using HSIC Criterion After defining our feature selection criteria, we now describe an algorithm to select features on the basis of this dependency measure. The feature selection method takes a group of original data X = {x1 , . . . , xm } ∈ RD in the high-dimensional space RD , according to the HSIC criterion proposed in the above section, to find a group of data Y = {y1 , . . . , ym } ∈ Rd in the low-dimensional space as the result of feature selection, in which the dimension of the feature selection is less than the original dimension, that is, d < D. In other words, the feature selection function based on HSIC can be described as: arg max(HSIC(X , Y)). Y
(24)
In other words, the low-dimensional Rd found after feature selection is linearly related to the original high-dimensional RD as much as possible to ensure that more original key information is retained, so as to achieve better feature selection results. According to the formula (24), in HSIC(X , Y), the result Y after feature selection is hidden in the kernel matrix KY , but this is not conducive to the solution of formula (24). Therefore, in order to explicitly represent Y, the kernel function of Y is defined as kY : Rd × Rd → R in HSIC, and for ∀yi ,yj ∈ Rd , i, j ∈ [1, m], kY can be expressed as kY xi , yj = yiT yj + cη yi , yj .
(25)
where c > 0, if and only if yi = yj , there is η yi , yj = 1, and in other cases, η yi , yj = 0. In order to make kY positive definite, η is added in the above formula.
HSIC Induced LncRNA Feature Selection
197
Combined with the above discussion, HSIC can be further expressed as: HSIC(Y, X ) = (m − 1)−2 tr Y T YJKx J + c(m − 1)−2 tr(JKx J ) = (m − 1)−2 tr YJKx J Y T + c(m − 1)−2 tr(JKx J ).
(26)
Since tr(JKX J ) is independent of Y, and c and m are also independent of Y, the objective function of the formula (26) is equivalent to finding: (27) arg max tr YJKX J Y T . Y
From a geometric point of view, the matrix multiplication YJ in the above formula means that the center of Y is transferred to the low-dimensional feature space Rd . In terms of feature selection, that is, feature dimensionality reduction, the effect of using Y instead of YJ is the same, so formula (27) can be further simplified as follows: (28) arg max tr YKX Y T . Y
3 Experimental Results 3.1 Data Sets The dataset utilized in this article comprises of lncRNA data downloaded from the TCGA database. Patients have been classified into three subcategories based on the T (Tumor), N (Lymph Node) and M (Metastasis) indexes of the international universal cancer classification. These subcategories include lymph node metastasis, lymph node non-metastasis, and normal samples, which are represented by the " + " sign, "-" sign, and "*" sign, respectively. The details of the 12 lncRNA datasets are presented in Table 1. All the 12 samples used in this article are from the illumina sequencing platform (https://www.illumina.com.cn/), and feature selection experiments will be conducted on these 12 lncRNA datasets. The expression of related genes in lncRNAs was quantified as cancer expression profile, and recorded in the corresponding data set. Because of the numerous sources in the process of collecting these data and the artificial collection of lost data information, when analyzing the cancer data in the data set, the lncRNAS value in which the missing part of the data set accounts for more than 30% of all samples will be abandoned, while the missing part of the rest of lncRNAs will be evaluated by sklearn’s impute.knn for interpolation filling. This approach has the advantage of not requiring complex data preprocessing or configuration of many parameters, and missing data points can be determined by finding the nearest adjacent point. This approach helps to avoid errors caused by missing values in subsequent feature selection results. In addition, as there are many high-quality reference gene annotations confirmed by experiments on Gencode (https://www.gencodegenes.org/), the relevant infor-mation from this website was used as a reference in this chapter. The original dimensions of all 12 lncRNA datasets were 60483. Subsequent feature selection experiments were conducted based on the original dimensions to reduce the dimensionality.
198
A. Guo and B. Li Table 1. LncRNA expression profile data set.
Dataset name
Sample number classification
Feature dimension
TCGA-BLCA
179(26 +,134-,19*)
60483
TCGA-BRCA
714(148 +,454-,112*)
60483
TCGA-CESC
110(27 +,80-,3*)
60483
TCGA-COAD
324(41 +,242-,41*)
60483
TCGA-HNSC
210(78 +,88-,44*)
60483
TCGA-KIRC
284(11 +,201-,72*)
60483
TCGA-LUAD
343(53 +,231-,59*)
60483
TCGA-LUSC
348(40 +,259-,49*)
60483
TCGA-PAAD
82(58 +,20-,4*)
60483
TCGA-READ
107(18 +,79-,10*)
60483
TCGA-STAD
261(126 +,103-,32*)
60483
TCGA-THCA
330(127 +,145-,58*)
60483
3.2 Experiments In order to further verify the effect of HSIC eliminating redundant features through experiments, this chapter uses several methods of feature selection from gene data that have been proposed by relevant studies, for example, chi-square filtering and joint hypothesis testing, which are based on the scores in the statistical test of the intrinsic attributes of genes and calculate the indicators of correlation. A fast wrapper feature selection algorithm using the selected feature subset as the evaluation factor and its convergence effect and ability as the evaluation index; the embedded recursive feature elimination(RFE) method which relies on the attributes of the underlying classifier to assign the correlation with genes and the feature selection and algorithm training of genes; last but not least,the mutual information(MI) method is used. Using the above methods, the original 12 lncRNA data sets are used for feature extraction, and the XGBoost classifier is used to classify the subsequent low-dimensional data. According to the above experimental steps, the k-fold cross-validation method(k = 2,…,10) is used to evaluate the generalization ability of the classification prediction model using the selected data. The final statistical results are shown in Table 2. From the comparative experimental results of the above different methods, it can be seen that the proposed HSIC method achieves ideal results on the original data sets such as TCGA-BLCA, TCGA-BRCA, TCGA-CESC, TCGA-COAD, TCGA-KIRC, TCGA-LUSC, TCGA-LUAD, TCGA-PAAD, TCGATHCA and TCGA-STAD, but it is relatively general in the subsets of TCGA-HNSC and TCGA-READ. Based on Table 2, it is evident that the HSIC method proposed in this chapter is relatively stable in most lncRNA data sets, and its performance is better than other comparison methods. This indicates that this method has fewer redundant features in the feature subset and is more identifiable for the prediction of lymph node metastasis.
HSIC Induced LncRNA Feature Selection
199
Table 2. Experimental results of different feature dimensionality reduction algorithms on lncRNAs datasets (average value, double standard deviation and feature number). Dataset name
Chi-square Filtering
Joint Hypo-thesis Test
MI
Embedding
RFE
The Proposed
TCGA-BLCA
0.7963 ± 0.1048 (1054)
0.7747 ± 0.1648 (11113)
0.7963 ± 0.0370 (12393)
0.7976 ± 0.1934 (466)
0.7778 ± 0.0907 (1036)
0.8333 ± 0.0224 (8030)
TCGA-BRCA
0.8275 ± 0.0627 (3772)
0.8277 ± 0.0641 (17729)
0.8186 ± 0.0348 (11938)
0.8091 ± 0.0709 (1770)
0.8231 ± 0.0290 (5786)
0.8279 ± 0.0228 (3183)
TCGA-CESC
0.6500 ± 0.2708 (1351)
0.5774 ± 0.1185 (4757)
0.6250 ± 0.1867 (9606)
0.6905 ± 0.2405 (379)
0.5625 ± 0.1184 (963)
0.6905 ± 0.0476 (3967)
TCGA-COAD
0.8272 ± 0.1056 (4821)
0.8367 ± 0.0408 (14567)
0.8373 ± 0.0745 (18152)
0.8163 ± 0.1022 (623)
0.8264 ± 0.0316 (9475)
0.8470 ± 0.0538 (10395)
TCGA-HNSC
0.6500 ± 0.3052 (1740)
0.5859 ± 0.3210 (15420)
0.6051 ± 0.2768 (16390)
0.6190 ± 0.1347 (835)
0.6350 ± 0.0202 (2563)
0.6346 ± 0.1946 (12505)
TCGA-KIRC
0.9536 ± 0.0319 (5607)
0.9551 ± 0.1132 (20601)
0.9548 ± 0.1002 (21813)
0.9432 ± 0.1093 (296)
0.9546 ± 0.0910 (4257)
0.9556 ± 0.0994 (8958)
TCGA-LUAD
0.7860 ± 0.0818 (1992)
0.8052 ± 0.1203 (20312)
0.7958 ± 0.0683 (19041)
0.8053 ± 0.0818 (864)
0.8148 ± 0.1280 (17452)
0.8151 ± 0.0764 (6432)
TCGA-LUSC
0.8953 ± 0.0171 (4895)
0.8957 ± 0.0595 (18228)
0.8955 ± 0.0522 (16785)
0.8864 ± 0.0884 (721)
0.8960 ± 0.0743 (8492)
0.8963 ± 0.0976 (3326)
TCGA-PAAD
0.6667 ± 0.3563 (91)
0.7407 ± 0.3884 (3189)
0.7500 ± 0.1732 (10741)
0.7143 ± 0.5553 (341)
0.7083 ± 0.3938 (2047)
0.8000 ± 0.2828 (6741)
TCGA-READ
0.7273 ± 0.0772 (3532)
0.7574 ± 0.0147 (13064)
0.6944 ± 0.2257 (18580)
0.6949 ± 0.1397 (268)
0.6970 ± 0.2268 (3096)
0.7085 ± 0.1773 (7959)
TCGA-STAD
0.5196 ± 0.2382 (1011)
0.5693 ± 0.2265 (18091)
0.5607 ± 0.2716 (16788)
0.6088 ± 0.2188 (965)
0.5679 ± 0.1453 (1654)
0.6106 ± 0.0372 (5164)
TCGA-THCA
0.7099 ± 0.2506 (4132)
0.6566 ± 0.0756 (20727)
0.6970 ± 0.0495 (15197)
0.6979 ± 0.3171 (1176)
0.6386 ± 0.2542 (2735)
0.7172 ± 0.0543 (6836)
200
A. Guo and B. Li
Therefore, this method has more practical significance in bioinformatics research on patient condition discovery and prognosis.
4 Conclusion The article proposes a feature selection method for lncRNA based on the HSIC criterion. This unsupervised algorithm takes into account the correlation between genes and maximizes the correlation between the data after feature selection and the original data by using HSIC. This helps to eliminate redundant features and identify key features. The method has several advantages. Firstly, the calculation of HSIC is simple and straightforward. If the lncRNA dataset is viewed as a concrete implementation of random vectors, then HSIC is the trace of the product of two kernel matrices. The kernel matrix comprises the values of the kernel functions on the data sample. The HSIC formula also shows that HSIC is related to both data and kernel functions. Secondly, empirical estimation of HSIC converges to the overall estimation at the rate of m−1/2 [13], where m is the sample size, so it will not be affected by the slow learning speed in the experiment based on HSIC as a measure of independence. Especially with the increase of sample size, HSIC can detect any existing correlation with a high probability. In other words, as a feature selection measure, HSIC is highly representative. It reduces the redundancy of lncRNA raw high-dimensional data and makes the extracted key features more recognizable.
References 1. Golla, U., et al.: ABHD11-AS1: An emerging long non-coding RNA (lncRNA) with clinical significance in human malignancies. Non-Coding RNA 8(2), 21 (2022) 2. Wang, J., et al.: LncRNA HOXA-AS2 and its molecular mechanisms in human cancer. Clin. Chim. Acta; Int. J. Clin. Chem. 485, 229–233 (2018) 3. Han, X., et al.: The lncRNA Hand2os1/Uph locus orchestrates heart development through regulation of precise expression of Hand2. Development 146(13), 176198 (2019) 4. Chen, X., et al.: LncRNA-AC02278.4 Is a Novel Prognostic Biomarker That Promotes Tumor Growth and Metastasis in Lung Adenocarcinoma. Front. Oncol. 12, 860961 (2022) 5. Tamgue, O., et al.: Non-Coding RNAs in the etiology and control of major and neglected human tropical diseases. Front. Immunol. 12, 703936 (2021) 6. Gu, Z., Wu, S., Wang, J., Zhao, S.: Long non-coding RNA LINC01419 mediates miR-519a3p/PDRG1 axis to promote cell progression in osteosarcoma. Cancer Cell Int. 20, 147 (2020) 7. Jia, W., Sun, M., Lian, J., Hou, S.: Feature dimensionality reduction: a review. Complex Intell. Syst. 8, 2663-2693 (2022) 8. Ozdenizci, O., Erdo˘gmu¸s, D.: Stochastic mutual information gradient estimation for dimensionality reduction networks. Inf. Sci. 570, 298–305 (2021) 9. Lei, J., Cai, Z., He, X., Zheng, W., Liu, J.: An approach of gene regulatory network construction using mixed entropy optimizing context-related likelihood mutual information. Bioinformatics 39, btac717 (2021) 10. Wang, Z., Ma, P., Chi, Z., Li, D., Yang, H., Du, W.: Multi-attention mutual information distributed framework for few-shot learning. Expert Syst. Appl. 202, 117062 (2022) 11. Jacod, J., Protter, P.: Probability essentials. Springer, Heidelberg (2004) 12. Gretton, A., et al.: Kernel constrained covariance for dependence measurement. In: International Conference on Artificial Intelligence and Statistics (2005) 13. Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Stochastic Modelling and Applied Probability (1996)
2D-DLPP Algorithm Based on SPD Manifold Tangent Space Xiaohang Li1(B) , Bo Li1 , and Zonghui Wang2 1 Wuhan University of Science and Technology, School of Computer Science and Technology,
Wuhan 430064, China [email protected], [email protected] 2 Zhejiang University, 66 Yuhangtang Rd, Hangzhou 310058, China [email protected]
Abstract. The manifold tangent space-based algorithm has emerged as a promising approach for processing and recognizing high-dimensional data. In this study, we propose a new algorithm based on the manifold tangent space, called the manifold tangent space-based 2D-DLPP algorithm. This algorithm embeds the covariance matrix into the tangent space of the SPD manifold and utilizes Log-Euclidean Metric Learning (LEM) to fully extract feature information, thus enhancing the discriminative ability of 2D-DLPP. Comparative experiments were conducted to evaluate the algorithm, and the results showed superior recognition ability compared to other existing algorithms. Experiments also demonstrate that the algorithm can retain the local nonlinear structure of the manifold and improve the class separability of samples. Keywords: Face Recognition · SPD Manifold Learning · Feature Extraction · 2D-DLPP
1 Introduction With the advancement of technology and information technology, computer vision has become a key research area in modern times. Feature extraction is an important direction in computer vision [1–3], which can be divided into linear feature extraction and nonlinear feature extraction. Linear feature extraction methods include principal component analysis, linear discriminant analysis, independent component analysis, etc.; nonlinear feature extraction includes local linear embedding, Laplace feature mapping, etc. Manifold learning belongs to the nonlinear feature extraction methods [4–6], which is a class of dimensionality reduction methods that exploit the local Euclidean structure properties of manifolds. A manifold is a space that is locally homogeneous with Euclidean space, in other words, it has the properties of Euclidean space locally and is able to compute local distances using Euclidean distances. The goal of manifold learning methods is to make the data maintain local geometric structure in the neighborhood and successfully find the intrinsic features of nonlinear © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 201–212, 2023. https://doi.org/10.1007/978-981-99-4752-2_17
202
X. Li et al.
manifold. Locality Preserving Projection (LPP) [7], Neighborhood Preserving Embedding (NPE) [8] and Local Linear Embedding (LLE) [9] are common methods for manifold learning that maintain locality. However, these methods have a limitation, that is, they ignore the important discriminant information for the recognition problem. LPP is a linear approximation of the Laplace feature mapping method, which preserves the local structure of the sample space by constructing a neighborhood graph based on the samples, so that neighboring samples in the original sample space remain close to each other after projection. The Two-dimensional Locality Preserving Projection (2DLPP) [10] is a two-dimensional expansion of the LPP. Based directly on the two-dimensional image matrix, 2DLPP not only solves the singularity problem of the sample matrix, but also avoids losing some structural information that exists in the original two-dimensional image, thereby achieving higher recognition rates. Discriminative Locality Preserving Projection (DLPP) [11] aims to find the subspace that can best distinguishes different classes by minimizing the within-class distance and maximizing the between-class distance. Two-dimensional Discriminative Locality Preserving Projection (2D-DLPP) [12] is an extension of DLPP in the field of 2D feature extraction. 2DDLPP can extract 2D information and preserve the geometric structure of the original data. However, the commonly used 2D-DLPP based on Euclidean space does not take into account the geometric structure of the SPD matrices on the Riemannian manifold, and its results are not satisfactory. In this paper, we introduce the Log-Euclidean Metric Learning (LEM) [13] and the logarithmic mapping of the SPD matrix into the 2D-DLPP algorithm. By embedding the covariance matrix into the tangent space, the feature information is fully extracted. When dealing with high-dimensional data, better discriminant performance is achieved while maintaining the local nonlinear structure of the manifold.
2 The Proposed Methods 2.1 Two-Dimensional Discriminant Locality Preserving Projection (2D-DLPP) 2DDLPP is the 2D expansion of DLPP. The main advantage of 2DDLPP over DLPP is that it provides a more accurate approximation of the original data, thus avoiding the loss of important information for recognition. Assume that X = [x1s , x2s , · · · , xns s ] ∈ m×n is a set of N training sample matrices in Euclidean space, divided into Z categories, and the number of samples corresponding to the s − th category is ns . The objective function of 2DDLPP is as follows: Z J (Y ) =
T Yis − Yjs Yis − Yjs Wijs . T Z Mi − Mj Bij i,j=1 Mi − Mj
s=1
n s
i,j=1
(1)
where Yis denotes the projected feature matrices of the i − th sample in class s, Mi represent the mean feature matrices of class i. Wijs and Bijs are the within-class weight matrix and the between-class weight matrix, respectively. B is the weight matrix between
2D-DLPP Algorithm Based on SPD Manifold Tangent Space
203
any two classes’ mean matrix and it is defined as Bijs = exp −Mi − Mj 2 /σ . Wijs is the weight matrix between any two samples in the s − th class, and it is defined as: s − X s 2 −X i j Wijs = exp . (2) σ where σ > 0 is an empirically determined parameter. Assuming that A denotes a 2-dimensional transformation matrix of d ×d dimensions, the objective function can be rewritten by linearly transforming Y = AT X , as: J (A) =
AT Sw A AT XLX T A = . AT FHF T A AT Sb A
(3)
The projection direction can be obtained by minimizing the objective function, which can be transformed into a generalized eigenvalue problem as follows. XLX T A = λFHF T A.
(4)
Therefore, A = [α1 , α2 , · · · , αd ] are the solution of Eq. (3), obtained by ordering them according to their eigenvalues λ1 , λ2 , · · · , λd . 2.2 The Tangent Space of SPD Matrix Let X = [x1 , x2 , · · · , xn ] ∈ m×n be an image set with n images, where si is the i − th vectorized instance which is a d-dimensional vector. The image set X can be represented by the d × d covariance matrix as [14–16]: C=
1 n (xi − x)(xi − x)T . i=1 n−1
(5)
where x denotes the mean value of x samples in the image set. Since the number of samples in the image set may be smaller than the dimensionality of the samples, the covariance matrix may be singular and therefore a small perturbation value needs to be added. The covariance matrix C calculated by Eq. (4) lies on a nonlinear Riemannian manifold, i.e., the SPD manifold symn++ . Therefore, Euclidean calculations are not applicable. Arsigny et al. [13] proposed to transform the manifold of the covariance matrix into the vector space of symmetric matrices using matrix logarithm mapping. The logarithmic mapping of this matrix can be expressed as: log(C) = U log()U T .
(6)
where U U T = C is the eigen-decomposition of the covariance matrix C, is the diagonal matrix of eigenvalues and U is the corresponding eigenvector matrix, and log() is the diagonal matrix of the eigenvalue logarithms. The various machine learning algorithms developed on the space of Euclidean spaces cannot be directly applied to Riemannian manifolds. The symn++ cannot constitute a linear space, but a metric can be defined on symn++ , for measuring the distance between
204
X. Li et al.
two SPD matrices. Among the similarity metrics on SPD manifolds, the Log-Euclidean Metric is a widely used distance measure that corresponds to the Euclidean metric on the logarithmic domain. The distance between two points Ci , Cj on symn++ is calculated by the LEM as:
dLED (Ci , Cj ) = log(Ci ) − log(Cj ) F . (7) The tangent space is Euclidean space, so the Euclidean metric can be directly applied as a distance metric, and the LEM is an approximate geodesic distance. 2.3 Tangent Space Discriminant Learning Our goal is to learn an embedding function φ that maximizes the discriminative power while maintaining the category information. According to the objective function of the 2D-DLPP, the within-class scatter Sw and between-class scatter Sb in the feature space can be defined as: φ Sw = Dis φ Cis , φ Cjs Wij . (8) i,j
Sb =
φ Dis φ Mi , φ Mj Bj .
(9)
i,j
where Dis(φ(Cis ), φ(Cjs )) denotes the distance between two embedded expressionlets i φ(Cis ) and φ(Cjs ), and Mi = (1/ni ) nk=1 Cki denotes the average matrix of the i-th class. Thus, the vector representation of the i-th expression in category s can be obtained, i.e. Cis , let x˜ is be a vector spanned by log(Cis ). The 2D linear feature extraction method is to project the image matrix onto the matrix A. Therefore, the features embedded in the classical Euclidean space and their distances can be re-expressed as: φ(Cis ) = x˜ is A, φ(Cjs ) = x˜ js A.
(10)
2
Dis(φ(Cis ), φ(Cjs )) = x˜ is A − x˜ js A .
(11)
2
˜ iA − m Dis(φ(Mi ), φ(Mj )) = m ˜ j A .
(12)
Next, it is sufficient to learn the projection matrix A instead of φ, by maximizing the inter-class scatter Sb and minimizing the within-class scatter Sw .
2D-DLPP Algorithm Based on SPD Manifold Tangent Space
205
The objective function of 2D-DLPP is transformed as follows: max J (A) =
= arg max A
AT Sb A AT Sw A Z φ A (m ˜i −m ˜ j ) (m ˜i −m ˜ j )ABj i,j=1
ns Z s=1 i,j=1
φs (˜xis A − x˜ js A)T (˜xis A − x˜ js A)Wij
= arg max
˜A ˜ T (H φ ⊗ Im )M AT M AT X˜ T (Lφ ⊗ Im )X˜ A
= arg max
AT SH φ A AT SLφ A
A
A
. (13)
where Lφ = D − W φ is the Laplacian matrix, D is a diagonal matrix, and its entries are n k φs Wij ; and H φ = E − Bφ , where E is a diagonal column (or row) sum of W ; Diis = j=1 φ matrix, and its entries are column (or row) sum of B; Eii = Zj=1 Bij . Therefore, the projection matrix A satisfying the objective function can be obtained by solving the following characteristic equation: SLφ A = λSH φ A.
(14)
The optimal projection matrix A = [α1 , α2 , · · · , αd ] is the solution of the equation, and the transformation vector α1 , α2 , · · · , αd is the eigenvector corresponding to the minimum eigenvalue λ1 , λ2 , · · · , λd of the generalized eigenvalue problem. 2.4 Feature Extraction and Classification Feature extraction is the projection of an image into a space tended by the column vectors of matrix A. Given any image Xi ∈ Rm×n , the corresponding low-dimensional description Yi ∈ Rm×n is: Yi = AT Xi , i = 1, 2, ..., Z.
(15)
Similarly, the low-dimensional description Y ∗ of the test image X ∗ can also be obtained by the above equation. The classification is based on the similarity between the low-dimensional description Yi and Y ∗ , and the similarity between Yi and Y ∗ is defined as follows: d (Yi , Y ∗ ) =
d
(∗) log(y(i) s ) − log(ys )2 .
(16)
s=1
where ys(i) and ys(∗) represent the k − th column of Yi and Y ∗ respectively. Using LEM instead of Euclidean distance as a distance metric and then use the KNN for classification.
206
X. Li et al.
2.5 Algorithm Table 1. The proposed method.
3 Experiments To verify the performance of the proposed algorithm, this section will conduct experimental comparisons on Yale face dataset and ORL face dataset. The performance of the proposed method is compared and analyzed with LPP-related improved algorithms such as DLPP, 2DLPP, 2D-DLPP and other classical 2D dimensionality reduction algorithms such as 2DLDA and 2DPCA. 3.1 Datasets The Yale dataset is a dataset of face images collected by Yale University. It consists of face images from 15 individuals, each of which has 11 different images containing differences in expressions, lighting conditions, and whether or not they are wearing glasses. The entire dataset consists of a total of 165 images, each with a size of 320 × 243 px. In the experiments, each image sample is manually cropped and scaled and normalized to 64 × 64 px. The ORL database contains face images from 40 people, of which each person contains 10 face images, each 112 × 92 px in size. Certain images were taken at different times and there are various differences between images, such as variations in expressions (smiling and unsmiling, open and closed eyes), changes in facial details such as whether glasses are worn or not, and some images also undergo subtle rotational changes. For the algorithm in this chapter, the feature dimension is d (d = 1,2,…,20), its feature matrix is d × d , and the feature matrix of the other algorithms is 112 × d . 3.2 Results and Discussion Since the parameter selection of the weight matrix is involved in DLPP, 2DLPP, 2DDLPP and the proposed algorithm, in order to better evaluate the performance of several
2D-DLPP Algorithm Based on SPD Manifold Tangent Space
207
methods, the parameter σ is experimentally selected in the interval [0,2]. Table 1 shows the parameter settings of this paper, including the number of training samples n, the number of test samples m, the number of nearest neighbor points k and the number of feature dimensions d (Table 2). Table 2. Experimental parameters. m
k
d
σ
Sample size
n
Yale
64 × 64
3~7
8~4
4~8
0 ~ 20
0~2
ORL
112 × 92
2~6
8~4
3~7
0 ~ 20
0~2
Firstly, the relationship between recognition rate and the number of eigenvalues is investigated. When the number of training samples is fixed, the effect of different number of eigenvalues on the recognition rate is examined separately. The n = 6 and n = 5 images from each class of samples in the Yale and ORL databases are randomly selected as the training set, and the rest are used as the test image set. Ten random divisions were performed and the experiments were repeated to take the average recognition rate of the algorithms, and the variation curves of the average recognition rate of each type of algorithm with different numbers of feature values in Yale and ORL databases are presented in Figs. 1 and 2, respectively.
Fig. 1. Variation of recognition rate on Yale dataset with the number of eigenvalues when n = 6.
To further illustrate the performance of the proposed algorithm, the effect of the number of training samples on the recognition rate of the algorithm is then compared. Similarly, the dataset is randomly divided 10 times and the best recognition rate is averaged. The highest recognition rates and the corresponding feature dimensions of the algorithms on the Yale and ORL databases are given in Tables 3 and 4, respectively, for different numbers of training samples.
208
X. Li et al.
Fig. 2. Variation of recognition rate on ORL dataset with the number of eigenvalues when n = 5. Table 3. The best recognition rates of different training samples on the Yale dataset and the corresponding feature dimensions. 方法
n=3
n=4
n=5
n=6
n=7
2DLDA
66.28 (13 × 13)
67.30 (13 × 13)
73.49 (16 × 16)
73.11 (16 × 16)
74.65 (15 × 15)
2DPCA
67.66 (10 × 10)
69.96 (13 × 13)
73.14 (13 × 13)
70.47 (10 × 10)
66.68 (15 × 15)
2DLPP
64.00 (10 × 10)
65.47 (12 × 12)
72.42 (14 × 14)
73.69 (12 × 12)
74.35 (14 × 14)
DLPP
56.22 (14 × 14)
59.77 (16 × 16)
60.19 (16 × 16)
63.65 (13 × 13)
64.11 (14 × 14)
proposed
67.70 (8 × 8)
70.19 (13 × 13)
73.25 (8 × 8)
74.26 (10 × 10)
75.04 (10 × 10)
Table 4. The best recognition rates of different training samples on the ORL dataset and the corresponding feature dimensions. methods
n=2
n=3
n=4
n=5
n=6
2DLDA
87.15 (112 × 4)
90.79 (112 × 5)
93.64 (112 × 4)
96.28 (112 × 7)
96.33 (112 × 6)
2DPCA
84.91 (112 × 4)
89.375 (112 × 5)
91.94 (112 × 4)
93.92 (112 × 5)
96.63 (112 × 6)
2DLPP
88.55 (112 × 3)
91.81 (112 × 5)
92.63 (112 × 6)
94.57 (112 × 6)
96.80 (112 × 5) (continued)
2D-DLPP Algorithm Based on SPD Manifold Tangent Space
209
Table 4. (continued) methods
n=2
n=3
n=4
n=5
n=6
DLPP
63.55 (112 × 5)
73.15 (112 × 5)
80.48 (112 × 5)
83.90 (112 × 5)
84.27 (112 × 5)
proposed
87.56 (5 × 5)
91.28 (7 × 7)
94.26 (9 × 9)
97.11 (9 × 9)
97.03 (7 × 7)
Due to the correlation between the recognition performance of the algorithm in this paper and the number of nearest neighbors k, experiments are conducted by varying the value of k while keeping the number of training samples fixed. For the selection of nearest neighbors, 2D-DLPP is determined by the Euclidean distance of the samples, while the proposed method is determined by LEM. Therefore, we present experimental results for both algorithms to better illustrate the effectiveness of the LEM utilized in this paper. Table 5. Comparison of the best average recognition rates of 2D-DLPP with different training samples and nearest neighbor points on the ORL dataset. k=3
k=4
k=5
k=6
k=7
n=2
83.65
80.22
78.22
80.53
80.53
n=3
86.38
85.66
87.45
84.23
84.23
n=4
90.33
90.19
90.19
90.44
90.92
n=5
93.19
92.19
92.61
93.19
93.19
mean
88.39
87.07
87.12
87.10
87.22
Table 6. Comparison of the best average recognition rates of the proposed method with different training samples and nearest neighbor points on the ORL dataset. k=3
k=4
k=5
k=6
k=7
n=2
82.74
80.85
81.51
81.94
81.51
n=3
87.57
89.44
88.11
87.22
87.22
n=4
91.97
92.64
92.56
91.97
91.38
n=5
92.51
93.51
91.51
93.52
93.09
mean
88.70
89.11
88.42
88.66
88.30
Table 5 displays the average optimal recognition rate of the 2D-DLPP algorithm based on Euclidean distance for different numbers of nearest neighbors and training samples. Table 6 shows the recognition rate of the algorithm presented in our research.
210
X. Li et al.
In Fig. 3, the best average recognition rates of the two algorithms with different k values and corresponding training samples are compared. It can be seen that the recognition rate of the proposed algorithm under different k-value conditions is better than that of the 2D-DLPP algorithm based on Euclidean distance.
Fig. 3. Comparison of recognition rates of two algorithms taking different k values on the ORL dataset.
Based on the comparative experiments on two facial datasets, the proposed algorithm achieves better recognition performance than other algorithms under the conditions of lighting or facial expression changes. This article compares the impact of feature vector dimensions on recognition rates, and the experimental results show that, regardless of the Yale or ORL dataset, as the feature vector dimensions increase, the recognition rates of all algorithms also increase significantly. Overall, the proposed algorithm has the highest average recognition rate compared to other algorithms. In addition, extensive experiments were conducted on different numbers of training samples. On the Yale dataset, except for 2DPCA with a sample size of 5 having the highest recognition rate, the proposed algorithm outperforms other algorithms in all other cases. On the ORL dataset, 2DLPP has the highest recognition rate when the sample size is 2 or 5, and the proposed algorithm outperforms other algorithms in all other cases. Regarding the impact of the nearest neighbor algorithm, the algorithm was compared with 2D-DLPP in the experiments. The results show that, with different nearest neighbors, the nearest sample selection based on LEM is superior to algorithms based on Euclidean distance. Overall, the advantages of the algorithm proposed in this paper are mainly reflected in the following aspects: (1) Metric method based on SPD manifold. The Log-Euclidean metric based on SPD manifolds is used to capture complex nonlinear features, and it provides better feature representation than the usual method using Euclidean distance, thus improving the accuracy and stability of the algorithm.
2D-DLPP Algorithm Based on SPD Manifold Tangent Space
211
(2) Local structure preservation. By using the Laplacian matrix and the tangent space of the SPD manifold to construct the within-class scatter matrix and between-class scatter matrix, the local structure of the samples can be better preserved, improving the separability of sample classes and enhancing the discriminative ability of the algorithm.
4 Conclusion Based on the advantage of the SPD manifold tangent space containing more local information of data, this paper introduces discriminative information based on the manifold tangent space into the 2DLPP algorithm to obtain the manifold tangent space-based 2D-DPLPP algorithm. The algorithm can extract the required information directly from image matrices and preserve the nonlinear structure of the manifold. Therefore, the local nonlinear structure of the manifold can be well preserved to a certain extent when handling high-dimensional data. Through comparative experiments, the recognition rate, the relationship between feature vector dimension and the number of training samples, and other discriminative indicators of the proposed method are compared with those of other algorithms. From the experimental results, the proposed algorithm has some advantages in all discriminative indicators. In conclusion, the proposed algorithm based on manifold tangent space demonstrates superior discriminative performance compared to other existing algorithms. The ability to preserve the nonlinear structure of the manifold while directly extracting required information from image matrices is a promising approach for handling high-dimensional data and retaining the local nonlinear structure of the manifold. Acknowledgments. This work was supported by Key Research and Development Plans of Guangxi Province (Granted No.AB22080077).
References 1. Pisal, A., Sor, R., Kinage, K.S.: Facial feature extraction using hierarchical MAX(HMAX) method. In: 2017 International Conference on Computing, Communication, Control and Automation (ICCUBEA), pp. 1–5. (2017) 2. Wang, R., Wu, X., Chen, K., Kittler, J.: Multiple Manifolds Metric Learning with Application to Image Set Classification. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 627–632. IEEE Computer Society, Beijing (2018) 3. Jin, Y., Dong, Y., Zhang, Y., Hu, X.: SSMD: dimensionality reduction and classification of hyperspectral images based on spatial-spectral manifold distance metric learning. IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022) 4. Faaeq, A., Gürüler, H., Peker, M.: Image classification using manifold learning based non-linear dimensionality reduction. In: 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE, Izmir (2018) 5. Li, C., Lv, J., Zhao, H., Chen, R., Zhan, J., Lin, K.: Dimensionality Reduction with Extreme Learning Machine Based on Manifold Preserving. In: International Conference on Advances in Brain Inspired Cognitive Systems (2019)
212
X. Li et al.
6. Ghojogh, B., Ghodsi, A., Karray, F., Crowley, M.: Laplacian-Based Dimensionality Reduction Including Spectral Clustering, Laplacian Eigenmap, Locality Preserving Projection, Graph Embedding, and Diffusion Map: Tutorial and Survey. ArXiv, abs/2106.02154 (2021) 7. He, X., Niyogi, P.: Locality preserving projections. Adv. neural inf. proc. Syst. 16, 234–241 (2003) 8. He, X., Cai, D., Yan, S., Zhang, H.: Neighborhood preserving embedding. In: Tenth IEEE International Conference on Computer Vision (ICCV 2005) Volume 1, 2, vol. 2, pp. 1208– 1213. IEEE Computer Society, Beijing (2005) 9. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000). https://doi.org/10.1126/science.290.5500.2323 10. Chen, S., Zhao, H., Kong, M., Luo, B.: 2D-LPP: a two-dimensional extension of locality preserving projections. Neurocomputing 70(4–6), 912–921 (2007) 11. Yu, W., Teng, X., Liu, C.: Discriminant Locality Preserving Projections: A New Method to Face Representation and Recognition. In: 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 201–207. (2005) 12. Zhi, R., Ruan, Q.: Facial expression recognition based on two-dimensional discriminant locality preserving projections. Neurocomputing 71(7–9), 1730–1734 (2008) 13. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magn. Reson. Med. 56(2), 411–421 (2006) 14. Wang, R., Guo, H., Davis, L.S., Dai, Q.: Covariance discriminative learning: a natural and efficient approach to image set classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2496–2503. IEEE Computer Society, Providence (2012) 15. Harandi, M.T., Salzmann, M., Hartley, R.I.: From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices. ArXiv, abs/1407.1120 (2014) 16. Harandi, M.T., Salzmann, M., Hartley, R.I.: dimensionality reduction on SPD manifolds: the emergence of geometry-aware methods. IEEE Trans. Pattern Anal. Mach. Intell. 40, 48–62 (2016)
Cluster Equality Validity Index and Efficient Clustering Optimization Strategy Zebin Huang1 , Ning Yu1 , Qingqiang Wu1,2,3(B) , and KunHong Liu3 1 School of Informatics, Xiamen University, Xiamen, China
[email protected]
2 Key Laboratory of Digital Protection and Intelligent Processing of Lntangible,
Cultural Heritage of Fujian and Taiwan Ministry of Culture and Tourism, Xiamen University, Xiamen, China 3 School of Film, Xiamen University, Xiamen, China
Abstract. Cluster analysis is an important research direction in data mining. The research on Cluster analysis can be divided into two parts: cluster validity index and clustering algorithm. We propose a novel external cluster validity index called CEI (Cluster Equality Index) and an efficient strategy called COS (Clustering Optimization Strategy). CEI constructs the equal confusion matrix by treating all clusters equally with greedy cluster pairing. Then it is used to match the clustering results and the ground truth labels for the clustering quality evaluation. COS is designed to judge whether a cluster should be reclustered by comparing the density of clusters’ centroids with that at their edges. This optimization strategy can be combined with existing clustering algorithms to further improve the quality of their clusters. The public available artificial datasets and real-world datasets are deployed in experiments. The results show that CEI offers more accurate evaluation compared with other indexes, and COS can be further improve the quality of clustering. Keywords: Cluster analysis · Cluster Equality Index · Clustering Optimization Strategy · K-means Algorithm
1 Introduction Cluster analysis is widely used in various data mining tasks, with the goal to divide the datasets into several clusters [1]. The cluster validity index and clustering algorithm are important research directions for cluster analysis [2]. The external cluster validity index is one of the most effective index. It characterizes the accuracy of the clustering results by calculating the similarity between the clustering results and ground truth labels. It can be categorized into paircounting measures, information theory measures, and set matching measures [3]. The pair-counting measures evaluate the clustering effect by counting the number of pairs in various situations. Rand Index (RI) takes the proportion of points pairs in the clustering results are consistent with the ground truth labels as the value of cluster validity © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 213–225, 2023. https://doi.org/10.1007/978-981-99-4752-2_18
214
Z. Huang et al.
index [4]. However, RI is poor in the penalty performance of random partitions. Adjusted Rand Index (ARI) uses the expected value of RI to normalize RI, which effectively solves the problem [5]. The information theory measures use information entropy to measure the difference in two partitions. Mutual Information (MI) calculates the difference between the ground truth labels and the clustering results. While Normalized Mutual Information (NMI) restricts the cluster rating index value to [0,1] by normalizing MI, which can be used to compare and analyse effectively [6]. The set matching measures evaluate the clustering effect by calculating the similarity of each paired cluster. Purity evaluates the clustering effect by calculating the purity of each cluster [7]. Pair Sets Index (PSI) uses hungarian algorithm to perform global match and calculates the similarity of paired cluster [8]. K-means algorithm is one of the most popular clustering algorithms. However, Kmeans has a fatal problem that the assignment of centroids and number of clusters affect the quality of clustering results greatly. Nowadays, there are a number of studies to improve the accuracy of K-means [9]. Gengzhang et al. proposed an improved K-means algorithm based on density canopy, the density parameters were coalesced in the algorithm to enhances the antinoise performance of the algorithm and ensures the reliability of the algorithm [10]. Fabregas et al. focused on the enhanced initial centroids for Kmeans algorithm by integrating the computation of the weighted mean in the initial centroids to improve the quality of clusters [9]. Sieranoja et al. proposed two projection methods: One is randomly selected two points for projections, another is to project the connection between data pairs with the farthest distance [11]. The above-mentioned clustering validity indexes and clustering algorithms are proposed from various angles. However, these indexes do not consider the influence of cluster sizes in designing the validity indexes and these algorithms all lack an effective criterion for the cluster centroids. In this paper, we propose a new external validity index called CEI and an efficient strategy called COS. CEI constructs the equal confusion matrix based on the concept that all clusters are equal and performs greedy cluster pairing to evaluate the clustering effect. COS determines whether the cluster centroids are needed to be reclustered by comparing the density of the centroids with that at the edges. The remainder of the paper is structured as follows. Section 2 describes the CEI and COS in detail. The experiment settings are introduced in Sects 3 and 4 describes the detailed result analysis. Section 5 concludes this study.
2 Introduction the Proposed Index and Optimization Strategy 2.1 Cluster Equality Validity Index Let X = {x1 , · · · , xkp } denote a dataset containing k samples, P = {P1 , · · · , Pkp } and G = {G1 , · · · , Gkp } represent their clustering results and ground truth labels. CEI constructs the cluster equal confusion matrix by calculating the proportion of each cluster and performs greedy cluster pairing. Then it is used to calculate the similarity between the ground truth labels and the clustering results. The details of CEI are shown in Algorithm 1.
Cluster Equality Validity Index and Efficient
215
Firstly, the confusion matrix is deployed to show the classification results of the algorithm, where the row represents the ground truth and the column represents the clustering results. Each value in the matrix represents the number of samples at the intersection of the ground truths and the clustering results. The confusion matrix is calculated as below. (1) nij = Gi ∩ Pj where Gi represents the ground truth labels, and Pj represents the clustering results. In order to eliminate the effect of the number of samples on the cluster validity index, we use the confusion matrix to calculate the proportion of each cluster and construct the cluster equality confusion matrix as follows. nij cnij = k p j nij
(2)
where nij represents the value in the confusion matrix, and kp represents the number of clusters. In the calculation of the similarity of each cluster, a greedy strategy is designed to get the similarity among the matched cluster. Let Gi and Pj denote two matched clusters in G and P respectively. sim Gi , Pj =
1+
2cnij k g i
(3) cnij
Then, the cluster pair with the highest similarity are added to the set, and the cluster pairs that have been selected are excluded, and the sim values of the cluster pairs are set to 0. The specific process is given below. s = i, j |i = argmaxi simij , j = argmaxi simij (4) S=S
s
(5)
{simi1 , · · · , simikp } = {0, · · · , 0}, ∀[i] ∈ S
(6)
{sim1j , · · · , simkp j } = {0, · · · , 0}, ∀[j] ∈ S
(7)
This process is repeated until all clusters are paired. Finally, the similarity values of each cluster pair are added up and normalized to [0,1]. s [i,j] sim(Gi , Pj ) (8) CEI = max(kg , kp ) where S is the set of paired clusters.
216
Z. Huang et al.
Algorithm 1 CEI (Cluster Equality Index) Input: Dataset sults
, ground truth labels
, clustering re-
Step1: Calculate the confusion matrix by equation (1). Step2: Get the cluster equality confusion matrix by equation (2). Step3: Obtain the similarity value between the matched cluster by equation (3). Step4: Generate an empty set S = {}. repeat Step5: Select the cluster pair with the highest similarity to add to the set S by equation (4) and (5). Step6: Assign the sim value of the selected cluster pair to 0 by equation (6) and (7). until all clusters are paired. Step7: Calculate global similarity by equation (8). Output: The value of CEI.
2.2 Clustering Optimization Strategy The COS is conducted to evaluate the quality of the cluster centroids and further improve the quality of clustering. It compares the density of cluster centroids and that of cluster edges to find out the clusters that need to be reclustered. The detail of COS are illustrated in Algorithm 2. Firstly, Using the clustering algorithm to form K clusters C = {C1 , · · · , Ck } and cluster centroids u = {u1 , · · · , uk } on the dataset x = {x1 , · · · , xk }. Then, we calculate the two cluster edge points on the clustering results. The first edge point is the farthest from the centroid of the cluster: i xmax1 = argmaxCj − xi 2 , xi ∈ Cj
(9)
The second edge point is the farthest point from the first edge point. 2
i i xmax2 = argmaxxmax1 − xj , xj ∈ Cj
(10)
Based on the two edge points and clustering results, the circle intersection region of cluster centroids and cluster edge points are calculated as follows. 2
i i Indeximax1 = {ixi − xmax1 ≤ uj − xmax1 }
(11)
On the basis of dividing the circle intersection region, we calculate the first radius of cluster centroid (rj,1 ) and the edge radius (r1 ). These radiuses are used to characterize cluster centroid density and cluster edge density. Indeximax1 rj,1 =
i
Indeximax1 r1 =
i
xi − uj 2 nj
i xi − xmax1 nj
(12) 2
(13)
Cluster Equality Validity Index and Efficient
217
Algorithm 2 COS (Clustering Optimization Strategy) Input: Dataset , the number of clusters . Step1: Perform clustering algorithm to get the clustering results: , repeat Step2: Get the first /second cluster edge point by equation (9) and equation (10). Step3: Calculate the circle intersection region of the first cluster edge point and cluster centroid by equation (11). Then, obtain the corresponding radius of cluster centroid (rj,1) and cluster edge point (r1) by equation (12) and equation (13). Step4: Obtain the second cluster centroid radius (rj,2) and the second cluster edge radius (r2) by equation (14), equation (15) and equation (16). Step5: Compare the radius of cluster centroids and the radius of cluster edge points, if rj,1 > r1 and rj,2 > r2, the strategy optimizes the cluster. until there are no cluster to be reclustered. Output: The clustering results.
In the same way, the second radius of cluster centroid (rj,2 ) and cluster edge (r2 ) are obtained by following equation. j
2
j
j
Indexmax2 = {ixi − xmax2 ≤ uj − xmax2 }
(14)
Indexjmax2 rj,2 =
i
Indexjmax2 r2 =
i
xi − uj 2 nj j
(15) 2
xi − xmax2 nj
(16)
Finally, the radiuses of the cluster centroids and cluster edge points are compared. If rj,1 > r1 and rj,2 > r2 , we will optimize the cluster.
3 Datasets and Evaluation Measures 3.1 The Datasets of CEI Two artificial datasets are produced as introduced in [8]: random partitions, cluster size imbalanced partitions. To compare each cluster validity index more fairly, we also compare the ability of each cluster validity index to find the optimal K on the UCI datasets [12]. Artificial Datasets of Random Partitions. The datasets are generated to study the influence of balance and unbalance in random partitions. As illustrated in Fig. 1, the datasets are randomly allocated to K clusters, where K ranges from 1 to 20. Artificial Datasets of Imbalanced Partitions. The datasets are generated to study the impact of the imbalanced problem on the cluster validity index. As illustrated in Fig. 2, the number of samples in the first category (red in G) and the second category (orange
218
Z. Huang et al.
Fig. 1. Schematic diagram of random partitions. The upper part represents a balanced distribution, and the lower part represents an unbalanced distribution. G represents the ground truth labels, P represents the clustering results labels. P1 to P5 means the range of K in [1, 18]. Different colors represent different clusters, and the number indicate the number of cluster samples.
Fig. 2. Schematic diagram of cluster size imbalanced partitions. G represents the ground truth labels, and P represents the clustering results labels. Different colors represent different clusters, and the number indicate the number of cluster samples.
in G) in the ground truth label remains at 1000, while the number of samples in the third category (grey in G) increases from 50 to 2000 at intervals of 50. The Real-World Datasets Eight datasets are deployed as described in the paper [12]. Their sample numbers from 214 to 2310, the number of cluster from 3 to 10, and attribute dimensions from 5 to 34. The details are shown in Table 1. Table 1. 8 UCI Datasets Description. Dataset
Instances
Features
Clusters
Car
1728
6
4
Dermat
366
34
6
Ecoli
336
7
8
Segment
2310
19
7
Stalog
2310
19
7
Yeast
1484
8
10
Cluster Equality Validity Index and Efficient
219
3.2 The Datasets of COS The experiments on artificial datasets and real-world datasets are conducted to verify the effectiveness of COS. The details are as follows.
Fig. 3. The effect of the cluster validity indexes on random partitions. The upper part represents the result of balanced partitions, the lower part represents the result of unbalanced partitions.
Artificial Datasets Six public available artificial datasets are used, named A1, A2, A3, S1, S2 and S3 etc. As illustrated in Table 2, A1, A2, and A3 are used to research the effect of clustering strategy when the number of clusters gradually increases [13]. S1, S2, and S3 refer to artificial datasets specifically used to study the impact of overlapping on clustering strategy [14]. Real-World Datasets. The real-world datasets of COS are the same as the CEI. The details are shown in Table 1.
3.3 The Evaluation Measures of CEI Three methods are deployed to analyse the effectiveness of each cluster validity index on the clustering results of the artificial datasets.
220
Z. Huang et al. Table 2. Public Artificial Datasets description Dataset
Instances
Features
Clusters
A1
3000
2
20
A2
5250
2
35
A3
7500
2
50
S1
5000
2
15
S2
5000
2
15
S3
5000
2
15
Fig. 4. The effect of the cluster validity indexes on cluster size imbalanced partitions.
Overall trend analysis: By analyzing whether the overall trend of each cluster validity index meets expectations, it can help us judge the pros and cons of the cluster validity index. Mutational point analysis: By analyzing whether each cluster validity index has a knee point and its performance at the knee point, which helps us to understand the characteristics of the cluster validity index in detail. Slope analysis: To compare each cluster validity index more deeply, it is necessary to analyze the change of the slope, which provides the sensitivity of the cluster validity index to various factors. For experiments on real-world datasets, we compare the performance of different cluster validity indexes. We mainly measure by the hit rate (HITRate) of optimal K and the mean squared error (MSE) of optimal K. The HITRate of optimal K refers to calculate the number of times that K determined by the cluster validity index is equal to the actual number of clusters. The specific process is below. k +10
g kopt = argmaxk CVI min(2,k g −10)
I kopt , k =
0, kopt = k 1, kopt = k
(17) (18)
Cluster Equality Validity Index and Efficient
221
n HITRate =
i i I (kopt , k)
n
(19)
where n represents the execution times of the clustering algorithm The MSE of optimal K refers to calculate the squared error between the optimal K determined by each cluster validity index and the actual number of clusters. It is defined as: n i 2 i (kopt − k) MSE = (20) n where n is the number of runs. 3.4 The Evaluation Measures of COS We use COS to judge the effectiveness of clusters and reclusters them and use Purity to evaluate the clustering effect of COS before and after operation. Then we calculate the corresponding mean and variance. The definition of purity is given below. kp
1 Purity = maxi nij n
(21)
j
where n represents the number of clusters.
4 Experimental Analysis 4.1 Experiment and Analysis of CEI on Artificial Datasets Experiment and Analysis of CEI on Random Partitions. The experimental results of CEI and other four cluster validity indexes to evaluate on the random partitions are illustrated in Fig. 3. We analyze the results from three perspectives. Overall trend analysis: the indexes should be lower overall. Figure 3 shows that Purity are significantly higher than other cluster validity indexes. CEI, PSI, NMI, and ARI are relatively low (Table 3). Mutational point analysis: CEI and PSI have a mutational point at k = 3. We believe that randomization is different in various random partitions. Considering the three random partitions k = 1, k = 3, and k = 20, it is obvious that the random partition of k = 3 contains more information than the random partition of k = 1 and k = 20. Slope analysis: As mentioned in mutational point analysis, there should be a process of slope change near k = 3, which reflects the sensitivity of the validity index to different k. CEI and PSI all have slope changes, of which CEI changes the most. Experiments and Analysis of CEI on Imbalanced Partitions. The experimental results are illustrated in Fig. 4. We also analyze the results from three perspectives. Overall trend analysis: Because only the size of the third cluster is changed, the entire cluster validity index should remain unchanged. We can find that CEI and PSI meet expectations.
222
Z. Huang et al.
Mutational point analysis: As mentioned in the overall trend analysis, all the indexes should not change. Thus, there are no mutational point. All the results of cluster validity indexes meet expectations. Slope analysis: The slope of the cluster validity index should maintain stability on the imbalanced partitions, while only CEI and PSI meet expectations. The experimental results show that CEI has the best performance on the random partitions and imbalanced partitions compared with other indexes. 4.2 Experiment and Analysis of CEI on Real-World Datasets We use K-means algorithm to run 100 times on each dataset [15]. We use HITRate and MSE to evaluate the ability to find the optimal K. It can be seen from the table that the performance of CEI on all datasets is better than other cluster validity index. Table 3. HITRate and MSE on the 8 UCI Datasets Dataset
ARI
NMI
PSI
Purity
CEI
Car
0.25(2.510)
0.02(27.35)
0.57(0.850)
0.03(14.32)
0.65(0.380)
Dermat
0.00(62.19)
0.00(72.64)
0.04(10.23)
0.00(38.39)
0.38(0.960)
Ecoli
0.00(11.99)
0.00(9.200)
0.35(0.830)
0.35(0.840)
0.18(1.320)
Segment
0.00(15.74)
0.00(52.73)
0.30(0.730)
0.01(3.570)
0.30(0.700)
Statlog
0.00(18.15)
0.00(52.39)
0.40(0.630)
0.10(3.510)
0.40(0.630)
Yeast
0.00(8.550)
0.13(9.470)
0.35(1.730)
0.14(1.950)
0.39(1.130)
4.3 Experiment and Analysis of COS on Real-World Datasets COS mainly optimizes the results for initialization cluster centroids. This study uses 4 different initial clustering algorithms, including random cluster centroids [16], random partition [17], sort initialization [18], K-means + + [19]. We calculate the mean and variance of the COS and corresponding initial clustering algorithms. We uses Purity for clustering result evaluation. The experimental results are illustrated in Table 4. We can find that COS outperforms the corresponding clustering algorithm in most cases. 4.4 Experiment and Analysis of COS on Real-World Datasets We perform the initial clustering algorithms and COS on each dataset, then calculate the mean and variance of the Purity, as shown in Table 5. In every two rows, the first row represents the results on the corresponding dataset using the initial clustering algorithms, and the second represents the clustering results after applying COS on the clustering algorithms. The black bold indicates relatively better clustering results. The results show that COS performs better than the initial clustering algorithms.
Cluster Equality Validity Index and Efficient
223
Table 4. Purity on the 6 Public Artificial Datasets Random
Keams + +
Dataset
Partion
Sort
A1
0.5437 ± 0.0655
0.8640 ± 0.0526
0.9265 ± 0.0366
0.8619 ± 0.0451
COS
0.5353 ± 0.0609
0.8742 ± 0.0480
0.9280 ± 0.0344
0.8686 ± 0.0421
A2
0.2989 ± 0.0335
0.8402 ± 0.0403
0.9135 ± 0.0287
0.8481 ± 0.0382
COS
0.2958 ± 0.0353
0.8613 ± 0.0350
0.9165 ± 0.0272
0.8701 ± 0.0335
A3
0.2083 ± 0.0253
0.8185 ± 0.0335
0.9093 ± 0.0224
0.8299 ± 0.0312
COS
0.2107 ± 0.0302
0.8457 ± 0.0309
0.9163 ± 0.0216
0.8539 ± 0.0274
S1
0.4477 ± 0.0803
0.5280 ± 0.0679
0.9348 ± 0.0411
0.5068 ± 0.0702
COS
0.4493 ± 0.0805
0.5303 ± 0.0736
0.9411 ± 0.0420
0.5070 ± 0.0774
S2
0.5414 ± 0.0604
0.8387 ± 0.0636
0.9054 ± 0.0399
0.8513 ± 0.0726
COS
0.5534 ± 0.0635
0.8428 ± 0.0670
0.9095 ± 0.0409
0.8567 ± 0.0720
S3
0.6049 ± 0.0343
0.7148 ± 0.0506
0.8013 ± 0.0334
0.7174 ± 0.0574
COS
0.6154 ± 0.0221
0.7153 ± 0.0508
0.8024 ± 0.0335
0.7175 ± 0.0574
Table 5. Purity on the 8 UCI Datasets Dataset
Partion
Random
Keams + +
Sort
Car
0.7210 ± 0.0252
0.7252 ± 0.0303
0.7225 ± 0.0281
0.7179 ± 0.0261
COS
0.7221 ± 0.0254
0.7269 ± 0.0312
0.7236 ± 0.0286
0.7194 ± 0.0274
Dermat
0.3936 ± 0.0334
0.3697 ± 0.0286
0.3634 ± 0.0252
0.3702 ± 0.0308
COS
0.3906 ± 0.0305
0.3787 ± 0.0268
0.3754 ± 0.0295
0.3807 ± 0.0282
Ecoli
0.8109 ± 0.0255
0.8085 ± 0.0275
0.8131 ± 0.0270
0.8145 ± 0.0230
COS
0.8137 ± 0.0257
0.8180 ± 0.0274
0.8148 ± 0.0260
0.8150 ± 0.0249
Segment
0.5601 ± 0.0330
0.5527 ± 0.0328
0.5191 ± 0.0444
0.5574 ± 0.0330
COS
0.5660 ± 0.0287
0.5593 ± 0.0327
0.5404 ± 0.0401
0.5628 ± 0.0333
Statlog
0.5472 ± 0.0407
0.5573 ± 0.0314
0.5289 ± 0.0444
0.5608 ± 0.0314
COS
0.5540 ± 0.0363
0.5604 ± 0.0336
0.5445 ± 0.0387
0.5654 ± 0.0325
Yeast
0.5168 ± 0.0143
0.5204 ± 0.0121
0.5120 ± 0.0197
0.5223 ± 0.0121
COS
0.5175 ± 0.0140
0.5210 ± 0.0125
0.5148 ± 0.0183
0.5224 ± 0.0128
5 Conclusion In this paper, we proposes a novel cluster validity index called CEI and an efficient clustering optimization strategy called COS. CEI deploys the proportion of various matched samples in clusters to treat each cluster equally, and uses a greedy search strategy to match the clustering results with ground truth labels. It calculates the similarity of
224
Z. Huang et al.
each clustering results and the ground truth label to generate their overall similarity. COS selects the clusters that need to be reclustered based on the assumption that density in the centroids of the cluster should be greater than that at the edge of the cluster. According to the experimental results, CEI and COS perform better than other indexes and algorithms. While it is noted that the density-based strategy would fit for the densitybased clustering algorithm. In the future, we will further explore the new density-based index, and combine them to improve the density-based clustering algorithms. Acknowledgements. This research was supported by the School of Film Xiamen University Dongbo Future Artificial Intelligence Research Institute Co., Ltd. Joint Laboratory for created the Metaverse (School Agreement No. 20223160C0026), School of Film Xiamen University - Xiaozhi Deep Art Artificial Intelligence Research Institute Co., Ltd. Computational Art Joint Laboratory (School Agreement No. 20213160C0032), and School of Information Xiamen University - Xiamen Yinjiang Smart City Joint Research Center (School Agreement No. 20213160C0029).
References 1. Ahmed, M., Seraj, R., Islam, S.M.S.: The k-means algorithm: a comprehensive survey and performance evaluation. Electronics 9(8), 1295 (2020) 2. Xiong, H., Li, Z.: Clustering Validation Measures. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 571–606. Chapman and Hall/CRC (2018) 3. Amelio, A., Pizzuti, C.: Correction for closeness: adjusting normalized mutual information measure for clustering comparison. Comput. Intell. 33(3), 579–601 (2017) 4. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971) 5. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837– 2854 (2010) 6. Kvalseth, T.O.: Entropy and correlation: some comments. IEEE Trans. Syst. Man Cybern. 17(3), 517–519 (1987) 7. Rendón, E., Abundez, I., Arizmendi, A., et al.: Internal versus external cluster validity indexes. Int. J. Comput. Commun. 5(1), 2734 (2011) 8. Rezaei, M., Fränti, P.: Set matching measures for external cluster validity. IEEE Trans. Knowl. Data Eng. 28(8), 2173–2186 (2016) 9. Fabregas, A.C., Gerardo, B.D., Tanguilig, B.T., III.: Enhanced initial centroids for k-means algorithm. Int. J. Inf. Technol. Comput. Sci. 1, 26–33 (2017) 10. Zhang, G., Zhang, C., Zhang, H.: Improved K-means algorithm based on density Canopy. Knowl.-Based Syst. 145, 289–297 (2018) 11. Sieranoja, S., Fränti, P.: Random projection for k-means clustering. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing: 17th International Conference, ICAISC 2018, Zakopane, Poland, June 3-7, 2018, Proceedings, Part I, pp. 680–689. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-91253-0_63 12. Dua, D., Graff, C.: UCI machine learning repository (2017) 13. Kärkkäinen, I., Fränti, P.: Dynamic local search algorithm for the clustering problem. University of Joensuu, Joensuu, Finland (2002) 14. Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–775 (2006)
Cluster Equality Validity Index and Efficient
225
15. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding Stanford (2006) 16. Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769 (1965) 17. Gupta, M.K., Chandra, P.: P-k-means: k-means using partition based cluster initialization method. SSRN Electron. J. (2019). https://doi.org/10.2139/ssrn.3462549 18. Cao, F., Liang, J., Jiang, G.: An initialization method for the K-Means algorithm using neighborhood model. Comput. Math. Appl. 58(3), 474–483 (2009) 19. Ismkhan, H.: Ik-means+: an iterative clustering algorithm based on an enhanced version of the k-means. Pattern Recogn. 79, 402–413 (2018)
Modeling Portraits of Students and Exercises for Exercise Recommendation Weiwei Gao1 , Huifang Ma1(B) , Yan Zhao1 , Jing Wang1 , and Quanhong Tian2 1 College of Computer Science and Engineering, Northwest Normal University, Lanzhou, China
[email protected] 2 School of Educational Technology, Northwest Normal University, Lanzhou, China
Abstract. Exercise recommendation aims at providing personalized recommendation to assist students with explicit learning directions, which has become an important and popular component of online learning platforms. However, previous efforts have largely followed the traditional collaborative filtering paradigm, which often sees students (exercises) as separate from each other, and the implicit connections between students (exercises) have been largely ignored. To this end, in this paper, we target at developing a new paradigm for exercise recommendation via Modeling Portraits of Students and Exercises (MPSE), to reveal the latent connections between student-student and exercise-exercise. In addition, the diversity of key factors that measure the dissimilarity between recommended exercises is appropriately introduced to ensure student satisfaction. Technically, a collaborative student exercise graph is created and a new random walk strategy is implemented to take full advantage of the spectral properties of nearly uncoupled Markov chains. This allows for full exploration of both similar exercises that students have completed and connections between students (exercises) with similar portraits. In addition, a simulated annealing framework is presented to obtain different exercise suggestions. Extensive experiments on two public datasets demonstrated that MPSE can satisfy novelty, accuracy, and diversity. Keywords: Exercise Recommend · Joint Random Walk · Optimization · Personalized Learning
1 Introduction With the increasing popularity of online education, various teaching resources are flourishing on the Internet, which greatly facilitates students’ learning and makes the online learning method acceptable and adaptable to students. However, with the increase of network resources, how to recommend suitable resources [1–3] for students has become a hot research topic. Exercise, as one of the most important resources, plays a vital role in consolidating and improving students’ conceptual knowledge. Recently, numerous research efforts have been dedicated to exercise recommendation [4–6]. Classical exercise recommendation algorithms attempt to learn vector representations of students and exercises. HB-DeepCF [7] is a typical hybrid recommendation method via deep collaborative filtering model, which obtains a student’s (an exercise’s) © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 226–236, 2023. https://doi.org/10.1007/978-981-99-4752-2_19
Modeling Portraits of Students and Exercises for Exercise Recommendation
227
embedding by mapping from pre-existing ratings. More recently, researchers adopt cognitive diagnosis model to reveal mastery of knowledge concepts for students to advance recommendation performance [8, 9]. For instance, KCP-ER [10] designs a knowledge concept prediction module to obtain the embedding of the knowledge concept coverage and the mastery of students, and then the exercise set filtering module is presented to filter the exercises. LSTMCQP [11] designs a knowledge tracing method to capture students’ knowledge states and then adopts a personalized LSTM method to perform exercise recommendation. We argue that an inherent limitation of such methods is that, fine-grained representations of both students and exercises are not encoded in a proper way. As such, the resultant recommendation result may not be sufficient to capture ‘suitable’ preference for students. Students' level of knowledge mastery
student
Exercise t1
t2
Ann
e1
e3
Bob
e2
e4
Carl
(a)
e1
(b)
Fig. 1. A toy example of exercise recommendation. (a) Students’ mastery of knowledge with past response records. (b) Difficulty of exercises pertinent to each knowledge concept and the exercises-knowledge concepts indication.
Exercise recommendation commits to recommending ‘suitable’ exercises to students. Essentially, an ideal exercise recommender system should be able to satisfy novelty, accuracy, and diversity. Figure 1 reveals the importance of the above three elements during the process of recommendations. Novelty means that the recommended exercise may contain new knowledge concept that the student does not know (or answer incorrectly before), which helps the student learn new knowledge. For example, in Fig. 1. (a) and (b), e2 and e4 are recommended to Ann to practice the new knowledge concepts k 5 and k 6 . Accuracy suggests that the exercises recommended to students should be of appropriate difficulty. Too difficult exercises may exert negative impact on students’ learning motivation while too easy exercises will make students degrade learning interest. Thus, the exercises recommended to students should be of a smooth accurate difficulty. As can be seen from Fig. 1. (a) and (b), Ann has answered e1 correctly, and e4 is of similar difficulty to e1 , so it is reasonable to recommend e4 to Ann. Diversity reflects that the knowledge concepts of recommended exercises for students should be different. Recommending a list of exercises that are different knowledge concepts will make students feel enthusiastic and will eventually assist them in mastering the knowledge comprehensively. Towards these insights in exercise recommendation, in this work, we keep abreast of the ongoing developments of cognitive diagnosis [12] and relevant random walk techniques [13, 14], and propose a novel algorithm MPSE (Model Portraits of Students and Exercises) for exercise recommendation. Our aim is to incorporate the above three recommendation goals into the recommendation process. Specifically, fine-grained representations of both students and exercises are modelled, based on which the Collaborative
228
W. Gao et al.
Student Exercise Graph (CSEG) is constructed. Furthermore, we consider a joint random walk mechanism with the spectral properties of nearly uncoupled Markov chains [15] for improving recommendation accuracy and meeting novelty requirement. Subsequently, the desired list of recommended exercises is obtained by solving an optimization problem to satisfy diversity. The major contributions of this paper are summarized as follows: • We highlight the importance of explicitly modeling the fine-grained portraits to construct CSEG, which is conducive to better recommendation results. • We develop a new method MPSE, which exploits the joint random walk mechanism with the spectral properties of nearly uncoupled Markov chains. • We conduct extensive experiments on two public datasets to demonstrate that MPSE can provide students with novel, accurate, and diverse exercises.
2 Notations and Problem Statement We first summarize important notations and their definitions, followed by the concept of CSEG. Let S = {s1 , s2 , s3 , · · ·, s|N| } be the set of N students, E = {e1 , e2 , e3 , · · ·, e|M| } be the set of M exercises and K = {k 1 , k 2 , k 3 , · · ·, k |Z| } be the set of Z knowledge concepts. The student-exercise interaction matrix is defined as R = [Rij ] ∈ {0,1}, where Rij = 1 indicates that student si correctly answers the exercise ej , otherwise Rij = 0. Besides, the exercise and knowledge concept incidence matrix is defined as Q = {Qij }, where Qij = 1 if exercise ei relates to knowledge concept k j and otherwise Qij = 0. Student-Exercise Bipartite Graph: It is defined as G1 = (S ∪ E,ε1 ), where ε1 denotes the set of edges that connect a student and an exercise based on R, indicating whether there is an observed interaction between s and e. Student-Student Similarity Graph: The student-student similarity graph is defined as G2 = (S,ε2 ), where ε2 denotes the set of edges. An edge in ε2 is built if and only if the similarity between the students exceeds ω. Exercise-Exercise Similarity Graph: The exercise-exercise similarity graph is defined as G3 = (E,ε3 ), where ε3 denotes the set of edges. An edge in ε3 is built if and only if the similarity between the exercises exceeds θ. Collaborative Student Exercise Graph (CSEG): Here we define the CSEG, which encodes students’ response behaviors together with student (exercise) relationship as a unified heterogenous graph: G = (S ∪ E,ε1 ∪ ε2 ∪ ε3 ) as illustrated in Fig. 2.
3 MPSE Model We now formulate the exercise recommendation task to be addressed in this paper as: Given the student-exercise response matrix R and the exercise-knowledge concept incidence matrix Q, the goal of this task is to provide each student with novelty, accuracy, and diversity exercises, the framework is shown in Fig. 2.
Modeling Portraits of Students and Exercises for Exercise Recommendation
229
3.1 Construction of CSEG In this subsection, we introduce the way of building portraits for both students and exercises via utilizing key factors from past response records. On one hand, it is believed that the most important two factors to decide the exercises preference for a student are knowledge proficiency level and knowledge coverage. On the other hand, the recommended exercises are determined by both difficulty level and their related knowledge concepts simultaneously. Thus, a natural idea is to portray their profiles from the above aspects.
Students' level of knowledge mastery Students' knowledge coverage
Similar Student
0.5 s1
D′1
0.2
Student-Exercise Interaction
Portraits of Student
D1
0.15
The difficulty of the exercises in knowledge concepts
r
D
0.002 D′2
CSEG e4
Random Walk
Q-matrix
Candidate Exercise
D2
Recommendation List
Similar Exercise
Portraits of Exercise
(1) Construction of CSEG
(2) Joint Random Walk
(3) Optimize the Recommendation List
Fig. 2. The model framework of the presented MPSE, which consists of three main tasks: (1) the fine-grained portrait of the student and the exercise construction of CSEG (Sect. 3.1); (2) the importance ranking of the exercises through a joint random walk (Sect. 3.2); (3) the final list of exercise recommendations with a multi-objective optimization (Sect. 3.3).
Here we employ NeuralCD [12], a widely used cognitive diagnosis method for getting both accurate and interpretable diagnosis results. It is an effective way to not only parameterize proficiency level of students’ knowledge proficiency but also project exercises with difficulty vector representations. After parameter estimation training of NeuralCD, both students’ proficiency and exercises’ difficulty matrixes are obtained, which serve as part of the student profile and the exercise profile respectively. Specifically, we denote the proficiency matrix as A ∈ R N×Z for all students and the difficulty matrix as B ∈ R M×Z for all exercises. Note that each entry of A/B is continuous ([0,1]), which indicates the student’s proficiency (difficulty) on knowledge concept. Similarity Between Students. With all students’ mastery of the knowledge concepts A, students can be partly represented by the degree of mastery of knowledge concept as msi = softmax xsiT · A , We then denote knowledge concept coverage of a student as another feature portrait of students as csi = softmax (xsi ·R·Q). After obtaining the above two representations of msi and csi for students, a criterion of similar students should to be defined to provide students with suitable exercises. Typically, similar students share similar levels of knowledge mastery, which enables similar students to refer to each other. However, in order to recommend novel exercises, the difference between the exercises correctly answered by students should be as large as possible, and such students are inclined to reveal novel exercises. Hence, similar students
230
W. Gao et al.
defined as with similar knowledge mastery and the greatest possible difference in knowledge coverage. In a word, similar students should satisfy the following two conditions: (1) Similar mastery of knowledge concepts. (2) Correctly answering the exercises with large differences in knowledge concept coverage. Consequently, the similarity between si and sj is defined with the Gaussian kernel function as: 2 2 (1) wsij = exp −msi − msj /2σ 2 − exp −csi − csj /2σ 2 where σ is the bandwidth that controls the local scope of the Gaussian kernel. The students whose similarity of wsij is calculated to be higher than ω are defined as similar students and form the student similarity matrix ws , others wsij = 0. Similarity Between Exercises. Exercise factors are divided into two categories. The first factor is exercise difficulty, which is crucial to maintain the recommendation results accurate. We can take each row of the exercise difficulty matrix B as dej = sigmoid(xT ej ·B). Here xej indicates the exercise’s one-hot vector. The second factor is knowledge relevancy, suggesting the connection between exercises and knowledge concepts. It is previously given (e.g., Q-matrix). We denote it as qej = xT ej ·Q indicating the relevancy between the exercise and the knowledge concept k i . The comprehensive representation of ej is acquired by the above two representations of the exercise through the difficulty distribution dej of the exercise and the knowledge association vector qej of the exercise, i.e., ej = dej qej , where is element-wise multiplication. We then define the similarity between exercise ei and ej with the Gaussian kernel function. 2 (2) weij = exp −ei − ej /2σ 2 Exercises for which the similarity of weij is higher than θ are defined as similar exercises, form the exercise similarity matrix we , others weij = 0. 3.2 Joint Random Walk In this subsection, we present a novel joint random walk framework on CSEG for exercise recommendations, which is equipped with implicit potential of student (exercise) associations. Our model endows the inherent ability of the joint random walk to propagate these relations across the student (exercise) space and exploits the rich network of interactions they form. Consequently, a walker jumps not only between student and exercise but also between student and student (exercise and exercise). In other words, students (exercises) with similar portraits give some valuable clues to the walker, making it possible to fully explore similar students (exercises). Initially, block transition matrix of the student-exercise bipartite graph is defined the 0 R . We then normalize H to JR as follows: as H = RT 0
0 R −1 −1 JR = Diag(H · 1) · H = (3) · RT 0 Diag RT · 1
Modeling Portraits of Students and Exercises for Exercise Recommendation
231
where 1 denotes the vector with all elements of 1. s We then construct a block transition matrix M with the similarity matrix W student s W 0 . Equally, a similar and the exercise similarity matrix We , such as: M = 0 We normalization operation is performed on M to get JSE : s W 0 JSE = Diag(M · 1)−1 · M = (4) e 0 W The transition probability matrix J controlling the walk behavior can be effectively represented as a weighted sum of the above transition probability matrices JR and JSE as follows:
s α×R (1 − α) × W −1 J α × JR + (1 − α) × JSE = (5) e · RT (1 − α) × W α × Diag RT · 1 where α controls the degree of the involvement of these two components for the final random walk process. Finally, we applying the restart policy to the joint random walk which is defined as: vst+1 = β × J·vst + (1 − β) × vs0 . where vs0 is a vector that contains the element 1 on the position that corresponds to the target student be recommended and zeros elsewhere. vst+1 is the node visiting probability vector at time t. A higher value in the vst+1 indicates that the exercise is more likely to be recommended to students. It is worth noting that the above recommendation paradigm explores the CSEG with the advantage of the spectral properties of the nearly uncoupled Markov chain, which can be clearly stated in the following theorem. Theorem 1. Let J be the joint transition probability transition matrix with α ∈ (0,1) defined over the CSEG, and let λ(J) be the set of the eigenvalues of J. Irrespectively of the similar students (exercises) used to define matrix Ws (We ) it holds: (a) 1-2α ∈ λ(J). (b) The Markov chain with probabilistic transition matrix J will be almost decoupled into at least 2 blocks when α is sufficiently small. Clearly, for small values of parameter α, the walk chain is going to be nearly uncoupled into two blocks, thereby allowing the random walk process dynamics towards equilibrium to disentangle into a slow-mixing and a fast-mixing component. In summary, the joint random walk prolongs the influence of the personalized initialization on the successive K-step landing probabilities of the walker and thus produce exercise recommendations with novelty and accuracy by eliminating the need of stopping the walks early.
3.3 Optimize the Recommendation List In this subsection, in order to get a diverse list of recommendations, for the stationary distribution of the random walk vector, we first choose exercises with the top-P
232
W. Gao et al.
largest scores as candidate recommendation list denoted as D0 . We take D0 as a highdimensional space and each exercise represents a point in that space. We then follow the simulated annealing algorithm [16] to solve a combinatorial optimization problem for obtaining diverse recommendation list. Some exercises are replaced randomly followed by the process of comparing the optimization factors, and the new version is continuously generated by updating the exercise in the list one by one. Concretely, the selected exercise sets from the candidate exercise set D0 , D1 is a replicated version of the top-L exercises of D0 . We first replace some of the exercises in the original D1 to get D2 . Then we calculate D’ 1 and D’ 2, indicating the distance matrices of the exercise in D1 and D2 , respectively. Given the exercises distance of D2 is larger than D1 , we accept D2 as the new version. Otherwise, we calculate the acceptance probability γ and compare it with a randomly generated r to decide whether to accept D2 . γ is calculated by: 2 1 − mean D /(kB × T ) (6) γ = exp − mean D where mean(D’ 1) indicates the average value of the distance matrix of list D1 , k B is the Boltzmann constant and T represents the temperature. The mean value of the distance matrix of all acceptable relations will be compared, and the version with the largest distance divergence will be finally kept as the output of the optimized recommendation list, which is the final recommendation list D with diversity.
4 Experiment We evaluate our proposed method on two real-world datasets. We aim to answer the following research questions: • RQ1: How does MPSE perform compared to the baseline methods? • RQ2: How much does each component in MPSE contribute? • RQ3: How do the hyper-parameters in MPSE affect the performance? 4.1 Experimental Settings Datasets. To evaluate the performance of the proposed MPSE, we conduct comprehensive experiments on two real-world datasets: ASSISTments 2009–2010 [17] and Algebra 2006–2007 [18]. Detailed datasets are presented in Table 1. Table 1. Real dataset statistics. Dataset
Students
Exercises
Knowledge Concepts
Records
ASSISTments
4,163
17,746
123
278,607
Algebra
1,338
91,771
491
222,314
Baselines. To demonstrate the effectiveness, we compare our MPSE with the following state-of-the-art methods from three groups: Classical recommendation methods (KNN
Modeling Portraits of Students and Exercises for Exercise Recommendation
233
[19] and KGEB-CF [20]); Cognitive diagnose models (NeuralCD [12] and DKT [21]); Exercise recommendation models (HB-DeepCF [7] and KCP-ER [10]). Evaluation Metrics and Settings. Similar with KCP-ER [10], we evaluate our method and baselines from the following three aspects: novelty, accuracy, and diversity. In addition to this, the student similarity threshold is ω = 0.6, the exercise similarity threshold is θ = 0.6, the joint random walk jump probability is set to α = 0.01, and the restart probability is β = 0.7.
4.2 Performance Comparison (RQ1) We present the performance results of all methods in Table 2, where the results of our MPSE and the best performed baselines are highlighted with bold and underlined, respectively. From the result, we summarize several important observations: MPSE consistently outperforms all types of baselines on both datasets in terms of all evaluation metrics. We attribute the significant performance improvements to the following reasons: 1) through the proper portraits for students (exercises), MPSE captures the explicit dependencies in a customized manner; 2) the designed joint random walk paradigm incorporates spectral properties of the nearly uncoupled Markov chain, which prolongs the influence of the user personalized initialization; 3) with the help of simulated annealing algorithm, the objective of diversity can be fulfilled. Table 2. Real data set statistics. ASSISTments 2009–2010
Algebra 2006–2007
novelty
novelty
accuracy
diversity
accuracy
diversity
KNN
0.934
0.888
0.254
0.783
0.747
0.407
KGEB-CF
0.912
0.879
0.524
0.676
0.651
0.674
DKT
0.602
0.880
0.466
0.621
0.855
0.583
NeuralCD
0.583
0.894
0.495
0.645
0.859
0.668
HB-DeepCF
0.914
0.823
0.758
0.739
0.645
0.619
KCP-ER
0.957
0.895
0.765
0.818
0.863
0.743
MPSE
0.959
0.897
0.781
0.821
0.865
0.758
4.3 Ablation Study (RQ2) In order to verify the effectiveness to integrate students (exercises) dependency under a random walk prototype. we build three variants of MPSE by removing part of its modules: 1) variant MPSE-s is built by removing student-student relation; 2) variant MPSE-e removes exercise-exercise relation; 3) variant MPSE-s-e is constructed by removing both student-student relation and exercise-exercise relation.
234
W. Gao et al.
Fig. 3. The influence of similar students (exercises).
We present the evaluation results in Fig. 3 with the following key summaries: MPSE always outperforms its variants. This suggests the effectiveness of our portrait modeling paradigm, by capturing the complex dependent relations between students (exercises). The removal of the student-student relation exerts a more significant impact on the model for accuracy, which validates that exercise performed by students with similar levels of knowledge mastery can provide an imperative reference value to the target students. Moreover, MPSE-e underperforms MPSE-s w.r.t. diversity, which indicates that removing exercise-exercise relation has negative impact on diversity in MPSE. 4.4 Sensitivity Evaluation (RQ3) Especially, we evaluate the effect of the candidate recommendation list size top-P and the recommendation list size top-L in MPSE. To verify the effect of the candidate exercise set and the final recommendation list on the recommendation results, we vary top-P in range of {20,30,40,50}, and top-L in range of {2,4,6,8,10}. The performance comparison on ASSISTments 2009–2010 data aggregation is illustrated in Fig. 4. When top-P increases from 20 to 30, the diversity performance is improved, demonstrating that too few exercises are not beneficial to the diversity of exercises. When top-P increases from 30 to 50, the performance becomes poorer. It makes sense since too many exercises can bring in inappropriate exercises. When the fixed-size number of top-L increases, we can see that the performance raises firstly then reduces. This is due to the increase in diversity of exercises making the accuracy and novelty decrease. When the size of a candidate’s exercise set is selected at 30 and the final recommendation list size is selected at 6, the results are more consistent with our requirement for novelty, accuracy and diversity. Too few exercises are not beneficial to the diversity of exercises. Too many exercises and the novelty wear off.
Modeling Portraits of Students and Exercises for Exercise Recommendation
235
Fig. 4. Top-P candidate (Top-L recommendation) exercise number impact on performance.
5 Conclusion In this paper, we propose a Modeling Portraits of Students and Exercises for Exercise Recommendation (MPSE) for the solution with addressing the problem of recommends novelty, accuracy and diversity exercises. In MPSE, we design three sophisticated modules that construction of CSEG, joint random walk and optimize the recommendation list. First the CSEG is constructed by portraits of students and exercises. Then we select a candidate set for the exercises by a joint random walk. Finally, a diverse list of recommended exercises is obtained by a simulated annealing algorithm. Compared with some existing methods, the advantages of MPSE are validated on several real-world datasets used in educational data mining. Acknowledgment. This work is supported by the Industrial Support Project of Gansu Colleges (2022CYZC-11), Gansu Natural Science Foundation Project (21JR7RA114), the National Natural Science Foundation of China (622760736, 1762078, 61363058) and NWNU Teachers Research Capacity Promotion Plan (NWNU-LKQN2019–2).
References 1. Nabizadeh, A., Leal, J., Rafsanjani, H., Shah, R.: Learning path personalization and recommendation methods: a survey of the state-of-the-art. Expert Syst. Appl. 159, 113596 (2020) 2. Zhang, Q., Lu, J., Zhang, G.: Recommender systems in E-learning. J. Smart Environ. Green Comput. 1, 76–89 (2021) 3. Ma, H., Huang, Z., Tang, W., Zhang, X.: Exercise recommendation based on cognitive diagnosis and neutrosophic set. In: 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), pp. 1467–1472. IEEE (2022)
236
W. Gao et al.
4. Huang, S., Liu, Q., Chen, J., Hu, X., Liu, Z., Luo, W.: A design of a simple yet effective exercise recommendation system in K-12 online learning. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.), Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium. AIED 2022. Lecture Notes in Computer Science, vol 13356. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11647-6_36 5. Li, Z., et al.: Exercise recommendation algorithm based on improved collaborative filtering. In: 21st 2021 International Conference on Advanced Learning Technologies (ICALT), pp. 47–49. IEEE (2021) 6. Li, Z., et al.: Exercise recommendation method based on machine learning. In: 21st International Conference on Advanced Learning Technologies (ICALT), pp. 50–52. IEEE (2021) 7. Gong, T., Yao, X.: Deep exercise recommendation model. Int. J. Model. Optim. 9(1), 18–23 (2019) 8. Wang, W., Ma, H., Zhao, Y., Li, Z., He, X.: Tracking knowledge proficiency of students with the calibrated Q-matrix. Expert Syst. Appl. 192, 116454 (2022) 9. Wang, W., Ma, H., Zhao, Y., Yang, F., Chang, L.: PERM: Pre-training question embeddings via relation map for improving knowledge tracing. In: Bhattacharya, A., et al. Database Systems for Advanced Applications. DASFAA 2022. Lecture Notes in Computer Science, vol 13247. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_22 10. Wu, Z., Li, M., Tang, Y., Liang, Q.: Exercise recommendation based on knowledge concept prediction. Knowl. Based Syst. 210, 106481 (2020) 11. Huo, Y., Wong, D., Ni, L., Chao, L., Zhang, J.: Knowledge modeling via contextualized representations for LSTM-based personalized exercise recommendation. Inf. Sci. 523, 266– 278 (2020) 12. Wang, F., et al.: Neural cognitive diagnosis for intelligent education systems. In: 10th Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 6153–6161 (2020) 13. Hang, Y., Ma, H., Jiang, Y., Li, Z.: Learning to recommend via random walk with profile of loan and lender in P2P lending. Expert Syst. Appl. 174, 114763 (2021) 14. Nikolakopoulos, A., Karypis, G.: RecWalk: Nearly uncoupled random walks for Top-N recommendation. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM), pp. 150–158. ACM (2019) 15. Stewart, G.: On the sensitivity of nearly uncoupled markov chains, Numerical solution of Markov chains, pp. 105–119. CRC (2021) 16. Debuse, J., Rayward-Smith, V.: Feature subset selection within a simulated annealing data mining algorithm. J. Intell. Inf. Syst. 9(1), 57–81 (1997) 17. Ghosh, A., Heffernan, N.T., Lan, A.S.: Context-aware attentive knowledge tracing. In: 26th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 2330–2339. ACM (2020) 18. Stamper, J., Niculescu-Mizil, A., Ritter, S., Gordon, G.J., Koedinger, K.R.: Challenge data set from KDD Cup 2010 Educational Data Mining Challenge, Algebra I 2006–2007. http:// pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp 19. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967) 20. Zhu, M., Zhen, D.-S., Tao, R., Shi, Y.-Q., Feng, X.-Y., Wang, Q.: Top-N collaborative filtering recommendation algorithm based on knowledge graph embedding. In: Uden, L., Ting, I.-H., Corchado, J.M. (eds.) KMO 2019. CCIS, vol. 1027, pp. 122–134. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21451-7_11 21. Piech, C., et al.: Deep knowledge tracing. In: 28th Advances in Neural Information Processing Systems (NIPS), pp. 505–513. MIT (2015)
EduAction: A College Student Action Dataset for Classroom Attention Estimation Kunhong Liu1
, Bin Chen1 , Liyan Chen1(B) , Yong Xu2 Fan Gao1 , and Yudi Zhao1
, Lu Lin1
,
1 Xiamen University, Xiamen, Fujian, China {lkhqz,chenliyan}@xmu.edu.cn, {24320191152511,22920192204195, 37020221150175}@stu.xmu.edu.cn 2 Fujian University of Technology, Fuzhou, Fujian, China [email protected]
Abstract. With the development of the action recognition technique, it is possible to automatically analyze the college students’ attention status through the recognition of their behavior in the classroom, which is of great significance to the evaluation of teaching quality in class. In the consideration that the college student behavior dataset is scarce, we set up a novel college students’ action dataset in the classroom for attention estimation in this paper, named as EduAction dataset. The EduAction dataset consists of 7 types of actions and 718 action clips, collected in the real college classroom environment. Furthermore, an improved two-stream ConvNet is conducted on this dataset with the 5-fold cross-validation for the performance evaluation. The results of our benchmark model achieve an overall accuracy of 83.01%. This spontaneous dataset will be a great aid for the teaching quality analysis in the learning environment. Keywords: Action Recognition · Classroom Activity · Teaching environment · Deep Learning · Two-stream Model
1 Introduction 1.1 A Subsection Sample Based on the analysis of multi-dimensional semantic information through the participation of students in classroom, the estimation of classroom attention is important in supporting the evaluation of teaching quality. Many researchers tried to estimate classroom attention in different ways. For example, Monkaresi et al. [1] used remote heart rate measurement and expression detection to analyze participants, and Xu et al. [2] estimated the head posture of students to score the attention based on the face orientation. Besides, Chen et al. [3] proposed a new method of automatically measuring attention through facial feature points. However, there is no study to mine the action information of students in the classroom for attention estimation, and the reason may lie in that the K. Liu and B. Chen—These authors contributed equally to this work. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 237–248, 2023. https://doi.org/10.1007/978-981-99-4752-2_20
238
K. Liu et al.
traditional observation method for the actions in the classroom is very time-consuming. Consequently, the automatic recognition of classroom behavior is of great importance. However, action and attention are closely related, making the recognition of classroom behavior a necessary prerequisite for judging students’ attention in class. The development of deep learning provides a solution for automatic action recognition in the classroom. Recently, action recognition is an increasingly popular research field in computer vision. More and more machine learning methods have been proposed for video action recognition. In the early stage, some methods [4, 5] were developed to manually extract features to identify different actions with SVM. By comparison, deep learning-based models, such as CNN and LSTM, can automatically learn visual features. One of these deep learning models is the two-stream ConvNet model [6], which consists of two types of features: the spatial and the temporal features. Recently, Kim et al. [7] proposed a Spatio-Temporal SlowFast Self-Attention network, and Wei et al. [8] proposed an improved bilinear pooling method for action recognition. As the deep learning-based models need a large amount of training data, many public action datasets have been released for action recognition, among which HMDB51 [9] and UCF101 [10] are the most popular benchmark datasets. In recent years, a large number of action datasets focusing on certain types of actions are available for researchers, such as cooking [11, 12] and sports [13, 14]. However, there are still not enough action data for the student action recognition collected in the real educational environment. Recently, some datasets obtained in college classroom had been introduced. Li et al. [15] introduced a college students’ action recognition database. However, their students sat on several round tables in the classroom, which is different from most of the university classroom environments. Sharma et al. [16] proposed the EduNet dataset for understanding human activity in the classroom environment, but their data were collected from online data, and it is not an action dataset specifically for college students. Hence, we introduce a novel dataset collected from the real classroom activities to recognize the actions of college students for attention estimation. This college students’ action dataset, named EduAction, contains seven types of most common college students’ actions in the classroom. It is used to support the training of the action recognition model in the classroom for attention estimation. As this EduAction dataset is collected from the real college classroom environment, it guarantees students’ spontaneous actions. In order to effectively evaluate the usefulness and quality of the EduAction dataset, an improved two-stream ConvNet is designed as the benchmark model. Experimental results show that our model achieves an overall accuracy of 83.01% using the 5-fold cross-validation, which confirms that our dataset can serve as a reliable basis for the attention estimation. In short, the contributions of this study are summarized as follows: • A real-world action dataset called EduAction. It consists of the video clips collected from the college students’ real classroom activities, including seven most common actions of modern students in the real uncontrolled college classroom environment. The dataset can be used in psychological research, human-computer interaction, and medical rehabilitation, in addition to helping to determine the concentration level of students in the classroom.
EduAction: A College Student Action Dataset
239
• Diversified environmental conditions. The data was collected in one semester during the COVID-19 epidemic, providing abundant changing factors for data analysis. For example, students’ outfits vary from T-shirt to coats, with and without face masks. Furthermore, the lighting and occlusion conditions change greatly in different clips. Such diversity of data provides more varieties for the analysis of students’ actions. • An improved two-stream ConvNet model. It is used as the benchmark model and achieves an overall accuracy of 83.01%, which confirms that our dataset can well support the attention estimation with action information in classroom, and would be a useful aid to improve the teaching quality in classroom.
2 Dataset We will describe the data collection and annotation process in this section. Table 1 summarizes the characteristics of the EduAction dataset. Table 1. Summary of the characteristics of EduAction dataset Characteristics
Value
Types of actions
7
Number of clips
718
Min number of frames in one clip
34
Max number of frames in one clip
1196
Mean number of frames in one clip
113.65
Size of clip frames
81.6K
Frame rate (per second)
25
Resolution
224 × 224
Clip selection method
Manual
Action location
Clips generation
Clips normalization and annotation
Fig. 1. Process of data collection and annotation.
240
K. Liu et al.
2.1 Source Data The source data comes from the video collection of the real classroom environment of a digital media major course from a college in China. The video recording of the course lasts for three months. The course has 48 sessions in total with each session lasting for 45 min. Two cameras (Sony alpha 7 III) are used for the source data collection, with one located on the teacher’s podium and another on the left upfront corner of the classroom. They have a resolution of 1440 × 1080, with a frame rate of 25 fps. All the data collection process has been approved by the students taking part in the experiment. About 120 college students participate in the experiment, with the male to female ratio 7:3. All students are Chinese, and their ages range from 21 to 23. Because of the change of seasons, the students’ clothes have changed from T-shirts to coats. Each student has different characteristics, such as wearing glasses, hats, or face masks. Some students even dyed their hair. The diversity among the data is proved to be helpful for the analysis of students’ attention status. 2.2 Data Collection and Annotation
Table 2. Number of Clips per Action Class in Education Dataset. Class ID
Action Categories
Sample Size
I
Sleeping
80
II
Listening to lectures
115
III
Drinking water
140
IV
Talking with others
60
V
Watching the computer
84
VI
Playing with mobile phones
90
VII
Writing
149
We carefully analyze the source data and discover seven most common actions of college students in the real classroom environment: sleeping, listening to lectures, drinking water, talking with others, watching the computer, playing with mobile phones, and writing. The whole process of obtaining action clips is shown in Fig. 1, which can be divided into three steps: action location, clips generation, clips normalization and annotation. In order to obtain the action clips, we need first to locate them from the source video. Specifically, when a college student takes one of the seven actions in a certain period shown in the video, we use the video editing software named Format Factory to box the area where the student is located and intercept the period si -ei of the action manually, where si and ei are the starting and ending times of the i-th action clip. Then we generate the action clip and name the files in the following format: X_Clips_i.mp4, where X is the name prefix of the source video and i indicates the i-th action clips.
EduAction: A College Student Action Dataset
241
Fig. 2. Examples of the action clips in each of the seven categories in the EduAction dataset. The figure shows five frames of images sampled equidistantly from the beginning to the end of each clip.
At last, in order to ensure the consistency of the dataset, each frame in all action clips is resized to 224 × 224. After action clips have been generated, we annotate each action clip and put it into the folder named after the corresponding label. To ensure the annotation quality, each clip is validated by more than two skilled individuals, so as to confirm the correctness of the label annotated on each clip. As a result, the EduAction dataset contains seven classes. Table 2 shows the number of clips in each class of the EduAction dataset. It is worth mentioning that the action clips with too few frames are deleted because they are not suitable for training deep learning models. Figure 2 presents the examples of the action clips for each class in the dataset.
Fig. 3. Sketch of the two-stream bseline model.
242
K. Liu et al.
2.3 Dataset Evaluation In order to avoid the problems caused by the unreasonable division of datasets, we use the 5-fold cross-validation to evaluate the EduAction dataset. That is, we split the data into five folds, with each fold containing a combination of an even proportion in each category. Then we in turn treat the clips in each fold as the validation set and those in the other four folds as the training set. At last, the results obtained from the five folds are averaged.
3 Benchmark Model To establish the benchmark performances on the EduAction dataset, we implement the improved two-stream ConvNet [6] as described below. This section introduces the implementation details of the benchmark model. 3.1 Model Input The input of the two-stream ConvNet can be decomposed into spatial and temporal components. In the spatial process, we randomly select the τ-th frame in the action clip as the individual frame appearance for data augmentation: τ = random(1, N − 32)
(1)
where N denotes the number of frames in the clip. In the temporal process, L consecutive temporal frames that are d frames apart are first sampled. Then the horizontal and vertical optical flows u and v of adjacent frames in the sequence {τ, τ + d , . . . , τ + (L − 1)d } are calculated. At last, the flow channels u and v of the L sampled frames are stacked to form a total of 2L input channels. More formally, let w and h be the width and height of the clip images, the input in temporal process is represented as It ∈ Rw×h×2L . In our model, L is set to 4 and d is set to 8. The optical flow method we use is the TV_L1 one [17].
Fig. 4. The block module of inception_V3.
EduAction: A College Student Action Dataset
243
3.2 Model Framework An overall sketch of the model framework is shown in Fig. 3. The details of spatial and temporal stream ConvNets are given as follows. • Spatial stream ConvNet. Since the function of the spatial ConvNet is for image classification, the Inception V3 backbone [18] is applied to generate the feature map in different frames, and the backbone is pre-trained with the ImageNet challenge dataset, a large image classification dataset. The block module of the inception V3 is shown in Fig. 4. The feature map is then processed by a global average pooling layer followed by one fully-connected layer of size 128 to generate the features F s of the spatial process. • Temporal stream ConvNet. The features F t of the temporal process are generated by a CNN, which is consisted of five convolutional layers with 3 × 3 kernels, stride 1, and ReLU activation [19], followed by a fully-connected layer of size 128. The five convolutional layers have 32, 64, 64, 128 and 128 output channels respectively. The 25% dropout operation [20] is applied to the first, third, and fifth layers, followed by a max_pooling layer with 3 × 3 kernels. • Fusion. The features of the two streams F s and F t are concatenated together to form a new feature vector with 256 dimensions. Let f cat denote the function to concatenate two streams, as given by: F = f cat F s , F t (2) Then, F is sent to a fully-connected layer to map features to 128 dimensions. The outputs are fed to the softmax layer with 7 dimensions to generate the final prediction results. To sum up, compared with the original two-stream ConvNet, the network is improved in two aspects. The first is the deployment of the inception V3 backbone for the generation of the feature map of spatial stream. The use of the pre-trained backbone will make the model converge more easily and faster. The other is the fusion of two streams based on the fully-connected layer, which provides better integration results compared with the simple voting methods of the original model. 3.3 Training Configuration Each fold in the 5-fold cross-validation uses the same training configuration. This benchmark model is trained using the SGD optimizer with momentum (the coefficient is 0.9) for 50 epochs. The learning rate is set to 1e-4. At each iteration, a mini-batch of 8 samples is constructed by sampling 8 training clips on a single GPU, from which the start frame is randomly selected as shown in (1). During the test, given an action clip, 10 inputs are randomly selected as the start frames. Then the output softmax scores of the 10 inputs from the model are averaged to get the prediction of this action clip.
244
K. Liu et al. Table 3. The Accuracy of Each Fold and All Folds in Seven Classes.
Action Categories
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
All folds
Sleeping
81.25%
87.50%
93.75%
81.25%
93.75%
87.50%
Listening to lectures
78.26%
78.26%
78.26%
95.65%
82.61%
82.61%
Drinking water
96.43%
100.00%
89.29%
89.29%
89.29%
92.86%
Talking with others
75.00%
50.00%
91.67%
83.33%
58.33%
71.67%
Watching the computer
82.35%
94.12%
76.47%
82.35%
81.25%
83.33%
Playing with mobile phones
94.44%
61.11%
94.44%
72.22%
94.44%
83.33%
Writing
80.00%
60.00%
76.67%
76.67%
86.21%
75.84%
Overall
84.72%
77.08%
84.72%
83.33%
85.21%
83.01%
4 Result and Analysis The benchmark model is trained and evaluated with the 5-fold cross-validation. The performances of different folds are shown in Table 3, where the accuracy of different folds and the average of all folds are provided. As observed from Table 3, we can see that the overall accuracy of folds 1–5 is 84.72%, 77.08%, 84.72%, 83.33%, and 85.21%, respectively, and the overall accuracy is 83.01% using our benchmark model. Moreover, experimental results show that our model gets the highest accuracy on the class Drinking water, and the lowest accuracy on the class Talking with others. The complexity of environment in the clip also affects the performance of our model. For example, the environment in most clips of Writing is disturbed by other items, such as computers or cups, which leads to lower accuracy than other classes although it has a largest sample size among the seven classes. Furthermore, the difference of movement ranges in different action categories also affects the results on different categories. By examining the raw data, we find that most clips of Talking with others have a large range of movement, resulting in a high misclassification rate. On the other hand, those of Sleeping have a small range of movement across clips, making the prediction easier to obtain a relatively high accuracy score. To further compare the recognition performance of different categories, Fig. 5. Shows the confusion matrix of different categories over all folds. I-VII represent the seven classes respectively. We can find that the largest proportion of the following three classes, IV, VI, and VII, are misclassified to class III in the benchmark model, resulting in the relatively low accuracy scores of them. The reason is that in the real classroom environment, the object “water cup” is detected in some of these action classes, which results in a wrong judgement made by the benchmark model from the original class to the ‘Drinking water’ class. It is worth noting that different actions have different importance for classroom attention estimation. The two classes, Sleeping and Playing with mobile phones, are the negative actions in the attention estimation, while Listening to lectures and Writing are two positive action categories. Therefore, it is important to design some strategies, such as the application of attention mechanism, to extract accurate features for the
EduAction: A College Student Action Dataset
245
Fig. 5. Confusion matrix of different categories over all folds (measured by recognition rate %).
identification of these classes so that the model can gain higher accuracy scores for these kinds of actions. The remaining three action classes could show clear tendencies towards either positive or negative effects on students’ attention estimation according to the specific situation. For example, students may not pay attention to the lectures when drinking water. For the actions of Talking with others and Watching the computer, it is necessary to analyze whether students are discussing or browsing the content related to the classroom lecturing. From the analysis of our dataset, about 60% and 30% of the clips of Talking with others and Watching the computer are positive actions. For example, in the action clip “talking_Clips_001.mp4”, it is clear from the audio portion of the classroom video that the instructor is not engaged in a classroom group discussion at this time. Therefore, the talking action is considered irrelevant to the classroom and leads to a decrease in students’ attention. The action clip “watching_computer_Clips_008.mp4”, based on the context of the video, shows the student manipulating the mouse and making eye contact with the teacher. Thus, the watching computer action clip is considered relevant to the class content and has a high level of student concentration. The clip named “watching_computer_Clips_016.mp4” is not related to the teacher’s lecture because the student’s hands are on the keyboard. The clip was considered irrelevant to the class content. Furthermore, in order to further investigate the dataset, we use the clustering technique based on the autoencoder algorithm for outlier analysis. Firstly, we extract the middle frame of each video clip as the input of the autoencoder model. Figure 6 shows the structure of the autoencoder model. We use the encoder module to map input data to lower dimensional features, and then use the K-means algorithm for cluster analysis. There were seven types of actions included in the sample. The number of clusters set by K-means algorithm is 7 and the maximum iterations is 300 rounds. Figure 7 shows the clustering results using the t-SNE algorithm [21] to map features into a 2D space. We can find that the K-means algorithm can well divide the data into 7 clusters. And the distribution of real labels is relatively close in the feature space extracted by the autoencoder. However, there are some outliers, which also brings more challenges to model training. Figure 8 shows two examples of the outliers, which are the combination of more than two actions in most cases. For example, Fig. 8 (a) shows a student drinks
246
K. Liu et al.
Fig. 6. The structure of the autoencoder model. The kernel sizes of all conv2D layers both in the encoder and decoder modules are 3 × 3. The pooling sizes of all maxPooling2D layers in the encoder module and all upSamling2D layers in the decoder module are 2 × 2. We flatten the features of middle layer to get the final features.
Fig. 7. Clustering results using the t-SNE algorithm.
Fig. 8. Examples of outliers.
water while watching the computer simultaneously. Figure 8 (b) shows a student talks while watching the computer at the same time.
EduAction: A College Student Action Dataset
247
5 Conclusion In this paper, we set up the EduAction dataset, a challenging action recognition dataset for attention estimation in the classroom in a college. To build this dataset, source data is collected in an actual classroom environment, and 7 most common actions in the classroom are annotated. Then we manually cut out the 7 classes of video clips from the source data and label each clip with the accurate action label. We study the action recognition task by training the benchmark model on our dataset. Our model achieves an overall accuracy of 83.01% by the 5-fold cross-validation. This dataset makes it possible to use action information to estimate classroom attention. In the future, we will make the study generalizable to non-engineering majors as well by collecting data from various majors. And we will try some attention models to focus on improving the recognition accuracy of the four key actions for attention estimation, and provide further insight to the analysis of attention status. Acknowledgment. This work is supported by the National Natural Science Foundation of China (No. 61772023), National Key Research and Development Program of China (No. 2019QY1803), and Fujian Science and Technology Plan Industry-University-Research Cooperation Project (No.2021H6015).
References 1. Monkaresi, H., et al.: Automated detection of engagement using videobased estimation of facial expressions and heart rate. IEEE Trans. Affect. Comput. 8(1), 15–28 (2016) 2. Xu, X., Teng, X.: Classroom attention analysis based on multiple euler angles constraint and head pose estimation. In: Ro, Y.M., et al. (eds.) MultiMedia Modeling. Lecture Notes in Computer Science, vol. 11961, pp. 329–340. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-37731-1_27 3. Chen, L., Yang, H., Liu, K.: Classroom attention estimation method based on mining facial landmarks of students. In: Þór Jónsson, B., et al. (eds.) MultiMedia Modeling. Lecture Notes in Computer Science, vol. 13142, pp. 255–266. Springer, Cham (2022). https://doi.org/10. 1007/978-3-030-98355-0_22 4. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013) 5. Laptev, I., et al.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008) 6. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016) 7. Kim, M., Kim, T., Kim, D.: Spatio-temporal slowfast self-attention network for action recognition. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2206–2210. IEEE (2020) 8. Wu, W., Yu, J.: An improved bilinear pooling method for imagebased action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 8578–8583. IEEE (2021) 9. Kuehne, H., et al.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
248
K. Liu et al.
10. Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. In: arXiv preprint arXiv:1212.0402 (2012) 11. Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) 12. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol. 11208, pp. 720–736. Springer, Cham (2018). https://doi.org/ 10.1007/978-3-030-01225-0_44 13. Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725– 1732 (2014) 14. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008) 15. Li, X., et al.: A students’ action recognition database in smart classroom. In: 2019 14th International Conference on Computer Science & Education (ICCSE), pp. 523–527. IEEE (2019) 16. Sharma, V., et al.: EduNet: a new video dataset for understanding human activity in the classroom environment. Sensors 21(17), 5699 (2021) 17. Zach, C., Pock, T., Bischof, H.: A Duality Based Approach for Realtime TV-L 1 Optical Flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) Pattern Recognition. DAGM 2007. Lecture Notes in Computer Science, vol. 4713, pp. 214–223. Springer, Berlin, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22 18. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 19. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807– 814 (2010) 20. Srivastava, N., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 21. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Zero-Shot Learning Based on Weighted Reconstruction of Hybrid Attribute Groups Jiarui Zhang1,2(B) , Ruilin Li3 , Nannan Yu1 , Jian Liu2 , and Yi Kong2 1 School of Electrical Engineering and Automation, Jiangsu Normal University,
Xuzhou 221116, Jiangsu, China [email protected] 2 School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China 3 State Key Laboratory for Geomechanics and Deep Underground Engineering, China University of Mining and Technology, Xuzhou 221116, Jiangsu, China
Abstract. For the zero-shot learning, the rational description and utilization of attributes are valid approaches to build the bridge between training classes and testing classes. In order to improve the descriptive ability of attributes and further construct the appropriate mapping between attributes and features, a novel zero-shot learning method based on weighted reconstruction of hybrid attribute groups (WRHAG) is proposed. First, original semantic attributes are divided into groups by using the hierarchical clustering, and grouped semantic attributes are further enhanced by the broad learning. The semantic attribute groups and enhanced attribute groups together constitute hybrid attribute groups, which effectively improve the attribute description ability. Then, the mutual mapping between attributes and features is obtained by constructing a weighted autoencoder, in which the structured sparse L21 norm and attribute group coefficients are adopted to choose the discriminative attributes and consider the differences between attribute groups. Finally, the zero-shot classification is achieved by calculating the similarity between features of testing sample and predicted class features in the feature space. Comparative experiments on typical CUB dataset demonstrate that the proposed WRHAG model yields better performance in zero-shot image classification. Keywords: Zero-shot learning · Broad learning · Attribute grouping · Attribute enhancement · Autoencoder
1 Introduction For traditional supervised image classification problems, the training classes and testing classes always remain the same. The reliability of supervised classification is built on the acquisition of a large number of training samples. However, this requirement sometimes cannot be guaranteed under actual circumstances. For instance, massive image acquisition for some endangered species is usually very difficult. As an extreme case, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 249–260, 2023. https://doi.org/10.1007/978-981-99-4752-2_21
250
J. Zhang et al.
when the training classes and testing classes are completely disjoint, this problem converts to the typical zero-shot classification [1, 2]. To classify the testing images without corresponding training samples, it is necessary to make full use of auxiliary information between image features and class labels. This kind of information shared between training classes and testing classes can be semantic attributes [3], word vectors [4], etc. The semantic attributes are widely used and easily understood. For example, “fox” and “chimpanzee” both have attributes of “muscle” and “agility”; while “chihuahua” and “giant + panda” do not. Therefore, a classifier can be obtained by training “fox” and “chihuahua” for identifying “chimpanzee” and “giant + panda”. So far, attributes have played an important role in image retrieval [5], face estimation [6], etc. Attribute-based zero-shot image classification is realized by comparing the similarity between predicted attributes of the testing sample and predefined attributes of testing classes (or features of the testing sample and predicted class features), before which the mapping between image features and attributes need to be learnt. Therefore, both the attribute description and the mapping between features and attributes have significant influences on the classification performance. On the one hand, the descriptive ability of attributes can be effectively improved by mining relationships among existing attributes. Jayaraman et al. [7] achieved zero-shot image classification by combining feature sharing within attribute groups and feature competition between attribute groups. Wang et al. [8] implemented attribute prediction by constructing relationships between classes and performing multi-task learning. However, these methods divide semantic attributes into groups manually according to the attribute type, which shows strong subjectivity and poor scalability. On the other hand, the attribute description ability can also be improved by reasonably enhancing existing semantic attributes. Currently, semantic attributes are usually obtained through manual annotation, which has disadvantages of heavy workload and high cost. At the same time, the number of labeled semantic attributes is often not enough in the case of massive training and testing classes or plenty of similar attributes between classes. Zhang et al. [9] obtained enhanced attributes by using the elastic network constraint to construct the relationship between attributes and features. By regarding class labels as extended attributes, Wang et al. [8] comprehensively used the relationship between classes and attributes, and the relationship between training and testing classes to achieve zero-shot classification. The above methods, however, do not consider the prior knowledge that reflects the relationship between attributes when performing semantic attribute enhancement. In addition, effectively learning the mapping between features and attributes is also the key to improve the classification performance. The structured joint embedding method, proposed by Akata et al. [10], obtained the mapping from features to attributes on the basis of learning the bilinear compatibility function by optimizing the structured SVM loss. Kodirov et al. [11] proposed the semantic autoencoder (SAE) method, which constructed the mutual mappings between features and attributes by assigning semantics to the hidden layer of autoencoder. The learned projection function in SAE method showed better generalization for new testing classes. The latent space encoding method, proposed by Yu et al. [9], adopted encoder-decoder framework to connect
Zero-Shot Learning Based on Weighted Reconstruction
251
semantic relations with different forms. The performance of above zero-shot classification demonstrates that it is more effective to consider the mutual mapping between features and attributes. Besides, a combination with high-quality attribute description could further improve zero-shot classification performance. However, none of the above methods attempts to describe attributes sufficiently. By taking the above issues into account, we propose a novel zero-shot learning method based on weighted reconstruction of hybrid attribute groups (WRHAG). Our method aims to fully consider the correlation between attributes, the finiteness of attribute quantity, and the mapping between features and attributes. To summarize, the main contributions of the proposed WRHAG method are outlined as follows. 1) The semantic attributes are grouped automatically by hierarchical clustering, and then enhanced by group using the broad structure. The grouped semantic attributes, together with enhanced attribute groups, constitute the hybrid attribute groups. 2) To consider different contributions of hybrid attribute groups, the mapping between attribute space and feature space is constructed by weighted autoencoder. 3) The structured sparse L21 norm is adopted in the objective function to remove attribute redundancy. 4) The zero-shot image classification is achieved by calculating the similarity between features of testing sample and predicted class features in the feature space.
2 Proposed WRHAG 2.1 Framework of WRHAG The proposed WRHAG method aims to mine attribute relationships through grouping and enhancing semantic attributes, and improve the discrimination and integrity of attributes in different classes. In addition, the weighted autoencoder is used to construct the mutual mapping between attributes and features, and then implement the feature prediction. During the above process, weighted sparse hybrid attribute groups are adopted to properly remove the attribute redundancy and consider different contributions of each attribute group, so as to improve the accuracy of zero-shot image classification. The framework of zero-shot image classification based on WRHAG mainly consists of four stages: semantic attribute grouping, attribute enhancement, weighted reconstructing of hybrid attribute groups, and zero-shot image classification. 2.2 Semantic Attribute Grouping The grouping of semantic attributes is achieved by employing the hierarchical clustering method. The original attributes are layered to form a hierarchy by similarity, based on which the attribute groups can be obtained with the given number of groups. The agglomerative clustering approach is chosen to construct the attribute hierarchy in the WRHAG. The aforementioned similarity between attributes is calculated by using the average-linkage criterion. The grouping based average-linkage criterion is implemented by merging clusters according to distances between groups, which exhibits an advantage that effectively avoids the sensitivity of outliers.
252
J. Zhang et al.
Let A = (a1 , a2 , · · · , au ) ⊂ Rr×u denote semantic attributes, where u is the number of semantic attributes and r is the total number of classes. When semantic attributes are divided into v groups, the number of semantic attributes in each group is u1 , u2 , · · · , uv , respectively, where u1 + u2 + · · · + uv = u. Therefore, the first semantic attribute group can be represented as: A1 = a1 , a2 , · · · , au1 ∈ Rr×u1 (1) Analogously, the ν th semantic attribute group can be written as: Av = au−uv +1 , au−uv +2 , · · · , au ∈ Rr×uv
(2)
For the sake of simplicity, it is assumed that attribute groups do not intersect, i.e., each attribute only appears in one group [7]: v
Ai ∩ Ai = ∅, i = i s.t. ∪ Ai = A i=1
(3)
Based on the above symbolic representation, the specific steps of implementing semantic attribute grouping by the hierarchical clustering method are detailed as follows. Firstly, calculate the pairwise distance of all attributes ah (h = 1, 2, · · · , u) by the average-linkage criterion, and merge the two attributes with the shortest distance into a new “attribute”. Then, distances between the new “attribute” and the remaining attributes are calculated, and the two attributes with the shortest distance are combined into a new “attribute”. In this manner, the iteration ends until all attributes are clustered into one group, which ultimately forms a complete attribute hierarchy. Finally, attribute groups can be obtained based on the attribute hierarchy when the number of groups is given previously. 2.3 Attribute Enhancement The semantic attribute groups are enhanced by BLS. The basic procedures of typical BLS include: 1) map the input data to mapped features at first, 2) then expand them into enhancement nodes directly or by group, and 3) finally combine all the nodes and project them to the output. The mapping weights for obtaining enhancement nodes are generated randomly in the second procedure. Based on the above approach, enhanced attribute groups, represented as enhancement nodes of broad learning, can be obtained by taking the grouped semantic attributes as feature nodes. The specific procedures are detailed as follows. When denoting the attribute groups Ai (i = 1, 2, · · · , v) as feature nodes of the broad learning, the enhancement nodes can be represented as following Eq. (4) after expanding feature nodes by group: (4) Ei = ξ Ai W ei + βei , i = 1, 2, · · · , v where Ei (i = 1, 2, · · · , v) is enhanced attribute groups, and the weight W ei and βei are generated randomly. Assume that the number of final hybrid attributes in each group
Zero-Shot Learning Based on Weighted Reconstruction
253
is q˜u, where q represents the multiple of attribute enhancement, and u˜ = max{ui }. Therefore, the first group of enhanced attributes can be expressed as: E1 = e1 , e2 , · · · , eq˜u−u1 ∈ Rr×(q˜u−u1 ) (5) Similarly, the ν th enhanced attribute group can be written as: Ev = eq˜u(v−1)−(u−uv )+1 , eq˜u(v−1)−(u−uv )+2 , · · · , eq˜uv−u ∈ Rr×(q˜u−uv )
(6)
Furthermore, semantic attributes and enhanced attributes are combined by group: Bi = [Ai |Ei ], i = 1, 2, · · · , v
(7)
to constitute the hybrid attributes: v
B = ∪ Bi ∈ Rr×q˜uv i=1
(8)
The hybrid attributes, composed of the existing semantic attributes and obtained enhanced attributes by group, are capable of integrally reflecting the relationship between classes and attributes. 2.4 Weighted Autoencoder Based on Hybrid Attribute Groups The mapping between the attribute space and the feature space is realized by using the weighted autoencoder, in which the attribute space is represented by hybrid attribute groups. The encoder process of weighted autoencoder is used to construct the mapping from hybrid attributes to class features, while the decoder obtains the reconstruction from class features to hybrid attributes. Meanwhile, the L21 norm and attribute group coefficients are introduced into the objective function to remove the redundancy of hybrid attributes and obtain weighted attribute groups, respectively. Firstly, all the sample features of each class in the training set are clustered to obtain class features: X = [X 1 , X 2 , · · · , X s ]T ∈ Rs×p
(9)
where s denotes the number of training classes, and p is the dimension of features. Secondly, by taking the hybrid attributes B as the matrix of input data in the autoencoder, and regarding class features X as the latent representation (i.e., the only shared hidden layer between the encoder and decoder), the objective function to minimize the reconstruction error of hybrid attributes can be represented as: 2 (10) min ∗ B − BW X W X F s.t. BW X = X W X ,W X
where W X ∈ Rq˜uv×p denotes the projection matrix from attributes to features, and W X ∈ Rp×q˜uv projects features back to attributes. The model can further be simplified by tying the weight [12]: W X = W TX
(11)
254
J. Zhang et al.
the objective function thus becomes: 2 minB − BW X W TX s.t. BW X = X WX
(12)
F
The hard constraint BW X = X in Eq. (12) makes it difficult to solve in the current stage. Therefore, the hard constraint is relaxed to a soft constraint here, and the objective function is rewritten as: 2 min X − BW X 2F + λ1 B − XW TX (13) WX
F
where λ1 is the weight coefficient to control the ratio between two items. In order to deal with the redundancy of hybrid attributes and the proportional relation between attribute groups, structured sparse L21 norm and attribute group coefficients are introduced into the objective function, respectively. Therefore, the objective function is finally expressed as: min
W X , σj
X k −Bk W X 2F
q˜u v 2 j 2 T σj + λ1 W Bi Bk −X k W X + λ2
k
F
k
j=1 i=1
2
(14) where k is the number of classes, λ2 is the weight coefficient to balance the proportion j of the first two items, and W Bi is the projection vector in W X which corresponds to the ith attribute in the jth group of the hybrid attribute B. σj in Eq. (14) stands for the weight coefficient of attribute groups, and can be represented as: v q˜u q˜u v j j σj = σj = 1 (15) W Bi W Bi s.t. σj ≥ 0, i=1
2
j=1 i=1
2
j=1
Thirdly, the optimization of objective function is achieved by alternately optimizing W X and σj during training [13]. The specific steps are given as follows. 1) For fixed σj , take a derivative of Eq. (14) with respect to W X and set it zero, which leads to: BTk Bk + λ2 diag σ1−1 , σ2−1 , · · · , σv−1 W X + W X λ1 X Tk X k = (1 + λ1 )BTk X k (16) The above Eq. (16) is consistent with the well-known Sylvester equation. Thus, let k = 1, 2, · · · , s, and W X can be solved by using the Bartels-Stewart algorithm [14]. 2) For fixed W X , σj can be obtained by Eq. (15). 3) Iterate above two steps until convergence, which consequently results in the optimal W ∗X and σj∗ for the objective function Eq. (14). Z In addition, the predicted class features of the testing set in the feature space, Xˆ , can be obtained by the product of hybrid attributes and optimal W ∗X : Z Xˆ = BW ∗X
(17)
Zero-Shot Learning Based on Weighted Reconstruction
255
2.5 Zero-Shot Image Classification Figure 1 illustrates the zero-shot image classification based on the WRHAG model, which mainly consists of the attribute space and the feature space. The attribute space is composed of grouped semantic attributes A and grouped enhanced attributes E, while the feature space is composed of features X. Grouped semantic attributes A in the attribute space are learnt by hierarchical clustering of semantic attributes. On the basis of semantic attribute groups, the grouped enhanced attributes E are obtained by using broad learning. The mapping between the attribute space and the feature space is constructed by the weighted autoencoder. X
WX ,
A
au1
a1 A1
au1
X1
xm
Xp
X2
arg min D xm , Xˆ kZ k 1,2, ,t
WX ,
1
au1
1
A2
u2
au
au
uv 1
equ
e1
Av
Ei
u1
equ
E1
AW i ei
ei
, i 1, 2,
e2qu
u1 1
E2
u1 u2
equ v
v
1
u uv
1
equv
u
E
Ev
,v
Fig. 1. Illustration of zero-shot image classification based on WRHAG model.
As shown in Fig. 1, the zero-shot image classification is achieved by calculating the minimum distance between the testing sample xm and predicted class features of each class in the feature space, that is: Z (18) φ(xm ) = arg min D xm , Xˆ k k=1,2,··· ,t
Z
where Xˆ k denotes the predicted class features of the k th class in the testing set, and t represents the number of testing classes.
3 Experimental Result and Analysis 3.1 Dataset and Parameters Setting Caltech-UCSD Birds-200–2011 dataset (CUB dataset) [15] is chosen to evaluate the classification performance of the proposed WRHAG model. A total of 8855 samples from 150 classes are selected as the training set, and 2933 samples from 50 classes are determined as the testing set. Each class of birds is described by a set of 312 continuous attributes, and each image is captured by the 1024-dimensional deep features extracted by the GoogLeNet network [16]. The optimal values of λ1 = 1000 and λ2 = 0.6 for WRHAG model are obtained by employing the grid search strategy.
256
J. Zhang et al.
3.2 Attribute Grouping and Enhancing The semantic attribute grouping in WRHAG model is realized by the hierarchical clustering method. According to the hierarchy of semantic attributes, Table 1 gives the relationship between the number of attribute groups, v, and the maximum number of attributes within attribute groups, u˜ . It can be observed from Table 1 that the maximum number of attributes within attribute groups exhibits a decreasing tendency when the number of attribute groups increases. Table 1. Number of attribute groups vs. maximum number of attributes within attribute groups. v
12
42
72
102
132
162
192
222
252
282
u˜
159
117
77
42
42
42
42
24
11
7
When semantic attributes in the CUB dataset are divided into 12 groups, a total of 153 attributes are dispersedly divided into 11 groups, as demonstrated in Fig. 2. Meanwhile, the remaining 159 attributes are clustered together as one group. It should be noted that the number after the attribute in Fig. 2 stands for the total number of attributes that share the same keyword with this semantic attribute, e.g., “brown: 6” means total 6 attributes are related to “brown”, such as “has_underparts_color:: brown”, “has_breast_color::brown”, “has_throat_color::brown”, etc. Figure 2 indicates that semantic attributes with the same color are easier to be grouped together, such as those groups with the single color of white, black, blue, and yellow. Those groups obtained by the WRHAG model are not exactly the same as previous groups divided by part locations, e.g., grouped by underparts_color, breast_pattern, tail_shape, and bill_length. A reasonable grouping of semantic attributes is critical to the WRHAG model since it has a direct influence on the subsequent attribute enhancement. Since the number of attribute groups, v, and the multiple of attribute enhancement, q, both have influences on zero-shot image classification accuracy, it is important to determine their optimal values after obtaining the hierarchy of semantic attributes. v and q are equal to 12 and 3, respectively. The appropriate attribute grouping and enhancement are helpful to improve zero-shot image classification performance; however, the specific fineness of attribute grouping and the quantity of attribute enhancement need to be determined according to the given number of semantic attributes. 3.3 Zero-Shot Image Classification To evaluate the zero-shot image classification performance of WRHAG, we investigate seven related methods for comparison in CUB dataset. In order to compare the results fairly, the same experimental settings are used for all the comparative experiments. The selected seven baselines are DAP [17], DeSVA [7], MTEAG [8], SAE [11], WRSAG (Weighted reconstruction of semantic attribute groups), RHA (Reconstruction of hybrid attributes), and RHAG (Reconstruction of hybrid attribute groups). The above WRSAG, RHA, and RHAG models are special cases of the proposed WRHAG model essentially.
Zero-Shot Learning Based on Weighted Reconstruction brown: 6 striped: 3 orange: 2 spotted: 5 white: 2
has_bill_shape::dagger has_bill_shape::hooked_seabird has_bill_shape::spatulate has_tail_shape::rounded_tail has_tail_shape::fan-shaped_tail has_tail_shape::squared_tail
buff: 15 brown: 9 striped: 3
solid: 3 black: 2 has_head_pattern::plain has_bill_length::about_the_same_as_head
white: 4
solid: 2 has_eye_color::black
white: 7
black: 13
has_head_pattern::eyebrow has_head_pattern::eyeline has_head_pattern::capped has_wing_shape::broad-wings has_wing_shape::tapered-wings has_wing_shape::long-wings
257
has_head_pattern::malar has_size::medium_(9_-_16_in) has_shape::duck-like has_shape::gull-like has_shape::tree-clinging-like
has_bill_shape::all-purpose has_tail_shape::notched_tail has_bill_length::shorter_than_head has_wing_shape::rounded-wings has_size::small_(5_-_9_in) has_shape::perching-like
grey: 15 multi-colored: 5 has_bill_shape::cone has_tail_shape::pointed_tail has_head_pattern::eyering has_wing_shape::pointed-wings has_size::very_small_(3_-_5_in)
blue: 13
yellow: 13
Fig. 2. Grouping of semantic attributes. Part of groups (11 out of 12 groups) in CUB dataset.
Table 2 gives the average accuracy of eight models for zero-shot image classification. The following conclusions can be drawn from Table 2. 1) The accuracy of DeSVA and MTEAG is both higher than that of DAP, indicating that attribute grouping can achieve good classification performance. In addition, the accuracy of RHAG is higher than that of RHA, suggesting that attribute grouping through hierarchical clustering can achieve better classification performance. Table 2. Comparison of mean accuracy (%) of zero-shot image classification. Model
DAP [17]
DeSVA [7]
MTEAG [8]
SAE [11]
WRSAG
RHA
RHAG
WRHAG
Mean accuracy
10.74
32.15
33.28
61.40
52.85
64.03
64.51
64.75
2) Compared with WRSAG, WRHAG achieves higher classification accuracy, which indicates that enhanced hybrid attributes have better ability of class expression and discrimination. 3) WRHAG is superior to RHAG in the zero-shot image classification. This is caused by the advantage of weighted attribute groups that can consider the difference between attribute groups in different classes. 4) All the accuracy of SAE, WRSAG, RHA, RHAG, and WRHAG is much higher than that of DAP, DeSVA, and MTEAG, which indicates that the reconstruction by the autoencoder can better learn the mapping between features and attributes. 5) Since WRHAG combines advantages of semantic attribute grouping, attribute enhancement, hybrid attribute group weighting, and weighted autoencoder based attributes reconstructing, it achieves the highest accuracy. The confusion matrices are employed to further evaluate the zero-shot image classification performance of the above eight models, which are capable of giving more detailed information of classes. Figure 3 visualizes the confusion matrices of four proposed models. The number located on the main diagonal line of the confusion matrix
258
J. Zhang et al.
stands for the quantity of testing samples that are classified into the correct class, while others are quantities classified in error. A larger number in the block (with a darker color of the background) means a larger quantity of classified testing samples. Since the number of testing classes in the CUB dataset is up to 50, Fig. 3 only illustrates part of confusion matrices with 10 testing classes for clarity, which includes “Brewer_Blackbird”, “Black_billed” (short for “Black_billed_Cuckoo”), “Pacific_ Loon”, “Baltimore_Oriole”, “Western_Wood” (short for “Western_Wood_Pewee”), “Great_Grey” (short for “Great_Grey_Shrike”), “Tree_Sparrow”, “Golden_ winged” (short for “Golden_winged_Warbler”), “Cedar_Waxwing”, and “American_Three” (short for “American_Three_toed_Woodpecker”).
(a)
(b)
(c)
(d)
Fig. 3. Confusion matrices of zero-shot image classification results. (a) WRSAG. (b) RHA. (c) RHAG. (d) WRHAG.
For the 10 testing classes in the CUB dataset, by comparing Fig. 3(a)-(c) and (d), it can be found that the WRHAG model obtains the optimal classification performance in total 7 classes and the remaining 3 classes also exhibit a very close classification accuracy with their optimal values. As the representative results, total 53 images in “Great_Grey” are classified correctly and achieve the classification accuracy of 88.33%, while 43 images in “American_Three” are classified correctly with accuracy of 86.00%. These results further indicate that the proposed WRHAG model can achieve overall better classification performance for all testing classes.
Zero-Shot Learning Based on Weighted Reconstruction
259
4 Conclusion The zero-shot image classification is proposed under the condition that training classes and testing classes are given without intersection, and the attribute is one of critical auxiliary information to solve it. A full mining and expression of the semantic attribute relationship, as well as a reasonable mapping between features and attributes, are both important for classification performance of attribute-based zero-shot learning. By considering the above two points, we proposed a novel zero-shot learning method called WRHAG. The main advantages of our method are as follows. 1) The limitation problem of attributes is solved by hybrid attribute groups, which are constructed by grouping and enhancing semantic attributes. 2) The quality of mapping between features and attributes is significantly improved by the weighted autoencoder, in which the weight coefficient is adopted to consider the proportion of attribute groups. 3) The redundancy of hybrid attributes is removed by introducing the structured sparse constraint term, which promotes to the better selection of discriminative attributes. 4) The zero-shot image classification is achieved by calculating the similarity between features of testing samples and predicted class features in the feature space. The comparison experiments show that the proposed WRHAG model effectively improves the accuracy of zero-shot image classification in the CUB dataset. Acknowledgements. This work was supported by the Natural Science Foundation of Jiangsu Higher Education Institutions of China under Grant 21KJB520005; the Jiangsu Normal University Foundation under Grant 21XSRS001; the Natural Science Foundation of Jiangsu Province under Grant BK20200632; and the National Natural Science Foundation of China under Grant 41902273.
References 1. Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2251–2265 (2019) 2. Xie, G., Zhang, Z., Xiong, H., Shao, L., Li, X.: Towards zero-shot learning: a brief review and an attention-based embedding network. IEEE Trans. Circuits Syst. Video Technol. 33(3), 1181–1197 (2023) 3. Yang, H., et al.: Iterative class prototype calibration for transductive zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 33(3), 1236–1246 (2023) 4. Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58. IEEE, Las Vegas (2016) 5. Yu, Y., Ji, Z., Guo, G., Zhang, Z.: Zero-shot learning via latent space encoding. IEEE Trans. Cybern. 49(10), 3755–3766 (2019) 6. Han, H., Jain, A.K., Wang, F., Shan, S., Chen, X.: Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE Trans. Pattern Anal. Mach. Intell. 40(11), 2597–2609 (2018) 7. Jayaraman, D., Sha, F., Grauman, K.: Decorrelating semantic visual attributes by resisting the urge to share. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1629– 1636. IEEE, Columbus (2014)
260
J. Zhang et al.
8. Wang, X., Li, Q., Gong, P., Cheng, Y.: Zero-shot learning based on multitask extended attribute groups. IEEE Trans. Syst. Man Cybern. Syst. 51(3), 2003–2011 (2021) 9. Zhang, J., Wang, X., Cheng, Y.: Broad attribute prediction model with enhanced attribute and feature. IEEE Access 7, 124606–124620 (2019) 10. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936. IEEE, Boston (2015) 11. Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183. IEEE, Honolulu (2017) 12. Ranzato, M., Boureau, Y. L., Le, C.Y.: Sparse feature learning for deep belief networks. In: Proceedings of 20th International Conference on Neural Information Processing Systems, pp. 1185–1192. Curran Associates Inc., Vancouver (2009) 13. Kim, S., Xing, E. P.: Tree-guided group lasso for multi-task regression with structured sparsity. In: Proceedings of 27th International Conference on International Conference on Machine Learning, pp. 543–550. Omnipress, Haifa (2010) 14. Bartels, R.H., Stewart, G.W.: Solution of the matrix equation ax + xb = c [f4]. Commun. ACM 15(9), 820–826 (1972) 15. Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: Proceedings of 2011 International Conference on Computer Vision, pp. 2524–2531. IEEE, Barcelona (2011) 16. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. IEEE, Boston (2015) 17. Lampert, C. H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958. IEEE, Miami (2009)
Adaptive Clustering-Based Collusion Detection in Crowdsourcing Ruoyu Xu1(B) , Gaoxiang Li1 , Wei Jin2 , Austin Chen3 , and Victor S. Sheng1 1 Computer Science Department, Texas Tech University, Lubbock, TX, USA
{ruoyxu,gaoli,victor.sheng}@ttu.edu
2 Computer Science and Engineering Department, University of North Texas, Denton, TX, USA
[email protected] 3 Lubbock High School, 2004 19th Street, Lubbock, TX, USA
Abstract. Crowdsourcing is a popular approach for crowd workers collaborating to have tasks done. However, some workers communicate with each other and share answers during the crowdsourcing process. This is referred to as “collusion”. Copying from others and submitting repeated answers are detrimental to the quality of the tasks. Existing studies on collusion detection focus on ground truth problems (e.g., labeling tasks) and require a fixed threshold to be set in advance. In this paper, we aim to detect collusion behavior of workers in an adaptive way, and propose an Adaptive Clustering Based Collusion Detection approach (ACCD) for a broad range of task types and data types solved via crowdsourcing (e.g., continuous rating with or without distributions). Extensive experiments on both real-world and synthetic datasets show the superiority of ACCD over state-of-the-art approaches. Keywords: Crowdsourcing · Collusion detection · Colluders · Collusion
1 Introduction Crowdsourcing is the concept of providing collaborative services utilizing the intelligence, skills, and creativity of a large group of people [10]. It has been shown that it aids in solving problems and scientific challenges that machines cannot solve very well. In academia and industry, crowdsourcing has become a popular way of addressing tough computer tasks but simple human activities [17], like image labeling [6], sentiment analysis [14] and handwriting recognition [8, 21]. A slew of studies has found that the quality of solutions is inextricably linked to the quality of workers. The quality of workers here not only refers to the level of knowledge and the responsible attitude of workers, but also relies on whether they are honest. Research shows that normal collaboration in crowdsourcing improves the solution quality [18]. However, some participants may converse via social media and even duplicate other participants’ answers while doing tasks. This is referred to as “collusion”. There are many reasons or motivations causing collusion. For example, participants desire to receive paid with less effort, or they have some malicious goals (such as increasing the rating of a product). It is obvious that collusion can decrease the solution quality via crowdsourcing [11]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 261–275, 2023. https://doi.org/10.1007/978-981-99-4752-2_22
262
R. Xu et al.
In order to reduce the impact of the collusion of participants and improve the quality of solutions from crowdsourcing, only a few cutting-edge collusion detection methods are available, such as FINDCOLLUDERS [11], Collusion-Proof [5] and PROCAP (Pairwise Recognition Of Collusion with Asymptotic Promise) [19]. However, their applications are constrained by their specific requirements, including the kinds of tasks and corresponding collaboration mechanisms. Specifically, FINDCOLLUDERS can be solely applicable to rating problems. The Collusion-Proof approach only works if all conspirators give the same answer. PROCAP is designed for labeling tasks. All conspirators providing the same solution is just the most straightforward naive collusion approach. Conspirators usually play smart and try to prevent collusion detection in different ways, for example, adding some noise into their responses. In this paper, we aim to propose a new method, an adaptive clustering-based collusion detection approach (ACCD), for a broad range of task types and data types solved via crowdsourcing.
2 Related Work It is undeniable that collaboration can improve the solution quality of crowdsourcing. However, collusion occurs when crowd workers cooperate. Several studies highlight the threat of cooperation to crowdsourcing [3, 11, 13, 16, 23], which illustrate the threat of collusion to crowdsourcing from the perspective of investigative tasks. Collusion via collaborative crowdsourcing can be categorized into three types [5, 9]: Duplicate submission. The quality of the contributions determines workers’ income on public crowdsourcing platforms like Amazon Mechanical Turk. As a reason, a group of workers may respond to a task with the same answers [4]. To ensure that everyone gets the payment, group members will submit duplicate responses to tasks. This will significantly detriment the solution quality of the tasks due to the lack of diversity. Group Plagiarism. Some workers only want to make money, thus they are more inclined to form a colluding group [22]. Some of the group members just copy one worker’s responses after he or she completes all tasks. This becomes worse if nobody in the group carefully processes the tasks, so that members simply make up answers at random. Spam Accounts. Identical responses can be submitted to a task multiple times using multiple accounts on a crowdsourcing platform [1]. Recent research has focused on the harm caused by collusion, and several methods of detecting collusion have been presented. FINDCOLLUDERS [11], CollusionProof [5] and PROCAP[19] are the most popular methods. FINDCOLLUDERS begins with responses of workers to identify the similarities between workers. Collusion-Proof focuses on worker performance of duplicate responses and finds collusion through the effect of repeated responses on the overall outcome. PROCAP method detects the collusive behaviors based on pairwise profile likelihood estimation. In the following subsection, we will briefly overview these three approaches and discuss their corresponding shortcomings. 2.1 Overview and Shortcoming Analysis FINDCOLLUDERS [11] is designed for opinion-based rating problems. Its simple main idea is to calculate the cosine similarity of answers from any two workers. If
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
263
the similarity of any two workers is greater than a fixed threshold (e.g., 0.85), the two workers are considered to be in collusion. Collusion-Proof [5] identifies collusion by calculating the change rate of the results on the original data before and after removing duplicate answers. The basic idea of this approach is that colluding people are more likely to provide the same answer than honest workers. If the worker performance change rate is more than or equal to a predefined threshold (e.g., 0.028), these workers are considered to be colluding. PROCAP [19] focuses on the quality of ground truth inference for labeling tasks. The collusion behavior detection of PROCAP relies on the pairwise profile likelihood estimation with an adaptive LASSO penalty to ensure the sparsity. PROCAP could achieve good performance in ground truth inference and collusion detection. However, PROCAP requires a relatively large amount of data and adopts a proper baseline model to ensure the good performance, which makes this method not flexible enough. It still requires parameters tuning and setting a threshold to eliminate false detection. These algorithms have some common limitations. First, they are for specific tasks designed. FINDCOLLUDERS focuses on opinion-based rating questions, which is unsuitable for many other types of tasks, e.g., object identification and image segmentation. Collusion-Proof detects colluders who submit repeated answers, but colluders frequently modify some responses to avoid detection. And PROCAP is designed for labeling tasks, which may not work well for rating and ranking takes. In addition, these approaches judge collusion using a fixed threshold, which makes this method inaccurate and not easily deployed. The fixed threshold needs to be determined through many tests, and the determination process is highly dependent on the design of experiments and experimental datasets. The threshold needs to be adjusted well for different datasets. However, the threshold adjustment is tough. To summarize, most existing studies rely on the preset thresholds and are designed for specific types of crowdsourcing tasks to detect the collusion behaviors. This motivates us to develop a novel Adaptive Clustering-Based Collusion Detector (ACCD) as follows.
3 Adaptive Clustering-Based Collusion Detector ACCD Before introducing the details of ACCD, we summarize the notations used in the paper in Table 1, and give the settings of this approach. Let W = {w1 ,…, wi ,…, wn } be the worker set, where wi is the ith worker, and the total number of workers is n, T = {t 1 ,…, t j ,…, t m } be the task set, where t j is the jth task, and the total number of tasks is m, A = {a11 ,…, aij ,…, anm } be the answer set, where aij ∈ A consists of the answer that the worker i reports for task j. Given a set label L = {l1 ,…, li ,…, l n }, where li is the collusion label of worker i. We utilize binary classification to define worker behavior categories: li = − 1 indicates that this worker i is an independent worker; otherwise, li ≥ 0 means that this worker i is a colluder. The same is true for the inference label set L = {l 1 ,…, l i ,…, l n }, where l i is the inference collusion label of worker i. If l i = − 1 means that this worker i is an independent worker in the prediction test, l i ≥ 0 means that this worker i is a colluder in the prediction test. Crowdsourcing Setting. In this study, we examine the crowdsourcing scenario in which people (referred to as requesters) seek employees (referred to as workers) to carry
264
R. Xu et al. Table 1. Notations
Symbol
Description
Symbol
Description
W
Worker set
T
Task set
A
Answer set
li
Label of the ith worker in L
n
Number of workers
m
Number of tasks
aij
Answer of the ith worker to the jth task
L
Label set of the inference result
wi
the ith worker
tj
the jth task
Label set of workers’ answers
l
L
i
Label of the ith worker in L
out tasks in exchange for a reward. A requester creates a list of human intelligence tasks and publishes them via a crowdsourcing platform to workers (e.g., Amazon Mechanical Turk) [20]. Worker Setting. We assume that workers are randomly selected from the application pool. The selected workers include independent and collusive participants, which was not known in advance. Task Setting. We consider a variety of crowdsourcing tasks, whose outputs are all in a numerical format, including rating problems (e.g., product rating [11] and ranking [15]) and Ground Truth problems (e.g., image labeling [6], object identification and handwriting recognition [8, 21]), which has a true answer, and it is unknown a priori. Answer Setting. We consider two distinct types of answers. (1) The first category is honest responses, which are provided by the independent workers and are based on their own experience and prophetic knowledge. (2) The second type is responded by insincere people who work dishonestly. These collusive answers come from three types of collusive behaviors, which are duplicated submission, group plagiarism, and spam accounts. 3.1 Collusion Detection From the above analysis of the collusion detection methods in Sect. 2.1, it is obvious that the major issue is that they require to set a fixed threshold in advance and then compare the calculated value to the threshold to determine collusion. As stated previously, the threshold is an empirical value derived through a vast number of experiments, which is inextricably linked to the dataset utilized in the experiment. In other words, the threshold found for one dataset could not be directly applied to different datasets. Therefore, we propose an Adaptive Clustering-Based Collusion Detector (ACCD) to solve this issue. Inspired by the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [2], we develop the ACCD method based on the HDBSCAN to adaptively detect the collusion behavior of colluders for a variety of data distributions. For traditional density-based clustering methods like DBSCAN [7] and OPTICS [12], the quality of clustering results highly relies on the tuning of parameters like min_cluster_size and min_sample. The former specifies the minimum cluster size and
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
265
is a critical parameter of the clustering method. The larger this parameter, the fewer clustered species will be, and any points less than this will be deemed “noise”. The second parameter is min_samples, which specifies the minimum number of samples that a point must have in its neighborhood to be considered a core point. As we know, if the result of a method is sensitive to its parameters choosing, this method is no longer flexible and adaptive for various scenarios. However, our HDBSCAN-based ACCD method is not sensitive to these parameters, and the ACCD method could achieve flexibility and adaptability. Unlike previous density-based clustering methods, the core idea of HDBSCAN is to calculate the distance in a different way including the following key definitions: core distance and mutual reachable distance. The distance between the sample and the k th nearest neighbor sample point is referred to as the core distance. Mutual reachable distance is the maximum value of the core distance of two sample points and the distance between two sample points. The mutual reachable distance can be obtained with: d mreach (a, b) = max{d core (a), d core (b), d(a, b)}, where d(a, b) denotes the distance between points a and b. Sample distance in the dense region does not change, but sample distance in the sparse region grows, which makes it easier for the algorithm to deal with noise points and increases the robustness of the algorithm to noise points. The specific parameter settings and approach development are in Sect. 4. Here we take the real-world dataset (includes 20 tasks, 123 participants and 6 nonoverlapping colluding groups) as an example to further explain our ACCD, and set the input parameter values of min_cluster_size and min_samples to 3 and 4, respectively. The following are the specific steps: 1) Transform the space. Figure 1(a) is the visualization of the raw data. To begin, we recalculate the distance between each core point using the mutual reachable distance. The goal is to keep dense points (with lower core distances) at the same distance from each other, but sparser points are pushed away so that their core distances are at least as far away from any other point as possible. 2) Build the minimum spanning tree. Second, a minimum spanning tree is established based on the new mutual reachable distance to find areas with high data density, as shown in Fig. 1(b). 3) Build the cluster hierarchy. The cluster hierarchy is then established to convert the minimum spanning tree into a graph splitting hierarchy, as illustrated in Fig. 1(c). 4) Condense the cluster tree. The cluster tree is condensed from top to bottom through cluster tree traversal (Fig. 1(d)). That is, if and only if the number of colluding clusters generated by the two splits is greater than min_cluster_size (e.g., 3), the compressed nodes are split into two parts. 5) Extract the clusters. The Next step is to extract the clusters. According to the stability function of the algorithm and the theory of hierarchical clustering, the clustering results are obtained when the clustering stability is optimal. In this example, 7 clusters are extracted, as seen in Fig. 1(e). 6) Visualize the clustering results. Finally, visualize the results of the collusive clustering through the labels of the 8 clusters produced by the HDBSCAN algorithm (Fig. 1(f)). 7 clusters are collusive clusters, and the noise points represent independent workers.
266
R. Xu et al.
(a) Raw Data Visualization
(b) Minimum Spanning Tree
(c) Single Linkage Tree
(d) Condensed Tree
(e) Select Clusters
(f) Cluster Result
Fig. 1. The procedure of ACCD with HDBSCAN
The pseudocode of our ACCD algorithm is presented in Algorithm 1. The answer set A is a n × m matrix containing the answers of the whole crowdsourcing. And the label set L is a n × 1 matrix containing the group number of all workers. Given the answer set A from the real or synthetic datasets, HDBSCAN (line 2 to 11) will return the clustering results, which are the inference results set denoted as L. For each l i ∈ L, when l i equals -1 means that the worker is an independent worker, the rest indicates that the worker has collaborated, and the number specifies the group he or she belongs to. Since we are mainly considering whether workers are colluding, we divide the inference results into two categories, marking independent workers as 0, and colluding people as 1.
4 Experiments In this section, we conduct experiments to investigate the performance of our ACCD, comparing with the three baseline approaches (i.e., FINDCOLLUDERS, CollusionProof and PROCAP). We first construct simulated datasets, and then we conduct experiments on both real-world and synthetic datasets, respectively. Finally, we present the performance results of these three approaches on various datasets in terms of precision, recall, F1 score and accuracy. 4.1 Experimental Setup Real and Synthetic Datasets. Trying to obtain reliable real-world data to evaluate our complicity detection methods is challenging. The reasons for the lack of such datasets are summarized in three main points. First, marketing companies or product manufacturers do not release crowdsourced product rating data. This is because shaing data may carry the risk of reducing their competitive business value. Second, labeling data for collusion groups requires significant time and effort, as well as personnel with the appropriate expertise. In crowdsourcing tasks, there is no dedicated person responsible for this. Third, since collusion is often unacceptable and may even lead to penalties (e.g., cancellation of
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
267
commissions or bans from future assignments), colluders may not admit their behavior, which makes it harder to get the ground truth.
The real dataset is the result of a product rating problem from an e-commerce company [11] and is the only public dataset where workers admit their collusion behavior. It contains 20 different rating tasks. Each assignment has a score range of 1 to 10. Additionally, the data includes 123 participants. 87 of them are independent, and 36 of them are suspected of colluding. The 36 colluders are grouped into 6 groups (Groups A-F), with each group consisting of 3 to 11 members. The dataset contains 4 missing values distributed across 4 rows. Because the overall data is intrinsically limited (i.e., 123 rows), we perform imputations to replace each missing value with the mean of each corresponding column. Due to the limited availability of real data, we construct multiple synthetic datasets. In order to simulate a variety of crowdsourcing problems and investigate the performance of tested collusion detectors, the synthetic datasets include both rating problems and ground truth problems. Rating problems are more subjective inquiries in which consumers give a product a personal subjective rating based on their own opinion and
268
R. Xu et al.
experience. These kinds of problems have a lot of different responses, and there is no one-size-fits-all solution. The ground truth problems are those that have actual answers, which are responses based on prior knowledge and common scientific senses. Responses from crowdsourcing to this type of question usually do not differ much, and most of them are correct (i.e., the same as the ground truth). To further investigate the performance of our ACCD, for both rating problems and ground truth problems, we generate two types of responses: categorical and continuous. A categorical data type has a limited number of categories or groups to choose. A continuous data type (i.e., a numeric variable) has infinite continuous values. In summary, we create a collection of simulated datasets that include the answers from non-colluding and colluding workers with the following four parameters. • • • •
Number of tasks: the total number of these tasks. Number of workers: the total number of people involved in these tasks. Non-Collusion Ratio: the percentage of total workers who not colluded. Number of Collusion Groups: the total number of collusion groups in these tasks.
As we mentioned before, the answers of rating problems indicate that people’s feelings are more subjective. Thus, we randomly create independent workers’ answers using uniform distribution. The group member who offers the solutions is referred to as the leader. Everyone else on the team is referred to as a follower, and their responses are very likely to be identical or similar to their leader. According to these concepts, the generation of leader’s answer is the same as independent workers’. Some responses of the followers are matched to the group leader’s, while others were generated randomly. For each task assigned to one follower, a predetermined probability ϕ r , where ϕ r ∼ N(0.5, 0.2), is used to determine whether the answer of this particular task is generated randomly or not. We believe that some colluders will purposefully offer random responses to confuse them and avoid being identified. For ground truth problems, we assume that the correct solutions are known and assigning tasks to workers are at random. To generate answers for independent workers, a Gaussian noise p is added to the actual answer. In the collusion groups, we separate group members into leaders and followers. We generate the leader’s answer the same way as generating answers for independent workers, specifically adding noise p to the ground truth answer. That is, p, p ∼ N(0, 1). We apply ϕ g of the answers given by the followers of the same group to be the same as the leader, and the remaining 1 − ϕ g will respond randomly to avoid exposure, where ϕ g ∼ N(0.7, 0.1). Due to the fixed set of correct answers, the difference between the answers from independent workers and those from colluders is minimal. It is more difficult to identify colluders compared to rating questions. Collusion Detection Method. Adaptive Clustering-Based Collusion Detector (ACCD). We set min_cluster_size to 3, which signifies that we want the cluster’s smallest size of all groups to be three. The larger the number of min_samples, the more conservative the clustering, which means that more points will be labeled as noise and the clustering will be constrained to a gradually denser area. In this case, we set min_samples to 4. Then, we fit the answers to our ACCD as stated in Sect. 3.1 and obtain the labeled results.
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
269
Then, we compute precision, recall, F1 score and accuracy by comparing these results to the original labeled dataset, as we did for the previous three approaches. Evaluation Metrics. We evaluate the performance of collusion detectors in terms of multiple metrics, including the precision, recall, and F1 Score of detecting collusion behaviors, and the overall accuracy of determining collusion or non-collusion behaviors.
4.2 Evaluation on Real-World and Synthetic Datasets We first conduct the experiments to compare the performances of FINDCOLLUDERS, Collusion-Proof, PROCAP and our ACCD on the real-world dataset. Our experimental results are illustrated in Table 2. Table 2 shows that our ACCD performs consistently better than the other three methods in terms of all measures. Specifically, our ACCD achieves the highest precision (i.e., 0.829), recall (i.e., 0.944), F1 Score (i.e., 0.883) and accuracy (i.e., 0.927) on the real-world dataset, followed by PROCAP and FINDCOLLUDERS. Collusion-Proof performs the worst in terms of all measures. Table 2. Performance of Different Detection Methods on the Real-World Dataset Methods
Precision
Recall
F1 Score
Accuracy
FINDCOLLUDERS
0.775
0.861
0.816
0.886
Collusion-Proof
0.583
0.389
0.467
0.740
PROCAP
0.804
0.736
0.768
0.772
ACCD
0.829
0.944
0.883
0.927
On the synthetic datasets, we first assume there are 50 tasks with the same fixed scale (i.e., 1 to 10) and having equal difficulty. A total of 250 workers participate, and 30% of them are colluders. There are 4 collusion groups, and the number of members in each group is determined at random. That is, we set the number of tasks to 50, the number of workers to 250, the non-collusion ratio to 0.7 and the number of collusion groups to 4. We conduct experiments on four types of crowdsourcing problems mentioned before, i.e., the categorical rating problems, the continuous rating problems, the categorical ground truth problems and the continuous ground truth problems. Our experimental results are shown in Table 3. Note that the collusion detection process of PROCAP highly depends on the workers’ confusion matrix, which makes PROCAP unable to fit the data with continuous problems. Therefore, we simply add PROCAP as a comparison baseline for categorical problems. The experimental results in Table 3 illustrate that our ACCD performs the best in terms of all measures for the four different types of crowdsourcing problems. Its lowest recall is 0.933, achieving on the categorical rating problems. On both the continuous rating problems and the categorical ground truth problems, its recall is 1.00. FINDCOLLUDERS performs well in terms of precision but performs badly in terms of recall, which means many colluders are not detected. PROCAP and Collusion-Proof perform
270
R. Xu et al.
better in terms of F1 score and accuracy than FINDCOLLUDERS, but their overall performance is still much lower than ACCD. These show that our ACCD is a promising approach, which can detect almost all colluders. Table 3. Performance of Different Detection Methods for Different Type of Problems Problem Types
Methods
Precision
Recall
F1 Score
Accuracy
Categorical Rating Problems
FINDCOLLUDERS
1.000
0.027
0.052
0.708
Collusion-Proof
0.900
0.120
0.212
0.732
PROCAP
0.685
0.631
0.656
0.760
ACCD
1.000
0.933
0.966
0.980
Continuous Rating Problems
FINDCOLLUDERS
1.000
0.667
0.800
0.720
Collusion-Proof
0.676
0.333
0.446
0.752
ACCD
0.915
1.000
0.955
0.972
Categorical Ground Truth Problems
FINDCOLLUDERS
1.000
0.107
0.193
0.732
Collusion-Proof
0.554
0.671
0.607
0.737
PROCAP
0.800
0.693
0.743
0.810
ACCD
0.872
1.000
0.932
0.956
FINDCOLLUDERS
1.000
0.093
0.171
0.728
Collusion-Proof
0.000
0.000
0.000
0.700
ACCD
0.935
0.960
0.947
0.968
Continuous Ground Truth Problems
4.3 Experiments on Synthetic Dataset with Various Setting In order to further evaluate the performance of our ACCD on detecting collusion behaviors, we conduct more experiments with various settings to see its fluctuations in terms of accuracy. We first create 4000 simulated datasets in total and 1000 datasets for each type of problem (i.e., the categorical rating problems, the continuous rating problems, the categorical ground truth problems and the continuous ground truth problems). Various settings include varying the number of tasks (i.e., 10, 20, 30, 40, 50), the number of workers (i.e., 100, 150, 200, 250, 300), the non-collusion ratios (i.e., 0.5, 0.6, 0.7, 0.8, 0.9), and the number of collusion groups (i.e., 2, 3, 4, 5, 6, 7, 8, 9). For each type of problem, we choose the combination number of tasks: 50, the number of workers: 250, the non-collusion ratios: 0.7, and the number of collusion groups: 4 as the benchmark. And we keep three of the data generator’s four variables constant while changing only one of them to generate datasets. Finally, the accuracy of the four methods for all four types of problems is used for performance evaluation.
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
271
Base on the following results, we can clearly see that ACCD outperforms other methods and is applicable to a variety of dataset types. Various Numbers of Tasks. In this experiment, the number of tasks varies from 10, 20, 30, 40 to 50. The experimental results are listed in Fig. 2. The results show that ACCD achieves higher accuracy than PROCAP and Collusion-Proof. Compared with FINDCOLLUDERS, the accuracy of ACCD is slightly lower when the number of tasks is fewer than 20. However, unlike the performance of FINDCOLLUDERS decreases with the increment of the number of tasks, the performance of our ACCD increases with the increment of the number of tasks. ACCD performs gradually better than FINDCOLLUDERS after the number of tasks is more than 20.
(a) Categorical Rating
(b) Continuous Rating
(c) Categorical Ground Truth
(d) Continuous Ground Truth
Fig. 2. The Accuracy of Different Methods against the Number of Tasks
Various Numbers of Workers. In this experiment, the number of workers varies from 100, 150, 200, 250 to 300. Our experimental results are shown in Fig. 3. From Fig. 3, we can see that our ACCD consistently performs very well and better than PROCAP on the categorical rating and the categorical ground truth problems, and better than FINDCOLLUDERS and collusion-proof for all types of problems. Various Non-Collusion Ratios. In this experiment, the non-collusion ratios vary from 0.5 to 0.9. Our experimental results are listed in Fig. 4. The results show that ACCD consistently performs best under various non-collusion ratios. With the increment of the non-collusion ratio, the difficulty is relatively reduced. The performance of FINDCOLLUDERS, collusion-Proof and PROCAP improves but is still lower than our ACCD. Various Numbers of Collusion Groups. In this experiment, the number of collusion groups vary from 2 to 9. Our experimental results are shown in Fig. 5. From Fig. 5, we can see that our ACCD not only performs better than FINDCOLLUDERS, Collusion-Proof and PROCAP, but also keeps a consistently high performance for all types of problems.
272
R. Xu et al.
(a) Categorical Rating
(b) Continuous Rating
(c) Categorical Ground Truth
(d) Continuous Ground Truth
Fig. 3. The Accuracy of Different Methods against the Number of Workers
(a) Categorical Rating
(b) Continuous Rating
(c) Categorical Ground Truth
(d) Continuous Ground Truth
Fig. 4. The Accuracy of Different Methods against the Non-Collusion Ratios
4.4 Experimental Study on Parameters Sensitivity of ACCD As we mentioned that our ACCD method is not sensitive to the parameters (i.e., min_cluste_size and min_samples) during the clustering process, so the ACCD method could achieve flexibility and adaptability. In this section, we test the sensitivity of the parameters by running the ACCD method on a real-world dataset with different parameters set on the ACCD method itself. The experimental results are listed in Table 4. Table 4 shows that with different pre-set parameters, the ACCD method is persistent and consistently outperforms the other collusion detection methods listed in Table 2.
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
(a) Categorical Rating
(b) Continuous Rating
(c) Categorical Ground Truth
(d) Continuous Ground Truth
273
Fig. 5. The Accuracy of Different Methods against the Number of Collusion Groups
Table 4. Performance of ACCD with Various Parameter Settings on the Real-World Dataset Parameters
Accuracy
min_cluster_size = 3, min_samples = 3
0.921
min_cluster_size = 3, min_samples = 4
0.927
min_cluster_size = 3, min_samples = 5
0.913
min_cluster_size = 3, min_samples = 6
0.910
min_cluster_size = 3, min_samples = 7
0.911
min_cluster_size = 4, min_samples = 4
0.906
min_cluster_size = 5, min_samples = 4
0.906
min_cluster_size = 6, min_samples = 4
0.906
min_cluster_size = 7, min_samples = 4
0.904
5 Conclusions In this paper, we analyzed the effectiveness of existing collusion detection methods on different types of datasets, and found existing methods cannot solve other types of problems very well. To make collusion detection for different real-world applications, we designed an efficient cluster-based collusion detection method called Adaptive Clustering-Based Collusion Detector (ACCD). Our experimental results showed that our ACCD not only performs much better than existing detection methods, but also keeps consistent high performance for different types of problems under different settings.
274
R. Xu et al.
References 1. Adams, S.A.: Maintaining the collision of accounts: crowdsourcing sites in health care as brokers in the co-production of pharmaceutical knowledge. Inf. Commun. Soc. 17(6), 657–669 (2014) 2. Campello, R.J., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 160–172. Springer (2013). https://doi.org/10.1007/978-3-642-37456-2_14 3. Celis, L.E., Reddy, S.P., Singh, I.P., Vaya, S.: Assignment techniques for crowdsourcing sensitive tasks. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pp. 836–847 (2016) 4. Chang, J.C., Amershi, S., Kamar, E.: Revolt: Collaborative crowdsourcing for labeling machine learning datasets. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2334–2346 (2017) 5. Chen, P.P., Sun, H.L., Fang, Y.L., Huai, J.P.: Collusion-proof result inference in crowdsourcing. J. Comput. Sci. Technol. 33(2), 351–365 (2018) 6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. IEEE (2009) 7. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996) 8. Fang, Y., Sun, H., Li, G., Zhang, R., Huai, J.: Effective result inference for context-sensitive tasks in crowdsourcing. In: Navathe, S., Wu, W., Shekhar, S., Du, X., Wang, X., Xiong, H. (eds.) Database Systems for Advanced Applications. DASFAA 2016. Lecture Notes in Computer Science, vol. 9642, pp. 33–48. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-32025-0_3 9. Gadiraju, U., Kawase, R., Dietze, S., Demartini, G.: Understanding malicious behavior in crowdsourcing platforms: the case of online surveys. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1631–1640 (2015) 10. Howe, J., et al.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006) 11. KhudaBukhsh, A.R., Carbonell, J.G., Jansen, P.J.: Detecting non-adversarial collusion in crowdsourcing. In: Second AAAI Conference on Human Computation and Crowdsourcing (2014) 12. Kriegel, H.P., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1(3), 231–240 (2011) 13. Lev, O., Polukarov, M., Bachrach, Y., Rosenschein, J.S.: Mergers and collusion in all-pay auctions and crowdsourcing contests. In: Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, pp. 675–682 (2013) 14. Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: Cdas: A crowdsourcing data analytics system. arXiv preprint arXiv:1207.0143 (2012) 15. Marcus, A., Karger, D., Madden, S., Miller, R., Oh, S.: Counting with the crowd. Proceed. VLDB Endow. 6(2), 109–120 (2012) 16. Niazi Torshiz, M., Amintoosi, H.: Collusion-resistant worker selection in social crowdsensing systems. Comput. Knowl. Eng. 1(1), 9–20 (2018) 17. Nouri, Z., Wachsmuth, H., Engels, G.: Mining crowdsourcing problems from discussion forums of workers. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6264–6276 (2020) 18. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614–622 (2008)
Adaptive Clustering-Based Collusion Detection in Crowdsourcing
275
19. Song, C., Liu, K., Zhang, X.: Collusion detection and ground truth inference in crowdsourcing for labeling tasks. J. Mach. Learn. Res. 22(190), 1–45 (2021) 20. Sun, H., Dong, B., Zhang, B., Wang, W.H., Kantarcioglu, M.: Sensitive task assignments in crowdsourcing markets with colluding workers. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), pp. 377–388. IEEE (2018) 21. Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., Blum, M.: recaptcha: human-based character recognition via web security measures. Science 321(5895), 1465–1468 (2008) 22. Wang, G., et al.: Serf and turf: crowdturfing for fun and profit. In: Proceedings of the 21st international conference on World Wide Web, pp. 679–688 (2012) 23. Xiang, Q., Nevat, I., Zhang, P., Zhang, J.: Collusion-resistant spatial phenomena crowdsourcing via mixture of gaussian processes regression. In: TRUST@ AAMAS, pp. 19–30 (2016)
Improving Causality Explanation of Judge-View Generation Based on Counterfactual Qinhua Huang(B) and Weimin Ouyang School of AI and Law, Shanghai University of Political Science and Law, Shanghai 201701, China {hqh,oywm}@shupl.edu.cn
Abstract. Legal justice predication has attracted wide attentions in both AI research communities and legal research area. With the general pre-trained model employed in many NLP task and get the SOTA performance, many methods of judge view generation in LJP were developed. This sort of methods improved the accuracy in prediction in general. But there are issues remained, one of the fatal problems of them which may hinder the application in real scenarios is the causality explanation of facts to judges. Researches showed big models with good prediction performance can easily make decisions based on spurious correlations from wrong text content in facts description to correct result. Apparently, this would weaken the model’s interpretability, accountability and trustworthy. Furthermore it might hinder the prevalence of AI LJP applications in real legal scenarios. Inspired by the ideas of counterfactuals in causality inference, we investigated its usage in legal AI application. We introduced a method of counterfactuals generation by intervening the raw data to address the data imbalance problem in vertical LJP domain. Combined with the generalization performance of large language model, the related embedding of facts in legal case are more expressive, thus reduced the probability of resulting in potential spurious correlations such as result view inferred from unrelated fact description text. We conduct a comparison experiment to illustrate the performance of our method. The result showed that with counterfactuals intervened in raw data, the performance of legal judge generation can be promoted. Keywords: Pre-trained Model · Counterfactual · Judge View · Causality
1 Introduction Since the born of AI, legal intelligence has been maintaining it’s center position in AI research communities. Ideas of using AI model as the judge can be traced back to 1958, when Lucien Mehi proposed the automatic judge problem of AI world [1, 2]. Legal intelligence can be divided into two stages roughly. As early as in 1980s, the joint disciplines researches of AI and Law had been constructed. Along with the popular AI method, much attention was put on rule-based system and logic reasoning. Typical systems were developed, such as RBR system, CBR system, HYPO, CABARET, and so on. Due to the problems as the data sparsity, lack of common sense knowledge, etc., © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 276–284, 2023. https://doi.org/10.1007/978-981-99-4752-2_23
Improving Causality Explanation of Judge-View Generation
277
legal intelligence meets big challenges soon, like situations happened in other typical AI domains. With the deep learning method developed in recent years, many researches are working on building legal AI models using pre-trained NLP models based on deep learning, trained and generated from big dataset of real legal documents. With the great progress made in large language models development, such as ELMo, BERT, GPT-2, etc., many researches turn focus a way of employing LLM to get basic embeddings representation and then do fine-tunings on a specific vertical domain task. Liu Zonglin [6] and others noticed that the crime prediction and legal recommendation are important sub-tasks of legal judgment prediction, and proposed to use multi-task learning model to model the two tasks jointly, and to integrate the crime keyword information into the model, which can improve the accuracy of the model for these two tasks. In the field of legal intelligence question-and-answer, especially legal advice, there is sometimes a need to answer questions that do not depend on a certain fact. McElvain has studied and established the West Plus system to provide question-and-answer capabilities by using IR and NLP technologies rather than relying on a structured knowledge base. This is a continuation of the idea of legal concept search. Due to the large degree of data openness, Chinese’s study of automatic trial prediction is relatively active. Meanwhile for the data policy and other reasons, auto judge system in other languages are relatively few. Ilias Chalkidis combined the data of the European Court of Human Rights with other data, set up a deep learning model based on the BERT language model. From the perspective of human, the traditional feature model in machine learning is more comprehensible, while deep learning model may have better accuracy and performance. The bias of the model by deep learning is also hard to explain and may be unpredictable. This is very critical in legal area. Haoxi Zhong [7] discussed the application of topological learning in trial prediction. In the actual legal trial, the trial is generally composed of several sub-tasks, such as application of laws, charges, fines, sentences, etc., based on the dependence between sub-tasks, the establishment of topological mapping, and thus the construction of trial models. In recent years, natural language processing has made great progress in the field of general reading comprehension, such as BERT model application, attention mechanism, Shangbang Long [8] studied the problem of legal reading comprehension, according to the description of facts, complaints and laws, according to the judge’s working mechanism, to give predictions, and thus to achieve automatic prediction. Wenmin Yang and other researchers based on the topological structure of sub-tasks, combined with the attention mechanism of word matching, proposed a deep learning framework obtained a greater performance improvement. As with the increasing of vary kinds of data corpus, how to fully employ this kinds of original resources is considered. In a specific AI task, supervised learning has more confidence than other method, but to label the corpus is a big burden. Self-supervised learning method in NLP was designed by using the context of each word to predict the hidden part of a specific text [23]. Thus can greatly reduce the price of expensive text tagging task. Based on the pre-training method, the representation of text unit can be reused, rather than just for one specific task. The ways to pre-training text corpus include word2vec, GloVe, sub-word represention, Transformer. After pre-training, the text, word or sentence, can be represented as float vector in the output matrix. This is called word
278
Q. Huang and W. Ouyang
representation learning. Most of these models are context-free model, in which word representation will lost its context in using. Thus the context-focused methods were developed. ELMo greatly improved context-sensitive representation by output the representation combinations of hidden layer from pre-trained bi-LSTM. By adding ELMo to some specific ML task, it promote 6 typical NLP tasks, sentiment analysis, natural language inference, semantic tagging, Co-reference Resolution, named-entity identification and Q&A. To facilitate the task of a specific NLP, GPT designed general task-free model for context-sensitive representation embedding. Based on Transformer decoder, GPT pre-trained a model to represent sequences of text. When come to a specific NLP task, in the fine tuning process it pass the model output to a additional linear layer to predict output label. To overcome the imitation of GPT self-regression, BERT was proposed. BERT can represent tokens by encoding bi-direction context in text. When come to a specific supervised-learning task, it pass the representation embedding to a addition output layer, and then do the fine-tuning on the pre-trained Transformer parameters. Meanwhile the additional layer will learn from beginning. The effective of BERT has been proved in 4 kinds of NLP tasks, single-text classification, text-pair-classification, question & answering, text-tagging. While the achievements made in many domain using LLM, one important issue come out. For typical LLM, it will have over hundreds billion of parameters inside. To understand the causality of the models’ decision process for human is becoming harder challenge. Without the basic reasoning clues for the potential causality, the model is sure to be rejected by human in many domains, such as legal domain.
2 Related Works In this section, we briefly introduce the judge-view generation model developed so far. Court view is an important task in legal intelligence field. It is a typical text-to-text natural language generation problem, which is from fact description to judge’s point of view. The judge’s point of view is consist of supported charges and the corresponding rationales. For legal intelligence domain the rationales is more important, it will provide the confidence for a judge to make decision on a specific charge. According to the most of legal rules in the world countries, a charge decision should be made based on the fact. The accurate rationales will save much of judge’s resource, thus it can promote the judge system’s efficiency, accuracy and trusty. Especially for the judge supervise system, due to the limitation of resource; in most situations of a review on the case it is hard for the supervisor to start from beginning. This requirement demands the rationals must have some basic support, one is the causality and consequence, which will re-coin the event logic. The other is the related details. Generally, there will be plenty of details hidden in the fact, especially when the fact description is provided by the original documents. Yao takes the task of fact details abstraction as the summary documents. This might lose the causality logic hidden in documents, for the important deduced information could be neglected [20]. Court view generation research work can be traced back from Seq2Seq model. By labeling the (faction description - labeled court view) pair, the basic model can be generated. But the Seq2Seq model has a difficulty in generation of rationales. Ye proposed
Improving Causality Explanation of Judge-View Generation
279
a method from the way of improving the quality of tagging corpus to promote the model performance. By tagging extra charge labels, it can get better fact description based rationales generation performance. In model construction process, Ye encoded fact descriptions to rationales pairs as encoder-decoder. Further more, the charges are encoded to label selected rationales pairs. With attention mechanism fused into the SeqSeq model, it can get the state-of-the-art performance. Noticed in the data generation mechanism, there would be confounding bias neglected. For the importance of demonstration of model’s causality and explanations in legal doman, many researches take causality analysis as tool to try to search for the solutions. As the most top level of causality analysis, counterfactual research get more and more focus recently. Wu proposed a method using Counterfactual mechanism. In their model, the input will be obtained by leveraging the plaintiff’s claim and fact in encoder. Thus the encoder can get information tied closely to claims in facts description. To relieve the affection of confounding bias, the Counterfactual decoder was proposed to do causality analysis [21]. Linan Yue did a further step in formatting the training data. They divided the court view into judging circumstance (ADC) and sentencing circumstance (SEC). ADC and SEC should be clearly identified because they have unbalancing in different cases, thus will have strong affections on the forming of court view generation. Based on this fact, they designed a circumstances selector to identify ADC and SEC. The result court view will be generated by merging these two types of circumstance [22]. To address the eXplainable(XAI) problem in image data deep learning, which had been less attractive to AI research communities, Eoin [26] proposed a method for generating plausible counterfactuals and semi-factuals for black-box CNN classifiers doing computer vison. They modifies all exceptional features in test image to be normal to generate plausible counterfactual images. Amir-Hossein Karimi [25] adopted the idea that counterfactual explanations should give causality information to help causal reasoning. They purposed a shift of paradigm from recourse via nearest counterfactual explanations to recourse through minimal interventions, shifting the focus from explanations to interventions. Chen [24] investigated spurious correlation error in LJP model. They proposed a method to reorganize the information in fact description by performing information extraction. Specifically, the fact description first was extracted using co-reference resolution and OIE tools to identify the correct correlations. These correlations will be build into a graph. The graph-building process was designed to get rid of some correlations with overlap, mismatch. Then the process of graph to sequence is performed. Finally the data will be trained through a classifier to produce a deep learning neural network prediction model. When come to machine learning the data distribution of a model will affect the performance, especially in domains with umbalanced data. Nitay Calderon [27] proposed a algorithm to generate the counterfactuals of the input text, which can be suitable for domain adaptation task. They evaluate the terms to mask and give a masking scores, according to which to decide if the terms will be masked. To a get adaption to the target domain, it concatenates the embeddings of masked sentences and the targeted domain embeddings as input to language model to generate the conterfactuals.
280
Q. Huang and W. Ouyang
The idea of generating counterfactual thus overcoming these shortcomings of data imbalance problem has attracted many interests. Zee [29] also reported a method based on large language models. Based on Reif et al.’s work, they orgnized three parts into a large language model: a small fixed set of prompts that demonstrate the rewriting task, the piece of text to be rewritten, and an instruction with style like “make aa bb-tive”. For the most, the LLM can return up to 16 attempts at rewriting the input text.
3 Problem Statement The court view generation problem is to generate the assistant text for a judicial document. It will present the charge and the related explanation under a specified legal system. The judicial explanation is called rationale. With court view generation function, ODR system can get a fast discrimination reference for all attending parts. The fact description is text in the document provided by each of the disputes parts, which should be processed in some format to emphasis fact of the related disputes. We denoted the fact description as f = (f1 , . . . , fn ). Obviously, in a legal judge document, the charges are the central main part. We represent the charges in cases as as c = (c1 , . . . , cn ). For the all of charge, the judge make decisions, the legal logic reasons should be presented too. For now we denoted the corresponding rationales as y = (y1 , . . . , yn ), the charges in the case. The task of court view generation is to maximize the likelihood P: yˆ = argmaxy P(y|f , c), where P(y|f, c) is the conditional probability of rationales y, given fact description f and charges c. To better interpret the structure of the court view from logic view of point, we classified the rationales into two categories, one is rationale with strong causality, the other is rationale with correlation. To get the court view interpretable, let each charge should be supported by at least one strong explanation, which could be thought as making causality inference, denoted by rcaus . And a charge usually have some support proof as less direct reasons, which might have some indirect affection on the judge. Let’s call it correlation inference, denoted by rcorr . So we divide the rationale y into two parts, y = {rcaus , rcorr }. Formally, a court view is denoted by v, v = F(c, rcaus , rcorr ). We adopt the way of [22] representation, fact is f = G(rcaus , rcorr , noise). To balance the data distribution, we adopt the DoCoGen method to generate the counterfactuals to intervene the inference process. Let f = {r caus , r corr } be the counterfactual cases description, ftr = {f, f } be the fact in augmented training data.
4 Model Our model is composed of two stages. First stage is to generate the counterfactuals of training data. To creating the destination domain, we randomly select 20% of the training set to get orientation domain data. The orientation embeddings will be concatenated with the masked embeddings of traing data, send into the large language model to generate the counterfactuals . f ∼ PD (f|V = v)
Improving Causality Explanation of Judge-View Generation
281
In the second stage, we mixed the counterfactuals and the fact to get ftr for training. From our problem description, the max likelihood of court view is written as P(v|ftr ). P(v|ftr ) = P(F(c, rcaus , rcorr )|ftr ) According to our definitions, the causality inference should be more important, while the correlation inference should be pruned by an rank of correlations. Thus we get: P(v|ftr ) = P(v, c|rcaus )argmaxrcorr P(v|rcorr ). Here we only choose the highest value of rcorr as the explanation of the correlation analysis to form a court view. Also the argmax could be options of top 2 or more. In the model designing, we use fact to predict the violations of articles. First employ Bert to represent the fact and then connected a full connected linear layer to predict the violations. In the attending, we use the soft aligning of attention mechanism, which is a weighted average align. The weights of attending eij : eij = mlp(ftr )T mlp(c), In the comparison experiment, we compare the tokens in fact with each one in violations. After the comparison step, we aggregate the two comparison vectors to fuse all the information to get the logic. By using the attention head, we concatenate the violations output together as input to generate the rationales. To generate the rationales, we connect the attention layer to a fully connected linear layer.
5 Training Method In the first stage, when to generate the counterfactuals, the input fact description data and the masked fact description data are embedded by a large language model. And then they were concatenated as an input. This input is sent to the LegalBert model to generate the fact description. This step of training is a self-supervised training. The output is the fact description. And the goal is to learn reconstruct fact description. In the second stage, we combine the facts and the counterfactuals as input to generate the judge view. The loss function is same as in [22], which is the binary cross-entropy function: −yi log yˆ i − (1 − yi )log(1 − yˆ i ) Loss = i=1
Each yi is some rationales. We rank the rationales by the importance of each sentence to classify the causality and the correlation. For our supervised learning task, the highest sentence with highest score is rcaus .
6 Experiment and Result 6.1 Dataset We evaluate our model with tagged data set [22]. The dataset originated from the published legal documents in China Judgments Online. The fact descriptions, charges and the rationales were extracted using regular-expression. The training data set has 50312 cases and test set has 22617 cases.
282
Q. Huang and W. Ouyang
The example data formation showed as in Fig. 1. The text is written in Chinese, the meaning is as the first column showed. Fact description described the fact investigated by the court. The charge is the names of violations against law. Articles are the details of the violations under the charges. The court view is given by the judge with a short causal reason and some correlative reasons, which is organized by a certain logic. We can see in typical judge view, the causal is short but indispensable, and the correlative reason usually has relatively weaker support to charge but may have more indirect details. Here for the privacy, we substituted specific names and dates with xx. But in the training, the information are kept same in the form as on China Judgments Online.
Fig. 1. Data format example.
6.2 Experimental Setup The maximum sentence length set to 150. To evaluate the performance of generation, we adopt BLEU scores in experiment. The pre-training language model selected is bert-base-chinese model. To illustrate the effect of intervention of generating the counterfactual, we compare the BLEU scores of these two methods. 6.3 Results Table 1 showed the Bleu scores of the model. The result is at an average level between C2VG, Transformer and AttS2S models. Row 2 of Table 1 showed when our constructed counterfactuals intervened to the training, the performance of our model can be promoted.
Improving Causality Explanation of Judge-View Generation
283
Table 1. Results of Court view Generation with counterfactuals. Method
BLEU B-1
BLEU B-2
BLEU B-N
No intervention
51.3
40.1
37.2
Two stages with intervention
52.5
40.5
39.2
7 Conclusion In this paper, we researched problem of promoting the LJP inference problem based on counterfactuals construction. In Judge-view generation made by large language model, the inference can easily lead by spurious correlations, which would harm to the explaninability, accountability and trustworthy of the model application. We carefully investigated the typical generation algorithms, and then proposed a method to intervene the raw training data by generating counterfactuals. The benefit is that it can improve the imbalance problem of the data distribution, which is the main reason of the underlying spurious correlations. Also our process can be transferred to other legal judge prediction domain where data source are relatively scarce in specified vertical area. By testing the performance on real datasets, we can find our model designing has some privileges with intervention in result, which promoted the performance of LJP. In the future, we will test on more dataset in terms of scalability and variety, and keep on optimizing some strategy to get better performance. One potential direction is to make the counterfactuals generation more robust and general. And the factors of different large language models is also worth of further investigations.
References 1. Katsh, E.: Online dispute resolution: building institutions in cyberspace. Univ. Connecticut Law Rev. 28, 953–980 (1996) 2. Mehl, L.: Automation in the Legal World. In: Conference on the Mechanisation of Thought Processes held at Teddington. England (1958) 3. Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: ACL, pp. 687–696 (2015) 4. Singhal, A.: Introducing the Knowledge Graph: Things, Not Strings. Google Official Blog (2012). Accessed 6 Sept 2014 5. From web, Zhejiang to comprehensively promote the construction of digital courts. https:// www.chinacourt.org/index.php/article/detail/2021/04/id/5949110.shtml (2021) 6. Liu, Z., Zhang, M., Zhen, R., Gong, Z., Yu, N., Fu, G.: Multi-task learning model for legal judgment predictions with charge keywords. J. Tsinghua Univ. (Sci. Technol.) 59(7), 497–504 (2019) 7. Zhong, H.: Legal judgment prediction via topological learning. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3540–3549 (2018) 8. Long, S., Tu, C., Liu, Z., Sun, M.: Automatic judgment prediction via legal reading comprehension. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) Chinese Computational Linguistics. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 11856, pp. 558–572. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-323813_45
284
Q. Huang and W. Ouyang
9. Xiao, H., Huang, M., Yu, H., Zhu, X.: From one point to a manifold: Knowledge graph embedding for precise link prediction. In: IJCAI, pp. 1315–1321 (2016) 10. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NIPS, pp. 2787–2795 (2013) 11. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI, pp. 1112–1119 (2014) 12. Jenatton, R., Roux, N.L., Bordes, A., Obozinski, G.R.: A latent factor model for highly multi-relational data. In: Proceedings of NIPS, pp. 3167–3175 (2012) 13. Yankai, L., Zhiyuan, L., Maosong, S., Yang, L., Xuan, Z.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI, pp. 2181–2187 (2015) 14. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: a lock-free approach to parallelizing stochastic gradient descent. In: NIPS, pp. 693–701 (2011) 15. Zhao, S.-Y., Zhang, G.-D., Li, W.-J.: Lock-free optimization for nonconvex problems. In: AAAI, pp. 2935–2941 (2017) 16. Kazemi, S.M., Poole, D.: SimplE embedding for link prediction in knowledge graphs. In: Advances in Neural Information Processing Systems (2018) 17. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of KDD, pp.1247–1250 (2008) 18. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) 19. Pengfei Wang, Y., Fan, S.N., Yang, Z., Zhang, Y., Guo, J.: Hierarchical matching network for crime classification. In: SIGIR 2019, July 21–25. France, Paris (2019) 20. Ye, H., Jiang, X., Luo, Z., Chao, W.: Interpretable charge predictions for criminal cases: learning to generate court views from fact descriptions. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics(NAACL) (2018) 21. Wu, Y., et al.: De-biased court’s view generation with causality. In: Proceedings of the 2020 Conference on EMNLP (2020) 22. Yue, L., et al.: Circumstances enhanced criminal court view generation. In: SIGIR2021, pp. 1855–1859 (2021) 23. Zhang, A., Lipton, Z.C., Li, M., Smola, A.J.: Dive into Deep Learning, arXiv preprint arXiv: 2106.11342 (2021) 24. Chen, H., Zhang, L., Chen, F., Yu, Y.: Knowledge is power: understanding causality makes legal judgment prediction models more generalizable and robust. arXiv:2211.03046 (2022) 25. Karimi, A.-H.: Algorithmic recourse: from counterfactual explanations to interventions. arXiv:2002.06278v4 [cs.LG]. Accessed 8 Oct 2020 26. Keny, E.M., Keane, M.T.: On generating plausible counterfactual and semi-factual explanations for deep learning. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) (2021) 27. Calderon, N., Ben-David, E., Feder, A., Reichart, R.: DoCoGen: domain counterfactual generation for low resource domain adaptation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7727–7746, Dublin, Ireland. Association for Computational Linguistics (2022)
Instance Weighting-Based Noise Correction for Crowdsourcing Qiang Ji, Liangxiao Jiang(B) , and Wenjun Zhang School of Computer Science, China University of Geosciences, Wuhan 430074, China {jq,ljiang,wjzhang}@cug.edu.cn
Abstract. In crowdsourcing scenarios, researchers can obtain each instance’s multiple noisy label set from crowd workers and then infer its integrated label via label integration. To further improve the quality of integrated labels, many noise correction algorithms have been proposed in recent years. Most of them are trying to get a clean set and a noise set, and then train classifiers on the clean set to correct the instances in the noise set. However, the class distribution of the clean set is often inconsistent with that of the noise set, which leads to a poor correction performance of the trained classifiers. To reduce the inconsistency between the class distributions of the clean set and the noise set, this paper proposes an instance weighting-based noise correction (IWNC) algorithm. IWNC at first calculates each class’s weight based on the class distribution of the clean set. Then IWNC calculates each instance’s weight using the weight of the class that its integrated label belongs to and its multiple noisy label set. Finally, IWNC trains a classifier on the instance weighted clean set to correct the instances in the noise set. The experimental results on 34 simulated and two real-world datasets indicate that IWNC significantly outperforms all the other state-of-the-art noise correction algorithms. Keywords: Crowdsourcing learning · Noise correction · Instance weighting · Class distribution
1 Introduction In supervised learning [9], researchers can improve the performance of algorithms by increasing the amount of the labeled data or improving the quality of the labeled data. However, due to the limited budget, it is difficult to ensure both the quantity and quality of the labeled data. Thus, it is urgent for researchers to find an efficient way to get plenty of high-quality labeled data. The emergence and development of crowdsourcing platforms [1, 2] such as Amazon Mechanical Turk (AMT) and CrowdFlower has exactly met this need to obtain plenty of labeled data in an efficient way. Researchers can publish crowdsourced tasks on these online crowdsourcing platforms and then collect plenty of labels from different crowd workers. A single crowd worker does not have extensive domain knowledge, thus, for each instance in a dataset, researchers usually employ multiple crowd workers to label it © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 285–297, 2023. https://doi.org/10.1007/978-981-99-4752-2_24
286
Q. Ji et al.
to get a multiple noisy label set. Subsequently, label integration algorithms are utilized to obtain the integrated label for each instance. There are several label integration algorithms have been proposed including the simplest Majority Voting (MV) [17]. To further improve the label quality, researchers have designed several other novel label integration algorithms from different aspects [3, 8, 18, 21, 24–26]. Although label integration algorithms are often effective, there are still a certain proportion of noise in the inferred integrated labels [11]. The noise indicates that the label of an instance inferred by a specific label integration algorithm is not its original true label, which leads that trained classifiers become less effective. Thus it is necessary to correct the noise. So far, researchers have already proposed several novel noise correction algorithms [4, 6, 10, 12, 13, 20, 22]. Most of them are trying to get a clean set and a noise set from the original dataset, then train one or several classifiers on the clean set to correct instances in the noise set. However, the class distribution of the clean set is often inconsistent with that of the noise set, which leads to a poor correction performance of the trained classifiers. To reduce the inconsistency between the class distributions of the clean set and the noise set, this paper proposes an instance weighting-based noise correction (IWNC) algorithm. IWNC at first calculates each class’s weight based on the class distribution of the clean set. Then IWNC calculates each instance’s weight using the weight of the class that its integrated label belongs to and its multiple noisy label set. Finally, IWNC trains a classifier on the instance weighted clean set to correct the instances in the noise set. The extensive experimental results indicate that IWNC is significantly superior to all its competitors. In general, the contributions of this paper include: 1. This paper finds a new problem that the class distribution of the clean set is often inconsistent with that of the noise set in crowdsourcing scenarios. To our knowledge, this paper is the first attempt to address this new problem. 2. This paper proposes an instance weighting-based noise correction (IWNC) algorithm. IWNC reduces the inconsistency between the class distributions of the clean set and the noise set and thus improve the correction performance of the classifier trained on the clean set. 3. This paper conducts extensive experiments to evaluate the proposed IWNC on simulated and real-world crowdsourced datasets. The results indicate that IWNC is significantly superior to all its competitors. The rest of this paper is organized as follows. Section 2 reviews some related works. Section 3 presents IWNC in detail. Section 4 describes the detailed setup and results of experiments. Section 5 summarizes this paper and outlines the research directions for future work.
2 Related Work As stated before, there are still a certain proportion of noise in integrated labels. More noise means lower label quality and directly leads that the trained classifiers become less effective. Thus, to improve the label quality, noise correction must be performed. Most noise correction algorithms for crowdsourcing consist of noise filtering and noise
Instance Weighting-Based Noise Correction for Crowdsourcing
287
correction. In the step of noise filtering, the original dataset is divided into a clean set and a noise set using a noise filter. In the step of noise correction, the clean set is used to train one or several classifiers to correct noise instances in the noise set. Nowadays, several noise correction algorithms have already been proposed. Firstly, Nicholson et al. Proposed the polishing labels (PL), self-training correction (STC) and cluster-based correction (CC) [13]. PL first divides the original dataset into ten folds to trains ten distinct classifiers on these folds. Then these classifiers are used to give a prediction for each instance in the dataset. According to the prediction results of these classifiers for each instance, the instance is corrected based on the majority voting. STC firstly uses a conventional noise filter to filter the noise in the dataset. Subsequently, STC uses a classifier trained on the filtered dataset to predict the probability of each noise instance belonging to each class. If the probability is greater than a predetermined threshold, then the instance will be relabeled as that certain class. CC firstly adopts one or several clustering algorithms to a dataset. Then, instances belong to the same cluster will obtain a same weight set. The weight set was calculated from the size of the cluster and the distribution of labels of the cluster. CC finally relabels each instance as the label with the largest weight in its weight set. Different from PL and STC, the performance of CC does not depend on the original label quality because of the advantage of clustering. Recently, Zhang et al. proposed the adaptive voting noise correction (AVNC) [22]. AVNC first estimates the quality of the crowd worker. Then the approximate proportion of noise in the dataset is estimated. After filtering out the noise, AVNC trains multiple weak classifiers on the filtered dataset. Subsequently the trained classifiers were used to predict noise instances. Finally, each noise instance was corrected by the strategy of majority voting. Xu et al. proposed the cross-entropy-based noise correction (CENC) [20]. CENC first calculates label distributions for instances. Then, CENC calculates an entropy for each instance based on its label distribution. The predicted labels and calculated entropy are used to filter each noise instance. Finally, the cross-entropy was calculated to correct each instance. Dong et al. proposed the co-training-based noise correction (CTNC) [6]. CTNC firstly utilize a conventional noise filter to filter the noise in the original dataset. Subsequently, CTNC trains several random trees on the filtered dataset. These random trees are used to build the second view for the dataset. Finally, CTNC corrects each noise instance by two classifiers trained in a co-training framework. Chen et al. proposed the label distribution-based noise correction (LDNC) [4]. LDNC filters noise instances by calculating the margin between the first largest and second largest label probabilities. After filtering the noise instances, LDNC trains a classifier on the filtered dataset to correct each noise instance. Li et al. proposed the multi-viewbased noise correction (MVNC) [12]. MVNC firstly utilizes a conventional noise filter to filter the noise in the original dataset to obtain a clean set and a noise set. Then MVNC generates a second attribute view for the clean set and trains a classifier on each view of the clean set. These classifiers are used to correct instances in the noise set. The corrected instances are added to the clean set to retrain the classifiers. Most existing noise correction algorithms mentioned above are trying to get a clean set and a noise set from the original dataset and then train one or several classifiers on the clean set to correct the instances in the noise set. However they do not notice that the class distribution of the clean set is often inconsistent with that of the noise
288
Q. Ji et al.
set and thus leads to a poor correction performance of the trained classifiers. To reduce the inconsistency between the class distributions of the clean set and the noise set and improve the correction performance of the trained classifiers, this paper proposes an instance weighting-based noise correction (IWNC) algorithm in Sect. 3.
3 Instance Weighting-Based Noise Correction 3.1 Motivation As mentioned in Sect. 2, most existing noise correction algorithms are trying to divide the original dataset into a clean set and a noise set firstly. Meanwhile they naively assume that the class distribution of the clean set is consistent with the class distribution of the noise set and directly train one or several classifiers on the clean set to correct instances in the noise set. However, they do not notice that when performing noise filtering, the label quality varies on different classes which leads to the fact that the proportion of instances filtering out on different classes is often different. Therefore, in a more realistic scenario, the class distribution of the clean set is often inconsistent with the class distribution of the noise set. In this paper, the Kullback-Leibler (KL) Divergence is applied to calculate the inconsistency between the class distributions of the clean set and the noise set, which is defined as follow: KL = 0.5 ∗ KL P, P + KL P , P ⎧ Q Q if q=1 pq ∗ q=1 pq = 0 ⎨ 0,
, = Q Q pq pq ⎩ 0.5 ∗ q=1 pq ∗ log + q=1 pq ∗ log pq , otherwise pq
(1)
where P represents the class distribution of the clean set, and P represents the class distribution of the noise set. The larger the result calculated by Eq. (1), the more inconsistent the two distributions are. For example, there are three class distributions P1 = {0.15,0.2,0.25,0.2,0.2}, P2 = {0.2,0.2,0.2,0.2,0.2} and P3 = {0.1,0.4,0.2,0.1,0.2}. According to Eq. (1), the KL Divergence between the P1 and P2 is 0.5 ∗ (0.15∗log 0.15 0.2 + 0.25 0.2 0.2 0.2 0.2 + 0.25 ∗ log + 0.2∗log + 0.2 ∗ log + 0.2 ∗log + 0.2 ∗log 0.2 ∗ log 0.2 0.2 0.2 0.2 0.2 0.15 0.2 + 0.2 0.2 0.2 ∗ log 0.25 + 0.2∗log 0.2 + 0.2 ∗ log = 0.018. Similarly, the KL Divergence between 0.2 0.2 the P1 and P3 is 0.17 . Based on the calculated results, the inconsistency between P1 and P3 is larger than the inconsistency between P1 and P2 . Independent and identically distributed (iid) is an important hypothesis in machine learning, which demonstrates that the training data should be independent and identically distributed with the test data. This hypothesis guarantees that the classifier trained on the training data can obtain a good performance on the test set. In noise correction for crowdsourcing, the clean set can be seen as the training data and the noise set can be seen as the test set. The inconsistency between the class distributions of the clean set and the noise set obviously violates this hypothesis, which leads that the classifier trained on the clean set can not obtain a good correction performance on the noise set. Thus, to make the correction performance of the trained classifier better, the first problem need to
Instance Weighting-Based Noise Correction for Crowdsourcing
289
solve is how to reduce the inconsistency between the class distributions of the clean set and the noise set. To solve this problem, this paper calculates a mixed weight for each instance in the clean set to train a classifier on the instance weighted clean set. When calculating the mixed weight, this paper takes into account not only the impact of the class distribution of the clean set, but also the impact of the multiple noisy label set of each instance. To correct the noise instances in the noise set, the class membership probabilities were at first estimated from the multiple noisy label set of each instance in the noise set. Meanwhile, this paper trains a classifier on the instance weighted clean set and uses the trained classifier to predict each instance in the noise set to obtain its predicted class membership probabilities. Thus, the class membership probabilities of each instance in the noise set can maintain a portion of its own estimated class membership probabilities and yet absorb a portion of its predicted class membership probabilities. Motivated by the above discussions, this paper proposes the IWNC. 3.2 The Proposed IWNC In crowdsourcing scenarios, a dataset is generally denoted by D = {(xi , Li )}N i=1 . Each instance xi is associated with a multiple noisy label set Li = {lir }Rr=1 , where lir is the label of xi labeled by the crowd worker ur (r = 1, 2, · · · , R). And lir takes the value from the set {−1, c1 , c2 , · · · , cQ }, where Q is the number of classes and -1 denotes that ur does not label xi . Subsequently, the integrated label yi of xi was inferred by a specific label
integration algorithm. Noise correction starts with the dataset D = { xi , Li , yi }N i=1 . The first step of IWNC is using a conventional noise filter to divide the original dataset D into a clean set Dc and a noise set Dn . As previously discussed, the class distribution in the clean set is inconsistent with that in the noisy set. To reduce the inconsistency between the class distributions of the clean set and the noise set and thus improve the correction performance of the trained classifier, IWNC takes into account the impact of the class distribution of Dc and calculate a weight wcq for class cq by Eq. (2): ⎧ ⎪ ⎪ D c ⎪ ⎪ ⎪ 0, if ⎪ i=1 δ(y i , cq ) = 0 ⎪ ⎪ ⎨ D c , (2) wcq = δ(y i ,cq ) max i=1 ⎪ c ∈{c ,c ,··· ,c } ⎪ q 1 2 Q ⎪ ⎪ , otherwise ⎪ ⎪ ⎪ D c ⎪ ⎩ i=1 δ(y i ,cq ) where Dc represents the number of instances in Dc , and δ(·) is an indicator function that returns 1 if its two parameters are equal and 0 otherwise. By Eq. (2), the minority class instance in Dc can obtain a relatively large weight, which balances the impact of the majority class instance and the minority class instance in Dc on the classifier. Subsequently, IWNC calculates a weight wi1 for xi in Dc based on wcq , which is defined
290
Q. Ji et al.
by Eq. (3): wy wi1 = Q i
q=1 wcq
.
(3)
On the other hand, IWNC takes into account the impact of the multiple noisy label set of each instance in Dc . For example, there are two distinct instances x1 and x2 . These two instances were labeled by seven different crowd workers with {c2 , c1 , c4 , c5 , c2 , c1 , c1 } and {c1 , c5 , c4 , c2 , c1 , c1 , c1 }. If IWNC chooses MV to infer their integrated labels, y1 and y2 are all c1 . However, under the assumption of MV that each crowd worker has a same quality, the probability of y1 being correct is 3/7 and the probability of y2 being correct is 4/7, and therefore y2 is more likely to be correct than y1 . So, to take full advantage of the information contained in the multiple noisy label set, IWNC calculates a weight wi2 for xi in Dc , which is defined by Eq. (4):
R
wi2
=
r=1 δ lir , y i . R r=1 (lir = −1)
(4)
Finally, IWNC mixes the weight wi1 and the weight wi2 into the mixed weight wi , which is defined by Eq. (5): wi =
2 × wi1 × wi2 wi1 + wi2
.
(5)
In the step of noise correction, IWNC at first estimates the class membership
probabilities P cq |Li of each instance xi in Dn using Li by Eq. (6):
P cq |Li =
R
r=1 δ lir , cq . R r=1 (lir = −1)
(6)
Meanwhile, IWNC trains a classifier G on the instance weighted Dc and
use the trained classifier G to predict the class membership probabilities PG cq |xi of each instance xi in Dn . Finally, IWNC updates the integrated label of xi in Dn by Eq. (7):
yi = (7) argmax αP cq |Li + (1 − α)PG cq |xi , cq ∈{c1 ,c2 ,··· ,cQ }
where α is a controlling factor that adjusts the proportions of P cq |Li and PG cq |xi . When α is greater than 0.5, the result of Eq. (7) is more inclined to the estimated class membership probabilities, otherwise when α is less than 0.5, the result of Eq. (7) is more inclined to the predicted class membership probabilities. To this end, the overall framework of IWNC is shown in Fig. 1. IWNC at first uses a conventional noise filter to divide the original dataset D into a clean set Dc and a noise set Dn . Then, IWNC calculates each class’s weight wcq based on the class distribution
Instance Weighting-Based Noise Correction for Crowdsourcing
291
of the clean set. After that, IWNC calculates each instance’s weight using the weight of the class that its integrated label belongs to and its multiple noisy label set. Finally, IWNC trains a classifier G on the instance weighted Dc to correct each instance in Dn . The learning procedure of IWNC is described by Algorithm 1.
Original Attributes Multiple Noisy Labels Integrated Labels
wc1
w11
wc2
w12
G
PG (cq | x1 )
w1
ˆ D c
wcQ
w1Dˆ
c
w2
Train
PG (cq | x 2 )
w12 w22
PG (cq | x Dˆ )
w Dˆ
n
c
w2Dˆ
ˆ D
c
P(cq | L1 ) P (cq | L 2 )
ˆ D n
P(cq | L Dˆ ) n
Fig. 1. Overall framework of IWNC
~ D
292
Q. Ji et al.
4 Experiments and Results In this section, extensive experiments are conducted on simulated and real-world crowdsourced datasets to evaluate the performance of IWNC. IWNC was compared with seven noise correction algorithms including PL, STC, CC, AVNC, CENC, LDNC and MVNC algorithms in terms of the noise ratio. In this paper, the noise ratio is defined as the percentage of instances whose integrated labels are not their original true labels in a dataset. For a fair comparison, the IWNC, CENC, LDNC and MVNC are implemented on the Crowd Environment and its Knowledge Analysis (CEKA) [23] platform. CEKA platform has provided the implementations of the other four algorithms. The implementations of C4.5 (J48) [14] and Classification Filter (CF) [7] are existing on the Waikato Environment and Knowledge Analysis (WEKA) [19] platform and the CEKA platform. To ensure that experiments are as fair as possible, the simplest label integration algorithm MV was utilized to get the initial integrated labels. IWNC adopts the C4.5 as its classifier. Considering that the choice of the noise filter does not have a significant impact on the performance of IWNC, this paper chooses one of the most commonly used and conventional noise filter CF as its noise filter. The controlling factor α in IWNC is set to 0.7. The number of subsets of the divided dataset in CF is set to 10. C4.5 is adopted to be the classifier for CF. And the experimental setups and parameters of the other algorithms are existing in the corresponding papers. 4.1 Experiments on Simulated Datasets The simulated experiments are conducted on the whole 34 datasets published on the opensource CEKA platform [23]. To artificially simulate the whole process of crowdsourcing and make the simulation process more realistic, this paper firstly hides the true label of each instance and then applies multiple simulated crowd workers with different quality to label it. To make the labels given by the simulated crowd workers more consistent with those given by real crowd workers, the quality of each simulated crowd worker was randomly generated from a uniform distribution on the interval [0.55, 0.75]. Then, to save on worker overhead, the number of simulated crowd workers R was set to 7. After getting the labels provided by simulated crowd workers, the MV algorithm was utilized to get the initial integrated labels. Subsequently, seven diverse noise correction algorithms including IWNC are used to correct the noise instances. Finally, this paper gets and records the corresponding noise ratio of each algorithm on each dataset. In order to overcome the effect of random factors during experiments, this paper repeats each group of experiments ten times independently and then gets the average experimental results. The detailed comparison results in terms of the noise ratio are shown in Table 1. The averages of these algorithms were recorded in the last row of Table 1. Based on these results, the Wilcoxon signed-rank tests [5, 9] were conducted to further compare each pair of algorithms. Table 2 has recorded the results of the Wilcoxon tests. In Table 2, the symbol ● indicates that the algorithm in the row significantly outperforms the algorithm in the corresponding column, and the symbol ◯ indicates the exact opposite of that indicated by the symbol ●. The significance levels of the lower and upper diagonal are
Instance Weighting-Based Noise Correction for Crowdsourcing
293
0.05 and 0.1, respectively. These experimental results verify the effectiveness of IWNC. Now, the following highlights can be summarized: 1. The average noise ratio of IWNC on all datasets is only 9.91%, which has a much better performance than PL (20.66%), STC (15.94%), CC (14.07%), AVNC (13.30%), CENC (13.11%), LDNC (10.62%) and MVNC (10.75%). The result indicates that IWNC is obviously superior to the other seven competitors in terms of the noise ratio. 2. Based on the Wilcoxon test results, IWNC is significantly superior to its competitors in terms of the noise ratio. This strongly validates the effectiveness of IWNC.
Table 1. Noise ratio (%) comparisons on the uniform distribution for IWNC versus PL,STC, CC, AVNC, CENC, LDNC and MVNC. Dataset
AVNC
CENC
LDNC
MVNC
IWNC
letter
9.58
10.29
3.28
8.33
7.57
1.01
0.33
0.51
mushroom
5.31
4.71
1.54
0.01
0.10
5.12
0.18
3.13
waveform
13.21
21.41
14.82
13.77
13.79
8.14
6.27
6.34
spambase
31.16
15.77
10.77
7.14
7.73
10.23
9.28
9.49
1.28
12.67
9.68
0.56
0.49
3.54
1.63
1.15
hypothyroid sick
PL
STC
CC
2.16
6.59
5.76
1.66
1.66
5.86
2.78
2.52
kr-vs-kp
23.99
7.50
12.84
1.59
1.88
6.25
6.77
4.74
segment
4.43
4.26
3.23
2.97
2.22
0.70
0.55
0.58
car
21.44
13.52
21.45
9.00
8.70
7.29
6.88
6.78
biodeg
28.25
20.42
15.51
13.63
14.82
14.13
13.65
12.16
credit-g
27.60
25.13
24.00
24.07
23.60
18.57
18.53
18.35
vowel
29.16
16.45
3.08
9.97
10.90
3.37
1.69
2.74
tic-tac-toe
33.97
20.47
17.13
16.52
16.02
14.05
15.27
15.46
anneal
19.63
8.55
14.62
10.01
9.08
4.37
7.83
4.73
vehicle
21.39
22.92
19.50
18.58
17.39
6.96
8.11
6.09
diabetes
24.01
21.43
20.65
23.24
21.95
15.04
18.10
15.70
breast-w
5.06
12.22
4.54
4.68
4.75
7.90
4.82
7.00
credit-a
16.41
15.88
13.67
12.78
11.93
10.88
10.84
11.45
balance-scale
23.71
14.66
13.98
16.80
15.49
10.67
14.24
10.69
vote
5.33
11.36
9.91
4.41
4.46
7.03
5.03
4.78
horse-colic
17.72
17.47
19.29
15.24
14.81
13.70
16.11
13.78
ionosphere
17.07
15.84
10.57
11.71
11.48
12.02
11.17
10.09
heart-c
18.78
20.56
18.15
19.34
18.81
16.93
15.15
16.07 (continued)
294
Q. Ji et al. Table 1. (continued)
Dataset
PL
STC
CC
AVNC
CENC
LDNC
MVNC
IWNC
heart-h
28.84
22.24
20.48
18.06
17.55
16.02
15.34
15.10
breast-cancer
27.83
25.70
24.06
25.77
25.59
19.23
24.06
19.48
heart-statlog
15.33
19.89
19.04
17.11
17.15
14.30
12.81
12.52
audiology
34.73
18.41
24.12
26.46
25.93
16.99
20.09
16.24
sonar
37.45
22.60
20.00
21.59
24.09
19.28
16.92
18.61
autos
42.98
8.78
23.76
18.24
18.59
7.07
10.00
7.46
hepatitis
18.90
19.81
17.23
18.52
17.55
15.29
15.48
15.03
iris
22.33
7.40
3.33
5.60
5.33
4.07
3.27
3.07
lymph
22.77
20.61
17.64
22.09
22.23
17.50
18.51
18.18
zoo
19.80
14.75
5.25
11.39
11.49
9.41
13.76
9.60
labor
30.88
21.75
15.79
21.58
20.70
18.42
20.35
17.37
Average
20.66
15.94
14.07
13.30
13.11
10.62
10.75
9.91
Table 2. Noise ratio (%) comparisons on the uniform distribution using Wilcoxon tests for IWNC versus PL, STC, CC, AVNC, CENC, LDNC and MVNC. PL
STC
CC
AVNC
CENC
LDNC
MVNC
IWNC
PL
-
◯
◯
◯
◯
◯
◯
◯
STC
●
-
◯
◯
◯
◯
◯
◯
CC
●
●
-
◯
◯
◯
AVNC
●
●
-
◯
◯
◯
◯
CENC
●
●
●
-
◯
◯
◯
LDNC
●
●
●
●
●
-
MVNC
●
●
●
●
●
IWNC
●
●
●
●
●
●
◯ -
◯
●
-
4.2 Experiments on Real-World Datasets The inconsistency between the clean set and noise set is more pronounced when the label quality varies on different classes. To further confirm the effectiveness of the IWNC in realistic scenarios, this paper collects two real-world crowdsourced datasets “LabelMe” and “Music Genre” with different label quality on different classes from the Amazon Mechanical Turk (AMT) platform and conduct experiments on them. The “LabelMe” dataset [15] is a multi-class dataset that contains 1000 instances and 512 attributes. The AMT platform collects 2547 labels for this dataset from 59 crowd workers. The “Music Genre” dataset [16] is also a multi-class dataset that contains 700
Instance Weighting-Based Noise Correction for Crowdsourcing
295
(a) LabelMe
(b) Music Genre Fig. 2. The noise ratio (%) comparisons for IWNC versus PL, STC, CC, AVNC, CENC, LDNC and MVNC on the “LabelMe” and “Music Genre” datasets
instances and 31 attributes. The AMT platform collects 2946 labels for this dataset from 44 crowd workers. Figure 2 has shown the detailed experimental results. And as seen from Fig. 2, the noise ratio of IWNC (21.20%) is appreciably superior to other competitors on “LabelMe” dataset, the noise ratio of IWNC (24.57%) is also appreciably superior to other competitors on “Music Genre” dataset. These experimental results also demonstrate the effectiveness of IWNC in realistic scenarios. 4.3 Conclusions and Future Work This paper proposed an instance weighting-based noise correction (IWNC) algorithm. IWNC firstly calculates each class’s weight based on the class distribution of the clean set. Then IWNC calculates each instance’s weight using the weight of the class that its integrated label belongs to and its multiple noisy label set. Finally, IWNC trains a classifier on the instance weighted clean set to correct the instances in the noise set.
296
Q. Ji et al.
All the experimental results show that IWNC can successfully reduce the inconsistency between the class distributions of the clean set and the noise set and thus improve the correction performance of the classifier trained on the clean set. Although this paper has found that the class distribution of the clean set often does not match the class distribution of the noise set in the crowdsourcing scenario and come up with a working solution, there is still room for improvement. For example, the concept of class imbalance can be introduced into noise correction. It is an important improvement direction for the future work.
References 1. Buecheler, T., Sieg, J.H., Füchslin, R.M., Pfeifer, R.: Crowdsourcing, open innovation and collective intelligence in the scientific method - a research agenda and operational framework. In: Proceedings of the Twelfth International Conference on the Synthesis and Simulation of Living Systems, ALIFE 2010, Odense, Denmark, August 19–23, 2010, pp. 679–686. MIT Press (2010) 2. Buhrmester, M., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? Perspect. Psychol. Sci. 6(1), 3–5 (2011) 3. Chen, Z., Jiang, L., Li, C.: Label augmented and weighted majority voting for crowd-sourcing. Inf. Sci. 606, 397–409 (2022) 4. Chen, Z., Jiang, L., Li, C.: Label distribution-based noise correction for multiclass crowdsourcing. Int. J. Intell. Syst. 37(9), 5752–5767 (2022) 5. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) 6. Dong, Y., Jiang, L., Li, C.: Improving data and model quality in crowdsourcing using cotraining-based noise correction. Inf. Sci. 583, 174–188 (2022) 7. Gamberger, D., Lavrac, N., Groselj, C.: Experiments with noise filtering in a medical domain. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, pp. 143–151. Morgan Kaufmann (1999) 8. Jiang, L., Zhang, H., Tao, F., Li, C.: Learning from crowds with multiple noisy label distribution propagation. IEEE Trans. Neural Networks Learn. Syst. 33(11), 6558–6568 (2022) 9. Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for naive bayes. IEEE Trans. Knowl. Data Eng. 31(2), 201–213 (2019) 10. Li, C., Jiang, L., Xu, W.: Noise correction to improve data and model quality for crowdsourcing. Eng. Appl. Artif. Intell. 82, 184–191 (2019) 11. Li, C., Sheng, V.S., Jiang, L., Li, H.: Noise filtering to improve data and model quality for crowdsourcing. Knowl. Based Syst. 107, 96–103 (2016) 12. Li, X., Li, C., Jiang, L.: A multi-view-based noise correction algorithm for crowd-sourcing learning. Information Fusion 91, 529–541 (2023) 13. Nicholson, B., Sheng, V.S., Zhang, J.: Label noise correction and application in crowdsourcing. Expert Syst. Appl. 66, 149–162 (2016) 14. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993) 15. Rodrigues, F., Lourenço, M., Ribeiro, B., Pereira, F.C.: Learning supervised topic models for classification and regression from crowds. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2409–2422 (2017) 16. Rodrigues, F., Pereira, F.C., Ribeiro, B.: Learning from multiple annotators: distinguishing good from random labelers. Pattern Recognit. Lett. 34(12), 1428–1436 (2013)
Instance Weighting-Based Noise Correction for Crowdsourcing
297
17. Sheng, V.S., Provost, F.J., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: Li, Y., Liu, B., Sarawagi, S. (eds.) Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24–27, 2008, pp. 614–622. ACM (2008) 18. Tao, D., Cheng, J., Yu, Z., Yue, K., Wang, L.: Domain-weighted majority voting for crowdsourcing. IEEE Trans. Neural Networks Learn. Syst. 30(1), 163–174 (2019) 19. Witten, I.H., Frank, E., Hall, M.A.: Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Elsevier (2011) 20. Xu, W., Jiang, L., Li, C.: Improving data and model quality in crowdsourcing using crossentropy-based noise correction. Inf. Sci. 546, 803–814 (2021) 21. Yang, W., Li, C., Jiang, L.: Learning from crowds with robust support vector machines. Science China Inf. Sci. 66(3), 1–17 (2023) 22. Zhang, J., Sheng, V.S., Li, T., Wu, X.: Improving crowdsourced label quality using noise correction. IEEE Trans. Neural Networks Learn. Syst. 29(5), 1675–1688 (2018) 23. Zhang, J., Sheng, V.S., Nicholson, B., Wu, X.: CEKA: a tool for mining the wisdom of crowds. J. Mach. Learn. Res. 16, 2853–2858 (2015) 24. Zhang, J., Sheng, V.S., Wu, J., Wu, X.: Multi-class ground truth inference in crowd-sourcing with clustering. IEEE Trans. Knowl. Data Eng. 28(4), 1080–1085 (2016) 25. Zhang, J., Wu, X., Sheng, V.S.: Imbalanced multiple noisy labeling. IEEE Trans. Knowl. Data Eng. 27(2), 489–503 (2015) 26. Zhang, Y., Jiang, L., Li, C.: Attribute augmentation-based label integration for crowdsourcing. Front. Comp. Sci. 17(5), 175331 (2023)
Improvement of Graph Convolution Network of Missing Data Based on P Systems Runpu Chi1 and Xiyu Liu2(B) 1 Business School, Shandong Normal University, Jinan, China 2 Academy of Management Science, Shandong Normal University, Jinan, China
[email protected]
Abstract. The graph convolutional network has achieved great success since its proposal. Since GCN can be used to study non-Euclidean data, it extends convolutional networks for real-world applications. Graph data is a prevalent data structure in the real world and is widely used in various fields. Nowadays, most GCN models take data as a complete structure for input. However, real-world data is often incomplete for various reasons, and some data is missing features. Therefore, we propose a GCN model for completing missing data (PGCN) based on the coupled P systems. It can express the missing features of the data using the Gaussian mixture model and attention mechanism. In addition, based on the input, a new activation function is computed in the first layer of the GCN. The proposed PGCN method performs the node classification task on three datasets, and the results show that the method’s performance is better than existing missing data processing methods. Keywords: Graph convolutional network · Attention mechanism · P systems
1 Introduction Graph data is an essential abstraction of real-world information that represents the relationship (edges) between objects (vertices) and objects (vertices). Moreover, with the continuous development of deep learning, graph data has been widely used in real-world applications such as recommendation systems [1, 2], search engines [3], node classification [4], link prediction [5]and text classification [6]. The types of graphs are diverse, such as friend relationship graphs in social networks [7]. Graph Convolutional Networks (GCNs) (Kipf & Welling, 2017) are effective applications of convolutional networks (CNNs) on graphs, which are first-order approximations of the convolution of spectral graphs [8]. Nowadays, many non-Euclidean data exist, such as traffic networks, World Wide Web, etc. But non-Euclidean data has translation invariance, and the structure of each node is different, so Graph Convolution Network (GCN) is proposed to perform convolutions operation on non-Euclidean data. However, in the real world, many graph structures have missing data types, especially the missing attributes of some nodes. We propose a solution based on the node classification of missing data named PGCN. PGCN produces missing data values by the © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 298–309, 2023. https://doi.org/10.1007/978-981-99-4752-2_25
Improvement of Graph Convolution Network of Missing Data Based on P Systems
299
Gaussian mixture model. We add an attention mechanism after the Gaussian mixture model generates the missing features to avoid the over-similarity of nodes. Membrane computing, also known as P systems, is a major branch of natural computing proposed by Gheorghe Pun in 1998 [9]. P systems are computational models inspired by the functions and structures of living cellular systems, which have maximum parallelization. P systems consist of a membrane structure, which in turn consists of many cell membranes. Contributions to this paper: • We propose a GCN model with GMM and attention mechanism based on P systems. • We propose to add an attention mechanism to avoid excessive similarity of missing data after data completion. When inputting missing data, we use the node representation matrix with attention coefficients instead of the normal node representation matrix. • We perform experimental validation of node classification in three datasets and find that the performance of our method is superior. The remainder of the article is written in the following structure. Section 2 introduces the concepts related to graph convolutional networks, missing data processing, and P systems. A graph convolutional network based on coupled P systems is proposed in Sect. 3. In Sect. 4, we conduct experiments on three datasets and report the results of the experiments. We provide the conclusion in Sect. 5.
2 Related Work 2.1 Graph Convolutional Network In GCN, the undirected graph is denoted as G = (V , E), where G is the undirected graph, V is the set of nodes, V = { vi |i = 1, , , , N }, and E is the set of edges. A is the adjacency matrix, Aij = 1, which means there is an edge connection between nodes vi and vj , and if Aij = 0, it means there is no edge connection between nodes vi and vj , X ∈ RN ×D is the feature matrix of the nodes and D is the number of features. The general graph convolution is implemented using the following equation: 1 1 ˜ − 2 H (l) W (l) , ˜ − 2 A˜ D (1) H (l+1) = σ D In Eq. (1), A˜ = A + IN , is the adjacency matrix of the undirected graph with self˜ is the degree matrix connection, and IN is the self-connection matrix of the nodes. D ˜ D = j Aij , W is the trainable weight matrix, and σ (∗) denotes the activation functions, such as ReLU, LeakyReLU, and ELU. H (l) RN ∗d is the activation matrix of the lth layer; H (0) = X . Taking a two-layer graph convolution with activation functions as an example, the model can be defined by the formula: (0) ˆ ˆ W (1) , (2) Z = f (X , A) = softmax AReLU AXW
300
R. Chi and X. Liu
2.2 P Systems There are various P systems, which can be classified as cell-like P systems, tissue-like P systems, and neural-like P systems. This paper used a combination of tissue-like P systems and cell-like P systems. The first one is the cell-like P system, a model of membrane computation abstracted from biological cells. For the definition of the basic concepts in the cell-like P system, the first definition is the skin membrane, which is the outermost membrane in the system. Next comes the essential membrane, which can be defined as a membrane structure that contains no other internal membranes, the most basic structure. If it contains other membrane structures, it is called a non-basic membrane. The region of the basement membrane is the area of space unique to it, while the region of the non-basement membrane is the space between it and other membranes. The structure of cell-like P systems is shown in Fig. 1. Membranes 2, 4, 6, 7, 8 and 9 are essential membranes. Membrane 3 and 5 are non-basic membranes.
Fig. 1. The structure of a cell-like P systems
The second refers to the tissue-like P system, which is an extended system of cell-like P systems, and the tissue-like P system consists of multiple cells, while its underlying cell-like P system is composed of a single cell [10]. There are various studies on tissuelike P systems nowadays. The tissue-like P system has a fission rule, so its cell number increases as the computational process increases, which also indicates the good computational performance of the tissue-like P system. However, it has a more complex structure than cell-like P systems. In the tissue-like P systems, cells communicate with cells or cells with the environment, and the rules will be activated when the desired state is reached. The structure of the tissue-like P systems is shown in Fig. 2.
3 Methods PGCN uses structures abstracted from biological mechanisms as a basis for complementary and convolutional operations on missing data. In this section, we describe the proposed PGCN model in more detail, starting with an explanation of the P-system involved (coupled P systems). The entire processing of the missing data is performed
Improvement of Graph Convolution Network of Missing Data Based on P Systems
301
2
4
n-4
n-2
n
1 3
5
n-3
n-1
Fig. 2. The structure of the tissue-like P systems
in the P systems. Not only the GCN layer behind is included but also the previous data feature calculation is performed in the P systems. We first input the data information into cell1. First, the missing data is input into cell2, where a preliminary node representation is generated using GMM. Then the generated data is input into cell3, where the nodes are aggregated for neighboring features using the attention mechanism. Secondly, a more comprehensive representation of the missing data can be obtained by the operations in cell3. Then, the generated data is input into cell4 for the first layer of the convolution operation. Because the data features are complemented using GMM and attention mechanism, the convolution operation in cell4 has difference from the normal GCN. Then, after the information is aggregated in cell4, , a new node feature representation is formed. The node representation generated in the first layer is then input to the second convolution layer (cell5) for convolution. Finally, the classification results are stored in cell6.Fig. 3 illustrates the general framework of the PGCN model.
Fig. 3. The general framework of the PGCN model
3.1 The Coupled P Systems We first introduced the basic framework of the coupled P system. A coupled P system is a combination of a cell-like P system and a tissue-like P system, and the GCN with Gaussian mixture model and attention mechanism (PGCN) in this paper is performed in a coupled P systems.
302
R. Chi and X. Liu
The coupled P systems can be expressed in the following form: = (O, η, μ, syn, σ1 , · · · , σm , R, in, out) where 1. O denotes the alphabet by which all elements in the coupled P system are represented. O = l, w, D, A, μkj , k , σjk , Eij , vj , X , Pijk , Qijk , L, k, αij 2. η is all the initial objects of the coupled P system. η = (X , W , L, k, l) ∈ O 3. μ denotes all membrane structures. 4. syn represents the cell-to-cell synaptic connections through which information transfer and communication between cells takes place. 5. σ1 , · · · , σm is the basic cell unit of the coupled P system, the number of respective nodes in different datasets determines the size of m. 6. R is the set of all communication rules and evolutionary rules. 7. in represents cell1, which is the input membrane; out represents cell6, which is the output membrane. Communication rules are used to move targets from one cell to another, while evolution rules are used to change the target objects in a cell. 3.2 Evolutionary Rules This section presents the complementation and classification operations for missing data based on the P systems. Firstly, in the framework of the P systems, we perform the Gaussian mixture model and the attention mechanism, and in cell 5 we perform the convolution operation and introduce the convolution formula of the first layer in detail. Cell1 contains a large number of initialization settings, such as the initial parameter settings are placed in Cell1, as well as the initial data input. R1 : Xij , W , L, k, l, where Xij is the feature matrix of nodes, W is the convolutional layer weight of GCN, L is the node aggregation matrix, k is the number of Gaussian mixture distribution, and l is the number of convolutional layers. Cell6 contains the final node classification results. R6 = ∅ Evolutionary Rules for Nodes Evolutionary rules in Cell2: There are two cases for the input data. One is that the data is complete without any special complements, and the other is that there are features missing. The input from cell 1 to cell 2 is missing data. For data with missing features, we make up for the features using a Gaussian mixture model, which is actually a probabilistic model that assumes that all data points are a mixture of a finite number of Gaussian distributions with unknown parameters. Here we use the maximum expectation algorithm (EM) of the Gaussian model to solve for our desired πk , μ[k] , [k] . We set X to be the eigenvectors of the nodes generated randomly using the Gaussian distribution. X∼
K k=1
πk N μ[k] , [k] ,
(3)
Improvement of Graph Convolution Network of Missing Data Based on P Systems
303
[k] T μ[k] , . . . , μ denotes the mean vector of all elements and μi[k] 1 D 2 2 denotes denotes the mean of the i th element. [k] = diag σ1[k] , , , σD[k] 2 the variance vector of all elements, σj[k] denotes the variance of the i th Gaussian component, πk denotes a constraint limiting k πk = 1, k is the number of Gaussian mixture distributions (k = 5), and D is the number of features. The representation of Xij∼ is then generated according to Eq. (4), and the missing data entered in cell1 can be represented as follows, where μ[k] =
R3 : Xij∼ ←
k
πk N Qij[k] , Pij[k] ,
(4)
k=1
Qijk ∈ RN ×D denotes the mean matrix of the elements and Pijk ∈ RN ×D is the variance
matrix of the elements. If the feature vector of the node is missing, then Qij[k] = μj[k] , 2 Pij[k] = σj[k] . Evolutionary rules in Cell3: Cell3 calculates the importance of different nodes through the node attention mechanism. It enables the data generated using GMM to be further optimized. The attention mechanism can aggregate information from different neighbors with different levels of importance, thus avoiding excessive similarity of nodes. We calculate the attention coefficients of neighboring nodes of node i: (5) R31 : Eij = att Whi , Whj , where Eij denotes the attention score, att(·) denotes the attention function indicating the importance of features of node j to node i, j is the neighbor node of node i on the graph [11]. We also perform a masked attention operation to incorporate the graph structure into the attention mechanism. And we will also use the softmax normalization function to normalize all j. This makes it easy to compare the attention scores between different nodes and to distinguish those that are important neighbors of node i: exp Eij , R32 : αij = softmaxj Eij = (6) k∈Ni exp Eij αij · Xij∼ , (7) R33 : Xij = j
The Eqs. (6) and (7) enable to obtain more accurate complements of the missing data. Evolutionary rules in cell4: In this cell, the cell of the P systems contains the convolution operation of the GCN. The complete node feature matrix representation Xij is obtained by the previous operation, and then we put the obtained data into the next neuron (cell4) for the operation. The expected activation of the neuron in the first layer is slightly different from the normal GCN, whose input is the new Xij with the addition of the attention mechanism. According to the literature it is possible to obtain [16]: μ
, (8) ReLU N μ, σ 2 = σ NR σ
304
R. Chi and X. Liu
Then we can update the first layer convolution function of GCN according to Eq. (8) to obtain Eq. (9), ⎛ ⎞ K ˆ [k] Q ij ⎠ ˆ = αij πk Pˆ ij[k] NR⎝ , (9) R41 : ReLU AXW [k] ij ˆ P k=1 ij
Evolutionary rules in cell5: The activated data are convolved once to obtain Eq. (11), where we use the most common two-layer convolution function, using the layer weights W 0 , W 1 and adjacency matrix A˜ with self-connections in the undirected graph to perform the calculation, in the final use of softmax function for normalization operation to achieve node classification: ⎛ ⎞ ⎞ ⎛ K ˆ [k] Q ij ⎠ 1 ⎠ αij πk Pˆ ij[k] NR⎝ W , (10) R5 : Z = softmax⎝A˜ ˆ Pij[k] k=1 When the data structure is complete with the input of the feature matrix, Pij[k] = 0,
Qij[k] = (AXW )ij ,
(11) For node classification, we use a cross-entropy function to evaluate the classification of nodes and update the parameters using gradient descent. 3.3 Communication Rules Between Different Cells Cells Communicate with each other through synapses, and each cell has different inputs and outputs from each other, so rules are developed to allow different cells to communicate with each other. Rule1: (1, u/λ, 2) The feature matrices X , L node aggregation matrix,k number of Gaussian mixture distribution of the initial data nodes are input from cell1 to cells2. Rule2: (2, u/λ, 3) The Gaussian mixture model generated in cell2 complements the node feature representation Xij∼ afferent from cell2 into cell3. Rule3: (3, u/λ, 4) The final complementary node feature representation Xij was obtained by computing the attention mechanism in cell3, and injecting the data from cell3 into cell4. Rule4: (4, u/λ, 5) The activation of the expected neurons was performed in cell4 for the complete data coming from cell3, and then the new node representation was input into cell5. Rule5: (5, u/λ, 6) A second layer of convolution calculation was performed in cell5 to form the final node representation, and then the classification of nodes was performed. And the final node classification results are stored in cell6.
Improvement of Graph Convolution Network of Missing Data Based on P Systems
305
4 Experiment To evaluate the performance of this model, the data were preprocessed on three datasets with different levels of missing data. We randomly selected node features in the dataset and removed them from within the feature matrix X of the nodes, and then obtained graph structures with different missing rates. 4.1 Datasets In this paper, three datasets, Cora, Citeseer and AmaComp, are used. And the details of the datasets are shown in Table 1. Table 1. Statistical information about the dataset Dataset Cora Citeseer AmaComp
Node
Edges
2708
5429
Classes
Features
7
1433
3312
4732
6
3703
13752
287209
10
767
4.2 Baselines The comparison algorithm in this paper is to form the complete dataset by interpolation and then import the dataset onto the original GCN. GAIN [12] (2018): GAIN is a generative adversarial network based approach for missing data complementation. GINN [13] (2020): This is an interpolation technique that uses a graph encoder to perform denoising. GAT [11] (2018): It is a graph neural network based on an attention mechanism. It assigns corresponding weights to different neighboring nodes. Here we directly input the data into GAT for the classification prediction of nodes. GCN-LPA [14] (2021): It is an algorithm that combines a label propagation algorithm and graph convolutional neural network. It is a method that uses LPA regularization to help GCN learn node weights and thus update the representation of nodes for node classification. VE-GCN [15] (2022): is a method that uses variational autoencoders to mitigate missing information in data, and it focuses on the problem of missing information in CF-based. It differs from VAE in that he uses historical interactions to produce a priori information and then uses the a priori information for adequate learning. 4.3 Experimental Implementation The experimental parameters are set: GCN is set to two-layer convolution, the number of hidden units on the dataset is 16 on the Cora and Citeseer datasets, the number of
306
R. Chi and X. Liu
hidden units on the dataset is 64 on the AmaComp, the parameters are optimized using Adam, the learning rate is 0.01, the extraction stopping strategy of waiting patiently for 100 periods is adopted to prevent overfitting, the epoch element is 200, and the missing rate of the data increases from 0.1 to 0.5 one by one. On the AmaComp dataset, we randomly select 50 nodes in each class for training, 600 nodes for validation, and the remaining data for testing. The accuracies of the node classification tasks performed on the dataset by different methods are shown below: Table 2. The accuracies (%) of the node classification task on the Cora
Cora
Missing Rate
0.1
0.2
0.3
0.4
0.5
GAIN GINN
79.43 79.35
79.86 80.07
78.65 77.24
76.94 76.58
74.43 72.32
GAT GCN-LPA VE-GCN
77.82 78.56 79.49
75.69 77.29 78.31
72.47 75.03 77.94
67.78 74.82 77.58
60.45 73.51 75.87
PGCN
81.69
81.31
80.09
79.12
76.39
Table 3. The accuracies (%) of the node classification task on the Citeseer
Citeseer
Missing Rate
0.1
0.2
0.3
0.4
0.5
GAIN GINN
69.37 69.69
68.56 67.68
67.45 66.43
64.68 63.77
64.79 60.76
GAT GCN-LPA VE-GCN
65.44 68.57 69.63
63.66 65.33 68.35
60.78 64.04 67.34
57.36 63.95 66.24
53.24 62.38 64.43
PGCN
72.34
70.11
69.23
67.07
65.66
Table 4. The accuracies (%) of the node classification task on the AmaComp
AmaComp
Missing Rate
0.1
0.2
0.3
0.4
0.5
GAIN GINN
80.23 80.47
79.90 79.36
78.49 78.87
75.36 74.36
71.24 70.40
GAT GCN-LPA VE-GCN
77.23 78.36 79.68
73.46 75.97 78.32
71.53 74.36 75.65
68.88 72.87 72.21
66.79 71.87 71.57
PGCN
82.09
81.96
81.81
79.07
77.76
The accuracies obtained by the different methods are presented in Tables 2, 3 and 4, the bolded ones are the best accuracies, and the underlined ones are the second accuracies. The performance of our PGCN is significantly better than the other methods,
Improvement of Graph Convolution Network of Missing Data Based on P Systems
307
and the superiority of our method becomes more evident as the missing rate becomes immense. This is because our method generates missing data features and avoids node over-similarity. Therefore, the performance degradation is not significant when the missing rate increases. As the missing rate increases, other methods of processing missing data can make the missing features too similar, leading to a decrease in classification performance. The GAT has the lowest classification performance due to missing data and uncompleted missing features. As for the GINN, the performance decreases significantly when the missing rate changes from forty to fifty per cent. It is because the interpolation approach requires a strong correlation between data, and the increase of missing data decreases this correlation. The GAIN and VAE generate missing features directly for missing data. No different attention is assigned to different neighboring nodes for aggregation. This may lead to the problem of excessive similarity between nodes, so the performance is less superior to our method. Moreover, our experimental method is robust, and other methods have a considerable variation in classification accuracy compared to it. VE-GCN is implemented using an interactive approach to missing data, but its performance is not superior in the downstream task of node classification and the difference in classification accuracy is greater on the larger dataset AmaComp. GCN-LPA has a higher computational cost in generating and propagating new node labels along the graph structure, and the running time is considerably longer than the other methods (Fig. 4). We also investigated the effect of different layers of GCN on this model, and we can see through the Fig. 5 that the two-layer GCN model performs best in either case. This is because the convolution operation of multiple layers may cause an over-smoothing phenomenon, resulting in a dilution of the characteristics of the nodes. As a result, all nodes are too similar, leading to the degradation of the classification performance.
Fig. 4. Comparison of accuracy of models with different numbers of layers
308
R. Chi and X. Liu
4.4 Ablation Experiments In this section we subject our model PGCN to ablation experiments. The experimental performance of our model is compared for different cases. We name the variants of the PGCN as follows: PGCN: The original model in this paper, without any changes. w/o attention: Removing the attention mechanism in the PGCN. w/o GMM: Removing Gaussian mixture model in PGCN. w/o GMM &attention: Removing Gaussian Mixture Models and Attention Mechanisms in PGCN. We can conclude from the figure that we ran all the methods five times. Then the average of the classification accuracy was taken. By removing the attention mechanism from the model, we find that the accuracy of the model decreases by about 2% to 5% on the three data sets. By removing the Gaussian mixture model from the model, the accuracy of the model decreases by about 4% to 10% on the three data sets. Removing the attention mechanism and the Gaussian mixture model from the model, the accuracy of the model decreases by about 5% to 15% on the three data sets. When the missing rate increases, the accuracy decreases significantly. This indicates that the attention mechanism and Gaussian mixture model positively affect PGCN.
Fig. 5. Ablation experiments performed under the Cora, Citeseer and AmaComp
We analyze the time complexity of the whole model. In the Gaussian mixture model section, the running time is greatly increased because we have to calculate the mean and variance of k Gaussian distributions, and then the attention coefficients of n nodes are performed. But by exploiting the computational parallelism of the P systems, the GMM is made to compute the parameters of the k Gaussian distributions in the cells in parallel. Then the attention coefficients of n nodes are computed in parallel in the cell of attention. By this way, the PGCN does not add extra running time, which is similar to the original GCN model. Thus, in theory we can say that the running efficiency is improved.
5 Conclusion This paper proposes a GCN model based on P systems, called PGCN, which can process data with missing features. PGCN first generates missing node features using a Gaussian mixture model and an attention mechanism while avoiding excessive similarity of nodes. At the end of the paper, we perform an experimental analysis of node classification for three datasets. Our approach has an excellent performance in handling data with different
Improvement of Graph Convolution Network of Missing Data Based on P Systems
309
missing rates. In this paper, we do not study the classification performance of PCGN for data with a missing rate greater than 50%. We will continue to explore this issue in future studies. In the future, we would test PGCN with deeper structures and higher dimensional features. Acknowledgment. This activity was financially supported in part by the National Natural Science Foundation of China. The National Natural Science Foundation of China (Nos. 621722622, 61876101,61802234 and 61806114), the Social Science Foundation of Shandong Province (16BGLJ06, 11CGLJ22), China Postdoctoral Science Foundation Project (2017M612339, 2018M642695). Natural Science Foundation of Shandong Province (ZR2019QF007), China Postdoctoral Special Funding Program (2019T120607) and the Youth Fund for Humanities and Social Sciences of the Ministry of Education. Youth Fund for Humanities and Social Sciences, Ministry of Education (19YJCZH244).
References 1. Ying, R., et al.: Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983 (2018) 2. He, X., et al.: BiRank: towards ranking on bipartite graphs. IEEE Trans. Knowl. Data Eng. 29(1), 57–71 (2017) 3. Sun, H., et al.: Open domain question answering via semantic enrichment. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1045–1055 (2015) 4. Wang, Z., et al.: SINE: second-order information network embedding. IEEE Access 8, 139044–139051 (2020) 5. Chen, H., et al.: Multi-level graph convolutional networks for cross-platform anchor link prediction. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1503–1511 (2020) 6. Tran, L., et al.: Text classification problems via BERT embedding method and graph convolutional neural network. In: 2021 International Conference on Advanced Technologies for Communications (ATC), pp. 260–264 (2021) 7. Kim, J., Hastak, M.: Social network analysis: characteristics of online social networks after a disaster. Int. J. Inf. Manage. 38(1), 86–96 (2018) 8. Thomas, N., Kipf, M.W.: Semi-supervised classification with graph convolutional networks. In: ICLR. (2017) 9. P˘aun, G.: Computing with membranes. J. Comput. Syst. Sci. 61(1), 108–143 (2000) 10. Ye, L., et al.: Solving the 0–1 Knapsack problem by using tissue p system with cell division. IEEE Access 7, 66055–66067 (2019) 11. Velikovi, P.,Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations. (2018) 12. Yoon, J., Jordan, J., van der Schaar, M.: GAIN: missing data imputation using generative adversarial nets. In: ICLR, pp. 5689–5698 (2018) 13. Spinelli, I., Scardapane, S., Uncini, A.: Missing data imputation with adversarially-trained graph convolutional networks. Neural Netw 129, 249–260 (2020) 14. Wang, H., Leskovec, J.: Combining graph convolutional neural networks and label propagation. ACM Trans. Inf. Syst. 40(4), 1–27 (2021) 15. Xiong, X., et al.: Handling information loss of graph convolutional networks in collaborative filtering. Inf. Syst. 109, 102051 (2022)
Explainable Artificial Intelligence 101: Techniques, Applications and Challenges Wiktor Kurek1,2 , Marek Pawlicki1,2(B) , Aleksandra Pawlicka2,3 , Rafał Kozik1,2 , and Michał Chora´s1,2 1 Bydgoszcz University of Science and Technology, Bydgoszcz, Poland 2 ITTI Sp. z o.o., Pozna´n, Poland
[email protected] 3 University of Warsaw, Warsaw, Poland
Abstract. Artificial Intelligence (AI) systems have grown commonplace in modern life, with various applications from customized suggestions to self-driving vehicles. As these systems get more complicated, the necessity for transparency in their decision-making processes becomes more critical. Explainability refers to an AI system’s ability to explain how and why it made a certain judgement or prediction. Recently, there has been a surge of interest in constructing explainable AI (XAI) systems that give insights into the decision-making processes of machine learning models. This paper discloses and elaborates upon a selection of XAI techniques, identifies current challenges and possible future directions in XAI research. Keywords: Artificial Intelligence · Cybersecurity · Explainability · xAI
1 Introduction The evolution of Artificial Intelligence (AI) has revolutionized people’s ways of living and working, by making difficult activities quicker and simpler to do. As deep learning, machine learning, and other techniques have advanced, AI has become more sophisticated, and started to make judgements that have far-reaching consequences in human lives. Thus, as AI is widely used in many industries such as banking, security or healthcare, it is getting increasingly necessary to guaran- tee that these systems are transparent, trustworthy, and responsible [6, 15, 24]. 1.1 Background on Explainable Artificial Intelligence Explainable Artificial Intelligence (XAI) is a study area that seeks to overcome this problem by giving insight into how AI systems make choices, and refers to a set of strategies and procedures used to improve the transparency and interpretability of AI systems for human specialists. The primary purpose of XAI is to assist users in understanding how an AI system arrived at a certain choice, as well as to enhance the system’s decision-making procedure, to be more readily available and reliable. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 310–318, 2023. https://doi.org/10.1007/978-981-99-4752-2_26
Explainable Artificial Intelligence 101
311
1.2 Importance of XAI in Machine Learning Models In addition to its importance in the context of machine learning, XAI is increasingly used due to the need for transparency and accountability. XAI techniques can greatly enhance the interpretability of machine learning models, making them more accessible for stakeholders and building trust in the models [35]. In traditional black-box models, the mechanism of decision-making is complicated and the model’s lack of adoption is the key culprit [36]. 1.3 The Purpose of the Paper The purpose of this paper is to provide an overview of the latest developments in explainable artificial intelligence, and their potential applications. The section concerning XAI techniques discusses the latest XAI techniques, such as LIME, SHAP, and others. The nature and performance of these techniques, as well as their strengths and limitations, were explained. The paper wraps up with the ‘Challenges and future directions’ sections, which summarize the XAI latest research. In the following, the paper highlights the current limitations of XAI techniques and potential areas for improvement, such as the development of more efficient and accurate XAI methods.
2 XAI Techniques Figure 1 shows the breakdown of the XAI techniques presented in this work. The main categories are model-based and rule-based.
Fig. 1. The breakdown of the XAI techniques featured in this work
312
W. Kurek et al.
2.1 Rule-Based Techniques Rule-based approaches are a sort of explainable AI that works by creating a collection of rules that explicitly express the model’s decision-making process [3]. Humans can comprehend the rules in this technique, and their logic is simple to grasp [3]. These rules can be established personally by domain experts or learnt by a rule-learning algorithm [25]. There are various advantages to using rule-based strategies over other models for machine learning. They are uncomplicated to read, and their decision-making process is transparent, making it simple to find and repair model mistakes. They are also highly scalable and capable of dealing with both continuous and discrete data [13]. Nonetheless, rule-based approaches have several disadvantages. They need extensive prior knowledge of the domain and may be incapable of capturing complex interactions between variables. They are also unsuitable for dealing with noisy or missing data, and they are susceptible to changes in data distribution [7]. Decision Trees. The decision tree method, which constructs a tree-like model of decisions and their potential repercussions, is a common example of a rule- based technique. Each node in the tree indicates a choice based on a certain data aspect or attribute, and the edges reflect the various consequences of that decision [25]. Roth et al. [28] developed a reinforcement learning algorithm for calculating a collision-free route for a robot. To eliminate mistakes, they converted the algorithm into a decision tree, naming their method XAI-N. A different approach was presented by Schaaf et al. [30]. In order to improve a decision tree matching for deep neural networks, L1-orthogonal regularization was applied during network training. Another field is cybersecurity, where a decision tree model was used to improve the trust management of machine learning algorithms [19]. The authors argue that AI makes judgments by analyzing massive volumes of data in order to find possibly hidden patterns and weak signals. Rule Lists. The production rule system, often known as a rule-based expert system, is another common rule-based approach [5]. This system is made up of a collection of production rules which specify the connections between input and output variables. Production rules are often written in the form of “if-then” declarations, with the antecedent being the condition or collection of requirements that must be satisfied for the rule to be used and the consequent being the action executed if the rule is followed [26]. Expert systems are frequently deployed in diagnostic and decision-making applications [1]. The interpretability of rule lists is their primary advantage. They also tend to be more effective than decision trees since they only need to explore the input features once to categorize the input [4]. Bahani et al. implemented a fuzzy algorithm to classify heart disease diagnoses [2]. RuleFit. RuleFit is a machine learning approach that combines decision trees with linear models to build a hybrid model capable of capturing both linear and non- linear data correlations. To identify the non-linear relationships in the data, the RuleFit algorithm first constructs a decision tree ensemble, often a random forest. Using a process known as rule extraction, the decision tree algorithm is then turned into a collection of rules. To develop a hybrid model, the extracted rules are merged with linear models such
Explainable Artificial Intelligence 101
313
as linear regression or logistic regression. The linear models are utilized to represent the linear patterns of the data, while the non-linear correlations are captured by the retrieved rules from the decision tree ensemble. The final model is made up of these two models. The weights for each are learnt during training [10]. Luo, Chao, et al. implemented this to predict cancer. The findings demonstrated that the RuleFit-based nomogram correctly predicted survival in individuals with nasopharyngeal carcinoma. The nomogram outperformed previous models that did not incorporate inflammatory markers in terms of discrimination and calibration [18]. Certifai. Certifai is a universal tool that can be used for any black-box model and any sort of input data, and it provides CERScore, a black-box model robustness score that outperforms approaches that have access to model internals [31]. Based on selections, the algorithm constrains the values of the sampled points, allowing the generated counterfactuals to represent a user’s notion of how much they may modify their features. Continuing the work, a framework called “Cortex Certifai” was created [14].
2.2 Model-Based Techniques Model-based techniques rely on the representation of AI models using mathematical or logical models that are easier to interpret. These models can be derived from the original AI models or built from scratch using the available data. The techniques can be broadly classified into two categories: model-agnostic and model-specific methods [27]. The methods that are model-agnostic can be used for any machine learning model, independent of architecture or training technique. Model-specific algorithms are those that are adapted to certain machine learning models or architectures [21]. One of the key advantages of model-based techniques is that they allow for more efficient and effective analysis of large datasets. By using mathematical models, researchers can quickly identify patterns and relationships within the data that might not be immediately apparent through other methods [27]. Neural networks are complex models that consist of multiple layers of interconnected nodes (also called neurons). These models are often used in tasks such as image recognition, natural language processing, and speech recognition. Neural networks are less interpretable than decision trees and linear models because they involve a large number of parameters and non-linear transformations [11]. Linear Models are mathematical models that describe the relationship be- tween a dependent variable and one or more independent variables. These models are often used in regression tasks to predict the value of a dependent variable based on the values of the independent variables. Linear models are interpretable because they provide information about the magnitude and sign of the coefficients of the independent variables. LIME (Local Interpretable Model-Agnostic Explanations) is an algorithm developed in 2016 that provides explanations for the predictions of com- plex machine learning models. This is achieved by generating a “local” model that approximates the behavior of the original model around a specific input instance. LIME is model-agnostic, allowing it to be used with any type of ma- chine learning model including deep neural networks, decision trees and support vector machines. In the first step of the LIME algorithm, the
314
W. Kurek et al.
instance to be explained is selected. Next, perturbed versions of the instance are generated by randomly masking or adding noise to the features. The number of perturbed instances depends on the complexity of the original model and the desired level of accuracy. Then, the weights of the interpretable features are computed. This is accomplished by training a linear model on the perturbed instances where the interpretable features are used as input and the output is the predicted probability of the original model. Weights of the linear model are then used to compute the importance of each feature in the prediction. The final step is the creation of the local model. This is achieved by selecting a subset of interpretable features based on their importance weights and training a simple interpretable model. The local model is then used to explain the prediction by displaying the contribution of each feature to the output [21]. SHAP (SHapley Additive exPlanations) is a machine learning technique that assigns points to each input feature to explain the prediction. The SHAP score is the difference between expected prediction when the feature exists and expected prediction when the feature does not exist, calculated on the average for all the possible combinations of features. In algorithms, the first step is to define baseline predictions, which represent the average prediction of the model in the whole dataset. This is done using the Shapley value, which is a concept of cooperative game theory, to assign contribution values to each feature. Taking into account the interaction with other features, the Shapley value represents a marginal contribution of a feature to the prediction. The final step is to combine the values to obtain a complete interpretation of the prediction. This can be done by displaying these values for each feature on a bar or summary plot [21]. Mitrovi´c et el. Have used this framework [20]. On the wave of popularity of Chat GPT, the authors used this framework to detect whether a text was written by a human.
3 Challenges in Implementing XAI Implementation of explainable artificial intelligence poses several challenges that need to be addressed in order to effectively integrate explainability into machine learning models. Key challenges include: – Trade-off between accuracy and explainability: Highly accurate machine learning models are often complex and difficult to explain, while more trans- parent models may sacrifice accuracy. This is one of the key challenges in the implementation of XAI [8, 32]. – Interpretation of the XAI output: In particular, the output of an XAI technique can be difficult to interpret by non-experts, such as business managers, policymakers, or end-users. In the same vein, developing user-friendly and effective visualization techniques is challenging [17]. – Computational overhead: The computational cost of some XAI techniques can be very high, with significant processing power and time required to generate explanations. A challenge is to optimize the XAI techniques for efficiency [34]. – Diversity of ML models and architectures: Various machine learning models and architectures are available, each requiring a tailored XAI technique for explainability. The challenge is to develop multi-model XAI techniques that can be applied to multiple architectures and models [12].
Explainable Artificial Intelligence 101
315
– Lack of standardization: The implementation of XAI techniques and the corresponding guidelines are currently not standardized. Therefore, XAI implementation can be inconsistent across different industries, organizations, and applications [9]. – Privacy and security concerns: In some cases, the implementation of ma- chine learning models may be explained in detail, and sensitive information regarding the input data, model parameters, or decision-making processes can be revealed. A key challenge is ensuring the privacy and security of the data [27].
4 Future of XAI With the increasing demand for transparency and accountability in machine learning models, the importance of explainable artificial intelligence is expected to increase in the coming years. The future of XAI is being shaped by several trends, such as: • Standardization: The lack of standardization in XAI techniques is a challenge for the widespread adoption of explainable models. Efforts to develop standardization frameworks and best practices for XAI techniques are underway. The development and implementation of XAI models will become more accessible to businesses and organizations with the increasing standardization [29]. • Continued research and development: The development of XAI techniques is still in its initial stages and new methods and approaches are expected to emerge in the coming years. More efficient, effective, and privacy-preserving XAI techniques are in the works of researchers [8,33]. Collaboration between experts in different fields: XAI development and implementation rely on collaborative effort among experts in machine learning, explainability, ethics and human-computer interaction. In addition, XAI’s importance is expected to grow with the use of interdisciplinary collaboration [22]. • Integration into automated decision-making systems: As the integration of machine learning models into automated decision-making systems becomes more widespread, the need for XAI will also increase. XAI can enhance the transparency and the interpretability of these systems, and strengthen trust among stakeholders [23]. • Adoption in high-stakes domains: XAI is especially relevant in high-stakes domains. Machine learning models make significant decisions and their effects can be felt in the public realm. The adoption of XAI in these domains is anticipated to increase as stakeholders demand more transparency and accountability [16].
5 Conclusions Applying the XAI techniques has substantially helped to address the challenges of explainability. Employing such methods as rule-based approaches or model- based approaches, has demonstrated the potential of XAI in enhancing the interpretability and transparency of AI models. However, XAI still comes with a number of possible limitations. They include the dilemma of balancing between accuracy and explainability and the need for domain-specific knowledge. In addition, the data used for training AI models might be biased, too.
316
W. Kurek et al.
To overcome these challenges, further research and development are required to enable the widespread adoption of XAI techniques. The role of XAI techniques is likely to become even more important in the future, continuously contributing to the development of trustworthy and ethical AI systems. As demand for explainable AI continues to grow, XAI is expected to evolve and expand, offering more sophisticated and effective solutions to the challenges of interpretability and transparency in AI. Acknowledgements. This work is funded under the AI4Cyber project, which has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070450.
References 1. Ambhaikar, A.: A survey on health care and expert system. Math. Statist. Eng. Appl. 72(1), 451–461 (2023) 2. Bahani, K., Moujabbir, M., Ramdani, M.: An accurate fuzzy rule-based classification systems for heart disease diagnosis. Sci. African 14, e01019 (2021) 3. Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18, 1–43 (2018) 4. Burkhardt, S., Brugger, J., Wagner, N., Ahmadi, Z., Kersting, K., Kramer, S.: Rule extraction from binary neural networks with convolutional rules for model validation. Front. Artif. Intell. 4, 642263 (2021) 5. Cambra Baseca, C., Sendra, S., Lloret, J., Tomas, J.: A smart decision system for digital farming. Agronomy 9(5), 216 (2019) 6. Chora´s, M., Pawlicki, M., Puchalski, D., Kozik, R.: Machine learning–the results are not the only thing that matters! what about security, explainability and fair- ness? In: Krzhizhanovskaya, V.V. et al (eds.). Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science, vol. 12140, pp. 615–628. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-50423-6_46 7. Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012) 8. Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017) 9. Dwivedi, R., et al.: Explainable ai (xai): Core ideas, techniques, and solutions. ACM Comput. Surv. 55(9), 1–33 (2023) 10. Friedman, J.H., Popescu, B.E.: Predictive learning via rule ensembles. Annal. Appl. Statist. 2, 916–954 (2008) 11. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016) 12. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51(5), 1–42 (2018) 13. Han, J., Kamber, M., Pei, J.: Data mining concepts and techniques third edition. University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University (2012) 14. Henderson, J., et al.: Certifai: a toolkit for building trust in AI systems. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 5249–5251 (2021) 15. Liao, Q.V., Gruen, D., Miller, S.: Questioning the AI: informing design practices for explainable AI user experiences. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–15 (2020)
Explainable Artificial Intelligence 101
317
16. Liao, Q.V., Varshney, K.R.: Human-centered explainable ai (xai): From algorithms to user experiences. arXiv preprint arXiv:2110.10790 (2021) 17. Lipton, Z.C.: The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57 (2018) 18. Luo, C., et al.: Rulefit-based nomogram using inflammatory indicators for predicting survival in nasopharyngeal carcinoma, a bi-center study. J. Inflamm. Res. 15, 4803–4815 (2022) 19. Mahbooba, B., Timilsina, M., Sahal, R., Serrano, M.: Explainable artificial intelligence (xai) to enhance trust management in intrusion detection systems using decision tree model. Complexity 2021, 1–11 (2021) 20. Mitrovi´c, S., Andreoletti, D., Ayoub, O.: Chatgpt or human? detect and explain. Explaining Decisions of Machine Learning Model for Detecting Short Chatgpt- Generated Text. arXiv preprint arXiv:2301.13852 (2023) 21. Molnar, C.: Interpretable machine learning. Lulu.com (2020) 22. Nalepa, G., Araszkiewicz, M., Nowaczyk, S., Bobek, S.: Building trust to AI systems through explainability: technical and legal perspectives (2019) 23. Nwakanma, C.I., et al.: Explainable artificial intelligence (xai) for intrusion detection and mitigation in intelligent connected vehicles: a review. Appl. Sci. 13(3), 1252 (2023) 24. Panesar, A.: Machine learning and AI for healthcare. Springer (2019). https://doi.org/10.1007/ 978-1-4842-3799-1 25. Quinlan, J.R.: Induction of decision trees. Machine learning 1, 81–106 (1986) 26. Reddy, B., Fields, R.: From past to present: a comprehensive technical review of rule-based expert systems from 1980–2021. In: Proceedings of the 2022 ACMSoutheast Conference, pp. 167–172 (2022) 27. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 28. Roth, A.M., Liang, J., Manocha, D.: Xai-n: Sensor-based robot navigation using expert policies and decision trees. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2053–2060. IEEE (2021) 29. Samek, W., Wiegand, T., Müller, K.R.: Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296 (2017) 30. Schaaf, N., Huber, M., Maucher, J.: Enhancing decision tree based interpretation of deep neural networks through l1-orthogonal regularization. In: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 42–49. IEEE (2019) 31. Sharma, S., Henderson, J., Ghosh, J.: Certifai: counterfactual explanations for robustness, transparency, interpretability, and fairness of artificial intelligence models. arXiv preprint arXiv:1905.07857 (2019) 32. Szczepa´nski, M., Chora´s, M., Pawlicki, M., Pawlicka, A.: The methods and approaches of explainable artificial intelligence. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds.) Computational Science – ICCS 2021. ICCS 2021. Lecture Notes in Computer Science, vol. 12745, pp. 3–17. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-77970-2_1 33. Szczepa´nski, M., Pawlicki, M., Kozik, R., Chora´s, M.: New explainability method for bertbased model in fake news detection. Sci. Rep. 11(1), 23705 (2021)
318
W. Kurek et al.
34. Van der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (xai) in deep learning-based medical image analysis. Med. Image Anal. 79, 102470 (2022) 35. Vouros, G.A.: Explainable deep reinforcement learning: state of the art and challenges. ACM Comput. Surv. 55(5), 1–39 (2022) 36. Zhang, Z., Hamadi, H.A., Damiani, E., Yeun, C.Y., Taher, F.: Explainable artificial intelligence applications in cyber security: State-of-the-art in research. arXiv preprint arXiv:2208.14937 (2022)
MSAM: Cross-Domain Recommendation Based on Multi-Layer Self-Attentive Mechanism XiaoBing Song , JiaYu Bao , Yicheng Di , and Yuan Li(B) Jiangnan University, Wuxi, Jiangsu, China [email protected]
Abstract. In recent years, recommendation systems have been extensively implemented across multiple platforms. It can extract useful information from vast amounts of data and recommend appropriate products to users based on their preferences. Typically, recommendation systems are plagued by data scarcity and chilly start issues, which is a serious challenge. To solve these problems, crossdomain recommendations (CDR) have received a lot of attention. Typically, CDR aims to leverage data from other domains or map user preferences from other domains to the target domain to improve recommendation quality and alleviate data sparsity and chilly start problems for new services. However, the majority of existing cross-domain recommendation methods are based on matrix decomposition, which can only learn shallow linear features and cannot adequately address the challenges associated with cross-domain recommendation. Therefore, this paper proposes a multi-layer self-attentive mechanism (MSAM) cross-domain recommendation method to make more accurate predictions by fusing and passing information between different domains. The framework is composed primarily of a feature extraction layer, a multilayer perceptron, and a feature fusion layer that discovers the potential factors of users and items and fuses the potential factors of users from various domains. In addition, we employ the Wasserstein self-attentive mechanism and the multi-headed self-attentive mechanism in the feature extraction layer and the feature fusion layer, respectively, to better extract key user features and learn the affinity of user potential factors across domains. we conducted multiple experimental validations on two actual datasets to demonstrate the model’s efficacy and superiority. Keywords: Cross-domain recommendation · MLP · Self-attentive mechanism
1 Introduction With the emergence of the knowledge age, it has become imperative to extricate useful information from vast quantities of data in order to assist users in navigating the vast array of options. As a result, recommendation systems[14] widespread use on numerous online platforms. Cross-domain recommender systems are more complex than general recommender systems. In a traditional recommendation system, we only need to consider building a recommendation model in the current domain for analysis; whereas in cross-domain recommendation, we need to focus on what information to choose for © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 319–332, 2023. https://doi.org/10.1007/978-981-99-4752-2_27
320
XiaoBing-Song et al.
migration and how to migrate this information between different domains, which are very critical aspects of cross-domain recommendation systems. Consequently, cross-domain recommendation models can be categorized based on the various migration information and methodologies, as well as the types of overlapping information between domains. In the recent years, cross-domain recommendation systems have received extensive attention [11, 12]. Users can have historical interactions in one domain (i.e., the source domain) but not necessarily in the other (i.e., the target domain), and these users are designated “ chilly start users” in the target domain. However, since both the source and the target domains are related, the source domain’s feedback might be used to formulate important suggestions for the target domain. The primary objective of cross-domain recommendation is to transfer user preferences across two related domains. To accomplish the mapping, existing methods such as EMCRD [12], CDLFM [13], and RC-DFM [2] encode user preferences as individual vectors and map them as a whole across domains. As illustrated in Fig. 1, existing solutions discover user/item representations in the source and target domains, respectively, and then discover cross-domain representation mappings based on overlapping users. Importantly, a direct mapping between source and target domain user representations cannot explicitly convey the diverse, fine-grained views of users across domains. We propose a multi-layer attention mechanism (MSAM) for cross-domain recommendation in this paper. Initially, a feature extraction layer extracts user and item features of various domains from the original rating matrix of the respective domain. The extracted features are then fed into a multilayer perceptron (MLP) to learn user and item potential factor vectors from various domains. The user latent factor vectors from various domains are then combined into a single user latent factor vector for the attention mechanism. The final step is to generate the predicted ratings based on the user and item latent factor vectors. We employ the Wasserstein self-attention mechanism in the feature extraction layer and the multi-headed self-attention mechanism in the feature fusion layer, respectively, for improved extraction of important user features and learning the affinity of user potential factors between different domains at a multifaceted level. And MSAMCDR model is proposed. In Sect. 3, the process of implementing the model is described in detail and the benefits of using each module of the model are examined. In Sect. 4, experiments are conducted with multiple datasets and ablation experiments are conducted to demonstrate the feasibility and robustness of our model. The following are the primary contributions of this paper: (1) We propose a MSAM cross-domain the recommendation capable of passing features between domains for greater precision projection. (2) Using the Wasserstein self-attention system we obtain user and item attributes from numerous domains from the original rating matrix of the that corresponds domain, which captures the relationship between the unique interactive users along with items. We then employ MLP to learn user and item potential factors in a deep and non-linear manner. (3) Using a self-attentive mechanism, we combine user prospective variables from various domains. In addition, the MSAMCDR model is constructed using a multi-headed self-attentive mechanism to learn a connection of user potential factors between distinct domains.
MSAM: Cross-Domain Recommendation
321
(4) We examine the proposed algorithm to existing cross-domain the recommendation methods, and the findings from experiments demonstrate that the recommended method is vastly superior to existing methods.
Fig. 1. Current workflow in the classic chilly start user cross-domain recommendation
2 Related Work 2.1 Self-Attentive Mechanism Self-attention mechanism has been utilized extensively in the field of computer science, particularly for tasks such as machine translation, text summarization, and sentence classification. In recommendation systems [15, 16], the application of the self-attentive mechanism can effectively solve the chilly start and long-tail problems in recommendation. We can map the user’s historical behavior sequence and the feature vector of items using the self-attention mechanism, thereby transforming them into a vector in a mega-dimensional vector space. This has the advantage of reducing the amount of operations while being able to better reflect the relationship between them. When modeling, we can use the self-attentive mechanism as part of the model and get the appropriate parameters through training. This will enable personalized recommendations in recommender systems, thus increasing user satisfaction and business value. Also, when using the self-attention mechanism, we need to optimize and tune the model to get better performance and results. The general formula is as follows: QK T V (1) SA(Q, K, V ) = softmax √ dk where Q, K, V denote the query vector, key vector, and value vector in the input sequence, dk represents the dimensionality of the vector.
322
XiaoBing-Song et al.
2.2 Cross-Domain Recommendation Cross-domain recommendation seeks to address the issue of recommending information from one domain to another. In recent years, a steady stream of scholars have been devoted to it. With a neural network-based model, Hu et al. [1] proposed a new method of migration learning for cross-domain recommendation; however, the interpretability of the model is low, and it is difficult to comprehend and articulate how the model makes decisions and recommendations. Di et al. [10] proposed the use of meta-learning methods to address the cross-domain recommendation problem; however, the use of a federated architecture reduces the efficacy of the model’s recommendations. Fu et al. [2] proposed a deep fusion model based on reviews and content called RC-DFM for cross-domain recommendation. It motivates us to use auxiliary information and source information for recommendation in our paper. Meanwhile Chen et al. [3] consider a more practical scenario to perform cross-domain recommendation. Influenced by the notion of domain adaptation Feng et al. [4] propose a deep domain adaptation model (DARec) that can extract and transmit patterns from the rating matrix alone, requiring no additional information. Kang et al. [5] proposed SSCDR, a novel CDR framework based on semi-supervised mapping that is effective in learning cross-domain relationships even when there are few labeled data available. For this reason. Zhu et al. [6] proposed the new bi-objective cross-domain recommendation framework DTCDR. However, their formulations for cross-domain transfer are typically more complex than the unified approach and must be optimized. Cheng et al. [7] have proposed Preference Propagation GraphNet (PPGN) to address the aforementioned issue. To this aim, Zhu et al. [8, 9] proposed a deep framework dubbed DCDCSR for cross-domain and cross-system recommendations based on matrix decomposition (MF) models and deep neural networks (DNN) with full connectivity. In addition, a thorough analysis of extant CDR approaches, including obstacles, research progress, and future prospects, is presented. 2.3 Symbols and Problem Assume that, in a cross-domain recommendation task, there are K domains with the same k total number of users, but with distinct item sets. There is a rating matrix W K ∈ RM ×N between M users and N k items in the k-th domain, Where the rating is between 0 and 5 in the value range, When W k (m, n) = 5 means that in the k-th domain user m has rated item n with a score of 5, if it is 0, then the user has not rated the item. k = W s and V k = v1k , . . . , vNk = W s T are the k-th domain Here U k = u1k , . . . , uM pairwise the user rating matrix and the item rating matrix, respectively, where the user k represents the rating relationship between the pairwise is user m and the rating vector um item in the k-th domain, and the item rating vector vnk represents the rating connection that exists between the item n and the pairwise user in the k-th domain. Finally, W k is ∼
predicted by the predictive rating matrix W k ∈ RM ×N . The most challenging aspect of cross-domain recommendation tasks is extracting potential factors from different domains and fusing them into the target domain. The majority of extant cross-domain recommendation methods concentrate on the extraction of public information from diverse domains. However, it is difficult to construct connections between unrelated domains. Therefore, we can take into account the prospective k
MSAM: Cross-Domain Recommendation
323
factors extracted from different domains to construct a multi-layer self-attentive mechanism for cross-domain recommendation, with the objective of predicting the ratings of unrated user items and recommending highly rated items to users.
Fig. 2. Cross-domain recommendation model with multi-layer self-attentive mechanism
3 Model In this chapter, we use a cross-domain recommendation based on a multi-layer selfattentive mechanism to address a cross-domain problem, as shown in Fig. 2. First, the Wasserstein self-attentive the mechanism is used to extract user and item features from the original evaluating matrices of different domains; then, using MLP for learning user and item potential factors from the encoding results, the user potential aspect vectors from different domains are fused into a hypothetical factor vector by the attention layer; and lastly, the predicted scores are derived from the user interest factor vector and the item interest factor vector. 3.1 Wasserstein Self-Attention Layer Instead of one-hot vectors, the original rating matrix is used as input in this article. Each row (separate column) of the rating matrix represents the rating information of the user (separate items), so analogous users (separate items) will have comparable inputs. The user rating matrix U k is initially transmitted to the encoder using the following encoding formula: k = σ AE · U S + BE (2) U L L−1 L L where AE1 , . . . , AEL and B1E , . . . , BLE are the encoder pair weight matrix and bias matrix, respectively, L represents the total number of encoding layers, U0k = U k and σ (·) is the ReLU activation function.
324
XiaoBing-Song et al.
Liu et al. [17] suggested using the Wasserstein self-attentive layer to extract user and item attributes from each domain, thereby increasing the accuracy of the extracted representations and allowing analogous users (respectively, items) to have similar extraction results. Consequently, more essential original collaborative relationships between users and objects can be depicted as follows: k ; θ (3) Uak = WSA U L ∼
where Uak ∈ RM ×n is the resultant outcome of ULk after passing through the Wasserstein self-attention layer, and WSA(·) is the Wasserstein self-attention function with θ dimensional correlation parameter. Similarly obtaining the features extracted from the k project Vak ∈ RN ×m k
3.2 Multi-Layer Perceptron Learning Potential Factors To learn user and item hidden variables from retrieved features, the coding results are fed into the MLP, which operates as follows to derive the matrix of user latent factors in the k-th domain. The following is the formula: k (4) + BLX X k = σ AXL · XL−1 where AX1 , . . . , AXL and B1X , . . . , BLX are MLP’s weight matrix and bias matrix, respectively, L is the number of MLP, whereX0k = Uak , σ (·) is the ReLU activation function, and X k is the MLP-learned user potential factor matrix for the k-th domain. Similarly, the item potential factor matrix Y k can be obtained. 3.3 Multi-Headed Self-Attentive Layer The input sequence is divided into multiple heads, each head independently calculates the self-attention weights, and then the results calculated by different heads are stitched together to form the final feature representation. Multi-head self-attention can better handle long and complex sequences and enhance the model’s performance and generalizability because each head can focus on various input components. In this paper, we use H parallel attention layers to understand the affinity of user-potential factors between various domains at multifaceted levels. The vector of user potential factors p p p p is divided into H vectors of equal length, and then Um = u1 , · · · , uh , · · · , uH and q q q q Um = u1 , · · · , uh , · · · , uH represent the potential factors of user m in the p-th and q-th domains. Respectively, so that for each two different from each user we can obtain p,q H self-attentive matrices ϕ = [A1 , · · · , AH ], where Ah = [Ah ] ∈ RK×K , which are calculated as follows: p
p,q Ah
qT
X ·X = h√ h d
(5)
MSAM: Cross-Domain Recommendation
325
where d is the scaling factor, the graphical representation of user m in the p-th domain can be computed as follows: p gh
=
K q=1
p,q exp Ah Xhq K p,k k=1 exp Ah
(6)
The end user m potential factor vector with the following equation: Gm =
K [g1k , · · · , ghk , · · · gHk ]
(7)
k=1
3.4 Projections and Losses Finally, the predicted evaluations among user m and item n in the k-th domain can be determined. The equation is as follows:
k (m, n) = G · Y k T W m n,:
(8)
k denotes matrix Y k to the n-th row vector, which represents the item potential where Yn,: factors for item n in the k-th domain obtained from the MLP. Given that our goal is to estimate users’ ratings of incorporate items, the outcome ∼
perform is to limit the loss between the anticipated rating matrix W k and the source rating matrix W k ,as follows: 1 k k 2 W −W + λ1 ||θ ||1 + λ2 ||θ ||22 n K
min L =
(9)
k=1
where λ1 , λ2 , ||θ ||1 , ||θ ||22 is the regularization factor and the norm of L1 , L2 , respectively, used to control the regularization pair strength. By adding the regularization term, the model can be made simpler and more robust by optimizing the loss function while preferring smaller parameter values. 3.5 Model Analysis In the data feature processing phase, we may employ the Wasserstein self-attentiveness mechanism so that the extracted representations will be more accurate, thereby improving the model as a whole. Different user scenarios and item characteristics can be viewed as different feature representations in cross-domain recommendation. Application of the Wasserstein self-attentiveness mechanism consists primarily of two steps: feature mapping and self-attentiveness computation. Feature mapping maps the feature representations of various items into a common hidden space, whereas self-attentive computation uses Wasserstein distance to measure the similarity and correlation between different features to obtain feature representations. The Wasserstein self-attentive mechanism
326
XiaoBing-Song et al.
has the following advantages: it can handle complex interaction information between different items; it can learn the weights between features adaptively so as to more accurately represent the correlation between different items; and it can effectively manage high-dimensional and sparse data by measuring the similarity and distance between features. The multi-headed self-attentive mechanism used in the data fusion stage is a relatively advanced self-attentive mechanism, and its application effect in recommendation algorithms has been widely studied and verified. Each attention head can learn different features, which can improve the model’s ability to represent different semantic information. Specifically, the use of multi-head self-attentive mechanism can learn the correlation between different features and thus improve the feature representation. In comparison to other attention mechanisms, such as gentle attention mechanisms, the recommendation algorithm has some disadvantages. Soft attention mechanisms must calculate attention weights based on past user performance; consequently, significant changes in user actions may render the recommendation effect unstable. Therefore, a multi-headed self-attention mechanism can more accurately capture the pertinent characteristics of users and objects. Overall, the use of numerous self-attention models in recommendation models can improve the performance of the models, allowing them to more accurately capture various levels of semantic information and enhance feature representation. By integrating features from multiple levels in multiple self-attentive models, more complex feature representations can be learned, thereby enhancing the generalization ability of the model. Multiple self-attentive models can effectively learn interaction details between users and products and increase the accuracy of recommendation in recommendation systems.
4 Experiment Before starting the experiment we introduced the evaluation indicators for the experiment. The validity of scoring predictions is assessed using two assessment metrics: the mean absolute error (MAE) and the root mean square error (RMSE), defined as follows: 1 k
k (m, n) (10) MAE = W (m, n) − W T k,m,n 2 1
k (m, n) RMSE = W k (m, n) − W (11) T k,m,n
where T represents the amount of test scores. 4.1 Dataset In the experimental section of this paper, we use two actual data sets to confirm the model we have suggested, and we explain the theoretical experiments in detail below: (1) Amazon: This dataset comprises Amazon product evaluations and metadata, since we are asking for rating data, we downloaded Ratings only. Since the dataset is too
MSAM: Cross-Domain Recommendation
327
large, more relevant information must be extracted from it, we first remove those users whose ratings are less than 4, while we randomly sample 20% of the items in each domain. (2) MovieLens: A dataset of selected complete data describing the 5-star ratings of the movie recommendation service MovieLens. Where the dataset is divided into multiple genres, of which we selected four for this experiment, namely Action, Adventure, and Comedy. We randomly sampled a portion of the data in each domain due to the volume of data. (3) MeiTuan: user evaluation dataset, due to the lack of a public dataset, we processed the data by first treating the same type of stores as the same domain and the stores in it as items. A new smaller dataset is constructed. These are the two traditional datasets used in cross-domain recommendation, and six different comparative cross-domain tasks will be defined in the two different datasets, While arbitrarily allocating 80% of the data to the training set and 20% to the test set, Task 1: Book-Movie; Task 2: Book-Phone; Task3: Movie-Phone; Task4: MotionVenture; Task5: Motion-Comedy; Task6: Venture-Comedy; Task 7: FastFood- Pasta. Details are provided in Table 1: Table 1. The statistical characteristics of the two datasets. Datasets
Domains
User #
Item #
Rating #
Amazon
Book
6328
49877
63245
Movie
6328
70833
95246
MovieLens
MeiTuan
Phone
6328
15323
47525
Motion
5664
6984
99425
Venture
5664
8983
353245
Comedy
5664
7452
83785
FastFood
526
158
10325
Pasta
526
132
9784
4.2 Contrast Baseline Comparatively, we have chosen some of the more typical baselines from current crossdomain recommendations: • CMF [19]: Integrating user rating vectors from various fields and distributing overlapping user embeddings throughout domains enables CMF to facilitate the integration of knowledge across various domains. • DCDCSR [8]: This model builds on the concept of EMCDR, but discovers that the function used for mapping is from the domain that is being targeted to the typical domain: the typical domain is a space of incorporation that combines both the source and target domains.
328
XiaoBing-Song et al.
• DARec [4]: DARec pulls and distributes sequences from the rating matrix alone, reliant on no additional data. • MVDNN [21]: This is a multi-view deep learning technique for cross-domain user representation in recommender systems. Deep learning has been suggested to map users and products to a latent space in order to maximize the resemblance between users and their favorite items. • CDRIB [18]: proposes a new framework aiming to learn an unbiased representation to encode the variational information bottleneck principle of domain invariant information • PTUPCDR [20]: PTUPCDR extracts preference bridges of user characteristics from their interaction records in the source domain using a meta-learner for customization. Using customization bridging, the mapping of the original domain embedding to the intended domain is then completed. 4.3 Results and Discussion For the six baselines listed above, the settings were set so that either the default settings were used or those that were most appropriate variables for every approach were modified according to the initial paper’s creators’ recommendations. Our proposed method’s parameter settings were optimized through experimentation, and the size of the MLP layers was selected as [512, 256, 128, 64]. The group size was set to 500 and the total amount of repetitions was set to 50. Table 2 displays the outcome of the investigations thus devised. Table 2. Comparative results of MAE and RMSE for six baseline methods (best results are bolded) Metric
CMF
DCDCSR
DARec
MVDNN
CDRIB
PTUPCDR
Ours
Task1
MAE
1.66
1.48
1.67
1.63
1.34
1.15
1.07
RMSE
2.18
1.83
1.89
1.84
1.46
1.55
1.39
Task2
MAE
1.42
1.23
1.45
1.39
1.17
0.91
0.97
RMSE
1.84
1.65
1.97
1.70
1.52
1.22
1.03
MAE
1.89
1.68
1.53
1.55
1.53
1.23
1.18
RMSE
2.14
1.87
1.75
1.82
1.75
1.68
1.56
Task4
MAE
1.43
1.26
0.95
0.87
1.05
0.89
0.74
RMSE
1.65
1.43
1.21
1.05
1.22
1.14
1.06
Task5
MAE
2.33
1.27
1.14
0.87
1.11
0.85
0.81
RMSE
2.73
1.59
1.45
1.19
1.38
1.14
1.08
Task6
MAE
2.32
1.26
1.06
1.15
1.24
0.93
0.96
RMSE
3.27
1.84
1.73
1.68
1.85
1.57
1.62
MAE
3.13
2.31
2.25
2.17
2.46
2.11
1.78
RMSE
3.87
3.13
3.16
3.21
3.25
2.56
2.26
Task3
Task7
MSAM: Cross-Domain Recommendation
329
By observing the comparative MAE and RMSE findings of the six baseline methods. This table demonstrates that the approach we suggest outperforms other comparable methods by obtaining lower MAE and RMSE values for the majority of the dataset’s domains. However, our model performs less well than the PTUPCDR model in Task 6, ranking only second. However, our experiments do not clearly state which part of the model has a significant enhancement on the experimental results, so further clarification is needed for the experiments. 4.4 Ablation Experiment By analyzing the experimental result, we are not sure whether the added part has significantly improved the model. Here, we change the Wasserstein self-attentive mechanism and the multi-headed self-attentive mechanism of the model to the standard self-attentive mechanism in order to evaluate the effect of changing these two model components on the model as a whole. We offer three model variations: • MSAM-base: as the base model, we use only two ordinary self-attentive mechanisms to build the model and see its effectiveness in solving the cross-domain problem • MSAM-W: add Wasserstein distance to the first self-attentive mechanism to judge the model effect • MSAM-H: apply the multi-headed self-attentive mechanism to the previous selfattentive mechanism in order to better capture the user relationship between various domains and to observe its effect on the model. • MSAM-FNN: A feedforward neural network is employed to compare the outcomes of a multi-layer perceptual neural network, while all other processes remain unchanged. Figures 3, 4 depicts the outcomes of the ablation research, for which all setup data are identical to the original experimental setup. In the leave-on task, we discovered that
Fig. 3. Performance of the four baselines in MAE
330
XiaoBing-Song et al.
each component of the model contributes significantly to the model’s aggregate effect, demonstrating the efficacy and robustness of our model construction.
Fig. 4. Performance of the four baselines in RMSE
5 Conclusion We put forward a cross-domain recommendation technique centered on multilayer selfattentive mechanism (MSAM) that seeks to improve prediction accuracy by integrating and transferring information across domains. Through the utilization of an extracting feature layer, a multilayer perceptron, and a feature fusion layer, the potential characteristics of users and items can be learned, and the potential factors of users from various domains can be combined. In the feature extraction and fusion layers, Wasserstein and multi-head self-attention mechanisms are utilized. The Wasserstein self-attention mechanism can assist the model in extracting essential user characteristics and capturing user behavior patterns across domains. The multi-headed self-attention mechanism can assist the model in learning the affinity of user potential factors across domains in order to better comprehend and manage user behavior in cross-domain recommendation tasks. Our framework not only learns the potential factors of users and items, but also enhances the precision and efficacy of cross-domain recommendations by integrating user potential factors from different domains. Nonetheless, the model has flaws, as the use of multiple self-attendance increases model complexity, thereby increasing computational complexity and decreasing interpretability. In the future, we will attempt to use other techniques, such as knowledge distillation, to address the issues of computational complexity, overfitting, and interpretability that arise when employing the attention mechanism multiple times, and to enhance the model’s performance.
MSAM: Cross-Domain Recommendation
331
References 1. Hu, G., Zhang, Y., Yang, Q.: CoNet: collaborative cross networks for cross-domain recommendation. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 667–676 (2018) 2. Fu, W., Peng, Z., Wang, S., et al.: Deeply fusing reviews and contents for cold start users in cross-domain recommendation systems. Proceed. AAAI Conf. Artif. Intell. 33(01), 94–101 (2019) 3. Gao, C., Chen, X., Feng, F., et al.: Cross-domain recommendation without sharing userrelevant data. In: The World Wide Web Conference, pp. 491–502 (2019) 4. Yuan, F., Yao, L., Benatallah, B.: DARec: deep domain adaptation for cross-domain recommendation via transferring rating patterns. arXiv preprint arXiv:1905.10760 (2019) 5. Kang, S.K., Hwang, J., Lee, D., et al.: Semi-supervised learning for cross-domain recommendation to cold-start users. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1563–1572 (2019) 6. Zhu, F., Chen, C., Wang, Y., et al.: DTCDR: a framework for dual-target cross-domain recommendation. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1533–1542 (2019) 7. Zhao, C., Li, C., Fu, C.: Cross-domain recommendation via preference propagation GraphNet. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2165–2168 (2019) 8. Zhu, F., Wang, Y., Chen, C., et al.: A deep framework for cross-domain and cross-system recommendations. arXiv preprint arXiv:2009.06215 (2020) 9. Zhu, F., Wang, Y., Chen, C., et al.: Cross-domain recommendation: challenges, progress, and prospects. arXiv preprint arXiv:2103.01696 (2021) 10. Di, Y., Liu, Y.: MFPCDR: A meta-learning-based model for federated personalized crossdomain recommendation. Appl. Sci. 13(7), 4407 (2023) 11. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009 (2009) 12. Man, T., Shen, H., Jin, X., et al.: Cross-domain recommendation: an embedding and mapping approach. In: IJCAI, vol. 17, pp. 2464–2470 (2017) 13. Wang, X.: CDLFM: cross-domain recommendation for cold-start users via latent feature mapping. Knowl. Inf. Syst. 62(5), 1723–1750 (2019). https://doi.org/10.1007/s10115-01901396-5 14. Huang, L., Zhao, Z.L., Wang, C.D.: LSCD: low-rank and sparse cross-domain recommendation. Neurocomputing 366, 86–96 (2019) 15. Zhang, T., Zhao, P., Liu, Y., et al.: Feature-level deeper self-attention network for sequential recommendation. In: IJCAI, pp. 4320–4326 (2019) 16. Zhong, S.T., Huang, L., Wang, C.D.: An autoencoder framework with attention mechanism for cross-domain recommendation. IEEE Trans. Cybern. 52(6), 5229–5241 (2020) 17. Fan, Z., Liu, Z., Wang, Y., et al.: Sequential recommendation via stochastic self-attention. In: Proceedings of the ACM Web Conference 2022, pp. 2036–2047 (2022) 18. Cao, J., Sheng, J., Cong, X., et al.: Cross-domain recommendation to cold-start users via variational information bottleneck. In: 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, pp. 2209–2223 (2022) 19. Singh, A P., Gordon, G.J.: Relational learning via collective matrix factorization. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 650–658 (2008) 20. Zhu Y., Tang, Z., Liu, Y., et al.: Personalized transfer of user preferences for cross-domain recommendation. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pp. 1507–1515 (2022)
332
XiaoBing-Song et al.
21. Elkahky, A.M., Song, Y., He, X.: A multi-view deep learning approach for cross domain user modeling in recommendation systems. In: Proceedings of the 24th International Conference on World Wide Web, pp. 278–288 (2015) 22. Zhao, C., Li, C., Xiao, R., et al.: CATN: cross-domain recommendation for cold-start users via aspect transfer network. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 229–238 (2020)
TDRConv: Exploring the Trade-off Between Feature Diversity and Redundancy for a Compact CNN Module Haigen Hu , Deming Zhou, Hui Xu, Qi Chen, Qiu Guan , and Qianwei Zhou(B) College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310014, Zhejiang, China [email protected]
Abstract. Rich or even redundant features without losing diversity of feature maps can undoubtedly help to improve network performance. In this work, we propose a compact CNN module by exploring the trade-off between feature diversity and redundancy, namely TDRConv, to retain and generate features with moderate redundancy and rich diversity but require less computation. Specifically, the input features are split into the main part and the expansion part by certain proportion, where the main part can extract intrinsic and diverse features in different ways, while the expansion part can enhance the extraction ability of diverse information. Finally, a series of experiments are conducted to verify the effectiveness of the proposed TDRConv on CIFAR10 and ImageNet. The results show the network models equipped with TDRConv can all outperform the state-of-the-art methods in terms of accuracy, but with significantly lower FLOPs and parameters. More importantly, the proposed TDRConv can readily replace the existing convolution modules as a plug-and-play component, and it is promising to further extend CNNs to wider scenarios. Keywords: Compact CNN Module · Model Compression · Feature Diversity
1 Introduction Redundant or similar features can help to speed up the convergence and reduce the empirical error but take higher risk of high computational complexity and overfitting in the neural networks [1, 2]. Previous studies [3, 4] have showed that over-sized deep neural network models easily lead to a heavy computing burden and a high level of overfitting owing to the dependence on many redundant features that can be either shifted version of each other or be very similar with little or no variations. To address this issue, common approaches mostly focus on reducing the effective number of parameters or constructing efficient operations by using model compression methods like transferred/compact convolutional filters [5–7] and regularization strategies including weight decay [8], Dropout [9] and DropConnect [10]. Most of them aim to remove redundant or correlated features (i.e., similar features). More specifically, to minimize the extraction of redundant features during training, it is necessary to inhibit learning of redundant filters and strengthen © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 333–344, 2023. https://doi.org/10.1007/978-981-99-4752-2_28
334
H. Hu et al.
feature diversity. In fact, feature diversity generally tends to produce better generalization ability, and the existing literature [2, 11] have indicated that the decrease of feature diversity may degrade the performance of Deep Neural Networks (DNNs). However, there is an opposite view that rich or even redundant information can also usually contribute to ensure a comprehensive understanding of the input data. For example, a new Ghost module was introduced to process these intrinsic features and generate redundant features by using Depthwise Convolutional Filter (DWC) [12] with fewer parameters while maintaining a certain number of intrinsic features [13], and it can effectively reduce computing burden through such cheap operations. While SPConv [14] provide another scheme to process the redundancy of the feature maps, and it utilizes a split based convolutional operation, instead of directly removing uncertain redundant features, to tolerate features with similar patterns but require less computation. Actually, Tied Block Convolution (TBC) [15] showed that the excessive redundancy cannot reduce the actual capacity to process complex data, but also lead to being inefficient to capture diversity information. Thus, it is necessary to keep some redundant features while preserving appropriate feature diversity as training progresses. However, the biggest challenge lies on how to find a trade-off between the redundancy and diversity of features. In this paper, a novel convolution module, namely TDRConv, is proposed to capture diverse information while maintaining the right amount feature redundancy, and we can find the optimal trade-off between the redundancy and diversity of features by regulating three hyper-parameters. Our contributions can be summarized into threefold. (1) We found that the intrinsic features generated by standard convolution operations suffer from a lack of information diversity. (2) A new TDRConv module is proposed by exploring the trade-off between feature diversity and redundancy to retain and generate features with moderate redundancy and rich diversity but require less computation. (3) The proposed TDRConv can readily replace the existing CNNs as a plug-and-play component.
2 Related Work In recent years, designing compact convolutional filters has attracted broad attention owing to the drastic reductions of redundant parameters and FLOPs. These methods usually explore the dependencies across spatial and channel dimensions to optimize the structure of CNNs with large redundancy. In fact, it is a common way to explore dependencies across channels in the community of convolutional filters, and some convolutional filters, including Groupwise Convolutional Filter (GWC) [16], Depthwise Convolutional Filter (DWC) [12] and Pointwise convolution (PWC) [17], were proposed to reasonably reduce the redundant connections between channels. For example, in GWC [16], a filter can only be convolved with a subset of the input features to reduce the redundant connections between channels. Thus, the parameters in the GWC layer can be reduced and adjusted by designing the number of channel groups divided along channel dimension. For DWC [12], each 2D kernel is convolved with a single channel of the input feature map to produce the corresponding out feature map. When the number of groups is the same with that of channels in the input features, GWC becomes DWC. Actually, DWC is a special case of GWC, and both of them can greatly reduce model
TDRConv
335
redundancy by adjusting connection strategies across channels. PWC [17] can integrate information across channels by using an 1 × 1 convolution operator. The above three convolution modules are usually used to design deeper or more efficient network models without increasing the model parameters and the FLOPs, such as MobileNet [18], ResNeXt [19], ShuffleNet [20] and GhostNet [13]. Although these approaches can achieve good comparable performance while maintaining a very small number of parameters, these features are limited to the intrinsic features generated by common convolutions and suffer from the lack of interaction between channels, thereby leading to a lot of feature redundancy and suffering from a lack of information richness. To address this issue, our proposed TDRConv attempts to capture diverse information while maintaining the right amount feature redundancy. Feature Extraction Block αC×W×H
...
Base branch
β
Diversity branch ...
γ Input feature maps X
(1-α)C×W×H
Expansion part XE
Feature Expansion Block
Pointwise Conv
Groupwise Conv (3x3)
Groupwise Conv (1x1)
Channel attention
C×W×H
Main part XM
Standard Conv
C'×W'×H'
1-β ... : Concat operation : Add operation α : Ratio of splitting β : Ratio of output
Output feature maps Y
γ : Times of expansion
Fig. 1. An architecture illustration of the proposed TDRConv. Firstly, the input feature maps are split into the main part and the expansion part by certain proportion α. The main part can respectively extract intrinsic and diverse features by using standard convolution of the base branch and groupwise convolution of diversity branch in the Feature Extraction Block, then the feature channels of the obtained maps are focused on by adopting channel attention after a concat operation. While the expansion part can enhance the extraction ability of diverse information by using pointwise convolution of the Feature Expansion Block, and complement the final output features by using add operation.
3 TDRConv: Proposed Module The proposed TDRConv module consists of the Feature Extraction Block, Feature Expansion Block and channel attention block, and the corresponding architecture is illustrated in Fig. 1. The input feature map is divided into two parts at a certain scale in channel order, and the two different contents are processed and fused along the two branches. Therefore, the input feature maps are divided into two groups, and let the total maps be X ∈ RC×W×H , where C, W and H are the number of input channels, the width and height of the feature map, respectively. Suppose a hyper-parameter α is introduced as the split ratio, then the main part is denoted as XM ∈ RαC×W×H , while the expansion branch is denoted as XE ∈ R(1−α)C×W×H .
336
H. Hu et al.
3.1 Feature Extraction Block A parallel structure is designed for the two functional branches (i.e., base branch and diversity branch) as shown in Fig. 1. The base branch utilize a k × k standard convolution with relative high computational complexity (here 3 × 3 convolution) to extract the intrinsic features with certain redundancy. While the diversity branch uses the GWC to capture diversity information. Compared with the DWC in GhostNet [13], GWC can achieve a relative balance between reducing the number of parameters and extracting the channel interaction information. Then the output features generated by the base branch and diversity branch are concatenated along the channel dimension, instead of the common add operation. In this way, the feature channels generated by each branch can become part of the final total output channels, not all of them, thereby reducing the number of parameters. For example, when the output channel is C‘, the output channels of the basic branch and the diversity branch are C‘/2 (assuming that the output channels of both branches are the same), the number of channels after the splicing and fusion of the two parts of the features is C‘, while if the element-by-element summation method is used, the channels of both parts of the features are required to be C‘. Let the total output features of the Feature Extraction Block be denoted as YM ∈ RC ×W ×H , where C , W and H are the number of total output channels, the width and height of the output feature maps, respectively. The hyper-parameter β is introduced to represent the ratio of the output of the base branch. Thus the number of channels of the intrinsic features and diversity features are respectively βC and (1 − β)C . The kernel size of the standard and group convolution are both k × k in this work. For the base branch, the output YSC of the standard convolution (SC) is defined as: sc YSC = Fconv (XM ) ∈ RβC
×W ×H
(1)
sc (·) denotes the standard convolution operation. For the diversity branch, in where Fconv addition to XM , the input features of the diversity branch are further expanded along the channel dimensionality by the Feature Expansion Block to improve the effect of obtaining diverse information. Thus the output of GWC can be derived as: gwc
YGWC = Fconv (XGWC )
(2)
gwc
where Fconv (·) denotes the GWC operation. XGWC are the inputs of the GWC, and can be obtained by using cancat operation, given as XGWC = XM ⊕ YE
(3)
where ⊕ denotes the concat operation. YE is the output from the branch of the Feature Expansion Block to expand the channels for the diversity branch. In order to make the number of groups g of diversity branches increase layer by layer, the number of groups g is correlated with the number of channels in each layer, and the Greatest Common Divisor grouping method of channels is proposed, which is defined as: g = GCD(C1, C2)
(4)
where GCD represents the Greatest Common Divisor function, C1, C2 represent the output channels of the basic branch and the diversity branch, respectively. These two
TDRConv
337
values are chosen as C1 and C2 because the group convolution specifies that the number of groupings must be divisible by the number of its input channels and the number of output channels is a multiple of the number of groupings. Therefore, the corresponding output YM of the Feature Extraction Block can be derived as: gwc
sc YM = YSC ⊕ YGWC = Fconv (XM ) ⊕ Fconv (XM ⊕ YE )
(5)
3.2 Feature Expansion Block In MobileNetv2 [18], lots of valuable information would be lost when the feature dimension was small. It is very intractable for the lightweight models, because the feature dimensions are severely restricted. Motivated by this, a Feature Expansion Block is proposed for the expansion part XE , where PWC is utilize to linearly expand the input features. Besides, a bypass branch flowing to the diversity branch can increase the number of input channels, and a hyper-parameter γ is introduced to denote the times of feature expansion (shown in Fig. 1). When γ = 1, the number of feature channels generated by this block is the same with the number of the output channels of the diversity branch, i.e., (1 − β)C . In [13, 14], it has shown that the redundant features usually contain necessary details. To sufficiently utilize the redundant features of the expansion part, we supplement the features for the final output from the Feature Expansion Block by adding YE and YM , and this branch from the PWC can further compensate and add the necessary information. To match the channels of YE and YM , an 1 × 1 GWC with a very small number of parameters is inserted the branch as shown in Fig. 1. Correspondingly, the output YE of the Feature Expansion Block can be obtained as: YE = Fconv (XE ) ∈ Rγ ×(1−β)C×W ×H pwc
(6)
pwc
where Fconv (·) denotes the PWC operation. Finally, we can obtain the output by using 1 × 1 GWC, given as:
gwc
YE = Fconv1×1 (YE )
(7)
gwc
where Fconv1×1 (·) denotes the 1 × 1 GWC operation. 3.3 Channel Attention Module In the Feature Extraction Block, the parallel structure can obtain more information with higher dissimilarity while maintaining the basic information extraction in the case of lower total parameters. As shown in Fig. 1, the base branch and diversity branch are used to extract intrinsic features and diversity features, respectively. However, some necessary and important details are easily covered up in the obtained feature maps by concatenating the two branches, and it is necessary to focus on these information in subsequent processing. Therefore, a channel attention module is used to calibrate the attention weights of these information, and the Squeeze-and-excitation module [21] is
338
H. Hu et al.
adopted in this work owing to its outstanding performance (shown in Fig. 1). For feature maps with c channels, we firstly perform global average pooling (GAP) along the channel dimensions to obtain c constants, which represent the global information of the channels. Then the operations of full connection and sigmod are performed, and we can obtain c weights in the range of 0, 1, which represent the importance of each channel of the input features. Finally, the obtained weights can be multiplied by the corresponding input features, which can strengthen these necessary details and weaken those invalid information. Correspondingly, the channel attention weights w are calculated as follows. f
f
w = σ (W2 ReLU (W1 (GAP)YM )) f
(8)
f
where W1 and W2 are the corresponding weights of fully-connected layer. Then, the obtained channel attention weights are multiplied by the corresponding channels of the output YM , given as YM = YM · w
(9)
Finally, we can obtain the final output Y by adding YM and YE , given as follows.
Y = YM + YE ∈ RC ×W ×H
(10)
3.4 Complexity Analysis
For an input vector X ∈ RC×W×H and an output vector Y ∈ RC ×W ×H , the parameters of the standard convolution with a size of k × k can be calculated as: PSC = C × C × k × k
(11)
To facilitate the analysis, we set a set of hyper-parameters (i.e., α = 0.5, β = 0.5, and γ = 2) and set the number of groups in GWC to g, for the proposed TDRConv, the parameters can be calculated as follows: PTDRC = (0.25C × C + 0.5C ×
0.5C+C )×k g
× k + 0.5C × C
1 = C ( g+1 4g C + 2g C ) × k × k + 0.5C × C
(12)
Take TDRConv-Resnet-50 as an example, where the input channel of TDRconv in all Stages is the same as the output channel, i.e., C = C‘, and g is extremely large, Eq. (12) can be simplified as: PTDRC = 0.25C 2 (k × k + 2)
(13)
Similarly, the parameters of the standard convolution in Eq. (11) can be simplified as: PSC = C 2 (k × k) Therefore, TDRConv can effectively reduce the parameters.
(14)
TDRConv
339
4 Experiments and Results 4.1 Datasets and Experimental Settings To verify the effectiveness and generalization of the proposed TDRConv, we select two public datasets: CIFAR-10 [22] and ImageNet [23]. We evaluate the performance of the proposed TDRConv on 4 criteria, such as Top-1 and Top-5 accuracy (Acc), FLOPs and number of reduced parameters (Parameters). All experiments are conducted on a server with an Intel(R) Xeon(R) Gold 6240 [email protected] 18-core processor and with 6 GeForce GTX 2080 graphics cards. A GPU implementation accelerates the forward propagation and back propagation routines by using the SGD optimizer under the Pytorch framework in all experiments. Besides, for comparison, the intrinsic 3 × 3 standard convolutions of the backbones are replaced with the proposed TDRConv module in different approaches. The three hyper-parameters α, β and γ are used to adjust the proportions of channel allocations in various branches. For the sake of fairness, all experiments adopt the same hyper-parameters. 4.2 Image Classification on CIFAR-10 and ImageNet Performance on CIFAR-10. In these experiments, two widely used VGG-16 [24] and ResNet-56 [25] are selected as the backbones. Similar to [14], all 3 × 3 convolutions of the the backbones are replaced with our TDRConv while leaving other settings unchanged, and the replaced modules are respectively called TDRConv-VGG-16 and TDRConv-ResNet-56. The training strategy is the same with [25]. Table 1 shows the comparative results by adopting various group of hyperparameters, where α and β are fixed (i.e., α = 1/2 and β = 1/2), and the expansion ratio γ is respectively selected as 1, 2 and 4 for comparison. In general, with the increase of expansion ratio, the number of parameters and the FLOPs increase, and the obtained accuracy also increases. For ResNet-56, our TDRConv combined with the standard architectures can achieve comparable accuracy to the baseline ResNet-56 when γ = 4, and our method can also outperform the stateof-the-art SPConv [14] and GhostNet [13] with fewer parameters and lower FLOPs. For VGG-16, the proposed TDRConv with different γ can obtain slightly higher accuracy than the original VGG-16 model only by using a maximum of less than 30% of the number of parameters and 31% of FLOPs. Compared with the stateof-the-art methods, our TDRConv can also surpass SPConv [14] and GhostNet [13] in accuracy, number of parameters and FLOPs when γ = 2. In general, the proposed TDRConv combined with the standard architectures can give significant improvements in FLOP and parameter counts with even rises in accuracy. Compared with other similar state-of-the-art methods, the proposed TDRConv can also obtain competitive results. Performance on ImageNet. In this experiment, ResNet-50, ResNet-101 and ShuffleNetV2 are selected as the backbone, and all 3 × 3 convolutions in ResNet50 are replaced with our proposed TDRConv. The batchsize, weight decay and momentum are 256, 1e-4 and 0.9, respectively. The learning rate is initially set to 0.1, then decays by a factor of 10 every 30 epochs for a total of 90 epochs. The hyperparameters of the second best performance are selected for comparison (i.e., α = 1/2, β = 1/2, and γ
340
H. Hu et al.
Table 1. Performance evaluation of different variants of ResNet-56 and VGG-16 by adopt various hyperparameters on CIFAR-10. The standard architectures ResNet-56 and VGG-16 are selected as the reference for calculating the reduction in parameter and FLOP counts. Model
Params
FLOPs
Acc. (%)
ResNet-56-Baseline
0.85M
127M
92.78
SPConv-ResNet-56-α1/2[14]
0.32M
49M
91.83
Ghost-ResNet-56-s2[13]
0.43M
63M
92.70
EKConv-ResNet-56[29]
0.43M
128M
92.70
TDRConv-ResNet-56-α1/2-β1/2-γ1Ours)
0.29M
44M
92.43
TDRConv-ResNet-56-α1/2-β1/2-γ2*Ours)
0.32M
49M
92.66
TDRConv-ResNet-56-α1/2-β1/2-γ4(Ours)
0.40M
64M
92.82
VGG-16-Baseline
14.7M
314M
93.60
SPConv-VGG-16-α1/2[14]
5.3M
116M
93.77
BFConv-VGG-16[28]
8.2M
307.15M
93.56
Ghost-VGG-16-s2[13]
7.7M
158M
93.70
TDRConv-VGG-16-α1/2-β1/2-γ1(Ours)
4.4M
97M
93.71
TDRConv-VGG-16-α1/2-β1/2-γ2(Ours)
4.8M
111M
93.81
TDRConv-VGG-16-α1/2-β1/2-γ4(Ours)
5.7M
142M
93.88
* In the hyper-parameters of TDRConv-ResNet-56, α = 1/2, β = 1/2, and γ = 2, respectively. And
so on in other cases.
= 2, respectively). The results are shown in Table 2. Classification performance comparisons between various methods based on ResNet-50, ResNet-101 and ShuffleNetV2 on ImageNet.. The proposed TDRConv combined with the standard architectures can outperform the original ResNet-50 by 0.78% only with 70.1% of the number of parameters and with 69.3% of FLOPs. Compared with other state-of-the-art methods, our TDRConv-ResNet50-α1/2-β1/2-γ2 can obtain the best performance in Top-1 and Top5 accuracy. For example, although there are lower parameters and FLOPs in GhostResNet50-s2 [13] and Thinet-ResNet-50 [26], our TDRConv-ResNet50-α1/2-β1/2-γ2 can obtain much higher Top-1 and Top-5 accuracy. As regards HetConv-ResNet-50-P4 [27] and SPConv-ResNet-50-α1/2 [14], our method can outperform them in Top-1 and Top-5 accuracy while reducing much more parameters and FLOPs. Overall, the proposed TDRConv can reduces the FLOP and number of parameters with significant rise in Top-1 and Top-5 accuracy based on ResNet-50 on ImageNet.
4.3 Ablation Study To test the impacts of the three hyperparameters and different components on performance, a series of ablation experiments are conducted on CIFAR-10 based on VGG-16. Table 3 illustrates the validity of the three hyperparameters α, β and γ.
TDRConv
341
Table 2. Classification performance comparisons between various methods based on ResNet-50, ResNet-101 and ShuffleNetV2 on ImageNet. Model
Params
FLOPs
Top-1 Acc. Top-5 Acc. (%) (%)
ResNet-50-Baseline
25.56 M
4.11 G
76.15
92.93
Thinet-ResNet-50[26]
16.90 M
2.60 G
72.10
90.30
SPConv-ResNet-50-α1/2[14]
18.34 M
2.97 G
76.26
93.05
HetConv-ResNet-50-P4[27]
-
2.85 G
76.16
-
Ghost-ResNet-50-s2[13]
13.00 M
2.20 G
75.00
92.30
TiedResNet-50[15]
17.07 M
-
75.78
92.72
GConv-ResNet-50[16]
20.49 M
4.1 G
76.55
93.14
GWSConv-24w-4s-ResNet-50[28]
20.5 M
4.1 G
76.60
93.17
BFConv-19w-4s-ResNet-50[28]
18.52 M
3.52 G
76.79
93.37
TDRConv-ResNet-50-α1/2-β1/2-γ2(Ours)
17.93 M
2.85 G
76.93
93.40
ResNet-101-Baseline
44.55 M
7.84 G
77.37
-
SE-ResNet-101[21]
49.33 M
-
77.62
-
TDRConv-ResNet-101-α2-β2-γ2(Ours)
30.16M
5.23G
77.64
93.73
ShuffleNetV2-Baseline
2.28 M
0.15 G
67.99
-
TDRConv-ShuffleNetV2(Ours)
2.35 M
0.16 G
69.35
88.78
Table 3. Impacts of each hyper-parameter on performance on CIFAR-10 based on VGG-16. Model
Hyper-parameter
Params
FLOPs
Acc. (%)
TDRConv
VGG-16(base)
14.7M
314M
93.6
α = 1/2,β = 1/2,γ = 1
4.4M
97M
93.71
α = 1/2,β = 1/2,γ = 2
4.8M
111M
93.81
α = 1/2,β = 1/2,γ = 4
5.7M
142M
93.88
α = 1/4,β = 1/2,γ = 2
3.5M
93M
93.6
α = 1/8,β = 1/2,γ = 2
2.8M
82M
93.52
α = 1/8,β = 1/4,γ = 2
3.6M
102M
93.28
The hyperparameter γ has a great influence on the performance of the model. The accuracy increases with the increase of the expansion ratio γ, and the number of parameters increases accordingly, which means that the Feature Expansion Block can help the model to improve performance. In fact, the increase of the expansion ratio γ can enhance the extraction of diversity information in the Diversity branch, thereby effectively improving the model performance and increasing the number of parameters. Thus
342
H. Hu et al.
it is necessary to find the balance between the performance improvement and the parameter reduction. In experiments, we found that an acceptable balance is reached when γ = 2. The hyperparameter α represents the ratio of the feature maps flowing to the branch of the Feature Extraction Block, i.e., the proportion of the main part to the total input feature maps. In experiments, when α decreases, the number of model parameters decreases significantly, while the accuracy decreases very slightly, which indicates that our model obviously has a good effect of reducing parameters. The hyperparameter β represents the channel number of the Base branch generated by the intrinsic features containing certain redundancy, which accounts for the proportion of all input feature maps. When β decreases, the number of parameters decreases while the accuracy performance decreases significantly, which illustrates the importance of the inherent features containing certain redundancy. Table 4. Ablation experiments of different components and branches on CIFAR-10 based on VGG-16. Model
No
DRConv(α = 1/2,β = 1/2,γ = 2)
1 2
BPG1
5 6 7 8
CA3
√ √
3 4
BPO2
√ √
√
√ √
√ √
√
√
√
Params
FLOPs
Acc. (%)
4.5M
99M
92.07
4.6M
111M
92.96
4.5M
99M
93.55
4.8M
99M
92.19
4.6M
111M
93.38
4.8M
111M
93.13
4.8M
99M
93.59
4.8M
111M
93.81
1 BPG denotes the branch from PWC of the Feature Expansion Block to GWC of the Diversity
branch 2 BPO denotes the branch from PWC of the Feature Expansion Block to the Output feature maps 3 CA represents the Channel attention after the Feature Extraction Block
In the proposed TDRConv module, different components can support each other to construct an organism that can contributes to the whole module to improve performance. Two branches and a channel attention module are selected for the ablation experiments. Thereinto, BPG and BPO play important roles in expanding the channels for the diversity branch and supplementing features for the final output, respectively. Table 4 shows the results. When each component is used alone, the BPG and BPO can improve significantly the performance, while the improvement of the channel attention module is not very obvious. For the combinations between two components, they can outperform the original method. Therein the combination between BPO and CA can obtain the best performance, while other combinations are even worse than the BPO alone, which can be caused by the interference of invalid diversity information to the feature supplement in the Feature Expansion Block. When the channel attention module is applied in the
TDRConv
343
output of the Feature Expansion Block, the performance is significantly improved by 0.43% compared with the combination of BPG and BPO. It is mainly because the channel attention module can enhance the weight of the effective diversity information, thereby inhibiting the interference.
5 Conclusions It is still a challenge to find a trade-off between the redundancy and diversity of features. In this paper, we proposed a new TDRConv module from the perspective of network structure design, and it can capture diversity information while acquiring intrinsic features with some redundancy. The proposed TDRConv module consists of the Feature Extraction Block, Feature Expansion Block and channel attention block, and it can effectively split and expand input features to retain and generate features with moderate redundancy and rich diversity but require less computation. Extensive experiments on various datasets (such as CIFAR10 and ImageNet) and network architectures have demonstrated the effectiveness and generalization of our TDRConv. The results show the networks equipped with the TDRConv can all outperform the state-of-the-art methods in terms of accuracy, but with significantly lower FLOPs and parameters. More importantly, the proposed TDRConv can readily replace the existing CNNs as a plug-and-play component, and it is promising to further extend CNNs to wider scenarios. Limitations: The proposed TDRConv is mainly designed for the widely used k × k convolution like 3 × 3, not including 1 × 1 convolution. Considering that 1 × 1 convolution can generate a low number of parameters, thus we will design a better network architecture by integrating the advantages of 1 × 1 convolution in the future. Acknowledgements. This work was supported in part by Zhejiang Provincial Natural Science Foundation of China(Grant No. LGF22F030016 and LGF20H180002), and in part by National Natural Science Foundation of China (Grant No. 62271448 and 61972354).
References 1. Tang, C., Xue, D., Chen, D.: Feature diversity learning with sample dropout forunsupervised domain adaptive person re-identification. CoRR abs/2201.10212 (2022) 2. Ayinde, B.O., Inanc, T., Zurada, J.M.: Regularizing deep neural networks by enhancing diversity in feature extraction. IEEE Trans. Neural Netw. Learn. Syst. 30(9), 2650–2661 (2019) 3. Ayinde, B.O., Zurada, J.M.: Nonredundant sparse feature extraction using autoencoders with receptive fields clustering. Neural Netw. 93, 99–109 (2017) 4. Ogundijo, O.E., Elmas, A., Wang, X.: Reverse engineering gene regulatory networks from measurement with missing values. EURASIP J. Bioinf. Syst. Biol. 2017(1), 1–11 (2017) 5. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016) 6. Dieleman, S., De Fauw, J., Kavukcuoglu, K.: Exploiting cyclic symmetry in convolutional neural networks. In: ICML2016 - Volume 48, pp. 1889–1898 (2016)
344
H. Hu et al.
7. Zhai, S., Cheng, Y., Lu, W., Zhang, Z.M.: Doubly convolutional neural networks. In: NIPS2016, pp. 1090–1098 (2016) 8. Ayinde, B.O., Zurada, J.M.: Deep learning of constrained autoencoders for enhanced understanding of data. IEEE Trans. Neural Netw. Learn. Syst. 29(9), 3969–3979 (2018) 9. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014) 10. Wan, L., Zeiler, M.D., Zhang, S., LeCun, Y., Fergus, R.: Regularization of neural networks using dropconnect. In: ICML (3). JMLR Workshop and Conference Proceedings, vol. 28, pp. 1058–1066. JMLR.org (2013) 11. Mellor, J., Turner, J., Storkey, A., Crowley, E.J.: Neural architecture search without training. In: International Conference on Machine Learning, pp. 7588–7598 (2021) 12. Sifre, L, Mallat, S.: Rigid-motion scattering for texture classification. arXiv preprint arXiv: 1403.1687 (2014) 13. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: more featuresfrom cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1580–1589 (2020) 14. Zhang, Q., et al.: Split to be slim: an overlooked redundancy in vanilla convolution. arXiv preprint arXiv:2006.12085 (2020) 15. Wang, X., Stella, X.Y.: Tied block convolution: leaner and better cnns with sharedthinner filters. In: AAAI2021, vol. 35, pp. 10227–10235 (2021) 16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012) 17. Szegedy, C., et al.: Going deeper with convolutions. In: CVPR2015. pp. 1–9 (2015) 18. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 19. Xie, S., Girshick, R., Dolla´r, P., Tu, Z., He, K.: Aggregated residual transformationsfor deep neural networks. In: CVPR2017, pp. 1492–1500 (2017) 20. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR2018, pp. 6848–6856 (2018) 21. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR2018, pp. 7132–7141 (2018) 22. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (canadian institute for advancedresearch) (2010). http://www.cs.toronto.edu/kriz/cifar.html 23. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: Imagenet: a largescale hierarchical image database. In: CVPR, pp. 248–255. IEEE Computer Society (2009) 24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015) 25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR2016, pp. 770–778 (2016) 26. Luo, J.H., Wu, J., Lin, W.: Thinet: a filter level pruning method for deep neuralnetwork compression. In: ICCV2017, pp. 5058–5066 (2017) 27. Singh, P., Verma, V.K., Rai, P., Namboodiri, V.P.: Hetconv: heterogeneous Kernel-based convolutions for deep CNNs. In: CVPR2019, pp. 4835–4844 (2019) 28. Yang, D., Yu, X., Sun, Y., Zhuang, F., He, Q., Ye, S.: BFConv: Improving Convolutional Neural Networks with Butterfly Convolution. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13111, pp. 40–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92273-3_4 29. Yang, D., Chen, Z., Sun, Y., He, Q., Ye, S., Chen, D.: Ekconv: compressing convolutional neural networks with evolutionary kernel convolution. In: Journal of Physics: Conference Series, vol. 2425, p. 012011. IOP Publishing (2023)
Community Detection Using Revised Medoid-Shift Based on KNN Jiakang Li1 , Xiaokang Peng1 , Jie Hou1 , Wei Ke2 , and Yonggang Lu1(B) 1 School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, Gansu, China [email protected] 2 Gansu New Vispower Technology Co. Ltd, No. 1689 Yanbei Road, Lanzhou 730000, Gansu, China
Abstract. Community detection becomes an important problem with the booming of social networks. The Medoid-Shift algorithm preserves the benefits of Mean-Shift and can be applied to problems based on distance matrix, such as community detection. One drawback of the Medoid-Shift algorithm is that there may be no data points within the neighborhood region defined by a distance parameter. To deal with the problem, a new algorithm called Revised Medoid-Shift (RMS) is proposed. During the process of finding the next medoid, the RMS algorithm is based on a neighborhood defined by KNN, while the original Medoid-Shift is based on a neighborhood defined by a distance parameter. Since the neighborhood defined by KNN is more stable than the one defined by the distance parameter in terms of the number of data points within the neighborhood, the RMS algorithm may converge more smoothly. The RMS algorithm is tested on two kinds of datasets including community datasets with known ground truth partition and community datasets without ground truth partition respectively. The experiment results show that the proposed RMS algorithm generally produces better results than Medoid-Shift and some state-of-the-art together with most classic community detection algorithms on different kinds of community detection datasets. Keywords: Clustering · Medoid-Shift · Community Detection · KNN
1 Introduction Social networks have become ubiquitous in our day-to-day lives through such platforms as Facebook, Twitter, and Instagram. These networks can be modeled as graphs, with nodes representing individuals and edges representing their interconnections. Within these complex graphs, certain subgraphs exhibit particularly high density, where individuals are more closely interconnected than elsewhere. These subgraphs are commonly referred to as communities [1]. In recent years, a plethora of community detection algorithms have been proposed to mine hidden information within networks [2]. These algorithms are generally categorized into two types: overlapping and non-overlapping methods. To address the problem of community detection in network analysis, researchers have employed a variety © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 345–353, 2023. https://doi.org/10.1007/978-981-99-4752-2_29
346
J. Li et al.
of approaches, including hierarchical divisive, hierarchical agglomerative, and random walk-based methods [3], among others. To evaluate the performance of these algorithms, researchers have proposed various detection metrics. Among these, modularity [4] is a critical metric used to assess the quality of the community generated by different methods. A higher modularity value indicates better community creation [5]. Modularity measures the degree to which nodes within a community are more densely connected than nodes outside that community. Additionally, Normalized Mutual Information (NMI) [6] is a crucial evaluation metric when ground truth partitions exist for the dataset. The higher the NMI, the better the match with the ground truth partition. Community detection poses a formidable challenge due to the complexity and scale of network structures. Notably, contemporary complex networks primarily comprise graphs, a form of structured data lacking coordinates that precludes the direct utilization of coordinate-based algorithms, such as the Mean-Shift algorithm, for community detection [7]. While the Medoid-Shift [8] algorithm proposed subsequently can address distance matrix-based issues, its application to community detection problems remains largely unexplored. Additionally, the Medoid-Shift algorithm may encounter a critical challenge of no data points existing within the neighborhood region defined by its distance parameter, leading to suboptimal performance on community detection problems. Therefore, to address the challenges above, this paper has proposed a new community detection algorithm named RMS, which extracts the characteristics from both k-nearest neighbors (KNN) [9] and Medoid-Shift while focusing on detecting the non-overlapping community. In contrast to the traditional Medoid-Shift algorithm, our proposed method employs a modified approach for determining the neighborhood of a given node. Specifically, we have defined a parameter k for RMS borrowing the idea of KNN. The parameter k is used to define the medoid’s neighborhood, rather than using a distance parameter as in the conventional Medoid-Shift algorithm. This modification effectively mitigates the issue of unstable number of data points within the defined neighborhood region, which is a known limitation of the original Medoid-Shift method. Moreover, during the shifting of the medoid, the RMS algorithm calculates similarities between each point p in the neighborhood of the current medoid and the KNN of p. Because the KNN of p is not related with the current medoid, the stableness of the shifting process is enhanced compared to the original Medoid-Shift method. The remaining portions of this paper are organized as follows: Sect. 2 discusses related works in community detection. Section 3 discusses the process of the RMS algorithm and its data pre-processing in detail. In Sect. 4, we present the experimental results and analyze them in detail. Section 5 discusses the conclusion and future work.
2 Related Work The research on uncovering the community structure of the real social network has been a hot research topic, which has also spawned many community detection algorithms. This section first introduces classical algorithms, KNN-based algorithms, and distance matrix-based algorithms. Then it introduces Medoid-Shift.
Community Detection Using Revised Medoid-Shift Based on KNN
347
2.1 Classical Algorithms in Community Detection In 2004, Newman and Girvan developed the Girvan Newman algorithm [10] which is a well-known method for discovering communities. The algorithm employs divisive hierarchical clustering and iteratively removes edges with the highest betweenness score to partition the network into subgroups. In 2010, Louvain [11] introduced a community detection algorithm that focuses on optimizing modularity. The algorithm aims to maximize the modularity of the entire network through an iterative process of optimizing the partition of the network into communities. Besides these two, there have been numerous community detection algorithms proposed during the past two decades, each with its characteristics and advantages. For instance, module-based optimization [12] algorithm, spectral clustering [13] algorithm, hierarchical clustering [11] algorithm, label propagation [14] algorithm, and information theory-based algorithm [15], which have become classical algorithms in community detection. 2.2 Medoid-Shift Algorithm Derived from the idea of Mean-Shift, Medoid-Shift is very similar with Mean-Shift in the theory and process. They both calculate the shift toward regions of greater data density, find the cluster centers by iteration, and calculate the number of clusters automatically. The biggest difference is that Mean-Shift shifts to a location according to the Mean-Shift vector, while Medoid-Shift shifts to a certain point in the neighborhood. Furthermore, Medoid-Shift can be directly applied to distance-based community detection problems, but Mean-Shift can not. Brief Introduction of Medoid-Shift Core Algorithm. Given an N × N symmetric matrix D(i, j) which is the distance between i and j starting from the point i, an index of point j is calculated as follows: S(i, j) =
N
D(j, k)φ(D(i, k))
(1)
k=1
The next point to shift from i is point j with the minimum value in S(i, j), 1, < P, QO2 > , . . . , < P, QOt >}. P = (p1 , p2 , . . . , pN ) is the input passage. And the question will be spliced with the i-th options QOi = (q1 , . . . , qN , o1 , . . . , oK ) as the input of the model. Where pi , qi , oi represent a token. Next, as shown in Eq. (2), for the input of extracting effective supporting sentences for answering questions, we will use P and a QOi as the input for the passage sentence quality evaluation method. The specific implementation is shown in Sect. 4.2, resulting in a new set of P and QOi . It should be noted that the splice vectors of P and QOi
A Sentence Quality Evaluation Framework for Machine Reading
447
will be used as the input of the model in advance. And a special token ([CLS]) is added at the beginning of the sentence to indicate the beginning of a sentence. And [CLS] can be used as an aggregate representation. It is suitable for classification tasks. A special separator [SEP] is added between P and QOi to distinguish between the two. As shown in Eq. (3). P, QOi = PQE(P, QOi )
(2)
input = [CLS] + P + [SEP] + QOi + [SEP]
(3)
In Eq. (4), MacBERT distinguishes between upper and lower sentences and incorporates positional information by concatenating segment embeddings and position embeddings onto token embeddings. Ei = Etok (xi ) + Eseg (xi ) + Epos (xi )
(4)
where Etok (xi ) ∈ Rdmodel denotes the token embedding. Eseg (xi ) ∈ Rdmodel Denotes the segment embedding, and Epos (xi ) ∈ Rdmodel denotes the position embedding. dmodel Denotes the hidden layer size. xi Denotes pi , qi , or oi . T = Transformer 12 (E)
(5)
After multiple layers of Transformer structure, the vector representation with global context information and the vector representation with fused global context information corresponding to each token can be obtained. This is shown in Eq. (6). T = {TC , TP1 , . . . , TPN , T[SEP] , TQ1 , . . . , TQM , TO1 , . . . , TOK , T[SEP] }
(6)
where the final vector representation obtained for each token has Ti ∈ Rdmodel and T ∈ RV ×dmodel . V denotes the total length of the input sequence, V = M + N + K + 2. The nature of the Multi-choice MRC task is a multiclassification task. As shown in Eqs. (7) and (8). The output pooler_output of MacBERT is the hidden state of the last layer of the first token of the sequence ([CLS]). It will be further processed by the linear layer and Tanh activation function to get the final vector representation.
TC = Tanh(linear(TC ))
TC = MacBERT(P, QOi )
(7) (8)
where the dimensionality of TC remains constant. TC ∈ Rdmodel . Therefore, after obtaining the output of the interaction layer, the probability distribution relative to the options is calculated. The aggregated information representation Ai ∈ Rdmodel for the < P, QOi > text pair is obtained for the i-th options Oi . Let the correct answer be Or . The cross-entropy loss function is used. The calculation is shown in Eqs. (9) and (10). Ai = MacBERT(P, QOi )
(9)
448
F.-J. Meng et al.
exp(W T Ar ) L(Or |P, Q) = −log t T i=1 exp(W Ai )
(10)
where W ∈ Rdmodel denotes the learnable parameters. And it has been mentioned earlier that t denotes the number of options. 4.2 Passage-Sentence Quality Evaluation
Fig. 2. Overall process framework based on PQE method system
As shown in Fig. 2, the contribution of each sentence of the passage to the answer is used as the way of evaluation. The experiment uses some classical unsupervised algorithms to calculate the similarity between each sentence in the passage and the four options separately. The similarity is used as the contribution of the passage sentences to the answer question. Finally, the sentences with high contribution degree are extracted. And they are sorted in the order they are in the original passage to form a new passage as the input of the model. The passages are made sentence slices to get the set of sentences {s1 , s2 , . . . , sg }. Question Q will be spliced with each option to get the set to be compared {QO1 , QO2 , . . . , QOt }. Make it do the similarity calculation with each sentence si in the passage respectively. Firstly, we do the word separation and deactivation process for the two parts separately. Then unsupervised algorithms are used to obtain scores
A Sentence Quality Evaluation Framework for Machine Reading
449
for both parts. The unsupervised methods include Jaccard, TF-IDF, BM25. Finally, the maximum length of the extracted text is limited to 512. The sentences with contribution top-k are extracted. Then new passages are formed in the original order as the input of the model. Jaccard Jaccard [19] is a method based on the number of co-occurring words. The score is mainly based on the size of the overlapping parts of the two input utterances. The more the relative co-occurring words have the same part the higher is their similarity. Jaccard(A, B) =
|A ∩ B| |A ∪ B|
(11)
where A ∩ B means to take the intersection of the array vocabulary of A and B. A ∪ B means to take the concatenation of the array vocabulary of A and B. The operation | ∗ | indicates that the length size is taken. TF-IDF TF-IDF is a classical method for calculating text similarity. The core idea is to evaluate the importance of a word for a document. And the importance of a word comes from its frequency of occurrence in the document and its frequency of occurrence in the corpus [20]. TF − IDF = TF × IDF TF =
n k
IDF = log
nk
|D| |{i ∈ D : j ∈ i}| + 1
(12) (13) (14)
where, to obtain the importance level corresponding to a word, n denotes the number of occurrences of the word in the document. k nk Denotes the total number of occurrences of all words in the document. |D| Denotes the total number of passage in the corpus. |{i ∈ D : j ∈ i}| Denotes the total number of sentences containing the word. Adding 1 is to prevent the denominator from being 0. BM25 BM25 is a probabilistic retrieval model commonly utilized for scoring relevance, which remains prevalent in search engines due to superior performance. For a retrieval query Q, preprocessing operations include word splitting to generate a retrieval array and retrieve word wi for each one. Importance scores for each wi are generated, and the scores for all wi regarding the documents are weighed and summed to obtain the final retrieval query Q’s importance score concerning document D [21]. B(D, Q) =
n i=1
IDF(wi ) × R(wi , D)
(15)
450
F.-J. Meng et al.
N − n(wi ) + 0.5 n(wi ) + 0.5 fi × (k1 + 1) R(wi , D) = LD fi + k1 × 1 − b + b × LD_avg IDF(wi ) = log
(16) (17)
where B(D, Q) denotes the importance score of query Q statement with document D. Higher scores indicate higher relevance. IDF(wi ) Denotes the weight for the word wi . R(wi , D) Denotes the importance score of word wi . After weighting and summing the importance scores of all words, the final utterance score relative to the document is obtained. In Eq. (16), N denotes the number of all documents. n(wi ) Denotes the number of documents in which wi appears in the document. It can be seen that IDF(wi ) for a given set of documents. The more frequently a word appears in a document, the less important it is. In Eq. (17), fi denotes the frequency of occurrence of the word in the query statement. LD Denotes the current document length. LD_avg Then denotes the average length of all documents. k1 is the adjustment factor. The role of k1 is to adjust the degree of influence of fi on the final importance score. k1 Tends to zero, the less fi is taken into account. b is also a moderator. When b = 0, the final importance score is not affected by the length of the document. Usually, k1 ∈ [1.2, 2.0] and b = 0.75. In this study, k1 = 1.5 and b = 0.75.
5 Experimental Analysis 5.1 Dataset The related studies were all conducted on the Chinese MRC dataset C3 [6] introduced earlier. Table 1 shows the specific information related to the C3 . Table 1. C3 sample information Num
C3M
C3D
C3
Passage
Question
Passage
Question
Passage
Question
Train
3138
6013
4885
5856
8023
11869
Dev
1046
1991
1628
1825
2674
3816
Test
1045
2002
1627
1890
2672
3892
All
5229
10006
8140
9571
13369
19577
5.2 Evaluation Indicator In the evaluation of multi-choice MRC models, accuracy is adopted as the evaluation metric. A higher value indicates better model performance. 1 N Accuracy = I(yi = yi ) (18) i=1 N
A Sentence Quality Evaluation Framework for Machine Reading
451
where, N denotes the total number of samples. yi denotes the predicted label of the i-th sample. yi denotes the true label of the i-th sample. I(yi = yi ) is used as the indicator function to determine whether the and are equal. If they are equal, the value is 1; otherwise, it is 0. 5.3 Experimental Setup Training was conducted using a single NVIDIA TESLA A40 card. Table 2 shows the super-reference settings. Table 2. Important parameter settings Parameter
Value
Learning rate
2e-5
Train batch size
8
Evalution batch size
8
Epoch
5
Max squence length
512
Warmup proportion
0.1
5.4 Experimental Result From Table 3, it can be seen that on the Test set, the MacBERT model answers question better compared to the other models. On the Dev set though it is not optimal. However, the difference of the effect is smaller. Table 3. Model experimental result Model
Dev(Acc/%)
Test(Acc/%)
BERT-base-chinese [16]
63.9
64.6
BERT-wwm-base [17]
63.2
64.4
BERT-wwm-ext-base [17]
66.5
67.3
MacBERT-base [18]
66.4
68.3
The researchers investigated the Chinese MRC dataset and found that answering questions required matching the text and some a priori knowledge, which is further categorized into matching and a priori knowledge (linguistic, domain-specific, and general world), with 8 subcategories related to the latter, and the questions are divided into single, multiple, and independent categories based on the minimum number of supporting sentences.
452
F.-J. Meng et al.
Table 4. Performance of BERT-wwm-ext and MacBERT on different problem types on the test Num
Model BERT-wwm-ext(Acc/%)
MacBERT(Acc/%)
Matching
41
80.5
97.6
Prior knowledge
324
66
65.7
*Linguistic
102
66.7
66.7
*Domain-specific
2
50.0
0.0
*General world
220
65.9
65.9
Arithmetic
16
43.8
18.8
Connotation
10
50.0
70.0
Cause-effect
32
75.0
78.1
Implication
63
66.7
71.4
Part-whole
19
73.7
73.3
Precondition
14
64.3
64.3
Scenario
56
69.4
62.5
Other
10
50.0
70.0
Single sentence
110
71.8
74.5
Multiple sentences
184
66.8
68.5
Independent
6
16.7
33.3
Table 4 presents the performance comparison of BERT-wwm-ext and MacBERT on various question types, along with their respective accuracies. The models demonstrate a notable enhancement in the broad matching category, achieving an accuracy of 97.6%. While the answers to matching questions can usually be found in the passage, the models may face challenges in discerning interfering information. For instance, consider the passage “2000年, 周晓娟19岁, 在甘肃省计划学校会计专业就读……” and the question”周晓娟学的是什么专业?” The answer choices include”计划学”and”会 计学”, which can cause confusion. BERT-wwm-ext may incorrectly select”计划学”as the answer, whereas MacBERT can better capture the semantic information of”会计专 业”and”会计学”through synonym substitution during the pre-training stage. In general, MacBERT exhibits different advantages and disadvantages across various subcategories. The arithmetic category demands models with numerical analysis and computational abilities. For example, consider the passage “……已经表演了三 个, 再表演三个就结束了。我多么希望多演两个呀!” and the corresponding question” 一共表演几个小品?”. The model must extract the corresponding numerical information from the passage and perform the corresponding operations to obtain the correct answer. Contextual language models still struggle to solve these simple math problems described in language. However, MacBERT can capture expressions like”谁知道啊!” in a conversation between different contexts. It is clear that the interlocutor is currently in a”不知道”state of knowledge. Overall, MacBERT performs better on cause-effect,
A Sentence Quality Evaluation Framework for Machine Reading
453
implication, and other categories of data that require language models equipped with better text understanding and reasoning capabilities. The impact of various PQE methods on the MacBERT-based model performance can be observed from Table 5. After careful consideration, the BM25 algorithm was selected as the most suitable PQE method for this study. This method can effectively identify and prioritize the important information in the passage that is relevant to answering the question, thereby aiding the answering process. Table 5. PQE experiment effect Method
Dev(Acc/%)
Test(Acc/%)
MacBERT-base
66.4
68.3
+ jaccard
67.7
69.0
+ TF-IDF
68.1
68.6
+ BM25
68.1
69.1
6 Conclusion A MacBERT-based Multi-choice MRC method is proposed to capture contextual information for passage, question, and options without expanding the model parameters, which improves the performance on the C3 Chinese MRC dataset. But with a performance decrease on categories such as arithmetic. This challenging research area for language models requires further exploration. We also conducted a PQE study using an unsupervised approach based on word-level similarity matching to evaluate the quality of passage sentences. It has low computing power requirements while being able to filter valid information and reduce noise. However, the statistical calculation is only on the basis of words, and does not go into the semantic level. It is hoped that future work can incorporate deep learning language modeling methods to PQE by including contextual semantic information. Acknowledgements. This work was supported by grants from Natural Science Foundation of Inner Mongolia (2023LHMSS06011), the Fundamental Research Funds for Inner Mongolia Normal University (2022JBQN105), Inner Mongolia JMRH Project (JMRKX202201), and Natural Science Foundation of Inner Mongolia (2023MS06016).
References 1. Hermann, K.M., Kocisky, T., Grefenstette, E., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 2. Rajpurkar, P., Zhang, J., Lopyrev, K., et al.: Squad: 100,000+ question for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
454
F.-J. Meng et al.
3. Nguyen, T., Rosenberg, M., Song, X., et al.: MS MARCO: a human generated MRC dataset. CoCo@ NIPS (2016) 4. Zhang, Z., Huang, Y., Zhu, P., et al.: Effective character-augmented word embedding for MRC. In: CCF International Conference on Natural Language Processing and Chinese Computing, pp. 27–39. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99495-6_3 5. Lai, G., Xie, Q., Liu, H., et al.: Race: large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 (2017) 6. Sun, K., Yu, D., Yu, D., et al.: Investigating prior knowledge for challenging Chinese MRC. Trans. Assoc. Comput. Linguist. 8, 141–155 (2020) 7. Ran, Q., Li, P., Hu, W., et al.: Option comparison network for multiple-choice reading comprehension. arXiv preprint arXiv:1903.03033 (2019) 8. Zhang, S., Zhao, H., Wu, Y., et al.: DCMN+: dual co-matching network for multi-choice reading comprehension. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, issue 05, pp. 9563–9570 (2020) 9. Zhu, P., Zhang, Z., Zhao, H., et al.: Duma: Reading comprehension with transposition thinking. IEEE/ACM Trans. Audio, Speech Lang. Process. 30, 269–279 (2021) 10. Brown, P.F., Della Pietra, V.J., Desouza, P.V., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1992) 11. Ando, R.K., Zhang, T., Bartlett, P.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6(11) (2005) 12. Blitzer, J., McDonald, R., Pereira, F.: Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 120–128 (2006) 13. Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, (2013) 14. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 15. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 16. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-trained of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 17. Cui, Y., Che, W., Liu, T., et al.: Pre-training with whole word masking for Chinese Bert. IEEE/ACM Trans. Audio, Speech Language Process. 29, 3504–3514 (2021) 18. Cui, Y., Che, W., Liu, T., et al.: Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020) 19. Tian, X., Zheng, J., Zhang, Z.: Jaccard text similarity algorithm based on word embedding. Comput. Sci. 45(07), 186–189 (2018) 20. Wang, C., Li, Q.: Research on hybrid recommendation algorithm based on restricted Boltzmann machine and term frequency-inverse document frequency. J. Nanjing Univ. Sci. Technol. 45(05), 551–557 (2021) 21. Shun, Y.: A Study into the BM25 model & borrowing behavior prediction model based on book retrieval sorting algorithm. Libr. J. 35(10), 63–68 (2016) 22. Jin, D., Gao, S., Kao, J.Y., et al.: Mmm: multi-stage multi-task learning for multi-choice reading comprehension. arXiv preprint arXiv:1910.00458 (2019) 23. Zhang, S., Zhao, H., Wu, Y., et al.: Dual co-matching network for multi-choice reading comprehension. arXiv preprint arXiv:1901.09381 (2019) 24. Guo, S., Zhang, H., Qian, Y., et al.: Semantic relevancy between sentences for Chinese reading comprehension on college entrance examinations. J. Tsinghua Univ. (Sci. Technol.) 57(6), 575–579 (2017)
A Sentence Quality Evaluation Framework for Machine Reading
455
25. Xiong, C., Zhong, V., Socher, R.: Dynamic co-attention networks for question answering. In: International Conference on Learning Representations, Toulon (2017) 26. Yang, Z., Li, C., Zhang, H., et al.: Question answering for overview questions based on CFN and discourse topic. J. Chinese Inf. Process. 34(12), 73–81 (2020)
UCM: Personalized Document-Level Sentiment Analysis Based on User Correlation Mining Jiayue Qiu, Ziyue Yu, and Wuman Luo(B) Macao Polytechnic University, Macao SAR, China [email protected]
Abstract. Personalized document-level sentiment analysis (PDSA) is important in various fields. Although various deep learning models for PDSA have been proposed, they failed to consider the correlations of rating behaviors between different users. It can be observed that in the real-world users may give different rating scores for the same product, but their rating behaviors tend to be correlated over a range of products. However, mining user correlation is very challenging due to real-world data sparsity, and a model is lacking to utilize user correlation for PDSA so far. To address these issues, we propose an architecture named User Correlation Mining (UCM). Specifically, UCM contains two components, namely Similar User Cluster Module (SUCM) and Triple Attributes BERT Model (TABM). SUCM is responsible for user clustering. It consists of two modules, namely Latent Factor Model based on Neural Network (LFM-NN) and Spectral Clustering based on Pearson Correlation Coefficient (SC-PCC). LFM-NN predicts the missing values of the sparse user-product rating matrix. SC-PCC clusters users with high correlations to get the user cluster IDs. TABM is designed to classify the users’ sentiment based on user cluster IDs, user IDs, product IDs, and user reviews. To evaluate the performance of UCM, extensive experiments are conducted on the three real-world datasets, i.e., IMDB, Yelp13, and Yelp14. The experiment results show that our proposed architecture UCM outperforms other baselines. Keywords: Personalized Document-level Sentiment Analysis · User Correlation · Latent Factor Model · Spectral Clustering · BERT
1 Introduction Personalized Document-level Sentiment Analysis (PDSA) aims to classify the personal sentiment of users based on long document texts (e.g., user reviews) [17], where each personal sentiment class is a discrete rating score. Nowadays, PDSA plays an important role in various fields such as social media [8], medical [9], intelligent education [11], finance [13], etc. So far, various models based on deep learning have been proposed for PDSA. Tang et al. [35] proposed a CNN-based model named UPNN to capture the user and the product features in the document-level reviews. Dou et al. [12] proposed a model named UPDMN based on Memory Network (MN) to capture the semantic information in a © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 456–471, 2023. https://doi.org/10.1007/978-981-99-4752-2_38
UCM: Personalized Document-Level Sentiment Analysis
457
document. Chen et al. [6] designed a framework called NSC + UPA based on Long Short-term Memory (LSTM). HUAPA [37] and CHIM [2] adopted Bidirectional LSTM (BiLSTM) for PDSA. Zhang et al. [39] designed MA-BERT based on Bidirectional Encoder Representations from Transformers (BERT) for PDSA. However, these works assumed that different users’ rating behaviors are independent of each other and ignore their correlations. It can be observed that although users show varying degrees of preferences for the same product, their rating behaviors for a range of products can be highly correlated. For example, suppose two users A and B, and four products P_a, P_b, P_c, and P_d. The rating scores given by user A and user B for these products are as follows: user A : (P_a = 3, P_b = 4, P_c = 5, P_d = 4) user B : (P_a = 1, P_b = 2, P_c = 3, P_d = 2) Although the scores for each product given by user A and user B are different from each other, it is easy to discover that these two users’ rating behaviors are highly correlated. If someone already knows that user B rated 3 for product e, that person can predict user A will rate 5 for product e with high confidence. The correlation of user rating behaviors is critical for PDSA. However, it is challenging to fully capture this correlation due to the sparsity problem of available real-world datasets (e.g., IMDB [35], Yelp13 [35], and Yelp14 [35]). Although the numbers of users and products are huge in these datasets, the user’s historical purchasing records are relatively quite limited. For example, Yelp14 contains 4818 users and 4194 products, while the number of the users’ historical purchasing records is only 231163, which is much smaller than 4818 × 4194. The data sparse rate of user historical purchasing records in Yelp14 is over 98.8%. Therefore, it is challenging to extract enough information about different user rating behaviors for user correlation. In this paper, we propose an architecture called User Correlation Mining (UCM) for PDSA. To the best of our knowledge, we are the first to adopt user correlation as an extra attribute for PDSA. Specifically, UCM contains two components, namely Similar User Cluster Module (SUCM) and Triple Attributes BERT Model (TABM). SUCM is used to cluster users in terms of calculated correlation degrees. It contains two modules, i.e., Latent Factor Model based on Neural Network (LFM-NN) and Spectral Clustering based on Pearson Correlation Coefficient (SC-PCC). LFM-NN is designed to deal with data sparsity. It can predict the missing values of the sparse attribute matrix, whose entries are the rating scores of users towards products. SC-PCC is designed to calculate the user correlation degrees and cluster the highly correlated users. As the output of SUCM, different user clusters are generated, and each user cluster is assigned a unique ID. TABM accepts user reviews, user IDs, product IDs, and user cluster IDs from SUCM. Then TABM will classify users’ sentiment levels to output user-product rating scores. In summary, the main contributions of our paper are as follows: – We propose an architecture named UCM for PDSA based on user correlations. UCM contains two main components, namely SUCM and TABM. – We design SUCM to cluster users based on their correlation degrees. SUCM consists of two modules, i.e., LFM-NN and SC-PCC. LFM-NN predicts the missing values
458
J. Qiu et al.
of the sparse attribute matrix. SC-PCC computes the user correlation degrees and clusters highly correlated users. Finally, TABM classifies the users’ sentiment levels. – The experiments are conducted on three benchmark datasets, i.e., IMDB, Yelp13, and Yelp14. Accuracy and Root Mean Square Error (RMSE) are evaluation metrics. The experiment results show that UCM outperforms other baselines. The remainder of this paper is organized as follows. In Sect. 2, we review related works of Sentiment Analysis (SA) and PDSA. The description of our proposed architecture UCM is given in Sect. 3. The experiments are performed and the performance is evaluated in Sect. 4. Finally, we summarize this paper in Sect. 5.
2 Related Work In this section, we discuss the existing works on SA and PDSA, respectively. In Sect. 2.1, previous works of SA are briefly introduced. In Sect. 2.2, the existing works on PDSA and the major issue of these existing works are described. 2.1 Sentiment Analysis Sentiment analysis aims to classify people’s emotions. It can be divided into three granularities [22], i.e., document-level, sentence-level, and aspect-level. The usual way is based on the text length and objective: (a) Document-level Sentiment Analysis (DSA) uses the whole document to classify the sentiment [4, 40], (b) Sentence-level Sentiment Analysis (SSA) aims to classify the sentiment of each sentence [3, 36], (c) Aspect-level Sentiment Analysis (ASA) identifies the sentiment polarity of the entity in the context [28, 30]. Deep learning models have been widely used in sentiment analysis. Rhanoui et al. [29] proposed the combination of CNN and Bi-LSTM models for DSA. Kim et al. [16] studied CNN with hyperparameter tuning and pre-trained word vectors for SSA. Sun et al. [34] presented a Neural Network (NN) and BiLSTM to identify the sentiment polarity of opinion words for ASA. Although these models are beneficial for SA, they ignore the different impacts of each word/sentence on semantic analysis. To address the issues of the previous works, the attention models were introduced to give different weights to sentences or words. A recurrent attention LSTM model [40] was presented to get the key sentiment words for DSA. Wang et al. [36] introduced sentence-level attention to look at the differences between each sentence for SSA. Ren et al. [28] proposed the lexicon-enhanced attention network (LEAN) based on BiLSTM for ASA. Most recently, pre-trained language models (PLMs) demonstrated superiority in natural language processing tasks. Lyu et al. [20] used the contextual feature and extra knowledge feature in the BERT model for DSA. Shen et al. [32] performed SSA by gaining the contextualized embedding via the BERT model. Their results showed that the PLMs models outperformed other models in SA.
UCM: Personalized Document-Level Sentiment Analysis
459
2.2 Personalized Document-Level Sentiment Analysis Traditional sentiment analysis only considers the text of the document, while PDSA not only considers the text but also considers other attributes such as user and product attributes. The PDSA methods in this paper are divided into two categories in terms of the considered attributes: (1) user attribute, and (2) user and product attributes. For the first category, Seyler et al. [31] built a CNN + LSTM language model to capture the user’s reviewing style and incorporate language model with users’ historical reviews and rating scores. Zhou et al. [41] utilized the BiLSTM model and attention mechanism to group similar users based on rating behaviors. However, these works ignored the product information that is also useful for PDSA. For the second category, a series of methods [2, 6, 12, 18, 21, 22, 35, 37–39] have been proposed to consider both user and product attributes. These works can be further divided into four types based on the way of using attributes. The first type [35] incorporated external knowledge by concatenating user and product attributes to modify the representations of the words and the documents. The second type [6, 12, 27, 37] used attribute features as a bias in self-attention mechanisms to extract the relations between words and attributes. Chen et al. [6] used a hierarchical NN and the attention mechanism with the LSTM model to consider the user and product attributes. Wu et al. [37] built two hierarchical NN to separately consider user and product attributes with the BiLSTM model and attention mechanism. Ma et al. [21] designed the multiway attention model to generate document representation with the user and product attributes. Different from the above two types, the third type [2, 39] injected the user and product attributes into the model. Zhang et al. [39] used bilinear interaction to inject multiple attributes into attention maps and input token representations. Amplayo et al. [2] presented attributes as chunk-wise importance weight matrices and considered embedding, encoding, attention, and classifier locations in the model for injecting attributes. The fourth type [18, 38] used the memory model to learn the user and product representations. Long et al. [18] learned the dual information of users and products by using separate memory networks. Yuan et al. [38] inferred user and product representations from the memory slot. The major problem of these existing works is that they ignored the correlations of rating behaviors between different users. To address this issue, we propose an effective architecture UCM to detect the correlation of users’ rating behavior, which can be useful for PDSA along with the user and product information.
3 Methodology In this section, our proposed architecture UCM is introduced in detail. First, the problem formulation of PDSA is defined in Sect. 3.1. Second, we introduce the overall structure of UCM in Sect. 3.2. Third, SUCM contains the LFM-NN and SC-PCC modules discussed in Sect. 3.3. Finally, we describe TABM in Sect. 3.4.
460
J. Qiu et al.
3.1 Problem Formulation Let ui be the unique ID of the i th user in the user set U . Let pj be the unique ID of the j th product in the product set P. Let URkij be the k th document-level review of user ui for product pj . Suppose that URkij contains m sentences, i.e., URkij = {s1 , s2 , . . . , sm }, k be the and the m th sentence sm contains n words, i.e., sm = {w1 , w2 , . . . , wn }. Let ri,j rating score given by user ui to the product pj of the k th document-level review. Given k . ui , pj , URkij , the problem of PDSA aims at predicting ri,j
3.2 UCM Architecture Figure 1 shows the overall architecture of UCM. UCM consists of two parts, namely SUCM and TABM. SUCM clusters users with high correlations in their purchase preferences and rating behaviors. TABM accepts the user cluster IDs (the outputs from SUCM), user IDs, product IDs, and user reviews to predict user rating scores. SUCM contains two modules, i.e., LFM-NN and SC-PCC. LFM-NN is used to address the issue of data sparsity and generate missing user-product rating scores. SC-PCC is used to calculate the correlation degrees between users regarding their rating behaviors. Then SC-PCC performs user clustering in terms of the calculated users’ correlation degrees. 3.3 SUCM SUCM is designed for user clustering. It has two modules: (1) LFM-NN, which is responsible for interpolating missing data, and (2) SC-PCC, which is in charge of clustering the data. LFM-NN. To interpolate the missing value of the sparse matrix, we use the LFM-NN model to predict the missing entries of the matrix. Specifically, a user-product rating matrix is constructed based on the training data of the dataset. The user-product rating matrix has two dimensions. The first dimension is the user ID, and the second dimension is the product ID. Each entry of the matrix is the user-product rating score. As discussed in Sect. 3, this matrix is very sparse. Let m and n be the number of rows (i.e., users) and the number of columns (i.e., products), respectively. Let e be the number of missing entries of the matrix. The sparse rate (SR) of the user-product rating matrix can be calculated as follows: e (1) SR = m×n Due to huge missing values in the user-product rating matrix, the value of SR tends to be very high (e.g., > 98.8% in Yelp14). It is therefore impossible to calculate the correlations between users’ rating behaviors directly upon the original sparse matrix. To distinguish the original sparse matrix and the matrix after data interpolation, the original matrix is renamed sparse attribute matrix and the matrix after data interpolation called completed attribute matrix. Let Rˆ be the sparse attribute matrix. All the products in Rˆ are assumed to be described by the same set of features, e.g., {size, color, price}. Inspired by the Matrix Factorization [23, 24] and Latent Factor Model (LFM) [33], Rˆ are factorized into two
UCM: Personalized Document-Level Sentiment Analysis
461
lower-dimensional submatrices U and P: Rˆ = U × P T
(2)
where U is the user latent matrix and P is the product latent matrix. Each row of U denotes a specific user, and each column of U denotes different users’ preference degrees toward a specific feature of the products. In converse, each column of P T denotes a specific product, and each row of P T denotes the existing degree of a specific feature of the different products. Let ui be the i th user (i th row) in U and pj be the j th product (j th column) in P T . Thus, ui is a user feature vector, and pj is a product feature vector. Let ui,k be the preference degree of user ui to the k th product feature. Let pk,j be the existing degree of the k th product feature in product pj . Let K be the total number of product features. The product of ui and pj can therefore be represented as follows: ui pj =
K
ui,k · pk,j
(3)
k=1
In the real world, some users are pickier and tend to give lower rating scores, while some users are more generous and tend to give higher rating scores. Similarly, a product with higher historical rating scores tends to be predicted by the model to get a high rating score, even if the current user experience is not good. Based on this, two bias parameters namely bui and bpj are introduced to denote the rating biases of user ui and product pj , ˆ respectively. Let rˆij be the predicted value of the entry at i th row and j th column in R. Therefore, the rˆij can be represented as the sum of ui pj , bui and bpj : rˆij = ui pj + bui + bpj
(4)
Then, mean squared error (MSE) with penalty function as the loss function is used to train the LFM-NN model. The reason is that MSE is suitable for our regression task of calculating the missing values of the sparse attribute matrix. It penalized the large errors more significantly than the small ones by using quadratic. Let L(f ) be the loss function: (5) L(f ) = ri,j − rˆi,j 2 + λ U 2 + P 2 + bui2 + bpj2 1≤i≤m 1≤j≤n where λ is the regularization rate, and the penalty function 2 2 2 2 λ U + P + bui + bpj is the regularizing term for over-fitting avoidance by penalizing the magnitude of the parameters. Our objective is to minimize the defined loss function: argmin L(f )
U ,P,bu,bp
(6)
Finally, we use the trained LFM-NN model to interpolate the missing values of the sparse attribute matrix to form the completed attribute matrix. The completed attribute
462
J. Qiu et al.
Fig. 1. The overall architecture of UCM. It consists of two components, namely SUCM and TABM. SUCM is designed for user clustering. SUCM contains two modules, i.e., LFM-NN and SC-PCC. LFM-NN interpolates missing data, and SC-PCC is in charge of data clustering. TABM encodes the attributes, such as the user cluster ID generated by the SC-PCC module, the user ID and the product ID, while processing the document-level review.
matrix is composed of interpolated rating score and the original rating score. As shown in Fig. 1, the dark color represents the interpolated rating score which is predicted by the trained LFM-NN model, and the light color means the original rating score. SC-PCC. To cluster the users with high correlations, SC-PCC is proposed to calculate the users’ correlations and cluster users with high correlation. Specifically, Pearson Correlation Coefficient Matrix calculates users’ correlations based on the completed attribute matrix.
UCM: Personalized Document-Level Sentiment Analysis
463
Then, we use spectral clustering to cluster users based on the Pearson Correlation Coefficient Matrix, rather than the affinity matrix in the traditional way. The spectral clustering [25] algorithm clusters users with high correlations, which is based on the spectrum of graph theory to cluster similar data into the same group. It is beneficial for nonlinear shapes because it does not require the shapes of clusters [1]. This fits perfectly with the problem that the shape of the user clusters is not known in advance. Pearson Correlation Coefficient (PCC) [5] reflects the linear correlation between users. Besides, PCC can reflect the linear correlation between two sets of users’ purchased products. In principle, PCC is the ratio between the covariance of two variables and the product of their standard deviations. Let ρ be the value of PCC for a pair of random variables (X , Y ). The formula for ρ is calculated as follows: ρX ,Y =
cov(X , Y ) E((X − μX )(Y − μY )) = σX σY σX σY
(7)
The algorithm 1 briefly summarizes the process of SC-PCC. The detailed steps of the SC-PCC algorithm are described as follows: (i) Construct the Pearson Correlation Coefficient Matrix C. Let r¯ui be the average rating score of the i th user and rui ,w denotes the rating score of users ui for the w th product. Let Pij be the products reviewed by the i th user and the j th user. The correlation degree between the i th user and the j th user is calculated as follows: w∈Pij rui ,w − r ui ruj,w − r uj Cij = (8) 2 2 r r − r − r u ,w u u ,w u i i j j w∈Pij w∈Pij The range of Cij is[−1, 1]. If the value is greater than 0, it means a positive correlation, otherwise a negative correlation. The value 0 means there is no correlation between the users. In converse, the value 1 is the most perfect correlation. Then, negate C to get the distance matrix : T = −C
(9)
464
J. Qiu et al.
(ii) Transform the distance matrix T into the affinity matrix AM by using the Gaussian Kernel:
x2 (10) AM = exp − 2 2δ where δ = 0.5 is a free parameter about the width of the Gaussian Kernel and x is the entry value of the distance matrix T . (iii) Construct the similarity matrix S by computing the weighted graph of k-nearest neighbors algorithm [19] for the affinity matrix AM . (iv) Build the adjacency matrix A based on S. The threshold value is 0.5 to evaluate the similar degree. If the entry in S is greater than 0.5, the adjacency matrix entry is 1. Otherwise, the entry of the adjacency matrix is 0. 1 Sij > 0.5 Aij = (11) 0 Sij ≤ 0.5 Then, calculate the Laplace matrix L from the degree matrix D and A as follows: L=D−A
(12)
(v) Calculate the eigenvectors of the eigenvalue of the Laplace matrix : L(v) = λv
(13)
where the λ is the eigenvalue and v are the eigenvectors. (vi) Construct the feature matrix F which contains the eigenvectors as columns, where k is the number of clusters, then standardize it. F = Std ([v1 , v2 , v3 , . . . , vk ])
(14)
(vii) Cluster the totally m number user points based on the F. Each row of F represents a user feature vector. Let ui be the vector corresponding to the i th row of F. Then cluster the user point ui to get the user cluster ID. (viii) Output the totally clustering results as the users’ cluster IDs. 3.4 TABM As shown in Fig. 1, the TABM model encodes the generated user cluster ID from the SC-PCC module. It is based on the MA-BERT model [39]. Unlike the traditional BERT [10] which can only process documents, MA-BERT can process documents and multiple attributes. Let Ei denotes attribute embedding, where i ∈ {1, 2, 3}. Let E1 be the user cluster ID embedding, E2 be the user ID embedding, and E3 be the product ID embedding. The document-level review is an input into the BERT model to get the hidden states of the review. In the TABM module, the attribute attention adopts bilinear feature interaction [15] in Qi , Ki , Vi to learn the interaction information between documents and triple attributes: Qi = UR · Wq,i Ei
(15)
UCM: Personalized Document-Level Sentiment Analysis
465
Ki = UR · Wk,i Ei
(16)
Vi = UR · Wv,i Ei
(17)
where UR is the representation of document-level review. Qi , Ki , and Vi are the queries, keys, and values which correspond to the i th attribute. Wq,i , Wk,i , and Wv,i are the weight of queries, keys, and values in the i th attribute attention. The Attribute BERTSelfOutput is combined with two feed-forward network layers, a normalization layer, and a dropout layer. The Attribute BERTIntermediate includes a feed-forward network layer, then followed by a Gaussian Error Linear Unit (GELU) [14]. The Attribute BERTOutput class has a feedforward network layer, a normalization layer, and a dropout layer. It outputs the final document-level review representation h[CLS] which is combined with the attribute information. Finally, a classifier consists of a linear projection and softmax activation function for sentiment classification.
4 Experiment 4.1 Experiment Setting Datasets. The three real-world datasets (i.e., IMDB, Yelp13, and Yelp14) are all collected by Tang et al. [35] to evaluate the performance. These datasets contain user reviews, user IDs, and product IDs. The train, dev, and test sets at a ratio of 8:1:1. Table 1 shows the details of the three real-world datasets.
Table 1. The details of IMDB, Yelp13 and Yelp14 datasets. Dataset
#Doc
#Train
#Dev
#Test
#User
#Product
#Word/Doc
IMDB
84,919
67,426
8,381
9,112
1,310
1,635
394.6
Yelp13
78,966
62,522
7,773
8,671
1,631
1,633
189.3
Yelp14
231,163
183,019
22,745
25,399
4,818
4,194
196.9
Evaluation Metrics. Accuracy measures the overall sentiment classification performance. RMSE measures the error divergences between the predicted labels and the ground truth labels. Accuracy = RMSE =
T N
2 N k=1 yk − yˆ k N
(18)
(19)
466
J. Qiu et al.
where T is the number of correctly predicted sentiment labels, N is the total number of labels, yk is the ground truth label, and yk is the predicted label.
Baselines. The 17 closely related approaches as baselines are divided into (a) without user and product attributes and (b) with user and product attributes. Firstly, there are six baselines without user and product attributes: UPNN (no up) [35] is based on CNN and only accepts document-level reviews. NSC [6] designed a Neural Sentiment Classification (NSC) model based on LSTM. NSC + LA [6] implemented the local semantic attention (LA) with NSC. NSC + LA (BiLSTM) [37] is based on BiLSTM and NSC. BERT [39] adopted auto-regressive model and masked attention. ToBERT [26] modified BERT for unlimited sequence length. Secondly, there are 11 baselines with user and product attributes: UPNN (CNN) [35] is based on CNN to capture attribute features. LUPDR [7] used the temporal sequence to learn the distributed representation of attributes. UPDMN [12] captured attribute information based on a deep memory network. NSC + UPA (BiLSTM) [37] combined attribute features based on BiLSTM, NSC, and attention mechanism. PMA [27] used parallel multi-feature attention to capture attribute information. DUPMN [18] adopted the dual memory network to learn the dual information of attributes. CMA [21] considered the cascading effect of attributes for document representation. CHIM [2] injected the attribute embedding into different positions of the model. HUAPA [37] used hierarchical attribute attention to encode information. RRP- UPM [38] adopted memory network to learn attribute knowledge representation. MA-BERT [39] incorporated multi-attribute knowledge with BERT for attribute representation learning.
4.2 Implementation Details Programming Language. All experiments are conducted in Python with PyTorch, MXNet, Numpy, and scikit-learn packages. Data Preprocessing. The following steps are performed to preprocess data: (1) Map each user ID and product ID into a unique integer, respectively. (2) Extract each user ID, product ID, and rating score at different times from the dataset. (3) Construct the sparse attribute matrix based on the extracted results. Each entry of the matrix is the average rating score of the user toward a product. (4) Fill in the missing values of the sparse attribute matrix to construct the completed attribute matrix. Hyperparameters. LFM-NN is trained by using the Adam optimizer with an initial learning rate of 0.002, a weight decay of 0.0001 and epoch number of 20. The cluster number is 40 for IMDB and Yelp13, and 120 for Yelp14 when training SC-PCC. TABM uses the Adam optimizer with a learning rate of 0.00002, a warm-up rate of 0.1, a weight decay of 0.01, an early-stop strategy of 3, and the batch size is 8. The dropout rate is 0.2 to avoid overfitting. Hardware. The experiment is conducted with two NVIDIA Quadro RTX 8000 GPUs, each one having 48GB GPU memory.
UCM: Personalized Document-Level Sentiment Analysis
467
4.3 Comparative Results Table 2 shows the experimental results of our architecture and baselines. The results are divided into two parts, i.e., models without user and product attributes and models with user and product attributes. In the first part, we compare models without user and product attributes. As shown in Table 2, UPNN (no up) has the worst performance in terms of both accuracy and RMSE. This is because it only uses the traditional CNN model to analyze documentlevel reviews, which is weaker than the other models regarding information extraction capability. NSC, NSC + LA, and NSC + LA(BiLSTM) both outperform the UPNN (no up), which proves that LSTM and BiLSTM are more suitable for sequential data analysis and document representation. In the first part, ToBERT and BERT achieve the best performance over all other baselines on the IMDB, Yelp13, and Yelp14 datasets. These results demonstrate the effectiveness and superiority of PLMs. In the second part, we compare models with user and product attributes. Our proposed architecture UCM outperforms all 17 baselines in terms of accuracy and RMSE on the three datasets. Though UPNN(CNN) is the worst one in this part, it has better performance than UPNN (no up). This shows that the user and product attributes are helpful in the representation of document-level reviews. UPDMN is better than UPNN(CNN), which proves that deep memory network is more suitable than CNN for PDSA. LUPDR outperforms UPDMN because it adopts temporal sequence to learn the distributed representation of users and products. DUPMN is better than LUPDR due to it learns the dual information of users and products. PMA outperforms DUPMN on IMDB dataset. However, it has worse performances on Yelp13 and Yelp14 datasets. CMA is better than all the previous baseline models. The reason is that it considers the cascading effect of user and product attributes. HUAPA outperforms CMA on all three datasets, which shows that hierarchical user attention and product attention are better than cascading multiway attention. CHIM gets higher accuracy than HUAPA on IMDB and Yelp14 datasets, which shows that the injection position of user and product attribute embedding is important for PDSA. RRP-UPM outperforms all the previous baselines on Yelp13 dataset. MA-BERT is retrained in our experimental environment, and MA-BERT achieves the best performance over all the other 16 baselines. Because it incorporates multi-attribute knowledge of transformer into BERT. These results show that incorporating user and product attributes can enhance the classification performance of models. In summary, our proposed architecture UCM outperforms all the baselines on the three datasets as Table 2 shows. This proves the proposed architecture is superior and effective for PDSA. Specifically, compared with MA-BERT, UCM improves the accuracy by 2.1% on the IMDB, 1.4% on Yelp13, and 1.3% on Yelp14, and lowers the RMSE by 1.4% on IMDB, 0.7% on Yelp13, and 4.7% on Yelp14. The proposed architecture UCM is based on PLMs and considers user and product attributes. In addition, it considers the correlation between users’ rating behaviors for capturing more implicit information from the data. The result proves that user correlation is crucial for capturing implicit useful information from the document-level review. To the best of our knowledge, UCM is the first architecture that mines the user correlation and gets the user cluster ID as the extra attribute for PDSA. Despite sparse data, UCM is still capable
468
J. Qiu et al.
Table 2. Experimental results of baselines and UCM. The best performance is presented as bold. The values with ∗ are retrieved from the corresponding references. Models
IMDB
Yelp13
Accuracy
RMSE
Yelp14
Accuracy
RMSE
Accuracy
RMSE
Models without user and product attributes UPNN(no up) [35]
0.405*
1.629*
0.577*
0.812*
0.585*
0.808*
NSC [6]
0.443*
1.465*
0.627*
0.701*
0.637*
0.686*
NSC + LA [6]
0.487*
1.381*
0.631*
0.706*
0.630*
0.715*
NSC + LA(BiLSTM) [37]
0.490*
1.325*
0.638*
0.691*
0.646*
0.678*
ToBERT [26]
0.508*
1.194*
0.667*
0.626*
0.669*
0.620*
BERT [39]
0.518*
1.191*
0.677*
0.627*
0.672*
0.630*
Models with user and product attributes UPNN(CNN) [35]
0.435*
1.602*
0.596*
0.784*
0.608*
0.764*
UPDMN [12]
0.465*
1.351*
0.639*
0.662*
0.639*
0.662*
LUPDR [7]
0.488*
1.451*
0.639*
0.694*
0.639*
0.688*
NSC + UPA(BiLSTM) [37]
0.529*
1.247*
0.655*
0.672*
0.669*
0.654*
DUPMN [18]
0.539*
1.279*
0.662*
0.667*
0.676*
0.639*
PMA [27]
0.540*
1.301*
0.658*
0.668*
0.675*
0.641*
CMA [21]
0.540*
1.191*
0.664*
0.677*
0.676*
0.637*
HUAPA [37]
0.550*
1.185*
0.683*
0.628*
0.686*
0.626*
CHIM [2]
0.564*
1.161*
0.678*
0.646*
0.692*
0.629*
RRP-UPM [38]
0.562*
1.174*
0.690*
0.629*
0.691*
0.621*
MA-BERT [39] 0.565
1.059
0.697
0.595
0.708
0.601
UCM
1.044
0.707
0.591
0.717
0.573
0.577
of mining correlations among users and clustering them. Besides, the results show that UCM consistently outperforms the baselines when evaluating large and small datasets. 4.4 Ablation Study To further evaluate the effectiveness of user correlation, the experiment of our architecture UCM without user cluster IDs is conducted. Table 3 shows the experimental results on the three datasets, i.e., IMDB, Yelp13, and Yelp14. Specifically, embedding user cluster IDs improves by 1.4% in accuracy on the IMDB, by 0.7% on Yelp13, and by 1.4% on
UCM: Personalized Document-Level Sentiment Analysis
469
Yelp14. Besides, embedding user cluster IDs lowers the RMSE by 3.4% on the IMDB, by 1.2% on Yelp13, and by 3.0% on Yelp14. The results verify that user correlation is crucial for PDSA. It is beneficial to embed the user cluster ID into the TABM module for capturing information about the person within a specific cluster. Furthermore, the study shows that user correlation is helpful when performing experiments on datasets of different sizes. Table 3. Ablation study results of UCM.’w/o’ means’without’. Models
IMDB
Yelp13
Yelp14
Accuracy
RMSE
Accuracy
RMSE
Accuracy
RMSE
w/o user cluster IDs
0.569
1.080
0.702
0.598
0.707
0.591
UCM
0.577
1.044
0.707
0.591
0.717
0.573
4.5 Case Study To further evaluate the effectiveness of SUCM, a subset of users with the same user cluster ID are chosen from the Yelp13 dataset. The results indicate that there is in fact a correlation between users. Specifically, five users (i.e., users A to E) are randomly selected in the same cluster and compute the user correlations by adopting our SC-PCC. Figure 2 (a) is the correlation heat map of five users. It is found that all these five users have high correlations (> 0.50) with each other, and more than half of them are greater than 0.80. Besides, the correlation between user D and user E is relatively low at 0.64 among the five users. Then, we randomly select five products that both user D and user E have rated and display their rating scores in Fig. 2 (b). The rating scores of the five products are 4, 4, 5, 5, 5 from user D, and 3, 2, 4, 3, 3 from user E. It is consistent with the calculated correlation value between user D and user E.
Fig. 2. The correlation values between the users. (a) The user correlation heat map. (b) The rating scores of five products from user D and user E.
470
J. Qiu et al.
5 Conclusion In this paper, an architecture named UCM is proposed. UCM utilizes the correlations between users and clusters them with high correlations. UCM consists of two modules, namely SUCM and TABM. To evaluate the performance of UCM, experiments are conducted on three real-world datasets. The experiment results show that our proposed architecture achieves better performance than other baselines. In the future, other clustering algorithms can be chosen to compare the effectiveness of SC-PCC. Besides, comprehensive interpretability is necessary to further improve our architecture. Acknowledgment. This work was supported in part by the Macao Polytechnic University – Big Data-Driven Intelligent Computing (RP/ESCA-05/2020).
References 1. Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications. CRC Press (2014) 2. Amplayo, R.K.: Rethinking attribute representation and injection for sentiment classification (2019) 3. Appel, O., Chiclana, F., Carter, J., Fujita, H.: A hybrid approach to the sentiment analysis problem at the sentence level. Knowl.-Based Syst. 108, 110–124 (2016) 4. Behdenna, S., Barigou, F., Belalem, G.: Document level sentiment analysis: a survey. CASA 4(13) (2018) 5. Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. In: Noise Reduction in Speech Processing, pp. 1–4 (2009) 6. Chen, H., Sun, M., Tu, C., Lin, Y., Liu, Z.: Neural sentiment classification with user and product attention. In: EMNLP, pp. 1650–1659 (2016) 7. Chen, T., Xu, R., He, Y., Xia, Y., Wang, X.: Learning user and product distributed representations using a sequence model for sentiment analysis. In: IEEE CIM (2016) 8. Crisci, A., Grasso, V., Nesi, P., Pantaleo, G., Paoli, I., Zaza, I.: Predicting TV programme audience by using twitter based metrics. Multimedia Tools Appl. 77(10), 12203–12232 (2018) 9. Denecke, K., Deng, Y.: Sentiment analysis in medical settings: new opportunities and challenges. Artif. Intell. Med. 64(1), 17–27 (2015) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019) 11. Dolianiti, F.S., Iakovakis, D., Dias, S.B., Hadjileontiadou, S., Diniz, J.A., Hadjileontiadis, L.: Sentiment analysis techniques and applications in education: a survey. In: TECH-EDU, pp. 412–427 (2018) 12. Dou, Z.Y.: Capturing user and product information for document level sentiment analysis with deep memory network. In: EMNLP, pp. 521–526 (2017) 13. Du, C.H., Tsai, M.F., Wang, C.J.: Beyond word-level to sentence-level sentiment analysis for financial reports. In: ICASSP, pp. 1562–1566 (2019) 14. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606. 08415 (2016) 15. Huang, T., Zhang, Z., Zhang, J.: Fibinet: combining feature importance and bilinear feature interaction for click- through rate prediction. In: RecSys, pp. 169–177 (2019) 16. Kim: Convolutional neural networks for sentence classification. In: EMNLP, pp. 1746–1751 (2014)
UCM: Personalized Document-Level Sentiment Analysis
471
17. Li, G., Hoi, S.C., Chang, K., Jain, R.: Micro-blogging sentiment detection by collaborative online learning. In: ICDM, pp. 893–898 (2010) 18. Long, Y., Ma, M., Lu, Q., Xiang, R., Huang, C.R.: Dual memory network model for biased product review classification. In: WASSA (2018) 19. Luci´nska, M., Wierzcho´n, S.T.: Spectral clustering based on k-Nearest neighbor graph. In: Cortesi, A., Chaki, N., Saeed, K., Wierzcho´n, S. (eds.) CISIM 2012. LNCS, vol. 7564, pp. 254– 265. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33260-9_22 20. Lyu, C., Ji, T., Graham, Y.: Incorporating context and knowledge for better sentiment analysis of narrative text. In: Text2Story@ECIR, pp. 39–45 (2020) 21. Ma, D., Li, S., Zhang, X., Wang, H., Sun, X.: Cascading multiway attentions for documentlevel sentiment classification. In: IJCNLP, pp. 634–643 (2017) 22. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014) 23. Mehta, R., Rana, K.: A review on matrix factorization techniques in recommender systems. In: CSCITA. IEEE (2017) 24. Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems (2007) 25. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14 (2001) 26. Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: IEEE ASRU (2019) 27. Pengcheng, Z., Yujiu, Y.: Parallel multi-feature attention on neural sentiment classification. In: SoICT, pp. 181– 188 (2017) 28. Ren, Z., Zeng, G., Chen, L., Zhang, Q., Zhang, C., Pan, D.: A lexicon-enhanced attention network for aspect-level sentiment analysis. IEEE Access 8, 93464–93471 (2020) 29. Rhanoui, M., Mikram, M., Yousfi, S., Barzali, S.: A cnn-bilstm model for document-level sentiment analysis. Mach. Learn. Knowl. Extract. 1(3), 832–847 (2019) 30. Schouten, K., Frasincar, F.: Survey on aspect-level sentiment analysis. IEEE TKDE 28(3), 813–830 (2015) 31. Seyler, D., Shen, J., Xiao, J., Wang, Y., Zhai, C.: Leveraging personalized sentiment lexicons for sentiment analysis. In: ICTIR, pp. 109–112 (2020) 32. Shen, J., Liao, X., Tao, Z.: Sentence-level sentiment analysis via BERT and BiGRU. In: 2019 International Conference on Image and Video Processing, and Artificial Intelligence, pp. 658–663 (2019) 33. Song, K., Feng, S., Gao, W., Wang, D., Yu, G., Wong, K.F.: Personalized sentiment classification based on latent individuality of microblog users. In: IJCAI (2015) 34. Sun, K., Zhang, R., Mensah, S., Mao, Y., Liu, X.: Aspect-level sentiment analysis via convolution over dependency tree. In: EMNLP-IJCNLP, pp. 5679–5688 (2019) 35. Tang, D., Qin, B., Liu, T.: Learning semantic representations of users and products for document level sentiment classification. In: ACL-IJCNLP, pp. 1014–1023 (2015) 36. Wang, P., Li, J., Hou, J.: S2SAN: a sentence-to-sentence attention network for sentiment analysis of online reviews. Decis. Support Syst. 149, 113603 (2021) 37. Wu, Z., Dai, X.Y., Yin, C., Huang, S., Chen, J.: Improving review representations with user attention and product attention for sentiment classification. In: AAAI (2018) 38. Yuan, Z., Wu, F., Liu, J., Wu, C., Huang, Y., Xie, X.: Neural review rating prediction with user and product memory. In: CIKM, pp. 2341–2344 (2019) 39. Zhang, Y., Wang, J., Yu, L.C., Zhang, X.: Ma-bert: Learning representation by incorporating multi-attribute knowledge in transformers. In: ACL-IJCNLP, pp. 2338–2343 (2021) 40. Zhang, Y., Wang, J., Zhang, X.: Conciseness is better: recurrent attention LSTM model for document-level sentiment analysis. Neurocomputing 462, 101–112 (2021) 41. Zhou, D., Zhang, M., Zhang, L., He, Y.: A neural group-wise sentiment analysis model with data sparsity aware- ness. In: AAAI, pp. 14594–14601 (2021)
Multi-modal Rumor Detection on Modality Alignment and Multi-perspective Structures Boqun Li, Zhong Qian, Peifeng Li, and Qiaoming Zhu(B) School of Computer Science and Technology, Soochow University, Suzhou, China [email protected], {qianzhong,pfli,qmzhu}@suda.edu.cn
Abstract. Due to the rapid spread of rumors on social media and their negative impact on society, real-time rumor detection is of utmost importance. Although some rumor detection methods have applied the structure of temporal or graphic information, they do not consider multiple structures to obtain better representation. Besides, since the authors maybe only post texts in real-world social scenarios, image modalities become inaccessible in multi-modal rumor detection. To solve the above issues, we propose a Multi-Modal rumor detection model on Modality Alignment and multi-Perspective Structures (M3 APS). The model uses the image generation method to fill the inaccessible image modalities in the multimodal heterogeneous node pair and fuses the node pair to obtain multi-modal features. Then, the “debunkers” which are extracted from the perspective of temporal structure and graphic structure query the events described in the source tweet, respectively. Experimental results on three popular datasets show that our model M3 APS is superior to the state-of-the-art baselines. Keywords: Multi-modal rumor detection · Temporal structure · Graphic structure · Modality alignment
1 Introduction With the popularity of social media and news websites, rumors and false information are becoming more and more popular on the Internet. Some false information may cause social unrest, mislead people’s decision-making, and even cause serious social consequences. Therefore, accurate identification and refutation of rumors have become an important task to protect public security and maintain public opinion. In recent years, multi-modal rumor detection has attracted wide attention, which combines different modalities of information (e.g., texts, images, and videos) to obtain more comprehensive and accurate information, thereby improving the performance of rumor detection. However, current multi-modal rumor detection methods still have some issues, as shown in Fig. 1, the image modality of tweet b and tweet d is inaccessible, the model has to be filled with random vectors, which inevitably brings about the impact of noise. Moreover, tweet d plays an important role behaves differently under different structures. In the temporal structure, tweet d is considered to be as important as other reply nodes and can interact directly with the source tweet a. However, in graphic © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 472–483, 2023. https://doi.org/10.1007/978-981-99-4752-2_39
Multi-modal Rumor Detection on Modality Alignment
473
structure, tweet d can only interact with the source tweet a through tweet b. Hence, both temporal structure and graphic structure should complement each other in rumor detection.
Fig. 1. An example of multi-modal tweets
Regarding the issue of inaccessible modalities, several methods are proposed to address it: (1) training separate models for each specific scenario, which is inefficient and has poor scalability; (2) imputation method, which replaces inaccessible values with the average or random values of the modality, inevitably introducing some noise. For example, Sun et al. [14] used random values as pseudo-features to handle inaccessible information. In terms of structure, most of the previous methods only used temporal structure or graphic information. For example, Ma et al. [11] and Kumar et al. [8] organized the source tweets and their corresponding replies into a tree structure. Khoo et al. [7] believed that the conversation in reply branches also affects global information and flattened the conversation tree into a time sequence structure. To address the above issues, we design a Multi-Modal rumor detection model on Modality Alignment and multi-Perspective Structures (M3 APS). The model absorbs the advantages of temporal and graphic structures and complements the inaccessible image modality. Specifically, the model uses Multi-Perspective Modeling (MPM) to model samples, in which the samples are modeled as temporal structure and multimodal heterogeneous graphic structure respectively. For each pair of multimodal nodes, Modality Alignment Module (MAM) completes and fuses the inaccessible modalities to obtain multimodal representation. Then, Debunker Extraction Module (DEM) obtains the query features from the two propagation perspectives of temporal structure and graphic structure respectively. Finally, in the Multi-Perspective Query (MPQ) module, temporal and graphic debunking features are used to query the events in the source tweet and the model inputs the results into the classifier. Experimental results on three popular datasets show that our model M3 APS is superior to the state-of-the-art baselines. To sum up, our contributions are as follows: (1) We introduce modality alignment to circulate information between the text domain and image domain, and consider the influence of noise caused by inaccessible modality.
474
B. Li et al.
(2) We construct and infer samples from multiple perspectives of temporal and graphic structures, avoiding the bias brought by a single structure.
2 Related Work 2.1 Text-Modal Rumor Detection Various methods using text modality have been proposed in rumor detection. Among them, two propagation structures are mainly used. Temporal Structure. Li et al. [9] detected rumors by incorporating user information, attention mechanism, and multi-task learning, adding user credibility information to the rumor detection layer, and giving more attention to important tweets using attentionbased LSTM. Xia et al. [19] considered the dynamic evolution of events and captured different features for different time states to adapt to different stages of event development. Wang et al. [16] extracted the propagation information of time series through a recursive neural network. Graphic structure. Ma et al. [11] organized the source tweets and their corresponding replies into a tree structure, used a recursive neural network to simulate the propagation of information in the propagation tree, and aggregated signals from different nodes in a bottom-up or top-down manner. Han et al. [4] considered fraudulent nodes in the conversation tree and used a self-attention mechanism to give nodes credibility, proposed a credibility-based stance-aware recursive tree model. To capture social features, SBAG [5] trained a multilayer perceptron and used the learned feature network graph to simulate the early rumor propagation path of posts for rumor detection. BiGCN [2] used top-down and bottom-up graph convolutional networks to handle the propagation and diffusion patterns of rumors. Sun et al. [15] modeled the dynamics of message propagation and background knowledge to identify rumors.
2.2 Multi-modal Rumor Detection Various methods using text modality have been proposed in rumor detection. Among them, two propagation structures are mainly used. EANN [17] directly connected visual and textual features to obtain multimodal features. However, this connection strategy disrupts the relationship between textual and visual features. Albalawi et al. [1] also used concatenation to fuse multimodal features in the field of Arabic rumor detection, but they integrated image features through two pre-trained models. Jin et al. [6] proposed a multimodal detection model att-RNN and used an attention mechanism to balance visual features and textual/social context features. Sun et al. [14] extracted entities from text and linked them to a knowledge graph, fused knowledge features with tweet content through an attention mechanism, and captured features where the background information is inconsistent with the tweet content.
Multi-modal Rumor Detection on Modality Alignment
475
Fig. 2. Multi-Modal rumor detection model on Modality Alignment and multi-Perspective Structures (M3 APS).
3 Approach The rumor detection task can be defined as a binary classification problem, aiming to determine whether the event described in the source tweet on social media is a rumor or non-rumor. In this paper, the tweet set is defined as M = {(E0 , V0 ), (E1 , V1 ),…,(En , Vn ), Gt , Gg }, where E0 represents the source tweet that initiates the message chain, Ei (i > 0) is the i-th relevant responsive post, Vi is the image attached to Ei , which may be empty depending on whether the user has posted an image, Gt represents the temporal structure, and Gg represents the graphic structure. In this paper, we need to learn a model f: M → Y to classify each source tweet into the predefined categories Y = {0, 1}, which is the ground-truth label of the source tweet. Figure 2 shows the framework of M3 APS. The workflow is as follows: for a given sample, MPM models it into the temporal structure and graphic structure and then MAM aligns the inaccessible image modality and obtains a robust multi-modal representation for each pair of multi-modal heterogeneous nodes. Subsequently, DEM extracts the “debunker” features from both the temporal structure and the graphic structure, which can reasonably evaluate the weight of nodes in their respective structures. Finally, in the MPQ module, the “debunker” features of temporal and graphic structures are used to detect the source tweet needed to be judged. 3.1 Multi-perspective Modeling Module (MPM) For each sample, MPM constructs a temporal propagation chain to represent the temporal structure and a propagation graph to represent the graphic structure.
476
B. Li et al.
In the temporal structure Gt (including {(E0 ,V0 ), E1 ,…,En }), only the source tweet E0 is linked with the image node V0 to form a multimodal heterogeneous node pair, and the reply tweets {E1 ,…, En } are linked in the temporal order. In the graphic structure Gg (including {(E0 ,V0 ),…,(En ,Vn )}), each tweet is treated as a node nodeE linked with the image node nodeV to form a multimodal heterogeneous node pair. The edge between nodej E and nodek E indicates a reply or retweet relation between adjacent text nodes at Ej and Ek . 3.2 Modality Alignment Module (MAM) Image Generator. We use DALL-E mini1 , an attempt to replicate OpenAI’s DALL-E model [12] as an image generator. The text content Ei is preprocessed to remove irrelevant tags to improve the quality of the generated image Vi ’. When the original image modality is inaccessible, the image generator generates new images closely related to the text to reduce random noise and increase the number of sample pairs with high text-image correlation. When the original image modality is present, the generated new images interact with the original image to explore the correlation between the original and generated images. Text Generator. To better understand both the original image and the generated image, our proposed model adopts a dual understanding method based on both the image perspective and the text perspective. Specifically, the model uses the traditional Image Caption method [20] in the field of image analysis. This method enhances the model’s understanding of images by generating descriptive text Ei ’ that is strongly associated with the image Vi . Image Encoder. The image encoder used in this study aims to generate image features for the original and generated images. The ViT model [18] is used for image encoding, which first resizes the image to 224 × 224 pixels and then inputs it into the vit-basepatch16 model to obtain image features Ii , Ii ’ ∈ R 768 . Text Encoder. The task of generating text features for given original text and generated text is performed by the text encoder. The model inputs the text into the Bert-BaseUncased model [3] to obtain contextual information and uses “[CLS]” to represent the text features Ti , Ti ’ ∈ R 768 . Intramodal Interaction. Our proposed model uses an intramodal interaction mechanism to obtain fusion features between the original text features Ti and generated text features Ti ’, as well as between the original image features Ii and generated image features Ii ’. For the textual part, weight α is learned based on the user-posted tweet Ei and the generated text Ei ’. The weight is assigned to the original text features Ti to compute the textual interaction feature ti ∈ R 768 . It should be noted that we treat the generated image Vi ’ as the original image when the original image Vi is inaccessible. The textual intramodal interaction steps are as follows. c = Ti ⊕ Ti 1 https://huggingface.co/flax-community/dalle-mini.
(1)
Multi-modal Rumor Detection on Modality Alignment
477
∝= σ (Wc c + bc )
(2)
ti = wTi
(3)
where ơ(·) refers to the Sigmoid function, Wc and bc are learnable parameters. Similar methods were used in this paper to obtain visual interaction features vi ∈ R 768 . To balance the textual interaction feature ti and the visual interaction feature vi , an adaptive learning method is used to automatically allocate weight e and obtain multimodal features Si ∈ R 768 . The specific calculation steps are as follows. e = softmax(We (ti ⊕ vi ) + be )
(4)
Si = (1 − e)ti + evi
(5)
where We and be are learnable parameters, and ⊕ denotes feature concatenation.
3.3 Debunker Extraction Module (DEM) In the temporal structure, we adopt the max-pooling method to extract the information of each reply node from the information propagation chain, to obtain the debunking vector “debunkert ” of the temporal structure. Specifically, we use the MAM module to obtain the multimodal encoding of the source node and use Bert-Base-Uncased to encode the information of the remaining temporal reply nodes. Then, the encoding features of all reply nodes are interacted with the multimodal encoding of the source node to obtain the debunking features “debunkert ” of the temporal structure. The specific implementation steps are shown as follows. P = MaxPool(T1 , ..., Tn )
(6)
debunker t = Wd (S0 ⊕ P) + bd
(7)
where P represents the reply information, MaxPool(·) denotes the max-pooling operation, S0 represents the multimodal features of source tweet and attached image, ⊕ represents the feature concatenation, and Wd and bd are learnable parameters. It should be noted that max-pooling is a simple and effective feature extraction method, but it cannot consider the dependency between nodes and the importance of different nodes. To further improve the effectiveness of the “debunker” features, it is necessary to consider the importance of different nodes in the propagation structure and extract “debunker” from other perspectives. In the graphic structure, inspired by Sun et al. [14], we calculate the cross-modal relevance wi between text and image based on similarity. The “debunker” of the graphic structure is obtained through graph convolution. The specific steps are as follows. wi =
ti vi ti vi
(8)
478
B. Li et al.
where ||·|| denotes the magnitude of a vector. The DEM module introduces a feature matrix X associated with correlation values and an adjacency matrix A representing the set of edges. Let H(i) be the input feature matrix of the i-th layer of GCN (H(0) = X). The graph convolution process is as follows. W = softmax(w1 , ..., wn )
(9)
X = [S1 , ..., Sn ]W
(10)
1 1 H (i+1) = σ D− 2 (I + A)D− 2 H (i) W
(11)
After two layers of GCN, the DEM module selects the aggregated features of the root node as the graph’s debunking features named “debunkerg ”. Among them, ơ(·) denotes the Sigmoid function, I is the identity matrix, D is the diagonal matrix, and W is the trainable weight matrix. 3.4 Multi-perspective Query (MPQ) The debunking vectors “debunkert ” and “debunkerg ” effectively incorporate information from the entire temporal structure and graphic structure, reducing the influence of untrustworthy tweet nodes and allowing the debunking vectors to infer the events that need to be judged in the source tweet E0 . Take temporal structure as an example, the process of obtaining the features rt is as follows. Table 1. Distribution of Datasets. PHEME
Twitter15
Twitter16
# of events
6425
1490
818
# of rumors
2402
1116
613
# of non-rumors
4023
374
205
# of images
7239
4917
1333
Average length of tweets
13.6
15.8
15.9
Average depth of events
3.2
2.8
2.7
Q
Rti = Attention(debunker t Wi , T0 WiK , T0 WiV )
(12)
r t = Rt1 ⊕ ... ⊕ Rth W O
(13)
Similarly, rg of the graphic structure is obtained, and the temporal judgment feature rt and the graphic judgment feature rg are uniformly sent to the classifier. res = classifier(r t , r g )
(14)
Multi-modal Rumor Detection on Modality Alignment
479
Table 2. Performance comparison of rumor detection on three datasets. We used the t-test with a 95% confidence interval for the significance test and all improvements of M3 APS over Sun2021 and BiGCN are significant (p < 0.02). Dataset
Method
Acc
P
R
F1
PHEME
BiGCN EANN Sun2021 M3 APS
0.880 0.824 0.893 0.919
0.873 0.813 0.892 0.897
0.878 0.833 0.881 0.928
0.875 0.818 0.886 0.909
Twitter15
BiGCN EANN Sun2021 M3 APS
0.908 0.866 0.912 0.935
0.871 0.824 0.876 0.911
0.909 0.886 0.912 0.923
0.887 0.843 0.891 0.917
Twitter16
BiGCN EANN Sun2021 M3 APS
0.876 0.839 0.861 0.891
0.825 0.827 0.804 0.842
0.832 0.689 0.823 0.863
0.829 0.722 0.812 0.852
where Ri t represents i-th attention features in temporal structure, ⊕ represents the concatenation operation, h represents the number of attention heads and is set to 8, WO , Wi Q , Wi K , Wi V represent the trainable parameters, and classifier(·) is a container consisting of (dense-dropout-dense-tanh-dropout-dense) layers.
4 Experimentation 4.1 Experimental Settings We evaluate the proposed model using PHEME, Twitter15, and Twitter16, whose distribution is shown in Table 1. For PHEME, the data is randomly split into 80% for training, 10% for validation, and 10% for testing, and we use the same method as Sujana et al. [13]. For Twitter15 and Twitter16, we adopt a 6:2:2 split ratio because of the smaller size of dataset. Similar to Ma et al. [10], we calculate accuracy, precision, recall, and F1 score to evaluate the performance. For the images, we adjust the size to 224 × 224 pixels for lower cost and normalize them. Adam optimizer is used to update the parameters, dropout is 0.3 and the learning rate is 10–5 . 4.2 Experimental Results To verify the effectiveness of our proposed M3 APS, the corresponding baselines are conducted for fair comparison as follows. BiGCN [2] A GCN-based model that uses the two key features of rumor propagation and dispersion to capture the global structure of the rumor tree. EANN [17] A multi-modal model which uses an event adversarial neural network to extract features.
480
B. Li et al.
Sun2021 [14] A multi-modal model which explores inconsistency among texts and images and spots information between posts and knowledge. Considering the different partition methods of datasets and the crawled images may differ, we rerun some experiments according to the source code provided by the authors. Table 2 shows the performance of other models, it can be seen that the proposed method achieves the best results both on Acc and F1, and we can draw the following observations: Table 3. Performance comparison among different variants on three datasets. Twitter15
Twitter16
TStr
GStr
Rand
Gene
Part
Acc
PHEME F1
Acc F1
Acc
F1
✓ ✓ ✓ ✓
✓ ✓ ✓ ✓
✓
✓
✓
0.872 0.891 0.854 0.860 0.919
0.862 0.886 0.841 0.852 0.909
0.912 0.908 0.927 0.893 0.935
0.861 0.854 0.847 0.877 0.891
0.812 0.800 0.793 0.829 0.852
0.891 0.887 0.909 0.864 0.917
(1) BiGCN considers the original structural information of the propagation graph and the potential correlation between features, and therefore performs well on the three datasets. (2) In the multimodal methods, Sun2021 is better than EANN. It can be observed that a simple connection strategy forces text and images to be in the same semantic space, thus breaking their relationship. (3) Sun2021 performs better than BiGCN on the PHEME and Twitter15 datasets, but worse on the Twitter16 dataset, which may be due to the smaller number of images on Twitter16, and the feature correlation between images and text cannot be learned well by the model. The modality alignment scheme designed in M3 APS attempts to solve this problem. (4) M3 APS achieves satisfactory performance compared to all baseline methods. This advantage can be attributed to two aspects: (a) M3 APS completes the inaccessible modalities, reduces the noise caused by random filling, and further strengthens the connection between modalities; (b) M3 APS considers the impact of reply information from different perspectives on the importance of the source event by adopting temporal structure and graphic structure. 4.3 Ablation Study In this section, we compare variants of M3 APS from different aspects to demonstrate the effectiveness of modality alignment and multiple structural perspectives. The following ablation experiments were conducted, as shown in Table 3. “TStr” indicates that the model only uses temporal structure during the modeling phase, “GStr” indicates that the model only uses graphic structure, “Rand” indicates that the model uses random vectors to fill in inaccessible image modality, “Gene” indicates that the model uses generated images to replace all image modality, and “Part” indicates our proposed
Multi-modal Rumor Detection on Modality Alignment
481
model, which uses both temporal and graphic structures and generates images only for inaccessible image modality. After analysis, we can draw the following conclusions: (1) The models based on temporal or graphic structure have achieved good results, but they have limitations. The temporal structure is difficult to handle long-distance reply information, and the graphic structure needs a clear propagation path to determine important reply information. (2) In the absence of image modality, the method of random vector initialization cannot effectively improve the performance of the model. The reason is the noise impact of randomly initialized vectors. The other reason is that in our proposed model, randomly initialized vectors cannot extract text information, and cannot play a role in strengthening the modality connection. (3) When all image modalities are replaced by generated images, the result is poor. The images generated by tweets are strongly related to the tweet text. This scheme regards each pair of multimodal information as equally important and also ignores that some images published by authors may have obvious emotional bias and falsehood. (4) Our proposed model only uses the generation method when the image modality is inaccessible. The generated image strengthens the feature connection between modalities, and considers the importance of reply information in temporal and graphic structures.
Fig. 3. Illustration of case.
Based on the above analysis, the following conclusions can be drawn: (a) Modality Alignment Module is crucial to reduce data noise and strengthen modality connection; (b) The perspective of temporal and graphic structures can help the model to better generalize different sample data.
482
B. Li et al.
4.4 Case Study To demonstrate the effectiveness of the proposed model from both structure and modality alignment, we analyzed a non-rumor case in Fig. 3. The source tweet a reported the event of “White smoke has been spotted above the Kremlin” and attached an image as evidence. From the temporal structure, the tweet b did not express a clear opinion, but its subsequent reply tweet d explicitly confirmed the correctness of the event reported in the source tweet. In the graphic structure, the tweet b absorbed the information from the tweet d through one layer of information aggregation, and in the next layer of graph convolutional aggregation, the information transferred from the tweet b to the source tweet a was weakened, thereby reducing the influence of b. In the figure, the yellow border is used to indicate the image generated by the model. The generated image of tweet d matches that of the source tweet a, further proving the authenticity of the source tweet. In the process of fusing the generated image with text, such as fusing the generated image in tweet b with the text “Putin”, the information correlation between the image and text modalities was strengthened.
5 Conclusion In this paper, we proposed a multi-modal rumor detection model on modality alignment and multi-perspective structures M3 APS. It introduces Modality Alignment Module, taking into account the possible absence of image modalities on real social platforms, and obtains a robust multimodal representation. When image modality exists, Modality Alignment Module can also further strengthen the connection between modalities, and provide more strongly correlated multimodal sample pairs for the model. In terms of structure, the information is reasonably gathered from the perspective of temporal structure and graphic structure, and the source event is queried. In future work, we will continue to study how to introduce external knowledge information from different structural perspectives. Acknowledgements. The authors would like to thank the three anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62006167, 62276177 and 61836007), and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).
References 1. Albalawi, R.M., Jamal, A.T., Khadidos, A.O., Alhothali, A.M.: Multimodal arabic rumors detection. IEEE Access 11, 9716–9730 (2023) 2. Bian, T., et al.: Rumor detection on social media with bi-directional graph convolutional networks. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 549– 556 (2020) 3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Multi-modal Rumor Detection on Modality Alignment
483
4. Han, X., Huang, Z., Lu, M., Li, D., Qiu, J.: Rumor verification on social media with stanceaware recursive tree. In: Proceedings of the 14th Knowledge Science, Engineering and Management International Conference, pp. 149–161 (2021) 5. Huang, Z., Lv, Z., Han, X., Li, B., Lu, M., Li, D.: Social bot-aware graph neural network for early rumor detection. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 6680–6690 (2022) 6. Jin, Z., Cao, J., Guo, H., Zhang, Y., Luo, J.: Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 795–816 (2017) 7. Khoo, L.M.S., Chieu, H.L., Qian, Z., Jiang, J.: Interpretable rumor detection in microblogs by attending to user interactions. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 8783–8790 (2020) 8. Kumar, S., Carley, K.M.: Tree LSTMs with convolution units to predict stance and rumor veracity in social media conversations. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 5047–5058 (2019) 9. Li, Q., Zhang, Q., Si, L.: Rumor detection by exploiting user credibility information, attention and multi-task learning. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 1173–1179 (2019) 10. Ma, J., et al.: Detecting rumors from microblogs with recurrent neural networks. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, pp. 3818–3824 (2016) 11. Ma, J., Gao, W., Wong, K.: Rumor detection on twitter with tree-structured recursive neural networks. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 1980–1989 (2018) 12. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text conditional image generation with CLIP latents. CoRR abs/2204.06125 (2022) 13. Sujana, Y., Li, J., Kao, H.: Rumor detection on twitter using multi-loss hierarchical BiLSTM with an attenuation factor. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 18–26 (2020) 14. Sun, M., Zhang, X., Ma, J., Liu, Y.: Inconsistency matters: a knowledge-guided dualinconsistency network for multi-modal rumor detection. In: Findings of the Association for Computational Linguistics, pp. 1412–1423 (2021) 15. Sun, M., Zhang, X., Zheng, J., Ma, G.: DDGCN: dual dynamic graph convolutional networks for rumor detection on social media. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence, pp. 4611–4619 (2022) 16. Wang, B., Wei, H., Li, R., Liu, S., Wang, K.: Rumor detection model fused with static spatiotemporal information. J. Intell. Fuzzy Syst. 44(2), 2847–2862 (2023) 17. Wang, Y., et al.: EANN: event adversarial neural networks for multi-modal fake news detection. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 849–857 (2018) 18. Wu, B., et al.: Visual transformers: Token-based image representation and processing for computer vision. CoRR abs/2006.03677 (2020) 19. Xia, R., Xuan, K., Yu, J.: A state-independent and time-evolving network for early rumor detection in social media. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 9042–9051 (2020) 20. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–2057 (2015)
Recognizing Functional Pragmatics in Chinese Discourses on Enhancing Paragraph Representation and Deep Differential Amplifier Yu Lu, Yaxin Fan, Peifeng Li, Xiaomin Chu, and Qiaoming Zhu(B) Soochow University, Soochow, China {ylu990424,yxfansuda}@stu.suda.edu.cn, {pfli,xmchu, qmzhu}@suda.edu.cn
Abstract. Discourse functional pragmatics recognition focuses on identifying the functions of discourse paragraphs, which is a significant research direction in natural language processing. To obtain better paragraph representation and alleviate the issue of imbalanced data distribution, we propose a Chinese discourse Functional Pragmatics Recognition model (FPR) based on enhancing paragraph representation and deep differential amplifier. More specifically, we first combine two different encoding methods to obtain paragraph encoding which contains much richer word-level information. And then we apply a deep differential amplifier to amplify the difference between paragraphs, which consists of residual structure and iterative structures. Additionally, we give more attention to the minority classes by adjusting the weights of the minority and majority classes in the objected function. Experimental results on MCDTB 2.0 show that our model FPR outperforms the state-of-the-art models. Keywords: Chinese Discourse Functional Pragmatics Recognition · Deep Differential Amplifier · Iterative Refinement · Residual Representation · Weighted Objective Function
1 Introduction Functional grammar [1] is functionally and semantically oriented and treats the discourse as a language unit, which is more difficult to abstract. Therefore, there are only a few studies on functional pragmatics. Theories of macro discourse functional pragmatics [2, 3] presume that each discourse unit has its role and function in the whole discourse. The analysis of functional pragmatics is helpful to dig out valuable information in the text. Thus, it in turn can assist in downstream tasks, such as information extraction [4], text summarization [5], and automatic essay scoring [6]. Dijk [7] proposed the news schema theory which describes the function of paragraphs in news articles. Pan et al. [8] presented a framing-based approach that provides four structural dimensions for the analysis of news discourse. White [9] introduced a structure of news articles centered on headlines and introductions. Inspired by previous studies, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 484–496, 2023. https://doi.org/10.1007/978-981-99-4752-2_40
Recognizing Functional Pragmatics in Chinese Discourses
485
Chu et al. [3] and Jiang et al. [10] proposed a macro discourse functional pragmatic structure and annotated 720 news reports to form the Macro Chinese Discourse TreeBank (MCDTB). Subsequently, Du et al. [11] annotated another 480 news reports and finally constructed the MCDTB 2.0 corpus.
Fig. 1. A tree of discourse functional pragmatic structure from the article chtb_0010 whose text is shown in Appendix A.
To briefly introduce the functional pragmatic structure, we choose a news report (chtb_0010) from MCDTB 2.0 as an example. In Fig. 1, a news report could be represented by a tree. The leaf nodes represent the function of each paragraph, while the non-leaf nodes represent the function of the discourse unit formed by all the descendant nodes of the non-leaf node together. The direction of the arrow indicates the NucleusSatellite relation between the two connected discourse units, pointing from the satellite unit to the nucleus unit. Additionally, there is the case where several discourse units are equally important and the arrows point from the non-leaf node to its child nodes. Specifically, there are five paragraphs in this news report, and the function of P1 is “Lead”, P2 corresponds to “Supplement”, and so on. Then, “Summary” is formed by P1 and P2 together and their Nucleus-Satellite relation points from P2 to P1. Other non-leaf nodes are also composed in the same way. In previous studies, most of the models encoded each paragraph separately and then the interaction of paragraph information is built on paragraph vectors. Obviously, one paragraph could obtain paragraph-level information from other paragraph vectors but not word-level information. Though encoding the whole article by XLNet could alleviate this problem, too much exposure to other paragraph information may lead to a lack of grasp of important information in this paragraph. Another common issue is that the number of different classes varies widely in both the Chinese and English corpora. This problem is evident in the MCDTB 2.0 corpus, and the number of minority categories accounts for more than half of all categories. Thus, almost all of the minority categories have poor recognition performance. To address the above issues, we propose a Chinese discourse Functional Pragmatics Recognition model (FPR) based on enhancing paragraph representation and deep differential amplifier. Firstly, we encode each paragraph separately and encode the whole article to obtain two types of paragraph vectors, and then these two paragraph vectors are combined by a gate control network. On the other hand, a differential amplifier module and weighted objective function are utilized to alleviate the general issue of unbalanced data. The differential amplifier module consists of many residual layers where a paragraph representation and the average of the other paragraph representations are acted as
486
Y. Lu et al.
two inputs. When calculating the weighted loss function, as determined by the percentage of the categories in MCDTB 2.0, we divide them into three coarse categories and set parameter coefficients for the three coarse categories. As shown in the experimental results, our model exhibits optimal performance compared to state-of-the-art models and possesses the ability to better recognize some minority classes.
2 Related Work 2.1 Theories of Discourse Functional Pragmatics Most previous research on functional grammar focused on some genre-specific domain. Dijk [7], Pan et al. [8], and White [9] extensively studied the functional discourse structure of articles in the journalism domain. Wilbur et al. [12], de Waard et al. [13], and Liakata et al. [14] proposed different annotation schemes of functional structures respectively that could be applied to biological articles. Rhetorical status and argument types were used by Teufel et al. [15], Liddy et al. [16] and Kircz et al. [17] to define functional theories and these theories were applied to annotate some scientific articles. The news schema theory introduced by Dijk [7] was the most influential and many subsequent studies were based on this theory. Yarlott et al. [18] constructed a corpus containing 50 articles. Choubey et al. [19] annotated 802 articles to form a corpus for the task of event co-reference parsing. Song et al. [6] developed a set of rules for annotating the function of sentences in essays and constructed a corpus to serve the essay scoring task. Chu et al. [3] and Jiang et al. [10] proposed a macro discourse functional pragmatic structure containing a total of 18 functional pragmatic types and annotated 720 news reports to form the Macro Chinese Discourse Treebank (MCDTB). Du et al. [11] extended the MCDTB to MCDTB 2.0 with another 480 news reports. 2.2 Discourse Functional Pragmatics Recognition Yarlott et al. [18] and Banisakher et al. [20] applied traditional machine learning methods with some manual features to identify the function of news paragraphs. Choubey et al. [19] and Song et al. [6] proposed hierarchical Bi-LSTM models with different attention mechanisms. Guided by Choubey’s studies, Zhang et al. [21] applied functional discourse structure to sentence deletion tasks and achieved remarkable performance. Du et al. [11, 22] successively proposed a functional pragmatics recognition model [11] based on global and structured information and a joint learning model [22] that combined a functional pragmatics recognition task and a text segmentation task.
3 Approach For the sake of obtaining better paragraph representation and relieving the issue of unevenly distributed data, we propose an FPR model in this paper, as shown in Fig. 2. Below are the definition of our task and the components of our model: (1) Task Definition: we introduce the definition of our task. (2) Word Level Paragraph Representation (WLPR): we combine two encoding methods to obtain paragraph representations with
Recognizing Functional Pragmatics in Chinese Discourses
487
more character-level information. (3) Paragraph Interactive Encoding (PIE): paragraph representations interact at this layer. (4) Deep Differential Amplifier Module (DDA): this module is applied to amplify and capture the difference between different paragraphs. (5) Weighted Objective Function: the attention to different classes is adjusted by a weighted loss function. 3.1 Task Definition The task of discourse functional pragmatics recognition is defined as recognizing the function of each paragraph in the new articles. Specifically, for an article containing K paragraphs P = {p1 , p2 , . . . , pK }, we need to learn a model of discourse functional pragmatics recognition f : P → Y to assign each paragraph a predefined functional pragmatic categories Y = {y1 , . . . , yK }, which is the set of the ground-truth label of the discourse functional pragmatics, such as “Lead”, “Supplement” and “Situation”.
Fig. 2. Framework of discourse functional pragmatics recognition.
3.2 Word Level Paragraph Representation For the purpose of obtaining better paragraph representation, we combine different encoding methods. XLNet [23] is used as the encoder in this paper. A document containing k paragraphs is denoted as P = {p1 , p2 , . . . , pk }, where pi = {wi,1 , . . . , wi,m } and m is the number of tokens contained in the i-th paragraph. First, we encode each paragraph separately. In detail, we add < sep > and < cls > at the end of each paragraph and input them into XLNet and obtain paragraph encoding P local = {p1local , p2local , . . . , pklocal }.
488
Y. Lu et al.
Then, we encode the whole article together. We add a < sep > token at the end of each paragraph and a < cls > token at the end of the whole article. And the form of input is P = {p1 < sep > p2 < sep > . . . pk < sep >< cls >}. We input P into XLNet and take the vector corresponding to < sep > at the end of each paragraph as paragraph global global global encoding P global = {p1 , p2 , . . . , pk }. local and the paragraph vectors P global . A After obtaining the paragraph vectors P gate control network is applied to combine them as follows. global
gi = Sigmoid (pilocal W1 + pi
W2 )
(1)
global
(2)
ri = gi pilocal + (1 − gi ) pi
where pilocal , pi , ri , gi ∈ R1×dh and W1 , W2 ∈ Rdh ×dh . Sigmoid (·) denotes the activation function. The inputs of gate control network are P local and P global , and the output is R = {r1 , r2 , . . . , rk }. As a brief note, we use lowercase letters xi for the i-th paragraph vector and uppercase letters X for all paragraph vectors in an article. And X (m) represents that X is at the m-th layer of a module. global
3.3 Paragraph Interactive Encoding After obtaining the combined encoding R = {r1 , . . . , rk } ∈ Rk×dh , the Transformer [24] encoder is used to enhance the interaction of paragraph-level contextual information. The encoder consists of one positional encoding layer and several attention layers. The formula for the positional encoding layer is as follows: si = Drop(ri + PosEmb(ri ))
(3)
where ri , si ∈ R1×dh .PosEmb(·) is the sinusoidal position embedding and Drop(·) represents the dropout layer. After the positional encoding layer, we can obtain S = {s1 , . . . , sk } ∈ Rk×dh . Assuming that the input of encoder at the m-th layer is (m) (m) S (m) = {s1 , . . . , sk }∈ Rk×dh , the m-th encoder layer is designed as follows. (m)
(m)
(m)
U (m) = MulHAtt(S (m) Wq(m) , S (m) Wk , S (m) Wv(m) )
(4)
V (m) = LN (U (m) + S (m) )
(5)
S (m+1) = LN (FF(V (m) ) + V (m) )
(6)
(m)
whereWq , Wk , Wv ∈ Rdh ×dh .U (m) , V (m) ∈ Rk×dh ,MulHAtt(·) denotes the multiheaded attention mechanism, LN (·) and FF(·) represent the layer normalization and the feed-forward layer respectively. After M layers of encoding, we obtain S = S (M +1) ∈ Rk×dh .
Recognizing Functional Pragmatics in Chinese Discourses
489
3.4 Deep Differential Amplifier The differential amplifier consists of several stacked layers that highlight key information while preserving the original paragraph representation. Residual learning is just a variant of the differential amplifier which contained a residual mapping R(x) and an identity mapping x as follows. F(x) = R(x) + x
(7)
F(x) = Vout , R(x) = (Vin+ − Vin− )Wd
(8)
According to the output of the Paragraph Interactive Encoding module S = {s1 , . . . , sk }, we regard a paragraph vector in the article as Vin+ and the average of other paragraph vectors as Vin− . Then, we replace the original data of differential amplifier mapping R(x) with the above Vin+ and Vin− . Finally, we can obtain the formula of the deep differential amplifier which could be applied to our model as follows. j∈{1,2,...,K}\{i} sj + − (9) Vin = si , Vin = K −1 j∈{1,2,...,K}\{i} sj F(si ) = si − (10) Wd + si K −1 where Wd ∈ Rdh ×dh and si , sj , F(si ) ∈ R1×dh . In order to reinforce the representation of pivotal information, an iterative structure is applied in our model. For a paragraph si , the formula of the basic iterative unit is as follows: F(si )(n) = R(si )(n) + si (n)
(11)
si (n+1) = F(si )(n)
(12)
where F(si )(n) , R(si )(n) , si (n) ∈ R1×dh . The output of the n-th iteration would become the input of (n+1)-th iteration and we iteratively refine the representation si for N times. (i+1) (i+1) And we can obtain Q = Q(1) , . . . , Q(N ) , Q(i) = S (i+1) = {s1 , . . . , sk }. After several iterations, we add a linear layer and an activation layer at the end of our model as follows. (i)
(i)
pred j = Sigmoid (qj W3 )
(13)
(i) ∈ R1×dh . Sigmoid () represents activation where W3 ∈ Rdh ×dh and pred (i) j , qj function. The output of our model is Pred = Pred (1) , . . . , Pred (N ) , Pred (i) = (i)
(i)
{pred 1 , . . . , pred k }. Although we iterate many rounds, only the result of the last iteration would be utilized to predict discourse functional pragmatic categories. And the outputs of each iteration are applied to compute the loss function.
490
Y. Lu et al.
3.5 Weighted Objective Function Due to the uneven distribution of classes, we further designed a weighted objective function to avoid bias in the data. There are so many classes in the MCDTB 2.0 corpus that we divide them into minority, common and majority classes. In detail, classes that make up less than 2% are considered minority classes, classes that make up more than 8% are viewed as majority classes and the rest are regarded as common classes. And weight is calculated using majority and minority classes in an article to adjust the importance of different classes. It is important to emphasize that the weights are calculated separately for each article. Below are our loss equations:
Loss =
K k=1
weight = Major/Minor
(14)
N 1 M weight W (y) [ yn log(rni )]} i=1 n=1 M
(15)
{
y ∈ {Major, Com, Minor}, W (y) = {−1, 0, 1}
(16)
where Major and Minor stand for the majority and minority classes respectively, and K is the number of iterations.
4 Experimentation 4.1 Experiment Settings In this paper, Micro-F1 and Macro-F1 scores are used as the standard metric to evaluate the recognition quality of discourse functional pragmatics, and all our experiments were conducted on the MCDTB 2.0 corpus containing 1200 news articles, 720 from the Penn Discourse Treebank and another 480 from Gigaword 2.0. A total of 6,763 functional pragmatics of paragraphs were marked, containing 15 functional types. The percentages of each functional pragmatics on the MCDTB 2.0 corpus are shown in Table 1. Following Du, 80% of the samples were chosen as the training set and the rest 20% as the testing set. The core parameters are shown in Table 2. Table 1. Percentages of each functional pragmatics. Function
Percentage
Function
Background
3.70%
Illustration
Behavior
0.35%
Lead
Percentage
Function
Percentage
1.27%
Situation
49.50%
15.70%
Statement
1.32%
Cause
2.19%
Progression
0.28%
Sub-Summary
6.03%
Comment
2.75%
Purpose
0.55%
Sum-up
0.68%
Contrast
0.44%
Result
2.11%
Supplement
13.12%
Recognizing Functional Pragmatics in Chinese Discourses
491
Table 2. Parameter settings. Parameter Name
Parameter Value
Training Epoch
20
Learning Rate
1e-5
Dropout
0.1
Encoder Layers
12
Attention Head
8
Residual Layers
4
Table 3. Results of different models on MCTDB 2.0. We used the t-test with a 95% confidence interval for the significance test and all improvement of our FPR over Du is significant (p (l ∈ {Label, Topic, Turn})
(6)
where WT = {wT1 , . . . , wTn } denotes the sequence of tokens of the target sentence. Then we feed them to the decoding layer to obtain the decoder hidden state Hl as follows. Hl = T 5 − Encoder(Tl )
(7)
Finally, we use a linear layer generator with softmax to produce the predicted target sentence, where the last [MASK] of the target sentence will be predicted as “SHIFT” or “NON-SHIFT”.
518
J. Lin et al.
3.3 Model Training We learned the above two modules together. The loss function of the classifier (LClass ) and the multi-granularity generator (LLabel , LTopic , LTurn ) is the cross-entropy loss, and the total loss (Loss) is the sum of both losses, as follows. Loss = LClass + LLabel + LTopic + LTurn
(8)
4 Experimentation 4.1 Datasets and Experimental Settings We evaluate our model on two datasets, the Chinese CNTD and English TIAGE. Following previous work, we used the same dataset partitioning on English TIAGE and Chinese CNTD. Based on the dataset of CNTD and TIAGE, we extract (context, response) pairs from each dialogue as input and the label of response as a target for the response-known task. In our experiments, every utterance except the first utterance of the dialogue can be considered as a response. As for evaluation in all experiments of this paper, we report Precision (P), Recall (R), and Macro-F1 scores. Our experiments are all performed on 3090Ti and use Pytorch and Huggingface as deep learning frameworks, with 2 BiLSTM layers in encoding for both English and Chinese. For Chinese, we use mt5(base) [22] as our T5 model, which is pre-trained on the mC4 corpus, covering 101 languages. It is a T5(base) model with 12 encoder layers and 12 decoder layers. Besides, the model file used by keyBERT [21] is paraphrasemultilingual-MiniLM-L12-v2 for Chinese and paraphrase-MiniLM-L6-v2 for English. For each experiment, we set the batch size to 2 and the number of training epochs to 20. In addition, we used the warm-up strategy as well as the AdamW optimizer and set the decay factor to 0.01. 4.2 Experimental Results In the task of dialogue topic shift detection, Xie et al. [9] is the only work that established a benchmark using the T5 model on TIAGE. Due to the similarity of this task to topic segmentation, we also attempted to utilize the hierarchical model along with the pretrained model as our baseline for topic shift detection. While in the pre-trained model, T5 is considered the SOTA model. Hence, we conduct the following baselines for comparison: 1) RoEBRTa [23], an improvement on BERT; 2) T5 [9], a modification of the Transformer structure for various NLP tasks as Text-to-Text tasks; 3) Hier-BERT [24], a hierarchical structure based on the Transformer model; 4) BERT+BiLSTM [24], a combination of BERT for text encoding and a bi-directional LSTM for deep bi-directional language representation; 5) BERT [25], a bidirectional encoder based on Transformer for text encoding; The results are presented in Table 2, which indicate that the pre-trained models show inconsistent performance during the experiments, with RoBERTa exhibiting the poorest results and T5 having the highest performance with a noteworthy F1 score of
Multi-granularity Prompts for Topic Shift Detection in Dialogue
519
81.1. Nevertheless, Compared to a single pre-trained model, it is evident that both hierBERT and BERT+BiLSTM, which incorporate a hierarchical structure, attain improved performance, recording F1 scores of 81.7 and 82.4, respectively. The results of the experiments suggest that models incorporating a hierarchical structure provide more consistent results in the task of dialog topic detection. Moreover, our model (Ours) further outperforms the best baseline BERT + BiLSTM significantly (p < 0.01), with a 3.0 improvement in F1-score. This result verifies the effectiveness of our proposed model. Table 2. Results of the baselines and ours on CNTD (p < 0.01). Model
P
R
F1
BERT
82.9
79.2
80.8
RoBERTa
84.4
75.4
78.6
T5
83.0
79.7
81.1
BERT + BiLSTM
82.8
82.0
82.4
Hier-BERT
85.6
79.0
81.7
Ours
85.7
83.8
84.7
Table 3. Results of the baselines and ours on TIAGE (p < 0.01). Model
P
R
F1
BERT
68.5
65.4
66.6
T5
76.5
72.2
73.9
BERT + BiLSTM
75.8
70.8
72.7
Hier-BERT
73.8
69.6
71.2
Ours
73.8
77.2
76.2
In addition, we also evaluate our model and the baselines in English TIAGE as shown in Table 3. Compared with BERT, both the hierarchical structure models Hier-BERT and BERT + BiLSTM can obtain better performance. However, different from the results in Chinese, T5 is better than the other three baselines in English. Our proposed model outperforms the best baseline T5 significantly with a 2.3 improvement in F1-score. This result further verifies the effectiveness of our proposed model. 4.3 Ablation Study on Classification and Generation We statistically analyzed the performance of the classification and generation modules in our proposed model. The results are shown in Table 4, where cla and gen indicate the classification module and the generation module, respectively. As shown in Table 4, it is
520
J. Lin et al.
surprising that cla and gen achieve the same precision, recall, and F1 score. This indicates that the generation model can achieve equivalent performance to the classification model. Our proposed model combining classification and generation cla + gen) is better than cla and gen. This result shows the combination of classification and generation is also an effective way for dialogue topic shift detection and these two models can interact and promote each other. Table 4. Results of the classification and generation on CNTD. Model
P
R
F1
gen
83.8
81.1
82.3
cla
83.8
81.1
82.3
gen + cls(Ours)
85.7
83.8
84.7
4.4 Ablation Study on Different Levels of Prompt The results are shown in Table 5. In the case of the single-level prompt, all the results of label-level, topic-level, and turn-level prompts are better than the basic T5, especially the topic level. This indicates that all three-level prompts are effective for dialogue topic shift detection. Moreover, the performance of the topic level reaches 83.9 in the F1 value and gains the highest improvement (+1.7). It demonstrates that the key information from the topic block has more effective topic information to enhance the model to distinguish different topic shift situations. Table 5. Ablation experiments at different levels of granularity on CNTD. Model
P
R
F1
T5
84.5
80.5
82.2
+Label
82.9
81.9
82.4
+Topic
85.4
82.6
83.9
+Turn
82.6
82.9
82.7
+Label+Topic
84.5
82.3
83.4
+Label+Turn
83.3
81.7
82.5
+Topic+Turn
86.1
83.1
84.4
+Label+Topic+Turn
85.7
83.8
84.7
In addition, it can be noted that both the combination of the label-level and Topic-level prompt (Label + Topic) and the combination of the label-level and Turn-level prompt (Label + Turn) will harm the performance, in comparison with the single-level prompt.
Multi-granularity Prompts for Topic Shift Detection in Dialogue
521
This indicates that the information of Label and Topic/Turn is partly crossed and even has a negative impact. It may also be caused by the different forms of target sentences at different granularities. In the case of only two granularities, the different forms of target sentences interact with each other leading to a degradation of performance. In the case of three granularities, the model is dominated by the second form of target sentences, so the performance can be improved. On the contrary, the combination of the Topic-level and Turn-level prompt (Topic + Turn) is better than the single-level prompts Topic and Turn. This indicates that these two prompts can promote each other. Moreover, if we combine all three prompts (Label + Topic + Turn), it can improve the F1 score in comparison with the above combinations.
5 Conclusion In this paper, we introduce a prompt-based model with multi-granularity to detect the topic shift in dialogues, which consists of a classification module and a generation module based on T5. Experimental results on our annotated Chinese dataset CNTD and the publicly available English TIAGE dataset show that the proposed model outperforms the baselines. Further experiments show that the information extracted at different levels of granularity effectively helps the model comprehend the conversation topics. However, when analyzing and observing the information we extracted at different granularities, it is clear that this key information existence of errors and noise. Our future work will focus on improving the reliability of dialogue information mining, and also explore the finer granularity of topic shift scenarios. Acknowledgements. The authors would like to thank the three anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62276177, and 61836007), and Project Funded by the Priority Aca-demic Program Development of Jiangsu Higher Education Institutions (PAPD).
References 1. Dai, S., Wang, G., Park, S., Lee, S.: Dialogue response generation via contrastive latent representation learning. In: Proceedings of the 3rd Workshop on NLP for ConvAI, pp. 189–197 (2021) 2. Li, J., et al.: Dadgraph: a discourse-aware dialogue graph neural network for multiparty dialogue machine reading comprehension. In: Proceedings of IJCNN, pp. 1–8. IEEE (2021) 3. Li, Y., Zhao, H.: Self-and pseudo-self-supervised prediction of speaker and key-utterance for multi-party dialogue reading comprehension. In: Findings of EMNLP 2021, pp. 2053–2063 (2021) 4. Ghandeharioun, A., et al.: Approximating interactive human evaluation with self-play for open-domain dialog systems. In: Proceedings of NIPS, vol. 32 (2019) 5. Einolghozati, A., Gupta, S., Mohit, M., Shah, R.: Improving robustness of task oriented dialog systems. arXiv preprint arXiv:1911.05153 (2019) 6. Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., Heck, L.: Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In: Proceedings of NAACL-HLT, pp. 2060–2069 (2018)
522
J. Lin et al.
7. Xing, L., Carenini, G.: Improving unsupervised dialogue topic segmentation with utterancepair coherence scoring. In: Proceedings of SIGdial, pp. 167–177 (2021) 8. Hearst, M.A.: Text tiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997) 9. Xie, H., Liu, Z., Xiong, C., Liu, Z., Copestake, A.: Tiage: a benchmark for topic-shift aware dialog modeling. In: Findings of EMNLP 2021, pp. 1684–1690 (2021) 10. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020) 11. Xu, Y., Zhao, H., Zhang, Z.: Topic-aware multi-turn dialogue modeling. In: Proceedings of the AAAI, pp. 14176–14184 (2021) 12. Lin, J., et al.: Topic shift detection in Chinese dialogues: corpus and benchmark. arXiv preprint arXiv:2305.01195 (2023) 13. Wang, X., Li, C., Zhao, J., Yu, D.: Naturalconv: a chinese dialogue dataset towards multi-turn topic-driven conversation. In: Proceedings of the AAAI, pp. 14006–14014 (2021) 14. Zhang, S., et al.: Personalizing dialogue agents: I have a dog, do you have pets too? In: Proceedings of ACL, pp. 2204–2213 (2018) 15. Budzianowski, P., et al.: Multiwoz-a large-scale multi-domain wizard-of-oz dataset for taskoriented dialogue modelling. In: Proceedings of EMNLP, pp. 5016–5026 (2018) 16. Eric, M., Krishnan, L., Charette, F., Manning, C.D.: Key-value retrieval networks for taskoriented dialogue. In: Proceedings of SIGdial, pp. 37–49 (2017) 17. Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: Proceedings of EMNLP, pp. 334–343 (2008) 18. Du, L., Buntine, W., Johnson, M.: Topic segmentation with a structured topic model. In: Proceedings of NAACL-HLT, pp. 190–200 (2013) 19. Badjatiya, P., Kurisinkel, L.J., Gupta, M., Varma, V.: Attention-based neural text segmentation. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 180–193. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_14 20. Arnold, S., Schneider, R., Cudré-Mauroux, P., Gers, F.A., Loser, A.: Sector: a neural model for coherent topic segmentation and classification. Trans. Assoc. Comput. Linguist. 7, 169–184 (2019) 21. Grootendorst, M.: Keybert: minimal keyword extraction with Bert (2020) 22. Xue, L., et al.: mt5: a massively multilingual pre-trained text-to-text transformer. In: Proceedings of NAACL-HLT, pp. 483–498 (2021) 23. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv: 1907.11692 (2019) 24. Lukasik, M., Dadachev, B., Papineni, K., Simoes, G.: Text segmentation by cross segment attention. In: Proceedings of EMNLP, pp. 4707–4716 (2020) 25. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Recognizing Functional Pragmatics of Chinese Discourses on Data Augmentation and Dependency Graph Yu Lu1 , Feng Jiang2 , Xiaomin Chu1 , Peifeng Li1 , and Qiaoming Zhu1(B) 1 Soochow University, Soochow, China
[email protected], {xmchu,pfli,qmzhu}@suda.edu.cn 2 The Chinese University of Hong Kong, Shenzhen, China [email protected]
Abstract. The study of discourse functional pragmatic structure attaches importance to the function of discourse units. Existing models have poor performance in the functional pragmatics recognition of minority categories and ignore discourse dependency structure to enhance the representation of discourse units. To address the above issues, we propose a Functional Pragmatic Recognition model based on Dependency Structure (FPRDS) to recognise the functional pragmatic structures of Chinese discourses. Specifically, we first propose a data augmentation approach based on adversarial training and subtree swapping to enhance the recognition performance of minority categories. And then we use graph convolutional networks to incorporate the discourse dependency information to better enhance the representations of discourse units. The experimental results show that our FPRDS outperforms the state-of-the-art models, especially for minority categories. Keywords: Functional Pragmatics · Discourse Dependency · Data Augmentation · Graph Convolution Networks
1 Introduction The theory of discourse functional pragmatics [1, 2] assumes that discourse functional pragmatics reflects the role of the discourse unit in the whole discourse. And functional pragmatic analysis can help mine valuable article information for downstream tasks, such as information extraction [3] and text summarization [4]. Dijk’s news schema theory [5] describes the role of paragraphs in news articles and lays the foundation for the subsequent research. Based on the theory, Yarlott et al. [6] and Choubey et al. [7] constructed corpora of English news. Chu et al. [2] and Jiang et al. [8] proposed a macro discourse functional pragmatic structure and annotated the Macro Chinese Discourse TreeBank (MCDTB). This paper focuses on recognizing Chinese discourse functional pragmatics, with the purpose of analyzing the role of the discourse units in the article. Our task is defined as identifying the function of each paragraph. Specifically, for an unstructured article containing K paragraphs P = [p1 , p2 , . . . , pK ], we train a functional pragmatic recognition model to identify the function of each paragraph Fun = [f1 , f2 , . . . , fK ]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 523–535, 2023. https://doi.org/10.1007/978-981-99-4752-2_43
524
Y. Lu et al.
Fig. 1. Example of discourse in MCDTB corpus (chtb_0307).
In this paper, we take a news report (chtb_0307) from MCDTB, which is shown in Fig. 1, as an example to elaborate on the macro discourse functional pragmatic structure. The functional pragmatic structure of the example in Fig. 1 can be represented as the structure tree shown in Fig. 2. The root node indicates the function of the whole article. The leaf nodes represent the functional pragmatics of each paragraph, while the non-leaf nodes represent the functional pragmatics of the discourse unit formed by all descendant nodes of that node together. The direction of the arrow indicates the Nucleus-Satellite relationship between the two discourse units connected to it. Specifically, the side with the arrow points to the nucleus unit, while the side without the arrow points to the satellite unit. There may also be cases where multiple discourse units are equally important, in which case the non-leaf nodes point to each discourse unit.
Fig. 2. Discourse functional pragmatic structure tree constructed from Fig. 1.
On the one hand, the existing work uses the semantic and positional information of paragraphs to identify functional pragmatics and has not yet utilized the discourse dependency structure. We first incorporate the dependency information of discourse structure into paragraph encoding through graph convolution networks. In this way, we can purposefully strengthen the interaction between dependent paragraphs. On the other
Recognizing Functional Pragmatics of Chinese Discourses
525
hand, to address the problem that the amount of paired functional pragmatics is relatively small, we propose the method of exchanging paired nodes to obtain augmentation texts with different dependency structures, thus helping our model recognize similar texts with different structures. Besides, the semantics of the text may be slightly incoherent due to changes in structure, thus adversarial training, adding noise at the encoding stage, is used to blur some semantically coherent features. The results of experiments show that our model achieves the best performance when compared with the state-of-the-art models, and is able to identify some minority classes such as “Behavior” and “Purpose” which are difficult to be identified by the existing model.
2 Related Work 2.1 Discourse Functional Pragmatics The main theories related to discourse functional pragmatics in English include the news schema theory [5], the framing-based approach [9], and news structure centered on headlines and introductions [10], and so on. Yarlott et al. [4] and Banisakher et al. [11] applied traditional machine learning methods with some manual features to identify the function of news paragraphs. Choubey et al. [7] proposed a hierarchical Bi-LSTM model combined with attention mechanism to identify the function of news sentences and serve the event co-reference task. In Chinese, Song et al. [12] annotating the function of sentences in essays and constructed a corpus to serve the essay scoring task. Chu et al. [2] proposed a macro discourse functional pragmatic structure. Based on this structure, Jiang et al. [8] labeled 720 news reports to form Macro Chinese Discourse Treebank, and a total of 18 functional pragmatic types were labeled. Then, Du et al. [13] extended the MCDTB to MCDTB 2.0 with a further 480 news reports and proposed a joint learning model. 2.2 Discourse Dependency Structure Xiao et al. [14] improved the attention module in Transformer based on the dependency of the discourse structure tree, and proposed a document-level discourse-based attention mechanism. In order to capture dependencies between discourse units, Xu et al. [15] transformed the RST discourse structure into discourse dependencies. Huang and Kurohashi [16] propose a heterogeneous graph-based extractive abstraction model that combines discourse dependencies and co-reference relations. Sun et al. [17] proposed a discourse-aware graph neural network on a multi-party conversational sentiment recognition task. 2.3 Data Augmentation Considering the problem of unbalanced data, we want to expand the number of minority categories by some data augmentation methods. Below are some common text augmentation methods. EDA [18], AEDA [19] and TinyBert [23] applied character-level operations to the original text to obtain new text. Back-Translation [20] was used to
526
Y. Lu et al.
obtain enhanced texts by translating back and forth from one language to another. Mixup [21] scales the sample-label pair data to produce the new data. VDA [22] used the Masked Language Model combined with Gaussian noise to produce vector representations. In order to make good use of structural information, we proposed a method of swapping pairwise functional pragmatics to achieve augmentation text.
3 Model With the purpose of utilizing discourse dependency structures to facilitate functional pragmatic recognition and alleviating the problem of insufficient data, we propose a Functional Pragmatic Recognition model based on Dependency Structure (FPRDS) in this paper, as shown in Fig. 3. The whole model consists of four components: (1) Data Augmentation: Augmentation texts are obtained by swapping paired functional pragmatics and adversarial training is used to help our model to better utilize augmentation texts. (2) Text Encoding: XLNet is used to encode the whole article for the sake of capturing character-level contextual information. (3) Interactive Encoding: The Encoder module is applied to enhance paragraph-level encoding interaction. (4) Dependency Graph Encoding: As the dependency graph is constructed, interactions between discourse units where dependencies exist could be purposefully enhanced.
Fig. 3. Framework of discourse functional pragmatics recognition.
3.1 Data Augmentation In the MCDTB corpus, the number of pairwise functional pragmatics is very small and the recognition performance is poor. For making better use of the structural information, we swap paired functional pragmatics to expand the data. Pairs of functional pragmatics usually appear as a whole, and switching their order does not have a significant impact on
Recognizing Functional Pragmatics of Chinese Discourses
527
the overall flow of the discourse. But other functional pragmatics may cause significant influence. For instance, three “Situation” in Fig. 4 tells three steps of the national women’s soccer team and must appear in a fixed order. There are three pairs of functional pragmatics on the MCDTB corpus: “BehaviorPurpose”, “Illustration-Statement” and “Cause-Result”. For the discourse functional pragmatic structure tree of an article, we process the data from the root node in a topdown manner. If the relationship between the two children of a non-leaf node is paired, the two children are swapped and the Nuclear-Satellite relationship of that non-leaf node is also changed. With the changing of the structure tree, the corresponding dependency graph will also change. For example, in the discourse functional pragmatic structure tree shown in Fig. 4 (left), two sub-nodes of the non-leaf node “Story” have the functional pragmatics of “Purpose” and “Behavior” respectively. These two sub-nodes are swapped, and then we can get new discourse functional pragmatic structure tree which is shown in Fig. 4 (right). The original P3 to P5 becomes P2 to P4, and the original P2 becomes P5. The nuclear-Satellite relationship has also changed from “S-N” to “N-S”. Since the original coherent semantics may be slightly damaged due to the altered paragraph order, we utilize PGD [24] to add some spoilers to the word embedding layer. With the help of adversarial training, some noise is added at the encoding end to achieve the effect of fuzzy semantic coherent features, so that our model can make better use of the augmentation text.
Fig. 4. Subtree node swapping
3.2 Text Encoding To obtain character-level contextual information, XLNet [25] is used to encode the entire document in the text encoding layer. A document containing k paragraphs is denoted j as D = {P1 , . . . , Pi , . . . , Pk }, where Pi = {wi1 , . . . , wi , . . . , win } and n represents the number of tokens contained in the i-th paragraph. We add a < sep > token at the end of each paragraph and a < cls > token at the end of the whole document. Therefore, the input of encoding layer is I = {P1 < sep > P2 < sep > . . . Pk < sep >< cls >}. We input I into XLNet and take the vector of < sep > at the end of each paragraph as paragraph encoding Q = {Q1 , . . . , Qk } ∈ Rk×dh , where dh is the hidden layer dimension.
528
Y. Lu et al.
3.3 Interactive Encoding In this layer, we use the Encoder module of Transformer [26] to get paragraph-level contextual information. After encoding with Encoder, we get new paragraph encoding QE = {Q1E , . . . , QkE } ∈ Rk×dh . The encoder module consists of a positional encoding layer and multiple Encoder layers. The formula for the positional encoding layer is as follows: S = Dropout(Q + PE(Q))
(1)
where PE(·) is sinusoidal position embedding. Assuming that input of Encoder at the (m) (m) m-th layer is S (m) = {S1 , . . . , Sk }∈ Rk×dh , the m-th Encoder layer is designed as follows: (m)
(m)
Ui
= MulHAtt(Si Wq(m) , Si Wk , Si Wv(m) )
(2)
(m+1)
(m) (m) (m) + LN (Ui + Si )) = LN (FF Ui
(3)
Si
where MulHAtt(·) denotes multi-headed attention mechanism, LN (·) and FF(·) represent layer normalization and feed-forward layer respectively. After M layers of encoding, we obtain QE = S (M +1) ∈ Rk×dh . 3.4 Dependency Graph Encoding For the sake of enhancing the interaction between discourse units with dependent relationships, a discourse dependency graph is constructed to store dependency information. The construction of the discourse dependency graph is based on the underlying discourse structure tree. We set the rule that the dependency relation points from the Satellite unit to the Nucleus unit. And for two discourse units of an equally important relationship, the dependency relation points from the latter to the former. Besides, there also exists a juxtaposition of multiple discourse units on the MCDTB corpus, thus we set another rule that the dependencies of all the discourse units point from themselves to the previous discourse unit except for the first one.
Fig. 5. Discourse Tree to Dependency Graph
With the modified rules, the functional pragmatic structure tree can be transformed into the discourse dependencies shown in Fig. 5. An adjacency matrix is used to preserve
Recognizing Functional Pragmatics of Chinese Discourses
529
the dependencies between the nodes of the directed graph in this paper: firstly, we construct an all-zero square matrix G of order K where K is the number of leaf nodes in the discourse structure tree. Then, if there is a dependency relationship between two leaf nodes Pi and Pj with Pi pointing to Pj , we set G[i] j = 1. The utilization of discourse dependencies can further enhance the interaction between related paragraphs, i.e., structural and semantic information can be obtained from passages with dependencies. In addition, due to the lack of information interaction between distant paragraphs in Encoder, we can enhance this interaction with discourse graph encoder (DGE). For the constructed dependency graph G = (N , E), the node N represents the paragraph in the article and the directed edge E represents the dependency between the two nodes connected to the edge. After inputting paragraph encoding Q into the DGE module, a graph convolution network [27] is used to update the encoding vector of each paragraph according to the dependency graph. Finally, we can obtain final paragraph encoding QG = {Q1G , . . . , QkG } ∈ Rk×dh that incorporates the structural dependence information. The DGE module consists of multiple DGE layers. Suppose the input of the n-th (n) (n) DGE layer is Q(n) = {Q1 , . . . , Qk } ∈ Rk×dh and the output of this layer is Q(n+1) = {Q1(n+1) , . . . , Qk(n+1) } ∈ Rk×dh . The formulas for the n-th layer are as follows: (n)
Vi
(n)
= LN (Qi
Wi(n) = ReLU (
(n)
+ Dropout(FF(Qi )
1 (n) (n) (n) W Vj + b1 |Gi | i
(4) (5)
j∈Gi
(n)
Qi
(n) (n) + Vi ) = LN (Dropout Wi
(6)
where ReLU (·), LN (·), and FF(·) denote the activation function, layer normalization, and feedforward layer, respectively. Gi represents the set of all nodes in the graph that have a dependency relationship with the i-th node. The input of the first DGE layer is Q = {Q1 , . . . , Qk } ∈ Rk×dh . After N layers of graph encoding, we obtain QG = (N +1) (N +1) Q(N +1) = {Q1 , . . . , Qk } ∈ Rk×dh . 3.5 Training and Inference After obtaining the interactive encoding QE and the dependency graph encoding QG , this paper concatenates these two encodings and then inputs them to the linear layer for prediction, and finally obtains the prediction vector pred = {pred 1 , . . . , pred k } ∈ Rk×dc . The specific formulas are as follows: pred i = Linear(Tanh(Concat(QiE , QiG )))
(7)
where Concat(·) represents concatenation of vectors, and Linear(·), Tanh(·) denotes linear layer and activation function respectively. The use of discourse structure trees is different between the training and inference stages. In the training stage, our model uses a discourse dependency graph transformed
530
Y. Lu et al.
by a standard discourse structure tree. In the testing stage, the discourse structure parsing model [28] used in is used to parse the discourse, and the discourse structure tree is transformed into a discourse dependency graph based on the generated discourse structure.
4 Experimentation 4.1 Experimental Settings In this paper, the recognition performance of Chinese discourse functional pragmatics is evaluated on the macro discourse corpus MCDTB 2.0 with Micro-F1 and Macro-F1 scores. The corpus contains a total of 1200 news reports, 720 from the Penn Discourse Treebank and another 480 from Gigaword 2.0. A total of 6,763 functional pragmatics of paragraphs were marked, containing 15 functional types. Following Du, 80/20% of the samples were chosen as the training/validation set, i.e. training/validation set has 960/240 texts respectively. The model is based on the PyTorch framework, the learning rate is set to 1e-5, the number of training epochs is 15, and the dropout is set to 0.5. XLNet-base is used in the Text Encoding module, and the parameters of XLNet are finetuned by Adam optimizer during the training stage. The number of layers of the Encoder is 12 and the head number of the multi-headed attention mechanism in the Encoder is set to 8. The number of layers of the graph convolution network is 2. 4.2 Experimental Results To verify the Effectiveness of the model, we make a comparison with some existing state-of-the-art models as follows: Song [12]: it uses a sequence labeling model to identify the function of sentences, which consists of hierarchical Bi-LSTM module and different attention mechanisms. Choubey [7]: it proposes a model that uses a hierarchical Bi-LSTM to obtain character-level, sentence-level, and discourse-level interaction information, and a classifier to identify the functional pragmatics of sentences. Du [12]: it proposes a joint learning model that combined text segmentation and functional pragmatics recognition tasks. In view of the news genre, it utilized text segmentation information and location information. XLNet+Bi-LSTM: XLNet is used as encoder to get character representation and Bi-LSTM with attention mechanism is applied to get paragraph representations which could be utilized to identify the functional pragmatics. XLNet+Encoder: XLNet is utilized as encoder to get paragraph encoding, and it enhances interaction of paragraph encoding by Transformer Encoder. The results of the experiments are shown in Table 1, and our FPRDS model in this paper achieves the best performance on both Micro-F1 and Macro-F1 with 71.42 and 35.11 respectively. Compared to Choubey’s model, our FPRDS model improved by 6.29 and 16.38 on Micro-F1 and Macro-F1, respectively. The improvement was 6.39 on Micro-F1 and 9.04 on Macro-F1 when compared to the better-performing
Recognizing Functional Pragmatics of Chinese Discourses
531
Table 1. Results of different models. We used the t-test with a 95% confidence interval for the significance test and all improvements of our FPRDS over DU and XLNet + Encoder are significant (p < 0.02) Model
Micro-F1
Macro-F1
SVM
52.23
16.72
Song
64.35
18.27
Choubey
65.13
18.73
Du
68.19
22.63
XLNet + BiLSTM
64.36
25.05
XLNet + Encoder
65.03
26.07
FPRDS (our)
71.42
35.11
-DA
70.46
32.80
-DA-DGE
68.33
26.56
XLNet+Encoder model. Even compared to the state of art model, Du, our model can also be improved by 3.23 and 12.48 on Micro-F1 and Macro-F1. In contrast to these baseline models, our FPRDS model encodes the entire article rather than segments. Therefore, our model can obtain character-level contextual information from not only the paragraph but also the whole article. And we make full use of discourse dependency structure information to purposefully strengthen the interaction between related paragraphs. Additionally, in order to make better use of structural information, this paper further improves the model’s ability to recognize functional pragmatics in an article with different structure trees by combining text augmentation with adversarial training. 4.3 Analysis The end of Table 1 shows the results of the ablation experiments in this paper. FPRDS denotes our complete model, -DA represents removing text augmentation and adversarial training from our model, and -DA-DGE indicates that XLNet is used to encode the entire article, and Encoder is used to obtain paragraph-level interaction information. As can be seen in Table 1, the performance of the -DA model compared to that of the -DA-DGE model improved by 2.13 on the Micro-F1 score and by 6.24 on the Macro-F1 score. Compared to the -DA model, the FPRDS model improved by 0.96 and 2.31 for the Micro-F1 score and Macro-F1 score, respectively. Table 2 shows the performance of different functional pragmatics belonging to the minority category, and the number in brackets represents the percentage of functional pragmatics on the MCDTB 2.0 corpus. As can be seen in Table 2, the FPRDS model improves on the vast majority of classes compared to the -DA-DGE model, indicating that the use of structural dependency information and structure-specific data augmentation is effective in this paper. Compared with the -DA model, the FPRDS model identifies the category “Behavior” that cannot be identified by the -DA model and improves on the
532
Y. Lu et al.
“Purpose”, “Statement”, and “Cause”. This shows that the data augmentation approach in this paper for pairwise functional pragmatics is indeed effective. Table 2. Performance improvement of different functional pragmatics. Model
Behavior
Purpose
Illustration
Statement
(0.35%) FPRDS
22.22
(0.55%)
(1.27%)
(1.32%)
30.00
21.05
46.15
-DA
0
26.27
26.67
37.04
-DA-DGE
0
0
0
25.00
Model
Cause
Result
Sub-Summary
Supplement
(0.35%)
(0.55%)
(1.27%)
(1.32%)
FPRDS
20.83
5.88
75.82
51.81
-DA
16.00
5.66
68.97
47.98
-DA-DGE
20.88
5.71
53.33
44.31
Besides pairwise functional pragmatics, the “Sub-Summary” category has a remarkable improvement. The -DA model misidentifies 17.4% of the “Sub-Summary” as “Situation” and the -DA-DGE model is up to 34.9%, while the FPRDS reduces the proportion to 11.6%. By observing the structure tree and the discourse dependency graph, we can find that the “Sub-Summary” class often appears as the first leaf node in the descendants of the non-leaf node whose functional pragmatics is paired. As shown in the tree structure on the left side of Fig. 6, “Cause” is located at a non-leaf node, while “Sub-Summary” is the first leaf node of all descendants of the “Cause”. By looking at the dependency graph on the left side of Fig. 6, we can see that “Sub-Summary” is at the center of P2– P4. Comparing -DA-DGE with -DA model, the F1-score of “Sub-Summary” increases by 15.64 due to the help of discourse dependency structure. After the nodes swap, the “Sub-Summary” remains at the core of P2–P4 in the dependency graph which is different from the original one (bottom right of Fig. 6). When our data augmentation approach produces texts with different structures, the “Sub-Summary” class is always at the heart of the localization. Most of the classes are improved, but the performance of the “Illustration” is reduced because the FPRDS model identifies more “Illustration” classes as “Situation” classes. By definition, an “Illustration” is a list of events to illustrate the object of the “Statement” and the “Situation” category is defined as a detailed description of an event or episode. Therefore, it is often difficult to distinguish between “Illustration” and “Situation”. And we increase the difficulty of recognition by adversarial training to increase the noise to blur the semantic features.
Recognizing Functional Pragmatics of Chinese Discourses
533
Fig. 6. Structure tree and dependency graph before and after pairwise functional pragmatics swapping.
5 Conclusion In this paper, we first incorporated discourse dependency structure into functional pragmatic recognition and propose a FPRDS model. Our model considers character-level and paragraph-level contextual information and incorporates dependency structure information, greatly enriching the contextual information at different granularities of discourse. In addition, we also improve the model’s ability to recognize functional pragmatics in different structures by augmenting different structured texts and adversarial training. Experiments show that our FPRDS model achieves the best performance on both Micro-F1 and Macro-F1 scores. Our model is able to identify the minority classes that the baseline model cannot identify and has significant improvement on the vast majority of classes. In future work, it remains a great challenge to improve the categories that have a significant amount of data but poor recognition. Acknowledgements. The authors would like to thank the two anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62276177, and 61836007), and Project Funded by the Priority Aca-demic Program Development of Jiangsu Higher Education Institutions (PAPD).
References 1. Halliday, M. A. K., Hasan, R.: Cohesion in english. Routledge (2014) 2. Chu, X., Xi, X., Jiang, F., Xu, S., Zhu, Q., Zhou, G.: Research of macro discourse structure representation schema and resource construction. J. Software 31(2), 321–343 (2019) 3. Zou, B., Zhou, G., Zhu, Q.: Negation focus identification with contextual discourse information. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 522–530 (2014) 4. Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Empirical Methods in Natural Language Processing, pp. 390–400 (2015) 5. Dijk, T. A. V.: News as discourse. University of Groningen (1988)
534
Y. Lu et al.
6. Yarlott, W. V., Cornelio, C., Gao, T., Finlayson, M.: Identifying the discourse function of news article paragraphs. In: Proceedings of the Workshop Events and Stories in the News, pp. 25–33 (2018) 7. Choubey, P. K., Lee, A., Huang, R., Wang, L.: Discourse as a function of event: Profiling discourse structure in news articles around the main event. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5374–5386 (2020) 8. Jiang, F., Xu, S., Chu, X., Li, P., Zhu, Q., Zhou, G.: Mcdtb: a macro-level chinese discourse treebank. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3493–3504 (2018) 9. Pan, Z., Kosicki, G.M.: Framing analysis: An approach to news discourse. Polit. Commun. 10(1), 55–75 (1993) 10. White, P. R.: Telling media tales: The news story as rhetoric. Department of Linguistics, Faculty of Arts, University of Sydney (1998 11. Banisakher, D., Yarlott, W. V., Aldawsari, M., Rishe, N., Finlayson, M.: Improving the identification of the discourse function of news article paragraphs. In: 1st Joint Workshop on Narrative Understanding, Storylines, and Events, pp. 17–25 (2020) 12. Song, W., Song, Z., Fu, R., Liu, L., Cheng, M., Liu, T.: Discourse self-attention for discourse element identification in argumentative student essays. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 2820–2830 (2020) 13. Du, M., Jiang, F., Chu, X., Li, P.: Discourse functional pragmatics recognition based on news schemata. In: Proceedings of the 21st Chinese National Conference on Computational Linguistics, pp. 120–131 (2022) 14. Xiao, W., Huber, P., Carenini, G.: Do we really need that many parameters in transformer for extractive summarization? discourse can help!. In: Proceedings of the First Workshop on Computational Approaches to Discourse, pp. 124–134 (2020) 15. Xu, J., Gan, Z., Cheng, Y., Liu, J.: Discourse-aware neural extractive text summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5021–5031 (2019) 16. Huang, Y. J., Kurohashi, S.: Extractive summarization considering discourse and coreference relations based on heterogeneous graph. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 3046–3052 (2021) 17. Sun, Y., Yu, N., Fu, G.: A discourse-aware graph neural network for emotion recognition in multi-party conversation. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2949–2958 (2021) 18. Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 6382–6388 (2019) 19. Karimi, A., Rossi, L., Prati, A.: AEDA: an easier data augmentation technique for text classification. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2748–2754 (2021) 20. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96 (2015) 21. Zhang, H., Cisse, M., Dauphin, Y. N., Lopez-Paz, D.: MIXUP: beyond empirical risk minimization. In: International Conference on Learning Representations (2018) 22. Zhou, K., Zhao, W. X., Wang, S., Zhang, F., Wu, W., Wen, J.-R.: Virtual data augmentation: a robust and general framework for fine-tuning pre-trained models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3875–3887 (2021)
Recognizing Functional Pragmatics of Chinese Discourses
535
23. Jiao, X., et al.: TinyBERT: distilling bert for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174 (2019) 24. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (2018) 25. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural. Inf. Process. Syst. 32, 13063–13075 (2019) 26. Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017) 27. Kipf, T. N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (2016) 28. Zhang, L., Kong, F., Zhou, G.: Adversarial learning for discourse rhetorical structure parsing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 3946–3957 (2021)
Simple but Effective: Keyword-Based Metric Learning for Event Sentence Coreference Identification Tailai Peng1,2,3(B)
, Rui Chen1,2,3
, Zhe Cui1,2,3
, and Zheng Chen1,2,3
1 Chengdu Institute of Computer Applications, Chinese Academy of Sciences,
Chengdu 610041, China [email protected] 2 University of Chinese Academy of Sciences, Beijing 101408, China 3 School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
Abstract. Event sentence coreference identification (ESCI) is a fundamental task of news event detection and tracking which aims to group sentences according to events they refer to. Most recent efforts address this task by means of identifying coreferential event sentence pairs. Currently, frameworks based on pre-trained language models like Sentence-BERT (SBERT) are widely used for sentence pair tasks. However, SBERT lacks keyword awareness, while the local features of sentences can demonstrate a strong correlation with the event topic. In addition, the strategy of encoding the whole sentence is less flexible and more time-consuming. After reconsidering the significance of keywords in ESCI task, we propose KeyML, a simple keyword-based metric learning approach which leverages both lexical and semantic features of keywords to capture subject patterns of events. Specifically, a Siamese network is adapted to optimize distance metrics of keyword embeddings, resulting in more separable similarity of event sentence pairs. Then, KeyML considers keywords of data with different granularity and exploits three training strategies, along with their corresponding sampling methods, to investigate co-occurrence relationships. Experimental results show that KeyML outperforms SBERT and SimCSE on three datasets and demonstrate the effectiveness and rationality of our method. Keywords: Event Sentence Coreference Identification · Metric Learning · Siamese Network · Attention Mechanism
1 Introduction News reports about a specific event tend to pop up on different news sites at different time, isolated and scattered. Thus, it is inconvenient to have a comprehensive view of the news coverage of the same event in a mass of news. In this regard, the task of event sentence coreference identification (ESCI) aims to group sentences according to the events contained in them. A good grouping result can benefit many real-world applications, including paraphrase mining, subject discovery [1], event extraction and summary [2], causal relation extraction [3] and news recommendation [4]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 536–550, 2023. https://doi.org/10.1007/978-981-99-4752-2_44
Simple but Effective
537
Given that events reported in sentences usually contain specific information such as time, places, and participants, keywords can be fully utilized for event sentence discrimination. Traditional lexical based methods such as TF-IDF [5] and LDA [6] demonstrate the importance of using the co-occurrence relationship among words and topics for text clustering, but the limitation is that they neglect semantic similarity of words. Thanks to RNN-based models like LSTM [7] and GRU [8], which can encode texts and provide more comprehensive semantic information. However, RNN’s limited ability of capturing long-distance dependencies [9] and its drawback that cannot compute in parallel may affect the performance and the time complexity. To overcome the shortages of RNN-based models, pretrained language models like BERT [10] based on Transformer [11] have been proposed and achieved outstanding results. Yet, BERT is unsuitable for large-scale semantic comparison tasks such as ESCI due to rather bad sentence embeddings it yields and excessive computation of inference caused by pairing sentences. Reimers & Gurevych [12] present Sentence-BERT (SBERT) which utilizes Siamese network structure [13] to derive fixed-length sentence embeddings with higher semantic discrimination and enables the semantic differences between sentences to be compared by cosine similarity. However, keywords are often closely related to the event topic in ESCI. Therefore, using the embedding of whole sentence may introduce more noise. To this end, we propose a simple but effective strategy for ESCI task by optimizing word- level semantic correlation and grouping event sentences using both lexical and semantic features of keywords, which we refer to as Keyword-based Metric Learning (KeyML). Our proposed method mainly focuses on utilizing keyword information. Through the metric learning of keywords, the model can not only capture the key information, but also learn the co-occurrence relationship of keywords under different event topics. More importantly, KeyML has a low demand for the amount of training data, which makes it still robust when labeled data is insufficient. According to previous work [14– 17], we use Accuracy (ACC) and Normalized Mutual Information (NMI) to assess the performance of KeyML on three experimental settings based on GoogleNews-TS [18]. It turns out that KeyML has better performance and less training time than SBERT, and achieve comparable results to SimCSE. Particularly, when trained with only 10% of the data from GoogleNews-TS, KeyML improves 9% on NMI and 9.8% on ACC compared to SBERT. The contributions of our work are summarized as follows: (1) We propose a simple but effective approach for event sentence coreference identification, which leverages both lexical and semantic features to optimize distance metrics of keywords. (2) We explore three metric learning strategies for KeyML, showing experimentally that the singlekeyword-based strategy is far more effective than the others. This highlights the need for rethinking and re-evaluating of the simplistic, traditional keyword-based methods in this day and age. (3) We provide in-depth analysis and demonstrate how KeyML effectively captures event feature by keywords to achieve smaller intra-class distance and larger inter-class distance.
538
T. Peng et al.
2 Related Work Event sentence coreference identification (ESCI) was introduced as a shared task in the workshop on Automated Extraction of Sociopolitical Events from News (AESPEN) at Language Resources and Evaluation Conference (LREC2020) [19]. The original task clusters sentences in a given news article into groups. Our similarity with it is aiming to cluster event sentences that refer to the same event. The difference is that we recognize further reports on known events by utilizing their existing news sentences, which is necessary when only limited relevant reports are accessible. To the best of our knowledge, relatively little work has focused on ESCI task. On the similar task event coreference resolution, there are many studies that calculate the similarity between event mentions in the classification stage and are very close to our work. Therefore, some closely related methods for event coreference resolution are also included in the following survey. On the basis of traditional approaches such as TF-IDF, early studies [20, 21] utilize lexical features like string matching and basic semantic features like bag-of-words model to resolve event coreference. Due to advances in word embedding methods [10, 22], many studies have adopted semantic similarities as important features to train classifiers between event mentions or sentences. Örs et al. [23] deconstructed the task into two steps including predicting if a sentence pair is coreferential or not and using the predictions as scores for clustering. This approach is also common amongst other event coreference resolution work [24]. Follow this work, Tan et al. [25] employed the same process with richer text embeddings based on BERT and used linguistic-based similarity measures for sentence pairs. Lu et al. [26] adopted SpanBERT [27] to learn similar span representations of two coreferential entity mentions. Following the emergence of the pre-trained language model BERT, SBERT, based on the Siamese BERT architecture, was proposed to alleviate low reasoning efficiency for semantic comparison of BERT. And our work is motivated by SBERT that the successful application of metric learning framework to sentence pair tasks can create new opportunities for ESCI research.
3 Keyword-Based Metric Learning 3.1 Overview Problem Statement. Let S = {si }N and E = {ej }M be a sentence set and its corresponding event set, where N and M are the number of sentences in S and the number of events in E, respectively. Each event ej is associated with a certain number of sentences in S. The goal of ESCI is to group event sentences that refer to the same event together. In our KeyML method, we formulate ESCI as a pairwise similarity identification task. Importance of Keywords. Despite the prevailing preference for overall semantic representation in the era of deep learning, particular tasks may benefit from an emphasis on specific keywords for more robust feature representation. For example, people typically rely on key terms such as named entities or noun phrases within a sentence to determine which event the sentence is related to, without paying excessive attention to other
Simple but Effective
539
information. Therefore, we have reconsidered the importance of keywords in ESCI and proposed a keyword-based metric learning architecture.
Training
Training Data
TF-IDF
Event Sentence2
Keywords of Event 1 Keywords of Event m
TF-IDF
Event Sentencen
Keywords of Sentence 1
Positive & Negative Sampling
Pre-trained Language Model
Fusion
Sample 2
...
...
Event Sentence3
Sample 1
...
Event Sentence1
Keywords of Sentence n
Pre-trained Language Model
Cross entropy loss
Inference
Training Data
Sentences about Event2
TF-IDF
Keywords Sets
Keywords Sets
Pre-trained Language Model
...
TF-IDF
TF-IDF
Keywords Sets
Q1 Test Data
Event Sentence
TF-IDF
Keywords
Pre-trained Language Model
H
Q2
Cos(H, Q1) Cos(H, Q2)
Qm
...
Sentences about Eventm
...
...
Sentences about Event1
Cos(H, Qm)
Fig. 1. The overall architecture of KeyML, composed primarily of a Siamese network.
Architecture. An overall demonstration of our proposed KeyML method is illustrated in Fig. 1. For the training stage, we extract core keywords of each known event cluster and event sentence by statistical method TF-IDF. Then training sample pairs were constructed for three different Metric Learning (ML) strategies as presented in Sect. 3.5. Secondly, we utilize Siamese BERT with shared weights as the keyword-based encoder to extract hidden contextual representations. After that, we aggregate the keyword representations via mean-pooling. At last, a classifier takes output representations of sample pairs and their element-wise difference to make predictions. For the test stage, keywords of each sentence are also extracted at first and then encoded by trained Siamese BERT together with the keyword sets of the known sentences in each event. After aligning the embedding pairs, we calculate the cosine similarity and design a clustering method for unknown sentences.
3.2 Keywords Extraction Assume that each event corresponds to a set of core keywords. we first collect all relevant event sentences for each event ej to form its sentence group Gj . Then sentences in Gj are considered as a whole to compute the TF-IDF values of all words. We manually set
540
T. Peng et al.
a threshold δ and extract words whose TF-IDF value are greater than δ and regard the ∼
∼
∼
set of extracted keywords KeyG j = {kw1 , kw2 , ..., kwk } as the k core keywords of the group Gj . In addition, we record the core keywords of all event sentence groups into a keyword dictionary, and the keywords of each news sentence si will be obtained from the dictionary and denoted asKeyS i = {kw1 , kw2 , ..., kwl }.
3.3 Metric Learning Framework Deep Metric Learning is an approach which uses neural networks to learn a representation function that maps objects into a new embedded space and preserve similarity among them. A basic architecture for implementing metric learning is Siamese network, which is a symmetric neural network architecture consisting of two sub-networks with the same parameters. The pipeline of deep metric learning mainly consists of three parts: 1) preparation of input sample pairs; 2) design of network model structure; 3) choice or design of metric loss function. In this work, correspondingly, we i) design the sampling method of positive and negative sample pairs; ii) adapt Siamese network SBERT as backbone model; iii) use the cross-entropy loss function to optimize the Softmax classifier. 3.4 Three ML Strategies We formulate three kinds of metric learning strategies based on different levels of granularity. The most coarse-grained and intuitive one is KeyML-Set which directly measures the distance between the keyword sets of two event sentences. Considering that the importance of the same keyword changes in different event contexts and weights of keywords need to be learned, attention mechanism between words is introduced in KeyML-Sent. Finally, to explore the influence of the distance between a single keyword and others, we propose KeyML-Single, which only uses metric learning at the word granularity. KeyML-Set: In general, the descriptions of two events usually contain different sets of keywords. As shown in Fig. 2, we regard each event sentence group as a whole which contains keyword sets of corresponding event sentences as well as a core keyword set. During training, positive and negative sample pairs are formed as described in Sect. 3.5. For a given keyword set {kw1 , kw2 , . . . , kwn }, we first use BERT to encode them separately to obtain the embedding set {h1 , h2 , . . . , hn }. To measure the distance of two event sentence groups, the representation of a keyword set is aggregated by an additional mean pooling layer.
hi = pool1 (BERT (kwi )),
(1)
Then another mean pooling function is applied over these vectors to obtain the global key information representation hg : hg = pool2 (h1 , h2 , . . . , hn ).
(2)
Simple but Effective Classifier
Classifier
( hg1, hg2, | hg1 - hg2 | )
( hg1, hg2, | hg1 - hg2 | ) hg1
hg2
hg1
Mean Pooling
Mean Pooling
Mean Pooling
h1
h2
...
hn
h1
h2
...
hm
Mean Pooling
Mean Pooling
BERT
BERT
kw1
kw2
...
kwn
kw1
541
kw2
...
kwm
Mean Pooling
BERT
kw1
Classifier
hg2
BERT
Sent1
Sent2
Concatenate
Concatenate
kw2
...
kwn
kw1
kw2
...
kwm
( hg1, hg2, | hg1 - hg2 | ) hg1
hg2
Mean Pooling
Mean Pooling
BERT
BERT
kw1
kw2
Keyword set pairs sampling
Keyword set pairs sampling
Keyword pairs sampling
Training Data
Training Data
Training Data
(a) Training structure of KeyML-Set
(b) Training structure of KeyML-Sent
(c) Training structure of KeyML-Single
Fig. 2. Training structures of three KeyML strategies. All three architectures are based on the Siamese BERT network and trained with a linear classification head.
KeyML-Sent: Consistent with the idea of KeyML-Set, we construct training samples based on sets of keywords. As shown in Fig. 2, the only difference is the introduction of inter-word attention by concatenating the keywords in each set into a sentence in order. For a keyword set {kw1 , kw2 , . . . , kwn }, the reconstructed sentence is:
Sent = Concatenation(kw1 , kw2 , . . . , kwn ).
(3)
After encoding by BERT, the global key information hg is obtained by mean pooling: hg = pool1 (BERT (Sent)).
(4)
Though lacking of grammatical integrity, our model can still capture the cooccurrence relationship of keywords through training. KeyML-Single: To capture the co-occurrence patterns between an individual keyword and other relevant terms in specific event contexts, sample pairs containing two independent keywords are utilized to learn correspondence between keywords and events. As shown in Fig. 2, the embedding of each keyword is obtained by BERT with mean pooling across all its tokens:
hg = pool1 (BERT (kw)).
(5)
Through the proposed approach, we can facilitate word-level metric learning that effectively enforces the core keywords of the same event closer together during training, while also separating the core keywords of different events. We discuss this further in Sect. 5.2.
542
T. Peng et al.
The inference stage of all three strategies, as illustrated in Fig. 3, exclusively uses the global key information representation hg for cosine similarity comparisons, with no linear layers involved. Additionally, KeyML-Single and KeyML-Set share the same inference framework. Cosine-Similarity
Cosine-Similarity hg2
hg1
hg2
Mean Pooling
Mean Pooling
Mean Pooling
BERT
BERT
hg1 Mean Pooling h1
h2
...
hn
h1
h2
...
hm
Mean Pooling
Mean Pooling
BERT
BERT
kw1
kw2
...
kwn
kw1
kw2
...
Keywords Extraction
Keywords Extraction
Sentence1
Sentence2
kwm
(a) Inference structure of KeyML-Set and KeyML-Single
Sent1
Sent2
Concatenate
Concatenate
kw1
kw2
...
kwn
kw1
kw2
...
Keywords Extraction
Keywords Extraction
Sentence1
Sentence2
kwm
(b) Inference structure of KeyML-Sent
Fig. 3. Inference structures of three KeyML strategies. The linear layer is removed, and only hg is utilized for cosine similarity comparisons.
3.5 Positive and Negative Sampling KeyML-Set and KeyML-Sent: For KeyML-Set, we use two methods to form positive samples. Firstly, two sentence keyword sets related to the same event constitute a positive sample pair. Secondly, for an event sentence group Gi , its core keywords KeyGi and contained sentences {Si }, KeyGi and each keyword set of sentences in {Si } are utilized to form positive sample pairs. Contrarily, two core keyword sets or two sentences related to different events are used to create negative pairs. The sole difference between KeyMLSent and KeyML-Set is that each set of keywords in a sample pair are concatenated to form a sentence in KeyML-Sent (Fig. 4). KeyML-Single: Three methods are employed to construct positive samples for the single-keyword strategy. Firstly, any two words in the same core keyword set KeyGi are regarded as a positive sample pair. Secondly, any two words in a same sentence keyword set are also used to form a positive sample pair. Thirdly, for the ith event, a single word from KeyGi and another word from a relevant sentence form a positive sample pair. With regard to negative samples, any two different keywords from KeyGi and KeyGj , respectively, constitute a negative sample pair.
Simple but Effective (a) KeyML-Single
(b) KeyML-Set
543
(c) KeyML-Sent Sent
E
Sent Sent
E
E
Sent Sent Sent Positive instance
E Encoder
Negative instance
Cluster
Text in cluster
Keyword
Fig. 4. Sampling methods of three KeyML strategies.
4 Experiments 4.1 Data and Baselines The dataset we use is GoogleNews-TS [17], which contains titles and snippets of 11,108 news articles related to 152 events. Since well-performed models with only a small amount of training data are more robust in practical, we divide GoogleNews-TS into three sub-datasets: GoogleNews-TS-10 (TRAIN: 10%, TEST: 90%), GoogleNews-TS20 (TRAIN: 20%, TEST: 80%), and GoogleNews-TS-30 (TRAIN: 30%, TEST: 70%). “Largest” and “Smallest” represent the sizes of the largest and smallest cluster in different sub-datasets respectively, and “Mean” represents the average size of all clusters. Each cluster is divided to ensure that all 152 events are covered in both training and test data. Since ESCI is a relatively new task with few baselines, we compare our method with the following models by reproducing them on the ESCI task. SBERT [12] uses Siamese network to derive semantically meaningful sentence embeddings by mapping sentences to a vector space suitable with common similarity measures. SimCSE [28] achieves data augmentation through dropout acts, constructs sample pairs for contrastive learning and finally improves sentence representations. 4.2 Experimental Setup We assemble Siamese bert-base-uncased model to obtain the 768-dimensional tokenlevel embeddings. During training, the dropout rate on BERT embeddings is 0.5, and we use AdamW optimizer [29] with a learning rate of 2 ∗ 10−5 and epsilon of 1 ∗ 10−6 . Batch shuffling is applied to the training set. All linear layers are initialized by default init weight function in the transformers package. The experiments are conducted on two NVIDIA GeForce RTX 3090 GPUs using Pytorch 1.10 in Ubuntu 18.04. Note that keywords are encoded separately in KeyML-Set strategy, we use padding, mask mechanism and matrix operation to achieve improved parallelism when implementing.
544
T. Peng et al.
4.3 Main Results Table 1 shows our main results on three datasets. We first compare our three proposed metric learning methods with baselines, which capture information about whole sentences. The results achieved by SBERT have been excellent, and SimCSE with improved text representation outperforms SBERT. Our keyword-based strategies achieve competitive or even better results in most of the experimental settings. Comparing with SBERT, KeyML-Single significantly improves the performance with an accuracy improvement of 4.17% on average, and an NMI improvement of 3.77. Meanwhile, KeyML-Single outperforms SimCSE by 0.43% on ACC and achieves comparable results on NMI averagely. In the comparison of three KeyML methods, KeyML-Sent with inter-word attention performs better than basic KeyML-Set, indicating that introducing learnable weights on keywords is helpful. KeyML-Single has the best performance for a possible reason that metric learning of keyword embedding and cluster center indirectly completes the weight adaptation, and can effectively capture the importance of keywords in different event contexts. Table 1. The overall results on three datasets. Methods
GNTS10
GNTS20
GNTS30
ACC
NMI
ACC
NMI
ACC
NMI
SBERT
–
84.6
86.8
93.2
94.7
94.7
96.2
SimCSE
–
93.8
95.6
94.8
96.4
95.1
96.8
KeyML-Set
δ = 0.25
89.0
90.5
92.9
94.2
94.3
95.5
δ = 0.2
88.7
90.2
92.8
93.9
93.7
95.2
δ = 0.15
84.8
86.4
91.8
93.1
93.9
95.3
δ = 0.25
90.5
92.6
93.5
94.7
93.8
95.2
δ = 0.2
90.8
92.6
93.8
95.0
94.2
95.6
KeyML-Sent
KeyML-Single
δ = 0.15
90.1
91.9
93.3
94.8
94.2
95.6
δ = 0.25
92.4
94.0
93.8
95.3
94.3
95.7
δ = 0.2
93.8
95.3
95.0
96.2
95.0
96.3
δ = 0.15
94.3
95.7
95.2
96.5
95.5
96.8
In addition, all three KeyML methods outperform baselines on GNTS10 with only 10% training data. This implies that keyword-based metric learning has less dependence on the amount of training data, so it is more suitable for practical applications. The above results provide strong evidence for the effectiveness of keywords, indicating that simple and direct keyword information has an advantage over the global information of sentences in specific tasks such as ESCI. Additionally, the results suggest that finer granularity of the extracted local information during training leads to better performance of the model.
Simple but Effective
545
5 Analysis 5.1 Comparison of Sentence Features KeyML-Single Leads to Better Separated and Less Dispersed Groups. To further investigate why KeyML-Single has such good performance, we compute both the intraclass distance and the inter-class distance evaluated in the representation space and visualize the embedded results with t-SNE [30]. For a given group of event sentences, the intra-class distance refers to the average Euclidean distance between the centroid and all of the event sentences within the group. On the other hand, the inter-class distance is defined as the average Euclidean distance between the centroids of the given group and other groups. For SBERT and SimCSE, each embedding vector represents the overall information of a sentence. Meanwhile, for our three KeyML methods, each embedding vector represents the keyword information of a sentence.
Table 2. Intra-class distance and inter-class distance on GoogleNews-TS-10 (GNTS10). Methods
Intra-class Distance
Inter-class Distance
GNTS10 ACC
NMI
SBERT
8.07
60.03
84.6
86.8
SimCSE
3.66
62.83
93.8
95.6
KeyML-Set
6.80
61.16
89.0
90.5
KeyML-Sent
5.96
61.12
90.8
92.6
KeyML-Single
2.49
62.76
94.4
95.8
Fig. 5. t-SNE visualization of sentences in first 10 event sentence groups in the GoogleNews-TS10 dataset.
As shown in Fig. 5 and Table 2, we can conclude that: (i) All three of our strategies achieve smaller intra-class distance than SBERT on the ground truth event sentence groups. This implies the ability of keyword-based metric learning to bring sentences belong to the same event together. (ii) Compare to KeyML-Set, KeyML-Sent also achieve larger inter-class distance than SBERT, which proves the effectiveness of adding attention mechanism among keywords. (iii) Our proposed KeyML-Single achieves the
546
T. Peng et al.
smallest intra-class distance and almost biggest inter-class distance simultaneously. This demonstrates the ability of the single-keyword strategy to both bring sentences about the same event closer and push sentences about different events further apart. 5.2 Comparison of Keyword Features KeyML-Single Has a Better Understanding of the Co-Occurrence Relationship Between Keywords. Since our core aim is to enhance the model’s attention to keyword information under different event topics, we calculate both the intra-cluster Euclidean distance and the inter-cluster Euclidean distance between keyword embeddings from the first 10 clusters of GoogleNews-TS-10 and visualizing them by t-SNE [30]. As shown in Fig. 6 and Table 3, we can observe that KeyML-Single does boost the ability of the model’s performance on capturing the relevance of keywords. Compared with SBERT and SimCSE, KeyML-Single has smaller intra-class keyword distance and larger interclass keyword distance, indicating that KeyML-Single understands the co-occurrence relationship between keywords better than SBERT.
Table 3. Intra-class and inter-class keyword distance (KD) on GoogleNews-TS-10 (GNTS10). Methods
Intra-class KD
Inter-class KD
GNTS10 ACC
NMI
SBERT
28.07
37.28
84.6
86.8
SimCSE
37.04
29.94
93.8
95.6
8.51
50.21
94.4
95.8
KeyML-Single
Fig. 6. t-SNE visualization of keywords of first 10 event sentence groups in the GoogleNewsTS-10 dataset.
5.3 Discussion of Threshold Threshold δ of TF-IDF value determines the scope and quantity of keyword extraction. In our extensive experiments, we also investigated the impact brought by different values of threshold δ. As shown in Fig. 7, the optimal δ values of KeyML-Set, KeyML-Sent, KeyML-Single are 0.25, 0.2, 0.15, respectively. This implies that KeyML-Single is more
Simple but Effective
547
adaptable to low δ value than other two strategies. Therefore, we further explore the effect of KeyML-Single with a lower threshold (δ = 0.1). The statistic results are shown in Table 4.
Fig. 7. Performance of three metric learning strategies under different δ thresholds. Table 4. Further analysis of the effect of different δ on KeyML-Single. Threshold
GNTS30
GNTS20
GNTS10
ACC
NMI
ACC
NMI
ACC
NMI
δ = 0.25
94.3
95.7
93.8
95.3
92.4
94.0
δ = 0.20
94.9
96.3
95.0
96.2
93.8
95.3
δ = 0.15
95.5
96.8
95.2
96.5
94.3
95.7
δ = 0.10
95.7
97.0
95.5
96.7
94.4
95.8
Table 4 shows that the decrease in δ benefits the result of KeyML-Single, which means more keywords lead to better results. But the lower bound of KeyML-Single’s performance is still competitive. To ensure both good results and high training efficiency, an appropriate δ value can be selected based on the device’s capacity.
6 Conclusion In this paper, we have proposed KeyML, a straightforward keyword-based metric learning framework that can effectively capture the subject information of events. To finetune on event sentence datasets, three training strategies, KeyML-Set, KeyML-Sent and KeyML-Single, were designed to accomplish the different granularity word-level metric learning. Experimental results verified that connections between keywords and events can be better established with our metric learning strategies, and the performance on ESCI task is significantly improved as a consequence. Our analysis provided good insights into rethinking and revaluating the local information of words instead of global representation of sentences. Acknowledgements. We thank the anonymous reviewers for providing insightful comments, suggestions and feedback. This research was supported by Sichuan Province Scientific and Technological Achievements Transfer and Transformation Demonstration Project, grant number 2022ZHCG0007.
548
T. Peng et al.
References 1. Kim, H.G., Lee, S., Kyeong, S.: Discovering hot topics using twitter streaming data social topic detection and geographic clustering. In: 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), pp. 1215–1220 (2013). https://doi.org/10.1109/ASONAM.2013.6785858 2. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5784–5789. Association for Computational Linguistics, November 2019 3. Blanco, E., Castell, N., Moldovan, D.: Causal relation extraction. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association (ELRA), Marrakech, Morocco, May 2008. http://www.lrecconf.org/proceedings/lrec2008/pdf/87_paper.pdf 4. Bouras, C., Tsogkas, V.: Improving news articles recommendations via user clustering. Int. J. Mach. Learn. Cybern. 8(1), 223–237 (2014). https://doi.org/10.1007/s13042-014-0316-3 5. Ramos, J.E.: Using tf-idf to determine word relevance in document queries (2003) 6. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(null), 993–1022 (2003) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, Doha, Qatar. Association for Computational Linguistics, October 2014 9. Bengio, Y., Boulanger-Lewandowski, N., Pascanu, R.: Advances in optimizing recurrent networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8624–8628 (2012) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics, June 2019 11. Vaswani, A., et al.: Attention is all you need. ArXiv abs/1706.03762 (2017) 12. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERTnetworks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. Association for Computational Linguistics, November 2019 13. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823 (2015) 14. Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J.: Self-taught convolutional neural networks for short text clustering. Neural Networks Off. J. Int. Neural Network Soc. 88, 22–31 (2017) 15. Hadifar, A., Sterckx, L., Demeester, T., Develder, C.: A self-training approach for short text clustering. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). pp. 194–199. Association for Computational Linguistics, Florence, Italy (Aug 2019)
Simple but Effective
549
16. Rakib, M.R.H., Zeh, N., Jankowska, M., Milios, E.E.: Enhancement of short text clustering by iterative classification. Natural Lang. Process. Inf. Syst. 12089, 105–117 (2020) 17. Zhang, D., et al.: Supporting clustering with contrastive learning. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5419–5430. Association for Computational Linguistics, Online, June 2021 18. Yin, J., Wang, J.: A model-based approach for text clustering with outlier detection. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636 (2016) 19. Hürriyeto˘glu, A., Zavarella, V., Tanev, H., Yörük, E., Safaya, A., Mutlu, O.: Automated extraction of socio-political events from news (AESPEN): workshop and shared task report. In: Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020, pp. 1–6, Marseille, France. European Language Resources Association (ELRA), May 2020. https://aclanthology.org/2020.aespen-1.1 20. Bejan, C., Harabagiu, S.: Unsupervised event coreference resolution with rich linguistic features. In: Proceedings of the 48th Annual Meeting of the Association for computational Linguistics, pp. 1412–1422, Uppsala, Sweden. Association for Computational Linguistics, July 2010. https://aclanthology.org/P10-1143 21. Lee, H., Recasens, M., Chang, A., Surdeanu, M., Jurafsky, D.: Joint entity and event coreference resolution across documents. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 489–500, Jeju Island, Korea. Association for Computational Linguistics, July 2012. https:// aclanthology.org/D12-1045 22. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar, October 2014 23. Örs, F.K., Yeniterzi, S., Yeniterzi, R.: Event clustering within news articles. In: Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020, Marseille, France, pp. 63–68. European Language Resources Association (ELRA), May 2020. https:// aclanthology.org/2020.aespen-1.11 24. Barhom, S., Shwartz, V., Eirew, A., Bugert, M., Reimers, N., Dagan, I.: Revisiting joint modeling of cross-document entity and event coreference resolution (2019) 25. Tan, F.A., Gollapalli, S.D., Ng, S.K.: NUS-IDS at CASE 2021 task 1: improving multilingual event sentence coreference identification with linguistic information. In: Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), pp. 105–112. Association for Computational Linguistics, Online, Aug 2021 26. Lu, J., Ng, V.: Conundrums in event coreference resolution: Making sense of the state of the art. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1368–1380, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, November 2021 27. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: Span-BERT: improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 8, 64–77 (2020) 28. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics, November 2021
550
T. Peng et al.
29. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017) 30. Laurensvan der Maaten, L., Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Automatic Text Extractive Summarization Based on Text Graph Representation and Attention Matrix Yuan-Ching Lin and Jinwen Ma(B) Department of Information and Computational Sciences, School of Mathematical Sciences and LMAM, Peking University, Beijing 1000871, China {yuanchinglin,jwma}@pku.edu.cn
Abstract. Automatic text summarization via representing a text as a graph has been investigated for over ten years. With the developments of attention mechanism and Transformer on natural language processing (NLP), it is possible to make a connection between the graph and attention structure for a text. In this paper, we propose an attention matrix text graph model for extractive text summarization. Specifically, an attention matrix between all the sentences of the whole text is adopted as a weighted adjacent matrix of a fully connected graph of the text where each node represents a sentence, which can be computed using the pre-training language model. The GCN is further applied to the text graph model for classifying all the nodes and finding out the salient sentences from the text to generate a summary. It is demonstrated by the experimental results on two typical datasets that our proposed model can achieve a competitive result in comparison with state-of-the-art models. Keywords: Text summarization · Graph convolutional network · Attention mechanism
1 Introduction As a major task of Natural Language Processing (NPL), automatic text summarization has attracted more and more attention recently thanks to the development of deep learning and artificial intelligence. It aims at contracting a text into a summary or an abstract automatically. In general, there are two kinds of methods for automatic text summarization. The first kind of methods generate an abstractive summary directly from the whole text. But their results are not so satisfactory, being unstable and unreadable. However, the second kind of methods extract some salient sentences from the text to form a combination summary so that the results are more stable and readable, usually gain the higher scores on the evaluation indexes. In fact, the extracted or selected sentences through an extractive summarization method ensure that the summary is meaningful and informative. Although the summary may contain certain redundant information, it can represent the key ideas for the whole text. Moreover, an extractive summarization method can be considered as a multi-label © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 551–562, 2023. https://doi.org/10.1007/978-981-99-4752-2_45
552
Y.-C. Lin and J. Ma
classifier, which is more effective on training due to the fewer parameter and relatively simple structure. Currently, neural network architectures are the mainstream of NLP, obtaining several outstanding results [5, 19, 24]. Especially, the Transformer [25] and attention mechanism [1] play extraordinary performance to extract the features within the text, making a breakthrough in all the evaluation indexes on various NLP tasks. Such an approach has been already adapted in text summarization task, and the experiments demonstrate that attention mechanism can locate the text with rich semantic information. Moreover, the pre-training language model based on attention mechanism [7, 23] can collect the text information in a mode of unsupervised learning, with properly fine-tuning training, we can use limited computation resources to reach high-quality results. In this paper, we propose an attention matrix text graph model for extractive text summarization. Actually, we analyze the text summarization task with a perspective of text structure through a graph model. Inspired by the development and application of Graph Convolutional Network (GCN) [14], we represent a text as a graph and select the key sentences through the message passing and node embedding in GCN. The crossattention among the sentences in a text can be regarded as a structural data, and we can discover such a feature in the attention-based pre-trained language model. Additionally, our model is not simply a combination of GCN and attention mechanism, as we use the attention from a pre-trained model. We then investigate the connection between the attention and the graph. An attention matrix between all the sentences is constructed as a weighted adjacent matrix of a fully connected graph of the text where each node represents a sentence. The node feature and the adjacent matrix can be obtained through sentence embedding and attention layers in the pre-trained transformer, respectively. We further apply the GCN to the text graph model for classifying all the nodes and finding out the salient sentences from the text to form a summary. The experimental results demonstrate that our proposed model can achieve a comparable result on two datasets-CNN/dailymail and Neswroom with the ROUGE index [15]. What’s more, we can obtain such results with only about 5 million parameters, 1 percent compared to the sate-of-the-art model, and an accepted loss in precision. As a result, our approach of extractive text summarization is more effective.
2 Related Work 2.1 Representing a Text as a Graph Text information can be obtained by transferring the unstructured data into a topological system or structure. In fact, several studies designed the text graphs for various tasks such as information retrieval [2, 3, 28], speech recognition [26] as well as text summarization. It is clear that the main idea for constructing a graph from a text is to define the vertices and edges based on text elements and statistics. Actually, the vertices can be defined as the elements of the text such as words or sentences, while the edges can be defined by a measurement of relationship between two vertices. The commonly used measurements include co-occurrence [4], document collection [3] and similarity. In fact, similarity can be measured in different levels, such as the sentence level using the term frequencies
Automatic Text Extractive Summarization
553
[18] and the word level using the cosine distances among the pre-trained word vectors. Each kind of graph representation has its advantages on a specific task [20], and provides the possibility to analyze the text data from a different comprehensive angle. 2.2 Extractive Summarization Models Neural Network Models. A neural network model can embed a text into certain dense vector features by training on an unpaired text corpus, which is useful for training the extractive model for text summarization task. Among classic deep learning models, Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) can integrate the text information from the words of a text and obtain high level features. Kageback et al. [13] and Nallapati et al. [19] adapted the RNN based models which produce the sentence features by recursively inputting the words, whereas Cheng and Lapata [5] utilized the CNN based models which collect the word features from a local field and sum up them with pooling. However, both kinds of the models combine the obtained sentence features through the classifiers to predict the text summaries. Transformer, a self-attention based language model [25], integrates the text features in another way such that the measurements of relationship among the words are computed in parallel and the sentence vectors are formed in the weighted sum. Actually, the experimental results given by Liu [17] and Zhong et al. [30] have demonstrated the great success of extracting the text summaries by Transformer. Graph-Based Models. The early studies of graph-based methods focus on the variants of PageRank [21] that is a well-known web search algorithm. In fact, it assumes that important sentences connect to each other and can be ranked by random walk. Textrank [18] implement such an idea on text summarization task, with the text graph via sentence similarity measured by tf-idf [12], by ranking all the sentences and returning the few sentences with top scores as the summary. Plaza, Diaz and Gervas [22] and Esuli and Sebastiani [12] also extended this idea and represented the graph with semantic matching networks. Recent studies connect sentence representation with graphs due to the improvement in generating text features from neural networks. Yasunaga et al. [27] applied the GCN model to find out the salient sentences with different rules (including sentence similarity) to build the text graph and encode the sentences to a fix-dimensional vector through RNN. However, building the sentence relationship by those rules may leave out important semantic information. The statistical methods (such as tf-idf and discourse relations [6]) just utilize a part of information of the text, thereby they are actually insensitive to the word meaning like the case of synonyms or antonyms. In this paper, we try to solve this issue by constructing the text graph with the learned attention weights from the pretraining language model. In fact, this attention matrix contains much more information, and it is also more flexible in representing the sentence relationship in different contexts.
554
Y.-C. Lin and J. Ma
3 Attention Matrix Text Graph Model In this section, we present our proposed attention matrix text graph model for extractive text summarization. As an extractive text summarization model, it aims at ranking all the sentences with a kind of reasonable scores. Therefore, we can select a number of top score sentences to form an effective summary. To this end, its major job can be divided into two parts: (1). Constructing the attention matrix text graph; (2). Selecting the salient sentences of the summary via the GCN model. We now describe how our proposed model fulfills these two parts of the job, respectively. 3.1 Text Graph Construction We begin to construct a graph for a text with the attention matrix, where a node represents a sentence in the text while an edge represents the relationship between two nodes. Specifically, we employ the self-attention sub-layers from the BERT model to interpret the relationships between sentences. The innovation of our model is that we use the attention extracted from the pre-trained BERT model, rather than training an attention mechanism from scratch. To measure the relationship between a sentence si and any other one in the set of all the sentences of the text, d = (s1 , . . . , sn ), we can consider these self-attention sublayers by inputting all the sentences to the BERT model in which the input encoding layer is set in the way of the BertSum [17]. Specifically, the BERT model takes three types of encoding as the input: the tokenize, segmentation, and position encoding, which can be described as follows. A text is split into a number of tokenized sentences such that each sentence is annexed to a classification token ([CLS]) and a separation token ([SEP]) in the front-end and back-end, respectively. During the segmentation embedding process, 1 and 0 tokens indicate different words in a sentence which interactively appear in their positions. Positional embedding remains the same as that of the vanilla Bert model. We visualize the relationship of the three inputs of the employed Bert model in Fig. 1. After the input encoding process, we take the attention sub-layer from the encoder layers for further analysis.
Fig. 1. Three input components of the employed BERT model.
The attention sub-layer of each encoder layer contains certain semantic information which should be analyzed independently. According to the self-attention mechanism, the attention sub-layer generates the attention matrix reflecting the interactions of all the tokens. By picking out all the [CLS] tokens, an inter-sentence attention matrix can be obtained. To ensure that the sentence level attention is subject to a probability distribution,
Automatic Text Extractive Summarization
555
we feed it into the softmax function. Figure 2 visualizes our approach of generating the attention matrix of the inter-sentences.
Fig. 2. Constructing the inter-sentences attention matrix from the BERT attention matrix. When the original text contains 3 sentences, the [CLS] token is selected from each row and column to form a 3 × 3 matrix.
From the visualized result of the inter-sentence attention (Fig. 3), we can find out that each sentence has the most attention to itself and less attention to the others. Moreover, the attention matrix can capture some basic semantic information which represents the structure of the text. After setting a threshold value, we can binarize each attention value into 1 or 0. Therefore, this binary attention matrix can be considered as the adjacent matrix of the sentences of the text so that a direct graph of the sentences (as notes) has been constructed for the text from the attention matrix. At this moment, we obtain a directed graph due to the attention of a sentence to the others is directional. Generally, the GCN model processes an undirected graph data. However, as its data processing can be regarded as information transmission, from such point of view, either the symmetric (undirected) or asymmetric (directed) adjacent matrix can be taken into the GCN model. Figure 4 shows the connection structure of the obtained directed graph for a given text. It is worth noting that parameter training is not involved in the process of generating the adjacent matrix, which actually helps to stabilize the GCN model and reduce the computational resource.
556
Y.-C. Lin and J. Ma
Fig. 3. The two subfigures shows different attention matrix results of the same text with two attention heads. Each attention head focuses on different information and generates different attention result. Therefore, adding more information into the model helps to improve the semantic richness.
Fig. 4. A typical adjacent matrix of text structure.
In our actual BERT model for language pre-training, there are 12 encoder layers where each layer has a self-attention sub-layer with 12 different attention heads and each head represents a kind of relationship between sentences. We choose the attention heads from the first layer to produce our adjacent matrix because they remain the most information from the original text data. We finally integrate all the information from different attention representations in our graph model. 3.2 Selecting the Salient Sentences Via GCN With the general assumption that salient sentences in a text would receive more attention from the others, we can implement a GCN model on the text graph to get the essential information or importance of each sentence in the text at the output layer. For an effective extractive summary, their sentences should support the theme of the text and own the important positions in the obtained graph of the text. In the light of information
Automatic Text Extractive Summarization
557
transmission in GCN, we consider that the information can be finally aggregated into the salient sentences or nodes after the implementation of the GCN on the text graph and thus these salient sentences can form a summary of the text at last. To reach this goal, we train the GCN model to be a softmax classifier on the text dataset in which, for each text, the label of a sentence in the summary is 1, while the label of a sentence without the summary is 0. Specifically, the GCN model takes two inputs: the node vectors and the adjacent matrix. For the node vector, we use the [CLS] position in the BERT output as the representative vector of the sentence embedding. The adjacent matrix is generated by the attention method above. Let X ∈ RN×d be the matrix with the vectors of N sentences and A ∈ {0, 1}N×N be the adjacent matrix, where d is the dimensionality of the node vector. The general processing operation of the GCN can be expressed by (1) GCN X, A˜ = relu A˜ · FC(X) , where A˜ is the normalized adjacent matrix (refer to [18] for the detailed expression), FC is a fully connected neural network. To merge the adjacent information from all the attention heads, we take the node updating rule as follows: H1 = concat GCN1 X, A˜ 1 , GCN1 X, A˜ 2 , · · · , GCN1 X, A˜ 12 , (2) H2 = concat GCN2 X, A˜ 1 , GCN2 X, A˜ 2 , · · · , GCN2 X, A˜ 12 ,
(3)
where the same GCN layer share the same parameters. If d1 and d2 are the output dimensionalties of GCN1 and GCN2 , respectively, we can compute H1 ∈ RN×d1 ×12 , H2 ∈ RN×d2 ×12 as above1 . For clarity, we design a readout function which combines the results from different layers of the GCN model in the same way as used in the structure of MPNN [9]. Define FC1 and FC2 as 2 independent fully connected layers, the readout function can be given by (4) R = σ FC1 H2 |X · FC2 H2 , where · is the point-to-point product and σ is the sigmoid function. Through a fully connected layer MLP, we can condense R into one dimensional function of each sentence and obtain the prediction
y = MPL(R),
(5)
where yi [0, 1], the ith element of y, is the predicted salience score for the ith sentence. So, all the sentences can be ranked by this kind of salience scores. Finally, we select a number of top score sentences as our salient sentences to form an effective summary. 1 In our experiment, the input X is generated by the BERT sentence embedding with dimension-
ality 768, and the two GCN layers before concatenating contain 50 neurons (d1 = d2 = 50). So the dimensionalities of H1 and H2 are 600, close to that of the input X.
558
Y.-C. Lin and J. Ma
4 Experimental Results 4.1 Datasets We test our proposed model on two typical benchmark datasets for text summarization: CNN/Dailymail [11] and Newsroom [10]. To match our proposed model, we transform the generative datasets into extractive ones by the greedy algorithm. We pick up K sentences with the highest ROUGE-2 scores from a text to form the summary, where K is the average sentence number of summaries in each dataset. Due to the various lengths of texts in the Newsroom dataset, we conduct two experiments in this particular case, one is on the original dataset, while another is for the modified one in which we select the samples with moderate text length and omitted the extreme ones. CNN/Dailymail is a summarization dataset with long texts. Actually, each text can have multiple summaries. In fact, the texts are mainly news, while the summaries are written by specialists. There are 287,226 pairs of texts and summaries for training and 11,490 pairs for test. The average number of sentences in a text is about 30, while each text contains about 3.72 summary sentences. Newsroom is collected from website texts such as social media or news. The characteristic of the dataset is that its texts cover a variety of different pipelines of various lengths. In total, the dataset contains about 1.3 million data pairs of texts and summaries with an average text length of 658 words (about 15 sentences) and an average summary length of 1.33 sentences. 4.2 Implementation Detail For each dataset, we choose K sentences from every text with the top K ROUGE-2 scores to serve as its summary. We then train and test our proposed model on the pairs of texts and summaries from the original or modified dataset. Specifically, we utilize the BERT-base as our language processing model to get the attention matrix. All the experiments are conducted with the Tensorflow2.3.0 framework and NVIDIA GeForce GTX 1080, 8 GB as GPU. 4.3 Summarization Performance and Comparison We now present the experimental results of our proposed model and compare it with the baseline and mainstream summarization methods. Actually, Table 1 and 2 summarize the experimental results and our proposed model and the competitive methods on dataset CNN/Dailymail and Newsroom, respectively. The method “Lead” is our baseline model which simply chooses 3 leading, i.e., the first 3 sentences, from the text as the summary since most of the authors emphasize the key points at the beginning of the text. Our evaluation and comparison is based on the F1 scores that are computed with ROUGE [15], a statistical index evaluating the similarity of two sentences. In fact, we evaluate each model with the F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L. Moreover, we also compare our proposed model with some generative models which cannot be evaluated by the ROUGE based scores.
Automatic Text Extractive Summarization
559
In CNN/Dailymail, our proposed model surpasses the baseline model and achieves a competitive result against the competitive methods. Actually, NN-SE [5] and BANDITSUM [18] are extractive while our proposed model achieves a similar but slightly better results. GPT-2[23], C2F-ALTERNATE [16] and PE-GASUS [29] are generative, but their results are generally better than ours, but quite different. Table 1. Experimental results of our proposed model, the baseline model and competitive methods on CNN/Dailymail dataset, where each number represents the score of F1 in this case.
Method
F1 of ROUGE-1
F1 of ROUGE-2
F1 of ROUGE-L
GCN_Attn
29.85
10.76
23.85
Lead
22.22
8.32
21.17
NN-SE
28.4
10.0
25.0
GPT-2
29.34
8.27
26.58
BANDITSUM
30.7
11.6
27.4
C2F-ALTERNATE
31.1
15.4
28.8
PEGASUS
44.17
21.47
41.11
As for the Newsroom dataset, we achieve a different result. Before the modification of the data, our proposed model performs similarly to the baseline and Pointer-N [14], a generative model with selecting data. However, our proposed model dramatically improves the result on the modified dataset. It surpasses the baseline and closes to the modified version of Pointer-N, though leaving behind PEGASUS, the state-of-the-art generative model. The potential reason is that the Newsroom dataset contains some extremely long sentences, which leads to the unstable training and therefore decreases the ability to label the reasonable summaries. Table 2. Experimental results of our proposed model, the baseline model and competitive methods on Newsroom dataset, where Point-N is the pointer-generator method proposed in [17], while Point-N* is the same method training with the selected dataset.
Method GCN_Attn GCN_Attn* Lead Pointer-N Pointer-N* PEGASUS
F1 of ROUGE-1 27.42 36.75 28.43 26.02 39.11 45.15
F1 of ROUGE-2 19.54 29.39 20.57 13.25 27.95 33.51
F1 of ROUGE-L 26.65 36.29 28.66 22.43 36.17 41.33
560
Y.-C. Lin and J. Ma
4.4 Discussions We further discuss the performance of our proposed model from different angles. It can be observed from the experimental results that the short text as well as the data filtering can improve the performance of the model. The reason may come from the limitation of the Bert-base model whose input can be only up to 512 tokens at once. Thus, it is insufficient for the longer texts. When we limit the tokens of the text in a moderate number, the learning result will be certainly improved. According to the comparison results, we have the following three points. First, our proposed model beats the baseline model which has no learning mechanism, which shows that the adaptions of attention mechanism and graph model are valuable and feasible. Second, in comparison with a neural extractive model, our proposed model often has a tied result. It demonstrates another way to analyze text structure in a graph view instead representing the text by integrating the local information. Finally, although our proposed model cannot reach the performances of some generative models, it ensures that the obtained summaries are readable. Actually, the generative models tend to capture the key words or phrases from a given text and achieve the higher score of the statistic index like ROUGE, but it usually remains the challenge on the readability issue. In addition, our proposed model is characterized with two benefits: few parameters and interpretability. Fewer parameters of the model can lead to faster training and inference. In contrast, PEGASUS, the state-of-the-art model, contains 568 million parameters, while our GCN model contains only 504839 parameters, which are much smaller than those of PEGASUS. From the limitations of computational resource and memory, our GCN model is of more practical value. Moreover, the graph structure is easier to visualize and interpret the connections among text elements, which gives a better understanding of the learning process of the model.
5 Conclusions We have proposed a new idea to construct the adjacent matrix for GCN in text summarization task. There is no additional training for the language processing model to generate the attention matrix which contains the semantic information and can compute the adjacent matrix from it. As a result, each text can be represented as a graph model with nodes as sentences, and we can find out the important sentences as the summary by the top K scores. Actually, we design the GCN model to realize the new idea to solve the text extractive summarization problem and successfully achieve the results which are competitive with the mainstream methods of text summarization in recent years, even with the advantages of fewer parameters and faster training in terms of efficiency. There are still some aspects of the relationship between graphs and language processing models which should be explored, and we will continue our research in an attempt to improve the accuracy and usefulness of the graph-based summarization model. Acknowledgment. This work was supported by the National Key Research and Development Program of China under grant 2018AAA0100205.
Automatic Text Extractive Summarization
561
References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Blanco, R., Lioma, C.: Graph-based term weighting for information retrieval. Inf. Retrieval 15(1), 54–92 (2012) 3. Bordag, S.., Heyer, G.., Quasthoff, U..: Small worlds of concepts and other principles of semantic search. In: Böhme, T., Heyer, G., Unger, H. (eds.) IICS 2003. LNCS, vol. 2877, pp. 10–19. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39884-4_2 4. Cancho, R.: i & sole, rv 2000 two regimes in the frequency of words and the origin of complex lexicons: Zipf’s law revisited. J. Quant. Linguistics 8, 1–65 (2001) 5. Cheng, J., Lapata, M.: Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252 (2016) 6. Christensen, J., Soderland, S., Etzioni, O., et al.: Towards coherent multi-document summarization. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies. pp. 1163–1173 (2013) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Dong, Y., Shen, Y., Crawford, E., van Hoof, H., Cheung, J.C.K.: Banditsum: extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672 (2018) 9. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: International Conference on Machine Learning, pp. 1263–1272. PMLR (2017) 10. Grusky, M., Naaman, M., Artzi, Y.: Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283 (2018) 11. Hermann, K.M., et al.: Teaching machines to read and comprehend. arXiv preprint arXiv: 1506.03340 (2015) 12. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972) 13. K ˚ageb¨ack, M., Mogren, O., Tahmasebi, N., Dubhashi, D.: Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 31–39 (2014) 14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 15. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 16. Ling, J.: Coarse-to-fine attention models for document summarization. Ph.D. thesis (2017) 17. Liu, Y.: Fine-tune Bert for extractive summarization. arXiv preprint arXiv:1903.10318 (2019) 18. Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004) 19. Nallapati, R., Zhai, F., Zhou, B.: Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230 (2016) 20. Osman, A.H., Barukub, O.M.: Graph-based text representation and matching: a review of the state of the art and future challenges. IEEE Access 8, 87562–87583 (2020). https://doi.org/ 10.1109/ACCESS.2020.2993191 21. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 22. Plaza, L., D´ıaz, A., Gerv´as, P.: Concept-graph based biomedical automatic summarization using ontologies. In: Coling 2008: Proceedings of the 3rd Textgraphs Workshop on GraphBased Algorithms for Natural Language Processing, pp. 53–56 (2008)
562
Y.-C. Lin and J. Ma
23. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) 24. See, A., Liu, P.J., Manning, C.D.: Get to the point: summarization with pointer generator networks. arXiv preprint arXiv:1704.04368 (2017) 25. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 26. Vitevitch, M.S., Rodr´ıguez, E.: Neighborhood density effects in spoken word recognition in Spanish. J. Multilingual Commun. Disorders 3(1), 64–73 (2005) 27. Yasunaga, M., et al.: Graph based neural multi-document summarization. arXiv preprint arXiv:1706.06681 (2017) 28. Zhang, J., Zhao, Y., Saleh, M., Liu, P.: Pegasus: pre-training with extracted gap sentences for abstractive summarization. In: International Conference on Machine Learning, pp. 11328– 11339. PMLR (2020) 29. Zhong, M., Liu, P., Chen, Y., Wang, D., Qiu, X., Huang, X.: Extractive summarization as text matching. arXiv preprint arXiv:2004.08795 (2020)
Transition-Based Mention Representation for Neural Coreference Resolution Qingqing Li1,2 and Fang Kong1,2(B) 1 Laboratory for Natural Language Processing, Soochow University, Suzhou, China
[email protected], [email protected] 2 School of Computer Science and Technology, Soochow University, Suzhou, China
Abstract. Coreference resolution plays an important role in text understanding. Recently, various neural approaches have been proposed and achieved success. Although most researches agree that mention extraction and representation much impact the performance of coreference resolution, existing neural architectures consider all possible spans up to a maximum length and only employ a simple ffat word-level span representation. In this way, those information which has been proved to be effective in previous non-neural coreference resolution, such as structural information, has been largely ignored. In this paper, for coreference resolution, we propose a uniffed transition-based approach to extract and represent mentions simultaneously. In particular, we propose a Simplified Constituent Parse Tree (SCPT) scheme for each sentence by only keeping the local detail inside the mentions and the coarse frame structure outside the mentions. That is each mention corresponds to a constituent of the obtained SCPT. Then we employ a transition-based strategy to construct the SCPT structure in a bottom-up manner. In this way, various potential mentions (i.e., constituents) can be obtained and the corresponding transition action sequences embedded with internal structural information can be viewed as their proper representations. After that, we employ such achieved potential mentions and their transition-based representations for neural coreference resolution. Keywords: Transition-based Approach · Neural Coreference Resolution · Nested Noun Phrases
1 Introduction 1.1 A Subsection Sample Coreference resolution aims to identify mentions in a text that refer to the same real world entity. As a fundamental task, coreference resolution plays an important role in text understanding and has been one of the key areas in NLP for two decades [2, 9, 11, 15, 18, 19]. Since the proposal of the first end-to-end neural coreference resolution system [12], various neural approaches have been proposed and achieved considerable success [7, 13, 20]. Figure 1 illustrates an excerpt with its gold standard coreference chains from article cctv 0000 of the OntoNotes corpus. From this sample snippet, which contains six coreference chains as shown in different colors, we can find that, © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 563–574, 2023. https://doi.org/10.1007/978-981-99-4752-2_46
564
Q. Li and F. Kong
Fig. 1. A coreference example in Chinese.
– The mentions occurring in coreference chains can be presented in various forms, i.e., pronouns (“他们/they”), named entities (“香港/Hong Kong”), common noun phrases (“过去百年的电影史/the past 100 years of film history”), and so on. – The nested mentions are very common. In this example, the common noun phrase “ 这条印记着香港百年电影史的星光大道/the Avenue of Stars, memorializing Hong Kong’s 100-year film history” contains four nested NPs, i.e., “ 香港/Hong Kong”,“香 港百年电影史/Hong Kong’s 100-year film history”,“星光大道/the Avenue of Stars” and “这条印记着香港百年电影史的星光大道/the Avenue of Stars, memorializing Hong Kong’s 100-year film history”. – What’s more, the nested mentions can appear on different coreference chains. In this example, the mention “ 星光大道/the Avenue of Stars” is the headword of the mention “这条印记着香港百年电影史的星光大道/the Avenue of Stars, memorializing Hong Kong’s 100-year film history”, they refer to the same world entity. While the other two mentions, “香港/Hong Kong” and “香港百年电影史/Hong Kong’s 100year film history” refer to different entities. So these four mentions appear on three coreference chains. As well known, mention detection and representation much impacts the overall performance of coreference resolution. In order to keep as many mentions as possible, Lee et al. [12] considered all text spans as potential mentions. They proposed the first end-to-end neural coreference resolution system which used a Head-finding attention mechanism to represent the mentions. In addition, some work on improving the span representation by applying the pre-trained embeddings, such as BERT [5] or SpanBERT [4], showed that improved mention representations can contribute much to coreference resolution. However, most neural architectures still employ a simple flat word-level mention (or span) representation. This largely restricts the effectiveness of neural architecture for coreference resolution. Another issue is, just as noted in Fig. 1, the phenomenon of nested mentions appearing on the same or different coreference chains is common. Table 1 shows the mention statistics of the OntoNotes corpus. We can find nested mentions occupy more than 30% of all data sets in Chinese. Intuitively, internal structural information is important for the
Transition-Based Mention Representation
565
Table 1. Mention distribution over nested or non-nested types of OntoNotes corpus. Type Chinese
Dev
Test
Nested
38159
4778
4842
Non-Nested
64694
8405
7959
102853
14183
12801
SUM English
Train
Nested
28974
3857
3864
Non-Nested
126584
15298
15900
SUM
155558
19155
19764
nested mentions. However, all previous work [3, 7, 8, 10] extracted syntactic patterns from the parse trees directly to get the structural information. Whether the complete parse tree structure is necessary for better coreference resolution? We argue that, internal structural information of the mentions (i.e., local detail) may be more important for representing mentions, while the coarse frame structure outside the mentions is sufficient for coreference resolution. In this paper, we propose a unified neural transition-based model for mention extraction and representation. We first simplify a sentence to a SCPT by only keeping local detail of the mentions and global outline of the sentence, where each mention corresponds to a constituent. Then we employ a transition-based strategy to construct the SCPT structure in a bottom-up manner. Here, we use a Stack-augmented Parser-Interpreter Neural Network (SPINN) model [17] to generate transition action sequences, and achieve the vector representations of each constituent as our mention representation to coreference resolution. Experimentation on the data-set of the CoNLL 2012 Shared Task shows the effectiveness of our proposed approach.
2 Unified Model for Mention Extraction and Representation Given a sentence, the goal of our unified model is to obtain all potential mentions and their representations simultaneously. In order to reduce the impact of external parsing, we firstly map the parse tree to the SCPT. Then, we construct the shift-reduce system referring to the SCPT. 2.1 Simplified Constituent Parse Tree Generation We argue that coreference resolution does not require the complete parse tree. Too much structural detail can introduce noise and rely heavily on syntactic parsers. Statistics of the OntoNotes corpus show that, the average depth of the syntactic parse tree corresponding to each sentence is about 8.4, while the depth of the nested mentions is no more than 3. Intuitively, for the mention itself, we need to focus on the internal structural details of the mention, while for its context, we only need the coarse outlines of the sentence. Accordingly, we propose a sentence simplification strategy. Statistics of the OntoNotes corpus show that the five most frequent tags of the mentions account for more than 95% of the total. We select them as our reserved tag set shown in Table 2. Given the
566
Q. Li and F. Kong
parse tree of the sentence and the reserved tag set, we then employ Algorithm 1 to simplify the structural information. For convenience, we add a virtual root node for each syntactic parse tree. Figure 2 illustrates an example to show the results of our parse tree simplification algorithm. The blue and yellow nodes in (a) are reserved tags. The yellow nodes mean redundant nodes, and the algorithm will choose their lower child reserved nodes to replace them. Comparing the original parse tree with the SCPT, we can find that, the nodes that we care about are retained. Useless structural details have been trimmed down and the obtained tree is much shorter. Using the SCPTs, we find that the average depth is now about 3.0 and 2.7 in English and Chinese respectively. In this way, we’ve achieved our goal of keeping local details and global outlines. Table 2. Reserved tag set over different languages. Language
Reserved tag set
Chinese
{NP, PN, NN, NR, NT}
English
{NP, PRP, NNP, PRP$, DT}
2.2 Unified Transition-Based Model For the obtained SCPT, it should be noted that, all the leaf nodes correspond to the words, and all the non-terminal nodes are the ones that appear in the reserved tag set. So we can extract the non-terminal nodes as potential mentions. We binaries the SCPT with right-branching, then generate the transition action sequences using the Oracle function. The Oracle function works through a post-order traversal and outputs SHIFT in face of leaf nodes while outputting REDUCE in face of non-terminal nodes. Afterwards, we use a SPINN model [17] to generate transition action sequences and extract the outputs at corresponding time steps as the representation of potential mentions. SPINN uses a tracking LSTM to save the history of the whole parsing process. Each state in tracking LSTM contains a stack and a buffer. The stack stores the partially completed subtrees under processing, while the buffer stores the tokens to be processed. At the very beginning, the stack is empty and the buffer is initialized with the sequence of all tokens of the sentence in order. We concatenate the representation of the top two elements from stack and the first element from buffer as the input of tracking LSTM at every time step and apply a MLP with softmax to the output h tracking t at t-th time step to predict the action of the next time step as Eq. (1)–(2). tracking
= max(0, WMLP ht hMLP t
pt = softmax(hMLP ) t
+ bMLP )
(1) (2)
Transition-Based Mention Representation
(a) the original constituent parse tree
(b) the simplified constituent parse tree Fig. 2. Mapping an sentence to the SCPT.
567
568
Q. Li and F. Kong
Obviously, we construct the shift-reduce system referring to the SCPT, once the predicted action is REDUCE, a potential mention is obtained. At the same time, the first two elements from top of the stack will be combined using the TreeRNN-style combination function shown as Eq. (3)–(5), to produce a representation for a new tree node that is the parent of the two popped nodes. In this case, the output of the combination function is what we need, i.e., the representation of the obtained potential mention. ⎤ ⎡ ⎤ ⎡ σ i ⎡ ⎤ ⎢f ⎥ ⎢ σ ⎥ hl ⎥ ⎢ i⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ (3) ⎢ fr ⎥ = ⎢ σ ⎥(Wcomp ⎣ hr ⎦) ⎥ ⎢ ⎥ ⎢ tracking ⎣o⎦ ⎣ σ ⎦ ht−1 tanh g Cr1 = fl · cl + fr · cr + i · g
(4)
hr1 = o · tanhcr1
(5)
2.3 Additional Points Our unified transition-based model can work on both the parse trees obtained by external parsers and unparsed raw texts. Figure 3 shows the overview of our transition-based model working during testing. We can find that there are three cases, SPINN: In such condition, we use the automatically generated actions of auto SCPTs determined by the MLP operated over tracking outputs.
Transition-Based Mention Representation
569
Traditional Parser:We can obtain the automatic parse trees using external parsers, and then extract the auto SCPTs. After that, we convert the auto SCPTs into action sequences to get the representation of each candidate mention (i.e., constituent). SCPT Parser: We extract gold SCPTs from the same large scale syntactic parser training data. Then we convert the SCPTs into action sequences.
Fig. 3. Our transition-based model.
3 Coreference Resolution Model We use the higher-order coreference model proposed by Lee et al. [13] as baseline. Similar with Lu and Ng [14], we use the publicly-available implementation of the resolver and make a change to jointly learn anaphoricity and determine coreferential links by loss function. Figure 4 illustrates our neural model. Specifically, after the contextual embedding layer, we introduce an additional role labeling module (i.e., true anaphors, singletons or others) using a CRFbased tagging approach. Finally, we employ the gating mechanism to weight the loss from the three parts, i.e., role labeling, mention detection and antecedent identification. We call the system without transition-based mention extraction and representation as “LeeDup”.
4 Experimentation In this section, we systematically evaluate our proposed approach to neural coreference resolution. 4.1 Experimental Settings All experiments are conducted in the data from the CoNLL-2012 shared task [16]. Specifically, we use the publicly-available pre-trained embeddings for Chinese, i.e., ELMo, BERT. We report the precision(which measures the proportion of correctly matched pronouns to their antecedents out of all the matched pronouns.), recall(which measures the proportion of correctly matched pronouns to their antecedents out of all the possible pronouns.), and F1 (a balance between precision and recall) for the standard MUC, B3, and CEAF φ4 metrics using the official CoNLL-2012 evaluation scripts, with the average F1 of the three metrics as the main evaluation standard.
570
Q. Li and F. Kong
Fig. 4. Our neural coreference resolution system.
4.2 Experimental Results We analyze the results from three perspectives. We firstly show the effectiveness of the proposed unified transition-based model. Then we compare our coreference resolution system with the state-of-the-art(SOTA)s. Finally, we discuss the contribution of our transition-based model to mention detection and representation, especially for the nested mentions. Effectiveness of Our Transition-based Approach. Table 3 and Table 4 illustrate the results for Chinese and English respectively. We list four settings, i.e., SPINN, means during the test stage, only raw texts are given. Transition-based approach will conduct SCPT parsing and mention extraction simultaneously. Berkeley, means during the test stage, raw texts and the corresponding parse trees are known. We convert the automatic parse trees into SCPTs. SCPT, means we retain the Berkeley parser using large scale gold SCPTs, and obtain the auto SCPTs directly during the test stage. Gold, the SCPTs are obtained using the gold standard parse trees. We can conclude that: For both Chinese and English, the unified transition-based model can improve the performance of coreference resolution by enriching mention extraction and representation. Comparing two languages under the Gold setting, the effect of our transition-based approach is more significant for Chinese. This may be due to the higher percentage of nested mentions in Chinese. Comparing the “SPINN” setting with the “Berkeley” setting, our transition-based approach performs better under the “Berkeley”. Which means better parse trees can provide more contribution. Comparing the “SCPT” with “Berkeley”, the performance under the “Berkeley” is slightly better. We also find that our coreference resolution model can achieve better recall under the “Berkeley” on all the metrics.
Transition-Based Mention Representation
571
Table 3. Performance of the systems with different word embeddings in Chinese on the test set from the CoNLL-2012 shared task. MUC GLoVe+ Systems ELMo
P (%)
B3 R (%)
F1
P (%)
CEAFφ4 R (%)
F1
P (%)
R (%)
CONLL F1
Avg. F1
LeeDup
77.5 66.5 71.3 70.4 57.6 63.0 66.5 56.7 60.9 65.1
+SPINN
78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8 67.2
+SCPT
81.2 77.1 79.1 73.1 66.9 69.9 66.9 65.4 66.1 71.7
+Berkeley 81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6 72.9 BERT
+Gold
85.2 80.0 82.2 78.2 70.4 73.5 72.1 70.8 71.5 75.7
LeeDup
82.7 70.1 75.5 77.1 62.9 68.5 73.1 62.8 67.3 70.4
+SPINN
83.5 82.8 83.2 74.5 74.1 74.3 71.9 70.6 71.3 76.3
+SCPT
85.2 80.2 82.5 80.1 72.8 76.2 76.3 74.9 75.6 78.1
+Berkeley 85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3 79.6 +Gold
87.3 83.5 85.3 82.4 77.4 79.7 80.0 78.4 79.3 81.4
Table 4. Performance of the systems with different word embeddings in English on the test set from the CoNLL-2012 shared task. MUC Glove+ ELMo
BERT
SpanBERT
B3
CEAFφ4
CONLL
Systems
P
R
F1
P
R
F1
P
R
F1
Avg.F1
LeeDup
81.4
79.5
80.4
72.2
69.5
70.8
68.2
67.1
67.6
73.0
+SPINN
81.3
80.9
81.1
71.4
71.2
71.5
69.0
69.1
69.0
73.9
+SCPT
84.7
79.4
81.9
76.1
69.0
72.2
70.8
68.4
69.5
74.5
+Berkeley
82.5
81.8
82.1
72.0
72.1
72.2
69.8
69.2
69.5
74.6
+Gold
84.6
83.3
83.9
76.7
74.1
75.4
71.7
73.4
72.5
77.3
LeeDup
83.5
82.8
83.2
74.5
74.1
74.3
71.9
70.6
71.3
76.3
+SPINN
86.8
82.7
84.7
78.4
73.6
75.7
74.5
71.9
73.2
77.9
+SCPT
87.3
83.3
85.0
75.0
80.5
77.0
75.8
74.3
75.2
79.1
+Berkeley
86.2
88.9
87.5
75.6
80.9
78.8
72.8
77.6
75.3
80.5
+Gold
86.4
89.4
88.1
78.4
82.1
80.6
75.9
79.0
77.6
82.1
LeeDup
85.8
84.8
85.3
78.3
77.9
78.1
76.4
74.2
75.3
79.6
+SPINN
85.7
86.2
86.0
77.5
79.6
78.8
77.2
76.2
76.7
80.5 (continued)
572
Q. Li and F. Kong Table 4. (continued) MUC
B3
CEAFφ4
CONLL
+SCPT
87.3
83.5
85.3
77.4
82.4
79.7
80.0
78.4
79.3
81.4
+Berkeley
87.5
90.3
89.1
77.6
83.8
81.2
78.9
80.5
79.9
83.4
+Gold
89.1
93.0
91.3
80.0
82.3
81.4
73.8
81.2
78.2
83.6
Comparison with the SOTAs. We will compare our model with the following SOTA models grouped by the pre-trained embeddings. Table 5 illustrates their average F1. GloVe: Clark and Manning. [1], which employs reinforcement learning to directly optimize a neural mention-ranking model. Kong and Fu [7], which reffnes the span representation by encoding contextual information in the traversal node sequence. GloVe+ELMo: Lee et al. [13], which reffnes the span-ranking architecture by modeling higher-order interactions between spans in predicted clusters. BERT: Kantor and Globerson. [6], which represents each mention by entity equalization mechanism to better capture properties of entity clusters. SpanBERT: Joshi et al. [4], which uses transformer to encode fixed-length nonoverlapping segments. Wu et al. [20], which casts coreference resolution as a span prediction task in question answering way.
Table 5. CoNLL average F1 of the state-of-the-arts from the CoNLL-2012 shared task.
Chinese
English
Systems
Avg F1
Clark and Manning [1]
63.9
Kong and Fu [7]
63.9
Clark and Manning [1]
65.7
Kong and Fu [7]
68.6
Lee et al.[13]
73.0
Kantor and Globerson [6]
76.6
Joshi et al. [4]
79.6
Wu et al. [20]
83.1
Contributions of our transition-based model to mention detection and representation are as follows: 1.For span detection, the recalls of both nested mentions and non-nested mentions are improved significantly which shows almost all potential mentions are extracted. The precision of SPD improves too. Which means impossible spans are better filtered under the “Berkeley” setting. 2. For mention detection, the recall improvements on both nested and non-nested mentions are more pronounced in Chinese. This may be due the higher proportion of the nested phenomenon.
Transition-Based Mention Representation
573
5 Conclusion In this paper, we focus on better mention extraction and representation for neural coreference resolution. In particular, firstly we map each sentence into a SCPT. Then we use a transition-based strategy to construct the SCPT structure in a bottom-up manner. In this way, the potential mentions can be obtained and the transition action sequences can be viewed as their representations.
References 1. Clark, K., Manning, C.D.: Deep reinforcement learning for mention-ranking coreference models. In: Proceedings of EMNLP-2016 (2016) 2. Clark, K., Manning, C.D.: Improving coreference resolution by learning entity-level distributed representations. In: Proceedings of ACL-2016, August 2016 3. Hobbs, J.R.: Resolving pronoun references. Lingua 44(4), 311–338 (1978) 4. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: improving pre-training by representing and predicting spans. Trans. Assoc. (2020) 5. Joshi, M., Levy, O., Zettlemoyer, L., Weld, D.: BERT for coreference resolution: Baselines and analysis. In: Proceedings of EMNLP-IJCNLP-2019, Hong Kong, China, pp. 5803– 5808. Association for Computational Linguistics, November 2019 6. Kantor, B., Globerson, A.: Coreference resolution with entity equalization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 673–677. Association for Computational Linguistics, July 2019. https://doi.org/10.18653/ v1/P19-1066, https://www.aclweb.org/anthology/P19-1066 7. Kong, F., Fu, J.: Incorporating structural information for better coreference resolution. In: Proceedings of IJCAI-2019. pp. 5039–5045 (2019) 8. Kong, F., Zhou, G.: Pronoun resolution in English and Chinese languages based on tree kernel. J. Software 23(5), 1085–1099 (2012) 9. Kummerfeld, J.K., Klein, D.: Error-driven analysis of challenges in coreference resolution. In: Proceedings of EMNLP-2013 (2013) 10. Lappin, S., Leass, H.J.: An algorithm for pronominal anaphora resolution. Comput. Linguist. 20(4) (1994). http://aclweb.org/anthology/J94-4002 11. Lee, H., Chang, A., Peirsman, Y., Chambers, N., Surdeanu, M., Jurafsky, D.: Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics 39(4), 885–916 (2013) 12. Lee, K., He, L., Lewis, M., Zettlemoyer, L.: End-to-end neural coreference resolution. In: Proceedings of EMNLP-2017 (2017) 13. Lee, K., He, L., Zettlemoyer, L.: Higher-order coreference resolution with coarseto-ffne inference. In: Proceedings of NAACL-2018 Volume 2 (Short Papers) (2018) 14. Lu, J., Ng, V.: Conundrums in entity coreference resolution: Making sense of the state of the art. In: Proceedings of EMNLP2020, pp. 6620–6631. Association for Computational Linguistics, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.536 15. Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: Proceedings of ACL-2002 (2002). http://aclweb.org/anthology/P02-1014 16. Pradhan, S., Moschitti, A., Xue, N., Uryupina, O., Zhang, Y.: Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In: EMNLPCoNLL-2012, pp. 1–40 (2012) 17. Samuel, R.B., Abhinav, R., Raghav, G., Christopher, D.M., Christopher, P.: A fast uniffed model for parsing and sentence understanding. In: Proceedings of ACL-2016 (2016)
574
Q. Li and F. Kong
18. Soon, W.M., Lim, D.C.Y., Ng, H.T.: A machine learning approach to coreference resolution of noun phrases. Comput. Linguist. 27(4) (2001), http://aclweb.org/anthology/J01-4004 19. Wiseman, S., Rush, A.M., Shieber, S., Weston, J.: Learning anaphoricity and antecedent ranking features for coreference resolution. In: Proceedings of ACLIJCNLP-2015 (2015) 20. Wu, W., Wang, F., Yuan, A., Wu, F., Li, J.: CorefQA: coreference resolution as query-based span prediction. In: Proceedings of ACL-2020, pp. 6953–6963 (2020)
Speaker-Aware Dialogue Discourse Parsing with Meta-Path Based Heterogeneous Graph Neural Network Shaoming Ji1,2 and Fang Kong1,2(B) 1 Laboratory for Natural Language Processing, Soochow University, Suzhou, China
[email protected], [email protected] 2 School of Computer Science and Technology, Soochow University, Suzhou, China
Abstract. Dialogue Discourse Parsing aims to identify the discourse links and relations between utterances, which has attracted more and more research interest in recent years. However, the speaker, the most essential characteristic of dialogue, has not been fully considered in most previous works. Accordingly, we propose a speaker-aware meta-path based heterogeneous graph neural network, including node-level aggregation and meta-path level aggregation. Concretely, firstly we construct a novel heterogeneous graph and define two meta-paths: intra-speaker meta-path and inter-speaker meta-path. Secondly, node-level aggregation aims to aggregate corresponding neighbors along each meta-path and then meta-path level aggregation uses a fusion gate to get the optimal combination of neighbors from different meta-paths. Experiments on Molweni and STAC demonstrate the effectiveness of our model, compared with previous state of the art systems. Keywords: Dialogue Discourse Parsing · Speaker · Heterogeneous Graph Neural Network
1 Introduction Dialogue discourse parsing is to find all the discourse links and corresponding relations in a dialogue. Therefore, dialogue discourse parsing can reveal the discourse structure of a dialogue, which is helpful for many other NLP tasks, such as dialogue comprehension [6], emotion recognition [15] and so on. Most studies for discourse parsing are mainly based on Rhetorical Structure Theory (RST) [11], Penn Discourse TreeBank (PDTB) [13] and Segmented Discourse Representation Theory (SDRT) [2]. Due to the need of the whole dialogue structure and the existence of crossing dependencies in dialogues, SDRT [2] is more appropriate for dialogue discourse parsing. Figure 1 shows an example from Molweni [7] annotated with its dependency structure. In dialogue discourse parsing, each elementary discourse unit (EDU) corresponds to an utterance in a conversation and each arc between EDUs represents a discourse relation. In this example, there is a set of crossing dependencies, U3 → U5 and U2 → U4. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 575–586, 2023. https://doi.org/10.1007/978-981-99-4752-2_47
576
S. Ji and F. Kong
Compared with monologues and plain texts, dialogues are less coherent and more sophisticated to understand due to multiple participants involving in dialogues. Therefore, speakers in a dialogue deserve special attention. However, most previous works have not fully considered speakers. Many works [1, 5, 12, 14, 17] only leverage the characteristic of whether two EDUs are from the same speaker but ignore capturing features from different speakers. Liu and Chen [9] and Chi and Rudnicky [4] concatenate speaker names with original EDU texts to enhance representations while the semantic interaction between speakers is ignored. Yu et al. [19] propose a second-stage pre-training task for the same speaker and a speaker-context interaction joint encoding module for different speakers, which may be not straightforward and convenient due to the division of speaker modeling in two stages. Besides, same speaker prediction for second-stage pre-training is data-consuming due to need of extra unlabeled dialogue corpus.
Fig. 1. An example from the Molweni[7] corpus. Solid line arrow, dashed line arrow and dotted line arrow respectively refer to Comment, Clarification question and Question-answer pair.
In summary, although the effectiveness and necessity of speaker modeling in dialogue discourse parsing have been proved in these studies, previous speaker modeling either under-consider or even ignore speaker interaction or need additional dialogue corpus. According to above discussion, we continue putting our sight on speakers and propose a speaker-aware meta-path based heterogeneous graph neural network to consider the semantic interaction between EDUs of the same speaker and different speakers at the same time without additional dialogue corpus. Concretely, we first define two metapaths: intra-speaker meta-path and inter-speaker meta-path. Afterwards we aggregate the neighboring nodes on each meta-path on the designed heterogeneous graph, and then aggregate the node information on different meta-paths through a fusion gate to get the optimal combination. This process is also named as hierarchical aggregation. In all, the contributions of this paper are: – We propose a novel speaker-aware model and this is the first attempt to adopt the heterogeneous graph neural network for dialogue discourse parsing. – Our model can extract semantic interactive information from the same speaker and different speakers without additional dialogue corpus, which is convenient and less data-consuming.
Speaker-Aware DDP with Meta-Path Based HGNN
577
– We prove the effectiveness of our model and our model can be more effective in link prediction when the number of dialogue speakers grows.
2 Model In this section, we will describe the details of our model which includes four main modules: Utterance Encoder, Sequential Modeling, Heterogeneous Graph Neural Network and Discourse Parsing. Figure 2 shows the architecture of our model. 2.1 Utterance Encoder The first part of our model is the utterance encoder. In view of the strong performance of pre-trained language models in sentence representing, we use the pre-trained model to encode each utterance. Following Wang et al. [17], we adopt the ELECTRA_small pretrained checkpoint for a fair comparison. Concretely, given a dialogue D consisting n utterances {ei }ni=1 , the input is tokens of an utterance ei and the utterance encoder outputs the utterance representation ui via mean pooling. As shown in Fig. 2, u1 ∼ u7 are obtained in this way. Following previous works, we add a dummy root u0 represented by a zero vector to represent the beginning of the dialogue. 2.2 Sequential Modeling Because dialogues are developed and organized sequentially by nature, it is necessary for us to capture sequential contextual information according to the timeline of utterances. For this goal, after the utterance encoder, we adopt a BiGRU to learn the sequential context-aware utterances representations. This learning process of each utterance representation can be seen as: (1) gi = BiGRU gi(+,−)1 , ui where ui is the output of the utterance encoder and gi denotes the sequential context-aware utterance representation. Up to now, speaker of each utterance has not been involved in the previous modules and speaker modeling will be discussed in detail in the next module. 2.3 Heterogeneous Graph Neural Network In this module, we put our efforts on speaker modeling. We propose a meta-path based heterogeneous graph neural network including hierarchical aggregation, i.e., node-level aggregation and meta-path level aggregation, to learn speaker-aware representations of each utterance. We will first describe how to construct a graph based on a dialogue and introduce two kinds of meta-path: intra-speaker meta-path and inter-speaker meta-path. Then we will
578
S. Ji and F. Kong
Fig. 2. The architecture of our model including utterance encoder, sequential modeling, speaker modeling and discourse parsing. We use the dialogue in Fig. 1 to explain our method. Utterances are colored by different colors according to different speakers. In discourse parsing, we take h5 as an example for clear presentation of link prediction and relation classification.
describe the detailed process of node-level aggregation and meta-path level aggregation respectively. Graph Construction Nodes: There are two types of nodes including speaker nodes and utterance nodes in our graph. Each utterance node gi is initialized by the output of the sequential modeling module but each speaker node si serves as a bridge or intermediate node to connect two utterances and has no practical vector representation. Taking the dialogue in Fig. 1 as an example, its heterogeneous graph is shown in Fig. 2 where eight utterances shown in different colors are spoken by four speakers. Note that the speaker of the dummy root is Root denoted by s0 . Edges: The heterogeneous graph also has two kinds of edges. The first one is that each speaker node connects to utterance nodes spoken by himself. The other is set between two arbitrary speaker nodes to make sure that two arbitrary utterances are accessible. As shown in Fig. 1, U2, U5 and U7 are all uttered by anto9us, so s2 denoting anto9us connects to g2 , g5 and g7 in Fig. 2. And in Fig. 2, s0 , s1 , s2 and s3 are connected in pairs. Note that g0 is only linked with s0 , which indicates that speaker Root only utters the dummy root u0 . R1
R2
Rl
Meta-paths: A meta-path Φ can be defined as a path in the form of A1 → A2 → ...→ Al+1 (abbreviated as A1 A2 . . . Al+1 ) [18], where Ai and Ri represent a node type and an edge relation respectively. This path includes a composite relation R = R1 ◦R2 ◦ . . . ◦Rl between the source node A1 and the target node Al+1 , where ◦ denotes the composition operator. In this paper, to capture speaker-aware features in a dialogue graph, we define two kinds of meta-paths, named intra-speaker meta-path and inter-speaker meta-path: Intra-speaker meta-path means two utterances are spoken by the same speaker. From Fig. 2, we can see that two utterances from the same speaker can be connected through
Speaker-Aware DDP with Meta-Path Based HGNN
579
a speaker node, hence intra-speaker meta-path can be described as Utterance-SpeakerUtterance (abbreviated as USU). For example, g1 −s1 −g4 is an instance of intra-speaker meta-path. Inter-speaker meta-path means two utterances are uttered by different speakers. Figure 2 shows that two utterances from different speakers can be connected through two different speaker nodes so inter-speaker meta-path can be denoted as UtteranceSpeaker-Speaker-Utterance (abbreviated as USSU). And g1 − s1 − s2 − g5 is an example of inter-speaker meta-path. Meta-path Based Neighbors: Along a given meta-path, each node has a set of neighbors which is known as meta-path based neighbors. In this paper, there are two types of metapath based neighbors according to our predefined meta-paths: intra-speaker meta-path based neighbors and inter-speaker meta-path based neighbors. Taking g5 in Fig. 2 as an example, its intra-speaker meta-path based neighbors are g2 , g5 and g7 while its interspeaker meta-path based neighbors are g0 , g1 , g3 , g4 , g5 and g6 . It is worth noting that the node itself is also included in its meta-path based neighbors. Node-level Aggregation. Given the meta-path Φ, the meta-path based neighbors of each node contribute differently to learn each node representation so different importance should be attached to the neighbors of each node. Here we adopt self-attention [16] to learn the weight among different nodes. Before calculating the weights, we follow self-attention [16] to get the input (i.e., queries, keys and values) of scaled dot-product attention by projecting utterances into different space. This process can be formalized as: Φ G QΦ /K Φ /V Φ = WQ/K/V
(2)
where G = {g0 , g1 , . . . , gn } is the output of sequential modeling, WQΦ , WKΦ and WVΦ denote different projection matrices on meta-path Φ, QΦ , K Φ and V Φ denote queries, keys and values on meta-path Φ. Since Fan et al. [5] and Wang et al. [17] have demonstrated the effectiveness of modeling distances in dialogue discourse parsing task, we also take the temporally relative distance between two utterances into consideration when calculating the weight of node pair (j, i) connected via meta-path Φ. Equation 3 –6 show the whole process of updating the node representation on given meta-path Φ. Firstly, we calculate one part of weights caused by temporally relative distances: compute the relative distance of node pair, map this distance into a distance vector and project this vector into a weight dist Φ ij through a linear transformation. Secondly, final weights of node pairs can be obtained by adding the weight of scaled dot-product attention and the relative distance weight dist Φ ij . Finally, a softmax function is leveraged to normalize the weights between all neighboring nodes of the target node via masked attention and then all neighbors based on meta-path Φ are aggregated by related weights to get the updated utterance node representation hΦ i . hΦ i =
j∈NiΦ
aijΦ · vj
(3)
580
S. Ji and F. Kong Φ Φ dist Φ ij = Linear (Embedding (j − i))
eijΦ
Φ T Φ · kj qi = + dist Φ √ ij d
exp(eijΦ ) aijΦ = Φ Ni Φ k=1 exp(eik )
(4)
(5)
(6)
where qiΦ , kjΦ and vjΦ respectively denote the corresponding query, key and value vector from Eq. 2, dist Φ ij is the weight from the relative distance j-i on meta-path Φ. d denotes the model dimension, NiΦ is the set of neighboring nodes of the target node i on metapath Φ, aijΦ is the normalized weight of node pair (j, i) connected via meta-path Φ while eijΦ is the unnormalized one, hΦ i is the aggregated node representation on meta-path Φ. For more stable training and more robust model, node aggregation is extended to multi-head attention as following: Φ Φ Φ Φ hΦ (7) i = concat hi1 , hi2 , . . . , hiK .WO where concat denotes the concatenation operation of vectors, K is the head number, WOΦ is a weight matrix. Considering that there are two kinds of meta-paths in our devised graph, we can get two groups of node representations. Here We use H S = {hS1 , hS2 , . . . , hSn } and H O = O O {hO 1 , h2 , . . . , hn } to represent the node representations collected by intra-speaker metapath and inter-speaker meta-path respectively. Meta-path Level Aggregation. Two groups of node representations from intra-speaker and inter-speaker meta-path can be seen as complementary in the aspect of speakers. Therefore, in order to fuse two node representations from two meta-paths, we introduce an information fusion gate [8] where heuristic features are adopted to capture the discrepancy and consistency signal. The fusion gate pi is calculated as follows: gi , hSi , gi − hSi , gi hSi )
(8)
O O ) gi , hO , g − h , g h i i i i i
(9)
h1i = ReLU (Linear h2i = ReLU (Linear
pi = Sigmoid (Linear( h1i , h2i )
(10)
where denotes the element-wise product, [·,·] denotes the concatenation operation, gi is the original output of sequential modeling, hSi and hO i are the updated representations of node i from intra-speaker meta-path and inter-speaker meta-path, h1i preserves the comparison signals between gi and hSi and h2i happens between gi and hO i .
Speaker-Aware DDP with Meta-Path Based HGNN
581
After getting the fusion gate pi , two complementary node representations can be fused to get the optimal combination of neighbors from two meta-paths. hi = pi hSi + (1 − pi ) hO i
(11)
where hi denotes the final representation after meta-path level aggregation. Taking g5 in Fig. 2 as an example, the hierarchical aggregation is demonstrated in Fig. 3 for better understanding.
Fig. 3. Explanation of hierarchical aggregation
2.4 Discourse Parsing Following previous works, dialogue discourse parsing can be divided into two sub-tasks, i.e., link prediction and relation classification. These two sub-tasks are generally similar but the number of final classifications is different. In this paper, we adopt a multi-layer perceptron (MLP) using the output H = {h1 , h2 , . . . , hn } of graph encoder to score links or relations between the current EDU and its preceding EDUs. We compute scores for predicting links and relations as following equations:
(12) sij = W2 · Tanh W1 · hi , hj + b1 + b2 pij = softmax(sij )
(13)
where W1 and W2 is a weight matrix and b1 and b2 is a bias term.
3 Experiments 3.1 Datasets We conduct our experiments on two datasets: Molweni [7] and STAC [3]. Molweni has annotated 9000, 500 and 500 dialogues for training, development and testing respectively. All of these dialogues are from the Ubuntu chat corpus. STAC is much smaller than Molweni and contains 1062 and 111 dialogues for training and testing respectively. All dialogues in STAC are collected from an online game. We follow Shi and Huang [14] to preprocess these two datasets.
582
S. Ji and F. Kong
3.2 Experimental Settings Experiments are conducted using PyTorch 1.10.1 framework and NVIDIA Tesla V100 is used for computational acceleration. In utterance encoder, following Wang et al. [17], ELECTRA-small is used and finetuned for better performance, which has been proved effective on some tasks with a small model size. In training period, the number of epochs is 40 and the batch size is 32 and 16 for Molweni and STAC because STAC is much smaller than STAC. The learning rate of ELECTRA-small is set to 1e−5 and the learning rate of other modules is set to 3e−4. The hidden size of the model and the dimension of relative distance vector are all 256. The head number K in node aggregation is 4. To prevent overfitting, dropout is 0.4 and L2 is 0.01. For stable training, warmup strategy is adopted and its radio is set to 0.05. We optimize model parameters by AdamW. Following previous works, we use micro F1 as evaluation metrics. 3.3 Baselines To verify the effectiveness of our model, we compare our model with previous SOTA systems: MST [1]: It adopts a MaxEnt to learn local probability distributions and then get a dependency tree by maximum spanning tree algorithm. ILP [12]: It obtains a more general DAG by integer linear programming. DeepSequential [14]: It relies on neural networks instead of hand-craft features to predict dependency links and construct a discourse structure alternatively and jointly. Hierarchical [9]: It uses a hierarchical encoder to get context-enhanced EDU representations. SSA [17]: It uses graph neural networks with fully connected graphs to avoid error propagation in DeepSequential and two auxiliary losses are used to improve the performance. DAMT [5]: It attempts to combine the advantages of transition-based and graph-based parsers together during encoding and decoding period. SA_DPMD [19]: It focuses on speakers. It proposes a second-stage pre-training task (i.e., same speaker prediction) and a speaker-context joint encoding module to capture features from same speaker and different speakers, respectively. SDDP [4]: It performs structured dialogue discourse parsing during training and inference and get the best performance. 3.4 Main Results Here we show the results of our model as well as previous SOTA systems on two benchmark datasets Molweni and STAC in Table 1. The Link metric only evaluates dependency links and the Link&Rel metric evaluates dependency links together with discourse relations. From Table 1, we can draw the following conclusions: In terms of methods, our model and SSA model are all based on graph neural networks but differ in graph construction and the corresponding updating method. As we can see from the results, our model improves SSA by 1.5% and 1.0% on Molweni respectively and gets comparable results compared with SSA on STAC, which can indicate that our heterogeneous graph based on speakers may be better than the fully connected homogeneous graph adopted in SSA. Turning our attention to research focus, both our model and SA_DPMD are devoted to speaker modeling. From the results, our model is comparable to SA_DPMD but our
Speaker-Aware DDP with Meta-Path Based HGNN
583
Table 1. Performance comparison. The results of DAMT and SA DPMD are from original papers and the results of other models refer to Chi and Rudnicky. Models MST
Molweni
STAC
Link
Link&Rel
Link
Link&Rel
69.0
48.7
69.6
52.1
ILP
67.3
48.3
69.0
53.1
DeepSequential
76.1
53.3
73.2
54.4
Hierarchical
80.1
56.1
73.1
57.1
SSA
81.6
58.4
73.4
57.3
DAMT
82.5
58.9
73.6
57.4
SA_DPMD
83.7
59.4
73.0
57.4
SDDP
83.5
59.9
74.4
59.6
Ours
83.1
59.4
74.0
57.2
model is less data-consuming due to no need of additional unlabeled dialogue corpus for second-stage pre-training. Compared with SOTA (i.e., SDDP), our model gets comparable results on Molweni, but on STAC, our model falls behind SDDP’s results, especially in Link&Rel. We speculate that it is because STAC is much smaller than Molweni and speaker information of STAC (with average speaker number 2.92) is not as rich as that of Molweni (with average speaker number 3.51). 3.5 Additional Analyses Performance on Dialogues with Different Speaker Numbers. To further analyze the necessity of speaker interactions, we investigate the link accuracy of dialogues with different number of speakers on Molweni. Here the baselines for comparison are DeepSequential, SSA and SDDP, where speaker information is leveraged in three representative ways but speaker interactions are ignored or considered incompletely. The detailed results are shown in Fig. 4(a). Concluding from the results of all models, the performance of dialogues with more than 2 speakers are much lower than the one of dialogues with two speakers, indicating that dialogues with two speakers are relatively easier to understand but dialogues involving more speakers may be too disorderly and complex to obtain their structures. Compared with all baselines, we can find that our model outperforms all baselines in conversations with at least 3 speakers, especially in conversations with at least 5 speakers. This result denmonstrates that taking semantic interaction between speakers into account can bring more benefits for parsing conversations involving more speakers. Performance on Dependency Links with Different Distances. To demonstrate the effectiveness of our meta-path based heterogeneous graph neural network, we investigate
584
S. Ji and F. Kong
the link accuracy of dependency links with different distances on Molweni and the detailed results can be seen in Fig. 4(b). The baseline we choose is SSA, which is similar to our model but differs in graph construction and corresponding updating mechanism.
(a) Link accuracy against speaker number
(b) Link accuracy against link distance
Fig. 4. Analysis including dependency link accuracy of dialogues with different speaker numbers and link accuracy of dependency links with different distances
All two models generally show a decreasing tendency, which demonstrates that it is difficult to recognize long-range dependency links. Compared with SSA, our model outperforms SSA at distances of more than 3 and does not drop as sharply as SSA. This can reflect that in long-range link prediction, our speaker-aware heterogeneous graph encoder is more effective than the fully connected homogeneous graph encoder of SSA. 3.6 Ablation Study To compare the importance of intra-speaker meta-path and inter-speaker meta-path, we conduct an ablation study on these two meta-paths. Additionally, we also conduct an ablation study of distance to demonstrate that taking distances into consideration is beneficial for dialogue discourse parsing. The results are shown in Table 2. Table 2. Ablation study. -intra and -inter denotes deleting intra-speaker and inter-speaker metapath respectively. -both denotes deleting two meta-paths. -distance denotes deleting distance information. Models
Molweni
STAC
Link
Link&Rel
Link
Link&Rel
final
83.1
59.4
74.0
57.2
-intra
77.9
55.9
71.5
54.9
-inter
80.7
58.0
72.3
55.3
-both
71.2
50.7
69.3
51.6
-distance
81.2
58.5
72.3
54.0
Speaker-Aware DDP with Meta-Path Based HGNN
585
When comparing the results of -intra and -inter, we can find that on both datasets, ignoring intra-speaker meta-path will bring much lower performance so intra-speaker meta-path is relatively more significant than inter-speaker meta-path. This may be due to the fact that focusing on utterances from the same speaker can better understand the development of the whole dialogue involving this speaker, which also explains the reason why previous works [10, 14, 17] only modeling the same speaker can get good results. And the fact that -both drops the largest demonstrates the effectiveness and necessity of our meta-path based heterogeneous graph neural network. When removing distance information, the results of on two datasets both drop, which can reflect that taking distances into consideration is beneficial for dialogue discourse parsing.
4 Conclusion In this paper, we propose a speaker-aware meta-path based heterogeneous graph neural network for dialogue discourse parsing. First, we define two meta-paths: intra-speaker meta-path and inter-speaker meta-path. Then we learn utterance representations by hierarchical aggregation. Experiments on benchmark datasets show the effectiveness of our method. In the future, we will further to explore the influences of injecting structured biases into dialogue discourse parsing.
References 1. Afantenos, S., Kow, E., Asher, N., Perret, J.: Discourse parsing for multi-party chat dialogues. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015) 2. Asher, N., Lascarides, A.: Logics of conversation. Peking University Press (2003) 3. Sher, N., Hunter, J., Morey, M., Benamara, F., Afantenos, S.: Discourse structure and dialogue acts in multiparty dialogue: the stac corpus. In: 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2721–2727 (2016) 4. Chi, T.C., Rudnicky, A.: Structured dialogue discourse parsing. In: Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. pp. 325–335 (2022) 5. Fan, Y., Li, P., Kong, F., Zhu, Q.: A distance-aware multi-task framework for conversational discourse parsing. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 912–921 (2022) 6. He, Y., Zhang, Z., Zhao, H.: Multi-tasking dialogue comprehension with discourse parsing (2021) 7. Li, J., et al.: Molweni: a challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. arXiv preprint arXiv:2004.05080 (2020) 8. Liu, L., Zhang, Z., Zhao, H., Zhou, X., Zhou, X.: Filling the gap of utterance-aware and speaker-aware representation for multi-turn dialogue. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 13406–13414 (2021) 9. Liu, Z., Chen, N.F.: Improving multi-party dialogue discourse parsing via domain integration. arXiv e-prints (2021) 10. Ma, X., Zhang, Z., Zhao, H.: Structural characterization for dialogue disentanglement. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 285–297 (2022)
586
S. Ji and F. Kong
11. Mann, W.C., Thompson, S.A.: Rethorical structure theory: Toward a functional theory of text organization. Text – Interdiscipl. J. Study Discourse 8(3), 243–281 (1988) 12. Perret, J., Afantenos, S., Asher, N., Morey, M.: Integer linear programming for discourse parsing. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies (2016) 13. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Webber, B.L.: The Penn discourse treebank 2.0. In: Proceedings of the International Conference on Language Re-sources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco (2008) 14. Shi, Z., Huang, M.: A deep sequential model for discourse parsing on multi-party dialogues. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 7007–7014 (2019) 15. Sun, Y., Yu, N., Fu, G.: A discourse-aware graph neural network for emotion recognition in multi-party conversation. In: Findings of the Association for Computational Linguistics: EMNLP 2021. pp. 2949–2958 (2021) 16. Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017) 17. Wang, A., et al.: A structure self-aware model for discourse parsing on multi-party dialogues. In: International Joint Conference on Artificial Intelligence (2021) 18. Wang, X., et al.: Heterogeneous graph attention network (2019) 19. Yu, N., Fu, G., Zhang, M.: Speaker-aware discourse parsing on multi-party dialogues. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 5372– 5382 (2022)
RA-KD: Random Attention Map Projection for Knowledge Distillation Linna Zhang , Yuehui Chen(B)
, Yi Cao , and Yaou Zhao
School of Information Science and Engineering, University of Jinan, Jinan, China {yhchen,isecaoy,isezhaoyo}@ujn.edu.cn
Abstract. Pretrained language models such as BERT, ELMO, and GPT have proven to be effective for natural language processing tasks. However, deploying them on computationally constrained devices poses a challenge. Moreover, the practical application of these models is affected by the training and deployment of large-scale pretrained language models. While knowledge distillation (KD) in the intermediate layer can improve standard KD techniques, especially for large-scale pretrained language models, distillation in the intermediate layers brings excessive computational burdens and engineering difficulties in mapping the middle layer of a student model with a variable number of layers. The attention map is one of the essential blocks in the intermediate layer. To address these problems, we propose an approach called random attention map projection, in which the intermediate layers are randomly selected from the teacher model. Then, the attention map knowledge is extracted to the student model’s attention block. This approach enables the student model to capture deeper semantic information while reducing the computational cost of the intermediate layer distillation method. We conducted experiments on the GLUE dataset to verify the effectiveness of our approach. Our proposed RA-KD approach performs considerably better than other KD approaches in both performance and training time. Keywords: Model Compression · Knowledge Distillation · BERT
1 Introduction Pretrained language models have been shown to have strong text representation capabilities, such as BERT [1], GPT [2], ELMO [3], etc. They are usually trained on large-scale unlabeled datasets. Although BERT has become a general paradigm for NLP, due to its millions of parameters and computational resource consumption during training, its deployment difficulty has become a recognized problem on edge and computationally constrained devices. The original BERT-Base model had 12 layers and 110 million parameters. Training from scratch usually takes four days and uses 4 to 16 Cloud TPUs. The large model size and long inference time greatly limit the scope of the application of BERT. There are various approach on model compression of BERT, such as pruning [4], quantification [5], knowledge distillation [6]. These approaches aim to reduce the size © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 587–596, 2023. https://doi.org/10.1007/978-981-99-4752-2_48
588
L. Zhang et al.
and latency of the model while maintaining comparable performance to the original model. Inspired by this, we made knowledge distillation (KD) the focus of this work. The knowledge distillation method is a neural network compression method. Small student models are trained by fine-tuning downstream tasks under the guidance of a large pretrained teacher model. In the earliest KD techniques, the teacher’s output was used for soft labels during student training. At the same time, some methods utilize the alignment of the intermediate layer to make the student model more uniform in learning the ability of the teacher model. While these approaches are valid, they often lack appropriate strategies for selecting matching layers in the teacher and student models. The solution is often methods such as dependency combination layers and comparative learning. Unlike previous methods, we adopt the random attention graph mapping method. We encourage the student model to randomly select the attention graph of the layer of the learning teacher model in each round to capture deeper semantic information. The number of layers of the student model does not limit our method. In each epoch, we select K of the N intermediate layers of the teacher model to map the attention graph to the corresponding student layer. Since the choice of layers is random, each attention feature graph has a chance to be selected. We do not add additional data augmentation methods and other methods that increase computational cost. We test on the GLUE development dataset to verify the effectiveness of our approach.
2 Related Work 2.1 Large Language Model Pre-training With the development of deep learning, pretrained methods have been widely used in text representation learning. Large language models have achieved earth-shattering wins in several natural language processing fields, such as sentiment classification, machine translation, and text summarization. Previous work can be divided into two categories: feature-based methods and fine-tuning methods. Feature-based methods mainly focus on learning text representations: (1) contextfree word representations such as Word2Vec [7], GLOVE [8] and FastText [9]. (2) sentence-level representations. (3) Word representations with contextual semantic information such as ELMO [3]. Among them, ELMO deep learns contextual associations using a bidirectional language model to improve the ability of word representations. BERT and GPT represent fine-tuning-based methods. BERT is a large pre-trained transformer structure with multiple layers, which is pretrained on a large prediction corpus and then fine-tuned on downstream tasks for better results. BERT has achieved stateof-the-art performance on various natural language understanding tasks by pretraining via mask language modeling and next-sentence prediction. The BERT model decomposes the input sentence into WordPiece [10] labeled sentences. Specifically, WordPiece divides complex words into sub-words, which helps to improve the representation of input vocabulary and reduce the storage of embeddings. And through these sub-words, It is also possible to form new words that have never been seen before in the training process, making it more robust for terms outside the vocabulary. Insert [CLS] classification token at the first of the sentence. For sentence pairs,
RA-KD: Random Attention Map Projection for Knowledge Distillation
589
insert [SEP] at the end of a single sentence to distinguish between two sentences. The input of BERT contains three embeddings, representing the result after WordPiece: token embedding, position embedding, and segment embedding. These embeddings are aggregated together, and the final output representation is obtained through the transformer block. However, the BERT-Base has twelve layers of transformers and 110 million parameters, and the BERT-Large has twenty-four layers of transformers and 330 million parameters. Such large language models usually have millions of parameters, significantly affecting their practical application. In this paper, we aim to compress BERT into a lightweight model while ensuring its performance as much as possible. The proposed method can also be applied to other large language models with transformer structures. 2.2 Knowledge Distillation Lightweight models can reduce the size of the model and speed up the reasoning speed of the model while maintaining the performance of the model. Recently, a series of methods for model compression of BERT models have emerged, such as pruning, quantization, low-rank decomposition, and knowledge distillation. Hinton proposes knowledge distillation. By transferring the knowledge of the large model (teacher model) to the small model (student model), the student model can imitate the behavior of the teacher model to obtain higher performance. DistilBERT [11] increases the cosine similarity loss between the teacher and student models. MobileBERT [12] adopts the same number of layers as BERT-large and adds a bottleneck structure to the Transformer of each layer to make the Transformer of each layer narrower. TinyBERT [13] adopts a two-stage training method. The model calculates loss functions in multiple intermediate processes to align it as much as possible; simultaneously, the corpus is enhanced with data, so significant progress has been made in model performance and effect. BERT-of-Theseus [14] uses a module substitution method to achieve an effect similar to that of knowledge distillation. In PKD, SUN et al. [15] extract 12-layer BERTs into 6-layer BERTs using specific mapping rules. This work ignores the effect of choosing a fixed number of layers. In MiniLM [16], attention graphs are used as an essential knowledge distillation optimization objective to improve the performance of student models. But MiniLM chooses an attention graph for the last layer rather than building a layer-to-layer mapping. Our proposed method adds no computational cost to the knowledge distillation process and does not require extensive experiments to find the best mapping scheme. We cannot compare with TinyBERT and MinilLM because they use a pre-trained knowledge distillation process, which significantly improves the textual representation of the model, but also requires more computational resources.
3 Method 3.1 Random Projection Knowledge Distillation In the process of knowledge distillation of the model, by adding the intermediate layer feature knowledge distillation, that is, the intermediate layer of the teacher model imparts knowledge to the students, which can further improve the performance of the student
590
L. Zhang et al.
model based on the original knowledge distillation. The existing knowledge distillation method adopts the one-on-one or many-to-one middle-layer knowledge distillation method. Discarding the intermediate layer of some teacher models may cause the loss of important information. Besides, choosing the intermediate layer of the teacher model more suitable for distillation also brings a huge search overhead. At the same time, to maximize the compression of the student model, we usually set a lower-dimension student model to reduce the number of parameters significantly, but the dimensions of the intermediate layer feature are different, and knowledge distillation cannot be directly carried out.
Fig. 1. Random mapping knowledge distillation process of RA-KD.
To solve the problem that the intermediate feature knowledge distillation mapping is complicated, we use the random projection knowledge distillation method to avoid the search overhead of selecting a suitable number of intermediate teacher layers. As shown in Fig. 1, in each training round, several attention graphs of the teacher layer are randomly selected and arranged for knowledge distillation. For the convenience of implementation, we randomly choose attention graphs with the same number of student model layers from the teacher’s intermediate layer. Our method can be generalized to other attention-based model knowledge distillation. 3.2 Attention Map Projection Attention mechanism is also crucial to pre-trained language models as a successful part of natural language processing. For the Transformer-based BERT model, the attention graph captures a rich layer of language information during the pre-training process. MiniLM [16] chose to perform knowledge distillation on the last layer of the attention map and has achieved good performance. By distilling the knowledge from attention maps, the student model can quickly learn the text representation ability of the large model (Fig. 2).
RA-KD: Random Attention Map Projection for Knowledge Distillation
591
Fig. 2. We randomly select an attention map with the same number of layers as the students, and map the information of attention to the students’ attention map, and perform knowledge distillation through the Q-K matrix.
In each layer, the Transformer uses multiple attention heads to aggregate the previous layer’s output. For the l-th Transformer layer, the attention graph output of one attention head is QK, and a ∈ [1, Ah ]. Ah is the number of attention heads. The specific calculation formula is as follows: Q
Ql,a = H l−1 Wl,a ,
(1)
K , Kl,a = H l−1 Wl,a
(2)
T . A = Ql,a Kl,a
(3)
The dimensions of the Query matrix of the teacher model are An × |x| × |dk |, the dimensions of the Key matrix are the same as those of the Query matrix, and the dimensions of the Query matrix of the student model are An × |x| × |dk |. Similarly, the Key matrix’s dimensions are the same as those of the student model. The dimensions of QK are An × |x| × |x|. For the knowledge distillation of two attention feature graph, we choose the KL divergence loss function that can measure the distribution, as follows: LQK
|x| Ah 1 = DKL (ATl,a,j |ASl,a,j , Ah |x|
(4)
a=1 j=1
where |x| represents the sequence length, Ah represents the number of attention heads. A represents the dot product of Query and Key matrices. In addition to making the layer mapping more flexible, this method can also avoid the problem of direct knowledge distillation caused by the inconsistency of the feature dimensions of the middle layers of the teacher model and the student model by distilling the dot product attention map of the Query-Key matrix.
592
L. Zhang et al.
3.3 Loss Function The loss of the attention feature graph LQK is combined with the loss of the original knowledge distillation Ldistil or extracting knowledge from the output of teacher T to the output of student model S, combined with the cross-entropy loss LCE of the label. The total loss function of the trained student model is: L = αLCE + (1 − α)Ldistil + βLQK ,
(5)
where α and β are hyper-parameters of our model to minimize the total loss.
4 Experiment 4.1 Datasets We evaluate that RA-KD comes from 6 tasks of General Language Understanding Evaluation (GLUE), among which: MNLI aims to predict the relationship between sentence pairs, QQP is used to classify sentence pairs, and SST-2 is used for single sentences. MRPC and QQP are used to judge whether the grammar is correct, and two natural language reasoning tasks, such as MNLI, RTE. 4.2 Implementation Details We used the original BERT-base (N = 12, d = 768, and h = 16) as the teacher network, which contains 110 M parameters. In order to ensure that the student network can learn both high-dimensional and low-dimensional information, we adopt the method of g(m) = 3 ∗ m to achieve random mapping. After that, we defined a 6-layer BERT as a student network for comparison with PKD and BERT of Theseus. During each epoch training process, six attention layers between 1 and 11 are randomly selected from 12 layers of BERT. Then, the randomly obtained attention level index is sorted from small to large, and a linear transformation is used to map the original dimensions to a 128 dimensional space to reduce model parameters. After that, the RA-KD loss between BERT12 and BERT6 is calculated and normalized. In our method, the optional range of α is {0.5, 0.7, 1}, and the optional range of β is {10, 50, 100}. We searched learning rate from {1e−5, 2e−5, 5e−5,}, batch size from {8, 16, 32}, and fixed the epoch number to 50 for all the experiments. We ran all the experiments 5 times and reported the average score to verify the reliability of our results. We ran all the experiments on a single NVIDIA 2080ti GPU using mixed precision training and PyTorch framework. Our results show that random attention feature projection not only provides consistently better results than deterministic mapping techniques such as PKD, but also has less computational overhead during the training process, while avoiding a large number of search experiments to find the optimal mapping.
RA-KD: Random Attention Map Projection for Knowledge Distillation
593
4.3 Baselines As shown in Table 1, we compared the number of layers, parameters, loss functions, and whether Pretrained. First, we fine-tune the original BERT directly to obtain a 6layer BERT model. Compress 12 layers of BERT to obtain 6 layers of BERT-PKD, DistilBERT, and PD-BERT, where both BERT-PKD and PD-BERT do not require preselection training. Compared to the above method, BERT-of-thesues used a random module replacement method to conduct experiments. Table 1. Comparison of different BERT compression approaches. “CE” and “MSE” stand for Cross Entropy and Mean Square Error, and “KL” stand for KL divergence. “KD” indicates the loss is for Knowledge Distillation. “CETASK”, “CEMLM” and “CENSP” indicate Cross Entropy calculated on downstream tasks. Method
#Layer
#Param
Loss Function
Pretraining
BERT-base
12
109 M
CEMLM +CENSP
–
DistilBERT
6
66 M
CEKD +CosKD +CEMLM
✓
PD-BERT
6
66 M
CEMLM +CEKD +CETASK
✓
Finetuning
6
66 M
CETASK
✗
Vanilla KD
6
66 M
CEKD +CETASK
✗
BERT-PKD
6
66 M
CEMLM +PTKD +CETASK
✗
BERT-of-Theseus
6
66 M
CETASK
✗
RA-KD
6
66 M
CEKD +KLatten
✗
4.4 Experimental Results on GLUE Table 2 shows the performance of the model on GLUE tasks. We used two types of models, pretrained-model and finetuning based. All experimental results are based on the original paper overall, our methods retains 98.4% and 98.3% of the BERT-base performance on GLUE development set. In each task of GLUE, our model is significantly superior to the finetuning baseline, indicating that using the same loss function, our proposed method can effectively achieve knowledge distillation. In addition, our model is significantly superior to vanilla KD and PKD methods on certain datasets. However, there are still shortcomings in some datasets, such as slightly insufficient performance on QNLI and RTE. Our model implements the same performance based on BERT base. It is worth noting that our model can always achieve good performance on large datasets with more than 350k samples and small datasets with less than 4k samples, verifying the robustness of our method.
594
L. Zhang et al.
Table 2. DEV a performance on GLUE benchmark. Bold mark describes the best results. Method
MNLI
QQP
QNLI
SST-2
RTE
MRPC
Macro-score
83.5
89.5
91.2
91.5
68.6
89.5
85.6
Teacher/Pretrained-model BERT-base
Models Compression while pretraining DistilBERT [11]
79.0
84.9
85.3
90.7
59.9
87.5
81.2
PD-BERT
83.0
89.1
89.0
91.1
66.7
87.2
84.4
Models Compression while finetuning Finetuning
80.1
87.8
86.9
89.6
62.1
86.0
82.0
Vanilla KD
80.1
88.1
88.0
90.5
64.9
86.2
83.0
BERT-PKD[ 15]
81.3
88.4
88.4
91.3
66.5
85.7
83.6
BERT-of-Theseus [14]
82.3
88.6
87.5
91.2
68.2
87.0
84.1
RA-KD (ours)
83.3
89.5
87.4
91.3
67.4
87.0
84.3
4.5 Impact of Random Layer Selection In order to evaluate the impact of random attention layers on model distillation, we conducted some experiments on three small data sets, and the specific experimental results are shown in Table 3. Table 3. Standards deviation (5 runs) of BERT6 models on the smallest three GLUE datasets. Method
MRPC
RTE
SST-2
PKD
85.7
66.5
91.1
PKD+RA
86.1
65.4
90.1
RA-KD
87.0
67.4
91.3
From Table 3, it can be seen that there are significant effects for small datasets in GLUE, while combining RA with PKD, it can be seen that the effect is somewhat improved compared to the original PKD. We also compare different loss functions over values in the self-attention module. The experimental results for the three tasks are shown in Table 4. Using different loss functions can achieve different performance. Specifically, our approach has brought about a 1.0% improvement in the MRPC testing benchmark. In addition, our method does not need to introduce additional parameters. We transfer the same knowledge and adopt a uniform strategy as in PKD to map teacher and student layers to perform layer-to-layer distillation. The experimental results for the three tasks are shown in the table, and it can be seen that the MSE based loss function has achieved better results. At the same time, this model solves the problem of difficult mapping between teacher model and student model.
RA-KD: Random Attention Map Projection for Knowledge Distillation
595
Table 4. Compare different loss function: KL-divergence over the value relation (KL), mean squared error (MSE) over values and mean absolute error (MAE). he finetuning results are an average of 5 runs for each task. Architecture
Model
MRPC
RTE
SST-2
M = 6; dh = 512
Value-KL
87.0
67.4
91.3
Value-MSE
86.1
66.2
90.1
Value-KL
83.5
65.3
90.2
Value-MSE
81.6
64.1
89.9
Value-KL
80.7
63.7
89.3
Value-MSE
79.4
63.1
89.2
Value-KL
77.1
61.5
88.3
Value-MSE
76.5
61.3
88.3
M = 5; dh = 512 M = 4; dh = 512 M = 3; dh = 512
5 Conclusion and Future Work We propose a simple and efficient approach called random attention feature map projection, which is superior to the vanilla knowledge distillation method in improving the performance of the student model and reducing the training time. RA-KD randomly selects the attention feature map corresponding to the number of intermediate layers of the student model from the teacher model and sorts the attention feature map to extract knowledge to the student’s attention blocks. This randomness brings attention to all attention feature maps of the teacher model, making our method suitable for knowledge distillation of large-scale pretrained models with attention modules and improving generalization performance on more natural language understanding (NLU) tasks. In the future, we will consider further research on the portability of attention feature maps to explore more possibilities of knowledge distillation approaches based on attention feature maps. Acknowledgments. This work was supported in part by the University Innovation Team Project of Jinan (2019GXRC015), Shandong Proviencial Natural Science Foundation, China (ZR2021MF036).
References 1. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of Deep Bidirectional Transformers for Language Understanding (2019). http://arxiv.org/abs/1810.04805, https:// doi.org/10.48550/arXiv.1810.04805 2. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog. 1, 24 (2019)
596
L. Zhang et al.
3. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227– 2237. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/N181202 4. Lin, Z., Liu, J.Z., Yang, Z., Hua, N., Roth, D.: Pruning redundant mappings in transformer models via spectral-normalized identity prior (2020). http://arxiv.org/abs/2010.01791, https:// doi.org/10.48550/arXiv.2010.01791 5. Fan, A., et al.: Training with Quantization Noise for Extreme Model Compression 6. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. 2 (2015) 7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). http://arxiv.org/abs/1301.3781 8. Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/D14-1162 9. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 427–431. Association for Computational Linguistics (2017) 10. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 11. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). https://doi.org/10.48550/arXiv.1910.01108 12. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a Compact task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.195 13. Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4163–4174. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.372 14. Xu, C., Zhou, W., Ge, T., Wei, F., Zhou, M.: BERT-of-Theseus: compressing BERT by progressive module replacing (2020). http://arxiv.org/abs/2002.02925 15. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT model compression (2019). http://arxiv.org/abs/1908.09355 16. Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers (2020). https://doi.org/ 10.48550/arXiv.2002.10957
Nucleus Beam Search for Machine Translation Decoding Zheng Chen(B)
, Ruiwen Tao, and Yifan Wang
School of Information and Software Engineering, University of Electronic Science and Technology of China, No. 4, Section 2, North Jianshe Road, Chengdu, Sichuan, People’s Republic of China [email protected]
Abstract. Beam search is the most widely-used decoding algorithm for machine translation. Its success, however, may be attributed to the inadvertent implementation of the Uniform Information Density (UID) hypothesis. The UID hypothesis suggests that humans prefer sentences with evenly distributed information across the linguistic signal, while adhering to grammatical constraints. This paper presents Nucleus Beam Search, a novel machine translation decoding algorithm aimed at achieving the UID objective. By combining nucleus filtering with beam search, our approach effectively expands the search space without violating the UID hypothesis, enabling the generation of lengthier and more com prehensive translations. Experimental results reveal that Nucleus Beam Search outperforms traditional decoding algorithms in terms of BLEU, METEOR, ROUGE-L and CIDEr scores. Nevertheless, our findings also suggest that information density is not the sole determinant of translation quality, with beamwidth playing a significant role as well. Keywords: Machine Translation · Decoding · Beam Search
1 Introduction In recent years, the rapid development of deep neural networks has significantly improved the state-of-the-art in machine translation [17]. With extensive supervised training on massive parallel corpora, deep neural networks can accurately generate the probability distribution of tokens given the input text and contextual conditions. The decoding algorithm is then responsible for selecting one or several tokens from this probability distribution to form the output sequence. This approach has proven to be highly effective in improving the quality of machine translations. Beam search is the most commonly used decoding algorithm for machine translation systems, as reported by several studies [7, 14]. Unlike exhaustive search, it can reduce space and time consumption by eliminating some lower quality nodes and retaining some higher quality ones at each expansion step. The beamwidth, or the width of the beam search, plays a significant role in determining the quality of the translation output. Generally, a smaller beamwidth explores a narrower space and produces lower quality © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 597–608, 2023. https://doi.org/10.1007/978-981-99-4752-2_49
598
Z. Chen et al.
output, whereas a larger beamwidth explores a wider space and results in higher quality output. However, numerous experimental results have shown that an excessively large beamwidth may not be the optimal choice for most machine translation models [3, 12, 23, 28]. In fact, larger beamwidths often lead to shorter results, which can seriously impact the quality of translation. Therefore, it is vital to carefully choose an appropriate beamwidth that balances the trade-off between exploration space and output quality. However, such a choice is primarily based on experience and lacks theoretical support. The most direct cause of degraded translation quality is the shortening of the output sequence. To address this issue, the most straightforward solution is to correct the sequence length [18]. In standard beam search, the score of a candidate sequence e is calculated using the following equation: m logP(ei |e1:i ), (1) s(e) = i=1
where ei represents the i-th token of the sequence e. It is evident that as the sequence length increases, the score will decrease monotonically. As a result, shorter sentences tend to receive higher scores. In their translation system, Jean et al. [11] were the first to introduce length normalization [1] by dividing the score by the length of the sequence, as follows:
s (e) = s(e)/m,
(2)
where m represents the decoded sequence length. As a result, it is possible to obtain a decent score even for a lengthy sequence. This straightforward and effective length normalization strategy has been widely employed in most machine translation systems [22, 27]. A similar idea has also been introduced in Google’s NMT system [26], which includes a more complex incentive term. Another proposal, the “Word reward” approach suggested by [8], follows a similar idea, except that it utilizes an additive reward. In recent years, there has been new thinking among researchers regarding this issue. According to Meister et al. [15], the degradation of beam search can be attributed to the unexpected distribution of information density rather than the shortening of the generated sequence length. In Cognitive Science, there is a uniform information density hypothesis (UID), which states that—subject to the constraints of the grammar, humans prefer sentences that distribute information (in the sense of information theory) equally across the linguistic signal, e.g., a sentence [10]. Meister et al. [15] argue that the reason why beam search may output satisfactory results is the unintentional implementation of the UID hypothesis. Furthermore, they developed a UID Decoding approach, which utilizes a series of regularizers to explicitly encourage evenly distributed surprisal in generated text. Their decoder has shown promising improvements compared to lengthnormalized beam search. However, the complexity and time-consuming nature of their approach have limited its widespread adoption. This paper introduces a streamlined and effective technique called Nucleus Beam Search, designed to directly achieve the UID objective. The concept behind the nucleus method is inspired by nucleus sampling, a widely employed decoding algorithm for open-ended text generation. In Nucleus Beam Search, node expansion considers the nbest candidates within the smallest subset of tokens that have an accumulated probability
Nucleus Beam Search for Machine Translation Decoding
599
of no less than p, referred to as the nucleus. This strategy retains only a dynamic number of candidates, which helps to avoid the inclusion of tokens with small probabilities (and high information content) in the search space. Through this simple modification, our approach ensures that even when employing a large beamwidth value to expand the search space, high-information tokens are not incorporated into the decoding results. Consequently, Nucleus Beam Search could produce output sequences with more uniform information density. Our experiments reveal that Nucleus Beam Search surpasses both length-normalized Beam Search and UID Decoding in performance, excelling not only in BLEU, but also a range of common metrics.
2 Approach In the vanilla beam search, a heuristic breadth-first approach is employed to construct the search tree. At each time step, all successor tokens of the current states (prefixes) are generated. However, only a predetermined number of the best states (prefixes concatenated with the generated tokens), denoted by the beamwidth k, are retained. These selected states are then expanded in the next time step. In our Nucleus Beam Search, we introduce an additional parameter, p, to work in conjunction with the beamwidth k in constraining the search process. Given a distribution P(x|x1:i−1 ), we define its top-p vocabulary V p ⊂ V as the smallest set that satisfies the following condition: P(x|x1:i−1 ) ≥ p. (3) p x∈V
At each time step, the decision to retained a token is contingent upon two factors: first, it must fall within the top k states, and second, it must belong to the top-p vocabulary subset. This small collection of tokens, with an accumulated probability of p, comprises a significant portion of the probability mass, and is consequently referred to as the nucleus. The size of this set is dynamically adjusted based on the probability distribution at each time step. The pseudo code of Nucleus Beam Search can be found in Algorithm 1. In most machine translation decoding scenarios, the translation model is highly deterministic when generating the next output token, which often results in an extremely uneven probability distribution. Under such circumstances, the nucleus may only contain few tokens. By filtering out low-probability (high information content) tokens that lie outside the nucleus, the search can avoid pursuing unexpected routes, which typically lead to shortcuts and, in turn, produce shorter and incomplete translations. When the translation model is uncertain about generating the next translated token, it yields a more uniform probability distribution. In these cases, only the top k states are chosen, as opposed to all tokens within the nucleus, to proceed with the search process. In practice, the search procedure is typically implemented in a batched manner. Hence, a predetermined parameter k, representing the maximum batch size, is crucial for practical implementation.
600
Z. Chen et al.
Fig. 1. An example of Nucleus Beam Search with p = 0.7 and k = 3.
In Fig. 1, an illustration of the Nucleus Beam Search with p = 0.7 and k = 3 is presented. The histogram illustrates the probability distribution of tokens. Tokens in light grey indicate that they have been filtered out, as they fall outside the nucleus. Those in dark grey represent those filtered out as they fall beyond the top-k range. Beam search only considers the teal-colored tokens. Strikethroughs on tokens denote that the cumulative probability of the token, in conjunction with its prefix, is outside the top-k range, thus filtered out by beam search.
At the initial time step, only two tokens reside within the nucleus. Consequently, the search proceeds solely with these two paths, instead of the expected three (k = 3).
Nucleus Beam Search for Machine Translation Decoding
601
At the second time step, four tokens are found within the nucleus of the prefix “Last.” However, only three of them are considered by beam search, as k = 3. In combination with the single token from the other path, the joint probabilities of these tokens, along with their respective prefixes, are computed (with length normalization), and the three most optimal ones are subsequently chosen for the next step. This iteration continues until all paths reach their respective endpoints. Table 1 presents a comparison of decoding results obtained using Nucleus Beam Search and Length Normalized Beam Search. The Length Normalized Beam Search generated a shorter output, as it took a shortcut at the token costume, resulting in a potentially more fluent but less comprehensive translation. Conversely, Nucleus Beam Search excluded the token costume since it was situated outside the nucleus, thereby yielding a lengthier and more complete translation. Table 1. An example of decoding results using the Nucleus Beam Search compared with the beam search. Source sentence
...
Beam Search
Last week, the costume drama “Beauty’s Private Kitchen” was Temporarily…
Nucleus Beam Search
Last week, the ancient costume drama “Beauty’s Private Dishes” was temporarily…
3 Experiments 3.1 Translation Model and Test Dataset We utilize the opus-mt-cs-en and opus-mt-en-cs models, provided by Helsinki- NLP [24], as the machine translation models for conducting our decoding experiments. To ensure reproducibility, we refrained from performing any additional fine-tuning on these models. To evaluate the translation performance, we employed the newstest2018-enzh corpus, as provided by WMT18. 3.2 Evaluation Metrics In this study, we utilize four evaluation metrics, namely BLEU, METEOR, ROUGE- L, and CIDEr, to assess the performance of the decoding algorithms. BLEU, or Bilingual Evaluation Understudy, is a widely used metric that evaluates the similarity between machine-generated translations and human reference translations. It measures the precision of n-grams in the candidate translation compared to those in the reference.
602
Z. Chen et al.
METEOR, or Metric for Evaluation of Translation with Explicit ORdering, is another evaluation metric for machine-generated translations. It calculates the harmonic mean of unigram precision and recall while also considering synonymy and phrase matches, which can result in better correlation with human judgment. ROUGE-L, or Recall-Oriented Understudy for Gisting Evaluation with Longest Common Subsequence, is an evaluation metric used for text generation tasks. It measures the quality of a generated text by comparing it to one or more human-generated references. ROUGE-L specifically uses the Longest Common Subsequence to calculate the overlap between the generated text and the reference. CIDEr, or Consensus-based Image Description Evaluation, is an evaluation metric designed for image captioning tasks, but it can also be applied to machine translation. It measures the similarity between machine-generated text and a set of human-generated references by considering n-gram occurrences in both, while also taking into account the consensus among the reference texts. Table 2. The BLEU scores of machines translation decoded using the Nucleus Beam Search. P=0.1 P=0.2 P=0.3 P=0.4 P=0.5 P=0.6 P=0.7 P=0.8 P=0.9 P=1.0 K=1 K=2 K=3 K=5
-
-
-
-
-
-
-
-
-
.2280 .2358 .2372 .2371
K=10 K=15 K=20 K=30 K=40 K=50
.2289 .2286 .2287 .2284 .2284 .2283
.2319 .2324 .2327 .2318 .2312 .2310
.2335 .2332 .2328 .2336 .2323 .2321
.2362 .2354 .2347 .2352 .2350 .2358
.2369 .2381 .2380 .2388 .2387 .2387
.2375 .2379 .2383 .2393 .2392 .2388
.2367 .2366 .2375 .2375 .2374 .2372
.2352 .2354 .2361 .2347 .2328 .2301
.2344 .2347 .2347 .2338 .2334 .2302
.2353 .2345 .2345 .2339 .2333 .2309
(a) English-to-Chinese P=0.1 P=0.2 P=0.3 P=0.4 P=0.5 P=0.6 P=0.7 P=0.8 P=0.9 P=1.0 K=1 K=2 K=3 K=5
-
-
-
-
-
-
-
-
-
.1483 .1523 .1557 .1592
K=10 K=15 K=20 K=30 K=40 K=50
.1484 .1484 .1485 .1485 .1485 .1485
.1505 .1507 .1507 .1508 .1508 .1506
.1522 .1522 .1523 15,24 .1527 .1528
.1535 .1545 .1548 .1550 .1548 .1550
.1552 .1561 .1558 .1563 .1562 .1560
.1568 .1573 .1565 .1569 .1570 .1575
.1571 .1574 .1587 .1580 .1582 .1584
.1589 .1590 .1592 .1599 .1603 .1603
.1605 .1607 .1611 .1616 .1618 .1617
.1591 .1590 .1584 .1580 .1580 .1576
(b) Chinese-to-English
Nucleus Beam Search for Machine Translation Decoding
603
3.3 Hyperparameter Search We first conducted a hyperparameter search on the newstest2018 development set. Table 2 displays the BLEU scores of the decoded results corresponding to different k and p hyperparameters. It is worth noting that when p = 1.0, Nucleus Beam Search degrades to length-normalized beam search, as we have implemented length normalization by default. In the table, we use orange to denote the column maximum value, cyan for the row maximum value, and red to signify the global maximum value. In the case of English-Chinese translation, length-normalized beam search attains the maximum BLEU score when the beamwidth is 3. Beyond this point, the BLEU score exhibits a gradual decline as the beamwidth enlarges. Nucleus filtering allows for higher BLEU scores to be achieved. For each row, the maximum value is situated in the columns with p = 0.5 or 0.6. The global maximum value is found at k = 30 and p = 0.6, which is 0.89 percent higher than the optimal length-normalized beam search with a beamwidth of 3. Similar outcomes are discernible in the Chinese-English translation results. Nucleus Beam Search with k = 40 and p = 0.9 achieves the highest BLEU score, which is 1.63 percent greater than the best length-normalized beam search with a beamwidth of 5. In conclusion, we adopt k = 30 and p = 0.6 for English-Chinese translation decoding, and k = 40 and p = 0.9 for Chinese-English translation decoding in subsequent evaluations. 3.4 Results
Table 3. The results of translations decoded using Greedy Search, Beam Search, UID Decoding, and Nucleus Beam Search.
Greedy Search
BLEU
METEOR
ROUGE L
CIDEr
.2278
.2664
.4917
1.849
Beam Search
.2373
.2704
.4997
1.982
UID Decoding
.2390
.2716
.5008
1.984
Nucleus Beam Search
.2395
.2711
.5010
1.995
BLEU
METEOR
ROUGE L
CIDEr
.1546
.2685
.3859
1.341
(a) English-to-Chinese Greedy Search Beam Search
.1611
.2726
.3916
1.412
UID Decoding
.1635
.2740
.3957
1.419
Nucleus Beam Search
.1665
.2751
.3958
1.432
(b) Chinese-to-English
We report the performance comparison between Greedy Search, Beam Search, UID Decoding, and the Nucleus Beam Search, shown in Table 3. For UID Decoding, we also
604
Z. Chen et al.
performed a hyperparameter search within k ≤ 50 and recorded their best results for a fair comparison. For the English-Chinese translation task, Nucleus Beam Search achieved the highest performance across all evaluation metrics, with a BLEU score of 0.2395, METEOR score of 0.2711, ROUGE-L score of 0.5010, and a CIDEr score of 1.995. Although UID Decoding had a slightly higher METEOR score (0.2716), Nucleus Beam Search demonstrated superior overall performance.
(a) Chinese-English
(b) English-Chinese Fig. 2. The std. Deviation of surprisals per decoded sequences as a function of the nucleus size p.
In the Chinese-English translation task, Nucleus Beam Search again outperformed the other methods, achieving the highest scores in all evaluation metrics: BLEU (0.1665), METEOR (0.2751), ROUGE L (0.3958), and CIDEr(1.432). The results suggest that Nucleus Beam Search consistently provides better translation quality across both language pairs compared to the other decoding techniques. These findings indicate that Nucleus Beam Search is an effective decoding method for machine translation tasks, demonstrating its robustness and superiority over the other tested methods in terms of translation quality, as evidenced by the performance scores across various evaluation metrics.
Nucleus Beam Search for Machine Translation Decoding
605
3.5 Analysis In this section, we investigate the effectiveness of nucleus filtering in achieving better Uniform Information Density, which serves as the driving force behind Nucleus Beam Search. We illustrate the standard deviation of surprisals as a function of nucleus size p in Fig. 2. In information theory, a token’s surprisal is formally defined as the negative of its log-probability. Hence, the average standard deviation of surprisals per sentence can be regarded as a practical implementation of UID [15, 16]. As illustrated in the figure, for a fixed k, the nucleus size p significantly influences the standard deviation of surprisals. A smaller nucleus size leads to a reduced standard deviation of surprisals, and vice versa. This finding supports our hypothesis that nucleus filtering can serve as an effective approach for achieving superior uniform information density. However, the results also indicate that information density is not the only factor affecting translation quality. While a smaller p may lead to lower information density, it simultaneously restricts the search space, thereby complicating the process of discovering high-quality translations. This observation further clarifies why Nucleus Beam Search achieves optimal performance with a larger beamwidth.
4 Related Works 4.1 Uniform Information Density Uniform Information Density (UID) is a linguistic hypothesis that has garnered significant attention in recent years. Jaeger et al. [10] were among the first to propose the UID hypothesis, which is grounded in the principles of information theory. Smith et al. [21] proposed an entropy-based approach to model UID in statistical machine translation systems. They demonstrated that incorporating UID into their models led to improvements in translation quality. In the realm of text generation, Tily and Piantadosi [25] found that UID-driven text simplification algorithms produced more coherent and fluent text. Furthermore, Feng and Hirst [5] applied the UID hypothesis to the task of automatic summarization, showing that their UID-based summarization system generated summaries that were more informative and easier to read. In summary, the UID hypothesis has inspired a wide range of research in psycholinguistics and NLP. Incorporating UID principles into NLP models and algorithms has been shown potential for enhancing language processing and understanding. 4.2 Text Generation Decoding There are two major categories of text generation tasks: constrained text generation and open-ended text generation. Both require a language model to produce probability distribution and then a decoding algorithm to select tokens from the distribution. Sampling-based decoding algorithms, like pure sampling, top-k sampling [4], temperature sampling [2, 6], and nucleus sampling [9], are typically used in open-ended
606
Z. Chen et al.
text generation tasks. While in constrained text generation tasks, search-based decoding algorithms, like greedy search and beam search, are typically used. Pure sampling and greedy search are the original and most primitive ones. Soon, researchers found that sampling/searching within the top-k candidate tokens could significantly improve the quality of generated texts. Thus, top-k Sampling and Beam Search are introduced. Later, researchers found that modifying the probability distribution could also improve the decoding, thus introducing temperature sampling, UID Decoding, and other label smoothing technologies [13, 19]. Recently, researchers found that sampling from a dynamic subset of tokens, i.e., the nucleus, could achieve even better qualities in open-ended text generation tasks. Since the nucleus sampling is already the got-o decoding algorithm for the open-ended text generation, researchers are naturally curious about how such a strategy performs in the task of constrained text generation. Shaham and Levy [20] introduced Dynamic Beam Search, which utilizes the nucleus to determine the beam width. However, their work has not yielded promising results, which we believe is due to the absence of a predetermined maximum beam width.
5 Conclusion This paper introduced the Nucleus Beam Search, a novel machine translation decoding algorithm inspired by the uniform information density hypothesis and the nucleus sampling. By filtering the candidate tokens using a dynamic nucleus of the probability distribution, our algorithm could outperform the traditional decoding algorithm without sacrificing simplicity. Acknowledgements. This work is supported by Sichuan Science and Technology Program (2022ZHCG0007), and the Natural Science Foundation of Sichuan Province (2022NSFSC0503).
References 1. Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: Audio chord recognition with recurrent neural networks. In: Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013 pp. 335–340 (2013) 2. Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., Charlin, L.: Language GANs falling short. In: Proceedings of the 8th International Conference on Learning Representations (2020). http://arxiv.org/abs/1811.02549 3. Cohen, E., Beck, J.C.: Empirical analysis of beam search performance degradation in neural sequence models. In: 36th International Conference on Machine Learning, ICML 2019 2019June, pp. 2294–2312 (2019) 4. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. ACL 2018- 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 1, 889–898 (2018) 5. Feng, V.W., Hirst, G.: Text-level discourse parsing with rich linguistic features. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 60–68 (2012)
Nucleus Beam Search for Machine Translation Decoding
607
6. Ficler, J., Goldberg, Y.: Controlling linguistic style aspects in neural language generation. In: Proceedings of the Workshop on Stylistic Variation, pp. 94–104 (2017) 7. Graves, A.: Sequence Transduction with Recurrent Neural Networks. arXiv preprint arXiv: 1211.3711 (2012). http://arxiv.org/abs/1211.3711 8. He, W., He, Z., Wu, H., Wang, H.: Improved neural machine translation with SMT features. In: 30th AAAI Conference on Artificial Intelligence, AAAI 2016, no. 10, pp.151–157 (2016) 9. Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: The International Conference on Learning Representations (ICLR) (2020) 10. Jaeger, T., Levy, R.: Speakers optimize information density through syntactic reduction. Adv. Neural. Inf. Process. Syst. 19, 849–856 (2007) 11. Jean, S., Firat, O., Cho, K., Memisevic, R., Bengio, Y.: Montreal neural machine translation systems for wmt’15. In: 10th Workshop on Statistical Machine Translation, WMT 2015 at the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015 – Proceedings, pp. 134–140 (2015) 12. Koehn, P., Knowles, R.: Six Challenges for Neural Machine Translation. First Workshop on Neural Machine Translation pp. 28–39 (2017) 13. Lukasik, M., et al.: Semantic label smoothing for sequence to sequence problems. In: EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2016), pp. 4992–4998 (2020) 14. Luong, M.T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, vol. 1, pp. 11–19 (2015) 15. Meister, C., Cotterell, R., Vieira, T.: If beam search is the answer, what was the question? In: EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 2173–2185 (2020) 16. Meister, C., Pimentel, T., Haller, P., Jäger, L., Cotterell, R., Levy, R.: Revisiting the uniform information density hypothesis. In: EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (i), pp. 963–980 (2021) 17. Mohamed, S.A., Elsayed, A.A., Hassan, Y.F., Abdou, M.A.: Neural machine translation: past, present, and future. Neural Comput. Appl. 33(23), 15919–15931 (2021). https://doi.org/10. 1007/s00521-021-06268-0 18. Murray, K., Chiang, D.: Correcting length bias in neural machine translation. In: WMT 2018 - 3rd Conference on Machine Translation, Proceedings of the Conference, vol. 1, pp. 212–223 (2018) 19. Peters, B., Martins, A.F.T.: Smoothing and shrinking the Sparse Seq2Seq search space. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2642–2654 (2021) 20. Shaham, U., Levy, O.: What do you get when you cross beam search with nucleus sampling? In: Proceedings of the Third Workshop on Insights from Negative Results in NLP, Dublin, Ireland, pp. 38–45. Association for Computational Linguistics, May 2022. https://doi.org/10. 18653/v1/2022.insights-1.5, https://aclanthology.org/2022.insights-1.5 21. Smith, N.J., Levy, R.: The effect of word predictability on reading time is logarithmic. Cognition 128(3), 302–319 (2013) 22. Stahlberg, F.: Neural machine translation: a review. J. Artif. Intell. Res. 69, 343–418 (2020) 23. Stahlberg, F., Byrne, B.: On NMT search errors and model errors: cat got your tongue? In: EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 3356–3362 (2019)
608
Z. Chen et al.
24. Tiedemann, J., Thottingal, S.: OPUS-MT — building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT). Lisbon, Portugal (2020) 25. Tily, H., Piantadosi, S.: Refer efficiently: Use less informative expressions for more predictable meanings. In: Proceedings of the Workshop on the Production of Referring Expressions: Bridging the Gap Between Computational and Empirical Approaches to Reference (2009) 26. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 pp. 1–23 (2016). http://arxiv.org/ abs/1609.08144 27. Yang, S., Wang, Y., Chu, X.: A survey of deep learning techniques for neural machine translation. arXiv preprint arXiv:2002.07526 (2020). http://arxiv.org/abs/2002.07526 28. Yang, Y., Huang, L., Ma, M.: Breaking the beam search curse: a study of (re)scoring methods and stopping criteria for neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 3054–3059 (2018)
A Survey on Multimodal Named Entity Recognition Shenyi Qian1 , Wenduo Jin1 , Yonggang Chen2 , Jiangtao Ma1(B) , Yaqiong Qiao3,4 , and Jinyu Lu5 1 College of Computer and Communication Engineering, Zhengzhou University of Light
Industry, Zhengzhou 450002, China {qsy,majiangtao}@zzuli.edu.cn, [email protected] 2 The State Information Center, Beijing 100045, China [email protected] 3 School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China 4 Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou 450001, China 5 Henan Province Platform Economy Development Guidance Center Zhengzhou, Zhengzhou, China
Abstract. Multimodal Named Entity Recognition (MNER) is a task of identifying entities with specific semantic types from natural language text and using image information to improve the accuracy and robustness of entity recognition. Named entity recognition is an important application of multimodal learning and a fundamental problem in the field of natural language processing. This article reviews existing multimodal named entity recognition techniques for social media. We first introduce commonly used datasets for MNER. Then, we classify existing MNER techniques into five categories: pre-trained models, single-modal representation, multimodal representation, multimodal fusion and main models. Next, we investigate the most representative methods applied in MNER. Finally, we present the challenges faced by MNER and discuss the future directions of this field. Keywords: Multimodal · Named Entity Recognition · Natural Language Processing
1 Introduction Text in social media tweets is often short, informal, and ambiguous, while the accompanying images can provide rich semantic information and contextual clues for the text. Therefore, recent research has attempted to improve the accuracy of NER models by utilizing image information in tweets, resulting in a new research direction called multimodal named entity recognition (MNER) [1]. Although MNER research has been developing for several years, to our knowledge, existing reviews mainly focus on the single-modality field of pure text. So far, as far as we know, there have been few reviews of the field of MNER. This trend has prompted us to conduct a survey to report on the current status of MNER research [2]. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 609–622, 2023. https://doi.org/10.1007/978-981-99-4752-2_50
610
S. Qian et al.
Therefore, we conducted an in-depth review of MNER to inspire and guide researchers and practitioners in this field. Specifically, we integrated MNER datasets in a table format to provide useful resources for future MNER research. Then, we reviewed key technologies in existing MNER by proposing a new classification method that covers pre-trained models, single-modal representation, multi-modal representation, multimodal fusion and main models. Next, we introduced the most representative methods applied in MNER, such as attention mechanism, graph neural networks, contrastive learning, and gating mechanisms. Finally, we presented the challenges that MNER faces and provided an outlook on the future directions of this field. To summarize, we make the following contributions in this survey: – We provide the first comprehensive review of MNER to inspire and guide researchers and practitioners in this field. – Based on a survey of representative papers on MNER, we propose a new classification approach covering pre-training models, single-modal representations, multimodal representations, multimodal fusion, and major models. – We introduce the most representative MNER methods, such as attention mechanisms, graph neural networks, contrastive learning, and gating mechanisms, to facilitate researchers’ understanding of applicable methods in MNER.
2 Background 2.1 What is MNER? Multimodal named entity recognition (MNER) is an important branch of named entity recognition (NER) task. It assumes that when textual information is inadequate, image data can help identify ambiguous named entities. For instance, the named entity “Rob” in the phrase “Handsome Rob after a fish dinner” could refer to either a person or an animal, making it challenging to determine its type. However, with the help of an accompanying image (as illustrated in Fig. 1), we can quickly classify it as MISC [3].
Text: Handsome [Rob MISC] after a fish dinner. Fig. 1. Example of multi-modal named entity recognition [3].
2.2 Why Study MNER? The task of named entity recognition (NER) has been studied for many years, and various mature models have been developed. Although these methods can achieve stateof-the-art performance on pure text, the results are not satisfactory when using the above
A Survey on Multimodal Named Entity Recognition
611
methods on social media platform Twitter, as the context of tweets is not rich enough. Therefore, some research suggests using external resources (such as images) to assist in named entity recognition tasks in social media text. Studies have shown that these models with external resources indeed perform better than previous work. 2.3 MNER Resources: Datasets As a fundamental NLP tool, MNER has received increasing attention in recent years. In the following, we summarize datasets used for MNER. • Twitter-2015: The dataset [4] includes 4 types of entities: Person, Location, Organization and miscellaneous, collected from 8257 tweets. • Twitter2017: The dataset [5] includes four types of named entities,each data instance consists of a pair of sentence and image, and the names in the sentence were manually annotated by three expert annotators. • Twitter-2017c: Due to the absence of Modality2 images in some samples of the Twitter-2017 dataset, Zhang et al. obtained a clean version by removing these samples, namely the Twitter-2017c dataset [6]. • SnapCaptions: This dataset [1] includes image-caption pairs (SNAP) generated by 10k users, with named entities in the captions manually annotated by expert annotators (entity types: person, location, organization, and miscellaneous). • Snap Research: This dataset [5] was collected from Twitter and Snapchat, and includes entities of type person, location, organization, and miscellaneous. • TRC: In this dataset [7], the authors annotated tweets with four types of text-image relations: text represents the image; text does not represent the image; image adds meaning to the tweet; image does not add meaning to the tweet. • Twitter100k: This dataset consists of 100,000 image-text pairs randomly scraped from Twitter [8]. Each image-text pair includes an image and text that appear together in a tweet.
3 MNER 3.1 Pre-Trained Models. In the MNER task, depending on the modality used, pre-trained models can be divided into two types: single-modal pre-trained models and multi-modal pre-trained models. Single-Modal Pretrained Models Text Pre-training Model. In traditional word embedding models, the bag-of-words model [9] is a simplified representation method used in natural language processing and information retrieval. The Word2Vec [10] pre-trained model can train word vectors from unlabeled corpora. The GloVe [11] model is an unsupervised word embedding method. Fasttext [12] is a word vector and text classification tool open-sourced by Facebook. In context word embedding models, the ELMO [13] model can train different word vectors for the same word through different sentences. The CoVe [14] is first pre-trained on the encoder-decoder machine translation task. XLM [15] is a cross-lingual pre-training
612
S. Qian et al.
model proposed by Facebook in 2019. GPT [16] can achieve impressive results in very complex NLP tasks. In bidirectional language models, BERT [17] emphasizes the use of a new masked language model (MLM) instead of the traditional unidirectional language model or shallow concatenation of two unidirectional language models for pretraining, in order to generate deep bidirectional semantic representations. RoBERTa [18] is an enhanced version of BERT and a more finely tuned version of the BERT model. ALBERT[19] is a variant of the BERT model that greatly reduces the number of model parameters while maintaining performance, making it more practical (Table 1). Image Pre-training Model. LeNet-5 [20] is the first successful convolutional neural network(CNN) applied to digit recognition problems. AlexNet [21] is a deep CNN consisting of 5 convolutional layers and 3 fully connected layers. VGG [22] model is a deep CNN composed of multiple repeated small convolution kernels. The Inception [23] model is a deep CNN composed of parallel convolutional branches, aimed at extracting and fusing multi-level features at different scales to improve image classification performance. ResNet [24] a deep CNN that uses residual blocks to solve the gradient vanishing problem in deep networks and achieves outstanding performance in image recognition tasks. ResNeXt [25] is a deep learning model that combines ResNet and Inception. DenseNet [26] is a deep learning model whose core idea is the Highway Nets’skip connection. Dual-Path-Net (Dual Path Networks, DPN) [27] is a simple, efficient, and modular network for image classification with a novel internal connection path topology. Vision Transformer (ViT) [28] is a model proposed by Google in 2020 that directly applies the transformer to image classification, and many subsequent works are based on ViT for improvement. ViT introduces the Transformer model to the field of computer vision, enabling tasks such as image classification and object detection by dividing the image into fixed-size blocks and transforming them into sequence data (Table 2). Multi-modal Pretraining Models. Multimodal pretraining models can be classified into two categories based on the way information is fused: single-stream and two-stream models. In single-stream models, VisualBERT [29] was referred to as the first image-text pretraining model. It uses Faster R-CNN [30] to extract visual features, connects the visual features and text embeddings, and then feeds the concatenated features into a transformer initialized by BERT [31]. Subsequently, in 2020, Unicoder-VL [32] was proposed by Li et al. as a pretraining model in the image-text domain. It continues to use a stacked transformer structure and is trained on a large number of image-text pairs. In terms of two-stream models, ViLBERT [33] model was the first to extend the BERT structure to a multimodal two-stream model. Similar to ViLBERT, the LXMERT [34] model also applies two Transformers to image and text and uses a third Transformer for fusion. Radford et al. [35] proposed the CLIP (Contrastive Language-Image PreTraining) model, which uses 400 million image-text pairs from web to train the model by treating text as image labels. Li et al. [36] proposed the ALBEF model, which introduces an intermediate image-text contrastive loss (Table 3).
A Survey on Multimodal Named Entity Recognition
613
Table 1. Text pre-training models
Traditional word embedding models
Contextual word embedding model
Bidirectional language model
Models
Year
Advantages
Disadvantages
Bag-of-words
2012
Simple and fast, easy to understand
Unable to handle context and semantic information
Word2Vec
2013
Able to capture semantic relationships
Ineffective for rare words and very similar words
GloVe
2014
Better at handling global context
Unable to handle semantic information in individual sentences
Fasttext
2016
Better modeling of rare and unseen words
Ineffective for very similar words and long texts
ELMO
2017
Captures contextual information
Computationally expensive
CoVe
2018
Effective for transfer learning
Requires pre-trained word embeddings
GPT
2018
State-of-the-art for language generation
May suffer from context fragmentation
XLM
2019
Effective for cross-lingual learning
May not perform as well on monolingual tasks
BERT
2019
Achieves state-of-the-art performance on many NLP tasks
Large size and high computational cost
RoBERTa
2019
Improves upon BERT by using larger training corpus
Similar issues with large size and high computational cost as BERT
ALBERT
2020
Reduces model size May require longer and computational cost training time and tuning for specific tasks
3.2 Single-modal Representation Text Representation. In MNER research, Tian et al. [37] extended the word embedding layer of BERT to process each token embedding as a multimodal representation of the corresponding entity. Sun et al. [38] encoded a Twitter text into a vector by concatenating word and character embeddings using BILSTM. Wu et al. [39] represented each word in a sentence by combining character embeddings with word embeddings. Suman et al.
614
S. Qian et al. Table 2. Image pre-training models Model
Year
Advantages
Disadvantages
Classical Convolutional Neural Networks
LeNet-5
1998
Lightweight, good for small images
Limited depth and complexity
AlexNet
2017
Pioneering work in deep learning, easy to implement
May overfit on small datasets, requires GPUs to train efficiently
Deep Convolutional Neural Network
VGG
2014
Strong performance, Computationally easy to understand and expensive, implement requires a lot of memory
Inception
2016
High accuracy, efficient use of computational resources
Complex architecture, difficult to interpret
ResNet
2016
Very deep, high accuracy, easy to train
Computationally expensive, may suffer from degradation problem
ResNeXt
2017
High accuracy, can be Computationally trained on limited data expensive, requires careful tuning
DenseNet
2017
Strong performance, efficient use of parameters
Dual-Path-Net
2017
High accuracy, can be Complex architecture, trained on limited data computationally expensive
ViT
2020
Strong performance on image classification tasks, can be applied to other domains
Transformer model
Computationally expensive
Limited interpretability, may require large amounts of training data
[40] used pre-trained word embeddings in vector form to obtain word-level feature representation. Zheng et al. [41] extracted character representations using BI-LSTM with the character sequence of each token as input. Shahzad et al. [42] used contextualized embeddings (ELMO) to obtain word representations from two stacked BILSTM layers, extracted character representations using a CNN, and extracted sentence representations using a sentence-level encoder. Image Representation. The categorization of image representation can be divided into three types: Visual global representation, Visual Regional Representation, and Visual object representation. Visual global representation refers to a D-dimensional static vector
A Survey on Multimodal Named Entity Recognition
615
Table 3. Multimodal pretraining models
single-stream models
two-stream models
Models
Year
Advantages
Disadvantages
VisualBERT
2019
Achieves high accuracy in visual question answering tasks. Flexible in incorporating additional modalities
High computational cost. Requires large amounts of training data
Unicoder-VL
2020
Good at handling largescale data. Supports multiple languages
High computational cost. Limited interpretability
ViLBERT
2019
Effective in multimodal tasks. Can capture fine-grained relationships between different modalities
Requires large amounts of training data. Limited interpretability
LXMERT
2019
Achieves high accuracy in visual question answering tasks. Can handle multiple input formats
High computational cost. Requires large amounts of training data
CLIP
2021
Achieves state-of-the-art results in image classification tasks. Can handle a wide range of visual and textual data
High computational cost. Requires large amounts of training data
ALBEF
2021
Good at handling large-scale data. Can capture fine-grained relationships between different modalities
Requires large amounts of training data. Limited interpretability
extracted from the high-level network of an image encoder to represent an image. Moon et al. [1] encoded the entire image as a global feature vector for visual global representation and designed effective attention mechanisms to extract visual information related to the text. Regional Representation in vision refers to extracting a set of D-dimensional vectors from the high-level network of an image encoder to represent an image, with each D-dimensional vector representing a specific region of the image of the same size. Lu et al. [5] utilized a Bi-LSTM+CRF model framework and pre-trained ResNet model to extract a set of D-dimensional vectors representing specific regions of the image of the same size. Visual information extraction a set of D-dimensional vectors to represent
616
S. Qian et al.
objects in an image. Wu et al. [43] utilized Faster-RCNN to extract visual object features from images, which were then fed into an adaptive collaborative attention network.
3.3 Multi-modal Representation Extensive research has demonstrated that the combination of textual and visual representations into multimodal representations can significantly enhance the performance of MNER [39]. The simplest approach to combining visual and textual representations is through concatenation. However, the differences in vector space between vision and language may result in semantic drift. To address this issue, Collell et al. [44] proposed a solution that involves learning a mapping function from text to vision, and the output of this function is then used in multimodal representation. Despite its effectiveness, the mapping function may become a bottleneck when the text and image are unrelated. To overcome this limitation, Wu et al. [39] suggested using object labels as embeddings to bridge the gap between vision and language. 3.4 Multi-modal Fusion MNER (MNER) utilizes both image and text information to identify entities, and how to better fuse the two modalities of image and text is critical to MNER tasks. Yu et al. [45] introduced a multimodal interaction module to integrate two modalities and a entity span detection module to filter out visual noise. Chen et al. [46] used external knowledge databases and attention-guided visual layers to obtain the final fused multimodal representation. Sun et al. [7] utilized an improved BERT encoder to obtain fused representations and introduced a text-image relation classification sub-task to determine the usefulness of image features. Xu et al. [3] utilized a cross-modal fusion module to fuse the representations of two modalities and input them into a conditional random field layer to obtain the final prediction results. 3.5 Main Model Multi-modal named entity recognition (MNER) models can handle both text and image data simultaneously. Zhang et al. [4] proposed an adaptive co-attention network. Lu et al. [5] proposed a MNER model based on visual attention mechanisms. Yu et al. [45] proposed a multi-modal named entity model based on Transformer architecture. Wu et al. [39] proposed a neural network that combines object-level image information and characterlevel text information to predict entities. Zheng et al. [41] proposed an adversarial gatebilinear attention neural network. Sun et al. [7] introduced a text-image relationship propagation method into a multi-modal BERT model. Lu et al. [47] introduced a planar multimodal interactive transformer for MNER. Chen et al. [46] presented a novel neural network architecture that leverages image attributes and knowledge to enhance the performance of named entity recognition. Zhang et al. [6] introduced a novel approach called the Unified Multi-Modal Graph Fusion method for graph encoding. Zhao et al. [48] introduced a graph convolutional network model that incorporates relationship-enhanced features to enhance the accuracy of MNER tasks. Chen et al. [49] introduced a novel approach named
A Survey on Multimodal Named Entity Recognition
617
Hierarchical Visual Prefix Fusion Network for enhancing entity and relation extraction through visual information. Wang et al. [50] introduced a cleaner architecture for an endto-end MNER framework based on transformers. Jia et al. [51] introduced a novel end-toend framework, named MNER-QG. Xu et al. [3] introduced a universal framework called Matching and Alignment (Table 4). Table 4. MNER models on Twitter-2015 and Twitter-2017 datasets. TWITTER-2015
TWITTER-2017
Year
Models
P
R
F1
P
R
F1
References
2018
ACOA
72.75
68.74
70.69
84.16
80.24
82.15
[4]
2018
VG
73.96
67.90
70.80
83.41
80.38
81.87
[5]
2020
OCSGA
74.71
71.21
72.92
2020
UMT
71.67
75.23
73.41
2021
Object-AGBAN
74.13
72.39
73.25
2021
IAIK
74.78
71.82
73.27
2021
RpBERT
71.15
74.30
72.69
82.85
84.38
83.61
[7]
2021
UMGF
74.49
75.21
74.85
86.54
84.50
85.51
[6]
2022
MAF
71.86
75.10
73.42
86.13
86.38
86.25
[3]
2022
R-GCN
73.95
76.18
75.00
86.72
87.53
87.11
[48]
2022
HVPNeT
73.87
76.82
75.32
85.84
87.93
86.87
[49]
2022
FMIT
75.11
77.43
76.25
87.51
86.08
86.79
[47]
2022
CAT-MNER
78.75
78.69
78.72
90.27
90.67
90.47
[50]
2022
MNER-QG
77.76
72.31
74.94
88.57
85.96
87.25
[51]
– 85.28 – –
– 85.34 – –
– 85.31 – –
[39] [45] [41] [46]
4 Applied for MNER 4.1 Attention Mechanisms for MNER The attention mechanism has found extensive application in various deep learning tasks. Regarding multi-modal named entity recognition tasks, Zheng et al. [41] introduced a bilinear attention network in multi-modal NER task to capture the correlation between visual objects and textual entities. Remi et al. [52] employed the multi-modal attention mechanism of visual question answering to identify the most pertinent region in an image based on the associated text. Tian et al. [37] employed an attention mechanism, modeled after that utilized in VQA, in order to facilitate greater semantic interactions across different modalities. Wu et al. [39] introduced a dense co-attention mechanism that facilitates comprehensive interactions between visual objects and textual entities, resulting in improved performance for MNER.
618
S. Qian et al.
4.2 Graph Neural Networks for MNER Recently, graph neural networks (GNNs) including gated graph neural networks, graph convolutional networks and graph attention networks have been proven effective in various tasks. In the MNER task, Zhang et al. [6] introduced a unified multimodal graph fusion technique that employs a singular multimodal graph to represent input images and sentences, effectively capturing various semantic connections between multimodal semantic units such as words and visual objects. Zhao et al. [48] introduced a novel approach, the relation-enhanced graph convolutional network. This approach builds intermodality and intra-modality relation graphs by gathering image data pertinent to the current text and image from the dataset, and utilizes multimodal interaction and fusion to predict NER label sequences. 4.3 Contrastive Learning for MNER Contrastive learning is an emerging research area that has achieved remarkable progress in diverse computer vision (CV) and natural language processing (NLP) applications. In recent years, with the development of multi-modal pretraining models, many researchers have started to incorporate multi-modal contrastive learning into their methods [51, 52]. But, many methods often use standard cross-modal contrastive learning based on random samples or only perform text-based data augmentation, lacking optimization from the perspective of visual objects. This is the challenge faced by MNER. To address this, Zhang et al. [53] presented a method for reducing bias in named entity recognition (NER) by leveraging bias-reducing contrastive learning. 4.4 Gating Mechanisms for MNER The working mechanism of the gating mechanism in multi-modal named entity recognition (MNER) is similar to its operation in other fields. Arshad et al.[54] introduced a gating multi-modal fusion module that dynamically selects information from both text and visual features. Yu et al.[45] introduced a visual gate to dynamically control visual features, and combined HBiLSTM-CRF with visual context to obtain lexicalaware visual representations through visual attention mechanism and visual gate. Chen et al.[46] combined text, attribute, knowledge, and image information using attention and gating fusion modules. Chen et al.[49] designed a dynamic gate to generate imagerelated paths and use various aggregated hierarchical multi-scale visual features as a visual prefix to enhance MNER.
5 Challenges and Future Directions 5.1 Challenges • Data sparsity. The acquisition cost of data from different modalities is often high, and the data from different modalities may not be aligned, leading to data sparsity and affecting the performance of multi-modal named entity recognition.
A Survey on Multimodal Named Entity Recognition
619
• Cross-modal feature fusion. The different modalities have different ways of expressing features. Therefore, fusing the features from different modalities to improve the performance of multi-modal named entity recognition is a challenge. • Modality imbalance. The quantity and quality of data are not consistent across different modalities, resulting in the problem of modality imbalance. How to address modality imbalance and improve the performance of MNER is a challenge. • Multilingual support. MNER requires support for multiple languages, and the expression of named entities varies across different languages. How to achieve multilingual support is a challenge for this field. 5.2 Future Directions • Multimodal feature fusion: Future research in MNER needs to further explore the fusion of multimodal features. Improving the performance of MNER by integrating features from different modalities is an important direction. • Multilingual support: In the future, multilingual support is necessary for MNER. Exploring cross-lingual MNER and improving the model’s versatility and applicability is an important direction. • Model adaptability: In the future, multi-modal named entity recognition requires model adaptability, which can handle data from different domains, modalities, and languages, and can perform transfer learning and adaptation across different tasks. • Widely applicability: In the future, multi-modal named entity recognition needs to be widely applied in various fields such as healthcare, smart homes, agriculture, finance, etc., to provide better services for each domain.
6 Conclusion This survey aims to review recent research on MNER (MNER) to help new researchers establish a comprehensive understanding of the field. This paper introduces the background, current research status, challenges, and future research directions of MNER. First, we integrate existing MNER datasets and present them in tabular form. Second, we provide a preliminary introduction to the definition of the MNER task, the reasons for studying MNER, and evaluation metrics. Third, we categorize MNER research according to a new classification scheme. Then, we further examine representative methods applied to MNER in recent years. Finally, we present the challenges and future directions for MNER. We hope this study will provide a good reference for researching MNER methods. Acknowledgement. This research was funded by the National Natural Science Foundation of China, grant number 62272163; the Songshan Laboratory Pre-research Project, grant number YYJC012022023; and the Henan Province Science Foundation, grant number 232300420150, 222300420230; the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness, grant number HNTS2022005; the Henan Province Science and Technology Department Foundation, grant number 222102210027; the Science and Technology Plan Projects of State Administration for Market Regulation, grant number 2021MK067; and the Undergraduate Universities Smart Teaching Special Research Project of Henan Province, grant number 489–29.
620
S. Qian et al.
References 1. Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL HLT 2018 - 2018 Conference on North American Chapter Association Computing Linguistic Human Language Technology – Proceedings of Conference, vol. 1, pp. 852–860 (2018) 2. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2022) 3. Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM 2022 – Proceedings of the 15th ACM International Conference on Web Search Data Mining, pp. 1215–1223 (2022) 4. Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: 32nd AAAI Conference on Artificial Intelligence. AAAI 2018, pp. 5674–5681 (2018) 5. Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Pap. 1, 1990–1999 (2018) 6. Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021. 16, 14347–14355 (2021) 7. Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagationbased BERT model for multimodal NER. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021, vol. 15, pp. 13860–13868 (2021) 8. Hu, Y., Zheng, L., Yang, Y., Huang, Y.: Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE Trans. Multimed. 20, 927–938 (2018) 9. Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28, 1188–1197 (2012) 10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 Workshop Track Proceedings, pp. 1–12 (2013) 11. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014) 12. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, pp. 427–431 (2017) 13. Peters, M.E., et al.: Deep contextualized word representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp. 2227–2237 (2018) 14. McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems, pp. 6295–6306 (2017) 15. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019) 16. Radford, A., Narasimhan, K.: Improving Language Understanding by Generative PreTraining. Presented at the (2018) 17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Human Language Technologies – Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)
A Survey on Multimodal Named Entity Recognition
621
18. Liu, Y., et al.: RoBERTa: a robustly optimized Bert pretraining approach. ArXiv.abs/1907.1 (2019) 19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv.abs/1909.1 (2019) 20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278–2323 (1998) 21. Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6706–6716 (2020) 22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1 (2014) 23. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016) 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 25. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995 (2017) 26. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017) 27. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Advances in Neural Information Processing Systems. pp. 4468–4476 (2017) 28. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv. abs/2010.1 (2020) 29. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv Prepr. arXiv1908.03557. (2019) 30. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017) 31. Chen, F.-L., et al.: Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 20, 38–56 (2023) 32. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pp. 11336–11344 (2020) 33. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019) 34. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5100–5111. Association for Computational Linguistics (2019) 35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021) 36. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems, pp. 9694–9705 (2021) 37. Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K.: Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439, 12–21 (2021)
622
S. Qian et al.
38. Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 1852–1862 (2020) 39. Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H.F., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM 2020 – Proceedings of the 28th ACM International Conference on Multimedia, pp. 1038–1046 (2020) 40. Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P.: Why pay more? A simple and efficient named entity recognition system for tweets. Expert Syst. Appl. 167, 114101 (2021) 41. Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 23, 2520–2532 (2021) 42. Shahzad, M., Amin, A., Esteves, D., Ngomo, A.C.N.: InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs. Proceedings of the International Florida Artificial Intelligence Research Society Conference, FLAIRS, vol. 34 (2021) 43. Wu, H., Cheng, S., Wang, J., Li, S., Chi, L.: Multimodal aspect extraction with region-aware alignment network. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12430, pp. 145–156. Springer, Cham (2020). https://doi.org/10.1007/978-3030-60450-9_12 44. Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. Proc. AAAI Conf. Artif. Intell. 31, 4378–4384 (2017) 45. Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3342–3352 (2020) 46. Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12 47. Lu, J., Zhang, D., Zhang, J., Zhang, P.: Flat Multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 2055–2064. International Committee on Computational Linguistics (2022) 48. Zhao, F., Li, C., Wu, Z., Xing, S., Dai, X.: Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Association for Computing Machinery (2022) 49. Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics. NAACL 2022 - Find., pp. 1607–1618 (2022) 50. Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022) 51. Jia, M., et al.: MNER-QG: an end-to-end MRC framework for Multimodal named entity recognition with query grounding. arXiv Prepr. arXiv2211.14739 (2022) 52. Cadene, R., Ben-younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering Sorbonne Universit. In: Conservatoire National des Arts et M. Cvpr 2019, pp. 1989--1998 (2019) 53. Zhang, X., Yuan, J., Li, L., Liu, J.: Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Association for Computing Machinery (2023) 54. Arshad, O., Gallo, I., Nawaz, S., Calefati, A.: Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceedings International Conference Document Analysis, Recognition, ICDAR, pp. 337–342 (2019)
HSRG-WSD: A Novel Unsupervised Chinese Word Sense Disambiguation Method Based on Heterogeneous Sememe-Relation Graph Meng Lyu and Shasha Mo(B) School of Cyber Science and Technology, Beihang University, Beijing 100191, China {lyumeng,moshasha}@buaa.edu.cn
Abstract. Word sense disambiguation (WSD) plays a crucial role in natural language processing. Unsupervised WSD approaches based on knowledge bases like HowNet offer improved applicability compared to supervised learning, but existing research tends to oversimplify disambiguation and neglect hierarchical sememe relationships which lead to the inability to accurately differentiate between senses with the same combination of sememes radicals. This paper present an unsupervised Chinese word sense disambiguation method based on a Heterogeneous Sememe-Relation Graph (HSRG) that leverages sememe hierarchical relationships to uncover the intrinsic connections between sememes. Additionally, we incorporate cross-word sememe relationships and semantic dependency relationships, establishing both indirect and direct contextual associations while mitigating the influence of syntactic structures. This integration enhances the representation of ambiguous words and improves disambiguation outcomes. Furthermore, our study innovatively combines the principles of graph contrastive learning with node selection algorithms, employing heterogeneous graph neural networks to effectively represent graph models and facilitate unsupervised selection of accurate sense vertices in HSRG. The proposed model is evaluated on the HowNet-based Chinese WSD dataset, demonstrating superior performance over competitive baselines. Keywords: Word Sense Disambiguation · HowNet · Sememe · Heterogeneous Graph · Unsupervised Learning
1 Introduction The phenomenon of polysemy is pervasive. However, due to the more complex and diverse grammatical and lexical features of the Chinese language, the proportion of ambiguous words in Chinese texts is higher than in English texts. Consequently, Chinese WSD is a more complicated challenge in natural language processing [18]. Traditional Chinese WSD methods usually utilize corpora or dictionaries to distinguish word senses. Supervised learning requires manually annotated corpora and is constrained by the quantity and quality of the annotated data [1, 5, 19]. And unsupervised corpus learning can only classify ambiguous words into sense categories without © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 623–633, 2023. https://doi.org/10.1007/978-981-99-4752-2_51
624
M. Lyu and S. Mo
accurately capturing their precise meanings [13]. Disambiguation methods based on knowledge bases can utilize any existing knowledge resources as disambiguation repositories to enhance their applicability and operability [7, 11]. They are currently the mainstream algorithms for disambiguation. HowNet [4], a knowledge base composed of sememe, is widely used in Chinese WSD. Sememes in it are the smallest semantic units in linguistics, and linguists believe that every word can be decomposed from a finite set of sememes. After the expansion by Qi et al. [10], the updated version of HowNet has predefined 2,540 sememes and has employed them to annotate over 200,000 words to elucidate the fundamental meanings of word concepts. Therefore, compared to other knowledge bases, HowNet can describe ambiguous words by capturing the essence of their meanings. As shown in Fig. 1, the ambiguous word has two senses, both of which can be represented by basic sememes.
Fig. 1. Sememe tree of ambiguous word “单位”.
Currently, there are numerous research works using HowNet for unsupervised Chinese WSD. Hou et al. [6] treated the sememes under a sense as a whole and scored the sense using the masked language model task in the pre-trained language model. However, using only sememe information is insufficient for fully representing ambiguous words. Zhou et al. [21] improved the disambiguation performance by augmenting synonym sets, building upon the work presented in [6]. Similarly, Zhang et al. [20] incorporate multiple senses of ambiguous words’ translations based on [6]. Studies mentioned above all simplified the sememe trees (as shown in Fig. 2), which disregards the internal relationships among sememes and limits the ability to explain disambiguated words.
A Novel Unsupervised Chinese Word Sense Disambiguation
625
Fig. 2. Neglect the hierarchical relationships among sememe.
Being composed of sememes, senses, and relations, the network structure of HowNet enables semantics indicated by graph-theoretic methods without loss of edge information. Tang et al. [13] construct a sememe graph based on sememe trees and employ DeepWalk to learn the graph structure. Yang et al. [16] incorporated translation relationships, semantic relationships, and co-occurrence relationships and construct a heterogeneous relation network graph that leverages the advantages of different knowledge. Lu et al. [9] utilized three types of similarity to connect sememe and construct disambiguation graphs. However, the aforementioned studies employed node ranking algorithms similar to PageRank, which cannot distinguish edge and node types and is unsuitable for node importance ranking in heterogeneous semantic graphs. We propose a novel unsupervised Chinese WSD method based on the Heterogeneous Sememe-Relation Graph (HSRG-WSD), which contain three types of nodes and three types of edges. Our approach not only uses sememe hierarchical relationships to accurately describe senses but also incorporates cross-word sememe relationships (the sememe relationship between different words in context) to capture the deep semantic connections, enriching sense representation and understanding distinctions between words. Besides, we employ Semantic Dependency Parsing (SDP) instead of Dependency Parsing (DP) used in other sememe-relation graphs to associate context and reduce the impact of syntax further directly. Compared to the traditional methods they used, Heterogeneous Graph Attention Network (HAN) [15] captures the comprehensive graph information of the heterogeneous sememe graph constructed by multiple semantic knowledge, while also differentiating the importance of various semantic relationship edges and nodes for disambiguation [14]. Drawing upon graph contrastive learning [3], we obtain negative contrast examples by deleting the sense to be selected [17] and construct a scoring function based on the cosine similarity of positive and negative examples for further node selection. The main contributions of this paper are as follows: (1) We propose an innovative unsupervised Chinese WSD model, HSRG-WSD, that effectively exploits sememe hierarchical relationships to deliver precise sense descriptions, surpassing all baseline performances on the HowNet-based Chinese WSD dataset. (2) Our method skillfully integrates cross-word sememe relationships and harnesses the power of SDP to establish both immediate and extended contextual associations,
626
M. Lyu and S. Mo
thereby enriching sense representation and effectively minimizing the impact of syntax. (3) We devise an advanced node scoring function by leveraging the capabilities of HAN and incorporating node deletion strategies, which allows for a more nuanced differentiation of semantic relationships across edges and nodes, thereby facilitating the efficient selection of nodes in heterogeneous semantic graphs.
2 Methodology 2.1 Overview of the Proposed Method A detailed illustration of the overall architecture of the model is presented in Fig. 3, which consists of three key modules: (a) heterogeneous sememe-relation graph construction module, responsible for generating semantic graphs based on hyponymy relations and semantic dependency relations for a given input, followed by integrating the graphs formed by these two types of relationships. (b) Unsupervised node selection module, followed by two parts. Firstly, the node deletion part treats each sense and its sememes of an ambiguous word as a complete entity and sequentially performs deletion operations to obtain a collection of subgraphs. Secondly, the node selection part is based on graph comparison. Learning the graph representations of each subgraph with HAN, the model then assesses the importance of the deleted node sets by comparing the heterogeneous semantic graphs before and after node deletion, ultimately making a selection for the correct sense. In the following sections, we will provide a detailed description of each module.
Fig. 3. Overall architecture of the HSRG-WSD. (a) The construction of HSRD is based on sememe and semantic dependency relationships, comprising three distinct types of nodes and edges. (b) Use node deletion, HAN encoder, and the node ranking algorithm to identify the appropriate word sense node accurately.
A Novel Unsupervised Chinese Word Sense Disambiguation
627
2.2 Heterogeneous Sememe-Relation Graph Construction The process of graph construction comprises two distinct steps, followed by an integration: constructing a sememe-relation graph (SRG), building a semantic dependency relation graph (SDRG), and ultimately integrating the two graphs into a unified heterogeneous sememe-relation graph (HSRG). SRG Construction. This part is composed of connecting intra-word sememe relationships and connecting context. Firstly, we utilize sememe knowledge in HowNet to represent the ambiguous word and its context. Taking the sentence “下达所属单位” as an example, the ambiguous word “单位” has two distinct meanings, and its context has three words: {下达, 所属, 单位}. Each word can be represented by a sememe tree-like Fig. 1, where the nodes from top to bottom represent the word, sense, and sememe, respectively. Nodes are connected by the relationships defined in HowNet, with edge weights corresponding to specific sememe relationships. Next, we proceed with the context connection. We calculate the relevance between the context sememes and the ambiguous word sememes, using it to connect cross-word sememes and associate the context. The calculation thoroughly exploits the tree-structured characteristics of the sememe tree, computing the relevance between two sememes based on the depth and distance within the tree. The calculation formula is as follows: α · (d (si ) + d sj ) (1) StructSimSem si , sj = α · d (si ) + d sj + dist si , sj + |d (si ) − d sj | d (si ) represents the depth of sememe si in the sememe tree, and dist si , sj is the distance between sememes si and sj in the sememe tree. α is a tuning parameter that determines the influence of depth on the relatedness calculation. Finally, we obtain the graph GSR = (VSR , ESR , TSR ) where VSR = {vi |i ∈ {1, 2, . . . , NSR }} represents the nodes in the graph, with the total number of nodes denoted as NSR , ESR = {ei |i ∈ {1, 2, . . . , MSR }} represents the edges, MSR is the total number of edges. TSR = (RV SR , RE SR ) is the type set, where RV SR = {word , sense, sememe} denotes the set of node types, and RE SR = {relation, relevance} is the edge type set. To avoid gradient explosion or vanishing gradient problems in graph neural networks, the edge weights of the relatedness type are normalized to the range [0, 1]. Use the min-max normalization method, and the specific formula is as follows: x =
x − xmin xmax − xmin
(2)
x represents the normalized weight, x refers to the initial data, xmin is the minimum value in the initial data, and xmax represents the largest value in the initial data. SDRG Construction. Semantic dependency analysis overcomes the limitations of syntactic structure, presenting the semantic relationships between words in a sentence directly through dependency structures. We use Language Technology Platform (LTP) [2] to extract keywords from ambiguous sentences and conduct semantic dependency analysis. The context keywords serve as vertices, with semantic dependency relationships acting as directed edges, and edge labels representing dependency relationships. Then
628
M. Lyu and S. Mo
we obtain the graph GSDR = (VSDR , ESDR ) where VSDR = {vi |i ∈ {1, 2, . . . , NSDR }} represents the nodes in the graph, with the total number of nodes equaling the total number of words, and ESDR = {ei |i ∈ {1, 2, . . . , MSDR }} represents the edges in the graph, with the total number of edges denoted as MSDR . HSRG Construction. HSRG is integrated by SDRG and SRG. As SDRG is a directed graph, and SRG is an undirected graph. We need to convert the undirected graph GSR into a directed graph GSR . Transform each edge VSR in GSR into two directed edges (u, v) and (v, u), each assigned with the original weight. This operation ensures the converted directed graph GSR and the original undirected graph have equivalent topology and edge weight information. Then, merge the nodes in GSR and GSDR to obtain the node set V0 of the merged graph, G0 . As the same, merge the edges in GSR and GSDR to obtain the edge set E0 of G0 . Since the same edges in GSR and GSDR contain different semantic information, all edges are retained. Finally, we get the merged graph HSRG G0 = (V0 , E0 , T0 ) where V0 = {vi |i ∈ {1, 2, . . . , N0 }} represents N0 nodes in the graph, E0 = {ei |i ∈ {1, 2, . . . , M0 }} represents M0 edges in the graph. The type set T0 = (RV G , RE G ), where RV G = {word , sense, sememe} represents the node type set, and RE G = {relation, relevance, sdr} represents the edge type set. 2.3 Unsupervised Node Selection Algorithm Inspired by graph contrastive learning, we employ a node deletion operation on heterogeneous semantic graph models and construct a node importance scoring model based on similarity computation. Node Deletion. In the described process, we obtain an HSRG by integrating all the senses of the target ambiguous word into the initial semantic graph G0 . Assuming the polysemous word has n senses, at this stage, we treat each sense and its sememes as a node-set, thus obtaining n node sets S = {S1 , . . . , Si , . . . , Sn }. Subsequently, we perform deletion operations on each node set and we obtain n subgraphs Gchild = {G 1 , · · · , Gn }. Finally, each subgraph in Gchild = {G 1 , · · · , Gn } is paired with the original semantic graph G0 , resulting in a set of candidate sample pairs Gscore = {[G 1 , G0 ], . . . , [G i , G0 ], . . . , [G n , G0 ]}. Using the example of “下达所属单位”, the ambiguous word “单位” has two distinct senses. and The node sets for each senses are , respectively. By performing deletion operations on each node set sequentially, we obtain two subgraphs, G1 and G2 , as illustrated in Fig. 3. By pairing each subgraph with the original semantic graph G0 , we obtain a set of candidate sample pairs Gscore = {[G 0 , G1 ], [G 0 , G2 ]}. Node Selection. We encode the HSRG G0 and subgraphs {G 1 , · · · , Gn } via HAN encoder. In heterogeneous graphs, two nodes can be connected through different semantic paths, referred to as meta-path [12]. HSRG and subgraphs all have three types of nodes: words, sense, and sememe. Based on the disambiguation task, we can extract a meta-path φ = word − sense − sememe. HAN separately produces node-level representations and edge-level representations based on meta-path φ and heterogeneous graphs
A Novel Unsupervised Chinese Word Sense Disambiguation
629
{G 0 , · · · , Gn }. Next, it combines node-level and edge-level representations using concatenation operations, and after average pooling, it obtains the graph representation {h0 , · · · , hn }. We design the scoring function for node-set Si : 1 sim0,i
(3)
h0 · hi h0 · hi
(4)
ScoreSi = log sim0,i =
sim0,i is the cosine similarity between the subgraph Gi and the entire graph G0 . Here · represents the dot product, while h0 and hi represent the L2 norms of vectors h0 and hi , respectively. A lower similarity score implies a higher relevance between Si and the context, resulting in a higher score in the scoring function. By computing scores for all node sets, we can identify the node with the highest score as the correct sense of the ambiguous word.
3 Experiments 3.1 Dataset In this study, we use the HowNet-based Chinese WSD dataset [6], which is constructed from the annotation corpus in SemEval-2007 Task 5. The HowNet-based Chinese WSD dataset encompasses 2,969 cases for 36 ambiguous target words, including 17 nouns and 19 verbs. 3.2 Implementation Details We follow the parameter settings of the original HAN, randomly initialize the parameters and optimize the model using the Adam optimizer [8]. We configure the HAN with a learning rate of 0.005, a regularization parameter of 0.001, a 128-dimensional semanticlevel attention vector q, and 8 attention heads K. Additionally, we apply a 0.6 dropout rate for attention and employ early stopping with a patience threshold of 100. 3.3 Baselines In this paper, we select representative unsupervised word sense disambiguation models based on HowNet in recent years as the baseline. – Random. Establish a control group consisting of random selections to serve as a baseline comparison. – MultiGraph [9]. This method combines three types of similarity calculations to construct a disambiguation graph and uses PageRank to select the highest-scoring word sense as the correct one.
630
M. Lyu and S. Mo
– BERT-WSD [6]. This model leverages the masked language model (MLM) in BERT to score target word senses and uses MLM prediction scores to evaluate the suitability of the intended sense within the given context. – Chinese-BERT-wwm-WSD [21]. This model replaces BERT in BERT-WSD with Chinese-BERT-wwm and adds synonym set information to improve disambiguation performance. – Translation-BERT-WSD [20]. This model builds on BERT-WSD by incorporating a translation-based scoring function to enhance disambiguation effectiveness. – HSRG-WSD. The algorithm proposed in this paper. Based on prior research, we select micro-F1 and macro-F1 scores as the evaluation metrics for our study. 3.4 Experimental Results The results of our model and baselines are presented in Table 1 and Table 2. The table indicates that our model’s performance surpasses the best baseline, thereby demonstrating the effectiveness of the proposed approach. We observed that all baselines performed worse on verbs, possibly because verbs have more senses than nouns (average: 5.53 vs. 3.35). For example, in HSRG-WSD experiment, disambiguation effects for the twosense noun “中医” (micro-F1: 68.92, macro-F1: 62.45) and “动摇” (micro-F1: 68. 91, macro-F1: 62.50) are similar, both significantly outperforming the verb “出” (micro-F1: 18.89, macro-F1: 10.35) with nine senses. Table 1. WSD results on overall dataset (%F1 score). The highest values in each column are shown in bold. Model
Overall Micro-F1
Macro-F1
Random
26.98
27.38
MultiGraph
49.20
41.30
BERT-WSD
52.98
45.04
Chinese-BERT-wwm-WSD
54.00
45.60
Translate-BERT-WSD
55.55
45.19
HSRG-WSD
56.98
49.36
A Novel Unsupervised Chinese Word Sense Disambiguation
631
Table 2. WSD results on the nouns and verbs (%F1 score). The highest values in each column are shown in bold. Model Random
Nouns
Verbs
Micro-F1
Macro-F1
Micro-F1
Macro-F1
37.24
34.83
20.54
20.72
MultiGraph
52.10
40.80
42.90
38.80
BERT-WSD
53.76
41.71
52.50
48.02
Chinese-BERT-wwm-WSD
54.40
42.30
53.60
49.20
Translate-BERT-WSD
54.19
41.97
56.40
48.08
HSRG-WSD
57.37
46.65
56.72
49.96
3.5 Ablation Study In order to examine the influence of sememe relations and semantic dependency relations on WSD, we carried out separate experiments employing SDRG and SRG. Table 3. WSD results on different graph (%F1 score). The highest values in each column are shown in bold. Model
Micro-F1
Macro-F1
SRG-WSD
54.16
46.30
SDRG-WSD
10.96
3.28
HSRG-WSD
56.98
49.36
From Table 3, we can observe the following: (a) When using each relation individually, the disambiguation accuracy of the sememe relation is the highest, indicating that sememe relations possess stronger disambiguation capabilities; (b) When adding semantic dependency relations to the network graph, the system’s disambiguation accuracy increases. Because adding sememes relations strengthens the connection between context and the interpretation of word meanings at the level of the construction principles of ambiguous words. The optimal strategy for word sense disambiguation is the optimized combination of the two types of relations.
4 Conclusion This study proposes the HSRG-WSD model, an unsupervised word sense disambiguation method based on HowNet. By integrating sense-sememe relationships, cross-word sememe relatedness, and semantic dependency relationships to construct a heterogeneous sememe relationship graph model, this model can represent word senses more
632
M. Lyu and S. Mo
comprehensively and understand contextual meanings more precisely. After graph construction, it combines node deletion and graph comparison methods to achieve unsupervised node selection in the heterogeneous graph. Performance evaluation on the HowNet-based Chinese WSD dataset indicates better performance compared to other baseline models. Ablation experiments show that incorporating sememe knowledge and semantic dependency relationships can influence the model’s results favorably.
References 1. Bevilacqua, M., Navigli, R.: Quasi bidirectional encoder representations fromtransformers for word sense disambiguation. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 122–131 (2019) 2. Che, W., Feng, Y., Qin, L., Liu, T.: N-LTP: an open-source neural language technology platform for Chinese. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 42–49. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/ 10.18653/v1/2021.emnlp-demo.6, https://aclanthology.org/2021.emnlp-demo.6 3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 4. Dong, Z., Dong, Q.: Hownet-a hybrid language and knowledge resource. In: International Conference on Natural Language Processing and Knowledge Engineering, Proceedings, pp. 820–824. IEEE (2003) 5. Hadiwinoto, C., Ng, H.T., Gan, W.C.: Improved word sense disambiguation using pre-trained contextualized word representations. arXiv preprint arXiv:1910.00194 (2019) 6. Hou, B., Qi, F., Zang, Y., Zhang, X., Liu, Z., Sun, M.: Try to substitute: an unsupervised Chinese word sense disambiguation method based on hownet. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 1752–1757 (2020) 7. Jain, G., Lobiyal, D.: Word sense disambiguation using cooperative game theory and fuzzy Hindi wordnet based on conceptnet. Trans. Asian Low-Resource Lang. Inf. Process. 21(4), 1–25 (2022) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 9. Lu, W., et al.: Graph-based Chinese word sense disambiguation with multi-knowledge integration. Comput. Mater. Continua 61(1), 197–212 (2019) 10. Qi, F., Yang, C., Liu, Z., Dong, Q., Sun, M., Dong, Z.: Openhownet: an open sememe-based lexical knowledge base. arXiv preprint arXiv:1901.09957 (2019) 11. Su, Y., Zhang, H., Song, Y., Zhang, T.: Multilingual word sense disambiguation with unified sense representation. arXiv preprint arXiv:2210.07447 (2022) 12. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4(11), 992–1003 (2011) 13. Tang, G., Yu, D., Xun, E.: An unsupervised word sense disambiguation method based on sememe vector in hownet. J. Chin. Inf. Process 29(6), 23–29 (2015) 14. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 15. Wang, X., et al.: Heterogeneous graph attention network. In: The World Wide Web Conference, pp. 2022–2032 (2019) 16. Yang, Z., Huang, H.: WSD method based on heterogeneous relation graph. J. Comput. Res. Dev. 50(2), 437–444 (2013)
A Novel Unsupervised Chinese Word Sense Disambiguation
633
17. You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. Adv. Neural. Inf. Process. Syst. 33, 5812–5823 (2020) 18. Zeng, L.B., Su, J.W., Yang, C., Qian, Y.: A review of natural language processing technology for Chinese language and literature. In: 2022 International Communication Engineering and Cloud Computing Conference (CECCC), pp.1–6. IEEE (2022) 19. Zhang, G., Lu, W., Peng, X., Wang, S., Kan, B., Yu, R.: Word sense disambiguation with knowledge-enhanced and local self-attention-based extractive sense comprehension. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4061–4070 (2022) 20. Zhang, X., Hauer, B., Kondrak, G.: Improving hownet-based Chinese word sense disambiguation with translations. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 4530–4536 (2022) 21. Zhou, Y., Du, J., Xue, Z., Li, A., Guan, Z.: Chinese word sense embedding with SememeWSD and synonym set. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds.) CICAI 2022. LNCS, vol. 13606, pp. 236–247. Springer, Cham (2022). https://doi.org/10.1007/978-3-03120503-3_19
Employing Beautiful Sentence Evaluation to Automatic Chinese Essay Scoring Yaqiong He, Xiaomin Chu, and Peifeng Li(B) School of Computer Science and Technology, Soochow University, Suzhou, China [email protected], {xmchu,pfli}@suda.edu.cn
Abstract. Language learning relies on language sense. To write a good essay, the author not only needs skills but also original accumulation. In this paper, we approach from two perspectives of essay scoring and essay writing, utilizing beautiful sentence evaluation and generation to assist in essay scoring and writing. We first established a beautiful sentence dataset and trained a beautiful sentence evaluation model using the dataset, and then applied the evaluation model to automatic Chinese essay scoring. The experimental results demonstrated the effectiveness of beautiful sentence features in the essay scoring tasks. Moreover, we trained a beautiful sentence generator using the pre-trained model GPT-2 to improve writing ability of essay writers. The experimental results demonstrated the effectiveness of the beautiful sentence generator in essay writing assistance. Keywords: Beautiful sentence generation · Beautiful sentence evaluation · Automatic Chinese essay scoring
1 Introduction In the essay scoring standard of college entrance exam, the development level requires that excellent essays should be creative in language and style, flexible in sentence structure, and adept at using rhetorical techniques. This is a higher level standard based on the basic level of language fluency, and it is an important manifestation of excellent essays. Beautiful sentences can enhance the expressiveness and infectiousness of an essay, improve its readability, and make it easier for readers to understand and appreciate the meaning of the essay. If a candidate can use the language fluently and beautifully, it can enhance the quality of their essays and lead to a better evaluation. Whether a sentence has literary grace can be easily judged subjectively. In Table 1, when describing an ideal, Exa1 is plain and fluent, and can only meet the basic level requirement for language in the evaluation standards for high school entrance exams. On the other hand, Exa2 is beautiful and fluent, using metaphors and comparisons to express the importance of the ideal, which meets the requirements for language use at the development level. Most existing research only focused on evaluating the writing level of essays, which cannot fundamentally improve students’ writing abilities. In this paper, we demonstrate the importance of language use for both essay scoring and essay writing from two © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 634–645, 2023. https://doi.org/10.1007/978-981-99-4752-2_52
Employing Beautiful Sentence Evaluation
635
Table 1. Two examples of essay fragments.
perspectives. We first established a beautiful sentence dataset and trained a beautiful sentence evaluation model using the dataset, and then applied the evaluation model to automatic Chinese essay scoring. The experimental results demonstrated the effectiveness of beautiful sentence features in the essay scoring tasks. Moreover, we trained a beautiful sentence generator using the pre-trained model GPT-2 to improve writing ability of essay writers. The experimental results demonstrated the effectiveness of the beautiful sentence generator in essay writing assistance. Our main contributions are as follows: • We constructed a dataset of beautiful sentences and trained a beautiful sentence evaluation model to evaluate sentences in essays. • We are the first to apply the beautiful sentence evaluation to automatic Chinese essay scoring and the results showed it can effectively enhance the performance of the automatic essay scoring tasks. • We built a beautiful sentence generator to assist essay writing.
2 Related Work 2.1 Beautiful Sentence Generation Previously, figurative language generation was primarily based on templates and mainly produced metaphors. a few studies [1, 2] used “A is like B” expressions and templatelike structures, limiting in their ability to handle the variability inherent in natural language and its creativity. Recently, there has been a shift towards neural-based end-to-end approaches in figurative language modeling, demonstrating high levels of creativity, particularly in pun and metaphor generation [3, 4]. To provide better transparency, Zhou et al. [5] proposed a neural-based pipeline for generating idioms. Impressive results in
636
Y. He et al.
figurative language generation have been achieved using pre-trained models, including T5 and BART, and fine-tuning these models has proven successful for generating metaphors [6, 7], similes [8], and hyperbole [9]. These studies have mainly focused on generating individual figurative forms and modeling the generation between literal and figurative language. However, our goal is to generate beautiful sentences that may employ various rhetorical devices and may or may not convey philosophical messages, while still utilizing parallel literal-figurative data for single forms. 2.2 Automatic Essay Scoring In the past decade or so, researchers in natural language processing have conducted numerous studies on automatic essay scoring. Research on automatic English essay scoring began earlier, with early studies mainly utilizing regression, classification, or ranking algorithms in conjunction with manual features [10–15] to score essays. In recent years, many studies [16–22] have shown that deep learning models can more effectively capture deep semantic meaning of essays, achieving better performance. The above research methods mainly train on essays that contain specific prompts, but it is difficult to obtain a large number of essays with specific prompts in reality. Therefore, some research [17, 23] has begun to focus on cross-prompt essay scoring. Ridley et al. [23] proposed a single-stage neural network-based method that uses some features unrelated to the prompt to score essays with a target prompt. In order to obtain more feedback, some studies have begun to study trait score. For example, Ke et al. [24] evaluated the persuasiveness of essay arguments and explained them using relevant attributes. Kumar et al. [25] and Ridley et al. [26] used multi-task learning to simultaneously obtain overall scores and scores for each trait. Kumar et al. [25] also used multi-task learning to treat the overall score as the main task and the traits score as an auxiliary task to score articles with specific prompts. Research on automatic Chinese essay scoring started relatively late and has been less studied. Early research extracted essay features and used machine learning techniques to score essays [27–29]. With the development of deep learning techniques, neural network models have gradually been introduced into this field, offering powerful feature extraction capabilities that eliminate the need for handcrafted features. Fu et al. [30] improved the accuracy of Chinese essay automatic scoring by identifying beautiful sentences and using them as features for essay scoring tasks. Ma et al. [31] used support vector machines (SVM) and BP neural networks to solve the Chinese essay scoring task. Song et al. [32] trained models to evaluate the organization level of high school argumentative essays. Song et al. [33] used a cross-prompt scoring approach to score four sets of articles containing thematic prompts as a whole.
3 Corpus Collection We collected beautiful sentences from the high school argumentative essays, prose, postreading essays, and biographical writing on leleketang1 (these sentences were marked in the text using underlined waves), as well as some beautiful sentences from other 1 https://www.leleketang.com/.
Employing Beautiful Sentence Evaluation
637
websites. After data pre-processing, we obtained 17,040 sentences, with details listed in Table 2. The average length of these sentences was 60, with the shortest sentence being 8 and the longest sentence being 446. Table 2. Data Preprocessing 1. Deleting redundant punctuation
2. Correcting erroneous punctuation
3. Removing special symbols
4.Adding missed punctuation
5. Deleting incomplete sentences
6. Merging separated sentences
7. Deleting sentences with unclear meaning
8. Deleting duplicate sentences
9. Removing extraneous text when scraping
Fig. 1. Flowchart of our model.
4 Model The process of our work is shown in Fig. 1, which consists of three parts: (1) the beautiful sentence evaluation (2) the Automatic Essay Scoring (AES) with beautiful sentence evaluation, and (3) the beautiful sentence generator. For essay scoring, we use the pre-trained models XLNet and BiLSTM to train a beautiful sentence evaluation model. The model evaluates all sentences in an essay, and the features are extracted from the evaluation results and applied to the automatic essay scoring model to analyze the effectiveness of beautiful sentence features in automatic essay scoring. For essay writing, we use the pre-trained GPT-2 model to obtain a beautiful sentence generator. The model’s effectiveness is evaluated using the beautiful sentence evaluation model and human evaluation. The generated beautiful sentences are then used to assist in essay writing, and their effectiveness is analyzed through experiments. We first split the sentences and use the XLNet to encode each word sequence si = {wi1 , wi2 , ..., wim } of each sentence. The last word representation is taken as the sentence representation. Then, all the sentence representations are concatenated and input into a BiLSTM, and the average of the hidden layer is taken as the final representation of
638
Y. He et al.
the beautiful sentence. The output is then passed through a dense layer for prediction, resulting in a classification of “Yes” or “No”. The Eqs. (1)–(3) show the process as follows. si = XLNet(si )
(1)
Table 3. Beauty sentence features. Feature
Meaning
Favg>0.5
Average probability of beautiful sentences (probability greater than 0.5)
Fratio
Proportion of beautiful sentences in the essay
Fnum
Number of beautiful sentences in the essay
Favg
Average probability of beautiful sentences in all sentences in the essay
hi = BiLSTM (si , hi−1 )
(2)
t = Dense(mean(h1 , h2 , ..., hi ))
(3)
We use cross-entropy loss to evaluate the model performance, as follows. Loss =
1 (yi log(pi ) + (1 − yi )log(1 − pi )) N
(4)
i
where yi represents the real label, pi represents the prediction label. 4.1 Automatic Essay Scoring on Beautiful Sentence Evaluation The beautiful sentence evaluation model is used to predict each sentence in an essay, and the features extracted from all prediction results are used as beautiful sentence features in the automatic essay scoring model. The features are shown in Table 3. For the baseline and MPE [34] model, the beautiful sentence features are concatenated at the end of the model and input into a dense layer for classification. For the HMTS(-langufeat) [35] model, the beautiful sentence features are used as language-specific features and concatenate with the language task representation flang as the final language task representation, as shown in Eq. (5). Elang = Concatenate(flang , Favg>0.5 , Fratio , Favg ) where flang is the language task, and Elang is the final language task representation.
(5)
Employing Beautiful Sentence Evaluation
639
4.2 Beautiful Sentence Generator We use the GPT-2 to pre-train the beautiful sentence generator on our dataset. GPT-2 mainly consists of the Transformer’s decoder module, as shown in Fig. 2. First, the input is respectively subjected to token embedding and position embedding. Token embedding converts each word to a fixed-size vector so that the neural network can process it. Position embedding is used to help the model understand the order of the input data. Then the token embedding and position embedding are added together to get the final input I , as shown in Eq. (6). I = TE(s) + PE(s)
(6)
where TE(·) is the token embedding and PE(·) is the position embedding.
Fig. 2. GPT-2 model architecture
The input is then fed into the decoder layer, consisting of the Masked Self Attention (MSA) layer and the Feed Forward Neural Network (FFNN) layer. The input is first passed through the MSA layer, which can integrate the textual sequence information so that each character in the transformed textual sequence is related to the entire textual sequence. The processed information is then passed to the FFNN layer, and the output is then passed to the next Decoder module. After multiple Decoder layers, the model completes one iteration and outputs a word. After multiple iterations, the final generated sequence is obtained. The inputs to each Decoder module are shown in Eqs. (7) and (8). Im = MSA(I )
(7)
Ie = FFNN (Im )
(8)
where Im is the representation of the input after passing through MSA, and Ie is the content representation of the input after passing through one decoder layer.
640
Y. He et al.
We use the negative log-likelihood loss to calculate the loss as follows. L(U ) = −
n
P(ui |ui−k , ..., ui−1 ; θ )
(9)
i
where the conditional probability P(ui |ui−k , ..., ui−1 ; θ ) represents the probability of the i-th word given the (i-k)-th to (i-1)-th words.
5 Experimentation This section first introduces the experimental settings and baseline systems, then presents the experimental results, and finally conducts the error analysis. 5.1 Experimental Settings The beautiful sentence evaluation model used data from Sect. 3 as beautiful sentences and 6000 sentences from “Bad” essays in leleketang as non-beautiful sentences. The dataset was split into an 8:2 ratio for training and testing, with 10 training epochs, a learning rate of 0.001, and the Adam optimizer used for parameter updates. The XLNet encoding word vector dimension is set to 768, and the BiLSTM hidden size is set to 256. Evaluation metrics include accuracy, recall, F1, and human comparison evaluation. The beautiful sentence generator was trained on a dataset collected in Sect. 3, using 400 randomly selected sentences as the validation set. The BERT Tokenizer was used to process Chinese characters, with 12 decoder layers, a 768-dimensional word vector, and 10 training epochs. The AdamW optimizer was used with a learning rate of 1e-4, and perplexity(ppl) was used as the evaluation metric. For the application module of beautiful sentence features in the automatic essay scoring task, its dataset is the same as the HMTS [35]. Evaluation metrics include QWK and F1. 5.2 Baselines For the beautiful sentence evaluation model, we compare it with CNN, BiLSTM and XLNet. For the automatic essay scoring on beautiful sentence evaluation, we introduce the beautiful sentence features to the following models. 1) CNN_LSTM_att [20]: it is a leading overall scoring model for specific prompt, which treated input essays as sentencedocument hierarchies; 2) Song2020 [32]: it is is proposed for organization evaluation of Chinese argumentative student essays, utilizing a hierarchical multi-task approach for joint discourse element identification and organization evaluation; 3) XLNet: it uses the pre-trained model XLNet to obtain the paragraph representation, and then concatenate them and feed them into the dense layer to predict the essay grade; 4) MPE [34]: it provides the overall score of an essay from three perspectives: semantics, organization, and logic; 5) HMTS(-langu-feat) [35]: it utilizes the method of multi-task learning to provide the overall score of an essay, as well as the four trait scores of organization, topic, logic, and language. We use the language features we designed to verify its effectiveness.
Employing Beautiful Sentence Evaluation
641
We first analyzed the results of the beautiful sentence evaluation model, then examined the application of beautiful sentence features in automatic essay scoring tasks, and finally evaluated the generated results of the beautiful sentence generator and used them to assist in writing. 5.3 Results of Beautiful Sentence Evaluation As shown in Table 4, compared with CNN and BiLSTM, the pre-trained XLNet model performed better, indicating the effectiveness of pre-trained models in evaluating beautiful sentences. In addition, our beautiful sentence evaluation model (ours) achieved the best performance by adding BiLSTM on the basis of XLNet, with an increase of 3.6 in accuracy, 4.2 in recall, and 3.9 in macro F1 value. Furthermore, as shown in Table 5, the beautiful sentence evaluation model can accurately identify beautiful and non-beautiful sentences. Table 4. Experimental results of the beautiful sentence evaluation. Model
Accuracy
Recall
F1 Score
CNN
66.85
67.10
66.71
BiLSTM
80.98
78.83
79.54
XLNet
92.26
91.90
92.07
Ours
95.63
95.76
95.57
Table 5. Cases of beautiful sentence evaluation.
642
Y. He et al.
Table 6. Results of automatic essay scoring on beautiful sentence evaluation, where BS refers to the beautiful sentence features from beautiful sentence evaluation.
5.4 Results of Automatic Essay Scoring on Beautiful Sentence Evaluation As shown in Table 6, all models showed the improvements in performance after adding the beautiful sentence feature. We used the t-test with a 95% confidence interval for the significance test and all improvements of +BS over the original models are significant (p < 0.03). For the MPE model, adding the feature increased its QWK value by 1.6 and macro F1 value by 1.2. For the HMTS(-langu-feat) model, the feature increased its QWK value by 0.9 and macro F1 value by 1.3. This is because beautiful sentences can reflect the language level and rhetorical skills of an essay, and thus have an important role in essay evaluation. Therefore, adding the beautiful sentence feature to automatic essay scoring models can better reflect the language level of an essay and thus improve the performance of essay scoring. The baseline models CNN_LSTM_att, Song2020, and XLNet showed more significant performance improvements compared to the MPE and HMTS(-langu-feat) models. This is because CNN_LSTM_att is designed for English essays, with poor ability to extract information from Chinese texts; Song2020 evaluates the organization of essays, and XLNet is only the lowest level of the MPE and HMTS(-langu-feat). In contrast, MPE and HMTS(-langu-feat) already considered language evaluation in their designs. MPE proposes semantic expression evaluation, while HMTS(-langu-feat) evaluates language as a separate task. This resulted in smaller performance improvements after adding the beautiful sentence feature. 5.5 Results of Beautiful Sentence Generator For the beautiful sentence generator model, the final training loss value is 0.81, the PPL value on the validation set is 1.97, and the loss is 0.68. However, it is difficult to judge the generation ability of the model only through data indicators. Therefore, we use the beautiful sentence generator to generate sentences and analyze them by observing the specific generation effect, as shown in Table 7. It is found that the beautiful sentence generator can not only generate sentences containing beautiful vocabulary and rhetorical
Employing Beautiful Sentence Evaluation
643
Table 7. Results of beautiful sentence generator.
devices but also generate implicit and philosophical sentences, which are indispensable for students to improve their language ability. For the results of the content generated by beautiful sentence generator, we use the beautiful sentence generator to generate 50 sentences, which were evaluated using the beautiful sentence evaluation model and compared to manual evaluation results. The beautiful sentence evaluation model evaluated 42 sentences as beautiful, while the human evaluation rated 48 sentences as beautiful. The sentences rated as beautiful were consistent between the beautiful sentence evaluation model and human evaluation, but some sentences with deep meaning but plain language were misjudged by the beautiful sentence evaluation model.
Fig. 3. An essay example improved by our beautiful sentence generator where the red texts refer to the modified parts.
644
Y. He et al.
6 Conclusion The existing research on automatic essay scoring only focuses on evaluating students’ writing ability, and cannot fundamentally improve students’ writing ability. We start from the perspectives of essay scoring and writing, and use beautiful sentence evaluation and generation to assist in essay scoring and writing. The experimental results show that beautiful sentence features can effectively improve the performance of automatic essay scoring tasks. The beautiful sentence generator can effectively generate beautiful sentences to assist essay writing. Acknowledgements. The authors would like to thank the two anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62276177, and 61836007), and Project Funded by the Priority Aca-demic Program Development of Jiangsu Higher Education Institutions (PAPD).
References 1. Abe, K., Sakamoto, K., Nakagawa, M.: A computational model of the metaphorgeneration process. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 28 (2006) 2. Veale, T.: Round up the usual suspects: knowledge-based metaphor generation. In:Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 34–41 (2016) 3. Yu, Z., Tan, J., Wan, X.: A neural approach to pun generation. In: ACL (1), pp.1650–1660. Association for Computational Linguistics (2018) 4. Yu, Z., Wan, X.: How to avoid sentences spelling boring? towards a neural approachto unsupervised metaphor generation. In: NAACL-HLT (1), pp. 861–871. Association for Computational Linguistics (2019) 5. Zhou, J., Gong, H., Bhat, S.: From solving a problem boldly to cutting the gordianknot: idiomatic text generation. CoRR abs/2104.06541 (2021) 6. Stowe, K., Chakrabarty, T., Peng, N., Muresan, S., Gurevych, I.: Metaphor generation with conceptual mappings. In: ACL/IJCNLP(1), pp. 6724–6736. Association for Computational Linguistics (2021) 7. Chakrabarty, T., Zhang, X., Muresan, S., Peng, N.: MERMAID: metaphor generation with symbolism and discriminative decoding. In: NAACL-HLT (1), pp. 4250–4261.Association for Computational Linguistics (2021) 8. Chakrabarty, T., Muresan, S., Peng, N.: Generating similes effortlessly like a pro:a style transfer approach for simile generation. In: EMNLP (1), pp. 6455–6469. Association for Computational Linguistics (2020) 9. Zhang, Y., Wan, X.: MOVER: mask, over-generate and rank for hyperbole generation. In: NAACL-HLT (1), pp. 6018–6030. Association for Computational Linguistics (2022) 10. Page, E.B.: The imminence of... grading essays by computer. Phi Delta Kappan47(5), 238–243 (1966) 11. Attali, Y., Burstein, J.: Automated essay scoring with e-rater v. 2.0 (ets rr-04–45). Educational Testing Service, Princeton, NJ, pp. 2–20 (2005) 12. Larkey, L.S.: Automatic essay grading using text categorization techniques. In: SIGIR, pp. 90– 95 (1998) 13. Rudner, L.M., Liang, T.: Automated essay scoring using Bayes’ theorem. J. Technol. Learn. Assess. 1(2) (2002)
Employing Beautiful Sentence Evaluation
645
14. Yannakoudakis, H., Briscoe, T., Medlock, B.: A new dataset and method for automatically grading ESOL texts. In: ACL, pp.180–189 (2011) 15. Chen, H., He, B.: Automated essay scoring by maximizing human-machine agreement. In: EMNLP, pp. 1741–1752 (2013) 16. Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: EMNLP, pp. 1882– 1891 (2016) 17. Cummins, R., Zhang, M., Briscoe, T.: Constrained multi-task learning for automated essay scoring. In: ACL (1) (2016) 18. Alikaniotis, D., Yannakoudakis, H., Rei, M.: Automatic text scoring using neural networks. In: ACL (1) (2016) 19. Dong, F., Zhang, Y.: Automatic features for essay scoring - an empirical study.In: EMNLP, pp. 1072–1077 (2016) 20. Dong, F., Zhang, Y., Yang, J.: Attention-based recurrent convolutional neural network for automatic essay scoring. In: CONLL, pp. 153–162 (2017) 21. Tay, Y., Phan, M.C., Tuan, L.A., Hui, S.C.: Skipflow: incorporating neural coherence features for end-to-end automatic text scoring. In: AAAI, pp. 5948–5955 (2018) 22. Li, X., Yang, H., Hu, S., Geng, J., Lin, K., Li, Y.: Enhanced hybrid neural network for automated essay scoring. Expert Syst. J. Knowl. Eng. 39(10) (2022) 23. Ridley, R., He, L., Dai, X., Huang, S., Chen, J.: Prompt agnostic essay scorer: adomain generalization approach to cross-prompt automated essay scoring. CoRR abs/2008.01441 (2020) 24. Ke, Z., Carlile, W., Gurrapadi, N., Ng, V.: Learning to give feedback: modeling attributes affecting argument persuasiveness in student essays. In: IJCAI, pp. 4130–4136 (2018) 25. Kumar, R., Mathias, S., Saha, S., Bhattacharyya, P.: Many hands make light work: using essay traits to automatically score essays. In: NAACL-HLT (1), pp. 1485–1495 (2022) 26. Ridley, R., He, L., Dai, X., Huang, S., Chen, J.: Automated cross-prompt scoring of essay traits. In: AAAI, pp. 13745–13753 (2021) 27. Liang, M., Wen, Q.: A critical review and implications of some automated essay scoring systems. Technol. Enhanc. Foreign Lang. Educ. 5, 18–24 (2007) 28. Li, Y.: Automated Essay Scoring for Testing Chinese as a Secog Language. Ph.D.thesis, Beijing Language and Culture University (2006) 29. Yang, C., Cao, Y.: Current situation and prospect of automatic essay scoring. Chin. Teach. Middle Sch. 3, 78–80 (2012) 30. Fu, R., Wang, D., Wang, S., Hu, G., Liu, T.: Elegart sentence recognition for automated essay scoring. J. Chin. Inf. Process. 32(6), 88–97 (2018) 31. Ma, H., Guo, L., Peng, H.: Comparison of automatic scoring effect of writing based on SVM and BP neural network. Examin. Res. 5, 8–13 (2019) 32. Song, W., Song, Z., Liu, L., Fu, R.: Hierarchical multi-task learning for organizationevaluation of argumentative student essays. In: IJCAI, pp. 3875–3881 (2020) 33. Song, W., Zhang, K., Fu, R., Liu, L., Liu, T., Cheng, M.: Multi-stage pre-trainingfor automated Chinese essay scoring. In: EMNLP, pp. 6723–6733 (2020) 34. He, Y., Jiang, F., Chu, X., Li, P.: Automatic Chinese essay scoring on multi-perspective modeling. Comput. Sci. 50(315–322) (2023) 35. He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multipletraits. In: COLING, pp. 3007–3016 (2022)
A Data Augmentation Method Based on Sub-tree Exchange for Low-Resource Neural Machine Translation Chuncheng Chi1
, Fuxue Li2(B) , Hong Yan2 and Zhongchao Zhao1
, Hui Guan1
,
1 College of Computer Science and Technology, Shenyang University of Chemical Technology,
Shenyang, China 2 College of Electrical Engineering, Yingkou Institute of Technology, Yingkou, China
[email protected]
Abstract. Neural machine translation (NMT) has recently gained a lot of attention due to its ability to provide highly accurate translations. Despite its promising potential, NMT is confronted with a major hurdle in the form of insufficient training data, which can adversely affect translation performance, particularly in languages with low-resources. This is a major obstacle as it hinders the applicability of NMT across diverse domains. To alleviate this issue, a novel data augmentation (DA) method has been proposed to expand training set. It utilizes pseudo-parallel sentence pairs generated by exchanging sub-tree and back-translation to enrich the diversity of the training samples. The effectiveness of the proposed method has been validated through a series of experiments. Both simulated and real lowresource translation tasks were used to evaluate the performance of the method. The results show that the proposed method outperforms other DA methods and significantly improves translation quality beyond that of the strong baseline. Keywords: Neural machine translation · Low-resource · Data Augmentation
1 Introduction Neural machine translation (NMT) has achieved excellent performance on various language pairs by using an end-to-end architecture and large-scale parallel data. However, these machine translation systems are usually trained on tens or even hundreds of millions of parallel sentences [1–3]. These datasets are only available for a few high-resource language pairs (HRLs) e.g., English → French. For low-resource language pairs (LRLs), there are a limited number of training samples, e.g., Vietnamese → English. It is an expensive and time-consuming task to collect high-quality parallel bilingual corpus with LRLs. As a result, some researchers proposed several data augmentation (DA) methods to improve the performance of the NMT model for LRLs [4–6]. Data augmentation (DA) technique is an important trick to generate additional training samples when the available parallel data are scare, and it has been proved useful in previous works [7–10]. It is widely applied in computer vision and then introduced to the © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 646–657, 2023. https://doi.org/10.1007/978-981-99-4752-2_53
A Data Augmentation Method Based on Sub-tree Exchange
647
natural language processing. Some existing methods used for neural machine translation include randomly swapping two words, dropping words, and so on [11–16]. Due to the characteristics of natural language, these random transformations can lead to more noise for the text or result in serious semantic mistakes. Another DA method uses monolingual data [9, 10], e.g., back-translation and self-learning. Nevertheless, to gather and collate the required quantity of monolingual data, these methods require a significant amount of work. Different from previous approaches that replace or randomly drop individual words in the training set, this paper proposes an effective data augmentation (DA) method named sub-tree exchange (STE). It utilizes a back-translation (BT) NMT model and new monolingual sentences generated by exchanging sub-trees to generate additional sentence pairs, and it does not need to introduce any external monolingual data or synonymous dictionary. The STE method is described as follows: Firstly, constructing a target-to-source NMT model (the backward NMT model); Secondly, generating the constituency parse tree by Sandford CoreNLP [17] for the original training set with target side sentences and a sub-tree exchange algorithm is proposed. Next, translating the generated newly sentence by the NMT model; Finally, adding the pseudo-parallel corpus to the original training set. The structure of this paper is as follows: the Related work is introduced in Sect. 2, and Sect. 3 introduced the STE method. Section 4 covers the experiments and results, while Sect. 5 provides the conclusion and outlines future work prospects.
2 Related Work Data augmentation (DA) is a widely employed method in computer vision that effectively improve model training. As for neural machine translation (NMT), DA is often used to generate the noisy data to increase the model’s robustness or more diverse training samples to improve translation performance. One of the DA methods requires the introduction of monolingual data. Sennrich et al., [10] presented a back-translation (BT) method to improve the model performance, which extends the training set by translating the monolingual target sentences into source sentences. Several studies have shown that the BT variant is an effective approach [11, 14, 18–20]. Currey et al., [21] proposed the DA method using monolingual data, in which the source sentences was copied by the target sentences to expand the training set. All these methods are effective, yet they require additional monolingual data. Unlike previous DA methods, the sub-tree exchange (STE) method does not need additional monolingual data. It merely exchanges sub-trees by establishing rules based on the constituency parse tree and expands the training set by constructing pseudo-parallel sentence pairs. Another category of DA is based on the word replacement. Existing methods for word-level DA include random word swapping and word dropping. Fadaee et al., [8] improved the NMT for low-resource translation by replacing words in the original sentence with the rare vocabulary. Words in the source and target languages are individually replaced by SwitchOut [7] with words that are uniformly sampled from corresponding vocabulary. Artetxe et al., [12] randomly replaced words with adjacent words within a
648
C. Chi et al.
window size, while Iyyer et al., [15] randomly removed word tokens. Xie et al., [16] proposed two ways to introduce noise into sentences: randomly substituting words with placeholder words and substituting words with other words that have a comparable frequency distribution across the vocabulary. These methods demonstrate the ability of data noising to improve NMT model performance through the introduction of noise and the expansion of the range of training data. Other DA methods utilize language models to generate substitute words [22, 23]. A soft contextual DA method is proposed by Gao et al., [24], which replaced the word representation with the soft distribution. Similar to the swap words method, the STE method can be seen as a phrase, and the word must be in the same sentence. Compared with other swap words DA methods, pseudo-parallel sentences pairs generated by the proposed method contain less noise.
3 Method The sub-tree exchange (STE) data augmentation (DA) method generates pseudo-parallel sentence pairs for the original training set by exchanging sub-tree and back-translation. Compared with the original sentences, the structure of pseudo data sentences generated based on sub-tree exchange method is consistent with the original training set. The two steps of this method are sub-tree exchange and the generation as well as utilization of pseudo-parallel sentence pairs. After that, we will clarify our method in the following part of this section. 3.1 Exchanging Sub-tree Based on the Constituency Parse Tree Previous works have pointed out that each word plays a different role in the sentence, and different types of syntax are stored in different layers [25, 26]. Following these works, it is reasonable to assume that nodes at the same depth in the syntax tree belong to a kind of coordinate relationship in the sentence structure. Therefore, the sentence generated by exchanging nodes with the same depth and attribute is relatively reasonable in semantics and complete in structure. The constituency parse tree is shown in Fig. 1. To describe the structure details of a sentence, constituency grammar is utilized to distinguish both terminal and non-terminal nodes. In a constituency-based parse tree, the non-leaf nodes correspond to the attribute words of a sentence, while the leaf nodes correspond to the content words of the sentence. As shown in Fig. 1, the leaf nodes can be represented as set L = {It’s, the, styrofoam, and, chocolate, game,..}, and the non-terminal nodes can be represented as set N = {NP, V P, PRP, V BZ…}. The sub-tree exchange algorithm is performed on the constituency parser tree to generate the pseudo monolingual sentence. It can be summarized as follows: 1. 2. 3. 4.
Constructing the constituency parse tree by Standford CoreNLP [17]; Setting the component attribute of the root node of the sub-tree exchange as “NP”; Locating the layer of the top-most leaf node, which are set d; Adding all the nodes in the constituency parse tree whose depth is d to the candidate set S;
A Data Augmentation Method Based on Sub-tree Exchange
649
5. For each node X in set S, the extension rules are introduced: (a) If there are two or more nodes whose attribute is NP in the set S, then exchanging two sub-trees with the node NP as the root node, therefore a new constituency parse tree has been constructed, ending the operation of this step; (b) If there is at most on NP node in S, setting S = {}, then adding child nodes of each node into the set S, and repeat step 5; 6. Combining all leaf nodes will get the new sentence eventually. As the example of Fig. 1, we exchange the sub-tree with “NP” as the root node, the new sentence is “It is chocolate game and the styrofoam”
Fig. 1. The constituency parse tree of an example sentence (leaf nodes and non-leaf nodes represent the words in sentences and their corresponding attributes, respectively. “NP” stands for noun phrase).
3.2 Constructing Pseudo-Parallel Sentence Pairs In Sect. 3.1, the sub-tree exchange algorithm is proposed which is based on the original training set, this section describes the details of generating the pseudo-bilingual corpus. As shown in Fig. 2, The method for constructing a pseudo-bilingual corpus can be summarized in the following way: 1. Given a parallel corpus D = {x, y}, where x and y represent the source language S and target language T , respectively. x1 , x2 , x3 , . . . , xm and y1 , y2 , y3 , . . . , yn represent the number of words in the source side and target side sentences respectively. Training a neural machine translation (NMT) model of T → S, and labelling it as MT →S ; 2. ∀(xi , yi ), generating newly sentences by the exchange sub-tree algorithm in Sect. 3.1, labelling it as yi ;
650
C. Chi et al.
Fig. 2. Data augmentation method STE based on sub-tree exchange.
3. Translating yi to get the corresponding translation xi by the NMT model MT →S , and the pseudo-bilingual corpus D = {xi , yi } is constructed.
4 Experiments Several experiments are carried out on simulated and real low-resource translation tasks to verify the effectiveness of the data augmentation approach proposed in this paper. It primarily consists of German (De), Hebrew (He), Italian (It), Vietnamese (Vi), Arabic (Ar), and Turkish (Tr) translation tasks to English (En), as well as English to Vietnamese and Turkish translation tasks from the well-known IWSLT14, IWSLT15, IWSLT16, and WMT18. For all experiments, the transformer model [27] is implemented by using the fairseq toolkit [28] and on a machine equipped with two GeForce RTX 3090Ti GPUs. 4.1 Datasets and Settings Datasets and Pre-processing. Table 1 displays the statistics of sentence pairs for each dataset. For the IWSLT14 translation tasks from He and De to En, the training sets are pre-processing in the same way as Gao et al., [24]. For the IWSLT14 It → En translation task, we follow the setting of Lin et al., [29] to pre-process the dataset. Regarding the IWSLT15 En → Vi translation task, we adapt the data pre-processing settings proposed by Wang et al., [7]. In the case of the IWSLT16 Ar → En translation task, the validation and test sets comprised of tst2013 and tst2014, respectively. For all IWSLT14 and IWSLT16 tasks, we use a shared vocabulary with 10K byte-pair encoding Table 1. The statistics of sentence pairs in each dataset. “NP” denotes the number of successful sub-tree exchange of sentences. Corpus Training Validation Test NP
IWSLT14
IWSLT15
IWSLT16
WMT18
He → En
De → En
It → En
Vi → En
Ar → En
Tr → En
187817
160239
184742
133317
225002
207678
1382
7283
1056
1553
1169
3007
962
6750
883
1268
1107
3000
123558
106126
121424
86091
148531
172457
A Data Augmentation Method Based on Sub-tree Exchange
651
(BPE) [30] types. Finally, the WMT 18 translation tasks, we follow the experiment setting outlined in Emanuele Bugliarello et al.,[31]. Model Settings. Throughout our experiments, we utilized the transformer model, which is currently the most prevalent sequence-to-sequence model. For the IWSLT14 De → En and IWSLT15 Vi → En translation tasks, we adopt the default transformer_base configuration for the neural machine translation (NMT) model, while for other translation tasks, only the default transformer_small configuration is used. All transformer models are trained and evaluated using the Adam [32] optimizer and BLEU [33] score, respectively. The details of the transformer model parameter settings are provided in Table 2. Table 2. The detailed parameter settings of the transformer model Model
Transformer_small
Transformer_base
Dimension
512
512
FFN
1024
2048
Mutli-head attention
4
8
Label smoothing
0.1
0.1
Dropout
0.3
0.3
Learning rate
5e−4
5e−4
Layers
6
6
4.2 Results and Analysis Performance on Different Translation Tasks. Tables 3 and 4 show the evaluation results for all translation tasks. It can be seen that for all tasks, the sub-tree exchange (STE) method improves BLEU scores by more than 0.36 compared to the strong baseline. In all low-resource translation tasks, the IWSLT16 Ar → En and IWSLT14 He → En translation tasks show significant performance improvements over the baseline, with BLEU score improvements ranging between 0.77 and 1.02. To further validate the robustness of the STE method, we perform two translation tasks with the sentence pairs generated by exchanging the sub-tree as the source side. As shown in Table 4, for the WMT18 En → Tr and IWSLT15 En → Vi translation tasks, significant improvements were achieved compared to the baseline, with BLEU score improvements ranging from 0.44 to 1.91, respectively. In both simulated and real low-resource translation tasks, we contend that STE method could significantly boost the performance of the NMT model for multilingual pairs.
652
C. Chi et al. Table 3. Performance on several translation tasks. IWSLT15
IWSLT16
WMT18
Pairs
De → En
IWSLT14 He → En
It → En
Vi → En
Ar → En
Tr → En
Baseline
34.20
33.64
27.66
27.76
29.28
15.63
STE
34.56
34.66
28.05
28.20
30.05
16.04
Table 4. Performance on several translation tasks. IWSLT15
WMT18
Pairs
En → Vi
En → Tr
Baseline
27.97
14.10
STE
29.88
14.54
Compare with Other Data Augmentation (DA) Methods. We compared the sub-tree exchange (STE) method with several existing DA methods on several translation tasks. To ensure a fair comparison of results, the same data settings and evaluation metrics as previous studies are utilized. As shown in Table 5, for the En → Vi translation task, it is obvious to find the STE method outperforms several other methods. In Table 6, we utilized the Transformer_small model for the IWSLT 14 He → En translation task. The STE method achieved results that are comparable to the SCA method used in [26], and outperforming other methods. Furthermore, Table 7 presents the results of our experiments on diverse translation tasks, which demonstrate that our method yields competitive results compared to the current DA method. This confirms its effectiveness and robustness in NMT models for low-resource languages. Table 5. Performance of different data augmentation methods on En → Vi translation task with BLEU metric based on the transformer_base model. (*) is from Wang et al., [7]. Method
IWSLT15 En → Vi
Transformer*
27.97
+WordDropout*
28.56
+SwitchOut*
28.67
+RAML*
28.88
+RAML + WordDroput*
28.86
+RAML + SwitchOut*
29.09
STE
29.88
A Data Augmentation Method Based on Sub-tree Exchange
653
Table 6. Performance of different data augmentation methods on He → En translation task with BLEU metric based on the transformer_base model. (*) is from Gao et al., [24]. Method
IWSLT14 He → En
Transformer*
33.64
Swap*
34.25
Drop*
34.29
Blank*
34.37
Smooth*
34.61
LMsmaple*
34.31
SCA*
34.91
STE
34.66
Table 7. Performance of different data augmentation methods on De → En translation task with BLEU metric based on the transformer_base model. (*) is from Maimaiti et al., [34]. Method
IWSLT14 De → En
Transformer*
33.53
BT*
33.69
COPY*
34.62
SWAP*
33.98
DROP*
34.68
BLANK*
34.83
SMOOTH*
34.85
SWITCH*
34.75
SCA*
34.89
Augmentsource *
34.93
Augmenttarget * Augmentsource+target *
34.98 35.14
STE
34.56
Performance of Comparative and Stacked Methods. To assess the effectiveness of the sub-tree exchange (STE) method, we conducted a set of experiment utilizing the backtranslation (BT) technique, a commonly employed approach for augmenting the training set in various neural machine translation (NMT) tasks that suffer from low-resources. The additional pseudo-parallel corpus generated by combining the STE method with the BT method, using 80k monolingual data from the WMT14 translation task. Specifically, a target-to-source NMT model is used to generate the translation of monolingual data, and a new source-target pseudo-parallel corpus is generated.
654
C. Chi et al.
Results in Table 8 demonstrate a significant improvement in NMT model performance when using the BT method with extra monolingual data compared to the baseline. We contend that the BT method with extra monolingual data enriches the diversity of the training set. Furthermore, combining the STE method with the BT method leads to further improvements in translation quality. This suggests that BT and STE are complementary factors that can independently improve the performance of the NMT model. Table 8. Performance of sub-tree exchange augmentation (STE) compared to and combined with back-translation (BT) Method
IWSLT16 Ar → En
Transformer*
29.28
+BT*
29.69
+STE *
30.05
+STE + BT*
20.35
Comparison of BLEU Score and Training Loss. To validate the convergence of the sub-tree exchange (STE) method, we calculate the BLEU score on the validation set and the training loss in different epoch for the IWSLT14 It → En and IWSLT16 Ar → En translation tasks. As shown in Fig. 3 and Fig. 4, the STE method shows faster convergence compared with the baseline. It is worth noting that the STE method has obtained a higher BLEU score on the verification set and lower loss on the training set from the beginning, which shows that the convergence speed is significantly accelerated. In our opinion, enriching the diversity of the training data with the pseudo-parallel sentence pairs, the neural machine translation model can perform better.
Fig. 3. BLEU score (left) and training loss (right) in different epoch on It → En translation task.
Case Study. Several sample sentences are listed in Tables 9 and 10, which are translated from Turkish and Vietnamese to English, respectively. Compared with baselines, all translations of the two language pairs generated by the sub-tree exchange (STE) method are closer to the corresponding reference, although the translation quality can yet be better. The STE method has proven to be advantageous in the translation of low-resource languages, as demonstrated by our experimental findings.
A Data Augmentation Method Based on Sub-tree Exchange
655
Fig. 4. BLEU score (left) and training loss (right) in different epoch on Ar → En translation task. Table 9. A Turkish example of the effectiveness of our method. Source Sentence
Ve genel seçimde de sıkıntı yaşayacağı pek çok neden var
Reference
And there are many reasons he would struggle in a general election
Baseline
And there are many reasons why they may have trouble in the general elections
STE
And there are many reasons for suffering in the general election
Table 10. A Vietnamese example of the effectiveness of our method. Source Sentence
Những việc các bạn làm là đun nóng thép, bạn làm chảy nhựa bitum, và nhựa bitum sẽ chảy vào các kẽ nứt siêu nhỏ này, và các viên đá sẽ kết dính lên lớp mặt trở lại
Reference
Then what you do is you heat up the steel, you melt the bitumen, and the bitumen will flow into these micro-cracks, and the stones are again fixed to the surface
Baseline
What you do is you heat up the steel, and the plastic will flow into these little cracks, and the rocks will stick up on the back
STE
What you do is you heat up the steel, you melt the plastic, and the plastic NO will flow into these tiny cracks, and the rocks will stick to each other
5 Conclusion The paper proposes a straightforward yet effective method named sub-tree exchange (STE) to augment the training set for low-resource language pairs (LRLs) to generate the pseudo-parallel corpus. The STE method employed a novel approach to augment the original training set, whereby sentences were generated via two techniques: sub-tree exchange and back-translation. Specifically, the former involved the exchanging of the sub-tree derived from the constituency parse tree whereas the latter involved the translation of target language sentences to source language and subsequently back again. The resulting sentences, thereby by these methods, were then incorporated into the training simulated (e.g., German → English) and real LRLs (e.g., Turkish → English) show substantial improvements of STE method over the strong baselines, and it is more effective
656
C. Chi et al.
than other existing data augmentation methos. We intend to exchange other component attribute nodes to generate more diverse monolingual data for data augmentation methods or other generation tasks of natural language processing in future work. Acknowledgement. This work was supported by National Natural Science Foundation of Liaoning Province, China (Grant no. 2021-YKLH-12, 2022-YKLH-18).
References 1. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017) 2. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 4. Gu, J., Wang, Y., Chen, Y., Cho, K., Li, V.O.: Meta-learning for low-resource neural machine translation. arXiv preprint arXiv:1808.08437 (2018) 5. Ren, S., Chen, W., Liu, S., Li, M., Zhou, M., Ma, S.: Triangular architecture for rare language translation. arXiv preprint arXiv:1805.04813 (2018) 6. Zoph, B., Yuret, D., May, J., Knight, K.: Transfer learning for low-resource neural machine translation. arXiv preprint arXiv:1604.02201 (2016) 7. Wang, X., Pham, H., Dai, Z., Neubig, G.: Switchout: an efficient data augmentation algorithm for neural machine translation. arXiv preprint arXiv:1808.07512 (2018) 8. Fadaee, M., Bisazza, A., Monz, C.: Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440 (2017) 9. Zhang, J., Zong, C.: Exploiting source-side monolingual data in neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2016) 10. Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015) 11. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 12. Artetxe, M., Labaka, G., Agirre, E., Cho, K.: Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041 (2017) 13. Author, F., Author, S.: Title of a proceedings paper. In: Editor, F., Editor, S. (eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1–13. Springer, Heidelberg (2016) 14. Lample, G., Conneau, A., Denoyer, L., Ranzato, M.: Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017) 15. Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered composition rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1681–1691 (2015) 16. Xie, Z., et al.: Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573 (2017) 17. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford coreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
A Data Augmentation Method Based on Sub-tree Exchange
657
18. Burlot, F., Yvon, F.: Using monolingual data in neural machine translation: a systematic study. arXiv preprint arXiv:1903.11437 (2019) 19. Cheng, Y., Cheng, Y.: Semi-supervised learning for neural machine translation. Jt. Train. Neural Mach. Transl. 25–40 (2019) 20. Cotterell, R., Kreutzer, J.: Explaining and generalizing back-translation through wake-sleep. arXiv preprint arXiv:1806.04402 (2018) 21. Currey, A., Miceli-Barone, A.V., Heafield, K.: Copied monolingual data improves lowresource neural machine translation. In: Proceedings of the Second Conference on Machine Translation, pp. 148–156 (2017) 22. Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional BERT Contextual Augmentation. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11539, pp. 84–95. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22747-0_7 23. Kobayashi, S.: Contextual augmentation: data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201 (2018) 24. Gao, F., et al.: Soft contextual data augmentation for neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5539–5544 (2019) 25. Chen, K., Wang, R., Utiyama, M., Sumita, E.: Content word aware neural machine translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 358–364 (2020) 26. Shi, X., Padhi, I., Knight, K.: Does string-based neural MT learn source syntax? In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1526–1534 (2016) 27. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 28. Ott, M., et al.: fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv: 1904.01038 (2019) 29. Lin, Z., Wu, L., Wang, M., Li, L.: Learning language specific sub-network for multilingual machine translation. arXiv preprint arXiv:2105.09259 (2021) 30. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015) 31. Bugliarello, E., Okazaki, N.: Enhancing machine translation with dependency-aware selfattention. arXiv preprint arXiv:1909.03149 (2019) 32. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) 33. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002) 34. Maimaiti, M., Liu, Y., Luan, H., Sun, M.: Data augmentation for low-resource languages NMT guided by constrained sampling. Int. J. Intell. Syst. 37(1), 30–51 (2022)
Improving Neural Machine Translation by Retrieving Target Translation Template Fuxue Li1,2(B)
, Chuncheng Chi3
, Hong Yan2
, and Zhen Zhang2
1 School of Computer Science and Engineering, Northeastern University, Shenyang, China
[email protected]
2 College of Electrical Engineering, Yingkou Institute of Technology, Yingkou, China 3 College of Computer Science and Technology, Shenyang University of Chemical Technology,
Shenyang, China
Abstract. In the neural machine translation (NMT) paradigm, transformer-based NMT has achieved great progress in recent years. It is based on the standard end-to-end structure, and acquires translation knowledge through the attention mechanism from the parallel corpus automatically without human intervention. Inspired by the process of translating sentences by human translators and the successful application of translation template in statistical machine translation, this paper proposes a novel approach to incorporate the target translation template into the Transformer-based NMT model. Firstly, the template extraction method derives the parallel templates corpus from the constituency parse tree. Secondly, given a sentence to be translated, a fuzzy matching strategy is proposed to calculate the most possible target translation template from the parallel template corpus. Finally, an effective method is proposed to incorporate the target translate template into the Transformer-based NMT model. Experimental results on three translation tasks demonstrate the effectiveness of the proposed approach and it improves the translation quality significantly. Keywords: Translation template · Neural machine translation · Fuzzy matching strategy
1 Introduction Neural machine translation (NMT) [1–4] has been the domain paradigm in recent years, which shows better performance than that of the statistical machine translation (SMT). Transformer-based NMT [5] shows powerful performance and computational efficiency among many NMT models. It is assumed that all the information can be learned from the bilingual corpus through a self-attention mechanism automatically. In detail, the Transformer-based NMT model firstly encodes the source sentence into a dense vector, then generates the target words one by one without human intervention. That is quite different from the process of translating sentences by human translators. Given a sentence to be translated, the translator usually looks for the translation of a similar sentence and modifies it as necessary to generate the final translation. Inspired by the translation process shown in Fig. 1, this paper proposes an approach that incorporates the target template into the Transformer-based NMT model to improve the translation quality. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 658–669, 2023. https://doi.org/10.1007/978-981-99-4752-2_54
Improving Neural Machine Translation
659
Fig. 1. The procedure of translating a Chinese sentence to English by a human translator.
The translation method in Fig. 1 can also be regarded as an example-based or template-based translation method. On the one hand, the successful application of templates in SMT has been proved conducive to improving translation system performance [6, 7]. On the other hand, the templates have been applied successfully for many natural language processing tasks in recent years [8–10]. As for the template research for neural machine translation, Yang proposes a method named “ST-NMT” [11], which takes a two-pass manner to improve the translation quality. TBMT [12] is proposed to utilize a multi-source NMT framework to guide the decoder. This paper adopts an one-pass manner which is more efficient and explores more template integrated strategies for the NMT model. The main contributions can be summarized as follows: 1. A new universal and unsupervised language-independent template extraction method is proposed. 2. A fuzzy matching strategy based on word level Levenshtein distance is utilized to get the target translation template. 3. An effective method that incorporates the target template into the Transformer-based NMT model is proposed to improve the translation quality.
2 Related Work With regard to template extraction, several translation template extraction methods have been proposed: Kaji et al., [13] conducted the translation templates by a bilingual dictionary. Liu et al., [14] extracts templates based on tree-to-string algorithm that align the source parse tree and target sequence by the word alignment. Zhang et al., [15] also use word alignment to build a translation template by the parse trees. Shang proposed to generate the MNP-template [12], which focuses on max-length NP. Different from the previous works which are based on word alignment, we propose a general and simple template extraction method based on the constituency parse tree. Many works have focused on incorporating the templates into machine translation. In SMT, the example-based or template-based method has been proven effective in improving the translation quality [6, 7]. Kaijie translates a given sentence by the matching translation template. [13, 14, 16] extract the translation template to improve the SMT. [17] uses a combination of chunk-string templates (CSTs) and translation of unknown words to guide model translation. In terms of NMT, Zhang et al., employ translation fragments gathered from the matching sentence pair to guide the NMT [18]. Dinu et al., [19] use a terminology dictionary to impose pre-specific translation knowledge. “STNMT” takes a two-pass manner to improve the translation quality by incorporating
660
F. Li et al.
a target template that is predicted by a Transformer, while our approach adopts onepass manner to generate the final translation, and the efficiency is better than that of “ST-NMT”.
3 Approach The proposed approach can be summarized as follows: firstly, the parallel template corpus is constructed by the template extraction method. Next, given a sentence to be translated, the most possible target translation template is calculated based on the parallel template corpus by the fuzzy matching strategy. Finally, the target template is integrated into the NMT decoder. In the following parts of this chapter, we will introduce this approach in detail. 3.1 Generating Templates by Constituency Parse Tree Constructing a template corpus with high quality usually requires the participation of linguists, and it is a time-consuming job. Some researchers have studied the automatic generation of the templates [13–15]. In this paper, a novel syntactic information template extraction strategy is proposed, which is not only unsupervised but also language-independent.
Fig. 2. The constituency parse tree of an example sentence.
The template generated in this paper derives from the constituency parse tree. It terminal and non-terminal nodes using constituency grammar to represent the structural information of the entire sentence. The content words of a sentence are leaf nodes in the constituency-based parse tree, while non-leaf nodes are the constituent attribute words, which consist of multiple content words of the sentence. Inspired by previous work [11, 12, 20], a novel and simple strategy is proposed to extract the template for a given sentence. The main steps are as follows: 1. Constructing the constituency parse tree by Stanford CoreNLP [21];
Improving Neural Machine Translation
661
2. Locating the layer of the top-most leaf node, which is set d; 3. Selecting the node X in the constituency parse tree whose depth is d, and an extension rule is introduced: If there is only one leaf node in the sub-tree of node X, then the leaf node is selected as the template candidate. The template extracted from the constituency parse tree in Fig. 2 is “I see NP VP”. 3.2 Searching the Translation Template Given a parallel corpus D = {S, T } and a source sentence S, two steps need to be performed. Firstly, for each sentence pairs (X i , Yi ) ∈ D and sentence S, we extract the template by the method proposed in Sect. 3.1, then the generated templates are Txi , Tyi and Tx , respectively. Next, a similarity calculation method is introduced to measure the similarity between the source sentence S and the source sentence in the parallel corpus. In particular, the formula is as follows: similarity = 1 −
Distance(Txi , Ts ) max(|T xi , |Ts |)
(1)
where Txi and Ts denote the template of the i − th sentence in parallel corpus and source sentence S. Distance(Txi , Ts ) denotes the word level Levenshtein distance of transforming Ts to Txi . In detail, to speed up the calculation of the similarity of the given source sentence S and each sentence Xi in a parallel corpus, we filter the candidates of parallel corpus by the following formula: score = 1 −
|Txi ∩ Ts | max(|T xi , |Ts |)
(2)
After we find the most similar source sentence in parallel corpus for the given sentence, we treated the target template of the target translation sentence as the translation template for the sentence to be translated. 3.3 Template Integrated Approach Model Architecture. The encoder in the stand Transformer is made up of six stacks of the same layers, each of which is made up of a fully connected feed-forward network (FFN) and a multi-headed self-attention (ATT). A residual connection [22] and layer normalization [23] (LayNorm) is applied between the sub-layers. Given the parallel corpus (X , Y ) ∈ D, the corresponding templates for each sentence pair in the source and target sentence can be conducted by Sect. 3.1, namely template corpus T. The template integrated approach is different from the stand Transformer, which has an encoder and a decoder, while the template integrated translation method has two encoders and one decoder. In other words, the proposed approach utilizes the source sentence X and the template of target sentence T to generate the target sentence. Note that the template of the target sentence cannot be extracted in advance before the decoding procedure, but the template can be searched by the method mentioned above in Sect. 3.2.
662
F. Li et al.
Fig. 3. The model architecture of incorporating the template information into the Transformer.
A. Encoder. As shown in Fig. 3, the proposed method consists of two encoders: the source sentence encoder and the target template encoder. It is composed of a stack of six blocks that contains self-attention layers with residual connections, layer normalization, and a feed-forward network (FNN). The source encoder and template encoder are calculated by hi = Block(hi−1 ). The final output of the encoder is divided into two parts, the source sentence representation Hs and the target template representation Ht . B. Decoder. The source representation hsrc and the target template representation htem are utilized to generate the target translation in the decoding procedure. The proposed method is structured similarly to the decoder in Transformer, with the addition of an integrated layer between the masked multi-head attention and the feed-forward network layers. The integrated layer contains two parts: the template-decoder attention (Tem-Dec) and the encoder-decoder (Enc-Dec) attention. The main architecture of the proposed method is shown in Fig. 3. In other words, in addition to the source sentence representation, the decoder utilizes the template representation to guide the generation of target words. Given a source sentence s = x1 , x2 ..., xm , and a target template t = t1 , y1 , ..., tn , we can get the output of the encoder: the source sentence representation Hs and the target template representation Ht . The template integrated translation decoding procedure can be summarized as follows: S = Att self (Et W q , Et W k , E t W v )
(3)
S = Att ed (Hs , Hs , S)
(4)
S˜ = Att tem (Ht , Ht , S)
(5)
Improving Neural Machine Translation
663
where Et is the word embedding of target words, Att self and Att ed are encoding-decoding attention and template-decoding attention, respectively. Integrated Approach. To incorporate the translation template into the Transformer model, we propose an integration method. Specifically, the calculation of the encoderdecoder (Enc-Dec) attention and the template-decoder (Tem-Dec) attention is calculated independently, and the two results can be integrated with two ways:
Linear Interpolation (LI). Compared with the encoder-decoder attention result S, ∼
the template-decoder attention S can be seen as a representation derived from prior knowledge for generating the target sentence. Therefore, the final result can be revised as:
h = (1 − λ) ∗ S + λ ∗ S˜
(6)
where λ is an empirical parameter with the value ranging from 0 and 1. Gate Learning (GL). Compared with the linear interpolation method, a generic strategy for adjusting the parameter dynamically is proposed throughout the training procedure, rather than empirical settings. (7) f = sigmoid (Wh S ; S˜ + bh )
h = f ∗ S + (1 − f ) ∗ S˜
(8)
where S ; S˜ denotes the operation of concatenating two context vectors, Wh and bh are parameters which could be learned automatically in the training process.
4 Experiment To verify the effectiveness of the proposed approach, we conducted several experiments on three public datasets: WMT14 and NIST from the news domain, and IWSLT14 from a TED talk. All the experiments were implemented using the fairseq toolkit [24]. 4.1 Datasets IWSLW14 German-English (De-En). The training set consists of 16K sentence pairs, and the valid set consists of 7283 sentences. The test set contains 6750 sentences with a translation reference per source sentence. NIST Chinese-English (Zh-En). The training set contains 1.8 million sentence pairs, and MT06 is selected to be the valid set. We merge test sets MT04, Mt05, and MT08 for testing (5891 sentences), and each sentence contains four translation references. WMT14 German-English (De-En). The training set contains 4.5 million sentence pairs, and the valid set consists of 21K sentence pairs. The test set contains 3003 sentences, and each source has a translation reference.
664
F. Li et al.
4.2 Pre-processing and Settings In the pre-processing step, Stanford CoreNLP was used to tokenize Chinese sentences, and Moses [25] was used to tokenize other sentences. For IWLST14 De → En, NIST Zh → En and WMT14 → De translation tasks, we use the small settings, basic setting and big settings of Transformer, respectively. BPE [26] is also used for all the experiments. The beam size is set to 5, and we use the checkpoint with the best BLEU score on the valid set. The case insensitive 4-g BLEU score is used as the primary evaluation metrics.
5 Results and Analysis 5.1 Performance on Three Datasets
Table 1. Performance on several translation tasks with the BLEU metric. Method
IWSLT14(De → En)
WMT14(En → De)
matching score/numbers
matching score /numbers
0.9+/86
0.9+/439
Transformer
34.39
Our Approach
35.16
0.85+/173
0.8+/362
0.85+/877
NIST(Zh → En) 0.8+/1075
28.40 35.23
34.96
29.62
matching score/numbers 0.9+/396
0.85+/531
0.8+/682
43.78
43.40
42.54 29.41
29.57
43.42
Fig. 4. The performance of various fuzzy matching scores α on NIST(Zh → En) left and WMT14(En → De) right test sets with BLEU metric. The red lines are the baseline.
For all translation tasks, the fuzzy matching scores were set 0.9+, 0.85+ and 0.8+, and then we got the different number of sentences for the test set, which are listed in Table 1. The BLEU score of our proposed method outperforms the baseline on all translation tasks. The target templates with high quality can capture the target structure information, and it can guide the decoder in Transformer. However, when the fuzzy matching score is 0.8+, the BLEU score is lower than that of 0.9+. Therefore, we test the influence of different fuzzy matching scores on NIST Zh → En and WMT14 En → De translation tasks. As shown in Fig. 4, as the fuzzy score α decreases from 0.95 to 0.85,
Improving Neural Machine Translation
665
more templates were extracted to improve the NMT. However, when α decreases to 0.8, the performance is reduced. The main reason is that the template with low similarity may damage the performance of the NMT. Therefore, the fuzzy matching score is set to 0.85 for other experiments in this paper. 5.2 Evaluating Hyper-Parameter for Linear Interpolation
Fig. 5. BLEU scores of the linear interpolation method on IWSLT14 and WMT14 test sets with different values of λ. The dashed line denotes the result of the baseline.
We vary the value of λ in the Linear Interpolation approach and evaluate its performance on the WMT14 En → De and IWSLT14 En → De translation tasks. Figure 5 depicts the experimental results. Compared with the baseline, when λ rises from 0 to 0.3, the BLEU scores improve by 0.42 and 0.32. These results suggest that the Linear Interpolation approach is effective for improving the performance of the NMT model. Furthermore, higher values of λ decrease the BLEU scores, suggesting that an excessive amount of target template information may introduce bias when predicting the target sentence. Overall, these findings demonstrate the potential benefits of incorporating target template information into the decoder of the NMT model at an appropriate level. Therefore, λ is set to 0.3 for the LI method in other experiments. 5.3 Performance on Two Integrated Approaches
Table 2. Performance of Different interpolation methods on WMT14 (En → De) Translation task with the BLEU metric. Manner
Method
BLEU (Test set)
Baseline
Transformer
28.40
Our Approach
Linear Interpolation (LI)
28.72
Gate Learning (GL)
29.41
666
F. Li et al.
As shown in Table 2, both the LI and GL methods improve the translation quality, and the GL method performs better than that of the LI method. We argue that the GL method exceeds the LI method at learning the fusion of template information through neural networks, leading to superior results. The two methods succeed in learning the representation of the target template and achieving the goal of guiding the decoder by the target template. It can be concluded that this method of using templates can play the role of templates more effectively. Therefore the GL method is chosen to perform other experiments in this study due to its good performance. 5.4 Comparison of Different NMT Models On IWSLT14 De → En and WMT14 EN → De translation tasks, we compare the proposed technique with many alternative NMT models, including the typical encoderdecoder framework GNMT, the RNN-based NMT model RNMT+, and so on. As shown in Table 3, for the IWSLT14 De → En task, the ST-NMT method gets a 35.24 BLEU score which shows the best result, while our approach outperforms +0.82 BLEU scores over the strong baseline, and achieves a comparable performance to the ST-NMT method. As for the WMT14 (En → De) translation task, our approach outperforms the baseline, and the BLEU score is slightly lower than that of the ST-NMT method. However, the ST-NMT approach takes a two-pass manner which increases the computation, while our approach is more simple and effective. Table 3. BLEU score of different machine translation approaches on IWSLT14 (De → En) and WMT14 (En → De) translation task with BLEU metric. Method
IWSLT14 (De–En)
WMT14 (En-De)
GNMT [2]
31.44
24.61
RNMT+ [27]
31.51
28.49
ConvS2S [3]
30.41
25.16
Rerank-NMT [28]
34.82
27.81
ST-NMT [11]
35.24*
29.68*
Transformer [5]
34.39
28.40
Our Approach
35.23
29.42
5.5 Performance on Different Layers Previous research has shown that different layers may capture different information [26]. Consequently, in order to evaluate the performance of the proposed method on different layers, some experiments were conducted on the IWSLT14 De → En translation task, and we present the results in Table 4. For integrating the target template on different layers, raising the layers from bottom to top does not considerably enhance the performance of
Improving Neural Machine Translation
667
Table 4. Performance of different integrated layers on IWSLT 14 (De → En) with BLEU metric. Layer
Test set
Layer
Test set
Baseline
34.49
/
/
1 (bottom)
34.52
6 (upper)
35.13
2
34.59
1–2
34.47
3
34.68
1–3
34.46
4
34.78
4–6
35.21
5
34.95
5–6
35.23*
this model, even if there is a slight decrease, e.g., 1–2. It can be seen that the template integration at the top level tends to perform better than at the bottom level. The best performance among all methods is that incorporating the target template information into the upper layers (5–6), and the improvement of the BLEU score achieves 0.84 than baseline. This is consistent with the conclusion that the bottom layer is skewed toward semantics, whereas the upper layer is biased toward the context information [29]. From a modeling perspective, the template of the target language provides the global structural information of a sentence, and it can be seen as a constraint for generating words in the decoder, thereby improving the translation quality. Therefore, the top two layers (5–6) are selected to incorporate the template information into the decoder in the rest experiments in this paper. 5.6 Case Study
Table 5. A Chinese-English translation example of the proposed method. Source
ni dedao le gengduo de qian, yinwei nide maoyitiaojian yousuo gaishan, dan zhe ye tuidong le quanmian de chanchu
Source Template
ni VP, PP, IP
Target
You get more money because your terms of trade have improved, but also that drives up output across the board
Target Template
You VP, but also that VP
Test sentence (Source)
nimen qude le fengshuo de chengguo, yinwei nimen fuchu le henduo, dan zhe yeshi lisuodangran de jieguo
Test template (Source)
nimen VP, PP, IP
Transformer (Baseline)
You have achieved fruitful results because you have paid a lot, but it is a natural result too
Our Approach
You have achieved fruitful results because you have paid a lot, but also it is a natural result
668
F. Li et al.
The translation of an example shown in in Table 5 demonstrates the benefits of incorporating the target template representation into the NMT model. The target template "You VP, but also that VP from the training set" is used to guide the NMT model. Note that while the phrase “but also that” exists in the target template, the word “that” does not appear in the final translation. This shows that our approach can learn to filter noise in the template and produce a more accurate final translation, rather than simply copying the template. In other words, our approach includes the ability to learn from the template and improve upon it in the final translation.
6 Conclusions Inspired by the process of sentence translation carried out by human translators, this paper proposes an approach that utilizes a target translation template to improve the Transformer-based NMT model. Additionally, we propose a general method for extracting templates and a fuzzy matching strategy to search for the appropriate template for a given sentence. Subsequently, we incorporate the target template into the model to guide the decoding process. We conducted experiments to demonstrate the effectiveness of our proposed approach. There are many areas for future research, such as exploring how to utilize templates for clauses in complex sentences and incorporating templates of translated fragments for the NMT model. Acknowledgement. This work was supported by National Natural Science Foundation of Liaoning Province, China (Grant no. 2021-YKLH-12, 2022-YKLH-18), Scientific Research Foundation of Liaoning Province (Grant no. LJKQZ2021184), High-level talents research project of Yingkou Institute of Technology (Grant No. YJRC202026).
References 1. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 2. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 3. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: International Conference on Machine Learning, pp. 1243–1252. PMLR (2017) 4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 5. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 6. Nagao, M.: A framework of a mechanical translation between Japanese and english by analogy principle. Artif. Hum. Intell. 351–354 (1984) 7. Carl, M.: Inducing translation templates for example-based machine translation. In: Proceedings of Machine Translation Summit VII, pp. 250–258 (1999) 8. Duan, N., Tang, D., Chen, P., Zhou, M.: Question generation for question answering. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 866–874 (2017) 9. Wang, K., Quan, X., Wang, R.: Biset: bi-directional selective encoding with template for abstractive summarization. arXiv preprint arXiv:1906.05012 (2019)
Improving Neural Machine Translation
669
10. Wiseman, S., Shieber, S.M., Rush, A.M.: Learning neural templates for text generation. arXiv preprint arXiv:1808.10122 (2018) 11. Yang, J., Ma, S., Zhang, D., Li, Z., Zhou, M.: Improving neural machine translation with soft template prediction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5979–5989 (2020) 12. Shang, W., Feng, C., Zhang, T., Xu, D.: Guiding neural machine translation with retrieved translation template. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021) 13. Kaji, H., Kida, Y., Morimoto, Y.: Learning translation templates from bilingual text. In: COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics (1992) 14. Liu, Y., Liu, Q., Lin, S.: Tree-to-string alignment template for statistical machine translation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pp. 609–616 (2006) 15. Zhang, M., Jiang, H., Aw, A., Li, H., Tan, C.L., Li, S.: A tree sequence alignment-based tree-to-tree translation model. In: Proceedings of ACL-08: HLT, pp. 559–567 (2008) 16. Quirk, C., Menezes, A., Cherry, C.: Dependency treelet translation: syntactically informed phrasal SMT. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 271–279 (2005) 17. Khan, M.A.S., Yamada, S., Nishino, T.: Example-based machine translation for low-resource language using chunk-string templates. In: Proceedings of Machine Translation Summit XIII: Papers (2011) 18. Zhang, J., Utiyama, M., Sumita, E., Neubig, G., Nakamura, S.: Guiding neural machine translation with retrieved translation pieces. arXiv preprint arXiv:1804.02559 (2018) 19. Dinu, G., Mathur, P., Federico, M., Al-Onaizan, Y.: Training neural machine translation to apply terminology constraints. arXiv preprint arXiv:1906.01105 (2019) 20. Duan, S., Zhao, H., Zhang, D., Wang, R.: Syntax-aware data augmentation for neural machine translation. arXiv preprint arXiv:2004.14200 (2020) 21. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanfordcorenlp natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014) 22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 23. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016) 24. Ott, M., et al.: Fairseq: a fast, extensible toolkit for sequence modeling. arXiv preprint arXiv: 1904.01038 (2019) 25. Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180 (2007) 26. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015) 27. Chen, M.X., et al.: The best of both worlds: combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849 (2018) 28. Liu, L., Utiyama, M., Finch, A., Sumita, E.: Agreement on target-bidirectional neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 411–416 (2016) 29. Anastasopoulos, A., Chiang, D.: Tied multitask learning for neural speech translation. arXiv preprint arXiv:1802.06655 (2018)
Chinese Named Entity Recognition Based on Multi-feature Fusion Zhenxiang Sun1,2 , Runyuan Sun1,2(B) , Zhifeng Liang2 , Zhuang Su1,2 , Yongxin Yu1,2 , and Shuainan Wu1 1 School of Information Science and Engineering, University of Jinan, Jinan 250022, China
[email protected] 2 Shandong Provincial Key Laboratory of Network Based Intelligent Computing,
University of Jinan, Jinan 250022, China
Abstract. Pre-trained language models usher in a new era of named entity recognition, but more additional relevant knowledge is needed to improve its performance on specific problems. In particular, in Chinese government named entity recognition, most entities are lengthy and have vague boundaries, and this entity length and boundary uncertainty makes the entity recognition task difficult or incorrectly identified. To address this problem, this paper proposes a Chinese named entity recognition model based on multi-feature fusion, in which lexical features, word boundary features and pinyin features are fused together through a multi-headed attention mechanism to enhance the model’s semantic representation of government texts. Meanwhile, this paper also studied the contribution of different features to entity recognition, and finds that pinyin features have unique advantages in recognising government entities. This study provides new ideas and methods for the research and application of Chinese governmental entity recognition, and also provides some insights into the problem of named entity recognition in other language domains. The experimental results show that the model proposed in this paper has better performance compared to the baseline model. Keywords: Government Entity Recognition · Multi-feature Fusion · Multi-headed Attention Mechanism · Semantic Representation
1 Introduction With the emergence and development of pre-trained language models such as BERT [1] and XLNet [2], significant progress has been made in named entity recognition tasks. By using large amounts of pre-trained data, these models are able to extract more accurate semantic features and thus perform better in downstream tasks. However, there are still some intractable problems in named entity recognition. The majority of entities in Chinese government entity recognition tasks suffer from long and difficult to determine boundaries. This is mainly due to the peculiarities of the Chinese language, where there are no spaces or clear separators between words. In addition, government entities often consist of multiple words or phrases and may have similar parts to other entity names, making entity boundaries not easy to determine or even ambiguous. Figure 1 provides a © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 670–681, 2023. https://doi.org/10.1007/978-981-99-4752-2_55
Chinese Named Entity Recognition Based on Multi-feature Fusion
671
simple example where a government text features two entities. The first entity is “Dean of the School of Computer Science and Engineering”, which falls into the position category. The second entity is “School of Computer Science and Engineering”, which falls under the institution category. It’s worth noting that, given the specific nature of entity recognition in government texts, there is a possibility of misidentification. For example, the term “Dean of the School of Computer Science and Engineering” could be mistakenly identified as “School of Computer Science and Engineering” or something else entirely. In addition, the ambiguity of entity boundaries and the semantic similarity between “School of Computer Science and Engineering” and “Computer Science and Engineering” in context can also lead to errors in entity identification.
Fig. 1. The following is a typical example from our dataset: ‘Introduction to the School of Computer Science and Engineering by the Dean of the School of Computer Science and Engineering.’ In this example, entity identification may result in multiple ambiguities due to the lengthy and ambiguous boundaries of the entities.
Currently, most pre-trained language models have difficulty in dealing with the above problems. As pre-trained language models are usually trained based on large-scale corpora, these corpora are more formalised and have relatively less diversity compared to government texts. As a result, the language patterns learned by these pre-trained language models are mainly focused on the semantic domain and lack some very important information, such as lexical features, word boundary features and phonetic features. This information is particularly important for the task of government text entity recognition, as they can help the models understand the text content more accurately and thus improve the accuracy of entity recognition. In addition, pre-trained language models often have difficulty capturing the complex relationships between entities and the diverse type features when dealing with government text entities, which further exacerbates the difficulty of applying pre-trained models in this area. For example, the types of entities contained in government texts may cover a wide range of types such as names of people, places, organisations and terminology, and each type has its own unique linguistic features and contextual information. Therefore, pre-trained models need to learn semantic features that are more in line with the contextual information in order to better adapt to the processing needs of government text entities. This paper proposes a Chinese Named Entity Recognition model based on MultiFeature Fusion (MFF-CNER), which aims to solve the problem of lengthy and ambiguous boundaries in Chinese government entity recognition tasks. The method fuses lexical features, word boundary features and pinyin features together to enhance the semantic representation of the model and thus more effectively deal with such problems. The experimental results show that the model achieves significant performance improvements on two public datasets and one private government dataset. This demonstrates that
672
Z. Sun et al.
multi-feature fusion can effectively improve the overall performance of the NER model and solve some challenging problems in the Chinese government entity recognition task.
2 Related Work Named Entity Recognition (NER) is a key task in natural language processing. It involves identifying entities in text that have a specific meaning, such as a person’s name, a place, an organization, etc. Over the past few decades, the NER approach has evolved from a rule-based and dictionary-driven approach to one based on statistical machine learning. In recent years, with the rapid development of deep learning, researchers are also constantly exploring more advanced techniques to further improve the performance of named entity recognition. The deep neural network multitask learning method proposed by Collobert et al. [3] This method takes different natural language processing tasks as a whole to share feature representation and parameters, thus improving model generalization performance and training efficiency. Chiu et al. [4] proposed a new named entity recognition network architecture, which uses a hybrid method of two-way LSTM and CNN to automatically detect words and characters, reducing feature engineering. Lewis et al. [5] applied the paradigm of large-scale unsupervised pre-training to tasks such as text translation, text summary and dialogue generation, so that the generated models to take advantage of large-scale pre-training. Lample et al. [6] proposed a named entity recognition model based on neural network, which uses word embedding, character embedding and lexical embedding to learn text representation, and uses multi-layer bidirectional LSTM network to model text. This model proves its effectiveness and robustness in named entity recognition. Ma et al. [7] proposed an end-to-end sequence labeling model, which combines the learning context features and local features of BiLSTM and CNN, uses CRF to solve the label dependency problem, and achieves excellent performance in various sequence labeling tasks. In recent years, the field of natural language processing has entered a new phase of development with the emergence of language models such as Transformer and BERT. Vaswani et al. [8] propose a Transformer model based on self-attentional mechanisms that eliminates the need for traditional RNN or CNN structures and allows parallel computation, significantly improving training speed. Cui et al. [9] pre-trained the model using full-word masking strategies, and the model was learning to solve a more challenging task than predicting word components. Sun et al. [10] proposed the ERNIE model, which adopted three masking strategies, namely carbon level masking, phrase level masking and entity level masking, to enhance the ability to capture multi-granularity semantics. Sun et al. [11] proposed a new Chinese pre-training model. The model enhances the representational ability of Chinese language by using both font and pinyin information in training. Yang et al. [12] adopted the "four-corner" coding method to introduce the symbol features of Chinese characters in the task of Chinese named entity recognition. Li et al. [13] proposed a Chinese named entity recognition method that integrates semantic, morphological and phonetic features in the model and solves the problem of Chinese character replacement by representing structural features through Chinese character components. Meanwhile, in order to solve the problem of boundary detection in Chinese entity recognition and enhance the robustness of entity recognition model, some researchers
Chinese Named Entity Recognition Based on Multi-feature Fusion
673
started to explore different methods. Chun et al. [14] proposed an enhancement method for boundary detection in Chinese named entity recognition, which included adding a graph focus network layer to capture the dependency relationship between words, and taking the first and last word prediction of entities as a secondary task in training. Liu et al. [15] proposed to directly integrate lexical information between BERT’s Transformer layers. The integration of lexical features in BERT allows a more complete interaction with the coding layer. Geng et al. [16] propose two uncertainty measures, namely the dropout of MC and the Top-k tag sequence, to improve the retrieval efficiency and accuracy. They select the vaguest entity level uncertainty component in the input text and then repredict the uncertainty component using the knowledge fusion model combined with the retrieved knowledge. Zheng et al. [17] embed character information and lexical information into the BERT model to improve the model’s ability to identify entity types. Then, they encode the model with a Transformer model that combines location coding and lexical boundary information to improve the model’s ability to identify entity boundaries. Guo et al. [18] proposed a named entity recognition model for agricultural diseases and pests based on prefix embedding and self-attention mechanism. In this model, three feature extraction strategies were designed, and prefix embedding was integrated into character embedding as input to enrich semantic information. The self-attention mechanism is used to further extract longer distance dependencies. Wu et al. [19] proposed an attention mechanism based on lattice structure and a specific embedding method to deal with the cross-entity boundary problem in Chinese text. The experimental results show that the NFLAT model can significantly improve the performance of Chinese named entity recognition, especially in the case of cross-entity boundary.
3 MFF-CNER Model 3.1 Model Overview The model structure is shown in Fig. 2, firstly, the pre-model Bert is used as the encoder of the model for semantic encoding, for a Chinese text sequence of input length n, the sequence is transformed into the corresponding word list index encoding X = (x1 , x2 , . . . , xi , . . . , xn ). Each character xi is mapped by the encoder to an embedding vector hi . These embedding vectors are then combined into a feature matrix as an encoded representation of the entire text sequence, denoted as the feature vector h. hi = encoder(xi )
(1)
h = h1 , . . . , hi , . . . , hn
(2)
After the text sequence is obtained, the pinyin sequence of the characters corresponding to this text sequence is obtained by means of data pre-processing, and the pinyin sequence is transformed into the corresponding index code sequence by means of the 2 , . . . , l i , . . . , l n ), The vector e is pinyin word list, which is noted as Lpy = (l 1py , lpy py py py obtained through the pinyin embedding layer. i = epy l i (3) epy py
674
Z. Sun et al.
1 , . . . , ei , . . . , em epy = epy py py
(4)
Fig. 2. Overview of the MFF-CNER. This is a typical example in our dataset, “Introduction to the School of Computer Science and Engineering by the Dean of the School of Computer Science and Engineering”, where the entity identification results in multiple ambiguities due to the lengthy and ambiguous boundaries of the entities.
While obtaining the information of the pinyin sequence, the text sequence is divided into Chinese words by stuttering, so as to obtain the lexical sequence corresponding to the text sequence, and its length is recorded as m, and the lexical sequence is recorded as W = (w1 , w2 , . . . , wi , . . . , wm ). After obtaining the lexical sequence, the lexical feature sequence and the word boundary feature sequence corresponding to the text sequence are obtained based on the lexical sequence and transformed into the corresponding index sequence by the corresponding word list, noting the lexi2 , . . . , l i , . . . , l n ) and the word boundary feature cal feature as Lpos = (l 1pos , lpos pos pos as Lb = (l 1b , lb2 , . . . , lbi , . . . , lbn ), vectors epos and eb are obtained through the lexical embedding layer and the word boundary embedding layer, respectively. i i epos (5) = epos lpos 1 , . . . , ei , . . . , em (6) epos = epos pos pos ebi = eb lbi
(7)
eb = eb1 , . . . , ebi , . . . , ebm
(8)
Pinyin features epy , lexical features epos and word boundary eb features are fused through a multi-headed attention mechanism. (9) efuse = M − Attention epos , epy , eb Feature overlay fusion is then performed to enhance the feature fusion of the model by feature fitting through a fully connected layer. (10) o = wo efuse + h + bo
Chinese Named Entity Recognition Based on Multi-feature Fusion
675
Finally, the fused features are fed into the BiLSTM layer so as to obtain the longdistance dependency information in the text sequence, thus again constraining the feature fusion effect of the model, followed by feeding the output of the BiLSTM layer into the CRF layer for sequence decoding, thus obtaining the decoded sequence y corresponding to the input sequence. c = BiLSTM (o)
(11)
y = decoder(c)
(12)
3.2 Integration Strategies Feature fusion is widely used in entity recognition tasks, which can improve the performance and robustness of models by fusing different feature sources. We compared three different fusion methods, including point-by-point addition, join, and multi-head attention fusion, which can fuse features of different dimensions together. Point-wise addition is a simple feature fusion method that simply adds up the embedded parts of different features to obtain a new embedding vector. Specifically, the embedding vectors epy , eb and epos of the three features in this paper are added together by dimension to obtain a new embedding vector. Although superimposed fusion is very simple, it excels at certain tasks because it can combine information from multiple feature sources. efuse = epos + epy + eb
(13)
Concatenate is another common feature fusion method that is widely used in entity recognition tasks. In this method stitches together the embedded parts of different features to form a longer embedding vector. Splice fusion can integrate information from different feature sources, but as the dimensionality of the resulting embedding vector is often high, it can easily lead to overfitting problems. (14) efuse = concat epos , epy , eb As shown in Fig. 3, multi-headed attention fusion is a feature fusion method based on an attention mechanism, which maps the embedding parts of different features into a high-dimensional space and then fuses them by attention weights. Specifically, multiheaded attention fusion projects each embedding vector into different subspaces, then performs attention computations within each subspace, and finally takes the attentionweighted sum of all subspaces as the final embedding vector. Multi-headed attention fusion can make full use of information from different features, while avoiding the high-dimensional problem of splicing fusion.
4 Experiments To demonstrate the impact of different features and fusion strategies on model performance, we conducted a series of experiments and evaluated our approach from several different perspectives. We used the standard Precision, Recall and F1-score as evaluation metrics to fully assess the performance of the model in all aspects.
676
Z. Sun et al.
Fig. 3. A multi-headed attentional approach to fusing phonetic features, word boundary features and lexical features.
After experimenting with two public datasets and one private dataset, our model was fully evaluated, which increased the reliability and generalisability of the results. Overall, we demonstrate that our approach is stronger than other mainstream models in the identification of named entities for government services. 4.1 Dataset We first conducted experiments on two publicly available datasets, Resume [20] and MSRA [21]. The Resume dataset is a dataset of CV abstracts of senior managers of listed companies filtered, filtered and manually annotated by Sina Finance.com, containing 1027 CV abstracts, with entity annotations divided into eight categories such as personal name, nationality, origin, race, profession, degree, institution and title categories. The MSRA dataset is an entity recognition dataset in the field of journalism annotated by Microsoft Asia Research, containing over 50,000 Chinese entity recognition annotations, with entity categories divided into people, places and organisations. Both datasets have a strong logical pattern and fit well with the characteristics of government texts. We also constructed our own dataset UATD, in which we used government text data to annotate named entities into different categories such as people, organisations, positions and documents, in order to more fully validate the boundary recognition capability of our proposed approach when dealing with long entity texts. By using this dataset, we can more objectively evaluate the performance of our proposed model and thus further refine and optimise our model. Detailed data for the three data sets mentioned above are shown in Table 1. Table 1. Overview of the data set. Dateset
Type
Category
Train
Test
Dev
Resume
Sentences
8
3821
463
477
MSRA
Sentences
3
41839
4525
4365
UATD
Sentences
4
1588
387
397
Chinese Named Entity Recognition Based on Multi-feature Fusion
677
4.2 Experimental Settings In this study, we adopted a Chinese named entity recognition model based on the feature fusion framework of the multiple attention mechanism, combined with the common sequence labeling model Bert-BiLSTM-CRF. The model has excellent results, wide applicability and robustness, which has been verified in several experimental studies. The focus of this study is to test the efficiency of multi-feature fusion embedding and to explore multiple fusion strategies to further improve the performance of the model. Table 2 lists some of the hyperparameters used in the experimental training phase. In the process of model parameter optimization, these hyperparameters are selected after repeated experiments and adjustments in order to obtain better model performance. Table 2. Model hyperparameters. Items
Range
batchsize
16
epochs
50
lr
1e−5
optimizer
adam
dropout
0.3
early stop
5
pos_embedding
768
pinyin_embedding
768
boundry_embedding
768
lstm layer
1
hidden dim
256
seq_lenth
128
5 Results and Discussion In this paper, we conduct two experimental studies, the first of which aims to evaluate the performance of our proposed model and the second to compare different fusion strategies. The results of our experiments are analysed in depth to provide a more complete picture of our research work. 5.1 Overall Performance In order to test the overall performance of the model, the Bert-BiLSTM-CRF model is referred to as the “base model”, and then different features are fused to the base model, and the lexical features, word boundary features and phonetic features are denoted as
678
Z. Sun et al.
POS, WBF and PYF respectively. In order to control the variables, we adopt the attention mechanism fusion strategy here, and to illustrate the effectiveness of our proposed multifeature fusion strategy and its advantages over other fusion strategies, we will explore this in depth and design corresponding experiments to prove the above points. Tables 3, 4 and 5 show the performance of the feature embedding model on different datasets. The experimental results show that the MFF-CNER model achieves better results by continuously fusing multiple features, and it achieves the best performance in all three datasets, which is in line with our expectation. Table 3. Experimental results of the MSRA. Models
Precision
Recall
F1
Base model
89.52
90.12
89.82
Base model (POS)
89.65
90.26
89.94
Base model (WBF)
89.58
90.20
89.88
Base model (PYF)
89.78
90.87
90.32
Base model (POS + WBF)
89.70
90.66
90.18
Base model (POS + PYF)
89.76
90.96
90.36
Base model (WBF + PYF)
89.68
91.20
90.43
MFF-CNER (ours)
89.81
91.35
90.57
Table 4. Experimental results of the Resume. Models
Precision
Recall
F1
Base model
94.81
94.11
94.46
Base model (POS)
94.90
94.23
94.56
Base model (WBF)
94.98
94.36
94.67
Base model (PYF)
95.30
95.02
95.16
Base model (POS + WBF)
95.21
94.76
94.98
Base model (POS + PYF)
95.44
95.28
95.36
Base model (WBF + PYF)
95.40
95.30
95.35
MFF-CNER (ours)
95.71
95.77
95.74
In the MSRA dataset, the F1 score of the base model was 89.82%. Based on this, we fused a single lexical feature, a word boundary feature and a pinyin feature to the base model, and the F1 scores of the models obtained were 89.94%, 89.88% and 90.32% respectively. Compared to the base model, the F1 scores of all three models were improved to different degrees. Interestingly, the model incorporating pinyin features had the highest boost in single-feature fusion, even 0.14% higher than the F1 of the twofeature model incorporating lexical features and word boundary features. By analysing
Chinese Named Entity Recognition Based on Multi-feature Fusion
679
Table 5. Experimental results of the UATD. Models
Precision
Recall
F1
Base model
84.75
80.07
82.23
Base model (POS)
84.82
80.27
82.48
Base model (WBF)
84.86
80.26
82.50
Base model (PYF)
85.72
81.30
83.45
Base model (POS + WBF)
85.46
81.08
83.21
Base model (POS + PYF)
85.96
81.62
83.73
Base model (WBF + PYF)
86.23
81.50
83.80
MFF-CNER (ours)
86.78
81.96
84.29
the text content, we found that in cases where entities have the same pronunciation but different meanings, the pinyin features can help the model to better identify these ambiguities. Also, we found that the model that fused any two features showed some improvement in F1 compared to the base model. Finally, the MFF-CNER incorporating multiple features showed a large improvement in Precision, Recall and F1 relative to the base model, by 0.29%, 1.23% and 0.75% respectively. It is worth noting that the performance trend of the model in the Resume dataset is almost the same as that of the model on the MSRA dataset, which verifies from the side that the model has better generalisation performance on different datasets. The performance trend of MFF-CNER on the private UATD dataset is generally consistent with the performance trend on the open source dataset. Compared to the base model, MFF-CNER showed a large improvement in Precision, Recall and F1, by 2.03%, 1.89% and 2.06% respectively. This experimental result illustrates two aspects: firstly, the private dataset we constructed can be used for academic research; secondly, our proposed model showed effectiveness and performance advantages in the Chinese government named entity recognition task, achieving the SOTA performance of the moment (Fig. 4).
Fig. 4. MFF-CNER has found the correct boundary and identified the exact entity
680
Z. Sun et al.
5.2 Comparison of Convergence Strategies Table 6 shows the results of the different fusion strategies on all datasets for the named entity recognition algorithms. The best performance was achieved by the attention fusion strategy on all datasets. There were differences in the effectiveness of the various fusion strategies on the different datasets. On Resume and MSRA, all three fusion strategies were able to slightly improve the F1 scores of the models. However, on the UATD dataset, the attention fusion strategy outperformed, with a larger improvement in F1 score compared to the other two strategies. Considering the three different fusion strategies, the feature fusion model based on the multi-headed attention mechanism had the best overall performance. Table 6. Comparison results of multiple convergence strategies. Integration methods
Resume
MSRA
UATD
Precision Recall F1
Precision Recall F1
Precision Recall F1
Add
88.40
90.26
89.32 94.63
95.22
94.92 84.92
80.20
82.49
Concatenate 88.64
90.66
89.64 94.98
95.63
95.30 85.35
80.40
82.80
Attention
91.35
90.57 95.71
95.77
95.74 86.78
81.96
84.29
89.81
6 Conclusion This paper proposes a new Chinese Named Entity Recognition model based on MultiFeature Fusion (MFF-CNER). The model incorporates more comprehensive word boundary features, Chinese parts of speech features, pinyin features, and integrates them using multiple attention mechanisms. This approach provides additional prior knowledge for the pre-trained language model, enhances the model’s semantic representation ability for government affairs text, and thus achieves superior results in Chinese government affairs entity recognition tasks. In future work, we will continue to explore how to further improve the performance of the Chinese named entity recognition model based on multi-feature fusion. On the one hand, we will try to use more abundant features to improve the recognition accuracy of the model, such as introducing syntactic features, font features, and so on. On the other hand, we will also explore how to introduce more unsupervised learning methods in model training to improve the generalization ability and adaptability of the model. In addition, we will explore how to apply Chinese named entity recognition to a wider range of fields, such as finance, healthcare, journalism, etc. In these fields, entity recognition plays an important role in information extraction, text classification, machine translation and other tasks. Therefore, we hope to develop a more efficient and accurate entity recognition model to provide better support for these application scenarios.
Chinese Named Entity Recognition Based on Multi-feature Fusion
681
References 1. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol. 1, p. 2 (2019) 2. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 3. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167 (2008) 4. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016) 5. Lewis, P., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474 (2020) 6. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016) 7. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354 (2016) 8. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 9. Cui, Y., et al.: A span-extraction dataset for Chinese machine reading comprehension. arXiv preprint arXiv:1810.07366 (2018) 10. Sun, Y., et al.: Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019) 11. Sun, Z., et al.: ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. arXiv preprint arXiv:2106.16038 (2021) 12. Yang, J., Wang, H., Tang, Y., Yang, F.: Incorporating lexicon and character glyph and morphological features into BiLSTM-CRF for Chinese medical NER. In: 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), pp. 12–17. IEEE (2021) 13. Li, J., Meng, K.: MFE-NER: multi-feature fusion embedding for Chinese named entity recognition. arXiv preprint arXiv:2109.07877 (2021) 14. Chen, C., Kong, F.: Enhancing entity boundary detection for better Chinese named entity recognition. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 20–25 (2021) 15. Liu, W., Fu, X., Zhang, Y., Iao, W.: Lexicon enhanced Chinese sequence labeling using BERT adapter. arXiv preprint arXiv:2105.07148 (2021) 16. Geng, Z., Yan, H., Yin, Z., An, C., Qiu, X.: Turner: the uncertainty-based retrieval framework for Chinese NER. arXiv preprint arXiv:2202.09022 (2022) 17. Zheng, L., Ren, L.: Named entity recognition in the domain of nutrition and health using fusion rules and BERT-flflat model. Trans. Chin. Soc. Agric. Eng. 37(20) (2021) 18. Guo, X., Tang, Z., Diao, L., Zhou, H., Li, L.: Named entity recognition of pests and diseases based on radical insertion and attention mechanism. J. Agric. Mach. 51(S2), 335–343 (2020) 19. Wu, S., Song, X., Feng, Z., Wu, X.: Nflflat: non-flat-lattice transformer for Chinese named entity recognition. arXiv preprint arXiv:2205.05832 (2022) 20. Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. arXiv preprint arXiv:1805.02023 (2018) 21. Levow, G.A.: The third international Chinese language processing bakeoffff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117 (2006)
Knowledge Graph Construction for Supply Chain Management in Manufacturing Industry Wenyan Chen1 , Lianglun Cheng1 , Tao Wang2(B) , and Jianfeng Deng1,3(B) 1 School of Computer Science and Technology, Guangdong University of Technology,
Guangzhou 510006, China [email protected] 2 School of Automation, Guangdong University of Technology, Guangzhou 510006, China [email protected] 3 School of Electrical Engineering, Guangzhou Railway Polytechnic, Guangzhou 510430, China
Abstract. Knowledge graph technology is crucial in enhancing supply chain management (SCM) in the manufacturing industry. However, the existing SCM ontology knowledge suffers from coarse granularity, leading to reduced accuracy of knowledge extraction and making knowledge graph construction more challenging. A novel construction method for the SCM event logic knowledge graph (ELKG) is proposed to overcome these challenges. The proposed method includes constructing an event logic ontology and annotating the SCM dataset based on the ontology. Meanwhile, a knowledge joint extraction model FIBGN based on Bidirectional Graph Convolutional Network (BiGCN) and feature interaction between different spaces (BiGCN) is proposed. Experimental results show that this method can improve the effect of event argument entity and relation joint extraction and is better than other methods. Finally, the event logic knowledge graph of supply chain management in large-scale manufacturing fields is established, which provides decision support for the supply chain system. Keywords: Event logic knowledge graph · Supply chain management · Event joint extraction
1 Introduction Supply Chain Management (SCM) has emerged as a popular approach for enterprises to enhance their competitiveness in an increasingly customer-driven and competitive market. In the knowledge-intensive manufacturing industry, SCM necessitates close collaboration among enterprises in the supply chain and maximal knowledge sharing in the supply chain [1–3]. Empirical research indicates that knowledge sharing and reuse between supply chain participants are crucial determinants of supply chain performance, strategically and operationally. Yu-Liang Chi [4] and Caiado et al. [5] suggested the combination of ontology and semantic rules, utilizing an ontological approach to design a knowledge model. Shokouhyar et al. [6] designed an expert system for reasoning and decision-making to assist users in solving complex problems in a specific domain. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 682–693, 2023. https://doi.org/10.1007/978-981-99-4752-2_56
Knowledge Graph Construction for Supply Chain Management
683
The tremendous development of neural network technology has been widely used in knowledge graph problems such as knowledge acquisition [7] and price movement prediction [8]. The ELKG has a stronger practical value because it focuses on describing sequential or causal relationships [9]. The ELKG consists of two essential tasks: ontology construction and event knowledge extraction. The diversity and complexity of data sources lead to the heterogeneity and redundancy of data, which raises the need to build domain ontology. Event knowledge extraction is divided into event arguments entity extraction (EAER) and event arguments relation extraction (EARE). Due to the information is usually distributed in different texts and data sources. We develop a joint extraction method to complete the above extraction, improve the accuracy and integrity of data, and ensure the quality of the event knowledge graph. In recent years, lots of work has considered a joint model on knowledge extraction tasks, which has outperformed pipeline methods by a large margin because it takes the correlation between the two subtasks into account. Miwa and Bansal et al. [10] proposed an end-to-end joint extraction model based on bidirectional tree-structured RNNs, using the information of word sequences and dependent tree structure. Fu et al. [11] proposed GraphRel applying 2nd-phase GCN to get a better effect of overlapping relations. Whether it is ontology construction or knowledge extraction, we need to break through the problem of coarse granularity in knowledge extraction. In order to extract fine-grained event logic knowledge, this paper uses artificial intelligence technology to construct an ELKG of SCM. Firstly, the event ontology in the large-scale manufacturing industry lacks a consistent, structured description, resulting in knowledge extracted at a coarse granularity unsuitable for further integration. Secondly, there needs to be a more reusable semantic markup dataset specifically designed for this specific domain. Finally, coarse-grained feature extraction hinders the expression of semantic features, making knowledge extraction in this domain more challenging. Based on the above motivation, the contributions of this paper are as follows: • For SCM cases in the large-scale manufacturing industry, a seven-step approach is used to construct an event ontology with trigger words as the core. • Based on the above ontology, we collect the SCM text corpus in this domain and set the annotation strategy to construct the dataset. • To better perform knowledge extraction on domain dataset for constructing the ELKG of SCM, a joint extraction model of BiGCN with feature interaction across different spaces is proposed.
2 Methodology 2.1 Ontology Construction and the Labeling Strategy of Dataset This section mainly explores how to construct an ontology model construction for SCM in the large-scale manufacturing industry. Currently, the seven-step method is widely used in ontology construction [12]. The core steps of this method include listing essential terms in the field, defining classes and class properties and relationships, and defining property aspects. The ontology construction proposed in this paper involves defining the concept classes of managing events and their relationships.
684
W. Chen et al.
Each supply chain case is considered a management event. Formally, an Event consists of three elements that can be represented by Formula 1 as a tuple. Here, O represents the body of the event, and T represents the trigger element of the event, which corresponds to the event trigger words in the text, such as improvement, selection, etc. S denotes the object dominated by the action of the event. e =
(1)
The ontology diagram of SCM is shown in Figure 1, where the boxes represent classes and the dashed lines represent relationships between classes. We categorize the SCM management cases into two primary concept classes: organizational information and professional technology theory. Organizational information pertains to the essential details within the links of industry SCM, including industry, company, field, and problem. Professional technology theory can be decomposed into methods, models, and management index systems. Additionally, we establish the relationship between concept classes. The event parameter relationships include Belong_to, Compose_of, Contrapose, Used for, Has_attribute, and Appear. The primary event logical relationships are Followed and Lead_to. Finally, we use Protégé to construct the logic ontology model of SCM case events.
Fig. 1. Structure of event logic ontology model for SCM
The domain dataset labeling process consists of three main stages. In the first stage, the SCM case event argument entities and relations are transformed into a sequence labeling classification task using the BIO labeling strategy. The second stage is the normalized labeling stage, which addresses the issue of multiple semantic words that arise when manually labeling data samples. The final stage is the formal labeling of the data. Following resolving issues in the previous stage, the four groups simultaneously compared the labeling results to guarantee accuracy and consistency. To demonstrate our labeling method, we provide Fig. 2 as an example. The figure depicts the labeling order and corresponding elements of a sample sentence. The given sentence: “Based on the hierarchical analysis method, we improve the supplier evaluation method and finally select the suitable supplier.” The sentence contains triples and three
Knowledge Graph Construction for Supply Chain Management
685
event instances, in which event arguments combine into event instances through event relations, and there is a ‘Followed’ relation between the events.
Event3
Event2
Event1
Improve
Based on
Select
Event Event
Event Followed
Method
Followed Supplier
Hierarchical analysis method
Composed_of
Supplier Evaluation
Has_attribute
Suitable
Fig. 2. The sample labeling strategy and its corresponding ontology diagram
2.2 Event Argument Entity and Relation Joint Extraction Figure 3 shows an overview architecture of the proposed FIBGN. It consists of three main parts: the embedding layer, the two-phase feature interaction layer, and the BiGCN relation inference layer. First, the embedding layer processes the source text and embeds deep context features. Then the entities of the first stage are predicted by the multi-head attention mechanism. After that, the predicted entity label and depth context information are input to the feature interaction layer to complete feature fusion. Then, the relation prediction in the first phase is completed. Next, BiGCN is introduced for relation inference, and graph convolution features are used to update features to perform joint extraction in the second phase. Then, the second predicted entity and depth context information are input to the feature interaction layer to complete the final relation prediction. The Embedding Layer. The methodology selected in this study uses Word2vec with the Skip-gram model, which is selected due to the limited size of the training dataset. The input text is represented by the character vector c = (c1 ,c2 , ..., c2 ). If a character vector is unavailable, a random value is assigned. The resulting set of character vectors is denoted by x = (x1 ,x2 ,...,x2 ). Then, we use stacked BiLSTM for contextual feature
686
W. Chen et al.
Fig. 3. The overall structure of our proposed FIBGN model
embedding. The output of layer m is as follows: ⎛ ⎛ ⎞ ⎛⎛ ⎞ ⎞⎞ (m) (m) (m) it bi Wi ⎜ (m) ⎟ ⎜ (m) ⎟⎟ ⎜⎜ (m) ⎟ (m) ⎝ ft ⎠ = σ⎝⎝ Wf ⎠ Ot−1 , xt + ⎝ bf ⎠⎠ o(m) Wo(m) b(m) o t
(m) (m) gt = tanh W(m) Ot-1 , xt + b(m) c c (m)
ct
(m)
= ft (m)
Ot
(m)
(m)
ct−1 + it (m)
= tanh(ot
(m)
gt (m)
ct )
−−→ ←−− (m) (m) ht = Ot , Ot
(2)
(3) (4) (5) (6)
among them, σ represents the sigmoid function, represents the element-level multipli(m) (m) (m) (m) (m) (m) cation calculation. Wi , Wf , Wo , bi , bf , and bo represent the weight matrix and bias term of input gate, forgetting gate and output gate, respectively. Then deep context feature sequence h = (h1 ,h2 , ..., h2 ) obtained by stacked BiLSTM. To acquire character dependencies across multiple subspaces, we use a multi-head attention mechanism. The calculation process iterates N times, and the final results are combined to obtain complete feature information. QKT Attention(Q, K, V) = softmax √ V (7) d
Q V headi = Attention QWi , KWK (8) i , VWi Q
where Wi , WiK , WiV , and Wo are matrix mapping parameters obtained by training which are used to map input features into different subspaces. The Two-phase Feature Interaction Layer. We address the issue of feature interaction across different spaces in both the first and second phases. The approach takes two sets
Knowledge Graph Construction for Supply Chain Management
687
of features, such as H and L. As input H represents context sequence features trained by stacked BiLSTM, and L represents the label features from two different stages of entity prediction. Initially, we encode L into labeled features L˜ using BiLSTM. Next, we set the features H as queries and the label features L˜ as keys and values, enabling each sequence feature to be integrated with its associated label information. BiLSTM(L)HT ˜ H (9) O = Attention(BiLSTM(L), H, H) = softmax √ d The BiGCN Relation Inference Layer. To improve task recognition in the second phase, we use BiGCN. In this approach, a relationship-weighted graph is created based on the relations obtained from the first phase of recognition. The adjacency matrix is split into a forward adjacency matrix and a backward adjacency matrix based on the directionality of the relationship between event arguments. − −−→ → − → 1 1 1+1 (10) hu = ReLU pr(u, v) · Wr hv + b1r c∈V
←−−
h1+1 = ReLU u
c∈V
r∈R
r∈R
← − ← − pr(u, v) · W1r h1v + b1r
−→ ←− l+1 l+1 hl+1 u = hu , hu
(11) (12)
where Wrl and blr are the kernel parameters matrix and bias of relation r in layer l respectively. V and R are the total numbers of entity and relation categories. Training Loss. In the EAER task, we use CRF to partition the global optimal labeling sequence. n
Ayi - 1 ,yi + pi,yi (13) p(M,Y) = i=1
ep(M,Y) p(M,Y) Y∈f(M) e = argmaxY∈f(M) log p Y |M
P(Y|M) = lossEAER
(14) (15)
where A represents the transition probability matrix between labels, and y represents the event argument entity label predicted by the ith character. In (14), Y’ represents a possible label sequence, and f (M ) is the set of all possible labeled sequences. As shown in (15), we set the maximum likelihood estimation function of P(Y |M ) as the loss function of EAER. In the EARE task, the probability calculation between characters ci and cj is defined as shown in (16): p(ci , rk , cj ) = σ(Wv tanh(Wf zci + Wb zcj ))
(16)
where p(ci , rk , cj denotes the predicted probability of each event parameter relation r between ci and cj . WV , Wf , and Wb are fully connected layer weight, forward and backward relation weight matrices, respectively.
688
W. Chen et al.
As shown in (17), the cross-entropy function is used as the loss function in the EARE task, where n, m is the number of characters and relation categories, p(ci , rk , cj ) represents the probability of the relation rk between the character ci and character cj . lossEARE = −
n−1 m−1 n−1 p ci , rk , cj log p ci , rk , cj
(17)
i=0 j=0 k=0
The overall model uses two types of loss: event argument entity loss and relation loss. To comprehensively consider the tasks of the two phases, we define the total loss function as: lossTOTAL = loss1EAER + loss2EAER + loss1EARE + loss2EARE
(18)
3 Experiments 3.1 Dataset and Experimental Settings The Dataset. This paper uses historical management cases in the process of SCM in the manufacturing industry as the data source. A specialized domain dataset is constructed using the BIO labeling strategy. The total number of manually labeled samples is 1500, with an average sentence length of 68.51 and a total of 82.2k characters (Table 1). Table 1. Detailed statistics of SCM datasets. Event Argument Entities Category
Total
Category
Event Argument Relations Total
Category
Total
Category
Total
Object
5138
Index
290
Event
6610
Composed_of
396
Trigger_word
5110
Model
454
Followed
3507
Contrapose
356
Attribute
2552
Company
224
Has_attribute
1835
Used_for
199
1601
Appear
105
Field
954
Industry
212
Belong_ to
Method
643
Index_system
166
Lead_to
Problem
357
709
Hyperparameter Settings. The model is built using the TensorFlow framework. For input feature representations, we use a 50-dimensional character vector as input for all models. Our model consists of 4 layers of stacked-BiLSTM, with hidden layer dimensions of 128 for both BiLSTM and BiGCN. We set the dropout rate to 0.9 to prevent
Knowledge Graph Construction for Supply Chain Management
689
overfitting, and the batch size is fixed to 10. This paper uses precision (P), recall (R), and F1 scores as evaluation indexes. Baselines. In order to extract more accurate triples, this paper verifies the performance of the baseline methods, including FIBGN, using 5-fold cross-validation. We compare FIBGN with the following baselines: • Zhang et al. [13]: A model uses LSTM for the NER task, and entity labels are used for CNN for the relation classification task. • Liu et al. [14]: A new joint extraction model based on self-attention mechanism. • Trans-SBLGCN [15]: A model based on the transfer learning method uses the graph convolution network. • CMAN [16]: A two-stage cross-modal joint extraction model based on deep stacking of multiple attention layers. • SBALGN [17]: A model uses a self-attention-based stacked BiLSTM with label weight embedding and a graph convolution network. 3.2 Experimental Results Overall Comparison. The results of our model against other baseline methods are shown in Table 2. FIBGN outperforms all other baselines, achieving an F1 score of 83.01% for the EAER task and 53.97% for the EARE task, with an average F1 score of 68.49%. We attribute the gains of FIBGN to its advantages: (1) Compared with other baselines, the semantic information under two different spatial distances is integrated, which is of great significance for relation extraction. However, the SBALGN model only performs simple embedding for labels, which cannot fully explore the semantic information of relations. (2) FIBGN focuses on relation-related entities and considers the implicit features between all word pairs in the text, which eliminates the error caused by predicting redundant entity pairs through secondary prediction. Table 2. Comparison of methods on the SCM dataset. Method
EAER
EARE
F1 score
P
R
F1
P
R
F1
Zhang et al. [13]
77.26
78.61
77.93
49.42
44.69
46.38
62.16
Liu et al. [14]
79.29
79.97
79.63
52.06
46.57
49.16
64.40
Trans-SBLGCN [15]
80.21
79.95
80.54
53.05
48.53
50.78
65.66
CMAN [16]
81.96
81.51
81.74
52.45
48.33
51.16
66.45
SBALGN [17]
80.60
81.24
80.54
55.05
49.02
52.08
66.50
FIBGN(Ours)
83.80
82.41
83.01
55.10
52.78
53.97
68.49
Ablation Study. We conducted ablation experiments to demonstrate the effectiveness of stacked BiLSTM, two-phase feature interaction, and BiGCN relation inference layer.
690
W. Chen et al.
We removed one component at a time to observe its impact on the experimental results, summarized in Table 3. (1) Stacked BiLSTM embeds in the input layer effectively provide deep syntactic information to the sentence. (2) The feature interaction layer in the first and second phases helps to fuse the deep semantic features with the label information to capture the fine-grained semantic relation features. (3) The result shows that the model’s accuracy is significantly decreased, which proves that BiGCN relation inference aggregates entity nodes using relation prediction results and can add relation information to entity recognition, thereby improving the performance of joint extraction. Table 3. Ablation study of FIBGN model Model
P
R
F1 score
FIBGN
69.45
67.60
68.49
—Stacked BiLSTM embedding
67.84
67.16
67.90
—First phase Feature interaction
67.05
66.54
66.84
—Second phase Feature interaction
66.12
66.01
66.47
—BiGCN relation inference layer
66.41
65.26
66.56
3.3 Tuning of Hyperparameters Effect of Batch Size Value on the Performance. Table 4 shows that the performance of joint extraction improves with an increase in batch size, indicating that selecting an appropriate number of samples to form a batch for training can lead to more accurate gradient calculations and better results. However, the performance of the joint extraction model decreases when the batch size is larger than 10, suggesting that a batch size that is too large can lead to convergence to a wrong local optimal value and poor generalization ability. Table 4. Effect of batch size value on the performance Method
EAER P
EARE R
F1
P
F1 score R
F1
1
83.19
82.39
82.81
54.87
52.25
53.49
68.15
10
83.80
82.41
83.01
55.10
52.78
53.97
68.49
20
82.06
81.65
82.34
54.75
51.74
53.26
67.85
Effect of Hyperparameters of Stacked BiLSTM. Different stacked layers and the number of hidden cells in the stacked BiLSTM have varying impacts on model performance. Figure 4 illustrates our experiments where we set the number of BiLSTM layers
Knowledge Graph Construction for Supply Chain Management
691
to 2, 3, 4, and 5 and the number of hidden cells to 32, 64, 128, and 256, respectively. The model is validated with a 5-fold crossover. The experimental results indicate that the model performs best when the number of stacked layers is 4, and the number of hidden units is 128. However, increasing the number of layers to 5 results in a sharp decline in the model performance. Therefore, we set the number of BiLSTM layers to 4 and the number of hidden layer dimensions to 128.
Fig. 4. Performance of different layers and hidden units in stacked BiLSTM
3.4 Visualization and Application of ELKG The application of the knowledge graph is mainly reflected in intelligent information retrieval, which uses query sentences to find specific entities and relations consistent with the query keywords, and returns the corresponding knowledge chain. After applying the joint extraction method for event argument entities and relations, we obtain numerous event knowledge triples that describe the SCM problem and approach. To establish the ELKG, these triples are stored in a Neo4j database. Figure 5 shows a part of the ELKG, and the detailed diagram is an example of the English translation of entity and relation triples. We can see that events are concatenated by the extracted event categories and relation categories: Event (Use, event, Analytic hierarchy process) correlates event (Analysis, event, Key Feature), and further correlate event (Establish, event, Screening model). Dynamic implementation of knowledge from the graph structure in SCM improves the success of its efficiency.
692
W. Chen et al.
Apply
Event
Event
Use
Analyse
Event
Supplier selection method
Hierarchica l analysis method
Establish
Com posed_ of
wed
Key Features
wed Screening Model
t
en
Ev
Index
Problem
Supplier Selection
Followed
m Co
Event
p
f d_o ose
d
Follo
Has_attribute
Fo
d
we
llo
t
Unqualified suppliers
Followe
analyse
en Ev
Event
Identify
Follo
ent
Discover
Ev Select
Supplier selection method
Event
Choose
Parts Procurement
Fig. 5. The visualization of SCM ELKG (partial)
4 Conclusion An ELKG is constructed for the SCM of the manufacturing industry based on the event text of SCM cases. Firstly, the event logic ontology of SCM is constructed, and the domain dataset is formulated according to the ontology. On this basis, a feature interaction method joint extraction model FIBGN is proposed. This model mainly improves the feature interaction performance and enhances the ability to obtain fine-grained semantic relation features between different spaces. Furthermore, experiments show that FIBGN can improve the effectiveness of joint extraction. Finally, according to the recognition results, the ELKG of SCM is established to realize the structured storage and query of event knowledge to provide knowledge support for SCM. Although the construction of ELKG is realized in this paper, there are still limitations. For example, in relation recognition, there are multiple relations for the same entity in a sentence, and the model cannot identify the problem of overlapping relations. In future work, we will further investigate the problem of relationship overlap. Additionally, we plan to explore semi-supervised labeling methods to reduce the dependence on manual labeling. Acknowledgements. Our work is supported by multiple funds in China, including the National key R & D project (2021YFB3301802), the Key Program of NSFC-Guangdong Joint Funds (U2001201, U1801263). Our work is also supported by Guangdong Provincial Key Laboratory of Cyber-Physical System (2020B1212060069).
Knowledge Graph Construction for Supply Chain Management
693
References 1. Min, S., Zacharia, Z.G., Smith, C.D.: Defining supply chain management: in the past, present, and future. J. Bus. Logist. 40, 44–55 (2019) 2. Ye¸sil, S., Koska, A., Büyükbe¸se, T.: Knowledge sharing process, innovation capability and innovation performance: an empirical study. Procedia Soc. Behav. Sci. 75, 217–225 (2013) 3. Schniederjans, D.G., Curado, C., Khalajhedayati, M.: Supply chain digitisation trends: an integration of knowledge management. Int. J. Product. Econ. 220, 107439 (2020) 4. Li, Y., Song, Y., Wang, J., Li, C.: Intellectual capital, knowledge sharing, and innovation performance: evidence from the Chinese construction industry. Sustainability 11, 2713 (2019) 5. Chi, Y.-L.: Rule-based ontological knowledge base for monitoring partners across supply networks. Expert Syst. Appl. 37, 1400–1407 (2010) 6. Caiado, R.G.G., et al.: A fuzzy rule-based industry 4.0 maturity model for operations and supply chain management. Int. J. Product. Econ. 231, 107883 (2021) 7. Almuiet, M.Z., Al-Zawahra, M.M.: Automated knowledge acquisition framework for supply chain management based on hybridization of case based reasoning and intelligent agent. Int. J. Adv. Comput. Sci. Appl. 10 (2019) 8. Liu, Y., Zeng, Q., Yang, H., Carrio, A.: Stock price movement prediction from financial news with deep learning and knowledge graph embedding. In: Yoshida, K., Lee, M. (eds.) PKAW 2018. LNCS (LNAI), vol. 11016, pp. 102–113. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-97289-3_8 9. Shokouhyar, S., Seifhashemi, S., Siadat, H., Ahmadi, M.M.: Implementing a fuzzy expert system for ensuring information technology supply chain. Expert Syst. 36, e12339 (2019) 10. Miwa, M., Bansal, M.: End-to-end relation extraction using LSTMs on sequences and tree structures. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1116 (2016) 11. Fu, T.-J., Li, P.-H., Ma, W.-Y.: GraphREL: modeling text as relational graphs for joint entity and relation extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1409–1418 (2019) 12. Jiang, W., Wang, Y., Hu, J., Guan, L., Zhu, Z.: Construction of substation engineering design knowledge graph based on “ontology seven-step method”. In: 2021 4th International Conference on Energy, Electrical and Power Engineering (CEEPE), pp. 957–962. IEEE (2021) 13. Zhang, J., Zhang, Y., Wang, M., et al.: Joint extraction of Chinese entity relations based on graph convolutional neural network. Comput. Eng. 47, 103–111 (2020) 14. Liu, M., Zhang, Y., Li, W., Ji, D.: Joint model of entity recognition and relation extraction with self-attention mechanism. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. (TALLIP) 19, 1–19 (2020) 15. Lin, R., Cheng, L., Wang, T., Deng, J.: Trans-SBLGCN: a transfer learning model for event logic knowledge graph construction of fault diagnosis. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2022) 16. Zhao, S., Hu, M., Cai, Z., Liu, F.: Dynamic modeling cross-modal interactions in two-phase prediction for entity-relation extraction. IEEE Trans. Neural Netw. Learn. Syst. (2021) 17. Deng, J., Wang, T., Wang, Z., Zhou, J., Cheng, L.: Research on event logic knowledge graph construction method of robot transmission system fault diagnosis. IEEE Access 10, 17656– 17673 (2022)
Leveraging Inter-class Differences and Label Semantics for Few-Shot Text Classification Xinran Xie1,2(B)
, Rui Chen1,2 , Tailai Peng1,2 and Zheng Chen3
, Zhe Cui1,2
,
1 Chengdu Institute of Computer Applications, Chinese Academy of Sciences,
Chengdu 610041, China [email protected] 2 University of Chinese Academy of Sciences, Beijing 101408, China 3 School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
Abstract. In some few-shot text classification tasks with strong data privacy or difficult labeling, the performance of pipeline methods, which directly encode text features and perform linear classification, is limited by the feature extraction ability of models. An increasing number of studies have recognized the significance of combining text features with label semantics and achieved good results. However, these existing methods cannot be well generalized to classification tasks where the class names have weak correlations with the instance texts. In this work, we address this problem by means of an effective fusion of text-label similarity and a redesign of contrastive loss. Firstly, the semantic similarity modules of text-text and text-label are adopted for further merging to improve the feature extraction ability. Then, we introduce DLSC, an inter-class differences and label semantics contrastive loss that facilitates instance embeddings to approximate correct label semantics in vector space. Experimental results show that our approach has greatly improved F1 scores on English and Chinese datasets from six classification tasks, even in tasks where label names are not strongly correlated with texts. Keywords: Few-shot text classification · Inter-class differences · Label semantics · Contrastive learning
1 Introduction Pre-trained models [1] have fueled the growth of Natural Language Processing (NLP), particularly in text classification. Typically, a pre-trained model is employed as an encoder to convert input texts into vectorized representations. Subsequently, a classifier, such as a fully connected layer, is used to infer the relationships between texts and labels. Such approaches usually involve fine-tuning with a substantial amount of annotated data and require considerable domain expertise, which may result in high costs. Achieving satisfactory results with limited labeled data often leads to overfitting and poor generalization. Therefore, obtaining a robust classification model for few-shot settings is still challenging. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 694–706, 2023. https://doi.org/10.1007/978-981-99-4752-2_57
Leveraging Inter-class Differences and Label Semantics
695
To address this issue, early researchers [2] have made some progress by optimizing the distance between samples in the embedding space. In addition to the original information of the samples, there are many other sources of auxiliary information that can be used to address the problem of poor model generalization due to insufficient labeled data. Studies [3, 4] have utilized the semantic information of labels (known as label semantic) as an additional signal to enhance the model’s prior knowledge. Specifically, these approaches employ two encoders to separately encode the input texts and label names and measure the distance between their embeddings to predict labels. Even though the method of measuring the semantic distance between text and label vectors is naturally suitable for small sample scenarios, its performance is unsatisfactory when the semantic correlation between texts and labels is lacking. Moreover, these methods are overly concerned with the distance between texts and labels, which leads to the neglect of the semantic correlation between texts themselves. In a few-shot scenario, poor instance representations can exacerbate the distance deviation. As shown in Fig. 1, some instances from class A are closer to class B in the projected space, which leads to the misclassification of the model. Despite this minor disappointment , i highly recommend it anyone who is serious about digital photography .
Class A Class B Text embedding from class A Text embedding from class B what the g3 raw image software gives you is a fantasy world designed to please the uncritical .
Fig. 1. The figure shows a binary classification task. If we overlook other aspects and purely concentrate on reducing the distance between texts and positive labels while increasing the gap between texts and negative labels, some text embeddings may end up being closer to the wrong labels. Inter-class differences can be added to push input texts away from incorrect clusters.
To solve this problem, we consider utilizing inter-class differences (i.e., instances from different classes that should be far from each other) for clustering similar instances that are centered on label embeddings more closely. As shown in Fig. 1, the semantic meaning of labels is used as the cluster center to bring its corresponding instances closer together, while instances belonging to different labels are pushed away from each other. Specifically, we propose a model that considers both text-text similarity and text-label similarity to extract features of inter-class differences and label semantics. Additionally, our proposal introduces a novel objective that serves the dual purpose of minimizing the distance between an instance and its positive label (i.e., the ground truth of the instance) while simultaneously maximizing the distance between the instance and negative instances (i.e., other instances in the same batch that belong to different
696
X. Xie et al.
classes)—referred to as inter-class Differences and Label Semantic Contrastive Loss (DLSC). We report experimental results in English and Chinese datasets from different tasks. Our contributions are summarized as follows: (1) We propose a simple but effective architecture in which text-text and text-label semantic similarity modules are designed for further fusion to enhance the feature extraction ability of the model. (2) We present a new contrastive objective DLSC that takes both label semantics and inter-class differences into account without adjusting the hyperparameter. It provides a new perspective on multitasking and can be easily extended to other NLP tasks. (3) We show that our proposed objective is more robust and effective in comparison to a model fine-tuned with label semantics only. Our model outperforms prior works by 1 to 6.5 F1 points on six datasets.
2 Related Work 2.1 Few-Shot Learning Few-shot learning is often used in scenarios where obtaining many supervised examples is difficult or even impossible. It has recently achieved promising results thanks to the creation of pre-trained models. Commonly used approaches include meta-learning and prompt-tuning. Meta Learning. It is a method of training models on various tasks with rich annotations and then solving new tasks using only a small number of labeled samples. Meta-learning assumes that the train and test sets are in the same distribution. Some studies [5] split one dataset into train and test sets by classes, with non-overlapping categories between the two sets and typically favoring the classification task with a larger number of classes. Others [6] use multiple datasets from different fields within the same category. Thus, meta learning is generally limited by the number of classes and domains in practical applications. Our approach is centered on the effective utilization of a limited number of labeled samples. Prompt-Tuning. It is a new paradigm that has recently emerged in few-shot or zero-shot scenarios [7]. By adding prompts directly to downstream tasks [8], it not only overcomes the challenge that fine-tuning relies on a large amount of supervised corpus but also solves the problem that downstream task goals are too far from pre-trained goals. However, prompt engineering, which requires rich experience and manual labor, is a complex art that includes designing prompts [9, 10], answer construction and prediction, answerlabel mapping. We propose introducing label names, represented in the pattern shown in Table 1, as a form of supplementary knowledge, which may be viewed as a type of prompt. Our approach eliminates the complex prompt-answer-verbalizer process and facilitates ease of migration. In addition, we use a dual encoder structure so that label embeddings can be precalculated, which helps save inference time.
Leveraging Inter-class Differences and Label Semantics
697
2.2 Label Semantics A growing body of work has recently demonstrated the value of label semantics. [11] directly used ‘text [sep] label names’ as input, which significantly improved in the classification task. In addition to using label semantics in fine-tuning and prediction, [12] also achieved good results in the text classification task by pre-training the labeled sentences from each domain for a second time. [3, 4] predict the labels of instances by calculating the distance between the instance vectors and the label vectors. We present a new objective that leverages inter-class differences to enhance the effectiveness of instance embeddings. It helps reduce bias caused by inadequate correlations between label names and instances. 2.3 Contrastive Learning Contrastive learning aims to learn effective representation by pulling semantically close samples together and pushing apart different samples. The concept of contrastive learning first appeared in the field of computer vision [13], and it has lately been expanded to NLP, which is extensively employed in a variety of text-based applications [14]. Two popular forms of contrastive learning are self-supervised contrastive learning [15, 16] and supervised contrastive learning [17]. In self-supervised contrastive learning, the positive examples of the anchor are created by data augmentation, while the other samples are designated as negative samples. Supervised contrastive learning utilizes instances from the same class as positive examples and instances from different classes as negative examples. In contrast to supervised contrastive learning, our method brings instances closer to positive labels and away from negative instances. Our goal is to cluster instances that belong to the same class around a positive label.
3 Method 3.1 Problem Setting There are N training examples with K classes on the training set Dtrain for few-shot text classification. Dtrain = {(xi , yk ), i = 1, . . . , N , k = 1, . . . K}, where yk is a label name of instance xi that can be a word or a sentence. For a binary affective polarity classification task, y = {bad , good }. The pattern setting of label names about other tasks are shown in Table 1. Instead of just mapping input x to label y with the classification layer, we fine-tune the model to minimize the distance between instances xi and their ground truth label names yk . 3.2 Architecture Figure 2 explains the overall system. We use two encoders to obtain text embeddings enc(x) and label name embeddings enc(y) separately because of their significant differences in length. The encoder consists of a pre-trained language model that generates contextual token embeddings and a mean pooler that merges token embeddings into sentence embeddings. We use the dot product as a similarity function to calculate the
698
X. Xie et al.
Dense Relu Dense
Text-text similarity Fusion Input text
Encoder
Label text
Encoder
} Class A
Maximize similarity
} Class B Text-label similarity Dot product
Fig. 2. Overall structure. In text-label similarity metrics, the dark yellow squares represent the similarity score for positive sample pairs, and the light-yellow squares represent the similarity score for negative sample pairs. In text-text similarity metrics, we only keep the similarity scores for different types of text, as indicated by the green squares. (Color figure online)
similarity scores between texts and label names. Before calculating the similarity scores, we apply L2 formalize F(·) to make feature uniform. Text-text similarity module tracks the relationships between texts. Before computing text similarity, we add a projection header proj(·) to get more information from text embeddings [16]. It is instantiated as a two-level multilayer perceptron proj(xi ) = W (2) σ W (1) xi , where σ is ReLU nonlinearity. For taking advantage of interclass differences, we ignore similarity vectors arising from the text itself and instances within the same class (white vectors in Fig. 2) and only keep the similarity vectors from different class instances (dark green vectors in Fig. 2). In a fusion module, we fuse both text-text and text-label similarity metrics by F(·) and then select suitable similarity metrics as positive or negative sample pairs as shown in Sect. 3.3. Our training goal is consistent with the idea of contrastive learning, that is, maximizing the similarity of positive sample pairs and minimizing the similarity of negative sample pairs. When inferring, the distances between the input texts and the pre-computed label embeddings are calculated using the structure in the dotted box in Fig. 2 only. The label name closest to the input text is regarded as its prediction label. 3.3 Loss Function To solve the problem of limited resources available in a few-shot setting, one way is to use external knowledge, and the other way is to use existing knowledge such as label semantics, relationships between texts, etc. In addition, how to combine various
Leveraging Inter-class Differences and Label Semantics
699
resources effectively and maximize their effect is what we mainly explore. We randomly sample a mini batch of N examples and assume instance xi is the anchor point. Label Semantic Contrastive (LSC). This objective uses cross-entropy directly to learn the mapping between instances and label names. The objective is to move the anchor point closer to the positive label but farther away from the negative labels, where φ(·) = F(enc(·)), the · symbol denotes the inner dot product, τ is a scalar temperature parameter, yk+ is the ground truth of xi . exp φ(xi ) · φ yk+ /τ 1 N LLSC = − (1) log K i=0 N k=0 exp(φ(xi ) · φ(yk )/τ ) Inter-Class Differences Contrastive (DC). When it is difficult to separate instances from negative labels through label semantics alone, we consider inter-class differences to further help anchors move away from their negative labels. Inter-class differences mean that the text representations of different classes have significant differences and should be kept apart by a considerable distance in the projection space. It can be achieved by the formula as follows: exp φ(xi ) · φ yk+ /τ 1 N (2) LDC = − log neg i=0 N exp φ(x ) · φ y+ /τ + exp φ(x ) · φ x− /τ
i
i=0,i=j
k
i
j
where φ (xi ) = proj(F(enc(xi ))), xj− is the negative instance of xi . The positive sample is still the positive label of the anchor, and the instance texts of different classes in the same batch are used as the negative samples. Joint Loss. Combining two goals with hyperparameter λ is a common approach to multi-task learning. The choice of hyperparameters has a great influence on the results. More negative samples in contrastive learning can promote the model’s ability to learn features that are sufficient to distinguish between positive and negative samples [16]. Inspired by this, we propose a new type of joint loss DLSC. Since the goals of DC and LSC are to bring positive pairs closer and push negative pairs farther, we add interclass differences as negative samples to LSC, which increases the number of negative sample pairs while keeping the two training objectives unchanged. The positive sample is positive label name, and the negative samples consist of two parts, one being the negative labels and the other being other instances from different classes in mini batch. As shown in the following formula: exp φ(xi ) · φ yk+ /τ 1 N log LDLSC = − neg i=0 K N exp(φ(x ) · φ(y )/τ) + exp φ(x ) · φ x− /τ
k=0
i
k
i=0,i=j
i
j
(3)
4 Experiments 4.1 Datasets As shown in Table 1, we use multiple English and Chinese text classification datasets, which include different kinds of tasks, such as question classification, topic classification, sentiment classification, etc.
700
X. Xie et al.
Table 1. Evaluation Datasets in this work. K: numbers of classes for classification tasks. N: average number of tokens in input sentences. Pattern: Label names used in the experiment. Due to space limitations, only the translated label names are listed here for the Chinese datasets. Name
Task
Lang Train
Subj [18]
subjectivity en
8,000 2,000
2 26 Subjective, Objective
CR [19]
sentiment
1,775 2,000
2 22 Bad, Good
TREC [20]
question cls en
5,452
6 10 Description, Entity, Abbreviation, Human, Location, Number
Yahoo [21]
topic
en
Test
500
K
N
Pattern {label names}
en
20,000 5,000 10 13 Society & Culture, Science & Mathematics, Health, Sports, Education & Reference, Computers & Internet, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government
CEmotion [22] emotion
ch
27,768 5,000
TNews [23]
ch
topic
6 44 Angry, Happy, Neutral, Surprised, Sad, Afraid
2,283 2,010 15 21 Story, Culture, Esports Entertainment, Sports, Finance, Real Estate, Automobile, Education, Technology, Military, Travel, International, Stocks, Agriculture
4.2 Training Details Referring to the setting of [10], 16 examples are randomly selected for each label in the training set. The training set samples are 16 * K in total, the validation set is the same as the training set, and the test set still uses the original test set in Table 1. To avoid the error caused by data selection, we run all experiments with 5 different training sets and report the mean and standard deviation. We use English and Chinese Bert-base as encoders for all methods. proj(·) is instantiated as a multilayer perceptron with a single hidden layer size of 768 and an output vector size of 128. We use Adam optimizer with a learning rate of 3e−5 and a batch size of 8 during training. We conduct the scalar temperature parameter τ = 1.0. Instead of setting a fixed maximum epoch, we end the training when the validation loss exceeds 100 batches without descending. We employ the F1-score as the evaluation metric follow other text classification works and run all experiments on a single NVIDIA TITAN RTX GPU.
Leveraging Inter-class Differences and Label Semantics
701
4.3 Baselines Different works may have various experimental settings. The results of following baselines are reproduced under the unified experimental setting, which slightly deviate from the results shown in original papers. CE: Here, we use the classical paradigm of text classification, which is only fine-tuned model with cross-entropy loss. CE + SCL [24]: This work directly applies supervised comparative learning to text classification, which focuses on bringing instances from the same class closer together and pushing instances from different classes further apart. It also makes use of textual relations, so we put it in our comparison experiments to verify whether inter-textual relations affect the quality of textual representations. LSC [3]: The method makes full use of label-semantic to enhance the text classification performance, as shown in Sect. 3.3. LM-BFF [10]: It is one of the representative methods of prompt learning which integrates prompts into a moderately sized language model such as BERT. The experiment was divided into two parts: 1) LM-BFF (man): manual prompt, consistent with templates in Table 1; 2) LM-BFF (demo): incorporating demonstrations as additional context based on 1).
5 Results and Analysis 5.1 Overall Results Table 2. We use 16 (per class) samples for few-shot experiments and the F1-score as the evaluation metric. We report mean (and standard deviation) performance over 5 different splits. Method
CR
Yahoo
Subj
TREC
CEmotion
TNews
CE
65.52(4.2)
54.82(1.5)
88.92(0.7)
75.08(2.1)
58.84(1.6)
51.64(1.2)
CE + SCL
66.27(4.3)
54.86(0.7)
88.38(1.3)
77.92(2.9)
60.14(2.1)
51.44(1.0)
LSC
78.80(8.9)
54.44(2.7)
88.24(1.0)
71.76(9.9)
61.73(1.5)
50.34(1.4)
LM-BFF (man)
85.10(1.4)
57.52(0.8)
85.82(1.0)
74.40(6.3)
56.78(2.5)
50.14(1.2)
LM-BFF (demo)
85.36(0.3)
57.00(0.9)
88.62(0.6)
75.48(5.9)
55.64(2.1)
51.90(0.7)
DC
76.61(6.8)
56.18(1.3)
81.29(8.1)
77.88(4.5)
60.40(2.0)
52.28(1.1)
LSC + DC
80.56(2.3)
54.54(1.8)
69.62(11.4)
78.04(3.7)
60.15(3.6)
49.78(1.3)
TLSC
74.30(3.6)
54.40(2.3)
87.92(3.2)
78.04(3.6)
60.01(2.4)
52.98(0.7)
DLSC
82.07(2.2)
56.68(1.8)
89.17(1.8)
78.28(3.3)
64.13(1.8)
52.02(2.1)
As shown in Table 2, our approach, DLSC, outperforms all the baseline models on all six datasets. We observe that CE + SCL improves about 1 point on some datasets compared to CE, which shows that the quality of text embeddings can be improved by learning the
702
X. Xie et al.
relationship between texts. LSC has achieved good results on multiple datasets, leading to 13.3 points improvement on CR, 3.3 points improvement on CEmotion. However, LSC improves little on the Yahoo and Subj, even performs poorly on the TREC with a decrease of 4 points compared to CE. Using inter-class differences features, DC achieves better results than LSC on datasets like TREC, Yahoo and TNews which have distinct text features. We also experiment with LSC + DC (λ = 0.5), but the results are not ideal and require further fine-tuning of hyperparameters. In Sect. 5.3, we report on the impact of varying values of λ on the accuracy. DLSC skillfully integrates the features of label semantics and inter-class differences without making hyperparameter adjustments. Our model performs significantly better than previous methods by a margin of 4.6 points, 5 points, and 2.85 points on average, respectively. Compared to LM-BFF, our method exhibits stronger competitiveness on four datasets, especially with a 9 points improvement on CEmotion. Prompt learning performs better on CR with simpler label names (good, bad), but it is not suitable for datasets with complex label names, especially on Chinese datasets. 5.2 Intra-class Correlations vs Inter-class Differences Relationships between texts include intra-class correlations and inter-class differences. The former refers to instances of the same kind that have certain common characteristics and are close to each other in the projection space; the latter refers to the significant differences in the representation of different classes of text. The farther apart they are, the more helpful it is to separate them. We consider positive label and instances of same classes as positive samples, while negative labels and instances from different classes are treated as negative samples. The formula is implemented as follows: + N −neg xj+ /τ exp φ(x /τ + exp φ · φ · φ y (x ) ) i i N i=0,i=j k 1 log LTLSC = − i=0 K N exp(φ(x ) · φ(y )/τ) + N exp φ(x ) · φ x /τ
k=0
i
k
i=0,i=j
i
j
(4) Observing the third and eighth lines in Table 2, we find that TLSC is not very helpful to improve the effect compared with LSC. The positive sample of TLSC contains a positive label and multiple positive instances, and the instance occupies a higher weight in the loss function, which will lead to the anchor being closer to the positive instance and weakening the experimental effect. The more positive cases we increase, the worse the performance is. Due to the small number of classes in binary classification tasks like CR and Subj, there are more positive instances in a batch, resulting in a greater decrease. The results of the eighth and ninth rows in the table prove that the effectiveness of DLSC is much better than TLSC. It indicates that the benefit yielded from inter-class differences outweighs the benefit yielded from intra-class correlations. 5.3 Forms of Joint Loss As described in Sect. 3.3, our work contains two objective functions, DC and LSC. We propose a new joint loss LDLSC that compensates for the shortcomings of traditional
Leveraging Inter-class Differences and Label Semantics
703
joint loss LDC+LSC = λLDC + (1 − λ)LLSC that require parameter adjustments. The results are shown in Fig. 3. Compared with adjusting the value of the parameter λ, DLSC has been maintained at a high level. Our introduction of inter-class differences by adding negative examples to LSC is inspired by the contrastive learning loss principle, which makes enough negative pairs to create a sharp contrast with positive pairs. We add negative instances to the pair of negative samples, which can serve as hard samples (a kind of negative sample closer to the anchor). As [25] mentioned, hard samples can make models learn more different textual representations, and the contribution to the performance improvement is huge compared to other negative samples. Our experimental results are consistent with this conclusion.
Fig. 3. The abscissa is the parameter λ ∈ [0, 1], the F1 score changes withλ. The red dashed line is the result of using DLSC loss function.
5.4 Ablation Experiments We add a text similarity calculation module and a fusion module in training. In the text similarity calculation module, we add the projection head proj(·) to enhance text representation; in the fusion module, F(·) is used to unify metrics. As shown in the first and fifth rows of Table 3, the effect decreases on all datasets without the two steps, and the TREC drops the most, about 8 points. A comparison of the third and fourth rows shows that F(·) has a greater impact on the results. This is caused by the inconsistent distribution of the similarity matrix of two different sources. After utilizing the F(·) unified metric, an analysis of the fourth and fifth lines reveal that the addition of proj(·) enhances the quality of text representation and aids in the improvement of accuracy for datasets.
704
X. Xie et al. Table 3. Ablation study on Proj(·) and F(·).
Proj(·) F(·) CR
Yahoo
√
54.68(2.9) 87.82(0.9)
√
Subj
TREC
CEmotion
TNews
-
78.03(3.1)
√
69.46(12.4) 50.37(3.2) 80.82(15.4) 75.80(6.2)
56.18(2.5) 50.26(2.3)
78.94(2.8)
57.80(0.8) 89.02(2.1)
76.44(4.9)
62.55(2.0) 52.67(1.4)
82.07(2.1)
56.68(1.7) 89.17(1.7)
78.28(3.3)
64.13(1.7) 52.02(2.1)
√
69.36(10.4) 61.43(2.7) 50.50(1.3)
5.5 Visualization To explore how DLSC enhances the quality of text representation, we utilize t-SNE on the Subj test set to map 768 dimensional embeddings to 2D data for visualization. We show the results of CE, LSC and DLSC in Fig. 4. Comparing (a) with (b) and (c), a distance-based approach which measures the relationship between label embeddings and instance embeddings promotes tighter clustering of representations from the same class. In addition, the fact that instance embeddings quietly surround label embeddings demonstrates the validity of methods such as LSC and DLSC that utilize label semantics. Comparing (b) with (c), DLSC helps the model to learn more discriminative and robust representations for the text features by exploiting inter-class differences.
Fig. 4. The t-SNE plots of (a) standard fine-tuning with cross-entropy loss (CE), (b) LSC uses label semantics only, and (c) DLSC using both inter-class differences and label semantics. Light circles are instance text embeddings and dark pentagons are label name embeddings.
6 Conclusion In this paper, we propose an architecture that utilizes both label semantics and interclass differences, which achieves a significant improvement over baselines on multiple datasets for few-shot text classification. Different from hyperparameter adjustment in multi-task learning, our objective function DLSC shows a high-level effect without adjusting the proportion of the two tasks. Through extensive comparative experiments, we provide further justification that inter-class differences provide the model with more
Leveraging Inter-class Differences and Label Semantics
705
clues than intra-class correlations. Our work holds practical value in scenarios with privacy-intensive and challenging labeling requirements, such as those found in medicine and biology, etc. Moving forward, we plan to delve deeper into field-specific optimization techniques. Acknowledgements. This research was supported by Sichuan Science and Technology Program, grant number 2022ZHCG0007.
References 1. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics 2. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shotlearning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 4080–4090, Red Hook, NY, USA, 2017. Curran Associates Inc. 3. Müller, T., Pérez-Torró, G., Franco-Salvador, M.: Few-shotlearning with Siamese networks and label tuning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8532–8545, Dublin, Ireland, May 2022. Association for Computational Linguistics 4. Ma, J., et al.: Label semantics for few shot named entity recognition. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 1956–1971, Dublin, Ireland, May 2022. Association for Computational Linguistics 5. Sun, S., Sun, Q., Zhou, K., Lv, T.: Hierarchical attention prototypical networks for fewshot text classification. In: Proceedings of the2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 476–485 (2019) 6. Geng, R., Li, B., Li, Y., Zhu, X., Jian, P., Sun, J.: Induction networks for few-shot text classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3904–3913, Hong Kong, China, November 2019. Association for Computational Linguistics 7. Liu, P., Yuan, W., Jinlan, F., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023) 8. Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (NeurIPS) (2020) 9. Shin, T., Razeghi, Y., Logan IV, R.L., Wallace, E., Singh, S.: AutoPrompt: eliciting knowledge from language models with automatically generated prompts. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222– 4235, Online,November 2020. Association for Computational Linguistics 10. Gao, T., Fisch, A., Chen, D.: Making pre-trained language modelsbetter few-shot learners. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3816–3830, Online, August 2021. Association for Computational Linguistics
706
X. Xie et al.
11. Luo, Q., Liu, L., Lin, Y., Zhang, W.: Don’t miss the labels:label-semantic augmented metalearner for few-shot text classification. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2773–2782, Online, August 2021. Association for Computational Linguistics 12. Mueller, A., et al.: Label semantic aware pre-training for few-shot textclassification. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8318–8334, Dublin, Ireland, May 2022. Association for Computational Linguistics 13. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742. IEEE (2006) 14. Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics 15. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020) 16. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 17. Khosla, P., et al.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661– 18673 (2020) 18. Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. arXiv preprint cs/0409058 (2004) 19. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168– 177 (2004) 20. Voorhees, E.M., Tice, D.M.: Building a question answering test collection. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 200–207 (2000) 21. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 22. SMP2020-EWECT Homepage. https://smp2020ewect.github.io/. Accessed 09 June 2020 23. Xu, L., et al. Fewclue: a Chinese few-shot learning evaluation benchmark. arXiv preprint arXiv:2107.07498 (2021) 24. Gunel, B., Du, J., Conneau, A., Stoyanov. V.: Supervised contrastive learning for pre-trained language model fine-tuning. arXiv preprint arXiv:2011.01403 (2020) 25. Wang, F., Liu, H.: Understanding the behaviour of contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2495–2504 (2021)
Simplifying Aspect-Sentiment Quadruple Prediction with Cartesian Product Operation Jigang Wang1 , Aimin Yang1,2 , Dong Zhou3(B) , Nankai Lin1(B) , Zepeng Wang3 , Weifeng Huang3 , and Boyu Chen4 1 School of Computer Science and Technology, Guangdong University of Technology,
Guangzhou 510006, Guangdong, China [email protected] 2 School of Computer Science and Intelligence Education, Lingnan Normal University, Zhanjiang 524000, Guangdong, China 3 School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, Guangdong, China [email protected] 4 Institute of Health Informatics, University College London, London, UK
Abstract. Aspect sentiment quad prediction (ASQP) is an emerging subtask of aspect-based sentiment analysis, which seeks to predict the sentiment quadruplets of aspect terms, aspect categories, associated sentiment polarities, and corresponding opinion items in one shot. Recent studies employ text generation models to accomplish this task. However, there are still two problems, how to effectively reduce the ASQP task’s high complexity, and the possibility that the generative model may predict explicit terms that do not exist in text sentences. In order to fill the gap, this paper proposes a novel text generation model Cartesian-ASQP based on the Transformer architecture. Specifically, this paper simplifies the aspect-based sentiment quad prediction task to a sentiment triple extraction task by performing a Cartesian product operation on the aspect categories and sentiment polarity sets. For sentiment quadruplet text sentences containing pronouns as implicit terms, we present an implicit term processing strategy by semantically mapping these terms back to pronouns. On the output side, for the situation when the explicit aspect/opinion words predicted by the model are absent from input sentences, this paper introduces a two-stage term correction strategy to solve the problem. Experimental results on two publicly available datasets demonstrate that our proposed model outperforms various baseline methods and achieves outperform performance. This work also validates that our proposed model can effectively handle the task of aspect-based sentiment quad prediction with a large number of implicit aspect and opinion terms. Keywords: Aspect-based sentiment analysis · Cartesian product operation · Two-stage term correction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 D.-S. Huang et al. (Eds.): ICIC 2023, LNAI 14089, pp. 707–719, 2023. https://doi.org/10.1007/978-981-99-4752-2_58
708
J. Wang et al.
1 Introduction Aspect-based sentiment analysis (ABSA) [1, 2] is becoming an important research direction in natural language processing. ABSA is a fine-grained task that aims to extract one or more specific sentiment elements from a given text or sentence. Four essential elements are typically involved: aspect categories, aspect terms, opinion terms, and sentiment polarity. We provide a concrete example in Fig. 1, the aspect term is “restaurant”. “ambience general” and “restaurant miscellaneous” are the aspect categories corresponding to the aspect term “restaurant”. “negative” is the sentiment polarity. “busy”, “cramped” and “closes early” are opinion terms. This paper focuses on extracting four sentiment elements from a given text sentence.
Fig. 1. Example of an ASQP task, with the blue span emphasizing the gold aspect term and the orange span being the opinion term. The symbol “-“ indicates negative sentiment polarity. (Color figure online)
Depending on the number of sentiment elements predicted from text comments, Zhang et al. further classified ABSA subtasks into two types: single and compound ABSA tasks. The single ABSA subtask mainly includes aspect term extraction (ATE) [3], opinion term extraction (OTE) [4], aspect category detection (ACD) [5]. The compound ABSA subtask extracts more detailed information and jointly predicts multiple sentiment tuple elements from text comments. These include aspect opinion term pair extraction (AOPE) [6], aspect category sentiment analysis (ACSA) [7], aspect sentiment triplet extraction (ASTE) [8], aspect category sentiment detection (ACSD) [9], and recently proposed aspect-based sentiment quad prediction (ASQP) [10–13]. Cai et al. [10] first suggested a two-stage pipeline method to perform the aspectbased sentiment quad prediction task. All possible aspect terms and opinion terms are extracted in the first stage. Then the extracted aspect terms and opinion terms are used to match the correct aspect-opinion pairs. The second stage determines its corresponding sentiment polarity and aspect category by extracting the obtained aspect-opinion term pairs. However, this pipeline approach may have error propagation problems. Recently, Mao et al. [12] proposed a Seq2Path model to generate sentiment quadruplets by constructing a tree structure. In addition, they added additional training data using data augmentation. Zhang et al. [11] proposed combining annotated sentiment elements with pre-constructed templates and using the obtained natural language sentences as target sequences. They converted the original aspect-based sentiment quad prediction task to a text generation problem. This work does not consider the case of text sentences containing implicit opinion terms and suffers from the possibility that some gold target
Simplifying Aspect-Sentiment Quadruple Prediction
709
sequences cannot be recovered as quadruplets. In addition, when the predicted terms are not present in the input sentences, both Mao et al. and Zhang et al. [11, 12] methods may not be able to process them accurately. Inspired by the above observations, we propose a novel end-to-end sentiment quadruplets prediction model named Cartesian-ASQP for the aspect-based sentiment quad prediction task. Since sentiment polarities and aspect categories are chosen from predefined sets, we simplify the aspect-based sentiment quad prediction task to a triplet prediction task by performing the Cartesian product operation [14] on these two sets. For sentiment quadruplet text sentences containing pronouns as implicit terms, we present an implicit term processing strategy by semantically mapping these terms back to pronouns it and null. In addition, we propose a two-stage term correction strategy for the situation where the explicit aspect/opinion terms predicted by the model are absent in the input text sentence. Our contribution can be summarized as follows: 1. This paper proposes a novel generation model Cartesian-ASQP to perform the aspectbased sentiment quad prediction task. We propose a two-stage term correction strategy to solve the prediction problem for terms that do not conform to the rules. The strategy and model are easy to implement and can be built on top of other pre-trained models for ABSA tasks. 2. This paper simplifies the aspect-based sentiment quad prediction task to a sentiment triple extraction task by performing a Cartesian product operation on the aspect categories and sentiment polarity sets. This simplification in the generation task can significantly decrease the complexity of the ASQP. 3. This paper conducts extensive experiments on two public ASQP and ACOS benchmark datasets. The experimental results demonstrate that our proposed model is significantly better than the recent outperform approaches.
2 Method We consider aspect sentiment quad prediction as a generation task (Subsect. 2.1) and use the Cartesian product operation to simplify the ASQP task into a new paradigm of sentiment triple extraction (Subsect. 2.2). At the same time, to better mine the implicit information and solve the model error prediction problem, we propose an implicit term processing strategy (Subsect. 2.3) and a two-stage term correction strategy (Subsect. 2.4) in the input and output sides of the model, respectively. Finally, we describe the model’s overall architecture in detail (Subsect. 2.5). 2.1 Task Formulation Given a sentence sequence X = {x1 , x2 , . . . , xn }, the aspect-based sentiment quad task aims to predict a sequence set Y = {(a1 , o1 , p1 , c1 ), ..., (am , om , pm , cm ))} of all sentiment quadruplets from X, where ai , oi , pi and ci represent the aspect term, opinion term, sentiment polarity, and aspect category in the ith sentiment quadruplet, respectively. The aspect term a and the opinion term o usually appear in the input sentence X. The aspect category c belongs to the set of predefined categories Lc . Sentiment polarity p ∈ Lp = {positive, neutral, negative}. n, m denotes the length of text sentences and the number of sentiment quadruplets in the target sequence.
710
J. Wang et al.
2.2 Cartesian Product Operation Unlike the previous models, this paper simplifies the ASQP task to a triplet sentiment prediction task by performing the Cartesian product operation. Specifically, since the sentiment polarity p and aspect category c belong to the already defined sets Lp and Lc , respectively. We obtain a new predefined set Lpc using the Cartesian product of Lp × Lc . This paper simplifies the original sentiment quadruplets (a, o, p, c) prediction task into a triplet (a, o, pc) extraction task. We consider that this processing can effectively decrease the complexity of the ASQP task where pc belongs to the recomputed predefined set Lpc . This triplet form is similar to the direct extraction of the expected sentiment elements but in the form of a generated text string. Thus, the original quadruplet is transformed into the final target output sequence.
Fig. 2. Two examples of target paradigm construct for the ASQP task, the first is the case of the ASQP dataset, and the second is the case of the ACOS data.
As shown in Fig. 2, if the input sentence X includes multiple sentiment quadruplets, we first transform each sentiment quadruplet (Label-Quad) into a triplet (Target-Triplet) processed by the Cartesian product operation. These triplets are then concatenated with the special symbol [SEP] to form the final target sequence Y, which contains all the sentiment elements predicted for the given sentence. We can express the above operation in the following form. Y = (a1 , o1 , p1 c1 )[SEP](a2 , o2 , p2 c2 )...[SEP](am , om , pm cm )
(1)
where m denotes the number of sentiment tuples contained in the target sequence Y. 2.3 Implicit Term Processing Strategy For the case where the sentiment tuple contains implicit aspect terms or opinion terms. We define a function that maps implicit aspect terms and opinion terms to the pronouns it and null, respectively. This processing makes the target output more consistent with the actual natural discourse to facilitate the model’s training.
Simplifying Aspect-Sentiment Quadruple Prediction
711
2.4 Two-Stage Correction and Tuple Recovery Based on the rule of ABSA tasks, the explicit aspect terms and opinion terms span must be sub-spans in the text sentence X. The aspect category must belong to the predefined sets Lc , and the sentiment polarity Lp ∈ {positive, neutral, negative}. However, the generation model does not have such a known rule that the aspect term and opinion term span predicted may be absent in the input sentence X, and the predicted aspect category does not belong to the predefined category Lc . To solve the above problems, this paper proposes a two-stage sentiment element correction strategy. We perform a token-level correction in the first stage and use a span-level correction in the second stage. Specifically, suppose the predicted term token length is one, and the token does not exist in the input sentence X token list. In that case, we find the most similar token in the token list of the input sentence X to replace the term predicted by the model for correction. We name it the token-level correction. If the prediction term S token length more than one, the first stage iterates through each token in the prediction term S and determines whether the current token exists in the list of tokens in the input sentence X. We perform a token-level correction if the token does not exist in the list. After token-level correction of the prediction term, we get a new prediction term S . Suppose the predicted term S is still not a sub-sequence of the text input sentence X, we perform the second correction stage to find the maximum similar sub-sequence of X to replace S. We name it the span-level correction. We use the Levenshtein [15] distance in the first stage of finding the most similar tokens. An example of the two-stage correction strategy is shown in Fig. 3.
Fig. 3. Example of the amendment to the two-stage correction strategy.
For proposed target paradigm, if the model predicts that the output Y contains multiple sentiment tuples, we first extract the individual sentiment tuples by segmenting them with the special symbol [SEP]. Then we extract each sentiment element in the sentiment tuple by the symbols “()” and “,”. If such a sentiment tuple recovery operation fails, e.g., predicting that the sentiment tuple contains only two or fewer sentiment elements, then we ignore such sentiment tuple predictions.
712
J. Wang et al.
Fig. 4. Overall model architecture
2.5 Overall Architecture In the previous section, the aspect-based sentiment quad prediction task can be formulated as a text generation task with input X = {x1 , x2 , . . . , xn } and output target sequence Y = {y1 , y2 , . . . , yk }, where Y is a sequence of triplets. n and k indicate the length of the input text sentence X and the output sequence Y, respectively. We use a model based on an encoder-decoder architecture to transform ASQP into a text generation problem. The overall architecture of the model is shown in Fig. 4 and can be expressed in the equation G:X →Y : P(Y |X ) =
k
P(yt |X , Y